Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
       [not found]             ` <1407587152.24027.5.camel@usa>
@ 2014-08-11 17:22               ` Walid Nouri
  2014-08-11 20:15                 ` Michael R. Hines
  2014-08-11 20:15                 ` Michael R. Hines
  0 siblings, 2 replies; 23+ messages in thread
From: Walid Nouri @ 2014-08-11 17:22 UTC (permalink / raw)
  To: qemu-devel, michael

Hi,
I will do my best to make a contribution :-)

Are there alternative ways of replicating local storage other than DRBD 
that are possibly feasible?
Some that are directly build into Qemu?

Walid

Am 09.08.2014 14:25, schrieb Michael R. Hines:
> On Sat, 2014-08-09 at 14:08 +0200, Walid Nouri wrote:
>> Hi Michael,
>> how is the weather in Bejing? :-)
> It's terrible. Lots of pollution =(
>
>> May I ask you some questions to your MC implementation?
>>
>> Currently i'm trying  to understand the general working of the MC
>> protokoll and possible problems that can occur so that I can discuss it
>> in my thesis.
>>
>> As far as i have understand MC relies on a shared disk. Output of the
>> primary vm are directly written, network output is buffered until the
>> corresponding checkpoint is acknowledged.
>>
>> One problem that comes into my mind is: What happens when the primary vm
>> writes to the disk and crashes before sending a corresponding checkpoint?
>>
> The MC implementation itself is incomplete, today. (I need help).
>
> The Xen Remus implementation uses the DRBD system to "mirror" all disk
> writes to the source and destination before completing each checkpoint.
>
> The KVM (mc) implementation needs exactly the same support, but it is
> missing today.
>
> Until that happens, we are *required* to use root-over-iSCSI or
> root-over-NFS (meaning that the guest filesystem is mounted directly
> inside the virtual machine without the host knowing about it.
>
> This has the effect of translating all disk I/O into network I/O,
> and since network I/O is already buffered, then we are safe.
>
>
>> Here an example: The Primary state is in the actual epoch epoch (n),
>> secondary state is in epoch (n-1). The primary writes to disk and
>> crashes before or while sending the checkpoint n. In this case the
>> secondary memory state is still at epoch (n-1) and the state of the
>> shared Disk corresponds to the primary state of epoch (n).
>>
>> How does MC guaranty that the Disk state of the backup vm is consistent
>> with its Memory state?
> As I mentioned above, we need the equivalent of the Xen solution, but I
> just haven't had the time to write it (or incorporate someone else's
> implementation). Patch is welcome =)
>
>> Is Memory-VCPU / Disk State consistency necessary under all circumstances?
>> Or can this be neglected because the secondary will (after a fail over)
>> repeat the same instructions and finally write to disk the same (as the
>> primary before) data for a second time?
>> Could this lead to fatal inconsistencies?
>>
>> Walid
>>
>
>
>
> - Michael
>
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-08-11 17:22               ` [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency Walid Nouri
@ 2014-08-11 20:15                 ` Michael R. Hines
  2014-08-17  9:52                   ` Paolo Bonzini
  2014-08-11 20:15                 ` Michael R. Hines
  1 sibling, 1 reply; 23+ messages in thread
From: Michael R. Hines @ 2014-08-11 20:15 UTC (permalink / raw)
  To: Walid Nouri, qemu-devel, michael, Paolo Bonzini

Excellent question: QEMU does have a feature called "drive-mirror"
in block/mirror.c that was introduced a couple of years ago. I'm not 
sure what the
adoption rate of the feature is, but I would start with that one.

There is also a second fault tolerance implementation that works a 
little differently called
"COLO" - you may have seen those emails on the list too, but their
method does not require a disk replication solution, if I recall correctly.

I know the time pressure that comes during a thesis, though =), so
there's no pressure to work on it - but that is the most pressing issue
in the implementation today. (Lack of disk replication in 
micro-checkpointing.)

The MC implementation also needs to be re-based against the latest
master - I just haven't had a chance to do it yet because some of my
hardware has been taken away from me the last few months - will
see if I can find some reasonable hardware soon.

- Michael

On 08/12/2014 01:22 AM, Walid Nouri wrote:
> Hi,
> I will do my best to make a contribution :-)
>
> Are there alternative ways of replicating local storage other than 
> DRBD that are possibly feasible?
> Some that are directly build into Qemu?
>
> Walid
>
> Am 09.08.2014 14:25, schrieb Michael R. Hines:
>> On Sat, 2014-08-09 at 14:08 +0200, Walid Nouri wrote:
>>> Hi Michael,
>>> how is the weather in Bejing? :-)
>> It's terrible. Lots of pollution =(
>>
>>> May I ask you some questions to your MC implementation?
>>>
>>> Currently i'm trying  to understand the general working of the MC
>>> protokoll and possible problems that can occur so that I can discuss it
>>> in my thesis.
>>>
>>> As far as i have understand MC relies on a shared disk. Output of the
>>> primary vm are directly written, network output is buffered until the
>>> corresponding checkpoint is acknowledged.
>>>
>>> One problem that comes into my mind is: What happens when the 
>>> primary vm
>>> writes to the disk and crashes before sending a corresponding 
>>> checkpoint?
>>>
>> The MC implementation itself is incomplete, today. (I need help).
>>
>> The Xen Remus implementation uses the DRBD system to "mirror" all disk
>> writes to the source and destination before completing each checkpoint.
>>
>> The KVM (mc) implementation needs exactly the same support, but it is
>> missing today.
>>
>> Until that happens, we are *required* to use root-over-iSCSI or
>> root-over-NFS (meaning that the guest filesystem is mounted directly
>> inside the virtual machine without the host knowing about it.
>>
>> This has the effect of translating all disk I/O into network I/O,
>> and since network I/O is already buffered, then we are safe.
>>
>>
>>> Here an example: The Primary state is in the actual epoch epoch (n),
>>> secondary state is in epoch (n-1). The primary writes to disk and
>>> crashes before or while sending the checkpoint n. In this case the
>>> secondary memory state is still at epoch (n-1) and the state of the
>>> shared Disk corresponds to the primary state of epoch (n).
>>>
>>> How does MC guaranty that the Disk state of the backup vm is consistent
>>> with its Memory state?
>> As I mentioned above, we need the equivalent of the Xen solution, but I
>> just haven't had the time to write it (or incorporate someone else's
>> implementation). Patch is welcome =)
>>
>>> Is Memory-VCPU / Disk State consistency necessary under all 
>>> circumstances?
>>> Or can this be neglected because the secondary will (after a fail over)
>>> repeat the same instructions and finally write to disk the same (as the
>>> primary before) data for a second time?
>>> Could this lead to fatal inconsistencies?
>>>
>>> Walid
>>>
>>
>>
>>
>> - Michael
>>
>>
>>
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-08-11 17:22               ` [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency Walid Nouri
  2014-08-11 20:15                 ` Michael R. Hines
@ 2014-08-11 20:15                 ` Michael R. Hines
  2014-08-13 14:03                   ` Walid Nouri
  1 sibling, 1 reply; 23+ messages in thread
From: Michael R. Hines @ 2014-08-11 20:15 UTC (permalink / raw)
  To: Walid Nouri, qemu-devel, michael, Paolo Bonzini, hinesmr

Excellent question: QEMU does have a feature called "drive-mirror"
in block/mirror.c that was introduced a couple of years ago. I'm not 
sure what the
adoption rate of the feature is, but I would start with that one.

There is also a second fault tolerance implementation that works a 
little differently called
"COLO" - you may have seen those emails on the list too, but their
method does not require a disk replication solution, if I recall correctly.

I know the time pressure that comes during a thesis, though =), so
there's no pressure to work on it - but that is the most pressing issue
in the implementation today. (Lack of disk replication in 
micro-checkpointing.)

The MC implementation also needs to be re-based against the latest
master - I just haven't had a chance to do it yet because some of my
hardware has been taken away from me the last few months - will
see if I can find some reasonable hardware soon.

- Michael

On 08/12/2014 01:22 AM, Walid Nouri wrote:
> Hi,
> I will do my best to make a contribution :-)
>
> Are there alternative ways of replicating local storage other than 
> DRBD that are possibly feasible?
> Some that are directly build into Qemu?
>
> Walid
>
> Am 09.08.2014 14:25, schrieb Michael R. Hines:
>> On Sat, 2014-08-09 at 14:08 +0200, Walid Nouri wrote:
>>> Hi Michael,
>>> how is the weather in Bejing? :-)
>> It's terrible. Lots of pollution =(
>>
>>> May I ask you some questions to your MC implementation?
>>>
>>> Currently i'm trying  to understand the general working of the MC
>>> protokoll and possible problems that can occur so that I can discuss it
>>> in my thesis.
>>>
>>> As far as i have understand MC relies on a shared disk. Output of the
>>> primary vm are directly written, network output is buffered until the
>>> corresponding checkpoint is acknowledged.
>>>
>>> One problem that comes into my mind is: What happens when the 
>>> primary vm
>>> writes to the disk and crashes before sending a corresponding 
>>> checkpoint?
>>>
>> The MC implementation itself is incomplete, today. (I need help).
>>
>> The Xen Remus implementation uses the DRBD system to "mirror" all disk
>> writes to the source and destination before completing each checkpoint.
>>
>> The KVM (mc) implementation needs exactly the same support, but it is
>> missing today.
>>
>> Until that happens, we are *required* to use root-over-iSCSI or
>> root-over-NFS (meaning that the guest filesystem is mounted directly
>> inside the virtual machine without the host knowing about it.
>>
>> This has the effect of translating all disk I/O into network I/O,
>> and since network I/O is already buffered, then we are safe.
>>
>>
>>> Here an example: The Primary state is in the actual epoch epoch (n),
>>> secondary state is in epoch (n-1). The primary writes to disk and
>>> crashes before or while sending the checkpoint n. In this case the
>>> secondary memory state is still at epoch (n-1) and the state of the
>>> shared Disk corresponds to the primary state of epoch (n).
>>>
>>> How does MC guaranty that the Disk state of the backup vm is consistent
>>> with its Memory state?
>> As I mentioned above, we need the equivalent of the Xen solution, but I
>> just haven't had the time to write it (or incorporate someone else's
>> implementation). Patch is welcome =)
>>
>>> Is Memory-VCPU / Disk State consistency necessary under all 
>>> circumstances?
>>> Or can this be neglected because the secondary will (after a fail over)
>>> repeat the same instructions and finally write to disk the same (as the
>>> primary before) data for a second time?
>>> Could this lead to fatal inconsistencies?
>>>
>>> Walid
>>>
>>
>>
>>
>> - Michael
>>
>>
>>
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-08-11 20:15                 ` Michael R. Hines
@ 2014-08-13 14:03                   ` Walid Nouri
  2014-08-13 22:28                     ` Michael R. Hines
  0 siblings, 1 reply; 23+ messages in thread
From: Walid Nouri @ 2014-08-13 14:03 UTC (permalink / raw)
  To: Michael R. Hines, qemu-devel, michael, hinesmr

Yes...
Time is a problem, and it‘s currently running out... ;-)

I think the first step is to reason about possible approaches and how 
they can be implemented in QEMU. The implementation can follow later :-)

Thank you for the hint with the drive-mirror feature.
I will take a look at it and surely come back with new questions :-)

I also think that disc replication ist the most pressing issue for MC.
While looking to find some ideas for approaches to replicating block 
devices I have read the paper about the Remus implementation. I think MC 
can take a similar approach for local disk.

Here are the main facts that I have understood:

Local disk contents is viewed as internal state the primary and secondary.
In the explanation they describe that for keeping disc semantics of the 
primary and to allow the primary to run speculatively all disc state 
changes are directly written to the disk. In parrallel and 
asynchronously send to the secondary. The secondary keeps the pending 
writing requests in two disk buffers. A speculation-disk-buffer and a 
write-out-buffer.

After the reception of the next checkpoint the secondary copies the 
speculation buffer to the write out buffer, commits the checkpoint and 
applies the write out buffer to its local disk.

When the primary fails the secondary must wait until write-out-buffer 
has been completely written to disk before before changing the execution 
mode to run as primary. In this case (failure of primary) the secondary 
discards pending disk writes in its speculation buffer. This protocol 
keeps the disc state consistent with the last checkpoint.

Remus uses the XEN specific blktap driver. As far as I know this can’t 
be used with QEMU (KVM).

I must see how drive-mirror can be used for this kind of protocol.

Am 11.08.2014 22:15, schrieb Michael R. Hines:
> Excellent question: QEMU does have a feature called "drive-mirror"
> in block/mirror.c that was introduced a couple of years ago. I'm not
> sure what the
> adoption rate of the feature is, but I would start with that one.
>
> There is also a second fault tolerance implementation that works a
> little differently called
> "COLO" - you may have seen those emails on the list too, but their
> method does not require a disk replication solution, if I recall correctly.

I have taken a look at COLO. They have also published a good paper about 
their approach. The paper is about the XEN implementation of COLO.  It’s 
an interesting approach to use a coarse grained lockspepping combined 
with checkpointing. From what i have understood the possible show 
stopper for general application is the depency of COLO to custom changes 
of the tcp stack to make it more deterministic.

IMHO there are two points. Custom changes of the TCP-Stack are a no-go 
for proprietary operating systems like Windows. It makes COLO 
application agnostic but not operating system agnostic. The other point 
is that with I/O intensive workloads COLO will tend to behave like MC. 
This is my point of view but i didn’t invest much time to understand 
everything in detail.

>
> I know the time pressure that comes during a thesis, though =), so
> there's no pressure to work on it - but that is the most pressing issue
> in the implementation today. (Lack of disk replication in
> micro-checkpointing.)
>
> The MC implementation also needs to be re-based against the latest
> master - I just haven't had a chance to do it yet because some of my
> hardware has been taken away from me the last few months - will
> see if I can find some reasonable hardware soon.
>
> - Michael
>
> On 08/12/2014 01:22 AM, Walid Nouri wrote:
>> Hi,
>> I will do my best to make a contribution :-)
>>
>> Are there alternative ways of replicating local storage other than
>> DRBD that are possibly feasible?
>> Some that are directly build into Qemu?
>>
>> Walid
>>
>> Am 09.08.2014 14:25, schrieb Michael R. Hines:
>>> On Sat, 2014-08-09 at 14:08 +0200, Walid Nouri wrote:
>>>> Hi Michael,
>>>> how is the weather in Bejing? :-)
>>> It's terrible. Lots of pollution =(
>>>
>>>> May I ask you some questions to your MC implementation?
>>>>
>>>> Currently i'm trying  to understand the general working of the MC
>>>> protokoll and possible problems that can occur so that I can discuss it
>>>> in my thesis.
>>>>
>>>> As far as i have understand MC relies on a shared disk. Output of the
>>>> primary vm are directly written, network output is buffered until the
>>>> corresponding checkpoint is acknowledged.
>>>>
>>>> One problem that comes into my mind is: What happens when the
>>>> primary vm
>>>> writes to the disk and crashes before sending a corresponding
>>>> checkpoint?
>>>>
>>> The MC implementation itself is incomplete, today. (I need help).
>>>
>>> The Xen Remus implementation uses the DRBD system to "mirror" all disk
>>> writes to the source and destination before completing each checkpoint.
>>>
>>> The KVM (mc) implementation needs exactly the same support, but it is
>>> missing today.
>>>
>>> Until that happens, we are *required* to use root-over-iSCSI or
>>> root-over-NFS (meaning that the guest filesystem is mounted directly
>>> inside the virtual machine without the host knowing about it.
>>>
>>> This has the effect of translating all disk I/O into network I/O,
>>> and since network I/O is already buffered, then we are safe.
>>>
>>>
>>>> Here an example: The Primary state is in the actual epoch epoch (n),
>>>> secondary state is in epoch (n-1). The primary writes to disk and
>>>> crashes before or while sending the checkpoint n. In this case the
>>>> secondary memory state is still at epoch (n-1) and the state of the
>>>> shared Disk corresponds to the primary state of epoch (n).
>>>>
>>>> How does MC guaranty that the Disk state of the backup vm is consistent
>>>> with its Memory state?
>>> As I mentioned above, we need the equivalent of the Xen solution, but I
>>> just haven't had the time to write it (or incorporate someone else's
>>> implementation). Patch is welcome =)
>>>
>>>> Is Memory-VCPU / Disk State consistency necessary under all
>>>> circumstances?
>>>> Or can this be neglected because the secondary will (after a fail over)
>>>> repeat the same instructions and finally write to disk the same (as the
>>>> primary before) data for a second time?
>>>> Could this lead to fatal inconsistencies?
>>>>
>>>> Walid
>>>>
>>>
>>>
>>>
>>> - Michael
>>>
>>>
>>>
>>
>>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-08-13 14:03                   ` Walid Nouri
@ 2014-08-13 22:28                     ` Michael R. Hines
  2014-08-14 10:58                       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 23+ messages in thread
From: Michael R. Hines @ 2014-08-13 22:28 UTC (permalink / raw)
  To: Walid Nouri, qemu-devel, michael, hinesmr

On 08/13/2014 10:03 PM, Walid Nouri wrote:
>
> While looking to find some ideas for approaches to replicating block 
> devices I have read the paper about the Remus implementation. I think 
> MC can take a similar approach for local disk.
>

I agree.

> Here are the main facts that I have understood:
>
> Local disk contents is viewed as internal state the primary and 
> secondary.
> In the explanation they describe that for keeping disc semantics of 
> the primary and to allow the primary to run speculatively all disc 
> state changes are directly written to the disk. In parrallel and 
> asynchronously send to the secondary. The secondary keeps the pending 
> writing requests in two disk buffers. A speculation-disk-buffer and a 
> write-out-buffer.
>
> After the reception of the next checkpoint the secondary copies the 
> speculation buffer to the write out buffer, commits the checkpoint and 
> applies the write out buffer to its local disk.
>
> When the primary fails the secondary must wait until write-out-buffer 
> has been completely written to disk before before changing the 
> execution mode to run as primary. In this case (failure of primary) 
> the secondary discards pending disk writes in its speculation buffer. 
> This protocol keeps the disc state consistent with the last checkpoint.
>
> Remus uses the XEN specific blktap driver. As far as I know this can’t 
> be used with QEMU (KVM).
>
> I must see how drive-mirror can be used for this kind of protocol.
>

That's all correct. Theoretically, we would do exactly the same thing: 
drive-mirror on the source would write immediately to disk but follow 
the same commit semantics on the destination as Xen.

>
> I have taken a look at COLO.
>

> IMHO there are two points. Custom changes of the TCP-Stack are a no-go 
> for proprietary operating systems like Windows. It makes COLO 
> application agnostic but not operating system agnostic. The other 
> point is that with I/O intensive workloads COLO will tend to behave 
> like MC. This is my point of view but i didn’t invest much time to 
> understand everything in detail.
>

Actually, if I remember correctly, the TCP stack is only modified at the 
hypervisor level - they are intercepting and translating TCP sequence 
numbers "in-flight" to detect divergence of the source and destination - 
which is not a big problem if the implementation is well-done.

My hope in the future was that the two approaches could be used in a 
"Hybrid" manner - actually MC has much more of a performance hit for I/O 
than COLO does because of its buffering requirements.

On the other hand, MC would perform better in a memory-intensive or 
CPU-intensive situation - so maybe QEMU could "switch" between the two 
mechanisms at different points in time when the resource bottleneck changes.

- Michael

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-08-13 22:28                     ` Michael R. Hines
@ 2014-08-14 10:58                       ` Dr. David Alan Gilbert
  2014-08-14 17:23                         ` Michael R. Hines
  2014-08-19  8:33                         ` Walid Nouri
  0 siblings, 2 replies; 23+ messages in thread
From: Dr. David Alan Gilbert @ 2014-08-14 10:58 UTC (permalink / raw)
  To: Michael R. Hines; +Cc: Walid Nouri, hinesmr, qemu-devel, michael

cc'ing in a couple of the COLOers.

* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 08/13/2014 10:03 PM, Walid Nouri wrote:
> >
> >While looking to find some ideas for approaches to replicating block
> >devices I have read the paper about the Remus implementation. I think MC
> >can take a similar approach for local disk.
> >
> 
> I agree.
> 
> >Here are the main facts that I have understood:
> >
> >Local disk contents is viewed as internal state the primary and secondary.
> >In the explanation they describe that for keeping disc semantics of the
> >primary and to allow the primary to run speculatively all disc state
> >changes are directly written to the disk. In parrallel and asynchronously
> >send to the secondary. The secondary keeps the pending writing requests in
> >two disk buffers. A speculation-disk-buffer and a write-out-buffer.
> >
> >After the reception of the next checkpoint the secondary copies the
> >speculation buffer to the write out buffer, commits the checkpoint and
> >applies the write out buffer to its local disk.
> >
> >When the primary fails the secondary must wait until write-out-buffer has
> >been completely written to disk before before changing the execution mode
> >to run as primary. In this case (failure of primary) the secondary
> >discards pending disk writes in its speculation buffer. This protocol
> >keeps the disc state consistent with the last checkpoint.
> >
> >Remus uses the XEN specific blktap driver. As far as I know this can?t be
> >used with QEMU (KVM).
> >
> >I must see how drive-mirror can be used for this kind of protocol.
> >
> 
> That's all correct. Theoretically, we would do exactly the same thing:
> drive-mirror on the source would write immediately to disk but follow the
> same commit semantics on the destination as Xen.
> 
> >
> >I have taken a look at COLO.
> >
> 
> >IMHO there are two points. Custom changes of the TCP-Stack are a no-go for
> >proprietary operating systems like Windows. It makes COLO application
> >agnostic but not operating system agnostic. The other point is that with
> >I/O intensive workloads COLO will tend to behave like MC. This is my point
> >of view but i didn?t invest much time to understand everything in detail.
> >
> 
> Actually, if I remember correctly, the TCP stack is only modified at the
> hypervisor level - they are intercepting and translating TCP sequence
> numbers "in-flight" to detect divergence of the source and destination -
> which is not a big problem if the implementation is well-done.

The 2013 paper says:
   'COLO modifies the guest OSâ€™s TCP/IP stack in order to make the behavior
    more deterministic. '
but does say that an alternative might be to have a
  ' comparison function that operates transparently over re-assembled TCP streams'

> My hope in the future was that the two approaches could be used in a
> "Hybrid" manner - actually MC has much more of a performance hit for I/O
> than COLO does because of its buffering requirements.
> 
> On the other hand, MC would perform better in a memory-intensive or
> CPU-intensive situation - so maybe QEMU could "switch" between the two
> mechanisms at different points in time when the resource bottleneck changes.

If the primary were to rate-limit the number of resynchronisations
(and send the secondary a message as soon as it knew a resync was needed) that
would get some of the way, but then the only difference from microcheckpointing
at that point is the secondary doing a wasteful copy and sending the packets across;
it seems it should be easy to disable those if it knew that a resync was going to
happen.

Dave

> - Michael
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-08-14 10:58                       ` Dr. David Alan Gilbert
@ 2014-08-14 17:23                         ` Michael R. Hines
  2014-08-19  8:33                         ` Walid Nouri
  1 sibling, 0 replies; 23+ messages in thread
From: Michael R. Hines @ 2014-08-14 17:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Walid Nouri, hinesmr, qemu-devel, michael

On 08/14/2014 06:58 PM, Dr. David Alan Gilbert wrote:
> cc'ing in a couple of the COLOers.
Thanks, David. Glad to see their patches in last month - I need to take 
a look at them.

> The 2013 paper says: 'COLO modifies the guest OSâ€™s TCP/IP stack in 
> order to make the behavior more deterministic. ' but does say that an 
> alternative might be to have a ' comparison function that operates 
> transparently over re-assembled TCP streams' 

Ouch - I didn't realize that.

It may or may not be a problem - but if it gets us further towards 
fault-tolerance, I'm open-minded. =)

The Xen paper did the same thing for databases - they also modified the 
guest TCP stack.

>> My hope in the future was that the two approaches could be used in a
>> "Hybrid" manner - actually MC has much more of a performance hit for I/O
>> than COLO does because of its buffering requirements.
>>
>> On the other hand, MC would perform better in a memory-intensive or
>> CPU-intensive situation - so maybe QEMU could "switch" between the two
>> mechanisms at different points in time when the resource bottleneck changes.
> If the primary were to rate-limit the number of resynchronisations
> (and send the secondary a message as soon as it knew a resync was needed) that
> would get some of the way, but then the only difference from microcheckpointing
> at that point is the secondary doing a wasteful copy and sending the packets across;
> it seems it should be easy to disable those if it knew that a resync was going to
> happen.
>
> Dave

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-08-11 20:15                 ` Michael R. Hines
@ 2014-08-17  9:52                   ` Paolo Bonzini
  2014-08-19  8:58                     ` Walid Nouri
  2014-09-10 15:43                     ` Walid Nouri
  0 siblings, 2 replies; 23+ messages in thread
From: Paolo Bonzini @ 2014-08-17  9:52 UTC (permalink / raw)
  To: Michael R. Hines, Walid Nouri, qemu-devel, michael

Il 11/08/2014 22:15, Michael R. Hines ha scritto:
> Excellent question: QEMU does have a feature called "drive-mirror"
> in block/mirror.c that was introduced a couple of years ago. I'm not
> sure what the
> adoption rate of the feature is, but I would start with that one.

block/mirror.c is asynchronous, and there's no support for communicating
checkpoints back to the master.  However, the quorum disk driver could
be what you need.

There's also a series on the mailing list that lets quorum read only
from the primary, so that quorum can still do replication and fault
tolerance, but skip fault detection.

Paolo

> There is also a second fault tolerance implementation that works a
> little differently called
> "COLO" - you may have seen those emails on the list too, but their
> method does not require a disk replication solution, if I recall correctly.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-08-14 10:58                       ` Dr. David Alan Gilbert
  2014-08-14 17:23                         ` Michael R. Hines
@ 2014-08-19  8:33                         ` Walid Nouri
  1 sibling, 0 replies; 23+ messages in thread
From: Walid Nouri @ 2014-08-19  8:33 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: michael, hinesmr, qemu-devel, Michael R. Hines

Hi,
I have tried to find more information on how to use drive-mirror besides what is available on the wiki. This was not very satisfactory...

This may sound naive but are there some code examples in "c" or any other language, documentation of any kind, blog entries (developer), presentation videos or any other source of information to get started?

Walid


> Am 14.08.2014 um 12:58 schrieb "Dr. David Alan Gilbert" <dgilbert@redhat.com>:
> 
> cc'ing in a couple of the COLOers.
> 
> * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
>>> On 08/13/2014 10:03 PM, Walid Nouri wrote:
>>> 
>>> While looking to find some ideas for approaches to replicating block
>>> devices I have read the paper about the Remus implementation. I think MC
>>> can take a similar approach for local disk.
>> 
>> I agree.
>> 
>>> Here are the main facts that I have understood:
>>> 
>>> Local disk contents is viewed as internal state the primary and secondary.
>>> In the explanation they describe that for keeping disc semantics of the
>>> primary and to allow the primary to run speculatively all disc state
>>> changes are directly written to the disk. In parrallel and asynchronously
>>> send to the secondary. The secondary keeps the pending writing requests in
>>> two disk buffers. A speculation-disk-buffer and a write-out-buffer.
>>> 
>>> After the reception of the next checkpoint the secondary copies the
>>> speculation buffer to the write out buffer, commits the checkpoint and
>>> applies the write out buffer to its local disk.
>>> 
>>> When the primary fails the secondary must wait until write-out-buffer has
>>> been completely written to disk before before changing the execution mode
>>> to run as primary. In this case (failure of primary) the secondary
>>> discards pending disk writes in its speculation buffer. This protocol
>>> keeps the disc state consistent with the last checkpoint.
>>> 
>>> Remus uses the XEN specific blktap driver. As far as I know this can?t be
>>> used with QEMU (KVM).
>>> 
>>> I must see how drive-mirror can be used for this kind of protocol.
>> 
>> That's all correct. Theoretically, we would do exactly the same thing:
>> drive-mirror on the source would write immediately to disk but follow the
>> same commit semantics on the destination as Xen.
>> 
>>> 
>>> I have taken a look at COLO.
>> 
>>> IMHO there are two points. Custom changes of the TCP-Stack are a no-go for
>>> proprietary operating systems like Windows. It makes COLO application
>>> agnostic but not operating system agnostic. The other point is that with
>>> I/O intensive workloads COLO will tend to behave like MC. This is my point
>>> of view but i didn?t invest much time to understand everything in detail.
>> 
>> Actually, if I remember correctly, the TCP stack is only modified at the
>> hypervisor level - they are intercepting and translating TCP sequence
>> numbers "in-flight" to detect divergence of the source and destination -
>> which is not a big problem if the implementation is well-done.
> 
> The 2013 paper says:
>   'COLO modifies the guest OS’s TCP/IP stack in order to make the behavior
>    more deterministic. '
> but does say that an alternative might be to have a
>  ' comparison function that operates transparently over re-assembled TCP streams'
> 
>> My hope in the future was that the two approaches could be used in a
>> "Hybrid" manner - actually MC has much more of a performance hit for I/O
>> than COLO does because of its buffering requirements.
>> 
>> On the other hand, MC would perform better in a memory-intensive or
>> CPU-intensive situation - so maybe QEMU could "switch" between the two
>> mechanisms at different points in time when the resource bottleneck changes.
> 
> If the primary were to rate-limit the number of resynchronisations
> (and send the secondary a message as soon as it knew a resync was needed) that
> would get some of the way, but then the only difference from microcheckpointing
> at that point is the secondary doing a wasteful copy and sending the packets across;
> it seems it should be easy to disable those if it knew that a resync was going to
> happen.
> 
> Dave
> 
>> - Michael
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-08-17  9:52                   ` Paolo Bonzini
@ 2014-08-19  8:58                     ` Walid Nouri
  2014-09-10 15:43                     ` Walid Nouri
  1 sibling, 0 replies; 23+ messages in thread
From: Walid Nouri @ 2014-08-19  8:58 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: michael, qemu-devel, Michael R. Hines

Hi Paolo,
thanks for your hint. I missed your mail from last sunday.
I will take a look on that!

Walid

> Am 17.08.2014 um 11:52 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
> Il 11/08/2014 22:15, Michael R. Hines ha scritto:
>> Excellent question: QEMU does have a feature called "drive-mirror"
>> in block/mirror.c that was introduced a couple of years ago. I'm not
>> sure what the
>> adoption rate of the feature is, but I would start with that one.
> 
> block/mirror.c is asynchronous, and there's no support for communicating
> checkpoints back to the master.  However, the quorum disk driver could
> be what you need.
> 
> There's also a series on the mailing list that lets quorum read only
> from the primary, so that quorum can still do replication and fault
> tolerance, but skip fault detection.
> 
> Paolo
> 
>> There is also a second fault tolerance implementation that works a
>> little differently called
>> "COLO" - you may have seen those emails on the list too, but their
>> method does not require a disk replication solution, if I recall correctly.
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-08-17  9:52                   ` Paolo Bonzini
  2014-08-19  8:58                     ` Walid Nouri
@ 2014-09-10 15:43                     ` Walid Nouri
  2014-09-11  1:50                       ` Michael R. Hines
                                         ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Walid Nouri @ 2014-09-10 15:43 UTC (permalink / raw)
  To: Paolo Bonzini, Michael R. Hines, qemu-devel, michael

Hello Michael, Hello Paolo
i have „studied“ the available documentation/Information and tried to 
get an idea of the QEMU live block operation possibilities.

I think the MC protocol doesn’t need synchronous block device 
replication because primary and secondary VM are not synchronous. The 
state of the primary is allays ahead of the state of the secondary. When 
the primary is in epoch(n) the secondary is in epoch(n-1).

What MC needs is a block device agnostic, controlled and asynchronous 
approach for replicating the contents of block devices and its state 
changes to the secondary VM while the primary VM is running. 
Asynchronous block transfer is important to allow maximum performance 
for the primary VM, while keeping the secondary VM updated with state 
changes.

The block device replication should be possible in two stages or modes.

The first stage is the live copy of all block devices of the primary to 
the secondary. This is necessary if the secondary doesn’t have an 
existing image which is in sync with the primary at the time MC has 
started. This is not very convenient but as far as I know actually there 
is no mechanism for persistent dirty bitmap in QEMU.

The second stage (mode) is the replication of block device state changes 
(modified blocks)  to keep the image on the secondary in sync with the 
primary. The mirrored blocks must be buffered in ram (block buffer) 
until the complete Checkpoint (RAM, vCPU, device state) can be committed.

For keeping the complete system state consistent on the secondary system 
there must be a possibility for MC to commit/discard block device state 
changes. In normal operation the mirrored block device state changes 
(block buffer) are committed to disk when the complete checkpoint is 
committed. In case of a crash of the primary system while transferring a 
checkpoint the data in the block buffer corresponding to the failed 
Checkpoint must be discarded.

The storage architecture should be “shared nothing” so that no shared 
storage is required and primary/secondary can have separate block device 
images.

I think this can be achieved by drive-mirror and a filter block driver. 
Another approach could be to exploit the block migration functionality 
of live migration with a filter block driver.

The drive-mirror (and live migration) does not rely on shared storage 
and allow live block device copy and incremental syncing.

A block buffer can be implemented with a QEMU filter block driver. It 
should sit at the same position as the Quorum driver in the block driver 
hierarchy. When using block filter approach MC will be transparent and 
block device agnostic.

The block buffer filter must have an Interface which allows MC control 
the commits or discards of block device state changes. I have no idea 
where to put such an interface to stay conform with QEMU coding style.

I’m sure there are alternative and better approaches and I’m open for 
any ideas

Walid

Am 17.08.2014 11:52, schrieb Paolo Bonzini:
> Il 11/08/2014 22:15, Michael R. Hines ha scritto:
>> Excellent question: QEMU does have a feature called "drive-mirror"
>> in block/mirror.c that was introduced a couple of years ago. I'm not
>> sure what the
>> adoption rate of the feature is, but I would start with that one.
>
> block/mirror.c is asynchronous, and there's no support for communicating
> checkpoints back to the master.  However, the quorum disk driver could
> be what you need.
>
> There's also a series on the mailing list that lets quorum read only
> from the primary, so that quorum can still do replication and fault
> tolerance, but skip fault detection.
>
> Paolo
>
>> There is also a second fault tolerance implementation that works a
>> little differently called
>> "COLO" - you may have seen those emails on the list too, but their
>> method does not require a disk replication solution, if I recall correctly.
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-09-10 15:43                     ` Walid Nouri
@ 2014-09-11  1:50                       ` Michael R. Hines
  2014-09-12  1:34                         ` Hongyang Yang
  2014-09-11  7:27                       ` Paolo Bonzini
  2014-09-11 17:44                       ` Dr. David Alan Gilbert
  2 siblings, 1 reply; 23+ messages in thread
From: Michael R. Hines @ 2014-09-11  1:50 UTC (permalink / raw)
  To: Walid Nouri, Paolo Bonzini, qemu-devel, michael, hinesmr,
	Dr. David Alan Gilbert, Hongyang Yang, Dong Eddie,
	FNST-Gui Jianfeng, wency

On 09/10/2014 11:43 PM, Walid Nouri wrote:
> Hello Michael, Hello Paolo
> i have „studied“ the available documentation/Information and tried to 
> get an idea of the QEMU live block operation possibilities.
>
> I think the MC protocol doesn’t need synchronous block device 
> replication because primary and secondary VM are not synchronous. The 
> state of the primary is allays ahead of the state of the secondary. 
> When the primary is in epoch(n) the secondary is in epoch(n-1).
>
> What MC needs is a block device agnostic, controlled and asynchronous 
> approach for replicating the contents of block devices and its state 
> changes to the secondary VM while the primary VM is running. 
> Asynchronous block transfer is important to allow maximum performance 
> for the primary VM, while keeping the secondary VM updated with state 
> changes.
>
> The block device replication should be possible in two stages or modes.
>
> The first stage is the live copy of all block devices of the primary 
> to the secondary. This is necessary if the secondary doesn’t have an 
> existing image which is in sync with the primary at the time MC has 
> started. This is not very convenient but as far as I know actually 
> there is no mechanism for persistent dirty bitmap in QEMU.
>
> The second stage (mode) is the replication of block device state 
> changes (modified blocks) to keep the image on the secondary in sync 
> with the primary. The mirrored blocks must be buffered in ram (block 
> buffer) until the complete Checkpoint (RAM, vCPU, device state) can be 
> committed.
>
> For keeping the complete system state consistent on the secondary 
> system there must be a possibility for MC to commit/discard block 
> device state changes. In normal operation the mirrored block device 
> state changes (block buffer) are committed to disk when the complete 
> checkpoint is committed. In case of a crash of the primary system 
> while transferring a checkpoint the data in the block buffer 
> corresponding to the failed Checkpoint must be discarded.
>
> The storage architecture should be “shared nothing” so that no shared 
> storage is required and primary/secondary can have separate block 
> device images.
>
> I think this can be achieved by drive-mirror and a filter block 
> driver. Another approach could be to exploit the block migration 
> functionality of live migration with a filter block driver.
>
> The drive-mirror (and live migration) does not rely on shared storage 
> and allow live block device copy and incremental syncing.
>
> A block buffer can be implemented with a QEMU filter block driver. It 
> should sit at the same position as the Quorum driver in the block 
> driver hierarchy. When using block filter approach MC will be 
> transparent and block device agnostic.
>
> The block buffer filter must have an Interface which allows MC control 
> the commits or discards of block device state changes. I have no idea 
> where to put such an interface to stay conform with QEMU coding style.
>
>
> I’m sure there are alternative and better approaches and I’m open for 
> any ideas
>
>
> Walid
>
> Am 17.08.2014 11:52, schrieb Paolo Bonzini:
>> Il 11/08/2014 22:15, Michael R. Hines ha scritto:
>>> Excellent question: QEMU does have a feature called "drive-mirror"
>>> in block/mirror.c that was introduced a couple of years ago. I'm not
>>> sure what the
>>> adoption rate of the feature is, but I would start with that one.
>>
>> block/mirror.c is asynchronous, and there's no support for communicating
>> checkpoints back to the master. However, the quorum disk driver could
>> be what you need.
>>
>> There's also a series on the mailing list that lets quorum read only
>> from the primary, so that quorum can still do replication and fault
>> tolerance, but skip fault detection.
>>
>> Paolo
>>
>>> There is also a second fault tolerance implementation that works a
>>> little differently called
>>> "COLO" - you may have seen those emails on the list too, but their
>>> method does not require a disk replication solution, if I recall 
>>> correctly.
>>
>

Nice description of the problem - would you like to put this information 
on the MC wiki page? (Just send an email to the list that says "request 
for wiki account, please" in the subject - and they will make an account 
for you.

A drive-mirror + filter driver solution sounds like a good plan overall,
of course the devil is in the details =)

I don't know how much time you have to spend on actual code, but even a 
description of what a "theoretical" interface between MC and 
drive-mirror would look like would go a long way even without code.

Your investigations would also help "drive" a solution to this problem 
for the COLO team as well - I believe they need the same thing....

- Michael

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-09-10 15:43                     ` Walid Nouri
  2014-09-11  1:50                       ` Michael R. Hines
@ 2014-09-11  7:27                       ` Paolo Bonzini
  2014-09-11 17:44                       ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 23+ messages in thread
From: Paolo Bonzini @ 2014-09-11  7:27 UTC (permalink / raw)
  To: Walid Nouri, Michael R. Hines, qemu-devel, michael

Il 10/09/2014 17:43, Walid Nouri ha scritto:
> The drive-mirror (and live migration) does not rely on shared storage
> and allow live block device copy and incremental syncing.

I think your analysis is right.  However, just for completeness I'll
note that quorum doesn't need shared storage.

To do drive-mirror without shared storage, what you do is run an NBD
server on the destination and point mirroring on the source to the NBD
server.  Similarly, an NBD server on the secondary machine would let
quorum do remote replication without shared storage.

Paolo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-09-10 15:43                     ` Walid Nouri
  2014-09-11  1:50                       ` Michael R. Hines
  2014-09-11  7:27                       ` Paolo Bonzini
@ 2014-09-11 17:44                       ` Dr. David Alan Gilbert
  2014-09-11 22:08                         ` Walid Nouri
                                           ` (2 more replies)
  2 siblings, 3 replies; 23+ messages in thread
From: Dr. David Alan Gilbert @ 2014-09-11 17:44 UTC (permalink / raw)
  To: Walid Nouri
  Cc: kwolf, eddie.dong, qemu-devel, Michael R. Hines, stefanha,
	Paolo Bonzini, yanghy

(I've cc'd in Fam, Stefan, and Kevin for Block stuff, and 
              Yang and Eddie for Colo)

* Walid Nouri (walid.nouri@gmail.com) wrote:
> Hello Michael, Hello Paolo
> i have ???studied??? the available documentation/Information and tried to
> get an idea of the QEMU live block operation possibilities.
> 
> I think the MC protocol doesn???t need synchronous block device replication
> because primary and secondary VM are not synchronous. The state of the
> primary is allays ahead of the state of the secondary. When the primary is
> in epoch(n) the secondary is in epoch(n-1).
> 
> What MC needs is a block device agnostic, controlled and asynchronous
> approach for replicating the contents of block devices and its state changes
> to the secondary VM while the primary VM is running. Asynchronous block
> transfer is important to allow maximum performance for the primary VM, while
> keeping the secondary VM updated with state changes.
> 
> The block device replication should be possible in two stages or modes.
> 
> The first stage is the live copy of all block devices of the primary to the
> secondary. This is necessary if the secondary doesn???t have an existing
> image which is in sync with the primary at the time MC has started. This is
> not very convenient but as far as I know actually there is no mechanism for
> persistent dirty bitmap in QEMU.
> 
> The second stage (mode) is the replication of block device state changes
> (modified blocks)  to keep the image on the secondary in sync with the
> primary. The mirrored blocks must be buffered in ram (block buffer) until
> the complete Checkpoint (RAM, vCPU, device state) can be committed.
> 
> For keeping the complete system state consistent on the secondary system
> there must be a possibility for MC to commit/discard block device state
> changes. In normal operation the mirrored block device state changes (block
> buffer) are committed to disk when the complete checkpoint is committed. In
> case of a crash of the primary system while transferring a checkpoint the
> data in the block buffer corresponding to the failed Checkpoint must be
> discarded.

I think for COLO there's a requirement that the secondary can do reads/writes
in parallel with the primary, and the secondary can discard those reads/writes
- and that doesn't happen in MC (Yang or Eddie should be able to confirm that).

> The storage architecture should be ???shared nothing??? so that no shared
> storage is required and primary/secondary can have separate block device
> images.

MC/COLO with shared storage still needs some stuff like this; but it's subtely
different.   They still need to be able to buffer/release modifications
to the shared storage; if any of this code can also be used in the
shared-storage configurations it would be good.

> I think this can be achieved by drive-mirror and a filter block driver.
> Another approach could be to exploit the block migration functionality of
> live migration with a filter block driver.
> 
> The drive-mirror (and live migration) does not rely on shared storage and
> allow live block device copy and incremental syncing.
> 
> A block buffer can be implemented with a QEMU filter block driver. It should
> sit at the same position as the Quorum driver in the block driver hierarchy.
> When using block filter approach MC will be transparent and block device
> agnostic.
> 
> The block buffer filter must have an Interface which allows MC control the
> commits or discards of block device state changes. I have no idea where to
> put such an interface to stay conform with QEMU coding style.
> 
> 
> I???m sure there are alternative and better approaches and I???m open for
> any ideas
> 
> 
> Walid
> 
> Am 17.08.2014 11:52, schrieb Paolo Bonzini:
> >Il 11/08/2014 22:15, Michael R. Hines ha scritto:
> >>Excellent question: QEMU does have a feature called "drive-mirror"
> >>in block/mirror.c that was introduced a couple of years ago. I'm not
> >>sure what the
> >>adoption rate of the feature is, but I would start with that one.
> >
> >block/mirror.c is asynchronous, and there's no support for communicating
> >checkpoints back to the master.  However, the quorum disk driver could
> >be what you need.
> >
> >There's also a series on the mailing list that lets quorum read only
> >from the primary, so that quorum can still do replication and fault
> >tolerance, but skip fault detection.
> >
> >Paolo
> >
> >>There is also a second fault tolerance implementation that works a
> >>little differently called
> >>"COLO" - you may have seen those emails on the list too, but their
> >>method does not require a disk replication solution, if I recall correctly.
> >
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-09-11 17:44                       ` Dr. David Alan Gilbert
@ 2014-09-11 22:08                         ` Walid Nouri
  2014-09-12  1:24                         ` Hongyang Yang
  2014-09-12 11:07                         ` Stefan Hajnoczi
  2 siblings, 0 replies; 23+ messages in thread
From: Walid Nouri @ 2014-09-11 22:08 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kwolf, eddie.dong, qemu-devel, Michael R. Hines, stefanha,
	Paolo Bonzini, yanghy

Am 11.09.2014 19:44, schrieb Dr. David Alan Gilbert:

>> For keeping the complete system state consistent on the secondary system
>> there must be a possibility for MC to commit/discard block device state
>> changes. In normal operation the mirrored block device state changes (block
>> buffer) are committed to disk when the complete checkpoint is committed. In
>> case of a crash of the primary system while transferring a checkpoint the
>> data in the block buffer corresponding to the failed Checkpoint must be
>> discarded.
>
> I think for COLO there's a requirement that the secondary can do reads/writes
> in parallel with the primary, and the secondary can discard those reads/writes
> - and that doesn't happen in MC (Yang or Eddie should be able to confirm that).
>
>> The storage architecture should be ???shared nothing??? so that no shared
>> storage is required and primary/secondary can have separate block device
>> images.

I admit that my formulation was unintentionally a bit ambiguous :)
I should have written that a shared storage should not be mandatory.
I'm comming from an SMB environment and (redundant) shared storage 
systems are still not usual in small companies :)

I looked for a storage agnostic approach which allows the number of 
system components to be as low as possible and still get redundancy and 
fault tolerance.

>
> MC/COLO with shared storage still needs some stuff like this; but it's subtely
> different.   They still need to be able to buffer/release modifications
> to the shared storage; if any of this code can also be used in the
> shared-storage configurations it would be good.

The proposed approach with block filter and the commit/discard protocol 
should be storage agnostic and will also work in a shared storage 
environment, but only with distinct images (because of the protocol).

In case of a shared storage and a common image used by the primary and 
secondary another storage protocol must be used.

It's not commit/discard but commit/rollback

The primary still sends asynchronously the block state changes. The 
secondary buffers block device state changes but doesn't apply them in 
normal operation. When the next checkpoint is complete the secondary 
clears the buffer and forgets about the old block state data.

If the primary fails the secondary must rollback the common image with 
the block state data corresponding to the actual checkpoint.
Otherwise the state of the image and rest of the system state on the 
secondary will not be in sync.

When there is no block state data corresponding to the actual 
checkpoint, then there is nothing to do on the storage for the secondary :)

There is a little danger in this though. When the secondary fails during 
rollback, the common image will be left in an inconsistent state.
I think this risk cannot be avoided when using a common image. But this 
unfortunate situation can also happen in other scenarios.

Sharing a common immage with this protocol will lead to a longer fail 
over time in case of existing block device state data for the actual 
checkpoint. The secondary must initiate the rollback and wait until all 
blocks of the actual checkpoint are commited to the common immage before 
taking over the active role.

Walid

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-09-11 17:44                       ` Dr. David Alan Gilbert
  2014-09-11 22:08                         ` Walid Nouri
@ 2014-09-12  1:24                         ` Hongyang Yang
  2014-09-12 11:07                         ` Stefan Hajnoczi
  2 siblings, 0 replies; 23+ messages in thread
From: Hongyang Yang @ 2014-09-12  1:24 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Walid Nouri
  Cc: kwolf, eddie.dong, qemu-devel, Michael R. Hines, stefanha, Paolo Bonzini



在 09/12/2014 01:44 AM, Dr. David Alan Gilbert 写道:
> (I've cc'd in Fam, Stefan, and Kevin for Block stuff, and
>                Yang and Eddie for Colo)
>
> * Walid Nouri (walid.nouri@gmail.com) wrote:
>> Hello Michael, Hello Paolo
>> i have ???studied??? the available documentation/Information and tried to
>> get an idea of the QEMU live block operation possibilities.
>>
>> I think the MC protocol doesn???t need synchronous block device replication
>> because primary and secondary VM are not synchronous. The state of the
>> primary is allays ahead of the state of the secondary. When the primary is
>> in epoch(n) the secondary is in epoch(n-1).
>>
>> What MC needs is a block device agnostic, controlled and asynchronous
>> approach for replicating the contents of block devices and its state changes
>> to the secondary VM while the primary VM is running. Asynchronous block
>> transfer is important to allow maximum performance for the primary VM, while
>> keeping the secondary VM updated with state changes.
>>
>> The block device replication should be possible in two stages or modes.
>>
>> The first stage is the live copy of all block devices of the primary to the
>> secondary. This is necessary if the secondary doesn???t have an existing
>> image which is in sync with the primary at the time MC has started. This is
>> not very convenient but as far as I know actually there is no mechanism for
>> persistent dirty bitmap in QEMU.
>>
>> The second stage (mode) is the replication of block device state changes
>> (modified blocks)  to keep the image on the secondary in sync with the
>> primary. The mirrored blocks must be buffered in ram (block buffer) until
>> the complete Checkpoint (RAM, vCPU, device state) can be committed.
>>
>> For keeping the complete system state consistent on the secondary system
>> there must be a possibility for MC to commit/discard block device state
>> changes. In normal operation the mirrored block device state changes (block
>> buffer) are committed to disk when the complete checkpoint is committed. In
>> case of a crash of the primary system while transferring a checkpoint the
>> data in the block buffer corresponding to the failed Checkpoint must be
>> discarded.
>
> I think for COLO there's a requirement that the secondary can do reads/writes
> in parallel with the primary, and the secondary can discard those reads/writes
> - and that doesn't happen in MC (Yang or Eddie should be able to confirm that).

Exactly, COLO need this functionality to ensure consistency.

>
>> The storage architecture should be ???shared nothing??? so that no shared
>> storage is required and primary/secondary can have separate block device
>> images.
>
> MC/COLO with shared storage still needs some stuff like this; but it's subtely
> different.   They still need to be able to buffer/release modifications
> to the shared storage; if any of this code can also be used in the
> shared-storage configurations it would be good.

Shared-storage is more complicated, we don't support shared-storage currently...

>
>> I think this can be achieved by drive-mirror and a filter block driver.
>> Another approach could be to exploit the block migration functionality of
>> live migration with a filter block driver.
>>
>> The drive-mirror (and live migration) does not rely on shared storage and
>> allow live block device copy and incremental syncing.
>>
>> A block buffer can be implemented with a QEMU filter block driver. It should
>> sit at the same position as the Quorum driver in the block driver hierarchy.
>> When using block filter approach MC will be transparent and block device
>> agnostic.
>>
>> The block buffer filter must have an Interface which allows MC control the
>> commits or discards of block device state changes. I have no idea where to
>> put such an interface to stay conform with QEMU coding style.
>>
>>
>> I???m sure there are alternative and better approaches and I???m open for
>> any ideas
>>
>>
>> Walid
>>
>> Am 17.08.2014 11:52, schrieb Paolo Bonzini:
>>> Il 11/08/2014 22:15, Michael R. Hines ha scritto:
>>>> Excellent question: QEMU does have a feature called "drive-mirror"
>>>> in block/mirror.c that was introduced a couple of years ago. I'm not
>>>> sure what the
>>>> adoption rate of the feature is, but I would start with that one.
>>>
>>> block/mirror.c is asynchronous, and there's no support for communicating
>>> checkpoints back to the master.  However, the quorum disk driver could
>>> be what you need.
>>>
>>> There's also a series on the mailing list that lets quorum read only
>> >from the primary, so that quorum can still do replication and fault
>>> tolerance, but skip fault detection.
>>>
>>> Paolo
>>>
>>>> There is also a second fault tolerance implementation that works a
>>>> little differently called
>>>> "COLO" - you may have seen those emails on the list too, but their
>>>> method does not require a disk replication solution, if I recall correctly.
>>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-09-11  1:50                       ` Michael R. Hines
@ 2014-09-12  1:34                         ` Hongyang Yang
  0 siblings, 0 replies; 23+ messages in thread
From: Hongyang Yang @ 2014-09-12  1:34 UTC (permalink / raw)
  To: Michael R. Hines, Walid Nouri, Paolo Bonzini, qemu-devel,
	michael, hinesmr, Dr. David Alan Gilbert, Dong Eddie,
	FNST-Gui Jianfeng, wency



在 09/11/2014 09:50 AM, Michael R. Hines 写道:
> On 09/10/2014 11:43 PM, Walid Nouri wrote:
>> Hello Michael, Hello Paolo
>> i have „studied“ the available documentation/Information and tried to get an
>> idea of the QEMU live block operation possibilities.
>>
>> I think the MC protocol doesn’t need synchronous block device replication
>> because primary and secondary VM are not synchronous. The state of the primary
>> is allays ahead of the state of the secondary. When the primary is in epoch(n)
>> the secondary is in epoch(n-1).
>>
>> What MC needs is a block device agnostic, controlled and asynchronous approach
>> for replicating the contents of block devices and its state changes to the
>> secondary VM while the primary VM is running. Asynchronous block transfer is
>> important to allow maximum performance for the primary VM, while keeping the
>> secondary VM updated with state changes.
>>
>> The block device replication should be possible in two stages or modes.
>>
>> The first stage is the live copy of all block devices of the primary to the
>> secondary. This is necessary if the secondary doesn’t have an existing image
>> which is in sync with the primary at the time MC has started. This is not very
>> convenient but as far as I know actually there is no mechanism for persistent
>> dirty bitmap in QEMU.
>>
>> The second stage (mode) is the replication of block device state changes
>> (modified blocks) to keep the image on the secondary in sync with the primary.
>> The mirrored blocks must be buffered in ram (block buffer) until the complete
>> Checkpoint (RAM, vCPU, device state) can be committed.
>>
>> For keeping the complete system state consistent on the secondary system there
>> must be a possibility for MC to commit/discard block device state changes. In
>> normal operation the mirrored block device state changes (block buffer) are
>> committed to disk when the complete checkpoint is committed. In case of a
>> crash of the primary system while transferring a checkpoint the data in the
>> block buffer corresponding to the failed Checkpoint must be discarded.
>>
>> The storage architecture should be “shared nothing” so that no shared storage
>> is required and primary/secondary can have separate block device images.
>>
>> I think this can be achieved by drive-mirror and a filter block driver.
>> Another approach could be to exploit the block migration functionality of live
>> migration with a filter block driver.
>>
>> The drive-mirror (and live migration) does not rely on shared storage and
>> allow live block device copy and incremental syncing.
>>
>> A block buffer can be implemented with a QEMU filter block driver. It should
>> sit at the same position as the Quorum driver in the block driver hierarchy.
>> When using block filter approach MC will be transparent and block device
>> agnostic.
>>
>> The block buffer filter must have an Interface which allows MC control the
>> commits or discards of block device state changes. I have no idea where to put
>> such an interface to stay conform with QEMU coding style.
>>
>>
>> I’m sure there are alternative and better approaches and I’m open for any ideas
>>
>>
>> Walid
>>
>> Am 17.08.2014 11:52, schrieb Paolo Bonzini:
>>> Il 11/08/2014 22:15, Michael R. Hines ha scritto:
>>>> Excellent question: QEMU does have a feature called "drive-mirror"
>>>> in block/mirror.c that was introduced a couple of years ago. I'm not
>>>> sure what the
>>>> adoption rate of the feature is, but I would start with that one.
>>>
>>> block/mirror.c is asynchronous, and there's no support for communicating
>>> checkpoints back to the master. However, the quorum disk driver could
>>> be what you need.
>>>
>>> There's also a series on the mailing list that lets quorum read only
>>> from the primary, so that quorum can still do replication and fault
>>> tolerance, but skip fault detection.
>>>
>>> Paolo
>>>
>>>> There is also a second fault tolerance implementation that works a
>>>> little differently called
>>>> "COLO" - you may have seen those emails on the list too, but their
>>>> method does not require a disk replication solution, if I recall correctly.
>>>
>>
>
> Nice description of the problem - would you like to put this information on the
> MC wiki page? (Just send an email to the list that says "request for wiki
> account, please" in the subject - and they will make an account for you.
>
> A drive-mirror + filter driver solution sounds like a good plan overall,
> of course the devil is in the details =)

If I understand correctly, this is similar to our approach. Disk replication
is definitely on our plan and we will post RFC patches that include Disk
replication, you can keep an eye on COLO patches:)

>
> I don't know how much time you have to spend on actual code, but even a
> description of what a "theoretical" interface between MC and drive-mirror would
> look like would go a long way even without code.
>
> Your investigations would also help "drive" a solution to this problem for the
> COLO team as well - I believe they need the same thing....
>
> - Michael
>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-09-11 17:44                       ` Dr. David Alan Gilbert
  2014-09-11 22:08                         ` Walid Nouri
  2014-09-12  1:24                         ` Hongyang Yang
@ 2014-09-12 11:07                         ` Stefan Hajnoczi
  2014-09-17 20:53                           ` Walid Nouri
  2 siblings, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2014-09-12 11:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kwolf, eddie.dong, qemu-devel, Michael R. Hines, Paolo Bonzini,
	Walid Nouri, yanghy

[-- Attachment #1: Type: text/plain, Size: 6516 bytes --]

On Thu, Sep 11, 2014 at 06:44:08PM +0100, Dr. David Alan Gilbert wrote:
> (I've cc'd in Fam, Stefan, and Kevin for Block stuff, and 
>               Yang and Eddie for Colo)
> 
> * Walid Nouri (walid.nouri@gmail.com) wrote:
> > Hello Michael, Hello Paolo
> > i have ???studied??? the available documentation/Information and tried to
> > get an idea of the QEMU live block operation possibilities.
> > 
> > I think the MC protocol doesn???t need synchronous block device replication
> > because primary and secondary VM are not synchronous. The state of the
> > primary is allays ahead of the state of the secondary. When the primary is
> > in epoch(n) the secondary is in epoch(n-1).

Note that I haven't followed the microcheckpointing or COLO discussions
so I'm not aware of those designs...

> > What MC needs is a block device agnostic, controlled and asynchronous
> > approach for replicating the contents of block devices and its state changes
> > to the secondary VM while the primary VM is running. Asynchronous block
> > transfer is important to allow maximum performance for the primary VM, while
> > keeping the secondary VM updated with state changes.
> > 
> > The block device replication should be possible in two stages or modes.
> > 
> > The first stage is the live copy of all block devices of the primary to the
> > secondary. This is necessary if the secondary doesn???t have an existing
> > image which is in sync with the primary at the time MC has started. This is
> > not very convenient but as far as I know actually there is no mechanism for
> > persistent dirty bitmap in QEMU.

I think you are trying to address the non-shared storage cause where the
secondary needs to acquire the initial state of the primary.

drive-mirror copies the contents of a source disk image to a
destination.  If the guest is running while copying takes place then new
writes will also be mirrored.

drive-mirror should be sufficient for the initial phase where primary
and secondary get in sync.

Fam Zheng sent a patch series earlier this year to add dirty bitmaps for
block devices to QEMU.  It only supported in-memory bitmaps but
persistent bitmaps are fairly straightforward to implement.  I'm
interested in these patches for the incremental backup use case.
https://lists.gnu.org/archive/html/qemu-devel/2014-03/msg05250.html

I guess the reason you mention persistent bitmaps is to save time when
adding a host that previously participated and has an older version of
the disk image?

> > The second stage (mode) is the replication of block device state changes
> > (modified blocks)  to keep the image on the secondary in sync with the
> > primary. The mirrored blocks must be buffered in ram (block buffer) until
> > the complete Checkpoint (RAM, vCPU, device state) can be committed.
> > 
> > For keeping the complete system state consistent on the secondary system
> > there must be a possibility for MC to commit/discard block device state
> > changes. In normal operation the mirrored block device state changes (block
> > buffer) are committed to disk when the complete checkpoint is committed. In
> > case of a crash of the primary system while transferring a checkpoint the
> > data in the block buffer corresponding to the failed Checkpoint must be
> > discarded.

Thoughts:

Writing data safely to disk can take milliseconds.  Not sure how that
figures into your commit step, but I guess commit needs to be fast.

I/O requests happen in parallel with CPU execution, so could an I/O
request be pending across a checkpoint commit?  Live migration does not
migrate inflight requests, although it has special case code for
migration requests that have failed at the host level and need to be
retried.  Another way of putting this is that live migration uses
bdrv_drain_all() to quiesce disks before migrating device state - I
don't think you have that luxury since bdrv_drain_all() can take a long
time and is not suitable for microcheckpointing.

Block devices have the following semantics:
1. There is no ordering between parallel in-flight I/O requests.
2. The guest sees the disk state for completed writes but it may not see
   disk state of in-flight writes (due to #1).
3. Completed writes are only guaranteed to be persistent across power
   failure if a disk cache flush was submitted and completed after the
   writes completed.

> > I think this can be achieved by drive-mirror and a filter block driver.
> > Another approach could be to exploit the block migration functionality of
> > live migration with a filter block driver.

block-migration.c should be avoided because it may be dropped from QEMU.
It is unloved code and has been replaced by drive-mirror.

> > The drive-mirror (and live migration) does not rely on shared storage and
> > allow live block device copy and incremental syncing.
> > 
> > A block buffer can be implemented with a QEMU filter block driver. It should
> > sit at the same position as the Quorum driver in the block driver hierarchy.
> > When using block filter approach MC will be transparent and block device
> > agnostic.
> > 
> > The block buffer filter must have an Interface which allows MC control the
> > commits or discards of block device state changes. I have no idea where to
> > put such an interface to stay conform with QEMU coding style.
> > 
> > 
> > I???m sure there are alternative and better approaches and I???m open for
> > any ideas

You can use drive-mirror and the run-time NBD server in QEMU without
modification:

  Primary (drive-mirror)   ---writes--->   Secondary (NBD export in QEMU)

Your block filter idea can work and must have the logic so that a commit
operation sent via the microcheckpointing protocol causes the block
filter to write buffered data to disk and flush the host disk cache.

To ensure that the disk image on the secondary is always in a crash
consistent state (i.e. the state you get from power failure), the
secondary needs to know when disk cache flush requests were sent and the
write ordering.  That way, even if there is a power failure while the
secondary is committing, the disk will be in a crash consistent state.
After the secondary (or primary) is booted again file systems or
databases will be able to fsck and resume.

(In other words, in a catastrophic failure you won't be any worse off
than with a power failure on an unprotected single machine.)

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-09-12 11:07                         ` Stefan Hajnoczi
@ 2014-09-17 20:53                           ` Walid Nouri
  2014-09-18 13:56                             ` Stefan Hajnoczi
  0 siblings, 1 reply; 23+ messages in thread
From: Walid Nouri @ 2014-09-17 20:53 UTC (permalink / raw)
  To: Stefan Hajnoczi, Dr. David Alan Gilbert
  Cc: kwolf, eddie.dong, qemu-devel, Michael R. Hines, Paolo Bonzini, yanghy

Thank you for your Time and the detailed answer!
I have needed some time to work through your answer ;-)

>>> What MC needs is a block device agnostic, controlled and asynchronous
>>> approach for replicating the contents of block devices and its state changes
>>> to the secondary VM while the primary VM is running. Asynchronous block
>>> transfer is important to allow maximum performance for the primary VM, while
>>> keeping the secondary VM updated with state changes.
>>>
>>> The block device replication should be possible in two stages or modes.
>>>
>>> The first stage is the live copy of all block devices of the primary to the
>>> secondary. This is necessary if the secondary doesn???t have an existing
>>> image which is in sync with the primary at the time MC has started. This is
>>> not very convenient but as far as I know actually there is no mechanism for
>>> persistent dirty bitmap in QEMU.
>
> I think you are trying to address the non-shared storage cause where the
> secondary needs to acquire the initial state of the primary.

That's correct!
>
> drive-mirror copies the contents of a source disk image to a
> destination.  If the guest is running while copying takes place then new
> writes will also be mirrored.
>
> drive-mirror should be sufficient for the initial phase where primary
> and secondary get in sync.
>
> Fam Zheng sent a patch series earlier this year to add dirty bitmaps for
> block devices to QEMU.  It only supported in-memory bitmaps but
> persistent bitmaps are fairly straightforward to implement.  I'm
> interested in these patches for the incremental backup use case.
> https://lists.gnu.org/archive/html/qemu-devel/2014-03/msg05250.html
>
> I guess the reason you mention persistent bitmaps is to save time when
> adding a host that previously participated and has an older version of
> the disk image?

Yes, it is desirable not to always mirror the whole image before the MC 
protection can become active. This would save time in case of a lost 
communication, shutdown or maintenance on the secondary.

The persistent dirty bitmap must have a mechanism to identify that a 
pair of images belong to each other and which of both is the primary 
with the actual valid data. I think that's self-sufficient "little" 
project...but the next logical step :)

>
>>> The second stage (mode) is the replication of block device state changes
>>> (modified blocks)  to keep the image on the secondary in sync with the
>>> primary. The mirrored blocks must be buffered in ram (block buffer) until
>>> the complete Checkpoint (RAM, vCPU, device state) can be committed.
>>>
>>> For keeping the complete system state consistent on the secondary system
>>> there must be a possibility for MC to commit/discard block device state
>>> changes. In normal operation the mirrored block device state changes (block
>>> buffer) are committed to disk when the complete checkpoint is committed. In
>>> case of a crash of the primary system while transferring a checkpoint the
>>> data in the block buffer corresponding to the failed Checkpoint must be
>>> discarded.
>
> Thoughts:
>
> Writing data safely to disk can take milliseconds.  Not sure how that
> figures into your commit step, but I guess commit needs to be fast.
>
We have no time to waste ;) but the disk semantic at the primary should 
be kept as expected from the primary. The time to acknowledge a 
checkpoint from the secondary will be delayed for the time needed to 
write all pending I/O requests of a checkpoint to disk.
I think for normal operation (just replication) the secondary can use 
the same semantics for the disc writes as the primary. Wouldn't that be 
safe enough?

> I/O requests happen in parallel with CPU execution, so could an I/O
> request be pending across a checkpoint commit?  Live migration does not
> migrate inflight requests, although it has special case code for
> migration requests that have failed at the host level and need to be
> retried.  Another way of putting this is that live migration uses
> bdrv_drain_all() to quiesce disks before migrating device state - I
> don't think you have that luxury since bdrv_drain_all() can take a long
> time and is not suitable for microcheckpointing.
>
> Block devices have the following semantics:
> 1. There is no ordering between parallel in-flight I/O requests.
> 2. The guest sees the disk state for completed writes but it may not see
>     disk state of in-flight writes (due to #1).
> 3. Completed writes are only guaranteed to be persistent across power
>     failure if a disk cache flush was submitted and completed after the
>     writes completed.
>
I'm not sure if I got your point.

The proposed MC block device protocol sends all block device state 
updates to the secondary directly after writing them to the primary 
block devices. This keeps the disc semantics for the primary and the 
secondary stays updated with the disc state changes of the actual epoch.

At the end of an epoch the primary gets paused to create a system state 
snapshot. A this moment there could be some pending write I/O requests 
on the primary which overlap with the generation of the system state 
snapshot? Do you meant a situation like that?

If this is your point then I think you are right, this is possible...and 
that raises your interesting question: How to deal with pending requests 
at the end of an epoch or how to be sure that all disc state changes of 
an epoch have been replicated?

Currently the MC protocol only cares about a part of the system state 
(RAM,vCPU,devices) and excludes the block device state changes.

To correctly use drive-mirror functionality the MC protocol must also be 
extended to check that all disc state changes of the primary 
corresponding to the current epoch have been delivered to the secondary.

When all state data is completely sent the checkpoint transaction can be 
committed.

When the checkpoint transaction is complete the secondary commits its 
disc state buffer and the rest (RAM, vCPU,devices) of the checkpoint and 
ACKS the complete checkpoint to the primary.

IMHO the easiest way for MC to track that all block device changes have 
been replicated would be to ask drive-mirror if the paused primary has 
unprocessed write requests.

As long as there are dirty blocks or in-flights, the checkpoint 
transaction of the current epoch is not complete.

Maybe you can give me a hint what you think is the best way (api 
call(s)) to ask drive-mirror if there are pending write operations???

>>> I think this can be achieved by drive-mirror and a filter block driver.
>>> Another approach could be to exploit the block migration functionality of
>>> live migration with a filter block driver.
>
> block-migration.c should be avoided because it may be dropped from QEMU.
> It is unloved code and has been replaced by drive-mirror.

Good to know!!!
I will avoid using block-migration.c.

>
>>> The drive-mirror (and live migration) does not rely on shared storage and
>>> allow live block device copy and incremental syncing.
>>>
>>> A block buffer can be implemented with a QEMU filter block driver. It should
>>> sit at the same position as the Quorum driver in the block driver hierarchy.
>>> When using block filter approach MC will be transparent and block device
>>> agnostic.
>>>
>>> The block buffer filter must have an Interface which allows MC control the
>>> commits or discards of block device state changes. I have no idea where to
>>> put such an interface to stay conform with QEMU coding style.
>>>
>>>
>>> I???m sure there are alternative and better approaches and I???m open for
>>> any ideas
>
> You can use drive-mirror and the run-time NBD server in QEMU without
> modification:
>
>    Primary (drive-mirror)   ---writes--->   Secondary (NBD export in QEMU)
>
> Your block filter idea can work and must have the logic so that a commit
> operation sent via the microcheckpointing protocol causes the block
> filter to write buffered data to disk and flush the host disk cache.

That's exactly what the block filter has to do. Where would be the right 
place to put the api call to the block filter flush logic "blockdev.c"?

> To ensure that the disk image on the secondary is always in a crash
> consistent state (i.e. the state you get from power failure), the
> secondary needs to know when disk cache flush requests were sent and the
> write ordering.  That way, even if there is a power failure while the
> secondary is committing, the disk will be in a crash consistent state.
> After the secondary (or primary) is booted again file systems or
> databases will be able to fsck and resume.
>
> (In other words, in a catastrophic failure you won't be any worse off
> than with a power failure on an unprotected single machine.)

I case of a fail over the secondary must drain all discs before becoming 
the new primary even if there is delay caused by flushing disc buffers. 
Otherwise the state of the block device could be not consistent with the 
rest of the system state when (the new) primary starts processing.


Walid

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-09-17 20:53                           ` Walid Nouri
@ 2014-09-18 13:56                             ` Stefan Hajnoczi
  2014-09-23 16:36                               ` Walid Nouri
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2014-09-18 13:56 UTC (permalink / raw)
  To: Walid Nouri
  Cc: kwolf, eddie.dong, Dr. David Alan Gilbert, Michael R. Hines,
	qemu-devel, Paolo Bonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 6642 bytes --]

On Wed, Sep 17, 2014 at 10:53:32PM +0200, Walid Nouri wrote:
> >Writing data safely to disk can take milliseconds.  Not sure how that
> >figures into your commit step, but I guess commit needs to be fast.
> >
> We have no time to waste ;) but the disk semantic at the primary should be
> kept as expected from the primary. The time to acknowledge a checkpoint from
> the secondary will be delayed for the time needed to write all pending I/O
> requests of a checkpoint to disk.
> I think for normal operation (just replication) the secondary can use the
> same semantics for the disc writes as the primary. Wouldn't that be safe
> enough?

There is the issue of request ordering (using write cache flushes).  The
secondary probably needs to perform requests in the same order and
interleave cache flushes in the same way as the primary.  Otherwise a
power failure on the secondary could leave the disk in an invalid state
that is impossible on the primary.  So I'm just pointing out that cache
flush operations matter, not just read/write.

The second, and bigger, point is that if disk commit holds back
checkpoint commit it could be a significant performance problem due to
the slow nature of disks.

There are fancier solutions using either a journal or snapshots that
provide data integrity without posing a performance bottleneck during
the commit phase.

The trick is to apply write requests as they come off the wire on the
secondary but use a journal or snapshot mechanism to enforce commit
semantics.  That way the commit doesn't have to wait for writing out all
the data to disk.

> >I/O requests happen in parallel with CPU execution, so could an I/O
> >request be pending across a checkpoint commit?  Live migration does not
> >migrate inflight requests, although it has special case code for
> >migration requests that have failed at the host level and need to be
> >retried.  Another way of putting this is that live migration uses
> >bdrv_drain_all() to quiesce disks before migrating device state - I
> >don't think you have that luxury since bdrv_drain_all() can take a long
> >time and is not suitable for microcheckpointing.
> >
> >Block devices have the following semantics:
> >1. There is no ordering between parallel in-flight I/O requests.
> >2. The guest sees the disk state for completed writes but it may not see
> >    disk state of in-flight writes (due to #1).
> >3. Completed writes are only guaranteed to be persistent across power
> >    failure if a disk cache flush was submitted and completed after the
> >    writes completed.
> >
> I'm not sure if I got your point.
> 
> The proposed MC block device protocol sends all block device state updates
> to the secondary directly after writing them to the primary block devices.
> This keeps the disc semantics for the primary and the secondary stays
> updated with the disc state changes of the actual epoch.
> 
> At the end of an epoch the primary gets paused to create a system state
> snapshot. A this moment there could be some pending write I/O requests on
> the primary which overlap with the generation of the system state snapshot?
> Do you meant a situation like that?

Yes, that's what I meant in the first paragraph.  The primary has not
completed the I/O request yet but QEMU's live migration is currently not
equipped to migrate in-flight requests so we're in trouble!

> If this is your point then I think you are right, this is possible...and
> that raises your interesting question: How to deal with pending requests at
> the end of an epoch or how to be sure that all disc state changes of an
> epoch have been replicated?
> 
> Currently the MC protocol only cares about a part of the system state
> (RAM,vCPU,devices) and excludes the block device state changes.
> 
> To correctly use drive-mirror functionality the MC protocol must also be
> extended to check that all disc state changes of the primary corresponding
> to the current epoch have been delivered to the secondary.
> 
> When all state data is completely sent the checkpoint transaction can be
> committed.
> 
> When the checkpoint transaction is complete the secondary commits its disc
> state buffer and the rest (RAM, vCPU,devices) of the checkpoint and ACKS the
> complete checkpoint to the primary.
> 
> IMHO the easiest way for MC to track that all block device changes have been
> replicated would be to ask drive-mirror if the paused primary has
> unprocessed write requests.
> 
> As long as there are dirty blocks or in-flights, the checkpoint transaction
> of the current epoch is not complete.
> 
> Maybe you can give me a hint what you think is the best way (api call(s)) to
> ask drive-mirror if there are pending write operations???

The details depend on the code and I don't remember everything well
enough.  Anyway, my mental model is:

1. The dirty bit is set *after* the primary has completed the write.
   See bdrv_aligned_pwritev().  Therefore you cannot use the dirty
   bitmap to query in-flight requests, instead you have to look at
   bs->tracked_requests.

2. The mirror block job periodically scans the dirty bitmap (when there
   is no rate-limit set it does this with no artifical delays) and
   writes the dirty blocks.

Given that cache flush requests probably need to be tracked too, maybe
you need MC-specific block driver on the primary to monitor and control
I/O requests.

But I haven't thought this through and it's non-trivial so we need to
break this down more.

> >>>I???m sure there are alternative and better approaches and I???m open for
> >>>any ideas
> >
> >You can use drive-mirror and the run-time NBD server in QEMU without
> >modification:
> >
> >   Primary (drive-mirror)   ---writes--->   Secondary (NBD export in QEMU)
> >
> >Your block filter idea can work and must have the logic so that a commit
> >operation sent via the microcheckpointing protocol causes the block
> >filter to write buffered data to disk and flush the host disk cache.
> 
> That's exactly what the block filter has to do. Where would be the right
> place to put the api call to the block filter flush logic "blockdev.c"?

block.c has the APIs that BlockDriverState nodes support.  For example,
bdrv_invalidate_cache_all() lives in block.c.

The other approach is for MC to offer a listener interface so interested
components can register callbacks that are invoked pre/post commit.

Both techniques are commonly used within QEMU, so I wouldn't worry about
that yet.  Best to decide when you are implementing the code.

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-09-18 13:56                             ` Stefan Hajnoczi
@ 2014-09-23 16:36                               ` Walid Nouri
  2014-09-24  8:47                                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 23+ messages in thread
From: Walid Nouri @ 2014-09-23 16:36 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, eddie.dong, Dr. David Alan Gilbert, Michael R. Hines,
	qemu-devel, Paolo Bonzini, yanghy

Am 18.09.2014 15:56, schrieb Stefan Hajnoczi:
> There is the issue of request ordering (using write cache flushes).  The
> secondary probably needs to perform requests in the same order and
> interleave cache flushes in the same way as the primary.  Otherwise a
> power failure on the secondary could leave the disk in an invalid state
> that is impossible on the primary.  So I'm just pointing out that cache
> flush operations matter, not just read/write.

To be honest, my thought was that drive-mirror handles all block device 
specific problems especially the cache flush requests for write 
ordering. So my naive approach was to use an existing functionality as a 
kind of black box transport mechanism and build on top of it. But that 
seems to be not possible for the subtle tricky part of the game.

This means the "block filter" on the secondary must ensure the commit 
semantics. But for doing that it must be able to interpret the write 
ordering semantic of a  stream of write requests.

>
> The second, and bigger, point is that if disk commit holds back
> checkpoint commit it could be a significant performance problem due to
> the slow nature of disks.
You are completely right. This would raise the latency for the primary. 
This can be done by changing the proposed protocol to write directly at 
the primary and asynchronously applying updates to the secondary.

> There are fancier solutions using either a journal or snapshots that
> provide data integrity without posing a performance bottleneck during
> the commit phase.
>
> The trick is to apply write requests as they come off the wire on the
> secondary but use a journal or snapshot mechanism to enforce commit
> semantics.  That way the commit doesn't have to wait for writing out all
> the data to disk.
>
Wouldn't that mean to send a kind of protocol information with the 
modified Blocks, a barrier or somthing like that?
Can you please explain a little more what you meant?

> The details depend on the code and I don't remember everything well
> enough.  Anyway, my mental model is:
>
> 1. The dirty bit is set *after* the primary has completed the write.
>     See bdrv_aligned_pwritev().  Therefore you cannot use the dirty
>     bitmap to query in-flight requests, instead you have to look at
>     bs->tracked_requests.
>
> 2. The mirror block job periodically scans the dirty bitmap (when there
>     is no rate-limit set it does this with no artifical delays) and
>     writes the dirty blocks.
>
> Given that cache flush requests probably need to be tracked too, maybe
> you need MC-specific block driver on the primary to monitor and control
> I/O requests.
>
> But I haven't thought this through and it's non-trivial so we need to
> break this down more.
>

As drive-mirror lacks this functionality a way (without changing the 
drive-mirror code) might be a MC-specific mechanism on the primary. This 
mechanism must respect write ordering requests (like forced cache flush, 
and Force Unit Access request) and send corresponding information for a 
stream of blocks to the secondary.

 From what I have learned i'm assuming most guest OS filesystem/block 
layer follows an ordering interface based on SCSI???? As those kind of 
requests must be flaged in an I/O request by the guest operating system 
this should be possible. Do we have the chance to access those 
information in a guest request?

If this is possible does this information survives the journey through 
the nbd-server or must there be another communication channel like the 
QEMUFile approach of “block-migration.c”?

Walid

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-09-23 16:36                               ` Walid Nouri
@ 2014-09-24  8:47                                 ` Stefan Hajnoczi
  2014-09-25 16:06                                   ` Walid Nouri
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2014-09-24  8:47 UTC (permalink / raw)
  To: Walid Nouri
  Cc: kwolf, eddie.dong, Dr. David Alan Gilbert, Michael R. Hines,
	qemu-devel, Stefan Hajnoczi, Paolo Bonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 4636 bytes --]

On Tue, Sep 23, 2014 at 06:36:42PM +0200, Walid Nouri wrote:
> Am 18.09.2014 15:56, schrieb Stefan Hajnoczi:
> >There is the issue of request ordering (using write cache flushes).  The
> >secondary probably needs to perform requests in the same order and
> >interleave cache flushes in the same way as the primary.  Otherwise a
> >power failure on the secondary could leave the disk in an invalid state
> >that is impossible on the primary.  So I'm just pointing out that cache
> >flush operations matter, not just read/write.
> 
> 
> To be honest, my thought was that drive-mirror handles all block device
> specific problems especially the cache flush requests for write ordering. So
> my naive approach was to use an existing functionality as a kind of black
> box transport mechanism and build on top of it. But that seems to be not
> possible for the subtle tricky part of the game.

I think the assumption with drive-mirror is that you throw away the
destination image if something fails.  That's the exact opposite of MC
where we want to fail over to the destination :).

> >There are fancier solutions using either a journal or snapshots that
> >provide data integrity without posing a performance bottleneck during
> >the commit phase.
> >
> >The trick is to apply write requests as they come off the wire on the
> >secondary but use a journal or snapshot mechanism to enforce commit
> >semantics.  That way the commit doesn't have to wait for writing out all
> >the data to disk.
> >
> Wouldn't that mean to send a kind of protocol information with the modified
> Blocks, a barrier or somthing like that?
> Can you please explain a little more what you meant?

Here is one example of a mechanism like this:

QEMU has a block job called drive-backup which copies sectors that are
about to be overwritten to an external file.  Once the data has been
copied into the external file, the sectors in the original image file
can be overwritten safely.

The Secondary runs drive-backup so that writes coming from the Primary
stash rollback data into an external qcow2 file.  When the Primary
wishes to commit we drop the qcow2 rollback file since we no longer need
the ability to roll back - this is cheap and not a lot of I/O needs to
be performed for the commit operation.

If the Secondary needs to take over it can use the rollback qcow2 file
as its disk image and the guest will see the state of the disk at the
last commit point.  The sectors that were modified since commit in the
original image file are covered by the data in the rollback qcow2 file.

There are a bunch of details on making this efficient but in principle
this approach makes both commit and rollback fairly lightweight.

> >The details depend on the code and I don't remember everything well
> >enough.  Anyway, my mental model is:
> >
> >1. The dirty bit is set *after* the primary has completed the write.
> >    See bdrv_aligned_pwritev().  Therefore you cannot use the dirty
> >    bitmap to query in-flight requests, instead you have to look at
> >    bs->tracked_requests.
> >
> >2. The mirror block job periodically scans the dirty bitmap (when there
> >    is no rate-limit set it does this with no artifical delays) and
> >    writes the dirty blocks.
> >
> >Given that cache flush requests probably need to be tracked too, maybe
> >you need MC-specific block driver on the primary to monitor and control
> >I/O requests.
> >
> >But I haven't thought this through and it's non-trivial so we need to
> >break this down more.
> >
> 
> As drive-mirror lacks this functionality a way (without changing the
> drive-mirror code) might be a MC-specific mechanism on the primary. This
> mechanism must respect write ordering requests (like forced cache flush, and
> Force Unit Access request) and send corresponding information for a stream
> of blocks to the secondary.
> 
> From what I have learned i'm assuming most guest OS filesystem/block layer
> follows an ordering interface based on SCSI???? As those kind of requests
> must be flaged in an I/O request by the guest operating system this should
> be possible. Do we have the chance to access those information in a guest
> request?
> 
> If this is possible does this information survives the journey through the
> nbd-server or must there be another communication channel like the QEMUFile
> approach of “block-migration.c”?

There isn't much information beyond the ordering of writes and cache
flushes, even in SCSI.  But that's okay, we just need to honor the
semantics of block devices.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
  2014-09-24  8:47                                 ` Stefan Hajnoczi
@ 2014-09-25 16:06                                   ` Walid Nouri
  0 siblings, 0 replies; 23+ messages in thread
From: Walid Nouri @ 2014-09-25 16:06 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, eddie.dong, Dr. David Alan Gilbert, Michael R. Hines,
	qemu-devel, Stefan Hajnoczi, Paolo Bonzini, yanghy

Am 24.09.2014 10:47, schrieb Stefan Hajnoczi:

> I think the assumption with drive-mirror is that you throw away the
> destination image if something fails.  That's the exact opposite of MC
> where we want to fail over to the destination :).
This was not obivous for me...

> Here is one example of a mechanism like this:
>
> QEMU has a block job called drive-backup which copies sectors that are
> about to be overwritten to an external file.  Once the data has been
> copied into the external file, the sectors in the original image file
> can be overwritten safely.
>
> The Secondary runs drive-backup so that writes coming from the Primary
> stash rollback data into an external qcow2 file.  When the Primary
> wishes to commit we drop the qcow2 rollback file since we no longer need
> the ability to roll back - this is cheap and not a lot of I/O needs to
> be performed for the commit operation.
>
> If the Secondary needs to take over it can use the rollback qcow2 file
> as its disk image and the guest will see the state of the disk at the
> last commit point.  The sectors that were modified since commit in the
> original image file are covered by the data in the rollback qcow2 file.
>
> There are a bunch of details on making this efficient but in principle
> this approach makes both commit and rollback fairly lightweight.
>
Until yesterday I’ve seen backup as mechanism that makes a point in time 
snapshot of a block device and saves the contents of that snapshot to an 
other block device. Your proposal is a new interpretation of backup :-)

I must admit that I had to think twice to get an idea what your point is.

I don’t know if I have understood all aspects of your proposal as my 
mental model of a possible architecture is not quite clear yet.

I will try to summarize in “MC-words” what I have understood:

The general Idea is to use drive-backup to get a consistent snapshot of 
a mirrored block device on the secondary for a given period of time I 
will call it epoch(n) and snapshot(n).

As a starting point we need to block-devices with exact the same state 
on primary and secondary. In other word there must be an exact copy of 
the primary image on the secondary.

In epoche(n) the primary mirror its writes to the image file of 
secondary. This leads to a continuous stream of updated blocks to the 
image of the secondary.

In parallel the secondary use drive-backup to get a rollback-snapshot(n) 
its own image file for each running epoche.

At the beginning of epoche(n+1) we start a (new) rollback-snapshot(n+1) 
and keep rollback-snapshot(n).

When in normal operation we drop rollback-snapshot(n) when epoche(n+1) 
is successfully completed.

In case of a failure in epoche(n+1) we make a fail over and use 
rollback-snapshot(n) to get back the consistent block device state of 
epoche(n)

Is this your idea?
Does this procedure guaranty the block-device semantics of the primary?

Walid

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-09-25 16:06 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <53D8FF52.9000104@gmail.com>
     [not found] ` <1406820870.2680.3.camel@usa>
     [not found]   ` <53DBE726.4050102@gmail.com>
     [not found]     ` <1406947532.2680.11.camel@usa>
     [not found]       ` <53E0AA60.9030404@gmail.com>
     [not found]         ` <1407376929.21497.2.camel@usa>
     [not found]           ` <53E60F34.1070607@gmail.com>
     [not found]             ` <1407587152.24027.5.camel@usa>
2014-08-11 17:22               ` [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency Walid Nouri
2014-08-11 20:15                 ` Michael R. Hines
2014-08-17  9:52                   ` Paolo Bonzini
2014-08-19  8:58                     ` Walid Nouri
2014-09-10 15:43                     ` Walid Nouri
2014-09-11  1:50                       ` Michael R. Hines
2014-09-12  1:34                         ` Hongyang Yang
2014-09-11  7:27                       ` Paolo Bonzini
2014-09-11 17:44                       ` Dr. David Alan Gilbert
2014-09-11 22:08                         ` Walid Nouri
2014-09-12  1:24                         ` Hongyang Yang
2014-09-12 11:07                         ` Stefan Hajnoczi
2014-09-17 20:53                           ` Walid Nouri
2014-09-18 13:56                             ` Stefan Hajnoczi
2014-09-23 16:36                               ` Walid Nouri
2014-09-24  8:47                                 ` Stefan Hajnoczi
2014-09-25 16:06                                   ` Walid Nouri
2014-08-11 20:15                 ` Michael R. Hines
2014-08-13 14:03                   ` Walid Nouri
2014-08-13 22:28                     ` Michael R. Hines
2014-08-14 10:58                       ` Dr. David Alan Gilbert
2014-08-14 17:23                         ` Michael R. Hines
2014-08-19  8:33                         ` Walid Nouri

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.