All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-09  3:53 ` Fernando Luis Vázquez Cao
  0 siblings, 0 replies; 26+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-11-09  3:53 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: "大村圭(oomura kei)",
	Yoshiaki Tamura, Takuya Yoshikawa, avi, anthony,
	Andrea Arcangeli, Chris Wright

Hi all,

It has been a while coming, but we have finally started work on
Kemari's port to KVM. For those not familiar with it, Kemari provides
the basic building block to create a virtualization-based fault
tolerant machine: a virtual machine synchronization mechanism.

Traditional high availability solutions can be classified in two
groups: fault tolerant servers, and software clustering.

Broadly speaking, fault tolerant servers protect us against hardware
failures and, generally, rely on redundant hardware (often
proprietary), and hardware failure detection to trigger fail-over.

On the other hand, software clustering, as its name indicates, takes
care of software failures and usually requires a standby server whose
software configuration for the part we are trying to make fault
tolerant must be identical to that of the active server.

Both solutions may be applied to virtualized environments. Indeed,
the current incarnation of Kemari (Xen-based) brings fault tolerant
server-like capabilities to virtual machines and integration with
existing HA stacks (Heartbeat, RHCS, etc) is under consideration.

After some time in the drawing board we completed the basic design of
Kemari for KVM, so we are sending an RFC at this point to get early
feedback and, hopefully, get things right from the start. Those
already familiar with Kemari and/or fault tolerance may want to skip
the "Background" and go directly to the design and implementation
bits.

This is a pretty long write-up, but please bear with me.

== Background ==

We started to play around with continuous virtual synchronization
technology about 3 years ago. As development progressed and, most
importantly, we got the first Xen-based working prototypes it became
clear that we needed a proper name for our toy: Kemari.

The goal of Kemari is to provide a fault tolerant platform for
virtualization environments, so that in the event of a hardware
failure the virtual machine fails over from compromised to properly
operating hardware (a physical machine) in a way that is completely
transparent to the guest operating system.

Although hardware based fault tolerant servers and HA servers
(software clustering) have been around for a (long) while, they
typically require specifically designed hardware and/or modifications
to applications. In contrast, by abstracting hardware using
virtualization, Kemari can be used on off-the-shelf hardware and no
application modifications are needed.

After a period of in-house development the first version of Kemari for
Xen was released in Nov 2008 as open source. However, by then it was
already pretty clear that a KVM port would have several
advantages. First, KVM is integrated into the Linux kernel, which
means one gets support for a wide variety of hardware for
free. Second, and in the same vein, KVM can also benefit from Linux'
low latency networking capabilities including RDMA, which is of
paramount importance for a extremely latency-sensitive functionality
like Kemari. Last, but not the least, KVM and its community is growing
rapidly, and there is increasing demand for Kemari-like functionality
for KVM.

Although the basic design principles will remain the same, our plan is
to write Kemari for KVM from scratch, since there does not seem to be
much opportunity for sharing between Xen and KVM.

== Design outline ==

The basic premise of fault tolerant servers is that when things go
awry with the hardware the running system should transparently
continue execution on an alternate physical host. For this to be
possible the state of the fallback host has to be identical to that of
the primary.

Kemari runs paired virtual machines in an active-passive configuration
and achieves whole-system replication by continuously copying the
state of the system (dirty pages and the state of the virtual devices)
from the active node to the passive node. An interesting implication
of this is that during normal operation only the active node is
actually executing code.

Another possible approach is to run a pair of systems in lock-step
(à la VMware FT). Since both the primary and fallback virtual machines
are active keeping them synchronized is a complex task, which usually
involves carefully injecting external events into both virtual
machines so that they result in identical states.

The latter approach is extremely architecture specific and not SMP
friendly. This spurred us to try the design that became Kemari, which
we believe lends itself to further optimizations.

== Implementation ==

The first step is to encapsulate the machine to be protected within a
virtual machine. Then the live migration functionality is leveraged to
keep the virtual machines synchronized.

Whereas during live migration dirty pages can be sent asynchronously
from the primary to the fallback server until the ratio of dirty pages
is low enough to guarantee very short downtimes, when it comes to
fault tolerance solutions whenever a synchronization point is reached
changes to the virtual machine since the previous one have to be sent
synchronously.

Since the virtual machine has to be stopped until the data reaches and
is acknowledged by the fallback server, the synchronization model is
of critical importance for performance (both in terms of raw
throughput and latencies). The model chosen for Kemari along with
other implementation details is described below.

* Synchronization model

The synchronization points were carefully chosen to minimize the
amount of traffic that goes over the wire while still maintaining the
FT pair consistent at all times. To be precise, Kemari uses events
that modify externally visible state as synchronizations points. This
means that all outgoing I/O needs to be trapped and sent to the
fallback host before the primary is resumed, so that it can be
replayed in the face of hardware failure.

The basic assumption here is that outgoing I/O operations are
idempotent, which is usually true for disk I/O and reliable network
protocols such as TCP (Kemari may trigger hidden bugs on applications
that use UDP or other unreliable protocols, so those may need minor
changes to ensure they work properly after failover).

The synchronization process can be broken down as follows:

   - Event tapping: On KVM all I/O generates a VMEXIT that is
     synchronously handled by the Linux kernel monitor i.e. KVM (it is
     worth noting that this applies to virtio devices too, because they
     use MMIO and PIO just like a regular PCI device).

   - VCPU/Guest freezing: This is automatic in the UP case. On SMP
     environments we may need to send a IPI to stop the other VCPUs.

   - Notification to qemu: Taking a page from live migration's
     playbook, the synchronization process is user-space driven, which
     means that qemu needs to be woken up at each synchronization
     point. That is already the case for qemu-emulated devices, but we
     also have in-kernel emulators. To compound the problem, even for
     user-space emulated devices accesses to coalesced MMIO areas can
     not be detected. As a consequence we need a mechanism to
     communicate KVM-handled events to qemu.

     The channel for KVM-qemu communication can be easily build upon
     the existing infrastructure. We just need to add a new a page to
     the kvm_run shared memory area that can be mmapped from user space
     and set the exit reason appropriately.

     Regarding in-kernel device emulators, we only need to care about
     writes. Specifically, making kvm_io_bus_write() fail when Kemari
     is activated and invoking the emulator again after re-entrance
     from user space should suffice (this is somewhat similar to what
     we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).

     To avoid missing synchronization points one should be careful with
     coalesced MMIO-like optimizations. In the particular case of
     coalesced MMIO, the I/O operation that caused the exit to user
     space should act as a write barrier when it was due to an access
     to a non-coalesced MMIO area. This means that before proceeding to
     handle the exit in kvm_run() we have to make sure that all the
     coalesced MMIO has reached the fallback host.

   - Virtual machine synchronization: All the dirty pages since the
     last synchronization point and the state of the virtual devices is
     sent to the fallback node from the user-space qemu process. For this
     the existing savevm infrastructure and KVM's dirty page tracking
     capabilities can be reused. Regarding in-kernel devices, with the
     likely advent of in-kernel virtio backends we need a generic way
     to access their state from user-space, for which, again, the kvm_run
     share memory area could be used.

   - Virtual machine run: Execution of the virtual machine is resumed
     as soon as synchronization finishes.

* Clock

Even though we do not need to worry about the clock that provides the
tick (the counter resides in memory, which we keep synchronized), the
same does not apply to counters such as the TSC (we certainly want to
avoid a situation where counters jump back in time right after
fail-over, breaking guarantees such as monotonicity).

To avoid big hiccups after migration the value of the TSC should be
sent to the fallback node frequently. An access from the guest
(through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment
to do this. Fortunately, both vmx and SVM provide controls to
intercept accesses to the TSC, so it is just a matter of setting those
appropriately ("RDTSC exiting" VM-execution control, and RDTSC,
RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However,
since synchronizing the virtual machines every time the TSC is
accessed would be prohibitive, the transmission of the TSC will be
done lazily, which means delaying it until there is a non-TSC
synchronization point arrives.

* Failover

Failover process kicks in whenever a failure in the primary node is
detected. At the time of writing we just ping the virtual machine
periodically to determine whether it is still alive, but in the long
term we have plans to integrate Kemari with the major HA stacks
(Hearbeat, RHCS, etc).

Ideally, we would like to leverage the hardware failure detection
capabilities of newish x86 hardware to trigger failover, the idea
being that transferring control to the fallback node proactively
when a problem is detected is much faster than relying on the polling
mechanisms used by most HA software.

Finally, to restore the virtual machine in the fallback host the loadvm
infrastructure used for live-migration is leveraged.

* Further information

Please visit the link below for additional information, including
documentation and, most importantly, source code (for Xen only at the
moment).

http://www.osrg.net/kemari
==


Any comments and suggestions would be greatly appreciated.

If this is the right forum and people on the KVM mailing list do not
mind, we would like to use the CC'ed mailing lists for Kemari
development. Having more expert eyes looking at one's code always
helps.

Thanks,

Fernando

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-09  3:53 ` Fernando Luis Vázquez Cao
  0 siblings, 0 replies; 26+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-11-09  3:53 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: Andrea Arcangeli, Chris Wright,
	"大村圭(oomura kei)",
	Yoshiaki Tamura, Takuya Yoshikawa, avi

Hi all,

It has been a while coming, but we have finally started work on
Kemari's port to KVM. For those not familiar with it, Kemari provides
the basic building block to create a virtualization-based fault
tolerant machine: a virtual machine synchronization mechanism.

Traditional high availability solutions can be classified in two
groups: fault tolerant servers, and software clustering.

Broadly speaking, fault tolerant servers protect us against hardware
failures and, generally, rely on redundant hardware (often
proprietary), and hardware failure detection to trigger fail-over.

On the other hand, software clustering, as its name indicates, takes
care of software failures and usually requires a standby server whose
software configuration for the part we are trying to make fault
tolerant must be identical to that of the active server.

Both solutions may be applied to virtualized environments. Indeed,
the current incarnation of Kemari (Xen-based) brings fault tolerant
server-like capabilities to virtual machines and integration with
existing HA stacks (Heartbeat, RHCS, etc) is under consideration.

After some time in the drawing board we completed the basic design of
Kemari for KVM, so we are sending an RFC at this point to get early
feedback and, hopefully, get things right from the start. Those
already familiar with Kemari and/or fault tolerance may want to skip
the "Background" and go directly to the design and implementation
bits.

This is a pretty long write-up, but please bear with me.

== Background ==

We started to play around with continuous virtual synchronization
technology about 3 years ago. As development progressed and, most
importantly, we got the first Xen-based working prototypes it became
clear that we needed a proper name for our toy: Kemari.

The goal of Kemari is to provide a fault tolerant platform for
virtualization environments, so that in the event of a hardware
failure the virtual machine fails over from compromised to properly
operating hardware (a physical machine) in a way that is completely
transparent to the guest operating system.

Although hardware based fault tolerant servers and HA servers
(software clustering) have been around for a (long) while, they
typically require specifically designed hardware and/or modifications
to applications. In contrast, by abstracting hardware using
virtualization, Kemari can be used on off-the-shelf hardware and no
application modifications are needed.

After a period of in-house development the first version of Kemari for
Xen was released in Nov 2008 as open source. However, by then it was
already pretty clear that a KVM port would have several
advantages. First, KVM is integrated into the Linux kernel, which
means one gets support for a wide variety of hardware for
free. Second, and in the same vein, KVM can also benefit from Linux'
low latency networking capabilities including RDMA, which is of
paramount importance for a extremely latency-sensitive functionality
like Kemari. Last, but not the least, KVM and its community is growing
rapidly, and there is increasing demand for Kemari-like functionality
for KVM.

Although the basic design principles will remain the same, our plan is
to write Kemari for KVM from scratch, since there does not seem to be
much opportunity for sharing between Xen and KVM.

== Design outline ==

The basic premise of fault tolerant servers is that when things go
awry with the hardware the running system should transparently
continue execution on an alternate physical host. For this to be
possible the state of the fallback host has to be identical to that of
the primary.

Kemari runs paired virtual machines in an active-passive configuration
and achieves whole-system replication by continuously copying the
state of the system (dirty pages and the state of the virtual devices)
from the active node to the passive node. An interesting implication
of this is that during normal operation only the active node is
actually executing code.

Another possible approach is to run a pair of systems in lock-step
(à la VMware FT). Since both the primary and fallback virtual machines
are active keeping them synchronized is a complex task, which usually
involves carefully injecting external events into both virtual
machines so that they result in identical states.

The latter approach is extremely architecture specific and not SMP
friendly. This spurred us to try the design that became Kemari, which
we believe lends itself to further optimizations.

== Implementation ==

The first step is to encapsulate the machine to be protected within a
virtual machine. Then the live migration functionality is leveraged to
keep the virtual machines synchronized.

Whereas during live migration dirty pages can be sent asynchronously
from the primary to the fallback server until the ratio of dirty pages
is low enough to guarantee very short downtimes, when it comes to
fault tolerance solutions whenever a synchronization point is reached
changes to the virtual machine since the previous one have to be sent
synchronously.

Since the virtual machine has to be stopped until the data reaches and
is acknowledged by the fallback server, the synchronization model is
of critical importance for performance (both in terms of raw
throughput and latencies). The model chosen for Kemari along with
other implementation details is described below.

* Synchronization model

The synchronization points were carefully chosen to minimize the
amount of traffic that goes over the wire while still maintaining the
FT pair consistent at all times. To be precise, Kemari uses events
that modify externally visible state as synchronizations points. This
means that all outgoing I/O needs to be trapped and sent to the
fallback host before the primary is resumed, so that it can be
replayed in the face of hardware failure.

The basic assumption here is that outgoing I/O operations are
idempotent, which is usually true for disk I/O and reliable network
protocols such as TCP (Kemari may trigger hidden bugs on applications
that use UDP or other unreliable protocols, so those may need minor
changes to ensure they work properly after failover).

The synchronization process can be broken down as follows:

   - Event tapping: On KVM all I/O generates a VMEXIT that is
     synchronously handled by the Linux kernel monitor i.e. KVM (it is
     worth noting that this applies to virtio devices too, because they
     use MMIO and PIO just like a regular PCI device).

   - VCPU/Guest freezing: This is automatic in the UP case. On SMP
     environments we may need to send a IPI to stop the other VCPUs.

   - Notification to qemu: Taking a page from live migration's
     playbook, the synchronization process is user-space driven, which
     means that qemu needs to be woken up at each synchronization
     point. That is already the case for qemu-emulated devices, but we
     also have in-kernel emulators. To compound the problem, even for
     user-space emulated devices accesses to coalesced MMIO areas can
     not be detected. As a consequence we need a mechanism to
     communicate KVM-handled events to qemu.

     The channel for KVM-qemu communication can be easily build upon
     the existing infrastructure. We just need to add a new a page to
     the kvm_run shared memory area that can be mmapped from user space
     and set the exit reason appropriately.

     Regarding in-kernel device emulators, we only need to care about
     writes. Specifically, making kvm_io_bus_write() fail when Kemari
     is activated and invoking the emulator again after re-entrance
     from user space should suffice (this is somewhat similar to what
     we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).

     To avoid missing synchronization points one should be careful with
     coalesced MMIO-like optimizations. In the particular case of
     coalesced MMIO, the I/O operation that caused the exit to user
     space should act as a write barrier when it was due to an access
     to a non-coalesced MMIO area. This means that before proceeding to
     handle the exit in kvm_run() we have to make sure that all the
     coalesced MMIO has reached the fallback host.

   - Virtual machine synchronization: All the dirty pages since the
     last synchronization point and the state of the virtual devices is
     sent to the fallback node from the user-space qemu process. For this
     the existing savevm infrastructure and KVM's dirty page tracking
     capabilities can be reused. Regarding in-kernel devices, with the
     likely advent of in-kernel virtio backends we need a generic way
     to access their state from user-space, for which, again, the kvm_run
     share memory area could be used.

   - Virtual machine run: Execution of the virtual machine is resumed
     as soon as synchronization finishes.

* Clock

Even though we do not need to worry about the clock that provides the
tick (the counter resides in memory, which we keep synchronized), the
same does not apply to counters such as the TSC (we certainly want to
avoid a situation where counters jump back in time right after
fail-over, breaking guarantees such as monotonicity).

To avoid big hiccups after migration the value of the TSC should be
sent to the fallback node frequently. An access from the guest
(through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment
to do this. Fortunately, both vmx and SVM provide controls to
intercept accesses to the TSC, so it is just a matter of setting those
appropriately ("RDTSC exiting" VM-execution control, and RDTSC,
RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However,
since synchronizing the virtual machines every time the TSC is
accessed would be prohibitive, the transmission of the TSC will be
done lazily, which means delaying it until there is a non-TSC
synchronization point arrives.

* Failover

Failover process kicks in whenever a failure in the primary node is
detected. At the time of writing we just ping the virtual machine
periodically to determine whether it is still alive, but in the long
term we have plans to integrate Kemari with the major HA stacks
(Hearbeat, RHCS, etc).

Ideally, we would like to leverage the hardware failure detection
capabilities of newish x86 hardware to trigger failover, the idea
being that transferring control to the fallback node proactively
when a problem is detected is much faster than relying on the polling
mechanisms used by most HA software.

Finally, to restore the virtual machine in the fallback host the loadvm
infrastructure used for live-migration is leveraged.

* Further information

Please visit the link below for additional information, including
documentation and, most importantly, source code (for Xen only at the
moment).

http://www.osrg.net/kemari
==


Any comments and suggestions would be greatly appreciated.

If this is the right forum and people on the KVM mailing list do not
mind, we would like to use the CC'ed mailing lists for Kemari
development. Having more expert eyes looking at one's code always
helps.

Thanks,

Fernando

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] KVM Fault Tolerance: Kemari for KVM
  2009-11-09  3:53 ` [Qemu-devel] " Fernando Luis Vázquez Cao
@ 2009-11-12 21:51   ` Dor Laor
  -1 siblings, 0 replies; 26+ messages in thread
From: Dor Laor @ 2009-11-12 21:51 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: kvm, qemu-devel, "大村圭(oomura kei)",
	Yoshiaki Tamura, Takuya Yoshikawa, avi, anthony,
	Andrea Arcangeli, Chris Wright

On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
> Hi all,
>
> It has been a while coming, but we have finally started work on
> Kemari's port to KVM. For those not familiar with it, Kemari provides
> the basic building block to create a virtualization-based fault
> tolerant machine: a virtual machine synchronization mechanism.
>
> Traditional high availability solutions can be classified in two
> groups: fault tolerant servers, and software clustering.
>
> Broadly speaking, fault tolerant servers protect us against hardware
> failures and, generally, rely on redundant hardware (often
> proprietary), and hardware failure detection to trigger fail-over.
>
> On the other hand, software clustering, as its name indicates, takes
> care of software failures and usually requires a standby server whose
> software configuration for the part we are trying to make fault
> tolerant must be identical to that of the active server.
>
> Both solutions may be applied to virtualized environments. Indeed,
> the current incarnation of Kemari (Xen-based) brings fault tolerant
> server-like capabilities to virtual machines and integration with
> existing HA stacks (Heartbeat, RHCS, etc) is under consideration.
>
> After some time in the drawing board we completed the basic design of
> Kemari for KVM, so we are sending an RFC at this point to get early
> feedback and, hopefully, get things right from the start. Those
> already familiar with Kemari and/or fault tolerance may want to skip
> the "Background" and go directly to the design and implementation
> bits.
>
> This is a pretty long write-up, but please bear with me.
>
> == Background ==
>
> We started to play around with continuous virtual synchronization
> technology about 3 years ago. As development progressed and, most
> importantly, we got the first Xen-based working prototypes it became
> clear that we needed a proper name for our toy: Kemari.
>
> The goal of Kemari is to provide a fault tolerant platform for
> virtualization environments, so that in the event of a hardware
> failure the virtual machine fails over from compromised to properly
> operating hardware (a physical machine) in a way that is completely
> transparent to the guest operating system.
>
> Although hardware based fault tolerant servers and HA servers
> (software clustering) have been around for a (long) while, they
> typically require specifically designed hardware and/or modifications
> to applications. In contrast, by abstracting hardware using
> virtualization, Kemari can be used on off-the-shelf hardware and no
> application modifications are needed.
>
> After a period of in-house development the first version of Kemari for
> Xen was released in Nov 2008 as open source. However, by then it was
> already pretty clear that a KVM port would have several
> advantages. First, KVM is integrated into the Linux kernel, which
> means one gets support for a wide variety of hardware for
> free. Second, and in the same vein, KVM can also benefit from Linux'
> low latency networking capabilities including RDMA, which is of
> paramount importance for a extremely latency-sensitive functionality
> like Kemari. Last, but not the least, KVM and its community is growing
> rapidly, and there is increasing demand for Kemari-like functionality
> for KVM.
>
> Although the basic design principles will remain the same, our plan is
> to write Kemari for KVM from scratch, since there does not seem to be
> much opportunity for sharing between Xen and KVM.
>
> == Design outline ==
>
> The basic premise of fault tolerant servers is that when things go
> awry with the hardware the running system should transparently
> continue execution on an alternate physical host. For this to be
> possible the state of the fallback host has to be identical to that of
> the primary.
>
> Kemari runs paired virtual machines in an active-passive configuration
> and achieves whole-system replication by continuously copying the
> state of the system (dirty pages and the state of the virtual devices)
> from the active node to the passive node. An interesting implication
> of this is that during normal operation only the active node is
> actually executing code.
>
> Another possible approach is to run a pair of systems in lock-step
> (à la VMware FT). Since both the primary and fallback virtual machines
> are active keeping them synchronized is a complex task, which usually
> involves carefully injecting external events into both virtual
> machines so that they result in identical states.
>
> The latter approach is extremely architecture specific and not SMP
> friendly. This spurred us to try the design that became Kemari, which
> we believe lends itself to further optimizations.
>
> == Implementation ==
>
> The first step is to encapsulate the machine to be protected within a
> virtual machine. Then the live migration functionality is leveraged to
> keep the virtual machines synchronized.
>
> Whereas during live migration dirty pages can be sent asynchronously
> from the primary to the fallback server until the ratio of dirty pages
> is low enough to guarantee very short downtimes, when it comes to
> fault tolerance solutions whenever a synchronization point is reached
> changes to the virtual machine since the previous one have to be sent
> synchronously.
>
> Since the virtual machine has to be stopped until the data reaches and
> is acknowledged by the fallback server, the synchronization model is
> of critical importance for performance (both in terms of raw
> throughput and latencies). The model chosen for Kemari along with
> other implementation details is described below.
>
> * Synchronization model
>
> The synchronization points were carefully chosen to minimize the
> amount of traffic that goes over the wire while still maintaining the
> FT pair consistent at all times. To be precise, Kemari uses events
> that modify externally visible state as synchronizations points. This
> means that all outgoing I/O needs to be trapped and sent to the
> fallback host before the primary is resumed, so that it can be
> replayed in the face of hardware failure.
>
> The basic assumption here is that outgoing I/O operations are
> idempotent, which is usually true for disk I/O and reliable network
> protocols such as TCP (Kemari may trigger hidden bugs on applications
> that use UDP or other unreliable protocols, so those may need minor
> changes to ensure they work properly after failover).
>
> The synchronization process can be broken down as follows:
>
> - Event tapping: On KVM all I/O generates a VMEXIT that is
> synchronously handled by the Linux kernel monitor i.e. KVM (it is
> worth noting that this applies to virtio devices too, because they
> use MMIO and PIO just like a regular PCI device).
>
> - VCPU/Guest freezing: This is automatic in the UP case. On SMP
> environments we may need to send a IPI to stop the other VCPUs.
>
> - Notification to qemu: Taking a page from live migration's
> playbook, the synchronization process is user-space driven, which
> means that qemu needs to be woken up at each synchronization
> point. That is already the case for qemu-emulated devices, but we
> also have in-kernel emulators. To compound the problem, even for
> user-space emulated devices accesses to coalesced MMIO areas can
> not be detected. As a consequence we need a mechanism to
> communicate KVM-handled events to qemu.
>
> The channel for KVM-qemu communication can be easily build upon
> the existing infrastructure. We just need to add a new a page to
> the kvm_run shared memory area that can be mmapped from user space
> and set the exit reason appropriately.
>
> Regarding in-kernel device emulators, we only need to care about
> writes. Specifically, making kvm_io_bus_write() fail when Kemari
> is activated and invoking the emulator again after re-entrance
> from user space should suffice (this is somewhat similar to what
> we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).
>
> To avoid missing synchronization points one should be careful with
> coalesced MMIO-like optimizations. In the particular case of
> coalesced MMIO, the I/O operation that caused the exit to user
> space should act as a write barrier when it was due to an access
> to a non-coalesced MMIO area. This means that before proceeding to
> handle the exit in kvm_run() we have to make sure that all the
> coalesced MMIO has reached the fallback host.
>
> - Virtual machine synchronization: All the dirty pages since the
> last synchronization point and the state of the virtual devices is
> sent to the fallback node from the user-space qemu process. For this
> the existing savevm infrastructure and KVM's dirty page tracking

I failed to understand whether you take the lock step approach and sync 
every vmexit + make sure the shadow host will inject irq on the original 
guest's instruction boundary or alternatively use continuous live snapshots.

If you use live snapshots, why do you need to track mmio, etc? Is it in 
order to save the device sync stage in live migration? In order to do it 
you fully lock step qemu execution (or send the entire vmstate to the 
slave). Isn't the device part is << dirt pages part?

Thanks,
Dor


> capabilities can be reused. Regarding in-kernel devices, with the
> likely advent of in-kernel virtio backends we need a generic way
> to access their state from user-space, for which, again, the kvm_run
> share memory area could be used.
>
> - Virtual machine run: Execution of the virtual machine is resumed
> as soon as synchronization finishes.
>
> * Clock
>
> Even though we do not need to worry about the clock that provides the
> tick (the counter resides in memory, which we keep synchronized), the
> same does not apply to counters such as the TSC (we certainly want to
> avoid a situation where counters jump back in time right after
> fail-over, breaking guarantees such as monotonicity).
>
> To avoid big hiccups after migration the value of the TSC should be
> sent to the fallback node frequently. An access from the guest
> (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment
> to do this. Fortunately, both vmx and SVM provide controls to
> intercept accesses to the TSC, so it is just a matter of setting those
> appropriately ("RDTSC exiting" VM-execution control, and RDTSC,
> RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However,
> since synchronizing the virtual machines every time the TSC is
> accessed would be prohibitive, the transmission of the TSC will be
> done lazily, which means delaying it until there is a non-TSC
> synchronization point arrives.
>
> * Failover
>
> Failover process kicks in whenever a failure in the primary node is
> detected. At the time of writing we just ping the virtual machine
> periodically to determine whether it is still alive, but in the long
> term we have plans to integrate Kemari with the major HA stacks
> (Hearbeat, RHCS, etc).
>
> Ideally, we would like to leverage the hardware failure detection
> capabilities of newish x86 hardware to trigger failover, the idea
> being that transferring control to the fallback node proactively
> when a problem is detected is much faster than relying on the polling
> mechanisms used by most HA software.
>
> Finally, to restore the virtual machine in the fallback host the loadvm
> infrastructure used for live-migration is leveraged.
>
> * Further information
>
> Please visit the link below for additional information, including
> documentation and, most importantly, source code (for Xen only at the
> moment).
>
> http://www.osrg.net/kemari
> ==
>
>
> Any comments and suggestions would be greatly appreciated.
>
> If this is the right forum and people on the KVM mailing list do not
> mind, we would like to use the CC'ed mailing lists for Kemari
> development. Having more expert eyes looking at one's code always
> helps.
>
> Thanks,
>
> Fernando
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-12 21:51   ` Dor Laor
  0 siblings, 0 replies; 26+ messages in thread
From: Dor Laor @ 2009-11-12 21:51 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Andrea Arcangeli, Chris Wright, omura kei)",
	kvm, Yoshiaki Tamura, qemu-devel, Takuya Yoshikawa, avi,
	=?UTF-8?B?IuWkp+adkeWcrShv?=

On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
> Hi all,
>
> It has been a while coming, but we have finally started work on
> Kemari's port to KVM. For those not familiar with it, Kemari provides
> the basic building block to create a virtualization-based fault
> tolerant machine: a virtual machine synchronization mechanism.
>
> Traditional high availability solutions can be classified in two
> groups: fault tolerant servers, and software clustering.
>
> Broadly speaking, fault tolerant servers protect us against hardware
> failures and, generally, rely on redundant hardware (often
> proprietary), and hardware failure detection to trigger fail-over.
>
> On the other hand, software clustering, as its name indicates, takes
> care of software failures and usually requires a standby server whose
> software configuration for the part we are trying to make fault
> tolerant must be identical to that of the active server.
>
> Both solutions may be applied to virtualized environments. Indeed,
> the current incarnation of Kemari (Xen-based) brings fault tolerant
> server-like capabilities to virtual machines and integration with
> existing HA stacks (Heartbeat, RHCS, etc) is under consideration.
>
> After some time in the drawing board we completed the basic design of
> Kemari for KVM, so we are sending an RFC at this point to get early
> feedback and, hopefully, get things right from the start. Those
> already familiar with Kemari and/or fault tolerance may want to skip
> the "Background" and go directly to the design and implementation
> bits.
>
> This is a pretty long write-up, but please bear with me.
>
> == Background ==
>
> We started to play around with continuous virtual synchronization
> technology about 3 years ago. As development progressed and, most
> importantly, we got the first Xen-based working prototypes it became
> clear that we needed a proper name for our toy: Kemari.
>
> The goal of Kemari is to provide a fault tolerant platform for
> virtualization environments, so that in the event of a hardware
> failure the virtual machine fails over from compromised to properly
> operating hardware (a physical machine) in a way that is completely
> transparent to the guest operating system.
>
> Although hardware based fault tolerant servers and HA servers
> (software clustering) have been around for a (long) while, they
> typically require specifically designed hardware and/or modifications
> to applications. In contrast, by abstracting hardware using
> virtualization, Kemari can be used on off-the-shelf hardware and no
> application modifications are needed.
>
> After a period of in-house development the first version of Kemari for
> Xen was released in Nov 2008 as open source. However, by then it was
> already pretty clear that a KVM port would have several
> advantages. First, KVM is integrated into the Linux kernel, which
> means one gets support for a wide variety of hardware for
> free. Second, and in the same vein, KVM can also benefit from Linux'
> low latency networking capabilities including RDMA, which is of
> paramount importance for a extremely latency-sensitive functionality
> like Kemari. Last, but not the least, KVM and its community is growing
> rapidly, and there is increasing demand for Kemari-like functionality
> for KVM.
>
> Although the basic design principles will remain the same, our plan is
> to write Kemari for KVM from scratch, since there does not seem to be
> much opportunity for sharing between Xen and KVM.
>
> == Design outline ==
>
> The basic premise of fault tolerant servers is that when things go
> awry with the hardware the running system should transparently
> continue execution on an alternate physical host. For this to be
> possible the state of the fallback host has to be identical to that of
> the primary.
>
> Kemari runs paired virtual machines in an active-passive configuration
> and achieves whole-system replication by continuously copying the
> state of the system (dirty pages and the state of the virtual devices)
> from the active node to the passive node. An interesting implication
> of this is that during normal operation only the active node is
> actually executing code.
>
> Another possible approach is to run a pair of systems in lock-step
> (à la VMware FT). Since both the primary and fallback virtual machines
> are active keeping them synchronized is a complex task, which usually
> involves carefully injecting external events into both virtual
> machines so that they result in identical states.
>
> The latter approach is extremely architecture specific and not SMP
> friendly. This spurred us to try the design that became Kemari, which
> we believe lends itself to further optimizations.
>
> == Implementation ==
>
> The first step is to encapsulate the machine to be protected within a
> virtual machine. Then the live migration functionality is leveraged to
> keep the virtual machines synchronized.
>
> Whereas during live migration dirty pages can be sent asynchronously
> from the primary to the fallback server until the ratio of dirty pages
> is low enough to guarantee very short downtimes, when it comes to
> fault tolerance solutions whenever a synchronization point is reached
> changes to the virtual machine since the previous one have to be sent
> synchronously.
>
> Since the virtual machine has to be stopped until the data reaches and
> is acknowledged by the fallback server, the synchronization model is
> of critical importance for performance (both in terms of raw
> throughput and latencies). The model chosen for Kemari along with
> other implementation details is described below.
>
> * Synchronization model
>
> The synchronization points were carefully chosen to minimize the
> amount of traffic that goes over the wire while still maintaining the
> FT pair consistent at all times. To be precise, Kemari uses events
> that modify externally visible state as synchronizations points. This
> means that all outgoing I/O needs to be trapped and sent to the
> fallback host before the primary is resumed, so that it can be
> replayed in the face of hardware failure.
>
> The basic assumption here is that outgoing I/O operations are
> idempotent, which is usually true for disk I/O and reliable network
> protocols such as TCP (Kemari may trigger hidden bugs on applications
> that use UDP or other unreliable protocols, so those may need minor
> changes to ensure they work properly after failover).
>
> The synchronization process can be broken down as follows:
>
> - Event tapping: On KVM all I/O generates a VMEXIT that is
> synchronously handled by the Linux kernel monitor i.e. KVM (it is
> worth noting that this applies to virtio devices too, because they
> use MMIO and PIO just like a regular PCI device).
>
> - VCPU/Guest freezing: This is automatic in the UP case. On SMP
> environments we may need to send a IPI to stop the other VCPUs.
>
> - Notification to qemu: Taking a page from live migration's
> playbook, the synchronization process is user-space driven, which
> means that qemu needs to be woken up at each synchronization
> point. That is already the case for qemu-emulated devices, but we
> also have in-kernel emulators. To compound the problem, even for
> user-space emulated devices accesses to coalesced MMIO areas can
> not be detected. As a consequence we need a mechanism to
> communicate KVM-handled events to qemu.
>
> The channel for KVM-qemu communication can be easily build upon
> the existing infrastructure. We just need to add a new a page to
> the kvm_run shared memory area that can be mmapped from user space
> and set the exit reason appropriately.
>
> Regarding in-kernel device emulators, we only need to care about
> writes. Specifically, making kvm_io_bus_write() fail when Kemari
> is activated and invoking the emulator again after re-entrance
> from user space should suffice (this is somewhat similar to what
> we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).
>
> To avoid missing synchronization points one should be careful with
> coalesced MMIO-like optimizations. In the particular case of
> coalesced MMIO, the I/O operation that caused the exit to user
> space should act as a write barrier when it was due to an access
> to a non-coalesced MMIO area. This means that before proceeding to
> handle the exit in kvm_run() we have to make sure that all the
> coalesced MMIO has reached the fallback host.
>
> - Virtual machine synchronization: All the dirty pages since the
> last synchronization point and the state of the virtual devices is
> sent to the fallback node from the user-space qemu process. For this
> the existing savevm infrastructure and KVM's dirty page tracking

I failed to understand whether you take the lock step approach and sync 
every vmexit + make sure the shadow host will inject irq on the original 
guest's instruction boundary or alternatively use continuous live snapshots.

If you use live snapshots, why do you need to track mmio, etc? Is it in 
order to save the device sync stage in live migration? In order to do it 
you fully lock step qemu execution (or send the entire vmstate to the 
slave). Isn't the device part is << dirt pages part?

Thanks,
Dor


> capabilities can be reused. Regarding in-kernel devices, with the
> likely advent of in-kernel virtio backends we need a generic way
> to access their state from user-space, for which, again, the kvm_run
> share memory area could be used.
>
> - Virtual machine run: Execution of the virtual machine is resumed
> as soon as synchronization finishes.
>
> * Clock
>
> Even though we do not need to worry about the clock that provides the
> tick (the counter resides in memory, which we keep synchronized), the
> same does not apply to counters such as the TSC (we certainly want to
> avoid a situation where counters jump back in time right after
> fail-over, breaking guarantees such as monotonicity).
>
> To avoid big hiccups after migration the value of the TSC should be
> sent to the fallback node frequently. An access from the guest
> (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment
> to do this. Fortunately, both vmx and SVM provide controls to
> intercept accesses to the TSC, so it is just a matter of setting those
> appropriately ("RDTSC exiting" VM-execution control, and RDTSC,
> RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However,
> since synchronizing the virtual machines every time the TSC is
> accessed would be prohibitive, the transmission of the TSC will be
> done lazily, which means delaying it until there is a non-TSC
> synchronization point arrives.
>
> * Failover
>
> Failover process kicks in whenever a failure in the primary node is
> detected. At the time of writing we just ping the virtual machine
> periodically to determine whether it is still alive, but in the long
> term we have plans to integrate Kemari with the major HA stacks
> (Hearbeat, RHCS, etc).
>
> Ideally, we would like to leverage the hardware failure detection
> capabilities of newish x86 hardware to trigger failover, the idea
> being that transferring control to the fallback node proactively
> when a problem is detected is much faster than relying on the polling
> mechanisms used by most HA software.
>
> Finally, to restore the virtual machine in the fallback host the loadvm
> infrastructure used for live-migration is leveraged.
>
> * Further information
>
> Please visit the link below for additional information, including
> documentation and, most importantly, source code (for Xen only at the
> moment).
>
> http://www.osrg.net/kemari
> ==
>
>
> Any comments and suggestions would be greatly appreciated.
>
> If this is the right forum and people on the KVM mailing list do not
> mind, we would like to use the CC'ed mailing lists for Kemari
> development. Having more expert eyes looking at one's code always
> helps.
>
> Thanks,
>
> Fernando
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] KVM Fault Tolerance: Kemari for KVM
  2009-11-12 21:51   ` [Qemu-devel] " Dor Laor
@ 2009-11-13 11:48     ` Yoshiaki Tamura
  -1 siblings, 0 replies; 26+ messages in thread
From: Yoshiaki Tamura @ 2009-11-13 11:48 UTC (permalink / raw)
  To: dlaor
  Cc: Fernando Luis Vázquez Cao, kvm, qemu-devel,
	"大村圭(oomura kei)",
	Takuya Yoshikawa, avi, anthony, Andrea Arcangeli, Chris Wright

Hi,

Thanks for your comments!

Dor Laor wrote:
> On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>> Hi all,
>>
>> It has been a while coming, but we have finally started work on
>> Kemari's port to KVM. For those not familiar with it, Kemari provides
>> the basic building block to create a virtualization-based fault
>> tolerant machine: a virtual machine synchronization mechanism.
>>
>> Traditional high availability solutions can be classified in two
>> groups: fault tolerant servers, and software clustering.
>>
>> Broadly speaking, fault tolerant servers protect us against hardware
>> failures and, generally, rely on redundant hardware (often
>> proprietary), and hardware failure detection to trigger fail-over.
>>
>> On the other hand, software clustering, as its name indicates, takes
>> care of software failures and usually requires a standby server whose
>> software configuration for the part we are trying to make fault
>> tolerant must be identical to that of the active server.
>>
>> Both solutions may be applied to virtualized environments. Indeed,
>> the current incarnation of Kemari (Xen-based) brings fault tolerant
>> server-like capabilities to virtual machines and integration with
>> existing HA stacks (Heartbeat, RHCS, etc) is under consideration.
>>
>> After some time in the drawing board we completed the basic design of
>> Kemari for KVM, so we are sending an RFC at this point to get early
>> feedback and, hopefully, get things right from the start. Those
>> already familiar with Kemari and/or fault tolerance may want to skip
>> the "Background" and go directly to the design and implementation
>> bits.
>>
>> This is a pretty long write-up, but please bear with me.
>>
>> == Background ==
>>
>> We started to play around with continuous virtual synchronization
>> technology about 3 years ago. As development progressed and, most
>> importantly, we got the first Xen-based working prototypes it became
>> clear that we needed a proper name for our toy: Kemari.
>>
>> The goal of Kemari is to provide a fault tolerant platform for
>> virtualization environments, so that in the event of a hardware
>> failure the virtual machine fails over from compromised to properly
>> operating hardware (a physical machine) in a way that is completely
>> transparent to the guest operating system.
>>
>> Although hardware based fault tolerant servers and HA servers
>> (software clustering) have been around for a (long) while, they
>> typically require specifically designed hardware and/or modifications
>> to applications. In contrast, by abstracting hardware using
>> virtualization, Kemari can be used on off-the-shelf hardware and no
>> application modifications are needed.
>>
>> After a period of in-house development the first version of Kemari for
>> Xen was released in Nov 2008 as open source. However, by then it was
>> already pretty clear that a KVM port would have several
>> advantages. First, KVM is integrated into the Linux kernel, which
>> means one gets support for a wide variety of hardware for
>> free. Second, and in the same vein, KVM can also benefit from Linux'
>> low latency networking capabilities including RDMA, which is of
>> paramount importance for a extremely latency-sensitive functionality
>> like Kemari. Last, but not the least, KVM and its community is growing
>> rapidly, and there is increasing demand for Kemari-like functionality
>> for KVM.
>>
>> Although the basic design principles will remain the same, our plan is
>> to write Kemari for KVM from scratch, since there does not seem to be
>> much opportunity for sharing between Xen and KVM.
>>
>> == Design outline ==
>>
>> The basic premise of fault tolerant servers is that when things go
>> awry with the hardware the running system should transparently
>> continue execution on an alternate physical host. For this to be
>> possible the state of the fallback host has to be identical to that of
>> the primary.
>>
>> Kemari runs paired virtual machines in an active-passive configuration
>> and achieves whole-system replication by continuously copying the
>> state of the system (dirty pages and the state of the virtual devices)
>> from the active node to the passive node. An interesting implication
>> of this is that during normal operation only the active node is
>> actually executing code.
>>
>> Another possible approach is to run a pair of systems in lock-step
>> (à la VMware FT). Since both the primary and fallback virtual machines
>> are active keeping them synchronized is a complex task, which usually
>> involves carefully injecting external events into both virtual
>> machines so that they result in identical states.
>>
>> The latter approach is extremely architecture specific and not SMP
>> friendly. This spurred us to try the design that became Kemari, which
>> we believe lends itself to further optimizations.
>>
>> == Implementation ==
>>
>> The first step is to encapsulate the machine to be protected within a
>> virtual machine. Then the live migration functionality is leveraged to
>> keep the virtual machines synchronized.
>>
>> Whereas during live migration dirty pages can be sent asynchronously
>> from the primary to the fallback server until the ratio of dirty pages
>> is low enough to guarantee very short downtimes, when it comes to
>> fault tolerance solutions whenever a synchronization point is reached
>> changes to the virtual machine since the previous one have to be sent
>> synchronously.
>>
>> Since the virtual machine has to be stopped until the data reaches and
>> is acknowledged by the fallback server, the synchronization model is
>> of critical importance for performance (both in terms of raw
>> throughput and latencies). The model chosen for Kemari along with
>> other implementation details is described below.
>>
>> * Synchronization model
>>
>> The synchronization points were carefully chosen to minimize the
>> amount of traffic that goes over the wire while still maintaining the
>> FT pair consistent at all times. To be precise, Kemari uses events
>> that modify externally visible state as synchronizations points. This
>> means that all outgoing I/O needs to be trapped and sent to the
>> fallback host before the primary is resumed, so that it can be
>> replayed in the face of hardware failure.
>>
>> The basic assumption here is that outgoing I/O operations are
>> idempotent, which is usually true for disk I/O and reliable network
>> protocols such as TCP (Kemari may trigger hidden bugs on applications
>> that use UDP or other unreliable protocols, so those may need minor
>> changes to ensure they work properly after failover).
>>
>> The synchronization process can be broken down as follows:
>>
>> - Event tapping: On KVM all I/O generates a VMEXIT that is
>> synchronously handled by the Linux kernel monitor i.e. KVM (it is
>> worth noting that this applies to virtio devices too, because they
>> use MMIO and PIO just like a regular PCI device).
>>
>> - VCPU/Guest freezing: This is automatic in the UP case. On SMP
>> environments we may need to send a IPI to stop the other VCPUs.
>>
>> - Notification to qemu: Taking a page from live migration's
>> playbook, the synchronization process is user-space driven, which
>> means that qemu needs to be woken up at each synchronization
>> point. That is already the case for qemu-emulated devices, but we
>> also have in-kernel emulators. To compound the problem, even for
>> user-space emulated devices accesses to coalesced MMIO areas can
>> not be detected. As a consequence we need a mechanism to
>> communicate KVM-handled events to qemu.
>>
>> The channel for KVM-qemu communication can be easily build upon
>> the existing infrastructure. We just need to add a new a page to
>> the kvm_run shared memory area that can be mmapped from user space
>> and set the exit reason appropriately.
>>
>> Regarding in-kernel device emulators, we only need to care about
>> writes. Specifically, making kvm_io_bus_write() fail when Kemari
>> is activated and invoking the emulator again after re-entrance
>> from user space should suffice (this is somewhat similar to what
>> we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).
>>
>> To avoid missing synchronization points one should be careful with
>> coalesced MMIO-like optimizations. In the particular case of
>> coalesced MMIO, the I/O operation that caused the exit to user
>> space should act as a write barrier when it was due to an access
>> to a non-coalesced MMIO area. This means that before proceeding to
>> handle the exit in kvm_run() we have to make sure that all the
>> coalesced MMIO has reached the fallback host.
>>
>> - Virtual machine synchronization: All the dirty pages since the
>> last synchronization point and the state of the virtual devices is
>> sent to the fallback node from the user-space qemu process. For this
>> the existing savevm infrastructure and KVM's dirty page tracking
> 
> I failed to understand whether you take the lock step approach and sync 
> every vmexit + make sure the shadow host will inject irq on the original 
> guest's instruction boundary or alternatively use continuous live 
> snapshots.

We'll take the live snapshots approach for now.

> If you use live snapshots, why do you need to track mmio, etc? Is it in 
> order to save the device sync stage in live migration? In order to do it 
> you fully lock step qemu execution (or send the entire vmstate to the 
> slave). Isn't the device part is << dirt pages part?

We're thinking to capture mmio operations that effect the state of devices as 
synchronization points.  The purpose is to lock step qemu execution as you 
mentioned.

Thanks,

Yoshi

> 
> Thanks,
> Dor
> 
> 
>> capabilities can be reused. Regarding in-kernel devices, with the
>> likely advent of in-kernel virtio backends we need a generic way
>> to access their state from user-space, for which, again, the kvm_run
>> share memory area could be used.
>>
>> - Virtual machine run: Execution of the virtual machine is resumed
>> as soon as synchronization finishes.
>>
>> * Clock
>>
>> Even though we do not need to worry about the clock that provides the
>> tick (the counter resides in memory, which we keep synchronized), the
>> same does not apply to counters such as the TSC (we certainly want to
>> avoid a situation where counters jump back in time right after
>> fail-over, breaking guarantees such as monotonicity).
>>
>> To avoid big hiccups after migration the value of the TSC should be
>> sent to the fallback node frequently. An access from the guest
>> (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment
>> to do this. Fortunately, both vmx and SVM provide controls to
>> intercept accesses to the TSC, so it is just a matter of setting those
>> appropriately ("RDTSC exiting" VM-execution control, and RDTSC,
>> RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However,
>> since synchronizing the virtual machines every time the TSC is
>> accessed would be prohibitive, the transmission of the TSC will be
>> done lazily, which means delaying it until there is a non-TSC
>> synchronization point arrives.
>>
>> * Failover
>>
>> Failover process kicks in whenever a failure in the primary node is
>> detected. At the time of writing we just ping the virtual machine
>> periodically to determine whether it is still alive, but in the long
>> term we have plans to integrate Kemari with the major HA stacks
>> (Hearbeat, RHCS, etc).
>>
>> Ideally, we would like to leverage the hardware failure detection
>> capabilities of newish x86 hardware to trigger failover, the idea
>> being that transferring control to the fallback node proactively
>> when a problem is detected is much faster than relying on the polling
>> mechanisms used by most HA software.
>>
>> Finally, to restore the virtual machine in the fallback host the loadvm
>> infrastructure used for live-migration is leveraged.
>>
>> * Further information
>>
>> Please visit the link below for additional information, including
>> documentation and, most importantly, source code (for Xen only at the
>> moment).
>>
>> http://www.osrg.net/kemari
>> ==
>>
>>
>> Any comments and suggestions would be greatly appreciated.
>>
>> If this is the right forum and people on the KVM mailing list do not
>> mind, we would like to use the CC'ed mailing lists for Kemari
>> development. Having more expert eyes looking at one's code always
>> helps.
>>
>> Thanks,
>>
>> Fernando
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-13 11:48     ` Yoshiaki Tamura
  0 siblings, 0 replies; 26+ messages in thread
From: Yoshiaki Tamura @ 2009-11-13 11:48 UTC (permalink / raw)
  To: dlaor
  Cc: Andrea Arcangeli, Chris Wright,
	"大村圭(oomura kei)",
	kvm, Fernando Luis Vázquez Cao, qemu-devel,
	Takuya Yoshikawa, avi

Hi,

Thanks for your comments!

Dor Laor wrote:
> On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>> Hi all,
>>
>> It has been a while coming, but we have finally started work on
>> Kemari's port to KVM. For those not familiar with it, Kemari provides
>> the basic building block to create a virtualization-based fault
>> tolerant machine: a virtual machine synchronization mechanism.
>>
>> Traditional high availability solutions can be classified in two
>> groups: fault tolerant servers, and software clustering.
>>
>> Broadly speaking, fault tolerant servers protect us against hardware
>> failures and, generally, rely on redundant hardware (often
>> proprietary), and hardware failure detection to trigger fail-over.
>>
>> On the other hand, software clustering, as its name indicates, takes
>> care of software failures and usually requires a standby server whose
>> software configuration for the part we are trying to make fault
>> tolerant must be identical to that of the active server.
>>
>> Both solutions may be applied to virtualized environments. Indeed,
>> the current incarnation of Kemari (Xen-based) brings fault tolerant
>> server-like capabilities to virtual machines and integration with
>> existing HA stacks (Heartbeat, RHCS, etc) is under consideration.
>>
>> After some time in the drawing board we completed the basic design of
>> Kemari for KVM, so we are sending an RFC at this point to get early
>> feedback and, hopefully, get things right from the start. Those
>> already familiar with Kemari and/or fault tolerance may want to skip
>> the "Background" and go directly to the design and implementation
>> bits.
>>
>> This is a pretty long write-up, but please bear with me.
>>
>> == Background ==
>>
>> We started to play around with continuous virtual synchronization
>> technology about 3 years ago. As development progressed and, most
>> importantly, we got the first Xen-based working prototypes it became
>> clear that we needed a proper name for our toy: Kemari.
>>
>> The goal of Kemari is to provide a fault tolerant platform for
>> virtualization environments, so that in the event of a hardware
>> failure the virtual machine fails over from compromised to properly
>> operating hardware (a physical machine) in a way that is completely
>> transparent to the guest operating system.
>>
>> Although hardware based fault tolerant servers and HA servers
>> (software clustering) have been around for a (long) while, they
>> typically require specifically designed hardware and/or modifications
>> to applications. In contrast, by abstracting hardware using
>> virtualization, Kemari can be used on off-the-shelf hardware and no
>> application modifications are needed.
>>
>> After a period of in-house development the first version of Kemari for
>> Xen was released in Nov 2008 as open source. However, by then it was
>> already pretty clear that a KVM port would have several
>> advantages. First, KVM is integrated into the Linux kernel, which
>> means one gets support for a wide variety of hardware for
>> free. Second, and in the same vein, KVM can also benefit from Linux'
>> low latency networking capabilities including RDMA, which is of
>> paramount importance for a extremely latency-sensitive functionality
>> like Kemari. Last, but not the least, KVM and its community is growing
>> rapidly, and there is increasing demand for Kemari-like functionality
>> for KVM.
>>
>> Although the basic design principles will remain the same, our plan is
>> to write Kemari for KVM from scratch, since there does not seem to be
>> much opportunity for sharing between Xen and KVM.
>>
>> == Design outline ==
>>
>> The basic premise of fault tolerant servers is that when things go
>> awry with the hardware the running system should transparently
>> continue execution on an alternate physical host. For this to be
>> possible the state of the fallback host has to be identical to that of
>> the primary.
>>
>> Kemari runs paired virtual machines in an active-passive configuration
>> and achieves whole-system replication by continuously copying the
>> state of the system (dirty pages and the state of the virtual devices)
>> from the active node to the passive node. An interesting implication
>> of this is that during normal operation only the active node is
>> actually executing code.
>>
>> Another possible approach is to run a pair of systems in lock-step
>> (à la VMware FT). Since both the primary and fallback virtual machines
>> are active keeping them synchronized is a complex task, which usually
>> involves carefully injecting external events into both virtual
>> machines so that they result in identical states.
>>
>> The latter approach is extremely architecture specific and not SMP
>> friendly. This spurred us to try the design that became Kemari, which
>> we believe lends itself to further optimizations.
>>
>> == Implementation ==
>>
>> The first step is to encapsulate the machine to be protected within a
>> virtual machine. Then the live migration functionality is leveraged to
>> keep the virtual machines synchronized.
>>
>> Whereas during live migration dirty pages can be sent asynchronously
>> from the primary to the fallback server until the ratio of dirty pages
>> is low enough to guarantee very short downtimes, when it comes to
>> fault tolerance solutions whenever a synchronization point is reached
>> changes to the virtual machine since the previous one have to be sent
>> synchronously.
>>
>> Since the virtual machine has to be stopped until the data reaches and
>> is acknowledged by the fallback server, the synchronization model is
>> of critical importance for performance (both in terms of raw
>> throughput and latencies). The model chosen for Kemari along with
>> other implementation details is described below.
>>
>> * Synchronization model
>>
>> The synchronization points were carefully chosen to minimize the
>> amount of traffic that goes over the wire while still maintaining the
>> FT pair consistent at all times. To be precise, Kemari uses events
>> that modify externally visible state as synchronizations points. This
>> means that all outgoing I/O needs to be trapped and sent to the
>> fallback host before the primary is resumed, so that it can be
>> replayed in the face of hardware failure.
>>
>> The basic assumption here is that outgoing I/O operations are
>> idempotent, which is usually true for disk I/O and reliable network
>> protocols such as TCP (Kemari may trigger hidden bugs on applications
>> that use UDP or other unreliable protocols, so those may need minor
>> changes to ensure they work properly after failover).
>>
>> The synchronization process can be broken down as follows:
>>
>> - Event tapping: On KVM all I/O generates a VMEXIT that is
>> synchronously handled by the Linux kernel monitor i.e. KVM (it is
>> worth noting that this applies to virtio devices too, because they
>> use MMIO and PIO just like a regular PCI device).
>>
>> - VCPU/Guest freezing: This is automatic in the UP case. On SMP
>> environments we may need to send a IPI to stop the other VCPUs.
>>
>> - Notification to qemu: Taking a page from live migration's
>> playbook, the synchronization process is user-space driven, which
>> means that qemu needs to be woken up at each synchronization
>> point. That is already the case for qemu-emulated devices, but we
>> also have in-kernel emulators. To compound the problem, even for
>> user-space emulated devices accesses to coalesced MMIO areas can
>> not be detected. As a consequence we need a mechanism to
>> communicate KVM-handled events to qemu.
>>
>> The channel for KVM-qemu communication can be easily build upon
>> the existing infrastructure. We just need to add a new a page to
>> the kvm_run shared memory area that can be mmapped from user space
>> and set the exit reason appropriately.
>>
>> Regarding in-kernel device emulators, we only need to care about
>> writes. Specifically, making kvm_io_bus_write() fail when Kemari
>> is activated and invoking the emulator again after re-entrance
>> from user space should suffice (this is somewhat similar to what
>> we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).
>>
>> To avoid missing synchronization points one should be careful with
>> coalesced MMIO-like optimizations. In the particular case of
>> coalesced MMIO, the I/O operation that caused the exit to user
>> space should act as a write barrier when it was due to an access
>> to a non-coalesced MMIO area. This means that before proceeding to
>> handle the exit in kvm_run() we have to make sure that all the
>> coalesced MMIO has reached the fallback host.
>>
>> - Virtual machine synchronization: All the dirty pages since the
>> last synchronization point and the state of the virtual devices is
>> sent to the fallback node from the user-space qemu process. For this
>> the existing savevm infrastructure and KVM's dirty page tracking
> 
> I failed to understand whether you take the lock step approach and sync 
> every vmexit + make sure the shadow host will inject irq on the original 
> guest's instruction boundary or alternatively use continuous live 
> snapshots.

We'll take the live snapshots approach for now.

> If you use live snapshots, why do you need to track mmio, etc? Is it in 
> order to save the device sync stage in live migration? In order to do it 
> you fully lock step qemu execution (or send the entire vmstate to the 
> slave). Isn't the device part is << dirt pages part?

We're thinking to capture mmio operations that effect the state of devices as 
synchronization points.  The purpose is to lock step qemu execution as you 
mentioned.

Thanks,

Yoshi

> 
> Thanks,
> Dor
> 
> 
>> capabilities can be reused. Regarding in-kernel devices, with the
>> likely advent of in-kernel virtio backends we need a generic way
>> to access their state from user-space, for which, again, the kvm_run
>> share memory area could be used.
>>
>> - Virtual machine run: Execution of the virtual machine is resumed
>> as soon as synchronization finishes.
>>
>> * Clock
>>
>> Even though we do not need to worry about the clock that provides the
>> tick (the counter resides in memory, which we keep synchronized), the
>> same does not apply to counters such as the TSC (we certainly want to
>> avoid a situation where counters jump back in time right after
>> fail-over, breaking guarantees such as monotonicity).
>>
>> To avoid big hiccups after migration the value of the TSC should be
>> sent to the fallback node frequently. An access from the guest
>> (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment
>> to do this. Fortunately, both vmx and SVM provide controls to
>> intercept accesses to the TSC, so it is just a matter of setting those
>> appropriately ("RDTSC exiting" VM-execution control, and RDTSC,
>> RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However,
>> since synchronizing the virtual machines every time the TSC is
>> accessed would be prohibitive, the transmission of the TSC will be
>> done lazily, which means delaying it until there is a non-TSC
>> synchronization point arrives.
>>
>> * Failover
>>
>> Failover process kicks in whenever a failure in the primary node is
>> detected. At the time of writing we just ping the virtual machine
>> periodically to determine whether it is still alive, but in the long
>> term we have plans to integrate Kemari with the major HA stacks
>> (Hearbeat, RHCS, etc).
>>
>> Ideally, we would like to leverage the hardware failure detection
>> capabilities of newish x86 hardware to trigger failover, the idea
>> being that transferring control to the fallback node proactively
>> when a problem is detected is much faster than relying on the polling
>> mechanisms used by most HA software.
>>
>> Finally, to restore the virtual machine in the fallback host the loadvm
>> infrastructure used for live-migration is leveraged.
>>
>> * Further information
>>
>> Please visit the link below for additional information, including
>> documentation and, most importantly, source code (for Xen only at the
>> moment).
>>
>> http://www.osrg.net/kemari
>> ==
>>
>>
>> Any comments and suggestions would be greatly appreciated.
>>
>> If this is the right forum and people on the KVM mailing list do not
>> mind, we would like to use the CC'ed mailing lists for Kemari
>> development. Having more expert eyes looking at one's code always
>> helps.
>>
>> Thanks,
>>
>> Fernando
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] KVM Fault Tolerance: Kemari for KVM
  2009-11-09  3:53 ` [Qemu-devel] " Fernando Luis Vázquez Cao
@ 2009-11-15 10:35   ` Avi Kivity
  -1 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2009-11-15 10:35 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: kvm, qemu-devel, "大村圭(oomura kei)",
	Yoshiaki Tamura, Takuya Yoshikawa, anthony, Andrea Arcangeli,
	Chris Wright

On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>
> Kemari runs paired virtual machines in an active-passive configuration
> and achieves whole-system replication by continuously copying the
> state of the system (dirty pages and the state of the virtual devices)
> from the active node to the passive node. An interesting implication
> of this is that during normal operation only the active node is
> actually executing code.
>

Can you characterize the performance impact for various workloads?  I 
assume you are running continuously in log-dirty mode.  Doesn't this 
make memory intensive workloads suffer?

>
> The synchronization process can be broken down as follows:
>
>   - Event tapping: On KVM all I/O generates a VMEXIT that is
>     synchronously handled by the Linux kernel monitor i.e. KVM (it is
>     worth noting that this applies to virtio devices too, because they
>     use MMIO and PIO just like a regular PCI device).

Some I/O (virtio-based) is asynchronous, but you still have well-known 
tap points within qemu.

>
>   - Notification to qemu: Taking a page from live migration's
>     playbook, the synchronization process is user-space driven, which
>     means that qemu needs to be woken up at each synchronization
>     point. That is already the case for qemu-emulated devices, but we
>     also have in-kernel emulators. To compound the problem, even for
>     user-space emulated devices accesses to coalesced MMIO areas can
>     not be detected. As a consequence we need a mechanism to
>     communicate KVM-handled events to qemu.

Do you mean the ioapic, pic, and lapic?  Perhaps its best to start with 
those in userspace (-no-kvm-irqchip).

Why is access to those chips considered a synchronization point?

>   - Virtual machine synchronization: All the dirty pages since the
>     last synchronization point and the state of the virtual devices is
>     sent to the fallback node from the user-space qemu process. For this
>     the existing savevm infrastructure and KVM's dirty page tracking
>     capabilities can be reused. Regarding in-kernel devices, with the
>     likely advent of in-kernel virtio backends we need a generic way
>     to access their state from user-space, for which, again, the kvm_run
>     share memory area could be used.

I wonder if you can pipeline dirty memory synchronization.  That is, 
write-protect those pages that are dirty, start copying them to the 
other side, and continue execution, copying memory if the guest faults 
it again.

How many pages do you copy per synchronization point for reasonably 
difficult workloads?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-15 10:35   ` Avi Kivity
  0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2009-11-15 10:35 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Andrea Arcangeli, Chris Wright, omura kei)",
	kvm, Yoshiaki Tamura, qemu-devel, Takuya Yoshikawa,
	=?UTF-8?B?IuWkp+adkeWcrShv?=

On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>
> Kemari runs paired virtual machines in an active-passive configuration
> and achieves whole-system replication by continuously copying the
> state of the system (dirty pages and the state of the virtual devices)
> from the active node to the passive node. An interesting implication
> of this is that during normal operation only the active node is
> actually executing code.
>

Can you characterize the performance impact for various workloads?  I 
assume you are running continuously in log-dirty mode.  Doesn't this 
make memory intensive workloads suffer?

>
> The synchronization process can be broken down as follows:
>
>   - Event tapping: On KVM all I/O generates a VMEXIT that is
>     synchronously handled by the Linux kernel monitor i.e. KVM (it is
>     worth noting that this applies to virtio devices too, because they
>     use MMIO and PIO just like a regular PCI device).

Some I/O (virtio-based) is asynchronous, but you still have well-known 
tap points within qemu.

>
>   - Notification to qemu: Taking a page from live migration's
>     playbook, the synchronization process is user-space driven, which
>     means that qemu needs to be woken up at each synchronization
>     point. That is already the case for qemu-emulated devices, but we
>     also have in-kernel emulators. To compound the problem, even for
>     user-space emulated devices accesses to coalesced MMIO areas can
>     not be detected. As a consequence we need a mechanism to
>     communicate KVM-handled events to qemu.

Do you mean the ioapic, pic, and lapic?  Perhaps its best to start with 
those in userspace (-no-kvm-irqchip).

Why is access to those chips considered a synchronization point?

>   - Virtual machine synchronization: All the dirty pages since the
>     last synchronization point and the state of the virtual devices is
>     sent to the fallback node from the user-space qemu process. For this
>     the existing savevm infrastructure and KVM's dirty page tracking
>     capabilities can be reused. Regarding in-kernel devices, with the
>     likely advent of in-kernel virtio backends we need a generic way
>     to access their state from user-space, for which, again, the kvm_run
>     share memory area could be used.

I wonder if you can pipeline dirty memory synchronization.  That is, 
write-protect those pages that are dirty, start copying them to the 
other side, and continue execution, copying memory if the guest faults 
it again.

How many pages do you copy per synchronization point for reasonably 
difficult workloads?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] KVM Fault Tolerance: Kemari for KVM
  2009-11-13 11:48     ` [Qemu-devel] " Yoshiaki Tamura
@ 2009-11-15 13:42       ` Dor Laor
  -1 siblings, 0 replies; 26+ messages in thread
From: Dor Laor @ 2009-11-15 13:42 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: Fernando Luis Vázquez Cao, kvm, qemu-devel,
	"大村圭(oomura kei)",
	Takuya Yoshikawa, avi, anthony, Andrea Arcangeli, Chris Wright

On 11/13/2009 01:48 PM, Yoshiaki Tamura wrote:
> Hi,
>
> Thanks for your comments!
>
> Dor Laor wrote:
>> On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>>> Hi all,
>>>
>>> It has been a while coming, but we have finally started work on
>>> Kemari's port to KVM. For those not familiar with it, Kemari provides
>>> the basic building block to create a virtualization-based fault
>>> tolerant machine: a virtual machine synchronization mechanism.
>>>
>>> Traditional high availability solutions can be classified in two
>>> groups: fault tolerant servers, and software clustering.
>>>
>>> Broadly speaking, fault tolerant servers protect us against hardware
>>> failures and, generally, rely on redundant hardware (often
>>> proprietary), and hardware failure detection to trigger fail-over.
>>>
>>> On the other hand, software clustering, as its name indicates, takes
>>> care of software failures and usually requires a standby server whose
>>> software configuration for the part we are trying to make fault
>>> tolerant must be identical to that of the active server.
>>>
>>> Both solutions may be applied to virtualized environments. Indeed,
>>> the current incarnation of Kemari (Xen-based) brings fault tolerant
>>> server-like capabilities to virtual machines and integration with
>>> existing HA stacks (Heartbeat, RHCS, etc) is under consideration.
>>>
>>> After some time in the drawing board we completed the basic design of
>>> Kemari for KVM, so we are sending an RFC at this point to get early
>>> feedback and, hopefully, get things right from the start. Those
>>> already familiar with Kemari and/or fault tolerance may want to skip
>>> the "Background" and go directly to the design and implementation
>>> bits.
>>>
>>> This is a pretty long write-up, but please bear with me.
>>>
>>> == Background ==
>>>
>>> We started to play around with continuous virtual synchronization
>>> technology about 3 years ago. As development progressed and, most
>>> importantly, we got the first Xen-based working prototypes it became
>>> clear that we needed a proper name for our toy: Kemari.
>>>
>>> The goal of Kemari is to provide a fault tolerant platform for
>>> virtualization environments, so that in the event of a hardware
>>> failure the virtual machine fails over from compromised to properly
>>> operating hardware (a physical machine) in a way that is completely
>>> transparent to the guest operating system.
>>>
>>> Although hardware based fault tolerant servers and HA servers
>>> (software clustering) have been around for a (long) while, they
>>> typically require specifically designed hardware and/or modifications
>>> to applications. In contrast, by abstracting hardware using
>>> virtualization, Kemari can be used on off-the-shelf hardware and no
>>> application modifications are needed.
>>>
>>> After a period of in-house development the first version of Kemari for
>>> Xen was released in Nov 2008 as open source. However, by then it was
>>> already pretty clear that a KVM port would have several
>>> advantages. First, KVM is integrated into the Linux kernel, which
>>> means one gets support for a wide variety of hardware for
>>> free. Second, and in the same vein, KVM can also benefit from Linux'
>>> low latency networking capabilities including RDMA, which is of
>>> paramount importance for a extremely latency-sensitive functionality
>>> like Kemari. Last, but not the least, KVM and its community is growing
>>> rapidly, and there is increasing demand for Kemari-like functionality
>>> for KVM.
>>>
>>> Although the basic design principles will remain the same, our plan is
>>> to write Kemari for KVM from scratch, since there does not seem to be
>>> much opportunity for sharing between Xen and KVM.
>>>
>>> == Design outline ==
>>>
>>> The basic premise of fault tolerant servers is that when things go
>>> awry with the hardware the running system should transparently
>>> continue execution on an alternate physical host. For this to be
>>> possible the state of the fallback host has to be identical to that of
>>> the primary.
>>>
>>> Kemari runs paired virtual machines in an active-passive configuration
>>> and achieves whole-system replication by continuously copying the
>>> state of the system (dirty pages and the state of the virtual devices)
>>> from the active node to the passive node. An interesting implication
>>> of this is that during normal operation only the active node is
>>> actually executing code.
>>>
>>> Another possible approach is to run a pair of systems in lock-step
>>> (à la VMware FT). Since both the primary and fallback virtual machines
>>> are active keeping them synchronized is a complex task, which usually
>>> involves carefully injecting external events into both virtual
>>> machines so that they result in identical states.
>>>
>>> The latter approach is extremely architecture specific and not SMP
>>> friendly. This spurred us to try the design that became Kemari, which
>>> we believe lends itself to further optimizations.
>>>
>>> == Implementation ==
>>>
>>> The first step is to encapsulate the machine to be protected within a
>>> virtual machine. Then the live migration functionality is leveraged to
>>> keep the virtual machines synchronized.
>>>
>>> Whereas during live migration dirty pages can be sent asynchronously
>>> from the primary to the fallback server until the ratio of dirty pages
>>> is low enough to guarantee very short downtimes, when it comes to
>>> fault tolerance solutions whenever a synchronization point is reached
>>> changes to the virtual machine since the previous one have to be sent
>>> synchronously.
>>>
>>> Since the virtual machine has to be stopped until the data reaches and
>>> is acknowledged by the fallback server, the synchronization model is
>>> of critical importance for performance (both in terms of raw
>>> throughput and latencies). The model chosen for Kemari along with
>>> other implementation details is described below.
>>>
>>> * Synchronization model
>>>
>>> The synchronization points were carefully chosen to minimize the
>>> amount of traffic that goes over the wire while still maintaining the
>>> FT pair consistent at all times. To be precise, Kemari uses events
>>> that modify externally visible state as synchronizations points. This
>>> means that all outgoing I/O needs to be trapped and sent to the
>>> fallback host before the primary is resumed, so that it can be
>>> replayed in the face of hardware failure.
>>>
>>> The basic assumption here is that outgoing I/O operations are
>>> idempotent, which is usually true for disk I/O and reliable network
>>> protocols such as TCP (Kemari may trigger hidden bugs on applications
>>> that use UDP or other unreliable protocols, so those may need minor
>>> changes to ensure they work properly after failover).
>>>
>>> The synchronization process can be broken down as follows:
>>>
>>> - Event tapping: On KVM all I/O generates a VMEXIT that is
>>> synchronously handled by the Linux kernel monitor i.e. KVM (it is
>>> worth noting that this applies to virtio devices too, because they
>>> use MMIO and PIO just like a regular PCI device).
>>>
>>> - VCPU/Guest freezing: This is automatic in the UP case. On SMP
>>> environments we may need to send a IPI to stop the other VCPUs.
>>>
>>> - Notification to qemu: Taking a page from live migration's
>>> playbook, the synchronization process is user-space driven, which
>>> means that qemu needs to be woken up at each synchronization
>>> point. That is already the case for qemu-emulated devices, but we
>>> also have in-kernel emulators. To compound the problem, even for
>>> user-space emulated devices accesses to coalesced MMIO areas can
>>> not be detected. As a consequence we need a mechanism to
>>> communicate KVM-handled events to qemu.
>>>
>>> The channel for KVM-qemu communication can be easily build upon
>>> the existing infrastructure. We just need to add a new a page to
>>> the kvm_run shared memory area that can be mmapped from user space
>>> and set the exit reason appropriately.
>>>
>>> Regarding in-kernel device emulators, we only need to care about
>>> writes. Specifically, making kvm_io_bus_write() fail when Kemari
>>> is activated and invoking the emulator again after re-entrance
>>> from user space should suffice (this is somewhat similar to what
>>> we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).
>>>
>>> To avoid missing synchronization points one should be careful with
>>> coalesced MMIO-like optimizations. In the particular case of
>>> coalesced MMIO, the I/O operation that caused the exit to user
>>> space should act as a write barrier when it was due to an access
>>> to a non-coalesced MMIO area. This means that before proceeding to
>>> handle the exit in kvm_run() we have to make sure that all the
>>> coalesced MMIO has reached the fallback host.
>>>
>>> - Virtual machine synchronization: All the dirty pages since the
>>> last synchronization point and the state of the virtual devices is
>>> sent to the fallback node from the user-space qemu process. For this
>>> the existing savevm infrastructure and KVM's dirty page tracking
>>
>> I failed to understand whether you take the lock step approach and
>> sync every vmexit + make sure the shadow host will inject irq on the
>> original guest's instruction boundary or alternatively use continuous
>> live snapshots.
>
> We'll take the live snapshots approach for now.
>
>> If you use live snapshots, why do you need to track mmio, etc? Is it
>> in order to save the device sync stage in live migration? In order to
>> do it you fully lock step qemu execution (or send the entire vmstate
>> to the slave). Isn't the device part is << dirt pages part?
>
> We're thinking to capture mmio operations that effect the state of
> devices as synchronization points. The purpose is to lock step qemu
> execution as you mentioned.

The hardest thing will be in this case is to inject the virtual irqs to 
the guest on the slave host in the exact instruction boundary that the 
original virq was injected on the master. You need to count guest 
instructions, user performance monitors and in the final stages use 
guest break points.

>
> Thanks,
>
> Yoshi
>
>>
>> Thanks,
>> Dor
>>
>>
>>> capabilities can be reused. Regarding in-kernel devices, with the
>>> likely advent of in-kernel virtio backends we need a generic way
>>> to access their state from user-space, for which, again, the kvm_run
>>> share memory area could be used.
>>>
>>> - Virtual machine run: Execution of the virtual machine is resumed
>>> as soon as synchronization finishes.
>>>
>>> * Clock
>>>
>>> Even though we do not need to worry about the clock that provides the
>>> tick (the counter resides in memory, which we keep synchronized), the
>>> same does not apply to counters such as the TSC (we certainly want to
>>> avoid a situation where counters jump back in time right after
>>> fail-over, breaking guarantees such as monotonicity).
>>>
>>> To avoid big hiccups after migration the value of the TSC should be
>>> sent to the fallback node frequently. An access from the guest
>>> (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment
>>> to do this. Fortunately, both vmx and SVM provide controls to
>>> intercept accesses to the TSC, so it is just a matter of setting those
>>> appropriately ("RDTSC exiting" VM-execution control, and RDTSC,
>>> RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However,
>>> since synchronizing the virtual machines every time the TSC is
>>> accessed would be prohibitive, the transmission of the TSC will be
>>> done lazily, which means delaying it until there is a non-TSC
>>> synchronization point arrives.
>>>
>>> * Failover
>>>
>>> Failover process kicks in whenever a failure in the primary node is
>>> detected. At the time of writing we just ping the virtual machine
>>> periodically to determine whether it is still alive, but in the long
>>> term we have plans to integrate Kemari with the major HA stacks
>>> (Hearbeat, RHCS, etc).
>>>
>>> Ideally, we would like to leverage the hardware failure detection
>>> capabilities of newish x86 hardware to trigger failover, the idea
>>> being that transferring control to the fallback node proactively
>>> when a problem is detected is much faster than relying on the polling
>>> mechanisms used by most HA software.
>>>
>>> Finally, to restore the virtual machine in the fallback host the loadvm
>>> infrastructure used for live-migration is leveraged.
>>>
>>> * Further information
>>>
>>> Please visit the link below for additional information, including
>>> documentation and, most importantly, source code (for Xen only at the
>>> moment).
>>>
>>> http://www.osrg.net/kemari
>>> ==
>>>
>>>
>>> Any comments and suggestions would be greatly appreciated.
>>>
>>> If this is the right forum and people on the KVM mailing list do not
>>> mind, we would like to use the CC'ed mailing lists for Kemari
>>> development. Having more expert eyes looking at one's code always
>>> helps.
>>>
>>> Thanks,
>>>
>>> Fernando
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-15 13:42       ` Dor Laor
  0 siblings, 0 replies; 26+ messages in thread
From: Dor Laor @ 2009-11-15 13:42 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: Andrea Arcangeli, Chris Wright,
	"大村圭(oomura kei)",
	kvm, Fernando Luis Vázquez Cao, qemu-devel,
	Takuya Yoshikawa, avi

On 11/13/2009 01:48 PM, Yoshiaki Tamura wrote:
> Hi,
>
> Thanks for your comments!
>
> Dor Laor wrote:
>> On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>>> Hi all,
>>>
>>> It has been a while coming, but we have finally started work on
>>> Kemari's port to KVM. For those not familiar with it, Kemari provides
>>> the basic building block to create a virtualization-based fault
>>> tolerant machine: a virtual machine synchronization mechanism.
>>>
>>> Traditional high availability solutions can be classified in two
>>> groups: fault tolerant servers, and software clustering.
>>>
>>> Broadly speaking, fault tolerant servers protect us against hardware
>>> failures and, generally, rely on redundant hardware (often
>>> proprietary), and hardware failure detection to trigger fail-over.
>>>
>>> On the other hand, software clustering, as its name indicates, takes
>>> care of software failures and usually requires a standby server whose
>>> software configuration for the part we are trying to make fault
>>> tolerant must be identical to that of the active server.
>>>
>>> Both solutions may be applied to virtualized environments. Indeed,
>>> the current incarnation of Kemari (Xen-based) brings fault tolerant
>>> server-like capabilities to virtual machines and integration with
>>> existing HA stacks (Heartbeat, RHCS, etc) is under consideration.
>>>
>>> After some time in the drawing board we completed the basic design of
>>> Kemari for KVM, so we are sending an RFC at this point to get early
>>> feedback and, hopefully, get things right from the start. Those
>>> already familiar with Kemari and/or fault tolerance may want to skip
>>> the "Background" and go directly to the design and implementation
>>> bits.
>>>
>>> This is a pretty long write-up, but please bear with me.
>>>
>>> == Background ==
>>>
>>> We started to play around with continuous virtual synchronization
>>> technology about 3 years ago. As development progressed and, most
>>> importantly, we got the first Xen-based working prototypes it became
>>> clear that we needed a proper name for our toy: Kemari.
>>>
>>> The goal of Kemari is to provide a fault tolerant platform for
>>> virtualization environments, so that in the event of a hardware
>>> failure the virtual machine fails over from compromised to properly
>>> operating hardware (a physical machine) in a way that is completely
>>> transparent to the guest operating system.
>>>
>>> Although hardware based fault tolerant servers and HA servers
>>> (software clustering) have been around for a (long) while, they
>>> typically require specifically designed hardware and/or modifications
>>> to applications. In contrast, by abstracting hardware using
>>> virtualization, Kemari can be used on off-the-shelf hardware and no
>>> application modifications are needed.
>>>
>>> After a period of in-house development the first version of Kemari for
>>> Xen was released in Nov 2008 as open source. However, by then it was
>>> already pretty clear that a KVM port would have several
>>> advantages. First, KVM is integrated into the Linux kernel, which
>>> means one gets support for a wide variety of hardware for
>>> free. Second, and in the same vein, KVM can also benefit from Linux'
>>> low latency networking capabilities including RDMA, which is of
>>> paramount importance for a extremely latency-sensitive functionality
>>> like Kemari. Last, but not the least, KVM and its community is growing
>>> rapidly, and there is increasing demand for Kemari-like functionality
>>> for KVM.
>>>
>>> Although the basic design principles will remain the same, our plan is
>>> to write Kemari for KVM from scratch, since there does not seem to be
>>> much opportunity for sharing between Xen and KVM.
>>>
>>> == Design outline ==
>>>
>>> The basic premise of fault tolerant servers is that when things go
>>> awry with the hardware the running system should transparently
>>> continue execution on an alternate physical host. For this to be
>>> possible the state of the fallback host has to be identical to that of
>>> the primary.
>>>
>>> Kemari runs paired virtual machines in an active-passive configuration
>>> and achieves whole-system replication by continuously copying the
>>> state of the system (dirty pages and the state of the virtual devices)
>>> from the active node to the passive node. An interesting implication
>>> of this is that during normal operation only the active node is
>>> actually executing code.
>>>
>>> Another possible approach is to run a pair of systems in lock-step
>>> (à la VMware FT). Since both the primary and fallback virtual machines
>>> are active keeping them synchronized is a complex task, which usually
>>> involves carefully injecting external events into both virtual
>>> machines so that they result in identical states.
>>>
>>> The latter approach is extremely architecture specific and not SMP
>>> friendly. This spurred us to try the design that became Kemari, which
>>> we believe lends itself to further optimizations.
>>>
>>> == Implementation ==
>>>
>>> The first step is to encapsulate the machine to be protected within a
>>> virtual machine. Then the live migration functionality is leveraged to
>>> keep the virtual machines synchronized.
>>>
>>> Whereas during live migration dirty pages can be sent asynchronously
>>> from the primary to the fallback server until the ratio of dirty pages
>>> is low enough to guarantee very short downtimes, when it comes to
>>> fault tolerance solutions whenever a synchronization point is reached
>>> changes to the virtual machine since the previous one have to be sent
>>> synchronously.
>>>
>>> Since the virtual machine has to be stopped until the data reaches and
>>> is acknowledged by the fallback server, the synchronization model is
>>> of critical importance for performance (both in terms of raw
>>> throughput and latencies). The model chosen for Kemari along with
>>> other implementation details is described below.
>>>
>>> * Synchronization model
>>>
>>> The synchronization points were carefully chosen to minimize the
>>> amount of traffic that goes over the wire while still maintaining the
>>> FT pair consistent at all times. To be precise, Kemari uses events
>>> that modify externally visible state as synchronizations points. This
>>> means that all outgoing I/O needs to be trapped and sent to the
>>> fallback host before the primary is resumed, so that it can be
>>> replayed in the face of hardware failure.
>>>
>>> The basic assumption here is that outgoing I/O operations are
>>> idempotent, which is usually true for disk I/O and reliable network
>>> protocols such as TCP (Kemari may trigger hidden bugs on applications
>>> that use UDP or other unreliable protocols, so those may need minor
>>> changes to ensure they work properly after failover).
>>>
>>> The synchronization process can be broken down as follows:
>>>
>>> - Event tapping: On KVM all I/O generates a VMEXIT that is
>>> synchronously handled by the Linux kernel monitor i.e. KVM (it is
>>> worth noting that this applies to virtio devices too, because they
>>> use MMIO and PIO just like a regular PCI device).
>>>
>>> - VCPU/Guest freezing: This is automatic in the UP case. On SMP
>>> environments we may need to send a IPI to stop the other VCPUs.
>>>
>>> - Notification to qemu: Taking a page from live migration's
>>> playbook, the synchronization process is user-space driven, which
>>> means that qemu needs to be woken up at each synchronization
>>> point. That is already the case for qemu-emulated devices, but we
>>> also have in-kernel emulators. To compound the problem, even for
>>> user-space emulated devices accesses to coalesced MMIO areas can
>>> not be detected. As a consequence we need a mechanism to
>>> communicate KVM-handled events to qemu.
>>>
>>> The channel for KVM-qemu communication can be easily build upon
>>> the existing infrastructure. We just need to add a new a page to
>>> the kvm_run shared memory area that can be mmapped from user space
>>> and set the exit reason appropriately.
>>>
>>> Regarding in-kernel device emulators, we only need to care about
>>> writes. Specifically, making kvm_io_bus_write() fail when Kemari
>>> is activated and invoking the emulator again after re-entrance
>>> from user space should suffice (this is somewhat similar to what
>>> we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).
>>>
>>> To avoid missing synchronization points one should be careful with
>>> coalesced MMIO-like optimizations. In the particular case of
>>> coalesced MMIO, the I/O operation that caused the exit to user
>>> space should act as a write barrier when it was due to an access
>>> to a non-coalesced MMIO area. This means that before proceeding to
>>> handle the exit in kvm_run() we have to make sure that all the
>>> coalesced MMIO has reached the fallback host.
>>>
>>> - Virtual machine synchronization: All the dirty pages since the
>>> last synchronization point and the state of the virtual devices is
>>> sent to the fallback node from the user-space qemu process. For this
>>> the existing savevm infrastructure and KVM's dirty page tracking
>>
>> I failed to understand whether you take the lock step approach and
>> sync every vmexit + make sure the shadow host will inject irq on the
>> original guest's instruction boundary or alternatively use continuous
>> live snapshots.
>
> We'll take the live snapshots approach for now.
>
>> If you use live snapshots, why do you need to track mmio, etc? Is it
>> in order to save the device sync stage in live migration? In order to
>> do it you fully lock step qemu execution (or send the entire vmstate
>> to the slave). Isn't the device part is << dirt pages part?
>
> We're thinking to capture mmio operations that effect the state of
> devices as synchronization points. The purpose is to lock step qemu
> execution as you mentioned.

The hardest thing will be in this case is to inject the virtual irqs to 
the guest on the slave host in the exact instruction boundary that the 
original virq was injected on the master. You need to count guest 
instructions, user performance monitors and in the final stages use 
guest break points.

>
> Thanks,
>
> Yoshi
>
>>
>> Thanks,
>> Dor
>>
>>
>>> capabilities can be reused. Regarding in-kernel devices, with the
>>> likely advent of in-kernel virtio backends we need a generic way
>>> to access their state from user-space, for which, again, the kvm_run
>>> share memory area could be used.
>>>
>>> - Virtual machine run: Execution of the virtual machine is resumed
>>> as soon as synchronization finishes.
>>>
>>> * Clock
>>>
>>> Even though we do not need to worry about the clock that provides the
>>> tick (the counter resides in memory, which we keep synchronized), the
>>> same does not apply to counters such as the TSC (we certainly want to
>>> avoid a situation where counters jump back in time right after
>>> fail-over, breaking guarantees such as monotonicity).
>>>
>>> To avoid big hiccups after migration the value of the TSC should be
>>> sent to the fallback node frequently. An access from the guest
>>> (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment
>>> to do this. Fortunately, both vmx and SVM provide controls to
>>> intercept accesses to the TSC, so it is just a matter of setting those
>>> appropriately ("RDTSC exiting" VM-execution control, and RDTSC,
>>> RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However,
>>> since synchronizing the virtual machines every time the TSC is
>>> accessed would be prohibitive, the transmission of the TSC will be
>>> done lazily, which means delaying it until there is a non-TSC
>>> synchronization point arrives.
>>>
>>> * Failover
>>>
>>> Failover process kicks in whenever a failure in the primary node is
>>> detected. At the time of writing we just ping the virtual machine
>>> periodically to determine whether it is still alive, but in the long
>>> term we have plans to integrate Kemari with the major HA stacks
>>> (Hearbeat, RHCS, etc).
>>>
>>> Ideally, we would like to leverage the hardware failure detection
>>> capabilities of newish x86 hardware to trigger failover, the idea
>>> being that transferring control to the fallback node proactively
>>> when a problem is detected is much faster than relying on the polling
>>> mechanisms used by most HA software.
>>>
>>> Finally, to restore the virtual machine in the fallback host the loadvm
>>> infrastructure used for live-migration is leveraged.
>>>
>>> * Further information
>>>
>>> Please visit the link below for additional information, including
>>> documentation and, most importantly, source code (for Xen only at the
>>> moment).
>>>
>>> http://www.osrg.net/kemari
>>> ==
>>>
>>>
>>> Any comments and suggestions would be greatly appreciated.
>>>
>>> If this is the right forum and people on the KVM mailing list do not
>>> mind, we would like to use the CC'ed mailing lists for Kemari
>>> development. Having more expert eyes looking at one's code always
>>> helps.
>>>
>>> Thanks,
>>>
>>> Fernando
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] KVM Fault Tolerance: Kemari for KVM
  2009-11-15 10:35   ` [Qemu-devel] " Avi Kivity
@ 2009-11-16 14:18     ` Fernando Luis Vázquez Cao
  -1 siblings, 0 replies; 26+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-11-16 14:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, qemu-devel, "大村圭(oomura kei)",
	Yoshiaki Tamura, Takuya Yoshikawa, anthony, Andrea Arcangeli,
	Chris Wright

Avi Kivity wrote:
> On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>>
>> Kemari runs paired virtual machines in an active-passive configuration
>> and achieves whole-system replication by continuously copying the
>> state of the system (dirty pages and the state of the virtual devices)
>> from the active node to the passive node. An interesting implication
>> of this is that during normal operation only the active node is
>> actually executing code.
>>
> 
> Can you characterize the performance impact for various workloads?  I 
> assume you are running continuously in log-dirty mode.  Doesn't this 
> make memory intensive workloads suffer?

Yes, we're running continuously in log-dirty mode.

We still do not have numbers to show for KVM, but
the snippets below from several runs of lmbench
using Xen+Kemari will give you an idea of what you
can expect in terms of overhead. All the tests were
run using a fully virtualized Debian guest with
hardware nested paging enabled.

                      fork exec   sh    P/F  C/S   [us]
------------------------------------------------------
Base                  114  349 1197 1.2845  8.2
Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
* P/F=page fault, C/S=context switch

The benchmarks above are memory intensive and, as you
can see, the overhead varies widely from 7% to 40%.
We also measured CPU bound operations, but, as expected,
Kemari incurred almost no overhead.

>> The synchronization process can be broken down as follows:
>>
>>   - Event tapping: On KVM all I/O generates a VMEXIT that is
>>     synchronously handled by the Linux kernel monitor i.e. KVM (it is
>>     worth noting that this applies to virtio devices too, because they
>>     use MMIO and PIO just like a regular PCI device).
> 
> Some I/O (virtio-based) is asynchronous, but you still have well-known 
> tap points within qemu.

Yep, and in some cases we have polling from the backend, which I forgot to
mention in the RFC.

>>   - Notification to qemu: Taking a page from live migration's
>>     playbook, the synchronization process is user-space driven, which
>>     means that qemu needs to be woken up at each synchronization
>>     point. That is already the case for qemu-emulated devices, but we
>>     also have in-kernel emulators. To compound the problem, even for
>>     user-space emulated devices accesses to coalesced MMIO areas can
>>     not be detected. As a consequence we need a mechanism to
>>     communicate KVM-handled events to qemu.
> 
> Do you mean the ioapic, pic, and lapic?

Well, I was more worried about the in-kernel backends currently in the
works. To save the state of those devices we could leverage qemu's vmstate
infrastructure and even reuse struct VMStateDescription's pre_save()
callback, but we would like to pass the device state through the kvm_run
area to avoid a ioctl call right after returning to user space.

> Perhaps its best to start with those in userspace (-no-kvm-irqchip).

That's precisely what we were planning to do. Once we get a working
prototype we will take care of existing optimizations such as in-kernel
emulators and add our own.

> Why is access to those chips considered a synchronization point?

The main problem with those is that to get the chip state we
use an ioctl when we could have copied it to qemu's memory
before going back to user space. Not all accesses to those chips
need to be treated as synchronization points.

>>   - Virtual machine synchronization: All the dirty pages since the
>>     last synchronization point and the state of the virtual devices is
>>     sent to the fallback node from the user-space qemu process. For this
>>     the existing savevm infrastructure and KVM's dirty page tracking
>>     capabilities can be reused. Regarding in-kernel devices, with the
>>     likely advent of in-kernel virtio backends we need a generic way
>>     to access their state from user-space, for which, again, the kvm_run
>>     share memory area could be used.
> 
> I wonder if you can pipeline dirty memory synchronization.  That is, 
> write-protect those pages that are dirty, start copying them to the 
> other side, and continue execution, copying memory if the guest faults 
> it again.

Asynchronous transmission of dirty pages would be really helpful to
eliminate the performance hiccups that tend to occur at synchronization
points. What we can do is to copy dirty pages asynchronously until we reach
a synchronization point, where we need to stop the guest and send the
remaining dirty pages and the state of devices to the other side.

However, we can not delay the transmission of a dirty page across a
synchronization point, because if the primary node crashed before the
page reached the fallback node the I/O operation that caused the
synchronization point cannot be replayed reliably.

> How many pages do you copy per synchronization point for reasonably 
> difficult workloads?

That is very workload-dependent, but if you take a look at the examples
below you can get a feeling of how Kemari behaves.

IOzone            Kemari sync interval[ms]  dirtied pages
---------------------------------------------------------
buffered + fsync                       400           3000
O_SYNC                                  10             80

In summary, if the guest executes few I/O operations, the interval
between Kemari synchronizations points will increase and the number of
dirtied pages will grow accordingly.

Thanks,

Fernando

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-16 14:18     ` Fernando Luis Vázquez Cao
  0 siblings, 0 replies; 26+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-11-16 14:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, Chris Wright, omura kei)",
	kvm, Yoshiaki Tamura, qemu-devel, Takuya Yoshikawa,
	=?UTF-8?B?IuWkp+adkeWcrShv?=

Avi Kivity wrote:
> On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>>
>> Kemari runs paired virtual machines in an active-passive configuration
>> and achieves whole-system replication by continuously copying the
>> state of the system (dirty pages and the state of the virtual devices)
>> from the active node to the passive node. An interesting implication
>> of this is that during normal operation only the active node is
>> actually executing code.
>>
> 
> Can you characterize the performance impact for various workloads?  I 
> assume you are running continuously in log-dirty mode.  Doesn't this 
> make memory intensive workloads suffer?

Yes, we're running continuously in log-dirty mode.

We still do not have numbers to show for KVM, but
the snippets below from several runs of lmbench
using Xen+Kemari will give you an idea of what you
can expect in terms of overhead. All the tests were
run using a fully virtualized Debian guest with
hardware nested paging enabled.

                      fork exec   sh    P/F  C/S   [us]
------------------------------------------------------
Base                  114  349 1197 1.2845  8.2
Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
* P/F=page fault, C/S=context switch

The benchmarks above are memory intensive and, as you
can see, the overhead varies widely from 7% to 40%.
We also measured CPU bound operations, but, as expected,
Kemari incurred almost no overhead.

>> The synchronization process can be broken down as follows:
>>
>>   - Event tapping: On KVM all I/O generates a VMEXIT that is
>>     synchronously handled by the Linux kernel monitor i.e. KVM (it is
>>     worth noting that this applies to virtio devices too, because they
>>     use MMIO and PIO just like a regular PCI device).
> 
> Some I/O (virtio-based) is asynchronous, but you still have well-known 
> tap points within qemu.

Yep, and in some cases we have polling from the backend, which I forgot to
mention in the RFC.

>>   - Notification to qemu: Taking a page from live migration's
>>     playbook, the synchronization process is user-space driven, which
>>     means that qemu needs to be woken up at each synchronization
>>     point. That is already the case for qemu-emulated devices, but we
>>     also have in-kernel emulators. To compound the problem, even for
>>     user-space emulated devices accesses to coalesced MMIO areas can
>>     not be detected. As a consequence we need a mechanism to
>>     communicate KVM-handled events to qemu.
> 
> Do you mean the ioapic, pic, and lapic?

Well, I was more worried about the in-kernel backends currently in the
works. To save the state of those devices we could leverage qemu's vmstate
infrastructure and even reuse struct VMStateDescription's pre_save()
callback, but we would like to pass the device state through the kvm_run
area to avoid a ioctl call right after returning to user space.

> Perhaps its best to start with those in userspace (-no-kvm-irqchip).

That's precisely what we were planning to do. Once we get a working
prototype we will take care of existing optimizations such as in-kernel
emulators and add our own.

> Why is access to those chips considered a synchronization point?

The main problem with those is that to get the chip state we
use an ioctl when we could have copied it to qemu's memory
before going back to user space. Not all accesses to those chips
need to be treated as synchronization points.

>>   - Virtual machine synchronization: All the dirty pages since the
>>     last synchronization point and the state of the virtual devices is
>>     sent to the fallback node from the user-space qemu process. For this
>>     the existing savevm infrastructure and KVM's dirty page tracking
>>     capabilities can be reused. Regarding in-kernel devices, with the
>>     likely advent of in-kernel virtio backends we need a generic way
>>     to access their state from user-space, for which, again, the kvm_run
>>     share memory area could be used.
> 
> I wonder if you can pipeline dirty memory synchronization.  That is, 
> write-protect those pages that are dirty, start copying them to the 
> other side, and continue execution, copying memory if the guest faults 
> it again.

Asynchronous transmission of dirty pages would be really helpful to
eliminate the performance hiccups that tend to occur at synchronization
points. What we can do is to copy dirty pages asynchronously until we reach
a synchronization point, where we need to stop the guest and send the
remaining dirty pages and the state of devices to the other side.

However, we can not delay the transmission of a dirty page across a
synchronization point, because if the primary node crashed before the
page reached the fallback node the I/O operation that caused the
synchronization point cannot be replayed reliably.

> How many pages do you copy per synchronization point for reasonably 
> difficult workloads?

That is very workload-dependent, but if you take a look at the examples
below you can get a feeling of how Kemari behaves.

IOzone            Kemari sync interval[ms]  dirtied pages
---------------------------------------------------------
buffered + fsync                       400           3000
O_SYNC                                  10             80

In summary, if the guest executes few I/O operations, the interval
between Kemari synchronizations points will increase and the number of
dirtied pages will grow accordingly.

Thanks,

Fernando

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] KVM Fault Tolerance: Kemari for KVM
  2009-11-16 14:18     ` [Qemu-devel] " Fernando Luis Vázquez Cao
@ 2009-11-16 14:49       ` Avi Kivity
  -1 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2009-11-16 14:49 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: kvm, qemu-devel, "大村圭(oomura kei)",
	Yoshiaki Tamura, Takuya Yoshikawa, anthony, Andrea Arcangeli,
	Chris Wright

On 11/16/2009 04:18 PM, Fernando Luis Vázquez Cao wrote:
> Avi Kivity wrote:
>> On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>>>
>>> Kemari runs paired virtual machines in an active-passive configuration
>>> and achieves whole-system replication by continuously copying the
>>> state of the system (dirty pages and the state of the virtual devices)
>>> from the active node to the passive node. An interesting implication
>>> of this is that during normal operation only the active node is
>>> actually executing code.
>>>
>>
>> Can you characterize the performance impact for various workloads?  I 
>> assume you are running continuously in log-dirty mode.  Doesn't this 
>> make memory intensive workloads suffer?
>
> Yes, we're running continuously in log-dirty mode.
>
> We still do not have numbers to show for KVM, but
> the snippets below from several runs of lmbench
> using Xen+Kemari will give you an idea of what you
> can expect in terms of overhead. All the tests were
> run using a fully virtualized Debian guest with
> hardware nested paging enabled.
>
>                      fork exec   sh    P/F  C/S   [us]
> ------------------------------------------------------
> Base                  114  349 1197 1.2845  8.2
> Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
> Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
> Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
> Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
> * P/F=page fault, C/S=context switch
>
> The benchmarks above are memory intensive and, as you
> can see, the overhead varies widely from 7% to 40%.
> We also measured CPU bound operations, but, as expected,
> Kemari incurred almost no overhead.

Is lmbench fork that memory intensive?

Do you have numbers for benchmarks that use significant anonymous RSS?  
Say, a parallel kernel build.

Note that scaling vcpus will increase a guest's memory-dirtying power 
but snapshot rate will not scale in the same way.

>>>   - Notification to qemu: Taking a page from live migration's
>>>     playbook, the synchronization process is user-space driven, which
>>>     means that qemu needs to be woken up at each synchronization
>>>     point. That is already the case for qemu-emulated devices, but we
>>>     also have in-kernel emulators. To compound the problem, even for
>>>     user-space emulated devices accesses to coalesced MMIO areas can
>>>     not be detected. As a consequence we need a mechanism to
>>>     communicate KVM-handled events to qemu.
>>
>> Do you mean the ioapic, pic, and lapic?
>
> Well, I was more worried about the in-kernel backends currently in the
> works. To save the state of those devices we could leverage qemu's 
> vmstate
> infrastructure and even reuse struct VMStateDescription's pre_save()
> callback, but we would like to pass the device state through the kvm_run
> area to avoid a ioctl call right after returning to user space.

Hm, let's defer all that until we have something working so we can 
estimate the impact of userspace virtio in those circumstances.

>> Why is access to those chips considered a synchronization point?
>
> The main problem with those is that to get the chip state we
> use an ioctl when we could have copied it to qemu's memory
> before going back to user space. Not all accesses to those chips
> need to be treated as synchronization points.

Ok.  Note that piggybacking on an exit will work for the lapic, but not 
for the global irqchips (ioapic, pic) since they can still be modified 
by another vcpu.

>> I wonder if you can pipeline dirty memory synchronization.  That is, 
>> write-protect those pages that are dirty, start copying them to the 
>> other side, and continue execution, copying memory if the guest 
>> faults it again.
>
> Asynchronous transmission of dirty pages would be really helpful to
> eliminate the performance hiccups that tend to occur at synchronization
> points. What we can do is to copy dirty pages asynchronously until we 
> reach
> a synchronization point, where we need to stop the guest and send the
> remaining dirty pages and the state of devices to the other side.
>
> However, we can not delay the transmission of a dirty page across a
> synchronization point, because if the primary node crashed before the
> page reached the fallback node the I/O operation that caused the
> synchronization point cannot be replayed reliably.

What I mean is:

- choose synchronization point A
- start copying memory for synchronization point A
   - output is delayed
- choose synchronization point B
- copy memory for A and B
    if guest touches memory not yet copied for A, COW it
- once A copying is complete, release A output
- continue copying memory for B
- choose synchronization point B

by keeping two synchronization points active, you don't have any 
pauses.  The cost is maintaining copy-on-write so we can copy dirty 
pages for A while keeping execution.

>> How many pages do you copy per synchronization point for reasonably 
>> difficult workloads?
>
> That is very workload-dependent, but if you take a look at the examples
> below you can get a feeling of how Kemari behaves.
>
> IOzone            Kemari sync interval[ms]  dirtied pages
> ---------------------------------------------------------
> buffered + fsync                       400           3000
> O_SYNC                                  10             80
>
> In summary, if the guest executes few I/O operations, the interval
> between Kemari synchronizations points will increase and the number of
> dirtied pages will grow accordingly.

In the example above, the externally observed latency grows to 400 ms, yes?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-16 14:49       ` Avi Kivity
  0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2009-11-16 14:49 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Andrea Arcangeli, Chris Wright, omura kei)",
	kvm, Yoshiaki Tamura, qemu-devel, Takuya Yoshikawa,
	=?UTF-8?B?IuWkp+adkeWcrShv?=

On 11/16/2009 04:18 PM, Fernando Luis Vázquez Cao wrote:
> Avi Kivity wrote:
>> On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>>>
>>> Kemari runs paired virtual machines in an active-passive configuration
>>> and achieves whole-system replication by continuously copying the
>>> state of the system (dirty pages and the state of the virtual devices)
>>> from the active node to the passive node. An interesting implication
>>> of this is that during normal operation only the active node is
>>> actually executing code.
>>>
>>
>> Can you characterize the performance impact for various workloads?  I 
>> assume you are running continuously in log-dirty mode.  Doesn't this 
>> make memory intensive workloads suffer?
>
> Yes, we're running continuously in log-dirty mode.
>
> We still do not have numbers to show for KVM, but
> the snippets below from several runs of lmbench
> using Xen+Kemari will give you an idea of what you
> can expect in terms of overhead. All the tests were
> run using a fully virtualized Debian guest with
> hardware nested paging enabled.
>
>                      fork exec   sh    P/F  C/S   [us]
> ------------------------------------------------------
> Base                  114  349 1197 1.2845  8.2
> Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
> Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
> Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
> Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
> * P/F=page fault, C/S=context switch
>
> The benchmarks above are memory intensive and, as you
> can see, the overhead varies widely from 7% to 40%.
> We also measured CPU bound operations, but, as expected,
> Kemari incurred almost no overhead.

Is lmbench fork that memory intensive?

Do you have numbers for benchmarks that use significant anonymous RSS?  
Say, a parallel kernel build.

Note that scaling vcpus will increase a guest's memory-dirtying power 
but snapshot rate will not scale in the same way.

>>>   - Notification to qemu: Taking a page from live migration's
>>>     playbook, the synchronization process is user-space driven, which
>>>     means that qemu needs to be woken up at each synchronization
>>>     point. That is already the case for qemu-emulated devices, but we
>>>     also have in-kernel emulators. To compound the problem, even for
>>>     user-space emulated devices accesses to coalesced MMIO areas can
>>>     not be detected. As a consequence we need a mechanism to
>>>     communicate KVM-handled events to qemu.
>>
>> Do you mean the ioapic, pic, and lapic?
>
> Well, I was more worried about the in-kernel backends currently in the
> works. To save the state of those devices we could leverage qemu's 
> vmstate
> infrastructure and even reuse struct VMStateDescription's pre_save()
> callback, but we would like to pass the device state through the kvm_run
> area to avoid a ioctl call right after returning to user space.

Hm, let's defer all that until we have something working so we can 
estimate the impact of userspace virtio in those circumstances.

>> Why is access to those chips considered a synchronization point?
>
> The main problem with those is that to get the chip state we
> use an ioctl when we could have copied it to qemu's memory
> before going back to user space. Not all accesses to those chips
> need to be treated as synchronization points.

Ok.  Note that piggybacking on an exit will work for the lapic, but not 
for the global irqchips (ioapic, pic) since they can still be modified 
by another vcpu.

>> I wonder if you can pipeline dirty memory synchronization.  That is, 
>> write-protect those pages that are dirty, start copying them to the 
>> other side, and continue execution, copying memory if the guest 
>> faults it again.
>
> Asynchronous transmission of dirty pages would be really helpful to
> eliminate the performance hiccups that tend to occur at synchronization
> points. What we can do is to copy dirty pages asynchronously until we 
> reach
> a synchronization point, where we need to stop the guest and send the
> remaining dirty pages and the state of devices to the other side.
>
> However, we can not delay the transmission of a dirty page across a
> synchronization point, because if the primary node crashed before the
> page reached the fallback node the I/O operation that caused the
> synchronization point cannot be replayed reliably.

What I mean is:

- choose synchronization point A
- start copying memory for synchronization point A
   - output is delayed
- choose synchronization point B
- copy memory for A and B
    if guest touches memory not yet copied for A, COW it
- once A copying is complete, release A output
- continue copying memory for B
- choose synchronization point B

by keeping two synchronization points active, you don't have any 
pauses.  The cost is maintaining copy-on-write so we can copy dirty 
pages for A while keeping execution.

>> How many pages do you copy per synchronization point for reasonably 
>> difficult workloads?
>
> That is very workload-dependent, but if you take a look at the examples
> below you can get a feeling of how Kemari behaves.
>
> IOzone            Kemari sync interval[ms]  dirtied pages
> ---------------------------------------------------------
> buffered + fsync                       400           3000
> O_SYNC                                  10             80
>
> In summary, if the guest executes few I/O operations, the interval
> between Kemari synchronizations points will increase and the number of
> dirtied pages will grow accordingly.

In the example above, the externally observed latency grows to 400 ms, yes?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] KVM Fault Tolerance: Kemari for KVM
  2009-11-16 14:49       ` [Qemu-devel] " Avi Kivity
@ 2009-11-17 11:04         ` Yoshiaki Tamura
  -1 siblings, 0 replies; 26+ messages in thread
From: Yoshiaki Tamura @ 2009-11-17 11:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Fernando Luis Vázquez Cao, kvm, qemu-devel,
	"大村圭(oomura kei)",
	Takuya Yoshikawa, anthony, Andrea Arcangeli, Chris Wright

Avi Kivity wrote:
> On 11/16/2009 04:18 PM, Fernando Luis Vázquez Cao wrote:
>> Avi Kivity wrote:
>>> On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>>>>
>>>> Kemari runs paired virtual machines in an active-passive configuration
>>>> and achieves whole-system replication by continuously copying the
>>>> state of the system (dirty pages and the state of the virtual devices)
>>>> from the active node to the passive node. An interesting implication
>>>> of this is that during normal operation only the active node is
>>>> actually executing code.
>>>>
>>>
>>> Can you characterize the performance impact for various workloads?  I 
>>> assume you are running continuously in log-dirty mode.  Doesn't this 
>>> make memory intensive workloads suffer?
>>
>> Yes, we're running continuously in log-dirty mode.
>>
>> We still do not have numbers to show for KVM, but
>> the snippets below from several runs of lmbench
>> using Xen+Kemari will give you an idea of what you
>> can expect in terms of overhead. All the tests were
>> run using a fully virtualized Debian guest with
>> hardware nested paging enabled.
>>
>>                      fork exec   sh    P/F  C/S   [us]
>> ------------------------------------------------------
>> Base                  114  349 1197 1.2845  8.2
>> Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
>> Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
>> Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
>> Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
>> * P/F=page fault, C/S=context switch
>>
>> The benchmarks above are memory intensive and, as you
>> can see, the overhead varies widely from 7% to 40%.
>> We also measured CPU bound operations, but, as expected,
>> Kemari incurred almost no overhead.
> 
> Is lmbench fork that memory intensive?
> 
> Do you have numbers for benchmarks that use significant anonymous RSS?  
> Say, a parallel kernel build.
> 
> Note that scaling vcpus will increase a guest's memory-dirtying power 
> but snapshot rate will not scale in the same way.

I don't think lmbench is intensive but it's sensitive to memory latency.
We'll measure kernel build time with minimum config, and post it later.

>>>>   - Notification to qemu: Taking a page from live migration's
>>>>     playbook, the synchronization process is user-space driven, which
>>>>     means that qemu needs to be woken up at each synchronization
>>>>     point. That is already the case for qemu-emulated devices, but we
>>>>     also have in-kernel emulators. To compound the problem, even for
>>>>     user-space emulated devices accesses to coalesced MMIO areas can
>>>>     not be detected. As a consequence we need a mechanism to
>>>>     communicate KVM-handled events to qemu.
>>>
>>> Do you mean the ioapic, pic, and lapic?
>>
>> Well, I was more worried about the in-kernel backends currently in the
>> works. To save the state of those devices we could leverage qemu's 
>> vmstate
>> infrastructure and even reuse struct VMStateDescription's pre_save()
>> callback, but we would like to pass the device state through the kvm_run
>> area to avoid a ioctl call right after returning to user space.
> 
> Hm, let's defer all that until we have something working so we can 
> estimate the impact of userspace virtio in those circumstances.

OK.  We'll start implementing everything in userspace first.

>>> Why is access to those chips considered a synchronization point?
>>
>> The main problem with those is that to get the chip state we
>> use an ioctl when we could have copied it to qemu's memory
>> before going back to user space. Not all accesses to those chips
>> need to be treated as synchronization points.
> 
> Ok.  Note that piggybacking on an exit will work for the lapic, but not 
> for the global irqchips (ioapic, pic) since they can still be modified 
> by another vcpu.
> 
>>> I wonder if you can pipeline dirty memory synchronization.  That is, 
>>> write-protect those pages that are dirty, start copying them to the 
>>> other side, and continue execution, copying memory if the guest 
>>> faults it again.
>>
>> Asynchronous transmission of dirty pages would be really helpful to
>> eliminate the performance hiccups that tend to occur at synchronization
>> points. What we can do is to copy dirty pages asynchronously until we 
>> reach
>> a synchronization point, where we need to stop the guest and send the
>> remaining dirty pages and the state of devices to the other side.
>>
>> However, we can not delay the transmission of a dirty page across a
>> synchronization point, because if the primary node crashed before the
>> page reached the fallback node the I/O operation that caused the
>> synchronization point cannot be replayed reliably.
> 
> What I mean is:
> 
> - choose synchronization point A
> - start copying memory for synchronization point A
>   - output is delayed
> - choose synchronization point B
> - copy memory for A and B
>    if guest touches memory not yet copied for A, COW it
> - once A copying is complete, release A output
> - continue copying memory for B
> - choose synchronization point B
> 
> by keeping two synchronization points active, you don't have any 
> pauses.  The cost is maintaining copy-on-write so we can copy dirty 
> pages for A while keeping execution.

The overall idea seems good, but if I'm understanding correctly, we need a 
buffer for copying memory locally, and when it gets full, or when we COW the 
memory for B, we still have to pause the guest to prevent from overwriting. Correct?

To make things simple, we would like to start with the synchronous transmission 
first, and tackle asynchronous transmission later.

>>> How many pages do you copy per synchronization point for reasonably 
>>> difficult workloads?
>>
>> That is very workload-dependent, but if you take a look at the examples
>> below you can get a feeling of how Kemari behaves.
>>
>> IOzone            Kemari sync interval[ms]  dirtied pages
>> ---------------------------------------------------------
>> buffered + fsync                       400           3000
>> O_SYNC                                  10             80
>>
>> In summary, if the guest executes few I/O operations, the interval
>> between Kemari synchronizations points will increase and the number of
>> dirtied pages will grow accordingly.
> 
> In the example above, the externally observed latency grows to 400 ms, yes?

Not exactly.  The sync interval refers to the interval of synchronization points 
captured when the workload is running.  In the example above, when the observed 
sync interval is 400ms, it takes about 150ms to sync VMs with 3000 dirtied 
pages.  Kemari resumes I/O operations immediately once the synchronization is 
finished, and thus, the externally observed latency is 150ms in this case.

Thanks,

Yoshi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-17 11:04         ` Yoshiaki Tamura
  0 siblings, 0 replies; 26+ messages in thread
From: Yoshiaki Tamura @ 2009-11-17 11:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, Chris Wright,
	"大村圭(oomura kei)",
	kvm, Fernando Luis Vázquez Cao, qemu-devel,
	Takuya Yoshikawa

Avi Kivity wrote:
> On 11/16/2009 04:18 PM, Fernando Luis Vázquez Cao wrote:
>> Avi Kivity wrote:
>>> On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>>>>
>>>> Kemari runs paired virtual machines in an active-passive configuration
>>>> and achieves whole-system replication by continuously copying the
>>>> state of the system (dirty pages and the state of the virtual devices)
>>>> from the active node to the passive node. An interesting implication
>>>> of this is that during normal operation only the active node is
>>>> actually executing code.
>>>>
>>>
>>> Can you characterize the performance impact for various workloads?  I 
>>> assume you are running continuously in log-dirty mode.  Doesn't this 
>>> make memory intensive workloads suffer?
>>
>> Yes, we're running continuously in log-dirty mode.
>>
>> We still do not have numbers to show for KVM, but
>> the snippets below from several runs of lmbench
>> using Xen+Kemari will give you an idea of what you
>> can expect in terms of overhead. All the tests were
>> run using a fully virtualized Debian guest with
>> hardware nested paging enabled.
>>
>>                      fork exec   sh    P/F  C/S   [us]
>> ------------------------------------------------------
>> Base                  114  349 1197 1.2845  8.2
>> Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
>> Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
>> Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
>> Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
>> * P/F=page fault, C/S=context switch
>>
>> The benchmarks above are memory intensive and, as you
>> can see, the overhead varies widely from 7% to 40%.
>> We also measured CPU bound operations, but, as expected,
>> Kemari incurred almost no overhead.
> 
> Is lmbench fork that memory intensive?
> 
> Do you have numbers for benchmarks that use significant anonymous RSS?  
> Say, a parallel kernel build.
> 
> Note that scaling vcpus will increase a guest's memory-dirtying power 
> but snapshot rate will not scale in the same way.

I don't think lmbench is intensive but it's sensitive to memory latency.
We'll measure kernel build time with minimum config, and post it later.

>>>>   - Notification to qemu: Taking a page from live migration's
>>>>     playbook, the synchronization process is user-space driven, which
>>>>     means that qemu needs to be woken up at each synchronization
>>>>     point. That is already the case for qemu-emulated devices, but we
>>>>     also have in-kernel emulators. To compound the problem, even for
>>>>     user-space emulated devices accesses to coalesced MMIO areas can
>>>>     not be detected. As a consequence we need a mechanism to
>>>>     communicate KVM-handled events to qemu.
>>>
>>> Do you mean the ioapic, pic, and lapic?
>>
>> Well, I was more worried about the in-kernel backends currently in the
>> works. To save the state of those devices we could leverage qemu's 
>> vmstate
>> infrastructure and even reuse struct VMStateDescription's pre_save()
>> callback, but we would like to pass the device state through the kvm_run
>> area to avoid a ioctl call right after returning to user space.
> 
> Hm, let's defer all that until we have something working so we can 
> estimate the impact of userspace virtio in those circumstances.

OK.  We'll start implementing everything in userspace first.

>>> Why is access to those chips considered a synchronization point?
>>
>> The main problem with those is that to get the chip state we
>> use an ioctl when we could have copied it to qemu's memory
>> before going back to user space. Not all accesses to those chips
>> need to be treated as synchronization points.
> 
> Ok.  Note that piggybacking on an exit will work for the lapic, but not 
> for the global irqchips (ioapic, pic) since they can still be modified 
> by another vcpu.
> 
>>> I wonder if you can pipeline dirty memory synchronization.  That is, 
>>> write-protect those pages that are dirty, start copying them to the 
>>> other side, and continue execution, copying memory if the guest 
>>> faults it again.
>>
>> Asynchronous transmission of dirty pages would be really helpful to
>> eliminate the performance hiccups that tend to occur at synchronization
>> points. What we can do is to copy dirty pages asynchronously until we 
>> reach
>> a synchronization point, where we need to stop the guest and send the
>> remaining dirty pages and the state of devices to the other side.
>>
>> However, we can not delay the transmission of a dirty page across a
>> synchronization point, because if the primary node crashed before the
>> page reached the fallback node the I/O operation that caused the
>> synchronization point cannot be replayed reliably.
> 
> What I mean is:
> 
> - choose synchronization point A
> - start copying memory for synchronization point A
>   - output is delayed
> - choose synchronization point B
> - copy memory for A and B
>    if guest touches memory not yet copied for A, COW it
> - once A copying is complete, release A output
> - continue copying memory for B
> - choose synchronization point B
> 
> by keeping two synchronization points active, you don't have any 
> pauses.  The cost is maintaining copy-on-write so we can copy dirty 
> pages for A while keeping execution.

The overall idea seems good, but if I'm understanding correctly, we need a 
buffer for copying memory locally, and when it gets full, or when we COW the 
memory for B, we still have to pause the guest to prevent from overwriting. Correct?

To make things simple, we would like to start with the synchronous transmission 
first, and tackle asynchronous transmission later.

>>> How many pages do you copy per synchronization point for reasonably 
>>> difficult workloads?
>>
>> That is very workload-dependent, but if you take a look at the examples
>> below you can get a feeling of how Kemari behaves.
>>
>> IOzone            Kemari sync interval[ms]  dirtied pages
>> ---------------------------------------------------------
>> buffered + fsync                       400           3000
>> O_SYNC                                  10             80
>>
>> In summary, if the guest executes few I/O operations, the interval
>> between Kemari synchronizations points will increase and the number of
>> dirtied pages will grow accordingly.
> 
> In the example above, the externally observed latency grows to 400 ms, yes?

Not exactly.  The sync interval refers to the interval of synchronization points 
captured when the workload is running.  In the example above, when the observed 
sync interval is 400ms, it takes about 150ms to sync VMs with 3000 dirtied 
pages.  Kemari resumes I/O operations immediately once the synchronization is 
finished, and thus, the externally observed latency is 150ms in this case.

Thanks,

Yoshi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] KVM Fault Tolerance: Kemari for KVM
  2009-11-17 11:04         ` [Qemu-devel] " Yoshiaki Tamura
@ 2009-11-17 12:15           ` Avi Kivity
  -1 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2009-11-17 12:15 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: Fernando Luis Vázquez Cao, kvm, qemu-devel,
	"大村圭(oomura kei)",
	Takuya Yoshikawa, anthony, Andrea Arcangeli, Chris Wright

On 11/17/2009 01:04 PM, Yoshiaki Tamura wrote:
>> What I mean is:
>>
>> - choose synchronization point A
>> - start copying memory for synchronization point A
>>   - output is delayed
>> - choose synchronization point B
>> - copy memory for A and B
>>    if guest touches memory not yet copied for A, COW it
>> - once A copying is complete, release A output
>> - continue copying memory for B
>> - choose synchronization point B
>>
>> by keeping two synchronization points active, you don't have any 
>> pauses.  The cost is maintaining copy-on-write so we can copy dirty 
>> pages for A while keeping execution.
>
>
> The overall idea seems good, but if I'm understanding correctly, we 
> need a buffer for copying memory locally, and when it gets full, or 
> when we COW the memory for B, we still have to pause the guest to 
> prevent from overwriting. Correct?

Yes.  During COW the guest would not be able to access the page, but if 
other vcpus access other pages, they can still continue.  So generally 
synchronization would be pauseless.

> To make things simple, we would like to start with the synchronous 
> transmission first, and tackle asynchronous transmission later.

Of course.  I'm just worried that realistic workloads will drive the 
latency beyond acceptable limits.

>
>>>> How many pages do you copy per synchronization point for reasonably 
>>>> difficult workloads?
>>>
>>> That is very workload-dependent, but if you take a look at the examples
>>> below you can get a feeling of how Kemari behaves.
>>>
>>> IOzone            Kemari sync interval[ms]  dirtied pages
>>> ---------------------------------------------------------
>>> buffered + fsync                       400           3000
>>> O_SYNC                                  10             80
>>>
>>> In summary, if the guest executes few I/O operations, the interval
>>> between Kemari synchronizations points will increase and the number of
>>> dirtied pages will grow accordingly.
>>
>> In the example above, the externally observed latency grows to 400 
>> ms, yes?
>
> Not exactly.  The sync interval refers to the interval of 
> synchronization points captured when the workload is running.  In the 
> example above, when the observed sync interval is 400ms, it takes 
> about 150ms to sync VMs with 3000 dirtied pages.  Kemari resumes I/O 
> operations immediately once the synchronization is finished, and thus, 
> the externally observed latency is 150ms in this case.

Not sure I understand.

If a packet is output from a guest immediately after a synchronization 
point, doesn't it need to be delayed until the next synchronization 
point?  So it's not just the guest pause time that matters, but also the 
interval between sync points?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-17 12:15           ` Avi Kivity
  0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2009-11-17 12:15 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: Andrea Arcangeli, Chris Wright,
	"大村圭(oomura kei)",
	kvm, Fernando Luis Vázquez Cao, qemu-devel,
	Takuya Yoshikawa

On 11/17/2009 01:04 PM, Yoshiaki Tamura wrote:
>> What I mean is:
>>
>> - choose synchronization point A
>> - start copying memory for synchronization point A
>>   - output is delayed
>> - choose synchronization point B
>> - copy memory for A and B
>>    if guest touches memory not yet copied for A, COW it
>> - once A copying is complete, release A output
>> - continue copying memory for B
>> - choose synchronization point B
>>
>> by keeping two synchronization points active, you don't have any 
>> pauses.  The cost is maintaining copy-on-write so we can copy dirty 
>> pages for A while keeping execution.
>
>
> The overall idea seems good, but if I'm understanding correctly, we 
> need a buffer for copying memory locally, and when it gets full, or 
> when we COW the memory for B, we still have to pause the guest to 
> prevent from overwriting. Correct?

Yes.  During COW the guest would not be able to access the page, but if 
other vcpus access other pages, they can still continue.  So generally 
synchronization would be pauseless.

> To make things simple, we would like to start with the synchronous 
> transmission first, and tackle asynchronous transmission later.

Of course.  I'm just worried that realistic workloads will drive the 
latency beyond acceptable limits.

>
>>>> How many pages do you copy per synchronization point for reasonably 
>>>> difficult workloads?
>>>
>>> That is very workload-dependent, but if you take a look at the examples
>>> below you can get a feeling of how Kemari behaves.
>>>
>>> IOzone            Kemari sync interval[ms]  dirtied pages
>>> ---------------------------------------------------------
>>> buffered + fsync                       400           3000
>>> O_SYNC                                  10             80
>>>
>>> In summary, if the guest executes few I/O operations, the interval
>>> between Kemari synchronizations points will increase and the number of
>>> dirtied pages will grow accordingly.
>>
>> In the example above, the externally observed latency grows to 400 
>> ms, yes?
>
> Not exactly.  The sync interval refers to the interval of 
> synchronization points captured when the workload is running.  In the 
> example above, when the observed sync interval is 400ms, it takes 
> about 150ms to sync VMs with 3000 dirtied pages.  Kemari resumes I/O 
> operations immediately once the synchronization is finished, and thus, 
> the externally observed latency is 150ms in this case.

Not sure I understand.

If a packet is output from a guest immediately after a synchronization 
point, doesn't it need to be delayed until the next synchronization 
point?  So it's not just the guest pause time that matters, but also the 
interval between sync points?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] KVM Fault Tolerance: Kemari for KVM
  2009-11-17 12:15           ` [Qemu-devel] " Avi Kivity
@ 2009-11-17 14:06             ` Yoshiaki Tamura
  -1 siblings, 0 replies; 26+ messages in thread
From: Yoshiaki Tamura @ 2009-11-17 14:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Fernando Luis Vázquez Cao, kvm, qemu-devel,
	大村圭(oomura kei),
	Takuya Yoshikawa, anthony, Andrea Arcangeli, Chris Wright

2009/11/17 Avi Kivity <avi@redhat.com>:
> On 11/17/2009 01:04 PM, Yoshiaki Tamura wrote:
>>>
>>> What I mean is:
>>>
>>> - choose synchronization point A
>>> - start copying memory for synchronization point A
>>>  - output is delayed
>>> - choose synchronization point B
>>> - copy memory for A and B
>>>   if guest touches memory not yet copied for A, COW it
>>> - once A copying is complete, release A output
>>> - continue copying memory for B
>>> - choose synchronization point B
>>>
>>> by keeping two synchronization points active, you don't have any pauses.
>>>  The cost is maintaining copy-on-write so we can copy dirty pages for A
>>> while keeping execution.
>>
>>
>> The overall idea seems good, but if I'm understanding correctly, we need a
>> buffer for copying memory locally, and when it gets full, or when we COW the
>> memory for B, we still have to pause the guest to prevent from overwriting.
>> Correct?
>
> Yes.  During COW the guest would not be able to access the page, but if
> other vcpus access other pages, they can still continue.  So generally
> synchronization would be pauseless.

Understood.

>> To make things simple, we would like to start with the synchronous
>> transmission first, and tackle asynchronous transmission later.
>
> Of course.  I'm just worried that realistic workloads will drive the latency
> beyond acceptable limits.

We're paying attention to this issue too, and would like do more advanced
stuff once there is a toy that runs on KVM.

>>>>> How many pages do you copy per synchronization point for reasonably
>>>>> difficult workloads?
>>>>
>>>> That is very workload-dependent, but if you take a look at the examples
>>>> below you can get a feeling of how Kemari behaves.
>>>>
>>>> IOzone            Kemari sync interval[ms]  dirtied pages
>>>> ---------------------------------------------------------
>>>> buffered + fsync                       400           3000
>>>> O_SYNC                                  10             80
>>>>
>>>> In summary, if the guest executes few I/O operations, the interval
>>>> between Kemari synchronizations points will increase and the number of
>>>> dirtied pages will grow accordingly.
>>>
>>> In the example above, the externally observed latency grows to 400 ms,
>>> yes?
>>
>> Not exactly.  The sync interval refers to the interval of synchronization
>> points captured when the workload is running.  In the example above, when
>> the observed sync interval is 400ms, it takes about 150ms to sync VMs with
>> 3000 dirtied pages.  Kemari resumes I/O operations immediately once the
>> synchronization is finished, and thus, the externally observed latency is
>> 150ms in this case.
>
> Not sure I understand.
>
> If a packet is output from a guest immediately after a synchronization
> point, doesn't it need to be delayed until the next synchronization point?

Kemari kicks the synchronization on event driven manner.
So the packet itself is captured as synchronization point,
and will start the synchronization immediately.

>  So it's not just the guest pause time that matters, but also the interval
> between sync points?

It does matter, and in case of Kemari, the interval between sync points varies
depending on what kind of workload is running.

 In the IOzone example above, two types of workloads are demonstrated.
Buffered writes w/ fsync creates less sync point, which leads to longer sync
interval and more dirtied pages.  On the other hand, O_SYNC writes creates
more sync point, which leads to shorter sync interval and less dirtied pages.

The benefit of event driven approach is that you don't have to start
synchronization until there is a specific event to be captured no matter how
many pages the guest may have dirtied.

Thanks,

Yoshi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-17 14:06             ` Yoshiaki Tamura
  0 siblings, 0 replies; 26+ messages in thread
From: Yoshiaki Tamura @ 2009-11-17 14:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, Chris Wright,
	大村圭(oomura kei),
	kvm, Fernando Luis Vázquez Cao, qemu-devel,
	Takuya Yoshikawa

2009/11/17 Avi Kivity <avi@redhat.com>:
> On 11/17/2009 01:04 PM, Yoshiaki Tamura wrote:
>>>
>>> What I mean is:
>>>
>>> - choose synchronization point A
>>> - start copying memory for synchronization point A
>>>  - output is delayed
>>> - choose synchronization point B
>>> - copy memory for A and B
>>>   if guest touches memory not yet copied for A, COW it
>>> - once A copying is complete, release A output
>>> - continue copying memory for B
>>> - choose synchronization point B
>>>
>>> by keeping two synchronization points active, you don't have any pauses.
>>>  The cost is maintaining copy-on-write so we can copy dirty pages for A
>>> while keeping execution.
>>
>>
>> The overall idea seems good, but if I'm understanding correctly, we need a
>> buffer for copying memory locally, and when it gets full, or when we COW the
>> memory for B, we still have to pause the guest to prevent from overwriting.
>> Correct?
>
> Yes.  During COW the guest would not be able to access the page, but if
> other vcpus access other pages, they can still continue.  So generally
> synchronization would be pauseless.

Understood.

>> To make things simple, we would like to start with the synchronous
>> transmission first, and tackle asynchronous transmission later.
>
> Of course.  I'm just worried that realistic workloads will drive the latency
> beyond acceptable limits.

We're paying attention to this issue too, and would like do more advanced
stuff once there is a toy that runs on KVM.

>>>>> How many pages do you copy per synchronization point for reasonably
>>>>> difficult workloads?
>>>>
>>>> That is very workload-dependent, but if you take a look at the examples
>>>> below you can get a feeling of how Kemari behaves.
>>>>
>>>> IOzone            Kemari sync interval[ms]  dirtied pages
>>>> ---------------------------------------------------------
>>>> buffered + fsync                       400           3000
>>>> O_SYNC                                  10             80
>>>>
>>>> In summary, if the guest executes few I/O operations, the interval
>>>> between Kemari synchronizations points will increase and the number of
>>>> dirtied pages will grow accordingly.
>>>
>>> In the example above, the externally observed latency grows to 400 ms,
>>> yes?
>>
>> Not exactly.  The sync interval refers to the interval of synchronization
>> points captured when the workload is running.  In the example above, when
>> the observed sync interval is 400ms, it takes about 150ms to sync VMs with
>> 3000 dirtied pages.  Kemari resumes I/O operations immediately once the
>> synchronization is finished, and thus, the externally observed latency is
>> 150ms in this case.
>
> Not sure I understand.
>
> If a packet is output from a guest immediately after a synchronization
> point, doesn't it need to be delayed until the next synchronization point?

Kemari kicks the synchronization on event driven manner.
So the packet itself is captured as synchronization point,
and will start the synchronization immediately.

>  So it's not just the guest pause time that matters, but also the interval
> between sync points?

It does matter, and in case of Kemari, the interval between sync points varies
depending on what kind of workload is running.

 In the IOzone example above, two types of workloads are demonstrated.
Buffered writes w/ fsync creates less sync point, which leads to longer sync
interval and more dirtied pages.  On the other hand, O_SYNC writes creates
more sync point, which leads to shorter sync interval and less dirtied pages.

The benefit of event driven approach is that you don't have to start
synchronization until there is a specific event to be captured no matter how
many pages the guest may have dirtied.

Thanks,

Yoshi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] KVM Fault Tolerance: Kemari for KVM
  2009-11-17 11:04         ` [Qemu-devel] " Yoshiaki Tamura
@ 2009-11-18 13:28           ` Yoshiaki Tamura
  -1 siblings, 0 replies; 26+ messages in thread
From: Yoshiaki Tamura @ 2009-11-18 13:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Fernando Luis Vázquez Cao, kvm, qemu-devel,
	大村圭(oomura kei),
	Takuya Yoshikawa, anthony, Andrea Arcangeli, Chris Wright

2009/11/17 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> Avi Kivity wrote:
>>
>> On 11/16/2009 04:18 PM, Fernando Luis Vázquez Cao wrote:
>>>
>>> Avi Kivity wrote:
>>>>
>>>> On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>>>>>
>>>>> Kemari runs paired virtual machines in an active-passive configuration
>>>>> and achieves whole-system replication by continuously copying the
>>>>> state of the system (dirty pages and the state of the virtual devices)
>>>>> from the active node to the passive node. An interesting implication
>>>>> of this is that during normal operation only the active node is
>>>>> actually executing code.
>>>>>
>>>>
>>>> Can you characterize the performance impact for various workloads?  I
>>>> assume you are running continuously in log-dirty mode.  Doesn't this make
>>>> memory intensive workloads suffer?
>>>
>>> Yes, we're running continuously in log-dirty mode.
>>>
>>> We still do not have numbers to show for KVM, but
>>> the snippets below from several runs of lmbench
>>> using Xen+Kemari will give you an idea of what you
>>> can expect in terms of overhead. All the tests were
>>> run using a fully virtualized Debian guest with
>>> hardware nested paging enabled.
>>>
>>>                     fork exec   sh    P/F  C/S   [us]
>>> ------------------------------------------------------
>>> Base                  114  349 1197 1.2845  8.2
>>> Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
>>> Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
>>> Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
>>> Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
>>> * P/F=page fault, C/S=context switch
>>>
>>> The benchmarks above are memory intensive and, as you
>>> can see, the overhead varies widely from 7% to 40%.
>>> We also measured CPU bound operations, but, as expected,
>>> Kemari incurred almost no overhead.
>>
>> Is lmbench fork that memory intensive?
>>
>> Do you have numbers for benchmarks that use significant anonymous RSS?
>>  Say, a parallel kernel build.
>>
>> Note that scaling vcpus will increase a guest's memory-dirtying power but
>> snapshot rate will not scale in the same way.
>
> I don't think lmbench is intensive but it's sensitive to memory latency.
> We'll measure kernel build time with minimum config, and post it later.

Here are some quick numbers of parallel kernel compile time.
The number of vcpu is 1, just for convenience.

time make -j 2 all
-----------------------------------------------------------------------------
Base:    real 1m13.950s (user 1m2.742s, sys 0m10.446s)
Kemari: real 1m22.720s (user 1m5.882s, sys 0m10.882s)

time make -j 4 all
-----------------------------------------------------------------------------
Base:    real 1m11.234s (user 1m2.582s, sys 0m8.643s)
Kemari: real 1m26.964s (user 1m6.530s, sys 0m12.194s)

The result of Kemari includes everything, meaning dirty pages tracking and
synchronization upon I/O operations to the disk.
The compile time using j=4 under Kemari was worse than that of j=2,
but I'm not sure this is due to dirty pages tracking or sync interval.

Thanks,

Yoshi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-18 13:28           ` Yoshiaki Tamura
  0 siblings, 0 replies; 26+ messages in thread
From: Yoshiaki Tamura @ 2009-11-18 13:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, Chris Wright,
	大村圭(oomura kei),
	kvm, Fernando Luis Vázquez Cao, qemu-devel,
	Takuya Yoshikawa

2009/11/17 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> Avi Kivity wrote:
>>
>> On 11/16/2009 04:18 PM, Fernando Luis Vázquez Cao wrote:
>>>
>>> Avi Kivity wrote:
>>>>
>>>> On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:
>>>>>
>>>>> Kemari runs paired virtual machines in an active-passive configuration
>>>>> and achieves whole-system replication by continuously copying the
>>>>> state of the system (dirty pages and the state of the virtual devices)
>>>>> from the active node to the passive node. An interesting implication
>>>>> of this is that during normal operation only the active node is
>>>>> actually executing code.
>>>>>
>>>>
>>>> Can you characterize the performance impact for various workloads?  I
>>>> assume you are running continuously in log-dirty mode.  Doesn't this make
>>>> memory intensive workloads suffer?
>>>
>>> Yes, we're running continuously in log-dirty mode.
>>>
>>> We still do not have numbers to show for KVM, but
>>> the snippets below from several runs of lmbench
>>> using Xen+Kemari will give you an idea of what you
>>> can expect in terms of overhead. All the tests were
>>> run using a fully virtualized Debian guest with
>>> hardware nested paging enabled.
>>>
>>>                     fork exec   sh    P/F  C/S   [us]
>>> ------------------------------------------------------
>>> Base                  114  349 1197 1.2845  8.2
>>> Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
>>> Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
>>> Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
>>> Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
>>> * P/F=page fault, C/S=context switch
>>>
>>> The benchmarks above are memory intensive and, as you
>>> can see, the overhead varies widely from 7% to 40%.
>>> We also measured CPU bound operations, but, as expected,
>>> Kemari incurred almost no overhead.
>>
>> Is lmbench fork that memory intensive?
>>
>> Do you have numbers for benchmarks that use significant anonymous RSS?
>>  Say, a parallel kernel build.
>>
>> Note that scaling vcpus will increase a guest's memory-dirtying power but
>> snapshot rate will not scale in the same way.
>
> I don't think lmbench is intensive but it's sensitive to memory latency.
> We'll measure kernel build time with minimum config, and post it later.

Here are some quick numbers of parallel kernel compile time.
The number of vcpu is 1, just for convenience.

time make -j 2 all
-----------------------------------------------------------------------------
Base:    real 1m13.950s (user 1m2.742s, sys 0m10.446s)
Kemari: real 1m22.720s (user 1m5.882s, sys 0m10.882s)

time make -j 4 all
-----------------------------------------------------------------------------
Base:    real 1m11.234s (user 1m2.582s, sys 0m8.643s)
Kemari: real 1m26.964s (user 1m6.530s, sys 0m12.194s)

The result of Kemari includes everything, meaning dirty pages tracking and
synchronization upon I/O operations to the disk.
The compile time using j=4 under Kemari was worse than that of j=2,
but I'm not sure this is due to dirty pages tracking or sync interval.

Thanks,

Yoshi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] KVM Fault Tolerance: Kemari for KVM
  2009-11-18 13:28           ` [Qemu-devel] " Yoshiaki Tamura
@ 2009-11-18 13:58             ` Avi Kivity
  -1 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2009-11-18 13:58 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: Fernando Luis Vázquez Cao, kvm, qemu-devel,
	"大村圭(oomura kei)",
	Takuya Yoshikawa, anthony, Andrea Arcangeli, Chris Wright

On 11/18/2009 03:28 PM, Yoshiaki Tamura wrote:
>
>> I don't think lmbench is intensive but it's sensitive to memory latency.
>> We'll measure kernel build time with minimum config, and post it later.
>>      
> Here are some quick numbers of parallel kernel compile time.
> The number of vcpu is 1, just for convenience.
>
> time make -j 2 all
> -----------------------------------------------------------------------------
> Base:    real 1m13.950s (user 1m2.742s, sys 0m10.446s)
> Kemari: real 1m22.720s (user 1m5.882s, sys 0m10.882s)
>
> time make -j 4 all
> -----------------------------------------------------------------------------
> Base:    real 1m11.234s (user 1m2.582s, sys 0m8.643s)
> Kemari: real 1m26.964s (user 1m6.530s, sys 0m12.194s)
>
> The result of Kemari includes everything, meaning dirty pages tracking and
> synchronization upon I/O operations to the disk.
> The compile time using j=4 under Kemari was worse than that of j=2,
> but I'm not sure this is due to dirty pages tracking or sync interval.
>    

Do disk writes trigger synchronization?  Otherwise this is a very 
relaxed test.  I'm surprised the degradation is so low running 
continuously in log-dirty.

Is this an npt or ept system?  Without npt or ept I'd expect less 
degradation since the page tables are heavily manipulated anyway.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-18 13:58             ` Avi Kivity
  0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2009-11-18 13:58 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: Andrea Arcangeli, Chris Wright,
	"大村圭(oomura kei)",
	kvm, Fernando Luis Vázquez Cao, qemu-devel,
	Takuya Yoshikawa

On 11/18/2009 03:28 PM, Yoshiaki Tamura wrote:
>
>> I don't think lmbench is intensive but it's sensitive to memory latency.
>> We'll measure kernel build time with minimum config, and post it later.
>>      
> Here are some quick numbers of parallel kernel compile time.
> The number of vcpu is 1, just for convenience.
>
> time make -j 2 all
> -----------------------------------------------------------------------------
> Base:    real 1m13.950s (user 1m2.742s, sys 0m10.446s)
> Kemari: real 1m22.720s (user 1m5.882s, sys 0m10.882s)
>
> time make -j 4 all
> -----------------------------------------------------------------------------
> Base:    real 1m11.234s (user 1m2.582s, sys 0m8.643s)
> Kemari: real 1m26.964s (user 1m6.530s, sys 0m12.194s)
>
> The result of Kemari includes everything, meaning dirty pages tracking and
> synchronization upon I/O operations to the disk.
> The compile time using j=4 under Kemari was worse than that of j=2,
> but I'm not sure this is due to dirty pages tracking or sync interval.
>    

Do disk writes trigger synchronization?  Otherwise this is a very 
relaxed test.  I'm surprised the degradation is so low running 
continuously in log-dirty.

Is this an npt or ept system?  Without npt or ept I'd expect less 
degradation since the page tables are heavily manipulated anyway.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] KVM Fault Tolerance: Kemari for KVM
  2009-11-18 13:58             ` [Qemu-devel] " Avi Kivity
@ 2009-11-19  3:43               ` Yoshiaki Tamura
  -1 siblings, 0 replies; 26+ messages in thread
From: Yoshiaki Tamura @ 2009-11-19  3:43 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Fernando Luis Vázquez Cao, kvm, qemu-devel,
	大村圭(oomura kei),
	Takuya Yoshikawa, anthony, Andrea Arcangeli, Chris Wright

[-- Attachment #1: Type: text/plain, Size: 2211 bytes --]

2009/11/18 Avi Kivity <avi@redhat.com>:
> On 11/18/2009 03:28 PM, Yoshiaki Tamura wrote:
>>
>>> I don't think lmbench is intensive but it's sensitive to memory latency.
>>> We'll measure kernel build time with minimum config, and post it later.
>>>
>>
>> Here are some quick numbers of parallel kernel compile time.
>> The number of vcpu is 1, just for convenience.
>>
>> time make -j 2 all
>>
>> -----------------------------------------------------------------------------
>> Base:    real 1m13.950s (user 1m2.742s, sys 0m10.446s)
>> Kemari: real 1m22.720s (user 1m5.882s, sys 0m10.882s)
>>
>> time make -j 4 all
>>
>> -----------------------------------------------------------------------------
>> Base:    real 1m11.234s (user 1m2.582s, sys 0m8.643s)
>> Kemari: real 1m26.964s (user 1m6.530s, sys 0m12.194s)
>>
>> The result of Kemari includes everything, meaning dirty pages tracking and
>> synchronization upon I/O operations to the disk.
>> The compile time using j=4 under Kemari was worse than that of j=2,
>> but I'm not sure this is due to dirty pages tracking or sync interval.
>>
>
> Do disk writes trigger synchronization?  Otherwise this is a very relaxed
> test.  I'm surprised the degradation is so low running continuously in
> log-dirty.

They do.  I double checked the traffic on the synchronization network.
Attached file is a graph that shows the traffic during kernel compiling under
Kemari.  It seems j=4 produces more traffic than j=2, and after finishing
compilation, both of the traffic drop to the normal rate.

> Is this an npt or ept system?  Without npt or ept I'd expect less
> degradation since the page tables are heavily manipulated anyway.

This is an ept system indeed.  If properly implemented, Kemari should
work not only on x86 but also on other archs, and I'm interested in how
the numbers would differ.

Thanks,

Yoshi

> --
> Do not meddle in the internals of kernels, for they are subtle and quick to
> panic.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[-- Attachment #2: kernel-compile-traffic.pdf --]
[-- Type: application/pdf, Size: 23813 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-19  3:43               ` Yoshiaki Tamura
  0 siblings, 0 replies; 26+ messages in thread
From: Yoshiaki Tamura @ 2009-11-19  3:43 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, Chris Wright,
	大村圭(oomura kei),
	kvm, Fernando Luis Vázquez Cao, qemu-devel,
	Takuya Yoshikawa

[-- Attachment #1: Type: text/plain, Size: 2211 bytes --]

2009/11/18 Avi Kivity <avi@redhat.com>:
> On 11/18/2009 03:28 PM, Yoshiaki Tamura wrote:
>>
>>> I don't think lmbench is intensive but it's sensitive to memory latency.
>>> We'll measure kernel build time with minimum config, and post it later.
>>>
>>
>> Here are some quick numbers of parallel kernel compile time.
>> The number of vcpu is 1, just for convenience.
>>
>> time make -j 2 all
>>
>> -----------------------------------------------------------------------------
>> Base:    real 1m13.950s (user 1m2.742s, sys 0m10.446s)
>> Kemari: real 1m22.720s (user 1m5.882s, sys 0m10.882s)
>>
>> time make -j 4 all
>>
>> -----------------------------------------------------------------------------
>> Base:    real 1m11.234s (user 1m2.582s, sys 0m8.643s)
>> Kemari: real 1m26.964s (user 1m6.530s, sys 0m12.194s)
>>
>> The result of Kemari includes everything, meaning dirty pages tracking and
>> synchronization upon I/O operations to the disk.
>> The compile time using j=4 under Kemari was worse than that of j=2,
>> but I'm not sure this is due to dirty pages tracking or sync interval.
>>
>
> Do disk writes trigger synchronization?  Otherwise this is a very relaxed
> test.  I'm surprised the degradation is so low running continuously in
> log-dirty.

They do.  I double checked the traffic on the synchronization network.
Attached file is a graph that shows the traffic during kernel compiling under
Kemari.  It seems j=4 produces more traffic than j=2, and after finishing
compilation, both of the traffic drop to the normal rate.

> Is this an npt or ept system?  Without npt or ept I'd expect less
> degradation since the page tables are heavily manipulated anyway.

This is an ept system indeed.  If properly implemented, Kemari should
work not only on x86 but also on other archs, and I'm interested in how
the numbers would differ.

Thanks,

Yoshi

> --
> Do not meddle in the internals of kernels, for they are subtle and quick to
> panic.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[-- Attachment #2: kernel-compile-traffic.pdf --]
[-- Type: application/pdf, Size: 23813 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2009-11-19  3:43 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-09  3:53 [RFC] KVM Fault Tolerance: Kemari for KVM Fernando Luis Vázquez Cao
2009-11-09  3:53 ` [Qemu-devel] " Fernando Luis Vázquez Cao
2009-11-12 21:51 ` Dor Laor
2009-11-12 21:51   ` [Qemu-devel] " Dor Laor
2009-11-13 11:48   ` Yoshiaki Tamura
2009-11-13 11:48     ` [Qemu-devel] " Yoshiaki Tamura
2009-11-15 13:42     ` Dor Laor
2009-11-15 13:42       ` [Qemu-devel] " Dor Laor
2009-11-15 10:35 ` Avi Kivity
2009-11-15 10:35   ` [Qemu-devel] " Avi Kivity
2009-11-16 14:18   ` Fernando Luis Vázquez Cao
2009-11-16 14:18     ` [Qemu-devel] " Fernando Luis Vázquez Cao
2009-11-16 14:49     ` Avi Kivity
2009-11-16 14:49       ` [Qemu-devel] " Avi Kivity
2009-11-17 11:04       ` Yoshiaki Tamura
2009-11-17 11:04         ` [Qemu-devel] " Yoshiaki Tamura
2009-11-17 12:15         ` Avi Kivity
2009-11-17 12:15           ` [Qemu-devel] " Avi Kivity
2009-11-17 14:06           ` Yoshiaki Tamura
2009-11-17 14:06             ` [Qemu-devel] " Yoshiaki Tamura
2009-11-18 13:28         ` Yoshiaki Tamura
2009-11-18 13:28           ` [Qemu-devel] " Yoshiaki Tamura
2009-11-18 13:58           ` Avi Kivity
2009-11-18 13:58             ` [Qemu-devel] " Avi Kivity
2009-11-19  3:43             ` Yoshiaki Tamura
2009-11-19  3:43               ` [Qemu-devel] " Yoshiaki Tamura

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.