[RFC] KVM Fault Tolerance: Kemari for KVM

* [RFC] KVM Fault Tolerance: Kemari for KVM
@ 2009-11-09  3:53 ` Fernando Luis Vázquez Cao
  0 siblings, 0 replies; 26+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-11-09  3:53 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: "大村圭(oomura kei)",
	Yoshiaki Tamura, Takuya Yoshikawa, avi, anthony,
	Andrea Arcangeli, Chris Wright

Hi all,

It has been a while coming, but we have finally started work on
Kemari's port to KVM. For those not familiar with it, Kemari provides
the basic building block to create a virtualization-based fault
tolerant machine: a virtual machine synchronization mechanism.

Traditional high availability solutions can be classified in two
groups: fault tolerant servers, and software clustering.

Broadly speaking, fault tolerant servers protect us against hardware
failures and, generally, rely on redundant hardware (often
proprietary), and hardware failure detection to trigger fail-over.

On the other hand, software clustering, as its name indicates, takes
care of software failures and usually requires a standby server whose
software configuration for the part we are trying to make fault
tolerant must be identical to that of the active server.

Both solutions may be applied to virtualized environments. Indeed,
the current incarnation of Kemari (Xen-based) brings fault tolerant
server-like capabilities to virtual machines and integration with
existing HA stacks (Heartbeat, RHCS, etc) is under consideration.

After some time in the drawing board we completed the basic design of
Kemari for KVM, so we are sending an RFC at this point to get early
feedback and, hopefully, get things right from the start. Those
already familiar with Kemari and/or fault tolerance may want to skip
the "Background" and go directly to the design and implementation
bits.

This is a pretty long write-up, but please bear with me.

== Background ==

We started to play around with continuous virtual synchronization
technology about 3 years ago. As development progressed and, most
importantly, we got the first Xen-based working prototypes it became
clear that we needed a proper name for our toy: Kemari.

The goal of Kemari is to provide a fault tolerant platform for
virtualization environments, so that in the event of a hardware
failure the virtual machine fails over from compromised to properly
operating hardware (a physical machine) in a way that is completely
transparent to the guest operating system.

Although hardware based fault tolerant servers and HA servers
(software clustering) have been around for a (long) while, they
typically require specifically designed hardware and/or modifications
to applications. In contrast, by abstracting hardware using
virtualization, Kemari can be used on off-the-shelf hardware and no
application modifications are needed.

After a period of in-house development the first version of Kemari for
Xen was released in Nov 2008 as open source. However, by then it was
already pretty clear that a KVM port would have several
advantages. First, KVM is integrated into the Linux kernel, which
means one gets support for a wide variety of hardware for
free. Second, and in the same vein, KVM can also benefit from Linux'
low latency networking capabilities including RDMA, which is of
paramount importance for a extremely latency-sensitive functionality
like Kemari. Last, but not the least, KVM and its community is growing
rapidly, and there is increasing demand for Kemari-like functionality
for KVM.

Although the basic design principles will remain the same, our plan is
to write Kemari for KVM from scratch, since there does not seem to be
much opportunity for sharing between Xen and KVM.

== Design outline ==

The basic premise of fault tolerant servers is that when things go
awry with the hardware the running system should transparently
continue execution on an alternate physical host. For this to be
possible the state of the fallback host has to be identical to that of
the primary.

Kemari runs paired virtual machines in an active-passive configuration
and achieves whole-system replication by continuously copying the
state of the system (dirty pages and the state of the virtual devices)
from the active node to the passive node. An interesting implication
of this is that during normal operation only the active node is
actually executing code.

Another possible approach is to run a pair of systems in lock-step
(à la VMware FT). Since both the primary and fallback virtual machines
are active keeping them synchronized is a complex task, which usually
involves carefully injecting external events into both virtual
machines so that they result in identical states.

The latter approach is extremely architecture specific and not SMP
friendly. This spurred us to try the design that became Kemari, which
we believe lends itself to further optimizations.

== Implementation ==

The first step is to encapsulate the machine to be protected within a
virtual machine. Then the live migration functionality is leveraged to
keep the virtual machines synchronized.

Whereas during live migration dirty pages can be sent asynchronously
from the primary to the fallback server until the ratio of dirty pages
is low enough to guarantee very short downtimes, when it comes to
fault tolerance solutions whenever a synchronization point is reached
changes to the virtual machine since the previous one have to be sent
synchronously.

Since the virtual machine has to be stopped until the data reaches and
is acknowledged by the fallback server, the synchronization model is
of critical importance for performance (both in terms of raw
throughput and latencies). The model chosen for Kemari along with
other implementation details is described below.

* Synchronization model

The synchronization points were carefully chosen to minimize the
amount of traffic that goes over the wire while still maintaining the
FT pair consistent at all times. To be precise, Kemari uses events
that modify externally visible state as synchronizations points. This
means that all outgoing I/O needs to be trapped and sent to the
fallback host before the primary is resumed, so that it can be
replayed in the face of hardware failure.

The basic assumption here is that outgoing I/O operations are
idempotent, which is usually true for disk I/O and reliable network
protocols such as TCP (Kemari may trigger hidden bugs on applications
that use UDP or other unreliable protocols, so those may need minor
changes to ensure they work properly after failover).

The synchronization process can be broken down as follows:

   - Event tapping: On KVM all I/O generates a VMEXIT that is
     synchronously handled by the Linux kernel monitor i.e. KVM (it is
     worth noting that this applies to virtio devices too, because they
     use MMIO and PIO just like a regular PCI device).

   - VCPU/Guest freezing: This is automatic in the UP case. On SMP
     environments we may need to send a IPI to stop the other VCPUs.

   - Notification to qemu: Taking a page from live migration's
     playbook, the synchronization process is user-space driven, which
     means that qemu needs to be woken up at each synchronization
     point. That is already the case for qemu-emulated devices, but we
     also have in-kernel emulators. To compound the problem, even for
     user-space emulated devices accesses to coalesced MMIO areas can
     not be detected. As a consequence we need a mechanism to
     communicate KVM-handled events to qemu.

     The channel for KVM-qemu communication can be easily build upon
     the existing infrastructure. We just need to add a new a page to
     the kvm_run shared memory area that can be mmapped from user space
     and set the exit reason appropriately.

     Regarding in-kernel device emulators, we only need to care about
     writes. Specifically, making kvm_io_bus_write() fail when Kemari
     is activated and invoking the emulator again after re-entrance
     from user space should suffice (this is somewhat similar to what
     we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).

     To avoid missing synchronization points one should be careful with
     coalesced MMIO-like optimizations. In the particular case of
     coalesced MMIO, the I/O operation that caused the exit to user
     space should act as a write barrier when it was due to an access
     to a non-coalesced MMIO area. This means that before proceeding to
     handle the exit in kvm_run() we have to make sure that all the
     coalesced MMIO has reached the fallback host.

   - Virtual machine synchronization: All the dirty pages since the
     last synchronization point and the state of the virtual devices is
     sent to the fallback node from the user-space qemu process. For this
     the existing savevm infrastructure and KVM's dirty page tracking
     capabilities can be reused. Regarding in-kernel devices, with the
     likely advent of in-kernel virtio backends we need a generic way
     to access their state from user-space, for which, again, the kvm_run
     share memory area could be used.

   - Virtual machine run: Execution of the virtual machine is resumed
     as soon as synchronization finishes.

* Clock

Even though we do not need to worry about the clock that provides the
tick (the counter resides in memory, which we keep synchronized), the
same does not apply to counters such as the TSC (we certainly want to
avoid a situation where counters jump back in time right after
fail-over, breaking guarantees such as monotonicity).

To avoid big hiccups after migration the value of the TSC should be
sent to the fallback node frequently. An access from the guest
(through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment
to do this. Fortunately, both vmx and SVM provide controls to
intercept accesses to the TSC, so it is just a matter of setting those
appropriately ("RDTSC exiting" VM-execution control, and RDTSC,
RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However,
since synchronizing the virtual machines every time the TSC is
accessed would be prohibitive, the transmission of the TSC will be
done lazily, which means delaying it until there is a non-TSC
synchronization point arrives.

* Failover

Failover process kicks in whenever a failure in the primary node is
detected. At the time of writing we just ping the virtual machine
periodically to determine whether it is still alive, but in the long
term we have plans to integrate Kemari with the major HA stacks
(Hearbeat, RHCS, etc).

Ideally, we would like to leverage the hardware failure detection
capabilities of newish x86 hardware to trigger failover, the idea
being that transferring control to the fallback node proactively
when a problem is detected is much faster than relying on the polling
mechanisms used by most HA software.

Finally, to restore the virtual machine in the fallback host the loadvm
infrastructure used for live-migration is leveraged.

* Further information

Please visit the link below for additional information, including
documentation and, most importantly, source code (for Xen only at the
moment).

http://www.osrg.net/kemari
==

Any comments and suggestions would be greatly appreciated.

If this is the right forum and people on the KVM mailing list do not
mind, we would like to use the CC'ed mailing lists for Kemari
development. Having more expert eyes looking at one's code always
helps.

Thanks,

Fernando

^ permalink raw reply	[flat|nested] 26+ messages in thread