From mboxrd@z Thu Jan  1 00:00:00 1970
From: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Subject: Re: [RFC] KVM Fault Tolerance: Kemari for KVM
Date: Fri, 13 Nov 2009 20:48:48 +0900
Message-ID: <4AFD47A0.6040202@lab.ntt.co.jp>
References: <4AF79242.20406@oss.ntt.co.jp> <4AFC837D.2060307@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8;
	format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?=
	<fernando@oss.ntt.co.jp>, kvm@vger.kernel.org,
	qemu-devel@nongnu.org,
	=?UTF-8?B?IuWkp+adkeWcrShvb211cmEga2VpKSI=?=
	<ohmura.kei@lab.ntt.co.jp>,
	Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>,
	avi@redhat.com, anthony@codemonkey.ws,
	Andrea Arcangeli <aarcange@redhat.com>,
	Chris Wright <chrisw@redhat.com>
To: dlaor@redhat.com
Return-path: <kvm-owner@vger.kernel.org>
Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:52739 "EHLO
	tama50.ecl.ntt.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750999AbZKMMRl (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 13 Nov 2009 07:17:41 -0500
In-Reply-To: <4AFC837D.2060307@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Hi,

Thanks for your comments!

Dor Laor wrote:
> On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote:
>> Hi all,
>>
>> It has been a while coming, but we have finally started work on
>> Kemari's port to KVM. For those not familiar with it, Kemari provide=
s
>> the basic building block to create a virtualization-based fault
>> tolerant machine: a virtual machine synchronization mechanism.
>>
>> Traditional high availability solutions can be classified in two
>> groups: fault tolerant servers, and software clustering.
>>
>> Broadly speaking, fault tolerant servers protect us against hardware
>> failures and, generally, rely on redundant hardware (often
>> proprietary), and hardware failure detection to trigger fail-over.
>>
>> On the other hand, software clustering, as its name indicates, takes
>> care of software failures and usually requires a standby server whos=
e
>> software configuration for the part we are trying to make fault
>> tolerant must be identical to that of the active server.
>>
>> Both solutions may be applied to virtualized environments. Indeed,
>> the current incarnation of Kemari (Xen-based) brings fault tolerant
>> server-like capabilities to virtual machines and integration with
>> existing HA stacks (Heartbeat, RHCS, etc) is under consideration.
>>
>> After some time in the drawing board we completed the basic design o=
f
>> Kemari for KVM, so we are sending an RFC at this point to get early
>> feedback and, hopefully, get things right from the start. Those
>> already familiar with Kemari and/or fault tolerance may want to skip
>> the "Background" and go directly to the design and implementation
>> bits.
>>
>> This is a pretty long write-up, but please bear with me.
>>
>> =3D=3D Background =3D=3D
>>
>> We started to play around with continuous virtual synchronization
>> technology about 3 years ago. As development progressed and, most
>> importantly, we got the first Xen-based working prototypes it became
>> clear that we needed a proper name for our toy: Kemari.
>>
>> The goal of Kemari is to provide a fault tolerant platform for
>> virtualization environments, so that in the event of a hardware
>> failure the virtual machine fails over from compromised to properly
>> operating hardware (a physical machine) in a way that is completely
>> transparent to the guest operating system.
>>
>> Although hardware based fault tolerant servers and HA servers
>> (software clustering) have been around for a (long) while, they
>> typically require specifically designed hardware and/or modification=
s
>> to applications. In contrast, by abstracting hardware using
>> virtualization, Kemari can be used on off-the-shelf hardware and no
>> application modifications are needed.
>>
>> After a period of in-house development the first version of Kemari f=
or
>> Xen was released in Nov 2008 as open source. However, by then it was
>> already pretty clear that a KVM port would have several
>> advantages. First, KVM is integrated into the Linux kernel, which
>> means one gets support for a wide variety of hardware for
>> free. Second, and in the same vein, KVM can also benefit from Linux'
>> low latency networking capabilities including RDMA, which is of
>> paramount importance for a extremely latency-sensitive functionality
>> like Kemari. Last, but not the least, KVM and its community is growi=
ng
>> rapidly, and there is increasing demand for Kemari-like functionalit=
y
>> for KVM.
>>
>> Although the basic design principles will remain the same, our plan =
is
>> to write Kemari for KVM from scratch, since there does not seem to b=
e
>> much opportunity for sharing between Xen and KVM.
>>
>> =3D=3D Design outline =3D=3D
>>
>> The basic premise of fault tolerant servers is that when things go
>> awry with the hardware the running system should transparently
>> continue execution on an alternate physical host. For this to be
>> possible the state of the fallback host has to be identical to that =
of
>> the primary.
>>
>> Kemari runs paired virtual machines in an active-passive configurati=
on
>> and achieves whole-system replication by continuously copying the
>> state of the system (dirty pages and the state of the virtual device=
s)
>> from the active node to the passive node. An interesting implication
>> of this is that during normal operation only the active node is
>> actually executing code.
>>
>> Another possible approach is to run a pair of systems in lock-step
>> (=C3=A0 la VMware FT). Since both the primary and fallback virtual m=
achines
>> are active keeping them synchronized is a complex task, which usuall=
y
>> involves carefully injecting external events into both virtual
>> machines so that they result in identical states.
>>
>> The latter approach is extremely architecture specific and not SMP
>> friendly. This spurred us to try the design that became Kemari, whic=
h
>> we believe lends itself to further optimizations.
>>
>> =3D=3D Implementation =3D=3D
>>
>> The first step is to encapsulate the machine to be protected within =
a
>> virtual machine. Then the live migration functionality is leveraged =
to
>> keep the virtual machines synchronized.
>>
>> Whereas during live migration dirty pages can be sent asynchronously
>> from the primary to the fallback server until the ratio of dirty pag=
es
>> is low enough to guarantee very short downtimes, when it comes to
>> fault tolerance solutions whenever a synchronization point is reache=
d
>> changes to the virtual machine since the previous one have to be sen=
t
>> synchronously.
>>
>> Since the virtual machine has to be stopped until the data reaches a=
nd
>> is acknowledged by the fallback server, the synchronization model is
>> of critical importance for performance (both in terms of raw
>> throughput and latencies). The model chosen for Kemari along with
>> other implementation details is described below.
>>
>> * Synchronization model
>>
>> The synchronization points were carefully chosen to minimize the
>> amount of traffic that goes over the wire while still maintaining th=
e
>> FT pair consistent at all times. To be precise, Kemari uses events
>> that modify externally visible state as synchronizations points. Thi=
s
>> means that all outgoing I/O needs to be trapped and sent to the
>> fallback host before the primary is resumed, so that it can be
>> replayed in the face of hardware failure.
>>
>> The basic assumption here is that outgoing I/O operations are
>> idempotent, which is usually true for disk I/O and reliable network
>> protocols such as TCP (Kemari may trigger hidden bugs on application=
s
>> that use UDP or other unreliable protocols, so those may need minor
>> changes to ensure they work properly after failover).
>>
>> The synchronization process can be broken down as follows:
>>
>> - Event tapping: On KVM all I/O generates a VMEXIT that is
>> synchronously handled by the Linux kernel monitor i.e. KVM (it is
>> worth noting that this applies to virtio devices too, because they
>> use MMIO and PIO just like a regular PCI device).
>>
>> - VCPU/Guest freezing: This is automatic in the UP case. On SMP
>> environments we may need to send a IPI to stop the other VCPUs.
>>
>> - Notification to qemu: Taking a page from live migration's
>> playbook, the synchronization process is user-space driven, which
>> means that qemu needs to be woken up at each synchronization
>> point. That is already the case for qemu-emulated devices, but we
>> also have in-kernel emulators. To compound the problem, even for
>> user-space emulated devices accesses to coalesced MMIO areas can
>> not be detected. As a consequence we need a mechanism to
>> communicate KVM-handled events to qemu.
>>
>> The channel for KVM-qemu communication can be easily build upon
>> the existing infrastructure. We just need to add a new a page to
>> the kvm_run shared memory area that can be mmapped from user space
>> and set the exit reason appropriately.
>>
>> Regarding in-kernel device emulators, we only need to care about
>> writes. Specifically, making kvm_io_bus_write() fail when Kemari
>> is activated and invoking the emulator again after re-entrance
>> from user space should suffice (this is somewhat similar to what
>> we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).
>>
>> To avoid missing synchronization points one should be careful with
>> coalesced MMIO-like optimizations. In the particular case of
>> coalesced MMIO, the I/O operation that caused the exit to user
>> space should act as a write barrier when it was due to an access
>> to a non-coalesced MMIO area. This means that before proceeding to
>> handle the exit in kvm_run() we have to make sure that all the
>> coalesced MMIO has reached the fallback host.
>>
>> - Virtual machine synchronization: All the dirty pages since the
>> last synchronization point and the state of the virtual devices is
>> sent to the fallback node from the user-space qemu process. For this
>> the existing savevm infrastructure and KVM's dirty page tracking
>=20
> I failed to understand whether you take the lock step approach and sy=
nc=20
> every vmexit + make sure the shadow host will inject irq on the origi=
nal=20
> guest's instruction boundary or alternatively use continuous live=20
> snapshots.

We'll take the live snapshots approach for now.

> If you use live snapshots, why do you need to track mmio, etc? Is it =
in=20
> order to save the device sync stage in live migration? In order to do=
 it=20
> you fully lock step qemu execution (or send the entire vmstate to the=
=20
> slave). Isn't the device part is << dirt pages part?

We're thinking to capture mmio operations that effect the state of devi=
ces as=20
synchronization points.  The purpose is to lock step qemu execution as =
you=20
mentioned.

Thanks,

Yoshi

>=20
> Thanks,
> Dor
>=20
>=20
>> capabilities can be reused. Regarding in-kernel devices, with the
>> likely advent of in-kernel virtio backends we need a generic way
>> to access their state from user-space, for which, again, the kvm_run
>> share memory area could be used.
>>
>> - Virtual machine run: Execution of the virtual machine is resumed
>> as soon as synchronization finishes.
>>
>> * Clock
>>
>> Even though we do not need to worry about the clock that provides th=
e
>> tick (the counter resides in memory, which we keep synchronized), th=
e
>> same does not apply to counters such as the TSC (we certainly want t=
o
>> avoid a situation where counters jump back in time right after
>> fail-over, breaking guarantees such as monotonicity).
>>
>> To avoid big hiccups after migration the value of the TSC should be
>> sent to the fallback node frequently. An access from the guest
>> (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment
>> to do this. Fortunately, both vmx and SVM provide controls to
>> intercept accesses to the TSC, so it is just a matter of setting tho=
se
>> appropriately ("RDTSC exiting" VM-execution control, and RDTSC,
>> RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However,
>> since synchronizing the virtual machines every time the TSC is
>> accessed would be prohibitive, the transmission of the TSC will be
>> done lazily, which means delaying it until there is a non-TSC
>> synchronization point arrives.
>>
>> * Failover
>>
>> Failover process kicks in whenever a failure in the primary node is
>> detected. At the time of writing we just ping the virtual machine
>> periodically to determine whether it is still alive, but in the long
>> term we have plans to integrate Kemari with the major HA stacks
>> (Hearbeat, RHCS, etc).
>>
>> Ideally, we would like to leverage the hardware failure detection
>> capabilities of newish x86 hardware to trigger failover, the idea
>> being that transferring control to the fallback node proactively
>> when a problem is detected is much faster than relying on the pollin=
g
>> mechanisms used by most HA software.
>>
>> Finally, to restore the virtual machine in the fallback host the loa=
dvm
>> infrastructure used for live-migration is leveraged.
>>
>> * Further information
>>
>> Please visit the link below for additional information, including
>> documentation and, most importantly, source code (for Xen only at th=
e
>> moment).
>>
>> http://www.osrg.net/kemari
>> =3D=3D
>>
>>
>> Any comments and suggestions would be greatly appreciated.
>>
>> If this is the right forum and people on the KVM mailing list do not
>> mind, we would like to use the CC'ed mailing lists for Kemari
>> development. Having more expert eyes looking at one's code always
>> helps.
>>
>> Thanks,
>>
>> Fernando
>> --=20
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>=20
>=20
>=20
>=20

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1N8ueG-00031Y-Hj
	for qemu-devel@nongnu.org; Fri, 13 Nov 2009 06:49:04 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1N8ueB-00030y-9R
	for qemu-devel@nongnu.org; Fri, 13 Nov 2009 06:49:03 -0500
Received: from [199.232.76.173] (port=46667 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1N8ueA-00030v-Vv
	for qemu-devel@nongnu.org; Fri, 13 Nov 2009 06:48:59 -0500
Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:52509)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <tamura.yoshiaki@lab.ntt.co.jp>) id 1N8ueA-0000AO-0d
	for qemu-devel@nongnu.org; Fri, 13 Nov 2009 06:48:58 -0500
Message-ID: <4AFD47A0.6040202@lab.ntt.co.jp>
Date: Fri, 13 Nov 2009 20:48:48 +0900
From: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
MIME-Version: 1.0
References: <4AF79242.20406@oss.ntt.co.jp> <4AFC837D.2060307@redhat.com>
In-Reply-To: <4AFC837D.2060307@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: dlaor@redhat.com
Cc: Andrea Arcangeli <aarcange@redhat.com>, Chris Wright <chrisw@redhat.com>, =?UTF-8?B?IuWkp+adkeWcrShvb211cmEga2VpKSI=?= <ohmura.kei@lab.ntt.co.jp>, kvm@vger.kernel.org, =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= <fernando@oss.ntt.co.jp>, qemu-devel@nongnu.org, Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>, avi@redhat.com

Hi,

Thanks for your comments!

Dor Laor wrote:
> On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote:
>> Hi all,
>>
>> It has been a while coming, but we have finally started work on
>> Kemari's port to KVM. For those not familiar with it, Kemari provides
>> the basic building block to create a virtualization-based fault
>> tolerant machine: a virtual machine synchronization mechanism.
>>
>> Traditional high availability solutions can be classified in two
>> groups: fault tolerant servers, and software clustering.
>>
>> Broadly speaking, fault tolerant servers protect us against hardware
>> failures and, generally, rely on redundant hardware (often
>> proprietary), and hardware failure detection to trigger fail-over.
>>
>> On the other hand, software clustering, as its name indicates, takes
>> care of software failures and usually requires a standby server whose
>> software configuration for the part we are trying to make fault
>> tolerant must be identical to that of the active server.
>>
>> Both solutions may be applied to virtualized environments. Indeed,
>> the current incarnation of Kemari (Xen-based) brings fault tolerant
>> server-like capabilities to virtual machines and integration with
>> existing HA stacks (Heartbeat, RHCS, etc) is under consideration.
>>
>> After some time in the drawing board we completed the basic design of
>> Kemari for KVM, so we are sending an RFC at this point to get early
>> feedback and, hopefully, get things right from the start. Those
>> already familiar with Kemari and/or fault tolerance may want to skip
>> the "Background" and go directly to the design and implementation
>> bits.
>>
>> This is a pretty long write-up, but please bear with me.
>>
>> =3D=3D Background =3D=3D
>>
>> We started to play around with continuous virtual synchronization
>> technology about 3 years ago. As development progressed and, most
>> importantly, we got the first Xen-based working prototypes it became
>> clear that we needed a proper name for our toy: Kemari.
>>
>> The goal of Kemari is to provide a fault tolerant platform for
>> virtualization environments, so that in the event of a hardware
>> failure the virtual machine fails over from compromised to properly
>> operating hardware (a physical machine) in a way that is completely
>> transparent to the guest operating system.
>>
>> Although hardware based fault tolerant servers and HA servers
>> (software clustering) have been around for a (long) while, they
>> typically require specifically designed hardware and/or modifications
>> to applications. In contrast, by abstracting hardware using
>> virtualization, Kemari can be used on off-the-shelf hardware and no
>> application modifications are needed.
>>
>> After a period of in-house development the first version of Kemari for
>> Xen was released in Nov 2008 as open source. However, by then it was
>> already pretty clear that a KVM port would have several
>> advantages. First, KVM is integrated into the Linux kernel, which
>> means one gets support for a wide variety of hardware for
>> free. Second, and in the same vein, KVM can also benefit from Linux'
>> low latency networking capabilities including RDMA, which is of
>> paramount importance for a extremely latency-sensitive functionality
>> like Kemari. Last, but not the least, KVM and its community is growing
>> rapidly, and there is increasing demand for Kemari-like functionality
>> for KVM.
>>
>> Although the basic design principles will remain the same, our plan is
>> to write Kemari for KVM from scratch, since there does not seem to be
>> much opportunity for sharing between Xen and KVM.
>>
>> =3D=3D Design outline =3D=3D
>>
>> The basic premise of fault tolerant servers is that when things go
>> awry with the hardware the running system should transparently
>> continue execution on an alternate physical host. For this to be
>> possible the state of the fallback host has to be identical to that of
>> the primary.
>>
>> Kemari runs paired virtual machines in an active-passive configuration
>> and achieves whole-system replication by continuously copying the
>> state of the system (dirty pages and the state of the virtual devices)
>> from the active node to the passive node. An interesting implication
>> of this is that during normal operation only the active node is
>> actually executing code.
>>
>> Another possible approach is to run a pair of systems in lock-step
>> (=C3=A0 la VMware FT). Since both the primary and fallback virtual mac=
hines
>> are active keeping them synchronized is a complex task, which usually
>> involves carefully injecting external events into both virtual
>> machines so that they result in identical states.
>>
>> The latter approach is extremely architecture specific and not SMP
>> friendly. This spurred us to try the design that became Kemari, which
>> we believe lends itself to further optimizations.
>>
>> =3D=3D Implementation =3D=3D
>>
>> The first step is to encapsulate the machine to be protected within a
>> virtual machine. Then the live migration functionality is leveraged to
>> keep the virtual machines synchronized.
>>
>> Whereas during live migration dirty pages can be sent asynchronously
>> from the primary to the fallback server until the ratio of dirty pages
>> is low enough to guarantee very short downtimes, when it comes to
>> fault tolerance solutions whenever a synchronization point is reached
>> changes to the virtual machine since the previous one have to be sent
>> synchronously.
>>
>> Since the virtual machine has to be stopped until the data reaches and
>> is acknowledged by the fallback server, the synchronization model is
>> of critical importance for performance (both in terms of raw
>> throughput and latencies). The model chosen for Kemari along with
>> other implementation details is described below.
>>
>> * Synchronization model
>>
>> The synchronization points were carefully chosen to minimize the
>> amount of traffic that goes over the wire while still maintaining the
>> FT pair consistent at all times. To be precise, Kemari uses events
>> that modify externally visible state as synchronizations points. This
>> means that all outgoing I/O needs to be trapped and sent to the
>> fallback host before the primary is resumed, so that it can be
>> replayed in the face of hardware failure.
>>
>> The basic assumption here is that outgoing I/O operations are
>> idempotent, which is usually true for disk I/O and reliable network
>> protocols such as TCP (Kemari may trigger hidden bugs on applications
>> that use UDP or other unreliable protocols, so those may need minor
>> changes to ensure they work properly after failover).
>>
>> The synchronization process can be broken down as follows:
>>
>> - Event tapping: On KVM all I/O generates a VMEXIT that is
>> synchronously handled by the Linux kernel monitor i.e. KVM (it is
>> worth noting that this applies to virtio devices too, because they
>> use MMIO and PIO just like a regular PCI device).
>>
>> - VCPU/Guest freezing: This is automatic in the UP case. On SMP
>> environments we may need to send a IPI to stop the other VCPUs.
>>
>> - Notification to qemu: Taking a page from live migration's
>> playbook, the synchronization process is user-space driven, which
>> means that qemu needs to be woken up at each synchronization
>> point. That is already the case for qemu-emulated devices, but we
>> also have in-kernel emulators. To compound the problem, even for
>> user-space emulated devices accesses to coalesced MMIO areas can
>> not be detected. As a consequence we need a mechanism to
>> communicate KVM-handled events to qemu.
>>
>> The channel for KVM-qemu communication can be easily build upon
>> the existing infrastructure. We just need to add a new a page to
>> the kvm_run shared memory area that can be mmapped from user space
>> and set the exit reason appropriately.
>>
>> Regarding in-kernel device emulators, we only need to care about
>> writes. Specifically, making kvm_io_bus_write() fail when Kemari
>> is activated and invoking the emulator again after re-entrance
>> from user space should suffice (this is somewhat similar to what
>> we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).
>>
>> To avoid missing synchronization points one should be careful with
>> coalesced MMIO-like optimizations. In the particular case of
>> coalesced MMIO, the I/O operation that caused the exit to user
>> space should act as a write barrier when it was due to an access
>> to a non-coalesced MMIO area. This means that before proceeding to
>> handle the exit in kvm_run() we have to make sure that all the
>> coalesced MMIO has reached the fallback host.
>>
>> - Virtual machine synchronization: All the dirty pages since the
>> last synchronization point and the state of the virtual devices is
>> sent to the fallback node from the user-space qemu process. For this
>> the existing savevm infrastructure and KVM's dirty page tracking
>=20
> I failed to understand whether you take the lock step approach and sync=
=20
> every vmexit + make sure the shadow host will inject irq on the origina=
l=20
> guest's instruction boundary or alternatively use continuous live=20
> snapshots.

We'll take the live snapshots approach for now.

> If you use live snapshots, why do you need to track mmio, etc? Is it in=
=20
> order to save the device sync stage in live migration? In order to do i=
t=20
> you fully lock step qemu execution (or send the entire vmstate to the=20
> slave). Isn't the device part is << dirt pages part?

We're thinking to capture mmio operations that effect the state of device=
s as=20
synchronization points.  The purpose is to lock step qemu execution as yo=
u=20
mentioned.

Thanks,

Yoshi

>=20
> Thanks,
> Dor
>=20
>=20
>> capabilities can be reused. Regarding in-kernel devices, with the
>> likely advent of in-kernel virtio backends we need a generic way
>> to access their state from user-space, for which, again, the kvm_run
>> share memory area could be used.
>>
>> - Virtual machine run: Execution of the virtual machine is resumed
>> as soon as synchronization finishes.
>>
>> * Clock
>>
>> Even though we do not need to worry about the clock that provides the
>> tick (the counter resides in memory, which we keep synchronized), the
>> same does not apply to counters such as the TSC (we certainly want to
>> avoid a situation where counters jump back in time right after
>> fail-over, breaking guarantees such as monotonicity).
>>
>> To avoid big hiccups after migration the value of the TSC should be
>> sent to the fallback node frequently. An access from the guest
>> (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment
>> to do this. Fortunately, both vmx and SVM provide controls to
>> intercept accesses to the TSC, so it is just a matter of setting those
>> appropriately ("RDTSC exiting" VM-execution control, and RDTSC,
>> RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However,
>> since synchronizing the virtual machines every time the TSC is
>> accessed would be prohibitive, the transmission of the TSC will be
>> done lazily, which means delaying it until there is a non-TSC
>> synchronization point arrives.
>>
>> * Failover
>>
>> Failover process kicks in whenever a failure in the primary node is
>> detected. At the time of writing we just ping the virtual machine
>> periodically to determine whether it is still alive, but in the long
>> term we have plans to integrate Kemari with the major HA stacks
>> (Hearbeat, RHCS, etc).
>>
>> Ideally, we would like to leverage the hardware failure detection
>> capabilities of newish x86 hardware to trigger failover, the idea
>> being that transferring control to the fallback node proactively
>> when a problem is detected is much faster than relying on the polling
>> mechanisms used by most HA software.
>>
>> Finally, to restore the virtual machine in the fallback host the loadv=
m
>> infrastructure used for live-migration is leveraged.
>>
>> * Further information
>>
>> Please visit the link below for additional information, including
>> documentation and, most importantly, source code (for Xen only at the
>> moment).
>>
>> http://www.osrg.net/kemari
>> =3D=3D
>>
>>
>> Any comments and suggestions would be greatly appreciated.
>>
>> If this is the right forum and people on the KVM mailing list do not
>> mind, we would like to use the CC'ed mailing lists for Kemari
>> development. Having more expert eyes looking at one's code always
>> helps.
>>
>> Thanks,
>>
>> Fernando
>> --=20
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>=20
>=20
>=20
>=20