From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yoshiaki Tamura Subject: Re: [RFC] KVM Fault Tolerance: Kemari for KVM Date: Fri, 13 Nov 2009 20:48:48 +0900 Message-ID: <4AFD47A0.6040202@lab.ntt.co.jp> References: <4AF79242.20406@oss.ntt.co.jp> <4AFC837D.2060307@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= , kvm@vger.kernel.org, qemu-devel@nongnu.org, =?UTF-8?B?IuWkp+adkeWcrShvb211cmEga2VpKSI=?= , Takuya Yoshikawa , avi@redhat.com, anthony@codemonkey.ws, Andrea Arcangeli , Chris Wright To: dlaor@redhat.com Return-path: Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:52739 "EHLO tama50.ecl.ntt.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750999AbZKMMRl (ORCPT ); Fri, 13 Nov 2009 07:17:41 -0500 In-Reply-To: <4AFC837D.2060307@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: Hi, Thanks for your comments! Dor Laor wrote: > On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote: >> Hi all, >> >> It has been a while coming, but we have finally started work on >> Kemari's port to KVM. For those not familiar with it, Kemari provide= s >> the basic building block to create a virtualization-based fault >> tolerant machine: a virtual machine synchronization mechanism. >> >> Traditional high availability solutions can be classified in two >> groups: fault tolerant servers, and software clustering. >> >> Broadly speaking, fault tolerant servers protect us against hardware >> failures and, generally, rely on redundant hardware (often >> proprietary), and hardware failure detection to trigger fail-over. >> >> On the other hand, software clustering, as its name indicates, takes >> care of software failures and usually requires a standby server whos= e >> software configuration for the part we are trying to make fault >> tolerant must be identical to that of the active server. >> >> Both solutions may be applied to virtualized environments. Indeed, >> the current incarnation of Kemari (Xen-based) brings fault tolerant >> server-like capabilities to virtual machines and integration with >> existing HA stacks (Heartbeat, RHCS, etc) is under consideration. >> >> After some time in the drawing board we completed the basic design o= f >> Kemari for KVM, so we are sending an RFC at this point to get early >> feedback and, hopefully, get things right from the start. Those >> already familiar with Kemari and/or fault tolerance may want to skip >> the "Background" and go directly to the design and implementation >> bits. >> >> This is a pretty long write-up, but please bear with me. >> >> =3D=3D Background =3D=3D >> >> We started to play around with continuous virtual synchronization >> technology about 3 years ago. As development progressed and, most >> importantly, we got the first Xen-based working prototypes it became >> clear that we needed a proper name for our toy: Kemari. >> >> The goal of Kemari is to provide a fault tolerant platform for >> virtualization environments, so that in the event of a hardware >> failure the virtual machine fails over from compromised to properly >> operating hardware (a physical machine) in a way that is completely >> transparent to the guest operating system. >> >> Although hardware based fault tolerant servers and HA servers >> (software clustering) have been around for a (long) while, they >> typically require specifically designed hardware and/or modification= s >> to applications. In contrast, by abstracting hardware using >> virtualization, Kemari can be used on off-the-shelf hardware and no >> application modifications are needed. >> >> After a period of in-house development the first version of Kemari f= or >> Xen was released in Nov 2008 as open source. However, by then it was >> already pretty clear that a KVM port would have several >> advantages. First, KVM is integrated into the Linux kernel, which >> means one gets support for a wide variety of hardware for >> free. Second, and in the same vein, KVM can also benefit from Linux' >> low latency networking capabilities including RDMA, which is of >> paramount importance for a extremely latency-sensitive functionality >> like Kemari. Last, but not the least, KVM and its community is growi= ng >> rapidly, and there is increasing demand for Kemari-like functionalit= y >> for KVM. >> >> Although the basic design principles will remain the same, our plan = is >> to write Kemari for KVM from scratch, since there does not seem to b= e >> much opportunity for sharing between Xen and KVM. >> >> =3D=3D Design outline =3D=3D >> >> The basic premise of fault tolerant servers is that when things go >> awry with the hardware the running system should transparently >> continue execution on an alternate physical host. For this to be >> possible the state of the fallback host has to be identical to that = of >> the primary. >> >> Kemari runs paired virtual machines in an active-passive configurati= on >> and achieves whole-system replication by continuously copying the >> state of the system (dirty pages and the state of the virtual device= s) >> from the active node to the passive node. An interesting implication >> of this is that during normal operation only the active node is >> actually executing code. >> >> Another possible approach is to run a pair of systems in lock-step >> (=C3=A0 la VMware FT). Since both the primary and fallback virtual m= achines >> are active keeping them synchronized is a complex task, which usuall= y >> involves carefully injecting external events into both virtual >> machines so that they result in identical states. >> >> The latter approach is extremely architecture specific and not SMP >> friendly. This spurred us to try the design that became Kemari, whic= h >> we believe lends itself to further optimizations. >> >> =3D=3D Implementation =3D=3D >> >> The first step is to encapsulate the machine to be protected within = a >> virtual machine. Then the live migration functionality is leveraged = to >> keep the virtual machines synchronized. >> >> Whereas during live migration dirty pages can be sent asynchronously >> from the primary to the fallback server until the ratio of dirty pag= es >> is low enough to guarantee very short downtimes, when it comes to >> fault tolerance solutions whenever a synchronization point is reache= d >> changes to the virtual machine since the previous one have to be sen= t >> synchronously. >> >> Since the virtual machine has to be stopped until the data reaches a= nd >> is acknowledged by the fallback server, the synchronization model is >> of critical importance for performance (both in terms of raw >> throughput and latencies). The model chosen for Kemari along with >> other implementation details is described below. >> >> * Synchronization model >> >> The synchronization points were carefully chosen to minimize the >> amount of traffic that goes over the wire while still maintaining th= e >> FT pair consistent at all times. To be precise, Kemari uses events >> that modify externally visible state as synchronizations points. Thi= s >> means that all outgoing I/O needs to be trapped and sent to the >> fallback host before the primary is resumed, so that it can be >> replayed in the face of hardware failure. >> >> The basic assumption here is that outgoing I/O operations are >> idempotent, which is usually true for disk I/O and reliable network >> protocols such as TCP (Kemari may trigger hidden bugs on application= s >> that use UDP or other unreliable protocols, so those may need minor >> changes to ensure they work properly after failover). >> >> The synchronization process can be broken down as follows: >> >> - Event tapping: On KVM all I/O generates a VMEXIT that is >> synchronously handled by the Linux kernel monitor i.e. KVM (it is >> worth noting that this applies to virtio devices too, because they >> use MMIO and PIO just like a regular PCI device). >> >> - VCPU/Guest freezing: This is automatic in the UP case. On SMP >> environments we may need to send a IPI to stop the other VCPUs. >> >> - Notification to qemu: Taking a page from live migration's >> playbook, the synchronization process is user-space driven, which >> means that qemu needs to be woken up at each synchronization >> point. That is already the case for qemu-emulated devices, but we >> also have in-kernel emulators. To compound the problem, even for >> user-space emulated devices accesses to coalesced MMIO areas can >> not be detected. As a consequence we need a mechanism to >> communicate KVM-handled events to qemu. >> >> The channel for KVM-qemu communication can be easily build upon >> the existing infrastructure. We just need to add a new a page to >> the kvm_run shared memory area that can be mmapped from user space >> and set the exit reason appropriately. >> >> Regarding in-kernel device emulators, we only need to care about >> writes. Specifically, making kvm_io_bus_write() fail when Kemari >> is activated and invoking the emulator again after re-entrance >> from user space should suffice (this is somewhat similar to what >> we do in kvm_arch_vcpu_ioctl_run() for MMIO reads). >> >> To avoid missing synchronization points one should be careful with >> coalesced MMIO-like optimizations. In the particular case of >> coalesced MMIO, the I/O operation that caused the exit to user >> space should act as a write barrier when it was due to an access >> to a non-coalesced MMIO area. This means that before proceeding to >> handle the exit in kvm_run() we have to make sure that all the >> coalesced MMIO has reached the fallback host. >> >> - Virtual machine synchronization: All the dirty pages since the >> last synchronization point and the state of the virtual devices is >> sent to the fallback node from the user-space qemu process. For this >> the existing savevm infrastructure and KVM's dirty page tracking >=20 > I failed to understand whether you take the lock step approach and sy= nc=20 > every vmexit + make sure the shadow host will inject irq on the origi= nal=20 > guest's instruction boundary or alternatively use continuous live=20 > snapshots. We'll take the live snapshots approach for now. > If you use live snapshots, why do you need to track mmio, etc? Is it = in=20 > order to save the device sync stage in live migration? In order to do= it=20 > you fully lock step qemu execution (or send the entire vmstate to the= =20 > slave). Isn't the device part is << dirt pages part? We're thinking to capture mmio operations that effect the state of devi= ces as=20 synchronization points. The purpose is to lock step qemu execution as = you=20 mentioned. Thanks, Yoshi >=20 > Thanks, > Dor >=20 >=20 >> capabilities can be reused. Regarding in-kernel devices, with the >> likely advent of in-kernel virtio backends we need a generic way >> to access their state from user-space, for which, again, the kvm_run >> share memory area could be used. >> >> - Virtual machine run: Execution of the virtual machine is resumed >> as soon as synchronization finishes. >> >> * Clock >> >> Even though we do not need to worry about the clock that provides th= e >> tick (the counter resides in memory, which we keep synchronized), th= e >> same does not apply to counters such as the TSC (we certainly want t= o >> avoid a situation where counters jump back in time right after >> fail-over, breaking guarantees such as monotonicity). >> >> To avoid big hiccups after migration the value of the TSC should be >> sent to the fallback node frequently. An access from the guest >> (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment >> to do this. Fortunately, both vmx and SVM provide controls to >> intercept accesses to the TSC, so it is just a matter of setting tho= se >> appropriately ("RDTSC exiting" VM-execution control, and RDTSC, >> RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However, >> since synchronizing the virtual machines every time the TSC is >> accessed would be prohibitive, the transmission of the TSC will be >> done lazily, which means delaying it until there is a non-TSC >> synchronization point arrives. >> >> * Failover >> >> Failover process kicks in whenever a failure in the primary node is >> detected. At the time of writing we just ping the virtual machine >> periodically to determine whether it is still alive, but in the long >> term we have plans to integrate Kemari with the major HA stacks >> (Hearbeat, RHCS, etc). >> >> Ideally, we would like to leverage the hardware failure detection >> capabilities of newish x86 hardware to trigger failover, the idea >> being that transferring control to the fallback node proactively >> when a problem is detected is much faster than relying on the pollin= g >> mechanisms used by most HA software. >> >> Finally, to restore the virtual machine in the fallback host the loa= dvm >> infrastructure used for live-migration is leveraged. >> >> * Further information >> >> Please visit the link below for additional information, including >> documentation and, most importantly, source code (for Xen only at th= e >> moment). >> >> http://www.osrg.net/kemari >> =3D=3D >> >> >> Any comments and suggestions would be greatly appreciated. >> >> If this is the right forum and people on the KVM mailing list do not >> mind, we would like to use the CC'ed mailing lists for Kemari >> development. Having more expert eyes looking at one's code always >> helps. >> >> Thanks, >> >> Fernando >> --=20 >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >=20 >=20 >=20 >=20 From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1N8ueG-00031Y-Hj for qemu-devel@nongnu.org; Fri, 13 Nov 2009 06:49:04 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1N8ueB-00030y-9R for qemu-devel@nongnu.org; Fri, 13 Nov 2009 06:49:03 -0500 Received: from [199.232.76.173] (port=46667 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1N8ueA-00030v-Vv for qemu-devel@nongnu.org; Fri, 13 Nov 2009 06:48:59 -0500 Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:52509) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1N8ueA-0000AO-0d for qemu-devel@nongnu.org; Fri, 13 Nov 2009 06:48:58 -0500 Message-ID: <4AFD47A0.6040202@lab.ntt.co.jp> Date: Fri, 13 Nov 2009 20:48:48 +0900 From: Yoshiaki Tamura MIME-Version: 1.0 References: <4AF79242.20406@oss.ntt.co.jp> <4AFC837D.2060307@redhat.com> In-Reply-To: <4AFC837D.2060307@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: dlaor@redhat.com Cc: Andrea Arcangeli , Chris Wright , =?UTF-8?B?IuWkp+adkeWcrShvb211cmEga2VpKSI=?= , kvm@vger.kernel.org, =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= , qemu-devel@nongnu.org, Takuya Yoshikawa , avi@redhat.com Hi, Thanks for your comments! Dor Laor wrote: > On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote: >> Hi all, >> >> It has been a while coming, but we have finally started work on >> Kemari's port to KVM. For those not familiar with it, Kemari provides >> the basic building block to create a virtualization-based fault >> tolerant machine: a virtual machine synchronization mechanism. >> >> Traditional high availability solutions can be classified in two >> groups: fault tolerant servers, and software clustering. >> >> Broadly speaking, fault tolerant servers protect us against hardware >> failures and, generally, rely on redundant hardware (often >> proprietary), and hardware failure detection to trigger fail-over. >> >> On the other hand, software clustering, as its name indicates, takes >> care of software failures and usually requires a standby server whose >> software configuration for the part we are trying to make fault >> tolerant must be identical to that of the active server. >> >> Both solutions may be applied to virtualized environments. Indeed, >> the current incarnation of Kemari (Xen-based) brings fault tolerant >> server-like capabilities to virtual machines and integration with >> existing HA stacks (Heartbeat, RHCS, etc) is under consideration. >> >> After some time in the drawing board we completed the basic design of >> Kemari for KVM, so we are sending an RFC at this point to get early >> feedback and, hopefully, get things right from the start. Those >> already familiar with Kemari and/or fault tolerance may want to skip >> the "Background" and go directly to the design and implementation >> bits. >> >> This is a pretty long write-up, but please bear with me. >> >> =3D=3D Background =3D=3D >> >> We started to play around with continuous virtual synchronization >> technology about 3 years ago. As development progressed and, most >> importantly, we got the first Xen-based working prototypes it became >> clear that we needed a proper name for our toy: Kemari. >> >> The goal of Kemari is to provide a fault tolerant platform for >> virtualization environments, so that in the event of a hardware >> failure the virtual machine fails over from compromised to properly >> operating hardware (a physical machine) in a way that is completely >> transparent to the guest operating system. >> >> Although hardware based fault tolerant servers and HA servers >> (software clustering) have been around for a (long) while, they >> typically require specifically designed hardware and/or modifications >> to applications. In contrast, by abstracting hardware using >> virtualization, Kemari can be used on off-the-shelf hardware and no >> application modifications are needed. >> >> After a period of in-house development the first version of Kemari for >> Xen was released in Nov 2008 as open source. However, by then it was >> already pretty clear that a KVM port would have several >> advantages. First, KVM is integrated into the Linux kernel, which >> means one gets support for a wide variety of hardware for >> free. Second, and in the same vein, KVM can also benefit from Linux' >> low latency networking capabilities including RDMA, which is of >> paramount importance for a extremely latency-sensitive functionality >> like Kemari. Last, but not the least, KVM and its community is growing >> rapidly, and there is increasing demand for Kemari-like functionality >> for KVM. >> >> Although the basic design principles will remain the same, our plan is >> to write Kemari for KVM from scratch, since there does not seem to be >> much opportunity for sharing between Xen and KVM. >> >> =3D=3D Design outline =3D=3D >> >> The basic premise of fault tolerant servers is that when things go >> awry with the hardware the running system should transparently >> continue execution on an alternate physical host. For this to be >> possible the state of the fallback host has to be identical to that of >> the primary. >> >> Kemari runs paired virtual machines in an active-passive configuration >> and achieves whole-system replication by continuously copying the >> state of the system (dirty pages and the state of the virtual devices) >> from the active node to the passive node. An interesting implication >> of this is that during normal operation only the active node is >> actually executing code. >> >> Another possible approach is to run a pair of systems in lock-step >> (=C3=A0 la VMware FT). Since both the primary and fallback virtual mac= hines >> are active keeping them synchronized is a complex task, which usually >> involves carefully injecting external events into both virtual >> machines so that they result in identical states. >> >> The latter approach is extremely architecture specific and not SMP >> friendly. This spurred us to try the design that became Kemari, which >> we believe lends itself to further optimizations. >> >> =3D=3D Implementation =3D=3D >> >> The first step is to encapsulate the machine to be protected within a >> virtual machine. Then the live migration functionality is leveraged to >> keep the virtual machines synchronized. >> >> Whereas during live migration dirty pages can be sent asynchronously >> from the primary to the fallback server until the ratio of dirty pages >> is low enough to guarantee very short downtimes, when it comes to >> fault tolerance solutions whenever a synchronization point is reached >> changes to the virtual machine since the previous one have to be sent >> synchronously. >> >> Since the virtual machine has to be stopped until the data reaches and >> is acknowledged by the fallback server, the synchronization model is >> of critical importance for performance (both in terms of raw >> throughput and latencies). The model chosen for Kemari along with >> other implementation details is described below. >> >> * Synchronization model >> >> The synchronization points were carefully chosen to minimize the >> amount of traffic that goes over the wire while still maintaining the >> FT pair consistent at all times. To be precise, Kemari uses events >> that modify externally visible state as synchronizations points. This >> means that all outgoing I/O needs to be trapped and sent to the >> fallback host before the primary is resumed, so that it can be >> replayed in the face of hardware failure. >> >> The basic assumption here is that outgoing I/O operations are >> idempotent, which is usually true for disk I/O and reliable network >> protocols such as TCP (Kemari may trigger hidden bugs on applications >> that use UDP or other unreliable protocols, so those may need minor >> changes to ensure they work properly after failover). >> >> The synchronization process can be broken down as follows: >> >> - Event tapping: On KVM all I/O generates a VMEXIT that is >> synchronously handled by the Linux kernel monitor i.e. KVM (it is >> worth noting that this applies to virtio devices too, because they >> use MMIO and PIO just like a regular PCI device). >> >> - VCPU/Guest freezing: This is automatic in the UP case. On SMP >> environments we may need to send a IPI to stop the other VCPUs. >> >> - Notification to qemu: Taking a page from live migration's >> playbook, the synchronization process is user-space driven, which >> means that qemu needs to be woken up at each synchronization >> point. That is already the case for qemu-emulated devices, but we >> also have in-kernel emulators. To compound the problem, even for >> user-space emulated devices accesses to coalesced MMIO areas can >> not be detected. As a consequence we need a mechanism to >> communicate KVM-handled events to qemu. >> >> The channel for KVM-qemu communication can be easily build upon >> the existing infrastructure. We just need to add a new a page to >> the kvm_run shared memory area that can be mmapped from user space >> and set the exit reason appropriately. >> >> Regarding in-kernel device emulators, we only need to care about >> writes. Specifically, making kvm_io_bus_write() fail when Kemari >> is activated and invoking the emulator again after re-entrance >> from user space should suffice (this is somewhat similar to what >> we do in kvm_arch_vcpu_ioctl_run() for MMIO reads). >> >> To avoid missing synchronization points one should be careful with >> coalesced MMIO-like optimizations. In the particular case of >> coalesced MMIO, the I/O operation that caused the exit to user >> space should act as a write barrier when it was due to an access >> to a non-coalesced MMIO area. This means that before proceeding to >> handle the exit in kvm_run() we have to make sure that all the >> coalesced MMIO has reached the fallback host. >> >> - Virtual machine synchronization: All the dirty pages since the >> last synchronization point and the state of the virtual devices is >> sent to the fallback node from the user-space qemu process. For this >> the existing savevm infrastructure and KVM's dirty page tracking >=20 > I failed to understand whether you take the lock step approach and sync= =20 > every vmexit + make sure the shadow host will inject irq on the origina= l=20 > guest's instruction boundary or alternatively use continuous live=20 > snapshots. We'll take the live snapshots approach for now. > If you use live snapshots, why do you need to track mmio, etc? Is it in= =20 > order to save the device sync stage in live migration? In order to do i= t=20 > you fully lock step qemu execution (or send the entire vmstate to the=20 > slave). Isn't the device part is << dirt pages part? We're thinking to capture mmio operations that effect the state of device= s as=20 synchronization points. The purpose is to lock step qemu execution as yo= u=20 mentioned. Thanks, Yoshi >=20 > Thanks, > Dor >=20 >=20 >> capabilities can be reused. Regarding in-kernel devices, with the >> likely advent of in-kernel virtio backends we need a generic way >> to access their state from user-space, for which, again, the kvm_run >> share memory area could be used. >> >> - Virtual machine run: Execution of the virtual machine is resumed >> as soon as synchronization finishes. >> >> * Clock >> >> Even though we do not need to worry about the clock that provides the >> tick (the counter resides in memory, which we keep synchronized), the >> same does not apply to counters such as the TSC (we certainly want to >> avoid a situation where counters jump back in time right after >> fail-over, breaking guarantees such as monotonicity). >> >> To avoid big hiccups after migration the value of the TSC should be >> sent to the fallback node frequently. An access from the guest >> (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment >> to do this. Fortunately, both vmx and SVM provide controls to >> intercept accesses to the TSC, so it is just a matter of setting those >> appropriately ("RDTSC exiting" VM-execution control, and RDTSC, >> RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However, >> since synchronizing the virtual machines every time the TSC is >> accessed would be prohibitive, the transmission of the TSC will be >> done lazily, which means delaying it until there is a non-TSC >> synchronization point arrives. >> >> * Failover >> >> Failover process kicks in whenever a failure in the primary node is >> detected. At the time of writing we just ping the virtual machine >> periodically to determine whether it is still alive, but in the long >> term we have plans to integrate Kemari with the major HA stacks >> (Hearbeat, RHCS, etc). >> >> Ideally, we would like to leverage the hardware failure detection >> capabilities of newish x86 hardware to trigger failover, the idea >> being that transferring control to the fallback node proactively >> when a problem is detected is much faster than relying on the polling >> mechanisms used by most HA software. >> >> Finally, to restore the virtual machine in the fallback host the loadv= m >> infrastructure used for live-migration is leveraged. >> >> * Further information >> >> Please visit the link below for additional information, including >> documentation and, most importantly, source code (for Xen only at the >> moment). >> >> http://www.osrg.net/kemari >> =3D=3D >> >> >> Any comments and suggestions would be greatly appreciated. >> >> If this is the right forum and people on the KVM mailing list do not >> mind, we would like to use the CC'ed mailing lists for Kemari >> development. Having more expert eyes looking at one's code always >> helps. >> >> Thanks, >> >> Fernando >> --=20 >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >=20 >=20 >=20 >=20