From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dor Laor Subject: Re: [RFC] KVM Fault Tolerance: Kemari for KVM Date: Thu, 12 Nov 2009 23:51:57 +0200 Message-ID: <4AFC837D.2060307@redhat.com> References: <4AF79242.20406@oss.ntt.co.jp> Reply-To: dlaor@redhat.com Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org, =?UTF-8?B?IuWkp+adkeWcrShv?= =?UTF-8?B?b211cmEga2VpKSI=?= , Yoshiaki Tamura , Takuya Yoshikawa , avi@redhat.com, anthony@codemonkey.ws, Andrea Arcangeli , Chris Wright To: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= Return-path: Received: from mx1.redhat.com ([209.132.183.28]:29452 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753546AbZKLVx7 (ORCPT ); Thu, 12 Nov 2009 16:53:59 -0500 In-Reply-To: <4AF79242.20406@oss.ntt.co.jp> Sender: kvm-owner@vger.kernel.org List-ID: On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote: > Hi all, > > It has been a while coming, but we have finally started work on > Kemari's port to KVM. For those not familiar with it, Kemari provides > the basic building block to create a virtualization-based fault > tolerant machine: a virtual machine synchronization mechanism. > > Traditional high availability solutions can be classified in two > groups: fault tolerant servers, and software clustering. > > Broadly speaking, fault tolerant servers protect us against hardware > failures and, generally, rely on redundant hardware (often > proprietary), and hardware failure detection to trigger fail-over. > > On the other hand, software clustering, as its name indicates, takes > care of software failures and usually requires a standby server whose > software configuration for the part we are trying to make fault > tolerant must be identical to that of the active server. > > Both solutions may be applied to virtualized environments. Indeed, > the current incarnation of Kemari (Xen-based) brings fault tolerant > server-like capabilities to virtual machines and integration with > existing HA stacks (Heartbeat, RHCS, etc) is under consideration. > > After some time in the drawing board we completed the basic design of > Kemari for KVM, so we are sending an RFC at this point to get early > feedback and, hopefully, get things right from the start. Those > already familiar with Kemari and/or fault tolerance may want to skip > the "Background" and go directly to the design and implementation > bits. > > This is a pretty long write-up, but please bear with me. > > =3D=3D Background =3D=3D > > We started to play around with continuous virtual synchronization > technology about 3 years ago. As development progressed and, most > importantly, we got the first Xen-based working prototypes it became > clear that we needed a proper name for our toy: Kemari. > > The goal of Kemari is to provide a fault tolerant platform for > virtualization environments, so that in the event of a hardware > failure the virtual machine fails over from compromised to properly > operating hardware (a physical machine) in a way that is completely > transparent to the guest operating system. > > Although hardware based fault tolerant servers and HA servers > (software clustering) have been around for a (long) while, they > typically require specifically designed hardware and/or modifications > to applications. In contrast, by abstracting hardware using > virtualization, Kemari can be used on off-the-shelf hardware and no > application modifications are needed. > > After a period of in-house development the first version of Kemari fo= r > Xen was released in Nov 2008 as open source. However, by then it was > already pretty clear that a KVM port would have several > advantages. First, KVM is integrated into the Linux kernel, which > means one gets support for a wide variety of hardware for > free. Second, and in the same vein, KVM can also benefit from Linux' > low latency networking capabilities including RDMA, which is of > paramount importance for a extremely latency-sensitive functionality > like Kemari. Last, but not the least, KVM and its community is growin= g > rapidly, and there is increasing demand for Kemari-like functionality > for KVM. > > Although the basic design principles will remain the same, our plan i= s > to write Kemari for KVM from scratch, since there does not seem to be > much opportunity for sharing between Xen and KVM. > > =3D=3D Design outline =3D=3D > > The basic premise of fault tolerant servers is that when things go > awry with the hardware the running system should transparently > continue execution on an alternate physical host. For this to be > possible the state of the fallback host has to be identical to that o= f > the primary. > > Kemari runs paired virtual machines in an active-passive configuratio= n > and achieves whole-system replication by continuously copying the > state of the system (dirty pages and the state of the virtual devices= ) > from the active node to the passive node. An interesting implication > of this is that during normal operation only the active node is > actually executing code. > > Another possible approach is to run a pair of systems in lock-step > (=C3=A0 la VMware FT). Since both the primary and fallback virtual ma= chines > are active keeping them synchronized is a complex task, which usually > involves carefully injecting external events into both virtual > machines so that they result in identical states. > > The latter approach is extremely architecture specific and not SMP > friendly. This spurred us to try the design that became Kemari, which > we believe lends itself to further optimizations. > > =3D=3D Implementation =3D=3D > > The first step is to encapsulate the machine to be protected within a > virtual machine. Then the live migration functionality is leveraged t= o > keep the virtual machines synchronized. > > Whereas during live migration dirty pages can be sent asynchronously > from the primary to the fallback server until the ratio of dirty page= s > is low enough to guarantee very short downtimes, when it comes to > fault tolerance solutions whenever a synchronization point is reached > changes to the virtual machine since the previous one have to be sent > synchronously. > > Since the virtual machine has to be stopped until the data reaches an= d > is acknowledged by the fallback server, the synchronization model is > of critical importance for performance (both in terms of raw > throughput and latencies). The model chosen for Kemari along with > other implementation details is described below. > > * Synchronization model > > The synchronization points were carefully chosen to minimize the > amount of traffic that goes over the wire while still maintaining the > FT pair consistent at all times. To be precise, Kemari uses events > that modify externally visible state as synchronizations points. This > means that all outgoing I/O needs to be trapped and sent to the > fallback host before the primary is resumed, so that it can be > replayed in the face of hardware failure. > > The basic assumption here is that outgoing I/O operations are > idempotent, which is usually true for disk I/O and reliable network > protocols such as TCP (Kemari may trigger hidden bugs on applications > that use UDP or other unreliable protocols, so those may need minor > changes to ensure they work properly after failover). > > The synchronization process can be broken down as follows: > > - Event tapping: On KVM all I/O generates a VMEXIT that is > synchronously handled by the Linux kernel monitor i.e. KVM (it is > worth noting that this applies to virtio devices too, because they > use MMIO and PIO just like a regular PCI device). > > - VCPU/Guest freezing: This is automatic in the UP case. On SMP > environments we may need to send a IPI to stop the other VCPUs. > > - Notification to qemu: Taking a page from live migration's > playbook, the synchronization process is user-space driven, which > means that qemu needs to be woken up at each synchronization > point. That is already the case for qemu-emulated devices, but we > also have in-kernel emulators. To compound the problem, even for > user-space emulated devices accesses to coalesced MMIO areas can > not be detected. As a consequence we need a mechanism to > communicate KVM-handled events to qemu. > > The channel for KVM-qemu communication can be easily build upon > the existing infrastructure. We just need to add a new a page to > the kvm_run shared memory area that can be mmapped from user space > and set the exit reason appropriately. > > Regarding in-kernel device emulators, we only need to care about > writes. Specifically, making kvm_io_bus_write() fail when Kemari > is activated and invoking the emulator again after re-entrance > from user space should suffice (this is somewhat similar to what > we do in kvm_arch_vcpu_ioctl_run() for MMIO reads). > > To avoid missing synchronization points one should be careful with > coalesced MMIO-like optimizations. In the particular case of > coalesced MMIO, the I/O operation that caused the exit to user > space should act as a write barrier when it was due to an access > to a non-coalesced MMIO area. This means that before proceeding to > handle the exit in kvm_run() we have to make sure that all the > coalesced MMIO has reached the fallback host. > > - Virtual machine synchronization: All the dirty pages since the > last synchronization point and the state of the virtual devices is > sent to the fallback node from the user-space qemu process. For this > the existing savevm infrastructure and KVM's dirty page tracking I failed to understand whether you take the lock step approach and sync= =20 every vmexit + make sure the shadow host will inject irq on the origina= l=20 guest's instruction boundary or alternatively use continuous live snaps= hots. If you use live snapshots, why do you need to track mmio, etc? Is it in= =20 order to save the device sync stage in live migration? In order to do i= t=20 you fully lock step qemu execution (or send the entire vmstate to the=20 slave). Isn't the device part is << dirt pages part? Thanks, Dor > capabilities can be reused. Regarding in-kernel devices, with the > likely advent of in-kernel virtio backends we need a generic way > to access their state from user-space, for which, again, the kvm_run > share memory area could be used. > > - Virtual machine run: Execution of the virtual machine is resumed > as soon as synchronization finishes. > > * Clock > > Even though we do not need to worry about the clock that provides the > tick (the counter resides in memory, which we keep synchronized), the > same does not apply to counters such as the TSC (we certainly want to > avoid a situation where counters jump back in time right after > fail-over, breaking guarantees such as monotonicity). > > To avoid big hiccups after migration the value of the TSC should be > sent to the fallback node frequently. An access from the guest > (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment > to do this. Fortunately, both vmx and SVM provide controls to > intercept accesses to the TSC, so it is just a matter of setting thos= e > appropriately ("RDTSC exiting" VM-execution control, and RDTSC, > RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However, > since synchronizing the virtual machines every time the TSC is > accessed would be prohibitive, the transmission of the TSC will be > done lazily, which means delaying it until there is a non-TSC > synchronization point arrives. > > * Failover > > Failover process kicks in whenever a failure in the primary node is > detected. At the time of writing we just ping the virtual machine > periodically to determine whether it is still alive, but in the long > term we have plans to integrate Kemari with the major HA stacks > (Hearbeat, RHCS, etc). > > Ideally, we would like to leverage the hardware failure detection > capabilities of newish x86 hardware to trigger failover, the idea > being that transferring control to the fallback node proactively > when a problem is detected is much faster than relying on the polling > mechanisms used by most HA software. > > Finally, to restore the virtual machine in the fallback host the load= vm > infrastructure used for live-migration is leveraged. > > * Further information > > Please visit the link below for additional information, including > documentation and, most importantly, source code (for Xen only at the > moment). > > http://www.osrg.net/kemari > =3D=3D > > > Any comments and suggestions would be greatly appreciated. > > If this is the right forum and people on the KVM mailing list do not > mind, we would like to use the CC'ed mailing lists for Kemari > development. Having more expert eyes looking at one's code always > helps. > > Thanks, > > Fernando > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1N8hc7-0005dV-FZ for qemu-devel@nongnu.org; Thu, 12 Nov 2009 16:53:59 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1N8hc2-0005an-4p for qemu-devel@nongnu.org; Thu, 12 Nov 2009 16:53:58 -0500 Received: from [199.232.76.173] (port=52050 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1N8hc1-0005ab-VO for qemu-devel@nongnu.org; Thu, 12 Nov 2009 16:53:54 -0500 Received: from mx1.redhat.com ([209.132.183.28]:62087) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1N8hc1-0001Ap-9H for qemu-devel@nongnu.org; Thu, 12 Nov 2009 16:53:53 -0500 Message-ID: <4AFC837D.2060307@redhat.com> Date: Thu, 12 Nov 2009 23:51:57 +0200 From: Dor Laor MIME-Version: 1.0 References: <4AF79242.20406@oss.ntt.co.jp> In-Reply-To: <4AF79242.20406@oss.ntt.co.jp> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM Reply-To: dlaor@redhat.com List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= Cc: Andrea Arcangeli , Chris Wright , =?UTF-8?B?b211cmEga2VpKSI=?= , kvm@vger.kernel.org, Yoshiaki Tamura , qemu-devel@nongnu.org, Takuya Yoshikawa , avi@redhat.com, =?UTF-8?B?IuWkp+adkeWcrShv?=@gnu.org On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote: > Hi all, > > It has been a while coming, but we have finally started work on > Kemari's port to KVM. For those not familiar with it, Kemari provides > the basic building block to create a virtualization-based fault > tolerant machine: a virtual machine synchronization mechanism. > > Traditional high availability solutions can be classified in two > groups: fault tolerant servers, and software clustering. > > Broadly speaking, fault tolerant servers protect us against hardware > failures and, generally, rely on redundant hardware (often > proprietary), and hardware failure detection to trigger fail-over. > > On the other hand, software clustering, as its name indicates, takes > care of software failures and usually requires a standby server whose > software configuration for the part we are trying to make fault > tolerant must be identical to that of the active server. > > Both solutions may be applied to virtualized environments. Indeed, > the current incarnation of Kemari (Xen-based) brings fault tolerant > server-like capabilities to virtual machines and integration with > existing HA stacks (Heartbeat, RHCS, etc) is under consideration. > > After some time in the drawing board we completed the basic design of > Kemari for KVM, so we are sending an RFC at this point to get early > feedback and, hopefully, get things right from the start. Those > already familiar with Kemari and/or fault tolerance may want to skip > the "Background" and go directly to the design and implementation > bits. > > This is a pretty long write-up, but please bear with me. > > =3D=3D Background =3D=3D > > We started to play around with continuous virtual synchronization > technology about 3 years ago. As development progressed and, most > importantly, we got the first Xen-based working prototypes it became > clear that we needed a proper name for our toy: Kemari. > > The goal of Kemari is to provide a fault tolerant platform for > virtualization environments, so that in the event of a hardware > failure the virtual machine fails over from compromised to properly > operating hardware (a physical machine) in a way that is completely > transparent to the guest operating system. > > Although hardware based fault tolerant servers and HA servers > (software clustering) have been around for a (long) while, they > typically require specifically designed hardware and/or modifications > to applications. In contrast, by abstracting hardware using > virtualization, Kemari can be used on off-the-shelf hardware and no > application modifications are needed. > > After a period of in-house development the first version of Kemari for > Xen was released in Nov 2008 as open source. However, by then it was > already pretty clear that a KVM port would have several > advantages. First, KVM is integrated into the Linux kernel, which > means one gets support for a wide variety of hardware for > free. Second, and in the same vein, KVM can also benefit from Linux' > low latency networking capabilities including RDMA, which is of > paramount importance for a extremely latency-sensitive functionality > like Kemari. Last, but not the least, KVM and its community is growing > rapidly, and there is increasing demand for Kemari-like functionality > for KVM. > > Although the basic design principles will remain the same, our plan is > to write Kemari for KVM from scratch, since there does not seem to be > much opportunity for sharing between Xen and KVM. > > =3D=3D Design outline =3D=3D > > The basic premise of fault tolerant servers is that when things go > awry with the hardware the running system should transparently > continue execution on an alternate physical host. For this to be > possible the state of the fallback host has to be identical to that of > the primary. > > Kemari runs paired virtual machines in an active-passive configuration > and achieves whole-system replication by continuously copying the > state of the system (dirty pages and the state of the virtual devices) > from the active node to the passive node. An interesting implication > of this is that during normal operation only the active node is > actually executing code. > > Another possible approach is to run a pair of systems in lock-step > (=C3=A0 la VMware FT). Since both the primary and fallback virtual mach= ines > are active keeping them synchronized is a complex task, which usually > involves carefully injecting external events into both virtual > machines so that they result in identical states. > > The latter approach is extremely architecture specific and not SMP > friendly. This spurred us to try the design that became Kemari, which > we believe lends itself to further optimizations. > > =3D=3D Implementation =3D=3D > > The first step is to encapsulate the machine to be protected within a > virtual machine. Then the live migration functionality is leveraged to > keep the virtual machines synchronized. > > Whereas during live migration dirty pages can be sent asynchronously > from the primary to the fallback server until the ratio of dirty pages > is low enough to guarantee very short downtimes, when it comes to > fault tolerance solutions whenever a synchronization point is reached > changes to the virtual machine since the previous one have to be sent > synchronously. > > Since the virtual machine has to be stopped until the data reaches and > is acknowledged by the fallback server, the synchronization model is > of critical importance for performance (both in terms of raw > throughput and latencies). The model chosen for Kemari along with > other implementation details is described below. > > * Synchronization model > > The synchronization points were carefully chosen to minimize the > amount of traffic that goes over the wire while still maintaining the > FT pair consistent at all times. To be precise, Kemari uses events > that modify externally visible state as synchronizations points. This > means that all outgoing I/O needs to be trapped and sent to the > fallback host before the primary is resumed, so that it can be > replayed in the face of hardware failure. > > The basic assumption here is that outgoing I/O operations are > idempotent, which is usually true for disk I/O and reliable network > protocols such as TCP (Kemari may trigger hidden bugs on applications > that use UDP or other unreliable protocols, so those may need minor > changes to ensure they work properly after failover). > > The synchronization process can be broken down as follows: > > - Event tapping: On KVM all I/O generates a VMEXIT that is > synchronously handled by the Linux kernel monitor i.e. KVM (it is > worth noting that this applies to virtio devices too, because they > use MMIO and PIO just like a regular PCI device). > > - VCPU/Guest freezing: This is automatic in the UP case. On SMP > environments we may need to send a IPI to stop the other VCPUs. > > - Notification to qemu: Taking a page from live migration's > playbook, the synchronization process is user-space driven, which > means that qemu needs to be woken up at each synchronization > point. That is already the case for qemu-emulated devices, but we > also have in-kernel emulators. To compound the problem, even for > user-space emulated devices accesses to coalesced MMIO areas can > not be detected. As a consequence we need a mechanism to > communicate KVM-handled events to qemu. > > The channel for KVM-qemu communication can be easily build upon > the existing infrastructure. We just need to add a new a page to > the kvm_run shared memory area that can be mmapped from user space > and set the exit reason appropriately. > > Regarding in-kernel device emulators, we only need to care about > writes. Specifically, making kvm_io_bus_write() fail when Kemari > is activated and invoking the emulator again after re-entrance > from user space should suffice (this is somewhat similar to what > we do in kvm_arch_vcpu_ioctl_run() for MMIO reads). > > To avoid missing synchronization points one should be careful with > coalesced MMIO-like optimizations. In the particular case of > coalesced MMIO, the I/O operation that caused the exit to user > space should act as a write barrier when it was due to an access > to a non-coalesced MMIO area. This means that before proceeding to > handle the exit in kvm_run() we have to make sure that all the > coalesced MMIO has reached the fallback host. > > - Virtual machine synchronization: All the dirty pages since the > last synchronization point and the state of the virtual devices is > sent to the fallback node from the user-space qemu process. For this > the existing savevm infrastructure and KVM's dirty page tracking I failed to understand whether you take the lock step approach and sync=20 every vmexit + make sure the shadow host will inject irq on the original=20 guest's instruction boundary or alternatively use continuous live snapsho= ts. If you use live snapshots, why do you need to track mmio, etc? Is it in=20 order to save the device sync stage in live migration? In order to do it=20 you fully lock step qemu execution (or send the entire vmstate to the=20 slave). Isn't the device part is << dirt pages part? Thanks, Dor > capabilities can be reused. Regarding in-kernel devices, with the > likely advent of in-kernel virtio backends we need a generic way > to access their state from user-space, for which, again, the kvm_run > share memory area could be used. > > - Virtual machine run: Execution of the virtual machine is resumed > as soon as synchronization finishes. > > * Clock > > Even though we do not need to worry about the clock that provides the > tick (the counter resides in memory, which we keep synchronized), the > same does not apply to counters such as the TSC (we certainly want to > avoid a situation where counters jump back in time right after > fail-over, breaking guarantees such as monotonicity). > > To avoid big hiccups after migration the value of the TSC should be > sent to the fallback node frequently. An access from the guest > (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment > to do this. Fortunately, both vmx and SVM provide controls to > intercept accesses to the TSC, so it is just a matter of setting those > appropriately ("RDTSC exiting" VM-execution control, and RDTSC, > RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However, > since synchronizing the virtual machines every time the TSC is > accessed would be prohibitive, the transmission of the TSC will be > done lazily, which means delaying it until there is a non-TSC > synchronization point arrives. > > * Failover > > Failover process kicks in whenever a failure in the primary node is > detected. At the time of writing we just ping the virtual machine > periodically to determine whether it is still alive, but in the long > term we have plans to integrate Kemari with the major HA stacks > (Hearbeat, RHCS, etc). > > Ideally, we would like to leverage the hardware failure detection > capabilities of newish x86 hardware to trigger failover, the idea > being that transferring control to the fallback node proactively > when a problem is detected is much faster than relying on the polling > mechanisms used by most HA software. > > Finally, to restore the virtual machine in the fallback host the loadvm > infrastructure used for live-migration is leveraged. > > * Further information > > Please visit the link below for additional information, including > documentation and, most importantly, source code (for Xen only at the > moment). > > http://www.osrg.net/kemari > =3D=3D > > > Any comments and suggestions would be greatly appreciated. > > If this is the right forum and people on the KVM mailing list do not > mind, we would like to use the CC'ed mailing lists for Kemari > development. Having more expert eyes looking at one's code always > helps. > > Thanks, > > Fernando > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html