From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= Subject: Re: [RFC] KVM Fault Tolerance: Kemari for KVM Date: Mon, 16 Nov 2009 23:18:42 +0900 Message-ID: <4B015F42.7070609@oss.ntt.co.jp> References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org, =?UTF-8?B?IuWkp+adkeWcrShv?= =?UTF-8?B?b211cmEga2VpKSI=?= , Yoshiaki Tamura , Takuya Yoshikawa , anthony@codemonkey.ws, Andrea Arcangeli , Chris Wright To: Avi Kivity Return-path: Received: from serv2.oss.ntt.co.jp ([222.151.198.100]:35365 "EHLO serv2.oss.ntt.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753465AbZKPOSi (ORCPT ); Mon, 16 Nov 2009 09:18:38 -0500 In-Reply-To: <4AFFD96D.5090100@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: Avi Kivity wrote: > On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote: >> >> Kemari runs paired virtual machines in an active-passive configurati= on >> and achieves whole-system replication by continuously copying the >> state of the system (dirty pages and the state of the virtual device= s) >> from the active node to the passive node. An interesting implication >> of this is that during normal operation only the active node is >> actually executing code. >> >=20 > Can you characterize the performance impact for various workloads? I= =20 > assume you are running continuously in log-dirty mode. Doesn't this=20 > make memory intensive workloads suffer? Yes, we're running continuously in log-dirty mode. We still do not have numbers to show for KVM, but the snippets below from several runs of lmbench using Xen+Kemari will give you an idea of what you can expect in terms of overhead. All the tests were run using a fully virtualized Debian guest with hardware nested paging enabled. fork exec sh P/F C/S [us] ------------------------------------------------------ Base 114 349 1197 1.2845 8.2 Kemari(10GbE) + FC 141 403 1280 1.2835 11.6 Kemari(10GbE) + DRBD 161 415 1388 1.3145 11.6 Kemari(1GbE) + FC 151 410 1335 1.3370 11.5 Kemari(1GbE) + DRBD 162 413 1318 1.3239 11.6 * P/F=3Dpage fault, C/S=3Dcontext switch The benchmarks above are memory intensive and, as you can see, the overhead varies widely from 7% to 40%. We also measured CPU bound operations, but, as expected, Kemari incurred almost no overhead. >> The synchronization process can be broken down as follows: >> >> - Event tapping: On KVM all I/O generates a VMEXIT that is >> synchronously handled by the Linux kernel monitor i.e. KVM (it i= s >> worth noting that this applies to virtio devices too, because th= ey >> use MMIO and PIO just like a regular PCI device). >=20 > Some I/O (virtio-based) is asynchronous, but you still have well-know= n=20 > tap points within qemu. Yep, and in some cases we have polling from the backend, which I forgot= to mention in the RFC. >> - Notification to qemu: Taking a page from live migration's >> playbook, the synchronization process is user-space driven, whic= h >> means that qemu needs to be woken up at each synchronization >> point. That is already the case for qemu-emulated devices, but w= e >> also have in-kernel emulators. To compound the problem, even for >> user-space emulated devices accesses to coalesced MMIO areas can >> not be detected. As a consequence we need a mechanism to >> communicate KVM-handled events to qemu. >=20 > Do you mean the ioapic, pic, and lapic? Well, I was more worried about the in-kernel backends currently in the works. To save the state of those devices we could leverage qemu's vmst= ate infrastructure and even reuse struct VMStateDescription's pre_save() callback, but we would like to pass the device state through the kvm_ru= n area to avoid a ioctl call right after returning to user space. > Perhaps its best to start with those in userspace (-no-kvm-irqchip). That's precisely what we were planning to do. Once we get a working prototype we will take care of existing optimizations such as in-kernel emulators and add our own. > Why is access to those chips considered a synchronization point? The main problem with those is that to get the chip state we use an ioctl when we could have copied it to qemu's memory before going back to user space. Not all accesses to those chips need to be treated as synchronization points. >> - Virtual machine synchronization: All the dirty pages since the >> last synchronization point and the state of the virtual devices = is >> sent to the fallback node from the user-space qemu process. For = this >> the existing savevm infrastructure and KVM's dirty page tracking >> capabilities can be reused. Regarding in-kernel devices, with th= e >> likely advent of in-kernel virtio backends we need a generic way >> to access their state from user-space, for which, again, the kvm= _run >> share memory area could be used. >=20 > I wonder if you can pipeline dirty memory synchronization. That is,=20 > write-protect those pages that are dirty, start copying them to the=20 > other side, and continue execution, copying memory if the guest fault= s=20 > it again. Asynchronous transmission of dirty pages would be really helpful to eliminate the performance hiccups that tend to occur at synchronization points. What we can do is to copy dirty pages asynchronously until we r= each a synchronization point, where we need to stop the guest and send the remaining dirty pages and the state of devices to the other side. However, we can not delay the transmission of a dirty page across a synchronization point, because if the primary node crashed before the page reached the fallback node the I/O operation that caused the synchronization point cannot be replayed reliably. > How many pages do you copy per synchronization point for reasonably=20 > difficult workloads? That is very workload-dependent, but if you take a look at the examples below you can get a feeling of how Kemari behaves. IOzone Kemari sync interval[ms] dirtied pages --------------------------------------------------------- buffered + fsync 400 3000 O_SYNC 10 80 In summary, if the guest executes few I/O operations, the interval between Kemari synchronizations points will increase and the number of dirtied pages will grow accordingly. Thanks, =46ernando From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NA2Pu-0004OA-Pa for qemu-devel@nongnu.org; Mon, 16 Nov 2009 09:18:54 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NA2Pp-0004NG-91 for qemu-devel@nongnu.org; Mon, 16 Nov 2009 09:18:53 -0500 Received: from [199.232.76.173] (port=42694 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NA2Pp-0004ND-33 for qemu-devel@nongnu.org; Mon, 16 Nov 2009 09:18:49 -0500 Received: from serv2.oss.ntt.co.jp ([222.151.198.100]:35364) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1NA2Po-0003Uf-0y for qemu-devel@nongnu.org; Mon, 16 Nov 2009 09:18:48 -0500 Message-ID: <4B015F42.7070609@oss.ntt.co.jp> Date: Mon, 16 Nov 2009 23:18:42 +0900 From: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= MIME-Version: 1.0 References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com> In-Reply-To: <4AFFD96D.5090100@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Avi Kivity Cc: Andrea Arcangeli , Chris Wright , =?UTF-8?B?b211cmEga2VpKSI=?= , kvm@vger.kernel.org, Yoshiaki Tamura , qemu-devel@nongnu.org, Takuya Yoshikawa , =?UTF-8?B?IuWkp+adkeWcrShv?=@gnu.org Avi Kivity wrote: > On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote: >> >> Kemari runs paired virtual machines in an active-passive configuration >> and achieves whole-system replication by continuously copying the >> state of the system (dirty pages and the state of the virtual devices) >> from the active node to the passive node. An interesting implication >> of this is that during normal operation only the active node is >> actually executing code. >> >=20 > Can you characterize the performance impact for various workloads? I=20 > assume you are running continuously in log-dirty mode. Doesn't this=20 > make memory intensive workloads suffer? Yes, we're running continuously in log-dirty mode. We still do not have numbers to show for KVM, but the snippets below from several runs of lmbench using Xen+Kemari will give you an idea of what you can expect in terms of overhead. All the tests were run using a fully virtualized Debian guest with hardware nested paging enabled. fork exec sh P/F C/S [us] ------------------------------------------------------ Base 114 349 1197 1.2845 8.2 Kemari(10GbE) + FC 141 403 1280 1.2835 11.6 Kemari(10GbE) + DRBD 161 415 1388 1.3145 11.6 Kemari(1GbE) + FC 151 410 1335 1.3370 11.5 Kemari(1GbE) + DRBD 162 413 1318 1.3239 11.6 * P/F=3Dpage fault, C/S=3Dcontext switch The benchmarks above are memory intensive and, as you can see, the overhead varies widely from 7% to 40%. We also measured CPU bound operations, but, as expected, Kemari incurred almost no overhead. >> The synchronization process can be broken down as follows: >> >> - Event tapping: On KVM all I/O generates a VMEXIT that is >> synchronously handled by the Linux kernel monitor i.e. KVM (it is >> worth noting that this applies to virtio devices too, because they >> use MMIO and PIO just like a regular PCI device). >=20 > Some I/O (virtio-based) is asynchronous, but you still have well-known=20 > tap points within qemu. Yep, and in some cases we have polling from the backend, which I forgot t= o mention in the RFC. >> - Notification to qemu: Taking a page from live migration's >> playbook, the synchronization process is user-space driven, which >> means that qemu needs to be woken up at each synchronization >> point. That is already the case for qemu-emulated devices, but we >> also have in-kernel emulators. To compound the problem, even for >> user-space emulated devices accesses to coalesced MMIO areas can >> not be detected. As a consequence we need a mechanism to >> communicate KVM-handled events to qemu. >=20 > Do you mean the ioapic, pic, and lapic? Well, I was more worried about the in-kernel backends currently in the works. To save the state of those devices we could leverage qemu's vmstat= e infrastructure and even reuse struct VMStateDescription's pre_save() callback, but we would like to pass the device state through the kvm_run area to avoid a ioctl call right after returning to user space. > Perhaps its best to start with those in userspace (-no-kvm-irqchip). That's precisely what we were planning to do. Once we get a working prototype we will take care of existing optimizations such as in-kernel emulators and add our own. > Why is access to those chips considered a synchronization point? The main problem with those is that to get the chip state we use an ioctl when we could have copied it to qemu's memory before going back to user space. Not all accesses to those chips need to be treated as synchronization points. >> - Virtual machine synchronization: All the dirty pages since the >> last synchronization point and the state of the virtual devices is >> sent to the fallback node from the user-space qemu process. For th= is >> the existing savevm infrastructure and KVM's dirty page tracking >> capabilities can be reused. Regarding in-kernel devices, with the >> likely advent of in-kernel virtio backends we need a generic way >> to access their state from user-space, for which, again, the kvm_r= un >> share memory area could be used. >=20 > I wonder if you can pipeline dirty memory synchronization. That is,=20 > write-protect those pages that are dirty, start copying them to the=20 > other side, and continue execution, copying memory if the guest faults=20 > it again. Asynchronous transmission of dirty pages would be really helpful to eliminate the performance hiccups that tend to occur at synchronization points. What we can do is to copy dirty pages asynchronously until we rea= ch a synchronization point, where we need to stop the guest and send the remaining dirty pages and the state of devices to the other side. However, we can not delay the transmission of a dirty page across a synchronization point, because if the primary node crashed before the page reached the fallback node the I/O operation that caused the synchronization point cannot be replayed reliably. > How many pages do you copy per synchronization point for reasonably=20 > difficult workloads? That is very workload-dependent, but if you take a look at the examples below you can get a feeling of how Kemari behaves. IOzone Kemari sync interval[ms] dirtied pages --------------------------------------------------------- buffered + fsync 400 3000 O_SYNC 10 80 In summary, if the guest executes few I/O operations, the interval between Kemari synchronizations points will increase and the number of dirtied pages will grow accordingly. Thanks, Fernando