From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: [RFC] KVM Fault Tolerance: Kemari for KVM Date: Mon, 16 Nov 2009 16:49:35 +0200 Message-ID: <4B01667F.3000600@redhat.com> References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com> <4B015F42.7070609@oss.ntt.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org, =?UTF-8?B?IuWkp+adkeWcrShv?= =?UTF-8?B?b211cmEga2VpKSI=?= , Yoshiaki Tamura , Takuya Yoshikawa , anthony@codemonkey.ws, Andrea Arcangeli , Chris Wright To: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= Return-path: Received: from mx1.redhat.com ([209.132.183.28]:33041 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753171AbZKPOth (ORCPT ); Mon, 16 Nov 2009 09:49:37 -0500 In-Reply-To: <4B015F42.7070609@oss.ntt.co.jp> Sender: kvm-owner@vger.kernel.org List-ID: On 11/16/2009 04:18 PM, Fernando Luis V=C3=A1zquez Cao wrote: > Avi Kivity wrote: >> On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote: >>> >>> Kemari runs paired virtual machines in an active-passive configurat= ion >>> and achieves whole-system replication by continuously copying the >>> state of the system (dirty pages and the state of the virtual devic= es) >>> from the active node to the passive node. An interesting implicatio= n >>> of this is that during normal operation only the active node is >>> actually executing code. >>> >> >> Can you characterize the performance impact for various workloads? = I=20 >> assume you are running continuously in log-dirty mode. Doesn't this= =20 >> make memory intensive workloads suffer? > > Yes, we're running continuously in log-dirty mode. > > We still do not have numbers to show for KVM, but > the snippets below from several runs of lmbench > using Xen+Kemari will give you an idea of what you > can expect in terms of overhead. All the tests were > run using a fully virtualized Debian guest with > hardware nested paging enabled. > > fork exec sh P/F C/S [us] > ------------------------------------------------------ > Base 114 349 1197 1.2845 8.2 > Kemari(10GbE) + FC 141 403 1280 1.2835 11.6 > Kemari(10GbE) + DRBD 161 415 1388 1.3145 11.6 > Kemari(1GbE) + FC 151 410 1335 1.3370 11.5 > Kemari(1GbE) + DRBD 162 413 1318 1.3239 11.6 > * P/F=3Dpage fault, C/S=3Dcontext switch > > The benchmarks above are memory intensive and, as you > can see, the overhead varies widely from 7% to 40%. > We also measured CPU bound operations, but, as expected, > Kemari incurred almost no overhead. Is lmbench fork that memory intensive? Do you have numbers for benchmarks that use significant anonymous RSS? = =20 Say, a parallel kernel build. Note that scaling vcpus will increase a guest's memory-dirtying power=20 but snapshot rate will not scale in the same way. >>> - Notification to qemu: Taking a page from live migration's >>> playbook, the synchronization process is user-space driven, whi= ch >>> means that qemu needs to be woken up at each synchronization >>> point. That is already the case for qemu-emulated devices, but = we >>> also have in-kernel emulators. To compound the problem, even fo= r >>> user-space emulated devices accesses to coalesced MMIO areas ca= n >>> not be detected. As a consequence we need a mechanism to >>> communicate KVM-handled events to qemu. >> >> Do you mean the ioapic, pic, and lapic? > > Well, I was more worried about the in-kernel backends currently in th= e > works. To save the state of those devices we could leverage qemu's=20 > vmstate > infrastructure and even reuse struct VMStateDescription's pre_save() > callback, but we would like to pass the device state through the kvm_= run > area to avoid a ioctl call right after returning to user space. Hm, let's defer all that until we have something working so we can=20 estimate the impact of userspace virtio in those circumstances. >> Why is access to those chips considered a synchronization point? > > The main problem with those is that to get the chip state we > use an ioctl when we could have copied it to qemu's memory > before going back to user space. Not all accesses to those chips > need to be treated as synchronization points. Ok. Note that piggybacking on an exit will work for the lapic, but not= =20 for the global irqchips (ioapic, pic) since they can still be modified=20 by another vcpu. >> I wonder if you can pipeline dirty memory synchronization. That is,= =20 >> write-protect those pages that are dirty, start copying them to the=20 >> other side, and continue execution, copying memory if the guest=20 >> faults it again. > > Asynchronous transmission of dirty pages would be really helpful to > eliminate the performance hiccups that tend to occur at synchronizati= on > points. What we can do is to copy dirty pages asynchronously until we= =20 > reach > a synchronization point, where we need to stop the guest and send the > remaining dirty pages and the state of devices to the other side. > > However, we can not delay the transmission of a dirty page across a > synchronization point, because if the primary node crashed before the > page reached the fallback node the I/O operation that caused the > synchronization point cannot be replayed reliably. What I mean is: - choose synchronization point A - start copying memory for synchronization point A - output is delayed - choose synchronization point B - copy memory for A and B if guest touches memory not yet copied for A, COW it - once A copying is complete, release A output - continue copying memory for B - choose synchronization point B by keeping two synchronization points active, you don't have any=20 pauses. The cost is maintaining copy-on-write so we can copy dirty=20 pages for A while keeping execution. >> How many pages do you copy per synchronization point for reasonably=20 >> difficult workloads? > > That is very workload-dependent, but if you take a look at the exampl= es > below you can get a feeling of how Kemari behaves. > > IOzone Kemari sync interval[ms] dirtied pages > --------------------------------------------------------- > buffered + fsync 400 3000 > O_SYNC 10 80 > > In summary, if the guest executes few I/O operations, the interval > between Kemari synchronizations points will increase and the number o= f > dirtied pages will grow accordingly. In the example above, the externally observed latency grows to 400 ms, = yes? --=20 error compiling committee.c: too many arguments to function From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NA2tn-0001qZ-H6 for qemu-devel@nongnu.org; Mon, 16 Nov 2009 09:49:47 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NA2ti-0001lr-G8 for qemu-devel@nongnu.org; Mon, 16 Nov 2009 09:49:46 -0500 Received: from [199.232.76.173] (port=59690 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NA2ti-0001le-9u for qemu-devel@nongnu.org; Mon, 16 Nov 2009 09:49:42 -0500 Received: from mx1.redhat.com ([209.132.183.28]:15454) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1NA2th-0000Jk-Gd for qemu-devel@nongnu.org; Mon, 16 Nov 2009 09:49:42 -0500 Message-ID: <4B01667F.3000600@redhat.com> Date: Mon, 16 Nov 2009 16:49:35 +0200 From: Avi Kivity MIME-Version: 1.0 References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com> <4B015F42.7070609@oss.ntt.co.jp> In-Reply-To: <4B015F42.7070609@oss.ntt.co.jp> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= Cc: Andrea Arcangeli , Chris Wright , =?UTF-8?B?b211cmEga2VpKSI=?= , kvm@vger.kernel.org, Yoshiaki Tamura , qemu-devel@nongnu.org, Takuya Yoshikawa , =?UTF-8?B?IuWkp+adkeWcrShv?=@gnu.org On 11/16/2009 04:18 PM, Fernando Luis V=C3=A1zquez Cao wrote: > Avi Kivity wrote: >> On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote: >>> >>> Kemari runs paired virtual machines in an active-passive configuratio= n >>> and achieves whole-system replication by continuously copying the >>> state of the system (dirty pages and the state of the virtual devices= ) >>> from the active node to the passive node. An interesting implication >>> of this is that during normal operation only the active node is >>> actually executing code. >>> >> >> Can you characterize the performance impact for various workloads? I=20 >> assume you are running continuously in log-dirty mode. Doesn't this=20 >> make memory intensive workloads suffer? > > Yes, we're running continuously in log-dirty mode. > > We still do not have numbers to show for KVM, but > the snippets below from several runs of lmbench > using Xen+Kemari will give you an idea of what you > can expect in terms of overhead. All the tests were > run using a fully virtualized Debian guest with > hardware nested paging enabled. > > fork exec sh P/F C/S [us] > ------------------------------------------------------ > Base 114 349 1197 1.2845 8.2 > Kemari(10GbE) + FC 141 403 1280 1.2835 11.6 > Kemari(10GbE) + DRBD 161 415 1388 1.3145 11.6 > Kemari(1GbE) + FC 151 410 1335 1.3370 11.5 > Kemari(1GbE) + DRBD 162 413 1318 1.3239 11.6 > * P/F=3Dpage fault, C/S=3Dcontext switch > > The benchmarks above are memory intensive and, as you > can see, the overhead varies widely from 7% to 40%. > We also measured CPU bound operations, but, as expected, > Kemari incurred almost no overhead. Is lmbench fork that memory intensive? Do you have numbers for benchmarks that use significant anonymous RSS? =20 Say, a parallel kernel build. Note that scaling vcpus will increase a guest's memory-dirtying power=20 but snapshot rate will not scale in the same way. >>> - Notification to qemu: Taking a page from live migration's >>> playbook, the synchronization process is user-space driven, which >>> means that qemu needs to be woken up at each synchronization >>> point. That is already the case for qemu-emulated devices, but we >>> also have in-kernel emulators. To compound the problem, even for >>> user-space emulated devices accesses to coalesced MMIO areas can >>> not be detected. As a consequence we need a mechanism to >>> communicate KVM-handled events to qemu. >> >> Do you mean the ioapic, pic, and lapic? > > Well, I was more worried about the in-kernel backends currently in the > works. To save the state of those devices we could leverage qemu's=20 > vmstate > infrastructure and even reuse struct VMStateDescription's pre_save() > callback, but we would like to pass the device state through the kvm_ru= n > area to avoid a ioctl call right after returning to user space. Hm, let's defer all that until we have something working so we can=20 estimate the impact of userspace virtio in those circumstances. >> Why is access to those chips considered a synchronization point? > > The main problem with those is that to get the chip state we > use an ioctl when we could have copied it to qemu's memory > before going back to user space. Not all accesses to those chips > need to be treated as synchronization points. Ok. Note that piggybacking on an exit will work for the lapic, but not=20 for the global irqchips (ioapic, pic) since they can still be modified=20 by another vcpu. >> I wonder if you can pipeline dirty memory synchronization. That is,=20 >> write-protect those pages that are dirty, start copying them to the=20 >> other side, and continue execution, copying memory if the guest=20 >> faults it again. > > Asynchronous transmission of dirty pages would be really helpful to > eliminate the performance hiccups that tend to occur at synchronization > points. What we can do is to copy dirty pages asynchronously until we=20 > reach > a synchronization point, where we need to stop the guest and send the > remaining dirty pages and the state of devices to the other side. > > However, we can not delay the transmission of a dirty page across a > synchronization point, because if the primary node crashed before the > page reached the fallback node the I/O operation that caused the > synchronization point cannot be replayed reliably. What I mean is: - choose synchronization point A - start copying memory for synchronization point A - output is delayed - choose synchronization point B - copy memory for A and B if guest touches memory not yet copied for A, COW it - once A copying is complete, release A output - continue copying memory for B - choose synchronization point B by keeping two synchronization points active, you don't have any=20 pauses. The cost is maintaining copy-on-write so we can copy dirty=20 pages for A while keeping execution. >> How many pages do you copy per synchronization point for reasonably=20 >> difficult workloads? > > That is very workload-dependent, but if you take a look at the examples > below you can get a feeling of how Kemari behaves. > > IOzone Kemari sync interval[ms] dirtied pages > --------------------------------------------------------- > buffered + fsync 400 3000 > O_SYNC 10 80 > > In summary, if the guest executes few I/O operations, the interval > between Kemari synchronizations points will increase and the number of > dirtied pages will grow accordingly. In the example above, the externally observed latency grows to 400 ms, ye= s? --=20 error compiling committee.c: too many arguments to function