From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yoshiaki Tamura Subject: Re: [RFC] KVM Fault Tolerance: Kemari for KVM Date: Tue, 17 Nov 2009 20:04:20 +0900 Message-ID: <4B028334.1070004@lab.ntt.co.jp> References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com> <4B015F42.7070609@oss.ntt.co.jp> <4B01667F.3000600@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= , kvm@vger.kernel.org, qemu-devel@nongnu.org, =?UTF-8?B?IuWkp+adkeWcrShvb211cmEga2VpKSI=?= , Takuya Yoshikawa , anthony@codemonkey.ws, Andrea Arcangeli , Chris Wright To: Avi Kivity Return-path: Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:49469 "EHLO tama50.ecl.ntt.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754116AbZKQLEq (ORCPT ); Tue, 17 Nov 2009 06:04:46 -0500 In-Reply-To: <4B01667F.3000600@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: Avi Kivity wrote: > On 11/16/2009 04:18 PM, Fernando Luis V=C3=A1zquez Cao wrote: >> Avi Kivity wrote: >>> On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote: >>>> >>>> Kemari runs paired virtual machines in an active-passive configura= tion >>>> and achieves whole-system replication by continuously copying the >>>> state of the system (dirty pages and the state of the virtual devi= ces) >>>> from the active node to the passive node. An interesting implicati= on >>>> of this is that during normal operation only the active node is >>>> actually executing code. >>>> >>> >>> Can you characterize the performance impact for various workloads? = I=20 >>> assume you are running continuously in log-dirty mode. Doesn't thi= s=20 >>> make memory intensive workloads suffer? >> >> Yes, we're running continuously in log-dirty mode. >> >> We still do not have numbers to show for KVM, but >> the snippets below from several runs of lmbench >> using Xen+Kemari will give you an idea of what you >> can expect in terms of overhead. All the tests were >> run using a fully virtualized Debian guest with >> hardware nested paging enabled. >> >> fork exec sh P/F C/S [us] >> ------------------------------------------------------ >> Base 114 349 1197 1.2845 8.2 >> Kemari(10GbE) + FC 141 403 1280 1.2835 11.6 >> Kemari(10GbE) + DRBD 161 415 1388 1.3145 11.6 >> Kemari(1GbE) + FC 151 410 1335 1.3370 11.5 >> Kemari(1GbE) + DRBD 162 413 1318 1.3239 11.6 >> * P/F=3Dpage fault, C/S=3Dcontext switch >> >> The benchmarks above are memory intensive and, as you >> can see, the overhead varies widely from 7% to 40%. >> We also measured CPU bound operations, but, as expected, >> Kemari incurred almost no overhead. >=20 > Is lmbench fork that memory intensive? >=20 > Do you have numbers for benchmarks that use significant anonymous RSS= ? =20 > Say, a parallel kernel build. >=20 > Note that scaling vcpus will increase a guest's memory-dirtying power= =20 > but snapshot rate will not scale in the same way. I don't think lmbench is intensive but it's sensitive to memory latency= =2E We'll measure kernel build time with minimum config, and post it later. >>>> - Notification to qemu: Taking a page from live migration's >>>> playbook, the synchronization process is user-space driven, wh= ich >>>> means that qemu needs to be woken up at each synchronization >>>> point. That is already the case for qemu-emulated devices, but= we >>>> also have in-kernel emulators. To compound the problem, even f= or >>>> user-space emulated devices accesses to coalesced MMIO areas c= an >>>> not be detected. As a consequence we need a mechanism to >>>> communicate KVM-handled events to qemu. >>> >>> Do you mean the ioapic, pic, and lapic? >> >> Well, I was more worried about the in-kernel backends currently in t= he >> works. To save the state of those devices we could leverage qemu's=20 >> vmstate >> infrastructure and even reuse struct VMStateDescription's pre_save() >> callback, but we would like to pass the device state through the kvm= _run >> area to avoid a ioctl call right after returning to user space. >=20 > Hm, let's defer all that until we have something working so we can=20 > estimate the impact of userspace virtio in those circumstances. OK. We'll start implementing everything in userspace first. >>> Why is access to those chips considered a synchronization point? >> >> The main problem with those is that to get the chip state we >> use an ioctl when we could have copied it to qemu's memory >> before going back to user space. Not all accesses to those chips >> need to be treated as synchronization points. >=20 > Ok. Note that piggybacking on an exit will work for the lapic, but n= ot=20 > for the global irqchips (ioapic, pic) since they can still be modifie= d=20 > by another vcpu. >=20 >>> I wonder if you can pipeline dirty memory synchronization. That is= ,=20 >>> write-protect those pages that are dirty, start copying them to the= =20 >>> other side, and continue execution, copying memory if the guest=20 >>> faults it again. >> >> Asynchronous transmission of dirty pages would be really helpful to >> eliminate the performance hiccups that tend to occur at synchronizat= ion >> points. What we can do is to copy dirty pages asynchronously until w= e=20 >> reach >> a synchronization point, where we need to stop the guest and send th= e >> remaining dirty pages and the state of devices to the other side. >> >> However, we can not delay the transmission of a dirty page across a >> synchronization point, because if the primary node crashed before th= e >> page reached the fallback node the I/O operation that caused the >> synchronization point cannot be replayed reliably. >=20 > What I mean is: >=20 > - choose synchronization point A > - start copying memory for synchronization point A > - output is delayed > - choose synchronization point B > - copy memory for A and B > if guest touches memory not yet copied for A, COW it > - once A copying is complete, release A output > - continue copying memory for B > - choose synchronization point B >=20 > by keeping two synchronization points active, you don't have any=20 > pauses. The cost is maintaining copy-on-write so we can copy dirty=20 > pages for A while keeping execution. The overall idea seems good, but if I'm understanding correctly, we nee= d a=20 buffer for copying memory locally, and when it gets full, or when we CO= W the=20 memory for B, we still have to pause the guest to prevent from overwrit= ing. Correct? To make things simple, we would like to start with the synchronous tran= smission=20 first, and tackle asynchronous transmission later. >>> How many pages do you copy per synchronization point for reasonably= =20 >>> difficult workloads? >> >> That is very workload-dependent, but if you take a look at the examp= les >> below you can get a feeling of how Kemari behaves. >> >> IOzone Kemari sync interval[ms] dirtied pages >> --------------------------------------------------------- >> buffered + fsync 400 3000 >> O_SYNC 10 80 >> >> In summary, if the guest executes few I/O operations, the interval >> between Kemari synchronizations points will increase and the number = of >> dirtied pages will grow accordingly. >=20 > In the example above, the externally observed latency grows to 400 ms= , yes? Not exactly. The sync interval refers to the interval of synchronizati= on points=20 captured when the workload is running. In the example above, when the = observed=20 sync interval is 400ms, it takes about 150ms to sync VMs with 3000 dirt= ied=20 pages. Kemari resumes I/O operations immediately once the synchronizat= ion is=20 finished, and thus, the externally observed latency is 150ms in this ca= se. Thanks, Yoshi From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NALrT-00054I-Fk for qemu-devel@nongnu.org; Tue, 17 Nov 2009 06:04:39 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NALrN-0004zs-Ml for qemu-devel@nongnu.org; Tue, 17 Nov 2009 06:04:37 -0500 Received: from [199.232.76.173] (port=58859 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NALrN-0004zZ-E8 for qemu-devel@nongnu.org; Tue, 17 Nov 2009 06:04:33 -0500 Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:49457) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1NALrM-000698-Hx for qemu-devel@nongnu.org; Tue, 17 Nov 2009 06:04:33 -0500 Message-ID: <4B028334.1070004@lab.ntt.co.jp> Date: Tue, 17 Nov 2009 20:04:20 +0900 From: Yoshiaki Tamura MIME-Version: 1.0 References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com> <4B015F42.7070609@oss.ntt.co.jp> <4B01667F.3000600@redhat.com> In-Reply-To: <4B01667F.3000600@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Avi Kivity Cc: Andrea Arcangeli , Chris Wright , =?UTF-8?B?IuWkp+adkeWcrShvb211cmEga2VpKSI=?= , kvm@vger.kernel.org, =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= , qemu-devel@nongnu.org, Takuya Yoshikawa Avi Kivity wrote: > On 11/16/2009 04:18 PM, Fernando Luis V=C3=A1zquez Cao wrote: >> Avi Kivity wrote: >>> On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote: >>>> >>>> Kemari runs paired virtual machines in an active-passive configurati= on >>>> and achieves whole-system replication by continuously copying the >>>> state of the system (dirty pages and the state of the virtual device= s) >>>> from the active node to the passive node. An interesting implication >>>> of this is that during normal operation only the active node is >>>> actually executing code. >>>> >>> >>> Can you characterize the performance impact for various workloads? I= =20 >>> assume you are running continuously in log-dirty mode. Doesn't this=20 >>> make memory intensive workloads suffer? >> >> Yes, we're running continuously in log-dirty mode. >> >> We still do not have numbers to show for KVM, but >> the snippets below from several runs of lmbench >> using Xen+Kemari will give you an idea of what you >> can expect in terms of overhead. All the tests were >> run using a fully virtualized Debian guest with >> hardware nested paging enabled. >> >> fork exec sh P/F C/S [us] >> ------------------------------------------------------ >> Base 114 349 1197 1.2845 8.2 >> Kemari(10GbE) + FC 141 403 1280 1.2835 11.6 >> Kemari(10GbE) + DRBD 161 415 1388 1.3145 11.6 >> Kemari(1GbE) + FC 151 410 1335 1.3370 11.5 >> Kemari(1GbE) + DRBD 162 413 1318 1.3239 11.6 >> * P/F=3Dpage fault, C/S=3Dcontext switch >> >> The benchmarks above are memory intensive and, as you >> can see, the overhead varies widely from 7% to 40%. >> We also measured CPU bound operations, but, as expected, >> Kemari incurred almost no overhead. >=20 > Is lmbench fork that memory intensive? >=20 > Do you have numbers for benchmarks that use significant anonymous RSS? = =20 > Say, a parallel kernel build. >=20 > Note that scaling vcpus will increase a guest's memory-dirtying power=20 > but snapshot rate will not scale in the same way. I don't think lmbench is intensive but it's sensitive to memory latency. We'll measure kernel build time with minimum config, and post it later. >>>> - Notification to qemu: Taking a page from live migration's >>>> playbook, the synchronization process is user-space driven, whic= h >>>> means that qemu needs to be woken up at each synchronization >>>> point. That is already the case for qemu-emulated devices, but w= e >>>> also have in-kernel emulators. To compound the problem, even for >>>> user-space emulated devices accesses to coalesced MMIO areas can >>>> not be detected. As a consequence we need a mechanism to >>>> communicate KVM-handled events to qemu. >>> >>> Do you mean the ioapic, pic, and lapic? >> >> Well, I was more worried about the in-kernel backends currently in the >> works. To save the state of those devices we could leverage qemu's=20 >> vmstate >> infrastructure and even reuse struct VMStateDescription's pre_save() >> callback, but we would like to pass the device state through the kvm_r= un >> area to avoid a ioctl call right after returning to user space. >=20 > Hm, let's defer all that until we have something working so we can=20 > estimate the impact of userspace virtio in those circumstances. OK. We'll start implementing everything in userspace first. >>> Why is access to those chips considered a synchronization point? >> >> The main problem with those is that to get the chip state we >> use an ioctl when we could have copied it to qemu's memory >> before going back to user space. Not all accesses to those chips >> need to be treated as synchronization points. >=20 > Ok. Note that piggybacking on an exit will work for the lapic, but not= =20 > for the global irqchips (ioapic, pic) since they can still be modified=20 > by another vcpu. >=20 >>> I wonder if you can pipeline dirty memory synchronization. That is,=20 >>> write-protect those pages that are dirty, start copying them to the=20 >>> other side, and continue execution, copying memory if the guest=20 >>> faults it again. >> >> Asynchronous transmission of dirty pages would be really helpful to >> eliminate the performance hiccups that tend to occur at synchronizatio= n >> points. What we can do is to copy dirty pages asynchronously until we=20 >> reach >> a synchronization point, where we need to stop the guest and send the >> remaining dirty pages and the state of devices to the other side. >> >> However, we can not delay the transmission of a dirty page across a >> synchronization point, because if the primary node crashed before the >> page reached the fallback node the I/O operation that caused the >> synchronization point cannot be replayed reliably. >=20 > What I mean is: >=20 > - choose synchronization point A > - start copying memory for synchronization point A > - output is delayed > - choose synchronization point B > - copy memory for A and B > if guest touches memory not yet copied for A, COW it > - once A copying is complete, release A output > - continue copying memory for B > - choose synchronization point B >=20 > by keeping two synchronization points active, you don't have any=20 > pauses. The cost is maintaining copy-on-write so we can copy dirty=20 > pages for A while keeping execution. The overall idea seems good, but if I'm understanding correctly, we need = a=20 buffer for copying memory locally, and when it gets full, or when we COW = the=20 memory for B, we still have to pause the guest to prevent from overwritin= g. Correct? To make things simple, we would like to start with the synchronous transm= ission=20 first, and tackle asynchronous transmission later. >>> How many pages do you copy per synchronization point for reasonably=20 >>> difficult workloads? >> >> That is very workload-dependent, but if you take a look at the example= s >> below you can get a feeling of how Kemari behaves. >> >> IOzone Kemari sync interval[ms] dirtied pages >> --------------------------------------------------------- >> buffered + fsync 400 3000 >> O_SYNC 10 80 >> >> In summary, if the guest executes few I/O operations, the interval >> between Kemari synchronizations points will increase and the number of >> dirtied pages will grow accordingly. >=20 > In the example above, the externally observed latency grows to 400 ms, = yes? Not exactly. The sync interval refers to the interval of synchronization= points=20 captured when the workload is running. In the example above, when the ob= served=20 sync interval is 400ms, it takes about 150ms to sync VMs with 3000 dirtie= d=20 pages. Kemari resumes I/O operations immediately once the synchronizatio= n is=20 finished, and thus, the externally observed latency is 150ms in this case. Thanks, Yoshi