From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yoshiaki Tamura Subject: Re: [RFC] KVM Fault Tolerance: Kemari for KVM Date: Tue, 17 Nov 2009 23:06:01 +0900 Message-ID: <87e9effc0911170606k2919eaa5v808ce3a90fff9d1a@mail.gmail.com> References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com> <4B015F42.7070609@oss.ntt.co.jp> <4B01667F.3000600@redhat.com> <4B028334.1070004@lab.ntt.co.jp> <4B0293D9.7000302@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: =?ISO-8859-1?Q?Fernando_Luis_V=E1zquez_Cao?= , kvm@vger.kernel.org, qemu-devel@nongnu.org, =?ISO-2022-JP?B?GyRCQmdCPDc9GyhCKG9vbXVyYSBrZWkp?= , Takuya Yoshikawa , anthony@codemonkey.ws, Andrea Arcangeli , Chris Wright To: Avi Kivity Return-path: Received: from mail-yx0-f187.google.com ([209.85.210.187]:50299 "EHLO mail-yx0-f187.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752712AbZKQOGE convert rfc822-to-8bit (ORCPT ); Tue, 17 Nov 2009 09:06:04 -0500 Received: by yxe17 with SMTP id 17so1672yxe.33 for ; Tue, 17 Nov 2009 06:06:08 -0800 (PST) In-Reply-To: <4B0293D9.7000302@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: 2009/11/17 Avi Kivity : > On 11/17/2009 01:04 PM, Yoshiaki Tamura wrote: >>> >>> What I mean is: >>> >>> - choose synchronization point A >>> - start copying memory for synchronization point A >>> =A0- output is delayed >>> - choose synchronization point B >>> - copy memory for A and B >>> =A0 if guest touches memory not yet copied for A, COW it >>> - once A copying is complete, release A output >>> - continue copying memory for B >>> - choose synchronization point B >>> >>> by keeping two synchronization points active, you don't have any pa= uses. >>> =A0The cost is maintaining copy-on-write so we can copy dirty pages= for A >>> while keeping execution. >> >> >> The overall idea seems good, but if I'm understanding correctly, we = need a >> buffer for copying memory locally, and when it gets full, or when we= COW the >> memory for B, we still have to pause the guest to prevent from overw= riting. >> Correct? > > Yes. =A0During COW the guest would not be able to access the page, bu= t if > other vcpus access other pages, they can still continue. =A0So genera= lly > synchronization would be pauseless. Understood. >> To make things simple, we would like to start with the synchronous >> transmission first, and tackle asynchronous transmission later. > > Of course. =A0I'm just worried that realistic workloads will drive th= e latency > beyond acceptable limits. We're paying attention to this issue too, and would like do more advanc= ed stuff once there is a toy that runs on KVM. >>>>> How many pages do you copy per synchronization point for reasonab= ly >>>>> difficult workloads? >>>> >>>> That is very workload-dependent, but if you take a look at the exa= mples >>>> below you can get a feeling of how Kemari behaves. >>>> >>>> IOzone =A0 =A0 =A0 =A0 =A0 =A0Kemari sync interval[ms] =A0dirtied = pages >>>> --------------------------------------------------------- >>>> buffered + fsync =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 400 =A0= =A0 =A0 =A0 =A0 3000 >>>> O_SYNC =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A010 =A0 =A0 =A0 =A0 =A0 =A0 80 >>>> >>>> In summary, if the guest executes few I/O operations, the interval >>>> between Kemari synchronizations points will increase and the numbe= r of >>>> dirtied pages will grow accordingly. >>> >>> In the example above, the externally observed latency grows to 400 = ms, >>> yes? >> >> Not exactly. =A0The sync interval refers to the interval of synchron= ization >> points captured when the workload is running. =A0In the example abov= e, when >> the observed sync interval is 400ms, it takes about 150ms to sync VM= s with >> 3000 dirtied pages. =A0Kemari resumes I/O operations immediately onc= e the >> synchronization is finished, and thus, the externally observed laten= cy is >> 150ms in this case. > > Not sure I understand. > > If a packet is output from a guest immediately after a synchronizatio= n > point, doesn't it need to be delayed until the next synchronization p= oint? Kemari kicks the synchronization on event driven manner. So the packet itself is captured as synchronization point, and will start the synchronization immediately. > =A0So it's not just the guest pause time that matters, but also the i= nterval > between sync points? It does matter, and in case of Kemari, the interval between sync points= varies depending on what kind of workload is running. In the IOzone example above, two types of workloads are demonstrated. Buffered writes w/ fsync creates less sync point, which leads to longer= sync interval and more dirtied pages. On the other hand, O_SYNC writes crea= tes more sync point, which leads to shorter sync interval and less dirtied = pages. The benefit of event driven approach is that you don't have to start synchronization until there is a specific event to be captured no matte= r how many pages the guest may have dirtied. Thanks, Yoshi From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NAOhD-0008D1-0X for qemu-devel@nongnu.org; Tue, 17 Nov 2009 09:06:15 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NAOh8-0008AK-Oj for qemu-devel@nongnu.org; Tue, 17 Nov 2009 09:06:14 -0500 Received: from [199.232.76.173] (port=57973 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NAOh7-0008A7-Qc for qemu-devel@nongnu.org; Tue, 17 Nov 2009 09:06:09 -0500 Received: from mail-yw0-f176.google.com ([209.85.211.176]:59734) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1NAOh7-0001Is-5R for qemu-devel@nongnu.org; Tue, 17 Nov 2009 09:06:09 -0500 Received: by ywh6 with SMTP id 6so7812ywh.4 for ; Tue, 17 Nov 2009 06:06:08 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <4B0293D9.7000302@redhat.com> References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com> <4B015F42.7070609@oss.ntt.co.jp> <4B01667F.3000600@redhat.com> <4B028334.1070004@lab.ntt.co.jp> <4B0293D9.7000302@redhat.com> Date: Tue, 17 Nov 2009 23:06:01 +0900 Message-ID: <87e9effc0911170606k2919eaa5v808ce3a90fff9d1a@mail.gmail.com> From: Yoshiaki Tamura Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Avi Kivity Cc: Andrea Arcangeli , Chris Wright , =?ISO-2022-JP?B?GyRCQmdCPDc9GyhCKG9vbXVyYSBrZWkp?= , kvm@vger.kernel.org, =?ISO-8859-1?Q?Fernando_Luis_V=E1zquez_Cao?= , qemu-devel@nongnu.org, Takuya Yoshikawa 2009/11/17 Avi Kivity : > On 11/17/2009 01:04 PM, Yoshiaki Tamura wrote: >>> >>> What I mean is: >>> >>> - choose synchronization point A >>> - start copying memory for synchronization point A >>> =A0- output is delayed >>> - choose synchronization point B >>> - copy memory for A and B >>> =A0 if guest touches memory not yet copied for A, COW it >>> - once A copying is complete, release A output >>> - continue copying memory for B >>> - choose synchronization point B >>> >>> by keeping two synchronization points active, you don't have any pauses= . >>> =A0The cost is maintaining copy-on-write so we can copy dirty pages for= A >>> while keeping execution. >> >> >> The overall idea seems good, but if I'm understanding correctly, we need= a >> buffer for copying memory locally, and when it gets full, or when we COW= the >> memory for B, we still have to pause the guest to prevent from overwriti= ng. >> Correct? > > Yes. =A0During COW the guest would not be able to access the page, but if > other vcpus access other pages, they can still continue. =A0So generally > synchronization would be pauseless. Understood. >> To make things simple, we would like to start with the synchronous >> transmission first, and tackle asynchronous transmission later. > > Of course. =A0I'm just worried that realistic workloads will drive the la= tency > beyond acceptable limits. We're paying attention to this issue too, and would like do more advanced stuff once there is a toy that runs on KVM. >>>>> How many pages do you copy per synchronization point for reasonably >>>>> difficult workloads? >>>> >>>> That is very workload-dependent, but if you take a look at the example= s >>>> below you can get a feeling of how Kemari behaves. >>>> >>>> IOzone =A0 =A0 =A0 =A0 =A0 =A0Kemari sync interval[ms] =A0dirtied page= s >>>> --------------------------------------------------------- >>>> buffered + fsync =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 400 =A0 = =A0 =A0 =A0 =A0 3000 >>>> O_SYNC =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A010 =A0 =A0 =A0 =A0 =A0 =A0 80 >>>> >>>> In summary, if the guest executes few I/O operations, the interval >>>> between Kemari synchronizations points will increase and the number of >>>> dirtied pages will grow accordingly. >>> >>> In the example above, the externally observed latency grows to 400 ms, >>> yes? >> >> Not exactly. =A0The sync interval refers to the interval of synchronizat= ion >> points captured when the workload is running. =A0In the example above, w= hen >> the observed sync interval is 400ms, it takes about 150ms to sync VMs wi= th >> 3000 dirtied pages. =A0Kemari resumes I/O operations immediately once th= e >> synchronization is finished, and thus, the externally observed latency i= s >> 150ms in this case. > > Not sure I understand. > > If a packet is output from a guest immediately after a synchronization > point, doesn't it need to be delayed until the next synchronization point= ? Kemari kicks the synchronization on event driven manner. So the packet itself is captured as synchronization point, and will start the synchronization immediately. > =A0So it's not just the guest pause time that matters, but also the inter= val > between sync points? It does matter, and in case of Kemari, the interval between sync points var= ies depending on what kind of workload is running. In the IOzone example above, two types of workloads are demonstrated. Buffered writes w/ fsync creates less sync point, which leads to longer syn= c interval and more dirtied pages. On the other hand, O_SYNC writes creates more sync point, which leads to shorter sync interval and less dirtied page= s. The benefit of event driven approach is that you don't have to start synchronization until there is a specific event to be captured no matter ho= w many pages the guest may have dirtied. Thanks, Yoshi