From mboxrd@z Thu Jan  1 00:00:00 1970
From: Yoshiaki Tamura <tamura.yoshiaki@gmail.com>
Subject: Re: [RFC] KVM Fault Tolerance: Kemari for KVM
Date: Tue, 17 Nov 2009 23:06:01 +0900
Message-ID: <87e9effc0911170606k2919eaa5v808ce3a90fff9d1a@mail.gmail.com>
References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com>
	 <4B015F42.7070609@oss.ntt.co.jp> <4B01667F.3000600@redhat.com>
	 <4B028334.1070004@lab.ntt.co.jp> <4B0293D9.7000302@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: =?ISO-8859-1?Q?Fernando_Luis_V=E1zquez_Cao?=
	<fernando@oss.ntt.co.jp>, kvm@vger.kernel.org,
	qemu-devel@nongnu.org,
	=?ISO-2022-JP?B?GyRCQmdCPDc9GyhCKG9vbXVyYSBrZWkp?=
	<ohmura.kei@lab.ntt.co.jp>,
	Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>,
	anthony@codemonkey.ws, Andrea Arcangeli <aarcange@redhat.com>,
	Chris Wright <chrisw@redhat.com>
To: Avi Kivity <avi@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-yx0-f187.google.com ([209.85.210.187]:50299 "EHLO
	mail-yx0-f187.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752712AbZKQOGE convert rfc822-to-8bit (ORCPT
	<rfc822;kvm@vger.kernel.org>); Tue, 17 Nov 2009 09:06:04 -0500
Received: by yxe17 with SMTP id 17so1672yxe.33
        for <kvm@vger.kernel.org>; Tue, 17 Nov 2009 06:06:08 -0800 (PST)
In-Reply-To: <4B0293D9.7000302@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

2009/11/17 Avi Kivity <avi@redhat.com>:
> On 11/17/2009 01:04 PM, Yoshiaki Tamura wrote:
>>>
>>> What I mean is:
>>>
>>> - choose synchronization point A
>>> - start copying memory for synchronization point A
>>> =A0- output is delayed
>>> - choose synchronization point B
>>> - copy memory for A and B
>>> =A0 if guest touches memory not yet copied for A, COW it
>>> - once A copying is complete, release A output
>>> - continue copying memory for B
>>> - choose synchronization point B
>>>
>>> by keeping two synchronization points active, you don't have any pa=
uses.
>>> =A0The cost is maintaining copy-on-write so we can copy dirty pages=
 for A
>>> while keeping execution.
>>
>>
>> The overall idea seems good, but if I'm understanding correctly, we =
need a
>> buffer for copying memory locally, and when it gets full, or when we=
 COW the
>> memory for B, we still have to pause the guest to prevent from overw=
riting.
>> Correct?
>
> Yes. =A0During COW the guest would not be able to access the page, bu=
t if
> other vcpus access other pages, they can still continue. =A0So genera=
lly
> synchronization would be pauseless.

Understood.

>> To make things simple, we would like to start with the synchronous
>> transmission first, and tackle asynchronous transmission later.
>
> Of course. =A0I'm just worried that realistic workloads will drive th=
e latency
> beyond acceptable limits.

We're paying attention to this issue too, and would like do more advanc=
ed
stuff once there is a toy that runs on KVM.

>>>>> How many pages do you copy per synchronization point for reasonab=
ly
>>>>> difficult workloads?
>>>>
>>>> That is very workload-dependent, but if you take a look at the exa=
mples
>>>> below you can get a feeling of how Kemari behaves.
>>>>
>>>> IOzone =A0 =A0 =A0 =A0 =A0 =A0Kemari sync interval[ms] =A0dirtied =
pages
>>>> ---------------------------------------------------------
>>>> buffered + fsync =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 400 =A0=
 =A0 =A0 =A0 =A0 3000
>>>> O_SYNC =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A010 =A0 =A0 =A0 =A0 =A0 =A0 80
>>>>
>>>> In summary, if the guest executes few I/O operations, the interval
>>>> between Kemari synchronizations points will increase and the numbe=
r of
>>>> dirtied pages will grow accordingly.
>>>
>>> In the example above, the externally observed latency grows to 400 =
ms,
>>> yes?
>>
>> Not exactly. =A0The sync interval refers to the interval of synchron=
ization
>> points captured when the workload is running. =A0In the example abov=
e, when
>> the observed sync interval is 400ms, it takes about 150ms to sync VM=
s with
>> 3000 dirtied pages. =A0Kemari resumes I/O operations immediately onc=
e the
>> synchronization is finished, and thus, the externally observed laten=
cy is
>> 150ms in this case.
>
> Not sure I understand.
>
> If a packet is output from a guest immediately after a synchronizatio=
n
> point, doesn't it need to be delayed until the next synchronization p=
oint?

Kemari kicks the synchronization on event driven manner.
So the packet itself is captured as synchronization point,
and will start the synchronization immediately.

> =A0So it's not just the guest pause time that matters, but also the i=
nterval
> between sync points?

It does matter, and in case of Kemari, the interval between sync points=
 varies
depending on what kind of workload is running.

 In the IOzone example above, two types of workloads are demonstrated.
Buffered writes w/ fsync creates less sync point, which leads to longer=
 sync
interval and more dirtied pages.  On the other hand, O_SYNC writes crea=
tes
more sync point, which leads to shorter sync interval and less dirtied =
pages.

The benefit of event driven approach is that you don't have to start
synchronization until there is a specific event to be captured no matte=
r how
many pages the guest may have dirtied.

Thanks,

Yoshi

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NAOhD-0008D1-0X
	for qemu-devel@nongnu.org; Tue, 17 Nov 2009 09:06:15 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1NAOh8-0008AK-Oj
	for qemu-devel@nongnu.org; Tue, 17 Nov 2009 09:06:14 -0500
Received: from [199.232.76.173] (port=57973 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NAOh7-0008A7-Qc
	for qemu-devel@nongnu.org; Tue, 17 Nov 2009 09:06:09 -0500
Received: from mail-yw0-f176.google.com ([209.85.211.176]:59734)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <tamura.yoshiaki@gmail.com>) id 1NAOh7-0001Is-5R
	for qemu-devel@nongnu.org; Tue, 17 Nov 2009 09:06:09 -0500
Received: by ywh6 with SMTP id 6so7812ywh.4
	for <qemu-devel@nongnu.org>; Tue, 17 Nov 2009 06:06:08 -0800 (PST)
MIME-Version: 1.0
In-Reply-To: <4B0293D9.7000302@redhat.com>
References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com>
	<4B015F42.7070609@oss.ntt.co.jp> <4B01667F.3000600@redhat.com>
	<4B028334.1070004@lab.ntt.co.jp> <4B0293D9.7000302@redhat.com>
Date: Tue, 17 Nov 2009 23:06:01 +0900
Message-ID: <87e9effc0911170606k2919eaa5v808ce3a90fff9d1a@mail.gmail.com>
From: Yoshiaki Tamura <tamura.yoshiaki@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Subject: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>, Chris Wright <chrisw@redhat.com>, =?ISO-2022-JP?B?GyRCQmdCPDc9GyhCKG9vbXVyYSBrZWkp?= <ohmura.kei@lab.ntt.co.jp>, kvm@vger.kernel.org, =?ISO-8859-1?Q?Fernando_Luis_V=E1zquez_Cao?= <fernando@oss.ntt.co.jp>, qemu-devel@nongnu.org, Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>

2009/11/17 Avi Kivity <avi@redhat.com>:
> On 11/17/2009 01:04 PM, Yoshiaki Tamura wrote:
>>>
>>> What I mean is:
>>>
>>> - choose synchronization point A
>>> - start copying memory for synchronization point A
>>> =A0- output is delayed
>>> - choose synchronization point B
>>> - copy memory for A and B
>>> =A0 if guest touches memory not yet copied for A, COW it
>>> - once A copying is complete, release A output
>>> - continue copying memory for B
>>> - choose synchronization point B
>>>
>>> by keeping two synchronization points active, you don't have any pauses=
.
>>> =A0The cost is maintaining copy-on-write so we can copy dirty pages for=
 A
>>> while keeping execution.
>>
>>
>> The overall idea seems good, but if I'm understanding correctly, we need=
 a
>> buffer for copying memory locally, and when it gets full, or when we COW=
 the
>> memory for B, we still have to pause the guest to prevent from overwriti=
ng.
>> Correct?
>
> Yes. =A0During COW the guest would not be able to access the page, but if
> other vcpus access other pages, they can still continue. =A0So generally
> synchronization would be pauseless.

Understood.

>> To make things simple, we would like to start with the synchronous
>> transmission first, and tackle asynchronous transmission later.
>
> Of course. =A0I'm just worried that realistic workloads will drive the la=
tency
> beyond acceptable limits.

We're paying attention to this issue too, and would like do more advanced
stuff once there is a toy that runs on KVM.

>>>>> How many pages do you copy per synchronization point for reasonably
>>>>> difficult workloads?
>>>>
>>>> That is very workload-dependent, but if you take a look at the example=
s
>>>> below you can get a feeling of how Kemari behaves.
>>>>
>>>> IOzone =A0 =A0 =A0 =A0 =A0 =A0Kemari sync interval[ms] =A0dirtied page=
s
>>>> ---------------------------------------------------------
>>>> buffered + fsync =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 400 =A0 =
=A0 =A0 =A0 =A0 3000
>>>> O_SYNC =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A010 =A0 =A0 =A0 =A0 =A0 =A0 80
>>>>
>>>> In summary, if the guest executes few I/O operations, the interval
>>>> between Kemari synchronizations points will increase and the number of
>>>> dirtied pages will grow accordingly.
>>>
>>> In the example above, the externally observed latency grows to 400 ms,
>>> yes?
>>
>> Not exactly. =A0The sync interval refers to the interval of synchronizat=
ion
>> points captured when the workload is running. =A0In the example above, w=
hen
>> the observed sync interval is 400ms, it takes about 150ms to sync VMs wi=
th
>> 3000 dirtied pages. =A0Kemari resumes I/O operations immediately once th=
e
>> synchronization is finished, and thus, the externally observed latency i=
s
>> 150ms in this case.
>
> Not sure I understand.
>
> If a packet is output from a guest immediately after a synchronization
> point, doesn't it need to be delayed until the next synchronization point=
?

Kemari kicks the synchronization on event driven manner.
So the packet itself is captured as synchronization point,
and will start the synchronization immediately.

> =A0So it's not just the guest pause time that matters, but also the inter=
val
> between sync points?

It does matter, and in case of Kemari, the interval between sync points var=
ies
depending on what kind of workload is running.

 In the IOzone example above, two types of workloads are demonstrated.
Buffered writes w/ fsync creates less sync point, which leads to longer syn=
c
interval and more dirtied pages.  On the other hand, O_SYNC writes creates
more sync point, which leads to shorter sync interval and less dirtied page=
s.

The benefit of event driven approach is that you don't have to start
synchronization until there is a specific event to be captured no matter ho=
w
many pages the guest may have dirtied.

Thanks,

Yoshi