From mboxrd@z Thu Jan  1 00:00:00 1970
From: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Subject: Re: [RFC] KVM Fault Tolerance: Kemari for KVM
Date: Tue, 17 Nov 2009 20:04:20 +0900
Message-ID: <4B028334.1070004@lab.ntt.co.jp>
References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com> <4B015F42.7070609@oss.ntt.co.jp> <4B01667F.3000600@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8;
	format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?=
	<fernando@oss.ntt.co.jp>, kvm@vger.kernel.org,
	qemu-devel@nongnu.org,
	=?UTF-8?B?IuWkp+adkeWcrShvb211cmEga2VpKSI=?=
	<ohmura.kei@lab.ntt.co.jp>,
	Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>,
	anthony@codemonkey.ws, Andrea Arcangeli <aarcange@redhat.com>,
	Chris Wright <chrisw@redhat.com>
To: Avi Kivity <avi@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:49469 "EHLO
	tama50.ecl.ntt.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754116AbZKQLEq (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 17 Nov 2009 06:04:46 -0500
In-Reply-To: <4B01667F.3000600@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Avi Kivity wrote:
> On 11/16/2009 04:18 PM, Fernando Luis V=C3=A1zquez Cao wrote:
>> Avi Kivity wrote:
>>> On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote:
>>>>
>>>> Kemari runs paired virtual machines in an active-passive configura=
tion
>>>> and achieves whole-system replication by continuously copying the
>>>> state of the system (dirty pages and the state of the virtual devi=
ces)
>>>> from the active node to the passive node. An interesting implicati=
on
>>>> of this is that during normal operation only the active node is
>>>> actually executing code.
>>>>
>>>
>>> Can you characterize the performance impact for various workloads? =
 I=20
>>> assume you are running continuously in log-dirty mode.  Doesn't thi=
s=20
>>> make memory intensive workloads suffer?
>>
>> Yes, we're running continuously in log-dirty mode.
>>
>> We still do not have numbers to show for KVM, but
>> the snippets below from several runs of lmbench
>> using Xen+Kemari will give you an idea of what you
>> can expect in terms of overhead. All the tests were
>> run using a fully virtualized Debian guest with
>> hardware nested paging enabled.
>>
>>                      fork exec   sh    P/F  C/S   [us]
>> ------------------------------------------------------
>> Base                  114  349 1197 1.2845  8.2
>> Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
>> Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
>> Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
>> Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
>> * P/F=3Dpage fault, C/S=3Dcontext switch
>>
>> The benchmarks above are memory intensive and, as you
>> can see, the overhead varies widely from 7% to 40%.
>> We also measured CPU bound operations, but, as expected,
>> Kemari incurred almost no overhead.
>=20
> Is lmbench fork that memory intensive?
>=20
> Do you have numbers for benchmarks that use significant anonymous RSS=
? =20
> Say, a parallel kernel build.
>=20
> Note that scaling vcpus will increase a guest's memory-dirtying power=
=20
> but snapshot rate will not scale in the same way.

I don't think lmbench is intensive but it's sensitive to memory latency=
=2E
We'll measure kernel build time with minimum config, and post it later.

>>>>   - Notification to qemu: Taking a page from live migration's
>>>>     playbook, the synchronization process is user-space driven, wh=
ich
>>>>     means that qemu needs to be woken up at each synchronization
>>>>     point. That is already the case for qemu-emulated devices, but=
 we
>>>>     also have in-kernel emulators. To compound the problem, even f=
or
>>>>     user-space emulated devices accesses to coalesced MMIO areas c=
an
>>>>     not be detected. As a consequence we need a mechanism to
>>>>     communicate KVM-handled events to qemu.
>>>
>>> Do you mean the ioapic, pic, and lapic?
>>
>> Well, I was more worried about the in-kernel backends currently in t=
he
>> works. To save the state of those devices we could leverage qemu's=20
>> vmstate
>> infrastructure and even reuse struct VMStateDescription's pre_save()
>> callback, but we would like to pass the device state through the kvm=
_run
>> area to avoid a ioctl call right after returning to user space.
>=20
> Hm, let's defer all that until we have something working so we can=20
> estimate the impact of userspace virtio in those circumstances.

OK.  We'll start implementing everything in userspace first.

>>> Why is access to those chips considered a synchronization point?
>>
>> The main problem with those is that to get the chip state we
>> use an ioctl when we could have copied it to qemu's memory
>> before going back to user space. Not all accesses to those chips
>> need to be treated as synchronization points.
>=20
> Ok.  Note that piggybacking on an exit will work for the lapic, but n=
ot=20
> for the global irqchips (ioapic, pic) since they can still be modifie=
d=20
> by another vcpu.
>=20
>>> I wonder if you can pipeline dirty memory synchronization.  That is=
,=20
>>> write-protect those pages that are dirty, start copying them to the=
=20
>>> other side, and continue execution, copying memory if the guest=20
>>> faults it again.
>>
>> Asynchronous transmission of dirty pages would be really helpful to
>> eliminate the performance hiccups that tend to occur at synchronizat=
ion
>> points. What we can do is to copy dirty pages asynchronously until w=
e=20
>> reach
>> a synchronization point, where we need to stop the guest and send th=
e
>> remaining dirty pages and the state of devices to the other side.
>>
>> However, we can not delay the transmission of a dirty page across a
>> synchronization point, because if the primary node crashed before th=
e
>> page reached the fallback node the I/O operation that caused the
>> synchronization point cannot be replayed reliably.
>=20
> What I mean is:
>=20
> - choose synchronization point A
> - start copying memory for synchronization point A
>   - output is delayed
> - choose synchronization point B
> - copy memory for A and B
>    if guest touches memory not yet copied for A, COW it
> - once A copying is complete, release A output
> - continue copying memory for B
> - choose synchronization point B
>=20
> by keeping two synchronization points active, you don't have any=20
> pauses.  The cost is maintaining copy-on-write so we can copy dirty=20
> pages for A while keeping execution.

The overall idea seems good, but if I'm understanding correctly, we nee=
d a=20
buffer for copying memory locally, and when it gets full, or when we CO=
W the=20
memory for B, we still have to pause the guest to prevent from overwrit=
ing. Correct?

To make things simple, we would like to start with the synchronous tran=
smission=20
first, and tackle asynchronous transmission later.

>>> How many pages do you copy per synchronization point for reasonably=
=20
>>> difficult workloads?
>>
>> That is very workload-dependent, but if you take a look at the examp=
les
>> below you can get a feeling of how Kemari behaves.
>>
>> IOzone            Kemari sync interval[ms]  dirtied pages
>> ---------------------------------------------------------
>> buffered + fsync                       400           3000
>> O_SYNC                                  10             80
>>
>> In summary, if the guest executes few I/O operations, the interval
>> between Kemari synchronizations points will increase and the number =
of
>> dirtied pages will grow accordingly.
>=20
> In the example above, the externally observed latency grows to 400 ms=
, yes?

Not exactly.  The sync interval refers to the interval of synchronizati=
on points=20
captured when the workload is running.  In the example above, when the =
observed=20
sync interval is 400ms, it takes about 150ms to sync VMs with 3000 dirt=
ied=20
pages.  Kemari resumes I/O operations immediately once the synchronizat=
ion is=20
finished, and thus, the externally observed latency is 150ms in this ca=
se.

Thanks,

Yoshi

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NALrT-00054I-Fk
	for qemu-devel@nongnu.org; Tue, 17 Nov 2009 06:04:39 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1NALrN-0004zs-Ml
	for qemu-devel@nongnu.org; Tue, 17 Nov 2009 06:04:37 -0500
Received: from [199.232.76.173] (port=58859 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NALrN-0004zZ-E8
	for qemu-devel@nongnu.org; Tue, 17 Nov 2009 06:04:33 -0500
Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:49457)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <tamura.yoshiaki@lab.ntt.co.jp>) id 1NALrM-000698-Hx
	for qemu-devel@nongnu.org; Tue, 17 Nov 2009 06:04:33 -0500
Message-ID: <4B028334.1070004@lab.ntt.co.jp>
Date: Tue, 17 Nov 2009 20:04:20 +0900
From: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
MIME-Version: 1.0
References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com>
	<4B015F42.7070609@oss.ntt.co.jp> <4B01667F.3000600@redhat.com>
In-Reply-To: <4B01667F.3000600@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>, Chris Wright <chrisw@redhat.com>, =?UTF-8?B?IuWkp+adkeWcrShvb211cmEga2VpKSI=?= <ohmura.kei@lab.ntt.co.jp>, kvm@vger.kernel.org, =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= <fernando@oss.ntt.co.jp>, qemu-devel@nongnu.org, Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>

Avi Kivity wrote:
> On 11/16/2009 04:18 PM, Fernando Luis V=C3=A1zquez Cao wrote:
>> Avi Kivity wrote:
>>> On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote:
>>>>
>>>> Kemari runs paired virtual machines in an active-passive configurati=
on
>>>> and achieves whole-system replication by continuously copying the
>>>> state of the system (dirty pages and the state of the virtual device=
s)
>>>> from the active node to the passive node. An interesting implication
>>>> of this is that during normal operation only the active node is
>>>> actually executing code.
>>>>
>>>
>>> Can you characterize the performance impact for various workloads?  I=
=20
>>> assume you are running continuously in log-dirty mode.  Doesn't this=20
>>> make memory intensive workloads suffer?
>>
>> Yes, we're running continuously in log-dirty mode.
>>
>> We still do not have numbers to show for KVM, but
>> the snippets below from several runs of lmbench
>> using Xen+Kemari will give you an idea of what you
>> can expect in terms of overhead. All the tests were
>> run using a fully virtualized Debian guest with
>> hardware nested paging enabled.
>>
>>                      fork exec   sh    P/F  C/S   [us]
>> ------------------------------------------------------
>> Base                  114  349 1197 1.2845  8.2
>> Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
>> Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
>> Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
>> Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
>> * P/F=3Dpage fault, C/S=3Dcontext switch
>>
>> The benchmarks above are memory intensive and, as you
>> can see, the overhead varies widely from 7% to 40%.
>> We also measured CPU bound operations, but, as expected,
>> Kemari incurred almost no overhead.
>=20
> Is lmbench fork that memory intensive?
>=20
> Do you have numbers for benchmarks that use significant anonymous RSS? =
=20
> Say, a parallel kernel build.
>=20
> Note that scaling vcpus will increase a guest's memory-dirtying power=20
> but snapshot rate will not scale in the same way.

I don't think lmbench is intensive but it's sensitive to memory latency.
We'll measure kernel build time with minimum config, and post it later.

>>>>   - Notification to qemu: Taking a page from live migration's
>>>>     playbook, the synchronization process is user-space driven, whic=
h
>>>>     means that qemu needs to be woken up at each synchronization
>>>>     point. That is already the case for qemu-emulated devices, but w=
e
>>>>     also have in-kernel emulators. To compound the problem, even for
>>>>     user-space emulated devices accesses to coalesced MMIO areas can
>>>>     not be detected. As a consequence we need a mechanism to
>>>>     communicate KVM-handled events to qemu.
>>>
>>> Do you mean the ioapic, pic, and lapic?
>>
>> Well, I was more worried about the in-kernel backends currently in the
>> works. To save the state of those devices we could leverage qemu's=20
>> vmstate
>> infrastructure and even reuse struct VMStateDescription's pre_save()
>> callback, but we would like to pass the device state through the kvm_r=
un
>> area to avoid a ioctl call right after returning to user space.
>=20
> Hm, let's defer all that until we have something working so we can=20
> estimate the impact of userspace virtio in those circumstances.

OK.  We'll start implementing everything in userspace first.

>>> Why is access to those chips considered a synchronization point?
>>
>> The main problem with those is that to get the chip state we
>> use an ioctl when we could have copied it to qemu's memory
>> before going back to user space. Not all accesses to those chips
>> need to be treated as synchronization points.
>=20
> Ok.  Note that piggybacking on an exit will work for the lapic, but not=
=20
> for the global irqchips (ioapic, pic) since they can still be modified=20
> by another vcpu.
>=20
>>> I wonder if you can pipeline dirty memory synchronization.  That is,=20
>>> write-protect those pages that are dirty, start copying them to the=20
>>> other side, and continue execution, copying memory if the guest=20
>>> faults it again.
>>
>> Asynchronous transmission of dirty pages would be really helpful to
>> eliminate the performance hiccups that tend to occur at synchronizatio=
n
>> points. What we can do is to copy dirty pages asynchronously until we=20
>> reach
>> a synchronization point, where we need to stop the guest and send the
>> remaining dirty pages and the state of devices to the other side.
>>
>> However, we can not delay the transmission of a dirty page across a
>> synchronization point, because if the primary node crashed before the
>> page reached the fallback node the I/O operation that caused the
>> synchronization point cannot be replayed reliably.
>=20
> What I mean is:
>=20
> - choose synchronization point A
> - start copying memory for synchronization point A
>   - output is delayed
> - choose synchronization point B
> - copy memory for A and B
>    if guest touches memory not yet copied for A, COW it
> - once A copying is complete, release A output
> - continue copying memory for B
> - choose synchronization point B
>=20
> by keeping two synchronization points active, you don't have any=20
> pauses.  The cost is maintaining copy-on-write so we can copy dirty=20
> pages for A while keeping execution.

The overall idea seems good, but if I'm understanding correctly, we need =
a=20
buffer for copying memory locally, and when it gets full, or when we COW =
the=20
memory for B, we still have to pause the guest to prevent from overwritin=
g. Correct?

To make things simple, we would like to start with the synchronous transm=
ission=20
first, and tackle asynchronous transmission later.

>>> How many pages do you copy per synchronization point for reasonably=20
>>> difficult workloads?
>>
>> That is very workload-dependent, but if you take a look at the example=
s
>> below you can get a feeling of how Kemari behaves.
>>
>> IOzone            Kemari sync interval[ms]  dirtied pages
>> ---------------------------------------------------------
>> buffered + fsync                       400           3000
>> O_SYNC                                  10             80
>>
>> In summary, if the guest executes few I/O operations, the interval
>> between Kemari synchronizations points will increase and the number of
>> dirtied pages will grow accordingly.
>=20
> In the example above, the externally observed latency grows to 400 ms, =
yes?

Not exactly.  The sync interval refers to the interval of synchronization=
 points=20
captured when the workload is running.  In the example above, when the ob=
served=20
sync interval is 400ms, it takes about 150ms to sync VMs with 3000 dirtie=
d=20
pages.  Kemari resumes I/O operations immediately once the synchronizatio=
n is=20
finished, and thus, the externally observed latency is 150ms in this case.

Thanks,

Yoshi