From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@redhat.com>
Subject: Re: [RFC] KVM Fault Tolerance: Kemari for KVM
Date: Mon, 16 Nov 2009 16:49:35 +0200
Message-ID: <4B01667F.3000600@redhat.com>
References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com> <4B015F42.7070609@oss.ntt.co.jp>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8;
	format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org,
	=?UTF-8?B?IuWkp+adkeWcrShv?= =?UTF-8?B?b211cmEga2VpKSI=?=
	<ohmura.kei@lab.ntt.co.jp>,
	Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>,
	Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>,
	anthony@codemonkey.ws, Andrea Arcangeli <aarcange@redhat.com>,
	Chris Wright <chrisw@redhat.com>
To: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?=
	<fernando@oss.ntt.co.jp>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:33041 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753171AbZKPOth (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 16 Nov 2009 09:49:37 -0500
In-Reply-To: <4B015F42.7070609@oss.ntt.co.jp>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 11/16/2009 04:18 PM, Fernando Luis V=C3=A1zquez Cao wrote:
> Avi Kivity wrote:
>> On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote:
>>>
>>> Kemari runs paired virtual machines in an active-passive configurat=
ion
>>> and achieves whole-system replication by continuously copying the
>>> state of the system (dirty pages and the state of the virtual devic=
es)
>>> from the active node to the passive node. An interesting implicatio=
n
>>> of this is that during normal operation only the active node is
>>> actually executing code.
>>>
>>
>> Can you characterize the performance impact for various workloads?  =
I=20
>> assume you are running continuously in log-dirty mode.  Doesn't this=
=20
>> make memory intensive workloads suffer?
>
> Yes, we're running continuously in log-dirty mode.
>
> We still do not have numbers to show for KVM, but
> the snippets below from several runs of lmbench
> using Xen+Kemari will give you an idea of what you
> can expect in terms of overhead. All the tests were
> run using a fully virtualized Debian guest with
> hardware nested paging enabled.
>
>                      fork exec   sh    P/F  C/S   [us]
> ------------------------------------------------------
> Base                  114  349 1197 1.2845  8.2
> Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
> Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
> Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
> Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
> * P/F=3Dpage fault, C/S=3Dcontext switch
>
> The benchmarks above are memory intensive and, as you
> can see, the overhead varies widely from 7% to 40%.
> We also measured CPU bound operations, but, as expected,
> Kemari incurred almost no overhead.

Is lmbench fork that memory intensive?

Do you have numbers for benchmarks that use significant anonymous RSS? =
=20
Say, a parallel kernel build.

Note that scaling vcpus will increase a guest's memory-dirtying power=20
but snapshot rate will not scale in the same way.

>>>   - Notification to qemu: Taking a page from live migration's
>>>     playbook, the synchronization process is user-space driven, whi=
ch
>>>     means that qemu needs to be woken up at each synchronization
>>>     point. That is already the case for qemu-emulated devices, but =
we
>>>     also have in-kernel emulators. To compound the problem, even fo=
r
>>>     user-space emulated devices accesses to coalesced MMIO areas ca=
n
>>>     not be detected. As a consequence we need a mechanism to
>>>     communicate KVM-handled events to qemu.
>>
>> Do you mean the ioapic, pic, and lapic?
>
> Well, I was more worried about the in-kernel backends currently in th=
e
> works. To save the state of those devices we could leverage qemu's=20
> vmstate
> infrastructure and even reuse struct VMStateDescription's pre_save()
> callback, but we would like to pass the device state through the kvm_=
run
> area to avoid a ioctl call right after returning to user space.

Hm, let's defer all that until we have something working so we can=20
estimate the impact of userspace virtio in those circumstances.

>> Why is access to those chips considered a synchronization point?
>
> The main problem with those is that to get the chip state we
> use an ioctl when we could have copied it to qemu's memory
> before going back to user space. Not all accesses to those chips
> need to be treated as synchronization points.

Ok.  Note that piggybacking on an exit will work for the lapic, but not=
=20
for the global irqchips (ioapic, pic) since they can still be modified=20
by another vcpu.

>> I wonder if you can pipeline dirty memory synchronization.  That is,=
=20
>> write-protect those pages that are dirty, start copying them to the=20
>> other side, and continue execution, copying memory if the guest=20
>> faults it again.
>
> Asynchronous transmission of dirty pages would be really helpful to
> eliminate the performance hiccups that tend to occur at synchronizati=
on
> points. What we can do is to copy dirty pages asynchronously until we=
=20
> reach
> a synchronization point, where we need to stop the guest and send the
> remaining dirty pages and the state of devices to the other side.
>
> However, we can not delay the transmission of a dirty page across a
> synchronization point, because if the primary node crashed before the
> page reached the fallback node the I/O operation that caused the
> synchronization point cannot be replayed reliably.

What I mean is:

- choose synchronization point A
- start copying memory for synchronization point A
   - output is delayed
- choose synchronization point B
- copy memory for A and B
    if guest touches memory not yet copied for A, COW it
- once A copying is complete, release A output
- continue copying memory for B
- choose synchronization point B

by keeping two synchronization points active, you don't have any=20
pauses.  The cost is maintaining copy-on-write so we can copy dirty=20
pages for A while keeping execution.

>> How many pages do you copy per synchronization point for reasonably=20
>> difficult workloads?
>
> That is very workload-dependent, but if you take a look at the exampl=
es
> below you can get a feeling of how Kemari behaves.
>
> IOzone            Kemari sync interval[ms]  dirtied pages
> ---------------------------------------------------------
> buffered + fsync                       400           3000
> O_SYNC                                  10             80
>
> In summary, if the guest executes few I/O operations, the interval
> between Kemari synchronizations points will increase and the number o=
f
> dirtied pages will grow accordingly.

In the example above, the externally observed latency grows to 400 ms, =
yes?

--=20
error compiling committee.c: too many arguments to function


From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NA2tn-0001qZ-H6
	for qemu-devel@nongnu.org; Mon, 16 Nov 2009 09:49:47 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1NA2ti-0001lr-G8
	for qemu-devel@nongnu.org; Mon, 16 Nov 2009 09:49:46 -0500
Received: from [199.232.76.173] (port=59690 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NA2ti-0001le-9u
	for qemu-devel@nongnu.org; Mon, 16 Nov 2009 09:49:42 -0500
Received: from mx1.redhat.com ([209.132.183.28]:15454)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <avi@redhat.com>) id 1NA2th-0000Jk-Gd
	for qemu-devel@nongnu.org; Mon, 16 Nov 2009 09:49:42 -0500
Message-ID: <4B01667F.3000600@redhat.com>
Date: Mon, 16 Nov 2009 16:49:35 +0200
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
References: <4AF79242.20406@oss.ntt.co.jp> <4AFFD96D.5090100@redhat.com>
	<4B015F42.7070609@oss.ntt.co.jp>
In-Reply-To: <4B015F42.7070609@oss.ntt.co.jp>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= <fernando@oss.ntt.co.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>, Chris Wright <chrisw@redhat.com>, =?UTF-8?B?b211cmEga2VpKSI=?= <ohmura.kei@lab.ntt.co.jp>, kvm@vger.kernel.org, Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>, qemu-devel@nongnu.org, Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>, =?UTF-8?B?IuWkp+adkeWcrShv?=@gnu.org

On 11/16/2009 04:18 PM, Fernando Luis V=C3=A1zquez Cao wrote:
> Avi Kivity wrote:
>> On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote:
>>>
>>> Kemari runs paired virtual machines in an active-passive configuratio=
n
>>> and achieves whole-system replication by continuously copying the
>>> state of the system (dirty pages and the state of the virtual devices=
)
>>> from the active node to the passive node. An interesting implication
>>> of this is that during normal operation only the active node is
>>> actually executing code.
>>>
>>
>> Can you characterize the performance impact for various workloads?  I=20
>> assume you are running continuously in log-dirty mode.  Doesn't this=20
>> make memory intensive workloads suffer?
>
> Yes, we're running continuously in log-dirty mode.
>
> We still do not have numbers to show for KVM, but
> the snippets below from several runs of lmbench
> using Xen+Kemari will give you an idea of what you
> can expect in terms of overhead. All the tests were
> run using a fully virtualized Debian guest with
> hardware nested paging enabled.
>
>                      fork exec   sh    P/F  C/S   [us]
> ------------------------------------------------------
> Base                  114  349 1197 1.2845  8.2
> Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
> Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
> Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
> Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
> * P/F=3Dpage fault, C/S=3Dcontext switch
>
> The benchmarks above are memory intensive and, as you
> can see, the overhead varies widely from 7% to 40%.
> We also measured CPU bound operations, but, as expected,
> Kemari incurred almost no overhead.

Is lmbench fork that memory intensive?

Do you have numbers for benchmarks that use significant anonymous RSS? =20
Say, a parallel kernel build.

Note that scaling vcpus will increase a guest's memory-dirtying power=20
but snapshot rate will not scale in the same way.

>>>   - Notification to qemu: Taking a page from live migration's
>>>     playbook, the synchronization process is user-space driven, which
>>>     means that qemu needs to be woken up at each synchronization
>>>     point. That is already the case for qemu-emulated devices, but we
>>>     also have in-kernel emulators. To compound the problem, even for
>>>     user-space emulated devices accesses to coalesced MMIO areas can
>>>     not be detected. As a consequence we need a mechanism to
>>>     communicate KVM-handled events to qemu.
>>
>> Do you mean the ioapic, pic, and lapic?
>
> Well, I was more worried about the in-kernel backends currently in the
> works. To save the state of those devices we could leverage qemu's=20
> vmstate
> infrastructure and even reuse struct VMStateDescription's pre_save()
> callback, but we would like to pass the device state through the kvm_ru=
n
> area to avoid a ioctl call right after returning to user space.

Hm, let's defer all that until we have something working so we can=20
estimate the impact of userspace virtio in those circumstances.

>> Why is access to those chips considered a synchronization point?
>
> The main problem with those is that to get the chip state we
> use an ioctl when we could have copied it to qemu's memory
> before going back to user space. Not all accesses to those chips
> need to be treated as synchronization points.

Ok.  Note that piggybacking on an exit will work for the lapic, but not=20
for the global irqchips (ioapic, pic) since they can still be modified=20
by another vcpu.

>> I wonder if you can pipeline dirty memory synchronization.  That is,=20
>> write-protect those pages that are dirty, start copying them to the=20
>> other side, and continue execution, copying memory if the guest=20
>> faults it again.
>
> Asynchronous transmission of dirty pages would be really helpful to
> eliminate the performance hiccups that tend to occur at synchronization
> points. What we can do is to copy dirty pages asynchronously until we=20
> reach
> a synchronization point, where we need to stop the guest and send the
> remaining dirty pages and the state of devices to the other side.
>
> However, we can not delay the transmission of a dirty page across a
> synchronization point, because if the primary node crashed before the
> page reached the fallback node the I/O operation that caused the
> synchronization point cannot be replayed reliably.

What I mean is:

- choose synchronization point A
- start copying memory for synchronization point A
   - output is delayed
- choose synchronization point B
- copy memory for A and B
    if guest touches memory not yet copied for A, COW it
- once A copying is complete, release A output
- continue copying memory for B
- choose synchronization point B

by keeping two synchronization points active, you don't have any=20
pauses.  The cost is maintaining copy-on-write so we can copy dirty=20
pages for A while keeping execution.

>> How many pages do you copy per synchronization point for reasonably=20
>> difficult workloads?
>
> That is very workload-dependent, but if you take a look at the examples
> below you can get a feeling of how Kemari behaves.
>
> IOzone            Kemari sync interval[ms]  dirtied pages
> ---------------------------------------------------------
> buffered + fsync                       400           3000
> O_SYNC                                  10             80
>
> In summary, if the guest executes few I/O operations, the interval
> between Kemari synchronizations points will increase and the number of
> dirtied pages will grow accordingly.

In the example above, the externally observed latency grows to 400 ms, ye=
s?

--=20
error compiling committee.c: too many arguments to function