From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:50821)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <ming.lei@canonical.com>) id 1XH7Bl-0000by-Da
	for qemu-devel@nongnu.org; Tue, 12 Aug 2014 04:12:21 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <ming.lei@canonical.com>) id 1XH7Bd-00072B-Up
	for qemu-devel@nongnu.org; Tue, 12 Aug 2014 04:12:13 -0400
Received: from youngberry.canonical.com ([91.189.89.112]:35046)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <ming.lei@canonical.com>) id 1XH7Bd-000722-Og
	for qemu-devel@nongnu.org; Tue, 12 Aug 2014 04:12:05 -0400
Received: from mail-vc0-f174.google.com ([209.85.220.174])
	by youngberry.canonical.com with esmtpsa (TLS1.0:RSA_ARCFOUR_SHA1:16)
	(Exim 4.71) (envelope-from <ming.lei@canonical.com>)
	id 1XH7Bd-0003X7-12
	for qemu-devel@nongnu.org; Tue, 12 Aug 2014 08:12:05 +0000
Received: by mail-vc0-f174.google.com with SMTP id la4so13015873vcb.33
	for <qemu-devel@nongnu.org>; Tue, 12 Aug 2014 01:12:03 -0700 (PDT)
MIME-Version: 1.0
In-Reply-To: <53E91B5D.4090009@redhat.com>
References: <1407209598-2572-1-git-send-email-ming.lei@canonical.com>
	<20140805094844.GF4391@noname.str.redhat.com>
	<CACVXFVMeu9N9OnTcfGV3EC5HOSBxaxrz2R_aodDG43gusS1dow@mail.gmail.com>
	<20140805134815.GD12251@stefanha-thinkpad.redhat.com>
	<20140805144728.GH4391@noname.str.redhat.com>
	<CACVXFVOyaoHXYpuayk8vOQpUCZ0rjD2rMTKG-wxCKokfarGr_w@mail.gmail.com>
	<20140806084855.GA4090@noname.str.redhat.com>
	<20140810114624.0305b7af@tom-ThinkPad-T410>
	<53E91B5D.4090009@redhat.com>
Date: Tue, 12 Aug 2014 16:12:03 +0800
Message-ID: <CACVXFVPqrJCob8Dn1MtZDOswjiwx8OY10oMsN=b+kxrTh4ZJaA@mail.gmail.com>
From: Ming Lei <ming.lei@canonical.com>
Content-Type: text/plain; charset=UTF-8
Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi
	virtqueue support
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Fam Zheng <famz@redhat.com>, qemu-devel <qemu-devel@nongnu.org>, Stefan Hajnoczi <stefanha@redhat.com>

On Tue, Aug 12, 2014 at 3:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 10/08/2014 05:46, Ming Lei ha scritto:
>> Hi Kevin, Paolo, Stefan and all,
>>
>>
>> On Wed, 6 Aug 2014 10:48:55 +0200
>> Kevin Wolf <kwolf@redhat.com> wrote:
>>
>>> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>>
>>>
>>> Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>>> coroutines instead of exiting them, so it can't make any use of the
>>> coroutine pool. On my laptop, I get this (where fixed coroutine is a
>>> version that simply removes the yield at the end):
>>>
>>>                 | bypass        | fixed coro    | buggy coro
>>> ----------------+---------------+---------------+--------------
>>> time            | 1.09s         | 1.10s         | 1.62s
>>> L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>>> insns per cycle | 2.39          | 2.39          | 1.90
>>>
>>> Begs the question whether you see a similar effect on a real qemu and
>>> the coroutine pool is still not big enough? With correct use of
>>> coroutines, the difference seems to be barely measurable even without
>>> any I/O involved.
>>
>> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
>> loading, and cause operations per sec very low(~40K/sec), finally I write a new
>> and simple one which can generate hundreds of kilo operations per sec and
>> the number should match with some fast storage devices, and it does show there
>> is not small effect from coroutine.
>>
>> Extremely if just getppid() syscall is run in each iteration, with using coroutine,
>> only 3M operations/sec can be got, and without using coroutine, the number can
>> reach 16M/sec, and there is more than 4 times difference!!!
>
> I should be on vacation, but I'm following a couple threads in the mailing list
> and I'm a bit tired to hear the same argument again and again...

I am sorry to interrupt your vocation and make you tired, but the discussion
isn't simply again and again, and something new always comes every time
or most of times.

>
> The different characteristics of asynchronous I/O vs. any synchronous workload
> are such that it is hard to be sure that microbenchmarks make sense.

I don't think it is related with asynchronous I/O or synchronous I/O, and there
isn't sleep(or wait for completion) at all, and we can treat it as aio
by thinking
completion as nop in this case(AIO model: submit and complete)

IMO the getppid() bench is a simple simulation on bdrv_aio_readv/writev()
with I/O plug/unplug wrt. coroutine usage.

BTW, do you agree the computation on coroutine cost in my previous mail?
And I don't think the computation is related with I/O type.

>
> The below patch is basically the minimal change to bypass coroutines.  Of course
> the block.c part is not acceptable as is (the change to refresh_total_sectors
> is broken, the others are just ugly), but it is a start.  Please run it with
> your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
> benchmark.

Could you explain why the new change is introduced?

I will hold it until we can align to the coroutine cost computation,
because it is
very important for the discussion.

Thank you again for taking time in the discussion.

Thanks,