From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50821) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XH7Bl-0000by-Da for qemu-devel@nongnu.org; Tue, 12 Aug 2014 04:12:21 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XH7Bd-00072B-Up for qemu-devel@nongnu.org; Tue, 12 Aug 2014 04:12:13 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:35046) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XH7Bd-000722-Og for qemu-devel@nongnu.org; Tue, 12 Aug 2014 04:12:05 -0400 Received: from mail-vc0-f174.google.com ([209.85.220.174]) by youngberry.canonical.com with esmtpsa (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.71) (envelope-from ) id 1XH7Bd-0003X7-12 for qemu-devel@nongnu.org; Tue, 12 Aug 2014 08:12:05 +0000 Received: by mail-vc0-f174.google.com with SMTP id la4so13015873vcb.33 for ; Tue, 12 Aug 2014 01:12:03 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <53E91B5D.4090009@redhat.com> References: <1407209598-2572-1-git-send-email-ming.lei@canonical.com> <20140805094844.GF4391@noname.str.redhat.com> <20140805134815.GD12251@stefanha-thinkpad.redhat.com> <20140805144728.GH4391@noname.str.redhat.com> <20140806084855.GA4090@noname.str.redhat.com> <20140810114624.0305b7af@tom-ThinkPad-T410> <53E91B5D.4090009@redhat.com> Date: Tue, 12 Aug 2014 16:12:03 +0800 Message-ID: From: Ming Lei Content-Type: text/plain; charset=UTF-8 Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: Kevin Wolf , Fam Zheng , qemu-devel , Stefan Hajnoczi On Tue, Aug 12, 2014 at 3:37 AM, Paolo Bonzini wrote: > Il 10/08/2014 05:46, Ming Lei ha scritto: >> Hi Kevin, Paolo, Stefan and all, >> >> >> On Wed, 6 Aug 2014 10:48:55 +0200 >> Kevin Wolf wrote: >> >>> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben: >> >>> >>> Anyhow, the coroutine version of your benchmark is buggy, it leaks all >>> coroutines instead of exiting them, so it can't make any use of the >>> coroutine pool. On my laptop, I get this (where fixed coroutine is a >>> version that simply removes the yield at the end): >>> >>> | bypass | fixed coro | buggy coro >>> ----------------+---------------+---------------+-------------- >>> time | 1.09s | 1.10s | 1.62s >>> L1-dcache-loads | 921,836,360 | 932,781,747 | 1,298,067,438 >>> insns per cycle | 2.39 | 2.39 | 1.90 >>> >>> Begs the question whether you see a similar effect on a real qemu and >>> the coroutine pool is still not big enough? With correct use of >>> coroutines, the difference seems to be barely measurable even without >>> any I/O involved. >> >> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high >> loading, and cause operations per sec very low(~40K/sec), finally I write a new >> and simple one which can generate hundreds of kilo operations per sec and >> the number should match with some fast storage devices, and it does show there >> is not small effect from coroutine. >> >> Extremely if just getppid() syscall is run in each iteration, with using coroutine, >> only 3M operations/sec can be got, and without using coroutine, the number can >> reach 16M/sec, and there is more than 4 times difference!!! > > I should be on vacation, but I'm following a couple threads in the mailing list > and I'm a bit tired to hear the same argument again and again... I am sorry to interrupt your vocation and make you tired, but the discussion isn't simply again and again, and something new always comes every time or most of times. > > The different characteristics of asynchronous I/O vs. any synchronous workload > are such that it is hard to be sure that microbenchmarks make sense. I don't think it is related with asynchronous I/O or synchronous I/O, and there isn't sleep(or wait for completion) at all, and we can treat it as aio by thinking completion as nop in this case(AIO model: submit and complete) IMO the getppid() bench is a simple simulation on bdrv_aio_readv/writev() with I/O plug/unplug wrt. coroutine usage. BTW, do you agree the computation on coroutine cost in my previous mail? And I don't think the computation is related with I/O type. > > The below patch is basically the minimal change to bypass coroutines. Of course > the block.c part is not acceptable as is (the change to refresh_total_sectors > is broken, the others are just ugly), but it is a start. Please run it with > your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O* > benchmark. Could you explain why the new change is introduced? I will hold it until we can align to the coroutine cost computation, because it is very important for the discussion. Thank you again for taking time in the discussion. Thanks,