Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

From: Ming Lei <tom.leiming@gmail.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>, Fam Zheng <famz@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date: Fri, 15 Aug 2014 18:39:17 +0800	[thread overview]
Message-ID: <CACVXFVNCg=VAOGOA1jpFH=mRi1OfRe_XXvDBbvsP4JdkPShQBQ@mail.gmail.com> (raw)
In-Reply-To: <20140814104637.GB3820@noname.redhat.com>

On Thu, Aug 14, 2014 at 6:46 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 11.08.2014 um 21:37 hat Paolo Bonzini geschrieben:
>> Il 10/08/2014 05:46, Ming Lei ha scritto:
>> > Hi Kevin, Paolo, Stefan and all,
>> >
>> >
>> > On Wed, 6 Aug 2014 10:48:55 +0200
>> > Kevin Wolf <kwolf@redhat.com> wrote:
>> >
>> >> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>> >
>> >>
>> >> Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> >> coroutines instead of exiting them, so it can't make any use of the
>> >> coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> >> version that simply removes the yield at the end):
>> >>
>> >>                 | bypass        | fixed coro    | buggy coro
>> >> ----------------+---------------+---------------+--------------
>> >> time            | 1.09s         | 1.10s         | 1.62s
>> >> L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> >> insns per cycle | 2.39          | 2.39          | 1.90
>> >>
>> >> Begs the question whether you see a similar effect on a real qemu and
>> >> the coroutine pool is still not big enough? With correct use of
>> >> coroutines, the difference seems to be barely measurable even without
>> >> any I/O involved.
>> >
>> > Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
>> > loading, and cause operations per sec very low(~40K/sec), finally I write a new
>> > and simple one which can generate hundreds of kilo operations per sec and
>> > the number should match with some fast storage devices, and it does show there
>> > is not small effect from coroutine.
>> >
>> > Extremely if just getppid() syscall is run in each iteration, with using coroutine,
>> > only 3M operations/sec can be got, and without using coroutine, the number can
>> > reach 16M/sec, and there is more than 4 times difference!!!
>>
>> I should be on vacation, but I'm following a couple threads in the mailing list
>> and I'm a bit tired to hear the same argument again and again...
>>
>> The different characteristics of asynchronous I/O vs. any synchronous workload
>> are such that it is hard to be sure that microbenchmarks make sense.
>>
>> The below patch is basically the minimal change to bypass coroutines.  Of course
>> the block.c part is not acceptable as is (the change to refresh_total_sectors
>> is broken, the others are just ugly), but it is a start.  Please run it with
>> your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
>> benchmark.
>
> So to finally reply with some numbers... I'm running fio tests based on
> Ming's configuration on a loop-mounted tmpfs image using dataplane. I've
> extended the tests to not only test random reads, but also sequential
> reads. I did not yet test writes and almost no test for block sizes
> larger than 4k, so I'm not including it here.
>
> The "base" case is with Ming's patches applied, but the set_bypass(true)
> call commented out in the virtio-blk code. All other cases are patches
> applied on top of this.
>
>                 | Random throughput | Sequential throughput
> ----------------+-------------------+-----------------------
> master          | 442 MB/s          | 730 MB/s
> base            | 453 MB/s          | 757 MB/s
> bypass (Ming)   | 461 MB/s          | 734 MB/s
> coroutine       | 468 MB/s          | 716 MB/s
> bypass (Paolo)  | 476 MB/s          | 682 MB/s

Looks the difference between rand read and sequential read
is quite big, which shouldn't have been so since the whole file is
cached in ram.

>
> So while your patches look pretty good in Ming's test case of random
> reads, I think the sequential case is worrying. The same is true for my
> latest coroutine optimisations, even though the degradation is smaller
> there.

In my VM test, both rand read and sequential read result are basically
same, and IO thread's CPU utilization is more than 93% with Paolo's
patch, over both nullblk and loop on file in tmpfs.

I am using 3.16 kernel.

>
> This needs some more investigation.

Maybe it is caused by your test setup and environment, or your VM kernel,
not sure.

Thanks,
-- 
Ming Lei