Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

From: Ming Lei <ming.lei@canonical.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Peter Maydell <peter.maydell@linaro.org>,
	Fam Zheng <famz@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date: Thu, 7 Aug 2014 18:27:55 +0800	[thread overview]
Message-ID: <CACVXFVN6Jnt1b0_J1yf+Yi4ykw+bPj4ggeABuCiNe=_NvHe0WA@mail.gmail.com> (raw)
In-Reply-To: <20140806154041.GD4090@noname.str.redhat.com>

On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 06.08.2014 um 13:28 hat Ming Lei geschrieben:
>> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
>> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> >> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>> >> >> Hi Kevin,
>> >> >>
>> >> >> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> >> >> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
>> >> >> >> I have been wondering how to prove that the root cause is the ucontext
>> >> >> >> coroutine mechanism (stack switching).  Here is an idea:
>> >> >> >>
>> >> >> >> Hack your "bypass" code path to run the request inside a coroutine.
>> >> >> >> That way you can compare "bypass without coroutine" against "bypass with
>> >> >> >> coroutine".
>> >> >> >>
>> >> >> >> Right now I think there are doubts because the bypass code path is
>> >> >> >> indeed a different (and not 100% correct) code path.  So this approach
>> >> >> >> might prove that the coroutines are adding the overhead and not
>> >> >> >> something that you bypassed.
>> >> >> >
>> >> >> > My doubts aren't only that the overhead might not come from the
>> >> >> > coroutines, but also whether any coroutine-related overhead is really
>> >> >> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
>> >> >> > just that instead of introducing additional code paths.
>> >> >>
>> >> >> OK, thank you for taking look at the problem, and hope we can
>> >> >> figure out the root cause, :-)
>> >> >>
>> >> >> >
>> >> >> > Another thought I had was this: If the performance difference is indeed
>> >> >> > only coroutines, then that is completely inside the block layer and we
>> >> >> > don't actually need a VM to test it. We could instead have something
>> >> >> > like a simple qemu-img based benchmark and should be observing the same.
>> >> >>
>> >> >> Even it is simpler to run a coroutine-only benchmark, and I just
>> >> >> wrote a raw one, and looks coroutine does decrease performance
>> >> >> a lot, please see the attachment patch, and thanks for your template
>> >> >> to help me add the 'co_bench' command in qemu-img.
>> >> >
>> >> > Yes, we can look at coroutines microbenchmarks in isolation. I actually
>> >> > did do that yesterday with the yield test from tests/test-coroutine.c.
>> >> > And in fact profiling immediately showed something to optimise:
>> >> > pthread_getspecific() was quite high, replacing it by __thread on
>> >> > systems where it works is more efficient and helped the numbers a bit.
>> >> > Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
>> >> > in qemu-img bench), maybe there's even something that can be done here.
>> >>
>> >> The lock/unlock in dataplane is often from memory_region_find(), and Paolo
>> >> should have done lots of work on that.
>
> qemu-img bench doesn't run that code. We have a few more locks that are
> taken, and one of them (the coroutine pool lock) is avoided by your
> bypass patches.
>
>> >> >
>> >> > However, I just wasn't sure whether a change on this level would be
>> >> > relevant in a realistic environment. This is the reason why I wanted to
>> >> > get a benchmark involving the block layer and some I/O.
>> >> >
>> >> >> From the profiling data in below link:
>> >> >>
>> >> >>     http://pastebin.com/YwH2uwbq
>> >> >>
>> >> >> With coroutine, the running time for same loading is increased
>> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
>> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
>> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
>> >> >>
>> >> >> The bypass code in the benchmark is very similar with the approach
>> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
>> >> >> blocks in the the kernel I/O path.
>> >> >>
>> >> >> Maybe the benchmark is a bit extremely, but given modern storage
>> >> >> device may reach millions of IOPS, and it is very easy to slow down
>> >> >> the I/O by coroutine.
>> >> >
>> >> > I think in order to optimise coroutines, such benchmarks are fair game.
>> >> > It's just not guaranteed that the effects are exactly the same on real
>> >> > workloads, so we should take the results with a grain of salt.
>> >> >
>> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> >> > coroutines instead of exiting them, so it can't make any use of the
>> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> >> > version that simply removes the yield at the end):
>> >> >
>> >> >                 | bypass        | fixed coro    | buggy coro
>> >> > ----------------+---------------+---------------+--------------
>> >> > time            | 1.09s         | 1.10s         | 1.62s
>> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> >> > insns per cycle | 2.39          | 2.39          | 1.90
>> >> >
>> >> > Begs the question whether you see a similar effect on a real qemu and
>> >> > the coroutine pool is still not big enough? With correct use of
>> >> > coroutines, the difference seems to be barely measurable even without
>> >> > any I/O involved.
>> >>
>> >> When I comment qemu_coroutine_yield(), looks result of
>> >> bypass and fixed coro is very similar as your test, and I am just
>> >> wondering if stack is always switched in qemu_coroutine_enter()
>> >> without calling qemu_coroutine_yield().
>> >
>> > Yes, definitely. qemu_coroutine_enter() always involves calling
>> > qemu_coroutine_switch(), which is the stack switch.
>> >
>> >> Without the yield, the benchmark can't emulate coroutine usage in
>> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset
>> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
>> >> for each bdrv_aio_readv/writev().
>> >
>> > It's not completely comparable anyway because you're not going through a
>> > main loop and callbacks from there for your benchmark.
>> >
>> > But fair enough: Keep the yield, but enter the coroutine twice then. You
>> > get slightly worse results then, but that's more like doubling the very
>> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
>> > / 2.37), not like the horrible performance of the buggy version.
>>
>> Yes, I compared that too, looks no big difference.
>>
>> >
>> > Actually, that's within the error of measurement for time and
>> > insns/cycle, so running it for a bit longer:
>> >
>> >                 | bypass    | coro      | + yield   | buggy coro
>> > ----------------+-----------+-----------+-----------+--------------
>> > time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
>> > L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
>> > insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
>> >
>> >> >> > I played a bit with the following, I hope it's not too naive. I couldn't
>> >> >> > see a difference with your patches, but at least one reason for this is
>> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the
>> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
>> >> >> > thing. (I actually wrote the patch up just for some profiling on my own,
>> >> >> > not for comparing throughput, but it should be usable for that as well.)
>> >> >>
>> >> >> This might not be good for the test since it is basically a sequential
>> >> >> read test, which can be optimized a lot by kernel. And I always use
>> >> >> randread benchmark.
>> >> >
>> >> > Yes, I shortly pondered whether I should implement random offsets
>> >> > instead. But then I realised that a quicker kernel operation would only
>> >> > help the benchmark because we want it to test the CPU consumption in
>> >> > userspace. So the faster the kernel gets, the better for us, because it
>> >> > should make the impact of coroutines bigger.
>> >>
>> >> OK, I will compare coroutine vs. bypass-co with the benchmark.
>>
>> I use the /dev/nullb0 block device to test, which is available in linux kernel
>> 3.13+, and follows the difference, which looks not very big(< 10%):
>
> Sounds useful. I'm running on an older kernel, so I used a loop-mounted
> file on tmpfs instead for my tests.

Actually loop is a slow device, and recently I used kernel aio and blk-mq
to speedup it a lot.

>
> Anyway, at some point today I figured I should take a different approach
> and not try to minimise the problems that coroutines introduce, but
> rather make the most use of them when we have them. After all, the
> raw-posix driver is still very callback-oriented and does things that
> aren't really necessary with coroutines (such as AIOCB allocation).
>
> The qemu-img bench time I ended up with looked quite nice. Maybe you
> want to take a look if you can reproduce these results, both with
> qemu-img bench and your real benchmark.
>
>
> $ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 /dev/loop0; done
> Sending 2000000 requests, 4096 bytes each, 64 in parallel
>
>         bypass (base) | bypass (patch) | coro (base) | coro (patch)
> ----------------------+----------------+-------------+---------------
> run 1   0m5.966s      | 0m5.687s       |  0m6.224s   | 0m5.362s
> run 2   0m5.826s      | 0m5.831s       |  0m5.994s   | 0m5.541s
> run 3   0m6.145s      | 0m5.495s       |  0m6.253s   | 0m5.408s
> run 4   0m5.683s      | 0m5.527s       |  0m6.045s   | 0m5.293s
> run 5   0m5.904s      | 0m5.607s       |  0m6.238s   | 0m5.207s

I suggest to run the test a bit long.

>
> You can find my working tree at:
>
>     git://repo.or.cz/qemu/kevin.git perf-bypass

I just tried your work tree, and looks qemu-img can work well
with your linux-aio coro patches, but unfortunately there is
little improvement observed in my server, basically the result is
same without bypass; in my laptop, the improvement can be
observed but it is still at least 5% less than bypass.

Let's see the result in my server:

ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 /dev/nullb5
Sending 6400000 requests, 4096 bytes each, 64 in parallel
    read time: 38351ms, 166.000000K IOPS
ming@:~/git/qemu$
ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 -b
/dev/nullb5
Sending 6400000 requests, 4096 bytes each, 64 in parallel
    read time: 35241ms, 181.000000K IOPS

Also there are some problems with your patches which can't boot a
VM in my environment:

- __thread patch: looks there is no '__thread' used, and the patch
basically makes bypass not workable.

- bdrv_co_writev callback isn't set for raw-posix, looks my rootfs need to
write during booting

- another problem, I am investigating: laio isn't accessable
in qemu_laio_process_completion() sometimes

Actually I do care about performance boost with multi queue, since
multi-queue can improve performance a lots against QEMU 2.0,
once I fixed these problems, I will run VM to test mq performance
with linu-aio coroutine. Or could you give suggestions about these problem?

> Please note that I added an even worse and even wronger hack to keep the
> bypass working so I can compare it (raw-posix exposes now both bdrv_aio*
> and bdrv_co_*, and enabling the bypass also switches). Also, once the
> AIO code that I kept for the bypass mode is gone, we can make the
> coroutine path even nicer.

This approach looks nice since it saves the intermediate callback.

Basically current bypass approach is to bypass coroutine in block, but
linux-aio takes a new coroutine, which are two different path. And
linux-aio's coroutine still can be bypassed easily too , :-)

Thanks,