All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@canonical.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Peter Maydell <peter.maydell@linaro.org>,
	Fam Zheng <famz@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date: Thu, 7 Aug 2014 18:27:55 +0800	[thread overview]
Message-ID: <CACVXFVN6Jnt1b0_J1yf+Yi4ykw+bPj4ggeABuCiNe=_NvHe0WA@mail.gmail.com> (raw)
In-Reply-To: <20140806154041.GD4090@noname.str.redhat.com>

On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 06.08.2014 um 13:28 hat Ming Lei geschrieben:
>> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
>> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> >> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>> >> >> Hi Kevin,
>> >> >>
>> >> >> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> >> >> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
>> >> >> >> I have been wondering how to prove that the root cause is the ucontext
>> >> >> >> coroutine mechanism (stack switching).  Here is an idea:
>> >> >> >>
>> >> >> >> Hack your "bypass" code path to run the request inside a coroutine.
>> >> >> >> That way you can compare "bypass without coroutine" against "bypass with
>> >> >> >> coroutine".
>> >> >> >>
>> >> >> >> Right now I think there are doubts because the bypass code path is
>> >> >> >> indeed a different (and not 100% correct) code path.  So this approach
>> >> >> >> might prove that the coroutines are adding the overhead and not
>> >> >> >> something that you bypassed.
>> >> >> >
>> >> >> > My doubts aren't only that the overhead might not come from the
>> >> >> > coroutines, but also whether any coroutine-related overhead is really
>> >> >> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
>> >> >> > just that instead of introducing additional code paths.
>> >> >>
>> >> >> OK, thank you for taking look at the problem, and hope we can
>> >> >> figure out the root cause, :-)
>> >> >>
>> >> >> >
>> >> >> > Another thought I had was this: If the performance difference is indeed
>> >> >> > only coroutines, then that is completely inside the block layer and we
>> >> >> > don't actually need a VM to test it. We could instead have something
>> >> >> > like a simple qemu-img based benchmark and should be observing the same.
>> >> >>
>> >> >> Even it is simpler to run a coroutine-only benchmark, and I just
>> >> >> wrote a raw one, and looks coroutine does decrease performance
>> >> >> a lot, please see the attachment patch, and thanks for your template
>> >> >> to help me add the 'co_bench' command in qemu-img.
>> >> >
>> >> > Yes, we can look at coroutines microbenchmarks in isolation. I actually
>> >> > did do that yesterday with the yield test from tests/test-coroutine.c.
>> >> > And in fact profiling immediately showed something to optimise:
>> >> > pthread_getspecific() was quite high, replacing it by __thread on
>> >> > systems where it works is more efficient and helped the numbers a bit.
>> >> > Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
>> >> > in qemu-img bench), maybe there's even something that can be done here.
>> >>
>> >> The lock/unlock in dataplane is often from memory_region_find(), and Paolo
>> >> should have done lots of work on that.
>
> qemu-img bench doesn't run that code. We have a few more locks that are
> taken, and one of them (the coroutine pool lock) is avoided by your
> bypass patches.
>
>> >> >
>> >> > However, I just wasn't sure whether a change on this level would be
>> >> > relevant in a realistic environment. This is the reason why I wanted to
>> >> > get a benchmark involving the block layer and some I/O.
>> >> >
>> >> >> From the profiling data in below link:
>> >> >>
>> >> >>     http://pastebin.com/YwH2uwbq
>> >> >>
>> >> >> With coroutine, the running time for same loading is increased
>> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
>> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
>> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
>> >> >>
>> >> >> The bypass code in the benchmark is very similar with the approach
>> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
>> >> >> blocks in the the kernel I/O path.
>> >> >>
>> >> >> Maybe the benchmark is a bit extremely, but given modern storage
>> >> >> device may reach millions of IOPS, and it is very easy to slow down
>> >> >> the I/O by coroutine.
>> >> >
>> >> > I think in order to optimise coroutines, such benchmarks are fair game.
>> >> > It's just not guaranteed that the effects are exactly the same on real
>> >> > workloads, so we should take the results with a grain of salt.
>> >> >
>> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> >> > coroutines instead of exiting them, so it can't make any use of the
>> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> >> > version that simply removes the yield at the end):
>> >> >
>> >> >                 | bypass        | fixed coro    | buggy coro
>> >> > ----------------+---------------+---------------+--------------
>> >> > time            | 1.09s         | 1.10s         | 1.62s
>> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> >> > insns per cycle | 2.39          | 2.39          | 1.90
>> >> >
>> >> > Begs the question whether you see a similar effect on a real qemu and
>> >> > the coroutine pool is still not big enough? With correct use of
>> >> > coroutines, the difference seems to be barely measurable even without
>> >> > any I/O involved.
>> >>
>> >> When I comment qemu_coroutine_yield(), looks result of
>> >> bypass and fixed coro is very similar as your test, and I am just
>> >> wondering if stack is always switched in qemu_coroutine_enter()
>> >> without calling qemu_coroutine_yield().
>> >
>> > Yes, definitely. qemu_coroutine_enter() always involves calling
>> > qemu_coroutine_switch(), which is the stack switch.
>> >
>> >> Without the yield, the benchmark can't emulate coroutine usage in
>> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset
>> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
>> >> for each bdrv_aio_readv/writev().
>> >
>> > It's not completely comparable anyway because you're not going through a
>> > main loop and callbacks from there for your benchmark.
>> >
>> > But fair enough: Keep the yield, but enter the coroutine twice then. You
>> > get slightly worse results then, but that's more like doubling the very
>> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
>> > / 2.37), not like the horrible performance of the buggy version.
>>
>> Yes, I compared that too, looks no big difference.
>>
>> >
>> > Actually, that's within the error of measurement for time and
>> > insns/cycle, so running it for a bit longer:
>> >
>> >                 | bypass    | coro      | + yield   | buggy coro
>> > ----------------+-----------+-----------+-----------+--------------
>> > time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
>> > L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
>> > insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
>> >
>> >> >> > I played a bit with the following, I hope it's not too naive. I couldn't
>> >> >> > see a difference with your patches, but at least one reason for this is
>> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the
>> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
>> >> >> > thing. (I actually wrote the patch up just for some profiling on my own,
>> >> >> > not for comparing throughput, but it should be usable for that as well.)
>> >> >>
>> >> >> This might not be good for the test since it is basically a sequential
>> >> >> read test, which can be optimized a lot by kernel. And I always use
>> >> >> randread benchmark.
>> >> >
>> >> > Yes, I shortly pondered whether I should implement random offsets
>> >> > instead. But then I realised that a quicker kernel operation would only
>> >> > help the benchmark because we want it to test the CPU consumption in
>> >> > userspace. So the faster the kernel gets, the better for us, because it
>> >> > should make the impact of coroutines bigger.
>> >>
>> >> OK, I will compare coroutine vs. bypass-co with the benchmark.
>>
>> I use the /dev/nullb0 block device to test, which is available in linux kernel
>> 3.13+, and follows the difference, which looks not very big(< 10%):
>
> Sounds useful. I'm running on an older kernel, so I used a loop-mounted
> file on tmpfs instead for my tests.

Actually loop is a slow device, and recently I used kernel aio and blk-mq
to speedup it a lot.

>
> Anyway, at some point today I figured I should take a different approach
> and not try to minimise the problems that coroutines introduce, but
> rather make the most use of them when we have them. After all, the
> raw-posix driver is still very callback-oriented and does things that
> aren't really necessary with coroutines (such as AIOCB allocation).
>
> The qemu-img bench time I ended up with looked quite nice. Maybe you
> want to take a look if you can reproduce these results, both with
> qemu-img bench and your real benchmark.
>
>
> $ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 /dev/loop0; done
> Sending 2000000 requests, 4096 bytes each, 64 in parallel
>
>         bypass (base) | bypass (patch) | coro (base) | coro (patch)
> ----------------------+----------------+-------------+---------------
> run 1   0m5.966s      | 0m5.687s       |  0m6.224s   | 0m5.362s
> run 2   0m5.826s      | 0m5.831s       |  0m5.994s   | 0m5.541s
> run 3   0m6.145s      | 0m5.495s       |  0m6.253s   | 0m5.408s
> run 4   0m5.683s      | 0m5.527s       |  0m6.045s   | 0m5.293s
> run 5   0m5.904s      | 0m5.607s       |  0m6.238s   | 0m5.207s

I suggest to run the test a bit long.

>
> You can find my working tree at:
>
>     git://repo.or.cz/qemu/kevin.git perf-bypass

I just tried your work tree, and looks qemu-img can work well
with your linux-aio coro patches, but unfortunately there is
little improvement observed in my server, basically the result is
same without bypass; in my laptop, the improvement can be
observed but it is still at least 5% less than bypass.

Let's see the result in my server:

ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 /dev/nullb5
Sending 6400000 requests, 4096 bytes each, 64 in parallel
    read time: 38351ms, 166.000000K IOPS
ming@:~/git/qemu$
ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 -b
/dev/nullb5
Sending 6400000 requests, 4096 bytes each, 64 in parallel
    read time: 35241ms, 181.000000K IOPS

Also there are some problems with your patches which can't boot a
VM in my environment:

- __thread patch: looks there is no '__thread' used, and the patch
basically makes bypass not workable.

- bdrv_co_writev callback isn't set for raw-posix, looks my rootfs need to
write during booting

- another problem, I am investigating: laio isn't accessable
in qemu_laio_process_completion() sometimes

Actually I do care about performance boost with multi queue, since
multi-queue can improve performance a lots against QEMU 2.0,
once I fixed these problems, I will run VM to test mq performance
with linu-aio coroutine. Or could you give suggestions about these problem?

> Please note that I added an even worse and even wronger hack to keep the
> bypass working so I can compare it (raw-posix exposes now both bdrv_aio*
> and bdrv_co_*, and enabling the bypass also switches). Also, once the
> AIO code that I kept for the bypass mode is gone, we can make the
> coroutine path even nicer.

This approach looks nice since it saves the intermediate callback.

Basically current bypass approach is to bypass coroutine in block, but
linux-aio takes a new coroutine, which are two different path. And
linux-aio's coroutine still can be bypassed easily too , :-)


Thanks,

  reply	other threads:[~2014-08-07 10:28 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool Ming Lei
2014-08-05 11:55   ` Eric Blake
2014-08-05 12:05     ` Michael S. Tsirkin
2014-08-05 12:21       ` Eric Blake
2014-08-05 12:51         ` Michael S. Tsirkin
2014-08-06  2:35     ` Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 02/17] dataplane: use object pool to speed up allocation for virtio blk request Ming Lei
2014-08-05 12:30   ` Eric Blake
2014-08-06  2:45     ` Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 03/17] qemu coroutine: support bypass mode Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 04/17] block: prepare for supporting selective bypass coroutine Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 05/17] garbage collector: introduced for support of " Ming Lei
2014-08-05 12:43   ` Eric Blake
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 06/17] block: introduce bdrv_co_can_bypass_co Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 07/17] block: support to bypass qemu coroutinue Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 08/17] Revert "raw-posix: drop raw_get_aio_fd() since it is no longer used" Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 09/17] dataplane: enable selective bypassing coroutine Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 10/17] linux-aio: fix submit aio as a batch Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 11/17] linux-aio: handling -EAGAIN for !s->io_q.plugged case Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 12/17] linux-aio: increase max event to 256 Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 13/17] linux-aio: remove 'node' from 'struct qemu_laiocb' Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 14/17] hw/virtio/virtio-blk.h: introduce VIRTIO_BLK_F_MQ Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 15/17] virtio-blk: support multi queue for non-dataplane Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 16/17] virtio-blk: dataplane: support multi virtqueue Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 17/17] hw/virtio-pci: introduce num_queues property Ming Lei
2014-08-05  9:38 ` [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Stefan Hajnoczi
2014-08-05  9:50   ` Ming Lei
2014-08-05  9:56     ` Kevin Wolf
2014-08-05 10:50       ` Ming Lei
2014-08-05 13:59     ` Stefan Hajnoczi
2014-08-05  9:48 ` Kevin Wolf
2014-08-05 10:00   ` Ming Lei
2014-08-05 11:44     ` Paolo Bonzini
2014-08-05 13:48     ` Stefan Hajnoczi
2014-08-05 14:47       ` Kevin Wolf
2014-08-06  5:33         ` Ming Lei
2014-08-06  7:45           ` Paolo Bonzini
2014-08-06  8:38             ` Ming Lei
2014-08-06  8:50               ` Paolo Bonzini
2014-08-06 13:53                 ` Ming Lei
2014-08-06  8:48           ` Kevin Wolf
2014-08-06  9:37             ` Ming Lei
2014-08-06 10:09               ` Kevin Wolf
2014-08-06 11:28                 ` Ming Lei
2014-08-06 11:44                   ` Ming Lei
2014-08-06 15:40                   ` Kevin Wolf
2014-08-07 10:27                     ` Ming Lei [this message]
2014-08-07 10:52                       ` Ming Lei
2014-08-07 11:06                         ` Kevin Wolf
2014-08-07 13:03                           ` Ming Lei
2014-08-07 13:51                       ` Kevin Wolf
2014-08-08 10:32                         ` Ming Lei
2014-08-08 11:26                           ` Ming Lei
2014-08-10  3:46             ` Ming Lei
2014-08-11 14:03               ` Kevin Wolf
2014-08-12  7:53                 ` Ming Lei
2014-08-12 11:40                   ` Kevin Wolf
2014-08-12 12:14                     ` Ming Lei
2014-08-11 19:37               ` Paolo Bonzini
2014-08-12  8:12                 ` Ming Lei
2014-08-12 19:08                   ` Paolo Bonzini
2014-08-13  9:54                     ` Kevin Wolf
2014-08-13 13:16                       ` Paolo Bonzini
2014-08-13 13:49                         ` Ming Lei
2014-08-14  9:39                           ` Stefan Hajnoczi
2014-08-14 10:12                             ` Ming Lei
2014-08-15 20:16                             ` Paolo Bonzini
2014-08-13 10:19                     ` Ming Lei
2014-08-13 12:35                       ` Paolo Bonzini
2014-08-13  8:55                 ` Stefan Hajnoczi
2014-08-13 11:43                 ` Ming Lei
2014-08-13 12:35                   ` Paolo Bonzini
2014-08-13 13:07                     ` Ming Lei
2014-08-14 10:46                 ` Kevin Wolf
2014-08-15 10:39                   ` Ming Lei
2014-08-15 20:15                   ` Paolo Bonzini
2014-08-16  8:20                     ` Ming Lei
2014-08-17  5:29                     ` Paolo Bonzini
2014-08-18  8:58                       ` Kevin Wolf
2014-08-06  9:37           ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACVXFVN6Jnt1b0_J1yf+Yi4ykw+bPj4ggeABuCiNe=_NvHe0WA@mail.gmail.com' \
    --to=ming.lei@canonical.com \
    --cc=famz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.