[Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
@ 2014-06-26 15:14 Ming Lei
  2014-06-26 15:29 ` Paolo Bonzini
  2014-06-27 12:01 ` Stefan Hajnoczi
  0 siblings, 2 replies; 43+ messages in thread
From: Ming Lei @ 2014-06-26 15:14 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, Michael S. Tsirkin

Hi Stefan,

I found VM block I/O thoughput is decreased by more than 40%
on my laptop, and looks much worsen in my server environment,
and it is caused by your commit 580b6b2aa2:

          dataplane: use the QEMU block layer for I/O

I run fio with below config to test random read:

[global]
direct=1
size=4G
bsrange=4k-4k
timeout=20
numjobs=4
ioengine=libaio
iodepth=64
filename=/dev/vdc
group_reporting=1

[f]
rw=randread

Together with throughput drop, the latency is improved a little.

With this commit, I/O block submitted to fs becomes much smaller
than before, and more io_submit() need to be called to kernel, that
means iodepth may become much less.

I am not surprised with the result since I did compare VM I/O
performance between qemu and lkvm before, which has no big qemu
lock problem and handle I/O in a dedicated thread, but lkvm's block
IO is still much worse than qemu from view of throughput, because
lkvm doesn't submit block I/O at batch like the way of previous
dataplane, IMO.

But now you change the way of submitting I/O, could you share
the motivation about the change? Is the throughput drop you expect?

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-26 15:14 [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2 Ming Lei
@ 2014-06-26 15:29 ` Paolo Bonzini
  2014-06-26 15:37   ` Ming Lei
  2014-06-27 12:01 ` Stefan Hajnoczi
  1 sibling, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2014-06-26 15:29 UTC (permalink / raw)
  To: Ming Lei, Stefan Hajnoczi, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

Il 26/06/2014 17:14, Ming Lei ha scritto:
> Hi Stefan,
>
> I found VM block I/O thoughput is decreased by more than 40%
> on my laptop, and looks much worsen in my server environment,
> and it is caused by your commit 580b6b2aa2:
>
>           dataplane: use the QEMU block layer for I/O
>
> I run fio with below config to test random read:
>
> [global]
> direct=1
> size=4G
> bsrange=4k-4k
> timeout=20
> numjobs=4
> ioengine=libaio
> iodepth=64
> filename=/dev/vdc
> group_reporting=1
>
> [f]
> rw=randread
>
> Together with throughput drop, the latency is improved a little.
>
> With this commit, I/O block submitted to fs becomes much smaller
> than before, and more io_submit() need to be called to kernel, that
> means iodepth may become much less.
>
> I am not surprised with the result since I did compare VM I/O
> performance between qemu and lkvm before, which has no big qemu
> lock problem and handle I/O in a dedicated thread, but lkvm's block
> IO is still much worse than qemu from view of throughput, because
> lkvm doesn't submit block I/O at batch like the way of previous
> dataplane, IMO.

What is your elevator setting in both the host and the guest?  Usually 
deadline gives the best performance.

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-26 15:29 ` Paolo Bonzini
@ 2014-06-26 15:37   ` Ming Lei
  2014-06-26 15:43     ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2014-06-26 15:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

On Thu, Jun 26, 2014 at 11:29 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 26/06/2014 17:14, Ming Lei ha scritto:
>
>> Hi Stefan,
>>
>> I found VM block I/O thoughput is decreased by more than 40%
>> on my laptop, and looks much worsen in my server environment,
>> and it is caused by your commit 580b6b2aa2:
>>
>>           dataplane: use the QEMU block layer for I/O
>>
>> I run fio with below config to test random read:
>>
>> [global]
>> direct=1
>> size=4G
>> bsrange=4k-4k
>> timeout=20
>> numjobs=4
>> ioengine=libaio
>> iodepth=64
>> filename=/dev/vdc
>> group_reporting=1
>>
>> [f]
>> rw=randread
>>
>> Together with throughput drop, the latency is improved a little.
>>
>> With this commit, I/O block submitted to fs becomes much smaller
>> than before, and more io_submit() need to be called to kernel, that
>> means iodepth may become much less.
>>
>> I am not surprised with the result since I did compare VM I/O
>> performance between qemu and lkvm before, which has no big qemu
>> lock problem and handle I/O in a dedicated thread, but lkvm's block
>> IO is still much worse than qemu from view of throughput, because
>> lkvm doesn't submit block I/O at batch like the way of previous
>> dataplane, IMO.
>
>
> What is your elevator setting in both the host and the guest?  Usually
> deadline gives the best performance.

The test is based on cfq, but I just run a quick test with deadline, looks
no obvious difference.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-26 15:37   ` Ming Lei
@ 2014-06-26 15:43     ` Paolo Bonzini
  2014-06-26 15:47       ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2014-06-26 15:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

Il 26/06/2014 17:37, Ming Lei ha scritto:
> On Thu, Jun 26, 2014 at 11:29 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 26/06/2014 17:14, Ming Lei ha scritto:
>>
>>> Hi Stefan,
>>>
>>> I found VM block I/O thoughput is decreased by more than 40%
>>> on my laptop, and looks much worsen in my server environment,
>>> and it is caused by your commit 580b6b2aa2:
>>>
>>>           dataplane: use the QEMU block layer for I/O
>>>
>>> I run fio with below config to test random read:
>>>
>>> [global]
>>> direct=1
>>> size=4G
>>> bsrange=4k-4k
>>> timeout=20
>>> numjobs=4
>>> ioengine=libaio
>>> iodepth=64
>>> filename=/dev/vdc
>>> group_reporting=1
>>>
>>> [f]
>>> rw=randread
>>>
>>> Together with throughput drop, the latency is improved a little.
>>>
>>> With this commit, I/O block submitted to fs becomes much smaller
>>> than before, and more io_submit() need to be called to kernel, that
>>> means iodepth may become much less.
>>>
>>> I am not surprised with the result since I did compare VM I/O
>>> performance between qemu and lkvm before, which has no big qemu
>>> lock problem and handle I/O in a dedicated thread, but lkvm's block
>>> IO is still much worse than qemu from view of throughput, because
>>> lkvm doesn't submit block I/O at batch like the way of previous
>>> dataplane, IMO.
>>
>>
>> What is your elevator setting in both the host and the guest?  Usually
>> deadline gives the best performance.
>
> The test is based on cfq, but I just run a quick test with deadline, looks
> no obvious difference.

Can you give us your QEMU command line?

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-26 15:43     ` Paolo Bonzini
@ 2014-06-26 15:47       ` Ming Lei
  2014-06-26 15:57         ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2014-06-26 15:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

On Thu, Jun 26, 2014 at 11:43 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 26/06/2014 17:37, Ming Lei ha scritto:
>
>> On Thu, Jun 26, 2014 at 11:29 PM, Paolo Bonzini <pbonzini@redhat.com>
>> wrote:
>>>
>>> Il 26/06/2014 17:14, Ming Lei ha scritto:
>>>
>>>> Hi Stefan,
>>>>
>>>> I found VM block I/O thoughput is decreased by more than 40%
>>>> on my laptop, and looks much worsen in my server environment,
>>>> and it is caused by your commit 580b6b2aa2:
>>>>
>>>>           dataplane: use the QEMU block layer for I/O
>>>>
>>>> I run fio with below config to test random read:
>>>>
>>>> [global]
>>>> direct=1
>>>> size=4G
>>>> bsrange=4k-4k
>>>> timeout=20
>>>> numjobs=4
>>>> ioengine=libaio
>>>> iodepth=64
>>>> filename=/dev/vdc
>>>> group_reporting=1
>>>>
>>>> [f]
>>>> rw=randread
>>>>
>>>> Together with throughput drop, the latency is improved a little.
>>>>
>>>> With this commit, I/O block submitted to fs becomes much smaller
>>>> than before, and more io_submit() need to be called to kernel, that
>>>> means iodepth may become much less.
>>>>
>>>> I am not surprised with the result since I did compare VM I/O
>>>> performance between qemu and lkvm before, which has no big qemu
>>>> lock problem and handle I/O in a dedicated thread, but lkvm's block
>>>> IO is still much worse than qemu from view of throughput, because
>>>> lkvm doesn't submit block I/O at batch like the way of previous
>>>> dataplane, IMO.
>>>
>>>
>>>
>>> What is your elevator setting in both the host and the guest?  Usually
>>> deadline gives the best performance.
>>
>>
>> The test is based on cfq, but I just run a quick test with deadline, looks
>> no obvious difference.
>
>
> Can you give us your QEMU command line?

The data.img is put on a ext4 over ssd, and basically similar
result can be observed on backend on /dev/nullb1 too.

/home/tom/git/other/qemu/x86_64-softmmu/qemu-system-x86_64 \
    -name 'kvm-test'  \
    -M pc  \
    -vga none  \
    -drive id=drive_image1,if=none,format=raw,cache=none,aio=native,file=/mnt/ssd/img/f19-fs.img
\
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=1,scsi=off,config-wce=off,x-data-plane=on,bus=pci.0,addr=02
\
    -drive id=drive_image3,if=none,format=raw,cache=none,aio=native,file=/dev/nullb1
\
    -device virtio-blk-pci,id=image3,drive=drive_image3,bootindex=3,scsi=off,config-wce=off,x-data-plane=on,bus=pci.0,addr=04
\
    -drive id=drive_image2,if=none,format=raw,cache=none,aio=native,file=/mnt/ssd/img/data.img
\
    -device virtio-blk-pci,id=image2,drive=drive_image2,bootindex=2,scsi=off,config-wce=off,x-data-plane=on,bus=pci.0,addr=03
\
    -netdev user,id=idabMX4S,hostfwd=tcp::5000-:22  \
    -device virtio-net-pci,mac=9a:be:bf:c0:c1:c2,id=idDyAIbK,vectors=4,netdev=idabMX4S,bus=pci.0,addr=08
\
    -m 1024  \
    -smp 4,maxcpus=4  \
    -kernel /mnt/ssd/git/linux-2.6/linux-2.6-next/arch/x86_64/boot/bzImage \
    -append 'earlyprintk console=ttyS0 mem=1024M rootfstype=ext4
root=/dev/vda rw virtio_blk.queue_depth=128 loglevel=9
no_console_suspend ip=dhcp ftrace_dump_on_oops'  \
    -nographic  \
    -rtc base=utc,clock=host,driftfix=none \
    -enable-kvm



Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-26 15:47       ` Ming Lei
@ 2014-06-26 15:57         ` Paolo Bonzini
  2014-06-27  1:15           ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2014-06-26 15:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

This is indeed a difference between the ioq-based and block-based 
backends.  ioq could submit more than one request with the same 
io_submit system call.

We can implement (advisory) calls like bdrv_plug/bdrv_unplug in order to 
restore the previous levels of performance.

Note that some fallout of the conversion was expected.  Dataplane told 
us experimentally what level of performance could be reached, but was a 
dead end in terms of functionality.  Now Stefan added a whole lot of 
functionality to dataplane (accounting, throttling, file formats and 
protocols, thread-pool based I/O, etc.) and we need to bring back any 
performance we lost in the process.

Out of curiosity, how does aio=threads (after the patch) fare in 
comparison to aio=native (before the patch)?

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-26 15:57         ` Paolo Bonzini
@ 2014-06-27  1:15           ` Ming Lei
  2014-06-27  4:59             ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2014-06-27  1:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

On Thu, Jun 26, 2014 at 11:57 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> This is indeed a difference between the ioq-based and block-based backends.
> ioq could submit more than one request with the same io_submit system call.
>

Yes, I have been thinking that is advantage of qemu virtio dataplane, but
it isn't any longer, and qemu dataplane has not any performance
advantage compared with lkvm now.

> We can implement (advisory) calls like bdrv_plug/bdrv_unplug in order to
> restore the previous levels of performance.

Yes, that is also what I am thinking, or interfaces like bdrv_queue_io()
and bdrv_submit_io(), which may match with aio interfaces.

>
> Note that some fallout of the conversion was expected.  Dataplane told us
> experimentally what level of performance could be reached, but was a dead
> end in terms of functionality.  Now Stefan added a whole lot of
> functionality to dataplane (accounting, throttling, file formats and
> protocols, thread-pool based I/O, etc.) and we need to bring back any
> performance we lost in the process.

These features are very good, but looks the conversion is a bit early, :-(

>
> Out of curiosity, how does aio=threads (after the patch) fare in comparison
> to aio=native (before the patch)?

Looks no obvious difference.

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-27  1:15           ` Ming Lei
@ 2014-06-27  4:59             ` Paolo Bonzini
  2014-06-27  6:23               ` Kevin Wolf
  2014-06-27  7:57               ` Ming Lei
  0 siblings, 2 replies; 43+ messages in thread
From: Paolo Bonzini @ 2014-06-27  4:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

Il 27/06/2014 03:15, Ming Lei ha scritto:
> On Thu, Jun 26, 2014 at 11:57 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> We can implement (advisory) calls like bdrv_plug/bdrv_unplug in order to
>> restore the previous levels of performance.
>
> Yes, that is also what I am thinking, or interfaces like bdrv_queue_io()
> and bdrv_submit_io(), which may match with aio interfaces.

Would you like to try preparing a patch?

>>
>> Note that some fallout of the conversion was expected.  Dataplane told us
>> experimentally what level of performance could be reached, but was a dead
>> end in terms of functionality.  Now Stefan added a whole lot of
>> functionality to dataplane (accounting, throttling, file formats and
>> protocols, thread-pool based I/O, etc.) and we need to bring back any
>> performance we lost in the process.
>
> These features are very good, but looks the conversion is a bit early, :-(

Dataplane is still (and has always been) experimental.  For now, it's a 
playground to get rid of the big QEMU lock in hot paths.  As such, 
performance going up and down is expected.  The good thing is that every 
performance improvement we do now will not be restricted to dataplane, 
it can be applied just as well to any other device.

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-27  4:59             ` Paolo Bonzini
@ 2014-06-27  6:23               ` Kevin Wolf
  2014-06-27  7:35                 ` Paolo Bonzini
  2014-06-27 12:35                 ` Ming Lei
  2014-06-27  7:57               ` Ming Lei
  1 sibling, 2 replies; 43+ messages in thread
From: Kevin Wolf @ 2014-06-27  6:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ming Lei, Fam Zheng, qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

Am 27.06.2014 um 06:59 hat Paolo Bonzini geschrieben:
> Il 27/06/2014 03:15, Ming Lei ha scritto:
> >On Thu, Jun 26, 2014 at 11:57 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>We can implement (advisory) calls like bdrv_plug/bdrv_unplug in order to
> >>restore the previous levels of performance.
> >
> >Yes, that is also what I am thinking, or interfaces like bdrv_queue_io()
> >and bdrv_submit_io(), which may match with aio interfaces.
> 
> Would you like to try preparing a patch?

Note that there is already an interface in block.c that takes multiple
requests at once, bdrv_aio_multiwrite(). It is currently used by
virtio-blk, even though not in dataplane mode. It also submits
individual requests to the block drivers currently, so effectively it
doesn't make a difference, just the problem occurs in the block layer
instead of the device.

We should either improve bdrv_aio_multiwrite() to submit the requests in
a batch to the block drivers, add a bdrv_aio_multiwrite() and use it for
dataplane as well (possibly with a flag for disabling the request merging
if we want to keep the current behaviour for dataplane); or, if we
consider it a bad interface, replace it altogether with the new thing
even for normal virtio-blk.

If this makes a difference for dataplane, it probably makes a difference
for all block devices.

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-27  6:23               ` Kevin Wolf
@ 2014-06-27  7:35                 ` Paolo Bonzini
  2014-06-27 12:35                 ` Ming Lei
  1 sibling, 0 replies; 43+ messages in thread
From: Paolo Bonzini @ 2014-06-27  7:35 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Ming Lei, Fam Zheng, qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

Il 27/06/2014 08:23, Kevin Wolf ha scritto:
> Note that there is already an interface in block.c that takes multiple
> requests at once, bdrv_aio_multiwrite(). It is currently used by
> virtio-blk, even though not in dataplane mode. It also submits
> individual requests to the block drivers currently, so effectively it
> doesn't make a difference, just the problem occurs in the block layer
> instead of the device.
>
> We should either improve bdrv_aio_multiwrite() to submit the requests in
> a batch to the block drivers, add a bdrv_aio_multiwrite() and use it for
> dataplane as well (possibly with a flag for disabling the request merging
> if we want to keep the current behaviour for dataplane); or, if we
> consider it a bad interface, replace it altogether with the new thing
> even for normal virtio-blk.

In fact, what's the status of Fam's patches to unify request processing 
between dataplane and non-dataplane?  They would add multiwrite support 
(also rerror/werror and blockstats).

I was hoping that they could get in 2.1.

Paolo

> If this makes a difference for dataplane, it probably makes a difference
> for all block devices.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-27  4:59             ` Paolo Bonzini
  2014-06-27  6:23               ` Kevin Wolf
@ 2014-06-27  7:57               ` Ming Lei
  1 sibling, 0 replies; 43+ messages in thread
From: Ming Lei @ 2014-06-27  7:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

On Fri, Jun 27, 2014 at 12:59 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 27/06/2014 03:15, Ming Lei ha scritto:
>>
>> On Thu, Jun 26, 2014 at 11:57 PM, Paolo Bonzini <pbonzini@redhat.com>
>> wrote:
>>>
>>> We can implement (advisory) calls like bdrv_plug/bdrv_unplug in order to
>>> restore the previous levels of performance.
>>
>>
>> Yes, that is also what I am thinking, or interfaces like bdrv_queue_io()
>> and bdrv_submit_io(), which may match with aio interfaces.
>
>
> Would you like to try preparing a patch?

OK, let me try to do that.

>
>
>>>
>>> Note that some fallout of the conversion was expected.  Dataplane told us
>>> experimentally what level of performance could be reached, but was a dead
>>> end in terms of functionality.  Now Stefan added a whole lot of
>>> functionality to dataplane (accounting, throttling, file formats and
>>> protocols, thread-pool based I/O, etc.) and we need to bring back any
>>> performance we lost in the process.
>>
>>
>> These features are very good, but looks the conversion is a bit early, :-(
>
>
> Dataplane is still (and has always been) experimental.  For now, it's a
> playground to get rid of the big QEMU lock in hot paths.  As such,
> performance going up and down is expected.  The good thing is that every
> performance improvement we do now will not be restricted to dataplane, it
> can be applied just as well to any other device.

Yes, virtio-scsi may benefit from the improvement too, and other
block devices too.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-26 15:14 [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2 Ming Lei
  2014-06-26 15:29 ` Paolo Bonzini
@ 2014-06-27 12:01 ` Stefan Hajnoczi
  2014-06-27 12:21   ` Kevin Wolf
  2014-06-27 18:01   ` Ming Lei
  1 sibling, 2 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2014-06-27 12:01 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, qemu-devel, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 1883 bytes --]

On Thu, Jun 26, 2014 at 11:14:16PM +0800, Ming Lei wrote:
> Hi Stefan,
> 
> I found VM block I/O thoughput is decreased by more than 40%
> on my laptop, and looks much worsen in my server environment,
> and it is caused by your commit 580b6b2aa2:
> 
>           dataplane: use the QEMU block layer for I/O
> 
> I run fio with below config to test random read:
> 
> [global]
> direct=1
> size=4G
> bsrange=4k-4k
> timeout=20
> numjobs=4
> ioengine=libaio
> iodepth=64
> filename=/dev/vdc
> group_reporting=1
> 
> [f]
> rw=randread
> 
> Together with throughput drop, the latency is improved a little.
> 
> With this commit, I/O block submitted to fs becomes much smaller
> than before, and more io_submit() need to be called to kernel, that
> means iodepth may become much less.
> 
> I am not surprised with the result since I did compare VM I/O
> performance between qemu and lkvm before, which has no big qemu
> lock problem and handle I/O in a dedicated thread, but lkvm's block
> IO is still much worse than qemu from view of throughput, because
> lkvm doesn't submit block I/O at batch like the way of previous
> dataplane, IMO.
> 
> But now you change the way of submitting I/O, could you share
> the motivation about the change? Is the throughput drop you expect?

Thanks for reporting this.  40% is a serious regression.

We were expecting a regression since the custom Linux AIO codepath has
been replaced with the QEMU block layer (which offers features like
image formats, snapshots, I/O throttling).

Let me know if you get stuck working on a patch.  Implementing batching
sounds like a good idea.  I never measured the impact when I wrote the
ioq code, it just seemed like a natural way to structure the code.

Hopefully this 40% number is purely due to batching and we can get most
of the performance back.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-27 12:01 ` Stefan Hajnoczi
@ 2014-06-27 12:21   ` Kevin Wolf
  2014-06-27 14:50     ` Stefan Hajnoczi
  2014-06-27 18:01   ` Ming Lei
  1 sibling, 1 reply; 43+ messages in thread
From: Kevin Wolf @ 2014-06-27 12:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Ming Lei, Michael S. Tsirkin, Fam Zheng, qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2167 bytes --]

Am 27.06.2014 um 14:01 hat Stefan Hajnoczi geschrieben:
> On Thu, Jun 26, 2014 at 11:14:16PM +0800, Ming Lei wrote:
> > Hi Stefan,
> > 
> > I found VM block I/O thoughput is decreased by more than 40%
> > on my laptop, and looks much worsen in my server environment,
> > and it is caused by your commit 580b6b2aa2:
> > 
> >           dataplane: use the QEMU block layer for I/O
> > 
> > I run fio with below config to test random read:
> > 
> > [global]
> > direct=1
> > size=4G
> > bsrange=4k-4k
> > timeout=20
> > numjobs=4
> > ioengine=libaio
> > iodepth=64
> > filename=/dev/vdc
> > group_reporting=1
> > 
> > [f]
> > rw=randread
> > 
> > Together with throughput drop, the latency is improved a little.
> > 
> > With this commit, I/O block submitted to fs becomes much smaller
> > than before, and more io_submit() need to be called to kernel, that
> > means iodepth may become much less.
> > 
> > I am not surprised with the result since I did compare VM I/O
> > performance between qemu and lkvm before, which has no big qemu
> > lock problem and handle I/O in a dedicated thread, but lkvm's block
> > IO is still much worse than qemu from view of throughput, because
> > lkvm doesn't submit block I/O at batch like the way of previous
> > dataplane, IMO.
> > 
> > But now you change the way of submitting I/O, could you share
> > the motivation about the change? Is the throughput drop you expect?
> 
> Thanks for reporting this.  40% is a serious regression.
> 
> We were expecting a regression since the custom Linux AIO codepath has
> been replaced with the QEMU block layer (which offers features like
> image formats, snapshots, I/O throttling).
> 
> Let me know if you get stuck working on a patch.  Implementing batching
> sounds like a good idea.  I never measured the impact when I wrote the
> ioq code, it just seemed like a natural way to structure the code.
> 
> Hopefully this 40% number is purely due to batching and we can get most
> of the performance back.

Shouldn't it be easy enough to take the old code, remove the batching
there and then measure if you get the same 40%?

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-27  6:23               ` Kevin Wolf
  2014-06-27  7:35                 ` Paolo Bonzini
@ 2014-06-27 12:35                 ` Ming Lei
  1 sibling, 0 replies; 43+ messages in thread
From: Ming Lei @ 2014-06-27 12:35 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Paolo Bonzini, Fam Zheng, qemu-devel, Stefan Hajnoczi,
	Michael S. Tsirkin

On Fri, Jun 27, 2014 at 2:23 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 27.06.2014 um 06:59 hat Paolo Bonzini geschrieben:
>> Il 27/06/2014 03:15, Ming Lei ha scritto:
>> >On Thu, Jun 26, 2014 at 11:57 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >>We can implement (advisory) calls like bdrv_plug/bdrv_unplug in order to
>> >>restore the previous levels of performance.
>> >
>> >Yes, that is also what I am thinking, or interfaces like bdrv_queue_io()
>> >and bdrv_submit_io(), which may match with aio interfaces.
>>
>> Would you like to try preparing a patch?
>
> Note that there is already an interface in block.c that takes multiple
> requests at once, bdrv_aio_multiwrite(). It is currently used by

It isn't good because aio supports to handle both read and
write requests in one submit, even supports different files.

> virtio-blk, even though not in dataplane mode. It also submits
> individual requests to the block drivers currently, so effectively it
> doesn't make a difference, just the problem occurs in the block layer
> instead of the device.
>
> We should either improve bdrv_aio_multiwrite() to submit the requests in
> a batch to the block drivers, add a bdrv_aio_multiwrite() and use it for
> dataplane as well (possibly with a flag for disabling the request merging
> if we want to keep the current behaviour for dataplane); or, if we
> consider it a bad interface, replace it altogether with the new thing
> even for normal virtio-blk.
>
> If this makes a difference for dataplane, it probably makes a difference
> for all block devices.

Yes, I think so.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-27 12:21   ` Kevin Wolf
@ 2014-06-27 14:50     ` Stefan Hajnoczi
  0 siblings, 0 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2014-06-27 14:50 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Ming Lei, Michael S. Tsirkin, Fam Zheng, qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2371 bytes --]

On Fri, Jun 27, 2014 at 02:21:06PM +0200, Kevin Wolf wrote:
> Am 27.06.2014 um 14:01 hat Stefan Hajnoczi geschrieben:
> > On Thu, Jun 26, 2014 at 11:14:16PM +0800, Ming Lei wrote:
> > > Hi Stefan,
> > > 
> > > I found VM block I/O thoughput is decreased by more than 40%
> > > on my laptop, and looks much worsen in my server environment,
> > > and it is caused by your commit 580b6b2aa2:
> > > 
> > >           dataplane: use the QEMU block layer for I/O
> > > 
> > > I run fio with below config to test random read:
> > > 
> > > [global]
> > > direct=1
> > > size=4G
> > > bsrange=4k-4k
> > > timeout=20
> > > numjobs=4
> > > ioengine=libaio
> > > iodepth=64
> > > filename=/dev/vdc
> > > group_reporting=1
> > > 
> > > [f]
> > > rw=randread
> > > 
> > > Together with throughput drop, the latency is improved a little.
> > > 
> > > With this commit, I/O block submitted to fs becomes much smaller
> > > than before, and more io_submit() need to be called to kernel, that
> > > means iodepth may become much less.
> > > 
> > > I am not surprised with the result since I did compare VM I/O
> > > performance between qemu and lkvm before, which has no big qemu
> > > lock problem and handle I/O in a dedicated thread, but lkvm's block
> > > IO is still much worse than qemu from view of throughput, because
> > > lkvm doesn't submit block I/O at batch like the way of previous
> > > dataplane, IMO.
> > > 
> > > But now you change the way of submitting I/O, could you share
> > > the motivation about the change? Is the throughput drop you expect?
> > 
> > Thanks for reporting this.  40% is a serious regression.
> > 
> > We were expecting a regression since the custom Linux AIO codepath has
> > been replaced with the QEMU block layer (which offers features like
> > image formats, snapshots, I/O throttling).
> > 
> > Let me know if you get stuck working on a patch.  Implementing batching
> > sounds like a good idea.  I never measured the impact when I wrote the
> > ioq code, it just seemed like a natural way to structure the code.
> > 
> > Hopefully this 40% number is purely due to batching and we can get most
> > of the performance back.
> 
> Shouldn't it be easy enough to take the old code, remove the batching
> there and then measure if you get the same 40%?

Yes, that's a good idea.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-27 12:01 ` Stefan Hajnoczi
  2014-06-27 12:21   ` Kevin Wolf
@ 2014-06-27 18:01   ` Ming Lei
  2014-06-27 21:51     ` Paolo Bonzini
  1 sibling, 1 reply; 43+ messages in thread
From: Ming Lei @ 2014-06-27 18:01 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, qemu-devel, Michael S. Tsirkin

On Fri, Jun 27, 2014 at 8:01 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Thu, Jun 26, 2014 at 11:14:16PM +0800, Ming Lei wrote:
>> Hi Stefan,
>>
>> I found VM block I/O thoughput is decreased by more than 40%
>> on my laptop, and looks much worsen in my server environment,
>> and it is caused by your commit 580b6b2aa2:
>>
>>           dataplane: use the QEMU block layer for I/O
>>
>> I run fio with below config to test random read:
>>
>> [global]
>> direct=1
>> size=4G
>> bsrange=4k-4k
>> timeout=20
>> numjobs=4
>> ioengine=libaio
>> iodepth=64
>> filename=/dev/vdc
>> group_reporting=1
>>
>> [f]
>> rw=randread
>>
>> Together with throughput drop, the latency is improved a little.
>>
>> With this commit, I/O block submitted to fs becomes much smaller
>> than before, and more io_submit() need to be called to kernel, that
>> means iodepth may become much less.
>>
>> I am not surprised with the result since I did compare VM I/O
>> performance between qemu and lkvm before, which has no big qemu
>> lock problem and handle I/O in a dedicated thread, but lkvm's block
>> IO is still much worse than qemu from view of throughput, because
>> lkvm doesn't submit block I/O at batch like the way of previous
>> dataplane, IMO.
>>
>> But now you change the way of submitting I/O, could you share
>> the motivation about the change? Is the throughput drop you expect?
>
> Thanks for reporting this.  40% is a serious regression.
>
> We were expecting a regression since the custom Linux AIO codepath has
> been replaced with the QEMU block layer (which offers features like
> image formats, snapshots, I/O throttling).
>
> Let me know if you get stuck working on a patch.  Implementing batching
> sounds like a good idea.  I never measured the impact when I wrote the
> ioq code, it just seemed like a natural way to structure the code.

I just implemented plug&unplug based batching, and it is working now.
But throughout still has no obvious improvement.

Looks loading in IOthread is a bit low, so I am wondering if there is
block point caused by Qemu QEMU block layer.

> Hopefully this 40% number is purely due to batching and we can get most
> of the performance back.

I will double check it, but based on my previous comparison between
lkvm and qemu, and batching is the only difference.

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-27 18:01   ` Ming Lei
@ 2014-06-27 21:51     ` Paolo Bonzini
  2014-06-28  9:58       ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2014-06-27 21:51 UTC (permalink / raw)
  To: Ming Lei, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Michael S. Tsirkin

Il 27/06/2014 20:01, Ming Lei ha scritto:
> I just implemented plug&unplug based batching, and it is working now.
> But throughout still has no obvious improvement.
>
> Looks loading in IOthread is a bit low, so I am wondering if there is
> block point caused by Qemu QEMU block layer.

What does perf say?  Also, you can try using the QEMU trace subsystem 
and see where the latency goes.

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-27 21:51     ` Paolo Bonzini
@ 2014-06-28  9:58       ` Ming Lei
  2014-06-30  8:08         ` Stefan Hajnoczi
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2014-06-28  9:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

On Sat, Jun 28, 2014 at 5:51 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 27/06/2014 20:01, Ming Lei ha scritto:
>
>> I just implemented plug&unplug based batching, and it is working now.
>> But throughout still has no obvious improvement.
>>
>> Looks loading in IOthread is a bit low, so I am wondering if there is
>> block point caused by Qemu QEMU block layer.
>
>
> What does perf say?  Also, you can try using the QEMU trace subsystem and
> see where the latency goes.

Follows some test result against 8589744aaf07b62 of
upstream qemu, and the test is done on my 2core(4thread)
laptop:

1, with my draft batch patches[1](only linux-aio supported now)
- throughput: +16% compared qemu upstream
- average time spent by handle_notify(): 310us
- average time between two handle_notify(): 1591us
(this time reflects latency of handling host_notifier)

2, same tests on 2.0.0 release(use custom Linux AIO)
- average time spent by handle_notify(): 68us
- average time between calling two handle_notify(): 269us
(this time reflects latency of handling host_notifier)

>From above tests, looks root cause is late handling notify, and
qemu block layer becomes 4times slower than previous custom
linux aio taken by dataplane.

[1], git://kernel.ubuntu.com/ming/qemu.git #master-0626-batch

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-28  9:58       ` Ming Lei
@ 2014-06-30  8:08         ` Stefan Hajnoczi
  2014-06-30  8:27           ` Ming Lei
  2014-07-01 13:53           ` Ming Lei
  0 siblings, 2 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2014-06-30  8:08 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, qemu-devel, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 2262 bytes --]

On Sat, Jun 28, 2014 at 05:58:58PM +0800, Ming Lei wrote:
> On Sat, Jun 28, 2014 at 5:51 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Il 27/06/2014 20:01, Ming Lei ha scritto:
> >
> >> I just implemented plug&unplug based batching, and it is working now.
> >> But throughout still has no obvious improvement.
> >>
> >> Looks loading in IOthread is a bit low, so I am wondering if there is
> >> block point caused by Qemu QEMU block layer.
> >
> >
> > What does perf say?  Also, you can try using the QEMU trace subsystem and
> > see where the latency goes.
> 
> Follows some test result against 8589744aaf07b62 of
> upstream qemu, and the test is done on my 2core(4thread)
> laptop:
> 
> 1, with my draft batch patches[1](only linux-aio supported now)
> - throughput: +16% compared qemu upstream
> - average time spent by handle_notify(): 310us
> - average time between two handle_notify(): 1591us
> (this time reflects latency of handling host_notifier)

16% is still a worthwhile improvement.  I guess batching only benefits
aio=native since the threadpool ought to do better when it receives
requests as soon as possible.

Patch or an RFC would be welcome.

> 2, same tests on 2.0.0 release(use custom Linux AIO)
> - average time spent by handle_notify(): 68us
> - average time between calling two handle_notify(): 269us
> (this time reflects latency of handling host_notifier)
> 
> From above tests, looks root cause is late handling notify, and
> qemu block layer becomes 4times slower than previous custom
> linux aio taken by dataplane.

Try:
$ perf record -e syscalls:* --tid <iothread-tid>
^C
$ perf script # shows the trace log

The difference between syscalls in QEMU 2.0 and qemu.git/master could
reveal the problem.

Using perf you can also trace ioeventfd signalling in the host kernel
and compare against the QEMU handle_notify entry/return.  It may be
easiest to use the ftrace_marker tracing backing in QEMU so the trace is
unified with the host kernel trace (./configure
--enable-trace-backend=ftrace and see the ftrace section in QEMU
docs/tracing.txt).

This way you can see whether the ioeventfd signal -> handle_notify()
entry increased or something else is going on.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-30  8:08         ` Stefan Hajnoczi
@ 2014-06-30  8:27           ` Ming Lei
  2014-07-01 13:53           ` Ming Lei
  1 sibling, 0 replies; 43+ messages in thread
From: Ming Lei @ 2014-06-30  8:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, qemu-devel, Michael S. Tsirkin

On Mon, Jun 30, 2014 at 4:08 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Sat, Jun 28, 2014 at 05:58:58PM +0800, Ming Lei wrote:
>> On Sat, Jun 28, 2014 at 5:51 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> > Il 27/06/2014 20:01, Ming Lei ha scritto:
>> >
>> >> I just implemented plug&unplug based batching, and it is working now.
>> >> But throughout still has no obvious improvement.
>> >>
>> >> Looks loading in IOthread is a bit low, so I am wondering if there is
>> >> block point caused by Qemu QEMU block layer.
>> >
>> >
>> > What does perf say?  Also, you can try using the QEMU trace subsystem and
>> > see where the latency goes.
>>
>> Follows some test result against 8589744aaf07b62 of
>> upstream qemu, and the test is done on my 2core(4thread)
>> laptop:
>>
>> 1, with my draft batch patches[1](only linux-aio supported now)
>> - throughput: +16% compared qemu upstream
>> - average time spent by handle_notify(): 310us
>> - average time between two handle_notify(): 1591us
>> (this time reflects latency of handling host_notifier)
>
> 16% is still a worthwhile improvement.  I guess batching only benefits
> aio=native since the threadpool ought to do better when it receives
> requests as soon as possible.

16% is obtained with 'simple' trace-backend enabled, and looks the
actual data with 'nop' trace is quite better than 16%, but it is still
not good as 2.0.0 release.

>
> Patch or an RFC would be welcome.

Yes, I will post it soon.

>> 2, same tests on 2.0.0 release(use custom Linux AIO)
>> - average time spent by handle_notify(): 68us
>> - average time between calling two handle_notify(): 269us
>> (this time reflects latency of handling host_notifier)
>>
>> From above tests, looks root cause is late handling notify, and
>> qemu block layer becomes 4times slower than previous custom
>> linux aio taken by dataplane.

The above data is still obtained with 'simple' trace backend enabled,
I need to find other ways to test again without extra trace io.

> Try:
> $ perf record -e syscalls:* --tid <iothread-tid>
> ^C
> $ perf script # shows the trace log
>
> The difference between syscalls in QEMU 2.0 and qemu.git/master could
> reveal the problem.
> Using perf you can also trace ioeventfd signalling in the host kernel
> and compare against the QEMU handle_notify entry/return.  It may be
> easiest to use the ftrace_marker tracing backing in QEMU so the trace is
> unified with the host kernel trace (./configure
> --enable-trace-backend=ftrace and see the ftrace section in QEMU
> docs/tracing.txt).
>
> This way you can see whether the ioeventfd signal -> handle_notify()
> entry increased or something else is going on.

Looks good ideas, I will try it.

I have tried ftrace, but looks some traces may be dropped and my
current script can't handle that well.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-06-30  8:08         ` Stefan Hajnoczi
  2014-06-30  8:27           ` Ming Lei
@ 2014-07-01 13:53           ` Ming Lei
  2014-07-01 14:31             ` Stefan Hajnoczi
  1 sibling, 1 reply; 43+ messages in thread
From: Ming Lei @ 2014-07-01 13:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, qemu-devel, Michael S. Tsirkin

On Mon, Jun 30, 2014 at 4:08 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> Try:
> $ perf record -e syscalls:* --tid <iothread-tid>
> ^C
> $ perf script # shows the trace log
>
> The difference between syscalls in QEMU 2.0 and qemu.git/master could
> reveal the problem.

The difference is that there are tons of write() and rt_sigprocmask()
in qemu.git/master, I guess it is related coroutinue.

For linux-aio, the coroutinue shouldn't be necessary because
io_submit() won't block at most of times for O_DIRECT read/write.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-01 13:53           ` Ming Lei
@ 2014-07-01 14:31             ` Stefan Hajnoczi
  2014-07-01 14:49               ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Hajnoczi @ 2014-07-01 14:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Tue, Jul 1, 2014 at 3:53 PM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Mon, Jun 30, 2014 at 4:08 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>
>> Try:
>> $ perf record -e syscalls:* --tid <iothread-tid>
>> ^C
>> $ perf script # shows the trace log
>>
>> The difference between syscalls in QEMU 2.0 and qemu.git/master could
>> reveal the problem.
>
> The difference is that there are tons of write() and rt_sigprocmask()
> in qemu.git/master, I guess it is related coroutinue.
>
> For linux-aio, the coroutinue shouldn't be necessary because
> io_submit() won't block at most of times for O_DIRECT read/write.

You're forgetting about image formats and the other QEMU block layer
features like I/O throttling.  They do require coroutines.

Are you sure it's the extra syscall overhead?  Any ideas for avoiding them?

The sigprocmask can probably be optimized away since the thread's
signal mask remains unchanged most of the time.

I'm not sure what is causing the write().

Stefan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-01 14:31             ` Stefan Hajnoczi
@ 2014-07-01 14:49               ` Ming Lei
  2014-07-01 16:49                 ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2014-07-01 14:49 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Tue, Jul 1, 2014 at 10:31 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Jul 1, 2014 at 3:53 PM, Ming Lei <tom.leiming@gmail.com> wrote:
>> On Mon, Jun 30, 2014 at 4:08 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>
>>> Try:
>>> $ perf record -e syscalls:* --tid <iothread-tid>
>>> ^C
>>> $ perf script # shows the trace log
>>>
>>> The difference between syscalls in QEMU 2.0 and qemu.git/master could
>>> reveal the problem.
>>
>> The difference is that there are tons of write() and rt_sigprocmask()
>> in qemu.git/master, I guess it is related coroutinue.
>>
>> For linux-aio, the coroutinue shouldn't be necessary because
>> io_submit() won't block at most of times for O_DIRECT read/write.
>
> You're forgetting about image formats and the other QEMU block layer
> features like I/O throttling.  They do require coroutines.

I mean from linux-aio view, io_submit() won't block most of times, like
your previous implementation of dataplane.

>
> Are you sure it's the extra syscall overhead?  Any ideas for avoiding them?

Yes, I am sure, and it can be felt obviously when running perf to
trace system call, :-)

Let me provide some data when running randread(bs 4k, libaio)
from VM for 10sec:

1), qemu.git/master
- write(): 731K
- rt_sigprocmask(): 417K
- read(): 21K
- ppoll(): 10K
- io_submit(): 5K
- io_getevents(): 4K

2), qemu 2.0
- write(): 9K
- read(): 28K
- ppoll(): 16K
- io_submit(): 12K
- io_getevents(): 10K

> The sigprocmask can probably be optimized away since the thread's
> signal mask remains unchanged most of the time.
>
> I'm not sure what is causing the write().

I am investigating it...


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-01 14:49               ` Ming Lei
@ 2014-07-01 16:49                 ` Paolo Bonzini
  2014-07-02  0:48                   ` Ming Lei
  2014-07-02  8:54                   ` Stefan Hajnoczi
  0 siblings, 2 replies; 43+ messages in thread
From: Paolo Bonzini @ 2014-07-01 16:49 UTC (permalink / raw)
  To: Ming Lei, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

Il 01/07/2014 16:49, Ming Lei ha scritto:
> Let me provide some data when running randread(bs 4k, libaio)
> from VM for 10sec:
>
> 1), qemu.git/master
> - write(): 731K
> - rt_sigprocmask(): 417K
> - read(): 21K
> - ppoll(): 10K
> - io_submit(): 5K
> - io_getevents(): 4K
>
> 2), qemu 2.0
> - write(): 9K
> - read(): 28K
> - ppoll(): 16K
> - io_submit(): 12K
> - io_getevents(): 10K
>
>> > The sigprocmask can probably be optimized away since the thread's
>> > signal mask remains unchanged most of the time.
>> >
>> > I'm not sure what is causing the write().
> I am investigating it...

I would guess sigprocmask is getcontext (from qemu_coroutine_new) and 
write is aio_notify (from qemu_bh_schedule).

Both can be eliminated by introducing a fast path in 
bdrv_aio_{read,write}v, that bypasses coroutines in the common case of 
no I/O throttling, no copy-on-write, etc.

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-01 16:49                 ` Paolo Bonzini
@ 2014-07-02  0:48                   ` Ming Lei
  2014-07-02  8:54                   ` Stefan Hajnoczi
  1 sibling, 0 replies; 43+ messages in thread
From: Ming Lei @ 2014-07-02  0:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Stefan Hajnoczi

On Wed, Jul 2, 2014 at 12:49 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 01/07/2014 16:49, Ming Lei ha scritto:
>
>> Let me provide some data when running randread(bs 4k, libaio)
>> from VM for 10sec:
>>
>> 1), qemu.git/master
>> - write(): 731K
>> - rt_sigprocmask(): 417K
>> - read(): 21K
>> - ppoll(): 10K
>> - io_submit(): 5K
>> - io_getevents(): 4K
>>
>> 2), qemu 2.0
>> - write(): 9K
>> - read(): 28K
>> - ppoll(): 16K
>> - io_submit(): 12K
>> - io_getevents(): 10K
>>
>>> > The sigprocmask can probably be optimized away since the thread's
>>> > signal mask remains unchanged most of the time.
>>> >
>>> > I'm not sure what is causing the write().
>>
>> I am investigating it...
>
>
> I would guess sigprocmask is getcontext (from qemu_coroutine_new) and write
> is aio_notify (from qemu_bh_schedule).

Yes, the write is on ctx->notifier.

>
> Both can be eliminated by introducing a fast path in bdrv_aio_{read,write}v,
> that bypasses coroutines in the common case of no I/O throttling, no
> copy-on-write, etc.



Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-01 16:49                 ` Paolo Bonzini
  2014-07-02  0:48                   ` Ming Lei
@ 2014-07-02  8:54                   ` Stefan Hajnoczi
  2014-07-02  9:13                     ` Paolo Bonzini
  2014-07-02 15:45                     ` Ming Lei
  1 sibling, 2 replies; 43+ messages in thread
From: Stefan Hajnoczi @ 2014-07-02  8:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Ming Lei, qemu-devel,
	Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 2720 bytes --]

On Tue, Jul 01, 2014 at 06:49:30PM +0200, Paolo Bonzini wrote:
> Il 01/07/2014 16:49, Ming Lei ha scritto:
> >Let me provide some data when running randread(bs 4k, libaio)
> >from VM for 10sec:
> >
> >1), qemu.git/master
> >- write(): 731K
> >- rt_sigprocmask(): 417K
> >- read(): 21K
> >- ppoll(): 10K
> >- io_submit(): 5K
> >- io_getevents(): 4K
> >
> >2), qemu 2.0
> >- write(): 9K
> >- read(): 28K
> >- ppoll(): 16K
> >- io_submit(): 12K
> >- io_getevents(): 10K
> >
> >>> The sigprocmask can probably be optimized away since the thread's
> >>> signal mask remains unchanged most of the time.
> >>>
> >>> I'm not sure what is causing the write().
> >I am investigating it...
> 
> I would guess sigprocmask is getcontext (from qemu_coroutine_new) and write
> is aio_notify (from qemu_bh_schedule).

Aha!  We shouldn't be executing qemu_coroutine_new() very often since we
try to keep a freelist of coroutines.

I think a tweak to the freelist could make the rt_sigprocmask() calls go
away since we should be reusing coroutines instead of allocating/freeing
them all the time.

> Both can be eliminated by introducing a fast path in bdrv_aio_{read,write}v,
> that bypasses coroutines in the common case of no I/O throttling, no
> copy-on-write, etc.

I tried that in 2012 and couldn't measure an improvement above the noise
threshold, although it was without dataplane.

BTW, we cannot eliminate the BH because the block layer guarantees that
callbacks are not invoked with reentrancy.  They are always invoked
directly from the event loop through a BH.  This simplifies callers
since they don't need to worry about callbacks happening while they are
still in bdrv_aio_readv(), for example.

Removing this guarantee (by making callers safe first) is orthogonal to
coroutines.  But it's hard to do since it requires auditing a lot of
code.

Another idea is to skip aio_notify() when we're sure the event loop
isn't blocked in g_poll().  Doing this is a thread-safe and lockless way
might be tricky though.

So to recap, three issues are being discussed here:

1. rt_sigprocmask due to getcontext() in qemu_coroutine_new().  We
shouldn't be invoking qemu_coroutine_new() often.  The freelist is
probably too small.

2. Coroutines might be slower than the non-coroutine aio codepath.  I
don't think this is the case, they are very cheap and I was never able
to measure a real difference.

3. The block layer requires a BH with aio_notify() for
bdrv_aio_readv()/bdrv_aio_writev()/bdrv_aio_flush() callbacks regardless
of coroutines or not.  Eliminating the BH or skipping aio_notify() will
take some work but could speed up QEMU as a whole.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02  8:54                   ` Stefan Hajnoczi
@ 2014-07-02  9:13                     ` Paolo Bonzini
  2014-07-02  9:39                       ` Kevin Wolf
  2014-07-02 15:45                     ` Ming Lei
  1 sibling, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2014-07-02  9:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Ming Lei, qemu-devel,
	Stefan Hajnoczi

Il 02/07/2014 10:54, Stefan Hajnoczi ha scritto:
>> Both can be eliminated by introducing a fast path in bdrv_aio_{read,write}v,
>> that bypasses coroutines in the common case of no I/O throttling, no
>> copy-on-write, etc.
>
> I tried that in 2012 and couldn't measure an improvement above the noise
> threshold, although it was without dataplane.
>
> BTW, we cannot eliminate the BH because the block layer guarantees that
> callbacks are not invoked with reentrancy.  They are always invoked
> directly from the event loop through a BH.  This simplifies callers
> since they don't need to worry about callbacks happening while they are
> still in bdrv_aio_readv(), for example.
>
> Removing this guarantee (by making callers safe first) is orthogonal to
> coroutines.  But it's hard to do since it requires auditing a lot of
> code.

You could also audit the few implementations of bdrv_aio_readv/writev 
(including bdrv_aio_readv/writev_em) to guarantee that they do not 
directly invoke the callbacks.  The rule was there before conversion to 
coroutine, so the implementations should be fine.  In fact, most of them 
are just forwarding to another bdrv_aio_readv/writev, and the others go 
through an EventNotifier or bottom half.

Drivers that implement bdrv_co_readv/writev would not enjoy the fast 
path, and would keep using the BH.

> Another idea is to skip aio_notify() when we're sure the event loop
> isn't blocked in g_poll().  Doing this is a thread-safe and lockless way
> might be tricky though.

Yes, good idea for 2.2 but not now.

> So to recap, three issues are being discussed here:
>
> 1. rt_sigprocmask due to getcontext() in qemu_coroutine_new().  We
> shouldn't be invoking qemu_coroutine_new() often.  The freelist is
> probably too small.

Yes, right now it's 64 for the whole of QEMU.  Originally it was 64 per 
thread (using TLS) but then TLS was dropped because of problems when 
coroutines were created in the VCPU thread and destroyed in the 
iothread.  Nowadays, the size should probably be dynamic---like 64 per 
iothread to keep it simple.

> 2. Coroutines might be slower than the non-coroutine aio codepath.  I
> don't think this is the case, they are very cheap and I was never able
> to measure a real difference.
>
> 3. The block layer requires a BH with aio_notify() for
> bdrv_aio_readv()/bdrv_aio_writev()/bdrv_aio_flush() callbacks regardless
> of coroutines or not.  Eliminating the BH or skipping aio_notify() will
> take some work but could speed up QEMU as a whole.

I think (3) is not really true, and the BH is the actual reason why 
coroutines are slower.

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02  9:13                     ` Paolo Bonzini
@ 2014-07-02  9:39                       ` Kevin Wolf
  2014-07-02  9:48                         ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Kevin Wolf @ 2014-07-02  9:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Michael S. Tsirkin, Ming Lei, qemu-devel,
	Stefan Hajnoczi, Stefan Hajnoczi

Am 02.07.2014 um 11:13 hat Paolo Bonzini geschrieben:
> Il 02/07/2014 10:54, Stefan Hajnoczi ha scritto:
> >>Both can be eliminated by introducing a fast path in bdrv_aio_{read,write}v,
> >>that bypasses coroutines in the common case of no I/O throttling, no
> >>copy-on-write, etc.
> >
> >I tried that in 2012 and couldn't measure an improvement above the noise
> >threshold, although it was without dataplane.
> >
> >BTW, we cannot eliminate the BH because the block layer guarantees that
> >callbacks are not invoked with reentrancy.  They are always invoked
> >directly from the event loop through a BH.  This simplifies callers
> >since they don't need to worry about callbacks happening while they are
> >still in bdrv_aio_readv(), for example.
> >
> >Removing this guarantee (by making callers safe first) is orthogonal to
> >coroutines.  But it's hard to do since it requires auditing a lot of
> >code.
> 
> You could also audit the few implementations of
> bdrv_aio_readv/writev (including bdrv_aio_readv/writev_em) to
> guarantee that they do not directly invoke the callbacks.  The rule
> was there before conversion to coroutine, so the implementations
> should be fine.  In fact, most of them are just forwarding to
> another bdrv_aio_readv/writev, and the others go through an
> EventNotifier or bottom half.
> 
> Drivers that implement bdrv_co_readv/writev would not enjoy the fast
> path, and would keep using the BH.

I don't think starting with that fast path as _the_ solution is a good
idea. It would essentially restrict dataplane to the scenarios that used
to work well in 2.0 - just look at what the block.c read/write functions
do: no image formats, (almost?) no block jobs, no 4k sector support, no
writethrough mode, no zero detection, no throttling, no nothing.
Anything we want to keep while using the fast path we would have to
duplicate there.

I much prefer to approach to make it use the normal path and optimise
that one, so that all cases benefit from it. We can probably get the
same results with coroutines - if the BH isn't necessary with AIO in the
common case, it also isn't necessary with coroutines.

> >Another idea is to skip aio_notify() when we're sure the event loop
> >isn't blocked in g_poll().  Doing this is a thread-safe and lockless way
> >might be tricky though.
> 
> Yes, good idea for 2.2 but not now.

Isn't it a first approximation that it's unnecessary when we're already
running in the thread with the AIO main loop? (Which pretty much means
always with dataplane.) Or can it be required even when we don't switch
to a different thread?

> >3. The block layer requires a BH with aio_notify() for
> >bdrv_aio_readv()/bdrv_aio_writev()/bdrv_aio_flush() callbacks regardless
> >of coroutines or not.  Eliminating the BH or skipping aio_notify() will
> >take some work but could speed up QEMU as a whole.
> 
> I think (3) is not really true, and the BH is the actual reason why
> coroutines are slower.

Can't we provide a mechanism with coroutines that checks whether we've
been reentered from the main loop at least once, so that we can avoid
the BH then?

We would still have to think about which stack we want to run it on, but
there would definitely be no aio_notify() involved then.

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02  9:39                       ` Kevin Wolf
@ 2014-07-02  9:48                         ` Paolo Bonzini
  2014-07-02 10:01                           ` Kevin Wolf
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2014-07-02  9:48 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Michael S. Tsirkin, Ming Lei, qemu-devel,
	Stefan Hajnoczi, Stefan Hajnoczi

Il 02/07/2014 11:39, Kevin Wolf ha scritto:
> Am 02.07.2014 um 11:13 hat Paolo Bonzini geschrieben:
> I don't think starting with that fast path as _the_ solution is a good
> idea. It would essentially restrict dataplane to the scenarios that used
> to work well in 2.0 - just look at what the block.c read/write functions
> do: no image formats, (almost?) no block jobs, no 4k sector support, no
> writethrough mode, no zero detection, no throttling, no nothing.
> Anything we want to keep while using the fast path we would have to
> duplicate there.

You're entirely right (I wouldn't duplicate it though, I would just 
sacrifice it).  But I think none of the bullets apply in maximum 
performance situations, and fast paths are okay as long as they are 
picked dynamically at run-time.

>>> Another idea is to skip aio_notify() when we're sure the event loop
>>> isn't blocked in g_poll().  Doing this is a thread-safe and lockless way
>>> might be tricky though.
>>
>> Yes, good idea for 2.2 but not now.
>
> Isn't it a first approximation that it's unnecessary when we're already
> running in the thread with the AIO main loop? (Which pretty much means
> always with dataplane.) Or can it be required even when we don't switch
> to a different thread?

That's not even that much of an approximation.  I think it's pretty much 
the definition of when it's unnecessary.  Clever!

An approximation is "it's unnecessary if we have the aio context lock 
taken".  Which is also always the case with dataplane, but never with 
non-dataplane (the main loop bypasses aio_context_acquire/release). 
Adding rfifolock_is_owned is trivial.

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02  9:48                         ` Paolo Bonzini
@ 2014-07-02 10:01                           ` Kevin Wolf
  2014-07-02 10:23                             ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Kevin Wolf @ 2014-07-02 10:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Michael S. Tsirkin, Ming Lei, qemu-devel,
	Stefan Hajnoczi, Stefan Hajnoczi

Am 02.07.2014 um 11:48 hat Paolo Bonzini geschrieben:
> Il 02/07/2014 11:39, Kevin Wolf ha scritto:
> >Am 02.07.2014 um 11:13 hat Paolo Bonzini geschrieben:
> >I don't think starting with that fast path as _the_ solution is a good
> >idea. It would essentially restrict dataplane to the scenarios that used
> >to work well in 2.0 - just look at what the block.c read/write functions
> >do: no image formats, (almost?) no block jobs, no 4k sector support, no
> >writethrough mode, no zero detection, no throttling, no nothing.
> >Anything we want to keep while using the fast path we would have to
> >duplicate there.
> 
> You're entirely right (I wouldn't duplicate it though, I would just
> sacrifice it).  But I think none of the bullets apply in maximum
> performance situations, and fast paths are okay as long as they are
> picked dynamically at run-time.

Fast paths are okay if there is no way to achieve the same performance
without them, but I'm not entirely convinced of that yet in our specific
case.

> >>>Another idea is to skip aio_notify() when we're sure the event loop
> >>>isn't blocked in g_poll().  Doing this is a thread-safe and lockless way
> >>>might be tricky though.
> >>
> >>Yes, good idea for 2.2 but not now.
> >
> >Isn't it a first approximation that it's unnecessary when we're already
> >running in the thread with the AIO main loop? (Which pretty much means
> >always with dataplane.) Or can it be required even when we don't switch
> >to a different thread?
> 
> That's not even that much of an approximation.  I think it's pretty
> much the definition of when it's unnecessary.  Clever!

Probably not quite, because the AIO main loop thread might be doing
something else at the moment and would come back to handling things in
its main loop even without being notified.

But it's probably close enough in practice.

> An approximation is "it's unnecessary if we have the aio context
> lock taken".  Which is also always the case with dataplane, but
> never with non-dataplane (the main loop bypasses
> aio_context_acquire/release). Adding rfifolock_is_owned is trivial.

Is the fix for the main loop as simple as just adding the acquire/
release pair, or does it involve more than that?

I would really prefer if the optimisations we apply for dataplane would
work even in the traditional case, improving the block layer as a whole
instead of just special cases.

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02 10:01                           ` Kevin Wolf
@ 2014-07-02 10:23                             ` Paolo Bonzini
  0 siblings, 0 replies; 43+ messages in thread
From: Paolo Bonzini @ 2014-07-02 10:23 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Michael S. Tsirkin, Ming Lei, qemu-devel,
	Stefan Hajnoczi, Stefan Hajnoczi

Il 02/07/2014 12:01, Kevin Wolf ha scritto:
> Am 02.07.2014 um 11:48 hat Paolo Bonzini geschrieben:
>> Il 02/07/2014 11:39, Kevin Wolf ha scritto:
>>> Am 02.07.2014 um 11:13 hat Paolo Bonzini geschrieben:
>>> I don't think starting with that fast path as _the_ solution is a good
>>> idea. It would essentially restrict dataplane to the scenarios that used
>>> to work well in 2.0 - just look at what the block.c read/write functions
>>> do: no image formats, (almost?) no block jobs, no 4k sector support, no
>>> writethrough mode, no zero detection, no throttling, no nothing.
>>> Anything we want to keep while using the fast path we would have to
>>> duplicate there.
>>
>> You're entirely right (I wouldn't duplicate it though, I would just
>> sacrifice it).  But I think none of the bullets apply in maximum
>> performance situations, and fast paths are okay as long as they are
>> picked dynamically at run-time.
>
> Fast paths are okay if there is no way to achieve the same performance
> without them, but I'm not entirely convinced of that yet in our specific
> case.
>
>>>>> Another idea is to skip aio_notify() when we're sure the event loop
>>>>> isn't blocked in g_poll().  Doing this is a thread-safe and lockless way
>>>>> might be tricky though.
>>>>
>>>> Yes, good idea for 2.2 but not now.
>>>
>>> Isn't it a first approximation that it's unnecessary when we're already
>>> running in the thread with the AIO main loop? (Which pretty much means
>>> always with dataplane.) Or can it be required even when we don't switch
>>> to a different thread?
>>
>> That's not even that much of an approximation.  I think it's pretty
>> much the definition of when it's unnecessary.  Clever!
>
> Probably not quite, because the AIO main loop thread might be doing
> something else at the moment and would come back to handling things in
> its main loop even without being notified.

But then aio_context_prepare (for the main iothread) or aio_poll (for 
dataplane threads) would check for bottom halves.

The problem is how you define the AIO main loop.  One way is "who has 
the aio context lock", but sooner or later we will want to get ride of 
the "Big Dataplane Lock" that is aio_context_acquire/release.  It's very 
hard otherwise to avoid lock inversion deadlocks in 
virtio-scsi-dataplane (which will likely use dma-helpers.c, not 
address_space_rw).

> But it's probably close enough in practice.
>
>> An approximation is "it's unnecessary if we have the aio context
>> lock taken".  Which is also always the case with dataplane, but
>> never with non-dataplane (the main loop bypasses
>> aio_context_acquire/release). Adding rfifolock_is_owned is trivial.
>
> Is the fix for the main loop as simple as just adding the acquire/
> release pair, or does it involve more than that?

Yes, it should be.  But see above about the possible short life of the 
aio context lock.

> I would really prefer if the optimisations we apply for dataplane would
> work even in the traditional case, improving the block layer as a whole
> instead of just special cases.

I agree.  But my hope is to get there by removing more "special" parts 
of dataplane, since I consider the aio context lock to be one.

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02  8:54                   ` Stefan Hajnoczi
  2014-07-02  9:13                     ` Paolo Bonzini
@ 2014-07-02 15:45                     ` Ming Lei
  2014-07-02 16:13                       ` Ming Lei
  2014-07-02 16:21                       ` Paolo Bonzini
  1 sibling, 2 replies; 43+ messages in thread
From: Ming Lei @ 2014-07-02 15:45 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2443 bytes --]

On Wed, Jul 2, 2014 at 4:54 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Tue, Jul 01, 2014 at 06:49:30PM +0200, Paolo Bonzini wrote:
>> Il 01/07/2014 16:49, Ming Lei ha scritto:
>> >Let me provide some data when running randread(bs 4k, libaio)
>> >from VM for 10sec:
>> >
>> >1), qemu.git/master
>> >- write(): 731K
>> >- rt_sigprocmask(): 417K
>> >- read(): 21K
>> >- ppoll(): 10K
>> >- io_submit(): 5K
>> >- io_getevents(): 4K
>> >
>> >2), qemu 2.0
>> >- write(): 9K
>> >- read(): 28K
>> >- ppoll(): 16K
>> >- io_submit(): 12K
>> >- io_getevents(): 10K
>> >
>> >>> The sigprocmask can probably be optimized away since the thread's
>> >>> signal mask remains unchanged most of the time.
>> >>>
>> >>> I'm not sure what is causing the write().
>> >I am investigating it...
>>
>> I would guess sigprocmask is getcontext (from qemu_coroutine_new) and write
>> is aio_notify (from qemu_bh_schedule).
>
> Aha!  We shouldn't be executing qemu_coroutine_new() very often since we
> try to keep a freelist of coroutines.
>
> I think a tweak to the freelist could make the rt_sigprocmask() calls go
> away since we should be reusing coroutines instead of allocating/freeing
> them all the time.
>
>> Both can be eliminated by introducing a fast path in bdrv_aio_{read,write}v,
>> that bypasses coroutines in the common case of no I/O throttling, no
>> copy-on-write, etc.
>
> I tried that in 2012 and couldn't measure an improvement above the noise
> threshold, although it was without dataplane.
>
> BTW, we cannot eliminate the BH because the block layer guarantees that
> callbacks are not invoked with reentrancy.  They are always invoked
> directly from the event loop through a BH.  This simplifies callers
> since they don't need to worry about callbacks happening while they are
> still in bdrv_aio_readv(), for example.
>
> Removing this guarantee (by making callers safe first) is orthogonal to
> coroutines.  But it's hard to do since it requires auditing a lot of
> code.
>
> Another idea is to skip aio_notify() when we're sure the event loop
> isn't blocked in g_poll().  Doing this is a thread-safe and lockless way
> might be tricky though.

The attachment debug patch skips aio_notify() if qemu_bh_schedule
is running from current aio context, but looks there is still 120K
writes triggered. (without the patch, 400K can be observed in
same test)

So is there still other writes not found in the path?


Thanks,
-- 
Ming Lei

[-- Attachment #2: bypass_aio_notify.patch --]
[-- Type: text/x-patch, Size: 1464 bytes --]

diff --git a/async.c b/async.c
index 5b6fe6b..5aa9982 100644
--- a/async.c
+++ b/async.c
@@ -40,6 +40,18 @@ struct QEMUBH {
     bool deleted;
 };
 
+static __thread AioContext *my_ctx = NULL;
+
+AioContext *get_current_aio_context(void)
+{
+    return my_ctx;
+}
+
+void set_current_aio_context(AioContext *ctx)
+{
+    my_ctx = ctx;
+}
+
 QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
 {
     QEMUBH *bh;
@@ -131,7 +143,9 @@ void qemu_bh_schedule(QEMUBH *bh)
      */
     smp_mb();
     bh->scheduled = 1;
-    aio_notify(ctx);
+
+    if (get_current_aio_context() != ctx)
+        aio_notify(ctx);
 }
 
 
diff --git a/include/block/aio.h b/include/block/aio.h
index a92511b..29f29e2 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -307,4 +307,7 @@ static inline void aio_timer_init(AioContext *ctx,
     timer_init(ts, ctx->tlg.tl[type], scale, cb, opaque);
 }
 
+AioContext *get_current_aio_context(void);
+void set_current_aio_context(AioContext *ctx);
+
 #endif
diff --git a/iothread.c b/iothread.c
index 1fbf9f1..beb32ad 100644
--- a/iothread.c
+++ b/iothread.c
@@ -36,6 +36,8 @@ static void *iothread_run(void *opaque)
     qemu_cond_signal(&iothread->init_done_cond);
     qemu_mutex_unlock(&iothread->init_done_lock);
 
+    set_current_aio_context(iothread->ctx);
+
     while (!iothread->stopping) {
         aio_context_acquire(iothread->ctx);
         while (!iothread->stopping && aio_poll(iothread->ctx, true)) {

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02 15:45                     ` Ming Lei
@ 2014-07-02 16:13                       ` Ming Lei
  2014-07-02 16:23                         ` Paolo Bonzini
  2014-07-02 16:21                       ` Paolo Bonzini
  1 sibling, 1 reply; 43+ messages in thread
From: Ming Lei @ 2014-07-02 16:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Paolo Bonzini

On Wed, Jul 2, 2014 at 11:45 PM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Wed, Jul 2, 2014 at 4:54 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> On Tue, Jul 01, 2014 at 06:49:30PM +0200, Paolo Bonzini wrote:
>>> Il 01/07/2014 16:49, Ming Lei ha scritto:
>>> >Let me provide some data when running randread(bs 4k, libaio)
>>> >from VM for 10sec:
>>> >
>>> >1), qemu.git/master
>>> >- write(): 731K
>>> >- rt_sigprocmask(): 417K
>>> >- read(): 21K
>>> >- ppoll(): 10K
>>> >- io_submit(): 5K
>>> >- io_getevents(): 4K
>>> >
>>> >2), qemu 2.0
>>> >- write(): 9K
>>> >- read(): 28K
>>> >- ppoll(): 16K
>>> >- io_submit(): 12K
>>> >- io_getevents(): 10K
>>> >
>>> >>> The sigprocmask can probably be optimized away since the thread's
>>> >>> signal mask remains unchanged most of the time.
>>> >>>
>>> >>> I'm not sure what is causing the write().
>>> >I am investigating it...
>>>
>>> I would guess sigprocmask is getcontext (from qemu_coroutine_new) and write
>>> is aio_notify (from qemu_bh_schedule).
>>
>> Aha!  We shouldn't be executing qemu_coroutine_new() very often since we
>> try to keep a freelist of coroutines.
>>
>> I think a tweak to the freelist could make the rt_sigprocmask() calls go
>> away since we should be reusing coroutines instead of allocating/freeing
>> them all the time.
>>
>>> Both can be eliminated by introducing a fast path in bdrv_aio_{read,write}v,
>>> that bypasses coroutines in the common case of no I/O throttling, no
>>> copy-on-write, etc.
>>
>> I tried that in 2012 and couldn't measure an improvement above the noise
>> threshold, although it was without dataplane.
>>
>> BTW, we cannot eliminate the BH because the block layer guarantees that
>> callbacks are not invoked with reentrancy.  They are always invoked
>> directly from the event loop through a BH.  This simplifies callers
>> since they don't need to worry about callbacks happening while they are
>> still in bdrv_aio_readv(), for example.
>>
>> Removing this guarantee (by making callers safe first) is orthogonal to
>> coroutines.  But it's hard to do since it requires auditing a lot of
>> code.
>>
>> Another idea is to skip aio_notify() when we're sure the event loop
>> isn't blocked in g_poll().  Doing this is a thread-safe and lockless way
>> might be tricky though.
>
> The attachment debug patch skips aio_notify() if qemu_bh_schedule
> is running from current aio context, but looks there is still 120K
> writes triggered. (without the patch, 400K can be observed in
> same test)
>
> So is there still other writes not found in the path?

That must be for generating guest irq, which should have been
processed as batch easily.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02 15:45                     ` Ming Lei
  2014-07-02 16:13                       ` Ming Lei
@ 2014-07-02 16:21                       ` Paolo Bonzini
  2014-07-03  4:54                         ` Ming Lei
  1 sibling, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2014-07-02 16:21 UTC (permalink / raw)
  To: Ming Lei, Stefan Hajnoczi
  Cc: Kevin Wolf, Stefan Hajnoczi, Fam Zheng, qemu-devel, Michael S. Tsirkin

Il 02/07/2014 17:45, Ming Lei ha scritto:
> The attachment debug patch skips aio_notify() if qemu_bh_schedule
> is running from current aio context, but looks there is still 120K
> writes triggered. (without the patch, 400K can be observed in
> same test)

Nice.  Another observation is that after aio_dispatch we'll always
re-evaluate everything (bottom halves, file descriptors and timeouts),
so we can skip the aio_notify if we're inside aio_dispatch.

So what about this untested patch:

diff --git a/aio-posix.c b/aio-posix.c
index f921d4f..a23d85d 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -124,6 +124,9 @@ static bool aio_dispatch(AioContext *ctx)
     AioHandler *node;
     bool progress = false;
 
+    /* No need to set the event notifier during aio_notify.  */
+    ctx->running++;
+
     /*
      * We have to walk very carefully in case qemu_aio_set_fd_handler is
      * called while we're walking.
@@ -169,6 +171,11 @@ static bool aio_dispatch(AioContext *ctx)
     /* Run our timers */
     progress |= timerlistgroup_run_timers(&ctx->tlg);
 
+    smp_wmb();
+    ctx->iter_count++;
+    smp_wmb();
+    ctx->running--;
+
     return progress;
 }
 
diff --git a/async.c b/async.c
index 5b6fe6b..1f56afa 100644
--- a/async.c
+++ b/async.c
@@ -249,7 +249,19 @@ ThreadPool *aio_get_thread_pool(AioContext *ctx)
 
 void aio_notify(AioContext *ctx)
 {
-    event_notifier_set(&ctx->notifier);
+    uint32_t iter_count;
+    do {
+        iter_count = ctx->iter_count;
+        /* Read ctx->iter_count before ctx->running.  */
+        smb_rmb();
+        if (!ctx->running) {
+            event_notifier_set(&ctx->notifier);
+            return;
+        }
+        /* Read ctx->running before ctx->iter_count.  */
+        smb_rmb();
+        /* ctx might have gone to sleep.  */
+    } while (iter_count != ctx->iter_count);
 }
 
 static void aio_timerlist_notify(void *opaque)
@@ -269,6 +279,7 @@ AioContext *aio_context_new(void)
     ctx = (AioContext *) g_source_new(&aio_source_funcs, sizeof(AioContext));
     ctx->pollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
     ctx->thread_pool = NULL;
+    ctx->iter_count = ctx->running = 0;
     qemu_mutex_init(&ctx->bh_lock);
     rfifolock_init(&ctx->lock, aio_rfifolock_cb, ctx);
     event_notifier_init(&ctx->notifier, false);
diff --git a/include/block/aio.h b/include/block/aio.h
index a92511b..9f51c4f 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -51,6 +51,9 @@ struct AioContext {
     /* Protects all fields from multi-threaded access */
     RFifoLock lock;
 
+    /* Used to avoid aio_notify while dispatching event handlers.
+     * Writes protected by lock or BQL, reads are lockless.
+     */
+    uint32_t iter_count, running;
+
     /* The list of registered AIO handlers */
     QLIST_HEAD(, AioHandler) aio_handlers;
 

Please review carefully.

> So is there still other writes not found in the path?

What do perf or gdb say? :)

Paolo

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02 16:13                       ` Ming Lei
@ 2014-07-02 16:23                         ` Paolo Bonzini
  2014-07-02 16:27                           ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2014-07-02 16:23 UTC (permalink / raw)
  To: Ming Lei, Stefan Hajnoczi
  Cc: Kevin Wolf, Stefan Hajnoczi, Fam Zheng, qemu-devel, Michael S. Tsirkin

Il 02/07/2014 18:13, Ming Lei ha scritto:
> That must be for generating guest irq, which should have been
> processed as batch easily.

No, guest irqs are generated (with notify_guest) on every I/O completion 
even in 2.0.

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02 16:23                         ` Paolo Bonzini
@ 2014-07-02 16:27                           ` Ming Lei
  2014-07-02 16:38                             ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2014-07-02 16:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Stefan Hajnoczi

On Thu, Jul 3, 2014 at 12:23 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 02/07/2014 18:13, Ming Lei ha scritto:
>
>> That must be for generating guest irq, which should have been
>> processed as batch easily.
>
>
> No, guest irqs are generated (with notify_guest) on every I/O completion
> even in 2.0.

In 2.0, only notify_guest() is called after event batch is completed,
I wrote one patch days ago for fixing the problem, and today
it is just verified that writes can be decreased to 10K from 120K.

BTW, what do you think about the patch v4 for submitting I/O
as batch?


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02 16:27                           ` Ming Lei
@ 2014-07-02 16:38                             ` Paolo Bonzini
  2014-07-02 16:41                               ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2014-07-02 16:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Stefan Hajnoczi


> On Thu, Jul 3, 2014 at 12:23 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Il 02/07/2014 18:13, Ming Lei ha scritto:
> >
> >> That must be for generating guest irq, which should have been
> >> processed as batch easily.
> >
> >
> > No, guest irqs are generated (with notify_guest) on every I/O completion
> > even in 2.0.
> 
> In 2.0, only notify_guest() is called after event batch is completed,

Ah, you're right.

> I wrote one patch days ago for fixing the problem, and today
> it is just verified that writes can be decreased to 10K from 120K.

Great.  Can you test/review my aio_notify patch and submit both then?  Bonus
points, but not a requirement IMO,  if it also helps non-dataplane.

This would only leave rt_sigprocmask.

> BTW, what do you think about the patch v4 for submitting I/O as batch?

I was waiting for Kevin to review it.

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02 16:38                             ` Paolo Bonzini
@ 2014-07-02 16:41                               ` Ming Lei
  0 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2014-07-02 16:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Stefan Hajnoczi

On Thu, Jul 3, 2014 at 12:38 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>> On Thu, Jul 3, 2014 at 12:23 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> > Il 02/07/2014 18:13, Ming Lei ha scritto:
>> >
>> >> That must be for generating guest irq, which should have been
>> >> processed as batch easily.
>> >
>> >
>> > No, guest irqs are generated (with notify_guest) on every I/O completion
>> > even in 2.0.
>>
>> In 2.0, only notify_guest() is called after event batch is completed,
>
> Ah, you're right.
>
>> I wrote one patch days ago for fixing the problem, and today
>> it is just verified that writes can be decreased to 10K from 120K.
>
> Great.  Can you test/review my aio_notify patch and submit both then?  Bonus
> points, but not a requirement IMO,  if it also helps non-dataplane.

OK, I will take a look and test tomorrow, and it is a bit late for me today.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-02 16:21                       ` Paolo Bonzini
@ 2014-07-03  4:54                         ` Ming Lei
  2014-07-03 10:29                           ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2014-07-03  4:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Stefan Hajnoczi

On Thu, Jul 3, 2014 at 12:21 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 02/07/2014 17:45, Ming Lei ha scritto:
>> The attachment debug patch skips aio_notify() if qemu_bh_schedule
>> is running from current aio context, but looks there is still 120K
>> writes triggered. (without the patch, 400K can be observed in
>> same test)
>
> Nice.  Another observation is that after aio_dispatch we'll always
> re-evaluate everything (bottom halves, file descriptors and timeouts),

The idea is very good.

If aio_notify() is called from the 1st aio_dispatch() in aio_poll(),
ctc->notifier might need to be set, but it can be handled easily.

> so we can skip the aio_notify if we're inside aio_dispatch.
>
> So what about this untested patch:
>
> diff --git a/aio-posix.c b/aio-posix.c
> index f921d4f..a23d85d 100644
> --- a/aio-posix.c
> +++ b/aio-posix.c

#include "qemu/atomic.h"

> @@ -124,6 +124,9 @@ static bool aio_dispatch(AioContext *ctx)
>      AioHandler *node;
>      bool progress = false;
>
> +    /* No need to set the event notifier during aio_notify.  */
> +    ctx->running++;
> +
>      /*
>       * We have to walk very carefully in case qemu_aio_set_fd_handler is
>       * called while we're walking.
> @@ -169,6 +171,11 @@ static bool aio_dispatch(AioContext *ctx)
>      /* Run our timers */
>      progress |= timerlistgroup_run_timers(&ctx->tlg);
>
> +    smp_wmb();
> +    ctx->iter_count++;
> +    smp_wmb();
> +    ctx->running--;
> +
>      return progress;
>  }
>
> diff --git a/async.c b/async.c
> index 5b6fe6b..1f56afa 100644
> --- a/async.c
> +++ b/async.c

#include "qemu/atomic.h"

> @@ -249,7 +249,19 @@ ThreadPool *aio_get_thread_pool(AioContext *ctx)
>
>  void aio_notify(AioContext *ctx)
>  {
> -    event_notifier_set(&ctx->notifier);
> +    uint32_t iter_count;
> +    do {
> +        iter_count = ctx->iter_count;
> +        /* Read ctx->iter_count before ctx->running.  */
> +        smb_rmb();

s/smb/smp

> +        if (!ctx->running) {
> +            event_notifier_set(&ctx->notifier);
> +            return;
> +        }
> +        /* Read ctx->running before ctx->iter_count.  */
> +        smb_rmb();

s/smb/smp

> +        /* ctx might have gone to sleep.  */
> +    } while (iter_count != ctx->iter_count);
>  }

Since both 'running' and 'iter_count'  may be read lockless, something
like ACCESS_ONCE() should be used to avoid compiler optimization.

>
>  static void aio_timerlist_notify(void *opaque)
> @@ -269,6 +279,7 @@ AioContext *aio_context_new(void)
>      ctx = (AioContext *) g_source_new(&aio_source_funcs, sizeof(AioContext));
>      ctx->pollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
>      ctx->thread_pool = NULL;
> +    ctx->iter_count = ctx->running = 0;
>      qemu_mutex_init(&ctx->bh_lock);
>      rfifolock_init(&ctx->lock, aio_rfifolock_cb, ctx);
>      event_notifier_init(&ctx->notifier, false);
> diff --git a/include/block/aio.h b/include/block/aio.h
> index a92511b..9f51c4f 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -51,6 +51,9 @@ struct AioContext {
>      /* Protects all fields from multi-threaded access */
>      RFifoLock lock;
>
> +    /* Used to avoid aio_notify while dispatching event handlers.
> +     * Writes protected by lock or BQL, reads are lockless.
> +     */
> +    uint32_t iter_count, running;
> +
>      /* The list of registered AIO handlers */
>      QLIST_HEAD(, AioHandler) aio_handlers;
>

In my test, it does decrease write() very much, and I hope
a formal version can be applied soon.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-03  4:54                         ` Ming Lei
@ 2014-07-03 10:29                           ` Paolo Bonzini
  2014-07-03 11:50                             ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2014-07-03 10:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Stefan Hajnoczi

Il 03/07/2014 06:54, Ming Lei ha scritto:
> On Thu, Jul 3, 2014 at 12:21 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 02/07/2014 17:45, Ming Lei ha scritto:
>>> The attachment debug patch skips aio_notify() if qemu_bh_schedule
>>> is running from current aio context, but looks there is still 120K
>>> writes triggered. (without the patch, 400K can be observed in
>>> same test)
>>
>> Nice.  Another observation is that after aio_dispatch we'll always
>> re-evaluate everything (bottom halves, file descriptors and timeouts),
>
> The idea is very good.
>
> If aio_notify() is called from the 1st aio_dispatch() in aio_poll(),
> ctc->notifier might need to be set, but it can be handled easily.

Yes, you can just move the atomic_inc/atomic_dec in aio_poll.

>> so we can skip the aio_notify if we're inside aio_dispatch.
>>
>> So what about this untested patch:
>>
>> diff --git a/aio-posix.c b/aio-posix.c
>> index f921d4f..a23d85d 100644
>> --- a/aio-posix.c
>> +++ b/aio-posix.c
>
> #include "qemu/atomic.h"
>
>> @@ -124,6 +124,9 @@ static bool aio_dispatch(AioContext *ctx)
>>      AioHandler *node;
>>      bool progress = false;
>>
>> +    /* No need to set the event notifier during aio_notify.  */
>> +    ctx->running++;
>> +
>>      /*
>>       * We have to walk very carefully in case qemu_aio_set_fd_handler is
>>       * called while we're walking.
>> @@ -169,6 +171,11 @@ static bool aio_dispatch(AioContext *ctx)
>>      /* Run our timers */
>>      progress |= timerlistgroup_run_timers(&ctx->tlg);
>>
>> +    smp_wmb();
>> +    ctx->iter_count++;
>> +    smp_wmb();
>> +    ctx->running--;
>> +
>>      return progress;
>>  }
>>
>> diff --git a/async.c b/async.c
>> index 5b6fe6b..1f56afa 100644
>> --- a/async.c
>> +++ b/async.c
>
> #include "qemu/atomic.h"
>
>> @@ -249,7 +249,19 @@ ThreadPool *aio_get_thread_pool(AioContext *ctx)
>>
>>  void aio_notify(AioContext *ctx)
>>  {
>> -    event_notifier_set(&ctx->notifier);
>> +    uint32_t iter_count;
>> +    do {
>> +        iter_count = ctx->iter_count;
>> +        /* Read ctx->iter_count before ctx->running.  */
>> +        smb_rmb();
>
> s/smb/smp
>
>> +        if (!ctx->running) {
>> +            event_notifier_set(&ctx->notifier);
>> +            return;
>> +        }
>> +        /* Read ctx->running before ctx->iter_count.  */
>> +        smb_rmb();
>
> s/smb/smp
>
>> +        /* ctx might have gone to sleep.  */
>> +    } while (iter_count != ctx->iter_count);
>>  }
>
> Since both 'running' and 'iter_count'  may be read lockless, something
> like ACCESS_ONCE() should be used to avoid compiler optimization.

No, smp_rmb() is enough to avoid them.  See also include/qemu/seqlock.h

The first access to ctx->iter_count _could_ be protected by 
ACCESS_ONCE(), which in QEMU we call atomic_read()/atomic_set(), but 
it's not necessary.  See docs/atomics.txt for a description for QEMU's
atomic access functions.

> In my test, it does decrease write() very much, and I hope
> a formal version can be applied soon.

Can you take care of that (you can add my Signed-off-by), since you have 
the best testing environment?  v5 of the plug/unplug series will be good 
to go, I think.

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-03 10:29                           ` Paolo Bonzini
@ 2014-07-03 11:50                             ` Ming Lei
  2014-07-03 11:56                               ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2014-07-03 11:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Stefan Hajnoczi

On Thu, Jul 3, 2014 at 6:29 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 03/07/2014 06:54, Ming Lei ha scritto:
>
>> On Thu, Jul 3, 2014 at 12:21 AM, Paolo Bonzini <pbonzini@redhat.com>
>> wrote:
>>>
>>> Il 02/07/2014 17:45, Ming Lei ha scritto:
>>>>
>>>> The attachment debug patch skips aio_notify() if qemu_bh_schedule
>>>> is running from current aio context, but looks there is still 120K
>>>> writes triggered. (without the patch, 400K can be observed in
>>>> same test)
>>>
>>>
>>> Nice.  Another observation is that after aio_dispatch we'll always
>>> re-evaluate everything (bottom halves, file descriptors and timeouts),
>>
>>
>> The idea is very good.
>>
>> If aio_notify() is called from the 1st aio_dispatch() in aio_poll(),
>> ctc->notifier might need to be set, but it can be handled easily.
>
>
> Yes, you can just move the atomic_inc/atomic_dec in aio_poll.

If you mean move inc/dec of 'running' in aio_poll, that won't work.
When aio_notify() sees 'running', it won't set notifier, and may
trap to ppoll().

We can set a flag in aio_notify() if notifier is bypassed, then
check the flag before ppoll() to decide if we should poll().

>
>
>>> so we can skip the aio_notify if we're inside aio_dispatch.
>>>
>>> So what about this untested patch:
>>>
>>> diff --git a/aio-posix.c b/aio-posix.c
>>> index f921d4f..a23d85d 100644
>>> --- a/aio-posix.c
>>> +++ b/aio-posix.c
>>
>>
>> #include "qemu/atomic.h"
>>
>>> @@ -124,6 +124,9 @@ static bool aio_dispatch(AioContext *ctx)
>>>      AioHandler *node;
>>>      bool progress = false;
>>>
>>> +    /* No need to set the event notifier during aio_notify.  */
>>> +    ctx->running++;
>>> +
>>>      /*
>>>       * We have to walk very carefully in case qemu_aio_set_fd_handler is
>>>       * called while we're walking.
>>> @@ -169,6 +171,11 @@ static bool aio_dispatch(AioContext *ctx)
>>>      /* Run our timers */
>>>      progress |= timerlistgroup_run_timers(&ctx->tlg);
>>>
>>> +    smp_wmb();
>>> +    ctx->iter_count++;
>>> +    smp_wmb();
>>> +    ctx->running--;
>>> +
>>>      return progress;
>>>  }
>>>
>>> diff --git a/async.c b/async.c
>>> index 5b6fe6b..1f56afa 100644
>>> --- a/async.c
>>> +++ b/async.c
>>
>>
>> #include "qemu/atomic.h"
>>
>>> @@ -249,7 +249,19 @@ ThreadPool *aio_get_thread_pool(AioContext *ctx)
>>>
>>>  void aio_notify(AioContext *ctx)
>>>  {
>>> -    event_notifier_set(&ctx->notifier);
>>> +    uint32_t iter_count;
>>> +    do {
>>> +        iter_count = ctx->iter_count;
>>> +        /* Read ctx->iter_count before ctx->running.  */
>>> +        smb_rmb();
>>
>>
>> s/smb/smp
>>
>>> +        if (!ctx->running) {
>>> +            event_notifier_set(&ctx->notifier);
>>> +            return;
>>> +        }
>>> +        /* Read ctx->running before ctx->iter_count.  */
>>> +        smb_rmb();
>>
>>
>> s/smb/smp
>>
>>> +        /* ctx might have gone to sleep.  */
>>> +    } while (iter_count != ctx->iter_count);
>>>  }
>>
>>
>> Since both 'running' and 'iter_count'  may be read lockless, something
>> like ACCESS_ONCE() should be used to avoid compiler optimization.
>
>
> No, smp_rmb() is enough to avoid them.  See also include/qemu/seqlock.h

Yes, there is order between read/write iter_count and running.

> The first access to ctx->iter_count _could_ be protected by ACCESS_ONCE(),
> which in QEMU we call atomic_read()/atomic_set(), but it's not necessary.
> See docs/atomics.txt for a description for QEMU's
> atomic access functions.
>
>
>> In my test, it does decrease write() very much, and I hope
>> a formal version can be applied soon.
>
>
> Can you take care of that (you can add my Signed-off-by), since you have the
> best testing environment?  v5 of the plug/unplug series will be good to go,

OK, I will do it,  and include it in v5, and put your name as
author/signed-off.

BTW, I do not have best testing environment, and all plug/unplug related
tests are done on my laptop, :-)


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-03 11:50                             ` Ming Lei
@ 2014-07-03 11:56                               ` Paolo Bonzini
  2014-07-03 12:09                                 ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2014-07-03 11:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Stefan Hajnoczi

Il 03/07/2014 13:50, Ming Lei ha scritto:
>> Yes, you can just move the atomic_inc/atomic_dec in aio_poll.
>
> If you mean move inc/dec of 'running' in aio_poll, that won't work.
> When aio_notify() sees 'running', it won't set notifier, and may
> trap to ppoll().

I mean move it to aio_poll, around the second invocation of aio_dispatch.

IIRC the first invocation of aio_dispatch is only used when AioContext 
is running as a GSource.  It should never run for dataplane, so it's 
okay if we only trap aio_notify from the second aio_dispatch().

In 2.2, we should rewrite aio_poll() to follow the 
prepare/poll/check/dispatch model of glib.  Then the optimization 
becomes much simpler (you only call aio_dispatch once) and it also works 
for GSource mode.

Stefan convinced me to send the patch myself, so I will do it now.

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2
  2014-07-03 11:56                               ` Paolo Bonzini
@ 2014-07-03 12:09                                 ` Ming Lei
  0 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2014-07-03 12:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Stefan Hajnoczi

On Thu, Jul 3, 2014 at 7:56 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 03/07/2014 13:50, Ming Lei ha scritto:
>
>>> Yes, you can just move the atomic_inc/atomic_dec in aio_poll.
>>
>>
>> If you mean move inc/dec of 'running' in aio_poll, that won't work.
>> When aio_notify() sees 'running', it won't set notifier, and may
>> trap to ppoll().
>
>
> I mean move it to aio_poll, around the second invocation of aio_dispatch.
> IIRC the first invocation of aio_dispatch is only used when AioContext is
> running as a GSource.  It should never run for dataplane, so it's okay if we
> only trap aio_notify from the second aio_dispatch().

That is fine.

>
> In 2.2, we should rewrite aio_poll() to follow the
> prepare/poll/check/dispatch model of glib.  Then the optimization becomes
> much simpler (you only call aio_dispatch once) and it also works for GSource
> mode.
>
> Stefan convinced me to send the patch myself, so I will do it now.

That is great, and virtio-blk multi vqs can benefit from this patch
a lot, :-)

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2014-07-03 12:09 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-26 15:14 [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2 Ming Lei
2014-06-26 15:29 ` Paolo Bonzini
2014-06-26 15:37   ` Ming Lei
2014-06-26 15:43     ` Paolo Bonzini
2014-06-26 15:47       ` Ming Lei
2014-06-26 15:57         ` Paolo Bonzini
2014-06-27  1:15           ` Ming Lei
2014-06-27  4:59             ` Paolo Bonzini
2014-06-27  6:23               ` Kevin Wolf
2014-06-27  7:35                 ` Paolo Bonzini
2014-06-27 12:35                 ` Ming Lei
2014-06-27  7:57               ` Ming Lei
2014-06-27 12:01 ` Stefan Hajnoczi
2014-06-27 12:21   ` Kevin Wolf
2014-06-27 14:50     ` Stefan Hajnoczi
2014-06-27 18:01   ` Ming Lei
2014-06-27 21:51     ` Paolo Bonzini
2014-06-28  9:58       ` Ming Lei
2014-06-30  8:08         ` Stefan Hajnoczi
2014-06-30  8:27           ` Ming Lei
2014-07-01 13:53           ` Ming Lei
2014-07-01 14:31             ` Stefan Hajnoczi
2014-07-01 14:49               ` Ming Lei
2014-07-01 16:49                 ` Paolo Bonzini
2014-07-02  0:48                   ` Ming Lei
2014-07-02  8:54                   ` Stefan Hajnoczi
2014-07-02  9:13                     ` Paolo Bonzini
2014-07-02  9:39                       ` Kevin Wolf
2014-07-02  9:48                         ` Paolo Bonzini
2014-07-02 10:01                           ` Kevin Wolf
2014-07-02 10:23                             ` Paolo Bonzini
2014-07-02 15:45                     ` Ming Lei
2014-07-02 16:13                       ` Ming Lei
2014-07-02 16:23                         ` Paolo Bonzini
2014-07-02 16:27                           ` Ming Lei
2014-07-02 16:38                             ` Paolo Bonzini
2014-07-02 16:41                               ` Ming Lei
2014-07-02 16:21                       ` Paolo Bonzini
2014-07-03  4:54                         ` Ming Lei
2014-07-03 10:29                           ` Paolo Bonzini
2014-07-03 11:50                             ` Ming Lei
2014-07-03 11:56                               ` Paolo Bonzini
2014-07-03 12:09                                 ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.