[Qemu-devel] Change in qemu 2.12 causes qemu-img convert to NBD to write more data

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
@ 2018-11-07 12:13 Richard W.M. Jones
  2018-11-07 14:36 ` Richard W.M. Jones
  2018-11-07 16:42 ` [Qemu-devel] " Eric Blake
  0 siblings, 2 replies; 15+ messages in thread
From: Richard W.M. Jones @ 2018-11-07 12:13 UTC (permalink / raw)
  To: qemu-devel, eblake, qemu-block, Edgar Kaziakhmedov; +Cc: nsoffer

(I'm not going to claim this is a bug, but it causes a large, easily
measurable performance regression in virt-v2v).

In qemu 2.10, when you do ‘qemu-img convert’ to an NBD target, qemu
interleaves write and zero requests.  We can observe this as follows:

  $ virt-builder fedora-28
  $ nbdkit --filter=log memory size=6G logfile=/tmp/log \
      --run './qemu-img convert ./fedora-28.img -n $nbd'
  $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c
      1 Write
      2 Zero
      1 Write
      3 Zero
      1 Write
      1 Zero
      1 Write
    [etc for over 1000 lines]

Looking at the log file in detail we can see it is writing serially
from the beginning to the end of the disk.

In qemu 2.12 this behaviour changed:

  $ nbdkit --filter=log memory size=6G logfile=/tmp/log \
      --run './qemu-img convert ./fedora-28.img -n $nbd'
  $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c
      193 Zero
     1246 Write

It now zeroes the whole disk up front and then writes data over the
top of the zeroed blocks.

The reason for the performance regression is that in the first case we
write 6G in total.  In the second case we write 6G of zeroes up front,
followed by the amount of data in the disk image (in this case the
test disk image contains 1G of non-sparse data, so we write about 7G
in total).

In real world cases this makes a great difference: we might have 100s
of G of data in the disk.  The ultimate backend storage (a Linux block
device) doesn't support efficient BLKZEROOUT so zeroing is pretty slow
too.

I bisected the change to the commit shown at the end of this email.

Any suggestions how to fix or work around this problem welcome.

Rich.

commit 9776f0db6a19a0510e89b7aae38190b4811c95ba
Author: Edgar Kaziakhmedov <edgar.kaziakhmedov@virtuozzo.com>
Date:   Thu Jan 18 14:51:58 2018 +0300

    nbd: implement bdrv_get_info callback

    Since mirror job supports efficient zero out target mechanism (see
    in mirror_dirty_init()), implement bdrv_get_info to make it work
    over NBD. Such improvement will allow using the largest chunk possible
    and will decrease the number of NBD_CMD_WRITE_ZEROES requests on the wire.

    Signed-off-by: Edgar Kaziakhmedov <edgar.kaziakhmedov@virtuozzo.com>
    Message-Id: <20180118115158.17219-1-edgar.kaziakhmedov@virtuozzo.com>
    Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: Eric Blake <eblake@redhat.com>

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-07 12:13 [Qemu-devel] Change in qemu 2.12 causes qemu-img convert to NBD to write more data Richard W.M. Jones
@ 2018-11-07 14:36 ` Richard W.M. Jones
  2018-11-07 14:56   ` Nir Soffer
  2018-11-07 16:42 ` [Qemu-devel] " Eric Blake
  1 sibling, 1 reply; 15+ messages in thread
From: Richard W.M. Jones @ 2018-11-07 14:36 UTC (permalink / raw)
  To: qemu-devel, eblake, qemu-block, Edgar Kaziakhmedov; +Cc: nsoffer

Another thing I tried was to change the NBD server (nbdkit) so that it
doesn't advertise zero support to the client:

  $ nbdkit --filter=log --filter=nozero memory size=6G logfile=/tmp/log \
      --run './qemu-img convert ./fedora-28.img -n $nbd'
  $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c 
   2154 Write

Not surprisingly no zero commands are issued.  The size of the write
commands is very uneven -- it appears to be send one command per block
of zeroes or data.

Nir: If we could get information from imageio about whether zeroing is
implemented efficiently or not by the backend, we could change
virt-v2v / nbdkit to advertise this back to qemu.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-07 14:36 ` Richard W.M. Jones
@ 2018-11-07 14:56   ` Nir Soffer
  2018-11-07 15:02     ` Richard W.M. Jones
  2018-11-07 17:27     ` [Qemu-devel] [Qemu-block] " Kevin Wolf
  0 siblings, 2 replies; 15+ messages in thread
From: Nir Soffer @ 2018-11-07 14:56 UTC (permalink / raw)
  To: Richard W.M. Jones
  Cc: QEMU Developers, Eric Blake, qemu-block, edgar.kaziakhmedov

Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <rjones@redhat.com> wrote:

> Another thing I tried was to change the NBD server (nbdkit) so that it
> doesn't advertise zero support to the client:
>
>   $ nbdkit --filter=log --filter=nozero memory size=6G logfile=/tmp/log \
>       --run './qemu-img convert ./fedora-28.img -n $nbd'
>   $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c
>    2154 Write
>
> Not surprisingly no zero commands are issued.  The size of the write
> commands is very uneven -- it appears to be send one command per block
> of zeroes or data.
>
> Nir: If we could get information from imageio about whether zeroing is
> implemented efficiently or not by the backend, we could change
> virt-v2v / nbdkit to advertise this back to qemu.
>

There is no way to detect the capability, ioctl(BLKZEROOUT) always
succeeds, falling back to manual zeroing in the kernel silently

Even if we could, sending zero on the wire from qemu may be even
slower, and it looks like qemu send even more requests in this case
(2154 vs ~1300).

Looks like this optimization in qemu side leads to worse performance,
so it should not be enabled by default.

Nir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-07 14:56   ` Nir Soffer
@ 2018-11-07 15:02     ` Richard W.M. Jones
  2018-11-07 17:27     ` [Qemu-devel] [Qemu-block] " Kevin Wolf
  1 sibling, 0 replies; 15+ messages in thread
From: Richard W.M. Jones @ 2018-11-07 15:02 UTC (permalink / raw)
  To: Nir Soffer; +Cc: QEMU Developers, Eric Blake, qemu-block, edgar.kaziakhmedov

On Wed, Nov 07, 2018 at 04:56:48PM +0200, Nir Soffer wrote:
> Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <rjones@redhat.com> wrote:
> 
> > Another thing I tried was to change the NBD server (nbdkit) so that it
> > doesn't advertise zero support to the client:
> >
> >   $ nbdkit --filter=log --filter=nozero memory size=6G logfile=/tmp/log \
> >       --run './qemu-img convert ./fedora-28.img -n $nbd'
> >   $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c
> >    2154 Write
> >
> > Not surprisingly no zero commands are issued.  The size of the write
> > commands is very uneven -- it appears to be send one command per block
> > of zeroes or data.
> >
> > Nir: If we could get information from imageio about whether zeroing is
> > implemented efficiently or not by the backend, we could change
> > virt-v2v / nbdkit to advertise this back to qemu.
> >
> 
> There is no way to detect the capability, ioctl(BLKZEROOUT) always
> succeeds, falling back to manual zeroing in the kernel silently
> 
> Even if we could, sending zero on the wire from qemu may be even
> slower,

Yes this is a very good point.  Sending zeroes would be terrible.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-07 12:13 [Qemu-devel] Change in qemu 2.12 causes qemu-img convert to NBD to write more data Richard W.M. Jones
  2018-11-07 14:36 ` Richard W.M. Jones
@ 2018-11-07 16:42 ` Eric Blake
  2018-11-11 15:25   ` Nir Soffer
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Blake @ 2018-11-07 16:42 UTC (permalink / raw)
  To: Richard W.M. Jones, qemu-devel, qemu-block, Edgar Kaziakhmedov; +Cc: nsoffer

On 11/7/18 6:13 AM, Richard W.M. Jones wrote:
> (I'm not going to claim this is a bug, but it causes a large, easily
> measurable performance regression in virt-v2v).

I haven't closely looked at at this email thread yet, but a quick first 
impression:


> In qemu 2.12 this behaviour changed:
> 
>    $ nbdkit --filter=log memory size=6G logfile=/tmp/log \
>        --run './qemu-img convert ./fedora-28.img -n $nbd'
>    $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c
>        193 Zero
>       1246 Write
> 
> It now zeroes the whole disk up front and then writes data over the
> top of the zeroed blocks.
> 
> The reason for the performance regression is that in the first case we
> write 6G in total.  In the second case we write 6G of zeroes up front,
> followed by the amount of data in the disk image (in this case the
> test disk image contains 1G of non-sparse data, so we write about 7G
> in total).

There was talk on the NBD list a while ago about the idea of letting the 
server advertise to the client when the image is known to start in an 
all-zero state, so that the client doesn't have to waste time writing 
zeroes (or relying on repeated NBD_CMD_BLOCK_STATUS calls to learn the 
same).  This may be justification for reviving that topic.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-07 14:56   ` Nir Soffer
  2018-11-07 15:02     ` Richard W.M. Jones
@ 2018-11-07 17:27     ` Kevin Wolf
  2018-11-07 17:55       ` Nir Soffer
  1 sibling, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2018-11-07 17:27 UTC (permalink / raw)
  To: Nir Soffer
  Cc: Richard W.M. Jones, edgar.kaziakhmedov, QEMU Developers, qemu-block

Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
> Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <rjones@redhat.com> wrote:
> 
> > Another thing I tried was to change the NBD server (nbdkit) so that it
> > doesn't advertise zero support to the client:
> >
> >   $ nbdkit --filter=log --filter=nozero memory size=6G logfile=/tmp/log \
> >       --run './qemu-img convert ./fedora-28.img -n $nbd'
> >   $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c
> >    2154 Write
> >
> > Not surprisingly no zero commands are issued.  The size of the write
> > commands is very uneven -- it appears to be send one command per block
> > of zeroes or data.
> >
> > Nir: If we could get information from imageio about whether zeroing is
> > implemented efficiently or not by the backend, we could change
> > virt-v2v / nbdkit to advertise this back to qemu.
> 
> There is no way to detect the capability, ioctl(BLKZEROOUT) always
> succeeds, falling back to manual zeroing in the kernel silently
> 
> Even if we could, sending zero on the wire from qemu may be even
> slower, and it looks like qemu send even more requests in this case
> (2154 vs ~1300).
> 
> Looks like this optimization in qemu side leads to worse performance,
> so it should not be enabled by default.

Well, that's overgeneralising your case a bit. If the backend does
support efficient zero writes (which file systems, the most common case,
generally do), doing one big write_zeroes request at the start can
improve performance quite a bit.

It seems the problem is that we can't really know whether the operation
will be efficient because the backends generally don't tell us. Maybe
NBD could introduce a flag for this, but in the general case it appears
to me that we'll have to have a command line option.

However, I'm curious what your exact use case and the backend used in it
is? Can something be improved there to actually get efficient zero
writes and get even better performance than by just disabling the big
zero write?

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-07 17:27     ` [Qemu-devel] [Qemu-block] " Kevin Wolf
@ 2018-11-07 17:55       ` Nir Soffer
  2018-11-11 16:11         ` Nir Soffer
  0 siblings, 1 reply; 15+ messages in thread
From: Nir Soffer @ 2018-11-07 17:55 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Richard W.M. Jones, edgar.kaziakhmedov, QEMU Developers, qemu-block

On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf <kwolf@redhat.com> wrote:

> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <rjones@redhat.com>
> wrote:
> >
> > > Another thing I tried was to change the NBD server (nbdkit) so that it
> > > doesn't advertise zero support to the client:
> > >
> > >   $ nbdkit --filter=log --filter=nozero memory size=6G
> logfile=/tmp/log \
> > >       --run './qemu-img convert ./fedora-28.img -n $nbd'
> > >   $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c
> > >    2154 Write
> > >
> > > Not surprisingly no zero commands are issued.  The size of the write
> > > commands is very uneven -- it appears to be send one command per block
> > > of zeroes or data.
> > >
> > > Nir: If we could get information from imageio about whether zeroing is
> > > implemented efficiently or not by the backend, we could change
> > > virt-v2v / nbdkit to advertise this back to qemu.
> >
> > There is no way to detect the capability, ioctl(BLKZEROOUT) always
> > succeeds, falling back to manual zeroing in the kernel silently
> >
> > Even if we could, sending zero on the wire from qemu may be even
> > slower, and it looks like qemu send even more requests in this case
> > (2154 vs ~1300).
> >
> > Looks like this optimization in qemu side leads to worse performance,
> > so it should not be enabled by default.
>
> Well, that's overgeneralising your case a bit. If the backend does
> support efficient zero writes (which file systems, the most common case,
> generally do), doing one big write_zeroes request at the start can
> improve performance quite a bit.
>
> It seems the problem is that we can't really know whether the operation
> will be efficient because the backends generally don't tell us. Maybe
> NBD could introduce a flag for this, but in the general case it appears
> to me that we'll have to have a command line option.
>
> However, I'm curious what your exact use case and the backend used in it
> is? Can something be improved there to actually get efficient zero
> writes and get even better performance than by just disabling the big
> zero write?


The backend is some NetApp storage connected via FC. I don't have
more info on this. We get zero rate of about 1G/s on this storage, which
is quite slow compared with other storage we tested.

One option we check now is if this is the kernel silent fallback to manual
zeroing when the server advertise wrong value of write_same_max_bytes.

Having a command line option to control this behavior sounds good. I don't
have enough data to tell what should be the default, but I think the safe
way would be to keep old behavior.

Nir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-07 16:42 ` [Qemu-devel] " Eric Blake
@ 2018-11-11 15:25   ` Nir Soffer
  0 siblings, 0 replies; 15+ messages in thread
From: Nir Soffer @ 2018-11-11 15:25 UTC (permalink / raw)
  To: Eric Blake
  Cc: Richard W.M. Jones, QEMU Developers, qemu-block, edgar.kaziakhmedov

On Wed, Nov 7, 2018 at 6:42 PM Eric Blake <eblake@redhat.com> wrote:

> On 11/7/18 6:13 AM, Richard W.M. Jones wrote:
> > (I'm not going to claim this is a bug, but it causes a large, easily
> > measurable performance regression in virt-v2v).
>
> I haven't closely looked at at this email thread yet, but a quick first
> impression:
>
>
> > In qemu 2.12 this behaviour changed:
> >
> >    $ nbdkit --filter=log memory size=6G logfile=/tmp/log \
> >        --run './qemu-img convert ./fedora-28.img -n $nbd'
> >    $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq -c
> >        193 Zero
> >       1246 Write
> >
> > It now zeroes the whole disk up front and then writes data over the
> > top of the zeroed blocks.
> >
> > The reason for the performance regression is that in the first case we
> > write 6G in total.  In the second case we write 6G of zeroes up front,
> > followed by the amount of data in the disk image (in this case the
> > test disk image contains 1G of non-sparse data, so we write about 7G
> > in total).
>
> There was talk on the NBD list a while ago about the idea of letting the
> server advertise to the client when the image is known to start in an
> all-zero state, so that the client doesn't have to waste time writing
> zeroes (or relying on repeated NBD_CMD_BLOCK_STATUS calls to learn the
> same).  This may be justification for reviving that topic.
>

This is a good idea in general, since in some cases we know that
a volume is already zeroed (e.g. new file on NFS/Gluster storage). But with
block storage, we typically don't have any guarantee about storage content,
and qemu need to zero or write the entire device, so this does not solve the
issue discussed in this thread.

Nir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-07 17:55       ` Nir Soffer
@ 2018-11-11 16:11         ` Nir Soffer
  2018-11-15 22:27           ` Nir Soffer
  0 siblings, 1 reply; 15+ messages in thread
From: Nir Soffer @ 2018-11-11 16:11 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Richard W.M. Jones, QEMU Developers, qemu-block

On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer <nsoffer@redhat.com> wrote:

> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf <kwolf@redhat.com> wrote:
>
>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <rjones@redhat.com>
>> wrote:
>> >
>> > > Another thing I tried was to change the NBD server (nbdkit) so that it
>> > > doesn't advertise zero support to the client:
>> > >
>> > >   $ nbdkit --filter=log --filter=nozero memory size=6G
>> logfile=/tmp/log \
>> > >       --run './qemu-img convert ./fedora-28.img -n $nbd'
>> > >   $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq
>> -c
>> > >    2154 Write
>> > >
>> > > Not surprisingly no zero commands are issued.  The size of the write
>> > > commands is very uneven -- it appears to be send one command per block
>> > > of zeroes or data.
>> > >
>> > > Nir: If we could get information from imageio about whether zeroing is
>> > > implemented efficiently or not by the backend, we could change
>> > > virt-v2v / nbdkit to advertise this back to qemu.
>> >
>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always
>> > succeeds, falling back to manual zeroing in the kernel silently
>> >
>> > Even if we could, sending zero on the wire from qemu may be even
>> > slower, and it looks like qemu send even more requests in this case
>> > (2154 vs ~1300).
>> >
>> > Looks like this optimization in qemu side leads to worse performance,
>> > so it should not be enabled by default.
>>
>> Well, that's overgeneralising your case a bit. If the backend does
>> support efficient zero writes (which file systems, the most common case,
>> generally do), doing one big write_zeroes request at the start can
>> improve performance quite a bit.
>>
>> It seems the problem is that we can't really know whether the operation
>> will be efficient because the backends generally don't tell us. Maybe
>> NBD could introduce a flag for this, but in the general case it appears
>> to me that we'll have to have a command line option.
>>
>> However, I'm curious what your exact use case and the backend used in it
>> is? Can something be improved there to actually get efficient zero
>> writes and get even better performance than by just disabling the big
>> zero write?
>
>
> The backend is some NetApp storage connected via FC. I don't have
> more info on this. We get zero rate of about 1G/s on this storage, which
> is quite slow compared with other storage we tested.
>
> One option we check now is if this is the kernel silent fallback to manual
> zeroing when the server advertise wrong value of write_same_max_bytes.
>

We eliminated this using blkdiscard. This is what we get on with this
storage
zeroing 100G LV:

for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m
/dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade;
done

real 4m50.851s
user 0m0.065s
sys 0m1.482s

real 4m30.504s
user 0m0.047s
sys 0m0.870s

real 4m19.443s
user 0m0.029s
sys 0m0.508s

real 4m13.016s
user 0m0.020s
sys 0m0.284s

real 2m45.888s
user 0m0.011s
sys 0m0.162s

real 2m10.153s
user 0m0.003s
sys 0m0.100s

We are investigating why we get low throughput on this server, and also
will check
several other servers.

Having a command line option to control this behavior sounds good. I don't
> have enough data to tell what should be the default, but I think the safe
> way would be to keep old behavior.
>

We file this bug:
https://bugzilla.redhat.com/1648622

Nir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-11 16:11         ` Nir Soffer
@ 2018-11-15 22:27           ` Nir Soffer
  2018-11-16 15:26             ` Kevin Wolf
  0 siblings, 1 reply; 15+ messages in thread
From: Nir Soffer @ 2018-11-15 22:27 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Richard W.M. Jones, QEMU Developers, qemu-block

On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer <nsoffer@redhat.com> wrote:

> On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer <nsoffer@redhat.com> wrote:
>
>> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf <kwolf@redhat.com> wrote:
>>
>>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
>>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <rjones@redhat.com>
>>> wrote:
>>> >
>>> > > Another thing I tried was to change the NBD server (nbdkit) so that
>>> it
>>> > > doesn't advertise zero support to the client:
>>> > >
>>> > >   $ nbdkit --filter=log --filter=nozero memory size=6G
>>> logfile=/tmp/log \
>>> > >       --run './qemu-img convert ./fedora-28.img -n $nbd'
>>> > >   $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq
>>> -c
>>> > >    2154 Write
>>> > >
>>> > > Not surprisingly no zero commands are issued.  The size of the write
>>> > > commands is very uneven -- it appears to be send one command per
>>> block
>>> > > of zeroes or data.
>>> > >
>>> > > Nir: If we could get information from imageio about whether zeroing
>>> is
>>> > > implemented efficiently or not by the backend, we could change
>>> > > virt-v2v / nbdkit to advertise this back to qemu.
>>> >
>>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always
>>> > succeeds, falling back to manual zeroing in the kernel silently
>>> >
>>> > Even if we could, sending zero on the wire from qemu may be even
>>> > slower, and it looks like qemu send even more requests in this case
>>> > (2154 vs ~1300).
>>> >
>>> > Looks like this optimization in qemu side leads to worse performance,
>>> > so it should not be enabled by default.
>>>
>>> Well, that's overgeneralising your case a bit. If the backend does
>>> support efficient zero writes (which file systems, the most common case,
>>> generally do), doing one big write_zeroes request at the start can
>>> improve performance quite a bit.
>>>
>>> It seems the problem is that we can't really know whether the operation
>>> will be efficient because the backends generally don't tell us. Maybe
>>> NBD could introduce a flag for this, but in the general case it appears
>>> to me that we'll have to have a command line option.
>>>
>>> However, I'm curious what your exact use case and the backend used in it
>>> is? Can something be improved there to actually get efficient zero
>>> writes and get even better performance than by just disabling the big
>>> zero write?
>>
>>
>> The backend is some NetApp storage connected via FC. I don't have
>> more info on this. We get zero rate of about 1G/s on this storage, which
>> is quite slow compared with other storage we tested.
>>
>> One option we check now is if this is the kernel silent fallback to manual
>> zeroing when the server advertise wrong value of write_same_max_bytes.
>>
>
> We eliminated this using blkdiscard. This is what we get on with this
> storage
> zeroing 100G LV:
>
> for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m
> /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade;
> done
>
> real 4m50.851s
> user 0m0.065s
> sys 0m1.482s
>
> real 4m30.504s
> user 0m0.047s
> sys 0m0.870s
>
> real 4m19.443s
> user 0m0.029s
> sys 0m0.508s
>
> real 4m13.016s
> user 0m0.020s
> sys 0m0.284s
>
> real 2m45.888s
> user 0m0.011s
> sys 0m0.162s
>
> real 2m10.153s
> user 0m0.003s
> sys 0m0.100s
>
> We are investigating why we get low throughput on this server, and also
> will check
> several other servers.
>
> Having a command line option to control this behavior sounds good. I don't
>> have enough data to tell what should be the default, but I think the safe
>> way would be to keep old behavior.
>>
>
> We file this bug:
> https://bugzilla.redhat.com/1648622
>

More data from even slower storage - zeroing 10G lv on Kaminario K2

# time blkdiscard -z -p 32m /dev/test_vg/test_lv2

real    50m12.425s
user    0m0.018s
sys     2m6.785s

Maybe something is wrong with this storage, since we see this:

# grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes
/sys/block/dm-29/queue/write_same_max_bytes:512

Since BLKZEROOUT always fallback to manual slow zeroing silently,
maybe we can disable the aggressive pre-zero of the entire device
for block devices, and keep this optimization for files when fallocate()
is supported?

Nir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-15 22:27           ` Nir Soffer
@ 2018-11-16 15:26             ` Kevin Wolf
  2018-11-17 20:59               ` Nir Soffer
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2018-11-16 15:26 UTC (permalink / raw)
  To: Nir Soffer; +Cc: Richard W.M. Jones, QEMU Developers, qemu-block

Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben:
> On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer <nsoffer@redhat.com> wrote:
> 
> > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer <nsoffer@redhat.com> wrote:
> >
> >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf <kwolf@redhat.com> wrote:
> >>
> >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
> >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <rjones@redhat.com>
> >>> wrote:
> >>> >
> >>> > > Another thing I tried was to change the NBD server (nbdkit) so that
> >>> it
> >>> > > doesn't advertise zero support to the client:
> >>> > >
> >>> > >   $ nbdkit --filter=log --filter=nozero memory size=6G
> >>> logfile=/tmp/log \
> >>> > >       --run './qemu-img convert ./fedora-28.img -n $nbd'
> >>> > >   $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | uniq
> >>> -c
> >>> > >    2154 Write
> >>> > >
> >>> > > Not surprisingly no zero commands are issued.  The size of the write
> >>> > > commands is very uneven -- it appears to be send one command per
> >>> block
> >>> > > of zeroes or data.
> >>> > >
> >>> > > Nir: If we could get information from imageio about whether zeroing
> >>> is
> >>> > > implemented efficiently or not by the backend, we could change
> >>> > > virt-v2v / nbdkit to advertise this back to qemu.
> >>> >
> >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always
> >>> > succeeds, falling back to manual zeroing in the kernel silently
> >>> >
> >>> > Even if we could, sending zero on the wire from qemu may be even
> >>> > slower, and it looks like qemu send even more requests in this case
> >>> > (2154 vs ~1300).
> >>> >
> >>> > Looks like this optimization in qemu side leads to worse performance,
> >>> > so it should not be enabled by default.
> >>>
> >>> Well, that's overgeneralising your case a bit. If the backend does
> >>> support efficient zero writes (which file systems, the most common case,
> >>> generally do), doing one big write_zeroes request at the start can
> >>> improve performance quite a bit.
> >>>
> >>> It seems the problem is that we can't really know whether the operation
> >>> will be efficient because the backends generally don't tell us. Maybe
> >>> NBD could introduce a flag for this, but in the general case it appears
> >>> to me that we'll have to have a command line option.
> >>>
> >>> However, I'm curious what your exact use case and the backend used in it
> >>> is? Can something be improved there to actually get efficient zero
> >>> writes and get even better performance than by just disabling the big
> >>> zero write?
> >>
> >>
> >> The backend is some NetApp storage connected via FC. I don't have
> >> more info on this. We get zero rate of about 1G/s on this storage, which
> >> is quite slow compared with other storage we tested.
> >>
> >> One option we check now is if this is the kernel silent fallback to manual
> >> zeroing when the server advertise wrong value of write_same_max_bytes.
> >>
> >
> > We eliminated this using blkdiscard. This is what we get on with this
> > storage
> > zeroing 100G LV:
> >
> > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m
> > /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade;
> > done
> >
> > real 4m50.851s
> > user 0m0.065s
> > sys 0m1.482s
> >
> > real 4m30.504s
> > user 0m0.047s
> > sys 0m0.870s
> >
> > real 4m19.443s
> > user 0m0.029s
> > sys 0m0.508s
> >
> > real 4m13.016s
> > user 0m0.020s
> > sys 0m0.284s
> >
> > real 2m45.888s
> > user 0m0.011s
> > sys 0m0.162s
> >
> > real 2m10.153s
> > user 0m0.003s
> > sys 0m0.100s
> >
> > We are investigating why we get low throughput on this server, and also
> > will check
> > several other servers.
> >
> > Having a command line option to control this behavior sounds good. I don't
> >> have enough data to tell what should be the default, but I think the safe
> >> way would be to keep old behavior.
> >>
> >
> > We file this bug:
> > https://bugzilla.redhat.com/1648622
> >
> 
> More data from even slower storage - zeroing 10G lv on Kaminario K2
> 
> # time blkdiscard -z -p 32m /dev/test_vg/test_lv2
> 
> real    50m12.425s
> user    0m0.018s
> sys     2m6.785s
> 
> Maybe something is wrong with this storage, since we see this:
> 
> # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes
> /sys/block/dm-29/queue/write_same_max_bytes:512
> 
> Since BLKZEROOUT always fallback to manual slow zeroing silently,
> maybe we can disable the aggressive pre-zero of the entire device
> for block devices, and keep this optimization for files when fallocate()
> is supported?

I'm not sure what the detour through NBD changes, but qemu-img directly
on a block device doesn't use BLKZEROOUT first, but
FALLOC_FL_PUNCH_HOLE. Maybe we can add a flag that avoids anything that
could be slow, such as BLKZEROOUT, as a fallback (and also the slow
emulation that QEMU itself would do if all kernel calls fail).

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-16 15:26             ` Kevin Wolf
@ 2018-11-17 20:59               ` Nir Soffer
  2018-11-17 21:13                 ` Richard W.M. Jones
  2018-11-19 11:50                 ` Kevin Wolf
  0 siblings, 2 replies; 15+ messages in thread
From: Nir Soffer @ 2018-11-17 20:59 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Richard W.M. Jones, QEMU Developers, qemu-block, Eric Blake

On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf <kwolf@redhat.com> wrote:

> Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben:
> > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer <nsoffer@redhat.com> wrote:
> >
> > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer <nsoffer@redhat.com> wrote:
> > >
> > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > >>
> > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
> > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <rjones@redhat.com>
> > >>> wrote:
> > >>> >
> > >>> > > Another thing I tried was to change the NBD server (nbdkit) so
> that
> > >>> it
> > >>> > > doesn't advertise zero support to the client:
> > >>> > >
> > >>> > >   $ nbdkit --filter=log --filter=nozero memory size=6G
> > >>> logfile=/tmp/log \
> > >>> > >       --run './qemu-img convert ./fedora-28.img -n $nbd'
> > >>> > >   $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' |
> uniq
> > >>> -c
> > >>> > >    2154 Write
> > >>> > >
> > >>> > > Not surprisingly no zero commands are issued.  The size of the
> write
> > >>> > > commands is very uneven -- it appears to be send one command per
> > >>> block
> > >>> > > of zeroes or data.
> > >>> > >
> > >>> > > Nir: If we could get information from imageio about whether
> zeroing
> > >>> is
> > >>> > > implemented efficiently or not by the backend, we could change
> > >>> > > virt-v2v / nbdkit to advertise this back to qemu.
> > >>> >
> > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always
> > >>> > succeeds, falling back to manual zeroing in the kernel silently
> > >>> >
> > >>> > Even if we could, sending zero on the wire from qemu may be even
> > >>> > slower, and it looks like qemu send even more requests in this case
> > >>> > (2154 vs ~1300).
> > >>> >
> > >>> > Looks like this optimization in qemu side leads to worse
> performance,
> > >>> > so it should not be enabled by default.
> > >>>
> > >>> Well, that's overgeneralising your case a bit. If the backend does
> > >>> support efficient zero writes (which file systems, the most common
> case,
> > >>> generally do), doing one big write_zeroes request at the start can
> > >>> improve performance quite a bit.
> > >>>
> > >>> It seems the problem is that we can't really know whether the
> operation
> > >>> will be efficient because the backends generally don't tell us. Maybe
> > >>> NBD could introduce a flag for this, but in the general case it
> appears
> > >>> to me that we'll have to have a command line option.
> > >>>
> > >>> However, I'm curious what your exact use case and the backend used
> in it
> > >>> is? Can something be improved there to actually get efficient zero
> > >>> writes and get even better performance than by just disabling the big
> > >>> zero write?
> > >>
> > >>
> > >> The backend is some NetApp storage connected via FC. I don't have
> > >> more info on this. We get zero rate of about 1G/s on this storage,
> which
> > >> is quite slow compared with other storage we tested.
> > >>
> > >> One option we check now is if this is the kernel silent fallback to
> manual
> > >> zeroing when the server advertise wrong value of write_same_max_bytes.
> > >>
> > >
> > > We eliminated this using blkdiscard. This is what we get on with this
> > > storage
> > > zeroing 100G LV:
> > >
> > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m
> > >
> /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade;
> > > done
> > >
> > > real 4m50.851s
> > > user 0m0.065s
> > > sys 0m1.482s
> > >
> > > real 4m30.504s
> > > user 0m0.047s
> > > sys 0m0.870s
> > >
> > > real 4m19.443s
> > > user 0m0.029s
> > > sys 0m0.508s
> > >
> > > real 4m13.016s
> > > user 0m0.020s
> > > sys 0m0.284s
> > >
> > > real 2m45.888s
> > > user 0m0.011s
> > > sys 0m0.162s
> > >
> > > real 2m10.153s
> > > user 0m0.003s
> > > sys 0m0.100s
> > >
> > > We are investigating why we get low throughput on this server, and also
> > > will check
> > > several other servers.
> > >
> > > Having a command line option to control this behavior sounds good. I
> don't
> > >> have enough data to tell what should be the default, but I think the
> safe
> > >> way would be to keep old behavior.
> > >>
> > >
> > > We file this bug:
> > > https://bugzilla.redhat.com/1648622
> > >
> >
> > More data from even slower storage - zeroing 10G lv on Kaminario K2
> >
> > # time blkdiscard -z -p 32m /dev/test_vg/test_lv2
> >
> > real    50m12.425s
> > user    0m0.018s
> > sys     2m6.785s
> >
> > Maybe something is wrong with this storage, since we see this:
> >
> > # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes
> > /sys/block/dm-29/queue/write_same_max_bytes:512
> >
> > Since BLKZEROOUT always fallback to manual slow zeroing silently,
> > maybe we can disable the aggressive pre-zero of the entire device
> > for block devices, and keep this optimization for files when fallocate()
> > is supported?
>
> I'm not sure what the detour through NBD changes, but qemu-img directly
> on a block device doesn't use BLKZEROOUT first, but
> FALLOC_FL_PUNCH_HOLE.


Looking at block/file-posix.c (83c496599cc04926ecbc3e47a37debaa3e38b686)
we don't use PUNCH_HOLE for block devices:

1472     if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
1473         return handle_aiocb_write_zeroes_block(aiocb);
1474     }

qemu uses BLKZEROOUT, which is not guaranteed to be fast on storage side,
and even worse fallback silently to manual zero if storage does not support
WRITE_SAME.

Maybe we can add a flag that avoids anything that
> could be slow, such as BLKZEROOUT, as a fallback (and also the slow
> emulation that QEMU itself would do if all kernel calls fail).
>

But the issue here is not how qemu-img handles this case, but how NBD
server can handle it. NBD may support zeroing, but there is no way to tell
if zeroing is going to be fast, since the backend writing zeros to storage
has the same limits of qemu-img.

So I think we need to fix the performance regression in 2.12 by enabling
pre-zero of entire disk only if FALLOCATE_FL_PUNCH_HOLE can be used
and only if it can be used without a fallback to slow zero method.

Enabling this optimization for anything else requires changing the entire
stack (storage, kernel, NBD protocol) to support reporting fast zero
capability
or limit zero to fast operations.

Nir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-17 20:59               ` Nir Soffer
@ 2018-11-17 21:13                 ` Richard W.M. Jones
  2018-11-18  7:24                   ` Nir Soffer
  2018-11-19 11:50                 ` Kevin Wolf
  1 sibling, 1 reply; 15+ messages in thread
From: Richard W.M. Jones @ 2018-11-17 21:13 UTC (permalink / raw)
  To: Nir Soffer; +Cc: Kevin Wolf, QEMU Developers, qemu-block, Eric Blake

On Sat, Nov 17, 2018 at 10:59:26PM +0200, Nir Soffer wrote:
> On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf <kwolf@redhat.com> wrote:
> 
> > Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben:
> > > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer <nsoffer@redhat.com> wrote:
> > >
> > > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer <nsoffer@redhat.com> wrote:
> > > >
> > > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > > >>
> > > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
> > > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <rjones@redhat.com>
> > > >>> wrote:
> > > >>> >
> > > >>> > > Another thing I tried was to change the NBD server (nbdkit) so
> > that
> > > >>> it
> > > >>> > > doesn't advertise zero support to the client:
> > > >>> > >
> > > >>> > >   $ nbdkit --filter=log --filter=nozero memory size=6G
> > > >>> logfile=/tmp/log \
> > > >>> > >       --run './qemu-img convert ./fedora-28.img -n $nbd'
> > > >>> > >   $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' |
> > uniq
> > > >>> -c
> > > >>> > >    2154 Write
> > > >>> > >
> > > >>> > > Not surprisingly no zero commands are issued.  The size of the
> > write
> > > >>> > > commands is very uneven -- it appears to be send one command per
> > > >>> block
> > > >>> > > of zeroes or data.
> > > >>> > >
> > > >>> > > Nir: If we could get information from imageio about whether
> > zeroing
> > > >>> is
> > > >>> > > implemented efficiently or not by the backend, we could change
> > > >>> > > virt-v2v / nbdkit to advertise this back to qemu.
> > > >>> >
> > > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always
> > > >>> > succeeds, falling back to manual zeroing in the kernel silently
> > > >>> >
> > > >>> > Even if we could, sending zero on the wire from qemu may be even
> > > >>> > slower, and it looks like qemu send even more requests in this case
> > > >>> > (2154 vs ~1300).
> > > >>> >
> > > >>> > Looks like this optimization in qemu side leads to worse
> > performance,
> > > >>> > so it should not be enabled by default.
> > > >>>
> > > >>> Well, that's overgeneralising your case a bit. If the backend does
> > > >>> support efficient zero writes (which file systems, the most common
> > case,
> > > >>> generally do), doing one big write_zeroes request at the start can
> > > >>> improve performance quite a bit.
> > > >>>
> > > >>> It seems the problem is that we can't really know whether the
> > operation
> > > >>> will be efficient because the backends generally don't tell us. Maybe
> > > >>> NBD could introduce a flag for this, but in the general case it
> > appears
> > > >>> to me that we'll have to have a command line option.
> > > >>>
> > > >>> However, I'm curious what your exact use case and the backend used
> > in it
> > > >>> is? Can something be improved there to actually get efficient zero
> > > >>> writes and get even better performance than by just disabling the big
> > > >>> zero write?
> > > >>
> > > >>
> > > >> The backend is some NetApp storage connected via FC. I don't have
> > > >> more info on this. We get zero rate of about 1G/s on this storage,
> > which
> > > >> is quite slow compared with other storage we tested.
> > > >>
> > > >> One option we check now is if this is the kernel silent fallback to
> > manual
> > > >> zeroing when the server advertise wrong value of write_same_max_bytes.
> > > >>
> > > >
> > > > We eliminated this using blkdiscard. This is what we get on with this
> > > > storage
> > > > zeroing 100G LV:
> > > >
> > > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m
> > > >
> > /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade;
> > > > done
> > > >
> > > > real 4m50.851s
> > > > user 0m0.065s
> > > > sys 0m1.482s
> > > >
> > > > real 4m30.504s
> > > > user 0m0.047s
> > > > sys 0m0.870s
> > > >
> > > > real 4m19.443s
> > > > user 0m0.029s
> > > > sys 0m0.508s
> > > >
> > > > real 4m13.016s
> > > > user 0m0.020s
> > > > sys 0m0.284s
> > > >
> > > > real 2m45.888s
> > > > user 0m0.011s
> > > > sys 0m0.162s
> > > >
> > > > real 2m10.153s
> > > > user 0m0.003s
> > > > sys 0m0.100s
> > > >
> > > > We are investigating why we get low throughput on this server, and also
> > > > will check
> > > > several other servers.
> > > >
> > > > Having a command line option to control this behavior sounds good. I
> > don't
> > > >> have enough data to tell what should be the default, but I think the
> > safe
> > > >> way would be to keep old behavior.
> > > >>
> > > >
> > > > We file this bug:
> > > > https://bugzilla.redhat.com/1648622
> > > >
> > >
> > > More data from even slower storage - zeroing 10G lv on Kaminario K2
> > >
> > > # time blkdiscard -z -p 32m /dev/test_vg/test_lv2
> > >
> > > real    50m12.425s
> > > user    0m0.018s
> > > sys     2m6.785s
> > >
> > > Maybe something is wrong with this storage, since we see this:
> > >
> > > # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes
> > > /sys/block/dm-29/queue/write_same_max_bytes:512
> > >
> > > Since BLKZEROOUT always fallback to manual slow zeroing silently,
> > > maybe we can disable the aggressive pre-zero of the entire device
> > > for block devices, and keep this optimization for files when fallocate()
> > > is supported?
> >
> > I'm not sure what the detour through NBD changes, but qemu-img directly
> > on a block device doesn't use BLKZEROOUT first, but
> > FALLOC_FL_PUNCH_HOLE.
> 
> 
> Looking at block/file-posix.c (83c496599cc04926ecbc3e47a37debaa3e38b686)
> we don't use PUNCH_HOLE for block devices:
> 
> 1472     if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
> 1473         return handle_aiocb_write_zeroes_block(aiocb);
> 1474     }
> 
> qemu uses BLKZEROOUT, which is not guaranteed to be fast on storage side,
> and even worse fallback silently to manual zero if storage does not support
> WRITE_SAME.
> 
> Maybe we can add a flag that avoids anything that
> > could be slow, such as BLKZEROOUT, as a fallback (and also the slow
> > emulation that QEMU itself would do if all kernel calls fail).
> >
> 
> But the issue here is not how qemu-img handles this case, but how NBD
> server can handle it. NBD may support zeroing, but there is no way to tell
> if zeroing is going to be fast, since the backend writing zeros to storage
> has the same limits of qemu-img.
> 
> So I think we need to fix the performance regression in 2.12 by enabling
> pre-zero of entire disk only if FALLOCATE_FL_PUNCH_HOLE can be used
> and only if it can be used without a fallback to slow zero method.
> 
> Enabling this optimization for anything else requires changing the entire
> stack (storage, kernel, NBD protocol) to support reporting fast zero
> capability
> or limit zero to fast operations.

I may be missing something here, but doesn't imageio know if the
backing block device starts out as all zeroes?  If so couldn't it
maintain a bitmap and simply ignore zero requests sent for unwritten
disk blocks?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-17 21:13                 ` Richard W.M. Jones
@ 2018-11-18  7:24                   ` Nir Soffer
  0 siblings, 0 replies; 15+ messages in thread
From: Nir Soffer @ 2018-11-18  7:24 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: Kevin Wolf, QEMU Developers, qemu-block, Eric Blake

On Sat, Nov 17, 2018 at 11:13 PM Richard W.M. Jones <rjones@redhat.com>
wrote:

> On Sat, Nov 17, 2018 at 10:59:26PM +0200, Nir Soffer wrote:
> > On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf <kwolf@redhat.com> wrote:
> >
> > > Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben:
> > > > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer <nsoffer@redhat.com>
> wrote:
> > > >
> > > > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer <nsoffer@redhat.com>
> wrote:
> > > > >
> > > > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf <kwolf@redhat.com>
> wrote:
> > > > >>
> > > > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
> > > > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <
> rjones@redhat.com>
> > > > >>> wrote:
> > > > >>> >
> > > > >>> > > Another thing I tried was to change the NBD server (nbdkit)
> so
> > > that
> > > > >>> it
> > > > >>> > > doesn't advertise zero support to the client:
> > > > >>> > >
> > > > >>> > >   $ nbdkit --filter=log --filter=nozero memory size=6G
> > > > >>> logfile=/tmp/log \
> > > > >>> > >       --run './qemu-img convert ./fedora-28.img -n $nbd'
> > > > >>> > >   $ grep '\.\.\.$' /tmp/log | sed
> 's/.*\([A-Z][a-z]*\).*/\1/' |
> > > uniq
> > > > >>> -c
> > > > >>> > >    2154 Write
> > > > >>> > >
> > > > >>> > > Not surprisingly no zero commands are issued.  The size of
> the
> > > write
> > > > >>> > > commands is very uneven -- it appears to be send one command
> per
> > > > >>> block
> > > > >>> > > of zeroes or data.
> > > > >>> > >
> > > > >>> > > Nir: If we could get information from imageio about whether
> > > zeroing
> > > > >>> is
> > > > >>> > > implemented efficiently or not by the backend, we could
> change
> > > > >>> > > virt-v2v / nbdkit to advertise this back to qemu.
> > > > >>> >
> > > > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT)
> always
> > > > >>> > succeeds, falling back to manual zeroing in the kernel silently
> > > > >>> >
> > > > >>> > Even if we could, sending zero on the wire from qemu may be
> even
> > > > >>> > slower, and it looks like qemu send even more requests in this
> case
> > > > >>> > (2154 vs ~1300).
> > > > >>> >
> > > > >>> > Looks like this optimization in qemu side leads to worse
> > > performance,
> > > > >>> > so it should not be enabled by default.
> > > > >>>
> > > > >>> Well, that's overgeneralising your case a bit. If the backend
> does
> > > > >>> support efficient zero writes (which file systems, the most
> common
> > > case,
> > > > >>> generally do), doing one big write_zeroes request at the start
> can
> > > > >>> improve performance quite a bit.
> > > > >>>
> > > > >>> It seems the problem is that we can't really know whether the
> > > operation
> > > > >>> will be efficient because the backends generally don't tell us.
> Maybe
> > > > >>> NBD could introduce a flag for this, but in the general case it
> > > appears
> > > > >>> to me that we'll have to have a command line option.
> > > > >>>
> > > > >>> However, I'm curious what your exact use case and the backend
> used
> > > in it
> > > > >>> is? Can something be improved there to actually get efficient
> zero
> > > > >>> writes and get even better performance than by just disabling
> the big
> > > > >>> zero write?
> > > > >>
> > > > >>
> > > > >> The backend is some NetApp storage connected via FC. I don't have
> > > > >> more info on this. We get zero rate of about 1G/s on this storage,
> > > which
> > > > >> is quite slow compared with other storage we tested.
> > > > >>
> > > > >> One option we check now is if this is the kernel silent fallback
> to
> > > manual
> > > > >> zeroing when the server advertise wrong value of
> write_same_max_bytes.
> > > > >>
> > > > >
> > > > > We eliminated this using blkdiscard. This is what we get on with
> this
> > > > > storage
> > > > > zeroing 100G LV:
> > > > >
> > > > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m
> > > > >
> > >
> /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade;
> > > > > done
> > > > >
> > > > > real 4m50.851s
> > > > > user 0m0.065s
> > > > > sys 0m1.482s
> > > > >
> > > > > real 4m30.504s
> > > > > user 0m0.047s
> > > > > sys 0m0.870s
> > > > >
> > > > > real 4m19.443s
> > > > > user 0m0.029s
> > > > > sys 0m0.508s
> > > > >
> > > > > real 4m13.016s
> > > > > user 0m0.020s
> > > > > sys 0m0.284s
> > > > >
> > > > > real 2m45.888s
> > > > > user 0m0.011s
> > > > > sys 0m0.162s
> > > > >
> > > > > real 2m10.153s
> > > > > user 0m0.003s
> > > > > sys 0m0.100s
> > > > >
> > > > > We are investigating why we get low throughput on this server, and
> also
> > > > > will check
> > > > > several other servers.
> > > > >
> > > > > Having a command line option to control this behavior sounds good.
> I
> > > don't
> > > > >> have enough data to tell what should be the default, but I think
> the
> > > safe
> > > > >> way would be to keep old behavior.
> > > > >>
> > > > >
> > > > > We file this bug:
> > > > > https://bugzilla.redhat.com/1648622
> > > > >
> > > >
> > > > More data from even slower storage - zeroing 10G lv on Kaminario K2
> > > >
> > > > # time blkdiscard -z -p 32m /dev/test_vg/test_lv2
> > > >
> > > > real    50m12.425s
> > > > user    0m0.018s
> > > > sys     2m6.785s
> > > >
> > > > Maybe something is wrong with this storage, since we see this:
> > > >
> > > > # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes
> > > > /sys/block/dm-29/queue/write_same_max_bytes:512
> > > >
> > > > Since BLKZEROOUT always fallback to manual slow zeroing silently,
> > > > maybe we can disable the aggressive pre-zero of the entire device
> > > > for block devices, and keep this optimization for files when
> fallocate()
> > > > is supported?
> > >
> > > I'm not sure what the detour through NBD changes, but qemu-img directly
> > > on a block device doesn't use BLKZEROOUT first, but
> > > FALLOC_FL_PUNCH_HOLE.
> >
> >
> > Looking at block/file-posix.c (83c496599cc04926ecbc3e47a37debaa3e38b686)
> > we don't use PUNCH_HOLE for block devices:
> >
> > 1472     if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
> > 1473         return handle_aiocb_write_zeroes_block(aiocb);
> > 1474     }
> >
> > qemu uses BLKZEROOUT, which is not guaranteed to be fast on storage side,
> > and even worse fallback silently to manual zero if storage does not
> support
> > WRITE_SAME.
> >
> > Maybe we can add a flag that avoids anything that
> > > could be slow, such as BLKZEROOUT, as a fallback (and also the slow
> > > emulation that QEMU itself would do if all kernel calls fail).
> > >
> >
> > But the issue here is not how qemu-img handles this case, but how NBD
> > server can handle it. NBD may support zeroing, but there is no way to
> tell
> > if zeroing is going to be fast, since the backend writing zeros to
> storage
> > has the same limits of qemu-img.
> >
> > So I think we need to fix the performance regression in 2.12 by enabling
> > pre-zero of entire disk only if FALLOCATE_FL_PUNCH_HOLE can be used
> > and only if it can be used without a fallback to slow zero method.
> >
> > Enabling this optimization for anything else requires changing the entire
> > stack (storage, kernel, NBD protocol) to support reporting fast zero
> > capability
> > or limit zero to fast operations.
>
> I may be missing something here, but doesn't imageio know if the
> backing block device starts out as all zeroes?


imageio cannot know since it is stateless. When you send a request
we don't know what happened before your request.

Engine knows if an image is zeroed when creating a disk for some disks
(raw sparse on file storage) so it  can report this info, and nbdkit can use
this when writing zeros...


> If so couldn't it
> maintain a bitmap and simply ignore zero requests sent for unwritten
> disk blocks?
>

But if nbdkit knows that it is writing to a new zeroed image, it would be
better to report it back to qemu, so qemu can skip zeroing the entire
device.

This does not solve the issue with block storage which is the most
interesting
use case for oVirt.

Nir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data
  2018-11-17 20:59               ` Nir Soffer
  2018-11-17 21:13                 ` Richard W.M. Jones
@ 2018-11-19 11:50                 ` Kevin Wolf
  1 sibling, 0 replies; 15+ messages in thread
From: Kevin Wolf @ 2018-11-19 11:50 UTC (permalink / raw)
  To: Nir Soffer; +Cc: Richard W.M. Jones, QEMU Developers, qemu-block, Eric Blake

Am 17.11.2018 um 21:59 hat Nir Soffer geschrieben:
> On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf <kwolf@redhat.com> wrote:
> 
> > Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben:
> > > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer <nsoffer@redhat.com> wrote:
> > >
> > > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer <nsoffer@redhat.com> wrote:
> > > >
> > > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > > >>
> > > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
> > > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <rjones@redhat.com>
> > > >>> wrote:
> > > >>> >
> > > >>> > > Another thing I tried was to change the NBD server (nbdkit) so
> > that
> > > >>> it
> > > >>> > > doesn't advertise zero support to the client:
> > > >>> > >
> > > >>> > >   $ nbdkit --filter=log --filter=nozero memory size=6G
> > > >>> logfile=/tmp/log \
> > > >>> > >       --run './qemu-img convert ./fedora-28.img -n $nbd'
> > > >>> > >   $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' |
> > uniq
> > > >>> -c
> > > >>> > >    2154 Write
> > > >>> > >
> > > >>> > > Not surprisingly no zero commands are issued.  The size of the
> > write
> > > >>> > > commands is very uneven -- it appears to be send one command per
> > > >>> block
> > > >>> > > of zeroes or data.
> > > >>> > >
> > > >>> > > Nir: If we could get information from imageio about whether
> > zeroing
> > > >>> is
> > > >>> > > implemented efficiently or not by the backend, we could change
> > > >>> > > virt-v2v / nbdkit to advertise this back to qemu.
> > > >>> >
> > > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always
> > > >>> > succeeds, falling back to manual zeroing in the kernel silently
> > > >>> >
> > > >>> > Even if we could, sending zero on the wire from qemu may be even
> > > >>> > slower, and it looks like qemu send even more requests in this case
> > > >>> > (2154 vs ~1300).
> > > >>> >
> > > >>> > Looks like this optimization in qemu side leads to worse
> > performance,
> > > >>> > so it should not be enabled by default.
> > > >>>
> > > >>> Well, that's overgeneralising your case a bit. If the backend does
> > > >>> support efficient zero writes (which file systems, the most common
> > case,
> > > >>> generally do), doing one big write_zeroes request at the start can
> > > >>> improve performance quite a bit.
> > > >>>
> > > >>> It seems the problem is that we can't really know whether the
> > operation
> > > >>> will be efficient because the backends generally don't tell us. Maybe
> > > >>> NBD could introduce a flag for this, but in the general case it
> > appears
> > > >>> to me that we'll have to have a command line option.
> > > >>>
> > > >>> However, I'm curious what your exact use case and the backend used
> > in it
> > > >>> is? Can something be improved there to actually get efficient zero
> > > >>> writes and get even better performance than by just disabling the big
> > > >>> zero write?
> > > >>
> > > >>
> > > >> The backend is some NetApp storage connected via FC. I don't have
> > > >> more info on this. We get zero rate of about 1G/s on this storage,
> > which
> > > >> is quite slow compared with other storage we tested.
> > > >>
> > > >> One option we check now is if this is the kernel silent fallback to
> > manual
> > > >> zeroing when the server advertise wrong value of write_same_max_bytes.
> > > >>
> > > >
> > > > We eliminated this using blkdiscard. This is what we get on with this
> > > > storage
> > > > zeroing 100G LV:
> > > >
> > > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m
> > > >
> > /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade;
> > > > done
> > > >
> > > > real 4m50.851s
> > > > user 0m0.065s
> > > > sys 0m1.482s
> > > >
> > > > real 4m30.504s
> > > > user 0m0.047s
> > > > sys 0m0.870s
> > > >
> > > > real 4m19.443s
> > > > user 0m0.029s
> > > > sys 0m0.508s
> > > >
> > > > real 4m13.016s
> > > > user 0m0.020s
> > > > sys 0m0.284s
> > > >
> > > > real 2m45.888s
> > > > user 0m0.011s
> > > > sys 0m0.162s
> > > >
> > > > real 2m10.153s
> > > > user 0m0.003s
> > > > sys 0m0.100s
> > > >
> > > > We are investigating why we get low throughput on this server, and also
> > > > will check
> > > > several other servers.
> > > >
> > > > Having a command line option to control this behavior sounds good. I
> > don't
> > > >> have enough data to tell what should be the default, but I think the
> > safe
> > > >> way would be to keep old behavior.
> > > >>
> > > >
> > > > We file this bug:
> > > > https://bugzilla.redhat.com/1648622
> > > >
> > >
> > > More data from even slower storage - zeroing 10G lv on Kaminario K2
> > >
> > > # time blkdiscard -z -p 32m /dev/test_vg/test_lv2
> > >
> > > real    50m12.425s
> > > user    0m0.018s
> > > sys     2m6.785s
> > >
> > > Maybe something is wrong with this storage, since we see this:
> > >
> > > # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes
> > > /sys/block/dm-29/queue/write_same_max_bytes:512
> > >
> > > Since BLKZEROOUT always fallback to manual slow zeroing silently,
> > > maybe we can disable the aggressive pre-zero of the entire device
> > > for block devices, and keep this optimization for files when fallocate()
> > > is supported?
> >
> > I'm not sure what the detour through NBD changes, but qemu-img directly
> > on a block device doesn't use BLKZEROOUT first, but
> > FALLOC_FL_PUNCH_HOLE.
> 
> 
> Looking at block/file-posix.c (83c496599cc04926ecbc3e47a37debaa3e38b686)
> we don't use PUNCH_HOLE for block devices:
> 
> 1472     if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
> 1473         return handle_aiocb_write_zeroes_block(aiocb);
> 1474     }
> 
> qemu uses BLKZEROOUT, which is not guaranteed to be fast on storage side,
> and even worse fallback silently to manual zero if storage does not support
> WRITE_SAME.

We only get there as a fallback. The normal path for supported zero
writes is like this:

1. convert_do_copy()
2. blk_make_zero(s->target, BDRV_REQ_MAY_UNMAP)
3. bdrv_make_zero(blk->root, flags)
4. a. bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL)
      Skip the zero write if the block is already zero
      Why don't we hit this path with NBD?
   b. bdrv_pwrite_zeroes(child, offset, bytes, flags)
5. bdrv_prwv_co(BDRV_REQ_ZERO_WRITE | flags)
   flags is now BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP
6. bdrv_co_pwritev()
   bdrv_co_do_zero_pwritev()
7. bdrv_aligned_pwritev(BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP)
8. bdrv_co_do_pwrite_zeroes(BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP)
9. drv->bdrv_co_pwrite_zeroes(flags & bs->supported_zero_flags)
   file-posix: bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP

10. raw_co_pwrite_zeroes(BDRV_REQ_MAY_UNMAP=
11. paio_submit_co(QEMU_AIO_WRITE_ZEROES | QEMU_AIO_DISCARD)
12. aio_worker()
13. handle_aiocb_write_zeroes_unmap()
14. do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                 aiocb->aio_offset, aiocb->aio_nbytes);

Only if this fails, we try other methods. And without the
BDRV_REQ_MAY_UNMAP flag, we would immediately try BLKZEROOUT.

> Maybe we can add a flag that avoids anything that
> > could be slow, such as BLKZEROOUT, as a fallback (and also the slow
> > emulation that QEMU itself would do if all kernel calls fail).
> >
> 
> But the issue here is not how qemu-img handles this case, but how NBD
> server can handle it. NBD may support zeroing, but there is no way to tell
> if zeroing is going to be fast, since the backend writing zeros to storage
> has the same limits of qemu-img.

I suppose we could add the same flag to NBD (write zeroes, but only if
it's efficient). But first we need to solve the local case because NBD
will always build on that.

> So I think we need to fix the performance regression in 2.12 by enabling
> pre-zero of entire disk only if FALLOCATE_FL_PUNCH_HOLE can be used
> and only if it can be used without a fallback to slow zero method.
> 
> Enabling this optimization for anything else requires changing the
> entire stack (storage, kernel, NBD protocol) to support reporting fast
> zero capability or limit zero to fast operations.

Yes, I agree.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-11-19 12:03 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-07 12:13 [Qemu-devel] Change in qemu 2.12 causes qemu-img convert to NBD to write more data Richard W.M. Jones
2018-11-07 14:36 ` Richard W.M. Jones
2018-11-07 14:56   ` Nir Soffer
2018-11-07 15:02     ` Richard W.M. Jones
2018-11-07 17:27     ` [Qemu-devel] [Qemu-block] " Kevin Wolf
2018-11-07 17:55       ` Nir Soffer
2018-11-11 16:11         ` Nir Soffer
2018-11-15 22:27           ` Nir Soffer
2018-11-16 15:26             ` Kevin Wolf
2018-11-17 20:59               ` Nir Soffer
2018-11-17 21:13                 ` Richard W.M. Jones
2018-11-18  7:24                   ` Nir Soffer
2018-11-19 11:50                 ` Kevin Wolf
2018-11-07 16:42 ` [Qemu-devel] " Eric Blake
2018-11-11 15:25   ` Nir Soffer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.