All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Request for clarification on qemu-img convert behavior zeroing target host_device
       [not found] <DB6PR07MB33330E3562B0AF0ECC8307BCAEA00@DB6PR07MB3333.eurprd07.prod.outlook.com>
@ 2018-12-13 13:12 ` De Backer, Fred (Nokia - BE/Antwerp)
  2018-12-13 14:17   ` Eric Blake
  0 siblings, 1 reply; 11+ messages in thread
From: De Backer, Fred (Nokia - BE/Antwerp) @ 2018-12-13 13:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: Aamir T, Owais (Nokia - IN/Chennai)

Hi,

We're using Openstack Ironic to deploy baremetal servers. During the deployment process an agent (ironic-python-agent) running on Fedora linux uses qemu-img to write a qcow2 file to a blockdevice.

Recently we saw a change in behavior of qemu-img. Previously we were using Fedora 27 containing a fedora packaged version of qemu-img v2.10.2 (qemu-img-2.10.2-1.fc27.x86_64.rpm); now we use Fedora 29 containing a fedora packaged version of qemu-img v3.0.0 (qemu-img-3.0.0-2.fc29.x86_64.rpm).

The command that is run by the ironic-python-agent (the same in both FC27 and FC29) is: qemu-img -t directsync -O host_device /tmp/image.qcow2 /dev/sda

We observe that in Fedora 29 the qemu-img, before imaging the disk, it fully zeroes it. Taking into account the disk size, the whole process now takes 35 minutes instead of 50 seconds. This causes the ironic-python-agent operation to time-out. The Fedora 27 qemu-img doesn't do that.

Scanning through the qemu-img source code, we found that adding -S 0 to the command on Fedora 29 qemu-img restores the behavior as observed in Fedora 27 qemu-img.

Looking through the changelogs of qemu I couldn't find this behavior change documented.

Now the questions:
* Is this the expected/required behavior that qemu-img first zeroes the complete target disk before writing the image. In other words: is this a qemu-img bug?
* Is applying the -S 0 parameter a safe/sound/sensible thing to do to revert to the old behavior. In other words: can I write a bug against the ironic-python-agent to start using this parameter?
* If the behavior is expected: is there some pointer to documentation/changelogs I can read about this?

Thanks,
Regards,

Fred De Backer
SME Video Integration Engineer,
IP/Optical Networks, Nokia
fred.de_backer@nokia.com

Nokia Bell NV I Copernicuslaan 50, 2018 Antwerpen I BTW BE 0404 621 642 RPR Antwerpen I BNP Paribas Fortis 220-0002334-42 I IBAN BE 77 2200 0023 3442 I BIC GEBABEBB

<<
This message (including any attachments) contains confidential information intended for a specific individual and purpose, and is protected by law. If you are not the intended recipient, you should delete this message. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited without the prior consent of its author.
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] Request for clarification on qemu-img convert behavior zeroing target host_device
  2018-12-13 13:12 ` [Qemu-devel] Request for clarification on qemu-img convert behavior zeroing target host_device De Backer, Fred (Nokia - BE/Antwerp)
@ 2018-12-13 14:17   ` Eric Blake
  2018-12-13 14:49     ` [Qemu-devel] [Qemu-block] " Kevin Wolf
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Blake @ 2018-12-13 14:17 UTC (permalink / raw)
  To: De Backer, Fred (Nokia - BE/Antwerp),
	qemu-devel, Nir Soffer, qemu block, Richard W.M. Jones
  Cc: Aamir T, Owais (Nokia - IN/Chennai)

On 12/13/18 7:12 AM, De Backer, Fred (Nokia - BE/Antwerp) wrote:
> Hi,
> 
> We're using Openstack Ironic to deploy baremetal servers. During the deployment process an agent (ironic-python-agent) running on Fedora linux uses qemu-img to write a qcow2 file to a blockdevice.
> 
> Recently we saw a change in behavior of qemu-img. Previously we were using Fedora 27 containing a fedora packaged version of qemu-img v2.10.2 (qemu-img-2.10.2-1.fc27.x86_64.rpm); now we use Fedora 29 containing a fedora packaged version of qemu-img v3.0.0 (qemu-img-3.0.0-2.fc29.x86_64.rpm).
> 
> The command that is run by the ironic-python-agent (the same in both FC27 and FC29) is: qemu-img -t directsync -O host_device /tmp/image.qcow2 /dev/sda
> 
> We observe that in Fedora 29 the qemu-img, before imaging the disk, it fully zeroes it. Taking into account the disk size, the whole process now takes 35 minutes instead of 50 seconds. This causes the ironic-python-agent operation to time-out. The Fedora 27 qemu-img doesn't do that.

Known issue; Nir and Rich have posted a previous thread on the topic, 
and the conclusion is that we need to make qemu-img smarter about NOT 
requesting pre-zeroing of devices where that is more expensive than just 
zeroing as we go.
https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg01182.html


> 
> Scanning through the qemu-img source code, we found that adding -S 0 to the command on Fedora 29 qemu-img restores the behavior as observed in Fedora 27 qemu-img.
> 
> Looking through the changelogs of qemu I couldn't find this behavior change documented.
> 
> Now the questions:
> * Is this the expected/required behavior that qemu-img first zeroes the complete target disk before writing the image. In other words: is this a qemu-img bug?

It's a performance bug. qemu-img convert has to ensure that the 
destination reads 0 (rather than is uninitialized), but the way in which 
it does so needs to be more careful about destinations that do not have 
efficient block status or bulk zeroing capabilities.

> * Is applying the -S 0 parameter a safe/sound/sensible thing to do to revert to the old behavior. In other words: can I write a bug against the ironic-python-agent to start using this parameter?

Using -S 0 avoids sparseness, which may introduce its own set of 
problems if you were expecting the destination to be sparse.

> * If the behavior is expected: is there some pointer to documentation/changelogs I can read about this?

Reading the mentioned thread will give some more insight, and hopefully 
qemu 4.0 will either improve the behavior by default or at least add 
knobs so that you can tweak the behavior based on your needs.

> This message (including any attachments) contains confidential information

Such disclaimers are unenforceable on publicly-archived lists.  Still, 
you may want to consider using a different email address that doesn't 
spam list readers with your employer's legalese gobbledygook.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Request for clarification on qemu-img convert behavior zeroing target host_device
  2018-12-13 14:17   ` Eric Blake
@ 2018-12-13 14:49     ` Kevin Wolf
  2018-12-13 15:05       ` Eric Blake
  0 siblings, 1 reply; 11+ messages in thread
From: Kevin Wolf @ 2018-12-13 14:49 UTC (permalink / raw)
  To: Eric Blake
  Cc: De Backer, Fred (Nokia - BE/Antwerp),
	qemu-devel, Nir Soffer, qemu block, Richard W.M. Jones, Aamir T,
	Owais (Nokia - IN/Chennai)

Am 13.12.2018 um 15:17 hat Eric Blake geschrieben:
> On 12/13/18 7:12 AM, De Backer, Fred (Nokia - BE/Antwerp) wrote:
> > Hi,
> > 
> > We're using Openstack Ironic to deploy baremetal servers. During the deployment process an agent (ironic-python-agent) running on Fedora linux uses qemu-img to write a qcow2 file to a blockdevice.
> > 
> > Recently we saw a change in behavior of qemu-img. Previously we were using Fedora 27 containing a fedora packaged version of qemu-img v2.10.2 (qemu-img-2.10.2-1.fc27.x86_64.rpm); now we use Fedora 29 containing a fedora packaged version of qemu-img v3.0.0 (qemu-img-3.0.0-2.fc29.x86_64.rpm).
> > 
> > The command that is run by the ironic-python-agent (the same in both FC27 and FC29) is: qemu-img -t directsync -O host_device /tmp/image.qcow2 /dev/sda
> > 
> > We observe that in Fedora 29 the qemu-img, before imaging the disk, it fully zeroes it. Taking into account the disk size, the whole process now takes 35 minutes instead of 50 seconds. This causes the ironic-python-agent operation to time-out. The Fedora 27 qemu-img doesn't do that.
> 
> Known issue; Nir and Rich have posted a previous thread on the topic, and
> the conclusion is that we need to make qemu-img smarter about NOT requesting
> pre-zeroing of devices where that is more expensive than just zeroing as we
> go.
> https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg01182.html

Yes, we should be careful to avoid the fallback in this case.

However, how could this ever go from 50 seconds for writing the whole
image to 35 minutes?! Even if you end up writing the whole image twice
because you write zeros first and then overwrite them everywhere with
data, shouldn't the maximum be doubling the time, i.e. 100 seconds?

Why is the write_zeroes fallback _that_ slow? It will also hit guests
that request write_zeroes, so I feel this is worth investigating a bit
more nevertheless.

Can you check with strace which operation actually succeeds writing
zeros to /dev/sda? The first thing we try is fallocate with
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE. This should always be fast,
so I suppose this fails in your case. The next thing is BLKZEROOUT,
which I think can do a fallback in the kernel. Does this return success?
Otherwise we have another fallback mechanism inside of QEMU, which would
use normal pwrite calls with a zeroed buffer.

Once we know which mechanism is used, we can look into why it is so
abysmally slow.

Kevin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Request for clarification on qemu-img convert behavior zeroing target host_device
  2018-12-13 14:49     ` [Qemu-devel] [Qemu-block] " Kevin Wolf
@ 2018-12-13 15:05       ` Eric Blake
  2018-12-13 21:14         ` De Backer, Fred (Nokia - BE/Antwerp)
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Blake @ 2018-12-13 15:05 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: De Backer, Fred (Nokia - BE/Antwerp),
	qemu-devel, Nir Soffer, qemu block, Richard W.M. Jones, Aamir T,
	Owais (Nokia - IN/Chennai)

On 12/13/18 8:49 AM, Kevin Wolf wrote:

>>> We observe that in Fedora 29 the qemu-img, before imaging the disk, it fully zeroes it. Taking into account the disk size, the whole process now takes 35 minutes instead of 50 seconds. This causes the ironic-python-agent operation to time-out. The Fedora 27 qemu-img doesn't do that.
>>
>> Known issue; Nir and Rich have posted a previous thread on the topic, and
>> the conclusion is that we need to make qemu-img smarter about NOT requesting
>> pre-zeroing of devices where that is more expensive than just zeroing as we
>> go.
>> https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg01182.html
> 
> Yes, we should be careful to avoid the fallback in this case.
> 
> However, how could this ever go from 50 seconds for writing the whole
> image to 35 minutes?! Even if you end up writing the whole image twice
> because you write zeros first and then overwrite them everywhere with
> data, shouldn't the maximum be doubling the time, i.e. 100 seconds?
> 
> Why is the write_zeroes fallback _that_ slow? It will also hit guests
> that request write_zeroes, so I feel this is worth investigating a bit
> more nevertheless.
> 
> Can you check with strace which operation actually succeeds writing
> zeros to /dev/sda? The first thing we try is fallocate with
> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE. This should always be fast,
> so I suppose this fails in your case. The next thing is BLKZEROOUT,
> which I think can do a fallback in the kernel. Does this return success?
> Otherwise we have another fallback mechanism inside of QEMU, which would
> use normal pwrite calls with a zeroed buffer.

It may also be a case of poor lseek(SEEK_HOLE) performance on the source 
(a known issue with at least some versions of tmpfs). The way qemu-img 
queries for block status, it ends up repeatedly hammering on lseek(), 
and if lseek() is already O(n) instead of O(1) in behavior, that 
explodes into some O(n^2) scaling because qemu-img isn't caching the 
answers it got previously.

> 
> Once we know which mechanism is used, we can look into why it is so
> abysmally slow.

Indeed, performance traces are important for issues like this.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Request for clarification on qemu-img convert behavior zeroing target host_device
  2018-12-13 15:05       ` Eric Blake
@ 2018-12-13 21:14         ` De Backer, Fred (Nokia - BE/Antwerp)
  2018-12-13 21:53           ` Nir Soffer
  0 siblings, 1 reply; 11+ messages in thread
From: De Backer, Fred (Nokia - BE/Antwerp) @ 2018-12-13 21:14 UTC (permalink / raw)
  To: Eric Blake, Kevin Wolf
  Cc: qemu-devel, Nir Soffer, qemu block, Richard W.M. Jones, Aamir T,
	Owais (Nokia - IN/Chennai)

[-- Attachment #1: Type: text/plain, Size: 3923 bytes --]

> >>> We observe that in Fedora 29 the qemu-img, before imaging the disk, it fully
> >>> zeroes it. Taking into account the disk size, the whole process now takes 35
> >>> minutes instead of 50 seconds. This causes the ironic-python-agent operation to
> >>> time-out. The Fedora 27 qemu-img doesn't do that.
> >>
> >> Known issue; Nir and Rich have posted a previous thread on the topic,
> >> and the conclusion is that we need to make qemu-img smarter about NOT
> >> requesting pre-zeroing of devices where that is more expensive than
> >> just zeroing as we go.
> >> https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg01182.html
> >
> > Yes, we should be careful to avoid the fallback in this case.
> >
> > However, how could this ever go from 50 seconds for writing the whole
> > image to 35 minutes?! Even if you end up writing the whole image twice
> > because you write zeros first and then overwrite them everywhere with
> > data, shouldn't the maximum be doubling the time, i.e. 100 seconds?

I believe the situation is different than the one described where I understand source and destination have a comparable size (hence doubling the time)
In the ironic deployment scenario; the source is a relatively small cloud-image compared to the destination which is a disk on a baremetal server. I've attached 2 files listing somewhat the properties of source (10G image; mostly sparse; compressed qcow2 size is 584M) and destination (300G RAID device on HP SmartArray controller).

Source qcow2 image properties:
image: /tmp/centos7-biosmbr-lvm-1539159593.qcow2
file format: qcow2
virtual size: 9.3G (10000000000 bytes)
disk size: 567M
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

Destination blockdevice properties:
# blockdev --getsz --getdiscardzeroes --getss --getpbsz --getiomin --getioopt --getalignoff --getmaxsect --getbsz --getsize64 --getra --getfra /dev/sda
585871964
0
512
512
262144
262144
0
512
2048
299966445568
256
256
# lsblk /dev/sda
NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda    8:0    0 279.4G  0 disk

The observation is that the whole 300GB disk gets zeroed before the "small" image is written.

Here is the timing for FC27:
# time qemu-img convert -t directsync -O host_device /tmp/centos7-biosmbr-lvm-1539159593.qcow2 /dev/sda
real	0m50.935s
user	0m7.917s
sys	0m3.954s

And for FC29:
# time qemu-img convert -t directsync -O host_device /tmp/centos7-biosmbr-lvm-1539159593.qcow2 /dev/sda
real	35m41.981s
user	0m8.520s
sys	0m12.232s

> >
> > Why is the write_zeroes fallback _that_ slow? It will also hit guests
> > that request write_zeroes, so I feel this is worth investigating a bit
> > more nevertheless.
> >
> > Can you check with strace which operation actually succeeds writing
> > zeros to /dev/sda? The first thing we try is fallocate with
> > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE. This should always be
> > fast, so I suppose this fails in your case. The next thing is
> > BLKZEROOUT, which I think can do a fallback in the kernel. Does this return
> success?
> > Otherwise we have another fallback mechanism inside of QEMU, which
> > would use normal pwrite calls with a zeroed buffer.
> 
> It may also be a case of poor lseek(SEEK_HOLE) performance on the source (a
> known issue with at least some versions of tmpfs). The way qemu-img queries
> for block status, it ends up repeatedly hammering on lseek(), and if lseek() is
> already O(n) instead of O(1) in behavior, that explodes into some O(n^2) scaling
> because qemu-img isn't caching the answers it got previously.
> 
> >
> > Once we know which mechanism is used, we can look into why it is so
> > abysmally slow.
 
> Indeed, performance traces are important for issues like this.
See strace of both FC27 and FC29 attached

Fred

[-- Attachment #2: fc27_qemu-img.strace.gz --]
[-- Type: application/x-gzip, Size: 86163 bytes --]

[-- Attachment #3: fc29_qemu-img.strace.gz --]
[-- Type: application/x-gzip, Size: 68136 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Request for clarification on qemu-img convert behavior zeroing target host_device
  2018-12-13 21:14         ` De Backer, Fred (Nokia - BE/Antwerp)
@ 2018-12-13 21:53           ` Nir Soffer
       [not found]             ` <VI1PR07MB3344412C5DE71936F9689909AEA10@VI1PR07MB3344.eurprd07.prod.outlook.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Nir Soffer @ 2018-12-13 21:53 UTC (permalink / raw)
  To: fred.de_backer
  Cc: Eric Blake, Kevin Wolf, QEMU Developers, qemu-block,
	Richard W.M. Jones, owais.aamir_t

On Thu, Dec 13, 2018 at 11:14 PM De Backer, Fred (Nokia - BE/Antwerp) <
fred.de_backer@nokia.com> wrote:

> > Indeed, performance traces are important for issues like this.
> See strace of both FC27 and FC29 attached
>

Looks like you traced only the main thread. All the I/O is done in other
threads.
These flags would be useful:

    strace -o log -f -T -tt

Nir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Request for clarification on qemu-img convert behavior zeroing target host_device
       [not found]             ` <VI1PR07MB3344412C5DE71936F9689909AEA10@VI1PR07MB3344.eurprd07.prod.outlook.com>
@ 2018-12-14 10:59               ` De Backer, Fred (Nokia - BE/Antwerp)
  2018-12-14 12:26                 ` Kevin Wolf
  0 siblings, 1 reply; 11+ messages in thread
From: De Backer, Fred (Nokia - BE/Antwerp) @ 2018-12-14 10:59 UTC (permalink / raw)
  To: Nir Soffer
  Cc: Eric Blake, Kevin Wolf, QEMU Developers, qemu-block,
	Richard W.M. Jones, Aamir T, Owais (Nokia - IN/Chennai)

[-- Attachment #1: Type: text/plain, Size: 457 bytes --]

>>> Indeed, performance traces are important for issues like this.
>>See strace of both FC27 and FC29 attached

>Looks like you traced only the main thread. All the I/O is done in other threads.
>These flags would be useful:
>
>    strace -o log -f -T -tt

New straces attached with mentioned flags.
Both truncated due to file size at what I believe to be the same position in the "write-phase". FC29 has the "zeroing-phase" in front.

Fred


[-- Attachment #2: fc27_qemu-img.strace.snip.gz --]
[-- Type: application/x-gzip, Size: 19318 bytes --]

[-- Attachment #3: fc29_qemu-img.strace.snip.gz --]
[-- Type: application/x-gzip, Size: 35998 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Request for clarification on qemu-img convert behavior zeroing target host_device
  2018-12-14 10:59               ` De Backer, Fred (Nokia - BE/Antwerp)
@ 2018-12-14 12:26                 ` Kevin Wolf
  2018-12-14 12:52                   ` De Backer, Fred (Nokia - BE/Antwerp)
  2018-12-14 13:22                   ` Daniel P. Berrangé
  0 siblings, 2 replies; 11+ messages in thread
From: Kevin Wolf @ 2018-12-14 12:26 UTC (permalink / raw)
  To: De Backer, Fred (Nokia - BE/Antwerp)
  Cc: Nir Soffer, Eric Blake, QEMU Developers, qemu-block,
	Richard W.M. Jones, Aamir T, Owais (Nokia - IN/Chennai)

Am 14.12.2018 um 11:59 hat De Backer, Fred (Nokia - BE/Antwerp) geschrieben:
> >>> Indeed, performance traces are important for issues like this.
> >>See strace of both FC27 and FC29 attached
> 
> >Looks like you traced only the main thread. All the I/O is done in other threads.
> >These flags would be useful:
> >
> >    strace -o log -f -T -tt
> 
> New straces attached with mentioned flags.  Both truncated due to file
> size at what I believe to be the same position in the "write-phase".
> FC29 has the "zeroing-phase" in front.

So this is indeed using BLKZEROOUT, which has a slow fallback in the
kernel (slow means ~12 seconds for each 2 GB chunk).

We need to avoid calling BLKZEROOUT in the context of pre-zeroing the
image for qemu-img convert.

Of course, we should also think about the other problem you mentioned,
related to copying a smaller image to a larger block device. Does this
require zeroing the parts after the image or should we leave them alone?

I'd tend to say that since you're passing the whole block device as a
target to 'qemu-img convert', and the whole block device will be visible
to a guest run with the same block device configuration, we should
indeed zero out the whole device. But then we would declare the F27
behaviour buggy and this case would stay slow (it would become slightly
faster because we avoid the double writes, but we wouldn't save the
writes to the unused space).

Or we could just refuse to convert if source and target aren't the same
size. Then you would have to use a raw filter driver to select a part
from the target image with the offset/size options. But this would break
backwards compatibility, and its use is not very intuitive either.

Unfortunately, this is just the kind of problems that raw images give
you. :-/

Kevin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Request for clarification on qemu-img convert behavior zeroing target host_device
  2018-12-14 12:26                 ` Kevin Wolf
@ 2018-12-14 12:52                   ` De Backer, Fred (Nokia - BE/Antwerp)
  2018-12-14 13:10                     ` Richard W.M. Jones
  2018-12-14 13:22                   ` Daniel P. Berrangé
  1 sibling, 1 reply; 11+ messages in thread
From: De Backer, Fred (Nokia - BE/Antwerp) @ 2018-12-14 12:52 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Nir Soffer, Eric Blake, QEMU Developers, qemu-block,
	Richard W.M. Jones, Aamir T, Owais (Nokia - IN/Chennai)

>Of course, we should also think about the other problem you mentioned, related to copying a smaller image to a larger block device. Does this require zeroing the parts after the image or should we leave them alone?
>
>I'd tend to say that since you're passing the whole block device as a target to 'qemu-img convert', and the whole block device will be visible to a guest run with the same block device configuration, we should indeed zero out the whole device. But then we would declare the F27 behaviour buggy and this case would stay slow (it would become slightly faster because we avoid the double writes, but we wouldn't save the writes to the unused space).

As long as it's outside the region of the source image I think you can leave it alone. Similar to deleting a file on a disk also doesn't zero out the sectors that were used to store that file before.

>Or we could just refuse to convert if source and target aren't the same size. Then you would have to use a raw filter driver to select a part from the target image with the offset/size options. But this would break backwards compatibility, and its use is not very intuitive either.

Going that path would certainly break the way Openstack Ironic project uses qemu-img via the ironic-python-agent to image baremetal servers. And I can imagine there are other cases out there using qemu-img to write out disk images to blockdevices.

Fred

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Request for clarification on qemu-img convert behavior zeroing target host_device
  2018-12-14 12:52                   ` De Backer, Fred (Nokia - BE/Antwerp)
@ 2018-12-14 13:10                     ` Richard W.M. Jones
  0 siblings, 0 replies; 11+ messages in thread
From: Richard W.M. Jones @ 2018-12-14 13:10 UTC (permalink / raw)
  To: De Backer, Fred (Nokia - BE/Antwerp)
  Cc: Kevin Wolf, Nir Soffer, Eric Blake, QEMU Developers, qemu-block,
	Aamir T, Owais (Nokia - IN/Chennai)

On Fri, Dec 14, 2018 at 12:52:34PM +0000, De Backer, Fred (Nokia - BE/Antwerp) wrote:
> >Of course, we should also think about the other problem you mentioned, related to copying a smaller image to a larger block device. Does this require zeroing the parts after the image or should we leave them alone?
> >
> >I'd tend to say that since you're passing the whole block device as a target to 'qemu-img convert', and the whole block device will be visible to a guest run with the same block device configuration, we should indeed zero out the whole device. But then we would declare the F27 behaviour buggy and this case would stay slow (it would become slightly faster because we avoid the double writes, but we wouldn't save the writes to the unused space).
> 
> As long as it's outside the region of the source image I think you can leave it alone. Similar to deleting a file on a disk also doesn't zero out the sectors that were used to store that file before.

It's really nothing at all like that case.  Kevin is right the only
sensible thing to do is to zero-extend the image to the full size of
the target (in the absence of the user instructing qemu-img to do
something else).

Rich.

> >Or we could just refuse to convert if source and target aren't the same size. Then you would have to use a raw filter driver to select a part from the target image with the offset/size options. But this would break backwards compatibility, and its use is not very intuitive either.
> 
> Going that path would certainly break the way Openstack Ironic project uses qemu-img via the ironic-python-agent to image baremetal servers. And I can imagine there are other cases out there using qemu-img to write out disk images to blockdevices.
> 
> Fred

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Request for clarification on qemu-img convert behavior zeroing target host_device
  2018-12-14 12:26                 ` Kevin Wolf
  2018-12-14 12:52                   ` De Backer, Fred (Nokia - BE/Antwerp)
@ 2018-12-14 13:22                   ` Daniel P. Berrangé
  1 sibling, 0 replies; 11+ messages in thread
From: Daniel P. Berrangé @ 2018-12-14 13:22 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: De Backer, Fred (Nokia - BE/Antwerp),
	qemu-block, QEMU Developers, Aamir T, Owais (Nokia - IN/Chennai),
	Richard W.M. Jones, Nir Soffer

On Fri, Dec 14, 2018 at 01:26:59PM +0100, Kevin Wolf wrote:
> Am 14.12.2018 um 11:59 hat De Backer, Fred (Nokia - BE/Antwerp) geschrieben:
> > >>> Indeed, performance traces are important for issues like this.
> > >>See strace of both FC27 and FC29 attached
> > 
> > >Looks like you traced only the main thread. All the I/O is done in other threads.
> > >These flags would be useful:
> > >
> > >    strace -o log -f -T -tt
> > 
> > New straces attached with mentioned flags.  Both truncated due to file
> > size at what I believe to be the same position in the "write-phase".
> > FC29 has the "zeroing-phase" in front.
> 
> So this is indeed using BLKZEROOUT, which has a slow fallback in the
> kernel (slow means ~12 seconds for each 2 GB chunk).
> 
> We need to avoid calling BLKZEROOUT in the context of pre-zeroing the
> image for qemu-img convert.
> 
> Of course, we should also think about the other problem you mentioned,
> related to copying a smaller image to a larger block device. Does this
> require zeroing the parts after the image or should we leave them alone?
> 
> I'd tend to say that since you're passing the whole block device as a
> target to 'qemu-img convert', and the whole block device will be visible
> to a guest run with the same block device configuration, we should
> indeed zero out the whole device. But then we would declare the F27
> behaviour buggy and this case would stay slow (it would become slightly
> faster because we avoid the double writes, but we wouldn't save the
> writes to the unused space).

I think this behaviour needs to be configurable.

Exposing old data from the block device to the guest is indeed a
security flaw, however, qemu-img can't ever know if there is old
data there or not. Assuming the worst case is a sensible default,
but is too pessimistic to be hardcoded.

If the mgmt application has taken care to ensure the volume was
already zeroed out, there should be a way for it to tell qemu-img
not to do zero'ing again.

Zero'ing of block devices is very expensive and so you really
don't want to it to be done in the hot startup path. I recall
that to deal with this, Nova (or was it Cinder) put zero'ing
of block devices into the guest cleanup path in the background.
That way the expensive zero'ing operation is done at a place
there it won't impact any user visible operations. It just
delays the time until the block device can be re-used for a
new guest.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-12-14 13:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <DB6PR07MB33330E3562B0AF0ECC8307BCAEA00@DB6PR07MB3333.eurprd07.prod.outlook.com>
2018-12-13 13:12 ` [Qemu-devel] Request for clarification on qemu-img convert behavior zeroing target host_device De Backer, Fred (Nokia - BE/Antwerp)
2018-12-13 14:17   ` Eric Blake
2018-12-13 14:49     ` [Qemu-devel] [Qemu-block] " Kevin Wolf
2018-12-13 15:05       ` Eric Blake
2018-12-13 21:14         ` De Backer, Fred (Nokia - BE/Antwerp)
2018-12-13 21:53           ` Nir Soffer
     [not found]             ` <VI1PR07MB3344412C5DE71936F9689909AEA10@VI1PR07MB3344.eurprd07.prod.outlook.com>
2018-12-14 10:59               ` De Backer, Fred (Nokia - BE/Antwerp)
2018-12-14 12:26                 ` Kevin Wolf
2018-12-14 12:52                   ` De Backer, Fred (Nokia - BE/Antwerp)
2018-12-14 13:10                     ` Richard W.M. Jones
2018-12-14 13:22                   ` Daniel P. Berrangé

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.