All of lore.kernel.org
 help / color / mirror / Atom feed
* qcow2 preallocation and backing files
@ 2019-11-20 12:06 Alberto Garcia
  2019-11-20 12:27 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 8+ messages in thread
From: Alberto Garcia @ 2019-11-20 12:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, qemu-block, Max Reitz

Hi,

as we discussed yesterday on IRC there's an inconsistency in the way
qcow2 preallocation works.

Let's create an image and fill it with data:

   $ qemu-img create -f raw base.img 1M
   $ qemu-io -f raw -c 'write -P 0xFF 0 1M' base.img

Now QEMU won't let us create a new image backed by base.img using
preallocation:

   $ qemu-img create -f qcow2 -b base.img -o preallocation=metadata active.img
   qemu-img: active.img: Backing file and preallocation cannot be used at the same time

The reason is that once a cluster is preallocated (i.e. it has a valid
L2 entry pointing to a host offset) the guest won't see the contents
of the backing file, so those options conflict with each other.

It is possible however to create an image that is smaller than
the backing file and then resize it using preallocation. In this
case qemu-img will happily accept any --preallocation option, with
different results from the guest's point of view:

   # This reads as 0xFF (the data comes from base.img)
   $ qemu-img create -f qcow2 -b base.img active.img 512K

   # The second half of the image also reads as 0xFF
   $ qemu-img resize --preallocation=off active.img 1M

   # Here the second half reads as zeroes
   $ qemu-img resize --preallocation=metadata active.img 1M

Apart from "qemu-img resize", the QMP block-resize command can also
extend an image like this, although it always uses PREALLOC_MODE_OFF
and the user cannot change that.

It does not seem right that the guest-visible data changes depending
on the preallocation mode. This could be solved by returning an error
when (backing_bs(blk_bs(blk)) && prealloc != PREALLOC_MODE_OFF) on
img_resize().

The important question is however: what behavior is the right one?
Should growing an image that was smaller than the backing file return
zeroes, or data from the backing file? I would opt for the latter, for
simplicity and consistency with the current behavior of block-resize,
although it was pointed out that this could be a security problem (I'm
not sure that I agree with that, but we can discuss it).

This also has a consequence on how preallocation should be implemented
for images with subclusters. Extended L2 entries allow us to allocate
a cluster but leave each one of its subclusters unallocated. That
would allow us to have a cluster that is simultaneously allocated but
whose data is read from the backing file. But it's up to us to decide
if that's what we should do when resizing an image.

Berto


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: qcow2 preallocation and backing files
  2019-11-20 12:06 qcow2 preallocation and backing files Alberto Garcia
@ 2019-11-20 12:27 ` Vladimir Sementsov-Ogievskiy
  2019-11-20 15:18   ` Alberto Garcia
  0 siblings, 1 reply; 8+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-20 12:27 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel; +Cc: Kevin Wolf, qemu-block, Max Reitz

20.11.2019 15:06, Alberto Garcia wrote:
> Hi,
> 
> as we discussed yesterday on IRC there's an inconsistency in the way
> qcow2 preallocation works.
> 
> Let's create an image and fill it with data:
> 
>     $ qemu-img create -f raw base.img 1M
>     $ qemu-io -f raw -c 'write -P 0xFF 0 1M' base.img
> 
> Now QEMU won't let us create a new image backed by base.img using
> preallocation:
> 
>     $ qemu-img create -f qcow2 -b base.img -o preallocation=metadata active.img
>     qemu-img: active.img: Backing file and preallocation cannot be used at the same time
> 
> The reason is that once a cluster is preallocated (i.e. it has a valid
> L2 entry pointing to a host offset) the guest won't see the contents
> of the backing file, so those options conflict with each other.
> 
> It is possible however to create an image that is smaller than
> the backing file and then resize it using preallocation. In this
> case qemu-img will happily accept any --preallocation option, with
> different results from the guest's point of view:
> 
>     # This reads as 0xFF (the data comes from base.img)
>     $ qemu-img create -f qcow2 -b base.img active.img 512K
> 
>     # The second half of the image also reads as 0xFF
>     $ qemu-img resize --preallocation=off active.img 1M
> 
>     # Here the second half reads as zeroes
>     $ qemu-img resize --preallocation=metadata active.img 1M
> 
> Apart from "qemu-img resize", the QMP block-resize command can also
> extend an image like this, although it always uses PREALLOC_MODE_OFF
> and the user cannot change that.
> 
> It does not seem right that the guest-visible data changes depending
> on the preallocation mode. This could be solved by returning an error
> when (backing_bs(blk_bs(blk)) && prealloc != PREALLOC_MODE_OFF) on
> img_resize().
> 
> The important question is however: what behavior is the right one?
> Should growing an image that was smaller than the backing file return
> zeroes, or data from the backing file? I would opt for the latter, for
> simplicity and consistency with the current behavior of block-resize,
> although it was pointed out that this could be a security problem (I'm
> not sure that I agree with that, but we can discuss it).

I'm for zeros way.

1. I'm sure that if guest after some operation may get access to that data
which it should not see, it's a security problem.

2. Seing backing file through new clusters is inconsistent with how read works:
read will return zeroes, not data from backing. Consider the following example:

      0         x     y
top: [---------------]
mid: [---------]
base:[111111111111111]

reading from [x,y] from top will return zeroes, not ones.

So, if we consider data after EOF as zeroes (not UNALLOCATED clusters), we should
not make these clusters UNALLOCATED after truncation.

3. Also, the latter way is inconsistent with discard. Discarded regions returns
zeroes, not clusters from backing. I think discard and truncate should behave
in the same safe zero way.

> 
> This also has a consequence on how preallocation should be implemented
> for images with subclusters. Extended L2 entries allow us to allocate
> a cluster but leave each one of its subclusters unallocated. That
> would allow us to have a cluster that is simultaneously allocated but
> whose data is read from the backing file. But it's up to us to decide
> if that's what we should do when resizing an image.
> 
> Berto
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: qcow2 preallocation and backing files
  2019-11-20 12:27 ` Vladimir Sementsov-Ogievskiy
@ 2019-11-20 15:18   ` Alberto Garcia
  2019-11-20 15:58     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 8+ messages in thread
From: Alberto Garcia @ 2019-11-20 15:18 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz

On Wed 20 Nov 2019 01:27:53 PM CET, Vladimir Semeeausntsov-Ogievskiy wrote:

> 3. Also, the latter way is inconsistent with discard. Discarded
> regions returns zeroes, not clusters from backing. I think discard and
> truncate should behave in the same safe zero way.

But then PREALLOC_MODE_OFF implies that the L2 metadata should be
preallocated (all clusters should be QCOW2_CLUSTER_ZERO_PLAIN), at least
when there is a backing file.

Or maybe we just forbid PREALLOC_MODE_OFF during resize if there is a
backing file ?

Berto


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: qcow2 preallocation and backing files
  2019-11-20 15:18   ` Alberto Garcia
@ 2019-11-20 15:58     ` Vladimir Sementsov-Ogievskiy
  2019-11-20 16:35       ` Alberto Garcia
                         ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-20 15:58 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel; +Cc: Kevin Wolf, qemu-block, Max Reitz

20.11.2019 18:18, Alberto Garcia wrote:
> On Wed 20 Nov 2019 01:27:53 PM CET, Vladimir Semeeausntsov-Ogievskiy wrote:
> 
>> 3. Also, the latter way is inconsistent with discard. Discarded
>> regions returns zeroes, not clusters from backing. I think discard and
>> truncate should behave in the same safe zero way.
> 
> But then PREALLOC_MODE_OFF implies that the L2 metadata should be
> preallocated (all clusters should be QCOW2_CLUSTER_ZERO_PLAIN), at least
> when there is a backing file.
> 
> Or maybe we just forbid PREALLOC_MODE_OFF during resize if there is a
> backing file ?
> 

Kevin proposed a fix that alters PREALLOC_MODE_OFF behavior if there is
a backing file, to allocate L2 metadata with ZERO clusters..

I don't think that it's the best thing to do, but it's already done, it works
and seems appropriate for rc3..

I see now, that change PREALLOC_MODE_OFF behavior may break things, first of
all qemu-img create, which creating UNALLOCATED qcow2 by default for years.

Still, I think that it would be safer to always ZERO expanded part of qcow2,
regardless of backing file..

We may add PREALLOC_MODE_ZERO, and use it in mirror, commit, and some other calls
to bdrv_truncate, except for qcow2 image creation of course.

Then, to improve this mode handling in qcow2, to not allocate all L2 tables, we
may add "zero" bit to L1 table entry.

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: qcow2 preallocation and backing files
  2019-11-20 15:58     ` Vladimir Sementsov-Ogievskiy
@ 2019-11-20 16:35       ` Alberto Garcia
  2019-11-20 16:46       ` Kevin Wolf
  2019-11-21  8:51       ` Max Reitz
  2 siblings, 0 replies; 8+ messages in thread
From: Alberto Garcia @ 2019-11-20 16:35 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz

On Wed 20 Nov 2019 04:58:38 PM CET, Vladimir Sementsov-Ogievskiy wrote:
>> But then PREALLOC_MODE_OFF implies that the L2 metadata should be
>> preallocated (all clusters should be QCOW2_CLUSTER_ZERO_PLAIN), at
>> least when there is a backing file.
>
> Kevin proposed a fix that alters PREALLOC_MODE_OFF behavior if there
> is a backing file, to allocate L2 metadata with ZERO clusters..

Right, I just saw :)

Berto


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: qcow2 preallocation and backing files
  2019-11-20 15:58     ` Vladimir Sementsov-Ogievskiy
  2019-11-20 16:35       ` Alberto Garcia
@ 2019-11-20 16:46       ` Kevin Wolf
  2019-11-20 17:49         ` Vladimir Sementsov-Ogievskiy
  2019-11-21  8:51       ` Max Reitz
  2 siblings, 1 reply; 8+ messages in thread
From: Kevin Wolf @ 2019-11-20 16:46 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Alberto Garcia, qemu-devel, qemu-block, Max Reitz

Am 20.11.2019 um 16:58 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 20.11.2019 18:18, Alberto Garcia wrote:
> > On Wed 20 Nov 2019 01:27:53 PM CET, Vladimir Semeeausntsov-Ogievskiy wrote:
> > 
> >> 3. Also, the latter way is inconsistent with discard. Discarded
> >> regions returns zeroes, not clusters from backing. I think discard and
> >> truncate should behave in the same safe zero way.
> > 
> > But then PREALLOC_MODE_OFF implies that the L2 metadata should be
> > preallocated (all clusters should be QCOW2_CLUSTER_ZERO_PLAIN), at least
> > when there is a backing file.
> > 
> > Or maybe we just forbid PREALLOC_MODE_OFF during resize if there is a
> > backing file ?
> > 
> 
> Kevin proposed a fix that alters PREALLOC_MODE_OFF behavior if there is
> a backing file, to allocate L2 metadata with ZERO clusters..
> 
> I don't think that it's the best thing to do, but it's already done,
> it works and seems appropriate for rc3..
> 
> I see now, that change PREALLOC_MODE_OFF behavior may break things,
> first of all qemu-img create, which creating UNALLOCATED qcow2 by
> default for years.

And it still does, because the backing file is added only after giving
the qcow2 image the right size.

But you're right, this is more accidental than by design. I wonder if
there are other problematic cases (and whether merging something like
this in -rc3 isn't rather risky).

> Still, I think that it would be safer to always ZERO expanded part of
> qcow2, regardless of backing file..
> 
> We may add PREALLOC_MODE_ZERO, and use it in mirror, commit, and some
> other calls to bdrv_truncate, except for qcow2 image creation of
> course.

What do we do with image formats that don't support zero clusters and
therefore can't provide PREALLOC_MODE_ZERO? Will commit just fail for
them?

> Then, to improve this mode handling in qcow2, to not allocate all L2
> tables, we may add "zero" bit to L1 table entry.

This would be an incompatible image format change that needs to be
explicitly enabled by the user. This might limit its usefulness a bit.

Kevin



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: qcow2 preallocation and backing files
  2019-11-20 16:46       ` Kevin Wolf
@ 2019-11-20 17:49         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 8+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-20 17:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Alberto Garcia, qemu-devel, qemu-block, Max Reitz

20.11.2019 19:46, Kevin Wolf wrote:
> Am 20.11.2019 um 16:58 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 20.11.2019 18:18, Alberto Garcia wrote:
>>> On Wed 20 Nov 2019 01:27:53 PM CET, Vladimir Semeeausntsov-Ogievskiy wrote:
>>>
>>>> 3. Also, the latter way is inconsistent with discard. Discarded
>>>> regions returns zeroes, not clusters from backing. I think discard and
>>>> truncate should behave in the same safe zero way.
>>>
>>> But then PREALLOC_MODE_OFF implies that the L2 metadata should be
>>> preallocated (all clusters should be QCOW2_CLUSTER_ZERO_PLAIN), at least
>>> when there is a backing file.
>>>
>>> Or maybe we just forbid PREALLOC_MODE_OFF during resize if there is a
>>> backing file ?
>>>
>>
>> Kevin proposed a fix that alters PREALLOC_MODE_OFF behavior if there is
>> a backing file, to allocate L2 metadata with ZERO clusters..
>>
>> I don't think that it's the best thing to do, but it's already done,
>> it works and seems appropriate for rc3..
>>
>> I see now, that change PREALLOC_MODE_OFF behavior may break things,
>> first of all qemu-img create, which creating UNALLOCATED qcow2 by
>> default for years.
> 
> And it still does, because the backing file is added only after giving
> the qcow2 image the right size.
> 
> But you're right, this is more accidental than by design. I wonder if
> there are other problematic cases (and whether merging something like
> this in -rc3 isn't rather risky).
> 
>> Still, I think that it would be safer to always ZERO expanded part of
>> qcow2, regardless of backing file..
>>
>> We may add PREALLOC_MODE_ZERO, and use it in mirror, commit, and some
>> other calls to bdrv_truncate, except for qcow2 image creation of
>> course.
> 
> What do we do with image formats that don't support zero clusters and
> therefore can't provide PREALLOC_MODE_ZERO? Will commit just fail for
> them?


Hmm. consider committing to raw

                        x     y
qcow2 [----------------------]  - full of unallocated clusters
raw   [2987957285235298]        - full of data, but file is short


Before commit, data from [x,y] reads as zero. Therefore, we should zero
expanded part of base..

And this is for base of any format: [x,y] must be zero after commit. So,
if format can't do fast-zero, it should fallback to writing real zeros.

===

Hmm, actually after your patch all formats partly support PREALLOC_MDOE_ZERO,
which in the worst case is done by writing real zeros.

> 
>> Then, to improve this mode handling in qcow2, to not allocate all L2
>> tables, we may add "zero" bit to L1 table entry.
> 
> This would be an incompatible image format change that needs to be
> explicitly enabled by the user. This might limit its usefulness a bit.
> 

Yes, I understand this. Still it may make sense.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: qcow2 preallocation and backing files
  2019-11-20 15:58     ` Vladimir Sementsov-Ogievskiy
  2019-11-20 16:35       ` Alberto Garcia
  2019-11-20 16:46       ` Kevin Wolf
@ 2019-11-21  8:51       ` Max Reitz
  2 siblings, 0 replies; 8+ messages in thread
From: Max Reitz @ 2019-11-21  8:51 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 1910 bytes --]

On 20.11.19 16:58, Vladimir Sementsov-Ogievskiy wrote:
> 20.11.2019 18:18, Alberto Garcia wrote:
>> On Wed 20 Nov 2019 01:27:53 PM CET, Vladimir Semeeausntsov-Ogievskiy wrote:
>>
>>> 3. Also, the latter way is inconsistent with discard. Discarded
>>> regions returns zeroes, not clusters from backing. I think discard and
>>> truncate should behave in the same safe zero way.
>>
>> But then PREALLOC_MODE_OFF implies that the L2 metadata should be
>> preallocated (all clusters should be QCOW2_CLUSTER_ZERO_PLAIN), at least
>> when there is a backing file.
>>
>> Or maybe we just forbid PREALLOC_MODE_OFF during resize if there is a
>> backing file ?
>>
> 
> Kevin proposed a fix that alters PREALLOC_MODE_OFF behavior if there is
> a backing file, to allocate L2 metadata with ZERO clusters..
> 
> I don't think that it's the best thing to do, but it's already done, it works
> and seems appropriate for rc3..
> 
> I see now, that change PREALLOC_MODE_OFF behavior may break things, first of
> all qemu-img create, which creating UNALLOCATED qcow2 by default for years.
> 
> Still, I think that it would be safer to always ZERO expanded part of qcow2,
> regardless of backing file..
> 
> We may add PREALLOC_MODE_ZERO, and use it in mirror, commit, and some other calls
> to bdrv_truncate, except for qcow2 image creation of course.

Well, the good news is that block_resize currently has no such parameter
and thus we could make a non-OFF value the default.

The bad news is that the reason it has no such parameter is that it
would need to be a job in order to support any form of preallocation.

So actually I don’t think we can make block_resize write zeroes to the
new region by default, because then it needs to be a job, and it just
isn’t.  (Of course, we can get to it through a deprecation cycle, but we
can’t “fix” block_resize directly.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-11-21  8:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-20 12:06 qcow2 preallocation and backing files Alberto Garcia
2019-11-20 12:27 ` Vladimir Sementsov-Ogievskiy
2019-11-20 15:18   ` Alberto Garcia
2019-11-20 15:58     ` Vladimir Sementsov-Ogievskiy
2019-11-20 16:35       ` Alberto Garcia
2019-11-20 16:46       ` Kevin Wolf
2019-11-20 17:49         ` Vladimir Sementsov-Ogievskiy
2019-11-21  8:51       ` Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.