* qcow2 preallocation and backing files @ 2019-11-20 12:06 Alberto Garcia 2019-11-20 12:27 ` Vladimir Sementsov-Ogievskiy 0 siblings, 1 reply; 8+ messages in thread From: Alberto Garcia @ 2019-11-20 12:06 UTC (permalink / raw) To: qemu-devel; +Cc: Kevin Wolf, qemu-block, Max Reitz Hi, as we discussed yesterday on IRC there's an inconsistency in the way qcow2 preallocation works. Let's create an image and fill it with data: $ qemu-img create -f raw base.img 1M $ qemu-io -f raw -c 'write -P 0xFF 0 1M' base.img Now QEMU won't let us create a new image backed by base.img using preallocation: $ qemu-img create -f qcow2 -b base.img -o preallocation=metadata active.img qemu-img: active.img: Backing file and preallocation cannot be used at the same time The reason is that once a cluster is preallocated (i.e. it has a valid L2 entry pointing to a host offset) the guest won't see the contents of the backing file, so those options conflict with each other. It is possible however to create an image that is smaller than the backing file and then resize it using preallocation. In this case qemu-img will happily accept any --preallocation option, with different results from the guest's point of view: # This reads as 0xFF (the data comes from base.img) $ qemu-img create -f qcow2 -b base.img active.img 512K # The second half of the image also reads as 0xFF $ qemu-img resize --preallocation=off active.img 1M # Here the second half reads as zeroes $ qemu-img resize --preallocation=metadata active.img 1M Apart from "qemu-img resize", the QMP block-resize command can also extend an image like this, although it always uses PREALLOC_MODE_OFF and the user cannot change that. It does not seem right that the guest-visible data changes depending on the preallocation mode. This could be solved by returning an error when (backing_bs(blk_bs(blk)) && prealloc != PREALLOC_MODE_OFF) on img_resize(). The important question is however: what behavior is the right one? Should growing an image that was smaller than the backing file return zeroes, or data from the backing file? I would opt for the latter, for simplicity and consistency with the current behavior of block-resize, although it was pointed out that this could be a security problem (I'm not sure that I agree with that, but we can discuss it). This also has a consequence on how preallocation should be implemented for images with subclusters. Extended L2 entries allow us to allocate a cluster but leave each one of its subclusters unallocated. That would allow us to have a cluster that is simultaneously allocated but whose data is read from the backing file. But it's up to us to decide if that's what we should do when resizing an image. Berto ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: qcow2 preallocation and backing files 2019-11-20 12:06 qcow2 preallocation and backing files Alberto Garcia @ 2019-11-20 12:27 ` Vladimir Sementsov-Ogievskiy 2019-11-20 15:18 ` Alberto Garcia 0 siblings, 1 reply; 8+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2019-11-20 12:27 UTC (permalink / raw) To: Alberto Garcia, qemu-devel; +Cc: Kevin Wolf, qemu-block, Max Reitz 20.11.2019 15:06, Alberto Garcia wrote: > Hi, > > as we discussed yesterday on IRC there's an inconsistency in the way > qcow2 preallocation works. > > Let's create an image and fill it with data: > > $ qemu-img create -f raw base.img 1M > $ qemu-io -f raw -c 'write -P 0xFF 0 1M' base.img > > Now QEMU won't let us create a new image backed by base.img using > preallocation: > > $ qemu-img create -f qcow2 -b base.img -o preallocation=metadata active.img > qemu-img: active.img: Backing file and preallocation cannot be used at the same time > > The reason is that once a cluster is preallocated (i.e. it has a valid > L2 entry pointing to a host offset) the guest won't see the contents > of the backing file, so those options conflict with each other. > > It is possible however to create an image that is smaller than > the backing file and then resize it using preallocation. In this > case qemu-img will happily accept any --preallocation option, with > different results from the guest's point of view: > > # This reads as 0xFF (the data comes from base.img) > $ qemu-img create -f qcow2 -b base.img active.img 512K > > # The second half of the image also reads as 0xFF > $ qemu-img resize --preallocation=off active.img 1M > > # Here the second half reads as zeroes > $ qemu-img resize --preallocation=metadata active.img 1M > > Apart from "qemu-img resize", the QMP block-resize command can also > extend an image like this, although it always uses PREALLOC_MODE_OFF > and the user cannot change that. > > It does not seem right that the guest-visible data changes depending > on the preallocation mode. This could be solved by returning an error > when (backing_bs(blk_bs(blk)) && prealloc != PREALLOC_MODE_OFF) on > img_resize(). > > The important question is however: what behavior is the right one? > Should growing an image that was smaller than the backing file return > zeroes, or data from the backing file? I would opt for the latter, for > simplicity and consistency with the current behavior of block-resize, > although it was pointed out that this could be a security problem (I'm > not sure that I agree with that, but we can discuss it). I'm for zeros way. 1. I'm sure that if guest after some operation may get access to that data which it should not see, it's a security problem. 2. Seing backing file through new clusters is inconsistent with how read works: read will return zeroes, not data from backing. Consider the following example: 0 x y top: [---------------] mid: [---------] base:[111111111111111] reading from [x,y] from top will return zeroes, not ones. So, if we consider data after EOF as zeroes (not UNALLOCATED clusters), we should not make these clusters UNALLOCATED after truncation. 3. Also, the latter way is inconsistent with discard. Discarded regions returns zeroes, not clusters from backing. I think discard and truncate should behave in the same safe zero way. > > This also has a consequence on how preallocation should be implemented > for images with subclusters. Extended L2 entries allow us to allocate > a cluster but leave each one of its subclusters unallocated. That > would allow us to have a cluster that is simultaneously allocated but > whose data is read from the backing file. But it's up to us to decide > if that's what we should do when resizing an image. > > Berto > -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: qcow2 preallocation and backing files 2019-11-20 12:27 ` Vladimir Sementsov-Ogievskiy @ 2019-11-20 15:18 ` Alberto Garcia 2019-11-20 15:58 ` Vladimir Sementsov-Ogievskiy 0 siblings, 1 reply; 8+ messages in thread From: Alberto Garcia @ 2019-11-20 15:18 UTC (permalink / raw) To: Vladimir Sementsov-Ogievskiy, qemu-devel Cc: Kevin Wolf, qemu-block, Max Reitz On Wed 20 Nov 2019 01:27:53 PM CET, Vladimir Semeeausntsov-Ogievskiy wrote: > 3. Also, the latter way is inconsistent with discard. Discarded > regions returns zeroes, not clusters from backing. I think discard and > truncate should behave in the same safe zero way. But then PREALLOC_MODE_OFF implies that the L2 metadata should be preallocated (all clusters should be QCOW2_CLUSTER_ZERO_PLAIN), at least when there is a backing file. Or maybe we just forbid PREALLOC_MODE_OFF during resize if there is a backing file ? Berto ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: qcow2 preallocation and backing files 2019-11-20 15:18 ` Alberto Garcia @ 2019-11-20 15:58 ` Vladimir Sementsov-Ogievskiy 2019-11-20 16:35 ` Alberto Garcia ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2019-11-20 15:58 UTC (permalink / raw) To: Alberto Garcia, qemu-devel; +Cc: Kevin Wolf, qemu-block, Max Reitz 20.11.2019 18:18, Alberto Garcia wrote: > On Wed 20 Nov 2019 01:27:53 PM CET, Vladimir Semeeausntsov-Ogievskiy wrote: > >> 3. Also, the latter way is inconsistent with discard. Discarded >> regions returns zeroes, not clusters from backing. I think discard and >> truncate should behave in the same safe zero way. > > But then PREALLOC_MODE_OFF implies that the L2 metadata should be > preallocated (all clusters should be QCOW2_CLUSTER_ZERO_PLAIN), at least > when there is a backing file. > > Or maybe we just forbid PREALLOC_MODE_OFF during resize if there is a > backing file ? > Kevin proposed a fix that alters PREALLOC_MODE_OFF behavior if there is a backing file, to allocate L2 metadata with ZERO clusters.. I don't think that it's the best thing to do, but it's already done, it works and seems appropriate for rc3.. I see now, that change PREALLOC_MODE_OFF behavior may break things, first of all qemu-img create, which creating UNALLOCATED qcow2 by default for years. Still, I think that it would be safer to always ZERO expanded part of qcow2, regardless of backing file.. We may add PREALLOC_MODE_ZERO, and use it in mirror, commit, and some other calls to bdrv_truncate, except for qcow2 image creation of course. Then, to improve this mode handling in qcow2, to not allocate all L2 tables, we may add "zero" bit to L1 table entry. -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: qcow2 preallocation and backing files 2019-11-20 15:58 ` Vladimir Sementsov-Ogievskiy @ 2019-11-20 16:35 ` Alberto Garcia 2019-11-20 16:46 ` Kevin Wolf 2019-11-21 8:51 ` Max Reitz 2 siblings, 0 replies; 8+ messages in thread From: Alberto Garcia @ 2019-11-20 16:35 UTC (permalink / raw) To: Vladimir Sementsov-Ogievskiy, qemu-devel Cc: Kevin Wolf, qemu-block, Max Reitz On Wed 20 Nov 2019 04:58:38 PM CET, Vladimir Sementsov-Ogievskiy wrote: >> But then PREALLOC_MODE_OFF implies that the L2 metadata should be >> preallocated (all clusters should be QCOW2_CLUSTER_ZERO_PLAIN), at >> least when there is a backing file. > > Kevin proposed a fix that alters PREALLOC_MODE_OFF behavior if there > is a backing file, to allocate L2 metadata with ZERO clusters.. Right, I just saw :) Berto ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: qcow2 preallocation and backing files 2019-11-20 15:58 ` Vladimir Sementsov-Ogievskiy 2019-11-20 16:35 ` Alberto Garcia @ 2019-11-20 16:46 ` Kevin Wolf 2019-11-20 17:49 ` Vladimir Sementsov-Ogievskiy 2019-11-21 8:51 ` Max Reitz 2 siblings, 1 reply; 8+ messages in thread From: Kevin Wolf @ 2019-11-20 16:46 UTC (permalink / raw) To: Vladimir Sementsov-Ogievskiy Cc: Alberto Garcia, qemu-devel, qemu-block, Max Reitz Am 20.11.2019 um 16:58 hat Vladimir Sementsov-Ogievskiy geschrieben: > 20.11.2019 18:18, Alberto Garcia wrote: > > On Wed 20 Nov 2019 01:27:53 PM CET, Vladimir Semeeausntsov-Ogievskiy wrote: > > > >> 3. Also, the latter way is inconsistent with discard. Discarded > >> regions returns zeroes, not clusters from backing. I think discard and > >> truncate should behave in the same safe zero way. > > > > But then PREALLOC_MODE_OFF implies that the L2 metadata should be > > preallocated (all clusters should be QCOW2_CLUSTER_ZERO_PLAIN), at least > > when there is a backing file. > > > > Or maybe we just forbid PREALLOC_MODE_OFF during resize if there is a > > backing file ? > > > > Kevin proposed a fix that alters PREALLOC_MODE_OFF behavior if there is > a backing file, to allocate L2 metadata with ZERO clusters.. > > I don't think that it's the best thing to do, but it's already done, > it works and seems appropriate for rc3.. > > I see now, that change PREALLOC_MODE_OFF behavior may break things, > first of all qemu-img create, which creating UNALLOCATED qcow2 by > default for years. And it still does, because the backing file is added only after giving the qcow2 image the right size. But you're right, this is more accidental than by design. I wonder if there are other problematic cases (and whether merging something like this in -rc3 isn't rather risky). > Still, I think that it would be safer to always ZERO expanded part of > qcow2, regardless of backing file.. > > We may add PREALLOC_MODE_ZERO, and use it in mirror, commit, and some > other calls to bdrv_truncate, except for qcow2 image creation of > course. What do we do with image formats that don't support zero clusters and therefore can't provide PREALLOC_MODE_ZERO? Will commit just fail for them? > Then, to improve this mode handling in qcow2, to not allocate all L2 > tables, we may add "zero" bit to L1 table entry. This would be an incompatible image format change that needs to be explicitly enabled by the user. This might limit its usefulness a bit. Kevin ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: qcow2 preallocation and backing files 2019-11-20 16:46 ` Kevin Wolf @ 2019-11-20 17:49 ` Vladimir Sementsov-Ogievskiy 0 siblings, 0 replies; 8+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2019-11-20 17:49 UTC (permalink / raw) To: Kevin Wolf; +Cc: Alberto Garcia, qemu-devel, qemu-block, Max Reitz 20.11.2019 19:46, Kevin Wolf wrote: > Am 20.11.2019 um 16:58 hat Vladimir Sementsov-Ogievskiy geschrieben: >> 20.11.2019 18:18, Alberto Garcia wrote: >>> On Wed 20 Nov 2019 01:27:53 PM CET, Vladimir Semeeausntsov-Ogievskiy wrote: >>> >>>> 3. Also, the latter way is inconsistent with discard. Discarded >>>> regions returns zeroes, not clusters from backing. I think discard and >>>> truncate should behave in the same safe zero way. >>> >>> But then PREALLOC_MODE_OFF implies that the L2 metadata should be >>> preallocated (all clusters should be QCOW2_CLUSTER_ZERO_PLAIN), at least >>> when there is a backing file. >>> >>> Or maybe we just forbid PREALLOC_MODE_OFF during resize if there is a >>> backing file ? >>> >> >> Kevin proposed a fix that alters PREALLOC_MODE_OFF behavior if there is >> a backing file, to allocate L2 metadata with ZERO clusters.. >> >> I don't think that it's the best thing to do, but it's already done, >> it works and seems appropriate for rc3.. >> >> I see now, that change PREALLOC_MODE_OFF behavior may break things, >> first of all qemu-img create, which creating UNALLOCATED qcow2 by >> default for years. > > And it still does, because the backing file is added only after giving > the qcow2 image the right size. > > But you're right, this is more accidental than by design. I wonder if > there are other problematic cases (and whether merging something like > this in -rc3 isn't rather risky). > >> Still, I think that it would be safer to always ZERO expanded part of >> qcow2, regardless of backing file.. >> >> We may add PREALLOC_MODE_ZERO, and use it in mirror, commit, and some >> other calls to bdrv_truncate, except for qcow2 image creation of >> course. > > What do we do with image formats that don't support zero clusters and > therefore can't provide PREALLOC_MODE_ZERO? Will commit just fail for > them? Hmm. consider committing to raw x y qcow2 [----------------------] - full of unallocated clusters raw [2987957285235298] - full of data, but file is short Before commit, data from [x,y] reads as zero. Therefore, we should zero expanded part of base.. And this is for base of any format: [x,y] must be zero after commit. So, if format can't do fast-zero, it should fallback to writing real zeros. === Hmm, actually after your patch all formats partly support PREALLOC_MDOE_ZERO, which in the worst case is done by writing real zeros. > >> Then, to improve this mode handling in qcow2, to not allocate all L2 >> tables, we may add "zero" bit to L1 table entry. > > This would be an incompatible image format change that needs to be > explicitly enabled by the user. This might limit its usefulness a bit. > Yes, I understand this. Still it may make sense. -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: qcow2 preallocation and backing files 2019-11-20 15:58 ` Vladimir Sementsov-Ogievskiy 2019-11-20 16:35 ` Alberto Garcia 2019-11-20 16:46 ` Kevin Wolf @ 2019-11-21 8:51 ` Max Reitz 2 siblings, 0 replies; 8+ messages in thread From: Max Reitz @ 2019-11-21 8:51 UTC (permalink / raw) To: Vladimir Sementsov-Ogievskiy, Alberto Garcia, qemu-devel Cc: Kevin Wolf, qemu-block [-- Attachment #1.1: Type: text/plain, Size: 1910 bytes --] On 20.11.19 16:58, Vladimir Sementsov-Ogievskiy wrote: > 20.11.2019 18:18, Alberto Garcia wrote: >> On Wed 20 Nov 2019 01:27:53 PM CET, Vladimir Semeeausntsov-Ogievskiy wrote: >> >>> 3. Also, the latter way is inconsistent with discard. Discarded >>> regions returns zeroes, not clusters from backing. I think discard and >>> truncate should behave in the same safe zero way. >> >> But then PREALLOC_MODE_OFF implies that the L2 metadata should be >> preallocated (all clusters should be QCOW2_CLUSTER_ZERO_PLAIN), at least >> when there is a backing file. >> >> Or maybe we just forbid PREALLOC_MODE_OFF during resize if there is a >> backing file ? >> > > Kevin proposed a fix that alters PREALLOC_MODE_OFF behavior if there is > a backing file, to allocate L2 metadata with ZERO clusters.. > > I don't think that it's the best thing to do, but it's already done, it works > and seems appropriate for rc3.. > > I see now, that change PREALLOC_MODE_OFF behavior may break things, first of > all qemu-img create, which creating UNALLOCATED qcow2 by default for years. > > Still, I think that it would be safer to always ZERO expanded part of qcow2, > regardless of backing file.. > > We may add PREALLOC_MODE_ZERO, and use it in mirror, commit, and some other calls > to bdrv_truncate, except for qcow2 image creation of course. Well, the good news is that block_resize currently has no such parameter and thus we could make a non-OFF value the default. The bad news is that the reason it has no such parameter is that it would need to be a job in order to support any form of preallocation. So actually I don’t think we can make block_resize write zeroes to the new region by default, because then it needs to be a job, and it just isn’t. (Of course, we can get to it through a deprecation cycle, but we can’t “fix” block_resize directly.) Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-11-21 8:53 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-11-20 12:06 qcow2 preallocation and backing files Alberto Garcia 2019-11-20 12:27 ` Vladimir Sementsov-Ogievskiy 2019-11-20 15:18 ` Alberto Garcia 2019-11-20 15:58 ` Vladimir Sementsov-Ogievskiy 2019-11-20 16:35 ` Alberto Garcia 2019-11-20 16:46 ` Kevin Wolf 2019-11-20 17:49 ` Vladimir Sementsov-Ogievskiy 2019-11-21 8:51 ` Max Reitz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).