[Qemu-devel] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM
@ 2017-09-13 11:48 Darren Kenny
       [not found] ` <59B91DCA.5080405@oracle.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Darren Kenny @ 2017-09-13 11:48 UTC (permalink / raw)
  To: qemu-devel

Hi,

It was observed during some testing of Qemu 2.9 that it appeared that if you
resized a qcow2 block device while the VM is running, that an qemu-img check
would report that there were leaked clusters.

The steps to reproduce are:

- First create the test image:

     # /usr/bin/qemu-img create -f qcow2 test.qcow2 10G
     Formatting 'test.qcow2', fmt=qcow2 size=10737418240 encryption=off
     cluster_size=65536 lazy_refcounts=off refcount_bits=16

     # qemu-img check test.qcow2
     No errors were found on the image.

- Now run a VM based here on Oracle Linux 7, but the disto really isn't
   important here, since the test disk is not even mounted in the VM at this
   point in time:

     # /usr/bin/qemu-kvm \
         -name 'test-vm' \
         -monitor stdio  \
         -drive 
id=drive_image1,if=none,snapshot=on,format=qcow2,file=./ol73-64.qcow2 \
         -device 
virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x4 \
         -drive id=drive_test,if=none,format=qcow2,file=./stg.qcow2 \
         -device 
virtio-blk-pci,id=stg,drive=drive_test,bootindex=1,serial=TARGET_DISK0,bus=pci.0,addr=0x5 
\
         -net bridge,br=br1 -net 
nic,model=virtio,macaddr=52:54:00:90:91:92 \
         -m 4096  \
         -smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
         -vnc :0

- Resize the img size to 15g from qemu monitor (on stdio after above 
command)

     QEMU 2.5.0 monitor - type 'help' for more information
     (qemu) block_resize drive_test 15360

- Now, in a separate terminal, while leaving the VM running, check the img
   again from host side:

     # qemu-img check ./test.qcow2
     Leaked cluster 3 refcount=1 reference=0

     1 leaked clusters were found on the image.
     This means waste of disk space, but no harm to data.
     Image end offset: 327680

As it suggests above, this is not really corruption, but it is a bit
misleading, and could make people think there is an issue here
(hence the reason I've been asked to find a fix).

What I observed, then was that if I powered down the VM, or even just 
quit the
VM, that the subsequent check of the disk would say that everything was just
fine, and there no longer were leaked clusters.

In looking at the code in qcow2_truncate() it would appear that in the case
where prealloc has the value PREALLOC_MODE_OFF, that we don't flush the
metadata to disk - which seems to be the case here.

If I ignore the if test, and always execute the block in block/qcow2.c,
lines 3250 to 3258:

   if (prealloc != PREALLOC_MODE_OFF) {
       /* Flush metadata before actually changing the image size */
       ret = bdrv_flush(bs);
       if (ret < 0) {
           error_setg_errno(errp, -ret,
                            "Failed to flush the preallocated area to 
disk");
           return ret;
       }
   }

causing the flush to always be done, then the check will succeed when the VM
is still running.

While I know that this resolves the issue, I can only imagine that there was
some reason that this check for !PREALLOC_MODE_OFF was being done in the
first place.

So, I'm hoping that someone here might be able to explain to me why that 
check
is needed, but also why it might be wrong to do the flush regardless of the
value of prealloc here.

If it is wrong to do that flush here, then would anyone have suggestions 
as to
an alternative solution to this issue?

Thanks,

Darren.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM
       [not found] ` <59B91DCA.5080405@oracle.com>
@ 2017-09-13 12:20   ` Kevin Wolf
  2017-09-13 13:32     ` Darren Kenny
  0 siblings, 1 reply; 5+ messages in thread
From: Kevin Wolf @ 2017-09-13 12:20 UTC (permalink / raw)
  To: Darren Kenny; +Cc: qemu-block, qemu-devel

Am 13.09.2017 um 14:00 hat Darren Kenny geschrieben:
> [Cross-posted from qemu-devel, meant to send here first]

Just keep both lists in the CC for the same email.

> Hi,
> 
> It was observed during some testing of Qemu 2.9 that it appeared that if you
> resized a qcow2 block device while the VM is running, that an qemu-img check
> would report that there were leaked clusters.
> 
> The steps to reproduce are:
> 
> - First create the test image:
>   [...]
> 
> - Now run a VM based here on Oracle Linux 7, but the disto really isn't
>   important here, since the test disk is not even mounted in the VM at this
>   point in time:
>   [...]
> 
> - Resize the img size to 15g from qemu monitor (on stdio after above
> command)
> 
>     QEMU 2.5.0 monitor - type 'help' for more information
>     (qemu) block_resize drive_test 15360
> 
> - Now, in a separate terminal, while leaving the VM running, check the img
>   again from host side:
> 
>     # qemu-img check ./test.qcow2
>     Leaked cluster 3 refcount=1 reference=0
> 
>     1 leaked clusters were found on the image.
>     This means waste of disk space, but no harm to data.
>     Image end offset: 327680
> 
> As it suggests above, this is not really corruption, but it is a bit
> misleading, and could make people think there is an issue here
> (hence the reason I've been asked to find a fix).

There is an issue here, which is that you are accessing the image at the
same time from two separate processes. qemu is using writeback caches in
order to improve performance, so only after the guest has issued a flush
command to its disk or after you shut down or at least pause qemu, the
changes are fully written to the image file.

In qemu 2.10, you would probably see this instead:

$ qemu-img check ./test.qcow2
qemu-img: Could not open './test.qcow2': Failed to get shared "write" lock
Is another process using the image?

This lock can be overridden, but at least it shows clearly that you are
doing something that you probably shouldn't be doing.

> What I observed, then was that if I powered down the VM, or even just
> quit the VM, that the subsequent check of the disk would say that
> everything was just fine, and there no longer were leaked clusters.

Yes, quitting the VM writes out all of the cached data, so the on-disk
state and the in-memory state become fully consistent again.

> In looking at the code in qcow2_truncate() it would appear that in the case
> where prealloc has the value PREALLOC_MODE_OFF, that we don't flush the
> metadata to disk - which seems to be the case here.
> 
> If I ignore the if test, and always execute the block in block/qcow2.c,
> lines 3250 to 3258:
> 
>   if (prealloc != PREALLOC_MODE_OFF) {
>       /* Flush metadata before actually changing the image size */
>       ret = bdrv_flush(bs);
>       if (ret < 0) {
>           error_setg_errno(errp, -ret,
>                            "Failed to flush the preallocated area to disk");
>           return ret;
>       }
>   }
> 
> causing the flush to always be done, then the check will succeed when
> the VM is still running.
> 
> While I know that this resolves the issue, I can only imagine that
> there was some reason that this check for !PREALLOC_MODE_OFF was being
> done in the first place.
> 
> So, I'm hoping that someone here might be able to explain to me why
> that check is needed, but also why it might be wrong to do the flush
> regardless of the value of prealloc here.

Doing a flush here wouldn't be wrong, but it's also unnecessary and
would slow down the operation a bit.

For the preallocation case, the goal that is achieved with the flush is
that the header update with the new image size is only written if the
preallocation succeeded, so that in error cases you don't end up with an
image that seems to be resized, but isn't actually preallocated as
requested.

> If it is wrong to do that flush here, then would anyone have
> suggestions as to an alternative solution to this issue?

Don't call 'qemu-img check' (or basically any other operation on the
image) while a VM is using the image.

Kevin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM
  2017-09-13 12:20   ` [Qemu-devel] [Qemu-block] " Kevin Wolf
@ 2017-09-13 13:32     ` Darren Kenny
  2017-09-13 14:07       ` Kevin Wolf
  0 siblings, 1 reply; 5+ messages in thread
From: Darren Kenny @ 2017-09-13 13:32 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block

Hi Kevin,

Thanks for getting back to me so quickly.

Kevin Wolf wrote:
> Am 13.09.2017 um 14:00 hat Darren Kenny geschrieben:
>> [Cross-posted from qemu-devel, meant to send here first]
>
> Just keep both lists in the CC for the same email.

Will do.
> There is an issue here, which is that you are accessing the image at 
> the same time from two separate processes. qemu is using writeback 
> caches in order to improve performance, so only after the guest has 
> issued a flush command to its disk or after you shut down or at least 
> pause qemu, the changes are fully written to the image file. In qemu 
> 2.10, you would probably see this instead: $ qemu-img check 
> ./test.qcow2 qemu-img: Could not open './test.qcow2': Failed to get 
> shared "write" lock Is another process using the image? This lock can 
> be overridden, but at least it shows clearly that you are doing 
> something that you probably shouldn't be doing.

Hmm, I've just updated to the HEAD of the Git repo, and I didn't see this
locking behaviour, it still did the same thing as before.

Does the disk need to be formatted/mounted before it's seen as locked?
Or even a configure option?

The version that have is:

     $ qemu-img --version
     qemu-img version 2.10.50 (v2.10.0-476-g04ef330)
     Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

     $ qemu-system-x86_64 --version
     QEMU emulator version 2.10.50 (v2.10.0-476-g04ef330)
     Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

The last commit I have is (as in the version string):

     04ef330 tcg/tci: do not use ldst label (never implemented)

>> What I observed, then was that if I powered down the VM, or even just
>> quit the VM, that the subsequent check of the disk would say that
>> everything was just fine, and there no longer were leaked clusters.
>
> Yes, quitting the VM writes out all of the cached data, so the on-disk
> state and the in-memory state become fully consistent again.

OK

>> In looking at the code in qcow2_truncate() it would appear that in the case
>> where prealloc has the value PREALLOC_MODE_OFF, that we don't flush the
>> metadata to disk - which seems to be the case here.
>>
>> If I ignore the if test, and always execute the block in block/qcow2.c,
>> lines 3250 to 3258:
>>
>>    if (prealloc != PREALLOC_MODE_OFF) {
>>        /* Flush metadata before actually changing the image size */
>>        ret = bdrv_flush(bs);
>>        if (ret<  0) {
>>            error_setg_errno(errp, -ret,
>>                             "Failed to flush the preallocated area to disk");
>>            return ret;
>>        }
>>    }
>>
>> causing the flush to always be done, then the check will succeed when
>> the VM is still running.
>>
>> While I know that this resolves the issue, I can only imagine that
>> there was some reason that this check for !PREALLOC_MODE_OFF was being
>> done in the first place.
>>
>> So, I'm hoping that someone here might be able to explain to me why
>> that check is needed, but also why it might be wrong to do the flush
>> regardless of the value of prealloc here.
>
> Doing a flush here wouldn't be wrong, but it's also unnecessary and
> would slow down the operation a bit.
Sure, but how often does a resize/truncate get done? Would seem like a
small impact to do it - but I agree w.r.t. the single-process access
as a better solution.
> For the preallocation case, the goal that is achieved with the flush is
> that the header update with the new image size is only written if the
> preallocation succeeded, so that in error cases you don't end up with an
> image that seems to be resized, but isn't actually preallocated as
> requested.
Understood, thanks.
>> If it is wrong to do that flush here, then would anyone have
>> suggestions as to an alternative solution to this issue?
>
> Don't call 'qemu-img check' (or basically any other operation on the
> image) while a VM is using the image.
Heh, sure - if the locking was working as you describe then I think
I would be fine with that :)

Thanks,

Darren.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM
  2017-09-13 13:32     ` Darren Kenny
@ 2017-09-13 14:07       ` Kevin Wolf
  2017-09-13 15:33         ` Darren Kenny
  0 siblings, 1 reply; 5+ messages in thread
From: Kevin Wolf @ 2017-09-13 14:07 UTC (permalink / raw)
  To: Darren Kenny; +Cc: qemu-devel, qemu-block

Am 13.09.2017 um 15:32 hat Darren Kenny geschrieben:
> Hi Kevin,
> 
> Thanks for getting back to me so quickly.
> 
> Kevin Wolf wrote:
> > Am 13.09.2017 um 14:00 hat Darren Kenny geschrieben:
> > > [Cross-posted from qemu-devel, meant to send here first]
> > 
> > Just keep both lists in the CC for the same email.
> 
> Will do.
> > There is an issue here, which is that you are accessing the image at the
> > same time from two separate processes. qemu is using writeback caches in
> > order to improve performance, so only after the guest has issued a flush
> > command to its disk or after you shut down or at least pause qemu, the
> > changes are fully written to the image file. In qemu 2.10, you would
> > probably see this instead: $ qemu-img check ./test.qcow2 qemu-img: Could
> > not open './test.qcow2': Failed to get shared "write" lock Is another
> > process using the image? This lock can be overridden, but at least it
> > shows clearly that you are doing something that you probably shouldn't
> > be doing.
> 
> Hmm, I've just updated to the HEAD of the Git repo, and I didn't see this
> locking behaviour, it still did the same thing as before.
> 
> Does the disk need to be formatted/mounted before it's seen as locked?
> Or even a configure option?
> 
> The version that have is:
> 
>     $ qemu-img --version
>     qemu-img version 2.10.50 (v2.10.0-476-g04ef330)
>     Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
> 
>     $ qemu-system-x86_64 --version
>     QEMU emulator version 2.10.50 (v2.10.0-476-g04ef330)
>     Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
> 
> The last commit I have is (as in the version string):
> 
>     04ef330 tcg/tci: do not use ldst label (never implemented)

This should have the locking code. It only works with relatively new
Linux kernels, though (it needs F_OFD_SETLK support). If you don't have
that, no locking is used even in qemu 2.10.

You could try enforcing some locking by adding file.locking=on to your
-drive option. If you're running an old kernel, this should print a
warning message (and use some less safe locking variant).

> > Doing a flush here wouldn't be wrong, but it's also unnecessary and
> > would slow down the operation a bit.
> Sure, but how often does a resize/truncate get done? Would seem like a
> small impact to do it - but I agree w.r.t. the single-process access
> as a better solution.

The thing is, truncate isn't the only operation that will lead to
qemu-img check reporting failure. Any cluster allocation in the image
can cause the same symptom, and there it is actually very important for
performance that we use the cache and do a batched write only later.

So changing truncate so that this specific operation looks as if
accessing the image from a second process were okay wouldn't actually
make a big difference for the overall state. Maybe it's in fact better
to have such attempts fail consistently.

Kevin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM
  2017-09-13 14:07       ` Kevin Wolf
@ 2017-09-13 15:33         ` Darren Kenny
  0 siblings, 0 replies; 5+ messages in thread
From: Darren Kenny @ 2017-09-13 15:33 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block

Kevin Wolf wrote:
> Am 13.09.2017 um 15:32 hat Darren Kenny geschrieben:
>> Hi Kevin,
>>
>> Thanks for getting back to me so quickly.
>>
>> Kevin Wolf wrote:
>>> Am 13.09.2017 um 14:00 hat Darren Kenny geschrieben:
>>>> [Cross-posted from qemu-devel, meant to send here first]
>>> Just keep both lists in the CC for the same email.
>> Will do.
>>> There is an issue here, which is that you are accessing the image at the
>>> same time from two separate processes. qemu is using writeback caches in
>>> order to improve performance, so only after the guest has issued a flush
>>> command to its disk or after you shut down or at least pause qemu, the
>>> changes are fully written to the image file. In qemu 2.10, you would
>>> probably see this instead: $ qemu-img check ./test.qcow2 qemu-img: Could
>>> not open './test.qcow2': Failed to get shared "write" lock Is another
>>> process using the image? This lock can be overridden, but at least it
>>> shows clearly that you are doing something that you probably shouldn't
>>> be doing.
>> Hmm, I've just updated to the HEAD of the Git repo, and I didn't see this
>> locking behaviour, it still did the same thing as before.
>>
>> Does the disk need to be formatted/mounted before it's seen as locked?
>> Or even a configure option?
>>
>> The version that have is:
>>
>>      $ qemu-img --version
>>      qemu-img version 2.10.50 (v2.10.0-476-g04ef330)
>>      Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
>>
>>      $ qemu-system-x86_64 --version
>>      QEMU emulator version 2.10.50 (v2.10.0-476-g04ef330)
>>      Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
>>
>> The last commit I have is (as in the version string):
>>
>>      04ef330 tcg/tci: do not use ldst label (never implemented)
>
> This should have the locking code. It only works with relatively new
> Linux kernels, though (it needs F_OFD_SETLK support). If you don't have
> that, no locking is used even in qemu 2.10.
>
> You could try enforcing some locking by adding file.locking=on to your
> -drive option. If you're running an old kernel, this should print a
> warning message (and use some less safe locking variant).
Ah, OK - I will need to look into that.

>>> Doing a flush here wouldn't be wrong, but it's also unnecessary and
>>> would slow down the operation a bit.
>> Sure, but how often does a resize/truncate get done? Would seem like a
>> small impact to do it - but I agree w.r.t. the single-process access
>> as a better solution.
>
> The thing is, truncate isn't the only operation that will lead to
> qemu-img check reporting failure. Any cluster allocation in the image
> can cause the same symptom, and there it is actually very important for
> performance that we use the cache and do a batched write only later.
>
> So changing truncate so that this specific operation looks as if
> accessing the image from a second process were okay wouldn't actually
> make a big difference for the overall state. Maybe it's in fact better
> to have such attempts fail consistently.
>
Thanks for the explanation, and I agree that consistency is usually best.

Thanks,

Darren.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-09-13 15:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-13 11:48 [Qemu-devel] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM Darren Kenny
     [not found] ` <59B91DCA.5080405@oracle.com>
2017-09-13 12:20   ` [Qemu-devel] [Qemu-block] " Kevin Wolf
2017-09-13 13:32     ` Darren Kenny
2017-09-13 14:07       ` Kevin Wolf
2017-09-13 15:33         ` Darren Kenny

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.