All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] qcow2, lazy_refcounts and killing qemu
@ 2014-08-30 14:53 Richard W.M. Jones
  2014-09-01 12:41 ` Greg Kurz
  2014-09-05 15:39 ` Stefan Hajnoczi
  0 siblings, 2 replies; 11+ messages in thread
From: Richard W.M. Jones @ 2014-08-30 14:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf

I found out a few days ago that if you:

(1) Open a qcow2 file that has lazy_refcounts = on and a backing file, and

(2) Write lots of stuff, and

(3) Kill qemu with SIGTERM [which I believed, maybe incorrectly, is a
"nice" way to kill qemu]

.. then you can end up with a corrupt qcow2 file.  In particular the
qcow2 file sometimes forgot that it had a backing file, but I suspect
this was just a symptom and in fact the qcow2 file header wasn't being
written to disk correctly.

Is it correct that sending SIGTERM to qemu should kill it cleanly, or
is that no longer the case, or is lazy_refcounts a special case, or
have I found a bug?

I can reproduce this easily, although of course the reproducer will
involve libguestfs.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] qcow2, lazy_refcounts and killing qemu
  2014-08-30 14:53 [Qemu-devel] qcow2, lazy_refcounts and killing qemu Richard W.M. Jones
@ 2014-09-01 12:41 ` Greg Kurz
  2014-09-01 13:07   ` Richard W.M. Jones
  2014-09-01 14:19   ` Richard W.M. Jones
  2014-09-05 15:39 ` Stefan Hajnoczi
  1 sibling, 2 replies; 11+ messages in thread
From: Greg Kurz @ 2014-09-01 12:41 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: kwolf, qemu-devel

On Sat, 30 Aug 2014 15:53:13 +0100
"Richard W.M. Jones" <rjones@redhat.com> wrote:
> I found out a few days ago that if you:
> 
> (1) Open a qcow2 file that has lazy_refcounts = on and a backing file, and
> 
> (2) Write lots of stuff, and
> 
> (3) Kill qemu with SIGTERM [which I believed, maybe incorrectly, is a
> "nice" way to kill qemu]
> 
> .. then you can end up with a corrupt qcow2 file.  In particular the
> qcow2 file sometimes forgot that it had a backing file, but I suspect
> this was just a symptom and in fact the qcow2 file header wasn't being
> written to disk correctly.
> 

Hi Rich,

Someone in IBM hit a very similar issue with PowerKVM a few monthes ago.
The symptom was a corrupted filesystem in a qcow2 file. The steps
involved to kill the QEMU process while the guest OS is shutting down.
Unfortunately, no easy reproducer could be found and investigations
halted...

> Is it correct that sending SIGTERM to qemu should kill it cleanly, or
> is that no longer the case, or is lazy_refcounts a special case, or
> have I found a bug?
> 

QEMU catches SIGTERM and calls bdrv_close(), so I would favor it is
a bug or an undocumented limitation (hence a documentation bug :)

> I can reproduce this easily, although of course the reproducer will
> involve libguestfs.
> 
> Rich.
> 

Can you share this reproducer ?

Cheers.

--
Greg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] qcow2, lazy_refcounts and killing qemu
  2014-09-01 12:41 ` Greg Kurz
@ 2014-09-01 13:07   ` Richard W.M. Jones
  2014-09-01 14:19   ` Richard W.M. Jones
  1 sibling, 0 replies; 11+ messages in thread
From: Richard W.M. Jones @ 2014-09-01 13:07 UTC (permalink / raw)
  To: Greg Kurz; +Cc: kwolf, qemu-devel

On Mon, Sep 01, 2014 at 02:41:02PM +0200, Greg Kurz wrote:
> On Sat, 30 Aug 2014 15:53:13 +0100
> "Richard W.M. Jones" <rjones@redhat.com> wrote:
> > I can reproduce this easily, although of course the reproducer will
> > involve libguestfs.
> > 
> > Rich.
> > 
> 
> Can you share this reproducer ?

The immediate reproducer (not very useful for you) is virt-v2v, if you
enable lazy_refcounts by hacking the call to qemu-img here:

https://github.com/libguestfs/libguestfs/blob/master/v2v/v2v.ml#L102

I'll try to come up with an actual reproducer, but I stress it's still
going to use libguestfs because that's the only sane way to run qemu
for the purposes of this test.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] qcow2, lazy_refcounts and killing qemu
  2014-09-01 12:41 ` Greg Kurz
  2014-09-01 13:07   ` Richard W.M. Jones
@ 2014-09-01 14:19   ` Richard W.M. Jones
  2014-09-01 14:23     ` Richard W.M. Jones
  2014-09-01 14:30     ` Greg Kurz
  1 sibling, 2 replies; 11+ messages in thread
From: Richard W.M. Jones @ 2014-09-01 14:19 UTC (permalink / raw)
  To: Greg Kurz; +Cc: kwolf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 874 bytes --]

A test case, attached.

Note that you have to look at the output of the final qemu-img info
command.  In the case where it goes wrong, the 'backing file:' and
'backing file format:' lines disappear completely.  In the case where
the bug is not reproduced, these lines are still present.

It's 100% reproducible for me when lazy_refcounts=on, and 0%
reproducible when lazy_refcounts=off.

BUT it only occurs if the backing file is a remote source (nbd:... in
this case), not if the backing file is a plain file.  Make of that
what you will.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

[-- Attachment #2: qemu-lazy-refcounts.sh --]
[-- Type: application/x-sh, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] qcow2, lazy_refcounts and killing qemu
  2014-09-01 14:19   ` Richard W.M. Jones
@ 2014-09-01 14:23     ` Richard W.M. Jones
  2014-09-01 14:30     ` Greg Kurz
  1 sibling, 0 replies; 11+ messages in thread
From: Richard W.M. Jones @ 2014-09-01 14:23 UTC (permalink / raw)
  To: Greg Kurz; +Cc: kwolf, qemu-devel

> # Write stuff to the overlay.
> guestfish <<EOF
>   add-drive overlay.qcow2 format:qcow2 cachemode:unsafe

To head off any suggestions, removing cachemode:unsafe doesn't fix it.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] qcow2, lazy_refcounts and killing qemu
  2014-09-01 14:19   ` Richard W.M. Jones
  2014-09-01 14:23     ` Richard W.M. Jones
@ 2014-09-01 14:30     ` Greg Kurz
  1 sibling, 0 replies; 11+ messages in thread
From: Greg Kurz @ 2014-09-01 14:30 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: kwolf, qemu-devel

On Mon, 1 Sep 2014 15:19:28 +0100
"Richard W.M. Jones" <rjones@redhat.com> wrote:

> A test case, attached.
> 
> Note that you have to look at the output of the final qemu-img info
> command.  In the case where it goes wrong, the 'backing file:' and
> 'backing file format:' lines disappear completely.  In the case where
> the bug is not reproduced, these lines are still present.
> 
> It's 100% reproducible for me when lazy_refcounts=on, and 0%
> reproducible when lazy_refcounts=off.
> 
> BUT it only occurs if the backing file is a remote source (nbd:... in
> this case), not if the backing file is a plain file.  Make of that
> what you will.
> 
> Rich.
> 

Thanks. I'll try that.

-- 
Gregory Kurz                                     kurzgreg@fr.ibm.com
                                                 gkurz@linux.vnet.ibm.com
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)562 165 496

"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] qcow2, lazy_refcounts and killing qemu
  2014-08-30 14:53 [Qemu-devel] qcow2, lazy_refcounts and killing qemu Richard W.M. Jones
  2014-09-01 12:41 ` Greg Kurz
@ 2014-09-05 15:39 ` Stefan Hajnoczi
  2014-09-05 17:41   ` Richard W.M. Jones
  1 sibling, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2014-09-05 15:39 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: kwolf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1056 bytes --]

On Sat, Aug 30, 2014 at 03:53:13PM +0100, Richard W.M. Jones wrote:
> I found out a few days ago that if you:
> 
> (1) Open a qcow2 file that has lazy_refcounts = on and a backing file, and
> 
> (2) Write lots of stuff, and
> 
> (3) Kill qemu with SIGTERM [which I believed, maybe incorrectly, is a
> "nice" way to kill qemu]
> 
> .. then you can end up with a corrupt qcow2 file.  In particular the
> qcow2 file sometimes forgot that it had a backing file, but I suspect
> this was just a symptom and in fact the qcow2 file header wasn't being
> written to disk correctly.
> 
> Is it correct that sending SIGTERM to qemu should kill it cleanly, or
> is that no longer the case, or is lazy_refcounts a special case, or
> have I found a bug?
> 
> I can reproduce this easily, although of course the reproducer will
> involve libguestfs.

That is very interesting, thanks for posting.

Did you try older QEMU versions?  I'm curious if this is something that
crept in later or is fundamentally broken in lazy_refcounts=on.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] qcow2, lazy_refcounts and killing qemu
  2014-09-05 15:39 ` Stefan Hajnoczi
@ 2014-09-05 17:41   ` Richard W.M. Jones
  2014-09-08  7:16     ` Markus Armbruster
  2014-09-08  9:57     ` Stefan Hajnoczi
  0 siblings, 2 replies; 11+ messages in thread
From: Richard W.M. Jones @ 2014-09-05 17:41 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1358 bytes --]

On Fri, Sep 05, 2014 at 04:39:51PM +0100, Stefan Hajnoczi wrote:
> Did you try older QEMU versions?  I'm curious if this is something that
> crept in later or is fundamentally broken in lazy_refcounts=on.

At your prompting, I've done a bit more investigation.

I was basing my observations on qemu 2.1.0.  However I tried my test
against qemu from git today and the bug has gone.  Good!

For my entertainment, I bisected the problem, and the commit which
*fixes* it is:

  commit 91af7014125895cc74141be6b60f3a3e882ed743
  Author: Max Reitz <mreitz@redhat.com>
  Date:   Fri Jul 18 20:24:56 2014 +0200

    block: Add bdrv_refresh_filename()

I didn't believe this either, but I have checked the result manually
and I'm pretty sure that whatever this commit does, it does end up
fixing the lazy_refcounts problem as a side-effect.

----------------------------------------------------------------------

I've updated the test script (attached), so you can now run it against
your own qemu, and also to improve the output.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

[-- Attachment #2: qemu-lazy-refcounts.sh --]
[-- Type: application/x-sh, Size: 1103 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] qcow2, lazy_refcounts and killing qemu
  2014-09-05 17:41   ` Richard W.M. Jones
@ 2014-09-08  7:16     ` Markus Armbruster
  2014-09-08 18:29       ` Max Reitz
  2014-09-08  9:57     ` Stefan Hajnoczi
  1 sibling, 1 reply; 11+ messages in thread
From: Markus Armbruster @ 2014-09-08  7:16 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: kwolf, Stefan Hajnoczi, qemu-devel, Max Reitz

"Richard W.M. Jones" <rjones@redhat.com> writes:

> On Fri, Sep 05, 2014 at 04:39:51PM +0100, Stefan Hajnoczi wrote:
>> Did you try older QEMU versions?  I'm curious if this is something that
>> crept in later or is fundamentally broken in lazy_refcounts=on.
>
> At your prompting, I've done a bit more investigation.
>
> I was basing my observations on qemu 2.1.0.  However I tried my test
> against qemu from git today and the bug has gone.  Good!
>
> For my entertainment, I bisected the problem, and the commit which
> *fixes* it is:
>
>   commit 91af7014125895cc74141be6b60f3a3e882ed743
>   Author: Max Reitz <mreitz@redhat.com>
>   Date:   Fri Jul 18 20:24:56 2014 +0200
>
>     block: Add bdrv_refresh_filename()
>
> I didn't believe this either, but I have checked the result manually
> and I'm pretty sure that whatever this commit does, it does end up
> fixing the lazy_refcounts problem as a side-effect.

Weird.  Maybe Max (cc'ed) has an idea.

> ----------------------------------------------------------------------
>
> I've updated the test script (attached), so you can now run it against
> your own qemu, and also to improve the output.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] qcow2, lazy_refcounts and killing qemu
  2014-09-05 17:41   ` Richard W.M. Jones
  2014-09-08  7:16     ` Markus Armbruster
@ 2014-09-08  9:57     ` Stefan Hajnoczi
  1 sibling, 0 replies; 11+ messages in thread
From: Stefan Hajnoczi @ 2014-09-08  9:57 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: kwolf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 975 bytes --]

On Fri, Sep 05, 2014 at 06:41:37PM +0100, Richard W.M. Jones wrote:
> On Fri, Sep 05, 2014 at 04:39:51PM +0100, Stefan Hajnoczi wrote:
> > Did you try older QEMU versions?  I'm curious if this is something that
> > crept in later or is fundamentally broken in lazy_refcounts=on.
> 
> At your prompting, I've done a bit more investigation.
> 
> I was basing my observations on qemu 2.1.0.  However I tried my test
> against qemu from git today and the bug has gone.  Good!
> 
> For my entertainment, I bisected the problem, and the commit which
> *fixes* it is:
> 
>   commit 91af7014125895cc74141be6b60f3a3e882ed743
>   Author: Max Reitz <mreitz@redhat.com>
>   Date:   Fri Jul 18 20:24:56 2014 +0200
> 
>     block: Add bdrv_refresh_filename()
> 
> I didn't believe this either, but I have checked the result manually
> and I'm pretty sure that whatever this commit does, it does end up
> fixing the lazy_refcounts problem as a side-effect.

Thanks!

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] qcow2, lazy_refcounts and killing qemu
  2014-09-08  7:16     ` Markus Armbruster
@ 2014-09-08 18:29       ` Max Reitz
  0 siblings, 0 replies; 11+ messages in thread
From: Max Reitz @ 2014-09-08 18:29 UTC (permalink / raw)
  To: Markus Armbruster, Richard W.M. Jones; +Cc: kwolf, Stefan Hajnoczi, qemu-devel

On 08.09.2014 09:16, Markus Armbruster wrote:
> "Richard W.M. Jones" <rjones@redhat.com> writes:
>
>> On Fri, Sep 05, 2014 at 04:39:51PM +0100, Stefan Hajnoczi wrote:
>>> Did you try older QEMU versions?  I'm curious if this is something that
>>> crept in later or is fundamentally broken in lazy_refcounts=on.
>> At your prompting, I've done a bit more investigation.
>>
>> I was basing my observations on qemu 2.1.0.  However I tried my test
>> against qemu from git today and the bug has gone.  Good!
>>
>> For my entertainment, I bisected the problem, and the commit which
>> *fixes* it is:
>>
>>    commit 91af7014125895cc74141be6b60f3a3e882ed743
>>    Author: Max Reitz <mreitz@redhat.com>
>>    Date:   Fri Jul 18 20:24:56 2014 +0200
>>
>>      block: Add bdrv_refresh_filename()
>>
>> I didn't believe this either, but I have checked the result manually
>> and I'm pretty sure that whatever this commit does, it does end up
>> fixing the lazy_refcounts problem as a side-effect.
> Weird.  Maybe Max (cc'ed) has an idea.

No, not really. The only remotely related effect I can imagine is that 
this patch allows a BDS to remember when it has been opened with runtime 
options differing from the standard (e.g. forcing lazy_refcounts=on for 
a qcow2 image without that flag set), but it seems this is not the case 
here. Also, this should even be ignored in most cases.

Other than that, the main intention for this commit is to allow block 
drivers to reconstruct a filename based solely on the BDS, i.e. a 
filename which more or less recreates the same BDS when opened. The 
filename in the BDS will therefore no longer be necessarily the one 
originally used for opening it.

Max

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-09-08 18:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-30 14:53 [Qemu-devel] qcow2, lazy_refcounts and killing qemu Richard W.M. Jones
2014-09-01 12:41 ` Greg Kurz
2014-09-01 13:07   ` Richard W.M. Jones
2014-09-01 14:19   ` Richard W.M. Jones
2014-09-01 14:23     ` Richard W.M. Jones
2014-09-01 14:30     ` Greg Kurz
2014-09-05 15:39 ` Stefan Hajnoczi
2014-09-05 17:41   ` Richard W.M. Jones
2014-09-08  7:16     ` Markus Armbruster
2014-09-08 18:29       ` Max Reitz
2014-09-08  9:57     ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.