From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LXsom-0004Vc-Vx for qemu-devel@nongnu.org; Fri, 13 Feb 2009 02:50:37 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LXsok-0004Ua-UV for qemu-devel@nongnu.org; Fri, 13 Feb 2009 02:50:36 -0500 Received: from [199.232.76.173] (port=46210 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LXsok-0004UX-MM for qemu-devel@nongnu.org; Fri, 13 Feb 2009 02:50:34 -0500 Received: from mail-fx0-f20.google.com ([209.85.220.20]:36428) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1LXsok-0007TI-3t for qemu-devel@nongnu.org; Fri, 13 Feb 2009 02:50:34 -0500 Received: by fxm13 with SMTP id 13so3211519fxm.10 for ; Thu, 12 Feb 2009 23:50:31 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <20090211070049.GA27821@shareable.org> <4992A108.8070304@suse.de> <20090211114126.GC31997@shareable.org> <4992C77D.4030104@suse.de> <20090211164814.GA7161@shareable.org> Date: Thu, 12 Feb 2009 23:50:30 -0800 Message-ID: Subject: Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change From: Marc Bevand Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Forwarding a post I sent to the kvm ML: I have got a massive KVM installation with hundreds of guests runnings dozens of different OSes, and have also noticed multiple qcow2 corruption bugs. All my guests are using the qcow2 format, and my hosts are running vanilla linux 2.6.28 x86_64 kernels and use NPT (Opteron 'Barcelona' 23xx processors). My Windows 2000 guests BSOD just like yours with kvm-73 or newer. I have to run kvm-75 (I need the NPT fixes it contains) with block-qcow2.c reverted to the version from kvm-72 to fix the BSOD. kvm-73+ also causes some of my Windows 2003 guests to exhibit this exact registry corruption error: http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2001452&group_id=180599 This bug is also fixed by reverting block-qcow2.c to the version from kvm-72. I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older because of the qcow2 performance regression caused by the default writethrough caching policy) but it randomly triggers an even worse bug: the moment I shut down a guest by typing "quit" in the monitor, it sometimes overwrite the first 4kB of the disk image with mostly NUL bytes (!) which completely destroys it. I am familiar with the qcow2 format and apparently this 4kB block seems to be an L2 table with most entries set to zero. I have had to restore at least 6 or 7 disk images from backup after occurences of that bug. My intuition tells me this may be the qcow2 code trying to allocate a cluster to write a new L2 table, but not noticing the allocation failed (represented by a 0 offset), and writing the L2 table at that 0 offset, overwriting the qcow2 header. Fortunately this bug is also fixed by running kvm-75 with block-qcow2.c reverted to its kvm-72 version. Basically qcow2 in kvm-73 or newer is completely unreliable. -marc