From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ian Campbell <Ian.Campbell@citrix.com>
Subject: Re: Fatal crash on xen4.2 HVM + qemu-xen dm + NFS
Date: Mon, 21 Jan 2013 17:29:09 +0000
Message-ID: <1358789349.3279.272.camel@zakaz.uk.xensource.com>
References: <5B4525F296F6ABEB38B0E614@nimrod.local>
	<50CEFDA602000078000B0B11@nat28.tlf.novell.com>
	<3B1D0701EAEA6532CEA91EA0@Ximines.local>
	<alpine.DEB.2.02.1301161302020.4978@kaball.uk.xensource.com>
	<F1DF150F7B2469CFD587A2A4@Ximines.local>
	<alpine.DEB.2.02.1301161620390.4978@kaball.uk.xensource.com>
	<E57EC0AFE2B6901CB9A11068@Ximines.local>
	<alpine.DEB.2.02.1301161718120.4978@kaball.uk.xensource.com>
	<77822E2DDAEA8F94631B6A52@Ximines.local>
	<1358781790.3279.224.camel@zakaz.uk.xensource.com>
	<F7F59FF70A5F8648886565B5@Ximines.local>
	<1358783420.3279.235.camel@zakaz.uk.xensource.com>
	<F7775BEAD1475FBBDEB2B9C4@Ximines.local>
	<1358787073.3279.257.camel@zakaz.uk.xensource.com>
	<91736C8D6DB136290494B9F8@Ximines.local>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <91736C8D6DB136290494B9F8@Ximines.local>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Alex Bligh <alex@alex.org.uk>
Cc: Konrad Wilk <konrad.wilk@oracle.com>, Xen Devel <xen-devel@lists.xen.org>, Jan Beulich <JBeulich@suse.com>, Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>
List-Id: xen-devel@lists.xenproject.org

On Mon, 2013-01-21 at 17:06 +0000, Alex Bligh wrote:
> >> I'm wondering whether what's happening is that when the disk grows
> >> (or there's a backing file in place) some sort of different I/O is
> >> done by qemu. Perhaps irrespective of write cache setting, it does some
> >> form of zero copy I/O when there's a backing file in place.
> >
> > I doubt that, but I don't really know anything about qdisk.
> >
> > I'd be much more inclined to suspect a bug in the xen_qdisk backend's
> > handling of disks resizes, if that's what you are doing.
> 
> We aren't resizing the qcow2 disk itself. What we're doing is
> creating a 20G (virtual size) qcow2 disk, containing a 3G (or
> so) Ubuntu image - i.e. the partition table says it's 3G. We
> then take a snapshot of it and use that as a backing file. The
> guest then writes to the partition table enlarging it to the
> virtual size of the disk, then resizes the file system. This
> triggers it. Unless QEMU has some special reason to care about
> what is in the partition table (e.g. to support the old xen
> 'mount a file as a partition' stuff), it's just a pile of sectors
> being written.
> 
> > tap == blktap2. I don't know if it supports qcow or not but I don't
> > think xl exposes it if it does.
> 
> Well, in xl's conf file we are using
>  disk = [ 'tap:qcow2:/my/nfs/directory/testdisk.qcow2,xvda,w' ]
> 
> I think that's how you are meant to do qcow2 isn't it?

See docs/misc/xl-disk-configuration.txt, the "tap" prefix is deprecated
and ignored by xl. Sorry, I didn't think of this usage of "tap" above.
With xend the tap: prefix did force blktap (1 or 2) to be used. xl tries
to pick the most suitable, and picks xen_qdisk for qcow, I think always.

> > You could try with a test .vhd or .raw file though.
> 
> We can do this but I'm betting it won't fail (at least with .raw)
> as it only breaks on qcow2 if there's a backing file associated
> with the qcow2 file (i.e. if we're writing to a snapshot).
> 
> > Unfortunately it won't be zero. There will be at least one reference
> > from the page being part of the process, which won't be dropped until
> > the process dies.
> 
> OK, well this is my ignorance of how the grant mechanism work.
> I had assumed the page from the relevant domU got mapped into the
> process in dom0, and that when it was unmapped it would be mapped
> back out of the process's memory. Otherwise would the process's
> memory map not fill up?

The page is mapped out of the user process like you expect. The problem
is that you cannot tell if the network stack still has a reference after
the write() syscall has finished. if you were to assume it did then you
would indeed fill the processes memory map.

Ian.