From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ian Campbell <Ian.Campbell@citrix.com>
Subject: Re: Fatal crash on xen4.2 HVM + qemu-xen dm + NFS
Date: Mon, 21 Jan 2013 16:51:13 +0000
Message-ID: <1358787073.3279.257.camel@zakaz.uk.xensource.com>
References: <5B4525F296F6ABEB38B0E614@nimrod.local>
	<50CEFDA602000078000B0B11@nat28.tlf.novell.com>
	<3B1D0701EAEA6532CEA91EA0@Ximines.local>
	<alpine.DEB.2.02.1301161302020.4978@kaball.uk.xensource.com>
	<F1DF150F7B2469CFD587A2A4@Ximines.local>
	<alpine.DEB.2.02.1301161620390.4978@kaball.uk.xensource.com>
	<E57EC0AFE2B6901CB9A11068@Ximines.local>
	<alpine.DEB.2.02.1301161718120.4978@kaball.uk.xensource.com>
	<77822E2DDAEA8F94631B6A52@Ximines.local>
	<1358781790.3279.224.camel@zakaz.uk.xensource.com>
	<F7F59FF70A5F8648886565B5@Ximines.local>
	<1358783420.3279.235.camel@zakaz.uk.xensource.com>
	<F7775BEAD1475FBBDEB2B9C4@Ximines.local>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <F7775BEAD1475FBBDEB2B9C4@Ximines.local>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Alex Bligh <alex@alex.org.uk>
Cc: Konrad Wilk <konrad.wilk@oracle.com>, Xen Devel <xen-devel@lists.xen.org>, Jan Beulich <JBeulich@suse.com>, Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>
List-Id: xen-devel@lists.xenproject.org

On Mon, 2013-01-21 at 16:33 +0000, Alex Bligh wrote:

> > The fact that you can reproduce so easily makes me wonder if this is
> > really the same issue. To trigger the issue you need this sequence of
> > events:
> >       * Send an RPC
> >       * RPC is encapsulated into a TCP/IP frame (or several) and sent.
> >       * Wait for an ACK response to the TCP/IP frame
> >       * Timeout.
> >       * Queue a retransmit of the TCP/IP frame(s)
> >       * Receive the ACK to the original.
> >       * Receive the reply to the RPC as well
> >       * Report success up the stack
> >       * Userspace gets success and unmaps the page
> >       * Retransmit hits the front of the queue
> >       * BOOM
> >
> > To do this you need to be pretty unlucky or retransmitting a lot (which
> > would usually imply something up with either the network or the filer).
> 
> Well, the two things we are doing different that potentially make this
> easier to replicate are:
> 
> * We are using a QCOW2 backing file, and running a VM image which
>   expands the partition, and then the filing system. This is a particularly
>   write heavy load. We're also using upstream qemu DM which I think
>   wasn't there when you lasted tested.

I've never tired to repro this with any version of qemu, we used to see
it with vhd+blktap2 and I had a PoC which showed the issue under native
too.

> * The filer we run this on is a dev filer which is performs poorly,
>   and has lots of LUNs (though I think we replicated it on another
>   filer too). Though the filer and network certainly aren't great,
>   they can run VMs just fine.

This could well be a factor I guess.

> >>  I think that would also
> >> apply to iSCSI over tcp, which would presumably suffer similarly.
> >
> > Correct, iSCSI over TCP can also have this issue.
> >
> >> Is that analysis correct?
> >
> > The important thing is zero copy vs. non-zero copy or not. IOW it is
> > only a problem if the actual userspace page, which is a mapped domU
> > page, is what gets queued up. Whether zero copy is done or not depends
> > on things like O_DIRECT and write(2) vs. sendpage(2) etc and what the
> > underlying fs implements etc. I thought NFS only did it for O_DIRECT. I
> > may be mistaken. aio is probably a factor too.
> 
> Right, and I'm pretty sure we're not using O_DIRECT as we're using
> cache=writeback (which is the default). Is there some way to make it
> copy pages?

Not as far as I know, but Trond zero-copy == O_DIRECT so if you aren't
using O_DIRECT then you aren't using zero copy -- and that agrees with
my recollection. In that case your issue is something totally unrelated.

You could try stracing the qemu-dm and see what it does.

> I'm wondering whether what's happening is that when the disk grows
> (or there's a backing file in place) some sort of different I/O is
> done by qemu. Perhaps irrespective of write cache setting, it does some
> form of zero copy I/O when there's a backing file in place.

I doubt that, but I don't really know anything about qdisk.

I'd be much more inclined to suspect a bug in the xen_qdisk backend's
handling of disks resizes, if that's what you are doing.

> > FWIW blktap2 always copies for pretty much this reason, I seem to recall
> > the maintainer saying the perf hit wasn't noticeable.
> 
> I'm afraid I find the various blk* combinations a bit of an impenetrable
> maze. Is it possible (if only for testing purposes) to use blktap2
> with HVM domU and qcow2 disks with backing files? I had thought the
> alternatives were qdisk and tap?

tap == blktap2. I don't know if it supports qcow or not but I don't
think xl exposes it if it does.

You could try with a test .vhd or .raw file though.

> And a late comment on  your previous email:
> 
> >> Surely before Xen removes the grant on the page, unmapping it from dom0's
> >> memory, it should check to see if there are any existing references
> >> to the page and if there are, given the kernel its own COW copy, rather
> >> than unmap it totally which is going to lead to problems.
> >
> > Unfortunately each page only has one reference count, so you cannot
> > distinguish between references from this particular NFS write from other
> > references (other writes, the ref held by the process itself, etc).
> 
> Sure, I understand that. But I wasn't suggesting the tcp layer triggered
> this (in which case it would need to get back to the NFS write). I
> think Trond said you were arranging for sendpage() to provide a callback.
> I'm not suggesting that.
> 
> What I was (possibly naively) suggesting, is that the single reference
> count to the page should be zero by the time the xen grant stuff is
> about to remove the mapping,

Unfortunately it won't be zero. There will be at least one reference
from the page being part of the process, which won't be dropped until
the process dies.

BTW I'm talking about the dom0 kernels page reference count. Xen's page
reference count is irrelevant here.

>  else it's in use somewhere in the domain
> into which it's mapped. The xen grant stuff can't know whether that's
> for NFS, or iSCSI or whatever. But it does know some other bit of the
> kernel is going to use that page, and when it's finished with it will
> decrement the reference count and presumably free the page up. So if
> it finds a page like this, surely the right thing to do is to leave
> a copy of it in dom0, which is no longer associated with the domU
> page; it will then get freed when the tcp stack (or whatever is using
> it) decrements the reference count later. I don't know if that makes
> any sense.

The whole point is that there is no such reference count which drops to
zero under these circumstances, that's why my series "skb paged fragment
destructors" adds one.

I suggest you google up previous discussions on the netdev list about
this issue -- all these sorts of ideas were discussed back then.

Ian.