From mboxrd@z Thu Jan  1 00:00:00 1970
From: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Subject: Re: Fatal crash on xen4.2 HVM + qemu-xen dm + NFS
Date: Wed, 23 Jan 2013 16:29:20 +0000
Message-ID: <alpine.DEB.2.02.1301231627440.29727@kaball.uk.xensource.com>
References: <5B4525F296F6ABEB38B0E614@nimrod.local>
	<50CEFDA602000078000B0B11@nat28.tlf.novell.com>
	<3B1D0701EAEA6532CEA91EA0@Ximines.local>
	<alpine.DEB.2.02.1301161302020.4978@kaball.uk.xensource.com>
	<F1DF150F7B2469CFD587A2A4@Ximines.local>
	<alpine.DEB.2.02.1301161620390.4978@kaball.uk.xensource.com>
	<E57EC0AFE2B6901CB9A11068@Ximines.local>
	<alpine.DEB.2.02.1301161718120.4978@kaball.uk.xensource.com>
	<77822E2DDAEA8F94631B6A52@Ximines.local>
	<1358781790.3279.224.camel@zakaz.uk.xensource.com>
	<F7F59FF70A5F8648886565B5@Ximines.local>
	<1358783420.3279.235.camel@zakaz.uk.xensource.com>
	<F7775BEAD1475FBBDEB2B9C4@Ximines.local>
	<19EA31DDC3BEF4D66B42CBAC@Ximines.local>
	<alpine.DEB.2.02.1301221531040.29727@kaball.uk.xensource.com>
	<alpine.DEB.2.02.1301221558180.29727@kaball.uk.xensource.com>
	<4D54EE421529BA5E0A511E31@nimrod.local>
	<alpine.DEB.2.02.1301231144300.29727@kaball.uk.xensource.com>
	<DEBAB6AE486E0674778F7C2D@nimrod.local>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <DEBAB6AE486E0674778F7C2D@nimrod.local>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Alex Bligh <alex@alex.org.uk>
Cc: Konrad Wilk <konrad.wilk@oracle.com>, Xen Devel <xen-devel@lists.xen.org>, Ian Campbell <Ian.Campbell@citrix.com>, Jan Beulich <JBeulich@suse.com>, Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>
List-Id: xen-devel@lists.xenproject.org

On Wed, 23 Jan 2013, Alex Bligh wrote:
> Ian, Stefano,
> 
> First bit of good news: Avoiding using O_DIRECT in xen_disk.c
> fixes the problem (see patch below for what we tested).
> 
> Therefore I think this problem will potentially occur:
> 
> * Using qemu-xen or qemu-traditional (both of which use O_DIRECT)
> 
> * With anything that uses a grant of another domX's memory (I
>   suspect that includes PV guests as well as HVM).
> 
> * Whenever the dom0 uses TCP to the backing device. That includes
>   NFS and iSCSI.
> 
> Disabling O_DIRECT is therefore a workaround, but I suspect not
> an attractive workaround performance-wise. Fixing it properly
> would appear to require Ian C's skbuff page tracking stuff.
> 
> >> I suspect the best possible answer is a
> >>   cache=[qemu-cache-identifier]
> >> config key, which gets put in xenstore, and which xen_disk.c then
> >> interprets using the same routine QEMU does itself for cache= on the
> >> command line, then uses exactly those BDEV flags.
> >
> > that would be nice to have
> 
> Good
> 
> >> For completeness one could also add an emulcache= option and just
> >> pass that straight through to the qemu command line for the emulated
> >> drives.
> >
> > In qemu-xen it should be already possible to change the cache setting
> > for the IDE disk by passing the right command line option to QEMU. Not
> > ideal, but it would work.
> 
> It would indeed, but 'cache=writeback' is currently hard coded. I was
> planning to change that to 'cache=%s' and pass the value of
> emulcache=.
> 
> >> I had a quick look at this on the train and it appears that to do it
> >> properly requires fiddling with lex file and possibly xenstore, i.e.
> >> is not completely trivial.
> >>
> >> An alternative more disgusting arrangement would be to overload one
> >> of the existing options.
> >>
> >> What's the right approach here, and have I any hope of getting a patch
> >> into 4.2.2? Not a disaster if not as I already need to maintain a local
> >> tree due to the absence of HVM live migrate on qemu-upstream.
> >
> > Backports are only for bug fixes.
> 
> Sure - and it would seem to me dom0 crashing horribly is a bug! I
> have a reliable way to replicate it, but it can happen to others.
> 
> In terms of a local fix, a one line change to xen_disk is all I need.
> 
> What I'm asking is what is the right way to fix this bug in 4.2.
> At one extreme, I can probably cook up a 'let's do it properly' patch
> which takes the proposed cache= option (most useful, most intrusive).
> At the other extreme, we could just apply the patch below or a variant
> of it conditional on a configure flag (least useful, least intrusive).
> 
> > However it might be possible to
> > backport a simple patch that in case of opening failure with O_DIRECT,
> > tries again without it (see appended). I don't think that would work
> > for you though, unless you manage to mount NFS in a way that would
> > return EINVAL on open(O_DIRECT), like it used to.
> 
> I don't think that will work for any of the problem cases I've identified
> above as NFS always permits O_DIRECT opens (I'm not even sure whether
> CONFIG_NFS_DIRECTIO still exists), and so does iSCSI.
> 
> -- 
> Alex Bligh
> 
> diff --git a/hw/xen_disk.c b/hw/xen_disk.c
> index a402ac8..1c3a6f5 100644
> --- a/hw/xen_disk.c
> +++ b/hw/xen_disk.c
> @@ -603,7 +603,7 @@ static int blk_init(struct XenDevice *xendev)
>      }
> 
>      /* read-only ? */
> -    qflags = BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO;
> +    qflags = /* BDRV_O_NOCACHE | */ BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO;
>      if (strcmp(blkdev->mode, "w") == 0) {
>          qflags |= BDRV_O_RDWR;
>      } else {

Before going for something like that I would like a confirmation from
Konrad about blkfront behavior regarding barriers and
BLKIF_OP_FLUSH_DISKCACHE. I certainly don't want to risk data
corruptions.