From mboxrd@z Thu Jan  1 00:00:00 1970
From: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Subject: Re: Fatal crash on xen4.2 HVM + qemu-xen dm + NFS
Date: Tue, 22 Jan 2013 15:42:18 +0000
Message-ID: <alpine.DEB.2.02.1301221531040.29727@kaball.uk.xensource.com>
References: <5B4525F296F6ABEB38B0E614@nimrod.local>
	<50CEFDA602000078000B0B11@nat28.tlf.novell.com>
	<3B1D0701EAEA6532CEA91EA0@Ximines.local>
	<alpine.DEB.2.02.1301161302020.4978@kaball.uk.xensource.com>
	<F1DF150F7B2469CFD587A2A4@Ximines.local>
	<alpine.DEB.2.02.1301161620390.4978@kaball.uk.xensource.com>
	<E57EC0AFE2B6901CB9A11068@Ximines.local>
	<alpine.DEB.2.02.1301161718120.4978@kaball.uk.xensource.com>
	<77822E2DDAEA8F94631B6A52@Ximines.local>
	<1358781790.3279.224.camel@zakaz.uk.xensource.com>
	<F7F59FF70A5F8648886565B5@Ximines.local>
	<1358783420.3279.235.camel@zakaz.uk.xensource.com>
	<F7775BEAD1475FBBDEB2B9C4@Ximines.local>
	<1358787073.3279.257.camel@zakaz.uk.xensource.com>
	<19EA31DDC3BEF4D66B42CBAC@Ximines.local>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <19EA31DDC3BEF4D66B42CBAC@Ximines.local>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Alex Bligh <alex@alex.org.uk>
Cc: Konrad Wilk <konrad.wilk@oracle.com>, Xen Devel <xen-devel@lists.xen.org>, Ian Campbell <Ian.Campbell@citrix.com>, Jan Beulich <JBeulich@suse.com>, Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>
List-Id: xen-devel@lists.xenproject.org

On Mon, 21 Jan 2013, Alex Bligh wrote:
> Ian, Stefano,
> 
> --On 21 January 2013 16:51:13 +0000 Ian Campbell <Ian.Campbell@citrix.com> 
> wrote:
> 
> > Not as far as I know, but Trond zero-copy == O_DIRECT so if you aren't
> > using O_DIRECT then you aren't using zero copy -- and that agrees with
> > my recollection. In that case your issue is something totally unrelated.
> 
> Further investigation suggests that Stefano's commit
>   47982cb00584371928e44ab6dfc6865d605a52fd
> (attached below) may have somewhat surprising results.
> 
> Firstly, changing the cache=writeback settings as passed to the QEMU
> command line probably only affects emulated disks, as the parameters
> for the PV disk appear to be hard coded per this commit, assuming I've
> understood correctly. I am guessing my fiddling with the cache=
> setting merely caused the emulated disk (used in HVM until the kernel
> has loaded) to break.

That is correct.


> Secondly, the chosen mode of cache operation is:
>   BDRV_O_NOCACHE | BDRV_O_CACHE_WB
> This appears to be the same as "cache=none" produces (see code
> fragment from bdrv_parse_cache_flags below), which is somewhat
> counterintuitive given the name of the second flag. "cache=writeback"
> (as appears on the command line) uses BDRV_O_CACHE_WB only.
>
> BDRV_O_NOCACHE appears to map on Linux to O_DIRECT, and BDRV_O_CACHE_WB
> to writeback caching. This implies O_DIRECT will always be used. This
> is somewhat surprising as qemu by default only uses O_DIRECT with
> cache=none, and yet the emulated devices are set up with the
> equivalent of cache=writeback.

Yes, it is counterintuitive, but you got it right: BDRV_O_NOCACHE |
BDRV_O_CACHE_WB means O_DIRECT.


> But this would explain why I'm still seeing the crash with O_DIRECT
> apparently off (cache=writeback), as the cache setting is being ignored.
> 
> This would also explain why Ian might not have seen it (it went in
> late and without O_DIRECT we think this crash can't happen).
> 
> Is the BDRV_O_NOCACHE | BDRV_O_CACHE_WB combination intentional or
> should BDRV_O_NOCACHE be removed? Why would the default be different
> for emulated and PV disks?

The setting is different from the one of emulated devices because after
analyzing the IDE code, we thought that using BDRV_O_CACHE_WB would be
safe enough because when the guest wants to make sure that the data hits
the disk, it issues an IDE FLUSH_CACHE operation.

In the xen_disk case instead, we weren't quite sure about the
assumptions of all the possible different PV frontend drivers, so we
went for the safe choice, that is O_DIRECT.

In fact if we wanted to change the cache setting for xen_disk, we would
probably have to go back to write-through (this setting is selected by
passing neither BDRV_O_NOCACHE nor BDRV_O_CACHE_WB) that is quite slow.

Recently, thanks to Konrad's work on blkfront cache flushes, a new flush
operation has been implemented in the block protocol:
BLKIF_OP_FLUSH_DISKCACHE. BLKIF_OP_FLUSH_DISKCACHE was introduced in
xen_disk by 7e7b7cba16faa7b721b822fa9ed8bebafa35700f "xen_disk:
implement BLKIF_OP_FLUSH_DISKCACHE, remove BLKIF_OP_WRITE_BARRIER".
Thanks to the new operation, maybe it is now safe to use write-back
caching.
Konrad, what do you think? Is blkback using the Linux disk cache by
default? Or is it using O_DIRECT?