From mboxrd@z Thu Jan  1 00:00:00 1970
From: Christoph Hellwig <hch@lst.de>
Subject: Notes on block I/O data integrity
Date: Tue, 25 Aug 2009 20:11:20 +0200
Message-ID: <20090825181120.GA4863@lst.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: rusty@rustcorp.com.au
To: qemu-devel@nongnu.org, kvm@vger.kernel.org
Return-path: <kvm-owner@vger.kernel.org>
Received: from verein.lst.de ([213.95.11.210]:38250 "EHLO verein.lst.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1755266AbZHYSL3 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 25 Aug 2009 14:11:29 -0400
Content-Disposition: inline
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

As various people wanted to know how the various data integrity patches
I've send out recently play together here's a small writeup on what
issues we have in QEMU and how to fix it:

There are two major aspects of data integrity we need to care in the
QEMU block I/O code:

 (1) stable data storage - we must be able to force data out of caches
     onto the stable media, and we must get completion notification for it.
 (2) request ordering - we must be able to make sure some I/O request
     do not get reordered with other in-flight requests before or after
     it.

Linux uses two related abstractions to implement the this (other operating
system are probably similar)

 (1) a cache flush request that flushes the whole volatile write cache to
     stable storage
 (2) a barrier request, which
      (a) is guaranteed to actually go all the way to stable storage
      (b) does not reordered versus any requests before or after it

For disks not using volatile write caches the cache flush is a no-op,
and barrier requests are implemented by draining the queue of
outstanding requests before the barrier request, and only allowing new
requests to proceed after it has finished.  Instead of the queue drain
tag ordering could be used, but at this point that is not the case in
Linux.

For disks using volatile write caches, the cache flush is implemented by
a protocol specific request, and the the barrier request are implemented
by performing cache flushes before and after the barrier request, in
addition to the draining mentioned above.  The second cache flush can be
replaced by setting the "Force Unit Access" bit on the barrier request 
on modern disks.


The above is supported by the QEMU emulated disks in the following way:

  - The IDE disk emulation implement the ATA WIN_FLUSH_CACHE/
    WIN_FLUSH_CACHE_EXT commands to flush the drive cache, but does not
    indicate a volatile write cache in the ATA IDENTIFY command.  Because
    of that guests do no not actually send down cache flush request.  Linux
    guests do however drain the I/O queues to guarantee ordering in absence
    of volatile write caches.
  - The SCSI disk emulation implements the SCSI SYNCHRONIZE_CACHE command,
    and also advertises the write cache enabled bit.  This means Linux
    sends down cache flush requests to implement barriers, and provides
    sufficient queue draining.
  - The virtio-blk driver does not implement any cache flush command.
    And while there is a virtio-blk feature bit for barrier support
    it is not support by virtio-blk.  Due to the lack of a cache flush
    command it also is insufficient to implement the required data
    integrity semantics.  Currently the virtio-blk Linux does not
    advertise any form of barrier support, and we don't even get the
    queue draining required for proper operation in a cache-less
    environment.

The I/O from these front end drivers maps to different host kernel I/O
patterns  depending on the cache= drive command line.  There are three
choices for it:

 (a) cache=writethrough
 (b) cache=writeback
 (c) cache=none

(a) means all writes are synchronous (O_DSYNC), which means the host
    kernel guarantees us that data is on stable storage once the I/O
    request has completed.
    In cache=writethrough mode the IDE and SCSI drivers are safe because
    the queue is properly drained to guarantee I/O ordering.  Virtio-blk
    is not safe due to the lack of queue draining.
(b) means we use regular buffered writes and need a fsync/fdatasync to
    actually guarantee that data is stable on disk.
    In data=writeback mode on the SCSI emulation is safe as all others
    miss the cache flush requests.
(c) means we use direct I/O (O_DIRECT) to bypass the host cache and
    perform direct dma to/from the I/O buffer in QEMU.  While direct I/O
    bypasses the host cache it does not guarantee flushing of volatile
    write caches in disks, nor completion of metadata operations in
    filesystems (e.g. block allocations).
    In data=none only the SCSI emulation is entirely safe right now
    due to the lack of cache flushes in the other drivers.


Action plan for the guest drivers:

 - virtio-blk needs to advertise ordered queue by default.
   This makes cache=writethrough safe on virtio.

Action plan for QEMU:

 - IDE needs to set the write cache enabled bit
 - virtio needs to implement a cache flush command and advertise it
   (also needs a small change to the host driver)
 - we need to implement an aio_fsync to not stall the vpu on cache
   flushes
 - investigate only advertising a write cache when we really have one
   to avoid the cache flush requests for cache=writethrough

Notes on disk cache flushes on Linux hosts:

 - barrier requests and cache flushes are supported by all local
   disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS).
   However unlike the other filesystems ext3 does _NOT_ enable barriers
   and cache flush requests by default.
 - currently O_SYNC writes or fsync on block device nodes does not
   flush the disk cache.
 - currently none of the filesystems nor the direct access to the block
   device nodes implements flushes of the disk caches when using
   O_DIRECT|O_DSYNC or using fsync/fdatasync after an O_DIRECT request.


From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Mg0UU-0004Tn-78
	for qemu-devel@nongnu.org; Tue, 25 Aug 2009 14:11:30 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Mg0UP-0004RZ-Js
	for qemu-devel@nongnu.org; Tue, 25 Aug 2009 14:11:29 -0400
Received: from [199.232.76.173] (port=59916 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Mg0UP-0004RW-Dl
	for qemu-devel@nongnu.org; Tue, 25 Aug 2009 14:11:25 -0400
Received: from verein.lst.de ([213.95.11.210]:47272)
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_3DES_EDE_CBC_SHA1:24) (Exim 4.60)
	(envelope-from <hch@lst.de>) id 1Mg0UO-0008GH-Ta
	for qemu-devel@nongnu.org; Tue, 25 Aug 2009 14:11:25 -0400
Date: Tue, 25 Aug 2009 20:11:20 +0200
From: Christoph Hellwig <hch@lst.de>
Message-ID: <20090825181120.GA4863@lst.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Subject: [Qemu-devel] Notes on block I/O data integrity
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org, kvm@vger.kernel.org
Cc: rusty@rustcorp.com.au

As various people wanted to know how the various data integrity patches
I've send out recently play together here's a small writeup on what
issues we have in QEMU and how to fix it:

There are two major aspects of data integrity we need to care in the
QEMU block I/O code:

 (1) stable data storage - we must be able to force data out of caches
     onto the stable media, and we must get completion notification for it.
 (2) request ordering - we must be able to make sure some I/O request
     do not get reordered with other in-flight requests before or after
     it.

Linux uses two related abstractions to implement the this (other operating
system are probably similar)

 (1) a cache flush request that flushes the whole volatile write cache to
     stable storage
 (2) a barrier request, which
      (a) is guaranteed to actually go all the way to stable storage
      (b) does not reordered versus any requests before or after it

For disks not using volatile write caches the cache flush is a no-op,
and barrier requests are implemented by draining the queue of
outstanding requests before the barrier request, and only allowing new
requests to proceed after it has finished.  Instead of the queue drain
tag ordering could be used, but at this point that is not the case in
Linux.

For disks using volatile write caches, the cache flush is implemented by
a protocol specific request, and the the barrier request are implemented
by performing cache flushes before and after the barrier request, in
addition to the draining mentioned above.  The second cache flush can be
replaced by setting the "Force Unit Access" bit on the barrier request 
on modern disks.


The above is supported by the QEMU emulated disks in the following way:

  - The IDE disk emulation implement the ATA WIN_FLUSH_CACHE/
    WIN_FLUSH_CACHE_EXT commands to flush the drive cache, but does not
    indicate a volatile write cache in the ATA IDENTIFY command.  Because
    of that guests do no not actually send down cache flush request.  Linux
    guests do however drain the I/O queues to guarantee ordering in absence
    of volatile write caches.
  - The SCSI disk emulation implements the SCSI SYNCHRONIZE_CACHE command,
    and also advertises the write cache enabled bit.  This means Linux
    sends down cache flush requests to implement barriers, and provides
    sufficient queue draining.
  - The virtio-blk driver does not implement any cache flush command.
    And while there is a virtio-blk feature bit for barrier support
    it is not support by virtio-blk.  Due to the lack of a cache flush
    command it also is insufficient to implement the required data
    integrity semantics.  Currently the virtio-blk Linux does not
    advertise any form of barrier support, and we don't even get the
    queue draining required for proper operation in a cache-less
    environment.

The I/O from these front end drivers maps to different host kernel I/O
patterns  depending on the cache= drive command line.  There are three
choices for it:

 (a) cache=writethrough
 (b) cache=writeback
 (c) cache=none

(a) means all writes are synchronous (O_DSYNC), which means the host
    kernel guarantees us that data is on stable storage once the I/O
    request has completed.
    In cache=writethrough mode the IDE and SCSI drivers are safe because
    the queue is properly drained to guarantee I/O ordering.  Virtio-blk
    is not safe due to the lack of queue draining.
(b) means we use regular buffered writes and need a fsync/fdatasync to
    actually guarantee that data is stable on disk.
    In data=writeback mode on the SCSI emulation is safe as all others
    miss the cache flush requests.
(c) means we use direct I/O (O_DIRECT) to bypass the host cache and
    perform direct dma to/from the I/O buffer in QEMU.  While direct I/O
    bypasses the host cache it does not guarantee flushing of volatile
    write caches in disks, nor completion of metadata operations in
    filesystems (e.g. block allocations).
    In data=none only the SCSI emulation is entirely safe right now
    due to the lack of cache flushes in the other drivers.


Action plan for the guest drivers:

 - virtio-blk needs to advertise ordered queue by default.
   This makes cache=writethrough safe on virtio.

Action plan for QEMU:

 - IDE needs to set the write cache enabled bit
 - virtio needs to implement a cache flush command and advertise it
   (also needs a small change to the host driver)
 - we need to implement an aio_fsync to not stall the vpu on cache
   flushes
 - investigate only advertising a write cache when we really have one
   to avoid the cache flush requests for cache=writethrough

Notes on disk cache flushes on Linux hosts:

 - barrier requests and cache flushes are supported by all local
   disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS).
   However unlike the other filesystems ext3 does _NOT_ enable barriers
   and cache flush requests by default.
 - currently O_SYNC writes or fsync on block device nodes does not
   flush the disk cache.
 - currently none of the filesystems nor the direct access to the block
   device nodes implements flushes of the disk caches when using
   O_DIRECT|O_DSYNC or using fsync/fdatasync after an O_DIRECT request.