All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kevin Wolf <kwolf@redhat.com>
To: qemu-block@nongnu.org
Cc: kwolf@redhat.com, qemu-devel@nongnu.org
Subject: [Qemu-devel] [PULL 50/69] block: Honor BDRV_REQ_FUA during write_zeroes
Date: Thu, 12 May 2016 16:35:30 +0200	[thread overview]
Message-ID: <1463063749-2201-51-git-send-email-kwolf@redhat.com> (raw)
In-Reply-To: <1463063749-2201-1-git-send-email-kwolf@redhat.com>

From: Eric Blake <eblake@redhat.com>

The block layer has a couple of cases where it can lose
Force Unit Access semantics when writing a large block of
zeroes, such that the request returns before the zeroes
have been guaranteed to land on underlying media.

SCSI does not support FUA during WRITESAME(10/16); FUA is only
supported if it falls back to WRITE(10/16).  But where the
underlying device is new enough to not need a fallback, it
means that any upper layer request with FUA semantics was
silently ignoring BDRV_REQ_FUA.

Conversely, NBD has situations where it can support FUA but not
ZERO_WRITE; when that happens, the generic block layer fallback
to bdrv_driver_pwritev() (or the older bdrv_co_writev() in qemu
2.6) was losing the FUA flag.

The problem of losing flags unrelated to ZERO_WRITE has been
latent in bdrv_co_do_write_zeroes() since commit aa7bfbff, but
back then, it did not matter because there was no FUA flag.  It
became observable when commit 93f5e6d8 paved the way for flags
that can impact correctness, when we should have been using
bdrv_co_writev_flags() with modified flags.  Compare to commit
9eeb6dd, which got flag manipulation right in
bdrv_co_do_zero_pwritev().

Symptoms: I tested with qemu-io with default writethrough cache
(which is supposed to use FUA semantics on every write), and
targetted an NBD client connected to a server that intentionally
did not advertise NBD_FLAG_SEND_FUA.  When doing 'write 0 512',
the NBD client sent two operations (NBD_CMD_WRITE then
NBD_CMD_FLUSH) to get the fallback FUA semantics; but when doing
'write -z 0 512', the NBD client sent only NBD_CMD_WRITE.

The fix is do to a cleanup bdrv_co_flush() at the end of the
operation if any step in the middle relied on a BDS that does
not natively support FUA for that step (note that we don't
need to flush after every operation, if the operation is broken
into chunks based on bounce-buffer sizing).  Each BDS gains a
new flag .supported_zero_flags, which parallels the use of
.supported_write_flags but only when accessing a zero write
operation (the flags MUST be different, because of SCSI having
different semantics based on WRITE vs. WRITESAME; and also
because BDRV_REQ_MAY_UNMAP only makes sense on zero writes).

Also fix some documentation to describe -ENOTSUP semantics,
particularly since iscsi depends on those semantics.

Down the road, we may want to add a driver where its
.bdrv_co_pwritev() honors all three of BDRV_REQ_FUA,
BDRV_REQ_ZERO_WRITE, and BDRV_REQ_MAY_UNMAP, and advertise
this via bs->supported_write_flags for blocks opened by that
driver; such a driver should NOT supply .bdrv_co_write_zeroes
nor .supported_zero_flags.  But none of the drivers touched
in this patch want to do that (the act of writing zeroes is
different enough from normal writes to deserve a second
callback).

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c                | 28 +++++++++++++++++++++++++---
 block/iscsi.c             |  1 +
 block/raw-posix.c         |  1 +
 block/raw_bsd.c           |  1 +
 include/block/block_int.h |  7 +++++--
 5 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index 1fb7afe..cd6d71a 100644
--- a/block/io.c
+++ b/block/io.c
@@ -652,7 +652,8 @@ int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num,
  * Completely zero out a block device with the help of bdrv_write_zeroes.
  * The operation is sped up by checking the block status and only writing
  * zeroes to the device if they currently do not return zeroes. Optional
- * flags are passed through to bdrv_write_zeroes (e.g. BDRV_REQ_MAY_UNMAP).
+ * flags are passed through to bdrv_write_zeroes (e.g. BDRV_REQ_MAY_UNMAP,
+ * BDRV_REQ_FUA).
  *
  * Returns < 0 on error, 0 on success. For error codes see bdrv_write().
  */
@@ -1160,6 +1161,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     QEMUIOVector qiov;
     struct iovec iov = {0};
     int ret = 0;
+    bool need_flush = false;
 
     int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_write_zeroes,
                                         BDRV_REQUEST_MAX_SECTORS);
@@ -1192,13 +1194,29 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
         ret = -ENOTSUP;
         /* First try the efficient write zeroes operation */
         if (drv->bdrv_co_write_zeroes) {
-            ret = drv->bdrv_co_write_zeroes(bs, sector_num, num, flags);
+            ret = drv->bdrv_co_write_zeroes(bs, sector_num, num,
+                                            flags & bs->supported_zero_flags);
+            if (ret != -ENOTSUP && (flags & BDRV_REQ_FUA) &&
+                !(bs->supported_zero_flags & BDRV_REQ_FUA)) {
+                need_flush = true;
+            }
+        } else {
+            assert(!bs->supported_zero_flags);
         }
 
         if (ret == -ENOTSUP) {
             /* Fall back to bounce buffer if write zeroes is unsupported */
             int max_xfer_len = MIN_NON_ZERO(bs->bl.max_transfer_length,
                                             MAX_WRITE_ZEROES_BOUNCE_BUFFER);
+            BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
+
+            if ((flags & BDRV_REQ_FUA) &&
+                !(bs->supported_write_flags & BDRV_REQ_FUA)) {
+                /* No need for bdrv_driver_pwrite() to do a fallback
+                 * flush on each chunk; use just one at the end */
+                write_flags &= ~BDRV_REQ_FUA;
+                need_flush = true;
+            }
             num = MIN(num, max_xfer_len);
             iov.iov_len = num * BDRV_SECTOR_SIZE;
             if (iov.iov_base == NULL) {
@@ -1212,7 +1230,8 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
             qemu_iovec_init_external(&qiov, &iov, 1);
 
             ret = bdrv_driver_pwritev(bs, sector_num * BDRV_SECTOR_SIZE,
-                                      num * BDRV_SECTOR_SIZE, &qiov, 0);
+                                      num * BDRV_SECTOR_SIZE, &qiov,
+                                      write_flags);
 
             /* Keep bounce buffer around if it is big enough for all
              * all future requests.
@@ -1228,6 +1247,9 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     }
 
 fail:
+    if (ret == 0 && need_flush) {
+        ret = bdrv_co_flush(bs);
+    }
     qemu_vfree(iov.iov_base);
     return ret;
 }
diff --git a/block/iscsi.c b/block/iscsi.c
index 6d5c1f6..10f3906 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -1553,6 +1553,7 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
     if (iscsilun->dpofua) {
         bs->supported_write_flags = BDRV_REQ_FUA;
     }
+    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
 
     /* Check the write protect flag of the LUN if we want to write */
     if (iscsilun->type == TYPE_DISK && (flags & BDRV_O_RDWR) &&
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 71ec463..a4f5a1b 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -517,6 +517,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
 
     s->has_discard = true;
     s->has_write_zeroes = true;
+    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
     if ((bs->open_flags & BDRV_O_NOCACHE) != 0) {
         s->needs_alignment = true;
     }
diff --git a/block/raw_bsd.c b/block/raw_bsd.c
index 1a1618e..3385ed4 100644
--- a/block/raw_bsd.c
+++ b/block/raw_bsd.c
@@ -205,6 +205,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
 {
     bs->sg = bs->file->bs->sg;
     bs->supported_write_flags = BDRV_REQ_FUA;
+    bs->supported_zero_flags = BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP;
 
     if (bs->probed && !bdrv_is_read_only(bs)) {
         fprintf(stderr,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 10f4962..2709488 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -161,8 +161,8 @@ struct BlockDriver {
     /*
      * Efficiently zero a region of the disk image.  Typically an image format
      * would use a compact metadata representation to implement this.  This
-     * function pointer may be NULL and .bdrv_co_writev() will be called
-     * instead.
+     * function pointer may be NULL or return -ENOSUP and .bdrv_co_writev()
+     * will be called instead.
      */
     int coroutine_fn (*bdrv_co_write_zeroes)(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors, BdrvRequestFlags flags);
@@ -445,6 +445,9 @@ struct BlockDriverState {
     unsigned int request_alignment;
     /* Flags honored during pwrite (so far: BDRV_REQ_FUA) */
     unsigned int supported_write_flags;
+    /* Flags honored during write_zeroes (so far: BDRV_REQ_FUA,
+     * BDRV_REQ_MAY_UNMAP) */
+    unsigned int supported_zero_flags;
 
     /* the following member gives a name to every node on the bs graph. */
     char node_name[32];
-- 
1.8.3.1

  parent reply	other threads:[~2016-05-12 14:37 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-12 14:34 [Qemu-devel] [PULL 00/69] Block layer patches Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 01/69] block: Don't disable I/O throttling on sync requests Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 02/69] block: make bdrv_start_throttled_reqs return void Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 03/69] block: move restarting of throttled reqs to block/throttle-groups.c Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 04/69] block: extract bdrv_drain_poll/bdrv_co_yield_to_drain from bdrv_drain/bdrv_co_drain Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 05/69] block: introduce bdrv_no_throttling_begin/end Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 06/69] block: plug whole tree at once, introduce bdrv_io_unplugged_begin/end Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 07/69] linux-aio: make it more type safe Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 08/69] block: Introduce bdrv_driver_preadv() Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 09/69] block: Introduce bdrv_driver_pwritev() Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 10/69] block: Support AIO drivers in bdrv_driver_preadv/pwritev() Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 11/69] block: Rename bdrv_co_do_preadv/writev to bdrv_co_preadv/writev Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 12/69] block: Introduce .bdrv_co_preadv/pwritev BlockDriver function Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 13/69] bochs: Implement .bdrv_co_preadv() interface Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 14/69] cloop: " Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 15/69] dmg: " Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 16/69] vdi: " Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 17/69] vdi: Implement .bdrv_co_pwritev() interface Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 18/69] vmdk: Add vmdk_find_offset_in_cluster() Kevin Wolf
2016-05-12 14:34 ` [Qemu-devel] [PULL 19/69] vmdk: Implement .bdrv_co_preadv() interface Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 20/69] vmdk: Implement .bdrv_co_pwritev() interface Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 21/69] vpc: Implement .bdrv_co_preadv() interface Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 22/69] vpc: Implement .bdrv_co_pwritev() interface Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 23/69] vvfat: Implement .bdrv_co_preadv/pwritev interfaces Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 24/69] block: Remove BlockDriver.bdrv_read/write Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 25/69] block: Fix typo in comment Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 26/69] block: always compile-check debug prints Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 27/69] Allow users to specify the vmdk virtual hardware version Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 28/69] qemu-io: Fix memory leak in 'aio_write -z' Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 29/69] block: Allow BDRV_REQ_FUA through blk_pwrite() Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 30/69] block: Switch blk_read_unthrottled() to byte interface Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 31/69] block: Switch blk_*write_zeroes() " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 32/69] block: Introduce byte-based aio read/write Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 33/69] ide: Switch to byte-based aio block access Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 34/69] scsi-disk: " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 35/69] virtio: " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 36/69] xen_disk: " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 37/69] fdc: Switch to byte-based " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 38/69] nand: " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 39/69] onenand: " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 40/69] pflash: " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 41/69] sd: " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 42/69] m25p80: " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 43/69] atapi: " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 44/69] nbd: " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 45/69] qemu-img: " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 46/69] qemu-io: " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 47/69] block: Kill unused sector-based blk_* functions Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 48/69] qcow2: improve qcow2_co_write_zeroes() Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 49/69] block: Make supported_write_flags a per-bds property Kevin Wolf
2016-05-12 14:35 ` Kevin Wolf [this message]
2016-05-12 14:35 ` [Qemu-devel] [PULL 51/69] nbd: Simplify client FUA handling Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 52/69] block: Invalidate all children Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 53/69] block: Drop superfluous invalidating bs->file from drivers Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 54/69] block: Inactivate all children Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 55/69] iotests: fix the redirection order in 083 Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 56/69] qemu-img: check block status of backing file when converting Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 57/69] Add new block driver interface to add/delete a BDS's child Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 58/69] quorum: implement bdrv_add_child() and bdrv_del_child() Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 59/69] qmp: add monitor command to add/remove a child Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 60/69] qemu-io: Add missing option documentation Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 61/69] qemu-io: Make 'open' subcommand more like command line Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 62/69] qemu-io: Use bool for command line flags Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 63/69] qemu-io: Allow unaligned access by default Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 64/69] qemu-io: Add 'write -f' to test FUA flag Kevin Wolf
2016-05-12 21:23   ` Eric Blake
2016-05-12 14:35 ` [Qemu-devel] [PULL 65/69] qemu-io: Add 'write -z -u' to test MAY_UNMAP flag Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 66/69] block: add support for --image-opts in block I/O tests Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 67/69] block: add support for encryption secrets " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 68/69] block: enable testing of LUKS driver with " Kevin Wolf
2016-05-12 14:35 ` [Qemu-devel] [PULL 69/69] qemu-iotests: iotests: fail hard if not run via "check" Kevin Wolf
2016-05-12 16:19 ` [Qemu-devel] [PULL 00/69] Block layer patches Peter Maydell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1463063749-2201-51-git-send-email-kwolf@redhat.com \
    --to=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.