All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer
@ 2016-06-20 23:39 Eric Blake
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 1/5] block: Fragment reads to max transfer length Eric Blake
                   ` (8 more replies)
  0 siblings, 9 replies; 21+ messages in thread
From: Eric Blake @ 2016-06-20 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf

We have max_transfer documented in BlockLimits, but while we
honor it during pwrite_zeroes, we were blindly ignoring it
during pwritev and preadv, leading to multiple drivers having
to implement fragmentation themselves.  This series moves
fragmentation to the block layer, then fixes the NBD driver to
use it; if you like this but it needs a v2, you can request that
I further do other drivers (I know at least iscsi and qcow2 do
some self-fragmenting and/or error reporting that can be
simplified by deferring fragmentation to the block layer).

Prequisite: Kevin's block branch, plus my work on byte-based
block limits (v2 at the moment):
https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg04006.html

Also available as a tag at:
git fetch git://repo.or.cz/qemu/ericb.git nbd-fragment-v1

Eric Blake (5):
  block: Fragment reads to max transfer length
  block: Fragment writes to max transfer length
  raw_bsd: Don't advertise flags not supported by protocol layer
  nbd: Rely on block layer to break up large requests
  nbd: Drop unused offset parameter

 include/block/nbd.h |  1 -
 nbd/nbd-internal.h  |  4 +--
 block/io.c          | 84 +++++++++++++++++++++++++++++++++++++++--------------
 block/nbd-client.c  | 78 ++++++++++++++-----------------------------------
 block/nbd.c         | 12 ++------
 block/raw_bsd.c     |  6 ++--
 nbd/common.c        |  3 +-
 7 files changed, 95 insertions(+), 93 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 1/5] block: Fragment reads to max transfer length
  2016-06-20 23:39 [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
@ 2016-06-20 23:39 ` Eric Blake
  2016-07-08 10:56   ` Kevin Wolf
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 2/5] block: Fragment writes " Eric Blake
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Eric Blake @ 2016-06-20 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf, Stefan Hajnoczi, Fam Zheng, Max Reitz

Drivers should be able to rely on the block layer honoring the
max transfer length, rather than needing to return -EINVAL
(iscsi) or manually fragment things (nbd).  This patch adds
the fragmentation in the block layer, after requests have been
aligned (fragmenting before alignment would lead to multiple
unaligned requests, rather than just the head and tail).

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/io.c | 53 +++++++++++++++++++++++++++++++++--------------------
 1 file changed, 33 insertions(+), 20 deletions(-)

diff --git a/block/io.c b/block/io.c
index 4e19868..a1443e3 100644
--- a/block/io.c
+++ b/block/io.c
@@ -971,8 +971,8 @@ err:

 /*
  * Forwards an already correctly aligned request to the BlockDriver. This
- * handles copy on read and zeroing after EOF; any other features must be
- * implemented by the caller.
+ * handles copy on read, zeroing after EOF, and fragmentation of large
+ * reads; any other features must be implemented by the caller.
  */
 static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
     BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
@@ -980,12 +980,16 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
 {
     int64_t total_bytes, max_bytes;
     int ret;
+    uint64_t bytes_remaining = bytes;
+    int max_transfer;

     assert(is_power_of_2(align));
     assert((offset & (align - 1)) == 0);
     assert((bytes & (align - 1)) == 0);
     assert(!qiov || bytes == qiov->size);
     assert((bs->open_flags & BDRV_O_NO_IO) == 0);
+    max_transfer = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_transfer, INT_MAX),
+                                   align);

     /* TODO: We would need a per-BDS .supported_read_flags and
      * potential fallback support, if we ever implement any read flags
@@ -1024,7 +1028,7 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
         }
     }

-    /* Forward the request to the BlockDriver */
+    /* Forward the request to the BlockDriver, possibly fragmenting it */
     total_bytes = bdrv_getlength(bs);
     if (total_bytes < 0) {
         ret = total_bytes;
@@ -1032,26 +1036,35 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
     }

     max_bytes = ROUND_UP(MAX(0, total_bytes - offset), align);
-    if (bytes <= max_bytes) {
+    if (bytes <= max_bytes && bytes <= max_transfer) {
         ret = bdrv_driver_preadv(bs, offset, bytes, qiov, 0);
-    } else if (max_bytes > 0) {
-        QEMUIOVector local_qiov;
-
-        qemu_iovec_init(&local_qiov, qiov->niov);
-        qemu_iovec_concat(&local_qiov, qiov, 0, max_bytes);
-
-        ret = bdrv_driver_preadv(bs, offset, max_bytes, &local_qiov, 0);
-
-        qemu_iovec_destroy(&local_qiov);
-    } else {
-        ret = 0;
+        goto out;
     }

-    /* Reading beyond end of file is supposed to produce zeroes */
-    if (ret == 0 && total_bytes < offset + bytes) {
-        uint64_t zero_offset = MAX(0, total_bytes - offset);
-        uint64_t zero_bytes = offset + bytes - zero_offset;
-        qemu_iovec_memset(qiov, zero_offset, 0, zero_bytes);
+    while (bytes_remaining) {
+        int num;
+
+        if (max_bytes) {
+            QEMUIOVector local_qiov;
+
+            num = MIN(bytes_remaining, MIN(max_bytes, max_transfer));
+            assert(num);
+            qemu_iovec_init(&local_qiov, qiov->niov);
+            qemu_iovec_concat(&local_qiov, qiov, bytes - bytes_remaining, num);
+
+            ret = bdrv_driver_preadv(bs, offset + bytes - bytes_remaining,
+                                     num, &local_qiov, 0);
+            max_bytes -= num;
+            qemu_iovec_destroy(&local_qiov);
+            if (ret < 0) {
+                break;
+            }
+        } else {
+            num = bytes_remaining;
+            qemu_iovec_memset(qiov, bytes - bytes_remaining, 0,
+                              bytes_remaining);
+        }
+        bytes_remaining -= num;
     }

 out:
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 2/5] block: Fragment writes to max transfer length
  2016-06-20 23:39 [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 1/5] block: Fragment reads to max transfer length Eric Blake
@ 2016-06-20 23:39 ` Eric Blake
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 3/5] raw_bsd: Don't advertise flags not supported by protocol layer Eric Blake
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Eric Blake @ 2016-06-20 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf, Stefan Hajnoczi, Fam Zheng, Max Reitz

Drivers should be able to rely on the block layer honoring the
max transfer length, rather than needing to return -EINVAL
(iscsi) or manually fragment things (nbd).  We already fragment
write zeroes at the block layer; this patch adds the fragmentation
for normal writes, after requests have been aligned (fragmenting
before alignment would lead to multiple unaligned requests, rather
than just the head and tail).

When fragmenting a large request where FUA was requested, but
where we know that FUA is implemented by flushing all requests
rather than the given request, then we can still get by with
only one flush.  Note, however, that we need a followup patch
to the raw format driver to avoid a regression in the number of
flushes actually issued.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/io.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index a1443e3..727a2d4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1280,6 +1280,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,

     int64_t start_sector = offset >> BDRV_SECTOR_BITS;
     int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
+    uint64_t bytes_remaining = bytes;
+    int max_transfer;

     assert(is_power_of_2(align));
     assert((offset & (align - 1)) == 0);
@@ -1287,6 +1289,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     assert(!qiov || bytes == qiov->size);
     assert((bs->open_flags & BDRV_O_NO_IO) == 0);
     assert(!(flags & ~BDRV_REQ_MASK));
+    max_transfer = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_transfer, INT_MAX),
+                                   align);

     waited = wait_serialising_requests(req);
     assert(!waited || !req->serialising);
@@ -1309,9 +1313,34 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     } else if (flags & BDRV_REQ_ZERO_WRITE) {
         bdrv_debug_event(bs, BLKDBG_PWRITEV_ZERO);
         ret = bdrv_co_do_pwrite_zeroes(bs, offset, bytes, flags);
-    } else {
+    } else if (bytes <= max_transfer) {
         bdrv_debug_event(bs, BLKDBG_PWRITEV);
         ret = bdrv_driver_pwritev(bs, offset, bytes, qiov, flags);
+    } else {
+        bdrv_debug_event(bs, BLKDBG_PWRITEV);
+        while (bytes_remaining) {
+            int num = MIN(bytes_remaining, max_transfer);
+            QEMUIOVector local_qiov;
+            int local_flags = flags;
+
+            assert(num);
+            if (num < bytes_remaining && (flags & BDRV_REQ_FUA) &&
+                !(bs->supported_write_flags & BDRV_REQ_FUA)) {
+                /* If FUA is going to be emulated by flush, we only
+                 * need to flush on the last iteration */
+                local_flags &= ~BDRV_REQ_FUA;
+            }
+            qemu_iovec_init(&local_qiov, qiov->niov);
+            qemu_iovec_concat(&local_qiov, qiov, bytes - bytes_remaining, num);
+
+            ret = bdrv_driver_pwritev(bs, offset + bytes - bytes_remaining,
+                                      num, &local_qiov, local_flags);
+            qemu_iovec_destroy(&local_qiov);
+            if (ret < 0) {
+                break;
+            }
+            bytes_remaining -= num;
+        }
     }
     bdrv_debug_event(bs, BLKDBG_PWRITEV_DONE);

-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 3/5] raw_bsd: Don't advertise flags not supported by protocol layer
  2016-06-20 23:39 [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 1/5] block: Fragment reads to max transfer length Eric Blake
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 2/5] block: Fragment writes " Eric Blake
@ 2016-06-20 23:39 ` Eric Blake
  2016-07-08 11:05   ` Kevin Wolf
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 4/5] nbd: Rely on block layer to break up large requests Eric Blake
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Eric Blake @ 2016-06-20 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf, Max Reitz

The raw format layer supports all flags via passthrough - but
it only makes sense to pass through flags that the lower layer
actually supports.

Thanks to the previous patch, the raw format layer now attempts
to fragment writes at the max_transfer limit it inherits from
the NBD protocol layer, recently set to 32m.  An attempt to do
'w -f 0 40m' to an NBD server that lacks FUA thus changed from
flushing once (after NBD fragmented a single 40m write itself)
to instead flushing twice (the format layer sees BDRV_REQ_FUA
in supported_write_flags, so it sends the flag on to both
fragments, and then the block layer emulates FUA by flushing
for both the 32m and 8m fragments at the protocol layer).
This patch fixes the performance regression (now that the
format layer no longer advertises a flag not present at the
protocol layer, the flush to emulate FUA is deferred to the
last fragment).

Note that 'w -f -z 0 40m' does not currently exhibit the same
problem, because there, the fragmentation does not occur until
at the NBD layer (the raw layer has .bdrv_co_pwrite_zeroes, and
the NBD layer doesn't advertise max_pwrite_zeroes to constrain
things at the raw layer) - but that problem is latent and would
have the same problem with too many flushes without this patch
once the NBD layer implements support for using the new
NBD_CMD_WRITE_ZEROES and sets max_pwrite_zeroes to the same 32m
limit as recommended by the NBD protocol.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/raw_bsd.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/raw_bsd.c b/block/raw_bsd.c
index 351ed2a..47a8352 100644
--- a/block/raw_bsd.c
+++ b/block/raw_bsd.c
@@ -197,8 +197,10 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
                     Error **errp)
 {
     bs->sg = bs->file->bs->sg;
-    bs->supported_write_flags = BDRV_REQ_FUA;
-    bs->supported_zero_flags = BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP;
+    bs->supported_write_flags = BDRV_REQ_FUA &
+        bs->file->bs->supported_write_flags;
+    bs->supported_zero_flags = (BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+        bs->file->bs->supported_zero_flags;

     if (bs->probed && !bdrv_is_read_only(bs)) {
         fprintf(stderr,
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 4/5] nbd: Rely on block layer to break up large requests
  2016-06-20 23:39 [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
                   ` (2 preceding siblings ...)
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 3/5] raw_bsd: Don't advertise flags not supported by protocol layer Eric Blake
@ 2016-06-20 23:39 ` Eric Blake
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 5/5] nbd: Drop unused offset parameter Eric Blake
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Eric Blake @ 2016-06-20 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf, Paolo Bonzini, Max Reitz

Now that the block layer will honor max_transfer, we can simplify
our code to rely on that guarantee.

The readv code can call directly into nbd-client, just as the
writev code has done since commit 52a4650.

Interestingly enough, while qemu-io 'w 0 40m' splits into a 32M
and 8M transaction, 'w -z 0 40m' splits into two 16M and an 8M,
because the block layer caps the bounce buffer for writing zeroes
at 16M.  When we later introduce support for NBD_CMD_WRITE_ZEROES,
we can get a full 32M zero write (or larger, if the client and
server negotiate that write zeroes can use a larger size than
ordinary writes).

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/nbd-client.c | 51 ++++++++-------------------------------------------
 block/nbd.c        | 12 +++---------
 2 files changed, 11 insertions(+), 52 deletions(-)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 420bce8..9f023f8 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -217,15 +217,15 @@ static void nbd_coroutine_end(NbdClientSession *s,
     }
 }

-static int nbd_co_readv_1(BlockDriverState *bs, int64_t sector_num,
-                          int nb_sectors, QEMUIOVector *qiov,
-                          int offset)
+int nbd_client_co_readv(BlockDriverState *bs, int64_t sector_num,
+                        int nb_sectors, QEMUIOVector *qiov)
 {
     NbdClientSession *client = nbd_get_client_session(bs);
     struct nbd_request request = { .type = NBD_CMD_READ };
     struct nbd_reply reply;
     ssize_t ret;

+    assert(nb_sectors <= NBD_MAX_SECTORS);
     request.from = sector_num * 512;
     request.len = nb_sectors * 512;

@@ -234,16 +234,15 @@ static int nbd_co_readv_1(BlockDriverState *bs, int64_t sector_num,
     if (ret < 0) {
         reply.error = -ret;
     } else {
-        nbd_co_receive_reply(client, &request, &reply, qiov, offset);
+        nbd_co_receive_reply(client, &request, &reply, qiov, 0);
     }
     nbd_coroutine_end(client, &request);
     return -reply.error;

 }

-static int nbd_co_writev_1(BlockDriverState *bs, int64_t sector_num,
-                           int nb_sectors, QEMUIOVector *qiov,
-                           int offset, int flags)
+int nbd_client_co_writev(BlockDriverState *bs, int64_t sector_num,
+                         int nb_sectors, QEMUIOVector *qiov, int flags)
 {
     NbdClientSession *client = nbd_get_client_session(bs);
     struct nbd_request request = { .type = NBD_CMD_WRITE };
@@ -255,11 +254,12 @@ static int nbd_co_writev_1(BlockDriverState *bs, int64_t sector_num,
         request.type |= NBD_CMD_FLAG_FUA;
     }

+    assert(nb_sectors <= NBD_MAX_SECTORS);
     request.from = sector_num * 512;
     request.len = nb_sectors * 512;

     nbd_coroutine_start(client, &request);
-    ret = nbd_co_send_request(bs, &request, qiov, offset);
+    ret = nbd_co_send_request(bs, &request, qiov, 0);
     if (ret < 0) {
         reply.error = -ret;
     } else {
@@ -269,41 +269,6 @@ static int nbd_co_writev_1(BlockDriverState *bs, int64_t sector_num,
     return -reply.error;
 }

-int nbd_client_co_readv(BlockDriverState *bs, int64_t sector_num,
-                        int nb_sectors, QEMUIOVector *qiov)
-{
-    int offset = 0;
-    int ret;
-    while (nb_sectors > NBD_MAX_SECTORS) {
-        ret = nbd_co_readv_1(bs, sector_num, NBD_MAX_SECTORS, qiov, offset);
-        if (ret < 0) {
-            return ret;
-        }
-        offset += NBD_MAX_SECTORS * 512;
-        sector_num += NBD_MAX_SECTORS;
-        nb_sectors -= NBD_MAX_SECTORS;
-    }
-    return nbd_co_readv_1(bs, sector_num, nb_sectors, qiov, offset);
-}
-
-int nbd_client_co_writev(BlockDriverState *bs, int64_t sector_num,
-                         int nb_sectors, QEMUIOVector *qiov, int flags)
-{
-    int offset = 0;
-    int ret;
-    while (nb_sectors > NBD_MAX_SECTORS) {
-        ret = nbd_co_writev_1(bs, sector_num, NBD_MAX_SECTORS, qiov, offset,
-                              flags);
-        if (ret < 0) {
-            return ret;
-        }
-        offset += NBD_MAX_SECTORS * 512;
-        sector_num += NBD_MAX_SECTORS;
-        nb_sectors -= NBD_MAX_SECTORS;
-    }
-    return nbd_co_writev_1(bs, sector_num, nb_sectors, qiov, offset, flags);
-}
-
 int nbd_client_co_flush(BlockDriverState *bs)
 {
     NbdClientSession *client = nbd_get_client_session(bs);
diff --git a/block/nbd.c b/block/nbd.c
index 08e5b67..8a13078 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -349,12 +349,6 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
     return ret;
 }

-static int nbd_co_readv(BlockDriverState *bs, int64_t sector_num,
-                        int nb_sectors, QEMUIOVector *qiov)
-{
-    return nbd_client_co_readv(bs, sector_num, nb_sectors, qiov);
-}
-
 static int nbd_co_flush(BlockDriverState *bs)
 {
     return nbd_client_co_flush(bs);
@@ -450,7 +444,7 @@ static BlockDriver bdrv_nbd = {
     .instance_size              = sizeof(BDRVNBDState),
     .bdrv_parse_filename        = nbd_parse_filename,
     .bdrv_file_open             = nbd_open,
-    .bdrv_co_readv              = nbd_co_readv,
+    .bdrv_co_readv              = nbd_client_co_readv,
     .bdrv_co_writev_flags       = nbd_client_co_writev,
     .bdrv_close                 = nbd_close,
     .bdrv_co_flush_to_os        = nbd_co_flush,
@@ -468,7 +462,7 @@ static BlockDriver bdrv_nbd_tcp = {
     .instance_size              = sizeof(BDRVNBDState),
     .bdrv_parse_filename        = nbd_parse_filename,
     .bdrv_file_open             = nbd_open,
-    .bdrv_co_readv              = nbd_co_readv,
+    .bdrv_co_readv              = nbd_client_co_readv,
     .bdrv_co_writev_flags       = nbd_client_co_writev,
     .bdrv_close                 = nbd_close,
     .bdrv_co_flush_to_os        = nbd_co_flush,
@@ -486,7 +480,7 @@ static BlockDriver bdrv_nbd_unix = {
     .instance_size              = sizeof(BDRVNBDState),
     .bdrv_parse_filename        = nbd_parse_filename,
     .bdrv_file_open             = nbd_open,
-    .bdrv_co_readv              = nbd_co_readv,
+    .bdrv_co_readv              = nbd_client_co_readv,
     .bdrv_co_writev_flags       = nbd_client_co_writev,
     .bdrv_close                 = nbd_close,
     .bdrv_co_flush_to_os        = nbd_co_flush,
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 5/5] nbd: Drop unused offset parameter
  2016-06-20 23:39 [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
                   ` (3 preceding siblings ...)
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 4/5] nbd: Rely on block layer to break up large requests Eric Blake
@ 2016-06-20 23:39 ` Eric Blake
  2016-07-08 11:11   ` Kevin Wolf
  2016-06-21  3:19 ` [Qemu-devel] [PATCH 6/5] iscsi: Rely on block layer to break up large requests Eric Blake
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Eric Blake @ 2016-06-20 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf, Paolo Bonzini, Max Reitz

Now that NBD relies on the block layer to fragment things, we no
longer need to track an offset argument for which fragment of
a request we are actually servicing.

While at it, use true and false instead of 0 and 1 for a bool
parameter.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 include/block/nbd.h |  1 -
 nbd/nbd-internal.h  |  4 ++--
 block/nbd-client.c  | 31 ++++++++++++++++---------------
 nbd/common.c        |  3 +--
 4 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index eeda3eb..503f514 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -89,7 +89,6 @@ enum {
 ssize_t nbd_wr_syncv(QIOChannel *ioc,
                      struct iovec *iov,
                      size_t niov,
-                     size_t offset,
                      size_t length,
                      bool do_read);
 int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint32_t *flags,
diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 26a9f4d..93a6ca8 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -101,14 +101,14 @@ static inline ssize_t read_sync(QIOChannel *ioc, void *buffer, size_t size)
      * our request/reply.  Synchronization is done with recv_coroutine, so
      * that this is coroutine-safe.
      */
-    return nbd_wr_syncv(ioc, &iov, 1, 0, size, true);
+    return nbd_wr_syncv(ioc, &iov, 1, size, true);
 }

 static inline ssize_t write_sync(QIOChannel *ioc, void *buffer, size_t size)
 {
     struct iovec iov = { .iov_base = buffer, .iov_len = size };

-    return nbd_wr_syncv(ioc, &iov, 1, 0, size, false);
+    return nbd_wr_syncv(ioc, &iov, 1, size, false);
 }

 struct NBDTLSHandshakeData {
diff --git a/block/nbd-client.c b/block/nbd-client.c
index 9f023f8..5841a32 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -116,7 +116,7 @@ static void nbd_restart_write(void *opaque)

 static int nbd_co_send_request(BlockDriverState *bs,
                                struct nbd_request *request,
-                               QEMUIOVector *qiov, int offset)
+                               QEMUIOVector *qiov)
 {
     NbdClientSession *s = nbd_get_client_session(bs);
     AioContext *aio_context;
@@ -149,8 +149,8 @@ static int nbd_co_send_request(BlockDriverState *bs,
         qio_channel_set_cork(s->ioc, true);
         rc = nbd_send_request(s->ioc, request);
         if (rc >= 0) {
-            ret = nbd_wr_syncv(s->ioc, qiov->iov, qiov->niov,
-                               offset, request->len, 0);
+            ret = nbd_wr_syncv(s->ioc, qiov->iov, qiov->niov, request->len,
+                               false);
             if (ret != request->len) {
                 rc = -EIO;
             }
@@ -167,8 +167,9 @@ static int nbd_co_send_request(BlockDriverState *bs,
 }

 static void nbd_co_receive_reply(NbdClientSession *s,
-    struct nbd_request *request, struct nbd_reply *reply,
-    QEMUIOVector *qiov, int offset)
+                                 struct nbd_request *request,
+                                 struct nbd_reply *reply,
+                                 QEMUIOVector *qiov)
 {
     int ret;

@@ -181,8 +182,8 @@ static void nbd_co_receive_reply(NbdClientSession *s,
         reply->error = EIO;
     } else {
         if (qiov && reply->error == 0) {
-            ret = nbd_wr_syncv(s->ioc, qiov->iov, qiov->niov,
-                               offset, request->len, 1);
+            ret = nbd_wr_syncv(s->ioc, qiov->iov, qiov->niov, request->len,
+                               true);
             if (ret != request->len) {
                 reply->error = EIO;
             }
@@ -230,11 +231,11 @@ int nbd_client_co_readv(BlockDriverState *bs, int64_t sector_num,
     request.len = nb_sectors * 512;

     nbd_coroutine_start(client, &request);
-    ret = nbd_co_send_request(bs, &request, NULL, 0);
+    ret = nbd_co_send_request(bs, &request, NULL);
     if (ret < 0) {
         reply.error = -ret;
     } else {
-        nbd_co_receive_reply(client, &request, &reply, qiov, 0);
+        nbd_co_receive_reply(client, &request, &reply, qiov);
     }
     nbd_coroutine_end(client, &request);
     return -reply.error;
@@ -259,11 +260,11 @@ int nbd_client_co_writev(BlockDriverState *bs, int64_t sector_num,
     request.len = nb_sectors * 512;

     nbd_coroutine_start(client, &request);
-    ret = nbd_co_send_request(bs, &request, qiov, 0);
+    ret = nbd_co_send_request(bs, &request, qiov);
     if (ret < 0) {
         reply.error = -ret;
     } else {
-        nbd_co_receive_reply(client, &request, &reply, NULL, 0);
+        nbd_co_receive_reply(client, &request, &reply, NULL);
     }
     nbd_coroutine_end(client, &request);
     return -reply.error;
@@ -284,11 +285,11 @@ int nbd_client_co_flush(BlockDriverState *bs)
     request.len = 0;

     nbd_coroutine_start(client, &request);
-    ret = nbd_co_send_request(bs, &request, NULL, 0);
+    ret = nbd_co_send_request(bs, &request, NULL);
     if (ret < 0) {
         reply.error = -ret;
     } else {
-        nbd_co_receive_reply(client, &request, &reply, NULL, 0);
+        nbd_co_receive_reply(client, &request, &reply, NULL);
     }
     nbd_coroutine_end(client, &request);
     return -reply.error;
@@ -309,11 +310,11 @@ int nbd_client_co_discard(BlockDriverState *bs, int64_t sector_num,
     request.len = nb_sectors * 512;

     nbd_coroutine_start(client, &request);
-    ret = nbd_co_send_request(bs, &request, NULL, 0);
+    ret = nbd_co_send_request(bs, &request, NULL);
     if (ret < 0) {
         reply.error = -ret;
     } else {
-        nbd_co_receive_reply(client, &request, &reply, NULL, 0);
+        nbd_co_receive_reply(client, &request, &reply, NULL);
     }
     nbd_coroutine_end(client, &request);
     return -reply.error;
diff --git a/nbd/common.c b/nbd/common.c
index 8ddb2dd..4d8e211 100644
--- a/nbd/common.c
+++ b/nbd/common.c
@@ -23,7 +23,6 @@
 ssize_t nbd_wr_syncv(QIOChannel *ioc,
                      struct iovec *iov,
                      size_t niov,
-                     size_t offset,
                      size_t length,
                      bool do_read)
 {
@@ -35,7 +34,7 @@ ssize_t nbd_wr_syncv(QIOChannel *ioc,

     nlocal_iov = iov_copy(local_iov, nlocal_iov,
                           iov, niov,
-                          offset, length);
+                          0, length);

     while (nlocal_iov > 0) {
         ssize_t len;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 6/5] iscsi: Rely on block layer to break up large requests
  2016-06-20 23:39 [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
                   ` (4 preceding siblings ...)
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 5/5] nbd: Drop unused offset parameter Eric Blake
@ 2016-06-21  3:19 ` Eric Blake
  2016-06-21  4:17 ` [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Eric Blake @ 2016-06-21  3:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, qemu-block, Ronnie Sahlberg, Paolo Bonzini, Peter Lieven,
	Max Reitz

Now that the block layer honors max_request, we don't need to
bother with an EINVAL on overlarge requests, but can instead
assert that requests are well-behaved.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/iscsi.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 16d7522..0f027ce 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -473,11 +473,8 @@ iscsi_co_writev_flags(BlockDriverState *bs, int64_t sector_num, int nb_sectors,
         return -EINVAL;
     }

-    if (bs->bl.max_transfer &&
-        nb_sectors << BDRV_SECTOR_BITS > bs->bl.max_transfer) {
-        error_report("iSCSI Error: Write of %d sectors exceeds max_xfer_len "
-                     "of %" PRIu32 " bytes", nb_sectors, bs->bl.max_transfer);
-        return -EINVAL;
+    if (bs->bl.max_transfer) {
+        assert(nb_sectors << BDRV_SECTOR_BITS <= bs->bl.max_transfer);
     }

     lba = sector_qemu2lun(sector_num, iscsilun);
@@ -651,11 +648,8 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState *bs,
         return -EINVAL;
     }

-    if (bs->bl.max_transfer &&
-        nb_sectors << BDRV_SECTOR_BITS > bs->bl.max_transfer) {
-        error_report("iSCSI Error: Read of %d sectors exceeds max_xfer_len "
-                     "of %" PRIu32 " bytes", nb_sectors, bs->bl.max_transfer);
-        return -EINVAL;
+    if (bs->bl.max_transfer) {
+        assert(nb_sectors << BDRV_SECTOR_BITS <= bs->bl.max_transfer);
     }

     if (iscsilun->lbprz && nb_sectors >= ISCSI_CHECKALLOC_THRES &&
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer
  2016-06-20 23:39 [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
                   ` (5 preceding siblings ...)
  2016-06-21  3:19 ` [Qemu-devel] [PATCH 6/5] iscsi: Rely on block layer to break up large requests Eric Blake
@ 2016-06-21  4:17 ` Eric Blake
  2016-06-21 10:23 ` Stefan Hajnoczi
  2016-06-22  5:54 ` Fam Zheng
  8 siblings, 0 replies; 21+ messages in thread
From: Eric Blake @ 2016-06-21  4:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1904 bytes --]

On 06/20/2016 05:39 PM, Eric Blake wrote:
> We have max_transfer documented in BlockLimits, but while we
> honor it during pwrite_zeroes, we were blindly ignoring it
> during pwritev and preadv, leading to multiple drivers having
> to implement fragmentation themselves.  This series moves
> fragmentation to the block layer, then fixes the NBD driver to
> use it; if you like this but it needs a v2, you can request that
> I further do other drivers (I know at least iscsi and qcow2 do
> some self-fragmenting and/or error reporting that can be
> simplified by deferring fragmentation to the block layer).

iscsi turned out to be easy (see 6/5), but qcow2 is too tricky (we still
have to fragment in qcow2_alloc_cluster_offset() if cluster allocation
is not contiguous, so there is no one good value of max_transfer to use)

> 
> Prequisite: Kevin's block branch, plus my work on byte-based
> block limits (v2 at the moment):
> https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg04006.html
> 
> Also available as a tag at:
> git fetch git://repo.or.cz/qemu/ericb.git nbd-fragment-v1
> 
> Eric Blake (5):
>   block: Fragment reads to max transfer length
>   block: Fragment writes to max transfer length
>   raw_bsd: Don't advertise flags not supported by protocol layer
>   nbd: Rely on block layer to break up large requests
>   nbd: Drop unused offset parameter
> 
>  include/block/nbd.h |  1 -
>  nbd/nbd-internal.h  |  4 +--
>  block/io.c          | 84 +++++++++++++++++++++++++++++++++++++++--------------
>  block/nbd-client.c  | 78 ++++++++++++++-----------------------------------
>  block/nbd.c         | 12 ++------
>  block/raw_bsd.c     |  6 ++--
>  nbd/common.c        |  3 +-
>  7 files changed, 95 insertions(+), 93 deletions(-)
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer
  2016-06-20 23:39 [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
                   ` (6 preceding siblings ...)
  2016-06-21  4:17 ` [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
@ 2016-06-21 10:23 ` Stefan Hajnoczi
  2016-06-21 10:43   ` Kevin Wolf
  2016-06-21 22:05   ` Eric Blake
  2016-06-22  5:54 ` Fam Zheng
  8 siblings, 2 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2016-06-21 10:23 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, kwolf, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2199 bytes --]

On Mon, Jun 20, 2016 at 05:39:24PM -0600, Eric Blake wrote:
> We have max_transfer documented in BlockLimits, but while we
> honor it during pwrite_zeroes, we were blindly ignoring it
> during pwritev and preadv, leading to multiple drivers having
> to implement fragmentation themselves.  This series moves
> fragmentation to the block layer, then fixes the NBD driver to
> use it; if you like this but it needs a v2, you can request that
> I further do other drivers (I know at least iscsi and qcow2 do
> some self-fragmenting and/or error reporting that can be
> simplified by deferring fragmentation to the block layer).
> 
> Prequisite: Kevin's block branch, plus my work on byte-based
> block limits (v2 at the moment):
> https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg04006.html
> 
> Also available as a tag at:
> git fetch git://repo.or.cz/qemu/ericb.git nbd-fragment-v1
> 
> Eric Blake (5):
>   block: Fragment reads to max transfer length
>   block: Fragment writes to max transfer length
>   raw_bsd: Don't advertise flags not supported by protocol layer
>   nbd: Rely on block layer to break up large requests
>   nbd: Drop unused offset parameter
> 
>  include/block/nbd.h |  1 -
>  nbd/nbd-internal.h  |  4 +--
>  block/io.c          | 84 +++++++++++++++++++++++++++++++++++++++--------------
>  block/nbd-client.c  | 78 ++++++++++++++-----------------------------------
>  block/nbd.c         | 12 ++------
>  block/raw_bsd.c     |  6 ++--
>  nbd/common.c        |  3 +-
>  7 files changed, 95 insertions(+), 93 deletions(-)

I'm concerned that requests A & B which should be atomic can now be
interleaved.

For example, two writes that are overlapping and fragmented.
Applications expect to either see A or B on disk when both requests have
completed.  Fragmentation must serialize overlapping requests in order
to prevent interleaved results where the application sees some of A and
some of B when both requests have completed.

A similar scenario happens when A is a read and B is a write, too.  Read
A is supposed to see either "before B" or "after B".  With fragmentation
it can see "some of before B and some of after B".

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer
  2016-06-21 10:23 ` Stefan Hajnoczi
@ 2016-06-21 10:43   ` Kevin Wolf
  2016-06-22 11:41     ` Stefan Hajnoczi
  2016-06-21 22:05   ` Eric Blake
  1 sibling, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2016-06-21 10:43 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Eric Blake, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1934 bytes --]

Am 21.06.2016 um 12:23 hat Stefan Hajnoczi geschrieben:
> On Mon, Jun 20, 2016 at 05:39:24PM -0600, Eric Blake wrote:
> > We have max_transfer documented in BlockLimits, but while we
> > honor it during pwrite_zeroes, we were blindly ignoring it
> > during pwritev and preadv, leading to multiple drivers having
> > to implement fragmentation themselves.  This series moves
> > fragmentation to the block layer, then fixes the NBD driver to
> > use it; if you like this but it needs a v2, you can request that
> > I further do other drivers (I know at least iscsi and qcow2 do
> > some self-fragmenting and/or error reporting that can be
> > simplified by deferring fragmentation to the block layer).
> 
> I'm concerned that requests A & B which should be atomic can now be
> interleaved.

I don't think there is any guarantee of atomicity for overlapping
requests, at least not with more than a single sector (logical block
size, not BDRV_SECTOR_SIZE).

That is, as far as I know neither hardware nor the Linux kernel nor the
qemu block layer (image formats fragment all the time!) protect against
this. If you have concurrent overlapping requests, you always get
undefined behaviour.

> For example, two writes that are overlapping and fragmented.
> Applications expect to either see A or B on disk when both requests have
> completed.  Fragmentation must serialize overlapping requests in order
> to prevent interleaved results where the application sees some of A and
> some of B when both requests have completed.
> 
> A similar scenario happens when A is a read and B is a write, too.  Read
> A is supposed to see either "before B" or "after B".  With fragmentation
> it can see "some of before B and some of after B".

If we wanted to achieve this semantics, it would be easy enough: Add a
mark_request_serialising() in the right place. But I'm pretty sure that
this isn't needed.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer
  2016-06-21 10:23 ` Stefan Hajnoczi
  2016-06-21 10:43   ` Kevin Wolf
@ 2016-06-21 22:05   ` Eric Blake
  2016-06-22 11:41     ` Stefan Hajnoczi
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Blake @ 2016-06-21 22:05 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, kwolf, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2107 bytes --]

On 06/21/2016 04:23 AM, Stefan Hajnoczi wrote:
> On Mon, Jun 20, 2016 at 05:39:24PM -0600, Eric Blake wrote:
>> We have max_transfer documented in BlockLimits, but while we
>> honor it during pwrite_zeroes, we were blindly ignoring it
>> during pwritev and preadv, leading to multiple drivers having
>> to implement fragmentation themselves.  This series moves
>> fragmentation to the block layer, then fixes the NBD driver to
>> use it; if you like this but it needs a v2, you can request that
>> I further do other drivers (I know at least iscsi and qcow2 do
>> some self-fragmenting and/or error reporting that can be
>> simplified by deferring fragmentation to the block layer).
>>

> I'm concerned that requests A & B which should be atomic can now be
> interleaved.
> 
> For example, two writes that are overlapping and fragmented.
> Applications expect to either see A or B on disk when both requests have
> completed.  Fragmentation must serialize overlapping requests in order
> to prevent interleaved results where the application sees some of A and
> some of B when both requests have completed.
> 
> A similar scenario happens when A is a read and B is a write, too.  Read
> A is supposed to see either "before B" or "after B".  With fragmentation
> it can see "some of before B and some of after B".

This patch series doesn't change that, it just changes where the
atomicity is broken.  Consider that pre-patch, a raw format wrapping an
NBD protocol connection would see a 32M max_transfer, but we were not
fragmenting at the raw protocol.  Thus, if someone requests a 40M
operation, the raw protocol treats it as a single transaction, and NBD
then fragments it into two transactions over the wire; and NBD can
complete transactions out of order.  All this series does is move the
fragmentation to the raw layer, rather than the NBD layer, but the race
between overlapping oversize requests is present whether or not the
series is applied.

> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer
  2016-06-20 23:39 [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
                   ` (7 preceding siblings ...)
  2016-06-21 10:23 ` Stefan Hajnoczi
@ 2016-06-22  5:54 ` Fam Zheng
  2016-07-06  2:04   ` Eric Blake
  8 siblings, 1 reply; 21+ messages in thread
From: Fam Zheng @ 2016-06-22  5:54 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, kwolf, qemu-block

On Mon, 06/20 17:39, Eric Blake wrote:
> We have max_transfer documented in BlockLimits, but while we
> honor it during pwrite_zeroes, we were blindly ignoring it
> during pwritev and preadv, leading to multiple drivers having
> to implement fragmentation themselves.  This series moves
> fragmentation to the block layer, then fixes the NBD driver to
> use it; if you like this but it needs a v2, you can request that
> I further do other drivers (I know at least iscsi and qcow2 do
> some self-fragmenting and/or error reporting that can be
> simplified by deferring fragmentation to the block layer).
> 
> Prequisite: Kevin's block branch, plus my work on byte-based
> block limits (v2 at the moment):
> https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg04006.html
> 
> Also available as a tag at:
> git fetch git://repo.or.cz/qemu/ericb.git nbd-fragment-v1

Patches 1-6:

Reviewed-by: Fam Zheng <famz@redhat.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer
  2016-06-21 10:43   ` Kevin Wolf
@ 2016-06-22 11:41     ` Stefan Hajnoczi
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2016-06-22 11:41 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Eric Blake, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1854 bytes --]

On Tue, Jun 21, 2016 at 12:43:08PM +0200, Kevin Wolf wrote:
> Am 21.06.2016 um 12:23 hat Stefan Hajnoczi geschrieben:
> > On Mon, Jun 20, 2016 at 05:39:24PM -0600, Eric Blake wrote:
> > > We have max_transfer documented in BlockLimits, but while we
> > > honor it during pwrite_zeroes, we were blindly ignoring it
> > > during pwritev and preadv, leading to multiple drivers having
> > > to implement fragmentation themselves.  This series moves
> > > fragmentation to the block layer, then fixes the NBD driver to
> > > use it; if you like this but it needs a v2, you can request that
> > > I further do other drivers (I know at least iscsi and qcow2 do
> > > some self-fragmenting and/or error reporting that can be
> > > simplified by deferring fragmentation to the block layer).
> > 
> > I'm concerned that requests A & B which should be atomic can now be
> > interleaved.
> 
> I don't think there is any guarantee of atomicity for overlapping
> requests, at least not with more than a single sector (logical block
> size, not BDRV_SECTOR_SIZE).
> 
> That is, as far as I know neither hardware nor the Linux kernel nor the
> qemu block layer (image formats fragment all the time!) protect against
> this. If you have concurrent overlapping requests, you always get
> undefined behaviour.

Kevin, I think you are right and my mental model was wrong.

I spent some time looking at SCSI SAM and SBC documents.  The closest
information I found was SBC "4.27 Model for uninterrupted sequences on LBA
ranges" where ORWRITE, XDWRITEREAD, XPWRITE, and COMPARE AND WRITE are
singled out.  It implies that the device server may access LBAs while
other types of commands are executing.

If anyone knows a better reference in the SCSI specifications that
clarifies behavior of overlapping I/O requests, please share!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer
  2016-06-21 22:05   ` Eric Blake
@ 2016-06-22 11:41     ` Stefan Hajnoczi
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2016-06-22 11:41 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, kwolf, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1622 bytes --]

On Tue, Jun 21, 2016 at 04:05:22PM -0600, Eric Blake wrote:
> On 06/21/2016 04:23 AM, Stefan Hajnoczi wrote:
> > On Mon, Jun 20, 2016 at 05:39:24PM -0600, Eric Blake wrote:
> >> We have max_transfer documented in BlockLimits, but while we
> >> honor it during pwrite_zeroes, we were blindly ignoring it
> >> during pwritev and preadv, leading to multiple drivers having
> >> to implement fragmentation themselves.  This series moves
> >> fragmentation to the block layer, then fixes the NBD driver to
> >> use it; if you like this but it needs a v2, you can request that
> >> I further do other drivers (I know at least iscsi and qcow2 do
> >> some self-fragmenting and/or error reporting that can be
> >> simplified by deferring fragmentation to the block layer).
> >>
> 
> > I'm concerned that requests A & B which should be atomic can now be
> > interleaved.
> > 
> > For example, two writes that are overlapping and fragmented.
> > Applications expect to either see A or B on disk when both requests have
> > completed.  Fragmentation must serialize overlapping requests in order
> > to prevent interleaved results where the application sees some of A and
> > some of B when both requests have completed.
> > 
> > A similar scenario happens when A is a read and B is a write, too.  Read
> > A is supposed to see either "before B" or "after B".  With fragmentation
> > it can see "some of before B and some of after B".
> 
> This patch series doesn't change that, it just changes where the
> atomicity is broken.

I know but I needed to clarify what the right semantics are :).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer
  2016-06-22  5:54 ` Fam Zheng
@ 2016-07-06  2:04   ` Eric Blake
  2016-07-08 11:15     ` Kevin Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Blake @ 2016-07-06  2:04 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, kwolf, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1185 bytes --]

On 06/21/2016 11:54 PM, Fam Zheng wrote:
> On Mon, 06/20 17:39, Eric Blake wrote:
>> We have max_transfer documented in BlockLimits, but while we
>> honor it during pwrite_zeroes, we were blindly ignoring it
>> during pwritev and preadv, leading to multiple drivers having
>> to implement fragmentation themselves.  This series moves
>> fragmentation to the block layer, then fixes the NBD driver to
>> use it; if you like this but it needs a v2, you can request that
>> I further do other drivers (I know at least iscsi and qcow2 do
>> some self-fragmenting and/or error reporting that can be
>> simplified by deferring fragmentation to the block layer).
>>
>> Prequisite: Kevin's block branch, plus my work on byte-based
>> block limits (v2 at the moment):
>> https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg04006.html
>>
>> Also available as a tag at:
>> git fetch git://repo.or.cz/qemu/ericb.git nbd-fragment-v1
> 
> Patches 1-6:
> 
> Reviewed-by: Fam Zheng <famz@redhat.com>

ping - series still applies to latest master without tweaks

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: Fragment reads to max transfer length
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 1/5] block: Fragment reads to max transfer length Eric Blake
@ 2016-07-08 10:56   ` Kevin Wolf
  2016-07-08 14:31     ` Eric Blake
  0 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2016-07-08 10:56 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, qemu-block, Stefan Hajnoczi, Fam Zheng, Max Reitz

Am 21.06.2016 um 01:39 hat Eric Blake geschrieben:
> Drivers should be able to rely on the block layer honoring the
> max transfer length, rather than needing to return -EINVAL
> (iscsi) or manually fragment things (nbd).  This patch adds
> the fragmentation in the block layer, after requests have been
> aligned (fragmenting before alignment would lead to multiple
> unaligned requests, rather than just the head and tail).
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>

Doesn't build for me:

block/io.c: In function 'bdrv_aligned_preadv':
block/io.c:1071:5: error: 'ret' may be used uninitialized in this function [-Werror=maybe-uninitialized]
     return ret;

Kevin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] raw_bsd: Don't advertise flags not supported by protocol layer
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 3/5] raw_bsd: Don't advertise flags not supported by protocol layer Eric Blake
@ 2016-07-08 11:05   ` Kevin Wolf
  2016-07-08 14:32     ` Eric Blake
  0 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2016-07-08 11:05 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, qemu-block, Max Reitz

Am 21.06.2016 um 01:39 hat Eric Blake geschrieben:
> The raw format layer supports all flags via passthrough - but
> it only makes sense to pass through flags that the lower layer
> actually supports.
> 
> Thanks to the previous patch, the raw format layer now attempts
> to fragment writes at the max_transfer limit it inherits from
> the NBD protocol layer, recently set to 32m.  An attempt to do
> 'w -f 0 40m' to an NBD server that lacks FUA thus changed from
> flushing once (after NBD fragmented a single 40m write itself)
> to instead flushing twice (the format layer sees BDRV_REQ_FUA
> in supported_write_flags, so it sends the flag on to both
> fragments, and then the block layer emulates FUA by flushing
> for both the 32m and 8m fragments at the protocol layer).
> This patch fixes the performance regression (now that the
> format layer no longer advertises a flag not present at the
> protocol layer, the flush to emulate FUA is deferred to the
> last fragment).
> 
> Note that 'w -f -z 0 40m' does not currently exhibit the same
> problem, because there, the fragmentation does not occur until
> at the NBD layer (the raw layer has .bdrv_co_pwrite_zeroes, and
> the NBD layer doesn't advertise max_pwrite_zeroes to constrain
> things at the raw layer) - but that problem is latent and would
> have the same problem with too many flushes without this patch
> once the NBD layer implements support for using the new
> NBD_CMD_WRITE_ZEROES and sets max_pwrite_zeroes to the same 32m
> limit as recommended by the NBD protocol.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>

Should this be moved before patch 2 so that we never get a regression in
the first place?

Kevin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] nbd: Drop unused offset parameter
  2016-06-20 23:39 ` [Qemu-devel] [PATCH 5/5] nbd: Drop unused offset parameter Eric Blake
@ 2016-07-08 11:11   ` Kevin Wolf
  0 siblings, 0 replies; 21+ messages in thread
From: Kevin Wolf @ 2016-07-08 11:11 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, qemu-block, Paolo Bonzini, Max Reitz

Am 21.06.2016 um 01:39 hat Eric Blake geschrieben:
> Now that NBD relies on the block layer to fragment things, we no
> longer need to track an offset argument for which fragment of
> a request we are actually servicing.
> 
> While at it, use true and false instead of 0 and 1 for a bool
> parameter.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>

> @@ -35,7 +34,7 @@ ssize_t nbd_wr_syncv(QIOChannel *ioc,
> 
>      nlocal_iov = iov_copy(local_iov, nlocal_iov,
>                            iov, niov,
> -                          offset, length);
> +                          0, length);

This statement fits in a single line instead of three.

Kevin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer
  2016-07-06  2:04   ` Eric Blake
@ 2016-07-08 11:15     ` Kevin Wolf
  0 siblings, 0 replies; 21+ messages in thread
From: Kevin Wolf @ 2016-07-08 11:15 UTC (permalink / raw)
  To: Eric Blake; +Cc: Fam Zheng, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1266 bytes --]

Am 06.07.2016 um 04:04 hat Eric Blake geschrieben:
> On 06/21/2016 11:54 PM, Fam Zheng wrote:
> > On Mon, 06/20 17:39, Eric Blake wrote:
> >> We have max_transfer documented in BlockLimits, but while we
> >> honor it during pwrite_zeroes, we were blindly ignoring it
> >> during pwritev and preadv, leading to multiple drivers having
> >> to implement fragmentation themselves.  This series moves
> >> fragmentation to the block layer, then fixes the NBD driver to
> >> use it; if you like this but it needs a v2, you can request that
> >> I further do other drivers (I know at least iscsi and qcow2 do
> >> some self-fragmenting and/or error reporting that can be
> >> simplified by deferring fragmentation to the block layer).
> >>
> >> Prequisite: Kevin's block branch, plus my work on byte-based
> >> block limits (v2 at the moment):
> >> https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg04006.html
> >>
> >> Also available as a tag at:
> >> git fetch git://repo.or.cz/qemu/ericb.git nbd-fragment-v1
> > 
> > Patches 1-6:
> > 
> > Reviewed-by: Fam Zheng <famz@redhat.com>
> 
> ping - series still applies to latest master without tweaks

Apart from the build fix and the minor comments I made, this looks good
to me.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: Fragment reads to max transfer length
  2016-07-08 10:56   ` Kevin Wolf
@ 2016-07-08 14:31     ` Eric Blake
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Blake @ 2016-07-08 14:31 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, Stefan Hajnoczi, Fam Zheng, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 940 bytes --]

On 07/08/2016 04:56 AM, Kevin Wolf wrote:
> Am 21.06.2016 um 01:39 hat Eric Blake geschrieben:
>> Drivers should be able to rely on the block layer honoring the
>> max transfer length, rather than needing to return -EINVAL
>> (iscsi) or manually fragment things (nbd).  This patch adds
>> the fragmentation in the block layer, after requests have been
>> aligned (fragmenting before alignment would lead to multiple
>> unaligned requests, rather than just the head and tail).
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> Doesn't build for me:
> 
> block/io.c: In function 'bdrv_aligned_preadv':
> block/io.c:1071:5: error: 'ret' may be used uninitialized in this function [-Werror=maybe-uninitialized]
>      return ret;

One of those annoying problems detected at -O2 but not at -O0.  I'll respin.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] raw_bsd: Don't advertise flags not supported by protocol layer
  2016-07-08 11:05   ` Kevin Wolf
@ 2016-07-08 14:32     ` Eric Blake
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Blake @ 2016-07-08 14:32 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 1981 bytes --]

On 07/08/2016 05:05 AM, Kevin Wolf wrote:
> Am 21.06.2016 um 01:39 hat Eric Blake geschrieben:
>> The raw format layer supports all flags via passthrough - but
>> it only makes sense to pass through flags that the lower layer
>> actually supports.
>>
>> Thanks to the previous patch, the raw format layer now attempts
>> to fragment writes at the max_transfer limit it inherits from
>> the NBD protocol layer, recently set to 32m.  An attempt to do
>> 'w -f 0 40m' to an NBD server that lacks FUA thus changed from
>> flushing once (after NBD fragmented a single 40m write itself)
>> to instead flushing twice (the format layer sees BDRV_REQ_FUA
>> in supported_write_flags, so it sends the flag on to both
>> fragments, and then the block layer emulates FUA by flushing
>> for both the 32m and 8m fragments at the protocol layer).
>> This patch fixes the performance regression (now that the
>> format layer no longer advertises a flag not present at the
>> protocol layer, the flush to emulate FUA is deferred to the
>> last fragment).
>>
>> Note that 'w -f -z 0 40m' does not currently exhibit the same
>> problem, because there, the fragmentation does not occur until
>> at the NBD layer (the raw layer has .bdrv_co_pwrite_zeroes, and
>> the NBD layer doesn't advertise max_pwrite_zeroes to constrain
>> things at the raw layer) - but that problem is latent and would
>> have the same problem with too many flushes without this patch
>> once the NBD layer implements support for using the new
>> NBD_CMD_WRITE_ZEROES and sets max_pwrite_zeroes to the same 32m
>> limit as recommended by the NBD protocol.
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> Should this be moved before patch 2 so that we never get a regression in
> the first place?

Can do, although it will require some word-smithing to the commit message.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-07-08 14:32 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-20 23:39 [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
2016-06-20 23:39 ` [Qemu-devel] [PATCH 1/5] block: Fragment reads to max transfer length Eric Blake
2016-07-08 10:56   ` Kevin Wolf
2016-07-08 14:31     ` Eric Blake
2016-06-20 23:39 ` [Qemu-devel] [PATCH 2/5] block: Fragment writes " Eric Blake
2016-06-20 23:39 ` [Qemu-devel] [PATCH 3/5] raw_bsd: Don't advertise flags not supported by protocol layer Eric Blake
2016-07-08 11:05   ` Kevin Wolf
2016-07-08 14:32     ` Eric Blake
2016-06-20 23:39 ` [Qemu-devel] [PATCH 4/5] nbd: Rely on block layer to break up large requests Eric Blake
2016-06-20 23:39 ` [Qemu-devel] [PATCH 5/5] nbd: Drop unused offset parameter Eric Blake
2016-07-08 11:11   ` Kevin Wolf
2016-06-21  3:19 ` [Qemu-devel] [PATCH 6/5] iscsi: Rely on block layer to break up large requests Eric Blake
2016-06-21  4:17 ` [Qemu-devel] [PATCH 0/5] Auto-fragment large transactions at the block layer Eric Blake
2016-06-21 10:23 ` Stefan Hajnoczi
2016-06-21 10:43   ` Kevin Wolf
2016-06-22 11:41     ` Stefan Hajnoczi
2016-06-21 22:05   ` Eric Blake
2016-06-22 11:41     ` Stefan Hajnoczi
2016-06-22  5:54 ` Fam Zheng
2016-07-06  2:04   ` Eric Blake
2016-07-08 11:15     ` Kevin Wolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.