All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests
@ 2016-11-17 20:13 Eric Blake
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 1/9] nbd: Allow unmap and fua during write zeroes Eric Blake
                   ` (9 more replies)
  0 siblings, 10 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-17 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf, pbonzini

The first 5 patches are candidates for inclusion in 2.8, since
they fix regressions and/or bugs exposed by the second half.
The last 4 patches are a new feature in blkdebug, and as such,
probably belong better in 2.9.  However, I wrote the patches by
swapping the order (last four patches first, and proving that the
new iotest fails, then the first five patches fix the failures),
which should give us some better assurance that the first half is
safe for inclusion during hard freeze.

Patches 3-5 have been on the list previously (hence calling this
series v2), although as two separate threads; patch 5 has been
improved since its last posting.  Patches 1-2 and 6-9 are new.

Eric Blake (9):
  nbd: Allow unmap and fua during write zeroes
  qcow2: Inform block layer about discard boundaries
  block: Let write zeroes fallback work even with small max_transfer
  block: Return -ENOTSUP rather than assert on unaligned discards
  block: Pass unaligned discard requests to drivers
  blkdebug: Sanity check block layer guarantees
  blkdebug: Add pass-through write_zero and discard support
  blkdebug: Add ability to override unmap geometries
  tests: Add coverage for recent block geometry fixes

 qapi/block-core.json       |  27 ++++++-
 block/blkdebug.c           | 176 ++++++++++++++++++++++++++++++++++++++++++++-
 block/io.c                 |  58 ++++++++++-----
 block/iscsi.c              |   4 +-
 block/nbd-client.c         |   4 ++
 block/qcow2.c              |   6 ++
 block/sheepdog.c           |   5 +-
 tests/qemu-iotests/173     | 115 +++++++++++++++++++++++++++++
 tests/qemu-iotests/173.out |  49 +++++++++++++
 tests/qemu-iotests/group   |   1 +
 10 files changed, 421 insertions(+), 24 deletions(-)
 create mode 100755 tests/qemu-iotests/173
 create mode 100644 tests/qemu-iotests/173.out

-- 
2.7.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 1/9] nbd: Allow unmap and fua during write zeroes
  2016-11-17 20:13 [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Eric Blake
@ 2016-11-17 20:13 ` Eric Blake
  2016-11-17 21:10   ` Max Reitz
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 2/9] qcow2: Inform block layer about discard boundaries Eric Blake
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Eric Blake @ 2016-11-17 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf, pbonzini, qemu-stable, Max Reitz

Commit fa778fff wired up support to send the NBD_CMD_WRITE_ZEROES,
but forgot to inform the block layer that FUA unmapping of zeroes is
supported.  Without BDRV_REQ_MAY_UNMAP listed as a supported flag,
the block layer will always insist on the NBD layer passing
NBD_CMD_FLAG_NO_HOLE, resulting in the server always allocating
things even when it was desired to let the server punch holes.
Similarly, failing to set BDRV_REQ_FUA means that the client may
send unnecessary NBD_CMD_FLUSH when it could have instead used the
NBD_CMD_FLAG_FUA bit.

CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/nbd-client.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 2a302de..3779c6c 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -415,6 +415,10 @@ int nbd_client_init(BlockDriverState *bs,
     }
     if (client->nbdflags & NBD_FLAG_SEND_FUA) {
         bs->supported_write_flags = BDRV_REQ_FUA;
+        bs->supported_zero_flags |= BDRV_REQ_FUA;
+    }
+    if (client->nbdflags & NBD_FLAG_SEND_WRITE_ZEROES) {
+        bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP;
     }

     qemu_co_mutex_init(&client->send_mutex);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 2/9] qcow2: Inform block layer about discard boundaries
  2016-11-17 20:13 [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Eric Blake
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 1/9] nbd: Allow unmap and fua during write zeroes Eric Blake
@ 2016-11-17 20:13 ` Eric Blake
  2016-11-17 21:24   ` Max Reitz
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v3 3/9] block: Let write zeroes fallback work even with small max_transfer Eric Blake
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Eric Blake @ 2016-11-17 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf, pbonzini, qemu-stable, Max Reitz

At the qcow2 layer, discard is only possible on a per-cluster
basis; at the moment, qcow2 silently rounds any unaligned
requests to this granularity.  However, an upcoming patch will
fix a regression in the block layer ignoring too much of an
unaligned discard request, by changing the block layer to
break up a discard request at alignment boundaries; for that
to work, the block layer must know about our limits.

However, we can't go one step further by changing
qcow2_discard_clusters() to assert that requests are always
aligned, since that helper function is reached on paths
outside of the block layer.

CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/qcow2.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 6d5689a..e22f6dc 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1206,6 +1206,7 @@ static void qcow2_refresh_limits(BlockDriverState *bs, Error **errp)
         bs->bl.request_alignment = BDRV_SECTOR_SIZE;
     }
     bs->bl.pwrite_zeroes_alignment = s->cluster_size;
+    bs->bl.pdiscard_alignment = s->cluster_size;
 }

 static int qcow2_set_key(BlockDriverState *bs, const char *key)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v3 3/9] block: Let write zeroes fallback work even with small max_transfer
  2016-11-17 20:13 [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Eric Blake
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 1/9] nbd: Allow unmap and fua during write zeroes Eric Blake
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 2/9] qcow2: Inform block layer about discard boundaries Eric Blake
@ 2016-11-17 20:13 ` Eric Blake
  2016-11-17 21:40   ` Max Reitz
  2016-11-22 13:16   ` Kevin Wolf
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 4/9] block: Return -ENOTSUP rather than assert on unaligned discards Eric Blake
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-17 20:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Denis V . Lunev,
	Stefan Hajnoczi, Fam Zheng, Max Reitz

Commit 443668ca rewrote the write_zeroes logic to guarantee that
an unaligned request never crosses a cluster boundary.  But
in the rewrite, the new code assumed that at most one iteration
would be needed to get to an alignment boundary.

However, it is easy to trigger an assertion failure: the Linux
kernel limits loopback devices to advertise a max_transfer of
only 64k.  Any operation that requires falling back to writes
rather than more efficient zeroing must obey max_transfer during
that fallback, which means an unaligned head may require multiple
iterations of the write fallbacks before reaching the aligned
boundaries, when layering a format with clusters larger than 64k
atop the protocol of file access to a loopback device.

Test case:

$ qemu-img create -f qcow2 -o cluster_size=1M file 10M
$ losetup /dev/loop2 /path/to/file
$ qemu-io -f qcow2 /dev/loop2
qemu-io> w 7m 1k
qemu-io> w -z 8003584 2093056

In fairness to Denis (as the original listed author of the culprit
commit), the faulty logic for at most one iteration is probably all
my fault in reworking his idea.  But the solution is to restore what
was in place prior to that commit: when dealing with an unaligned
head or tail, iterate as many times as necessary while fragmenting
the operation at max_transfer boundaries.

Reported-by: Ed Swierk <eswierk@skyportsystems.com>
CC: qemu-stable@nongnu.org
CC: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/io.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index aa532a5..085ac34 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1214,6 +1214,8 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
     int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
     int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
                         bs->bl.request_alignment);
+    int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
+                                    MAX_WRITE_ZEROES_BOUNCE_BUFFER);

     assert(alignment % bs->bl.request_alignment == 0);
     head = offset % alignment;
@@ -1229,9 +1231,12 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
          * boundaries.
          */
         if (head) {
-            /* Make a small request up to the first aligned sector.  */
-            num = MIN(count, alignment - head);
-            head = 0;
+            /* Make a small request up to the first aligned sector. For
+             * convenience, limit this request to max_transfer even if
+             * we don't need to fall back to writes.  */
+            num = MIN(MIN(count, max_transfer), alignment - head);
+            head = (head + num) % alignment;
+            assert(num < max_write_zeroes);
         } else if (tail && num > alignment) {
             /* Shorten the request to the last aligned sector.  */
             num -= tail;
@@ -1257,8 +1262,6 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,

         if (ret == -ENOTSUP) {
             /* Fall back to bounce buffer if write zeroes is unsupported */
-            int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
-                                            MAX_WRITE_ZEROES_BOUNCE_BUFFER);
             BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;

             if ((flags & BDRV_REQ_FUA) &&
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 4/9] block: Return -ENOTSUP rather than assert on unaligned discards
  2016-11-17 20:13 [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Eric Blake
                   ` (2 preceding siblings ...)
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v3 3/9] block: Let write zeroes fallback work even with small max_transfer Eric Blake
@ 2016-11-17 20:13 ` Eric Blake
  2016-11-17 22:01   ` Max Reitz
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers Eric Blake
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Eric Blake @ 2016-11-17 20:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Ronnie Sahlberg,
	Peter Lieven, Max Reitz, Hitoshi Mitake, Liu Yuan, Jeff Cody,
	open list:Sheepdog

Right now, the block layer rounds discard requests, so that
individual drivers are able to assert that discard requests
will never be unaligned.  But there are some ISCSI devices
that track and coalesce multiple unaligned requests, turning it
into an actual discard if the requests eventually cover an
entire page, which implies that it is better to always pass
discard requests as low down the stack as possible.

In isolation, this patch has no semantic effect, since the
block layer currently never passes an unaligned request through.
But the block layer already has code that silently ignores
drivers that return -ENOTSUP for a discard request that cannot
be honored (as well as drivers that return 0 even when nothing
was done).  But the next patch will update the block layer to
fragment discard requests, so that clients are guaranteed that
they are either dealing with an unaligned head or tail, or an
aligned core, making it similar to the block layer semantics of
write zero fragmentation.

CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/iscsi.c    | 4 +++-
 block/qcow2.c    | 5 +++++
 block/sheepdog.c | 5 +++--
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 71bd523..0960929 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -1083,7 +1083,9 @@ coroutine_fn iscsi_co_pdiscard(BlockDriverState *bs, int64_t offset, int count)
     struct IscsiTask iTask;
     struct unmap_list list;

-    assert(is_byte_request_lun_aligned(offset, count, iscsilun));
+    if (!is_byte_request_lun_aligned(offset, count, iscsilun)) {
+        return -ENOTSUP;
+    }

     if (!iscsilun->lbp.lbpu) {
         /* UNMAP is not supported by the target */
diff --git a/block/qcow2.c b/block/qcow2.c
index e22f6dc..7cfcd84 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2491,6 +2491,11 @@ static coroutine_fn int qcow2_co_pdiscard(BlockDriverState *bs,
     int ret;
     BDRVQcow2State *s = bs->opaque;

+    if (!QEMU_IS_ALIGNED(offset | count, s->cluster_size)) {
+        assert(count < s->cluster_size);
+        return -ENOTSUP;
+    }
+
     qemu_co_mutex_lock(&s->lock);
     ret = qcow2_discard_clusters(bs, offset, count >> BDRV_SECTOR_BITS,
                                  QCOW2_DISCARD_REQUEST, false);
diff --git a/block/sheepdog.c b/block/sheepdog.c
index 1fb9173..4c9af89 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -2829,8 +2829,9 @@ static coroutine_fn int sd_co_pdiscard(BlockDriverState *bs, int64_t offset,
     iov.iov_len = sizeof(zero);
     discard_iov.iov = &iov;
     discard_iov.niov = 1;
-    assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
-    assert((count & (BDRV_SECTOR_SIZE - 1)) == 0);
+    if (!QEMU_IS_ALIGNED(offset | count, BDRV_SECTOR_SIZE)) {
+        return -ENOTSUP;
+    }
     acb = sd_aio_setup(bs, &discard_iov, offset >> BDRV_SECTOR_BITS,
                        count >> BDRV_SECTOR_BITS);
     acb->aiocb_type = AIOCB_DISCARD_OBJ;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
  2016-11-17 20:13 [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Eric Blake
                   ` (3 preceding siblings ...)
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 4/9] block: Return -ENOTSUP rather than assert on unaligned discards Eric Blake
@ 2016-11-17 20:13 ` Eric Blake
  2016-11-17 22:26   ` Max Reitz
                     ` (2 more replies)
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 6/9] blkdebug: Sanity check block layer guarantees Eric Blake
                   ` (4 subsequent siblings)
  9 siblings, 3 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-17 20:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Stefan Hajnoczi,
	Fam Zheng, Max Reitz

Discard is advisory, so rounding the requests to alignment
boundaries is never semantically wrong from the data that
the guest sees.  But at least the Dell Equallogic iSCSI SANs
has an interesting property that its advertised discard
alignment is 15M, yet documents that discarding a sequence
of 1M slices will eventually result in the 15M page being
marked as discarded, and it is possible to observe which
pages have been discarded.

Between commits 9f1963b and b8d0a980, we converted the block
layer to a byte-based interface that ultimately ignores any
unaligned head or tail based on the driver's advertised
discard granularity, which means that qemu 2.7 refuses to
pass any discard request smaller than 15M down to the Dell
Equallogic hardware.  This is a slight regression in behavior
compared to earlier qemu, where a guest executing discards
in power-of-2 chunks used to be able to get every page
discarded, but is now left with various pages still allocated
because the guest requests did not align with the hardware's
15M pages.

Since the SCSI specification says nothing about a minimum
discard granularity, and only documents the preferred
alignment, it is best if the block layer gives the driver
every bit of information about discard requests, rather than
rounding it to alignment boundaries early.

Rework the block layer discard algorithm to mirror the write
zero algorithm: always peel off any unaligned head or tail
and manage that in isolation, then do the bulk of the request
on an aligned boundary.  The fallback when the driver returns
-ENOTSUP for an unaligned request is to silently ignore that
portion of the discard request; but for devices that can pass
the partial request all the way down to hardware, this can
result in the hardware coalescing requests and discarding
aligned pages after all.

Reported by: Peter Lieven <pl@kamp.de>
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>

---
v2: Split at more boundaries.
---
 block/io.c | 45 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 32 insertions(+), 13 deletions(-)

diff --git a/block/io.c b/block/io.c
index 085ac34..4f00562 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2424,7 +2424,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
 {
     BdrvTrackedRequest req;
     int max_pdiscard, ret;
-    int head, align;
+    int head, tail, align;

     if (!bs->drv) {
         return -ENOMEDIUM;
@@ -2447,19 +2447,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
         return 0;
     }

-    /* Discard is advisory, so ignore any unaligned head or tail */
+    /* Discard is advisory, but some devices track and coalesce
+     * unaligned requests, so we must pass everything down rather than
+     * round here.  Still, most devices will just silently ignore
+     * unaligned requests (by returning -ENOTSUP), so we must fragment
+     * the request accordingly.  */
     align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment);
     assert(align % bs->bl.request_alignment == 0);
     head = offset % align;
-    if (head) {
-        head = MIN(count, align - head);
-        count -= head;
-        offset += head;
-    }
-    count = QEMU_ALIGN_DOWN(count, align);
-    if (!count) {
-        return 0;
-    }
+    tail = (offset + count) % align;

     bdrv_inc_in_flight(bs);
     tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD);
@@ -2471,11 +2467,34 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,

     max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
                                    align);
-    assert(max_pdiscard);
+    assert(max_pdiscard >= bs->bl.request_alignment);

     while (count > 0) {
         int ret;
-        int num = MIN(count, max_pdiscard);
+        int num = count;
+
+        if (head) {
+            /* Make small requests to get to alignment boundaries. */
+            num = MIN(count, align - head);
+            if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
+                num %= bs->bl.request_alignment;
+            }
+            head = (head + num) % align;
+            assert(num < max_pdiscard);
+        } else if (tail) {
+            if (num > align) {
+                /* Shorten the request to the last aligned cluster.  */
+                num -= tail;
+            } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) &&
+                       tail > bs->bl.request_alignment) {
+                tail %= bs->bl.request_alignment;
+                num -= tail;
+            }
+        }
+        /* limit request size */
+        if (num > max_pdiscard) {
+            num = max_pdiscard;
+        }

         if (bs->drv->bdrv_co_pdiscard) {
             ret = bs->drv->bdrv_co_pdiscard(bs, offset, num);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 6/9] blkdebug: Sanity check block layer guarantees
  2016-11-17 20:13 [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Eric Blake
                   ` (4 preceding siblings ...)
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers Eric Blake
@ 2016-11-17 20:13 ` Eric Blake
  2016-11-17 22:36   ` Max Reitz
  2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 7/9] blkdebug: Add pass-through write_zero and discard support Eric Blake
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Eric Blake @ 2016-11-17 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf, pbonzini, Max Reitz

Commits 04ed95f4 and 1a62d0ac updated the block layer to auto-fragment
any I/O to fit within device boundaries. Additionally, when using a
minimum alignment of 4k, we want to ensure the block layer does proper
read-modify-write rather than requesting I/O on a slice of a sector.
Let's enforce that the contract is obeyed when using blkdebug.  For
now, blkdebug only allows alignment overrides, and just inherits other
limits from whatever device it is wrapping, but a future patch will
further enhance things.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/blkdebug.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 4127571..0a47977 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -445,6 +445,15 @@ static BlockAIOCB *blkdebug_aio_readv(BlockDriverState *bs,
     BDRVBlkdebugState *s = bs->opaque;
     BlkdebugRule *rule = NULL;

+    /* Sanity check block layer guarantees */
+    assert(QEMU_IS_ALIGNED(sector_num * BDRV_SECTOR_SIZE,
+                           bs->bl.request_alignment));
+    assert(QEMU_IS_ALIGNED(nb_sectors * BDRV_SECTOR_SIZE,
+                           bs->bl.request_alignment));
+    if (bs->bl.max_transfer) {
+        assert(nb_sectors * BDRV_SECTOR_SIZE <= bs->bl.max_transfer);
+    }
+
     QSIMPLEQ_FOREACH(rule, &s->active_rules, active_next) {
         if (rule->options.inject.sector == -1 ||
             (rule->options.inject.sector >= sector_num &&
@@ -468,6 +477,15 @@ static BlockAIOCB *blkdebug_aio_writev(BlockDriverState *bs,
     BDRVBlkdebugState *s = bs->opaque;
     BlkdebugRule *rule = NULL;

+    /* Sanity check block layer guarantees */
+    assert(QEMU_IS_ALIGNED(sector_num * BDRV_SECTOR_SIZE,
+                           bs->bl.request_alignment));
+    assert(QEMU_IS_ALIGNED(nb_sectors * BDRV_SECTOR_SIZE,
+                           bs->bl.request_alignment));
+    if (bs->bl.max_transfer) {
+        assert(nb_sectors * BDRV_SECTOR_SIZE <= bs->bl.max_transfer);
+    }
+
     QSIMPLEQ_FOREACH(rule, &s->active_rules, active_next) {
         if (rule->options.inject.sector == -1 ||
             (rule->options.inject.sector >= sector_num &&
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 7/9] blkdebug: Add pass-through write_zero and discard support
  2016-11-17 20:13 [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Eric Blake
                   ` (5 preceding siblings ...)
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 6/9] blkdebug: Sanity check block layer guarantees Eric Blake
@ 2016-11-17 20:14 ` Eric Blake
  2016-11-17 22:47   ` Max Reitz
  2016-11-17 23:27   ` Max Reitz
  2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 8/9] blkdebug: Add ability to override unmap geometries Eric Blake
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-17 20:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf, pbonzini, Max Reitz

In order to test the effects of artificial geometry constraints
on operations like write zero or discard, we first need blkdebug
to manage these actions.  Ideally, it would be nice to let these
operations also react to injected errors like read/write/flush,
but it is not trivial to turn bdrv_aio error injection (where
we return BlockAIOCB*) into bdrv_co (where we return int), not
to mention the fact that I don't want to conflict with Kevin's
concurrent work on refactoring away from bdrv_aio.  So for now,
the operations merely have a TODO comment for adding error
injection.

However, one thing we CAN test is the contract promised by the
block layer; namely, if a device has specified limits on
alignment or maximum size, then those limits must be obeyed (for
now, the blkdebug driver merely inherits limits from whatever it
is wrapping, but the next patch will further enhance it to allow
specific limit overrides).

Tested by setting up an NBD server with export 'foo', then invoking:
$ ./qemu-io
qemu-io> open -o driver=blkdebug blkdebug::nbd://localhost:10809/foo
qemu-io> d 0 15M
qemu-io> w -z 0 15M

Pre-patch, the server never sees the discard (it was silently
eaten by the block layer); post-patch it is passed across the
wire.  Likewise, pre-patch the write is always passed with
NBD_WRITE (with 15M of zeroes on the wire), while post-patch
it can utilize NBD_WRITE_ZEROES (for less traffic).

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/blkdebug.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 0a47977..d45826d 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -1,6 +1,7 @@
 /*
  * Block protocol for I/O error injection
  *
+ * Copyright (C) 2016 Red Hat, Inc.
  * Copyright (c) 2010 Kevin Wolf <kwolf@redhat.com>
  *
  * Permission is hereby granted, free of charge, to any person obtaining a copy
@@ -383,6 +384,11 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
         goto out;
     }

+    bs->supported_write_flags = BDRV_REQ_FUA &
+        bs->file->bs->supported_write_flags;
+    bs->supported_zero_flags = (BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+        bs->file->bs->supported_zero_flags;
+
     /* Set request alignment */
     align = qemu_opt_get_size(opts, "align", 0);
     if (align < INT_MAX && is_power_of_2(align)) {
@@ -522,6 +528,59 @@ static BlockAIOCB *blkdebug_aio_flush(BlockDriverState *bs,
 }


+static int coroutine_fn blkdebug_co_pwrite_zeroes(BlockDriverState *bs,
+                                                  int64_t offset, int count,
+                                                  BdrvRequestFlags flags)
+{
+    uint32_t align = MAX(bs->bl.request_alignment,
+                         bs->bl.pwrite_zeroes_alignment);
+
+    /* Regardless of whether the lower layer has a finer granularity,
+     * we want to treat any unaligned request as unsupported, and
+     * check that the block layer never hands us anything crossing an
+     * alignment boundary.  */
+    if (count < align) {
+        return -ENOTSUP;
+    }
+    assert(QEMU_IS_ALIGNED(offset, align));
+    assert(QEMU_IS_ALIGNED(count, align));
+    if (bs->bl.max_pwrite_zeroes) {
+        assert(count <= bs->bl.max_pwrite_zeroes);
+    }
+
+    /* TODO: Support rule injection? */
+
+    return bdrv_co_pwrite_zeroes(bs->file, offset, count, flags);
+}
+
+
+static int coroutine_fn blkdebug_co_pdiscard(BlockDriverState *bs,
+                                             int64_t offset, int count)
+{
+    uint32_t align = bs->bl.pdiscard_alignment;
+
+    /* Only pass through requests that are larger than requested
+     * minimum alignment, and ensure that unaligned requests do not
+     * cross optimum discard boundaries. */
+    if (count < bs->bl.request_alignment) {
+        return -ENOTSUP;
+    }
+    assert(QEMU_IS_ALIGNED(offset, bs->bl.request_alignment));
+    assert(QEMU_IS_ALIGNED(count, bs->bl.request_alignment));
+    if (align && count >= align) {
+        assert(QEMU_IS_ALIGNED(offset, align));
+        assert(QEMU_IS_ALIGNED(count, align));
+    }
+    if (bs->bl.max_pdiscard) {
+        assert(count <= bs->bl.max_pdiscard);
+    }
+
+    /* TODO: Support rule injection? */
+
+    return bdrv_co_pdiscard(bs->file->bs, offset, count);
+}
+
+
 static void blkdebug_close(BlockDriverState *bs)
 {
     BDRVBlkdebugState *s = bs->opaque;
@@ -773,6 +832,8 @@ static BlockDriver bdrv_blkdebug = {
     .bdrv_aio_readv         = blkdebug_aio_readv,
     .bdrv_aio_writev        = blkdebug_aio_writev,
     .bdrv_aio_flush         = blkdebug_aio_flush,
+    .bdrv_co_pwrite_zeroes  = blkdebug_co_pwrite_zeroes,
+    .bdrv_co_pdiscard       = blkdebug_co_pdiscard,

     .bdrv_debug_event           = blkdebug_debug_event,
     .bdrv_debug_breakpoint      = blkdebug_debug_breakpoint,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 8/9] blkdebug: Add ability to override unmap geometries
  2016-11-17 20:13 [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Eric Blake
                   ` (6 preceding siblings ...)
  2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 7/9] blkdebug: Add pass-through write_zero and discard support Eric Blake
@ 2016-11-17 20:14 ` Eric Blake
  2016-11-17 23:02   ` Max Reitz
  2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes Eric Blake
  2016-11-22 16:05 ` [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Kevin Wolf
  9 siblings, 1 reply; 47+ messages in thread
From: Eric Blake @ 2016-11-17 20:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf, pbonzini, Max Reitz, Markus Armbruster

Make it easier to simulate the Dell Equallogic iSCSI with its
unusual 15M preferred and maximum unmap and write zero sizing,
or to simulate Linux loopback block devices enforcing a small
max_transfer of 64k, by allowing blkdebug to wrap any other
device with further restrictions on various alignments.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 qapi/block-core.json | 27 ++++++++++++++-
 block/blkdebug.c     | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index c29bef7..26f3e9f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2068,6 +2068,29 @@
 # @align:           #optional required alignment for requests in bytes,
 #                   must be power of 2, or 0 for default
 #
+# @max-transfer:    #optional maximum size for I/O transfers in bytes,
+#                   must be multiple of the larger of @align and and 512
+#                   (but need not be a power of 2), or 0 for default
+#                   (since 2.9)
+#
+# @opt-write-zero:  #optional preferred alignment for write zero requests
+#                   in bytes, must be multiple of the larger of @align
+#                   and 512 (but need not be a power of 2), or 0 for
+#                   default (since 2.9)
+#
+# @max-write-zero:  #optional maximum size for write zero requests in bytes,
+#                   must be multiple of the larger of @align and 512 (but
+#                   need not be a power of 2), or 0 for default (since 2.9)
+#
+# @opt-discard:     #optional preferred alignment for discard requests
+#                   in bytes, must be multiple of the larger of @align
+#                   and 512 (but need not be a power of 2), or 0 for
+#                   default (since 2.9)
+#
+# @max-discard:     #optional maximum size for discard requests in bytes,
+#                   must be multiple of the larger of @align and 512 (but
+#                   need not be a power of 2), or 0 for default (since 2.9)
+#
 # @inject-error:    #optional array of error injection descriptions
 #
 # @set-state:       #optional array of state-change descriptions
@@ -2077,7 +2100,9 @@
 { 'struct': 'BlockdevOptionsBlkdebug',
   'data': { 'image': 'BlockdevRef',
             '*config': 'str',
-            '*align': 'int',
+            '*align': 'int', '*max-transfer': 'int32',
+            '*opt-write-zero': 'int32', '*max-write-zero': 'int32',
+            '*opt-discard': 'int32', '*max-discard': 'int32',
             '*inject-error': ['BlkdebugInjectErrorOptions'],
             '*set-state': ['BlkdebugSetStateOptions'] } }

diff --git a/block/blkdebug.c b/block/blkdebug.c
index d45826d..3ba7a96 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -39,6 +39,11 @@ typedef struct BDRVBlkdebugState {
     int state;
     int new_state;
     int align;
+    int max_transfer;
+    int opt_write_zero;
+    int max_write_zero;
+    int opt_discard;
+    int max_discard;

     /* For blkdebug_refresh_filename() */
     char *config_file;
@@ -344,6 +349,31 @@ static QemuOptsList runtime_opts = {
             .type = QEMU_OPT_SIZE,
             .help = "Required alignment in bytes",
         },
+        {
+            .name = "max-transfer",
+            .type = QEMU_OPT_SIZE,
+            .help = "Maximum transfer size in bytes",
+        },
+        {
+            .name = "opt-write-zero",
+            .type = QEMU_OPT_SIZE,
+            .help = "Optimum write zero size in bytes",
+        },
+        {
+            .name = "max-write-zero",
+            .type = QEMU_OPT_SIZE,
+            .help = "Maximum write zero size in bytes",
+        },
+        {
+            .name = "opt-discard",
+            .type = QEMU_OPT_SIZE,
+            .help = "Optimum discard size in bytes",
+        },
+        {
+            .name = "max-discard",
+            .type = QEMU_OPT_SIZE,
+            .help = "Maximum discard size in bytes",
+        },
         { /* end of list */ }
     },
 };
@@ -354,7 +384,8 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
     BDRVBlkdebugState *s = bs->opaque;
     QemuOpts *opts;
     Error *local_err = NULL;
-    uint64_t align;
+    uint64_t align, max_transfer;
+    uint64_t opt_write_zero, max_write_zero, opt_discard, max_discard;
     int ret;

     opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
@@ -389,7 +420,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
     bs->supported_zero_flags = (BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
         bs->file->bs->supported_zero_flags;

-    /* Set request alignment */
+    /* Set alignment overrides */
     align = qemu_opt_get_size(opts, "align", 0);
     if (align < INT_MAX && is_power_of_2(align)) {
         s->align = align;
@@ -398,6 +429,53 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
         ret = -EINVAL;
         goto fail_unref;
     }
+    max_transfer = qemu_opt_get_size(opts, "max-transfer", 0);
+    if (max_transfer < INT_MAX &&
+        QEMU_IS_ALIGNED(max_transfer, MAX(align, BDRV_SECTOR_SIZE))) {
+        s->max_transfer = max_transfer;
+    } else if (max_transfer) {
+        error_setg(errp, "Invalid argument");
+        ret = -EINVAL;
+        goto fail_unref;
+    }
+    opt_write_zero = qemu_opt_get_size(opts, "opt-write-zero", 0);
+    if (opt_write_zero < INT_MAX &&
+        QEMU_IS_ALIGNED(opt_write_zero, MAX(align, BDRV_SECTOR_SIZE))) {
+        s->opt_write_zero = opt_write_zero;
+    } else if (opt_write_zero) {
+        error_setg(errp, "Invalid argument");
+        ret = -EINVAL;
+        goto fail_unref;
+    }
+    max_write_zero = qemu_opt_get_size(opts, "max-write-zero", 0);
+    if (max_write_zero < INT_MAX &&
+        QEMU_IS_ALIGNED(max_write_zero, MAX(opt_write_zero,
+                                         MAX(align, BDRV_SECTOR_SIZE)))) {
+        s->max_write_zero = max_write_zero;
+    } else if (max_write_zero) {
+        error_setg(errp, "Invalid argument");
+        ret = -EINVAL;
+        goto fail_unref;
+    }
+    opt_discard = qemu_opt_get_size(opts, "opt-discard", 0);
+    if (opt_discard < INT_MAX &&
+        QEMU_IS_ALIGNED(opt_discard, MAX(align, BDRV_SECTOR_SIZE))) {
+        s->opt_discard = opt_discard;
+    } else if (opt_discard) {
+        error_setg(errp, "Invalid argument");
+        ret = -EINVAL;
+        goto fail_unref;
+    }
+    max_discard = qemu_opt_get_size(opts, "max-discard", 0);
+    if (max_discard < INT_MAX &&
+        QEMU_IS_ALIGNED(max_discard, MAX(opt_discard,
+                                         MAX(align, BDRV_SECTOR_SIZE)))) {
+        s->max_discard = max_discard;
+    } else if (max_discard) {
+        error_setg(errp, "Invalid argument");
+        ret = -EINVAL;
+        goto fail_unref;
+    }

     ret = 0;
     goto out;
@@ -807,6 +885,21 @@ static void blkdebug_refresh_limits(BlockDriverState *bs, Error **errp)
     if (s->align) {
         bs->bl.request_alignment = s->align;
     }
+    if (s->max_transfer) {
+        bs->bl.max_transfer = s->max_transfer;
+    }
+    if (s->opt_write_zero) {
+        bs->bl.pwrite_zeroes_alignment = s->opt_write_zero;
+    }
+    if (s->max_write_zero) {
+        bs->bl.max_pwrite_zeroes = s->max_write_zero;
+    }
+    if (s->opt_discard) {
+        bs->bl.pdiscard_alignment = s->opt_discard;
+    }
+    if (s->max_discard) {
+        bs->bl.max_pdiscard = s->max_discard;
+    }
 }

 static int blkdebug_reopen_prepare(BDRVReopenState *reopen_state,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes
  2016-11-17 20:13 [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Eric Blake
                   ` (7 preceding siblings ...)
  2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 8/9] blkdebug: Add ability to override unmap geometries Eric Blake
@ 2016-11-17 20:14 ` Eric Blake
  2016-11-17 23:19   ` Max Reitz
  2016-11-17 23:42   ` Max Reitz
  2016-11-22 16:05 ` [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Kevin Wolf
  9 siblings, 2 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-17 20:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, kwolf, pbonzini, Max Reitz

Use blkdebug's new geometry constraints to emulate setups that
have caused recent regression fixes: write zeroes asserting
when running through a loopback block device with max-transfer
smaller than cluster size, and discard rounding away requests
that were not aligned to power-of-two boundaries.  Also, add
coverage that the block layer is honoring max transfer limits.

For now, a single iotest performs all actions, with the idea
that we can add future blkdebug constraint test cases in the
same file; but it can be split into multiple iotests if we find
reason to run one portion of the test in more setups than what
are possible in the other.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/173     | 115 +++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/173.out |  49 +++++++++++++++++++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 165 insertions(+)
 create mode 100755 tests/qemu-iotests/173
 create mode 100644 tests/qemu-iotests/173.out

diff --git a/tests/qemu-iotests/173 b/tests/qemu-iotests/173
new file mode 100755
index 0000000..1215759
--- /dev/null
+++ b/tests/qemu-iotests/173
@@ -0,0 +1,115 @@
+#!/bin/bash
+#
+# Test corner cases with unusual block geometries
+#
+# Copyright (C) 2016 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=eblake@redhat.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+here=`pwd`
+status=1	# failure is the default!
+
+_cleanup()
+{
+	_cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt qcow2
+_supported_proto file
+
+CLUSTER_SIZE=1M
+size=128M
+options=driver=blkdebug,image.driver=qcow2
+
+echo
+echo "== setting up files =="
+
+_make_test_img $size
+$QEMU_IO -c "write -P 11 0 $size" "$TEST_IMG" | _filter_qemu_io
+mv "$TEST_IMG" "$TEST_IMG.base"
+_make_test_img -b "$TEST_IMG.base"
+$QEMU_IO -c "write -P 22 0 $size" "$TEST_IMG" | _filter_qemu_io
+
+# Linux loopback devices are restricted to 64k max-transfer
+echo
+echo "== constrained alignment and max-transfer =="
+limits=align=4k,max-transfer=64k
+$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
+         -c "write -P 33 1000 128k" -c "read -P 33 1000 128k" | _filter_qemu_io
+
+echo
+echo "== write zero with constrained max-transfer =="
+limits=max-transfer=64k,opt-write-zero=$CLUSTER_SIZE
+$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
+         -c "write -z 8003584 2093056" | _filter_qemu_io
+
+# Dell Equallogic iSCSI device is unusual with its 15M page size
+echo
+echo "== non-power-of-2 write zeroes =="
+
+limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
+$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
+         -c "write -z 32M 32M" | _filter_qemu_io
+
+echo
+echo "== non-power-of-2 discard =="
+
+limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
+$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
+         -c "discard 80000000 30M" | _filter_qemu_io
+
+echo
+echo "== verify image content =="
+
+function verify_io()
+{
+    if ($QEMU_IMG info -f "$IMGFMT" "$TEST_IMG" |
+	    grep "compat: 0.10" > /dev/null); then
+        # For v2 images, discarded clusters are read from the backing file
+        discarded=11
+    else
+        # Discarded clusters are zeroed for v3 or later
+        discarded=0
+    fi
+
+    echo read -P 22 0 1000
+    echo read -P 33 1000 128k
+    echo read -P 22 132072 7871512
+    echo read -P 0 8003584 2093056
+    echo read -P 22 10096640 23457792
+    echo read -P 0 32M 32M
+    echo read -P 22 64M 13M
+    echo read -P $discarded 77M 29M
+    echo read -P 22 106M 22M
+}
+
+verify_io | $QEMU_IO "$TEST_IMG" | _filter_qemu_io
+
+_check_test_img
+
+# success, all done
+echo "*** done"
+status=0
diff --git a/tests/qemu-iotests/173.out b/tests/qemu-iotests/173.out
new file mode 100644
index 0000000..4cd3f16
--- /dev/null
+++ b/tests/qemu-iotests/173.out
@@ -0,0 +1,49 @@
+QA output created by 173
+
+== setting up files ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
+wrote 134217728/134217728 bytes at offset 0
+128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 backing_file=TEST_DIR/t.IMGFMT.base
+wrote 134217728/134217728 bytes at offset 0
+128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== constrained alignment and max-transfer ==
+wrote 131072/131072 bytes at offset 1000
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 131072/131072 bytes at offset 1000
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== write zero with constrained max-transfer ==
+wrote 2093056/2093056 bytes at offset 8003584
+1.996 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== non-power-of-2 write zeroes ==
+wrote 33554432/33554432 bytes at offset 33554432
+32 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== non-power-of-2 discard ==
+discard 31457280/31457280 bytes at offset 80000000
+30 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== verify image content ==
+read 1000/1000 bytes at offset 0
+1000 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 131072/131072 bytes at offset 1000
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 7871512/7871512 bytes at offset 132072
+7.507 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 2093056/2093056 bytes at offset 8003584
+1.996 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 23457792/23457792 bytes at offset 10096640
+22.371 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 33554432/33554432 bytes at offset 33554432
+32 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 13631488/13631488 bytes at offset 67108864
+13 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 30408704/30408704 bytes at offset 80740352
+29 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 23068672/23068672 bytes at offset 111149056
+22 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+No errors were found on the image.
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 866c1a0..0453e6c 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -165,3 +165,4 @@
 170 rw auto quick
 171 rw auto quick
 172 auto
+173 rw auto quick
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/9] nbd: Allow unmap and fua during write zeroes
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 1/9] nbd: Allow unmap and fua during write zeroes Eric Blake
@ 2016-11-17 21:10   ` Max Reitz
  2016-11-17 21:14     ` Eric Blake
  2016-11-18 14:12     ` Paolo Bonzini
  0 siblings, 2 replies; 47+ messages in thread
From: Max Reitz @ 2016-11-17 21:10 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, kwolf, pbonzini, qemu-stable

[-- Attachment #1: Type: text/plain, Size: 817 bytes --]

On 17.11.2016 21:13, Eric Blake wrote:
> Commit fa778fff wired up support to send the NBD_CMD_WRITE_ZEROES,
> but forgot to inform the block layer that FUA unmapping of zeroes is
> supported.  Without BDRV_REQ_MAY_UNMAP listed as a supported flag,
> the block layer will always insist on the NBD layer passing
> NBD_CMD_FLAG_NO_HOLE, resulting in the server always allocating
> things even when it was desired to let the server punch holes.
> Similarly, failing to set BDRV_REQ_FUA means that the client may
> send unnecessary NBD_CMD_FLUSH when it could have instead used the
> NBD_CMD_FLAG_FUA bit.
> 
> CC: qemu-stable@nongnu.org
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/nbd-client.c | 4 ++++
>  1 file changed, 4 insertions(+)

Reviewed-by: Max Reitz <mreitz@redhat.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/9] nbd: Allow unmap and fua during write zeroes
  2016-11-17 21:10   ` Max Reitz
@ 2016-11-17 21:14     ` Eric Blake
  2016-11-18 14:12     ` Paolo Bonzini
  1 sibling, 0 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-17 21:14 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: qemu-block, kwolf, pbonzini, qemu-stable

[-- Attachment #1: Type: text/plain, Size: 1210 bytes --]

On 11/17/2016 03:10 PM, Max Reitz wrote:
> On 17.11.2016 21:13, Eric Blake wrote:
>> Commit fa778fff wired up support to send the NBD_CMD_WRITE_ZEROES,
>> but forgot to inform the block layer that FUA unmapping of zeroes is
>> supported.  Without BDRV_REQ_MAY_UNMAP listed as a supported flag,
>> the block layer will always insist on the NBD layer passing
>> NBD_CMD_FLAG_NO_HOLE, resulting in the server always allocating
>> things even when it was desired to let the server punch holes.
>> Similarly, failing to set BDRV_REQ_FUA means that the client may
>> send unnecessary NBD_CMD_FLUSH when it could have instead used the
>> NBD_CMD_FLAG_FUA bit.
>>
>> CC: qemu-stable@nongnu.org

I guess this one is technically not for qemu-stable (as fa778fff is not
part of 2.7); rather, I just got carried away with the CC lines
distinguishing the first half of my series for 2.8 vs. the second for 2.9.

>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> ---
>>  block/nbd-client.c | 4 ++++
>>  1 file changed, 4 insertions(+)
> 
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/9] qcow2: Inform block layer about discard boundaries
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 2/9] qcow2: Inform block layer about discard boundaries Eric Blake
@ 2016-11-17 21:24   ` Max Reitz
  0 siblings, 0 replies; 47+ messages in thread
From: Max Reitz @ 2016-11-17 21:24 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, kwolf, pbonzini, qemu-stable

[-- Attachment #1: Type: text/plain, Size: 1620 bytes --]

On 17.11.2016 21:13, Eric Blake wrote:
> At the qcow2 layer, discard is only possible on a per-cluster
> basis; at the moment, qcow2 silently rounds any unaligned
> requests to this granularity.  However, an upcoming patch will
> fix a regression in the block layer ignoring too much of an
> unaligned discard request, by changing the block layer to
> break up a discard request at alignment boundaries; for that
> to work, the block layer must know about our limits.
> 
> However, we can't go one step further by changing
> qcow2_discard_clusters() to assert that requests are always
> aligned, since that helper function is reached on paths
> outside of the block layer.

Also, to do that, we'd have to establish that pdiscard_alignment is a
mandatory alignment. Currently, it's described as the "best alignment",
which doesn't sound too mandatory to me (although the block layer
apparently always does drop unaligned parts).

> CC: qemu-stable@nongnu.org
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/qcow2.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 6d5689a..e22f6dc 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1206,6 +1206,7 @@ static void qcow2_refresh_limits(BlockDriverState *bs, Error **errp)
>          bs->bl.request_alignment = BDRV_SECTOR_SIZE;
>      }
>      bs->bl.pwrite_zeroes_alignment = s->cluster_size;
> +    bs->bl.pdiscard_alignment = s->cluster_size;
>  }
> 
>  static int qcow2_set_key(BlockDriverState *bs, const char *key)

Reviewed-by: Max Reitz <mreitz@redhat.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/9] block: Let write zeroes fallback work even with small max_transfer
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v3 3/9] block: Let write zeroes fallback work even with small max_transfer Eric Blake
@ 2016-11-17 21:40   ` Max Reitz
  2016-11-22 13:16   ` Kevin Wolf
  1 sibling, 0 replies; 47+ messages in thread
From: Max Reitz @ 2016-11-17 21:40 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Denis V . Lunev,
	Stefan Hajnoczi, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 1741 bytes --]

On 17.11.2016 21:13, Eric Blake wrote:
> Commit 443668ca rewrote the write_zeroes logic to guarantee that
> an unaligned request never crosses a cluster boundary.  But
> in the rewrite, the new code assumed that at most one iteration
> would be needed to get to an alignment boundary.
> 
> However, it is easy to trigger an assertion failure: the Linux
> kernel limits loopback devices to advertise a max_transfer of
> only 64k.  Any operation that requires falling back to writes
> rather than more efficient zeroing must obey max_transfer during
> that fallback, which means an unaligned head may require multiple
> iterations of the write fallbacks before reaching the aligned
> boundaries, when layering a format with clusters larger than 64k
> atop the protocol of file access to a loopback device.
> 
> Test case:
> 
> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> $ losetup /dev/loop2 /path/to/file
> $ qemu-io -f qcow2 /dev/loop2
> qemu-io> w 7m 1k
> qemu-io> w -z 8003584 2093056
> 
> In fairness to Denis (as the original listed author of the culprit
> commit), the faulty logic for at most one iteration is probably all
> my fault in reworking his idea.  But the solution is to restore what
> was in place prior to that commit: when dealing with an unaligned
> head or tail, iterate as many times as necessary while fragmenting
> the operation at max_transfer boundaries.
> 
> Reported-by: Ed Swierk <eswierk@skyportsystems.com>
> CC: qemu-stable@nongnu.org
> CC: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/io.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/9] block: Return -ENOTSUP rather than assert on unaligned discards
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 4/9] block: Return -ENOTSUP rather than assert on unaligned discards Eric Blake
@ 2016-11-17 22:01   ` Max Reitz
  2016-11-18 22:48     ` Eric Blake
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2016-11-17 22:01 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Ronnie Sahlberg,
	Peter Lieven, Hitoshi Mitake, Liu Yuan, Jeff Cody,
	open list:Sheepdog

[-- Attachment #1: Type: text/plain, Size: 3946 bytes --]

On 17.11.2016 21:13, Eric Blake wrote:
> Right now, the block layer rounds discard requests, so that
> individual drivers are able to assert that discard requests
> will never be unaligned.  But there are some ISCSI devices
> that track and coalesce multiple unaligned requests, turning it
> into an actual discard if the requests eventually cover an
> entire page, which implies that it is better to always pass
> discard requests as low down the stack as possible.
> 
> In isolation, this patch has no semantic effect, since the
> block layer currently never passes an unaligned request through.
> But the block layer already has code that silently ignores
> drivers that return -ENOTSUP for a discard request that cannot
> be honored (as well as drivers that return 0 even when nothing
> was done).  But the next patch will update the block layer to
> fragment discard requests, so that clients are guaranteed that
> they are either dealing with an unaligned head or tail, or an
> aligned core, making it similar to the block layer semantics of
> write zero fragmentation.
> 
> CC: qemu-stable@nongnu.org
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/iscsi.c    | 4 +++-
>  block/qcow2.c    | 5 +++++
>  block/sheepdog.c | 5 +++--
>  3 files changed, 11 insertions(+), 3 deletions(-)

OK, so much about my remark that the comment describing
pdiscard_alignment only says it's an "optimal alignment"...

> 
> diff --git a/block/iscsi.c b/block/iscsi.c
> index 71bd523..0960929 100644
> --- a/block/iscsi.c
> +++ b/block/iscsi.c
> @@ -1083,7 +1083,9 @@ coroutine_fn iscsi_co_pdiscard(BlockDriverState *bs, int64_t offset, int count)
>      struct IscsiTask iTask;
>      struct unmap_list list;
> 
> -    assert(is_byte_request_lun_aligned(offset, count, iscsilun));
> +    if (!is_byte_request_lun_aligned(offset, count, iscsilun)) {
> +        return -ENOTSUP;
> +    }
> 
>      if (!iscsilun->lbp.lbpu) {
>          /* UNMAP is not supported by the target */

Next line is:

>          return 0;

Hmm... -ENOTSUP would be the obvious return value here, too. That might
interfere with your next patch, though.

> diff --git a/block/qcow2.c b/block/qcow2.c
> index e22f6dc..7cfcd84 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2491,6 +2491,11 @@ static coroutine_fn int qcow2_co_pdiscard(BlockDriverState *bs,
>      int ret;
>      BDRVQcow2State *s = bs->opaque;
> 
> +    if (!QEMU_IS_ALIGNED(offset | count, s->cluster_size)) {

Ha! I like "offset | count".

> +        assert(count < s->cluster_size);

Maybe add a comment for this assertion? E.g. "The block layer will only
generate unaligned discard requests that are smaller than the alignment".

> +        return -ENOTSUP;
> +    }
> +
>      qemu_co_mutex_lock(&s->lock);
>      ret = qcow2_discard_clusters(bs, offset, count >> BDRV_SECTOR_BITS,
>                                   QCOW2_DISCARD_REQUEST, false);
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index 1fb9173..4c9af89 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -2829,8 +2829,9 @@ static coroutine_fn int sd_co_pdiscard(BlockDriverState *bs, int64_t offset,
>      iov.iov_len = sizeof(zero);
>      discard_iov.iov = &iov;
>      discard_iov.niov = 1;
> -    assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> -    assert((count & (BDRV_SECTOR_SIZE - 1)) == 0);
> +    if (!QEMU_IS_ALIGNED(offset | count, BDRV_SECTOR_SIZE)) {
> +        return -ENOTSUP;
> +    }

Out of interest: Where does sheepdog tell the block layer that requests
have to be aligned to this value?

With this patch, it doesn't matter though, it only did before, so:

Reviewed-by: Max Reitz <mreitz@redhat.com>

>      acb = sd_aio_setup(bs, &discard_iov, offset >> BDRV_SECTOR_BITS,
>                         count >> BDRV_SECTOR_BITS);
>      acb->aiocb_type = AIOCB_DISCARD_OBJ;
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers Eric Blake
@ 2016-11-17 22:26   ` Max Reitz
  2016-11-17 23:01     ` Eric Blake
  2016-11-17 23:44   ` Max Reitz
  2016-11-22 14:03   ` Kevin Wolf
  2 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2016-11-17 22:26 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Stefan Hajnoczi, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 5753 bytes --]

On 17.11.2016 21:13, Eric Blake wrote:
> Discard is advisory, so rounding the requests to alignment
> boundaries is never semantically wrong from the data that
> the guest sees.  But at least the Dell Equallogic iSCSI SANs
> has an interesting property that its advertised discard
> alignment is 15M, yet documents that discarding a sequence
> of 1M slices will eventually result in the 15M page being
> marked as discarded, and it is possible to observe which
> pages have been discarded.
> 
> Between commits 9f1963b and b8d0a980, we converted the block
> layer to a byte-based interface that ultimately ignores any
> unaligned head or tail based on the driver's advertised
> discard granularity, which means that qemu 2.7 refuses to
> pass any discard request smaller than 15M down to the Dell
> Equallogic hardware.  This is a slight regression in behavior
> compared to earlier qemu, where a guest executing discards
> in power-of-2 chunks used to be able to get every page
> discarded, but is now left with various pages still allocated
> because the guest requests did not align with the hardware's
> 15M pages.
> 
> Since the SCSI specification says nothing about a minimum
> discard granularity, and only documents the preferred
> alignment, it is best if the block layer gives the driver
> every bit of information about discard requests, rather than
> rounding it to alignment boundaries early.
> 
> Rework the block layer discard algorithm to mirror the write
> zero algorithm: always peel off any unaligned head or tail
> and manage that in isolation, then do the bulk of the request
> on an aligned boundary.  The fallback when the driver returns
> -ENOTSUP for an unaligned request is to silently ignore that
> portion of the discard request; but for devices that can pass
> the partial request all the way down to hardware, this can
> result in the hardware coalescing requests and discarding
> aligned pages after all.
> 
> Reported by: Peter Lieven <pl@kamp.de>
> CC: qemu-stable@nongnu.org
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> v2: Split at more boundaries.
> ---
>  block/io.c | 45 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 32 insertions(+), 13 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 085ac34..4f00562 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2424,7 +2424,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
>  {
>      BdrvTrackedRequest req;
>      int max_pdiscard, ret;
> -    int head, align;
> +    int head, tail, align;
> 
>      if (!bs->drv) {
>          return -ENOMEDIUM;
> @@ -2447,19 +2447,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
>          return 0;
>      }
> 
> -    /* Discard is advisory, so ignore any unaligned head or tail */
> +    /* Discard is advisory, but some devices track and coalesce
> +     * unaligned requests, so we must pass everything down rather than
> +     * round here.  Still, most devices will just silently ignore
> +     * unaligned requests (by returning -ENOTSUP), so we must fragment

Returning -ENOTSUP is not quite "silently" in my book.

> +     * the request accordingly.  */
>      align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment);
>      assert(align % bs->bl.request_alignment == 0);
>      head = offset % align;
> -    if (head) {
> -        head = MIN(count, align - head);
> -        count -= head;
> -        offset += head;
> -    }
> -    count = QEMU_ALIGN_DOWN(count, align);
> -    if (!count) {
> -        return 0;
> -    }
> +    tail = (offset + count) % align;
> 
>      bdrv_inc_in_flight(bs);
>      tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD);
> @@ -2471,11 +2467,34 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
> 
>      max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
>                                     align);
> -    assert(max_pdiscard);
> +    assert(max_pdiscard >= bs->bl.request_alignment);
> 
>      while (count > 0) {
>          int ret;
> -        int num = MIN(count, max_pdiscard);
> +        int num = count;
> +
> +        if (head) {
> +            /* Make small requests to get to alignment boundaries. */
> +            num = MIN(count, align - head);
> +            if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
> +                num %= bs->bl.request_alignment;
> +            }

Could be written as

num = num % bs->bl.request_alignment ?: num;

But that's up to you.

More importantly, is it possible that request_alignment >
pdiscard_alignment? In that case, align would be request_alignment, head
would be less than request_alignment but could be more than
pdiscard_alignment.

Thus, if that is possible, this might create a request longer than
pdiscard_alignment.

> +            head = (head + num) % align;
> +            assert(num < max_pdiscard);
> +        } else if (tail) {
> +            if (num > align) {
> +                /* Shorten the request to the last aligned cluster.  */
> +                num -= tail;
> +            } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) &&
> +                       tail > bs->bl.request_alignment) {
> +                tail %= bs->bl.request_alignment;
> +                num -= tail;

Same here.

Max

> +            }
> +        }
> +        /* limit request size */
> +        if (num > max_pdiscard) {
> +            num = max_pdiscard;
> +        }
> 
>          if (bs->drv->bdrv_co_pdiscard) {
>              ret = bs->drv->bdrv_co_pdiscard(bs, offset, num);
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 6/9] blkdebug: Sanity check block layer guarantees
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 6/9] blkdebug: Sanity check block layer guarantees Eric Blake
@ 2016-11-17 22:36   ` Max Reitz
  0 siblings, 0 replies; 47+ messages in thread
From: Max Reitz @ 2016-11-17 22:36 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, kwolf, pbonzini

[-- Attachment #1: Type: text/plain, Size: 764 bytes --]

On 17.11.2016 21:13, Eric Blake wrote:
> Commits 04ed95f4 and 1a62d0ac updated the block layer to auto-fragment
> any I/O to fit within device boundaries. Additionally, when using a
> minimum alignment of 4k, we want to ensure the block layer does proper
> read-modify-write rather than requesting I/O on a slice of a sector.
> Let's enforce that the contract is obeyed when using blkdebug.  For
> now, blkdebug only allows alignment overrides, and just inherits other
> limits from whatever device it is wrapping, but a future patch will
> further enhance things.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/blkdebug.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)

Reviewed-by: Max Reitz <mreitz@redhat.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/9] blkdebug: Add pass-through write_zero and discard support
  2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 7/9] blkdebug: Add pass-through write_zero and discard support Eric Blake
@ 2016-11-17 22:47   ` Max Reitz
  2016-11-18 23:08     ` Eric Blake
  2016-11-17 23:27   ` Max Reitz
  1 sibling, 1 reply; 47+ messages in thread
From: Max Reitz @ 2016-11-17 22:47 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, kwolf, pbonzini

[-- Attachment #1: Type: text/plain, Size: 2749 bytes --]

On 17.11.2016 21:14, Eric Blake wrote:
> In order to test the effects of artificial geometry constraints
> on operations like write zero or discard, we first need blkdebug
> to manage these actions.  Ideally, it would be nice to let these
> operations also react to injected errors like read/write/flush,
> but it is not trivial to turn bdrv_aio error injection (where
> we return BlockAIOCB*) into bdrv_co (where we return int), not
> to mention the fact that I don't want to conflict with Kevin's
> concurrent work on refactoring away from bdrv_aio.  So for now,
> the operations merely have a TODO comment for adding error
> injection.
> 
> However, one thing we CAN test is the contract promised by the
> block layer; namely, if a device has specified limits on
> alignment or maximum size, then those limits must be obeyed (for
> now, the blkdebug driver merely inherits limits from whatever it
> is wrapping, but the next patch will further enhance it to allow
> specific limit overrides).
> 
> Tested by setting up an NBD server with export 'foo', then invoking:
> $ ./qemu-io
> qemu-io> open -o driver=blkdebug blkdebug::nbd://localhost:10809/foo
> qemu-io> d 0 15M
> qemu-io> w -z 0 15M
> 
> Pre-patch, the server never sees the discard (it was silently
> eaten by the block layer); post-patch it is passed across the
> wire.  Likewise, pre-patch the write is always passed with
> NBD_WRITE (with 15M of zeroes on the wire), while post-patch
> it can utilize NBD_WRITE_ZEROES (for less traffic).
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/blkdebug.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 61 insertions(+)
> 
> diff --git a/block/blkdebug.c b/block/blkdebug.c
> index 0a47977..d45826d 100644
> --- a/block/blkdebug.c
> +++ b/block/blkdebug.c

[...]

> @@ -522,6 +528,59 @@ static BlockAIOCB *blkdebug_aio_flush(BlockDriverState *bs,
>  }
> 
> 
> +static int coroutine_fn blkdebug_co_pwrite_zeroes(BlockDriverState *bs,
> +                                                  int64_t offset, int count,
> +                                                  BdrvRequestFlags flags)
> +{
> +    uint32_t align = MAX(bs->bl.request_alignment,
> +                         bs->bl.pwrite_zeroes_alignment);
> +
> +    /* Regardless of whether the lower layer has a finer granularity,
> +     * we want to treat any unaligned request as unsupported, and

Why?

Assuming there is a reason:

Reviewed-by: Max Reitz <mreitz@redhat.com>

> +     * check that the block layer never hands us anything crossing an
> +     * alignment boundary.  */
> +    if (count < align) {
> +        return -ENOTSUP;
> +    }

[...]



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
  2016-11-17 22:26   ` Max Reitz
@ 2016-11-17 23:01     ` Eric Blake
  2016-11-17 23:03       ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Blake @ 2016-11-17 23:01 UTC (permalink / raw)
  To: Max Reitz, qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Stefan Hajnoczi, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 1956 bytes --]

On 11/17/2016 04:26 PM, Max Reitz wrote:
> On 17.11.2016 21:13, Eric Blake wrote:
>> Discard is advisory, so rounding the requests to alignment
>> boundaries is never semantically wrong from the data that
>> the guest sees.  But at least the Dell Equallogic iSCSI SANs
>> has an interesting property that its advertised discard
>> alignment is 15M, yet documents that discarding a sequence
>> of 1M slices will eventually result in the 15M page being
>> marked as discarded, and it is possible to observe which
>> pages have been discarded.
>>

>>      max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
>>                                     align);
>> -    assert(max_pdiscard);
>> +    assert(max_pdiscard >= bs->bl.request_alignment);
>>
>>      while (count > 0) {
>>          int ret;
>> -        int num = MIN(count, max_pdiscard);
>> +        int num = count;
>> +
>> +        if (head) {
>> +            /* Make small requests to get to alignment boundaries. */
>> +            num = MIN(count, align - head);
>> +            if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
>> +                num %= bs->bl.request_alignment;
>> +            }
> 
> Could be written as
> 
> num = num % bs->bl.request_alignment ?: num;
> 
> But that's up to you.
> 
> More importantly, is it possible that request_alignment >
> pdiscard_alignment? In that case, align would be request_alignment, head
> would be less than request_alignment but could be more than
> pdiscard_alignment.

pdiscard_alignment can be 0 (no inherent limit); but it if it is
nonzero, it must be at least request_alignment.  The block layer (should
be, if it is not already) enforcing that as part of the
.bdrv_refresh_limits() callback contract; at any rate, it is documented
that way in block_int.h

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 8/9] blkdebug: Add ability to override unmap geometries
  2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 8/9] blkdebug: Add ability to override unmap geometries Eric Blake
@ 2016-11-17 23:02   ` Max Reitz
  2016-11-21 21:11     ` Eric Blake
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2016-11-17 23:02 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, kwolf, pbonzini, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 9020 bytes --]

On 17.11.2016 21:14, Eric Blake wrote:
> Make it easier to simulate the Dell Equallogic iSCSI with its

Somehow I feel bad putting company and product names into commit messages...

> unusual 15M preferred and maximum unmap and write zero sizing,
> or to simulate Linux loopback block devices enforcing a small
> max_transfer of 64k, by allowing blkdebug to wrap any other
> device with further restrictions on various alignments.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  qapi/block-core.json | 27 ++++++++++++++-
>  block/blkdebug.c     | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 121 insertions(+), 3 deletions(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index c29bef7..26f3e9f 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2068,6 +2068,29 @@
>  # @align:           #optional required alignment for requests in bytes,
>  #                   must be power of 2, or 0 for default
>  #
> +# @max-transfer:    #optional maximum size for I/O transfers in bytes,
> +#                   must be multiple of the larger of @align and and 512
> +#                   (but need not be a power of 2), or 0 for default
> +#                   (since 2.9)
> +#
> +# @opt-write-zero:  #optional preferred alignment for write zero requests
> +#                   in bytes, must be multiple of the larger of @align
> +#                   and 512 (but need not be a power of 2), or 0 for
> +#                   default (since 2.9)
> +#
> +# @max-write-zero:  #optional maximum size for write zero requests in bytes,
> +#                   must be multiple of the larger of @align and 512 (but
> +#                   need not be a power of 2), or 0 for default (since 2.9)
> +#
> +# @opt-discard:     #optional preferred alignment for discard requests
> +#                   in bytes, must be multiple of the larger of @align
> +#                   and 512 (but need not be a power of 2), or 0 for
> +#                   default (since 2.9)
> +#
> +# @max-discard:     #optional maximum size for discard requests in bytes,
> +#                   must be multiple of the larger of @align and 512 (but
> +#                   need not be a power of 2), or 0 for default (since 2.9)
> +#
>  # @inject-error:    #optional array of error injection descriptions
>  #
>  # @set-state:       #optional array of state-change descriptions
> @@ -2077,7 +2100,9 @@
>  { 'struct': 'BlockdevOptionsBlkdebug',
>    'data': { 'image': 'BlockdevRef',
>              '*config': 'str',
> -            '*align': 'int',
> +            '*align': 'int', '*max-transfer': 'int32',
> +            '*opt-write-zero': 'int32', '*max-write-zero': 'int32',
> +            '*opt-discard': 'int32', '*max-discard': 'int32',
>              '*inject-error': ['BlkdebugInjectErrorOptions'],
>              '*set-state': ['BlkdebugSetStateOptions'] } }
> 
> diff --git a/block/blkdebug.c b/block/blkdebug.c
> index d45826d..3ba7a96 100644
> --- a/block/blkdebug.c
> +++ b/block/blkdebug.c
> @@ -39,6 +39,11 @@ typedef struct BDRVBlkdebugState {
>      int state;
>      int new_state;
>      int align;
> +    int max_transfer;
> +    int opt_write_zero;
> +    int max_write_zero;
> +    int opt_discard;
> +    int max_discard;
> 
>      /* For blkdebug_refresh_filename() */
>      char *config_file;
> @@ -344,6 +349,31 @@ static QemuOptsList runtime_opts = {
>              .type = QEMU_OPT_SIZE,
>              .help = "Required alignment in bytes",
>          },
> +        {
> +            .name = "max-transfer",
> +            .type = QEMU_OPT_SIZE,
> +            .help = "Maximum transfer size in bytes",
> +        },
> +        {
> +            .name = "opt-write-zero",
> +            .type = QEMU_OPT_SIZE,
> +            .help = "Optimum write zero size in bytes",

s/size/alignment/?

> +        },
> +        {
> +            .name = "max-write-zero",
> +            .type = QEMU_OPT_SIZE,
> +            .help = "Maximum write zero size in bytes",
> +        },
> +        {
> +            .name = "opt-discard",
> +            .type = QEMU_OPT_SIZE,
> +            .help = "Optimum discard size in bytes",

s/size/alignment/?

> +        },
> +        {
> +            .name = "max-discard",
> +            .type = QEMU_OPT_SIZE,
> +            .help = "Maximum discard size in bytes",
> +        },
>          { /* end of list */ }
>      },
>  };
> @@ -354,7 +384,8 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
>      BDRVBlkdebugState *s = bs->opaque;
>      QemuOpts *opts;
>      Error *local_err = NULL;
> -    uint64_t align;
> +    uint64_t align, max_transfer;
> +    uint64_t opt_write_zero, max_write_zero, opt_discard, max_discard;
>      int ret;
> 
>      opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
> @@ -389,7 +420,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
>      bs->supported_zero_flags = (BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
>          bs->file->bs->supported_zero_flags;
> 
> -    /* Set request alignment */
> +    /* Set alignment overrides */
>      align = qemu_opt_get_size(opts, "align", 0);
>      if (align < INT_MAX && is_power_of_2(align)) {
>          s->align = align;
> @@ -398,6 +429,53 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
>          ret = -EINVAL;
>          goto fail_unref;
>      }
> +    max_transfer = qemu_opt_get_size(opts, "max-transfer", 0);
> +    if (max_transfer < INT_MAX &&
> +        QEMU_IS_ALIGNED(max_transfer, MAX(align, BDRV_SECTOR_SIZE))) {
> +        s->max_transfer = max_transfer;
> +    } else if (max_transfer) {
> +        error_setg(errp, "Invalid argument");

Could you be more specific? Same in all of the error_setg() calls below.

> +        ret = -EINVAL;
> +        goto fail_unref;
> +    }

Also, the way this is formatted seems not intuitive to me. I know it's
the same as it's been done for "align", but normally I'd use the following:

s->value = qemu_opt_get_size(...);
if (s->value is set and invalid) {
    /* error out */
}

Instead of:

value = qemu_opt_get_size(...);
if (value is valid) {
    s->value = value;
} else if (value is set) {
    /* error out */
}

Same for all blocks below.

Finally, my eyes would be really grateful for some empty lines here and
there, at least one between the handling of different options.


Max

> +    opt_write_zero = qemu_opt_get_size(opts, "opt-write-zero", 0);
> +    if (opt_write_zero < INT_MAX &&
> +        QEMU_IS_ALIGNED(opt_write_zero, MAX(align, BDRV_SECTOR_SIZE))) {
> +        s->opt_write_zero = opt_write_zero;
> +    } else if (opt_write_zero) {
> +        error_setg(errp, "Invalid argument");
> +        ret = -EINVAL;
> +        goto fail_unref;
> +    }
> +    max_write_zero = qemu_opt_get_size(opts, "max-write-zero", 0);
> +    if (max_write_zero < INT_MAX &&
> +        QEMU_IS_ALIGNED(max_write_zero, MAX(opt_write_zero,
> +                                         MAX(align, BDRV_SECTOR_SIZE)))) {
> +        s->max_write_zero = max_write_zero;
> +    } else if (max_write_zero) {
> +        error_setg(errp, "Invalid argument");
> +        ret = -EINVAL;
> +        goto fail_unref;
> +    }
> +    opt_discard = qemu_opt_get_size(opts, "opt-discard", 0);
> +    if (opt_discard < INT_MAX &&
> +        QEMU_IS_ALIGNED(opt_discard, MAX(align, BDRV_SECTOR_SIZE))) {
> +        s->opt_discard = opt_discard;
> +    } else if (opt_discard) {
> +        error_setg(errp, "Invalid argument");
> +        ret = -EINVAL;
> +        goto fail_unref;
> +    }
> +    max_discard = qemu_opt_get_size(opts, "max-discard", 0);
> +    if (max_discard < INT_MAX &&
> +        QEMU_IS_ALIGNED(max_discard, MAX(opt_discard,
> +                                         MAX(align, BDRV_SECTOR_SIZE)))) {
> +        s->max_discard = max_discard;
> +    } else if (max_discard) {
> +        error_setg(errp, "Invalid argument");
> +        ret = -EINVAL;
> +        goto fail_unref;
> +    }
> 
>      ret = 0;
>      goto out;
> @@ -807,6 +885,21 @@ static void blkdebug_refresh_limits(BlockDriverState *bs, Error **errp)
>      if (s->align) {
>          bs->bl.request_alignment = s->align;
>      }
> +    if (s->max_transfer) {
> +        bs->bl.max_transfer = s->max_transfer;
> +    }
> +    if (s->opt_write_zero) {
> +        bs->bl.pwrite_zeroes_alignment = s->opt_write_zero;
> +    }
> +    if (s->max_write_zero) {
> +        bs->bl.max_pwrite_zeroes = s->max_write_zero;
> +    }
> +    if (s->opt_discard) {
> +        bs->bl.pdiscard_alignment = s->opt_discard;
> +    }
> +    if (s->max_discard) {
> +        bs->bl.max_pdiscard = s->max_discard;
> +    }
>  }
> 
>  static int blkdebug_reopen_prepare(BDRVReopenState *reopen_state,
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
  2016-11-17 23:01     ` Eric Blake
@ 2016-11-17 23:03       ` Max Reitz
  0 siblings, 0 replies; 47+ messages in thread
From: Max Reitz @ 2016-11-17 23:03 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Stefan Hajnoczi, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 2010 bytes --]

On 18.11.2016 00:01, Eric Blake wrote:
> On 11/17/2016 04:26 PM, Max Reitz wrote:
>> On 17.11.2016 21:13, Eric Blake wrote:
>>> Discard is advisory, so rounding the requests to alignment
>>> boundaries is never semantically wrong from the data that
>>> the guest sees.  But at least the Dell Equallogic iSCSI SANs
>>> has an interesting property that its advertised discard
>>> alignment is 15M, yet documents that discarding a sequence
>>> of 1M slices will eventually result in the 15M page being
>>> marked as discarded, and it is possible to observe which
>>> pages have been discarded.
>>>
> 
>>>      max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
>>>                                     align);
>>> -    assert(max_pdiscard);
>>> +    assert(max_pdiscard >= bs->bl.request_alignment);
>>>
>>>      while (count > 0) {
>>>          int ret;
>>> -        int num = MIN(count, max_pdiscard);
>>> +        int num = count;
>>> +
>>> +        if (head) {
>>> +            /* Make small requests to get to alignment boundaries. */
>>> +            num = MIN(count, align - head);
>>> +            if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
>>> +                num %= bs->bl.request_alignment;
>>> +            }
>>
>> Could be written as
>>
>> num = num % bs->bl.request_alignment ?: num;
>>
>> But that's up to you.
>>
>> More importantly, is it possible that request_alignment >
>> pdiscard_alignment? In that case, align would be request_alignment, head
>> would be less than request_alignment but could be more than
>> pdiscard_alignment.
> 
> pdiscard_alignment can be 0 (no inherent limit); but it if it is
> nonzero, it must be at least request_alignment.  The block layer (should
> be, if it is not already) enforcing that as part of the
> .bdrv_refresh_limits() callback contract; at any rate, it is documented
> that way in block_int.h

Yes, you're right. Thus:

Reviewed-by: Max Reitz <mreitz@redhat.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes
  2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes Eric Blake
@ 2016-11-17 23:19   ` Max Reitz
  2016-11-18  1:19     ` Eric Blake
  2016-11-17 23:42   ` Max Reitz
  1 sibling, 1 reply; 47+ messages in thread
From: Max Reitz @ 2016-11-17 23:19 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, kwolf, pbonzini

[-- Attachment #1: Type: text/plain, Size: 3357 bytes --]

On 17.11.2016 21:14, Eric Blake wrote:
> Use blkdebug's new geometry constraints to emulate setups that
> have caused recent regression fixes: write zeroes asserting
> when running through a loopback block device with max-transfer
> smaller than cluster size, and discard rounding away requests
> that were not aligned to power-of-two boundaries.  Also, add
> coverage that the block layer is honoring max transfer limits.
> 
> For now, a single iotest performs all actions, with the idea
> that we can add future blkdebug constraint test cases in the
> same file; but it can be split into multiple iotests if we find
> reason to run one portion of the test in more setups than what
> are possible in the other.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  tests/qemu-iotests/173     | 115 +++++++++++++++++++++++++++++++++++++++++++++
>  tests/qemu-iotests/173.out |  49 +++++++++++++++++++
>  tests/qemu-iotests/group   |   1 +
>  3 files changed, 165 insertions(+)
>  create mode 100755 tests/qemu-iotests/173
>  create mode 100644 tests/qemu-iotests/173.out
> 
> diff --git a/tests/qemu-iotests/173 b/tests/qemu-iotests/173
> new file mode 100755
> index 0000000..1215759
> --- /dev/null
> +++ b/tests/qemu-iotests/173

[...]

> +echo
> +echo "== write zero with constrained max-transfer =="
> +limits=max-transfer=64k,opt-write-zero=$CLUSTER_SIZE
> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
> +         -c "write -z 8003584 2093056" | _filter_qemu_io
> +
> +# Dell Equallogic iSCSI device is unusual with its 15M page size

Very minor nagging: A simple "Test non-power-of-two write-zero/discard
alignment" might get the point across better.

> +echo
> +echo "== non-power-of-2 write zeroes =="
> +
> +limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
> +         -c "write -z 32M 32M" | _filter_qemu_io
> +
> +echo
> +echo "== non-power-of-2 discard =="
> +
> +limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
> +         -c "discard 80000000 30M" | _filter_qemu_io
> +
> +echo
> +echo "== verify image content =="
> +
> +function verify_io()
> +{
> +    if ($QEMU_IMG info -f "$IMGFMT" "$TEST_IMG" |
> +	    grep "compat: 0.10" > /dev/null); then
> +        # For v2 images, discarded clusters are read from the backing file
> +        discarded=11
> +    else
> +        # Discarded clusters are zeroed for v3 or later
> +        discarded=0
> +    fi

This is fine since you've already done the work to support compat=0.10,
but I think we've had v3 long enough that you could have just put
compat=0.10 into _unsupported_imgopts.

Reviewed-by: Max Reitz <mreitz@redhat.com>

> +
> +    echo read -P 22 0 1000
> +    echo read -P 33 1000 128k
> +    echo read -P 22 132072 7871512
> +    echo read -P 0 8003584 2093056
> +    echo read -P 22 10096640 23457792
> +    echo read -P 0 32M 32M
> +    echo read -P 22 64M 13M
> +    echo read -P $discarded 77M 29M
> +    echo read -P 22 106M 22M
> +}
> +
> +verify_io | $QEMU_IO "$TEST_IMG" | _filter_qemu_io
> +
> +_check_test_img
> +
> +# success, all done
> +echo "*** done"
> +status=0


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/9] blkdebug: Add pass-through write_zero and discard support
  2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 7/9] blkdebug: Add pass-through write_zero and discard support Eric Blake
  2016-11-17 22:47   ` Max Reitz
@ 2016-11-17 23:27   ` Max Reitz
  2016-11-18  1:17     ` Eric Blake
  1 sibling, 1 reply; 47+ messages in thread
From: Max Reitz @ 2016-11-17 23:27 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, kwolf, pbonzini

[-- Attachment #1: Type: text/plain, Size: 1799 bytes --]

On 17.11.2016 21:14, Eric Blake wrote:
> In order to test the effects of artificial geometry constraints
> on operations like write zero or discard, we first need blkdebug
> to manage these actions.  Ideally, it would be nice to let these
> operations also react to injected errors like read/write/flush,
> but it is not trivial to turn bdrv_aio error injection (where
> we return BlockAIOCB*) into bdrv_co (where we return int), not
> to mention the fact that I don't want to conflict with Kevin's
> concurrent work on refactoring away from bdrv_aio.  So for now,
> the operations merely have a TODO comment for adding error
> injection.
> 
> However, one thing we CAN test is the contract promised by the
> block layer; namely, if a device has specified limits on
> alignment or maximum size, then those limits must be obeyed (for
> now, the blkdebug driver merely inherits limits from whatever it
> is wrapping, but the next patch will further enhance it to allow
> specific limit overrides).
> 
> Tested by setting up an NBD server with export 'foo', then invoking:
> $ ./qemu-io
> qemu-io> open -o driver=blkdebug blkdebug::nbd://localhost:10809/foo
> qemu-io> d 0 15M
> qemu-io> w -z 0 15M
> 
> Pre-patch, the server never sees the discard (it was silently
> eaten by the block layer); post-patch it is passed across the
> wire.  Likewise, pre-patch the write is always passed with
> NBD_WRITE (with 15M of zeroes on the wire), while post-patch
> it can utilize NBD_WRITE_ZEROES (for less traffic).
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/blkdebug.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 61 insertions(+)

I probably have to take my R-b back, since this patch breaks iotest 98.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes
  2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes Eric Blake
  2016-11-17 23:19   ` Max Reitz
@ 2016-11-17 23:42   ` Max Reitz
  2016-11-18  1:28     ` Eric Blake
  1 sibling, 1 reply; 47+ messages in thread
From: Max Reitz @ 2016-11-17 23:42 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, kwolf, pbonzini

[-- Attachment #1: Type: text/plain, Size: 2468 bytes --]

On 17.11.2016 21:14, Eric Blake wrote:
> Use blkdebug's new geometry constraints to emulate setups that
> have caused recent regression fixes: write zeroes asserting
> when running through a loopback block device with max-transfer
> smaller than cluster size, and discard rounding away requests
> that were not aligned to power-of-two boundaries.  Also, add
> coverage that the block layer is honoring max transfer limits.
> 
> For now, a single iotest performs all actions, with the idea
> that we can add future blkdebug constraint test cases in the
> same file; but it can be split into multiple iotests if we find
> reason to run one portion of the test in more setups than what
> are possible in the other.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  tests/qemu-iotests/173     | 115 +++++++++++++++++++++++++++++++++++++++++++++
>  tests/qemu-iotests/173.out |  49 +++++++++++++++++++
>  tests/qemu-iotests/group   |   1 +
>  3 files changed, 165 insertions(+)
>  create mode 100755 tests/qemu-iotests/173
>  create mode 100644 tests/qemu-iotests/173.out
> 
> diff --git a/tests/qemu-iotests/173 b/tests/qemu-iotests/173
> new file mode 100755
> index 0000000..1215759
> --- /dev/null
> +++ b/tests/qemu-iotests/173

[...]

> +# Dell Equallogic iSCSI device is unusual with its 15M page size
> +echo
> +echo "== non-power-of-2 write zeroes =="
> +
> +limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
> +         -c "write -z 32M 32M" | _filter_qemu_io
> +
> +echo
> +echo "== non-power-of-2 discard =="
> +
> +limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
> +         -c "discard 80000000 30M" | _filter_qemu_io

Question: What does this test has to do with iscsi?

The first case just tests that we fall back to writing the head and tail
as full zeroes when the driver returns -ENOTSUP.

The second test, as far as I can see, just gives some discard request to
blkdebug (split up into head, mid and tail), which blkdebug just passes
on (because 80000000 is a multiple of 512). qcow2 then discards part of
that and drops the head and tail of the request it receives (but head
and tail are now calculated based on qcow2's 64k limit).

What does that have to do with said iscsi device, though?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers Eric Blake
  2016-11-17 22:26   ` Max Reitz
@ 2016-11-17 23:44   ` Max Reitz
  2016-11-18  1:13     ` Eric Blake
  2016-11-22 14:03   ` Kevin Wolf
  2 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2016-11-17 23:44 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Stefan Hajnoczi, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 1820 bytes --]

On 17.11.2016 21:13, Eric Blake wrote:
> Discard is advisory, so rounding the requests to alignment
> boundaries is never semantically wrong from the data that
> the guest sees.  But at least the Dell Equallogic iSCSI SANs
> has an interesting property that its advertised discard
> alignment is 15M, yet documents that discarding a sequence
> of 1M slices will eventually result in the 15M page being
> marked as discarded, and it is possible to observe which
> pages have been discarded.
> 
> Between commits 9f1963b and b8d0a980, we converted the block
> layer to a byte-based interface that ultimately ignores any
> unaligned head or tail based on the driver's advertised
> discard granularity, which means that qemu 2.7 refuses to
> pass any discard request smaller than 15M down to the Dell
> Equallogic hardware.  This is a slight regression in behavior
> compared to earlier qemu, where a guest executing discards
> in power-of-2 chunks used to be able to get every page
> discarded, but is now left with various pages still allocated
> because the guest requests did not align with the hardware's
> 15M pages.
> 
> Since the SCSI specification says nothing about a minimum
> discard granularity, and only documents the preferred
> alignment, it is best if the block layer gives the driver
> every bit of information about discard requests, rather than
> rounding it to alignment boundaries early.

Is this series supposed to address this issue? Because if so, I fail to
see where it does. If the device advertises 15 MB as the discard
granularity, then the iscsi driver will still drop all discard requests
that are not aligned to 15 MB boundaries, no?

The only difference is that it's now the iscsi driver that drops the
request instead of the generic block layer.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
  2016-11-17 23:44   ` Max Reitz
@ 2016-11-18  1:13     ` Eric Blake
  2016-11-19 22:05       ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Blake @ 2016-11-18  1:13 UTC (permalink / raw)
  To: Max Reitz, qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Stefan Hajnoczi,
	Fam Zheng, Peter Lieven

[-- Attachment #1: Type: text/plain, Size: 1886 bytes --]

On 11/17/2016 05:44 PM, Max Reitz wrote:
>>
>> Since the SCSI specification says nothing about a minimum
>> discard granularity, and only documents the preferred
>> alignment, it is best if the block layer gives the driver
>> every bit of information about discard requests, rather than
>> rounding it to alignment boundaries early.
> 
> Is this series supposed to address this issue? Because if so, I fail to
> see where it does. If the device advertises 15 MB as the discard
> granularity, then the iscsi driver will still drop all discard requests
> that are not aligned to 15 MB boundaries, no?
> 

I don't have access to the device in question, so I'm hoping Peter
chimes in (oops, how'd I miss him on original CC?).  Here's all the more
he said on v1:
https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02223.html

My gut feel, however, is that the iscsi code is NOT rounding (the qcow2
code rounded, but that's different); the regression happened in 2.7
because the block layer also started rounding, and this patch gets the
block layer rounding out of the way. If nothing changed in the iscsi
code in the meantime, then the iscsi code should now (once again) be
discarding all sizes, regardless of the 15M advertisement.

Meanwhile, I _did_ test this patch with blkdebug (see 9/9), as well as
on a slightly modified NBD client that forced 15M alignments, and for
those cases, it definitely made the difference on whether all bytes were
passed (spread across several calls), vs. just the aligned bytes in the
middle of a request larger than 15M.

> The only difference is that it's now the iscsi driver that drops the
> request instead of the generic block layer.

If the iscsi driver was ever dropping it in the first place.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/9] blkdebug: Add pass-through write_zero and discard support
  2016-11-17 23:27   ` Max Reitz
@ 2016-11-18  1:17     ` Eric Blake
  0 siblings, 0 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-18  1:17 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: qemu-block, kwolf, pbonzini

[-- Attachment #1: Type: text/plain, Size: 2288 bytes --]

On 11/17/2016 05:27 PM, Max Reitz wrote:
> On 17.11.2016 21:14, Eric Blake wrote:
>> In order to test the effects of artificial geometry constraints
>> on operations like write zero or discard, we first need blkdebug
>> to manage these actions.  Ideally, it would be nice to let these
>> operations also react to injected errors like read/write/flush,
>> but it is not trivial to turn bdrv_aio error injection (where
>> we return BlockAIOCB*) into bdrv_co (where we return int), not
>> to mention the fact that I don't want to conflict with Kevin's
>> concurrent work on refactoring away from bdrv_aio.  So for now,
>> the operations merely have a TODO comment for adding error
>> injection.
>>
>> However, one thing we CAN test is the contract promised by the
>> block layer; namely, if a device has specified limits on
>> alignment or maximum size, then those limits must be obeyed (for
>> now, the blkdebug driver merely inherits limits from whatever it
>> is wrapping, but the next patch will further enhance it to allow
>> specific limit overrides).
>>
>> Tested by setting up an NBD server with export 'foo', then invoking:
>> $ ./qemu-io
>> qemu-io> open -o driver=blkdebug blkdebug::nbd://localhost:10809/foo
>> qemu-io> d 0 15M
>> qemu-io> w -z 0 15M
>>
>> Pre-patch, the server never sees the discard (it was silently
>> eaten by the block layer); post-patch it is passed across the
>> wire.  Likewise, pre-patch the write is always passed with
>> NBD_WRITE (with 15M of zeroes on the wire), while post-patch
>> it can utilize NBD_WRITE_ZEROES (for less traffic).
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> ---
>>  block/blkdebug.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 61 insertions(+)
> 
> I probably have to take my R-b back, since this patch breaks iotest 98.

Huh, I guess I tweaked things since the last time I reran full tests;
but I indeed see a failure in './check -qcow2 98'.  I'll investigate
tomorrow when I'm not as tired.

At least this patch is not 2.8 material, so respinning it doesn't affect
the other patches that should go in before the release.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes
  2016-11-17 23:19   ` Max Reitz
@ 2016-11-18  1:19     ` Eric Blake
  0 siblings, 0 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-18  1:19 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: qemu-block, kwolf, pbonzini

[-- Attachment #1: Type: text/plain, Size: 1200 bytes --]

On 11/17/2016 05:19 PM, Max Reitz wrote:
> On 17.11.2016 21:14, Eric Blake wrote:
>> Use blkdebug's new geometry constraints to emulate setups that
>> have caused recent regression fixes: write zeroes asserting
>> when running through a loopback block device with max-transfer
>> smaller than cluster size, and discard rounding away requests
>> that were not aligned to power-of-two boundaries.  Also, add
>> coverage that the block layer is honoring max transfer limits.
>>

>> +function verify_io()
>> +{
>> +    if ($QEMU_IMG info -f "$IMGFMT" "$TEST_IMG" |
>> +	    grep "compat: 0.10" > /dev/null); then
>> +        # For v2 images, discarded clusters are read from the backing file
>> +        discarded=11
>> +    else
>> +        # Discarded clusters are zeroed for v3 or later
>> +        discarded=0
>> +    fi
> 
> This is fine since you've already done the work to support compat=0.10,
> but I think we've had v3 long enough that you could have just put
> compat=0.10 into _unsupported_imgopts.

Copy-and-paste from 46, so it wasn't really that hard.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes
  2016-11-17 23:42   ` Max Reitz
@ 2016-11-18  1:28     ` Eric Blake
  2016-11-19 21:45       ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Blake @ 2016-11-18  1:28 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: qemu-block, kwolf, pbonzini

[-- Attachment #1: Type: text/plain, Size: 3815 bytes --]

On 11/17/2016 05:42 PM, Max Reitz wrote:
> On 17.11.2016 21:14, Eric Blake wrote:
>> Use blkdebug's new geometry constraints to emulate setups that
>> have caused recent regression fixes: write zeroes asserting
>> when running through a loopback block device with max-transfer
>> smaller than cluster size, and discard rounding away requests
>> that were not aligned to power-of-two boundaries.  Also, add
>> coverage that the block layer is honoring max transfer limits.
>>
>> For now, a single iotest performs all actions, with the idea
>> that we can add future blkdebug constraint test cases in the
>> same file; but it can be split into multiple iotests if we find
>> reason to run one portion of the test in more setups than what
>> are possible in the other.
>>

> 
>> +# Dell Equallogic iSCSI device is unusual with its 15M page size
>> +echo
>> +echo "== non-power-of-2 write zeroes =="
>> +
>> +limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
>> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
>> +         -c "write -z 32M 32M" | _filter_qemu_io
>> +
>> +echo
>> +echo "== non-power-of-2 discard =="
>> +
>> +limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
>> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
>> +         -c "discard 80000000 30M" | _filter_qemu_io
> 
> Question: What does this test has to do with iscsi?
> 
> The first case just tests that we fall back to writing the head and tail
> as full zeroes when the driver returns -ENOTSUP.

The first one isn't all that interesting, so much as making sure we
don't regress. I couldn't make it fail, pre-patch.  The real test is the
second one...

> 
> The second test, as far as I can see, just gives some discard request to
> blkdebug (split up into head, mid and tail), which blkdebug just passes
> on (because 80000000 is a multiple of 512). qcow2 then discards part of
> that and drops the head and tail of the request it receives (but head
> and tail are now calculated based on qcow2's 64k limit).

Thanks to the opt-discard=15M, the blkdebug layer is forcing the block
layer to break it into head/middle/tail on 15M boundaries, but throwing
away the head and tail without giving blkdebug a chance, so it only
zeroed 90-105M.  Then, with the block layer fixed to pass the head on
through anyways, but without patch 2/9, the qcow2 code was seeing that
the start offset was not cluster-aligned ($TEST_IMG has 1M clusters),
and with patch 4/9 that was making qcow2 return -ENOTSUP, and still
ignoring everything up to 90M.  It took all of 2, 4, and 5 before the
discard finally affected the range 77M-90M (since 80000000 is just
before 77M).

Maybe I should tweak the number to not be a multiple of 512, to doubly
make sure that the double-alignment code in io.c 5/9 is doing the right
job (I didn't even check that 80000000 is indeed a multiple of 512).

You're right that it doesn't tickle iscsi code, so much as tickling the
block layer code that was previously throwing away the unaligned
portions rather than passing them on through anyways.  The test was
inspired because of the iscsi regression, but I was able to rework it
into something that reproducibly failed even without iscsi in the mix,
until the block layer was fixed.  But if that means cleaning up the
comment on respin, I'm fine with that.

> 
> What does that have to do with said iscsi device, though?

Only that the unusual choice of 15M alignment and maximum discard sizing
was chosen to match that particular iscsi device, because that was the
device where the regression was first reported.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/9] nbd: Allow unmap and fua during write zeroes
  2016-11-17 21:10   ` Max Reitz
  2016-11-17 21:14     ` Eric Blake
@ 2016-11-18 14:12     ` Paolo Bonzini
  1 sibling, 0 replies; 47+ messages in thread
From: Paolo Bonzini @ 2016-11-18 14:12 UTC (permalink / raw)
  To: Max Reitz, Eric Blake, qemu-devel; +Cc: qemu-block, kwolf, qemu-stable

[-- Attachment #1: Type: text/plain, Size: 912 bytes --]



On 17/11/2016 22:10, Max Reitz wrote:
> On 17.11.2016 21:13, Eric Blake wrote:
>> Commit fa778fff wired up support to send the NBD_CMD_WRITE_ZEROES,
>> but forgot to inform the block layer that FUA unmapping of zeroes is
>> supported.  Without BDRV_REQ_MAY_UNMAP listed as a supported flag,
>> the block layer will always insist on the NBD layer passing
>> NBD_CMD_FLAG_NO_HOLE, resulting in the server always allocating
>> things even when it was desired to let the server punch holes.
>> Similarly, failing to set BDRV_REQ_FUA means that the client may
>> send unnecessary NBD_CMD_FLUSH when it could have instead used the
>> NBD_CMD_FLAG_FUA bit.
>>
>> CC: qemu-stable@nongnu.org
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> ---
>>  block/nbd-client.c | 4 ++++
>>  1 file changed, 4 insertions(+)
> 
> Reviewed-by: Max Reitz <mreitz@redhat.com>

I'll take this one.

Paolo


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/9] block: Return -ENOTSUP rather than assert on unaligned discards
  2016-11-17 22:01   ` Max Reitz
@ 2016-11-18 22:48     ` Eric Blake
  0 siblings, 0 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-18 22:48 UTC (permalink / raw)
  To: Max Reitz, qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Ronnie Sahlberg,
	Peter Lieven, Hitoshi Mitake, Liu Yuan, Jeff Cody,
	open list:Sheepdog

[-- Attachment #1: Type: text/plain, Size: 5215 bytes --]

On 11/17/2016 04:01 PM, Max Reitz wrote:
> On 17.11.2016 21:13, Eric Blake wrote:
>> Right now, the block layer rounds discard requests, so that
>> individual drivers are able to assert that discard requests
>> will never be unaligned.  But there are some ISCSI devices
>> that track and coalesce multiple unaligned requests, turning it
>> into an actual discard if the requests eventually cover an
>> entire page, which implies that it is better to always pass
>> discard requests as low down the stack as possible.
>>
>> In isolation, this patch has no semantic effect, since the
>> block layer currently never passes an unaligned request through.
>> But the block layer already has code that silently ignores
>> drivers that return -ENOTSUP for a discard request that cannot
>> be honored (as well as drivers that return 0 even when nothing
>> was done).  But the next patch will update the block layer to
>> fragment discard requests, so that clients are guaranteed that
>> they are either dealing with an unaligned head or tail, or an
>> aligned core, making it similar to the block layer semantics of
>> write zero fragmentation.
>>

>> +++ b/block/iscsi.c
>> @@ -1083,7 +1083,9 @@ coroutine_fn iscsi_co_pdiscard(BlockDriverState *bs, int64_t offset, int count)
>>      struct IscsiTask iTask;
>>      struct unmap_list list;
>>
>> -    assert(is_byte_request_lun_aligned(offset, count, iscsilun));
>> +    if (!is_byte_request_lun_aligned(offset, count, iscsilun)) {
>> +        return -ENOTSUP;
>> +    }
>>
>>      if (!iscsilun->lbp.lbpu) {
>>          /* UNMAP is not supported by the target */
> 
> Next line is:
> 
>>          return 0;
> 
> Hmm... -ENOTSUP would be the obvious return value here, too. That might
> interfere with your next patch, though.

Shouldn't interfere. I guess no one value is better than the other; I
can respin to pick a consistent value (I'd lean towards -ENOTSUP) if you
think it is worth it; but I'd rather get this into 2.8 without worrying
about it.

> 
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index e22f6dc..7cfcd84 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -2491,6 +2491,11 @@ static coroutine_fn int qcow2_co_pdiscard(BlockDriverState *bs,
>>      int ret;
>>      BDRVQcow2State *s = bs->opaque;
>>
>> +    if (!QEMU_IS_ALIGNED(offset | count, s->cluster_size)) {
> 
> Ha! I like "offset | count".

It only works because we know that qcow2 guarantees that s->cluster_size
is a power of 2 (it does not work at the block layer, where the
bs->bl.pdiscard_align need not be a power of 2).

> 
>> +        assert(count < s->cluster_size);
> 
> Maybe add a comment for this assertion? E.g. "The block layer will only
> generate unaligned discard requests that are smaller than the alignment".

Sure, if the maintainer wants a respin.

> 
>> +        return -ENOTSUP;
>> +    }
>> +
>>      qemu_co_mutex_lock(&s->lock);
>>      ret = qcow2_discard_clusters(bs, offset, count >> BDRV_SECTOR_BITS,
>>                                   QCOW2_DISCARD_REQUEST, false);
>> diff --git a/block/sheepdog.c b/block/sheepdog.c
>> index 1fb9173..4c9af89 100644
>> --- a/block/sheepdog.c
>> +++ b/block/sheepdog.c
>> @@ -2829,8 +2829,9 @@ static coroutine_fn int sd_co_pdiscard(BlockDriverState *bs, int64_t offset,
>>      iov.iov_len = sizeof(zero);
>>      discard_iov.iov = &iov;
>>      discard_iov.niov = 1;
>> -    assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
>> -    assert((count & (BDRV_SECTOR_SIZE - 1)) == 0);
>> +    if (!QEMU_IS_ALIGNED(offset | count, BDRV_SECTOR_SIZE)) {

Again, works because this is a power of 2.

>> +        return -ENOTSUP;
>> +    }
> 
> Out of interest: Where does sheepdog tell the block layer that requests
> have to be aligned to this value?

It's Magic !  Don't tell anyone that I told you :)

Actually, it's because sheepdog still uses .bdrv_co_readv
(sector-based), and not .bdrv_co_preadv (byte-based), so the block layer
automatically sets bs->bl.request_alignment to BDRV_SECTOR_SIZE for
sheepdog and all other old-school drivers.  The block layer then treats
bs->bl.request_alignment as the very minimum that can be changed in the
image at a time, so it only makes sense that sheepdog can't react to a
discard request aligned below those limits.

This code is weakening from an assertion to an early error return, and
then 5/9 is what starts calling the code even with something aligned
smaller than a sector.

Someday the sheepdog driver may be relaxed to implement byte-based
callbacks, and then we may want to delete the early error return of
-ENOTSUP for requests smaller than 512; but that depends on whether
sheepdog uses bytes or sectors over the wire.

> 
> With this patch, it doesn't matter though, it only did before, so:
> 
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> 
>>      acb = sd_aio_setup(bs, &discard_iov, offset >> BDRV_SECTOR_BITS,
>>                         count >> BDRV_SECTOR_BITS);
>>      acb->aiocb_type = AIOCB_DISCARD_OBJ;
>>
> 
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/9] blkdebug: Add pass-through write_zero and discard support
  2016-11-17 22:47   ` Max Reitz
@ 2016-11-18 23:08     ` Eric Blake
  0 siblings, 0 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-18 23:08 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: qemu-block, kwolf, pbonzini

[-- Attachment #1: Type: text/plain, Size: 3024 bytes --]

On 11/17/2016 04:47 PM, Max Reitz wrote:
> On 17.11.2016 21:14, Eric Blake wrote:
>> In order to test the effects of artificial geometry constraints
>> on operations like write zero or discard, we first need blkdebug
>> to manage these actions.  Ideally, it would be nice to let these
>> operations also react to injected errors like read/write/flush,
>> but it is not trivial to turn bdrv_aio error injection (where
>> we return BlockAIOCB*) into bdrv_co (where we return int), not
>> to mention the fact that I don't want to conflict with Kevin's
>> concurrent work on refactoring away from bdrv_aio.  So for now,
>> the operations merely have a TODO comment for adding error
>> injection.
>>
>> However, one thing we CAN test is the contract promised by the
>> block layer; namely, if a device has specified limits on
>> alignment or maximum size, then those limits must be obeyed (for
>> now, the blkdebug driver merely inherits limits from whatever it
>> is wrapping, but the next patch will further enhance it to allow
>> specific limit overrides).
>>
>> Tested by setting up an NBD server with export 'foo', then invoking:
>> $ ./qemu-io
>> qemu-io> open -o driver=blkdebug blkdebug::nbd://localhost:10809/foo
>> qemu-io> d 0 15M
>> qemu-io> w -z 0 15M
>>
>> Pre-patch, the server never sees the discard (it was silently
>> eaten by the block layer); post-patch it is passed across the
>> wire.  Likewise, pre-patch the write is always passed with
>> NBD_WRITE (with 15M of zeroes on the wire), while post-patch
>> it can utilize NBD_WRITE_ZEROES (for less traffic).
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> ---
>>  block/blkdebug.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 61 insertions(+)
>>
>> diff --git a/block/blkdebug.c b/block/blkdebug.c
>> index 0a47977..d45826d 100644
>> --- a/block/blkdebug.c
>> +++ b/block/blkdebug.c
> 
> [...]
> 
>> @@ -522,6 +528,59 @@ static BlockAIOCB *blkdebug_aio_flush(BlockDriverState *bs,
>>  }
>>
>>
>> +static int coroutine_fn blkdebug_co_pwrite_zeroes(BlockDriverState *bs,
>> +                                                  int64_t offset, int count,
>> +                                                  BdrvRequestFlags flags)
>> +{
>> +    uint32_t align = MAX(bs->bl.request_alignment,
>> +                         bs->bl.pwrite_zeroes_alignment);
>> +
>> +    /* Regardless of whether the lower layer has a finer granularity,
>> +     * we want to treat any unaligned request as unsupported, and
> 
> Why?

Hmm, at the moment, I'm having a hard time coming up with a strong
reason why I did that. I'll retest without it, and see if it still picks
up the regression fixed by 3/5; if so I'll drop it as part of the respin
(since I still have the iotest to fix); if not, I'll have a good reason
why and include it in the commit message.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes
  2016-11-18  1:28     ` Eric Blake
@ 2016-11-19 21:45       ` Max Reitz
  2016-11-19 22:17         ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2016-11-19 21:45 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, kwolf, pbonzini

[-- Attachment #1: Type: text/plain, Size: 4615 bytes --]

On 18.11.2016 02:28, Eric Blake wrote:
> On 11/17/2016 05:42 PM, Max Reitz wrote:
>> On 17.11.2016 21:14, Eric Blake wrote:
>>> Use blkdebug's new geometry constraints to emulate setups that
>>> have caused recent regression fixes: write zeroes asserting
>>> when running through a loopback block device with max-transfer
>>> smaller than cluster size, and discard rounding away requests
>>> that were not aligned to power-of-two boundaries.  Also, add
>>> coverage that the block layer is honoring max transfer limits.
>>>
>>> For now, a single iotest performs all actions, with the idea
>>> that we can add future blkdebug constraint test cases in the
>>> same file; but it can be split into multiple iotests if we find
>>> reason to run one portion of the test in more setups than what
>>> are possible in the other.
>>>
> 
>>
>>> +# Dell Equallogic iSCSI device is unusual with its 15M page size
>>> +echo
>>> +echo "== non-power-of-2 write zeroes =="
>>> +
>>> +limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
>>> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
>>> +         -c "write -z 32M 32M" | _filter_qemu_io
>>> +
>>> +echo
>>> +echo "== non-power-of-2 discard =="
>>> +
>>> +limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
>>> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
>>> +         -c "discard 80000000 30M" | _filter_qemu_io
>>
>> Question: What does this test has to do with iscsi?
>>
>> The first case just tests that we fall back to writing the head and tail
>> as full zeroes when the driver returns -ENOTSUP.
> 
> The first one isn't all that interesting, so much as making sure we
> don't regress. I couldn't make it fail, pre-patch.  The real test is the
> second one...
> 
>>
>> The second test, as far as I can see, just gives some discard request to
>> blkdebug (split up into head, mid and tail), which blkdebug just passes
>> on (because 80000000 is a multiple of 512). qcow2 then discards part of
>> that and drops the head and tail of the request it receives (but head
>> and tail are now calculated based on qcow2's 64k limit).
> 
> Thanks to the opt-discard=15M, the blkdebug layer is forcing the block
> layer to break it into head/middle/tail on 15M boundaries, but throwing
> away the head and tail without giving blkdebug a chance, so it only
> zeroed 90-105M.  Then, with the block layer fixed to pass the head on
> through anyways, but without patch 2/9, the qcow2 code was seeing that
> the start offset was not cluster-aligned ($TEST_IMG has 1M clusters),
> and with patch 4/9 that was making qcow2 return -ENOTSUP, and still
> ignoring everything up to 90M.  It took all of 2, 4, and 5 before the
> discard finally affected the range 77M-90M (since 80000000 is just
> before 77M).

OK, thank you for the explanation, but again, I don't know how that is
related to the iscsi case. It's a good test and you should keep it but
you should probably change the comment about the iSCSI device because as
far as I can see, this has nothing to do with it.

In said iSCSI case, the block driver limiting the alignment wouldn't be
the format block driver or blkdebug, but the protocol driver (i.e.
iscsi). As I said in my reply to patch 5, though, I don't think that
your series changes behavior in that case.

Old behavior: The block layer cuts off head and tail from the discard
request before sending it to the iscsi driver. That driver then asserts
that the request is aligned and proceeds.

New behavior: The block layer sends head and tail separately from the
rest. The iscsi driver discards head and tail, though, because they are
not aligned, and thus again only sends the aligned part of the request
to the device.

> Maybe I should tweak the number to not be a multiple of 512, to doubly
> make sure that the double-alignment code in io.c 5/9 is doing the right
> job (I didn't even check that 80000000 is indeed a multiple of 512).

I was a bit surprised myself. :-)

> You're right that it doesn't tickle iscsi code, so much as tickling the
> block layer code that was previously throwing away the unaligned
> portions rather than passing them on through anyways.  The test was
> inspired because of the iscsi regression, but I was able to rework it
> into something that reproducibly failed even without iscsi in the mix,
> until the block layer was fixed.  But if that means cleaning up the
> comment on respin, I'm fine with that.

Yes, that would be very much fine with me.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
  2016-11-18  1:13     ` Eric Blake
@ 2016-11-19 22:05       ` Max Reitz
  2016-11-21 13:39         ` Peter Lieven
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2016-11-19 22:05 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Stefan Hajnoczi,
	Fam Zheng, Peter Lieven

[-- Attachment #1: Type: text/plain, Size: 3368 bytes --]

On 18.11.2016 02:13, Eric Blake wrote:
> On 11/17/2016 05:44 PM, Max Reitz wrote:
>>>
>>> Since the SCSI specification says nothing about a minimum
>>> discard granularity, and only documents the preferred
>>> alignment, it is best if the block layer gives the driver
>>> every bit of information about discard requests, rather than
>>> rounding it to alignment boundaries early.
>>
>> Is this series supposed to address this issue? Because if so, I fail to
>> see where it does. If the device advertises 15 MB as the discard
>> granularity, then the iscsi driver will still drop all discard requests
>> that are not aligned to 15 MB boundaries, no?
>>
> 
> I don't have access to the device in question, so I'm hoping Peter
> chimes in (oops, how'd I miss him on original CC?).  Here's all the more
> he said on v1:
> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02223.html
> 
> My gut feel, however, is that the iscsi code is NOT rounding

Why are you relying on your gut feel when you can simply look into the
code in question? As a matter of fact, I know that you have looked at
the very piece of code in question because you touch it in patch 4.

Now that I looked at it again, though, I see my mistake, though: I
assumed that is_byte_request_lun_aligned() would check whether the
request is aligned to the advertised discard granularity. Of course, it
does not, though, it only checks whether it's aligned to the device's
block_size.

So with the device in question, block_size is something like, say, 1 MB
and the pdiscard_alignment is 15 MB. With your series, the block layer
will first split off the head of an unaligned discard request (with the
rest being aligned to the pdiscard_alignment) and then again the head
off that head (with regards to the request_alignment).

The iscsi driver will discard the head of the head (which isn't aligned
to the request_alignment), but pass the part of the head that is aligned
to request_alignment and of course pass everything that's aligned to
pdiscard_alignment, too.

(And the same then happens for the tail.)

OK, now I see.

>                                                              (the qcow2
> code rounded, but that's different); the regression happened in 2.7
> because the block layer also started rounding, and this patch gets the
> block layer rounding out of the way. If nothing changed in the iscsi
> code in the meantime, then the iscsi code should now (once again) be
> discarding all sizes, regardless of the 15M advertisement.

Well, all sizes that are aligned to the request_alignment.

> Meanwhile, I _did_ test this patch with blkdebug (see 9/9), as well as
> on a slightly modified NBD client that forced 15M alignments, and for
> those cases, it definitely made the difference on whether all bytes were
> passed (spread across several calls), vs. just the aligned bytes in the
> middle of a request larger than 15M.
> 
>> The only difference is that it's now the iscsi driver that drops the
>> request instead of the generic block layer.
> 
> If the iscsi driver was ever dropping it in the first place.

It wasn't dropping them, it was asserting that it was dropped; yes, my
mistake for thinking that is_byte_request_lun_aligned() would check
whether the request is aligned to pdiscard_alignment.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes
  2016-11-19 21:45       ` Max Reitz
@ 2016-11-19 22:17         ` Max Reitz
  2016-11-21 11:38           ` Kevin Wolf
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2016-11-19 22:17 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, kwolf, pbonzini

[-- Attachment #1: Type: text/plain, Size: 3663 bytes --]

On 19.11.2016 22:45, Max Reitz wrote:
> On 18.11.2016 02:28, Eric Blake wrote:
>> On 11/17/2016 05:42 PM, Max Reitz wrote:
>>> On 17.11.2016 21:14, Eric Blake wrote:
>>>> Use blkdebug's new geometry constraints to emulate setups that
>>>> have caused recent regression fixes: write zeroes asserting
>>>> when running through a loopback block device with max-transfer
>>>> smaller than cluster size, and discard rounding away requests
>>>> that were not aligned to power-of-two boundaries.  Also, add
>>>> coverage that the block layer is honoring max transfer limits.
>>>>
>>>> For now, a single iotest performs all actions, with the idea
>>>> that we can add future blkdebug constraint test cases in the
>>>> same file; but it can be split into multiple iotests if we find
>>>> reason to run one portion of the test in more setups than what
>>>> are possible in the other.
>>>>
>>
>>>
>>>> +# Dell Equallogic iSCSI device is unusual with its 15M page size
>>>> +echo
>>>> +echo "== non-power-of-2 write zeroes =="
>>>> +
>>>> +limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
>>>> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
>>>> +         -c "write -z 32M 32M" | _filter_qemu_io
>>>> +
>>>> +echo
>>>> +echo "== non-power-of-2 discard =="
>>>> +
>>>> +limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
>>>> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
>>>> +         -c "discard 80000000 30M" | _filter_qemu_io
>>>
>>> Question: What does this test has to do with iscsi?
>>>
>>> The first case just tests that we fall back to writing the head and tail
>>> as full zeroes when the driver returns -ENOTSUP.
>>
>> The first one isn't all that interesting, so much as making sure we
>> don't regress. I couldn't make it fail, pre-patch.  The real test is the
>> second one...
>>
>>>
>>> The second test, as far as I can see, just gives some discard request to
>>> blkdebug (split up into head, mid and tail), which blkdebug just passes
>>> on (because 80000000 is a multiple of 512). qcow2 then discards part of
>>> that and drops the head and tail of the request it receives (but head
>>> and tail are now calculated based on qcow2's 64k limit).
>>
>> Thanks to the opt-discard=15M, the blkdebug layer is forcing the block
>> layer to break it into head/middle/tail on 15M boundaries, but throwing
>> away the head and tail without giving blkdebug a chance, so it only
>> zeroed 90-105M.  Then, with the block layer fixed to pass the head on
>> through anyways, but without patch 2/9, the qcow2 code was seeing that
>> the start offset was not cluster-aligned ($TEST_IMG has 1M clusters),
>> and with patch 4/9 that was making qcow2 return -ENOTSUP, and still
>> ignoring everything up to 90M.  It took all of 2, 4, and 5 before the
>> discard finally affected the range 77M-90M (since 80000000 is just
>> before 77M).
> 
> OK, thank you for the explanation, but again, I don't know how that is
> related to the iscsi case.

And now with my last response to patch 5, I do know. So because the
"align" option isn't set, the request_alignment defaults to 512.
blkdebug will accept requests and pass them on even if they're not
aligned to pdiscard_alignment, so that's similar to how iscsi only drops
discard requests that are not aligned to request_alignment but passes
everything on regardless of whether it's aligned to pdiscard_alignment.

I'd be fine with either adjusting the "Dell..." comment to be something
entirely iscsi-unrelated, or with a description of the complete picture
here.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 480 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes
  2016-11-19 22:17         ` Max Reitz
@ 2016-11-21 11:38           ` Kevin Wolf
  2016-11-21 16:16             ` Eric Blake
  0 siblings, 1 reply; 47+ messages in thread
From: Kevin Wolf @ 2016-11-21 11:38 UTC (permalink / raw)
  To: Max Reitz; +Cc: Eric Blake, qemu-devel, qemu-block, pbonzini

[-- Attachment #1: Type: text/plain, Size: 3858 bytes --]

Am 19.11.2016 um 23:17 hat Max Reitz geschrieben:
> On 19.11.2016 22:45, Max Reitz wrote:
> > On 18.11.2016 02:28, Eric Blake wrote:
> >> On 11/17/2016 05:42 PM, Max Reitz wrote:
> >>> On 17.11.2016 21:14, Eric Blake wrote:
> >>>> Use blkdebug's new geometry constraints to emulate setups that
> >>>> have caused recent regression fixes: write zeroes asserting
> >>>> when running through a loopback block device with max-transfer
> >>>> smaller than cluster size, and discard rounding away requests
> >>>> that were not aligned to power-of-two boundaries.  Also, add
> >>>> coverage that the block layer is honoring max transfer limits.
> >>>>
> >>>> For now, a single iotest performs all actions, with the idea
> >>>> that we can add future blkdebug constraint test cases in the
> >>>> same file; but it can be split into multiple iotests if we find
> >>>> reason to run one portion of the test in more setups than what
> >>>> are possible in the other.
> >>>>
> >>
> >>>
> >>>> +# Dell Equallogic iSCSI device is unusual with its 15M page size
> >>>> +echo
> >>>> +echo "== non-power-of-2 write zeroes =="
> >>>> +
> >>>> +limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
> >>>> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
> >>>> +         -c "write -z 32M 32M" | _filter_qemu_io
> >>>> +
> >>>> +echo
> >>>> +echo "== non-power-of-2 discard =="
> >>>> +
> >>>> +limits=opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
> >>>> +$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
> >>>> +         -c "discard 80000000 30M" | _filter_qemu_io
> >>>
> >>> Question: What does this test has to do with iscsi?
> >>>
> >>> The first case just tests that we fall back to writing the head and tail
> >>> as full zeroes when the driver returns -ENOTSUP.
> >>
> >> The first one isn't all that interesting, so much as making sure we
> >> don't regress. I couldn't make it fail, pre-patch.  The real test is the
> >> second one...
> >>
> >>>
> >>> The second test, as far as I can see, just gives some discard request to
> >>> blkdebug (split up into head, mid and tail), which blkdebug just passes
> >>> on (because 80000000 is a multiple of 512). qcow2 then discards part of
> >>> that and drops the head and tail of the request it receives (but head
> >>> and tail are now calculated based on qcow2's 64k limit).
> >>
> >> Thanks to the opt-discard=15M, the blkdebug layer is forcing the block
> >> layer to break it into head/middle/tail on 15M boundaries, but throwing
> >> away the head and tail without giving blkdebug a chance, so it only
> >> zeroed 90-105M.  Then, with the block layer fixed to pass the head on
> >> through anyways, but without patch 2/9, the qcow2 code was seeing that
> >> the start offset was not cluster-aligned ($TEST_IMG has 1M clusters),
> >> and with patch 4/9 that was making qcow2 return -ENOTSUP, and still
> >> ignoring everything up to 90M.  It took all of 2, 4, and 5 before the
> >> discard finally affected the range 77M-90M (since 80000000 is just
> >> before 77M).
> > 
> > OK, thank you for the explanation, but again, I don't know how that is
> > related to the iscsi case.
> 
> And now with my last response to patch 5, I do know. So because the
> "align" option isn't set, the request_alignment defaults to 512.
> blkdebug will accept requests and pass them on even if they're not
> aligned to pdiscard_alignment, so that's similar to how iscsi only drops
> discard requests that are not aligned to request_alignment but passes
> everything on regardless of whether it's aligned to pdiscard_alignment.

If the 512 byte alignment is important, please make it explicit. I have
a patch to make blkdebug work with byte alignment and then the default
will be 1.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
  2016-11-19 22:05       ` Max Reitz
@ 2016-11-21 13:39         ` Peter Lieven
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Lieven @ 2016-11-21 13:39 UTC (permalink / raw)
  To: Max Reitz, Eric Blake, qemu-devel
  Cc: qemu-block, kwolf, pbonzini, qemu-stable, Stefan Hajnoczi, Fam Zheng

Am 19.11.2016 um 23:05 schrieb Max Reitz:
> On 18.11.2016 02:13, Eric Blake wrote:
>> On 11/17/2016 05:44 PM, Max Reitz wrote:
>>>> Since the SCSI specification says nothing about a minimum
>>>> discard granularity, and only documents the preferred
>>>> alignment, it is best if the block layer gives the driver
>>>> every bit of information about discard requests, rather than
>>>> rounding it to alignment boundaries early.
>>> Is this series supposed to address this issue? Because if so, I fail to
>>> see where it does. If the device advertises 15 MB as the discard
>>> granularity, then the iscsi driver will still drop all discard requests
>>> that are not aligned to 15 MB boundaries, no?
>>>
>> I don't have access to the device in question, so I'm hoping Peter
>> chimes in (oops, how'd I miss him on original CC?).  Here's all the more
>> he said on v1:
>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02223.html
>>
>> My gut feel, however, is that the iscsi code is NOT rounding
> Why are you relying on your gut feel when you can simply look into the
> code in question? As a matter of fact, I know that you have looked at
> the very piece of code in question because you touch it in patch 4.
>
> Now that I looked at it again, though, I see my mistake, though: I
> assumed that is_byte_request_lun_aligned() would check whether the
> request is aligned to the advertised discard granularity. Of course, it
> does not, though, it only checks whether it's aligned to the device's
> block_size.
>
> So with the device in question, block_size is something like, say, 1 MB
> and the pdiscard_alignment is 15 MB. With your series, the block layer
> will first split off the head of an unaligned discard request (with the
> rest being aligned to the pdiscard_alignment) and then again the head
> off that head (with regards to the request_alignment).
>
> The iscsi driver will discard the head of the head (which isn't aligned
> to the request_alignment), but pass the part of the head that is aligned
> to request_alignment and of course pass everything that's aligned to
> pdiscard_alignment, too.
>
> (And the same then happens for the tail.)
>
> OK, now I see.
>
>>                                                               (the qcow2
>> code rounded, but that's different); the regression happened in 2.7
>> because the block layer also started rounding, and this patch gets the
>> block layer rounding out of the way. If nothing changed in the iscsi
>> code in the meantime, then the iscsi code should now (once again) be
>> discarding all sizes, regardless of the 15M advertisement.
> Well, all sizes that are aligned to the request_alignment.
>
>> Meanwhile, I _did_ test this patch with blkdebug (see 9/9), as well as
>> on a slightly modified NBD client that forced 15M alignments, and for
>> those cases, it definitely made the difference on whether all bytes were
>> passed (spread across several calls), vs. just the aligned bytes in the
>> middle of a request larger than 15M.
>>
>>> The only difference is that it's now the iscsi driver that drops the
>>> request instead of the generic block layer.
>> If the iscsi driver was ever dropping it in the first place.
> It wasn't dropping them, it was asserting that it was dropped; yes, my
> mistake for thinking that is_byte_request_lun_aligned() would check
> whether the request is aligned to pdiscard_alignment.

Sorry for not responding earlier. I was ooo some days.

The iSCSI driver indeed checked only for alignment to block_size and
was passing all other discard requests which the device was handling
or just dropping.

The version now seems correct.

Thanks,
Peter

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes
  2016-11-21 11:38           ` Kevin Wolf
@ 2016-11-21 16:16             ` Eric Blake
  0 siblings, 0 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-21 16:16 UTC (permalink / raw)
  To: Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block, pbonzini

[-- Attachment #1: Type: text/plain, Size: 1108 bytes --]

On 11/21/2016 05:38 AM, Kevin Wolf wrote:
>> And now with my last response to patch 5, I do know. So because the
>> "align" option isn't set, the request_alignment defaults to 512.
>> blkdebug will accept requests and pass them on even if they're not
>> aligned to pdiscard_alignment, so that's similar to how iscsi only drops
>> discard requests that are not aligned to request_alignment but passes
>> everything on regardless of whether it's aligned to pdiscard_alignment.
> 
> If the 512 byte alignment is important, please make it explicit. I have
> a patch to make blkdebug work with byte alignment and then the default
> will be 1.

In fact, I just rebased my patches on top of your blkdebug cleanup
(since it was visible on your remove-aio-em branch), even though it
hasn't been posted to the list yet, and doing that let me fix the
regression on iotest 98 that was pointed out by Max on 7/9.  So I'll
wait to post v3 until your byte-based blkdebug patch has hit the list.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 8/9] blkdebug: Add ability to override unmap geometries
  2016-11-17 23:02   ` Max Reitz
@ 2016-11-21 21:11     ` Eric Blake
  2016-11-21 21:29       ` Eric Blake
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Blake @ 2016-11-21 21:11 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: qemu-block, kwolf, pbonzini, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 2013 bytes --]

On 11/17/2016 05:02 PM, Max Reitz wrote:
> On 17.11.2016 21:14, Eric Blake wrote:
>> Make it easier to simulate the Dell Equallogic iSCSI with its
> 
> Somehow I feel bad putting company and product names into commit messages...

Not the first time I've done it - see commit b8d0a980.

Keeping it in the commit message is one thing (so I probably won't
change this one), but having it in the iotest is another (so I probably
will rework the comments in 9/9 to avoid the specific mention).

> 
>> unusual 15M preferred and maximum unmap and write zero sizing,
>> or to simulate Linux loopback block devices enforcing a small
>> max_transfer of 64k, by allowing blkdebug to wrap any other
>> device with further restrictions on various alignments.
>>

>> +        {
>> +            .name = "opt-write-zero",
>> +            .type = QEMU_OPT_SIZE,
>> +            .help = "Optimum write zero size in bytes",
> 
> s/size/alignment/?

Yes, will fix.

>> @@ -398,6 +429,53 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
>>          ret = -EINVAL;
>>          goto fail_unref;
>>      }
>> +    max_transfer = qemu_opt_get_size(opts, "max-transfer", 0);
>> +    if (max_transfer < INT_MAX &&
>> +        QEMU_IS_ALIGNED(max_transfer, MAX(align, BDRV_SECTOR_SIZE))) {
>> +        s->max_transfer = max_transfer;
>> +    } else if (max_transfer) {
>> +        error_setg(errp, "Invalid argument");
> 
> Could you be more specific? Same in all of the error_setg() calls below.
> 
>> +        ret = -EINVAL;
>> +        goto fail_unref;
>> +    }
> 
> Also, the way this is formatted seems not intuitive to me. I know it's
> the same as it's been done for "align", but normally I'd use the following:
> 
> s->value = qemu_opt_get_size(...);
> if (s->value is set and invalid) {
>     /* error out */
> }

I'll see what I can do.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 8/9] blkdebug: Add ability to override unmap geometries
  2016-11-21 21:11     ` Eric Blake
@ 2016-11-21 21:29       ` Eric Blake
  0 siblings, 0 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-21 21:29 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: kwolf, pbonzini, Markus Armbruster, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1470 bytes --]

On 11/21/2016 03:11 PM, Eric Blake wrote:

>>> @@ -398,6 +429,53 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
>>>          ret = -EINVAL;
>>>          goto fail_unref;
>>>      }
>>> +    max_transfer = qemu_opt_get_size(opts, "max-transfer", 0);
>>> +    if (max_transfer < INT_MAX &&
>>> +        QEMU_IS_ALIGNED(max_transfer, MAX(align, BDRV_SECTOR_SIZE))) {
>>> +        s->max_transfer = max_transfer;
>>> +    } else if (max_transfer) {
>>> +        error_setg(errp, "Invalid argument");
>>
>> Could you be more specific? Same in all of the error_setg() calls below.
>>
>>> +        ret = -EINVAL;
>>> +        goto fail_unref;
>>> +    }
>>
>> Also, the way this is formatted seems not intuitive to me. I know it's
>> the same as it's been done for "align", but normally I'd use the following:
>>
>> s->value = qemu_opt_get_size(...);
>> if (s->value is set and invalid) {
>>     /* error out */
>> }
> 
> I'll see what I can do.

Unfortunately, part of the problem is type casting.  qemu_opt_get_size()
returns a 64-bit number, but s->align is 'int'. You can't detect
wraparound unless you store into a temporary and check bounds prior to
assigning to the narrower type.  I guess I could always change the
struct to store 64-bit values that have been validated to fit within 32
bits.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/9] block: Let write zeroes fallback work even with small max_transfer
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v3 3/9] block: Let write zeroes fallback work even with small max_transfer Eric Blake
  2016-11-17 21:40   ` Max Reitz
@ 2016-11-22 13:16   ` Kevin Wolf
  2016-11-22 13:22     ` Eric Blake
  1 sibling, 1 reply; 47+ messages in thread
From: Kevin Wolf @ 2016-11-22 13:16 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-devel, qemu-block, pbonzini, qemu-stable, Denis V . Lunev,
	Stefan Hajnoczi, Fam Zheng, Max Reitz

Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
> Commit 443668ca rewrote the write_zeroes logic to guarantee that
> an unaligned request never crosses a cluster boundary.  But
> in the rewrite, the new code assumed that at most one iteration
> would be needed to get to an alignment boundary.
> 
> However, it is easy to trigger an assertion failure: the Linux
> kernel limits loopback devices to advertise a max_transfer of
> only 64k.  Any operation that requires falling back to writes
> rather than more efficient zeroing must obey max_transfer during
> that fallback, which means an unaligned head may require multiple
> iterations of the write fallbacks before reaching the aligned
> boundaries, when layering a format with clusters larger than 64k
> atop the protocol of file access to a loopback device.
> 
> Test case:
> 
> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> $ losetup /dev/loop2 /path/to/file
> $ qemu-io -f qcow2 /dev/loop2
> qemu-io> w 7m 1k
> qemu-io> w -z 8003584 2093056
> 
> In fairness to Denis (as the original listed author of the culprit
> commit), the faulty logic for at most one iteration is probably all
> my fault in reworking his idea.  But the solution is to restore what
> was in place prior to that commit: when dealing with an unaligned
> head or tail, iterate as many times as necessary while fragmenting
> the operation at max_transfer boundaries.
> 
> Reported-by: Ed Swierk <eswierk@skyportsystems.com>
> CC: qemu-stable@nongnu.org
> CC: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/io.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index aa532a5..085ac34 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1214,6 +1214,8 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>      int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
>      int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
>                          bs->bl.request_alignment);
> +    int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> +                                    MAX_WRITE_ZEROES_BOUNCE_BUFFER);
> 
>      assert(alignment % bs->bl.request_alignment == 0);
>      head = offset % alignment;
> @@ -1229,9 +1231,12 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>           * boundaries.
>           */
>          if (head) {
> -            /* Make a small request up to the first aligned sector.  */
> -            num = MIN(count, alignment - head);
> -            head = 0;
> +            /* Make a small request up to the first aligned sector. For
> +             * convenience, limit this request to max_transfer even if
> +             * we don't need to fall back to writes.  */
> +            num = MIN(MIN(count, max_transfer), alignment - head);
> +            head = (head + num) % alignment;
> +            assert(num < max_write_zeroes);
>          } else if (tail && num > alignment) {
>              /* Shorten the request to the last aligned sector.  */
>              num -= tail;
>
> @@ -1257,8 +1262,6 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
> 
>          if (ret == -ENOTSUP) {
>              /* Fall back to bounce buffer if write zeroes is unsupported */
> -            int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> -                                            MAX_WRITE_ZEROES_BOUNCE_BUFFER);
>              BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;

Why do we even still bother with max_transfer in this function when we
could just call bdrv_aligned_pwritev() and use its fragmentation code?

Of course, when bdrv_co_do_pwrite_zeroes() was written, your
fragmentation code didn't exist yet, but today I think it would make
more sense to use a single centralised version of it instead of
reimplementing it here.

This doesn't make your fix less correct, but if we did things this way,
the fix wouldn't even be needed because a single iteration (in this
loop) would indeed always be enough.

Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/9] block: Let write zeroes fallback work even with small max_transfer
  2016-11-22 13:16   ` Kevin Wolf
@ 2016-11-22 13:22     ` Eric Blake
  2016-11-22 13:30       ` Kevin Wolf
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Blake @ 2016-11-22 13:22 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-devel, qemu-block, pbonzini, qemu-stable, Denis V . Lunev,
	Stefan Hajnoczi, Fam Zheng, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 1865 bytes --]

On 11/22/2016 07:16 AM, Kevin Wolf wrote:
> Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
>> Commit 443668ca rewrote the write_zeroes logic to guarantee that
>> an unaligned request never crosses a cluster boundary.  But
>> in the rewrite, the new code assumed that at most one iteration
>> would be needed to get to an alignment boundary.
>>

>> @@ -1257,8 +1262,6 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>
>>          if (ret == -ENOTSUP) {
>>              /* Fall back to bounce buffer if write zeroes is unsupported */
>> -            int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
>> -                                            MAX_WRITE_ZEROES_BOUNCE_BUFFER);
>>              BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
> 
> Why do we even still bother with max_transfer in this function when we
> could just call bdrv_aligned_pwritev() and use its fragmentation code?

Hmm. bdrv_aligned_pwritev() asserts that its arguments are already
aligned, but for the head and tail, they might not be.  I agree that for
the bulk of the body, it may help, but it would take more thought on
refactoring if we want to have fragmentation at only one spot.

> 
> Of course, when bdrv_co_do_pwrite_zeroes() was written, your
> fragmentation code didn't exist yet, but today I think it would make
> more sense to use a single centralised version of it instead of
> reimplementing it here.
> 
> This doesn't make your fix less correct, but if we did things this way,
> the fix wouldn't even be needed because a single iteration (in this
> loop) would indeed always be enough.

Can I request to defer such refactoring to 2.9, while getting this patch
as-is into 2.8?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/9] block: Let write zeroes fallback work even with small max_transfer
  2016-11-22 13:22     ` Eric Blake
@ 2016-11-22 13:30       ` Kevin Wolf
  0 siblings, 0 replies; 47+ messages in thread
From: Kevin Wolf @ 2016-11-22 13:30 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-devel, qemu-block, pbonzini, qemu-stable, Denis V . Lunev,
	Stefan Hajnoczi, Fam Zheng, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 2204 bytes --]

Am 22.11.2016 um 14:22 hat Eric Blake geschrieben:
> On 11/22/2016 07:16 AM, Kevin Wolf wrote:
> > Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
> >> Commit 443668ca rewrote the write_zeroes logic to guarantee that
> >> an unaligned request never crosses a cluster boundary.  But
> >> in the rewrite, the new code assumed that at most one iteration
> >> would be needed to get to an alignment boundary.
> >>
> 
> >> @@ -1257,8 +1262,6 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
> >>
> >>          if (ret == -ENOTSUP) {
> >>              /* Fall back to bounce buffer if write zeroes is unsupported */
> >> -            int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> >> -                                            MAX_WRITE_ZEROES_BOUNCE_BUFFER);
> >>              BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
> > 
> > Why do we even still bother with max_transfer in this function when we
> > could just call bdrv_aligned_pwritev() and use its fragmentation code?
> 
> Hmm. bdrv_aligned_pwritev() asserts that its arguments are already
> aligned, but for the head and tail, they might not be.  I agree that for
> the bulk of the body, it may help, but it would take more thought on
> refactoring if we want to have fragmentation at only one spot.

Right, it should be more like bdrv_co_pwritev() then, but something that
uses the logic in bdrv_aligned_pwritev().

Using bdrv_co_pwritev() would mean that it's tracked as another request,
but I don't think that's a problem. Otherwise we'd have to factor that
part out.

> > Of course, when bdrv_co_do_pwrite_zeroes() was written, your
> > fragmentation code didn't exist yet, but today I think it would make
> > more sense to use a single centralised version of it instead of
> > reimplementing it here.
> > 
> > This doesn't make your fix less correct, but if we did things this way,
> > the fix wouldn't even be needed because a single iteration (in this
> > loop) would indeed always be enough.
> 
> Can I request to defer such refactoring to 2.9, while getting this patch
> as-is into 2.8?

Yes, the refactoring is definitely for 2.9.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
  2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers Eric Blake
  2016-11-17 22:26   ` Max Reitz
  2016-11-17 23:44   ` Max Reitz
@ 2016-11-22 14:03   ` Kevin Wolf
  2016-11-22 14:13     ` Eric Blake
  2 siblings, 1 reply; 47+ messages in thread
From: Kevin Wolf @ 2016-11-22 14:03 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-devel, qemu-block, pbonzini, qemu-stable, Stefan Hajnoczi,
	Fam Zheng, Max Reitz

Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
> Discard is advisory, so rounding the requests to alignment
> boundaries is never semantically wrong from the data that
> the guest sees.  But at least the Dell Equallogic iSCSI SANs
> has an interesting property that its advertised discard
> alignment is 15M, yet documents that discarding a sequence
> of 1M slices will eventually result in the 15M page being
> marked as discarded, and it is possible to observe which
> pages have been discarded.
> 
> Between commits 9f1963b and b8d0a980, we converted the block
> layer to a byte-based interface that ultimately ignores any
> unaligned head or tail based on the driver's advertised
> discard granularity, which means that qemu 2.7 refuses to
> pass any discard request smaller than 15M down to the Dell
> Equallogic hardware.  This is a slight regression in behavior
> compared to earlier qemu, where a guest executing discards
> in power-of-2 chunks used to be able to get every page
> discarded, but is now left with various pages still allocated
> because the guest requests did not align with the hardware's
> 15M pages.
> 
> Since the SCSI specification says nothing about a minimum
> discard granularity, and only documents the preferred
> alignment, it is best if the block layer gives the driver
> every bit of information about discard requests, rather than
> rounding it to alignment boundaries early.
> 
> Rework the block layer discard algorithm to mirror the write
> zero algorithm: always peel off any unaligned head or tail
> and manage that in isolation, then do the bulk of the request
> on an aligned boundary.  The fallback when the driver returns
> -ENOTSUP for an unaligned request is to silently ignore that
> portion of the discard request; but for devices that can pass
> the partial request all the way down to hardware, this can
> result in the hardware coalescing requests and discarding
> aligned pages after all.
> 
> Reported by: Peter Lieven <pl@kamp.de>
> CC: qemu-stable@nongnu.org
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> v2: Split at more boundaries.
> ---
>  block/io.c | 45 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 32 insertions(+), 13 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 085ac34..4f00562 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2424,7 +2424,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
>  {
>      BdrvTrackedRequest req;
>      int max_pdiscard, ret;
> -    int head, align;
> +    int head, tail, align;
> 
>      if (!bs->drv) {
>          return -ENOMEDIUM;
> @@ -2447,19 +2447,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
>          return 0;
>      }
> 
> -    /* Discard is advisory, so ignore any unaligned head or tail */
> +    /* Discard is advisory, but some devices track and coalesce
> +     * unaligned requests, so we must pass everything down rather than
> +     * round here.  Still, most devices will just silently ignore
> +     * unaligned requests (by returning -ENOTSUP), so we must fragment
> +     * the request accordingly.  */
>      align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment);
>      assert(align % bs->bl.request_alignment == 0);
>      head = offset % align;
> -    if (head) {
> -        head = MIN(count, align - head);
> -        count -= head;
> -        offset += head;
> -    }
> -    count = QEMU_ALIGN_DOWN(count, align);
> -    if (!count) {
> -        return 0;
> -    }
> +    tail = (offset + count) % align;
> 
>      bdrv_inc_in_flight(bs);
>      tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD);
> @@ -2471,11 +2467,34 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
> 
>      max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
>                                     align);
> -    assert(max_pdiscard);
> +    assert(max_pdiscard >= bs->bl.request_alignment);
> 
>      while (count > 0) {
>          int ret;
> -        int num = MIN(count, max_pdiscard);
> +        int num = count;
> +
> +        if (head) {
> +            /* Make small requests to get to alignment boundaries. */
> +            num = MIN(count, align - head);
> +            if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
> +                num %= bs->bl.request_alignment;
> +            }
> +            head = (head + num) % align;
> +            assert(num < max_pdiscard);
> +        } else if (tail) {
> +            if (num > align) {
> +                /* Shorten the request to the last aligned cluster.  */
> +                num -= tail;
> +            } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) &&
> +                       tail > bs->bl.request_alignment) {
> +                tail %= bs->bl.request_alignment;
> +                num -= tail;
> +            }
> +        }

Hm, from the commit message I expected splitting requests in three
(head, bulk, tail), but actually we can end up splitting it in five?

Just to check whether I got this right, let me try an example: Let's
assume request alignment 512, pdiscard alignment 64k, and we get a
discard request with offset 510, length 130k. This algorithm makes the
following calls to the driver:

1. pdiscard offset=510, len=2               | new count = 130k - 2
2. pdiscard offset=512, len=(64k - 512)     | new count = 66k + 510
3. pdiscard offset=64k, len=64k             | new count = 2558
4. pdiscard offset=128k, len=2048           | new count = 510
5. pdiscard offset=130k, len=510            | new count = 0

Correct?

If so, is this really necessary or even helpful? I see that the iscsi
driver throws away requests 1 and 5 and needs the split because
otherwise it would disregard the areas covered by 2 and 4, too. But why
can't or shouldn't the iscsi driver do this rounding itself when it gets
combined 1+2 and 4+5 requests?

Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
  2016-11-22 14:03   ` Kevin Wolf
@ 2016-11-22 14:13     ` Eric Blake
  2016-11-22 14:56       ` Eric Blake
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Blake @ 2016-11-22 14:13 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-devel, qemu-block, pbonzini, qemu-stable, Stefan Hajnoczi,
	Fam Zheng, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 2761 bytes --]

On 11/22/2016 08:03 AM, Kevin Wolf wrote:
> Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
>> Discard is advisory, so rounding the requests to alignment
>> boundaries is never semantically wrong from the data that
>> the guest sees.  But at least the Dell Equallogic iSCSI SANs
>> has an interesting property that its advertised discard
>> alignment is 15M, yet documents that discarding a sequence
>> of 1M slices will eventually result in the 15M page being
>> marked as discarded, and it is possible to observe which
>> pages have been discarded.
>>

>>
>> Rework the block layer discard algorithm to mirror the write
>> zero algorithm: always peel off any unaligned head or tail
>> and manage that in isolation, then do the bulk of the request
>> on an aligned boundary.  The fallback when the driver returns
>> -ENOTSUP for an unaligned request is to silently ignore that
>> portion of the discard request; but for devices that can pass
>> the partial request all the way down to hardware, this can
>> result in the hardware coalescing requests and discarding
>> aligned pages after all.
>>

> 
> Hm, from the commit message I expected splitting requests in three
> (head, bulk, tail), but actually we can end up splitting it in five?

Correct; so maybe I need to improve the commit message.  The goal in
multiple splits was to make it easier for drivers to not have to worry
about re-aligning things (a request is either sub-sector, sub-page but
sector-aligned, or page-aligned).

> 
> Just to check whether I got this right, let me try an example: Let's
> assume request alignment 512, pdiscard alignment 64k, and we get a
> discard request with offset 510, length 130k. This algorithm makes the
> following calls to the driver:
> 
> 1. pdiscard offset=510, len=2               | new count = 130k - 2
> 2. pdiscard offset=512, len=(64k - 512)     | new count = 66k + 510
> 3. pdiscard offset=64k, len=64k             | new count = 2558
> 4. pdiscard offset=128k, len=2048           | new count = 510
> 5. pdiscard offset=130k, len=510            | new count = 0
> 
> Correct?
> 

Yes.

> If so, is this really necessary or even helpful? I see that the iscsi
> driver throws away requests 1 and 5 and needs the split because
> otherwise it would disregard the areas covered by 2 and 4, too. But why
> can't or shouldn't the iscsi driver do this rounding itself when it gets
> combined 1+2 and 4+5 requests?

Because then every driver has to implement rounding; I thought that
having centralized rounding code was easier to maintain overall than
having every driver reimplement it.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
  2016-11-22 14:13     ` Eric Blake
@ 2016-11-22 14:56       ` Eric Blake
  0 siblings, 0 replies; 47+ messages in thread
From: Eric Blake @ 2016-11-22 14:56 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, qemu-block, qemu-devel, qemu-stable, Stefan Hajnoczi,
	pbonzini, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 2134 bytes --]

On 11/22/2016 08:13 AM, Eric Blake wrote:

>> Hm, from the commit message I expected splitting requests in three
>> (head, bulk, tail), but actually we can end up splitting it in five?
> 
> Correct; so maybe I need to improve the commit message.  The goal in
> multiple splits was to make it easier for drivers to not have to worry
> about re-aligning things (a request is either sub-sector, sub-page but
> sector-aligned, or page-aligned).
> 
>>
>> Just to check whether I got this right, let me try an example: Let's
>> assume request alignment 512, pdiscard alignment 64k, and we get a
>> discard request with offset 510, length 130k. This algorithm makes the
>> following calls to the driver:
>>
>> 1. pdiscard offset=510, len=2               | new count = 130k - 2
>> 2. pdiscard offset=512, len=(64k - 512)     | new count = 66k + 510
>> 3. pdiscard offset=64k, len=64k             | new count = 2558
>> 4. pdiscard offset=128k, len=2048           | new count = 510
>> 5. pdiscard offset=130k, len=510            | new count = 0
>>
>> Correct?
>>
> 
> Yes.

To further clarify: for write zeroes, we have been doing 1+2 and 4+5
together, because if the driver returns -ENOTSUP, we then do a normal
write, and normal writes already take care of fragmenting things back
into alignment boundaries (1 and 5 are handled by read-modify-write, 2
and 4 are aligned to write request boundaries), so the writes happen no
matter what even if the write-zero at request boundaries might have been
more efficient.  With discard, there is no fallback to any other
operation, so we HAVE to present as many fragments as possible to the
driver, while still making it easy for the driver to distinguish between
aligned and unaligned requests and not having to do extra rounding.

But you made me realize that if I make write_zeroes do the same 5-piece
fragmentation, then it might be easier to use bdrv_aligned_pwritev
instead of manually reimplementing max_transfer logic in the write fallback.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests
  2016-11-17 20:13 [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Eric Blake
                   ` (8 preceding siblings ...)
  2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes Eric Blake
@ 2016-11-22 16:05 ` Kevin Wolf
  9 siblings, 0 replies; 47+ messages in thread
From: Kevin Wolf @ 2016-11-22 16:05 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, qemu-block, pbonzini

Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
> The first 5 patches are candidates for inclusion in 2.8, since
> they fix regressions and/or bugs exposed by the second half.
> The last 4 patches are a new feature in blkdebug, and as such,
> probably belong better in 2.9.  However, I wrote the patches by
> swapping the order (last four patches first, and proving that the
> new iotest fails, then the first five patches fix the failures),
> which should give us some better assurance that the first half is
> safe for inclusion during hard freeze.
> 
> Patches 3-5 have been on the list previously (hence calling this
> series v2), although as two separate threads; patch 5 has been
> improved since its last posting.  Patches 1-2 and 6-9 are new.

Thanks, applied patches 2-5 to the block branch.

Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2016-11-22 16:05 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-17 20:13 [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Eric Blake
2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 1/9] nbd: Allow unmap and fua during write zeroes Eric Blake
2016-11-17 21:10   ` Max Reitz
2016-11-17 21:14     ` Eric Blake
2016-11-18 14:12     ` Paolo Bonzini
2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 2/9] qcow2: Inform block layer about discard boundaries Eric Blake
2016-11-17 21:24   ` Max Reitz
2016-11-17 20:13 ` [Qemu-devel] [PATCH v3 3/9] block: Let write zeroes fallback work even with small max_transfer Eric Blake
2016-11-17 21:40   ` Max Reitz
2016-11-22 13:16   ` Kevin Wolf
2016-11-22 13:22     ` Eric Blake
2016-11-22 13:30       ` Kevin Wolf
2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 4/9] block: Return -ENOTSUP rather than assert on unaligned discards Eric Blake
2016-11-17 22:01   ` Max Reitz
2016-11-18 22:48     ` Eric Blake
2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers Eric Blake
2016-11-17 22:26   ` Max Reitz
2016-11-17 23:01     ` Eric Blake
2016-11-17 23:03       ` Max Reitz
2016-11-17 23:44   ` Max Reitz
2016-11-18  1:13     ` Eric Blake
2016-11-19 22:05       ` Max Reitz
2016-11-21 13:39         ` Peter Lieven
2016-11-22 14:03   ` Kevin Wolf
2016-11-22 14:13     ` Eric Blake
2016-11-22 14:56       ` Eric Blake
2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 6/9] blkdebug: Sanity check block layer guarantees Eric Blake
2016-11-17 22:36   ` Max Reitz
2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 7/9] blkdebug: Add pass-through write_zero and discard support Eric Blake
2016-11-17 22:47   ` Max Reitz
2016-11-18 23:08     ` Eric Blake
2016-11-17 23:27   ` Max Reitz
2016-11-18  1:17     ` Eric Blake
2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 8/9] blkdebug: Add ability to override unmap geometries Eric Blake
2016-11-17 23:02   ` Max Reitz
2016-11-21 21:11     ` Eric Blake
2016-11-21 21:29       ` Eric Blake
2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes Eric Blake
2016-11-17 23:19   ` Max Reitz
2016-11-18  1:19     ` Eric Blake
2016-11-17 23:42   ` Max Reitz
2016-11-18  1:28     ` Eric Blake
2016-11-19 21:45       ` Max Reitz
2016-11-19 22:17         ` Max Reitz
2016-11-21 11:38           ` Kevin Wolf
2016-11-21 16:16             ` Eric Blake
2016-11-22 16:05 ` [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Kevin Wolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.