[PATCH 0/5] Obey NBD spec regarding block size bounds on reply

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] Obey NBD spec regarding block size bounds on reply
@ 2021-02-18 20:15 Eric Blake
  2021-02-18 20:15 ` [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation Eric Blake
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Eric Blake @ 2021-02-18 20:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: vsementsov, qemu-block

We've had some known long-standing compliance bugs in our NBD server
not always honoring a minimum block size in its response to client
requests, when dealing with an image with a large block size backed by
another image with a smaller block size (for example, an encrypted
qcow2 image has a minimum block size of 512, backed by a file whose
size is not a multiple of 512).  Fragmenting our reply to NBD_CMD_READ
or NBD_CMD_BLOCK_STATUS to something smaller than our advertised
minimum block size can confuse a client (in fact, qemu 3.2 would
abort() on such messages, although we patched the client to be
tolerant of non-compliant servers for qemu 4.0).  Thankfully, most
day-to-day uses of NBD don't run into these cases.

Back in 2019, I did propose a fix for the server[1], but it was
incomplete at the time because I couldn't write a reliable iotest
(using blkdebug failed, because it was a filter that blocked access to
exposing the dirty bitmap), and the patches were too close to the 4.0
release for a corner case that did not occur frequently in practice,
so it moved to my back burner.  But now that we have fixed the ability
to see through the blkdebug filter, and have in the meantime added the
qemu:allocation-depth context that also suffers from the same problem,
I have finally updated this series to a state that I'm happy with.

[1] https://lists.gnu.org/archive/html/qemu-devel/2019-04/msg00589.html

Patch 5 is merely to aid in testing, by disabling the client
workaround that has been present since qemu 4.0 (since the revert is
no longer trivial).

Eric Blake (5):
  iotests: Update 241 to expose backing layer fragmentation
  block: Fix BDRV_BLOCK_RAW status to honor alignment
  nbd/server: Avoid unaligned read/block_status from backing
  nbd/server: Avoid unaligned dirty-bitmap status
  do not apply: Revert "nbd-client: Work around server BLOCK_STATUS
    misalignment at EOF"

 block/coroutines.h         |   2 +
 include/block/block.h      |   2 +
 block/io.c                 | 210 ++++++++++++++++++++++++++++++++++---
 block/nbd.c                |  30 +-----
 block/quorum.c             |   7 +-
 nbd/server.c               |  42 ++++++--
 tests/qemu-iotests/221     |  13 +++
 tests/qemu-iotests/221.out |   9 ++
 tests/qemu-iotests/241     |  58 +++++++++-
 tests/qemu-iotests/241.out |  24 ++++-
 10 files changed, 337 insertions(+), 60 deletions(-)

-- 
2.30.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation
  2021-02-18 20:15 [PATCH 0/5] Obey NBD spec regarding block size bounds on reply Eric Blake
@ 2021-02-18 20:15 ` Eric Blake
  2021-02-25 13:50   ` Vladimir Sementsov-Ogievskiy
  2021-02-18 20:15 ` [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment Eric Blake
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Eric Blake @ 2021-02-18 20:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, vsementsov, qemu-block, Max Reitz

Previous commits (such as 6e280648, 75d34eb9) have mentioned that our
NBD server still sends unaligned fragments when an active layer with
large advertised minimum block size is backed by another layer with a
smaller block size. Expand the test to actually cover these scenario,
by using two different approaches: qcow2 encryption (which forces
512-byte alignment) with an unaligned raw backing file, and blkdebug
with a 4k alignment.

The encryption test passes with the desired results, but only because
the client side works around the server's non-compliance; if you
repeat the test manually with tracing turned on, you will see the
server sending a status for 1000 bytes of data then 1048 bytes of
hole, which is not aligned. But reverting commit 737d3f5244 shows that
it is indeed the client working around the bug in the server.

Meanwhile, the blkdebug test gives incorrect results: remember, when
using x-dirty-bitmap with qemu-img map as a way to sniff alternative
metadata contexts, the meanings of "data" and "zero" are determined by
that context.  Our client workaround is assuming that the fragmented
replies can be merged according to base:allocation rules, but those
rules do not work for other contexts (merging dirty and clean bitmap
should produce dirty; merging allocated and unallocated should produce
allocated; see the FIXME for more about the decoded values we expect).

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/241     | 60 ++++++++++++++++++++++++++++++++++----
 tests/qemu-iotests/241.out | 21 +++++++++++++
 2 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/241 b/tests/qemu-iotests/241
index c962c8b6075d..5217af82dc65 100755
--- a/tests/qemu-iotests/241
+++ b/tests/qemu-iotests/241
@@ -3,7 +3,7 @@
 #
 # Test qemu-nbd vs. unaligned images
 #
-# Copyright (C) 2018-2019 Red Hat, Inc.
+# Copyright (C) 2018-2021 Red Hat, Inc.
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -27,7 +27,7 @@ status=1 # failure is the default!
 _cleanup()
 {
     _cleanup_test_img
-    rm -f "$TEST_DIR/server.log"
+    rm -f "$TEST_DIR/server.log" "$TEST_IMG_FILE.mid" "$TEST_IMG_FILE.qcow2"
     nbd_server_stop
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
@@ -37,7 +37,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 . ./common.filter
 . ./common.nbd

-_supported_fmt raw
+_supported_fmt raw # although the test also requires use of qcow2
 _supported_proto nbd
 _supported_os Linux
 _require_command QEMU_NBD
@@ -89,11 +89,61 @@ $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
 $QEMU_IO -c map "$TEST_IMG"
 nbd_server_stop

-# Not tested yet: we also want to ensure that qemu as NBD client does
+echo
+echo "=== Encrypted qcow2 file backed by unaligned raw image ==="
+echo
+
+# Enabling encryption in qcow2 forces 512-alignment
+SECRET=secret,id=sec0,data=12345
+$QEMU_IMG create -f qcow2 -b "$TEST_IMG_FILE" -F raw --object "$SECRET" \
+  -o encrypt.format=luks,encrypt.key-secret=sec0,encrypt.iter-time=10 \
+  "$TEST_IMG_FILE.qcow2" 2k | _filter_img_create
+nbd_server_start_unix_socket --object "$SECRET" --image-opts \
+  driver=qcow2,file.filename="$TEST_IMG_FILE.qcow2",encrypt.key-secret=sec0
+
+$QEMU_NBD_PROG --list -k $nbd_unix_socket | grep '\(size\|min\)'
+$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IO -c map "$TEST_IMG"
+nbd_server_stop
+
+echo
+echo "=== Use blkdebug for larger alignment than backing layer ==="
+echo
+
+$QEMU_IMG create -f qcow2 -o cluster_size=1024 \
+  "$TEST_IMG_FILE" 2k | _filter_img_create
+$QEMU_IMG create -f qcow2 -b "$TEST_IMG_FILE" -F qcow2 -o cluster_size=512 \
+  "$TEST_IMG_FILE.mid" | _filter_img_create
+$QEMU_IMG bitmap --add -g 512 --enable "$TEST_IMG_FILE.mid" b0
+$QEMU_IO -f qcow2 -c 'w 512 512' "$TEST_IMG_FILE.mid" | _filter_qemu_io
+$QEMU_IMG create -f qcow2 -b "$TEST_IMG_FILE.mid" -F qcow2 \
+  "$TEST_IMG_FILE.qcow2" 4k | _filter_img_create
+FILE=image.file.filename="$TEST_IMG_FILE.qcow2"
+nbd_server_start_unix_socket -B b0 -A --image-opts \
+  driver=blkdebug,align=4096,image.driver=qcow2,image.file.driver=file,"$FILE"
+
+TEST_IMG="driver=nbd,server.type=unix,server.path=$nbd_unix_socket"
+$QEMU_NBD_PROG --list -k $nbd_unix_socket | grep '\(size\|min\)'
+# FIXME: this should report a single 4k block of "data":false which translates
+# to the dirty bitmap being set for at least part of the region; "data":true
+# is wrong unless the entire 4k is clean.
+$QEMU_IMG map --output=json --image-opts \
+	  "$TEST_IMG",x-dirty-bitmap=qemu:dirty-bitmap:b0 | _filter_qemu_img_map
+
+# FIXME: this should report a single 4k block of "zero":true,"data":true,
+# meaning allocated from the backing chain.  Using "zero":false,"data":false
+# (allocated in active layer) or "zero":false,"data":true (entire region
+# unallocated) is wrong.
+$QEMU_IMG map --output=json --image-opts \
+  "$TEST_IMG",x-dirty-bitmap=qemu:allocation-depth | _filter_qemu_img_map
+nbd_server_stop
+
+# Not tested here: we also want to ensure that qemu as NBD client does
 # not access beyond the end of a server's advertised unaligned size:
 #  nbdkit -U - memory size=513 --run 'qemu-io -f raw -c "r 512 512" $nbd'
 # However, since qemu as server always rounds up to a sector alignment,
-# we would have to use nbdkit to provoke the current client failures.
+# we would have to use nbdkit to demonstrate this scenario (see
+# commit 9cf638508c for more information).

 # success, all done
 echo '*** done'
diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out
index 75f9f465e522..67aaeed34f50 100644
--- a/tests/qemu-iotests/241.out
+++ b/tests/qemu-iotests/241.out
@@ -25,4 +25,25 @@ WARNING: Image format was not specified for 'TEST_DIR/t.raw' and probing guessed
 [{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
 { "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}]
 1 KiB (0x400) bytes     allocated at offset 0 bytes (0x0)
+
+=== Encrypted qcow2 file backed by unaligned raw image ===
+
+Formatting 'TEST_DIR/t.IMGFMT.qcow2', fmt=qcow2 size=2048 backing_file=TEST_DIR/t.IMGFMT backing_fmt=IMGFMT
+  size:  2048
+  min block: 512
+[{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
+{ "start": 1024, "length": 1024, "depth": 0, "zero": true, "data": true, "offset": OFFSET}]
+2 KiB (0x800) bytes     allocated at offset 0 bytes (0x0)
+
+=== Use blkdebug for larger alignment than backing layer ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=qcow2 size=2048
+Formatting 'TEST_DIR/t.IMGFMT.mid', fmt=qcow2 size=2048 backing_file=TEST_DIR/t.IMGFMT backing_fmt=qcow2
+wrote 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Formatting 'TEST_DIR/t.IMGFMT.qcow2', fmt=qcow2 size=4096 backing_file=TEST_DIR/t.IMGFMT.mid backing_fmt=qcow2
+  size:  4096
+  min block: 4096
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
 *** done
-- 
2.30.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment
  2021-02-18 20:15 [PATCH 0/5] Obey NBD spec regarding block size bounds on reply Eric Blake
  2021-02-18 20:15 ` [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation Eric Blake
@ 2021-02-18 20:15 ` Eric Blake
  2021-02-25 14:55   ` Vladimir Sementsov-Ogievskiy
  2021-02-18 20:15 ` [PATCH 3/5] nbd/server: Avoid unaligned read/block_status from backing Eric Blake
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Eric Blake @ 2021-02-18 20:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, vsementsov, qemu-block, Max Reitz,
	Stefan Hajnoczi

Previous patches mentioned how the blkdebug filter driver demonstrates
a bug present in our NBD server (for example, commit b0245d64); the
bug is also present with the raw format driver when probing
occurs. Basically, if we specify a particular alignment > 1, but defer
the actual block status to the underlying file, and the underlying
file has a smaller alignment, then the use of BDRV_BLOCK_RAW to defer
to the underlying file can end up with status split at an alignment
unacceptable to our layer.  Many callers don't care, but NBD does - it
is a violation of the NBD protocol to send status or read results
split on an unaligned boundary (in 737d3f5244, we taught our 4.0
client to be tolerant of such violations because the problem was even
more pronounced with qemu 3.1 as server; but we still need to fix our
server for the sake of other stricter clients).

This patch lays the groundwork - it adjusts bdrv_block_status to
ensure that any time one layer defers to another via BDRV_BLOCK_RAW,
the deferral is either truncated down to an aligned boundary, or
multiple sub-aligned requests are coalesced into a single
representative answer (using an implicit hole beyond EOF as
needed). Iotest 241 exposes the effect (when format probing occurred,
we don't want block status to subdivide the initial sector, and thus
not any other sector either). Similarly, iotest 221 is a good
candidate to amend to specifically track the effects; a change to a
hole at EOF is not visible if an upper layer does not want alignment
smaller than 512. However, this patch alone is not a complete fix - it
is still possible to have an active layer with large alignment
constraints backed by another layer with smaller constraints; so the
next patch will complete the task.

In particular, the next patch will introduce some mutual recursion
(bdrv_co_common_block_status_above will call this new function rather
than directly into bdrv_co_block_status), so some conditions added
here (such as a NULL pointer for map or handling a length-0 request)
are not reachable until that point.

There is one interesting corner case: prior to this patch, ALLOCATED
was never returned without either DATA or ZERO also set. But now, if
we have two subregions where the first reports status 0 (not
allocated), and the second reports ZERO|ALLOCATED but not DATA
(preallocated, read sees zero but underlying file has indeterminate
contents), then we can end up with a result of just
ALLOCATED. However, looking at callers of bdrv_is_allocated does not
find any problem with this new return value. What's more, this
situation doesn't really happen until the next patch adds support for
getting aligned status from backing files (as the use of aligned
status in this patch tends to be limited to just the protocol child of
a format driver, yet protocol drivers tend to report fully allocated,
and only format drivers have unallocated clusters that defer to a
backing file in the first place).

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/io.c                 | 142 +++++++++++++++++++++++++++++++++++--
 tests/qemu-iotests/221     |  13 ++++
 tests/qemu-iotests/221.out |   9 +++
 tests/qemu-iotests/241.out |   3 +-
 4 files changed, 161 insertions(+), 6 deletions(-)

diff --git a/block/io.c b/block/io.c
index ca2dca30070e..4bca775c96b4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2325,6 +2325,132 @@ int bdrv_flush_all(void)
     return result;
 }

+static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
+                                             bool want_zero,
+                                             int64_t offset, int64_t bytes,
+                                             int64_t *pnum, int64_t *map,
+                                             BlockDriverState **file);
+
+/*
+ * Returns an aligned allocation status of the specified disk region.
+ *
+ * Wrapper around bdrv_co_block_status() which requires the initial
+ * @offset and @count to be aligned to @align (must be power of 2),
+ * and guarantees the resulting @pnum will also be aligned; this may
+ * require piecing together multiple sub-aligned queries into an
+ * appropriate coalesced answer, as follows:
+ *
+ * - BDRV_BLOCK_DATA is set if the flag is set for at least one subregion
+ * - BDRV_BLOCK_ZERO is set only if the flag is set for all subregions
+ * - BDRV_BLOCK_OFFSET_VALID is set only if all subregions are contiguous
+ *   from the same file (@map and @file are then from the first subregion)
+ * - BDRV_BLOCK_ALLOCATED is set if the flag is set for at least one subregion
+ * - BDRV_BLOCK_EOF is set if the last subregion queried set it (any
+ *   remaining bytes to reach alignment are treated as an implicit hole)
+ * - BDRV_BLOCK_RAW is never set
+ */
+static int coroutine_fn bdrv_co_block_status_aligned(BlockDriverState *bs,
+                                                     uint32_t align,
+                                                     bool want_zero,
+                                                     int64_t offset,
+                                                     int64_t bytes,
+                                                     int64_t *pnum,
+                                                     int64_t *map,
+                                                     BlockDriverState **file)
+{
+    int ret;
+
+    assert(is_power_of_2(align) && QEMU_IS_ALIGNED(offset | bytes, align));
+    ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
+    if (ret < 0) {
+        return ret;
+    }
+    /* 0-length return only possible for 0-length query or beyond EOF */
+    if (!*pnum) {
+        assert(!bytes || ret & BDRV_BLOCK_EOF);
+        return ret;
+    }
+    assert(!(ret & BDRV_BLOCK_RAW));
+
+    /*
+     * If initial query ended at EOF, round up to align: the post-EOF
+     * tail is an implicit hole, but our rules say we can treat that
+     * like the initial subregion.
+     */
+    if (ret & BDRV_BLOCK_EOF) {
+        *pnum = QEMU_ALIGN_UP(*pnum, align);
+        assert(*pnum <= bytes);
+        return ret;
+    }
+
+    /*
+     * If result is unaligned but not at EOF, it's easier to return
+     * the aligned subset and then compute the coalesced version over
+     * just align bytes.
+     */
+    if (*pnum >= align) {
+        *pnum = QEMU_ALIGN_DOWN(*pnum, align);
+        return ret;
+    }
+
+    /*
+     * If we got here, we have to merge status for multiple
+     * subregions. We can't detect if offsets are contiguous unless
+     * map and file are non-NULL.
+     */
+    if (!map || !file) {
+        ret &= ~BDRV_BLOCK_OFFSET_VALID;
+    }
+    while (*pnum < align) {
+        int ret2;
+        int64_t pnum2;
+        int64_t map2;
+        BlockDriverState *file2;
+
+        ret2 = bdrv_co_block_status(bs, want_zero, offset + *pnum,
+                                    align - *pnum, &pnum2, &map2, &file2);
+        if (ret2 < 0) {
+            return ret2;
+        }
+        assert(!(ret2 & BDRV_BLOCK_RAW));
+        /*
+         * A 0-length answer here is a bug - we should not be querying
+         * beyond EOF. Our rules allow any further bytes in implicit
+         * hole past EOF to have same treatment as the subregion just
+         * before EOF.
+         */
+        assert(pnum2 && pnum2 <= align - *pnum);
+        if (ret2 & BDRV_BLOCK_EOF) {
+            ret |= BDRV_BLOCK_EOF;
+            pnum2 = align - *pnum;
+        }
+
+        /* Now merge the status */
+        if (ret2 & BDRV_BLOCK_DATA) {
+            ret |= BDRV_BLOCK_DATA;
+        }
+        if (!(ret2 & BDRV_BLOCK_ZERO)) {
+            ret &= ~BDRV_BLOCK_ZERO;
+        }
+        if ((ret & BDRV_BLOCK_OFFSET_VALID) &&
+            (!(ret2 & BDRV_BLOCK_OFFSET_VALID) ||
+             *map + *pnum != map2 || *file != file2)) {
+            ret &= ~BDRV_BLOCK_OFFSET_VALID;
+            if (map) {
+                *map = 0;
+            }
+            if (file) {
+                *file = NULL;
+            }
+        }
+        if (ret2 & BDRV_BLOCK_ALLOCATED) {
+            ret |= BDRV_BLOCK_ALLOCATED;
+        }
+        *pnum += pnum2;
+    }
+    return ret;
+}
+
 /*
  * Returns the allocation status of the specified sectors.
  * Drivers not implementing the functionality are assumed to not support
@@ -2438,7 +2564,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
      */
     assert(*pnum && QEMU_IS_ALIGNED(*pnum, align) &&
            align > offset - aligned_offset);
-    if (ret & BDRV_BLOCK_RECURSE) {
+    if (ret & BDRV_BLOCK_RAW) {
+        assert(local_file);
+        ret = bdrv_co_block_status_aligned(local_file, align, want_zero,
+                                           local_map, *pnum, pnum, &local_map,
+                                           &local_file);
+        if (ret < 0) {
+            goto out;
+        }
+        assert(!(ret & BDRV_BLOCK_RAW));
+        ret |= BDRV_BLOCK_RAW;
+    } else if (ret & BDRV_BLOCK_RECURSE) {
         assert(ret & BDRV_BLOCK_DATA);
         assert(ret & BDRV_BLOCK_OFFSET_VALID);
         assert(!(ret & BDRV_BLOCK_ZERO));
@@ -2453,9 +2589,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     }

     if (ret & BDRV_BLOCK_RAW) {
-        assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
-        ret = bdrv_co_block_status(local_file, want_zero, local_map,
-                                   *pnum, pnum, &local_map, &local_file);
+        ret &= ~BDRV_BLOCK_RAW;
         goto out;
     }

diff --git a/tests/qemu-iotests/221 b/tests/qemu-iotests/221
index c463fd4b113e..6a15e0160b24 100755
--- a/tests/qemu-iotests/221
+++ b/tests/qemu-iotests/221
@@ -46,6 +46,12 @@ echo
 echo "=== Check mapping of unaligned raw image ==="
 echo

+# Note that when we enable format probing by omitting -f, the raw
+# layer forces 512-byte alignment and the bytes past EOF take on the
+# same status as the rest of the sector; otherwise, we can see the
+# implicit hole visible past EOF thanks to the block layer rounding
+# sizes up.
+
 _make_test_img 65537 # qemu-img create rounds size up

 # file-posix allocates the first block of any images when it is created;
@@ -55,15 +61,22 @@ _make_test_img 65537 # qemu-img create rounds size up
 $QEMU_IO -c 'discard 0 65537' "$TEST_IMG" | _filter_qemu_io

 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map
+echo

 truncate --size=65537 "$TEST_IMG" # so we resize it and check again
 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map
+echo

 $QEMU_IO -c 'w 65536 1' "$TEST_IMG" | _filter_qemu_io # writing also rounds up
 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map
+echo

 truncate --size=65537 "$TEST_IMG" # so we resize it and check again
 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map

 # success, all done
 echo '*** done'
diff --git a/tests/qemu-iotests/221.out b/tests/qemu-iotests/221.out
index 93846c7dabb6..d22b5e00d4f8 100644
--- a/tests/qemu-iotests/221.out
+++ b/tests/qemu-iotests/221.out
@@ -7,11 +7,20 @@ discard 65537/65537 bytes at offset 0
 64.001 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 [{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
 [{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
+
+[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
+[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
+
 wrote 1/1 bytes at offset 65536
 1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
+{ "start": 65536, "length": 512, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
+[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
 { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
 { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
+
+[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
+{ "start": 65536, "length": 512, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
 [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
 { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
 { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out
index 67aaeed34f50..56d3796cf3ac 100644
--- a/tests/qemu-iotests/241.out
+++ b/tests/qemu-iotests/241.out
@@ -22,8 +22,7 @@ WARNING: Image format was not specified for 'TEST_DIR/t.raw' and probing guessed

   size:  1024
   min block: 1
-[{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
-{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}]
+[{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
 1 KiB (0x400) bytes     allocated at offset 0 bytes (0x0)

 === Encrypted qcow2 file backed by unaligned raw image ===
-- 
2.30.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] nbd/server: Avoid unaligned read/block_status from backing
  2021-02-18 20:15 [PATCH 0/5] Obey NBD spec regarding block size bounds on reply Eric Blake
  2021-02-18 20:15 ` [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation Eric Blake
  2021-02-18 20:15 ` [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment Eric Blake
@ 2021-02-18 20:15 ` Eric Blake
  2021-02-18 20:15 ` [PATCH 4/5] nbd/server: Avoid unaligned dirty-bitmap status Eric Blake
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Eric Blake @ 2021-02-18 20:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Fam Zheng, vsementsov, Alberto Garcia, qemu-block,
	Max Reitz, Stefan Hajnoczi

The NBD server code used bdrv_block_status_above() to determine where
to fragment structured read and block status replies, and similarly
used bdrv_is_allocated_above() for the qemu:allocation-depth context.
However, the protocol can only advertise the active layer's minimum
block size; if the active layer is backed by another file with smaller
alignment, then we can end up leaking unaligned results back through
to the client, in violation of the spec.

Fix this by exposing a new bdrv_block_status_aligned() function around
the recently-added internal bdrv_co_block_status_aligned, to guarantee
that all block status answers from backing layers are rounded up to
the alignment of the current layer.  Note that the underlying function
still requires aligned boundaries, but the public function copes with
unaligned inputs.

The portion of iotest 241 using an encrypted qcow2 file does not
change in output, but running it manually with traces shows the
improved behavior; furthermore, reverting 737d3f5244 but leaving this
patch in place lets the test pass (whereas before the test would fail
because the client had to work around the server's non-compliance).
Meanwhile, the portion running with blkdebug shows that
qemu:allocation-depth now shows the desired output.

Note that while this fixes NBD_CMD_READ and NBD_CMD_BLOCK_STATUS for
base:allocation and qemu:allocation-depth (because those rely on
bdrv_block_status*), we still have a compliance problem with
NBD_CMD_BLOCK_STATUS for qemu:dirty-bitmap:NN when visiting a bitmap
created at a smaller granularity than what we advertised. That will be
addressed in the next patch.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/coroutines.h         |  2 ++
 include/block/block.h      |  2 ++
 block/io.c                 | 68 +++++++++++++++++++++++++++++++++-----
 block/quorum.c             |  7 ++--
 nbd/server.c               | 12 +++----
 tests/qemu-iotests/241     |  9 +++--
 tests/qemu-iotests/241.out |  2 +-
 7 files changed, 77 insertions(+), 25 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index 4cfb4946e65e..1c0d761c669e 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -41,6 +41,7 @@ bdrv_pwritev(BdrvChild *child, int64_t offset, unsigned int bytes,
 int coroutine_fn
 bdrv_co_common_block_status_above(BlockDriverState *bs,
                                   BlockDriverState *base,
+                                  uint32_t align,
                                   bool include_base,
                                   bool want_zero,
                                   int64_t offset,
@@ -52,6 +53,7 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
 int generated_co_wrapper
 bdrv_common_block_status_above(BlockDriverState *bs,
                                BlockDriverState *base,
+                               uint32_t align,
                                bool include_base,
                                bool want_zero,
                                int64_t offset,
diff --git a/include/block/block.h b/include/block/block.h
index b3f6e509d49d..fcfd3514701e 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -517,6 +517,8 @@ int bdrv_block_status(BlockDriverState *bs, int64_t offset,
 int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
                             int64_t offset, int64_t bytes, int64_t *pnum,
                             int64_t *map, BlockDriverState **file);
+int bdrv_block_status_aligned(BlockDriverState *bs, int64_t offset,
+                              int64_t bytes, int64_t *pnum);
 int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
                       int64_t *pnum);
 int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
diff --git a/block/io.c b/block/io.c
index 4bca775c96b4..d239282b4763 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2656,6 +2656,7 @@ early_out:
 int coroutine_fn
 bdrv_co_common_block_status_above(BlockDriverState *bs,
                                   BlockDriverState *base,
+                                  uint32_t align,
                                   bool include_base,
                                   bool want_zero,
                                   int64_t offset,
@@ -2698,8 +2699,8 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
     for (p = bdrv_filter_or_cow_bs(bs); include_base || p != base;
          p = bdrv_filter_or_cow_bs(p))
     {
-        ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
-                                   file);
+        ret = bdrv_co_block_status_aligned(p, align, want_zero, offset, bytes,
+                                           pnum, map, file);
         ++*depth;
         if (ret < 0) {
             return ret;
@@ -2758,8 +2759,8 @@ int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
                             int64_t offset, int64_t bytes, int64_t *pnum,
                             int64_t *map, BlockDriverState **file)
 {
-    return bdrv_common_block_status_above(bs, base, false, true, offset, bytes,
-                                          pnum, map, file, NULL);
+    return bdrv_common_block_status_above(bs, base, 1, false, true, offset,
+                                          bytes, pnum, map, file, NULL);
 }

 int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
@@ -2786,7 +2787,7 @@ int coroutine_fn bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset,
         return 1;
     }

-    ret = bdrv_common_block_status_above(bs, NULL, false, false, offset,
+    ret = bdrv_common_block_status_above(bs, NULL, 1, false, false, offset,
                                          bytes, &pnum, NULL, NULL, NULL);

     if (ret < 0) {
@@ -2796,13 +2797,47 @@ int coroutine_fn bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset,
     return (pnum == bytes) && (ret & BDRV_BLOCK_ZERO);
 }

+/*
+ * Similar to bdrv_block_status_above(bs, NULL, ...), but ensures that
+ * the answer matches the minimum alignment of bs (smaller alignments
+ * in layers above will not leak through to the active layer). It is
+ * assumed that callers do not care about the resulting mapping of
+ * offsets to an underlying BDS.
+ */
+int bdrv_block_status_aligned(BlockDriverState *bs, int64_t offset,
+                              int64_t bytes, int64_t *pnum)
+{
+    /* Widen the request to aligned boundaries */
+    int64_t aligned_offset, aligned_bytes;
+    uint32_t align = bs->bl.request_alignment;
+    int ret;
+
+    assert(pnum);
+    aligned_offset = QEMU_ALIGN_DOWN(offset, align);
+    aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
+    ret = bdrv_common_block_status_above(bs, NULL, align, false, true,
+                                         aligned_offset, aligned_bytes,
+                                         pnum, NULL, NULL, NULL);
+    if (ret < 0) {
+        *pnum = 0;
+        return ret;
+    }
+    assert(*pnum && QEMU_IS_ALIGNED(*pnum, align) &&
+           align > offset - aligned_offset);
+    *pnum -= offset - aligned_offset;
+    if (*pnum > bytes) {
+        *pnum = bytes;
+    }
+    return ret;
+}
+
 int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
                                    int64_t bytes, int64_t *pnum)
 {
     int ret;
     int64_t dummy;

-    ret = bdrv_common_block_status_above(bs, bs, true, false, offset,
+    ret = bdrv_common_block_status_above(bs, bs, 1, true, false, offset,
                                          bytes, pnum ? pnum : &dummy, NULL,
                                          NULL, NULL);
     if (ret < 0) {
@@ -2833,13 +2868,28 @@ int bdrv_is_allocated_above(BlockDriverState *top,
                             bool include_base, int64_t offset,
                             int64_t bytes, int64_t *pnum)
 {
+    /* Widen the request to aligned boundaries */
+    int64_t aligned_offset, aligned_bytes;
+    uint32_t align = top->bl.request_alignment;
     int depth;
-    int ret = bdrv_common_block_status_above(top, base, include_base, false,
-                                             offset, bytes, pnum, NULL, NULL,
-                                             &depth);
+    int ret;
+
+    assert(pnum);
+    aligned_offset = QEMU_ALIGN_DOWN(offset, align);
+    aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
+    ret = bdrv_common_block_status_above(top, base, align, include_base, false,
+                                         aligned_offset, aligned_bytes, pnum,
+                                         NULL, NULL, &depth);
     if (ret < 0) {
+        *pnum = 0;
         return ret;
     }
+    assert(*pnum && QEMU_IS_ALIGNED(*pnum, align) &&
+           align > offset - aligned_offset);
+    *pnum -= offset - aligned_offset;
+    if (*pnum > bytes) {
+        *pnum = bytes;
+    }

     if (ret & BDRV_BLOCK_ALLOCATED) {
         return depth;
diff --git a/block/quorum.c b/block/quorum.c
index 0bd75450de97..feea9ad8fa87 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1230,9 +1230,10 @@ static int coroutine_fn quorum_co_block_status(BlockDriverState *bs,

     for (i = 0; i < s->num_children; i++) {
         int64_t bytes;
-        ret = bdrv_co_common_block_status_above(s->children[i]->bs, NULL, false,
-                                                want_zero, offset, count,
-                                                &bytes, NULL, NULL, NULL);
+        ret = bdrv_co_common_block_status_above(s->children[i]->bs, NULL, 1,
+                                                false, want_zero, offset,
+                                                count, &bytes, NULL, NULL,
+                                                NULL);
         if (ret < 0) {
             quorum_report_bad(QUORUM_OP_TYPE_READ, offset, count,
                               s->children[i]->bs->node_name, ret);
diff --git a/nbd/server.c b/nbd/server.c
index 7229f487d296..40847276ca64 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1923,7 +1923,7 @@ static int coroutine_fn nbd_co_send_structured_error(NBDClient *client,
 }

 /* Do a sparse read and send the structured reply to the client.
- * Returns -errno if sending fails. bdrv_block_status_above() failure is
+ * Returns -errno if sending fails. bdrv_block_status_aligned() failure is
  * reported to the client, at which point this function succeeds.
  */
 static int coroutine_fn nbd_co_send_sparse_read(NBDClient *client,
@@ -1939,10 +1939,9 @@ static int coroutine_fn nbd_co_send_sparse_read(NBDClient *client,

     while (progress < size) {
         int64_t pnum;
-        int status = bdrv_block_status_above(blk_bs(exp->common.blk), NULL,
-                                             offset + progress,
-                                             size - progress, &pnum, NULL,
-                                             NULL);
+        int status = bdrv_block_status_aligned(blk_bs(exp->common.blk),
+                                               offset + progress,
+                                               size - progress, &pnum);
         bool final;

         if (status < 0) {
@@ -2080,8 +2079,7 @@ static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset,
     while (bytes) {
         uint32_t flags;
         int64_t num;
-        int ret = bdrv_block_status_above(bs, NULL, offset, bytes, &num,
-                                          NULL, NULL);
+        int ret = bdrv_block_status_aligned(bs, offset, bytes, &num);

         if (ret < 0) {
             return ret;
diff --git a/tests/qemu-iotests/241 b/tests/qemu-iotests/241
index 5217af82dc65..49e2bc09e5bc 100755
--- a/tests/qemu-iotests/241
+++ b/tests/qemu-iotests/241
@@ -129,11 +129,10 @@ $QEMU_NBD_PROG --list -k $nbd_unix_socket | grep '\(size\|min\)'
 # is wrong unless the entire 4k is clean.
 $QEMU_IMG map --output=json --image-opts \
 	  "$TEST_IMG",x-dirty-bitmap=qemu:dirty-bitmap:b0 | _filter_qemu_img_map
-
-# FIXME: this should report a single 4k block of "zero":true,"data":true,
-# meaning allocated from the backing chain.  Using "zero":false,"data":false
-# (allocated in active layer) or "zero":false,"data":true (entire region
-# unallocated) is wrong.
+# Reports a single 4k block of "zero":true,"data":true, meaning allocated
+# from the backing chain.  Reporting "zero":false,"data":false would be wrong
+# (nothing is allocated in the active layer), and as would reporting
+# "zero":false,"data":true (the entire region is not unallocated).
 $QEMU_IMG map --output=json --image-opts \
   "$TEST_IMG",x-dirty-bitmap=qemu:allocation-depth | _filter_qemu_img_map
 nbd_server_stop
diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out
index 56d3796cf3ac..12a899ba9181 100644
--- a/tests/qemu-iotests/241.out
+++ b/tests/qemu-iotests/241.out
@@ -44,5 +44,5 @@ Formatting 'TEST_DIR/t.IMGFMT.qcow2', fmt=qcow2 size=4096 backing_file=TEST_DIR/
   size:  4096
   min block: 4096
 [{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
-[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
+[{ "start": 0, "length": 4096, "depth": 0, "zero": true, "data": true, "offset": OFFSET}]
 *** done
-- 
2.30.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] nbd/server: Avoid unaligned dirty-bitmap status
  2021-02-18 20:15 [PATCH 0/5] Obey NBD spec regarding block size bounds on reply Eric Blake
                   ` (2 preceding siblings ...)
  2021-02-18 20:15 ` [PATCH 3/5] nbd/server: Avoid unaligned read/block_status from backing Eric Blake
@ 2021-02-18 20:15 ` Eric Blake
  2021-02-18 20:15 ` [PATCH 5/5] do not apply: Revert "nbd-client: Work around server BLOCK_STATUS misalignment at EOF" Eric Blake
  2021-02-18 20:33 ` [PATCH 0/5] Obey NBD spec regarding block size bounds on reply no-reply
  5 siblings, 0 replies; 16+ messages in thread
From: Eric Blake @ 2021-02-18 20:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, vsementsov, qemu-block, Max Reitz

The NBD spec requires that responses to NBD_CMD_BLOCK_STATUS be
aligned to the server's advertised min_block alignment, if the client
agreed to abide by alignments. In general, since dirty bitmap
granularity cannot go below 512, and it is hard to provoke qcow2 to go
above an alignment of 512, this is not an issue. But now that the
block layer can see through filters, it is possible to use blkdebug to
show a scenario where where the server is provoked into supplying a
non-compliant reply, as show in iotest 241.

Thus, it is time to fix the dirty bitmap code to round requests to
aligned boundaries.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 nbd/server.c               | 30 ++++++++++++++++++++++++++----
 tests/qemu-iotests/241     |  5 ++---
 tests/qemu-iotests/241.out |  2 +-
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 40847276ca64..31462abaee63 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1,5 +1,5 @@
 /*
- *  Copyright (C) 2016-2020 Red Hat, Inc.
+ *  Copyright (C) 2016-2021 Red Hat, Inc.
  *  Copyright (C) 2005  Anthony Liguori <anthony@codemonkey.ws>
  *
  *  Network Block Device Server Side
@@ -2175,7 +2175,8 @@ static int nbd_co_send_block_status(NBDClient *client, uint64_t handle,
 }

 /* Populate @ea from a dirty bitmap. */
-static void bitmap_to_extents(BdrvDirtyBitmap *bitmap,
+static void bitmap_to_extents(uint32_t align,
+                              BdrvDirtyBitmap *bitmap,
                               uint64_t offset, uint64_t length,
                               NBDExtentArray *es)
 {
@@ -2186,10 +2187,23 @@ static void bitmap_to_extents(BdrvDirtyBitmap *bitmap,
     bdrv_dirty_bitmap_lock(bitmap);

     for (start = offset;
-         bdrv_dirty_bitmap_next_dirty_area(bitmap, start, end, INT32_MAX,
+         bdrv_dirty_bitmap_next_dirty_area(bitmap, start, end,
+                                           INT32_MAX - align + 1,
                                            &dirty_start, &dirty_count);
          start = dirty_start + dirty_count)
     {
+        /*
+         * Round unaligned bits: any transition mid-alignment makes
+         * that entire aligned region appear dirty.
+         */
+        assert(QEMU_IS_ALIGNED(start, align));
+        if (!QEMU_IS_ALIGNED(dirty_start, align)) {
+            dirty_count += dirty_start - QEMU_ALIGN_DOWN(dirty_start, align);
+            dirty_start = QEMU_ALIGN_DOWN(dirty_start, align);
+        }
+        if (!QEMU_IS_ALIGNED(dirty_count, align)) {
+            dirty_count = QEMU_ALIGN_UP(dirty_count, align);
+        }
         if ((nbd_extent_array_add(es, dirty_start - start, 0) < 0) ||
             (nbd_extent_array_add(es, dirty_count, NBD_STATE_DIRTY) < 0))
         {
@@ -2214,7 +2228,15 @@ static int nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
     unsigned int nb_extents = dont_fragment ? 1 : NBD_MAX_BLOCK_STATUS_EXTENTS;
     g_autoptr(NBDExtentArray) ea = nbd_extent_array_new(nb_extents);

-    bitmap_to_extents(bitmap, offset, length, ea);
+    /* Easiest to just refuse to answer an unaligned query */
+    if (client->check_align &&
+        !QEMU_IS_ALIGNED(offset | length, client->check_align)) {
+        return nbd_co_send_structured_error(client, handle, -EINVAL,
+                                            "unaligned dirty bitmap request",
+                                            errp);
+    }
+
+    bitmap_to_extents(client->check_align ?: 1, bitmap, offset, length, ea);

     return nbd_co_send_extents(client, handle, ea, last, context_id, errp);
 }
diff --git a/tests/qemu-iotests/241 b/tests/qemu-iotests/241
index 49e2bc09e5bc..b4d2e1934d66 100755
--- a/tests/qemu-iotests/241
+++ b/tests/qemu-iotests/241
@@ -124,9 +124,8 @@ nbd_server_start_unix_socket -B b0 -A --image-opts \

 TEST_IMG="driver=nbd,server.type=unix,server.path=$nbd_unix_socket"
 $QEMU_NBD_PROG --list -k $nbd_unix_socket | grep '\(size\|min\)'
-# FIXME: this should report a single 4k block of "data":false which translates
-# to the dirty bitmap being set for at least part of the region; "data":true
-# is wrong unless the entire 4k is clean.
+# Reports a single 4k block of "data":false, meaning dirty.  Reporting
+# "data":true would be wrong (the entire region is not clean).
 $QEMU_IMG map --output=json --image-opts \
 	  "$TEST_IMG",x-dirty-bitmap=qemu:dirty-bitmap:b0 | _filter_qemu_img_map
 # Reports a single 4k block of "zero":true,"data":true, meaning allocated
diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out
index 12a899ba9181..2368b71054d3 100644
--- a/tests/qemu-iotests/241.out
+++ b/tests/qemu-iotests/241.out
@@ -43,6 +43,6 @@ wrote 512/512 bytes at offset 512
 Formatting 'TEST_DIR/t.IMGFMT.qcow2', fmt=qcow2 size=4096 backing_file=TEST_DIR/t.IMGFMT.mid backing_fmt=qcow2
   size:  4096
   min block: 4096
-[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": false}]
 [{ "start": 0, "length": 4096, "depth": 0, "zero": true, "data": true, "offset": OFFSET}]
 *** done
-- 
2.30.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] do not apply: Revert "nbd-client: Work around server BLOCK_STATUS misalignment at EOF"
  2021-02-18 20:15 [PATCH 0/5] Obey NBD spec regarding block size bounds on reply Eric Blake
                   ` (3 preceding siblings ...)
  2021-02-18 20:15 ` [PATCH 4/5] nbd/server: Avoid unaligned dirty-bitmap status Eric Blake
@ 2021-02-18 20:15 ` Eric Blake
  2021-02-18 20:33 ` [PATCH 0/5] Obey NBD spec regarding block size bounds on reply no-reply
  5 siblings, 0 replies; 16+ messages in thread
From: Eric Blake @ 2021-02-18 20:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, vsementsov, qemu-block, Max Reitz

This reverts commit 737d3f524481bb2ef68d3eba1caa636ff143e16a.

This is intended only for testing purposes: if you apply this without
the rest of the series, iotest 241 no longer benefits from the client
side working around server non-compliance.

---
 block/nbd.c | 30 ++++--------------------------
 1 file changed, 4 insertions(+), 26 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index c26dc5a54f52..34c91f68e150 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -937,36 +937,14 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
     extent->length = payload_advance32(&payload);
     extent->flags = payload_advance32(&payload);

-    if (extent->length == 0) {
+    if (extent->length == 0 ||
+        (s->info.min_block && !QEMU_IS_ALIGNED(extent->length,
+                                               s->info.min_block))) {
         error_setg(errp, "Protocol error: server sent status chunk with "
-                   "zero length");
+                   "invalid length");
         return -EINVAL;
     }

-    /*
-     * A server sending unaligned block status is in violation of the
-     * protocol, but as qemu-nbd 3.1 is such a server (at least for
-     * POSIX files that are not a multiple of 512 bytes, since qemu
-     * rounds files up to 512-byte multiples but lseek(SEEK_HOLE)
-     * still sees an implicit hole beyond the real EOF), it's nicer to
-     * work around the misbehaving server. If the request included
-     * more than the final unaligned block, truncate it back to an
-     * aligned result; if the request was only the final block, round
-     * up to the full block and change the status to fully-allocated
-     * (always a safe status, even if it loses information).
-     */
-    if (s->info.min_block && !QEMU_IS_ALIGNED(extent->length,
-                                                   s->info.min_block)) {
-        trace_nbd_parse_blockstatus_compliance("extent length is unaligned");
-        if (extent->length > s->info.min_block) {
-            extent->length = QEMU_ALIGN_DOWN(extent->length,
-                                             s->info.min_block);
-        } else {
-            extent->length = s->info.min_block;
-            extent->flags = 0;
-        }
-    }
-
     /*
      * We used NBD_CMD_FLAG_REQ_ONE, so the server should not have
      * sent us any more than one extent, nor should it have included
-- 
2.30.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] Obey NBD spec regarding block size bounds on reply
  2021-02-18 20:15 [PATCH 0/5] Obey NBD spec regarding block size bounds on reply Eric Blake
                   ` (4 preceding siblings ...)
  2021-02-18 20:15 ` [PATCH 5/5] do not apply: Revert "nbd-client: Work around server BLOCK_STATUS misalignment at EOF" Eric Blake
@ 2021-02-18 20:33 ` no-reply
  2021-02-18 20:40   ` Eric Blake
  5 siblings, 1 reply; 16+ messages in thread
From: no-reply @ 2021-02-18 20:33 UTC (permalink / raw)
  To: eblake; +Cc: vsementsov, qemu-devel, qemu-block

Patchew URL: https://patchew.org/QEMU/20210218201528.127099-1-eblake@redhat.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20210218201528.127099-1-eblake@redhat.com
Subject: [PATCH 0/5] Obey NBD spec regarding block size bounds on reply

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]         patchew/20210218201528.127099-1-eblake@redhat.com -> patchew/20210218201528.127099-1-eblake@redhat.com
Switched to a new branch 'test'
7be641f do not apply: Revert "nbd-client: Work around server BLOCK_STATUS misalignment at EOF"
93caf2c nbd/server: Avoid unaligned dirty-bitmap status
f9b0b64 nbd/server: Avoid unaligned read/block_status from backing
c990e1b block: Fix BDRV_BLOCK_RAW status to honor alignment
247619c iotests: Update 241 to expose backing layer fragmentation

=== OUTPUT BEGIN ===
1/5 Checking commit 247619cd55cb (iotests: Update 241 to expose backing layer fragmentation)
2/5 Checking commit c990e1bc847b (block: Fix BDRV_BLOCK_RAW status to honor alignment)
3/5 Checking commit f9b0b646002f (nbd/server: Avoid unaligned read/block_status from backing)
4/5 Checking commit 93caf2c5d549 (nbd/server: Avoid unaligned dirty-bitmap status)
5/5 Checking commit 7be641fea748 (do not apply: Revert "nbd-client: Work around server BLOCK_STATUS misalignment at EOF")
ERROR: Missing Signed-off-by: line(s)

total: 1 errors, 0 warnings, 41 lines checked

Patch 5/5 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20210218201528.127099-1-eblake@redhat.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] Obey NBD spec regarding block size bounds on reply
  2021-02-18 20:33 ` [PATCH 0/5] Obey NBD spec regarding block size bounds on reply no-reply
@ 2021-02-18 20:40   ` Eric Blake
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Blake @ 2021-02-18 20:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: vsementsov, qemu-block

On 2/18/21 2:33 PM, no-reply@patchew.org wrote:
> Patchew URL: https://patchew.org/QEMU/20210218201528.127099-1-eblake@redhat.com/
> 
> 
> 
> Hi,
> 
> This series seems to have some coding style problems. See output below for
> more information:
> 

> 5/5 Checking commit 7be641fea748 (do not apply: Revert "nbd-client: Work around server BLOCK_STATUS misalignment at EOF")
> ERROR: Missing Signed-off-by: line(s)

Intentional since patch 5 is not part of the final series. (actually,
you can apply patch 5 first to see failures, then 1-4 to see how the
failures are fixed even without the interference of client-side
workarounds).  But at the end of the day, once 1-4 are upstream, we want
to keep the client-side workarounds for communication with older qemu or
with other NBD servers that have a similar compliance problem as what
gets fixed here.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation
  2021-02-18 20:15 ` [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation Eric Blake
@ 2021-02-25 13:50   ` Vladimir Sementsov-Ogievskiy
  2021-02-25 14:57     ` Vladimir Sementsov-Ogievskiy
  2021-02-25 15:46     ` Eric Blake
  0 siblings, 2 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-25 13:50 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, Kevin Wolf, Max Reitz

18.02.2021 23:15, Eric Blake wrote:
> Previous commits (such as 6e280648, 75d34eb9) have mentioned that our
> NBD server still sends unaligned fragments when an active layer with
> large advertised minimum block size is backed by another layer with a
> smaller block size. Expand the test to actually cover these scenario,
> by using two different approaches: qcow2 encryption (which forces
> 512-byte alignment) with an unaligned raw backing file, and blkdebug
> with a 4k alignment.
> 
> The encryption test passes with the desired results, but only because
> the client side works around the server's non-compliance; if you
> repeat the test manually with tracing turned on, you will see the
> server sending a status for 1000 bytes of data then 1048 bytes of
> hole, which is not aligned. But reverting commit 737d3f5244 shows that
> it is indeed the client working around the bug in the server.
> 
> Meanwhile, the blkdebug test gives incorrect results: remember, when
> using x-dirty-bitmap with qemu-img map as a way to sniff alternative
> metadata contexts, the meanings of "data" and "zero" are determined by

How I'm tired of this abuse:) It seems that total amount of comments
about it in code and commit messages worth creating more intuitive
interface.. Don't you have an idea in mind?

> that context.  Our client workaround is assuming that the fragmented
> replies can be merged according to base:allocation rules, but those
> rules do not work for other contexts (merging dirty and clean bitmap
> should produce dirty; merging allocated and unallocated should produce
> allocated; see the FIXME for more about the decoded values we expect).

You could instead keep the test output correct (without FIXME marks) but
add the test to "disabled" group and drop it from the group when fixed.

> 
> Signed-off-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

> ---
>   tests/qemu-iotests/241     | 60 ++++++++++++++++++++++++++++++++++----
>   tests/qemu-iotests/241.out | 21 +++++++++++++
>   2 files changed, 76 insertions(+), 5 deletions(-)
> 
> diff --git a/tests/qemu-iotests/241 b/tests/qemu-iotests/241
> index c962c8b6075d..5217af82dc65 100755
> --- a/tests/qemu-iotests/241
> +++ b/tests/qemu-iotests/241
> @@ -3,7 +3,7 @@
>   #
>   # Test qemu-nbd vs. unaligned images
>   #
> -# Copyright (C) 2018-2019 Red Hat, Inc.
> +# Copyright (C) 2018-2021 Red Hat, Inc.
>   #
>   # This program is free software; you can redistribute it and/or modify
>   # it under the terms of the GNU General Public License as published by
> @@ -27,7 +27,7 @@ status=1 # failure is the default!
>   _cleanup()
>   {
>       _cleanup_test_img
> -    rm -f "$TEST_DIR/server.log"
> +    rm -f "$TEST_DIR/server.log" "$TEST_IMG_FILE.mid" "$TEST_IMG_FILE.qcow2"
>       nbd_server_stop
>   }
>   trap "_cleanup; exit \$status" 0 1 2 3 15
> @@ -37,7 +37,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
>   . ./common.filter
>   . ./common.nbd
> 
> -_supported_fmt raw
> +_supported_fmt raw # although the test also requires use of qcow2
>   _supported_proto nbd
>   _supported_os Linux
>   _require_command QEMU_NBD
> @@ -89,11 +89,61 @@ $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
>   $QEMU_IO -c map "$TEST_IMG"
>   nbd_server_stop
> 
> -# Not tested yet: we also want to ensure that qemu as NBD client does
> +echo
> +echo "=== Encrypted qcow2 file backed by unaligned raw image ==="
> +echo
> +
> +# Enabling encryption in qcow2 forces 512-alignment
> +SECRET=secret,id=sec0,data=12345
> +$QEMU_IMG create -f qcow2 -b "$TEST_IMG_FILE" -F raw --object "$SECRET" \
> +  -o encrypt.format=luks,encrypt.key-secret=sec0,encrypt.iter-time=10 \
> +  "$TEST_IMG_FILE.qcow2" 2k | _filter_img_create
> +nbd_server_start_unix_socket --object "$SECRET" --image-opts \
> +  driver=qcow2,file.filename="$TEST_IMG_FILE.qcow2",encrypt.key-secret=sec0
> +
> +$QEMU_NBD_PROG --list -k $nbd_unix_socket | grep '\(size\|min\)'
> +$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
> +$QEMU_IO -c map "$TEST_IMG"
> +nbd_server_stop
> +
> +echo
> +echo "=== Use blkdebug for larger alignment than backing layer ==="
> +echo
> +
> +$QEMU_IMG create -f qcow2 -o cluster_size=1024 \
> +  "$TEST_IMG_FILE" 2k | _filter_img_create
> +$QEMU_IMG create -f qcow2 -b "$TEST_IMG_FILE" -F qcow2 -o cluster_size=512 \
> +  "$TEST_IMG_FILE.mid" | _filter_img_create
> +$QEMU_IMG bitmap --add -g 512 --enable "$TEST_IMG_FILE.mid" b0
> +$QEMU_IO -f qcow2 -c 'w 512 512' "$TEST_IMG_FILE.mid" | _filter_qemu_io
> +$QEMU_IMG create -f qcow2 -b "$TEST_IMG_FILE.mid" -F qcow2 \
> +  "$TEST_IMG_FILE.qcow2" 4k | _filter_img_create
> +FILE=image.file.filename="$TEST_IMG_FILE.qcow2"
> +nbd_server_start_unix_socket -B b0 -A --image-opts \
> +  driver=blkdebug,align=4096,image.driver=qcow2,image.file.driver=file,"$FILE"
> +
> +TEST_IMG="driver=nbd,server.type=unix,server.path=$nbd_unix_socket"
> +$QEMU_NBD_PROG --list -k $nbd_unix_socket | grep '\(size\|min\)'
> +# FIXME: this should report a single 4k block of "data":false which translates
> +# to the dirty bitmap being set for at least part of the region; "data":true
> +# is wrong unless the entire 4k is clean.
> +$QEMU_IMG map --output=json --image-opts \
> +	  "$TEST_IMG",x-dirty-bitmap=qemu:dirty-bitmap:b0 | _filter_qemu_img_map
> +
> +# FIXME: this should report a single 4k block of "zero":true,"data":true,
> +# meaning allocated from the backing chain.  Using "zero":false,"data":false
> +# (allocated in active layer) or "zero":false,"data":true (entire region
> +# unallocated) is wrong.
> +$QEMU_IMG map --output=json --image-opts \
> +  "$TEST_IMG",x-dirty-bitmap=qemu:allocation-depth | _filter_qemu_img_map
> +nbd_server_stop
> +
> +# Not tested here: we also want to ensure that qemu as NBD client does
>   # not access beyond the end of a server's advertised unaligned size:
>   #  nbdkit -U - memory size=513 --run 'qemu-io -f raw -c "r 512 512" $nbd'
>   # However, since qemu as server always rounds up to a sector alignment,
> -# we would have to use nbdkit to provoke the current client failures.
> +# we would have to use nbdkit to demonstrate this scenario (see
> +# commit 9cf638508c for more information).
> 
>   # success, all done
>   echo '*** done'
> diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out
> index 75f9f465e522..67aaeed34f50 100644
> --- a/tests/qemu-iotests/241.out
> +++ b/tests/qemu-iotests/241.out
> @@ -25,4 +25,25 @@ WARNING: Image format was not specified for 'TEST_DIR/t.raw' and probing guessed
>   [{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
>   { "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}]
>   1 KiB (0x400) bytes     allocated at offset 0 bytes (0x0)
> +
> +=== Encrypted qcow2 file backed by unaligned raw image ===
> +
> +Formatting 'TEST_DIR/t.IMGFMT.qcow2', fmt=qcow2 size=2048 backing_file=TEST_DIR/t.IMGFMT backing_fmt=IMGFMT
> +  size:  2048
> +  min block: 512
> +[{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
> +{ "start": 1024, "length": 1024, "depth": 0, "zero": true, "data": true, "offset": OFFSET}]
> +2 KiB (0x800) bytes     allocated at offset 0 bytes (0x0)
> +
> +=== Use blkdebug for larger alignment than backing layer ===
> +
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=qcow2 size=2048
> +Formatting 'TEST_DIR/t.IMGFMT.mid', fmt=qcow2 size=2048 backing_file=TEST_DIR/t.IMGFMT backing_fmt=qcow2
> +wrote 512/512 bytes at offset 512
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +Formatting 'TEST_DIR/t.IMGFMT.qcow2', fmt=qcow2 size=4096 backing_file=TEST_DIR/t.IMGFMT.mid backing_fmt=qcow2
> +  size:  4096
> +  min block: 4096
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
>   *** done
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment
  2021-02-18 20:15 ` [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment Eric Blake
@ 2021-02-25 14:55   ` Vladimir Sementsov-Ogievskiy
  2021-02-25 16:03     ` Eric Blake
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-25 14:55 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz

18.02.2021 23:15, Eric Blake wrote:
> Previous patches mentioned how the blkdebug filter driver demonstrates
> a bug present in our NBD server (for example, commit b0245d64); the
> bug is also present with the raw format driver when probing
> occurs. Basically, if we specify a particular alignment > 1, but defer
> the actual block status to the underlying file, and the underlying
> file has a smaller alignment, then the use of BDRV_BLOCK_RAW to defer
> to the underlying file can end up with status split at an alignment
> unacceptable to our layer.  Many callers don't care, but NBD does - it
> is a violation of the NBD protocol to send status or read results
> split on an unaligned boundary (in 737d3f5244, we taught our 4.0
> client to be tolerant of such violations because the problem was even
> more pronounced with qemu 3.1 as server; but we still need to fix our
> server for the sake of other stricter clients).
> 
> This patch lays the groundwork - it adjusts bdrv_block_status to
> ensure that any time one layer defers to another via BDRV_BLOCK_RAW,
> the deferral is either truncated down to an aligned boundary, or
> multiple sub-aligned requests are coalesced into a single
> representative answer (using an implicit hole beyond EOF as
> needed). Iotest 241 exposes the effect (when format probing occurred,
> we don't want block status to subdivide the initial sector, and thus
> not any other sector either). Similarly, iotest 221 is a good
> candidate to amend to specifically track the effects; a change to a
> hole at EOF is not visible if an upper layer does not want alignment
> smaller than 512. However, this patch alone is not a complete fix - it
> is still possible to have an active layer with large alignment
> constraints backed by another layer with smaller constraints; so the
> next patch will complete the task.
> 
> In particular, the next patch will introduce some mutual recursion
> (bdrv_co_common_block_status_above will call this new function rather
> than directly into bdrv_co_block_status), so some conditions added
> here (such as a NULL pointer for map or handling a length-0 request)
> are not reachable until that point.
> 
> There is one interesting corner case: prior to this patch, ALLOCATED
> was never returned without either DATA or ZERO also set. But now, if
> we have two subregions where the first reports status 0 (not
> allocated), and the second reports ZERO|ALLOCATED but not DATA
> (preallocated, read sees zero but underlying file has indeterminate
> contents), then we can end up with a result of just
> ALLOCATED. However, looking at callers of bdrv_is_allocated does not
> find any problem with this new return value. What's more, this
> situation doesn't really happen until the next patch adds support for
> getting aligned status from backing files (as the use of aligned
> status in this patch tends to be limited to just the protocol child of
> a format driver, yet protocol drivers tend to report fully allocated,
> and only format drivers have unallocated clusters that defer to a
> backing file in the first place).
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>   block/io.c                 | 142 +++++++++++++++++++++++++++++++++++--
>   tests/qemu-iotests/221     |  13 ++++
>   tests/qemu-iotests/221.out |   9 +++
>   tests/qemu-iotests/241.out |   3 +-
>   4 files changed, 161 insertions(+), 6 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index ca2dca30070e..4bca775c96b4 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2325,6 +2325,132 @@ int bdrv_flush_all(void)
>       return result;
>   }
> 
> +static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
> +                                             bool want_zero,
> +                                             int64_t offset, int64_t bytes,
> +                                             int64_t *pnum, int64_t *map,
> +                                             BlockDriverState **file);
> +
> +/*
> + * Returns an aligned allocation status of the specified disk region.
> + *
> + * Wrapper around bdrv_co_block_status() which requires the initial
> + * @offset and @count to be aligned to @align (must be power of 2),
> + * and guarantees the resulting @pnum will also be aligned; this may
> + * require piecing together multiple sub-aligned queries into an
> + * appropriate coalesced answer, as follows:
> + *
> + * - BDRV_BLOCK_DATA is set if the flag is set for at least one subregion
> + * - BDRV_BLOCK_ZERO is set only if the flag is set for all subregions
> + * - BDRV_BLOCK_OFFSET_VALID is set only if all subregions are contiguous
> + *   from the same file (@map and @file are then from the first subregion)
> + * - BDRV_BLOCK_ALLOCATED is set if the flag is set for at least one subregion

Hmm about this..

We already have mess around ALLOCATED:

  [1] for format drivers it means that "read is handled by this layer, not by backing", i.e. data (or zero) is placed exactly on that layer of backing-chain

  [2] for protocol drivers it's up to the driver, which may always report ALLOCATED (being more compatible with format drivers) or it may sometimes return UNALLOCATED to show that data is not allocated in FS..

And this way, bdrv_co_block_status_aligned() is compatible with protocol drivers but not with format drivers (as you can't combine "go-to-backing" information of several flags, as for some scenarios it's safer to consider the whole region ALLOCATED and for another it's safer to consider it UNALLOCATED.

For example for stream target it's safer to consider target block UNALLOCATED and do extra copy-on-read operation. And for stream base it's safer to consider block ALLOCATED (and again do extra copying, not missing something significant).

I think, to avoid increasing of the mess, we should first split [1] from [2] somehow..
Assume we change it to BDRV_BLOCK_PROTOCOL_ALLOCATED and BDRV_BLOCK_GO_TO_BACKING.

Then, for BDRV_BLOCK_PROTOCOL_ALLOCATED we probably can just report BDRV_BLOCK_PROTOCOL_ALLOCATED if at least one of extents reports BDRV_BLOCK_PROTOCOL_ALLOCATED. (probably we don't need BDRV_BLOCK_PROTOCOL_ALLOCATED at all and can drop this logic)

But for BDRV_BLOCK_GO_TO_BACKING we'll have to also add BDRV_BLOCK_GO_TO_BACKING_VALID and report

  * BDRV_BLOCK_GO_TO_BACKING | BDRV_BLOCK_GO_TO_BACKING_VALID if all extents report BDRV_BLOCK_GO_TO_BACKING

  * BDRV_BLOCK_GO_TO_BACKING if all extents report no BDRV_BLOCK_GO_TO_BACKING

  * <nothing> if some extents report BDRV_BLOCK_GO_TO_BACKING but others not.

Hmm.. And, I think that there is a problem is in NBD protocol. Actually, with allocation-depth context we started to report internal layers of backing chain. And if we have layers with different request-alignment it's not correct to report allocation-depth "aligned" to top image request-alignment.. So, for allocation-depth to work correctly we should extend NBD protocol to allow unaligned chunks in BLOCK_STATUS report.

So, finally:

1. making bitmap export extents aligned is a separate simple thing - that's OK

2. making base:allocation aligned is possible due to good NBD_STATE_HOLE definition. So for it it's correct to combine BDRV_BLOCK_ALLOCATED as you do even keeping in mind format layers. We call block_status_above for the whole chain. ALLOCATED=0 only if all format layers refer to backing and bottom protocol driver(if exists) reports "unallocated in FS" so that correspond to

"If an extent is marked with NBD_STATE_HOLE at that context, this means that the given extent is not allocated in the backend storage, and that writing to the extent MAY result in the NBD_ENOSPC error"

    And this way, I'd prefer to fix exactly base:allocation context handling in nbd-server not touching generic block_status which already has enough mess.

3. fixing qemu:allocation-depth I think is impossible without allowing unaligned chunks in NBD spec (really, why we should restrict all possible metadata contexts so hard?) Or, if we are still going to align allocation-depth results to top layer request-alignment we should change allocation-depth specification so that that's not "actual allocation depth" but something >= than actual allocation depth of all subchunks... And that becomes useless.

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation
  2021-02-25 13:50   ` Vladimir Sementsov-Ogievskiy
@ 2021-02-25 14:57     ` Vladimir Sementsov-Ogievskiy
  2021-02-25 15:52       ` Eric Blake
  2021-02-25 15:46     ` Eric Blake
  1 sibling, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-25 14:57 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, Kevin Wolf, Max Reitz

25.02.2021 16:50, Vladimir Sementsov-Ogievskiy wrote:
> 18.02.2021 23:15, Eric Blake wrote:
>> Previous commits (such as 6e280648, 75d34eb9) have mentioned that our
>> NBD server still sends unaligned fragments when an active layer with
>> large advertised minimum block size is backed by another layer with a
>> smaller block size. Expand the test to actually cover these scenario,
>> by using two different approaches: qcow2 encryption (which forces
>> 512-byte alignment) with an unaligned raw backing file, and blkdebug
>> with a 4k alignment.
>>
>> The encryption test passes with the desired results, but only because
>> the client side works around the server's non-compliance; if you
>> repeat the test manually with tracing turned on, you will see the
>> server sending a status for 1000 bytes of data then 1048 bytes of
>> hole, which is not aligned. But reverting commit 737d3f5244 shows that
>> it is indeed the client working around the bug in the server.
>>
>> Meanwhile, the blkdebug test gives incorrect results: remember, when
>> using x-dirty-bitmap with qemu-img map as a way to sniff alternative
>> metadata contexts, the meanings of "data" and "zero" are determined by
> 
> How I'm tired of this abuse:) It seems that total amount of comments
> about it in code and commit messages worth creating more intuitive
> interface.. Don't you have an idea in mind?
> 
>> that context.  Our client workaround is assuming that the fragmented
>> replies can be merged according to base:allocation rules, but those
>> rules do not work for other contexts (merging dirty and clean bitmap
>> should produce dirty; merging allocated and unallocated should produce
>> allocated; see the FIXME for more about the decoded values we expect).
> 
> You could instead keep the test output correct (without FIXME marks) but
> add the test to "disabled" group and drop it from the group when fixed.
> 
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 

Now I don't think that aligning qemu:allocation-depth information is a correct thing to do.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation
  2021-02-25 13:50   ` Vladimir Sementsov-Ogievskiy
  2021-02-25 14:57     ` Vladimir Sementsov-Ogievskiy
@ 2021-02-25 15:46     ` Eric Blake
  1 sibling, 0 replies; 16+ messages in thread
From: Eric Blake @ 2021-02-25 15:46 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz

On 2/25/21 7:50 AM, Vladimir Sementsov-Ogievskiy wrote:
> 18.02.2021 23:15, Eric Blake wrote:
>> Previous commits (such as 6e280648, 75d34eb9) have mentioned that our
>> NBD server still sends unaligned fragments when an active layer with
>> large advertised minimum block size is backed by another layer with a
>> smaller block size. Expand the test to actually cover these scenario,
>> by using two different approaches: qcow2 encryption (which forces
>> 512-byte alignment) with an unaligned raw backing file, and blkdebug
>> with a 4k alignment.
>>
>> The encryption test passes with the desired results, but only because
>> the client side works around the server's non-compliance; if you
>> repeat the test manually with tracing turned on, you will see the
>> server sending a status for 1000 bytes of data then 1048 bytes of
>> hole, which is not aligned. But reverting commit 737d3f5244 shows that
>> it is indeed the client working around the bug in the server.
>>
>> Meanwhile, the blkdebug test gives incorrect results: remember, when
>> using x-dirty-bitmap with qemu-img map as a way to sniff alternative
>> metadata contexts, the meanings of "data" and "zero" are determined by
> 
> How I'm tired of this abuse:) It seems that total amount of comments
> about it in code and commit messages worth creating more intuitive
> interface.. Don't you have an idea in mind?

Yes: 'nbdinfo' as part of the libnbd project ;)

Sadly, libnbd is not available on all our common porting targets yet,
and nbdinfo is less than a year old (so even distros that have libnbd
1.0 are too old).

> 
>> that context.  Our client workaround is assuming that the fragmented
>> replies can be merged according to base:allocation rules, but those
>> rules do not work for other contexts (merging dirty and clean bitmap
>> should produce dirty; merging allocated and unallocated should produce
>> allocated; see the FIXME for more about the decoded values we expect).
> 
> You could instead keep the test output correct (without FIXME marks) but
> add the test to "disabled" group and drop it from the group when fixed.

Either way, it's fixed by the end of the series.

> 
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 

Thanks!

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation
  2021-02-25 14:57     ` Vladimir Sementsov-Ogievskiy
@ 2021-02-25 15:52       ` Eric Blake
  2021-02-25 16:04         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Blake @ 2021-02-25 15:52 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz

On 2/25/21 8:57 AM, Vladimir Sementsov-Ogievskiy wrote:
> 25.02.2021 16:50, Vladimir Sementsov-Ogievskiy wrote:
>> 18.02.2021 23:15, Eric Blake wrote:
>>> Previous commits (such as 6e280648, 75d34eb9) have mentioned that our
>>> NBD server still sends unaligned fragments when an active layer with
>>> large advertised minimum block size is backed by another layer with a
>>> smaller block size. Expand the test to actually cover these scenario,
>>> by using two different approaches: qcow2 encryption (which forces
>>> 512-byte alignment) with an unaligned raw backing file, and blkdebug
>>> with a 4k alignment.
>>>

> 
> Now I don't think that aligning qemu:allocation-depth information is a
> correct thing to do.

Why not?  First, it's very rare that you'd have a qcow2 image with
mandated 4k minimum block size, backed by another qcow2 image with 512
block size (blkdebug made it possible to expose the bug, but I could not
find a way in common day-to-day usage), so we really aren't impacting
normal users.  Second, from the perspective of copying backing chains
over NBD, what difference does it make if we have the backing chain:

A (granularity 512) <- B (granularity 512) <- C (granularity 4k)

with the allocation pattern:

A: -A-A-A-A-A-A-A-A
B: --BB--BB--BB--BB
C: --------CCCCCCCC

and report the allocation depth as:

   2222222211111111

instead of

   0322032211111111

The former may be imprecise, but it obeys our bounds, and in all
reality, if all we have access to is 4k chunks, any decisions we make
about how to handle that 4k block should be based on the fact that at
least some of the data was allocated in our backing file, and not
treating the entire 4k as unallocated merely because the first 512 bytes
are neither in A nor B.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment
  2021-02-25 14:55   ` Vladimir Sementsov-Ogievskiy
@ 2021-02-25 16:03     ` Eric Blake
  2021-02-25 16:23       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Blake @ 2021-02-25 16:03 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

On 2/25/21 8:55 AM, Vladimir Sementsov-Ogievskiy wrote:
> 18.02.2021 23:15, Eric Blake wrote:
>> Previous patches mentioned how the blkdebug filter driver demonstrates
>> a bug present in our NBD server (for example, commit b0245d64); the
>> bug is also present with the raw format driver when probing
>> occurs. Basically, if we specify a particular alignment > 1, but defer
>> the actual block status to the underlying file, and the underlying
>> file has a smaller alignment, then the use of BDRV_BLOCK_RAW to defer
>> to the underlying file can end up with status split at an alignment
>> unacceptable to our layer.  Many callers don't care, but NBD does - it
>> is a violation of the NBD protocol to send status or read results
>> split on an unaligned boundary (in 737d3f5244, we taught our 4.0
>> client to be tolerant of such violations because the problem was even
>> more pronounced with qemu 3.1 as server; but we still need to fix our
>> server for the sake of other stricter clients).
>>

>> + * - BDRV_BLOCK_ALLOCATED is set if the flag is set for at least one
>> subregion
> 
> Hmm about this..
> 
> We already have mess around ALLOCATED:
> 
>  [1] for format drivers it means that "read is handled by this layer,
> not by backing", i.e. data (or zero) is placed exactly on that layer of
> backing-chain

If we're reading at a given granularity, then the 4k read is satisfied
at this layer even if portions of the read came from lower layers.  So
the logic works here.

> 
>  [2] for protocol drivers it's up to the driver, which may always report
> ALLOCATED (being more compatible with format drivers) or it may
> sometimes return UNALLOCATED to show that data is not allocated in FS..

We've been moving away from this particular overload.  What's more, most
protocol drivers that set it at all set it for every single byte,
because protocol layers don't have a notion of a backing file; which
means that if it is set at all, it will be set for every byte anyway, so
the logic works here.

> 
> And this way, bdrv_co_block_status_aligned() is compatible with protocol
> drivers but not with format drivers (as you can't combine
> "go-to-backing" information of several flags, as for some scenarios it's
> safer to consider the whole region ALLOCATED and for another it's safer
> to consider it UNALLOCATED.
> 
> For example for stream target it's safer to consider target block
> UNALLOCATED and do extra copy-on-read operation. And for stream base
> it's safer to consider block ALLOCATED (and again do extra copying, not
> missing something significant).
> 
> 
> I think, to avoid increasing of the mess, we should first split [1] from
> [2] somehow..
> Assume we change it to BDRV_BLOCK_PROTOCOL_ALLOCATED and
> BDRV_BLOCK_GO_TO_BACKING.

Maybe it is indeed worth splitting out two different flags to fully
distinguish between the two overloaded meanings, but that seems like an
independent patch to this series.

> 
> Then, for BDRV_BLOCK_PROTOCOL_ALLOCATED we probably can just report
> BDRV_BLOCK_PROTOCOL_ALLOCATED if at least one of extents reports
> BDRV_BLOCK_PROTOCOL_ALLOCATED. (probably we don't need
> BDRV_BLOCK_PROTOCOL_ALLOCATED at all and can drop this logic)
> 
> But for BDRV_BLOCK_GO_TO_BACKING we'll have to also add
> BDRV_BLOCK_GO_TO_BACKING_VALID and report
> 
>  * BDRV_BLOCK_GO_TO_BACKING | BDRV_BLOCK_GO_TO_BACKING_VALID if all
> extents report BDRV_BLOCK_GO_TO_BACKING
>  
>  * BDRV_BLOCK_GO_TO_BACKING if all extents report no
> BDRV_BLOCK_GO_TO_BACKING
> 
>  * <nothing> if some extents report BDRV_BLOCK_GO_TO_BACKING but others
> not.
> 
> 
> Hmm.. And, I think that there is a problem is in NBD protocol. Actually,
> with allocation-depth context we started to report internal layers of
> backing chain. And if we have layers with different request-alignment
> it's not correct to report allocation-depth "aligned" to top image
> request-alignment.. So, for allocation-depth to work correctly we should
> extend NBD protocol to allow unaligned chunks in BLOCK_STATUS report.

The NBD protocol says that base:allocation must obey allocation rules.
If we want to declare that "because qemu:allocation-depth is an
extension, we choose to NOT obey allocation rules, and if your client
connects to our extension, it MUST be prepared for what would normally
be non-compliant responses to NBD_CMD_BLOCK_STATUS", then we are free to
do so (it is our extension, after all).  Particularly since the only way
I could actually trigger it was with blkdebug (most format layers
support byte-level access, even when layered on top of a protocol layer
with a 512 or 4k minimum byte access).

So if you think it is better for me to respin the patch to fix ONLY
base:allocation bugs, but not qemu:allocation-depth, and instead merely
document the issue there, I could be persuaded to do so.

> 
> So, finally:
> 
> 1. making bitmap export extents aligned is a separate simple thing -
> that's OK
> 
> 2. making base:allocation aligned is possible due to good NBD_STATE_HOLE
> definition. So for it it's correct to combine BDRV_BLOCK_ALLOCATED as
> you do even keeping in mind format layers. We call block_status_above
> for the whole chain. ALLOCATED=0 only if all format layers refer to
> backing and bottom protocol driver(if exists) reports "unallocated in
> FS" so that correspond to
> 
> "If an extent is marked with NBD_STATE_HOLE at that context, this means
> that the given extent is not allocated in the backend storage, and that
> writing to the extent MAY result in the NBD_ENOSPC error"
> 
>    And this way, I'd prefer to fix exactly base:allocation context
> handling in nbd-server not touching generic block_status which already
> has enough mess.
> 
> 3. fixing qemu:allocation-depth I think is impossible without allowing
> unaligned chunks in NBD spec (really, why we should restrict all
> possible metadata contexts so hard?) Or, if we are still going to align
> allocation-depth results to top layer request-alignment we should change
> allocation-depth specification so that that's not "actual allocation
> depth" but something >= than actual allocation depth of all subchunks...
> And that becomes useless.
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation
  2021-02-25 15:52       ` Eric Blake
@ 2021-02-25 16:04         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-25 16:04 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: qemu-block, Kevin Wolf, Max Reitz

25.02.2021 18:52, Eric Blake wrote:
> On 2/25/21 8:57 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 25.02.2021 16:50, Vladimir Sementsov-Ogievskiy wrote:
>>> 18.02.2021 23:15, Eric Blake wrote:
>>>> Previous commits (such as 6e280648, 75d34eb9) have mentioned that our
>>>> NBD server still sends unaligned fragments when an active layer with
>>>> large advertised minimum block size is backed by another layer with a
>>>> smaller block size. Expand the test to actually cover these scenario,
>>>> by using two different approaches: qcow2 encryption (which forces
>>>> 512-byte alignment) with an unaligned raw backing file, and blkdebug
>>>> with a 4k alignment.
>>>>
> 
>>
>> Now I don't think that aligning qemu:allocation-depth information is a
>> correct thing to do.
> 
> Why not?  First, it's very rare that you'd have a qcow2 image with
> mandated 4k minimum block size, backed by another qcow2 image with 512
> block size (blkdebug made it possible to expose the bug, but I could not
> find a way in common day-to-day usage), so we really aren't impacting
> normal users.  Second, from the perspective of copying backing chains
> over NBD, what difference does it make if we have the backing chain:
> 
> A (granularity 512) <- B (granularity 512) <- C (granularity 4k)
> 
> with the allocation pattern:
> 
> A: -A-A-A-A-A-A-A-A
> B: --BB--BB--BB--BB
> C: --------CCCCCCCC
> 
> and report the allocation depth as:
> 
>     2222222211111111
> 
> instead of
> 
>     0322032211111111
> 
> The former may be imprecise, but it obeys our bounds, and in all
> reality, if all we have access to is 4k chunks, any decisions we make
> about how to handle that 4k block should be based on the fact that at
> least some of the data was allocated in our backing file, and not
> treating the entire 4k as unallocated merely because the first 512 bytes
> are neither in A nor B.
> 

I'm not sure about NBD client, but in qemu block-jobs the decision may be different for different tasks, as I mentioned in my answer on [2/5].
For example block-stream will skip chunks allocated in top, because nothing to do, data is already in top. But if we imagine that top may return ALLOCATED for something that is not ALLOCATED the stream logic is broken.. Probably that's a bad example.

I agree that this is a rare case anyway and we probably shouldn't care too much. But we should at least describe it in allocation-depth specification.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment
  2021-02-25 16:03     ` Eric Blake
@ 2021-02-25 16:23       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-25 16:23 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz

25.02.2021 19:03, Eric Blake wrote:
> On 2/25/21 8:55 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 18.02.2021 23:15, Eric Blake wrote:
>>> Previous patches mentioned how the blkdebug filter driver demonstrates
>>> a bug present in our NBD server (for example, commit b0245d64); the
>>> bug is also present with the raw format driver when probing
>>> occurs. Basically, if we specify a particular alignment > 1, but defer
>>> the actual block status to the underlying file, and the underlying
>>> file has a smaller alignment, then the use of BDRV_BLOCK_RAW to defer
>>> to the underlying file can end up with status split at an alignment
>>> unacceptable to our layer.  Many callers don't care, but NBD does - it
>>> is a violation of the NBD protocol to send status or read results
>>> split on an unaligned boundary (in 737d3f5244, we taught our 4.0
>>> client to be tolerant of such violations because the problem was even
>>> more pronounced with qemu 3.1 as server; but we still need to fix our
>>> server for the sake of other stricter clients).
>>>
> 
>>> + * - BDRV_BLOCK_ALLOCATED is set if the flag is set for at least one
>>> subregion
>>
>> Hmm about this..
>>
>> We already have mess around ALLOCATED:
>>
>>   [1] for format drivers it means that "read is handled by this layer,
>> not by backing", i.e. data (or zero) is placed exactly on that layer of
>> backing-chain
> 
> If we're reading at a given granularity, then the 4k read is satisfied
> at this layer even if portions of the read came from lower layers.  So
> the logic works here.

Hmm.. I can't agree. This way we can say that everything is satisfied at this layer. Even if no data in it, we read from this layer and it somehow takes data from lower layers.

It's all about terminology of course and we can use same terms for different things. For me ALLOCATED for format layer works as follows:

The whole layer is split into clusters. Each cluster is either ALLOCATED (format layer produces data on read somehow not touching backing child), or UNALLOCATED (format layer just calls read() on backing child with same offset.

And before your patch block_status request never combined clusters with different ALLOCATED status.

ALLOCATED status of blocks at some layer of backing chain is significant for block-jobs, and if we have several sequential chunks with different ALLOCATED status, we can't just consider all of them as ALLOCATED, because in some scenarios it will lead to data loss.

> 
>>
>>   [2] for protocol drivers it's up to the driver, which may always report
>> ALLOCATED (being more compatible with format drivers) or it may
>> sometimes return UNALLOCATED to show that data is not allocated in FS..
> 
> We've been moving away from this particular overload.  What's more, most
> protocol drivers that set it at all set it for every single byte,
> because protocol layers don't have a notion of a backing file; which
> means that if it is set at all, it will be set for every byte anyway, so
> the logic works here.
> 
>>
>> And this way, bdrv_co_block_status_aligned() is compatible with protocol
>> drivers but not with format drivers (as you can't combine
>> "go-to-backing" information of several flags, as for some scenarios it's
>> safer to consider the whole region ALLOCATED and for another it's safer
>> to consider it UNALLOCATED.
>>
>> For example for stream target it's safer to consider target block
>> UNALLOCATED and do extra copy-on-read operation. And for stream base
>> it's safer to consider block ALLOCATED (and again do extra copying, not
>> missing something significant).
>>
>>
>> I think, to avoid increasing of the mess, we should first split [1] from
>> [2] somehow..
>> Assume we change it to BDRV_BLOCK_PROTOCOL_ALLOCATED and
>> BDRV_BLOCK_GO_TO_BACKING.
> 
> Maybe it is indeed worth splitting out two different flags to fully
> distinguish between the two overloaded meanings, but that seems like an
> independent patch to this series.
> 
>>
>> Then, for BDRV_BLOCK_PROTOCOL_ALLOCATED we probably can just report
>> BDRV_BLOCK_PROTOCOL_ALLOCATED if at least one of extents reports
>> BDRV_BLOCK_PROTOCOL_ALLOCATED. (probably we don't need
>> BDRV_BLOCK_PROTOCOL_ALLOCATED at all and can drop this logic)
>>
>> But for BDRV_BLOCK_GO_TO_BACKING we'll have to also add
>> BDRV_BLOCK_GO_TO_BACKING_VALID and report
>>
>>   * BDRV_BLOCK_GO_TO_BACKING | BDRV_BLOCK_GO_TO_BACKING_VALID if all
>> extents report BDRV_BLOCK_GO_TO_BACKING
>>   
>>   * BDRV_BLOCK_GO_TO_BACKING if all extents report no
>> BDRV_BLOCK_GO_TO_BACKING
>>
>>   * <nothing> if some extents report BDRV_BLOCK_GO_TO_BACKING but others
>> not.
>>
>>
>> Hmm.. And, I think that there is a problem is in NBD protocol. Actually,
>> with allocation-depth context we started to report internal layers of
>> backing chain. And if we have layers with different request-alignment
>> it's not correct to report allocation-depth "aligned" to top image
>> request-alignment.. So, for allocation-depth to work correctly we should
>> extend NBD protocol to allow unaligned chunks in BLOCK_STATUS report.
> 
> The NBD protocol says that base:allocation must obey allocation rules.
> If we want to declare that "because qemu:allocation-depth is an
> extension, we choose to NOT obey allocation rules, and if your client
> connects to our extension, it MUST be prepared for what would normally
> be non-compliant responses to NBD_CMD_BLOCK_STATUS", then we are free to
> do so (it is our extension, after all).  Particularly since the only way
> I could actually trigger it was with blkdebug (most format layers
> support byte-level access, even when layered on top of a protocol layer
> with a 512 or 4k minimum byte access).

Hmm, NBD spec says in description of NBD_CMD_BLOCK_STATUS:

The server SHOULD use descriptor lengths that are ..., and MUST use descriptor lengths that are an integer multiple of any advertised minimum block size.

> 
> So if you think it is better for me to respin the patch to fix ONLY
> base:allocation bugs, but not qemu:allocation-depth, and instead merely
> document the issue there, I could be persuaded to do so.
> 
>>
>> So, finally:
>>
>> 1. making bitmap export extents aligned is a separate simple thing -
>> that's OK
>>
>> 2. making base:allocation aligned is possible due to good NBD_STATE_HOLE
>> definition. So for it it's correct to combine BDRV_BLOCK_ALLOCATED as
>> you do even keeping in mind format layers. We call block_status_above
>> for the whole chain. ALLOCATED=0 only if all format layers refer to
>> backing and bottom protocol driver(if exists) reports "unallocated in
>> FS" so that correspond to
>>
>> "If an extent is marked with NBD_STATE_HOLE at that context, this means
>> that the given extent is not allocated in the backend storage, and that
>> writing to the extent MAY result in the NBD_ENOSPC error"
>>
>>     And this way, I'd prefer to fix exactly base:allocation context
>> handling in nbd-server not touching generic block_status which already
>> has enough mess.
>>
>> 3. fixing qemu:allocation-depth I think is impossible without allowing
>> unaligned chunks in NBD spec (really, why we should restrict all
>> possible metadata contexts so hard?) Or, if we are still going to align
>> allocation-depth results to top layer request-alignment we should change
>> allocation-depth specification so that that's not "actual allocation
>> depth" but something >= than actual allocation depth of all subchunks...
>> And that becomes useless.
>>
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-02-25 16:25 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-18 20:15 [PATCH 0/5] Obey NBD spec regarding block size bounds on reply Eric Blake
2021-02-18 20:15 ` [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation Eric Blake
2021-02-25 13:50   ` Vladimir Sementsov-Ogievskiy
2021-02-25 14:57     ` Vladimir Sementsov-Ogievskiy
2021-02-25 15:52       ` Eric Blake
2021-02-25 16:04         ` Vladimir Sementsov-Ogievskiy
2021-02-25 15:46     ` Eric Blake
2021-02-18 20:15 ` [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment Eric Blake
2021-02-25 14:55   ` Vladimir Sementsov-Ogievskiy
2021-02-25 16:03     ` Eric Blake
2021-02-25 16:23       ` Vladimir Sementsov-Ogievskiy
2021-02-18 20:15 ` [PATCH 3/5] nbd/server: Avoid unaligned read/block_status from backing Eric Blake
2021-02-18 20:15 ` [PATCH 4/5] nbd/server: Avoid unaligned dirty-bitmap status Eric Blake
2021-02-18 20:15 ` [PATCH 5/5] do not apply: Revert "nbd-client: Work around server BLOCK_STATUS misalignment at EOF" Eric Blake
2021-02-18 20:33 ` [PATCH 0/5] Obey NBD spec regarding block size bounds on reply no-reply
2021-02-18 20:40   ` Eric Blake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).