All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above
@ 2020-05-19 19:54 Vladimir Sementsov-Ogievskiy
  2020-05-19 19:54 ` [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above Vladimir Sementsov-Ogievskiy
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 19:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

Hi all!

v2:
01: wording, grammar, keep comment
02-03: add Kevin's r-bs
05: test-output rebased on compression type qcow2 extension

=====

I wanted to understand, what is the real difference between bdrv_block_status_above
and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
bdrv_block_status_above..

And I found the problem: bdrv_is_allocated_above considers space after EOF as
UNALLOCATED for intermediate nodes..

UNALLOCATED is not about allocation at fs level, but about should we go to backing or
not.. And it seems incorrect for me, as in case of short backing file, we'll read
zeroes after EOF, instead of going further by backing chain.

This leads to the following effect:

./qemu-img create -f qcow2 base.qcow2 2M
./qemu-io -c "write -P 0x1 0 2M" base.qcow2

./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M

Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
./qemu-io -c "read -P 0 1M 1M" top.qcow2
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)

But after commit guest visible state is changed, which seems wrong for me:
./qemu-img commit top.qcow2 -b mid.qcow2

./qemu-io -c "read -P 0 1M 1M" mid.qcow2
Pattern verification failed at offset 1048576, 1048576 bytes
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)

./qemu-io -c "read -P 1 1M 1M" mid.qcow2
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)

=====

bdrv_block_allocated_above behaves strange too:

with want_zero=true, it may report unallocated zeroes because of short backing files, which
are actually "allocated" in POV of backing chains. But I see this may influence only
qemu-img compare, and I don't see can it trigger some bug..

with want_zero=false, it may do no progress because of short backing file. Moreover it may
report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
onlyt top layer, so it seems OK. 

Vladimir Sementsov-Ogievskiy (5):
  block/io: fix bdrv_co_block_status_above
  block/io: bdrv_common_block_status_above: support include_base
  block/io: bdrv_common_block_status_above: support bs == base
  block/io: fix bdrv_is_allocated_above
  iotests: add commit top->base cases to 274

 block/io.c                 | 104 ++++++++++++++++++-------------------
 tests/qemu-iotests/154.out |   4 +-
 tests/qemu-iotests/274     |  20 +++++++
 tests/qemu-iotests/274.out |  65 +++++++++++++++++++++++
 4 files changed, 139 insertions(+), 54 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above
  2020-05-19 19:54 [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
@ 2020-05-19 19:54 ` Vladimir Sementsov-Ogievskiy
  2020-05-19 20:41   ` Eric Blake
  2020-05-19 19:54 ` [PATCH v2 2/5] block/io: bdrv_common_block_status_above: support include_base Vladimir Sementsov-Ogievskiy
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 19:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

bdrv_co_block_status_above has several problems with handling short
backing files:

1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
which produces these after-EOF zeros is inside requested backing
sequence.

2. With want_zero=false, it may return pnum=0 prior to actual EOF,
because of EOF of short backing file.

Fix these things, making logic about short backing files clearer.

Note that 154 output changed, because now bdrv_block_status_above don't
merge unallocated zeros with zeros after EOF (which are actually
"allocated" in POV of read from backing-chain top) and is_zero() just
don't understand that the whole head or tail is zero. We may update
is_zero to call bdrv_block_status_above several times, or add flag to
bdrv_block_status_above that we are not interested in ALLOCATED flag,
so ranges with different ALLOCATED status may be merged, but actually,
it seems that we'd better don't care about this corner case.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c                 | 38 +++++++++++++++++++++++++++++---------
 tests/qemu-iotests/154.out |  4 ++--
 2 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/block/io.c b/block/io.c
index 121ce17a49..db990e812b 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2461,25 +2461,45 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
         ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
                                    file);
         if (ret < 0) {
-            break;
+            return ret;
         }
-        if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+        if (*pnum == 0) {
+            if (first) {
+                return ret;
+            }
+
             /*
-             * Reading beyond the end of the file continues to read
-             * zeroes, but we can only widen the result to the
-             * unallocated length we learned from an earlier
-             * iteration.
+             * Reads from bs for the selected region will return zeroes,
+             * produced because the current level is short. We should consider
+             * it as allocated.
+             *
+             * TODO: Should we report p as file here?
              */
+            assert(ret & BDRV_BLOCK_EOF);
             *pnum = bytes;
+            return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
         }
-        if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
-            break;
+        if (ret & BDRV_BLOCK_ALLOCATED) {
+            /* We've found the node and the status, we must return. */
+
+            if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+                /*
+                 * This level is also responsible for reads after EOF inside
+                 * the unallocated region in the previous level.
+                 */
+                *pnum = bytes;
+            }
+
+            return ret;
         }
+
         /* [offset, pnum] unallocated on this layer, which could be only
          * the first part of [offset, bytes].  */
-        bytes = MIN(bytes, *pnum);
+        assert(*pnum <= bytes);
+        bytes = *pnum;
         first = false;
     }
+
     return ret;
 }
 
diff --git a/tests/qemu-iotests/154.out b/tests/qemu-iotests/154.out
index fa3673317f..a203dfcadd 100644
--- a/tests/qemu-iotests/154.out
+++ b/tests/qemu-iotests/154.out
@@ -310,13 +310,13 @@ wrote 512/512 bytes at offset 134217728
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 2048/2048 bytes allocated at offset 128 MiB
 [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
-{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
 wrote 512/512 bytes at offset 134219264
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 2048/2048 bytes allocated at offset 128 MiB
 [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
-{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
 wrote 1024/1024 bytes at offset 134218240
 1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/5] block/io: bdrv_common_block_status_above: support include_base
  2020-05-19 19:54 [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
  2020-05-19 19:54 ` [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above Vladimir Sementsov-Ogievskiy
@ 2020-05-19 19:54 ` Vladimir Sementsov-Ogievskiy
  2020-05-19 19:54 ` [PATCH v2 3/5] block/io: bdrv_common_block_status_above: support bs == base Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 19:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

In order to reuse bdrv_common_block_status_above in
bdrv_is_allocated_above, let's support include_base parameter.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index db990e812b..cdc0e6663e 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2223,6 +2223,7 @@ int bdrv_flush_all(void)
 typedef struct BdrvCoBlockStatusData {
     BlockDriverState *bs;
     BlockDriverState *base;
+    bool include_base;
     bool want_zero;
     int64_t offset;
     int64_t bytes;
@@ -2445,6 +2446,7 @@ early_out:
 
 static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
                                                    BlockDriverState *base,
+                                                   bool include_base,
                                                    bool want_zero,
                                                    int64_t offset,
                                                    int64_t bytes,
@@ -2456,8 +2458,8 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
     int ret = 0;
     bool first = true;
 
-    assert(bs != base);
-    for (p = bs; p != base; p = backing_bs(p)) {
+    assert(include_base || bs != base);
+    for (p = bs; include_base || p != base; p = backing_bs(p)) {
         ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
                                    file);
         if (ret < 0) {
@@ -2495,6 +2497,11 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
 
         /* [offset, pnum] unallocated on this layer, which could be only
          * the first part of [offset, bytes].  */
+
+        if (p == base) {
+            break;
+        }
+
         assert(*pnum <= bytes);
         bytes = *pnum;
         first = false;
@@ -2509,7 +2516,7 @@ static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
     BdrvCoBlockStatusData *data = opaque;
 
     data->ret = bdrv_co_block_status_above(data->bs, data->base,
-                                           data->want_zero,
+                                           data->include_base, data->want_zero,
                                            data->offset, data->bytes,
                                            data->pnum, data->map, data->file);
     data->done = true;
@@ -2523,6 +2530,7 @@ static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
  */
 static int bdrv_common_block_status_above(BlockDriverState *bs,
                                           BlockDriverState *base,
+                                          bool include_base,
                                           bool want_zero, int64_t offset,
                                           int64_t bytes, int64_t *pnum,
                                           int64_t *map,
@@ -2532,6 +2540,7 @@ static int bdrv_common_block_status_above(BlockDriverState *bs,
     BdrvCoBlockStatusData data = {
         .bs = bs,
         .base = base,
+        .include_base = include_base,
         .want_zero = want_zero,
         .offset = offset,
         .bytes = bytes,
@@ -2556,7 +2565,7 @@ int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
                             int64_t offset, int64_t bytes, int64_t *pnum,
                             int64_t *map, BlockDriverState **file)
 {
-    return bdrv_common_block_status_above(bs, base, true, offset, bytes,
+    return bdrv_common_block_status_above(bs, base, false, true, offset, bytes,
                                           pnum, map, file);
 }
 
@@ -2573,7 +2582,7 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
     int ret;
     int64_t dummy;
 
-    ret = bdrv_common_block_status_above(bs, backing_bs(bs), false, offset,
+    ret = bdrv_common_block_status_above(bs, bs, true, false, offset,
                                          bytes, pnum ? pnum : &dummy, NULL,
                                          NULL);
     if (ret < 0) {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/5] block/io: bdrv_common_block_status_above: support bs == base
  2020-05-19 19:54 [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
  2020-05-19 19:54 ` [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above Vladimir Sementsov-Ogievskiy
  2020-05-19 19:54 ` [PATCH v2 2/5] block/io: bdrv_common_block_status_above: support include_base Vladimir Sementsov-Ogievskiy
@ 2020-05-19 19:54 ` Vladimir Sementsov-Ogievskiy
  2020-05-19 19:55 ` [PATCH v2 4/5] block/io: fix bdrv_is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 19:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

We are going to reuse bdrv_common_block_status_above in
bdrv_is_allocated_above. bdrv_is_allocated_above may be called with
include_base == false and still bs == base (for ex. from img_rebase()).

So, support this corner case.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index cdc0e6663e..df44e89b7d 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2458,7 +2458,11 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
     int ret = 0;
     bool first = true;
 
-    assert(include_base || bs != base);
+    if (!include_base && bs == base) {
+        *pnum = bytes;
+        return 0;
+    }
+
     for (p = bs; include_base || p != base; p = backing_bs(p)) {
         ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
                                    file);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/5] block/io: fix bdrv_is_allocated_above
  2020-05-19 19:54 [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2020-05-19 19:54 ` [PATCH v2 3/5] block/io: bdrv_common_block_status_above: support bs == base Vladimir Sementsov-Ogievskiy
@ 2020-05-19 19:55 ` Vladimir Sementsov-Ogievskiy
  2020-05-19 20:45   ` Eric Blake
  2020-05-19 19:55 ` [PATCH v2 5/5] iotests: add commit top->base cases to 274 Vladimir Sementsov-Ogievskiy
  2020-05-19 20:21 ` [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Eric Blake
  5 siblings, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 19:55 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

bdrv_is_allocated_above wrongly handles short backing files: it reports
after-EOF space as UNALLOCATED which is wrong, as on read the data is
generated on the level of short backing file (if all overlays has
unallocated area at that place).

Reusing bdrv_common_block_status_above fixes the issue and unifies code
path.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c | 43 +++++--------------------------------------
 1 file changed, 5 insertions(+), 38 deletions(-)

diff --git a/block/io.c b/block/io.c
index df44e89b7d..61f0930626 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2610,52 +2610,19 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
  * at 'offset + *pnum' may return the same allocation status (in other
  * words, the result is not necessarily the maximum possible range);
  * but 'pnum' will only be 0 when end of file is reached.
- *
  */
 int bdrv_is_allocated_above(BlockDriverState *top,
                             BlockDriverState *base,
                             bool include_base, int64_t offset,
                             int64_t bytes, int64_t *pnum)
 {
-    BlockDriverState *intermediate;
-    int ret;
-    int64_t n = bytes;
-
-    assert(base || !include_base);
-
-    intermediate = top;
-    while (include_base || intermediate != base) {
-        int64_t pnum_inter;
-        int64_t size_inter;
-
-        assert(intermediate);
-        ret = bdrv_is_allocated(intermediate, offset, bytes, &pnum_inter);
-        if (ret < 0) {
-            return ret;
-        }
-        if (ret) {
-            *pnum = pnum_inter;
-            return 1;
-        }
-
-        size_inter = bdrv_getlength(intermediate);
-        if (size_inter < 0) {
-            return size_inter;
-        }
-        if (n > pnum_inter &&
-            (intermediate == top || offset + pnum_inter < size_inter)) {
-            n = pnum_inter;
-        }
-
-        if (intermediate == base) {
-            break;
-        }
-
-        intermediate = backing_bs(intermediate);
+    int ret = bdrv_common_block_status_above(top, base, include_base, false,
+                                             offset, bytes, pnum, NULL, NULL);
+    if (ret < 0) {
+        return ret;
     }
 
-    *pnum = n;
-    return 0;
+    return !!(ret & BDRV_BLOCK_ALLOCATED);
 }
 
 typedef struct BdrvVmstateCo {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 5/5] iotests: add commit top->base cases to 274
  2020-05-19 19:54 [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2020-05-19 19:55 ` [PATCH v2 4/5] block/io: fix bdrv_is_allocated_above Vladimir Sementsov-Ogievskiy
@ 2020-05-19 19:55 ` Vladimir Sementsov-Ogievskiy
  2020-05-19 21:13   ` Eric Blake
  2020-05-19 20:21 ` [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Eric Blake
  5 siblings, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 19:55 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

These cases are fixed by previous patches around block_status and
is_allocated.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 tests/qemu-iotests/274     | 20 ++++++++++++
 tests/qemu-iotests/274.out | 65 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/tests/qemu-iotests/274 b/tests/qemu-iotests/274
index 5d1bf34dff..e910455f13 100755
--- a/tests/qemu-iotests/274
+++ b/tests/qemu-iotests/274
@@ -115,6 +115,26 @@ with iotests.FilePath('base') as base, \
     iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, mid)
     iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), mid)
 
+    iotests.log('=== Testing qemu-img commit (top -> base) ===')
+
+    create_chain()
+    iotests.qemu_img_log('commit', '-b', base, top)
+    iotests.img_info_log(base)
+    iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
+    iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), base)
+
+    iotests.log('=== Testing QMP active commit (top -> base) ===')
+
+    create_chain()
+    with create_vm() as vm:
+        vm.launch()
+        vm.qmp_log('block-commit', device='top', base_node='base',
+                   job_id='job0', auto_dismiss=False)
+        vm.run_job('job0', wait=5)
+
+    iotests.img_info_log(mid)
+    iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
+    iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), base)
 
     iotests.log('== Resize tests ==')
 
diff --git a/tests/qemu-iotests/274.out b/tests/qemu-iotests/274.out
index d24ff681af..9806dea8b6 100644
--- a/tests/qemu-iotests/274.out
+++ b/tests/qemu-iotests/274.out
@@ -129,6 +129,71 @@ read 1048576/1048576 bytes at offset 0
 read 1048576/1048576 bytes at offset 1048576
 1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+=== Testing qemu-img commit (top -> base) ===
+Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=2097152 cluster_size=65536 lazy_refcounts=off refcount_bits=16 compression_type=zlib
+
+Formatting 'TEST_DIR/PID-mid', fmt=qcow2 size=1048576 backing_file=TEST_DIR/PID-base cluster_size=65536 lazy_refcounts=off refcount_bits=16 compression_type=zlib
+
+Formatting 'TEST_DIR/PID-top', fmt=qcow2 size=2097152 backing_file=TEST_DIR/PID-mid cluster_size=65536 lazy_refcounts=off refcount_bits=16 compression_type=zlib
+
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Image committed.
+
+image: TEST_IMG
+file format: IMGFMT
+virtual size: 2 MiB (2097152 bytes)
+cluster_size: 65536
+Format specific information:
+    compat: 1.1
+    compression type: zlib
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+=== Testing QMP active commit (top -> base) ===
+Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=2097152 cluster_size=65536 lazy_refcounts=off refcount_bits=16 compression_type=zlib
+
+Formatting 'TEST_DIR/PID-mid', fmt=qcow2 size=1048576 backing_file=TEST_DIR/PID-base cluster_size=65536 lazy_refcounts=off refcount_bits=16 compression_type=zlib
+
+Formatting 'TEST_DIR/PID-top', fmt=qcow2 size=2097152 backing_file=TEST_DIR/PID-mid cluster_size=65536 lazy_refcounts=off refcount_bits=16 compression_type=zlib
+
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+{"execute": "block-commit", "arguments": {"auto-dismiss": false, "base-node": "base", "device": "top", "job-id": "job0"}}
+{"return": {}}
+{"execute": "job-complete", "arguments": {"id": "job0"}}
+{"return": {}}
+{"data": {"device": "job0", "len": 1048576, "offset": 1048576, "speed": 0, "type": "commit"}, "event": "BLOCK_JOB_READY", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
+{"data": {"device": "job0", "len": 1048576, "offset": 1048576, "speed": 0, "type": "commit"}, "event": "BLOCK_JOB_COMPLETED", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
+{"execute": "job-dismiss", "arguments": {"id": "job0"}}
+{"return": {}}
+image: TEST_IMG
+file format: IMGFMT
+virtual size: 1 MiB (1048576 bytes)
+cluster_size: 65536
+backing file: TEST_DIR/PID-base
+Format specific information:
+    compat: 1.1
+    compression type: zlib
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
 == Resize tests ==
 === preallocation=off ===
 Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=6442450944 cluster_size=65536 lazy_refcounts=off refcount_bits=16 compression_type=zlib
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above
  2020-05-19 19:54 [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (4 preceding siblings ...)
  2020-05-19 19:55 ` [PATCH v2 5/5] iotests: add commit top->base cases to 274 Vladimir Sementsov-Ogievskiy
@ 2020-05-19 20:21 ` Eric Blake
  2020-05-19 20:28   ` Vladimir Sementsov-Ogievskiy
  5 siblings, 1 reply; 16+ messages in thread
From: Eric Blake @ 2020-05-19 20:21 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

On 5/19/20 2:54 PM, Vladimir Sementsov-Ogievskiy wrote:

> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)

This no longer happens as of commit bf03dede47 and friends.  As such, 
how much of this series is still needed for other reasons?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above
  2020-05-19 20:21 ` [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Eric Blake
@ 2020-05-19 20:28   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 20:28 UTC (permalink / raw)
  To: Eric Blake, qemu-block; +Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

19.05.2020 23:21, Eric Blake wrote:
> On 5/19/20 2:54 PM, Vladimir Sementsov-Ogievskiy wrote:
> 
>> This leads to the following effect:
>>
>> ./qemu-img create -f qcow2 base.qcow2 2M
>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>
>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>
>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>
>> But after commit guest visible state is changed, which seems wrong for me:
>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>
>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>> Pattern verification failed at offset 1048576, 1048576 bytes
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> This no longer happens as of commit bf03dede47 and friends.  As such, how much of this series is still needed for other reasons?
> 

Oops sorry. I blindly copied cover-letter of v1, and forget that it describes another thing. This test above is unrelated now. The whole series is valid, it fixes another problem (see 04 and new test cases in 05).

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above
  2020-05-19 19:54 ` [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above Vladimir Sementsov-Ogievskiy
@ 2020-05-19 20:41   ` Eric Blake
  2020-05-19 21:13     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Blake @ 2020-05-19 20:41 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

On 5/19/20 2:54 PM, Vladimir Sementsov-Ogievskiy wrote:
> bdrv_co_block_status_above has several problems with handling short
> backing files:
> 
> 1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
> without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
> which produces these after-EOF zeros is inside requested backing
> sequence.

That's intentional.  That portion of the guest-visible data reads as 
zero (BDRV_BLOCK_ZERO set) but was NOT read from the top layer, but 
rather synthesized by the block layer because it derived from the 
backing file but was beyond EOF of that backing layer 
(BDRV_BLOCK_ALLOCATED is clear).

> 
> 2. With want_zero=false, it may return pnum=0 prior to actual EOF,
> because of EOF of short backing file.

Do you have a reproducer for this?  In my experience, this is not 
possible.  Generally, if you request status that overlaps EOF of the 
backing, you get a response truncated to the end of the backing, and you 
are then likely to follow up with a subsequent status request starting 
from the underlying EOF which then sees the desired unallocated zeroes:

back     xxxx
top      yy------
request    ^^^^^^
response   ^^
request      ^^^^
response     ^^^^

> 
> Fix these things, making logic about short backing files clearer.
> 
> Note that 154 output changed, because now bdrv_block_status_above don't

doesn't

> merge unallocated zeros with zeros after EOF (which are actually
> "allocated" in POV of read from backing-chain top) and is_zero() just
> don't understand that the whole head or tail is zero. We may update
> is_zero to call bdrv_block_status_above several times, or add flag to
> bdrv_block_status_above that we are not interested in ALLOCATED flag,
> so ranges with different ALLOCATED status may be merged, but actually,
> it seems that we'd better don't care about this corner case.

This actually sounds like an avoidable regression.  :(

I argue that if we did not explicitly write data/zero clusters in the 
tail of the top layer, then those clusters are not allocated from the 
POV of reading from the backing-chain top.  Yes, we know what their 
contents will be, but we also know what the contents of unallocated 
clusters will be when there is no backing file at all - basically, after 
your other patch series to drop unallocated_blocks_are_zero:
https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg05429.html
then we know that only format drivers that can support backing files 
even care what allocation means, and 'allocated' strictly means that the 
data comes from the top layer rather than from a backing (whether 
directly from the backing, or synthesized as zero by the block layer 
because it was beyond EOF of the backing).

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/io.c                 | 38 +++++++++++++++++++++++++++++---------
>   tests/qemu-iotests/154.out |  4 ++--
>   2 files changed, 31 insertions(+), 11 deletions(-)
> 

I'm already not a fan of this patch - it adds lines rather than removes, 
and seems to add a regression.

> diff --git a/block/io.c b/block/io.c
> index 121ce17a49..db990e812b 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2461,25 +2461,45 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
>           ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
>                                      file);
>           if (ret < 0) {
> -            break;
> +            return ret;
>           }
> -        if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
> +        if (*pnum == 0) {
> +            if (first) {
> +                return ret;
> +            }
> +
>               /*
> -             * Reading beyond the end of the file continues to read
> -             * zeroes, but we can only widen the result to the
> -             * unallocated length we learned from an earlier
> -             * iteration.
> +             * Reads from bs for the selected region will return zeroes,
> +             * produced because the current level is short. We should consider
> +             * it as allocated.

Why?  If we replaced the backing file to something longer (qemu-img 
rebase -u), we would WANT to read from the backing file.  The only 
reason we read zero is because the block layer synthesized it _while_ 
deferring to the backing layer, not because it was directly allocated in 
the top layer.

> +             *
> +             * TODO: Should we report p as file here?

No. Reporting 'file' only makes sense if you can point to an offset 
within that file that would read the guest-visible data in question - 
but when the data is synthesized, there is no such offset.

>                */
> +            assert(ret & BDRV_BLOCK_EOF);
>               *pnum = bytes;
> +            return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
>           }
> -        if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
> -            break;
> +        if (ret & BDRV_BLOCK_ALLOCATED) {
> +            /* We've found the node and the status, we must return. */
> +
> +            if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
> +                /*
> +                 * This level is also responsible for reads after EOF inside
> +                 * the unallocated region in the previous level.
> +                 */
> +                *pnum = bytes;
> +            }
> +
> +            return ret;
>           }
> +
>           /* [offset, pnum] unallocated on this layer, which could be only
>            * the first part of [offset, bytes].  */
> -        bytes = MIN(bytes, *pnum);
> +        assert(*pnum <= bytes);
> +        bytes = *pnum;
>           first = false;
>       }
> +
>       return ret;
>   }
>   
> diff --git a/tests/qemu-iotests/154.out b/tests/qemu-iotests/154.out
> index fa3673317f..a203dfcadd 100644
> --- a/tests/qemu-iotests/154.out
> +++ b/tests/qemu-iotests/154.out
> @@ -310,13 +310,13 @@ wrote 512/512 bytes at offset 134217728
>   512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   2048/2048 bytes allocated at offset 128 MiB
>   [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
> -{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
> +{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]

The fact that we no longer see zeroes in the tail of the file makes me 
think this patch is wrong.

>   Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
>   wrote 512/512 bytes at offset 134219264
>   512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   2048/2048 bytes allocated at offset 128 MiB
>   [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
> -{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
> +{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
>   Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
>   wrote 1024/1024 bytes at offset 134218240
>   1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/5] block/io: fix bdrv_is_allocated_above
  2020-05-19 19:55 ` [PATCH v2 4/5] block/io: fix bdrv_is_allocated_above Vladimir Sementsov-Ogievskiy
@ 2020-05-19 20:45   ` Eric Blake
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Blake @ 2020-05-19 20:45 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

On 5/19/20 2:55 PM, Vladimir Sementsov-Ogievskiy wrote:
> bdrv_is_allocated_above wrongly handles short backing files: it reports
> after-EOF space as UNALLOCATED which is wrong,

You haven't convinced me of that claim.

> as on read the data is
> generated on the level of short backing file (if all overlays has
> unallocated area at that place).
> 
> Reusing bdrv_common_block_status_above fixes the issue and unifies code
> path.

Unifying the code path is admirable, but I'm not sure we have the 
semantics right, yet.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/io.c | 43 +++++--------------------------------------
>   1 file changed, 5 insertions(+), 38 deletions(-)
> 
-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/5] iotests: add commit top->base cases to 274
  2020-05-19 19:55 ` [PATCH v2 5/5] iotests: add commit top->base cases to 274 Vladimir Sementsov-Ogievskiy
@ 2020-05-19 21:13   ` Eric Blake
  2020-05-19 21:25     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Blake @ 2020-05-19 21:13 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

On 5/19/20 2:55 PM, Vladimir Sementsov-Ogievskiy wrote:
> These cases are fixed by previous patches around block_status and
> is_allocated.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   tests/qemu-iotests/274     | 20 ++++++++++++
>   tests/qemu-iotests/274.out | 65 ++++++++++++++++++++++++++++++++++++++
>   2 files changed, 85 insertions(+)

Okay, so this test fails when applied in isolation without the rest of 
your series.

> 
> diff --git a/tests/qemu-iotests/274 b/tests/qemu-iotests/274
> index 5d1bf34dff..e910455f13 100755
> --- a/tests/qemu-iotests/274
> +++ b/tests/qemu-iotests/274
> @@ -115,6 +115,26 @@ with iotests.FilePath('base') as base, \
>       iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, mid)
>       iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), mid)
>   
> +    iotests.log('=== Testing qemu-img commit (top -> base) ===')
> +
> +    create_chain()
> +    iotests.qemu_img_log('commit', '-b', base, top)
> +    iotests.img_info_log(base)
> +    iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
> +    iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), base)

So if I understand it, we are going from:

base    11111111
mid     ----
top     --------
guest   11110000

and we want to go to:

base    11110000

except that we are not properly writing the zeroes into base, because we 
grabbed the wrong status, ending up with:

base    11111111

The status of top from 1M onwards is unallocated, and if we were to 
commit to just mid, Kevin's truncate fixes solve that (we now zero out 
the tail of mid as part of resizing it to be large enough).  But you are 
instead skipping mid, and committing all the way to base.  So we need 
_something_ that can tell qemu-img commit that even though the region 
1m-2m is unallocated in top, we must behave as though the status of mid 
reports it as allocated (because when reading beyond EOF in mid, we DO 
read zero).  Since the data is allocated not in top, but acts as though 
it was allocated in mid, which is above base, then the commit operation 
has to do something to preserve that allocation.

Okay, you've convinced me we have a bug.  However, I'm still not sold 
that patches 1 and 4 are quite the right fix.  Going back to the 
original setup, unpatched qemu.git head reports:

$ ./qemu-img map --output=json top.qcow2
[{ "start": 0, "length": 1048576, "depth": 2, "zero": false, "data": 
true, "offset": 327680},
{ "start": 1048576, "length": 1048576, "depth": 0, "zero": true, "data": 
false}]

I think what we really want is:

[{ "start": 0, "length": 1048576, "depth": 2, "zero": false, "data": 
true, "offset": 327680},
{ "start": 1048576, "length": 1048576, "depth": 1, "zero": true, "data": 
false}]

because then we would be _accurately_ reporting that the zeroes that we 
read from 1m-2m come _because_ we read from mid (beyond EOF), which is 
different from our current answer that the zeroes come from top (they 
don't, because top deferred to mid).  If we fix up qemu-img map output 
to correctly report zeroes beyond EOF from the correct layer, will that 
also fix up the bug we are seeing in qemu-img commit?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above
  2020-05-19 20:41   ` Eric Blake
@ 2020-05-19 21:13     ` Vladimir Sementsov-Ogievskiy
  2020-05-19 21:48       ` Eric Blake
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 21:13 UTC (permalink / raw)
  To: Eric Blake, qemu-block; +Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

19.05.2020 23:41, Eric Blake wrote:
> On 5/19/20 2:54 PM, Vladimir Sementsov-Ogievskiy wrote:
>> bdrv_co_block_status_above has several problems with handling short
>> backing files:
>>
>> 1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
>> without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
>> which produces these after-EOF zeros is inside requested backing
>> sequence.
> 
> That's intentional.  That portion of the guest-visible data reads as zero (BDRV_BLOCK_ZERO set) but was NOT read from the top layer, but rather synthesized by the block layer because it derived from the backing file but was beyond EOF of that backing layer (BDRV_BLOCK_ALLOCATED is clear).

Not in top yes. But _inside_ the requested base..top backing-chain-part. So it should be considered ALLOCATED, as we should not go to further backing.

Assume the following chain:

top    aa--
middle bb
base   xxxx

(so, middle is short)

block_status(top, 2) should return ZERO without ALLOCATED, as yes it's ZERO and yes, it's from another layer

block_status_above(top, base, 2) should return ZERO with ALLOCATED, as it's ZERO, and it's produced inside requested backing-chain-region, actually, it's produced because of short middle node. We must report ALLOCATED to show that we are not going to read from base.

> 
>>
>> 2. With want_zero=false, it may return pnum=0 prior to actual EOF,
>> because of EOF of short backing file.
> 
> Do you have a reproducer for this?

No, I don't have one, but it seems possible at least with want_zero=false. I'll think of it tomorrow, too tired now.

> In my experience, this is not possible.  Generally, if you request status that overlaps EOF of the backing, you get a response truncated to the end of the backing, and you are then likely to follow up with a subsequent status request starting from the underlying EOF which then sees the desired unallocated zeroes:
> 
> back     xxxx
> top      yy------
> request    ^^^^^^
> response   ^^
> request      ^^^^
> response     ^^^^
> 
>>
>> Fix these things, making logic about short backing files clearer.
>>
>> Note that 154 output changed, because now bdrv_block_status_above don't
> 
> doesn't
> 
>> merge unallocated zeros with zeros after EOF (which are actually
>> "allocated" in POV of read from backing-chain top) and is_zero() just
>> don't understand that the whole head or tail is zero. We may update
>> is_zero to call bdrv_block_status_above several times, or add flag to
>> bdrv_block_status_above that we are not interested in ALLOCATED flag,
>> so ranges with different ALLOCATED status may be merged, but actually,
>> it seems that we'd better don't care about this corner case.
> 
> This actually sounds like an avoidable regression.  :(

I don't see real problem in it. But it seems not hard to avoid it, so I will try to.

> 
> I argue that if we did not explicitly write data/zero clusters in the tail of the top layer, then those clusters are not allocated from the POV of reading from the backing-chain top.  Yes, we know what their contents will be, but we also know what the contents of unallocated clusters will be when there is no backing file at all - basically, after your other patch series to drop unallocated_blocks_are_zero:
> https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg05429.html
> then we know that only format drivers that can support backing files even care what allocation means, and 'allocated' strictly means that the data comes from the top layer rather than from a backing (whether directly from the backing, or synthesized as zero by the block layer because it was beyond EOF of the backing).

I agree about allocated in top, returned by block_status. But this patch is for allocated_above, and the ALLOCATED status is not about top, but about a set of nodes from base (not inclusive) to top.

> 
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/io.c                 | 38 +++++++++++++++++++++++++++++---------
>>   tests/qemu-iotests/154.out |  4 ++--
>>   2 files changed, 31 insertions(+), 11 deletions(-)
>>
> 
> I'm already not a fan of this patch - it adds lines rather than removes, and seems to add a regression.
> 
>> diff --git a/block/io.c b/block/io.c
>> index 121ce17a49..db990e812b 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -2461,25 +2461,45 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
>>           ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
>>                                      file);
>>           if (ret < 0) {
>> -            break;
>> +            return ret;
>>           }
>> -        if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
>> +        if (*pnum == 0) {
>> +            if (first) {
>> +                return ret;
>> +            }
>> +
>>               /*
>> -             * Reading beyond the end of the file continues to read
>> -             * zeroes, but we can only widen the result to the
>> -             * unallocated length we learned from an earlier
>> -             * iteration.
>> +             * Reads from bs for the selected region will return zeroes,
>> +             * produced because the current level is short. We should consider
>> +             * it as allocated.
> 
> Why?  If we replaced the backing file to something longer (qemu-img rebase -u), we would WANT to read from the backing file.  The only reason we read zero is because the block layer synthesized it _while_ deferring to the backing layer, not because it was directly allocated in the top layer.

No, if we replace backing file of the current layer, nothing will change, as _this_ layer is short, not the backing. Or which backing file do you mean? If you mean current bs, than replacing it doesn't make sense in the context, as block_status_above requested the current bs (as part of base..top range), not the other one.

> 
>> +             *
>> +             * TODO: Should we report p as file here?
> 
> No. Reporting 'file' only makes sense if you can point to an offset within that file that would read the guest-visible data in question - but when the data is synthesized, there is no such offset.

I don't know. It still adds some information about which level is responsible for these ZEROES. Kevin argued that it make sense.

> 
>>                */
>> +            assert(ret & BDRV_BLOCK_EOF);
>>               *pnum = bytes;
>> +            return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
>>           }
>> -        if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
>> -            break;
>> +        if (ret & BDRV_BLOCK_ALLOCATED) {
>> +            /* We've found the node and the status, we must return. */
>> +
>> +            if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
>> +                /*
>> +                 * This level is also responsible for reads after EOF inside
>> +                 * the unallocated region in the previous level.
>> +                 */
>> +                *pnum = bytes;
>> +            }
>> +
>> +            return ret;
>>           }
>> +
>>           /* [offset, pnum] unallocated on this layer, which could be only
>>            * the first part of [offset, bytes].  */
>> -        bytes = MIN(bytes, *pnum);
>> +        assert(*pnum <= bytes);
>> +        bytes = *pnum;
>>           first = false;
>>       }
>> +
>>       return ret;
>>   }
>> diff --git a/tests/qemu-iotests/154.out b/tests/qemu-iotests/154.out
>> index fa3673317f..a203dfcadd 100644
>> --- a/tests/qemu-iotests/154.out
>> +++ b/tests/qemu-iotests/154.out
>> @@ -310,13 +310,13 @@ wrote 512/512 bytes at offset 134217728
>>   512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>   2048/2048 bytes allocated at offset 128 MiB
>>   [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
>> -{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
>> +{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
> 
> The fact that we no longer see zeroes in the tail of the file makes me think this patch is wrong.
> 
>>   Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
>>   wrote 512/512 bytes at offset 134219264
>>   512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>   2048/2048 bytes allocated at offset 128 MiB
>>   [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
>> -{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
>> +{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
>>   Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
>>   wrote 1024/1024 bytes at offset 134218240
>>   1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/5] iotests: add commit top->base cases to 274
  2020-05-19 21:13   ` Eric Blake
@ 2020-05-19 21:25     ` Vladimir Sementsov-Ogievskiy
  2020-05-19 21:49       ` Eric Blake
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 21:25 UTC (permalink / raw)
  To: Eric Blake, qemu-block; +Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

20.05.2020 00:13, Eric Blake wrote:
> On 5/19/20 2:55 PM, Vladimir Sementsov-Ogievskiy wrote:
>> These cases are fixed by previous patches around block_status and
>> is_allocated.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   tests/qemu-iotests/274     | 20 ++++++++++++
>>   tests/qemu-iotests/274.out | 65 ++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 85 insertions(+)
> 
> Okay, so this test fails when applied in isolation without the rest of your series.
> 
>>
>> diff --git a/tests/qemu-iotests/274 b/tests/qemu-iotests/274
>> index 5d1bf34dff..e910455f13 100755
>> --- a/tests/qemu-iotests/274
>> +++ b/tests/qemu-iotests/274
>> @@ -115,6 +115,26 @@ with iotests.FilePath('base') as base, \
>>       iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, mid)
>>       iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), mid)
>> +    iotests.log('=== Testing qemu-img commit (top -> base) ===')
>> +
>> +    create_chain()
>> +    iotests.qemu_img_log('commit', '-b', base, top)
>> +    iotests.img_info_log(base)
>> +    iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
>> +    iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), base)
> 
> So if I understand it, we are going from:
> 
> base    11111111
> mid     ----
> top     --------
> guest   11110000
> 
> and we want to go to:
> 
> base    11110000
> 
> except that we are not properly writing the zeroes into base, because we grabbed the wrong status, ending up with:
> 
> base    11111111
> 
> The status of top from 1M onwards is unallocated, and if we were to commit to just mid, Kevin's truncate fixes solve that (we now zero out the tail of mid as part of resizing it to be large enough).  But you are instead skipping mid, and committing all the way to base.  So we need _something_ that can tell qemu-img commit that even though the region 1m-2m is unallocated in top, we must behave as though the status of mid reports it as allocated (because when reading beyond EOF in mid, we DO read zero).  Since the data is allocated not in top, but acts as though it was allocated in mid, which is above base, then the commit operation has to do something to preserve that allocation.
> 
> Okay, you've convinced me we have a bug. > However, I'm still not sold that patches 1 and 4 are quite the right fix.  Going back to the original setup, unpatched qemu.git head reports:
> 
> $ ./qemu-img map --output=json top.qcow2
> [{ "start": 0, "length": 1048576, "depth": 2, "zero": false, "data": true, "offset": 327680},
> { "start": 1048576, "length": 1048576, "depth": 0, "zero": true, "data": false}]
> 
> I think what we really want is:
> 
> [{ "start": 0, "length": 1048576, "depth": 2, "zero": false, "data": true, "offset": 327680},
> { "start": 1048576, "length": 1048576, "depth": 1, "zero": true, "data": false}]
> 
> because then we would be _accurately_ reporting that the zeroes that we read from 1m-2m come _because_ we read from mid (beyond EOF), which is different from our current answer that the zeroes come from top (they don't, because top deferred to mid). 

Right. This is exactly the logic which I bring to block_status_above and is_allocated_above by this series

If we fix up qemu-img map output to correctly report zeroes beyond EOF from the correct layer, will that also fix up the bug we are seeing in qemu-img commit?
> 

No it will not fix it, because img_map has own implementation of block_status_above - get_block_status function in qemu-img.c, which goes through backing chain by itself, and is used only in img_map (not in img_convert). But you are right that it should be fixed too.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above
  2020-05-19 21:13     ` Vladimir Sementsov-Ogievskiy
@ 2020-05-19 21:48       ` Eric Blake
  2020-05-20  6:16         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Blake @ 2020-05-19 21:48 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

On 5/19/20 4:13 PM, Vladimir Sementsov-Ogievskiy wrote:
> 19.05.2020 23:41, Eric Blake wrote:
>> On 5/19/20 2:54 PM, Vladimir Sementsov-Ogievskiy wrote:
>>> bdrv_co_block_status_above has several problems with handling short
>>> backing files:
>>>
>>> 1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
>>> without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
>>> which produces these after-EOF zeros is inside requested backing
>>> sequence.
>>
>> That's intentional.  That portion of the guest-visible data reads as 
>> zero (BDRV_BLOCK_ZERO set) but was NOT read from the top layer, but 
>> rather synthesized by the block layer because it derived from the 
>> backing file but was beyond EOF of that backing layer 
>> (BDRV_BLOCK_ALLOCATED is clear).
> 
> Not in top yes. But _inside_ the requested base..top backing-chain-part. 
> So it should be considered ALLOCATED, as we should not go to further 
> backing.

Yes, I think I figured that out by patch 5.

> 
> Assume the following chain:
> 
> top    aa--
> middle bb
> base   xxxx
> 
> (so, middle is short)
> 
> block_status(top, 2) should return ZERO without ALLOCATED, as yes it's 
> ZERO and yes, it's from another layer
> 
> block_status_above(top, base, 2) should return ZERO with ALLOCATED, as 
> it's ZERO, and it's produced inside requested backing-chain-region, 
> actually, it's produced because of short middle node. We must report 
> ALLOCATED to show that we are not going to read from base.

Yes, that matches my intuition.  allocated_above says "where in the 
chain did we get the data, since it did not come from top", and the 
correct answer is "we got it from middle, due to synthesizing zero 
beyond EOF".  Okay, with that understanding in place, maybe this patch 
is right.  But I'll have to revisit it tomorrow on a fresh mind (it's 
too late in the day for me to be sure that I'm getting it all straight 
right now).

> 
>>
>>>
>>> 2. With want_zero=false, it may return pnum=0 prior to actual EOF,
>>> because of EOF of short backing file.
>>
>> Do you have a reproducer for this?
> 
> No, I don't have one, but it seems possible at least with 
> want_zero=false. I'll think of it tomorrow, too tired now.
> 
>> In my experience, this is not possible.  Generally, if you request 
>> status that overlaps EOF of the backing, you get a response truncated 
>> to the end of the backing, and you are then likely to follow up with a 
>> subsequent status request starting from the underlying EOF which then 
>> sees the desired unallocated zeroes:
>>
>> back     xxxx
>> top      yy------
>> request    ^^^^^^
>> response   ^^
>> request      ^^^^
>> response     ^^^^

If we can come up with a reproducer where allocated_above returns 
pnum=0, that would indeed prove my initial hesitation wrong (perhaps by:

back    xxxxxxxx
mid1    xxxxxx
mid2    xxxx
mid3    xxxxxx
top     xxxxxxxx

for various different start and base points within the chain?)

>>
>>>
>>> Fix these things, making logic about short backing files clearer.
>>>
>>> Note that 154 output changed, because now bdrv_block_status_above don't
>>
>> doesn't
>>
>>> merge unallocated zeros with zeros after EOF (which are actually
>>> "allocated" in POV of read from backing-chain top) and is_zero() just
>>> don't understand that the whole head or tail is zero. We may update
>>> is_zero to call bdrv_block_status_above several times, or add flag to
>>> bdrv_block_status_above that we are not interested in ALLOCATED flag,
>>> so ranges with different ALLOCATED status may be merged, but actually,
>>> it seems that we'd better don't care about this corner case.
>>
>> This actually sounds like an avoidable regression.  :(
> 
> I don't see real problem in it. But it seems not hard to avoid it, so I 
> will try to.

I guess my real reasoning is: "I spent a lot of time trying to tweak 
that test to not lose the fact that the tail of the image reads as 
zero", because it looks weird if we later resize the image but still 
have a glitch in the middle reporting one non-zero cluster out of a 
larger range all because of the shenanigans that occurred around the 
tail prior to resizing.

>>> +++ b/block/io.c
>>> @@ -2461,25 +2461,45 @@ static int coroutine_fn 
>>> bdrv_co_block_status_above(BlockDriverState *bs,
>>>           ret = bdrv_co_block_status(p, want_zero, offset, bytes, 
>>> pnum, map,
>>>                                      file);
>>>           if (ret < 0) {
>>> -            break;
>>> +            return ret;
>>>           }
>>> -        if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
>>> +        if (*pnum == 0) {
>>> +            if (first) {
>>> +                return ret;
>>> +            }
>>> +
>>>               /*
>>> -             * Reading beyond the end of the file continues to read
>>> -             * zeroes, but we can only widen the result to the
>>> -             * unallocated length we learned from an earlier
>>> -             * iteration.
>>> +             * Reads from bs for the selected region will return 
>>> zeroes,
>>> +             * produced because the current level is short. We 
>>> should consider
>>> +             * it as allocated.
>>
>> Why?  If we replaced the backing file to something longer (qemu-img 
>> rebase -u), we would WANT to read from the backing file.  The only 
>> reason we read zero is because the block layer synthesized it _while_ 
>> deferring to the backing layer, not because it was directly allocated 
>> in the top layer.
> 
> No, if we replace backing file of the current layer, nothing will 
> change, as _this_ layer is short, not the backing. Or which backing file 
> do you mean? If you mean current bs, than replacing it doesn't make 
> sense in the context, as block_status_above requested the current bs (as 
> part of base..top range), not the other one.

Maybe it's just the comment wording that needs help.  After reading 
through patch 5, it looks like my problem is now coming up with a 
comment to the effect of "the top layer deferred to this layer, and 
because this layer is short, any zeroes that we synthesize beyond EOF 
behave as if they were allocated at this layer".

> 
>>
>>> +             *
>>> +             * TODO: Should we report p as file here?
>>
>> No. Reporting 'file' only makes sense if you can point to an offset 
>> within that file that would read the guest-visible data in question - 
>> but when the data is synthesized, there is no such offset.
> 
> I don't know. It still adds some information about which level is 
> responsible for these ZEROES. Kevin argued that it make sense.

It took me a while, but I'm coming around to it: my initial read was 
assuming that you were reporting that the tail was being claimed as 
allocated by top; but in reality, you are fixing things to claim it as 
being allocated by mid.  The former is wrong (top did not allocate, it 
deferred to mid); but the latter does indeed make sense (reading from 
mid ended up synthesizing, which means that our hunt for the data ends 
at mid and we never traverse deeper, regardless of whether base may also 
have data).  But now it's a question of whether the code matches that 
textual description, and I'm a bit too fried to answer that question 
properly today :)

>>> +++ b/tests/qemu-iotests/154.out
>>> @@ -310,13 +310,13 @@ wrote 512/512 bytes at offset 134217728
>>>   512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>>   2048/2048 bytes allocated at offset 128 MiB
>>>   [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, 
>>> "data": false},
>>> -{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, 
>>> "data": false}]
>>> +{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, 
>>> "data": true, "offset": OFFSET}]
>>
>> The fact that we no longer see zeroes in the tail of the file makes me 
>> think this patch is wrong.

So, if we can avoid that minor regression, and still otherwise report 
zeroes as allocated from mid, then I think we'll be on the right track.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/5] iotests: add commit top->base cases to 274
  2020-05-19 21:25     ` Vladimir Sementsov-Ogievskiy
@ 2020-05-19 21:49       ` Eric Blake
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Blake @ 2020-05-19 21:49 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

On 5/19/20 4:25 PM, Vladimir Sementsov-Ogievskiy wrote:

>> $ ./qemu-img map --output=json top.qcow2
>> [{ "start": 0, "length": 1048576, "depth": 2, "zero": false, "data": 
>> true, "offset": 327680},
>> { "start": 1048576, "length": 1048576, "depth": 0, "zero": true, 
>> "data": false}]
>>
>> I think what we really want is:
>>
>> [{ "start": 0, "length": 1048576, "depth": 2, "zero": false, "data": 
>> true, "offset": 327680},
>> { "start": 1048576, "length": 1048576, "depth": 1, "zero": true, 
>> "data": false}]
>>
>> because then we would be _accurately_ reporting that the zeroes that 
>> we read from 1m-2m come _because_ we read from mid (beyond EOF), which 
>> is different from our current answer that the zeroes come from top 
>> (they don't, because top deferred to mid). 
> 
> Right. This is exactly the logic which I bring to block_status_above and 
> is_allocated_above by this series
> 
> If we fix up qemu-img map output to correctly report zeroes beyond EOF 
> from the correct layer, will that also fix up the bug we are seeing in 
> qemu-img commit?
>>
> 
> No it will not fix it, because img_map has own implementation of 
> block_status_above - get_block_status function in qemu-img.c, which goes 
> through backing chain by itself, and is used only in img_map (not in 
> img_convert). But you are right that it should be fixed too.

You are in a maze of twisty passages, all alike ;)

[Hope neither of us is eaten by a grue by the time we get this series in]

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above
  2020-05-19 21:48       ` Eric Blake
@ 2020-05-20  6:16         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-20  6:16 UTC (permalink / raw)
  To: Eric Blake, qemu-block; +Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

20.05.2020 00:48, Eric Blake wrote:
> On 5/19/20 4:13 PM, Vladimir Sementsov-Ogievskiy wrote:
>> 19.05.2020 23:41, Eric Blake wrote:
>>> On 5/19/20 2:54 PM, Vladimir Sementsov-Ogievskiy wrote:
>>>> bdrv_co_block_status_above has several problems with handling short
>>>> backing files:
>>>>
>>>> 1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
>>>> without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
>>>> which produces these after-EOF zeros is inside requested backing
>>>> sequence.
>>>
>>> That's intentional.  That portion of the guest-visible data reads as zero (BDRV_BLOCK_ZERO set) but was NOT read from the top layer, but rather synthesized by the block layer because it derived from the backing file but was beyond EOF of that backing layer (BDRV_BLOCK_ALLOCATED is clear).
>>
>> Not in top yes. But _inside_ the requested base..top backing-chain-part. So it should be considered ALLOCATED, as we should not go to further backing.
> 
> Yes, I think I figured that out by patch 5.
> 
>>
>> Assume the following chain:
>>
>> top    aa--
>> middle bb
>> base   xxxx
>>
>> (so, middle is short)
>>
>> block_status(top, 2) should return ZERO without ALLOCATED, as yes it's ZERO and yes, it's from another layer
>>
>> block_status_above(top, base, 2) should return ZERO with ALLOCATED, as it's ZERO, and it's produced inside requested backing-chain-region, actually, it's produced because of short middle node. We must report ALLOCATED to show that we are not going to read from base.
> 
> Yes, that matches my intuition.  allocated_above says "where in the chain did we get the data, since it did not come from top", and the correct answer is "we got it from middle, due to synthesizing zero beyond EOF".  Okay, with that understanding in place, maybe this patch is right.  But I'll have to revisit it tomorrow on a fresh mind (it's too late in the day for me to be sure that I'm getting it all straight right now).
> 
>>
>>>
>>>>
>>>> 2. With want_zero=false, it may return pnum=0 prior to actual EOF,
>>>> because of EOF of short backing file.
>>>
>>> Do you have a reproducer for this?
>>
>> No, I don't have one, but it seems possible at least with want_zero=false. I'll think of it tomorrow, too tired now.
>>
>>> In my experience, this is not possible.  Generally, if you request status that overlaps EOF of the backing, you get a response truncated to the end of the backing, and you are then likely to follow up with a subsequent status request starting from the underlying EOF which then sees the desired unallocated zeroes:
>>>
>>> back     xxxx
>>> top      yy------
>>> request    ^^^^^^
>>> response   ^^
>>> request      ^^^^
>>> response     ^^^^
> 
> If we can come up with a reproducer where allocated_above returns pnum=0, that would indeed prove my initial hesitation wrong (perhaps by:
> 
> back    xxxxxxxx
> mid1    xxxxxx
> mid2    xxxx
> mid3    xxxxxx
> top     xxxxxxxx
> 
> for various different start and base points within the chain?)

Seems, we just don't have users of bdrv_co_block_status_above with base points within the chain, base is always NULL or backing of top. So, I don't think we have a reproducer.

> 
>>>
>>>>
>>>> Fix these things, making logic about short backing files clearer.
>>>>
>>>> Note that 154 output changed, because now bdrv_block_status_above don't
>>>
>>> doesn't
>>>
>>>> merge unallocated zeros with zeros after EOF (which are actually
>>>> "allocated" in POV of read from backing-chain top) and is_zero() just
>>>> don't understand that the whole head or tail is zero. We may update
>>>> is_zero to call bdrv_block_status_above several times, or add flag to
>>>> bdrv_block_status_above that we are not interested in ALLOCATED flag,
>>>> so ranges with different ALLOCATED status may be merged, but actually,
>>>> it seems that we'd better don't care about this corner case.
>>>
>>> This actually sounds like an avoidable regression.  :(
>>
>> I don't see real problem in it. But it seems not hard to avoid it, so I will try to.
> 
> I guess my real reasoning is: "I spent a lot of time trying to tweak that test to not lose the fact that the tail of the image reads as zero", because it looks weird if we later resize the image but still have a glitch in the middle reporting one non-zero cluster out of a larger range all because of the shenanigans that occurred around the tail prior to resizing.
> 
>>>> +++ b/block/io.c
>>>> @@ -2461,25 +2461,45 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
>>>>           ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
>>>>                                      file);
>>>>           if (ret < 0) {
>>>> -            break;
>>>> +            return ret;
>>>>           }
>>>> -        if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
>>>> +        if (*pnum == 0) {
>>>> +            if (first) {
>>>> +                return ret;
>>>> +            }
>>>> +
>>>>               /*
>>>> -             * Reading beyond the end of the file continues to read
>>>> -             * zeroes, but we can only widen the result to the
>>>> -             * unallocated length we learned from an earlier
>>>> -             * iteration.
>>>> +             * Reads from bs for the selected region will return zeroes,
>>>> +             * produced because the current level is short. We should consider
>>>> +             * it as allocated.
>>>
>>> Why?  If we replaced the backing file to something longer (qemu-img rebase -u), we would WANT to read from the backing file.  The only reason we read zero is because the block layer synthesized it _while_ deferring to the backing layer, not because it was directly allocated in the top layer.
>>
>> No, if we replace backing file of the current layer, nothing will change, as _this_ layer is short, not the backing. Or which backing file do you mean? If you mean current bs, than replacing it doesn't make sense in the context, as block_status_above requested the current bs (as part of base..top range), not the other one.
> 
> Maybe it's just the comment wording that needs help.  After reading through patch 5, it looks like my problem is now coming up with a comment to the effect of "the top layer deferred to this layer, and because this layer is short, any zeroes that we synthesize beyond EOF behave as if they were allocated at this layer".
> 
>>
>>>
>>>> +             *
>>>> +             * TODO: Should we report p as file here?
>>>
>>> No. Reporting 'file' only makes sense if you can point to an offset within that file that would read the guest-visible data in question - but when the data is synthesized, there is no such offset.
>>
>> I don't know. It still adds some information about which level is responsible for these ZEROES. Kevin argued that it make sense.
> 
> It took me a while, but I'm coming around to it: my initial read was assuming that you were reporting that the tail was being claimed as allocated by top; but in reality, you are fixing things to claim it as being allocated by mid.  The former is wrong (top did not allocate, it deferred to mid); but the latter does indeed make sense (reading from mid ended up synthesizing, which means that our hunt for the data ends at mid and we never traverse deeper, regardless of whether base may also have data).  But now it's a question of whether the code matches that textual description, and I'm a bit too fried to answer that question properly today :)
> 
>>>> +++ b/tests/qemu-iotests/154.out
>>>> @@ -310,13 +310,13 @@ wrote 512/512 bytes at offset 134217728
>>>>   512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>>>   2048/2048 bytes allocated at offset 128 MiB
>>>>   [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
>>>> -{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
>>>> +{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
>>>
>>> The fact that we no longer see zeroes in the tail of the file makes me think this patch is wrong.
> 
> So, if we can avoid that minor regression, and still otherwise report zeroes as allocated from mid, then I think we'll be on the right track.
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-05-20  6:17 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-19 19:54 [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
2020-05-19 19:54 ` [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above Vladimir Sementsov-Ogievskiy
2020-05-19 20:41   ` Eric Blake
2020-05-19 21:13     ` Vladimir Sementsov-Ogievskiy
2020-05-19 21:48       ` Eric Blake
2020-05-20  6:16         ` Vladimir Sementsov-Ogievskiy
2020-05-19 19:54 ` [PATCH v2 2/5] block/io: bdrv_common_block_status_above: support include_base Vladimir Sementsov-Ogievskiy
2020-05-19 19:54 ` [PATCH v2 3/5] block/io: bdrv_common_block_status_above: support bs == base Vladimir Sementsov-Ogievskiy
2020-05-19 19:55 ` [PATCH v2 4/5] block/io: fix bdrv_is_allocated_above Vladimir Sementsov-Ogievskiy
2020-05-19 20:45   ` Eric Blake
2020-05-19 19:55 ` [PATCH v2 5/5] iotests: add commit top->base cases to 274 Vladimir Sementsov-Ogievskiy
2020-05-19 21:13   ` Eric Blake
2020-05-19 21:25     ` Vladimir Sementsov-Ogievskiy
2020-05-19 21:49       ` Eric Blake
2020-05-19 20:21 ` [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Eric Blake
2020-05-19 20:28   ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.