qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] fix & merge block_status_above and is_allocated_above
@ 2019-11-16 16:34 Vladimir Sementsov-Ogievskiy
  2019-11-16 16:34 ` [PATCH 1/4] block/io: fix bdrv_co_block_status_above Vladimir Sementsov-Ogievskiy
                   ` (10 more replies)
  0 siblings, 11 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-16 16:34 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

Hi all!

I wanted to understand, what is the real difference between bdrv_block_status_above
and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
bdrv_block_status_above..

And I found the problem: bdrv_is_allocated_above considers space after EOF as
UNALLOCATED for intermediate nodes..

UNALLOCATED is not about allocation at fs level, but about should we go to backing or
not.. And it seems incorrect for me, as in case of short backing file, we'll read
zeroes after EOF, instead of going further by backing chain.

This leads to the following effect:

./qemu-img create -f qcow2 base.qcow2 2M
./qemu-io -c "write -P 0x1 0 2M" base.qcow2

./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M

Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
./qemu-io -c "read -P 0 1M 1M" top.qcow2
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)

But after commit guest visible state is changed, which seems wrong for me:
./qemu-img commit top.qcow2 -b mid.qcow2

./qemu-io -c "read -P 0 1M 1M" mid.qcow2
Pattern verification failed at offset 1048576, 1048576 bytes
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)

./qemu-io -c "read -P 1 1M 1M" mid.qcow2
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)


I don't know, is it a real bug, as I don't know, do we support backing file larger than
its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
to other problems.

=====

Hmm, bdrv_block_allocated_above behaves strange too:

with want_zero=true, it may report unallocated zeroes because of short backing files, which
are actually "allocated" in POV of backing chains. But I see this may influence only
qemu-img compare, and I don't see can it trigger some bug..

with want_zero=false, it may do no progress because of short backing file. Moreover it may
report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
onlyt top layer, so it seems OK. 

=====

So, I propose these series, still I'm not sure is there a real bug.

Vladimir Sementsov-Ogievskiy (4):
  block/io: fix bdrv_co_block_status_above
  block/io: bdrv_common_block_status_above: support include_base
  block/io: bdrv_common_block_status_above: support bs == base
  block/io: fix bdrv_is_allocated_above

 block/io.c                 | 104 ++++++++++++++++++-------------------
 tests/qemu-iotests/154.out |   4 +-
 2 files changed, 53 insertions(+), 55 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/4] block/io: fix bdrv_co_block_status_above
  2019-11-16 16:34 [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
@ 2019-11-16 16:34 ` Vladimir Sementsov-Ogievskiy
  2019-11-25 16:00   ` Kevin Wolf
  2019-11-16 16:34 ` [PATCH 2/4] block/io: bdrv_common_block_status_above: support include_base Vladimir Sementsov-Ogievskiy
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-16 16:34 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

bdrv_co_block_status_above has several problems with handling short
backing files:

1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
which produces these after-EOF zeros is inside requested backing
sequesnce.

2. With want_zeros=false, it will just stop inside requested region, if
we have unallocated region in top node when underlying backing is
short.

Fix these things, making logic about short backing files clearer.

Note that 154 output changed, because now bdrv_block_status_above don't
merge unallocated zeros with zeros after EOF (which are actually
"allocated" in POV of read from backing-chain top) and is_zero() just
don't understand that the whole head or tail is zero. We may update
is_zero to call bdrv_block_status_above several times, or add flag to
bdrv_block_status_above that we are not interested in ALLOCATED flag,
so ranges with different ALLOCATED status may be merged, but actually,
it seems that we'd better don't care about this corner case.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c                 | 41 ++++++++++++++++++++++++++++----------
 tests/qemu-iotests/154.out |  4 ++--
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/block/io.c b/block/io.c
index f75777f5ea..4d7fa99bd2 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2434,25 +2434,44 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
         ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
                                    file);
         if (ret < 0) {
-            break;
+            return ret;
         }
-        if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+        if (*pnum == 0) {
+            if (first) {
+                return ret;
+            }
+
             /*
-             * Reading beyond the end of the file continues to read
-             * zeroes, but we can only widen the result to the
-             * unallocated length we learned from an earlier
-             * iteration.
+             * Reads from bs for selected region will return zeroes, produced
+             * because current level is short. We should consider it as
+             * allocated.
+             *
+             * TODO: Should we report p as file here?
              */
+            assert(ret & BDRV_BLOCK_EOF);
             *pnum = bytes;
+            return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
         }
-        if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
-            break;
+        if (ret & BDRV_BLOCK_ALLOCATED) {
+            /* We've found the node and the status, we must return. */
+
+            if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+                /*
+                 * This level also responsible for reads after EOF inside
+                 * unallocated region in previous level.
+                 */
+                *pnum = bytes;
+            }
+
+            return ret;
         }
-        /* [offset, pnum] unallocated on this layer, which could be only
-         * the first part of [offset, bytes].  */
-        bytes = MIN(bytes, *pnum);
+
+        /* Proceed to backing */
+        assert(*pnum <= bytes);
+        bytes = *pnum;
         first = false;
     }
+
     return ret;
 }
 
diff --git a/tests/qemu-iotests/154.out b/tests/qemu-iotests/154.out
index fa3673317f..a203dfcadd 100644
--- a/tests/qemu-iotests/154.out
+++ b/tests/qemu-iotests/154.out
@@ -310,13 +310,13 @@ wrote 512/512 bytes at offset 134217728
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 2048/2048 bytes allocated at offset 128 MiB
 [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
-{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
 wrote 512/512 bytes at offset 134219264
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 2048/2048 bytes allocated at offset 128 MiB
 [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
-{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
 wrote 1024/1024 bytes at offset 134218240
 1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/4] block/io: bdrv_common_block_status_above: support include_base
  2019-11-16 16:34 [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
  2019-11-16 16:34 ` [PATCH 1/4] block/io: fix bdrv_co_block_status_above Vladimir Sementsov-Ogievskiy
@ 2019-11-16 16:34 ` Vladimir Sementsov-Ogievskiy
  2019-11-25 16:19   ` Kevin Wolf
  2019-11-16 16:34 ` [PATCH 3/4] block/io: bdrv_common_block_status_above: support bs == base Vladimir Sementsov-Ogievskiy
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-16 16:34 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

In order to reuse bdrv_common_block_status_above in
bdrv_is_allocated_above, let's support include_base parameter.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index 4d7fa99bd2..df3ecf2430 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2196,6 +2196,7 @@ int bdrv_flush_all(void)
 typedef struct BdrvCoBlockStatusData {
     BlockDriverState *bs;
     BlockDriverState *base;
+    bool include_base;
     bool want_zero;
     int64_t offset;
     int64_t bytes;
@@ -2418,6 +2419,7 @@ early_out:
 
 static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
                                                    BlockDriverState *base,
+                                                   bool include_base,
                                                    bool want_zero,
                                                    int64_t offset,
                                                    int64_t bytes,
@@ -2429,8 +2431,8 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
     int ret = 0;
     bool first = true;
 
-    assert(bs != base);
-    for (p = bs; p != base; p = backing_bs(p)) {
+    assert(include_base || bs != base);
+    for (p = bs; include_base || p != base; p = backing_bs(p)) {
         ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
                                    file);
         if (ret < 0) {
@@ -2466,6 +2468,10 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
             return ret;
         }
 
+        if (p == base) {
+            break;
+        }
+
         /* Proceed to backing */
         assert(*pnum <= bytes);
         bytes = *pnum;
@@ -2481,7 +2487,7 @@ static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
     BdrvCoBlockStatusData *data = opaque;
 
     data->ret = bdrv_co_block_status_above(data->bs, data->base,
-                                           data->want_zero,
+                                           data->include_base, data->want_zero,
                                            data->offset, data->bytes,
                                            data->pnum, data->map, data->file);
     data->done = true;
@@ -2495,6 +2501,7 @@ static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
  */
 static int bdrv_common_block_status_above(BlockDriverState *bs,
                                           BlockDriverState *base,
+                                          bool include_base,
                                           bool want_zero, int64_t offset,
                                           int64_t bytes, int64_t *pnum,
                                           int64_t *map,
@@ -2504,6 +2511,7 @@ static int bdrv_common_block_status_above(BlockDriverState *bs,
     BdrvCoBlockStatusData data = {
         .bs = bs,
         .base = base,
+        .include_base = include_base,
         .want_zero = want_zero,
         .offset = offset,
         .bytes = bytes,
@@ -2528,7 +2536,7 @@ int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
                             int64_t offset, int64_t bytes, int64_t *pnum,
                             int64_t *map, BlockDriverState **file)
 {
-    return bdrv_common_block_status_above(bs, base, true, offset, bytes,
+    return bdrv_common_block_status_above(bs, base, false, true, offset, bytes,
                                           pnum, map, file);
 }
 
@@ -2545,7 +2553,7 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
     int ret;
     int64_t dummy;
 
-    ret = bdrv_common_block_status_above(bs, backing_bs(bs), false, offset,
+    ret = bdrv_common_block_status_above(bs, bs, true, false, offset,
                                          bytes, pnum ? pnum : &dummy, NULL,
                                          NULL);
     if (ret < 0) {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/4] block/io: bdrv_common_block_status_above: support bs == base
  2019-11-16 16:34 [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
  2019-11-16 16:34 ` [PATCH 1/4] block/io: fix bdrv_co_block_status_above Vladimir Sementsov-Ogievskiy
  2019-11-16 16:34 ` [PATCH 2/4] block/io: bdrv_common_block_status_above: support include_base Vladimir Sementsov-Ogievskiy
@ 2019-11-16 16:34 ` Vladimir Sementsov-Ogievskiy
  2019-11-25 16:23   ` Kevin Wolf
  2019-11-16 16:34 ` [PATCH 4/4] block/io: fix bdrv_is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-16 16:34 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

We are going to reuse bdrv_common_block_status_above in
bdrv_is_allocated_above. bdrv_is_allocated_above may be called with
include_base == false and still bs == base (for ex. from img_rebase()).

So, support this corner case.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index df3ecf2430..f05b2e8ecc 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2431,7 +2431,11 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
     int ret = 0;
     bool first = true;
 
-    assert(include_base || bs != base);
+    if (!include_base && bs == base) {
+        *pnum = bytes;
+        return 0;
+    }
+
     for (p = bs; include_base || p != base; p = backing_bs(p)) {
         ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
                                    file);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/4] block/io: fix bdrv_is_allocated_above
  2019-11-16 16:34 [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2019-11-16 16:34 ` [PATCH 3/4] block/io: bdrv_common_block_status_above: support bs == base Vladimir Sementsov-Ogievskiy
@ 2019-11-16 16:34 ` Vladimir Sementsov-Ogievskiy
  2019-11-19 10:22 ` [PATCH 0/4] fix & merge block_status_above and is_allocated_above Max Reitz
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-16 16:34 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

bdrv_is_allocated_above wrongly handles short backing files: it reports
after-EOF space as UNALLOCATED which is wrong, as on read the data is
generated on the level of short backing file (if all overlays has
unallocated area at that place).

Reusing bdrv_common_block_status_above fixes the issue and unifies code
path.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c | 43 +++++--------------------------------------
 1 file changed, 5 insertions(+), 38 deletions(-)

diff --git a/block/io.c b/block/io.c
index f05b2e8ecc..6946120587 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2581,52 +2581,19 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
  * at 'offset + *pnum' may return the same allocation status (in other
  * words, the result is not necessarily the maximum possible range);
  * but 'pnum' will only be 0 when end of file is reached.
- *
  */
 int bdrv_is_allocated_above(BlockDriverState *top,
                             BlockDriverState *base,
                             bool include_base, int64_t offset,
                             int64_t bytes, int64_t *pnum)
 {
-    BlockDriverState *intermediate;
-    int ret;
-    int64_t n = bytes;
-
-    assert(base || !include_base);
-
-    intermediate = top;
-    while (include_base || intermediate != base) {
-        int64_t pnum_inter;
-        int64_t size_inter;
-
-        assert(intermediate);
-        ret = bdrv_is_allocated(intermediate, offset, bytes, &pnum_inter);
-        if (ret < 0) {
-            return ret;
-        }
-        if (ret) {
-            *pnum = pnum_inter;
-            return 1;
-        }
-
-        size_inter = bdrv_getlength(intermediate);
-        if (size_inter < 0) {
-            return size_inter;
-        }
-        if (n > pnum_inter &&
-            (intermediate == top || offset + pnum_inter < size_inter)) {
-            n = pnum_inter;
-        }
-
-        if (intermediate == base) {
-            break;
-        }
-
-        intermediate = backing_bs(intermediate);
+    int ret = bdrv_common_block_status_above(top, base, include_base, false,
+                                             offset, bytes, pnum, NULL, NULL);
+    if (ret < 0) {
+        return ret;
     }
 
-    *pnum = n;
-    return 0;
+    return !!(ret & BDRV_BLOCK_ALLOCATED);
 }
 
 typedef struct BdrvVmstateCo {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-16 16:34 [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2019-11-16 16:34 ` [PATCH 4/4] block/io: fix bdrv_is_allocated_above Vladimir Sementsov-Ogievskiy
@ 2019-11-19 10:22 ` Max Reitz
  2019-11-19 12:02   ` Denis V. Lunev
  2019-11-19 12:05 ` Kevin Wolf
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Max Reitz @ 2019-11-19 10:22 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, stefanha, den


[-- Attachment #1.1: Type: text/plain, Size: 3646 bytes --]

On 16.11.19 17:34, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> I wanted to understand, what is the real difference between bdrv_block_status_above
> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> bdrv_block_status_above..
> 
> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> UNALLOCATED for intermediate nodes..
> 
> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> zeroes after EOF, instead of going further by backing chain.

Should we, though?  It absolutely makes sense to me to consider post-EOF
space as unallocated because, well, it is as unallocated as it gets.

So from my POV it would make more sense to fall back to the backing file
for post-EOF reads.

OTOH, I don’t know whether changing that behavior would qualify as a
possible security issue now, because maybe someone has sensitive
information in the tail of some disk and then truncated the overlay so
as to hide it?  But honestly, that seems ridiculous and I can’t imagine
people to do that.  (It would work only for the tail, and why not just
write zeroes there, which works everywhere?)  So in practice I don’t
believe that to be a problem.

Max

> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> 
> 
> I don't know, is it a real bug, as I don't know, do we support backing file larger than
> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
> to other problems.
> 
> =====
> 
> Hmm, bdrv_block_allocated_above behaves strange too:
> 
> with want_zero=true, it may report unallocated zeroes because of short backing files, which
> are actually "allocated" in POV of backing chains. But I see this may influence only
> qemu-img compare, and I don't see can it trigger some bug..
> 
> with want_zero=false, it may do no progress because of short backing file. Moreover it may
> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
> onlyt top layer, so it seems OK. 
> 
> =====
> 
> So, I propose these series, still I'm not sure is there a real bug.
> 
> Vladimir Sementsov-Ogievskiy (4):
>   block/io: fix bdrv_co_block_status_above
>   block/io: bdrv_common_block_status_above: support include_base
>   block/io: bdrv_common_block_status_above: support bs == base
>   block/io: fix bdrv_is_allocated_above
> 
>  block/io.c                 | 104 ++++++++++++++++++-------------------
>  tests/qemu-iotests/154.out |   4 +-
>  2 files changed, 53 insertions(+), 55 deletions(-)
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-19 10:22 ` [PATCH 0/4] fix & merge block_status_above and is_allocated_above Max Reitz
@ 2019-11-19 12:02   ` Denis V. Lunev
  2019-11-19 12:12     ` Vladimir Sementsov-Ogievskiy
  2019-11-19 12:20     ` Max Reitz
  0 siblings, 2 replies; 35+ messages in thread
From: Denis V. Lunev @ 2019-11-19 12:02 UTC (permalink / raw)
  To: Max Reitz, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, stefanha

On 11/19/19 1:22 PM, Max Reitz wrote:
> On 16.11.19 17:34, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> I wanted to understand, what is the real difference between bdrv_block_status_above
>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>> bdrv_block_status_above..
>>
>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>> UNALLOCATED for intermediate nodes..
>>
>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>> zeroes after EOF, instead of going further by backing chain.
> Should we, though?  It absolutely makes sense to me to consider post-EOF
> space as unallocated because, well, it is as unallocated as it gets.
>
> So from my POV it would make more sense to fall back to the backing file
> for post-EOF reads.
>
> OTOH, I don’t know whether changing that behavior would qualify as a
> possible security issue now, because maybe someone has sensitive
> information in the tail of some disk and then truncated the overlay so
> as to hide it?  But honestly, that seems ridiculous and I can’t imagine
> people to do that.  (It would work only for the tail, and why not just
> write zeroes there, which works everywhere?)  So in practice I don’t
> believe that to be a problem.
>
> Max

That seems to be wrong from my POW. Once we get block device truncated,
it exposed that tail to the guest with all zeroes.

Let us assume that we have virtual disk of length L. We create new top
delta of
length X (less then L) and new top delta after with length Y (more than L),
like the following:

[.........................] Y
[........] X
[...................] L

Once the guest creates FS  on state Y it relies on the fact that data from X
to Y is all zeroes.

Any operations with backing chain must keep guest content to be tha same,
i.e. if we commit from Y to L, virtual disk content should be preserved,
i.e.
read as all zero even if there is some data in L from X to L.

If we commit from X to Y, the range from X to L should remain all zeroes.

This is especially valid for backups, which can not be changed and are
validated by the software from time to time.

Does this makes sense?

Den


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-16 16:34 [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (4 preceding siblings ...)
  2019-11-19 10:22 ` [PATCH 0/4] fix & merge block_status_above and is_allocated_above Max Reitz
@ 2019-11-19 12:05 ` Kevin Wolf
  2019-11-19 12:17   ` Vladimir Sementsov-Ogievskiy
  2019-11-19 14:54 ` Kevin Wolf
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Kevin Wolf @ 2019-11-19 12:05 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Hi all!
> 
> I wanted to understand, what is the real difference between
> bdrv_block_status_above and bdrv_is_allocated_above, IMHO
> bdrv_is_allocated_above should work through bdrv_block_status_above..
> 
> And I found the problem: bdrv_is_allocated_above considers space after
> EOF as UNALLOCATED for intermediate nodes..
> 
> UNALLOCATED is not about allocation at fs level, but about should we
> go to backing or not.. And it seems incorrect for me, as in case of
> short backing file, we'll read zeroes after EOF, instead of going
> further by backing chain.

We actually have documentation what it means:

 * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
 *                       layer rather than any backing, set by block layer

Say we have a short overlay, like this:

base.qcow2:     AAAAAAAA
overlay.qcow2:  BBBB

Then of course the content of block 5 (the one after EOF of
overlay.qcow2) is still determined by overlay.qcow2, which can be easily
verified by reading it from overlay.qcow2 (produces zeros) and from
base.qcow2 (produces As).

So the correct result when querying the block status of block 5 on
overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.

Interestingly, we already fixed the opposite case (large overlay over
short backing file) in commit e88ae2264d9 from May 2014 according to the
same logic.

> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> 
> 
> I don't know, is it a real bug, as I don't know, do we support backing
> file larger than its parent. Still, I'm not sure that this behavior of
> bdrv_is_allocated_above don't lead to other problems.

I agree it's a bug.

Your fix doesn't look right to me, though. You leave the buggy behaviour
of bdrv_co_block_status() as it is and then add four patches to work
around it in some (but not all) callers of it.

All that it should take to fix this is making the bs->backing check
independent from want_zero and let it set ALLOCATED. What I expected
would be something like the below patch.

But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in
qemu-io shows that the range is now considered allocated), so probably
there is still a separate bug in bdrv_is_allocated_above().

And I think we'll want an iotests case for both cases (short overlay,
short backing file).

Kevin


diff --git a/block/io.c b/block/io.c
index f75777f5ea..5eafcff01a 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2359,16 +2359,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
 
     if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
         ret |= BDRV_BLOCK_ALLOCATED;
-    } else if (want_zero) {
-        if (bdrv_unallocated_blocks_are_zero(bs)) {
-            ret |= BDRV_BLOCK_ZERO;
-        } else if (bs->backing) {
-            BlockDriverState *bs2 = bs->backing->bs;
-            int64_t size2 = bdrv_getlength(bs2);
-
-            if (size2 >= 0 && offset >= size2) {
+    } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) {
+        ret |= BDRV_BLOCK_ZERO;
+    } else if (bs->backing) {
+        BlockDriverState *bs2 = bs->backing->bs;
+        int64_t size2 = bdrv_getlength(bs2);
+
+        if (size2 >= 0 && offset >= size2) {
+            if (want_zero) {
                 ret |= BDRV_BLOCK_ZERO;
             }
+            ret |= BDRV_BLOCK_ALLOCATED;
         }
     }
 



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-19 12:02   ` Denis V. Lunev
@ 2019-11-19 12:12     ` Vladimir Sementsov-Ogievskiy
  2019-11-19 12:20     ` Max Reitz
  1 sibling, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-19 12:12 UTC (permalink / raw)
  To: Denis Lunev, Max Reitz, qemu-block; +Cc: kwolf, fam, qemu-devel, stefanha

19.11.2019 15:02, Denis Lunev wrote:
> On 11/19/19 1:22 PM, Max Reitz wrote:
>> On 16.11.19 17:34, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> I wanted to understand, what is the real difference between bdrv_block_status_above
>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>>> bdrv_block_status_above..
>>>
>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>>> UNALLOCATED for intermediate nodes..
>>>
>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>>> zeroes after EOF, instead of going further by backing chain.
>> Should we, though?  It absolutely makes sense to me to consider post-EOF
>> space as unallocated because, well, it is as unallocated as it gets.
>>
>> So from my POV it would make more sense to fall back to the backing file
>> for post-EOF reads.
>>
>> OTOH, I don’t know whether changing that behavior would qualify as a
>> possible security issue now, because maybe someone has sensitive
>> information in the tail of some disk and then truncated the overlay so
>> as to hide it?  But honestly, that seems ridiculous and I can’t imagine
>> people to do that.  (It would work only for the tail, and why not just
>> write zeroes there, which works everywhere?)  So in practice I don’t
>> believe that to be a problem.
>>
>> Max
> 
> That seems to be wrong from my POW. Once we get block device truncated,
> it exposed that tail to the guest with all zeroes.
> 
> Let us assume that we have virtual disk of length L. We create new top
> delta of
> length X (less then L) and new top delta after with length Y (more than L),
> like the following:
> 
> [.........................] Y
> [........] X
> [...................] L
> 
> Once the guest creates FS  on state Y it relies on the fact that data from X
> to Y is all zeroes.
> 
> Any operations with backing chain must keep guest content to be tha same,
> i.e. if we commit from Y to L, virtual disk content should be preserved,
> i.e.
> read as all zero even if there is some data in L from X to L.
> 
> If we commit from X to Y, the range from X to L should remain all zeroes.
> 
> This is especially valid for backups, which can not be changed and are
> validated by the software from time to time.
> 
> Does this makes sense?
> 
> Den
> 

Yes, if we consider space after EOF as unallocated, incremental backups are broken,
consider the following sequence:

starting with disk of size L
full backup -> state BASE
shrink disk to size X
incremental backup to empty qcow2 of size X, with backing to BASE -> state INC1
expand disk to size Y
incremental backup to empty qcow2 of size Y, with backing to INC1 -> state INC2

Now, if we read from backup INC2, we'll see data from BASE in range [X, L], which
we must not see.


INC2 [.........................] Y
INC1 [........] X
BASE [...................] L


Also, this example shows, that these series fixes real use-case: merge of some
incremental backups (by commit).

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-19 12:05 ` Kevin Wolf
@ 2019-11-19 12:17   ` Vladimir Sementsov-Ogievskiy
  2019-11-19 12:32     ` Vladimir Sementsov-Ogievskiy
  2019-11-19 14:21     ` Kevin Wolf
  0 siblings, 2 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-19 12:17 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: fam, Denis Lunev, qemu-block, qemu-devel, mreitz, stefanha

19.11.2019 15:05, Kevin Wolf wrote:
> Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> Hi all!
>>
>> I wanted to understand, what is the real difference between
>> bdrv_block_status_above and bdrv_is_allocated_above, IMHO
>> bdrv_is_allocated_above should work through bdrv_block_status_above..
>>
>> And I found the problem: bdrv_is_allocated_above considers space after
>> EOF as UNALLOCATED for intermediate nodes..
>>
>> UNALLOCATED is not about allocation at fs level, but about should we
>> go to backing or not.. And it seems incorrect for me, as in case of
>> short backing file, we'll read zeroes after EOF, instead of going
>> further by backing chain.
> 
> We actually have documentation what it means:
> 
>   * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>   *                       layer rather than any backing, set by block layer
> 
> Say we have a short overlay, like this:
> 
> base.qcow2:     AAAAAAAA
> overlay.qcow2:  BBBB
> 
> Then of course the content of block 5 (the one after EOF of
> overlay.qcow2) is still determined by overlay.qcow2, which can be easily
> verified by reading it from overlay.qcow2 (produces zeros) and from
> base.qcow2 (produces As).
> 
> So the correct result when querying the block status of block 5 on
> overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.
> 
> Interestingly, we already fixed the opposite case (large overlay over
> short backing file) in commit e88ae2264d9 from May 2014 according to the
> same logic.
> 
>> This leads to the following effect:
>>
>> ./qemu-img create -f qcow2 base.qcow2 2M
>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>
>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>
>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>
>> But after commit guest visible state is changed, which seems wrong for me:
>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>
>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>> Pattern verification failed at offset 1048576, 1048576 bytes
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>
>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>
>>
>> I don't know, is it a real bug, as I don't know, do we support backing
>> file larger than its parent. Still, I'm not sure that this behavior of
>> bdrv_is_allocated_above don't lead to other problems.
> 
> I agree it's a bug.
> 
> Your fix doesn't look right to me, though. You leave the buggy behaviour
> of bdrv_co_block_status() as it is and then add four patches to work
> around it in some (but not all) callers of it.
> 
> All that it should take to fix this is making the bs->backing check
> independent from want_zero and let it set ALLOCATED. What I expected
> would be something like the below patch.
> 
> But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in
> qemu-io shows that the range is now considered allocated), so probably
> there is still a separate bug in bdrv_is_allocated_above().
> 
> And I think we'll want an iotests case for both cases (short overlay,
> short backing file).
> 
> Kevin
> 
> 
> diff --git a/block/io.c b/block/io.c
> index f75777f5ea..5eafcff01a 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2359,16 +2359,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>   
>       if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
>           ret |= BDRV_BLOCK_ALLOCATED;
> -    } else if (want_zero) {
> -        if (bdrv_unallocated_blocks_are_zero(bs)) {
> -            ret |= BDRV_BLOCK_ZERO;
> -        } else if (bs->backing) {
> -            BlockDriverState *bs2 = bs->backing->bs;
> -            int64_t size2 = bdrv_getlength(bs2);
> -
> -            if (size2 >= 0 && offset >= size2) {
> +    } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) {
> +        ret |= BDRV_BLOCK_ZERO;
> +    } else if (bs->backing) {
> +        BlockDriverState *bs2 = bs->backing->bs;
> +        int64_t size2 = bdrv_getlength(bs2);
> +
> +        if (size2 >= 0 && offset >= size2) {
> +            if (want_zero) {
>                   ret |= BDRV_BLOCK_ZERO;
>               }
> +            ret |= BDRV_BLOCK_ALLOCATED;

No, I think it's wrong, as it's not allocated at bs, but at bs->backing->bs.


So, bdrv_co_block_status works correct, what we can change about it, is not
to return pnum=0 if requested range after eof, but return pnum=bytes, together
with BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO | BDRV_BLOCK_EOF.

And actual problem is in bdrv_block_status_above and bdrv_is_allocated_above, which
I'm fixing.



-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-19 12:02   ` Denis V. Lunev
  2019-11-19 12:12     ` Vladimir Sementsov-Ogievskiy
@ 2019-11-19 12:20     ` Max Reitz
  2019-11-19 12:30       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 35+ messages in thread
From: Max Reitz @ 2019-11-19 12:20 UTC (permalink / raw)
  To: Denis V. Lunev, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, stefanha


[-- Attachment #1.1: Type: text/plain, Size: 2663 bytes --]

On 19.11.19 13:02, Denis V. Lunev wrote:
> On 11/19/19 1:22 PM, Max Reitz wrote:
>> On 16.11.19 17:34, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> I wanted to understand, what is the real difference between bdrv_block_status_above
>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>>> bdrv_block_status_above..
>>>
>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>>> UNALLOCATED for intermediate nodes..
>>>
>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>>> zeroes after EOF, instead of going further by backing chain.
>> Should we, though?  It absolutely makes sense to me to consider post-EOF
>> space as unallocated because, well, it is as unallocated as it gets.
>>
>> So from my POV it would make more sense to fall back to the backing file
>> for post-EOF reads.
>>
>> OTOH, I don’t know whether changing that behavior would qualify as a
>> possible security issue now, because maybe someone has sensitive
>> information in the tail of some disk and then truncated the overlay so
>> as to hide it?  But honestly, that seems ridiculous and I can’t imagine
>> people to do that.  (It would work only for the tail, and why not just
>> write zeroes there, which works everywhere?)  So in practice I don’t
>> believe that to be a problem.
>>
>> Max
> 
> That seems to be wrong from my POW. Once we get block device truncated,
> it exposed that tail to the guest with all zeroes.
> 
> Let us assume that we have virtual disk of length L. We create new top
> delta of
> length X (less then L) and new top delta after with length Y (more than L),
> like the following:
> 
> [.........................] Y
> [........] X
> [...................] L
> 
> Once the guest creates FS  on state Y it relies on the fact that data from X
> to Y is all zeroes.
> 
> Any operations with backing chain must keep guest content to be tha same,
> i.e. if we commit from Y to L, virtual disk content should be preserved,
> i.e.
> read as all zero even if there is some data in L from X to L.
> 
> If we commit from X to Y, the range from X to L should remain all zeroes.
> 
> This is especially valid for backups, which can not be changed and are
> validated by the software from time to time.
> 
> Does this makes sense?

All right then.  But then there’s the case of commit not shrinking the
backing file, so the guest content won’t be the same if you commit a
short overlay into a longer backing file.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-19 12:20     ` Max Reitz
@ 2019-11-19 12:30       ` Vladimir Sementsov-Ogievskiy
  2019-11-19 13:28         ` Kevin Wolf
  0 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-19 12:30 UTC (permalink / raw)
  To: Max Reitz, Denis Lunev, qemu-block; +Cc: kwolf, fam, qemu-devel, stefanha

19.11.2019 15:20, Max Reitz wrote:
> On 19.11.19 13:02, Denis V. Lunev wrote:
>> On 11/19/19 1:22 PM, Max Reitz wrote:
>>> On 16.11.19 17:34, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> I wanted to understand, what is the real difference between bdrv_block_status_above
>>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>>>> bdrv_block_status_above..
>>>>
>>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>>>> UNALLOCATED for intermediate nodes..
>>>>
>>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>>>> zeroes after EOF, instead of going further by backing chain.
>>> Should we, though?  It absolutely makes sense to me to consider post-EOF
>>> space as unallocated because, well, it is as unallocated as it gets.
>>>
>>> So from my POV it would make more sense to fall back to the backing file
>>> for post-EOF reads.
>>>
>>> OTOH, I don’t know whether changing that behavior would qualify as a
>>> possible security issue now, because maybe someone has sensitive
>>> information in the tail of some disk and then truncated the overlay so
>>> as to hide it?  But honestly, that seems ridiculous and I can’t imagine
>>> people to do that.  (It would work only for the tail, and why not just
>>> write zeroes there, which works everywhere?)  So in practice I don’t
>>> believe that to be a problem.
>>>
>>> Max
>>
>> That seems to be wrong from my POW. Once we get block device truncated,
>> it exposed that tail to the guest with all zeroes.
>>
>> Let us assume that we have virtual disk of length L. We create new top
>> delta of
>> length X (less then L) and new top delta after with length Y (more than L),
>> like the following:
>>
>> [.........................] Y
>> [........] X
>> [...................] L
>>
>> Once the guest creates FS  on state Y it relies on the fact that data from X
>> to Y is all zeroes.
>>
>> Any operations with backing chain must keep guest content to be tha same,
>> i.e. if we commit from Y to L, virtual disk content should be preserved,
>> i.e.
>> read as all zero even if there is some data in L from X to L.
>>
>> If we commit from X to Y, the range from X to L should remain all zeroes.
>>
>> This is especially valid for backups, which can not be changed and are
>> validated by the software from time to time.
>>
>> Does this makes sense?
> 
> All right then.  But then there’s the case of commit not shrinking the
> backing file, so the guest content won’t be the same if you commit a
> short overlay into a longer backing file.
> 

Hmm. Isn't commit target truncated to source before operation?


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-19 12:17   ` Vladimir Sementsov-Ogievskiy
@ 2019-11-19 12:32     ` Vladimir Sementsov-Ogievskiy
  2019-11-19 12:34       ` Vladimir Sementsov-Ogievskiy
  2019-11-19 14:21     ` Kevin Wolf
  1 sibling, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-19 12:32 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: fam, Denis Lunev, qemu-block, qemu-devel, mreitz, stefanha

19.11.2019 15:17, Vladimir Sementsov-Ogievskiy wrote:
> 19.11.2019 15:05, Kevin Wolf wrote:
>> Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> Hi all!
>>>
>>> I wanted to understand, what is the real difference between
>>> bdrv_block_status_above and bdrv_is_allocated_above, IMHO
>>> bdrv_is_allocated_above should work through bdrv_block_status_above..
>>>
>>> And I found the problem: bdrv_is_allocated_above considers space after
>>> EOF as UNALLOCATED for intermediate nodes..
>>>
>>> UNALLOCATED is not about allocation at fs level, but about should we
>>> go to backing or not.. And it seems incorrect for me, as in case of
>>> short backing file, we'll read zeroes after EOF, instead of going
>>> further by backing chain.
>>
>> We actually have documentation what it means:
>>
>>   * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>   *                       layer rather than any backing, set by block layer
>>
>> Say we have a short overlay, like this:
>>
>> base.qcow2:     AAAAAAAA
>> overlay.qcow2:  BBBB
>>
>> Then of course the content of block 5 (the one after EOF of
>> overlay.qcow2) is still determined by overlay.qcow2, which can be easily
>> verified by reading it from overlay.qcow2 (produces zeros) and from
>> base.qcow2 (produces As).
>>
>> So the correct result when querying the block status of block 5 on
>> overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.
>>
>> Interestingly, we already fixed the opposite case (large overlay over
>> short backing file) in commit e88ae2264d9 from May 2014 according to the
>> same logic.
>>
>>> This leads to the following effect:
>>>
>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>
>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>
>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>
>>> But after commit guest visible state is changed, which seems wrong for me:
>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>
>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>
>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>
>>>
>>> I don't know, is it a real bug, as I don't know, do we support backing
>>> file larger than its parent. Still, I'm not sure that this behavior of
>>> bdrv_is_allocated_above don't lead to other problems.
>>
>> I agree it's a bug.
>>
>> Your fix doesn't look right to me, though. You leave the buggy behaviour
>> of bdrv_co_block_status() as it is and then add four patches to work
>> around it in some (but not all) callers of it.
>>
>> All that it should take to fix this is making the bs->backing check
>> independent from want_zero and let it set ALLOCATED. What I expected
>> would be something like the below patch.
>>
>> But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in
>> qemu-io shows that the range is now considered allocated), so probably
>> there is still a separate bug in bdrv_is_allocated_above().
>>
>> And I think we'll want an iotests case for both cases (short overlay,
>> short backing file).
>>
>> Kevin
>>
>>
>> diff --git a/block/io.c b/block/io.c
>> index f75777f5ea..5eafcff01a 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -2359,16 +2359,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>>       if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
>>           ret |= BDRV_BLOCK_ALLOCATED;
>> -    } else if (want_zero) {
>> -        if (bdrv_unallocated_blocks_are_zero(bs)) {
>> -            ret |= BDRV_BLOCK_ZERO;
>> -        } else if (bs->backing) {
>> -            BlockDriverState *bs2 = bs->backing->bs;
>> -            int64_t size2 = bdrv_getlength(bs2);
>> -
>> -            if (size2 >= 0 && offset >= size2) {
>> +    } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) {
>> +        ret |= BDRV_BLOCK_ZERO;
>> +    } else if (bs->backing) {
>> +        BlockDriverState *bs2 = bs->backing->bs;
>> +        int64_t size2 = bdrv_getlength(bs2);
>> +
>> +        if (size2 >= 0 && offset >= size2) {
>> +            if (want_zero) {
>>                   ret |= BDRV_BLOCK_ZERO;
>>               }
>> +            ret |= BDRV_BLOCK_ALLOCATED;
> 
> No, I think it's wrong, as it's not allocated at bs, but at bs->backing->bs.
> 
> 
> So, bdrv_co_block_status works correct, what we can change about it, is not
> to return pnum=0 if requested range after eof, but return pnum=bytes, together
> with BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO | BDRV_BLOCK_EOF.

But this will break users which rely on pnum=0.

> 
> And actual problem is in bdrv_block_status_above and bdrv_is_allocated_above, which
> I'm fixing.
> 
> 
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-19 12:32     ` Vladimir Sementsov-Ogievskiy
@ 2019-11-19 12:34       ` Vladimir Sementsov-Ogievskiy
  2019-11-19 12:49         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-19 12:34 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: fam, Denis Lunev, qemu-block, qemu-devel, mreitz, stefanha

19.11.2019 15:32, Vladimir Sementsov-Ogievskiy wrote:
> 19.11.2019 15:17, Vladimir Sementsov-Ogievskiy wrote:
>> 19.11.2019 15:05, Kevin Wolf wrote:
>>> Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> Hi all!
>>>>
>>>> I wanted to understand, what is the real difference between
>>>> bdrv_block_status_above and bdrv_is_allocated_above, IMHO
>>>> bdrv_is_allocated_above should work through bdrv_block_status_above..
>>>>
>>>> And I found the problem: bdrv_is_allocated_above considers space after
>>>> EOF as UNALLOCATED for intermediate nodes..
>>>>
>>>> UNALLOCATED is not about allocation at fs level, but about should we
>>>> go to backing or not.. And it seems incorrect for me, as in case of
>>>> short backing file, we'll read zeroes after EOF, instead of going
>>>> further by backing chain.
>>>
>>> We actually have documentation what it means:
>>>
>>>   * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>>   *                       layer rather than any backing, set by block layer
>>>
>>> Say we have a short overlay, like this:
>>>
>>> base.qcow2:     AAAAAAAA
>>> overlay.qcow2:  BBBB
>>>
>>> Then of course the content of block 5 (the one after EOF of
>>> overlay.qcow2) is still determined by overlay.qcow2, which can be easily
>>> verified by reading it from overlay.qcow2 (produces zeros) and from
>>> base.qcow2 (produces As).
>>>
>>> So the correct result when querying the block status of block 5 on
>>> overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.
>>>
>>> Interestingly, we already fixed the opposite case (large overlay over
>>> short backing file) in commit e88ae2264d9 from May 2014 according to the
>>> same logic.
>>>
>>>> This leads to the following effect:
>>>>
>>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>>
>>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>>
>>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>>
>>>> But after commit guest visible state is changed, which seems wrong for me:
>>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>>
>>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>>
>>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>>
>>>>
>>>> I don't know, is it a real bug, as I don't know, do we support backing
>>>> file larger than its parent. Still, I'm not sure that this behavior of
>>>> bdrv_is_allocated_above don't lead to other problems.
>>>
>>> I agree it's a bug.
>>>
>>> Your fix doesn't look right to me, though. You leave the buggy behaviour
>>> of bdrv_co_block_status() as it is and then add four patches to work
>>> around it in some (but not all) callers of it.
>>>
>>> All that it should take to fix this is making the bs->backing check
>>> independent from want_zero and let it set ALLOCATED. What I expected
>>> would be something like the below patch.
>>>
>>> But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in
>>> qemu-io shows that the range is now considered allocated), so probably
>>> there is still a separate bug in bdrv_is_allocated_above().
>>>
>>> And I think we'll want an iotests case for both cases (short overlay,
>>> short backing file).
>>>
>>> Kevin
>>>
>>>
>>> diff --git a/block/io.c b/block/io.c
>>> index f75777f5ea..5eafcff01a 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>> @@ -2359,16 +2359,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>>>       if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
>>>           ret |= BDRV_BLOCK_ALLOCATED;
>>> -    } else if (want_zero) {
>>> -        if (bdrv_unallocated_blocks_are_zero(bs)) {
>>> -            ret |= BDRV_BLOCK_ZERO;
>>> -        } else if (bs->backing) {
>>> -            BlockDriverState *bs2 = bs->backing->bs;
>>> -            int64_t size2 = bdrv_getlength(bs2);
>>> -
>>> -            if (size2 >= 0 && offset >= size2) {
>>> +    } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) {
>>> +        ret |= BDRV_BLOCK_ZERO;
>>> +    } else if (bs->backing) {
>>> +        BlockDriverState *bs2 = bs->backing->bs;
>>> +        int64_t size2 = bdrv_getlength(bs2);
>>> +
>>> +        if (size2 >= 0 && offset >= size2) {
>>> +            if (want_zero) {
>>>                   ret |= BDRV_BLOCK_ZERO;
>>>               }
>>> +            ret |= BDRV_BLOCK_ALLOCATED;
>>
>> No, I think it's wrong, as it's not allocated at bs, but at bs->backing->bs.
>>
>>
>> So, bdrv_co_block_status works correct, what we can change about it, is not
>> to return pnum=0 if requested range after eof, but return pnum=bytes, together
>> with BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO | BDRV_BLOCK_EOF.
> 
> But this will break users which rely on pnum=0.

Possibly, we may add flag to bdrv_co_block_status, to behave in such manner, and
set it to true when call from bdrv_block_status_above/bdrv_is_allocated_above.

> 
>>
>> And actual problem is in bdrv_block_status_above and bdrv_is_allocated_above, which
>> I'm fixing.
>>
>>
>>
> 
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-19 12:34       ` Vladimir Sementsov-Ogievskiy
@ 2019-11-19 12:49         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-19 12:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: fam, Denis Lunev, qemu-block, qemu-devel, mreitz, stefanha

19.11.2019 15:34, Vladimir Sementsov-Ogievskiy wrote:
> 19.11.2019 15:32, Vladimir Sementsov-Ogievskiy wrote:
>> 19.11.2019 15:17, Vladimir Sementsov-Ogievskiy wrote:
>>> 19.11.2019 15:05, Kevin Wolf wrote:
>>>> Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>> Hi all!
>>>>>
>>>>> I wanted to understand, what is the real difference between
>>>>> bdrv_block_status_above and bdrv_is_allocated_above, IMHO
>>>>> bdrv_is_allocated_above should work through bdrv_block_status_above..
>>>>>
>>>>> And I found the problem: bdrv_is_allocated_above considers space after
>>>>> EOF as UNALLOCATED for intermediate nodes..
>>>>>
>>>>> UNALLOCATED is not about allocation at fs level, but about should we
>>>>> go to backing or not.. And it seems incorrect for me, as in case of
>>>>> short backing file, we'll read zeroes after EOF, instead of going
>>>>> further by backing chain.
>>>>
>>>> We actually have documentation what it means:
>>>>
>>>>   * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>>>   *                       layer rather than any backing, set by block layer
>>>>
>>>> Say we have a short overlay, like this:
>>>>
>>>> base.qcow2:     AAAAAAAA
>>>> overlay.qcow2:  BBBB
>>>>
>>>> Then of course the content of block 5 (the one after EOF of
>>>> overlay.qcow2) is still determined by overlay.qcow2, which can be easily
>>>> verified by reading it from overlay.qcow2 (produces zeros) and from
>>>> base.qcow2 (produces As).
>>>>
>>>> So the correct result when querying the block status of block 5 on
>>>> overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.
>>>>
>>>> Interestingly, we already fixed the opposite case (large overlay over
>>>> short backing file) in commit e88ae2264d9 from May 2014 according to the
>>>> same logic.
>>>>
>>>>> This leads to the following effect:
>>>>>
>>>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>>>
>>>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>>>
>>>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>>>> read 1048576/1048576 bytes at offset 1048576
>>>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>>>
>>>>> But after commit guest visible state is changed, which seems wrong for me:
>>>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>>>
>>>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>>>> read 1048576/1048576 bytes at offset 1048576
>>>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>>>
>>>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>>>> read 1048576/1048576 bytes at offset 1048576
>>>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>>>
>>>>>
>>>>> I don't know, is it a real bug, as I don't know, do we support backing
>>>>> file larger than its parent. Still, I'm not sure that this behavior of
>>>>> bdrv_is_allocated_above don't lead to other problems.
>>>>
>>>> I agree it's a bug.
>>>>
>>>> Your fix doesn't look right to me, though. You leave the buggy behaviour
>>>> of bdrv_co_block_status() as it is and then add four patches to work
>>>> around it in some (but not all) callers of it.

The only caller is bdrv_co_block_status_above.

>>>>
>>>> All that it should take to fix this is making the bs->backing check
>>>> independent from want_zero and let it set ALLOCATED. What I expected
>>>> would be something like the below patch.
>>>>
>>>> But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in
>>>> qemu-io shows that the range is now considered allocated), so probably
>>>> there is still a separate bug in bdrv_is_allocated_above().
>>>>
>>>> And I think we'll want an iotests case for both cases (short overlay,
>>>> short backing file).
>>>>
>>>> Kevin
>>>>
>>>>
>>>> diff --git a/block/io.c b/block/io.c
>>>> index f75777f5ea..5eafcff01a 100644
>>>> --- a/block/io.c
>>>> +++ b/block/io.c
>>>> @@ -2359,16 +2359,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>>>>       if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
>>>>           ret |= BDRV_BLOCK_ALLOCATED;
>>>> -    } else if (want_zero) {
>>>> -        if (bdrv_unallocated_blocks_are_zero(bs)) {
>>>> -            ret |= BDRV_BLOCK_ZERO;
>>>> -        } else if (bs->backing) {
>>>> -            BlockDriverState *bs2 = bs->backing->bs;
>>>> -            int64_t size2 = bdrv_getlength(bs2);
>>>> -
>>>> -            if (size2 >= 0 && offset >= size2) {
>>>> +    } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) {
>>>> +        ret |= BDRV_BLOCK_ZERO;
>>>> +    } else if (bs->backing) {
>>>> +        BlockDriverState *bs2 = bs->backing->bs;
>>>> +        int64_t size2 = bdrv_getlength(bs2);
>>>> +
>>>> +        if (size2 >= 0 && offset >= size2) {
>>>> +            if (want_zero) {
>>>>                   ret |= BDRV_BLOCK_ZERO;
>>>>               }
>>>> +            ret |= BDRV_BLOCK_ALLOCATED;
>>>
>>> No, I think it's wrong, as it's not allocated at bs, but at bs->backing->bs.
>>>
>>>
>>> So, bdrv_co_block_status works correct, what we can change about it, is not
>>> to return pnum=0 if requested range after eof, but return pnum=bytes, together
>>> with BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO | BDRV_BLOCK_EOF.
>>
>> But this will break users which rely on pnum=0.
> 
> Possibly, we may add flag to bdrv_co_block_status, to behave in such manner, and
> set it to true when call from bdrv_block_status_above/bdrv_is_allocated_above.
> 
>>
>>>
>>> And actual problem is in bdrv_block_status_above and bdrv_is_allocated_above, which
>>> I'm fixing.
>>>
>>>
>>>
>>
>>
> 
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-19 12:30       ` Vladimir Sementsov-Ogievskiy
@ 2019-11-19 13:28         ` Kevin Wolf
  0 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2019-11-19 13:28 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, Denis Lunev, qemu-block, qemu-devel, Max Reitz, stefanha

Am 19.11.2019 um 13:30 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 19.11.2019 15:20, Max Reitz wrote:
> > On 19.11.19 13:02, Denis V. Lunev wrote:
> >> On 11/19/19 1:22 PM, Max Reitz wrote:
> >>> On 16.11.19 17:34, Vladimir Sementsov-Ogievskiy wrote:
> >>>> Hi all!
> >>>>
> >>>> I wanted to understand, what is the real difference between bdrv_block_status_above
> >>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> >>>> bdrv_block_status_above..
> >>>>
> >>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> >>>> UNALLOCATED for intermediate nodes..
> >>>>
> >>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> >>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> >>>> zeroes after EOF, instead of going further by backing chain.
> >>> Should we, though?  It absolutely makes sense to me to consider post-EOF
> >>> space as unallocated because, well, it is as unallocated as it gets.
> >>>
> >>> So from my POV it would make more sense to fall back to the backing file
> >>> for post-EOF reads.
> >>>
> >>> OTOH, I don’t know whether changing that behavior would qualify as a
> >>> possible security issue now, because maybe someone has sensitive
> >>> information in the tail of some disk and then truncated the overlay so
> >>> as to hide it?  But honestly, that seems ridiculous and I can’t imagine
> >>> people to do that.  (It would work only for the tail, and why not just
> >>> write zeroes there, which works everywhere?)  So in practice I don’t
> >>> believe that to be a problem.
> >>>
> >>> Max
> >>
> >> That seems to be wrong from my POW. Once we get block device truncated,
> >> it exposed that tail to the guest with all zeroes.
> >>
> >> Let us assume that we have virtual disk of length L. We create new top
> >> delta of
> >> length X (less then L) and new top delta after with length Y (more than L),
> >> like the following:
> >>
> >> [.........................] Y
> >> [........] X
> >> [...................] L
> >>
> >> Once the guest creates FS  on state Y it relies on the fact that data from X
> >> to Y is all zeroes.
> >>
> >> Any operations with backing chain must keep guest content to be tha same,
> >> i.e. if we commit from Y to L, virtual disk content should be preserved,
> >> i.e.
> >> read as all zero even if there is some data in L from X to L.
> >>
> >> If we commit from X to Y, the range from X to L should remain all zeroes.
> >>
> >> This is especially valid for backups, which can not be changed and are
> >> validated by the software from time to time.
> >>
> >> Does this makes sense?
> > 
> > All right then.  But then there’s the case of commit not shrinking the
> > backing file, so the guest content won’t be the same if you commit a
> > short overlay into a longer backing file.
> 
> Hmm. Isn't commit target truncated to source before operation?

Only if the target is smaller than the source.

Maybe we should change that, because I don't think it's expected that a
guest sees a larger disk, where old data reappears, after resizing
(shrinking) the active layer and then commiting it to the backing file.

Kevin



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-19 12:17   ` Vladimir Sementsov-Ogievskiy
  2019-11-19 12:32     ` Vladimir Sementsov-Ogievskiy
@ 2019-11-19 14:21     ` Kevin Wolf
  1 sibling, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2019-11-19 14:21 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, Denis Lunev, qemu-block, qemu-devel, mreitz, stefanha

Am 19.11.2019 um 13:17 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 19.11.2019 15:05, Kevin Wolf wrote:
> > Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> Hi all!
> >>
> >> I wanted to understand, what is the real difference between
> >> bdrv_block_status_above and bdrv_is_allocated_above, IMHO
> >> bdrv_is_allocated_above should work through bdrv_block_status_above..
> >>
> >> And I found the problem: bdrv_is_allocated_above considers space after
> >> EOF as UNALLOCATED for intermediate nodes..
> >>
> >> UNALLOCATED is not about allocation at fs level, but about should we
> >> go to backing or not.. And it seems incorrect for me, as in case of
> >> short backing file, we'll read zeroes after EOF, instead of going
> >> further by backing chain.
> > 
> > We actually have documentation what it means:
> > 
> >   * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
> >   *                       layer rather than any backing, set by block layer
> > 
> > Say we have a short overlay, like this:
> > 
> > base.qcow2:     AAAAAAAA
> > overlay.qcow2:  BBBB
> > 
> > Then of course the content of block 5 (the one after EOF of
> > overlay.qcow2) is still determined by overlay.qcow2, which can be easily
> > verified by reading it from overlay.qcow2 (produces zeros) and from
> > base.qcow2 (produces As).
> > 
> > So the correct result when querying the block status of block 5 on
> > overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.

Do you agree with this assessment?

> > Interestingly, we already fixed the opposite case (large overlay over
> > short backing file) in commit e88ae2264d9 from May 2014 according to the
> > same logic.
> > 
> >> This leads to the following effect:
> >>
> >> ./qemu-img create -f qcow2 base.qcow2 2M
> >> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> >>
> >> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> >> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> >>
> >> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> >> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> >> read 1048576/1048576 bytes at offset 1048576
> >> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> >>
> >> But after commit guest visible state is changed, which seems wrong for me:
> >> ./qemu-img commit top.qcow2 -b mid.qcow2
> >>
> >> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> >> Pattern verification failed at offset 1048576, 1048576 bytes
> >> read 1048576/1048576 bytes at offset 1048576
> >> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> >>
> >> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> >> read 1048576/1048576 bytes at offset 1048576
> >> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> >>
> >>
> >> I don't know, is it a real bug, as I don't know, do we support backing
> >> file larger than its parent. Still, I'm not sure that this behavior of
> >> bdrv_is_allocated_above don't lead to other problems.
> > 
> > I agree it's a bug.
> > 
> > Your fix doesn't look right to me, though. You leave the buggy behaviour
> > of bdrv_co_block_status() as it is and then add four patches to work
> > around it in some (but not all) callers of it.
> > 
> > All that it should take to fix this is making the bs->backing check
> > independent from want_zero and let it set ALLOCATED. What I expected
> > would be something like the below patch.
> > 
> > But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in
> > qemu-io shows that the range is now considered allocated), so probably
> > there is still a separate bug in bdrv_is_allocated_above().
> > 
> > And I think we'll want an iotests case for both cases (short overlay,
> > short backing file).
> > 
> > Kevin
> > 
> > 
> > diff --git a/block/io.c b/block/io.c
> > index f75777f5ea..5eafcff01a 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -2359,16 +2359,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
> >   
> >       if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
> >           ret |= BDRV_BLOCK_ALLOCATED;
> > -    } else if (want_zero) {
> > -        if (bdrv_unallocated_blocks_are_zero(bs)) {
> > -            ret |= BDRV_BLOCK_ZERO;
> > -        } else if (bs->backing) {
> > -            BlockDriverState *bs2 = bs->backing->bs;
> > -            int64_t size2 = bdrv_getlength(bs2);
> > -
> > -            if (size2 >= 0 && offset >= size2) {
> > +    } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) {
> > +        ret |= BDRV_BLOCK_ZERO;
> > +    } else if (bs->backing) {
> > +        BlockDriverState *bs2 = bs->backing->bs;
> > +        int64_t size2 = bdrv_getlength(bs2);
> > +
> > +        if (size2 >= 0 && offset >= size2) {
> > +            if (want_zero) {
> >                   ret |= BDRV_BLOCK_ZERO;
> >               }
> > +            ret |= BDRV_BLOCK_ALLOCATED;
> 
> No, I think it's wrong, as it's not allocated at bs, but at
> bs->backing->bs.

This code implements the logic that I explained above. In my example,
the zeros are produced by overlay.qcow2 (because we're reading after its
EOF), not by base.qcow2.

If it were allocated at base.qcow2, you wouldn't read zeros, but the
data from (the longer) base.qcow2.

The code I posted isn't quite correct yet, but I think that the core of
the problem is already present in bdrv_co_block_status() and not only in
its callers.

Kevin



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-16 16:34 [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (5 preceding siblings ...)
  2019-11-19 12:05 ` Kevin Wolf
@ 2019-11-19 14:54 ` Kevin Wolf
  2019-11-19 16:58 ` Stefan Hajnoczi
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2019-11-19 14:54 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Hi all!
> 
> I wanted to understand, what is the real difference between bdrv_block_status_above
> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> bdrv_block_status_above..
> 
> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> UNALLOCATED for intermediate nodes..
> 
> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> zeroes after EOF, instead of going further by backing chain.
> 
> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> 
> 
> I don't know, is it a real bug, as I don't know, do we support backing
> file larger than its parent. Still, I'm not sure that this behavior of
> bdrv_is_allocated_above don't lead to other problems.

Actually, this specific problem is completely unrelated to how the block
status functions deal with short backing files because they are only
ever called for backing files of the same length as their overlay.

The problem is that the commit job grows the backing file first without
making sure that the clusters in the new part read as zeros. After this,
the damage is done and bdrv_is_allocated_above() returns correctly that
the blocks are unallocated both in top.qcow2 and in mid.qcow2.

So the simple fix for 4.2 would be the following. Maybe we can find a
way to optimise it later (though probably it's not worth it because
short backing files are an uncommon case anyway).

Kevin


diff --git a/block/commit.c b/block/commit.c
index 23c90b3b91..a0c4f51caf 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -159,6 +159,11 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
         if (ret) {
             goto out;
         }
+        ret = blk_co_pwrite_zeroes(s->base, base_len, len - base_len,
+                                   BDRV_REQ_MAY_UNMAP);
+        if (ret < 0) {
+            goto out;
+        }
     }
 
     buf = blk_blockalign(s->top, COMMIT_BUFFER_SIZE);
diff --git a/block/mirror.c b/block/mirror.c
index f0f2d9dff1..2a34f2fad6 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -883,6 +883,12 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
             if (ret < 0) {
                 goto immediate_exit;
             }
+            ret = blk_co_pwrite_zeroes(s->target, base_length,
+                                       s->bdev_length - base_length,
+                                       BDRV_REQ_MAY_UNMAP);
+            if (ret < 0) {
+                goto immediate_exit;
+            }
         }
     }
 



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-16 16:34 [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (6 preceding siblings ...)
  2019-11-19 14:54 ` Kevin Wolf
@ 2019-11-19 16:58 ` Stefan Hajnoczi
  2019-11-19 17:11   ` Vladimir Sementsov-Ogievskiy
  2019-11-20 10:20 ` Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Stefan Hajnoczi @ 2019-11-19 16:58 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: kwolf, fam, qemu-block, qemu-devel, mreitz, den

[-- Attachment #1: Type: text/plain, Size: 2371 bytes --]

On Sat, Nov 16, 2019 at 07:34:06PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> I wanted to understand, what is the real difference between bdrv_block_status_above
> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> bdrv_block_status_above..
> 
> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> UNALLOCATED for intermediate nodes..
> 
> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> zeroes after EOF, instead of going further by backing chain.
> 
> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> 
> 
> I don't know, is it a real bug, as I don't know, do we support backing file larger than
> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
> to other problems.

It seems correct that a backing file limits the virtual disk size of its
backing chain.

The "qemu-img commit" behavior seems counter-intuitive at first, but the
problem there is that the top-most image file is larger than its backing
file - not that the backing chain has differing sizes.  Committing
should not lose data, mid.qcow2 will be grown and then you get the
result you've observed.

I consider this a little weird but not a bug.  Does anyone else have
opinions?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-19 16:58 ` Stefan Hajnoczi
@ 2019-11-19 17:11   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-19 17:11 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, fam, Denis Lunev, qemu-block, qemu-devel, mreitz

19.11.2019 19:58, Stefan Hajnoczi wrote:
> On Sat, Nov 16, 2019 at 07:34:06PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> I wanted to understand, what is the real difference between bdrv_block_status_above
>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>> bdrv_block_status_above..
>>
>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>> UNALLOCATED for intermediate nodes..
>>
>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>> zeroes after EOF, instead of going further by backing chain.
>>
>> This leads to the following effect:
>>
>> ./qemu-img create -f qcow2 base.qcow2 2M
>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>
>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>
>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>
>> But after commit guest visible state is changed, which seems wrong for me:
>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>
>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>> Pattern verification failed at offset 1048576, 1048576 bytes
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>
>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>
>>
>> I don't know, is it a real bug, as I don't know, do we support backing file larger than
>> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
>> to other problems.
> 
> It seems correct that a backing file limits the virtual disk size of its
> backing chain.
> 
> The "qemu-img commit" behavior seems counter-intuitive at first, but the
> problem there is that the top-most image file is larger than its backing
> file - not that the backing chain has differing sizes.  Committing
> should not lose data, mid.qcow2 will be grown and then you get the
> result you've observed.
> 
> I consider this a little weird but not a bug.  Does anyone else have
> opinions?
> 

I thing, that the fact that read and block_status treats space after EOF
in different ways is a bug anyway.

Also, block_status behavior don't correspond to our definition of
BDRV_BLOCK_ALLOCATED (see Kevin's response)

Also, as I explained in parallel branch, such situation is possible for example
when we interleave incremental backup creation with vm disk resize.

So, then, we have incremental backup chain with different sizes like in
example. Assume we want to merge some sequential incremental backups into
one (aka, remove incremental backup), it may be done with help of commit.
And we get final image, where we see data from base backup instead of
zeros. This is illegal, vm should not get access to this data.

Possibly, similar thing is possible with snapshots.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-16 16:34 [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (7 preceding siblings ...)
  2019-11-19 16:58 ` Stefan Hajnoczi
@ 2019-11-20 10:20 ` Vladimir Sementsov-Ogievskiy
  2019-11-20 11:44   ` Kevin Wolf
  2019-11-20 16:24 ` [PATCH 5/4] iotests: add commit top->base cases to 274 Vladimir Sementsov-Ogievskiy
  2019-11-25 10:08 ` [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
  10 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-20 10:20 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, Denis Lunev, qemu-devel, mreitz, stefanha

16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> I wanted to understand, what is the real difference between bdrv_block_status_above
> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> bdrv_block_status_above..
> 
> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> UNALLOCATED for intermediate nodes..
> 
> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> zeroes after EOF, instead of going further by backing chain.
> 
> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> 
> 
> I don't know, is it a real bug, as I don't know, do we support backing file larger than
> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
> to other problems.
> 
> =====
> 
> Hmm, bdrv_block_allocated_above behaves strange too:
> 
> with want_zero=true, it may report unallocated zeroes because of short backing files, which
> are actually "allocated" in POV of backing chains. But I see this may influence only
> qemu-img compare, and I don't see can it trigger some bug..
> 
> with want_zero=false, it may do no progress because of short backing file. Moreover it may
> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
> onlyt top layer, so it seems OK.
> 
> =====
> 
> So, I propose these series, still I'm not sure is there a real bug.
> 
> Vladimir Sementsov-Ogievskiy (4):
>    block/io: fix bdrv_co_block_status_above
>    block/io: bdrv_common_block_status_above: support include_base
>    block/io: bdrv_common_block_status_above: support bs == base
>    block/io: fix bdrv_is_allocated_above
> 
>   block/io.c                 | 104 ++++++++++++++++++-------------------
>   tests/qemu-iotests/154.out |   4 +-
>   2 files changed, 53 insertions(+), 55 deletions(-)
> 


Interesting that the problem illustrated here is not fixed by the series, it's actually
relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, which leads
to unallocated qcow2 clusters, which I think should be fixed too.

To illustrate the problem fixed by the series, we should commit to base:

# ./qemu-img commit top.qcow2 -b base.qcow2
Image committed.
# ./qemu-io -c "read -P 0 1M 1M" base.qcow2
Pattern verification failed at offset 1048576, 1048576 bytes
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)

=======

Hmm, but how to fix the problem about truncate? I think truncate must not make
underlying backing available for read.. Discard operation doesn't do it.

So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark new clusters
as ZERO?

To improve it, we can add "zero bit" to L1 table entry..

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-20 10:20 ` Vladimir Sementsov-Ogievskiy
@ 2019-11-20 11:44   ` Kevin Wolf
  2019-11-20 12:04     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Kevin Wolf @ 2019-11-20 11:44 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, Denis Lunev, qemu-block, qemu-devel, mreitz, stefanha

Am 20.11.2019 um 11:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
> > Hi all!
> > 
> > I wanted to understand, what is the real difference between bdrv_block_status_above
> > and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> > bdrv_block_status_above..
> > 
> > And I found the problem: bdrv_is_allocated_above considers space after EOF as
> > UNALLOCATED for intermediate nodes..
> > 
> > UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> > not.. And it seems incorrect for me, as in case of short backing file, we'll read
> > zeroes after EOF, instead of going further by backing chain.
> > 
> > This leads to the following effect:
> > 
> > ./qemu-img create -f qcow2 base.qcow2 2M
> > ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> > 
> > ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> > ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> > 
> > Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> > ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> > read 1048576/1048576 bytes at offset 1048576
> > 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> > 
> > But after commit guest visible state is changed, which seems wrong for me:
> > ./qemu-img commit top.qcow2 -b mid.qcow2
> > 
> > ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> > Pattern verification failed at offset 1048576, 1048576 bytes
> > read 1048576/1048576 bytes at offset 1048576
> > 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> > 
> > ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> > read 1048576/1048576 bytes at offset 1048576
> > 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> > 
> > 
> > I don't know, is it a real bug, as I don't know, do we support backing file larger than
> > its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
> > to other problems.
> > 
> > =====
> > 
> > Hmm, bdrv_block_allocated_above behaves strange too:
> > 
> > with want_zero=true, it may report unallocated zeroes because of short backing files, which
> > are actually "allocated" in POV of backing chains. But I see this may influence only
> > qemu-img compare, and I don't see can it trigger some bug..
> > 
> > with want_zero=false, it may do no progress because of short backing file. Moreover it may
> > report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
> > onlyt top layer, so it seems OK.
> > 
> > =====
> > 
> > So, I propose these series, still I'm not sure is there a real bug.
> > 
> > Vladimir Sementsov-Ogievskiy (4):
> >    block/io: fix bdrv_co_block_status_above
> >    block/io: bdrv_common_block_status_above: support include_base
> >    block/io: bdrv_common_block_status_above: support bs == base
> >    block/io: fix bdrv_is_allocated_above
> > 
> >   block/io.c                 | 104 ++++++++++++++++++-------------------
> >   tests/qemu-iotests/154.out |   4 +-
> >   2 files changed, 53 insertions(+), 55 deletions(-)
> > 
> 
> 
> Interesting that the problem illustrated here is not fixed by the series, it's actually
> relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, which leads
> to unallocated qcow2 clusters, which I think should be fixed too.

Yes, this is what I posted yesterday. (With a suggested quick fix, but
it turns out it was not quite correct, see below.)

> To illustrate the problem fixed by the series, we should commit to base:
> 
> # ./qemu-img commit top.qcow2 -b base.qcow2
> Image committed.
> # ./qemu-io -c "read -P 0 1M 1M" base.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)

Ok, I'll try that later.

> Hmm, but how to fix the problem about truncate? I think truncate must
> not make underlying backing available for read.. Discard operation
> doesn't do it.
> 
> So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark
> new clusters as ZERO?

Yes, we need to write zeroes to the new area if the backing file covers
it. We need to do this not only in mirror/commit/bdrv_commit(), but in
fact for all truncate operations: Berto mentioned on IRC yesterday that
you can get into the same situation with 'block_resize' monitor
commands.

So I tried to fix this yesterday, and I thought that I had a fix, when I
noticed that bdrv_co_do_zero_pwritev() takes a 32 bit bytes parameter.
So I'll still need to fix this. Other than that, I suppose the following
fix should work (but is probably a bit too invasive for -rc3).

Kevin

diff --git a/block/io.c b/block/io.c
index f75777f5ea..4118bf0118 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3382,6 +3382,32 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
         goto out;
     }

+    /*
+     * If the image has a backing file that is large enough that it would
+     * provide data for the new area, we cannot leave it unallocated because
+     * then the backing file content would become visible. Instead, zero-fill
+     * the area where backing file and new area overlap.
+     */
+    if (new_bytes && bs->backing && prealloc == PREALLOC_MODE_OFF) {
+        int64_t backing_len;
+
+        backing_len = bdrv_getlength(backing_bs(bs));
+        if (backing_len < 0) {
+            ret = backing_len;
+            goto out;
+        }
+
+        if (backing_len > old_size) {
+            /* FIXME bytes parameter is 32 bits */
+            ret = bdrv_co_do_zero_pwritev(child, old_size,
+                                          MIN(new_bytes, backing_len - old_size),
+                                          BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP, &req);
+            if (ret < 0) {
+                goto out;
+            }
+        }
+    }
+
     ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "Could not refresh total sector count");



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-20 11:44   ` Kevin Wolf
@ 2019-11-20 12:04     ` Vladimir Sementsov-Ogievskiy
  2019-11-20 13:30       ` Kevin Wolf
  2019-11-20 13:37       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 2 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-20 12:04 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: fam, Denis Lunev, qemu-block, qemu-devel, mreitz, stefanha

20.11.2019 14:44, Kevin Wolf wrote:
> Am 20.11.2019 um 11:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> I wanted to understand, what is the real difference between bdrv_block_status_above
>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>>> bdrv_block_status_above..
>>>
>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>>> UNALLOCATED for intermediate nodes..
>>>
>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>>> zeroes after EOF, instead of going further by backing chain.
>>>
>>> This leads to the following effect:
>>>
>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>
>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>
>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>
>>> But after commit guest visible state is changed, which seems wrong for me:
>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>
>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>
>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>
>>>
>>> I don't know, is it a real bug, as I don't know, do we support backing file larger than
>>> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
>>> to other problems.
>>>
>>> =====
>>>
>>> Hmm, bdrv_block_allocated_above behaves strange too:
>>>
>>> with want_zero=true, it may report unallocated zeroes because of short backing files, which
>>> are actually "allocated" in POV of backing chains. But I see this may influence only
>>> qemu-img compare, and I don't see can it trigger some bug..
>>>
>>> with want_zero=false, it may do no progress because of short backing file. Moreover it may
>>> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
>>> onlyt top layer, so it seems OK.
>>>
>>> =====
>>>
>>> So, I propose these series, still I'm not sure is there a real bug.
>>>
>>> Vladimir Sementsov-Ogievskiy (4):
>>>     block/io: fix bdrv_co_block_status_above
>>>     block/io: bdrv_common_block_status_above: support include_base
>>>     block/io: bdrv_common_block_status_above: support bs == base
>>>     block/io: fix bdrv_is_allocated_above
>>>
>>>    block/io.c                 | 104 ++++++++++++++++++-------------------
>>>    tests/qemu-iotests/154.out |   4 +-
>>>    2 files changed, 53 insertions(+), 55 deletions(-)
>>>
>>
>>
>> Interesting that the problem illustrated here is not fixed by the series, it's actually
>> relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, which leads
>> to unallocated qcow2 clusters, which I think should be fixed too.
> 
> Yes, this is what I posted yesterday. (With a suggested quick fix, but
> it turns out it was not quite correct, see below.)
> 
>> To illustrate the problem fixed by the series, we should commit to base:
>>
>> # ./qemu-img commit top.qcow2 -b base.qcow2
>> Image committed.
>> # ./qemu-io -c "read -P 0 1M 1M" base.qcow2
>> Pattern verification failed at offset 1048576, 1048576 bytes
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)
> 
> Ok, I'll try that later.
> 
>> Hmm, but how to fix the problem about truncate? I think truncate must
>> not make underlying backing available for read.. Discard operation
>> doesn't do it.
>>
>> So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark
>> new clusters as ZERO?
> 
> Yes, we need to write zeroes to the new area if the backing file covers
> it. We need to do this not only in mirror/commit/bdrv_commit(), but in
> fact for all truncate operations: Berto mentioned on IRC yesterday that
> you can get into the same situation with 'block_resize' monitor
> commands.
> 
> So I tried to fix this yesterday, and I thought that I had a fix, when I
> noticed that bdrv_co_do_zero_pwritev() takes a 32 bit bytes parameter.
> So I'll still need to fix this. Other than that, I suppose the following
> fix should work (but is probably a bit too invasive for -rc3).
> 
> Kevin
> 
> diff --git a/block/io.c b/block/io.c
> index f75777f5ea..4118bf0118 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -3382,6 +3382,32 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
>           goto out;
>       }
> 
> +    /*
> +     * If the image has a backing file that is large enough that it would
> +     * provide data for the new area, we cannot leave it unallocated because
> +     * then the backing file content would become visible. Instead, zero-fill
> +     * the area where backing file and new area overlap.
> +     */
> +    if (new_bytes && bs->backing && prealloc == PREALLOC_MODE_OFF) {
> +        int64_t backing_len;
> +
> +        backing_len = bdrv_getlength(backing_bs(bs));
> +        if (backing_len < 0) {
> +            ret = backing_len;
> +            goto out;
> +        }
> +
> +        if (backing_len > old_size) {
> +            /* FIXME bytes parameter is 32 bits */
> +            ret = bdrv_co_do_zero_pwritev(child, old_size,
> +                                          MIN(new_bytes, backing_len - old_size),
> +                                          BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP, &req);
> +            if (ret < 0) {
> +                goto out;
> +            }
> +        }
> +    }
> +
>       ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
>       if (ret < 0) {
>           error_setg_errno(errp, -ret, "Could not refresh total sector count");
> 

I'm not sure that it is safe enough: we may not have opened backing at the moment, but
still it may exist and managed by user.

PREALLOC_MODE_OFF is documented as
# @off: no preallocation

- not very descriptive, but I think it's nothing about making backing file available
through new clusters.

I think PREALLOC_MODE_OFF should always make new clusters "BDRV_BLOCK_ALLOCATED". If
for some scenarios (are they exist at all?) we need to preallocate cluster in manner
that backing file would be visible through them, we'd better add another preallocation
mode which will directly document this behaviour, like PREALLOC_MODE_BACKING.

So, I'd consider PREALLOC_MODE_OFF as something that must not create UNALLOCATED (in POV
of backing chains) clusters, and should be fixed in all formats.. Or as a quick fix may
we may write zeros from bdrv_co_truncate, but independently of backing file existence
and length.

===

Also I think it's a wrong thing at all that qcow2 new file is transparent by default..
It should be transparent only when we create snapshots and incremental backups. But when
we create new disk for new vm it should be zeroed (and extending L1 table entry spec by
"zero bit" may help)

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-20 12:04     ` Vladimir Sementsov-Ogievskiy
@ 2019-11-20 13:30       ` Kevin Wolf
  2019-11-20 13:51         ` Vladimir Sementsov-Ogievskiy
  2019-11-20 13:37       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 35+ messages in thread
From: Kevin Wolf @ 2019-11-20 13:30 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, Denis Lunev, qemu-block, qemu-devel, mreitz, stefanha

Am 20.11.2019 um 13:04 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 20.11.2019 14:44, Kevin Wolf wrote:
> > Am 20.11.2019 um 11:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> 16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
> >>> Hi all!
> >>>
> >>> I wanted to understand, what is the real difference between bdrv_block_status_above
> >>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> >>> bdrv_block_status_above..
> >>>
> >>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> >>> UNALLOCATED for intermediate nodes..
> >>>
> >>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> >>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> >>> zeroes after EOF, instead of going further by backing chain.
> >>>
> >>> This leads to the following effect:
> >>>
> >>> ./qemu-img create -f qcow2 base.qcow2 2M
> >>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> >>>
> >>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> >>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> >>>
> >>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> >>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> >>> read 1048576/1048576 bytes at offset 1048576
> >>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> >>>
> >>> But after commit guest visible state is changed, which seems wrong for me:
> >>> ./qemu-img commit top.qcow2 -b mid.qcow2
> >>>
> >>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> >>> Pattern verification failed at offset 1048576, 1048576 bytes
> >>> read 1048576/1048576 bytes at offset 1048576
> >>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> >>>
> >>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> >>> read 1048576/1048576 bytes at offset 1048576
> >>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> >>>
> >>>
> >>> I don't know, is it a real bug, as I don't know, do we support backing file larger than
> >>> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
> >>> to other problems.
> >>>
> >>> =====
> >>>
> >>> Hmm, bdrv_block_allocated_above behaves strange too:
> >>>
> >>> with want_zero=true, it may report unallocated zeroes because of short backing files, which
> >>> are actually "allocated" in POV of backing chains. But I see this may influence only
> >>> qemu-img compare, and I don't see can it trigger some bug..
> >>>
> >>> with want_zero=false, it may do no progress because of short backing file. Moreover it may
> >>> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
> >>> onlyt top layer, so it seems OK.
> >>>
> >>> =====
> >>>
> >>> So, I propose these series, still I'm not sure is there a real bug.
> >>>
> >>> Vladimir Sementsov-Ogievskiy (4):
> >>>     block/io: fix bdrv_co_block_status_above
> >>>     block/io: bdrv_common_block_status_above: support include_base
> >>>     block/io: bdrv_common_block_status_above: support bs == base
> >>>     block/io: fix bdrv_is_allocated_above
> >>>
> >>>    block/io.c                 | 104 ++++++++++++++++++-------------------
> >>>    tests/qemu-iotests/154.out |   4 +-
> >>>    2 files changed, 53 insertions(+), 55 deletions(-)
> >>>
> >>
> >>
> >> Interesting that the problem illustrated here is not fixed by the series, it's actually
> >> relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, which leads
> >> to unallocated qcow2 clusters, which I think should be fixed too.
> > 
> > Yes, this is what I posted yesterday. (With a suggested quick fix, but
> > it turns out it was not quite correct, see below.)
> > 
> >> To illustrate the problem fixed by the series, we should commit to base:
> >>
> >> # ./qemu-img commit top.qcow2 -b base.qcow2
> >> Image committed.
> >> # ./qemu-io -c "read -P 0 1M 1M" base.qcow2
> >> Pattern verification failed at offset 1048576, 1048576 bytes
> >> read 1048576/1048576 bytes at offset 1048576
> >> 1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)
> > 
> > Ok, I'll try that later.
> > 
> >> Hmm, but how to fix the problem about truncate? I think truncate must
> >> not make underlying backing available for read.. Discard operation
> >> doesn't do it.
> >>
> >> So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark
> >> new clusters as ZERO?
> > 
> > Yes, we need to write zeroes to the new area if the backing file covers
> > it. We need to do this not only in mirror/commit/bdrv_commit(), but in
> > fact for all truncate operations: Berto mentioned on IRC yesterday that
> > you can get into the same situation with 'block_resize' monitor
> > commands.
> > 
> > So I tried to fix this yesterday, and I thought that I had a fix, when I
> > noticed that bdrv_co_do_zero_pwritev() takes a 32 bit bytes parameter.
> > So I'll still need to fix this. Other than that, I suppose the following
> > fix should work (but is probably a bit too invasive for -rc3).
> > 
> > Kevin
> > 
> > diff --git a/block/io.c b/block/io.c
> > index f75777f5ea..4118bf0118 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -3382,6 +3382,32 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
> >           goto out;
> >       }
> > 
> > +    /*
> > +     * If the image has a backing file that is large enough that it would
> > +     * provide data for the new area, we cannot leave it unallocated because
> > +     * then the backing file content would become visible. Instead, zero-fill
> > +     * the area where backing file and new area overlap.
> > +     */
> > +    if (new_bytes && bs->backing && prealloc == PREALLOC_MODE_OFF) {
> > +        int64_t backing_len;
> > +
> > +        backing_len = bdrv_getlength(backing_bs(bs));
> > +        if (backing_len < 0) {
> > +            ret = backing_len;
> > +            goto out;
> > +        }
> > +
> > +        if (backing_len > old_size) {
> > +            /* FIXME bytes parameter is 32 bits */
> > +            ret = bdrv_co_do_zero_pwritev(child, old_size,
> > +                                          MIN(new_bytes, backing_len - old_size),
> > +                                          BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP, &req);
> > +            if (ret < 0) {
> > +                goto out;
> > +            }
> > +        }
> > +    }
> > +
> >       ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
> >       if (ret < 0) {
> >           error_setg_errno(errp, -ret, "Could not refresh total sector count");
> > 
> 
> I'm not sure that it is safe enough: we may not have opened backing at
> the moment, but still it may exist and managed by user.

But then I think it's the user's responsibility to make sure that the
backing file still makes sense when they reattach it. You can't tell
QEMU that there is no backing file and then expect QEMU to take care of
your secret backing file.

Of course, we could unconditionally zero out the new area without
looking at the backing file, but I'm not sure if that is wanted.

> PREALLOC_MODE_OFF is documented as
> # @off: no preallocation
> 
> - not very descriptive, but I think it's nothing about making backing file available
> through new clusters.
> 
> I think PREALLOC_MODE_OFF should always make new clusters "BDRV_BLOCK_ALLOCATED". If
> for some scenarios (are they exist at all?) we need to preallocate cluster in manner
> that backing file would be visible through them, we'd better add another preallocation
> mode which will directly document this behaviour, like PREALLOC_MODE_BACKING.
> 
> So, I'd consider PREALLOC_MODE_OFF as something that must not create UNALLOCATED (in POV
> of backing chains) clusters, and should be fixed in all formats.. Or as a quick fix may
> we may write zeros from bdrv_co_truncate, but independently of backing file existence
> and length.

No, that would mean that all images must be preallocated because
BDRV_BLOCK_ALLOCATED isn't supposed to be set for sparse blocks.

Essentially, the mode that you're envisioning for it is the same as
PREALLOC_MODE_METADATA today (except that it would have to be supported
by every driver).

Obviously, that's wrong for PREALLOC_MODE_OFF.

> ===
> 
> Also I think it's a wrong thing at all that qcow2 new file is transparent by default..
> It should be transparent only when we create snapshots and incremental backups. But when
> we create new disk for new vm it should be zeroed (and extending L1 table entry spec by
> "zero bit" may help)

Why would a qcow2 file even have a backing file when you don't want to
ever access the backing file? Create your image without a backing file
and you get the behaviour you want.

Kevin



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-20 12:04     ` Vladimir Sementsov-Ogievskiy
  2019-11-20 13:30       ` Kevin Wolf
@ 2019-11-20 13:37       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-20 13:37 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: fam, Denis Lunev, qemu-block, qemu-devel, mreitz, stefanha

20.11.2019 15:04, Vladimir Sementsov-Ogievskiy wrote:
> 20.11.2019 14:44, Kevin Wolf wrote:
>> Am 20.11.2019 um 11:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> 16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> I wanted to understand, what is the real difference between bdrv_block_status_above
>>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>>>> bdrv_block_status_above..
>>>>
>>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>>>> UNALLOCATED for intermediate nodes..
>>>>
>>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>>>> zeroes after EOF, instead of going further by backing chain.
>>>>
>>>> This leads to the following effect:
>>>>
>>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>>
>>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>>
>>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>>
>>>> But after commit guest visible state is changed, which seems wrong for me:
>>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>>
>>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>>
>>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>>
>>>>
>>>> I don't know, is it a real bug, as I don't know, do we support backing file larger than
>>>> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
>>>> to other problems.
>>>>
>>>> =====
>>>>
>>>> Hmm, bdrv_block_allocated_above behaves strange too:
>>>>
>>>> with want_zero=true, it may report unallocated zeroes because of short backing files, which
>>>> are actually "allocated" in POV of backing chains. But I see this may influence only
>>>> qemu-img compare, and I don't see can it trigger some bug..
>>>>
>>>> with want_zero=false, it may do no progress because of short backing file. Moreover it may
>>>> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
>>>> onlyt top layer, so it seems OK.
>>>>
>>>> =====
>>>>
>>>> So, I propose these series, still I'm not sure is there a real bug.
>>>>
>>>> Vladimir Sementsov-Ogievskiy (4):
>>>>     block/io: fix bdrv_co_block_status_above
>>>>     block/io: bdrv_common_block_status_above: support include_base
>>>>     block/io: bdrv_common_block_status_above: support bs == base
>>>>     block/io: fix bdrv_is_allocated_above
>>>>
>>>>    block/io.c                 | 104 ++++++++++++++++++-------------------
>>>>    tests/qemu-iotests/154.out |   4 +-
>>>>    2 files changed, 53 insertions(+), 55 deletions(-)
>>>>
>>>
>>>
>>> Interesting that the problem illustrated here is not fixed by the series, it's actually
>>> relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, which leads
>>> to unallocated qcow2 clusters, which I think should be fixed too.
>>
>> Yes, this is what I posted yesterday. (With a suggested quick fix, but
>> it turns out it was not quite correct, see below.)
>>
>>> To illustrate the problem fixed by the series, we should commit to base:
>>>
>>> # ./qemu-img commit top.qcow2 -b base.qcow2
>>> Image committed.
>>> # ./qemu-io -c "read -P 0 1M 1M" base.qcow2
>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)
>>
>> Ok, I'll try that later.
>>
>>> Hmm, but how to fix the problem about truncate? I think truncate must
>>> not make underlying backing available for read.. Discard operation
>>> doesn't do it.
>>>
>>> So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark
>>> new clusters as ZERO?
>>
>> Yes, we need to write zeroes to the new area if the backing file covers
>> it. We need to do this not only in mirror/commit/bdrv_commit(), but in
>> fact for all truncate operations: Berto mentioned on IRC yesterday that
>> you can get into the same situation with 'block_resize' monitor
>> commands.
>>
>> So I tried to fix this yesterday, and I thought that I had a fix, when I
>> noticed that bdrv_co_do_zero_pwritev() takes a 32 bit bytes parameter.
>> So I'll still need to fix this. Other than that, I suppose the following
>> fix should work (but is probably a bit too invasive for -rc3).
>>
>> Kevin
>>
>> diff --git a/block/io.c b/block/io.c
>> index f75777f5ea..4118bf0118 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -3382,6 +3382,32 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
>>           goto out;
>>       }
>>
>> +    /*
>> +     * If the image has a backing file that is large enough that it would
>> +     * provide data for the new area, we cannot leave it unallocated because
>> +     * then the backing file content would become visible. Instead, zero-fill
>> +     * the area where backing file and new area overlap.
>> +     */
>> +    if (new_bytes && bs->backing && prealloc == PREALLOC_MODE_OFF) {
>> +        int64_t backing_len;
>> +
>> +        backing_len = bdrv_getlength(backing_bs(bs));
>> +        if (backing_len < 0) {
>> +            ret = backing_len;
>> +            goto out;
>> +        }
>> +
>> +        if (backing_len > old_size) {
>> +            /* FIXME bytes parameter is 32 bits */
>> +            ret = bdrv_co_do_zero_pwritev(child, old_size,
>> +                                          MIN(new_bytes, backing_len - old_size),
>> +                                          BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP, &req);
>> +            if (ret < 0) {
>> +                goto out;
>> +            }
>> +        }
>> +    }
>> +
>>       ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
>>       if (ret < 0) {
>>           error_setg_errno(errp, -ret, "Could not refresh total sector count");
>>
> 
> I'm not sure that it is safe enough: we may not have opened backing at the moment, but
> still it may exist and managed by user.
> 
> PREALLOC_MODE_OFF is documented as
> # @off: no preallocation
> 
> - not very descriptive, but I think it's nothing about making backing file available
> through new clusters.
> 
> I think PREALLOC_MODE_OFF should always make new clusters "BDRV_BLOCK_ALLOCATED". If
> for some scenarios (are they exist at all?) we need to preallocate cluster in manner
> that backing file would be visible through them, we'd better add another preallocation
> mode which will directly document this behaviour, like PREALLOC_MODE_BACKING.
> 
> So, I'd consider PREALLOC_MODE_OFF as something that must not create UNALLOCATED (in POV
> of backing chains) clusters, and should be fixed in all formats.. Or as a quick fix may
> we may write zeros from bdrv_co_truncate, but independently of backing file existence
> and length.
> 
> ===

Or visa-versa, if we can't change current qemu-img-create default, which means preallocation="off"
  === UNALLOCATED transaprent image, we may improve preallocation:"off" specification, mentioning
backing files, and add preallocation:"zero" mode, to be used in truncate.

> 
> Also I think it's a wrong thing at all that qcow2 new file is transparent by default..
> It should be transparent only when we create snapshots and incremental backups. But when
> we create new disk for new vm it should be zeroed (and extending L1 table entry spec by
> "zero bit" may help)
> 

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-20 13:30       ` Kevin Wolf
@ 2019-11-20 13:51         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-20 13:51 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: fam, Denis Lunev, qemu-block, qemu-devel, mreitz, stefanha

20.11.2019 16:30, Kevin Wolf wrote:
> Am 20.11.2019 um 13:04 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 20.11.2019 14:44, Kevin Wolf wrote:
>>> Am 20.11.2019 um 11:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Hi all!
>>>>>
>>>>> I wanted to understand, what is the real difference between bdrv_block_status_above
>>>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>>>>> bdrv_block_status_above..
>>>>>
>>>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>>>>> UNALLOCATED for intermediate nodes..
>>>>>
>>>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>>>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>>>>> zeroes after EOF, instead of going further by backing chain.
>>>>>
>>>>> This leads to the following effect:
>>>>>
>>>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>>>
>>>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>>>
>>>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>>>> read 1048576/1048576 bytes at offset 1048576
>>>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>>>
>>>>> But after commit guest visible state is changed, which seems wrong for me:
>>>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>>>
>>>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>>>> read 1048576/1048576 bytes at offset 1048576
>>>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>>>
>>>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>>>> read 1048576/1048576 bytes at offset 1048576
>>>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>>>
>>>>>
>>>>> I don't know, is it a real bug, as I don't know, do we support backing file larger than
>>>>> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
>>>>> to other problems.
>>>>>
>>>>> =====
>>>>>
>>>>> Hmm, bdrv_block_allocated_above behaves strange too:
>>>>>
>>>>> with want_zero=true, it may report unallocated zeroes because of short backing files, which
>>>>> are actually "allocated" in POV of backing chains. But I see this may influence only
>>>>> qemu-img compare, and I don't see can it trigger some bug..
>>>>>
>>>>> with want_zero=false, it may do no progress because of short backing file. Moreover it may
>>>>> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
>>>>> onlyt top layer, so it seems OK.
>>>>>
>>>>> =====
>>>>>
>>>>> So, I propose these series, still I'm not sure is there a real bug.
>>>>>
>>>>> Vladimir Sementsov-Ogievskiy (4):
>>>>>      block/io: fix bdrv_co_block_status_above
>>>>>      block/io: bdrv_common_block_status_above: support include_base
>>>>>      block/io: bdrv_common_block_status_above: support bs == base
>>>>>      block/io: fix bdrv_is_allocated_above
>>>>>
>>>>>     block/io.c                 | 104 ++++++++++++++++++-------------------
>>>>>     tests/qemu-iotests/154.out |   4 +-
>>>>>     2 files changed, 53 insertions(+), 55 deletions(-)
>>>>>
>>>>
>>>>
>>>> Interesting that the problem illustrated here is not fixed by the series, it's actually
>>>> relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, which leads
>>>> to unallocated qcow2 clusters, which I think should be fixed too.
>>>
>>> Yes, this is what I posted yesterday. (With a suggested quick fix, but
>>> it turns out it was not quite correct, see below.)
>>>
>>>> To illustrate the problem fixed by the series, we should commit to base:
>>>>
>>>> # ./qemu-img commit top.qcow2 -b base.qcow2
>>>> Image committed.
>>>> # ./qemu-io -c "read -P 0 1M 1M" base.qcow2
>>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)
>>>
>>> Ok, I'll try that later.
>>>
>>>> Hmm, but how to fix the problem about truncate? I think truncate must
>>>> not make underlying backing available for read.. Discard operation
>>>> doesn't do it.
>>>>
>>>> So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark
>>>> new clusters as ZERO?
>>>
>>> Yes, we need to write zeroes to the new area if the backing file covers
>>> it. We need to do this not only in mirror/commit/bdrv_commit(), but in
>>> fact for all truncate operations: Berto mentioned on IRC yesterday that
>>> you can get into the same situation with 'block_resize' monitor
>>> commands.
>>>
>>> So I tried to fix this yesterday, and I thought that I had a fix, when I
>>> noticed that bdrv_co_do_zero_pwritev() takes a 32 bit bytes parameter.
>>> So I'll still need to fix this. Other than that, I suppose the following
>>> fix should work (but is probably a bit too invasive for -rc3).
>>>
>>> Kevin
>>>
>>> diff --git a/block/io.c b/block/io.c
>>> index f75777f5ea..4118bf0118 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>> @@ -3382,6 +3382,32 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
>>>            goto out;
>>>        }
>>>
>>> +    /*
>>> +     * If the image has a backing file that is large enough that it would
>>> +     * provide data for the new area, we cannot leave it unallocated because
>>> +     * then the backing file content would become visible. Instead, zero-fill
>>> +     * the area where backing file and new area overlap.
>>> +     */
>>> +    if (new_bytes && bs->backing && prealloc == PREALLOC_MODE_OFF) {
>>> +        int64_t backing_len;
>>> +
>>> +        backing_len = bdrv_getlength(backing_bs(bs));
>>> +        if (backing_len < 0) {
>>> +            ret = backing_len;
>>> +            goto out;
>>> +        }
>>> +
>>> +        if (backing_len > old_size) {
>>> +            /* FIXME bytes parameter is 32 bits */
>>> +            ret = bdrv_co_do_zero_pwritev(child, old_size,
>>> +                                          MIN(new_bytes, backing_len - old_size),
>>> +                                          BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP, &req);
>>> +            if (ret < 0) {
>>> +                goto out;
>>> +            }
>>> +        }
>>> +    }
>>> +
>>>        ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
>>>        if (ret < 0) {
>>>            error_setg_errno(errp, -ret, "Could not refresh total sector count");
>>>
>>
>> I'm not sure that it is safe enough: we may not have opened backing at
>> the moment, but still it may exist and managed by user.
> 
> But then I think it's the user's responsibility to make sure that the
> backing file still makes sense when they reattach it. You can't tell
> QEMU that there is no backing file and then expect QEMU to take care of
> your secret backing file.
> 
> Of course, we could unconditionally zero out the new area without
> looking at the backing file, but I'm not sure if that is wanted.
> 
>> PREALLOC_MODE_OFF is documented as
>> # @off: no preallocation
>>
>> - not very descriptive, but I think it's nothing about making backing file available
>> through new clusters.
>>
>> I think PREALLOC_MODE_OFF should always make new clusters "BDRV_BLOCK_ALLOCATED". If
>> for some scenarios (are they exist at all?) we need to preallocate cluster in manner
>> that backing file would be visible through them, we'd better add another preallocation
>> mode which will directly document this behaviour, like PREALLOC_MODE_BACKING.
>>
>> So, I'd consider PREALLOC_MODE_OFF as something that must not create UNALLOCATED (in POV
>> of backing chains) clusters, and should be fixed in all formats.. Or as a quick fix may
>> we may write zeros from bdrv_co_truncate, but independently of backing file existence
>> and length.
> 
> No, that would mean that all images must be preallocated because
> BDRV_BLOCK_ALLOCATED isn't supposed to be set for sparse blocks.

Hmm. what is sparse blocks and how it relates to backing chain?

> 
> Essentially, the mode that you're envisioning for it is the same as
> PREALLOC_MODE_METADATA today (except that it would have to be supported
> by every driver).

As I understand, PREALLOC_MODE_METADATA preallocates DATA clusters, not ZERO.

Also note, that PREALLOC_MODE_OFF actually allocates some metadata: L1 table,
to mark new clusters as UNALLOCATED.

> 
> Obviously, that's wrong for PREALLOC_MODE_OFF.
> 
>> ===
>>
>> Also I think it's a wrong thing at all that qcow2 new file is transparent by default..
>> It should be transparent only when we create snapshots and incremental backups. But when
>> we create new disk for new vm it should be zeroed (and extending L1 table entry spec by
>> "zero bit" may help)
> 
> Why would a qcow2 file even have a backing file when you don't want to
> ever access the backing file? Create your image without a backing file
> and you get the behaviour you want.
> 
> Kevin
> 

Anyway, I think we should at least document that "off" mode make backing file visible through new
clusters. And with your patch, we should document that "off" mode correctly handles current
backing file (which may be backing file defined by qcow2 image, or may be set by some other
mechanism), and otherwise new clusters will be transparent.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/4] iotests: add commit top->base cases to 274
  2019-11-16 16:34 [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (8 preceding siblings ...)
  2019-11-20 10:20 ` Vladimir Sementsov-Ogievskiy
@ 2019-11-20 16:24 ` Vladimir Sementsov-Ogievskiy
  2019-11-25 10:08 ` [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
  10 siblings, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-20 16:24 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

These cases are fixed by previous patches around block_status and
is_allocated.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 tests/qemu-iotests/274     | 20 ++++++++++++
 tests/qemu-iotests/274.out | 63 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 83 insertions(+)

diff --git a/tests/qemu-iotests/274 b/tests/qemu-iotests/274
index f3b71e2859..492fc7b251 100755
--- a/tests/qemu-iotests/274
+++ b/tests/qemu-iotests/274
@@ -115,6 +115,26 @@ with iotests.FilePath('base') as base, \
     iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, mid)
     iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), mid)
 
+    iotests.log('=== Testing qemu-img commit (top -> base) ===')
+
+    create_chain()
+    iotests.qemu_img_log('commit', '-b', base, top)
+    iotests.img_info_log(base)
+    iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
+    iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), base)
+
+    iotests.log('=== Testing QMP active commit (top -> base) ===')
+
+    create_chain()
+    with create_vm() as vm:
+        vm.launch()
+        vm.qmp_log('block-commit', device='top', base_node='base',
+                   job_id='job0', auto_dismiss=False)
+        vm.run_job('job0', wait=5)
+
+    iotests.img_info_log(mid)
+    iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
+    iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), base)
 
     iotests.log('== Resize tests ==')
 
diff --git a/tests/qemu-iotests/274.out b/tests/qemu-iotests/274.out
index def0ac7d27..8f59a4a123 100644
--- a/tests/qemu-iotests/274.out
+++ b/tests/qemu-iotests/274.out
@@ -126,6 +126,69 @@ read 1048576/1048576 bytes at offset 0
 read 1048576/1048576 bytes at offset 1048576
 1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+=== Testing qemu-img commit (top -> base) ===
+Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=2097152 cluster_size=65536 lazy_refcounts=off refcount_bits=16
+
+Formatting 'TEST_DIR/PID-mid', fmt=qcow2 size=1048576 backing_file=TEST_DIR/PID-base cluster_size=65536 lazy_refcounts=off refcount_bits=16
+
+Formatting 'TEST_DIR/PID-top', fmt=qcow2 size=2097152 backing_file=TEST_DIR/PID-mid cluster_size=65536 lazy_refcounts=off refcount_bits=16
+
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Image committed.
+
+image: TEST_IMG
+file format: IMGFMT
+virtual size: 2 MiB (2097152 bytes)
+cluster_size: 65536
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+=== Testing QMP active commit (top -> base) ===
+Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=2097152 cluster_size=65536 lazy_refcounts=off refcount_bits=16
+
+Formatting 'TEST_DIR/PID-mid', fmt=qcow2 size=1048576 backing_file=TEST_DIR/PID-base cluster_size=65536 lazy_refcounts=off refcount_bits=16
+
+Formatting 'TEST_DIR/PID-top', fmt=qcow2 size=2097152 backing_file=TEST_DIR/PID-mid cluster_size=65536 lazy_refcounts=off refcount_bits=16
+
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+{"execute": "block-commit", "arguments": {"auto-dismiss": false, "base-node": "base", "device": "top", "job-id": "job0"}}
+{"return": {}}
+{"execute": "job-complete", "arguments": {"id": "job0"}}
+{"return": {}}
+{"data": {"device": "job0", "len": 1048576, "offset": 1048576, "speed": 0, "type": "commit"}, "event": "BLOCK_JOB_READY", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
+{"data": {"device": "job0", "len": 1048576, "offset": 1048576, "speed": 0, "type": "commit"}, "event": "BLOCK_JOB_COMPLETED", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
+{"execute": "job-dismiss", "arguments": {"id": "job0"}}
+{"return": {}}
+image: TEST_IMG
+file format: IMGFMT
+virtual size: 1 MiB (1048576 bytes)
+cluster_size: 65536
+backing file: TEST_DIR/PID-base
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
 == Resize tests ==
 Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=6442450944 cluster_size=65536 lazy_refcounts=off refcount_bits=16
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-16 16:34 [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
                   ` (9 preceding siblings ...)
  2019-11-20 16:24 ` [PATCH 5/4] iotests: add commit top->base cases to 274 Vladimir Sementsov-Ogievskiy
@ 2019-11-25 10:08 ` Vladimir Sementsov-Ogievskiy
  2019-11-25 15:46   ` Kevin Wolf
  10 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-25 10:08 UTC (permalink / raw)
  To: qemu-block
  Cc: kwolf, fam, Alberto Garcia, Denis Lunev, qemu-devel, mreitz, stefanha

Ping?

Hi! Why so silent? Postpone this to 5.0? This is fixing the same problem with block
commit, like Kevin's series, just commit not to mid but to base..

16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> I wanted to understand, what is the real difference between bdrv_block_status_above
> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> bdrv_block_status_above..
> 
> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> UNALLOCATED for intermediate nodes..
> 
> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> zeroes after EOF, instead of going further by backing chain.
> 
> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> 
> 
> I don't know, is it a real bug, as I don't know, do we support backing file larger than
> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
> to other problems.
> 
> =====
> 
> Hmm, bdrv_block_allocated_above behaves strange too:
> 
> with want_zero=true, it may report unallocated zeroes because of short backing files, which
> are actually "allocated" in POV of backing chains. But I see this may influence only
> qemu-img compare, and I don't see can it trigger some bug..
> 
> with want_zero=false, it may do no progress because of short backing file. Moreover it may
> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
> onlyt top layer, so it seems OK.
> 
> =====
> 
> So, I propose these series, still I'm not sure is there a real bug.
> 
> Vladimir Sementsov-Ogievskiy (4):
>    block/io: fix bdrv_co_block_status_above
>    block/io: bdrv_common_block_status_above: support include_base
>    block/io: bdrv_common_block_status_above: support bs == base
>    block/io: fix bdrv_is_allocated_above
> 
>   block/io.c                 | 104 ++++++++++++++++++-------------------
>   tests/qemu-iotests/154.out |   4 +-
>   2 files changed, 53 insertions(+), 55 deletions(-)
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-25 10:08 ` [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
@ 2019-11-25 15:46   ` Kevin Wolf
  2019-11-26  7:27     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Kevin Wolf @ 2019-11-25 15:46 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, Alberto Garcia, Denis Lunev, qemu-block, qemu-devel, mreitz,
	stefanha

Am 25.11.2019 um 11:08 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Ping?
> 
> Hi! Why so silent? Postpone this to 5.0? This is fixing the same
> problem with block commit, like Kevin's series, just commit not to mid
> but to base..

To be honest, I think by now we've found so many problems around short
backing files, each with a non-trivial fix, that I don't think we can
have a reasonably complete fix in -rc3 without risking breaking
everything. None of the problems are new, in fact I think they have
existed since day one of resize/commit, and nobody has reported problems
before, so they can't be hitting a large number of users.

So, reluctantly, I have to admit that both series and whatever we'll add
on top are probably better kept for 5.0 (and 4.2.1) rather than added
very late into 4.2.

Kevin



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block/io: fix bdrv_co_block_status_above
  2019-11-16 16:34 ` [PATCH 1/4] block/io: fix bdrv_co_block_status_above Vladimir Sementsov-Ogievskiy
@ 2019-11-25 16:00   ` Kevin Wolf
  2019-11-26  7:26     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Kevin Wolf @ 2019-11-25 16:00 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
> bdrv_co_block_status_above has several problems with handling short
> backing files:
> 
> 1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
> without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
> which produces these after-EOF zeros is inside requested backing
> sequesnce.

s/sequesnce/sequence/

> 
> 2. With want_zeros=false, it will just stop inside requested region, if
> we have unallocated region in top node when underlying backing is
> short.

I honestly don't understand this one. Can you rephrase/explain in more
detail what you mean by "stop inside [the] requested region"?

> Fix these things, making logic about short backing files clearer.
> 
> Note that 154 output changed, because now bdrv_block_status_above don't
> merge unallocated zeros with zeros after EOF (which are actually
> "allocated" in POV of read from backing-chain top) and is_zero() just
> don't understand that the whole head or tail is zero. We may update
> is_zero to call bdrv_block_status_above several times, or add flag to
> bdrv_block_status_above that we are not interested in ALLOCATED flag,
> so ranges with different ALLOCATED status may be merged, but actually,
> it seems that we'd better don't care about this corner case.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/io.c                 | 41 ++++++++++++++++++++++++++++----------
>  tests/qemu-iotests/154.out |  4 ++--
>  2 files changed, 32 insertions(+), 13 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index f75777f5ea..4d7fa99bd2 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2434,25 +2434,44 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
>          ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
>                                     file);
>          if (ret < 0) {
> -            break;
> +            return ret;
>          }
> -        if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
> +        if (*pnum == 0) {
> +            if (first) {
> +                return ret;
> +            }
> +
>              /*
> -             * Reading beyond the end of the file continues to read
> -             * zeroes, but we can only widen the result to the
> -             * unallocated length we learned from an earlier
> -             * iteration.
> +             * Reads from bs for selected region will return zeroes, produced
> +             * because current level is short. We should consider it as
> +             * allocated.

"the selected region"
"the current level"

> +             * TODO: Should we report p as file here?

I think that would make sense.

>               */
> +            assert(ret & BDRV_BLOCK_EOF);

Can this assertion be moved above the if (first)?

>              *pnum = bytes;
> +            return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
>          }
> -        if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
> -            break;
> +        if (ret & BDRV_BLOCK_ALLOCATED) {
> +            /* We've found the node and the status, we must return. */
> +
> +            if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
> +                /*
> +                 * This level also responsible for reads after EOF inside
> +                 * unallocated region in previous level.

"is also responsible"
"the unallocated region in the previous level"

> +                 */
> +                *pnum = bytes;
> +            }
> +
> +            return ret;
>          }
> -        /* [offset, pnum] unallocated on this layer, which could be only
> -         * the first part of [offset, bytes].  */

Any reason for deleting this comment? I think it's still valid.

> -        bytes = MIN(bytes, *pnum);
> +
> +        /* Proceed to backing */
> +        assert(*pnum <= bytes);
> +        bytes = *pnum;
>          first = false;
>      }
> +
>      return ret;
>  }

Kevin



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/4] block/io: bdrv_common_block_status_above: support include_base
  2019-11-16 16:34 ` [PATCH 2/4] block/io: bdrv_common_block_status_above: support include_base Vladimir Sementsov-Ogievskiy
@ 2019-11-25 16:19   ` Kevin Wolf
  0 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2019-11-25 16:19 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
> In order to reuse bdrv_common_block_status_above in
> bdrv_is_allocated_above, let's support include_base parameter.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/4] block/io: bdrv_common_block_status_above: support bs == base
  2019-11-16 16:34 ` [PATCH 3/4] block/io: bdrv_common_block_status_above: support bs == base Vladimir Sementsov-Ogievskiy
@ 2019-11-25 16:23   ` Kevin Wolf
  0 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2019-11-25 16:23 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
> We are going to reuse bdrv_common_block_status_above in
> bdrv_is_allocated_above. bdrv_is_allocated_above may be called with
> include_base == false and still bs == base (for ex. from img_rebase()).
> 
> So, support this corner case.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Worth a comment in the code why we need this? Either way:

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block/io: fix bdrv_co_block_status_above
  2019-11-25 16:00   ` Kevin Wolf
@ 2019-11-26  7:26     ` Vladimir Sementsov-Ogievskiy
  2019-11-26 14:20       ` Kevin Wolf
  0 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-26  7:26 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: fam, Denis Lunev, qemu-block, qemu-devel, mreitz, stefanha

25.11.2019 19:00, Kevin Wolf wrote:
> Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> bdrv_co_block_status_above has several problems with handling short
>> backing files:
>>
>> 1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
>> without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
>> which produces these after-EOF zeros is inside requested backing
>> sequesnce.
> 
> s/sequesnce/sequence/
> 
>>
>> 2. With want_zeros=false, it will just stop inside requested region, if
>> we have unallocated region in top node when underlying backing is
>> short.
> 
> I honestly don't understand this one. Can you rephrase/explain in more
> detail what you mean by "stop inside [the] requested region"?

Hmm, yes, bad description. I mean, it may return pnum=0 prior to actual EOF,
because of EOF of short backing file.

> 
>> Fix these things, making logic about short backing files clearer.
>>
>> Note that 154 output changed, because now bdrv_block_status_above don't
>> merge unallocated zeros with zeros after EOF (which are actually
>> "allocated" in POV of read from backing-chain top) and is_zero() just
>> don't understand that the whole head or tail is zero. We may update
>> is_zero to call bdrv_block_status_above several times, or add flag to
>> bdrv_block_status_above that we are not interested in ALLOCATED flag,
>> so ranges with different ALLOCATED status may be merged, but actually,
>> it seems that we'd better don't care about this corner case.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/io.c                 | 41 ++++++++++++++++++++++++++++----------
>>   tests/qemu-iotests/154.out |  4 ++--
>>   2 files changed, 32 insertions(+), 13 deletions(-)
>>
>> diff --git a/block/io.c b/block/io.c
>> index f75777f5ea..4d7fa99bd2 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -2434,25 +2434,44 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
>>           ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
>>                                      file);
>>           if (ret < 0) {
>> -            break;
>> +            return ret;
>>           }
>> -        if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
>> +        if (*pnum == 0) {
>> +            if (first) {
>> +                return ret;
>> +            }
>> +
>>               /*
>> -             * Reading beyond the end of the file continues to read
>> -             * zeroes, but we can only widen the result to the
>> -             * unallocated length we learned from an earlier
>> -             * iteration.
>> +             * Reads from bs for selected region will return zeroes, produced
>> +             * because current level is short. We should consider it as
>> +             * allocated.
> 
> "the selected region"
> "the current level"
> 
>> +             * TODO: Should we report p as file here?
> 
> I think that would make sense.
> 
>>                */
>> +            assert(ret & BDRV_BLOCK_EOF);
> 
> Can this assertion be moved above the if (first)?

it may correspond to requested bytes==0.. But we can check it separately
before for loop and move this assertion.

> 
>>               *pnum = bytes;
>> +            return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
>>           }
>> -        if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
>> -            break;
>> +        if (ret & BDRV_BLOCK_ALLOCATED) {
>> +            /* We've found the node and the status, we must return. */
>> +
>> +            if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
>> +                /*
>> +                 * This level also responsible for reads after EOF inside
>> +                 * unallocated region in previous level.
> 
> "is also responsible"
> "the unallocated region in the previous level"
> 
>> +                 */
>> +                *pnum = bytes;
>> +            }
>> +
>> +            return ret;
>>           }
>> -        /* [offset, pnum] unallocated on this layer, which could be only
>> -         * the first part of [offset, bytes].  */
> 
> Any reason for deleting this comment? I think it's still valid.

Hmm, I decided that it's obvious and shorter comment is enough.
I can leave it, of course.

> 
>> -        bytes = MIN(bytes, *pnum);
>> +
>> +        /* Proceed to backing */
>> +        assert(*pnum <= bytes);
>> +        bytes = *pnum;
>>           first = false;
>>       }
>> +
>>       return ret;
>>   }
> 
> Kevin
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above
  2019-11-25 15:46   ` Kevin Wolf
@ 2019-11-26  7:27     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-26  7:27 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: fam, Alberto Garcia, Denis Lunev, qemu-block, qemu-devel, mreitz,
	stefanha

25.11.2019 18:46, Kevin Wolf wrote:
> Am 25.11.2019 um 11:08 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> Ping?
>>
>> Hi! Why so silent? Postpone this to 5.0? This is fixing the same
>> problem with block commit, like Kevin's series, just commit not to mid
>> but to base..
> 
> To be honest, I think by now we've found so many problems around short
> backing files, each with a non-trivial fix, that I don't think we can
> have a reasonably complete fix in -rc3 without risking breaking
> everything. None of the problems are new, in fact I think they have
> existed since day one of resize/commit, and nobody has reported problems
> before, so they can't be hitting a large number of users.
> 
> So, reluctantly, I have to admit that both series and whatever we'll add
> on top are probably better kept for 5.0 (and 4.2.1) rather than added
> very late into 4.2.
> 

OK, I agree.


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] block/io: fix bdrv_co_block_status_above
  2019-11-26  7:26     ` Vladimir Sementsov-Ogievskiy
@ 2019-11-26 14:20       ` Kevin Wolf
  0 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2019-11-26 14:20 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, Denis Lunev, qemu-block, qemu-devel, mreitz, stefanha

Am 26.11.2019 um 08:26 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 25.11.2019 19:00, Kevin Wolf wrote:
> > Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> bdrv_co_block_status_above has several problems with handling short
> >> backing files:
> >>
> >> 1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
> >> without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
> >> which produces these after-EOF zeros is inside requested backing
> >> sequesnce.
> > 
> > s/sequesnce/sequence/
> > 
> >>
> >> 2. With want_zeros=false, it will just stop inside requested region, if
> >> we have unallocated region in top node when underlying backing is
> >> short.
> > 
> > I honestly don't understand this one. Can you rephrase/explain in more
> > detail what you mean by "stop inside [the] requested region"?
> 
> Hmm, yes, bad description. I mean, it may return pnum=0 prior to actual EOF,
> because of EOF of short backing file.

Ah, yes, that's true. Definitely mention pnum=0 in the comment, this
explanation is much clearer.

> >> Fix these things, making logic about short backing files clearer.
> >>
> >> Note that 154 output changed, because now bdrv_block_status_above don't
> >> merge unallocated zeros with zeros after EOF (which are actually
> >> "allocated" in POV of read from backing-chain top) and is_zero() just
> >> don't understand that the whole head or tail is zero. We may update
> >> is_zero to call bdrv_block_status_above several times, or add flag to
> >> bdrv_block_status_above that we are not interested in ALLOCATED flag,
> >> so ranges with different ALLOCATED status may be merged, but actually,
> >> it seems that we'd better don't care about this corner case.
> >>
> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >> ---
> >>   block/io.c                 | 41 ++++++++++++++++++++++++++++----------
> >>   tests/qemu-iotests/154.out |  4 ++--
> >>   2 files changed, 32 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/block/io.c b/block/io.c
> >> index f75777f5ea..4d7fa99bd2 100644
> >> --- a/block/io.c
> >> +++ b/block/io.c
> >> @@ -2434,25 +2434,44 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
> >>           ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
> >>                                      file);
> >>           if (ret < 0) {
> >> -            break;
> >> +            return ret;
> >>           }
> >> -        if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
> >> +        if (*pnum == 0) {
> >> +            if (first) {
> >> +                return ret;
> >> +            }
> >> +
> >>               /*
> >> -             * Reading beyond the end of the file continues to read
> >> -             * zeroes, but we can only widen the result to the
> >> -             * unallocated length we learned from an earlier
> >> -             * iteration.
> >> +             * Reads from bs for selected region will return zeroes, produced
> >> +             * because current level is short. We should consider it as
> >> +             * allocated.
> > 
> > "the selected region"
> > "the current level"
> > 
> >> +             * TODO: Should we report p as file here?
> > 
> > I think that would make sense.
> > 
> >>                */
> >> +            assert(ret & BDRV_BLOCK_EOF);
> > 
> > Can this assertion be moved above the if (first)?
> 
> it may correspond to requested bytes==0.. But we can check it separately
> before for loop and move this assertion.

Ah, right. Don't bother then, it's fine either way.

Kevin



^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2019-11-26 14:23 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-16 16:34 [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
2019-11-16 16:34 ` [PATCH 1/4] block/io: fix bdrv_co_block_status_above Vladimir Sementsov-Ogievskiy
2019-11-25 16:00   ` Kevin Wolf
2019-11-26  7:26     ` Vladimir Sementsov-Ogievskiy
2019-11-26 14:20       ` Kevin Wolf
2019-11-16 16:34 ` [PATCH 2/4] block/io: bdrv_common_block_status_above: support include_base Vladimir Sementsov-Ogievskiy
2019-11-25 16:19   ` Kevin Wolf
2019-11-16 16:34 ` [PATCH 3/4] block/io: bdrv_common_block_status_above: support bs == base Vladimir Sementsov-Ogievskiy
2019-11-25 16:23   ` Kevin Wolf
2019-11-16 16:34 ` [PATCH 4/4] block/io: fix bdrv_is_allocated_above Vladimir Sementsov-Ogievskiy
2019-11-19 10:22 ` [PATCH 0/4] fix & merge block_status_above and is_allocated_above Max Reitz
2019-11-19 12:02   ` Denis V. Lunev
2019-11-19 12:12     ` Vladimir Sementsov-Ogievskiy
2019-11-19 12:20     ` Max Reitz
2019-11-19 12:30       ` Vladimir Sementsov-Ogievskiy
2019-11-19 13:28         ` Kevin Wolf
2019-11-19 12:05 ` Kevin Wolf
2019-11-19 12:17   ` Vladimir Sementsov-Ogievskiy
2019-11-19 12:32     ` Vladimir Sementsov-Ogievskiy
2019-11-19 12:34       ` Vladimir Sementsov-Ogievskiy
2019-11-19 12:49         ` Vladimir Sementsov-Ogievskiy
2019-11-19 14:21     ` Kevin Wolf
2019-11-19 14:54 ` Kevin Wolf
2019-11-19 16:58 ` Stefan Hajnoczi
2019-11-19 17:11   ` Vladimir Sementsov-Ogievskiy
2019-11-20 10:20 ` Vladimir Sementsov-Ogievskiy
2019-11-20 11:44   ` Kevin Wolf
2019-11-20 12:04     ` Vladimir Sementsov-Ogievskiy
2019-11-20 13:30       ` Kevin Wolf
2019-11-20 13:51         ` Vladimir Sementsov-Ogievskiy
2019-11-20 13:37       ` Vladimir Sementsov-Ogievskiy
2019-11-20 16:24 ` [PATCH 5/4] iotests: add commit top->base cases to 274 Vladimir Sementsov-Ogievskiy
2019-11-25 10:08 ` [PATCH 0/4] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
2019-11-25 15:46   ` Kevin Wolf
2019-11-26  7:27     ` Vladimir Sementsov-Ogievskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).