All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements
@ 2017-05-19  9:34 Anton Nefedov
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk Anton Nefedov
                   ` (13 more replies)
  0 siblings, 14 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Anton Nefedov

This pull request is to address a few performance problems of qcow2 format:

  1. non cluster-aligned write requests (to unallocated clusters) explicitly
    pad data with zeroes if there is no backing data. This can be avoided
    and the whole clusters are preallocated and zeroed in a single
    efficient write_zeroes() operation, also providing better host file
    continuity

  2. moreover, efficient write_zeroes() operation can be used to preallocate
    space megabytes ahead which gives noticeable improvement on some storage
    types (e.g. distributed storages where space allocation operation is
    expensive)

  3. preallocating/zeroing the clusters in advance makes possible to enable
    simultaneous writes to the same unallocated cluster, which is beneficial
    for parallel sequential write operations which are not cluster-aligned

Performance test results are added to commit messages (see patch 3, 12)

Anton Nefedov (9):
  qcow2: is_zero_sectors(): return true if area is outside of backing
    file
  qcow2: do not COW the empty areas
  qcow2: set inactive flag
  qcow2: handle_prealloc(): find out if area zeroed by earlier
    preallocation
  qcow2: fix misleading comment about L2 linking
  qcow2-cluster: slightly refactor handle_dependencies()
  qcow2-cluster: make handle_dependencies() logic easier to follow
  qcow2: allow concurrent unaligned writes to the same clusters
  iotest 046: test simultaneous cluster write error case

Denis V. Lunev (3):
  qcow2: alloc space for COW in one chunk
  qcow2: preallocation at image expand
  qcow2: truncate preallocated space

Pavel Butsykin (1):
  qcow2: check space leak at the end of the image

 block/qcow2-cache.c                |   3 +
 block/qcow2-cluster.c              | 216 +++++++++++++++++++++++-----
 block/qcow2-refcount.c             |  21 +++
 block/qcow2.c                      | 286 ++++++++++++++++++++++++++++++++++++-
 block/qcow2.h                      |  26 ++++
 tests/qemu-iotests/026.out         | 104 ++++++++++----
 tests/qemu-iotests/026.out.nocache | 104 ++++++++++----
 tests/qemu-iotests/029.out         |   5 +-
 tests/qemu-iotests/046             |  38 ++++-
 tests/qemu-iotests/046.out         |  23 +++
 tests/qemu-iotests/060             |   2 +-
 tests/qemu-iotests/060.out         |  13 +-
 tests/qemu-iotests/061.out         |   5 +-
 tests/qemu-iotests/066             |   2 +-
 tests/qemu-iotests/066.out         |   9 +-
 tests/qemu-iotests/098.out         |   7 +-
 tests/qemu-iotests/108.out         |   5 +-
 tests/qemu-iotests/112.out         |   5 +-
 tests/qemu-iotests/154.out         |   4 +-
 19 files changed, 769 insertions(+), 109 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-22 19:00   ` Eric Blake
  2017-05-26  8:11   ` Kevin Wolf
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 02/13] qcow2: is_zero_sectors(): return true if area is outside of backing file Anton Nefedov
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Denis V. Lunev, Anton Nefedov

From: "Denis V. Lunev" <den@openvz.org>

Currently each single write operation can result in 3 write operations
if guest offsets are not cluster aligned. One write is performed for the
real payload and two for COW-ed areas. Thus the data possibly lays
non-contiguously on the host filesystem. This will reduce further
sequential read performance significantly.

The patch allocates the space in the file with cluster granularity,
ensuring
  1. better host offset locality
  2. less space allocation operations
     (which can be expensive on distributed storages)

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index a8d61f0..2e6a0ec 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1575,6 +1575,32 @@ fail:
     return ret;
 }
 
+static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
+{
+    BDRVQcow2State *s = bs->opaque;
+    BlockDriverState *file = bs->file->bs;
+    QCowL2Meta *m;
+    int ret;
+
+    for (m = l2meta; m != NULL; m = m->next) {
+        uint64_t bytes = m->nb_clusters << s->cluster_bits;
+
+        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
+            continue;
+        }
+
+        /* try to alloc host space in one chunk for better locality */
+        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);
+
+        if (ret != 0) {
+            continue;
+        }
+
+        file->total_sectors = MAX(file->total_sectors,
+                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
+    }
+}
+
 static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -1656,8 +1682,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
         if (ret < 0) {
             goto fail;
         }
-
         qemu_co_mutex_unlock(&s->lock);
+
+        if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) {
+            handle_alloc_space(bs, l2meta);
+        }
+
         BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
         trace_qcow2_writev_data(qemu_coroutine_self(),
                                 cluster_offset + offset_in_cluster);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 02/13] qcow2: is_zero_sectors(): return true if area is outside of backing file
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-22 19:12   ` Eric Blake
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 03/13] qcow2: do not COW the empty areas Anton Nefedov
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Anton Nefedov

in such case, bdrv_get_block_status() shall return 0, *nr == 0

iotest 154 updated accordingly: write-zeroes tail alignment can be detected
as zeroes now, so pwrite_zeroes succeeds

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/qcow2.c              | 6 ++++--
 tests/qemu-iotests/154.out | 4 ++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 2e6a0ec..b885dfc 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2482,7 +2482,7 @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
     int64_t res;
 
     if (start + count > bs->total_sectors) {
-        count = bs->total_sectors - start;
+        count = start < bs->total_sectors ? bs->total_sectors - start : 0;
     }
 
     if (!count) {
@@ -2490,7 +2490,9 @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
     }
     res = bdrv_get_block_status_above(bs, NULL, start, count,
                                       &nr, &file);
-    return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
+    return res >= 0
+        && (((res & BDRV_BLOCK_ZERO) && nr == count)
+            || nr == 0);
 }
 
 static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
diff --git a/tests/qemu-iotests/154.out b/tests/qemu-iotests/154.out
index d8485ee..259340e 100644
--- a/tests/qemu-iotests/154.out
+++ b/tests/qemu-iotests/154.out
@@ -322,7 +322,7 @@ wrote 1024/1024 bytes at offset 134218240
 1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 2048/2048 bytes allocated at offset 128 MiB
 [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
-{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
 wrote 2048/2048 bytes at offset 134217728
 2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
@@ -348,7 +348,7 @@ wrote 1024/1024 bytes at offset 134218240
 1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 2048/2048 bytes allocated at offset 128 MiB
 [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
-{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 backing_file=TEST_DIR/t.IMGFMT.base
 wrote 2048/2048 bytes at offset 134217728
 2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 03/13] qcow2: do not COW the empty areas
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk Anton Nefedov
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 02/13] qcow2: is_zero_sectors(): return true if area is outside of backing file Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-22 19:24   ` Eric Blake
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 04/13] qcow2: preallocation at image expand Anton Nefedov
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Anton Nefedov, Denis V . Lunev

If COW area of the newly allocated cluster is zeroes, there is no reason
to write zero sectors in perform_cow() again now as whole clusters are
zeroed out in single chunks by handle_alloc_space().

Introduce QCowL2Meta field "reduced", since the existing fields
(offset and nb_bytes) still has to keep other write requests from
simultaneous writing in the area

iotest 060:
write to the discarded cluster does not trigger COW anymore.
so, break on write_aio event instead, will work for the test
(but write won't fail anymore, so update reference output)

iotest 066:
cluster-alignment areas that were not really COWed are now detected
as zeroes, hence the initial write has to be exactly the same size for
the maps to match

performance tests: ===

qemu-io,
  results in seconds to complete (less is better)
  random write 4k to empty image, no backing
    HDD
      64k cluster
        128M over 128M image:   160 -> 160 ( x1  )
        128M over   2G image:    86 ->  84 ( x1  )
        128M over   8G image:    40 ->  29 ( x1.4 )
      1M cluster
         32M over   8G image:    58 ->  23 ( x2.5 )

    SSD
      64k cluster
          2G over   2G image:    71 ->  38 (  x1.9 )
        512M over   8G image:    85 ->   8 ( x10.6 )
      1M cluster
        128M over  32G image:   314 ->   2 ( x157  )

  - improvement grows bigger the bigger the cluster size,
  - first data portions to the fresh image benefit the most
  (more chance to hit an unallocated cluster)
  - SSD improvement is close to the IO length reduction rate
  (e.g. writing only 4k instead of 64k) gives theoretical x16
  and practical x10 improvement)

fio tests over xfs, empty image (cluster 64k), no backing,

  first megabytes of random writes:
    randwrite 4k, size=8g:

      HDD (io_size=128m) :  730 ->  1050 IOPS ( x1.45)
      SSD (io_size=512m) : 1500 ->  7000 IOPS ( x4.7 )

  random writes io_size==image_size:
    randwrite 4k, size=2g io_size=2g:
                   HDD   : 200 IOPS (no difference)
                   SSD   : 7500 ->  9500 IOPS ( x1.3 )

  sequential write:
    seqwrite 4k, size=4g, iodepth=4
                   SSD   : 7000 -> 18000 IOPS ( x2.6 )

  - numbers are similar to qemu-io tests, slightly less improvement
  (damped by fs?)

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2-cluster.c      |  4 +++-
 block/qcow2.c              | 23 +++++++++++++++++++++++
 block/qcow2.h              |  4 ++++
 tests/qemu-iotests/060     |  2 +-
 tests/qemu-iotests/060.out |  3 ++-
 tests/qemu-iotests/066     |  2 +-
 tests/qemu-iotests/066.out |  4 ++--
 7 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 347d94b..cf18dee 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -758,7 +758,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m, Qcow2COWRegion *r)
     BDRVQcow2State *s = bs->opaque;
     int ret;
 
-    if (r->nb_bytes == 0) {
+    if (r->nb_bytes == 0 || r->reduced) {
         return 0;
     }
 
@@ -1267,10 +1267,12 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset,
         .cow_start = {
             .offset     = 0,
             .nb_bytes   = offset_into_cluster(s, guest_offset),
+            .reduced    = false,
         },
         .cow_end = {
             .offset     = nb_bytes,
             .nb_bytes   = avail_bytes - nb_bytes,
+            .reduced    = false,
         },
     };
     qemu_co_queue_init(&(*m)->dependent_requests);
diff --git a/block/qcow2.c b/block/qcow2.c
index b885dfc..b438f22 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -64,6 +64,9 @@ typedef struct {
 #define  QCOW2_EXT_MAGIC_BACKING_FORMAT 0xE2792ACA
 #define  QCOW2_EXT_MAGIC_FEATURE_TABLE 0x6803f857
 
+static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
+                            uint32_t count);
+
 static int qcow2_probe(const uint8_t *buf, int buf_size, const char *filename)
 {
     const QCowHeader *cow_header = (const void *)buf;
@@ -1575,6 +1578,25 @@ fail:
     return ret;
 }
 
+static void handle_cow_reduce(BlockDriverState *bs, QCowL2Meta *m)
+{
+    if (bs->encrypted) {
+        return;
+    }
+    if (!m->cow_start.reduced && m->cow_start.nb_bytes != 0 &&
+        is_zero_sectors(bs,
+                        (m->offset + m->cow_start.offset) >> BDRV_SECTOR_BITS,
+                        m->cow_start.nb_bytes >> BDRV_SECTOR_BITS)) {
+        m->cow_start.reduced = true;
+    }
+    if (!m->cow_end.reduced && m->cow_end.nb_bytes != 0 &&
+        is_zero_sectors(bs,
+                        (m->offset + m->cow_end.offset) >> BDRV_SECTOR_BITS,
+                        m->cow_end.nb_bytes >> BDRV_SECTOR_BITS)) {
+        m->cow_end.reduced = true;
+    }
+}
+
 static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
 {
     BDRVQcow2State *s = bs->opaque;
@@ -1598,6 +1620,7 @@ static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
 
         file->total_sectors = MAX(file->total_sectors,
                                   (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
+        handle_cow_reduce(bs, m);
     }
 }
 
diff --git a/block/qcow2.h b/block/qcow2.h
index 1801dc3..ba15c08 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -305,6 +305,10 @@ typedef struct Qcow2COWRegion {
 
     /** Number of bytes to copy */
     int         nb_bytes;
+
+    /** The region is filled with zeroes and does not require COW
+     */
+    bool        reduced;
 } Qcow2COWRegion;
 
 /**
diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
index 8e95c45..3a0f096 100755
--- a/tests/qemu-iotests/060
+++ b/tests/qemu-iotests/060
@@ -160,7 +160,7 @@ poke_file "$TEST_IMG" '131084' "\x00\x00" # 0x2000c
 # any unallocated cluster, leading to an attempt to overwrite the second L2
 # table. Finally, resume the COW write and see it fail (but not crash).
 echo "open -o file.driver=blkdebug $TEST_IMG
-break cow_read 0
+break write_aio 0
 aio_write 0k 1k
 wait_break 0
 write 64k 64k
diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
index 9e8f5b9..ea29a32 100644
--- a/tests/qemu-iotests/060.out
+++ b/tests/qemu-iotests/060.out
@@ -107,7 +107,8 @@ qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps
 blkdebug: Suspended request '0'
 write failed: Input/output error
 blkdebug: Resuming request '0'
-aio_write failed: No medium found
+wrote 1024/1024 bytes at offset 0
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
 === Testing unallocated image header ===
 
diff --git a/tests/qemu-iotests/066 b/tests/qemu-iotests/066
index 8638217..3c216a1 100755
--- a/tests/qemu-iotests/066
+++ b/tests/qemu-iotests/066
@@ -71,7 +71,7 @@ echo
 _make_test_img $IMG_SIZE
 
 # Create data clusters (not aligned to an L2 table)
-$QEMU_IO -c 'write -P 42 1M 256k' "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -P 42 $(((1024 + 32) * 1024)) 192k" "$TEST_IMG" | _filter_qemu_io
 orig_map=$($QEMU_IMG map --output=json "$TEST_IMG")
 
 # Convert the data clusters to preallocated zero clusters
diff --git a/tests/qemu-iotests/066.out b/tests/qemu-iotests/066.out
index 3d9da9b..093431e 100644
--- a/tests/qemu-iotests/066.out
+++ b/tests/qemu-iotests/066.out
@@ -19,8 +19,8 @@ Offset          Length          Mapped to       File
 === Writing to preallocated zero clusters ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67109376
-wrote 262144/262144 bytes at offset 1048576
-256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 196608/196608 bytes at offset 1081344
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 wrote 262144/262144 bytes at offset 1048576
 256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 wrote 196608/196608 bytes at offset 1081344
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 04/13] qcow2: preallocation at image expand
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (2 preceding siblings ...)
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 03/13] qcow2: do not COW the empty areas Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-22 19:29   ` Eric Blake
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 05/13] qcow2: set inactive flag Anton Nefedov
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Denis V. Lunev, Anton Nefedov

From: "Denis V. Lunev" <den@openvz.org>

This patch adds image preallocation at expand to provide better locality
of QCOW2 image file and optimize this procedure for some distributed
storages where this procedure is slow.

Image expand requests have to be suspended until the allocation is
performed which is done via special QCowL2Meta.
This meta is invisible to handle_dependencies() code.
This is the main reason for also calling preallocation before metadata
write: it might intersect with preallocation triggered by another IO,
and has to yield

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/qcow2-cache.c    |   3 +
 block/qcow2-cluster.c  |   5 ++
 block/qcow2-refcount.c |  14 +++++
 block/qcow2.c          | 151 +++++++++++++++++++++++++++++++++++++++++++++++++
 block/qcow2.h          |   5 ++
 5 files changed, 178 insertions(+)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 1d25147..aa9da5f 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -204,6 +204,9 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i)
         return ret;
     }
 
+    /* check and preallocate extra space if touching a fresh metadata cluster */
+    qcow2_handle_prealloc(bs, c->entries[i].offset, s->cluster_size);
+
     if (c == s->refcount_block_cache) {
         BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_UPDATE_PART);
     } else if (c == s->l2_table_cache) {
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index cf18dee..a4b6d40 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -108,6 +108,9 @@ int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
         goto fail;
     }
 
+    qcow2_handle_prealloc(bs, new_l1_table_offset,
+                          QEMU_ALIGN_UP(new_l1_size2, s->cluster_size));
+
     BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_WRITE_TABLE);
     for(i = 0; i < s->l1_size; i++)
         new_l1_table[i] = cpu_to_be64(new_l1_table[i]);
@@ -1820,6 +1823,8 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
                 goto fail;
             }
 
+            qcow2_handle_prealloc(bs, offset, s->cluster_size);
+
             ret = bdrv_pwrite_zeroes(bs->file, offset, s->cluster_size, 0);
             if (ret < 0) {
                 if (cluster_type == QCOW2_CLUSTER_ZERO_PLAIN) {
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 7c06061..873a1d2 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -547,6 +547,8 @@ static int alloc_refcount_block(BlockDriverState *bs,
     }
 
     /* Write refcount blocks to disk */
+    qcow2_handle_prealloc(bs, meta_offset, blocks_clusters * s->cluster_size);
+
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE_BLOCKS);
     ret = bdrv_pwrite_sync(bs->file, meta_offset, new_blocks,
         blocks_clusters * s->cluster_size);
@@ -561,6 +563,10 @@ static int alloc_refcount_block(BlockDriverState *bs,
         cpu_to_be64s(&new_table[i]);
     }
 
+    qcow2_handle_prealloc(bs, table_offset,
+                          QEMU_ALIGN_UP(table_size * sizeof(uint64_t),
+                                        s->cluster_size));
+
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE_TABLE);
     ret = bdrv_pwrite_sync(bs->file, table_offset, new_table,
         table_size * sizeof(uint64_t));
@@ -2104,6 +2110,8 @@ write_refblocks:
             goto fail;
         }
 
+        qcow2_handle_prealloc(bs, refblock_offset, s->cluster_size);
+
         /* The size of *refcount_table is always cluster-aligned, therefore the
          * write operation will not overflow */
         on_disk_refblock = (void *)((char *) *refcount_table +
@@ -2158,6 +2166,8 @@ write_refblocks:
     }
 
     assert(reftable_size < INT_MAX / sizeof(uint64_t));
+    qcow2_handle_prealloc(bs, reftable_offset,
+                          reftable_size * sizeof(uint64_t));
     ret = bdrv_pwrite(bs->file, reftable_offset, on_disk_reftable,
                       reftable_size * sizeof(uint64_t));
     if (ret < 0) {
@@ -2845,6 +2855,10 @@ int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
         cpu_to_be64s(&new_reftable[i]);
     }
 
+    qcow2_handle_prealloc(bs, new_reftable_offset,
+                          QEMU_ALIGN_UP(new_reftable_size * sizeof(uint64_t),
+                                        s->cluster_size));
+
     ret = bdrv_pwrite(bs->file, new_reftable_offset, new_reftable,
                       new_reftable_size * sizeof(uint64_t));
 
diff --git a/block/qcow2.c b/block/qcow2.c
index b438f22..6e7ce96 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -464,6 +464,11 @@ static QemuOptsList qcow2_runtime_opts = {
             .type = QEMU_OPT_NUMBER,
             .help = "Clean unused cache entries after this time (in seconds)",
         },
+        {
+            .name = QCOW2_OPT_PREALLOC_SIZE,
+            .type = QEMU_OPT_SIZE,
+            .help = "Preallocation amount at image expand",
+        },
         { /* end of list */ }
     },
 };
@@ -754,6 +759,13 @@ static int qcow2_update_options_prepare(BlockDriverState *bs,
     r->discard_passthrough[QCOW2_DISCARD_OTHER] =
         qemu_opt_get_bool(opts, QCOW2_OPT_DISCARD_OTHER, false);
 
+    s->prealloc_size =
+        ROUND_UP(qemu_opt_get_size_del(opts, QCOW2_OPT_PREALLOC_SIZE, 0),
+                 s->cluster_size);
+    if (s->prealloc_size && bs->file->bs->drv->bdrv_co_pwrite_zeroes == NULL) {
+        s->prealloc_size = 0;
+    }
+
     ret = 0;
 fail:
     qemu_opts_del(opts);
@@ -1597,6 +1609,135 @@ static void handle_cow_reduce(BlockDriverState *bs, QCowL2Meta *m)
     }
 }
 
+/*
+ * Checks that the host space area specified by @m is not being preallocated
+ * at the moment, and does co_queue_wait() if it is.
+ * If the specified area is not allocated yet, allocates it + prealloc_size
+ * bytes ahead.
+ *
+ * Returns
+ *   true if the space is allocated and contains zeroes
+ */
+static bool coroutine_fn handle_prealloc(BlockDriverState *bs,
+                                         const QCowL2Meta *m)
+{
+    BDRVQcow2State *s = bs->opaque;
+    BlockDriverState *file = bs->file->bs;
+    QCowL2Meta *old, *meta;
+    uint64_t start = m->alloc_offset;
+    uint64_t end = start + (m->nb_clusters << s->cluster_bits);
+    uint64_t nbytes;
+    int err;
+
+    assert(offset_into_cluster(s, start) == 0);
+
+restart:
+    /* check that the request is not overlapped with any
+       currently running preallocations */
+    QLIST_FOREACH(old, &s->cluster_allocs, next_in_flight) {
+        uint64_t old_start, old_end;
+
+        old_start = old->alloc_offset;
+        old_end = old_start + (old->nb_clusters << s->cluster_bits);
+
+        if (old == m || end <= old_start || start >= old_end) {
+            /* No intersection */
+            continue;
+        }
+
+        qemu_co_queue_wait(&old->dependent_requests, NULL);
+        goto restart;
+    }
+
+    if (end <= bdrv_getlength(file)) {
+        /* No need to care, file size will not be changed */
+        return false;
+    }
+
+    meta = g_alloca(sizeof(*meta));
+    *meta = (QCowL2Meta) {
+        /* this meta is invisible for handle_dependencies() */
+        .alloc_offset   = bdrv_getlength(file),
+        .nb_clusters    = size_to_clusters(s, start +
+                (m->nb_clusters << s->cluster_bits) +
+                s->prealloc_size - bdrv_getlength(file)),
+    };
+    qemu_co_queue_init(&meta->dependent_requests);
+    QLIST_INSERT_HEAD(&s->cluster_allocs, meta, next_in_flight);
+
+    nbytes = meta->nb_clusters << s->cluster_bits;
+
+    /* try to alloc host space in one chunk for better locality */
+    err = file->drv->bdrv_co_pwrite_zeroes(file, meta->alloc_offset, nbytes, 0);
+
+    QLIST_REMOVE(meta, next_in_flight);
+    qemu_co_queue_restart_all(&meta->dependent_requests);
+
+    if (err == 0) {
+        file->total_sectors =
+            MAX(file->total_sectors,
+                (meta->alloc_offset + nbytes) / BDRV_SECTOR_SIZE);
+        return start >= meta->alloc_offset;
+    }
+    return false;
+}
+
+typedef struct {
+    BlockDriverState *bs;
+    uint64_t offset;
+    uint64_t size;
+    int ret;
+} PreallocCo;
+
+static void coroutine_fn handle_prealloc_co_entry(void* opaque)
+{
+    PreallocCo *prco = opaque;
+    BDRVQcow2State *s = prco->bs->opaque;
+    QCowL2Meta meta = {
+        /* this meta is invisible for handle_dependencies() */
+        .alloc_offset = prco->offset,
+        .nb_clusters  = size_to_clusters(s, prco->size)
+    };
+    handle_prealloc(prco->bs, &meta);
+    prco->ret = 0;
+}
+
+/*
+ * Context(coroutine)-independent interface around handle_prealloc(), see
+ * its description.
+ * Must be called on a first write on the newly allocated cluster(s).
+ * @offset and @size must be cluster_aligned
+ */
+void qcow2_handle_prealloc(BlockDriverState *bs, uint64_t offset, uint64_t size)
+{
+    BDRVQcow2State *s = bs->opaque;
+    PreallocCo prco = {
+        .bs     = bs,
+        .offset = offset,
+        .size   = size,
+        .ret    = -EAGAIN
+    };
+
+    assert(offset_into_cluster(s, offset) == 0);
+    assert(offset_into_cluster(s, size) == 0);
+
+    if (s->prealloc_size == 0 ||
+        bs->file->bs->drv->bdrv_co_pwrite_zeroes == NULL) {
+        return;
+    }
+
+    if (qemu_in_coroutine()) {
+        handle_prealloc_co_entry(&prco);
+    } else {
+        AioContext *aio_context = bdrv_get_aio_context(bs);
+        Coroutine *co = qemu_coroutine_create(handle_prealloc_co_entry, &prco);
+        qemu_coroutine_enter(co);
+        while (prco.ret == -EAGAIN) {
+            aio_poll(aio_context, true);
+        }
+    }
+}
+
 static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
 {
     BDRVQcow2State *s = bs->opaque;
@@ -1607,6 +1748,11 @@ static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
     for (m = l2meta; m != NULL; m = m->next) {
         uint64_t bytes = m->nb_clusters << s->cluster_bits;
 
+        if (s->prealloc_size != 0 && handle_prealloc(bs, m)) {
+            handle_cow_reduce(bs, m);
+            continue;
+        }
+
         if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
             continue;
         }
@@ -2725,6 +2871,11 @@ qcow2_co_pwritev_compressed(BlockDriverState *bs, uint64_t offset,
         goto fail;
     }
 
+    qcow2_handle_prealloc(bs, start_of_cluster(s, cluster_offset),
+                          QEMU_ALIGN_UP(
+                              offset_into_cluster(s, cluster_offset) + out_len,
+                              s->cluster_size));
+
     iov = (struct iovec) {
         .iov_base   = out_buf,
         .iov_len    = out_len,
diff --git a/block/qcow2.h b/block/qcow2.h
index ba15c08..a0d222d 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -97,6 +97,7 @@
 #define QCOW2_OPT_L2_CACHE_SIZE "l2-cache-size"
 #define QCOW2_OPT_REFCOUNT_CACHE_SIZE "refcount-cache-size"
 #define QCOW2_OPT_CACHE_CLEAN_INTERVAL "cache-clean-interval"
+#define QCOW2_OPT_PREALLOC_SIZE "prealloc-size"
 
 typedef struct QCowHeader {
     uint32_t magic;
@@ -294,6 +295,8 @@ typedef struct BDRVQcow2State {
      * override) */
     char *image_backing_file;
     char *image_backing_format;
+
+    uint64_t prealloc_size;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
@@ -493,6 +496,8 @@ int qcow2_mark_dirty(BlockDriverState *bs);
 int qcow2_mark_corrupt(BlockDriverState *bs);
 int qcow2_mark_consistent(BlockDriverState *bs);
 int qcow2_update_header(BlockDriverState *bs);
+void qcow2_handle_prealloc(BlockDriverState *bs,
+                           uint64_t offset, uint64_t size);
 
 void qcow2_signal_corruption(BlockDriverState *bs, bool fatal, int64_t offset,
                              int64_t size, const char *message_format, ...)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 05/13] qcow2: set inactive flag
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (3 preceding siblings ...)
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 04/13] qcow2: preallocation at image expand Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-26  8:11   ` Kevin Wolf
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 06/13] qcow2: truncate preallocated space Anton Nefedov
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Anton Nefedov

Qcow2State and BlockDriverState flags have to be in sync

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/qcow2.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 6e7ce96..07c1706 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1939,6 +1939,7 @@ static int qcow2_inactivate(BlockDriverState *bs)
 
     if (result == 0) {
         qcow2_mark_clean(bs);
+        s->flags |= BDRV_O_INACTIVE;
     }
 
     return result;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 06/13] qcow2: truncate preallocated space
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (4 preceding siblings ...)
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 05/13] qcow2: set inactive flag Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 07/13] qcow2: check space leak at the end of the image Anton Nefedov
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Denis V. Lunev, Anton Nefedov

From: "Denis V. Lunev" <den@openvz.org>

This could be done after calculation of the end of data and metadata in
the qcow2 image.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/qcow2-cluster.c  | 9 +++++++++
 block/qcow2-refcount.c | 7 +++++++
 block/qcow2.c          | 8 ++++++++
 block/qcow2.h          | 3 +++
 4 files changed, 27 insertions(+)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index a4b6d40..b2879b9 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1957,3 +1957,12 @@ fail:
     g_free(l1_table);
     return ret;
 }
+
+void qcow2_update_data_end(BlockDriverState *bs, uint64_t off)
+{
+    BDRVQcow2State *s = bs->opaque;
+
+    if (s->data_end < off) {
+        s->data_end = off;
+    }
+}
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 873a1d2..8156466 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -744,6 +744,9 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
             ret = alloc_refcount_block(bs, cluster_index, &refcount_block);
             if (ret < 0) {
                 goto fail;
+            } else {
+                qcow2_update_data_end(bs, s->refcount_table_offset +
+                        s->refcount_table_size * sizeof(uint64_t));
             }
         }
         old_table_index = table_index;
@@ -865,6 +868,8 @@ retry:
         s->free_cluster_index - 1 > (INT64_MAX >> s->cluster_bits))
     {
         return -EFBIG;
+    } else {
+        qcow2_update_data_end(bs, s->free_cluster_index << s->cluster_bits);
     }
 
 #ifdef DEBUG_ALLOC2
@@ -929,6 +934,8 @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
 
     if (ret < 0) {
         return ret;
+    } else {
+        qcow2_update_data_end(bs, offset + (nb_clusters << s->cluster_bits));
     }
 
     return i;
diff --git a/block/qcow2.c b/block/qcow2.c
index 07c1706..7b4359b 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1192,6 +1192,8 @@ static int qcow2_do_open(BlockDriverState *bs, QDict *options, int flags,
         }
     }
 
+    s->data_end = bdrv_getlength(bs->file->bs);
+
 #ifdef DEBUG_ALLOC
     {
         BdrvCheckResult result = {0};
@@ -1948,12 +1950,18 @@ static int qcow2_inactivate(BlockDriverState *bs)
 static void qcow2_close(BlockDriverState *bs)
 {
     BDRVQcow2State *s = bs->opaque;
+
     qemu_vfree(s->l1_table);
     /* else pre-write overlap checks in cache_destroy may crash */
     s->l1_table = NULL;
 
     if (!(s->flags & BDRV_O_INACTIVE)) {
         qcow2_inactivate(bs);
+
+        /* truncate preallocated space */
+        if (!bs->read_only && s->data_end < bdrv_getlength(bs->file->bs)) {
+            bdrv_truncate(bs->file, s->data_end, NULL);
+        }
     }
 
     cache_clean_timer_del(bs);
diff --git a/block/qcow2.h b/block/qcow2.h
index a0d222d..e28c54a 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -297,6 +297,7 @@ typedef struct BDRVQcow2State {
     char *image_backing_format;
 
     uint64_t prealloc_size;
+    uint64_t data_end;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
@@ -607,4 +608,6 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
     void **table);
 void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
 
+void qcow2_update_data_end(BlockDriverState *bs, uint64_t off);
+
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 07/13] qcow2: check space leak at the end of the image
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (5 preceding siblings ...)
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 06/13] qcow2: truncate preallocated space Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 08/13] qcow2: handle_prealloc(): find out if area zeroed by earlier preallocation Anton Nefedov
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: den, kwolf, mreitz, Pavel Butsykin, Denis V . Lunev, Anton Nefedov

From: Pavel Butsykin <pbutsykin@virtuozzo.com>

Preallocating memory in the image may remain unused after the fall, for
the qcow2_check adds the ability to identify and fix it, so as not to
store extra memory on the host.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/qcow2.c                      |  31 +++++++++++
 tests/qemu-iotests/026.out         | 104 ++++++++++++++++++++++++++++---------
 tests/qemu-iotests/026.out.nocache | 104 ++++++++++++++++++++++++++++---------
 tests/qemu-iotests/029.out         |   5 +-
 tests/qemu-iotests/060.out         |  10 +++-
 tests/qemu-iotests/061.out         |   5 +-
 tests/qemu-iotests/066.out         |   5 +-
 tests/qemu-iotests/098.out         |   7 ++-
 tests/qemu-iotests/108.out         |   5 +-
 tests/qemu-iotests/112.out         |   5 +-
 10 files changed, 225 insertions(+), 56 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 7b4359b..503f0dc 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -322,6 +322,32 @@ int qcow2_mark_consistent(BlockDriverState *bs)
     return 0;
 }
 
+static int qcow2_check_extra_preallocation(BlockDriverState *bs,
+    BdrvCheckResult *res, BdrvCheckMode fix)
+{
+    BDRVQcow2State *s = bs->opaque;
+    uint64_t img_size = bdrv_getlength(bs->file->bs);
+
+    if (res->image_end_offset < img_size) {
+        uint64_t count =
+            DIV_ROUND_UP(img_size - res->image_end_offset, s->cluster_size);
+        fprintf(stderr, "%s space leaked at the end of the image %jd\n",
+                fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR",
+                img_size - res->image_end_offset);
+        res->leaks += count;
+        if (fix & BDRV_FIX_LEAKS) {
+            int ret = bdrv_truncate(bs->file, res->image_end_offset, NULL);
+            if (ret < 0) {
+                res->check_errors++;
+                return ret;
+            }
+            res->leaks_fixed += count;
+        }
+    }
+
+    return 0;
+}
+
 static int qcow2_check(BlockDriverState *bs, BdrvCheckResult *result,
                        BdrvCheckMode fix)
 {
@@ -330,6 +356,11 @@ static int qcow2_check(BlockDriverState *bs, BdrvCheckResult *result,
         return ret;
     }
 
+    ret = qcow2_check_extra_preallocation(bs, result, fix);
+    if (ret < 0) {
+        return ret;
+    }
+
     if (fix && result->check_errors == 0 && result->corruptions == 0) {
         ret = qcow2_mark_clean(bs);
         if (ret < 0) {
diff --git a/tests/qemu-iotests/026.out b/tests/qemu-iotests/026.out
index 86a50a2..e8cf348 100644
--- a/tests/qemu-iotests/026.out
+++ b/tests/qemu-iotests/026.out
@@ -5,7 +5,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 5; imm: off; once: on; write
 write failed: Input/output error
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 5; imm: off; once: on; write -b
@@ -33,7 +36,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 28; imm: off; once: on; write -b
@@ -181,7 +187,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l2_alloc_write; errno: 5; imm: off; once: on; write
 write failed: Input/output error
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l2_alloc_write; errno: 5; imm: off; once: on; write -b
@@ -207,7 +216,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l2_alloc_write; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l2_alloc_write; errno: 28; imm: off; once: on; write -b
@@ -468,20 +480,27 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 394240
+
+770 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 5120
 
-55 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -489,19 +508,26 @@ Event: refblock_alloc_hookup; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 265728
 
-251 leaked clusters were found on the image.
+770 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 394240
+
+770 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: off; write
@@ -520,20 +546,27 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 27648
 
-11 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -541,27 +574,35 @@ Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 21504
 
-23 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 27648
 
-11 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -569,27 +610,35 @@ Event: refblock_alloc_write_table; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 21504
 
-23 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 27648
 
-11 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -597,8 +646,9 @@ Event: refblock_alloc_switch_table; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 21504
 
-23 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 
 === L1 growth tests ===
@@ -651,7 +701,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_grow_activate_table; errno: 5; imm: off; once: on
 write failed: Input/output error
-No errors were found on the image.
+ERROR space leaked at the end of the image 98304
+
+96 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_grow_activate_table; errno: 5; imm: off; once: off
@@ -665,7 +718,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_grow_activate_table; errno: 28; imm: off; once: on
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 98304
+
+96 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_grow_activate_table; errno: 28; imm: off; once: off
diff --git a/tests/qemu-iotests/026.out.nocache b/tests/qemu-iotests/026.out.nocache
index ea2e166..820ce0f 100644
--- a/tests/qemu-iotests/026.out.nocache
+++ b/tests/qemu-iotests/026.out.nocache
@@ -5,7 +5,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 5; imm: off; once: on; write 
 write failed: Input/output error
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: l1_update; errno: 5; imm: off; once: on; write -b
@@ -33,7 +36,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: l1_update; errno: 28; imm: off; once: on; write -b
@@ -189,7 +195,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l2_alloc_write; errno: 5; imm: off; once: on; write 
 write failed: Input/output error
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: l2_alloc_write; errno: 5; imm: off; once: on; write -b
@@ -215,7 +224,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l2_alloc_write; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: l2_alloc_write; errno: 28; imm: off; once: on; write -b
@@ -476,20 +488,27 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 394240
+
+770 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: off; write 
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 5120
 
-55 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -497,19 +516,26 @@ Event: refblock_alloc_hookup; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 265728
 
-251 leaked clusters were found on the image.
+770 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 394240
+
+770 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: off; write 
@@ -528,20 +554,27 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: off; write 
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 27648
 
-11 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -549,27 +582,35 @@ Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 21504
 
-23 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: off; write 
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 27648
 
-11 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -577,27 +618,35 @@ Event: refblock_alloc_write_table; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 21504
 
-23 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: off; write 
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 27648
 
-11 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -605,8 +654,9 @@ Event: refblock_alloc_switch_table; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 21504
 
-23 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 
 === L1 growth tests ===
@@ -659,7 +709,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_grow_activate_table; errno: 5; imm: off; once: on
 write failed: Input/output error
-No errors were found on the image.
+ERROR space leaked at the end of the image 98304
+
+96 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: l1_grow_activate_table; errno: 5; imm: off; once: off
@@ -673,7 +726,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_grow_activate_table; errno: 28; imm: off; once: on
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 98304
+
+96 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: l1_grow_activate_table; errno: 28; imm: off; once: off
diff --git a/tests/qemu-iotests/029.out b/tests/qemu-iotests/029.out
index 5bc93e0..7176e38 100644
--- a/tests/qemu-iotests/029.out
+++ b/tests/qemu-iotests/029.out
@@ -6,7 +6,10 @@ is smaller than the current L1 table.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 wrote 4096/4096 bytes at offset 0
 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-No errors were found on the image.
+ERROR space leaked at the end of the image 65536
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=16777216
 wrote 4194304/4194304 bytes at offset 0
 4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
index ea29a32..7936938 100644
--- a/tests/qemu-iotests/060.out
+++ b/tests/qemu-iotests/060.out
@@ -77,20 +77,26 @@ Leaked cluster 9 refcount=1 reference=0
 Repairing cluster 4 refcount=1 reference=2
 Repairing cluster 9 refcount=1 reference=0
 Repairing OFLAG_COPIED data cluster: l2_entry=8000000000040000 refcount=2
+ERROR space leaked at the end of the image 65536
 The following inconsistencies were found and repaired:
 
     1 leaked clusters
     2 corruptions
 
 Double checking the fixed image now...
-No errors were found on the image.
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 incompatible_features     0x0
 wrote 512/512 bytes at offset 0
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 incompatible_features     0x0
 read 512/512 bytes at offset 0
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-No errors were found on the image.
+ERROR space leaked at the end of the image 131072
+
+2 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 read 512/512 bytes at offset 0
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index a431b7f..69fb929 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -328,7 +328,10 @@ wrote 131072/131072 bytes at offset 0
 No errors were found on the image.
 read 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-No errors were found on the image.
+ERROR space leaked at the end of the image 196608
+
+3 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 read 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
diff --git a/tests/qemu-iotests/066.out b/tests/qemu-iotests/066.out
index 093431e..81ef795 100644
--- a/tests/qemu-iotests/066.out
+++ b/tests/qemu-iotests/066.out
@@ -13,7 +13,10 @@ discard 67109376/67109376 bytes at offset 0
 64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 read 67109376/67109376 bytes at offset 0
 64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-No errors were found on the image.
+ERROR space leaked at the end of the image 327680
+
+5 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Offset          Length          Mapped to       File
 
 === Writing to preallocated zero clusters ===
diff --git a/tests/qemu-iotests/098.out b/tests/qemu-iotests/098.out
index 7634d0e..733249f 100644
--- a/tests/qemu-iotests/098.out
+++ b/tests/qemu-iotests/098.out
@@ -20,7 +20,10 @@ Leaked cluster 4 refcount=1 reference=0
 Leaked cluster 5 refcount=1 reference=0
 Repairing cluster 4 refcount=1 reference=0
 Repairing cluster 5 refcount=1 reference=0
-No errors were found on the image.
+ERROR space leaked at the end of the image 131072
+
+2 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 
 === reftable_update ===
 
@@ -34,6 +37,7 @@ ERROR cluster 1 refcount=0 reference=1
 ERROR cluster 3 refcount=0 reference=1
 Rebuilding refcount structure
 Repairing cluster 1 refcount=1 reference=0
+Repairing space leaked at the end of the image 65536
 No errors were found on the image.
 
 === refblock_alloc ===
@@ -48,5 +52,6 @@ ERROR cluster 1 refcount=0 reference=1
 ERROR cluster 3 refcount=0 reference=1
 Rebuilding refcount structure
 Repairing cluster 1 refcount=1 reference=0
+Repairing space leaked at the end of the image 65536
 No errors were found on the image.
 *** done
diff --git a/tests/qemu-iotests/108.out b/tests/qemu-iotests/108.out
index 75bab8d..d178b1e 100644
--- a/tests/qemu-iotests/108.out
+++ b/tests/qemu-iotests/108.out
@@ -68,13 +68,16 @@ Rebuilding refcount structure
 Repairing cluster 1 refcount=1 reference=0
 Repairing cluster 2 refcount=1 reference=0
 Repairing cluster 16 refcount=1 reference=0
+ERROR space leaked at the end of the image 720896
 The following inconsistencies were found and repaired:
 
     0 leaked clusters
     2 corruptions
 
 Double checking the fixed image now...
-No errors were found on the image.
+
+11 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 
 --- Signed overflow after the refblock ---
 
diff --git a/tests/qemu-iotests/112.out b/tests/qemu-iotests/112.out
index 81b04d1..7386be8 100644
--- a/tests/qemu-iotests/112.out
+++ b/tests/qemu-iotests/112.out
@@ -93,7 +93,10 @@ refcount bits: 1
 
 === Amend from refcount_bits=1 to refcount_bits=64 ===
 
-No errors were found on the image.
+ERROR space leaked at the end of the image 131072
+
+2 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 refcount bits: 64
 
 === Amend to compat=0.10 ===
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 08/13] qcow2: handle_prealloc(): find out if area zeroed by earlier preallocation
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (6 preceding siblings ...)
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 07/13] qcow2: check space leak at the end of the image Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 09/13] qcow2: fix misleading comment about L2 linking Anton Nefedov
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Anton Nefedov, Denis V . Lunev

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2-cluster.c | 2 ++
 block/qcow2.c         | 8 +++++++-
 block/qcow2.h         | 4 ++++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index b2879b9..25210cd 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1143,6 +1143,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset,
     uint64_t *host_offset, uint64_t *bytes, QCowL2Meta **m)
 {
     BDRVQcow2State *s = bs->opaque;
+    const uint64_t old_data_end = s->data_end;
     int l2_index;
     uint64_t *l2_table;
     uint64_t entry;
@@ -1264,6 +1265,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset,
         .alloc_offset   = alloc_cluster_offset,
         .offset         = start_of_cluster(s, guest_offset),
         .nb_clusters    = nb_clusters,
+        .clusters_are_trailing = alloc_cluster_offset >= old_data_end,
 
         .keep_old_clusters  = keep_old_clusters,
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 503f0dc..97a66a0 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1684,7 +1684,13 @@ restart:
 
     if (end <= bdrv_getlength(file)) {
         /* No need to care, file size will not be changed */
-        return false;
+
+        /* We're safe to assume that the area is zeroes if the area
+         * was allocated at the end of data (s->data_end).
+         * In this case, the only way for file length to be bigger is that
+         * the area was preallocated by another request.
+         */
+        return m->clusters_are_trailing;
     }
 
     meta = g_alloca(sizeof(*meta));
diff --git a/block/qcow2.h b/block/qcow2.h
index e28c54a..2fd8510 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -333,6 +333,10 @@ typedef struct QCowL2Meta
     /** Do not free the old clusters */
     bool keep_old_clusters;
 
+    /** True if the area is allocated after the end of data area
+     *  (i.e. >= s->data_end), which means that it is zeroed */
+    bool clusters_are_trailing;
+
     /**
      * Requests that overlap with this allocation and wait to be restarted
      * when the allocating request has completed.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 09/13] qcow2: fix misleading comment about L2 linking
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (7 preceding siblings ...)
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 08/13] qcow2: handle_prealloc(): find out if area zeroed by earlier preallocation Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 10/13] qcow2-cluster: slightly refactor handle_dependencies() Anton Nefedov
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Anton Nefedov, Denis V . Lunev

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2-cluster.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 25210cd..4204db9 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -827,12 +827,10 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
 
     assert(l2_index + m->nb_clusters <= s->l2_size);
     for (i = 0; i < m->nb_clusters; i++) {
-        /* if two concurrent writes happen to the same unallocated cluster
-         * each write allocates separate cluster and writes data concurrently.
-         * The first one to complete updates l2 table with pointer to its
-         * cluster the second one has to do RMW (which is done above by
-         * perform_cow()), update l2 table with its cluster pointer and free
-         * old cluster. This is what this loop does */
+        /* handle_dependencies() protects from normal cluster allocation
+         * collision; still L2 entry might be !0 in case of zero or compressed
+         * cluster reusage or writing over the snapshot
+         */
         if (l2_table[l2_index + i] != 0) {
             old_cluster[j++] = l2_table[l2_index + i];
         }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 10/13] qcow2-cluster: slightly refactor handle_dependencies()
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (8 preceding siblings ...)
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 09/13] qcow2: fix misleading comment about L2 linking Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 11/13] qcow2-cluster: make handle_dependencies() logic easier to follow Anton Nefedov
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Anton Nefedov, Denis V . Lunev

  - assert the alignment on return if the allocation has to stop
    (at the start of a running allocation)
  - make use of const specifiers for local variables

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2-cluster.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 4204db9..03d6f7e 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -900,15 +900,15 @@ out:
  * Check if there already is an AIO write request in flight which allocates
  * the same cluster. In this case we need to wait until the previous
  * request has completed and updated the L2 table accordingly.
- *
  * Returns:
  *   0       if there was no dependency. *cur_bytes indicates the number of
  *           bytes from guest_offset that can be read before the next
- *           dependency must be processed (or the request is complete)
+ *           dependency must be processed (or the request is complete).
+ *           *m is not modified
  *
- *   -EAGAIN if we had to wait for another request, previously gathered
- *           information on cluster allocation may be invalid now. The caller
- *           must start over anyway, so consider *cur_bytes undefined.
+ *   -EAGAIN if we had to wait for another request. The caller
+ *           must start over, so consider *cur_bytes undefined.
+ *           *m is not modified
  */
 static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
     uint64_t *cur_bytes, QCowL2Meta **m)
@@ -919,10 +919,10 @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
 
     QLIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight) {
 
-        uint64_t start = guest_offset;
-        uint64_t end = start + bytes;
-        uint64_t old_start = l2meta_cow_start(old_alloc);
-        uint64_t old_end = l2meta_cow_end(old_alloc);
+        const uint64_t start = guest_offset;
+        const uint64_t end = start + bytes;
+        const uint64_t old_start = l2meta_cow_start(old_alloc);
+        const uint64_t old_end = l2meta_cow_end(old_alloc);
 
         if (end <= old_start || start >= old_end) {
             /* No intersection */
@@ -939,6 +939,8 @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
              * and deal with requests depending on them before starting to
              * gather new ones. Not worth the trouble. */
             if (bytes == 0 && *m) {
+                /* start must be cluster aligned at this point */
+                assert(start == start_of_cluster(s, start));
                 *cur_bytes = 0;
                 return 0;
             }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 11/13] qcow2-cluster: make handle_dependencies() logic easier to follow
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (9 preceding siblings ...)
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 10/13] qcow2-cluster: slightly refactor handle_dependencies() Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-22 19:37   ` Eric Blake
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 12/13] qcow2: allow concurrent unaligned writes to the same clusters Anton Nefedov
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Anton Nefedov, Denis V . Lunev

Avoid complicated nested conditions; return or continue asap instead.
The logic is not changed.

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2-cluster.c | 45 ++++++++++++++++++++++-----------------------
 1 file changed, 22 insertions(+), 23 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 03d6f7e..c0974e8 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -926,32 +926,31 @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
 
         if (end <= old_start || start >= old_end) {
             /* No intersection */
-        } else {
-            if (start < old_start) {
-                /* Stop at the start of a running allocation */
-                bytes = old_start - start;
-            } else {
-                bytes = 0;
-            }
+            continue;
+        }
 
-            /* Stop if already an l2meta exists. After yielding, it wouldn't
-             * be valid any more, so we'd have to clean up the old L2Metas
-             * and deal with requests depending on them before starting to
-             * gather new ones. Not worth the trouble. */
-            if (bytes == 0 && *m) {
-                /* start must be cluster aligned at this point */
-                assert(start == start_of_cluster(s, start));
-                *cur_bytes = 0;
-                return 0;
-            }
+        if (start < old_start) {
+            /* Stop at the start of a running allocation */
+            bytes = old_start - start;
+            /* ..if there is no other conflict, keep checking */
+            continue;
+        }
 
-            if (bytes == 0) {
-                /* Wait for the dependency to complete. We need to recheck
-                 * the free/allocated clusters when we continue. */
-                qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
-                return -EAGAIN;
-            }
+        /* Stop if already an l2meta exists. After yielding, it wouldn't
+         * be valid any more, so we'd have to clean up the old L2Metas
+         * and deal with requests depending on them before starting to
+         * gather new ones. Not worth the trouble. */
+        if (*m) {
+            /* start must be cluster aligned at this point */
+            assert(start == start_of_cluster(s, start));
+            *cur_bytes = 0;
+            return 0;
         }
+
+        /* Wait for the dependency to complete. We need to recheck
+         * the free/allocated clusters when we continue. */
+        qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
+        return -EAGAIN;
     }
 
     /* Make sure that existing clusters and new allocations are only used up to
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 12/13] qcow2: allow concurrent unaligned writes to the same clusters
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (10 preceding siblings ...)
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 11/13] qcow2-cluster: make handle_dependencies() logic easier to follow Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 13/13] iotest 046: test simultaneous cluster write error case Anton Nefedov
  2017-05-23 14:35 ` [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Eric Blake
  13 siblings, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Anton Nefedov, Denis V . Lunev

If COW area of a write request to unallocated cluster is empty,
concurrent write requests can be allowed with a little bit of
extra synchronization; so they don't have to wait until L2 is filled.

Let qcow2_cluster.c::handle_dependencies() do the most of the job:
  if there is an in-flight request to the same cluster,
  and the current request wants to write in its COW area,
  and its COW area is marked empty,
  - steal the allocated offset and write concurrently. Let the original
    request update L2 later when it likes.

This gives an improvement for parallel misaligned writes to
unallocated clusters with no backing data:

HDD fio over xfs iodepth=4:
  seqwrite 4k:   18400 -> 22800 IOPS ( x1.24 )
  seqwrite 68k:   1600 ->  2300 IOPS ( x1.44 )

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2-cluster.c | 169 +++++++++++++++++++++++++++++++++++++++++++-------
 block/qcow2.c         |  28 ++++++++-
 block/qcow2.h         |  12 +++-
 3 files changed, 181 insertions(+), 28 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index c0974e8..7cffdd4 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -898,20 +898,32 @@ out:
 
 /*
  * Check if there already is an AIO write request in flight which allocates
- * the same cluster. In this case we need to wait until the previous
- * request has completed and updated the L2 table accordingly.
+ * the same cluster.
+ * In this case, check if that request has explicitly allowed to write
+ * in its COW area(s).
+ *   If yes - fill the meta to point to the same cluster.
+ *   If no  - we need to wait until the previous request has completed and
+ *            updated the L2 table accordingly or
+ *            has allowed writing in its COW area(s).
  * Returns:
  *   0       if there was no dependency. *cur_bytes indicates the number of
  *           bytes from guest_offset that can be read before the next
  *           dependency must be processed (or the request is complete).
- *           *m is not modified
+ *           *m, *host_offset are not modified
+ *
+ *   1       if there is a dependency but it is possible to write concurrently
+ *           *m is filled accordingly,
+ *           *cur_bytes may have decreased and describes
+ *             the length of the area that can be written to,
+ *           *host_offset contains the starting host image offset to write to
  *
  *   -EAGAIN if we had to wait for another request. The caller
- *           must start over, so consider *cur_bytes undefined.
+ *           must start over, so consider *cur_bytes and *host_offset undefined.
  *           *m is not modified
  */
 static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
-    uint64_t *cur_bytes, QCowL2Meta **m)
+                               uint64_t *host_offset, uint64_t *cur_bytes,
+                               QCowL2Meta **m)
 {
     BDRVQcow2State *s = bs->opaque;
     QCowL2Meta *old_alloc;
@@ -924,7 +936,7 @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
         const uint64_t old_start = l2meta_cow_start(old_alloc);
         const uint64_t old_end = l2meta_cow_end(old_alloc);
 
-        if (end <= old_start || start >= old_end) {
+        if (end <= old_start || start >= old_end || old_alloc->piggybacked) {
             /* No intersection */
             continue;
         }
@@ -936,21 +948,95 @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
             continue;
         }
 
-        /* Stop if already an l2meta exists. After yielding, it wouldn't
-         * be valid any more, so we'd have to clean up the old L2Metas
-         * and deal with requests depending on them before starting to
-         * gather new ones. Not worth the trouble. */
-        if (*m) {
+        /* offsets of the cluster we're intersecting in */
+        const uint64_t cluster_start = start_of_cluster(s, start);
+        const uint64_t cluster_end = cluster_start + s->cluster_size;
+
+        const uint64_t old_data_start = old_start
+            + old_alloc->cow_start.nb_bytes;
+        const uint64_t old_data_end = old_alloc->offset
+            + old_alloc->cow_end.offset;
+
+        const bool conflict_in_data_area =
+            end > old_data_start && start < old_data_end;
+        const bool conflict_in_old_cow_start =
+            /* 1). new write request area is before the old */
+            start < old_data_start
+            && /* 2). old request did not allow writing in its cow area */
+            !old_alloc->cow_start.reduced;
+        const bool conflict_in_old_cow_end =
+            /* 1). new write request area is after the old */
+            start > old_data_start
+            && /* 2). old request did not allow writing in its cow area */
+            !old_alloc->cow_end.reduced;
+
+        if (conflict_in_data_area ||
+            conflict_in_old_cow_start || conflict_in_old_cow_end) {
+
+            /* Stop if already an l2meta exists. After yielding, it wouldn't
+             * be valid any more, so we'd have to clean up the old L2Metas
+             * and deal with requests depending on them before starting to
+             * gather new ones. Not worth the trouble. */
+            if (*m) {
+                /* start must be cluster aligned at this point */
+                assert(start == cluster_start);
+                *cur_bytes = 0;
+                return 0;
+            }
+
+            /* Wait for the dependency to complete. We need to recheck
+             * the free/allocated clusters when we continue. */
+            qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
+            return -EAGAIN;
+        }
+
+        /* allocations do conflict, but the competitor kindly allowed us
+         * to write concurrently (our data area only, not the whole cluster!)
+         * Inter alia, this means we must not touch the COW areas */
+
+        if (*host_offset) {
             /* start must be cluster aligned at this point */
-            assert(start == start_of_cluster(s, start));
-            *cur_bytes = 0;
-            return 0;
+            assert(start == cluster_start);
+            if ((old_alloc->alloc_offset + (start - old_start))
+                != *host_offset) {
+                /* can't extend contiguous allocation */
+                *cur_bytes = 0;
+                return 0;
+            }
         }
 
-        /* Wait for the dependency to complete. We need to recheck
-         * the free/allocated clusters when we continue. */
-        qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
-        return -EAGAIN;
+        QCowL2Meta *old_m = *m;
+        *m = g_malloc0(sizeof(**m));
+
+        **m = (QCowL2Meta) {
+            .next           = old_m,
+
+            .alloc_offset   = old_alloc->alloc_offset
+                              + (cluster_start - old_start),
+            .offset         = old_alloc->offset
+                              + (cluster_start - old_start),
+            .nb_clusters    = 1,
+            .piggybacked    = true,
+            .clusters_are_trailing = false,
+
+            /* reduced COW areas. see above */
+            .cow_start = {
+                .offset   = 0,
+                .nb_bytes = start - cluster_start,
+                .reduced  = true,
+            },
+            .cow_end = {
+                .offset   = MIN(end - cluster_start, s->cluster_size),
+                .nb_bytes = end < cluster_end ? cluster_end - end : 0,
+                .reduced  = true,
+            },
+        };
+        qemu_co_queue_init(&(*m)->dependent_requests);
+        QLIST_INSERT_HEAD(&s->cluster_allocs, *m, next_in_flight);
+
+        *host_offset = old_alloc->alloc_offset + (start - old_start);
+        *cur_bytes = MIN(*cur_bytes, cluster_end - start);
+        return 1;
     }
 
     /* Make sure that existing clusters and new allocations are only used up to
@@ -1264,6 +1350,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset,
         .alloc_offset   = alloc_cluster_offset,
         .offset         = start_of_cluster(s, guest_offset),
         .nb_clusters    = nb_clusters,
+        .piggybacked    = false,
         .clusters_are_trailing = alloc_cluster_offset >= old_data_end,
 
         .keep_old_clusters  = keep_old_clusters,
@@ -1364,13 +1451,12 @@ again:
          *         for contiguous clusters (the situation could have changed
          *         while we were sleeping)
          *
-         *      c) TODO: Request starts in the same cluster as the in-flight
-         *         allocation ends. Shorten the COW of the in-fight allocation,
-         *         set cluster_offset to write to the same cluster and set up
-         *         the right synchronisation between the in-flight request and
-         *         the new one.
+         *      c) Overlap with another request's writeable COW area. Use
+         *         the stolen offset (and let the original request update L2
+         *         when it pleases)
+         *
          */
-        ret = handle_dependencies(bs, start, &cur_bytes, m);
+        ret = handle_dependencies(bs, start, &cluster_offset, &cur_bytes, m);
         if (ret == -EAGAIN) {
             /* Currently handle_dependencies() doesn't yield if we already had
              * an allocation. If it did, we would have to clean up the L2Meta
@@ -1379,6 +1465,8 @@ again:
             goto again;
         } else if (ret < 0) {
             return ret;
+        } else if (ret) {
+            continue;
         } else if (cur_bytes == 0) {
             break;
         } else {
@@ -1967,3 +2055,36 @@ void qcow2_update_data_end(BlockDriverState *bs, uint64_t off)
         s->data_end = off;
     }
 }
+
+/*
+ * For each @m, wait for its dependency request to finish and check for its
+ * success, i.e. that L2 table is updated as expected.
+ */
+int qcow2_wait_l2table_update(BlockDriverState *bs, const QCowL2Meta *m)
+{
+    BDRVQcow2State *s = bs->opaque;
+    QCowL2Meta *old_alloc;
+    uint64_t alloc_offset;
+    unsigned int bytes;
+    int ret;
+
+    for (; m != NULL; m = m->next) {
+        assert(m->piggybacked);
+        QLIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight) {
+            uint64_t a_off;
+            a_off = old_alloc->alloc_offset + (m->offset - old_alloc->offset);
+            if (!old_alloc->piggybacked && m->offset >= old_alloc->offset &&
+                a_off == m->alloc_offset) {
+
+                qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
+                break;
+            }
+        }
+        ret = qcow2_get_cluster_offset(bs, m->offset, &bytes, &alloc_offset);
+        if (ret != QCOW2_CLUSTER_NORMAL ||
+            alloc_offset != m->alloc_offset) {
+            return -1;
+        }
+    }
+    return 0;
+}
diff --git a/block/qcow2.c b/block/qcow2.c
index 97a66a0..0f28a4b 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1625,6 +1625,8 @@ fail:
 
 static void handle_cow_reduce(BlockDriverState *bs, QCowL2Meta *m)
 {
+    bool trimmed = false;
+
     if (bs->encrypted) {
         return;
     }
@@ -1633,12 +1635,19 @@ static void handle_cow_reduce(BlockDriverState *bs, QCowL2Meta *m)
                         (m->offset + m->cow_start.offset) >> BDRV_SECTOR_BITS,
                         m->cow_start.nb_bytes >> BDRV_SECTOR_BITS)) {
         m->cow_start.reduced = true;
+        trimmed = true;
     }
     if (!m->cow_end.reduced && m->cow_end.nb_bytes != 0 &&
         is_zero_sectors(bs,
                         (m->offset + m->cow_end.offset) >> BDRV_SECTOR_BITS,
                         m->cow_end.nb_bytes >> BDRV_SECTOR_BITS)) {
         m->cow_end.reduced = true;
+        trimmed = true;
+    }
+    /* The request is trimmed. Let's try to start dependent
+       ones, may be we will be lucky */
+    if (trimmed) {
+        qemu_co_queue_restart_all(&m->dependent_requests);
     }
 }
 
@@ -1787,6 +1796,10 @@ static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
     for (m = l2meta; m != NULL; m = m->next) {
         uint64_t bytes = m->nb_clusters << s->cluster_bits;
 
+        if (m->piggybacked) {
+            continue;
+        }
+
         if (s->prealloc_size != 0 && handle_prealloc(bs, m)) {
             handle_cow_reduce(bs, m);
             continue;
@@ -1910,9 +1923,18 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
         while (l2meta != NULL) {
             QCowL2Meta *next;
 
-            ret = qcow2_alloc_cluster_link_l2(bs, l2meta);
-            if (ret < 0) {
-                goto fail;
+            if (!l2meta->piggybacked) {
+                ret = qcow2_alloc_cluster_link_l2(bs, l2meta);
+                if (ret < 0) {
+                    goto fail;
+                }
+            } else {
+                ret = qcow2_wait_l2table_update(bs, l2meta);
+                if (ret < 0) {
+                    /* dependency request failed, return general EIO */
+                    ret = -EIO;
+                    goto fail;
+                }
             }
 
             /* Take the request off the list of running requests */
diff --git a/block/qcow2.h b/block/qcow2.h
index 2fd8510..5947045 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -310,7 +310,8 @@ typedef struct Qcow2COWRegion {
     /** Number of bytes to copy */
     int         nb_bytes;
 
-    /** The region is filled with zeroes and does not require COW
+    /** The region does not require COW
+     *  (either filled with zeroes or busy with other request)
      */
     bool        reduced;
 } Qcow2COWRegion;
@@ -338,6 +339,13 @@ typedef struct QCowL2Meta
     bool clusters_are_trailing;
 
     /**
+     * True if the described clusters are being allocated by
+     * the other concurrent request; so this one must not actually update L2
+     * or COW but only write its data
+     */
+    bool piggybacked;
+
+    /**
      * Requests that overlap with this allocation and wait to be restarted
      * when the allocating request has completed.
      */
@@ -575,6 +583,8 @@ int qcow2_expand_zero_clusters(BlockDriverState *bs,
                                BlockDriverAmendStatusCB *status_cb,
                                void *cb_opaque);
 
+int qcow2_wait_l2table_update(BlockDriverState *bs, const QCowL2Meta *m);
+
 /* qcow2-snapshot.c functions */
 int qcow2_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info);
 int qcow2_snapshot_goto(BlockDriverState *bs, const char *snapshot_id);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v1 13/13] iotest 046: test simultaneous cluster write error case
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (11 preceding siblings ...)
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 12/13] qcow2: allow concurrent unaligned writes to the same clusters Anton Nefedov
@ 2017-05-19  9:34 ` Anton Nefedov
  2017-05-23 14:35 ` [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Eric Blake
  13 siblings, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-19  9:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, Anton Nefedov

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 tests/qemu-iotests/046     | 38 +++++++++++++++++++++++++++++++++++++-
 tests/qemu-iotests/046.out | 23 +++++++++++++++++++++++
 2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/046 b/tests/qemu-iotests/046
index f2ebecf..c210b55 100755
--- a/tests/qemu-iotests/046
+++ b/tests/qemu-iotests/046
@@ -29,7 +29,8 @@ status=1	# failure is the default!
 
 _cleanup()
 {
-	_cleanup_test_img
+    _cleanup_test_img
+    rm "$TEST_DIR/blkdebug.conf"
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
 
@@ -188,6 +189,37 @@ overlay_io | $QEMU_IO blkdebug::"$TEST_IMG" | _filter_qemu_io |\
 	sed -e 's/bytes at offset [0-9]*/bytes at offset XXX/g'
 
 echo
+echo "== Concurrency error case =="
+
+# 1. 1st request allocated the cluster, stop before it writes and updates L2
+# 2. 2nd request starts at the same cluster must complete write and start
+#    waiting for the 1st to update L2
+# 3. Resume the 1st request to make it fail (injected error)
+# 4. 2nd request must wake and fail as well
+#    1 cluster will end up leaked
+cat > "$TEST_DIR/blkdebug.conf" <<EOF
+[inject-error]
+event = "write_aio"
+sector = "0xb00"
+EOF
+function error_io()
+{
+cat <<EOF
+discard 0x120000 0x20000
+break write_aio A
+aio_write -P 180 0x120000 0x200
+wait_break A
+aio_write -P 181 0x120200 0x10000
+resume A
+EOF
+}
+
+error_io | $QEMU_IO "blkdebug:$TEST_DIR/blkdebug.conf:$TEST_IMG" | _filter_qemu_io |\
+	sed -e 's/bytes at offset [0-9]*/bytes at offset XXX/g'
+
+_check_test_img -r leaks
+
+echo
 echo "== Verify image content =="
 
 function verify_io()
@@ -259,6 +291,10 @@ function verify_io()
     # Undefined content for 0x10c000 0x8000
     echo read -P 160 0x114000 0x8000
     echo read -P 17  0x11c000 0x4000
+
+    echo read -P ${discarded:-18} 0x120000 0x10000
+    echo read -P 181 0x130000 0x200
+    echo read -P ${discarded:-19} 0x130200 0xfe00
 }
 
 verify_io | $QEMU_IO "$TEST_IMG" | _filter_qemu_io
diff --git a/tests/qemu-iotests/046.out b/tests/qemu-iotests/046.out
index ca2c740..a065102 100644
--- a/tests/qemu-iotests/046.out
+++ b/tests/qemu-iotests/046.out
@@ -140,6 +140,23 @@ wrote 32768/32768 bytes at offset XXX
 wrote 98304/98304 bytes at offset XXX
 96 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+== Concurrency error case ==
+discard 131072/131072 bytes at offset XXX
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Suspended request 'A'
+blkdebug: Resuming request 'A'
+aio_write failed: Input/output error
+aio_write failed: Input/output error
+Leaked cluster 22 refcount=1 reference=0
+Repairing cluster 22 refcount=1 reference=0
+The following inconsistencies were found and repaired:
+
+    1 leaked clusters
+    0 corruptions
+
+Double checking the fixed image now...
+No errors were found on the image.
+
 == Verify image content ==
 read 65536/65536 bytes at offset 0
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
@@ -235,5 +252,11 @@ read 32768/32768 bytes at offset 1130496
 32 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 read 16384/16384 bytes at offset 1163264
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 1179648
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 1245184
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65024/65024 bytes at offset 1245696
+63.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 No errors were found on the image.
 *** done
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk Anton Nefedov
@ 2017-05-22 19:00   ` Eric Blake
  2017-05-23  8:28     ` Anton Nefedov
  2017-05-23  9:13     ` Denis V. Lunev
  2017-05-26  8:11   ` Kevin Wolf
  1 sibling, 2 replies; 36+ messages in thread
From: Eric Blake @ 2017-05-22 19:00 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: kwolf, Denis V. Lunev, den, mreitz

[-- Attachment #1: Type: text/plain, Size: 4574 bytes --]

On 05/19/2017 04:34 AM, Anton Nefedov wrote:
> From: "Denis V. Lunev" <den@openvz.org>
> 
> Currently each single write operation can result in 3 write operations
> if guest offsets are not cluster aligned. One write is performed for the
> real payload and two for COW-ed areas. Thus the data possibly lays
> non-contiguously on the host filesystem. This will reduce further
> sequential read performance significantly.
> 
> The patch allocates the space in the file with cluster granularity,
> ensuring
>   1. better host offset locality
>   2. less space allocation operations
>      (which can be expensive on distributed storages)

s/storages/storage/

> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>  block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 

> diff --git a/block/qcow2.c b/block/qcow2.c
> index a8d61f0..2e6a0ec 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1575,6 +1575,32 @@ fail:
>      return ret;
>  }
>  
> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    BlockDriverState *file = bs->file->bs;
> +    QCowL2Meta *m;
> +    int ret;
> +
> +    for (m = l2meta; m != NULL; m = m->next) {
> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
> +
> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
> +            continue;
> +        }
> +
> +        /* try to alloc host space in one chunk for better locality */
> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);

Are we guaranteed that this is a fast operation?  (That is, it either
results in a hole or an error, and doesn't waste time tediously writing
actual zeroes)

> +
> +        if (ret != 0) {
> +            continue;
> +        }

Supposing we are using a file system that doesn't support holes, then
ret will not be zero, and we ended up not allocating anything after all.
 Is that a problem that we are just blindly continuing the loop as our
reaction to the error?

/reads further

I guess not - you aren't reacting to any error call, but merely using
the side effect that an allocation happened for speed when it worked,
and ignoring failure (you get the old behavior of the write() now
causing the allocation) when it didn't.

> +
> +        file->total_sectors = MAX(file->total_sectors,
> +                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
> +    }
> +}
> +
>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>                                           uint64_t bytes, QEMUIOVector *qiov,
>                                           int flags)
> @@ -1656,8 +1682,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>          if (ret < 0) {
>              goto fail;
>          }
> -
>          qemu_co_mutex_unlock(&s->lock);
> +
> +        if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) {
> +            handle_alloc_space(bs, l2meta);
> +        }

Is it really a good idea to be modifying the underlying protocol image
outside of the mutex?

At any rate, it looks like your patch is doing a best-effort write
zeroes as an attempt to trigger consecutive allocation of the entire
cluster in the underlying protocol right after a cluster has been
allocated at the qcow2 format layer.  Which means there are more
syscalls now than there were previously, but now when we do three
write() calls at offsets B, A, C, those three calls are into file space
that was allocated earlier by the write zeroes, rather than fresh calls
into unallocated space that is likely to trigger up to three disjoint
allocations.

As a discussion point, wouldn't we achieve the same effect of less
fragmentation if we instead collect our data into a bounce buffer, and
only then do a single write() (or more likely, a writev() where the iov
is set up to reconstruct a single buffer on the syscall, but where the
source data is still at different offsets)?  We'd be avoiding the extra
syscalls of pre-allocating the cluster, and while our write() call is
still causing allocations, at least it is now one cluster-aligned
write() rather than three sub-cluster out-of-order write()s.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 02/13] qcow2: is_zero_sectors(): return true if area is outside of backing file
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 02/13] qcow2: is_zero_sectors(): return true if area is outside of backing file Anton Nefedov
@ 2017-05-22 19:12   ` Eric Blake
  2017-05-22 19:14     ` Eric Blake
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Blake @ 2017-05-22 19:12 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: kwolf, den, mreitz

[-- Attachment #1: Type: text/plain, Size: 2250 bytes --]

On 05/19/2017 04:34 AM, Anton Nefedov wrote:
> in such case, bdrv_get_block_status() shall return 0, *nr == 0

Sounds better as s/shall/will/

> 
> iotest 154 updated accordingly: write-zeroes tail alignment can be detected
> as zeroes now, so pwrite_zeroes succeeds
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>  block/qcow2.c              | 6 ++++--
>  tests/qemu-iotests/154.out | 4 ++--
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 

You'll want to also patch test 154 proper to remove these two TODO comments:

# TODO: Note that this forces an allocation, because we aren't yet able to
# quickly detect that reads beyond EOF of the backing file are always zero


> diff --git a/block/qcow2.c b/block/qcow2.c
> index 2e6a0ec..b885dfc 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2482,7 +2482,7 @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
>      int64_t res;
>  
>      if (start + count > bs->total_sectors) {
> -        count = bs->total_sectors - start;
> +        count = start < bs->total_sectors ? bs->total_sectors - start : 0;
>      }
>  
>      if (!count) {
> @@ -2490,7 +2490,9 @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
>      }
>      res = bdrv_get_block_status_above(bs, NULL, start, count,
>                                        &nr, &file);
> -    return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
> +    return res >= 0
> +        && (((res & BDRV_BLOCK_ZERO) && nr == count)
> +            || nr == 0);

The logic makes sense to me, although the formatting is unusual (we tend
to split && and || with the operator at the tail of the previous line,
not the head of the new line).

This quick check may make me revisit whether it is worth my my RFC
series about adding BDRV_BLOCK_EOF for more quickly tracking reads
beyond EOF (my solution in that series was able to make the same change
to test 154, but by doing it at the block layer instead of the qcow2.c
code).  https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg01127.html


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 02/13] qcow2: is_zero_sectors(): return true if area is outside of backing file
  2017-05-22 19:12   ` Eric Blake
@ 2017-05-22 19:14     ` Eric Blake
  2017-05-23  8:35       ` Anton Nefedov
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Blake @ 2017-05-22 19:14 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: kwolf, den, mreitz

[-- Attachment #1: Type: text/plain, Size: 1742 bytes --]

On 05/22/2017 02:12 PM, Eric Blake wrote:

>> +++ b/block/qcow2.c
>> @@ -2482,7 +2482,7 @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
>>      int64_t res;
>>  
>>      if (start + count > bs->total_sectors) {
>> -        count = bs->total_sectors - start;
>> +        count = start < bs->total_sectors ? bs->total_sectors - start : 0;
>>      }
>>  
>>      if (!count) {
>> @@ -2490,7 +2490,9 @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
>>      }
>>      res = bdrv_get_block_status_above(bs, NULL, start, count,
>>                                        &nr, &file);
>> -    return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
>> +    return res >= 0
>> +        && (((res & BDRV_BLOCK_ZERO) && nr == count)
>> +            || nr == 0);
> 
> The logic makes sense to me, although the formatting is unusual (we tend
> to split && and || with the operator at the tail of the previous line,
> not the head of the new line).
> 
> This quick check may make me revisit whether it is worth my my RFC
> series about adding BDRV_BLOCK_EOF for more quickly tracking reads
> beyond EOF (my solution in that series was able to make the same change
> to test 154, but by doing it at the block layer instead of the qcow2.c
> code).  https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg01127.html

Actually, re-reading my RFC - I was able to change 6 instances in test
154, while your tweak only affected 2 instances (you still left four
instances that were allocating).  So my approach may still be more
optimal in the long run.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 03/13] qcow2: do not COW the empty areas
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 03/13] qcow2: do not COW the empty areas Anton Nefedov
@ 2017-05-22 19:24   ` Eric Blake
  2017-05-23  8:31     ` Anton Nefedov
  2017-05-23  9:15     ` Denis V. Lunev
  0 siblings, 2 replies; 36+ messages in thread
From: Eric Blake @ 2017-05-22 19:24 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: kwolf, Denis V . Lunev, den, mreitz

[-- Attachment #1: Type: text/plain, Size: 2284 bytes --]

On 05/19/2017 04:34 AM, Anton Nefedov wrote:
> If COW area of the newly allocated cluster is zeroes, there is no reason
> to write zero sectors in perform_cow() again now as whole clusters are
> zeroed out in single chunks by handle_alloc_space().

But that's only true if you can guarantee that handle_alloc_space()
succeeded at ensuring the cluster reads as zeroes.  If you silently
ignore errors (which is what patch 1/13 does), you risk assuming that
the cluster reads as zeroes when in reality it does not, and then you
have corrupted data.

The idea of avoiding a COW of areas that read as zero at the source when
the destination also already reads as zeroes makes sense, but I'm not
convinced that this patch is safe as written.

> 
> Introduce QCowL2Meta field "reduced", since the existing fields
> (offset and nb_bytes) still has to keep other write requests from
> simultaneous writing in the area
> 
> iotest 060:
> write to the discarded cluster does not trigger COW anymore.
> so, break on write_aio event instead, will work for the test
> (but write won't fail anymore, so update reference output)
> 
> iotest 066:
> cluster-alignment areas that were not really COWed are now detected
> as zeroes, hence the initial write has to be exactly the same size for
> the maps to match
> 
> performance tests: ===
> 
> qemu-io,
>   results in seconds to complete (less is better)
>   random write 4k to empty image, no backing
>     HDD
>       64k cluster
>         128M over 128M image:   160 -> 160 ( x1  )
>         128M over   2G image:    86 ->  84 ( x1  )
>         128M over   8G image:    40 ->  29 ( x1.4 )
>       1M cluster
>          32M over   8G image:    58 ->  23 ( x2.5 )
> 
>     SSD
>       64k cluster
>           2G over   2G image:    71 ->  38 (  x1.9 )
>         512M over   8G image:    85 ->   8 ( x10.6 )
>       1M cluster
>         128M over  32G image:   314 ->   2 ( x157  )

At any rate, the benchmark numbers show that there is merit to pursuing
the idea of reducing I/O when partial cluster writes can avoid writing
COW'd zeroes on either side of the data.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 04/13] qcow2: preallocation at image expand
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 04/13] qcow2: preallocation at image expand Anton Nefedov
@ 2017-05-22 19:29   ` Eric Blake
  2017-05-24 16:57     ` Anton Nefedov
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Blake @ 2017-05-22 19:29 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: kwolf, Denis V. Lunev, den, mreitz

[-- Attachment #1: Type: text/plain, Size: 901 bytes --]

On 05/19/2017 04:34 AM, Anton Nefedov wrote:
> From: "Denis V. Lunev" <den@openvz.org>
> 
> This patch adds image preallocation at expand to provide better locality
> of QCOW2 image file and optimize this procedure for some distributed
> storages where this procedure is slow.
> 
> Image expand requests have to be suspended until the allocation is
> performed which is done via special QCowL2Meta.
> This meta is invisible to handle_dependencies() code.
> This is the main reason for also calling preallocation before metadata
> write: it might intersect with preallocation triggered by another IO,
> and has to yield

How does this interact with Max's work on preallocated truncate?
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg00267.html

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 11/13] qcow2-cluster: make handle_dependencies() logic easier to follow
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 11/13] qcow2-cluster: make handle_dependencies() logic easier to follow Anton Nefedov
@ 2017-05-22 19:37   ` Eric Blake
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Blake @ 2017-05-22 19:37 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: kwolf, Denis V . Lunev, den, mreitz

[-- Attachment #1: Type: text/plain, Size: 820 bytes --]

On 05/19/2017 04:34 AM, Anton Nefedov wrote:
> Avoid complicated nested conditions; return or continue asap instead.
> The logic is not changed.
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> ---
>  block/qcow2-cluster.c | 45 ++++++++++++++++++++++-----------------------
>  1 file changed, 22 insertions(+), 23 deletions(-)
> 

>  
> -            /* Stop if already an l2meta exists. After yielding, it wouldn't

As long as you're touching this,


> -            }
> +        /* Stop if already an l2meta exists. After yielding, it wouldn't

fix the grammar: s/already an l2meta/an l2meta already/

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk
  2017-05-22 19:00   ` Eric Blake
@ 2017-05-23  8:28     ` Anton Nefedov
  2017-05-23  9:13     ` Denis V. Lunev
  1 sibling, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-23  8:28 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: kwolf, Denis V. Lunev, den, mreitz



On 05/22/2017 10:00 PM, Eric Blake wrote:
> On 05/19/2017 04:34 AM, Anton Nefedov wrote:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> Currently each single write operation can result in 3 write operations
>> if guest offsets are not cluster aligned. One write is performed for the
>> real payload and two for COW-ed areas. Thus the data possibly lays
>> non-contiguously on the host filesystem. This will reduce further
>> sequential read performance significantly.
>>
>> The patch allocates the space in the file with cluster granularity,
>> ensuring
>>   1. better host offset locality
>>   2. less space allocation operations
>>      (which can be expensive on distributed storages)
>
> s/storages/storage/
>

Done.

>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>  block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>>  1 file changed, 31 insertions(+), 1 deletion(-)
>>
>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index a8d61f0..2e6a0ec 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1575,6 +1575,32 @@ fail:
>>      return ret;
>>  }
>>
>> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    BlockDriverState *file = bs->file->bs;
>> +    QCowL2Meta *m;
>> +    int ret;
>> +
>> +    for (m = l2meta; m != NULL; m = m->next) {
>> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
>> +
>> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
>> +            continue;
>> +        }
>> +
>> +        /* try to alloc host space in one chunk for better locality */
>> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);
>
> Are we guaranteed that this is a fast operation?  (That is, it either
> results in a hole or an error, and doesn't waste time tediously writing
> actual zeroes)
>

well, block_int.h reads:

/*
* Efficiently zero a region of the disk image. Typically an image format
* would use a compact metadata representation to implement this. This
* function pointer may be NULL or return -ENOSUP and .bdrv_co_writev()
* will be called instead.
*/
int coroutine_fn (*bdrv_co_pwrite_zeroes)(BlockDriverState *bs,
int64_t offset, int count, BdrvRequestFlags flags);


(and that's why the driver function is used directly, bypassing the 
'safe' bdrv interface that would try to write zeroes no matter the cost)

As far as I checked the drivers mostly follow the idea

>> +
>> +        if (ret != 0) {
>> +            continue;
>> +        }
>
> Supposing we are using a file system that doesn't support holes, then
> ret will not be zero, and we ended up not allocating anything after all.
>  Is that a problem that we are just blindly continuing the loop as our
> reaction to the error?
>
> /reads further
>
> I guess not - you aren't reacting to any error call, but merely using
> the side effect that an allocation happened for speed when it worked,
> and ignoring failure (you get the old behavior of the write() now
> causing the allocation) when it didn't.
>

yes, exactly

>> +
>> +        file->total_sectors = MAX(file->total_sectors,
>> +                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
>> +    }
>> +}
>> +
>>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>                                           uint64_t bytes, QEMUIOVector *qiov,
>>                                           int flags)
>> @@ -1656,8 +1682,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>          if (ret < 0) {
>>              goto fail;
>>          }
>> -
>>          qemu_co_mutex_unlock(&s->lock);
>> +
>> +        if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) {
>> +            handle_alloc_space(bs, l2meta);
>> +        }
>
> Is it really a good idea to be modifying the underlying protocol image
> outside of the mutex?
>

as far as I understand, qcow2 usually modifies the underlying image
outside of the mutex? I guess it's qcow2 metadata that we wouldn't want
to touch unlocked

> At any rate, it looks like your patch is doing a best-effort write
> zeroes as an attempt to trigger consecutive allocation of the entire
> cluster in the underlying protocol right after a cluster has been
> allocated at the qcow2 format layer.  Which means there are more
> syscalls now than there were previously, but now when we do three
> write() calls at offsets B, A, C, those three calls are into file space
> that was allocated earlier by the write zeroes, rather than fresh calls
> into unallocated space that is likely to trigger up to three disjoint
> allocations.
>
> As a discussion point, wouldn't we achieve the same effect of less
> fragmentation if we instead collect our data into a bounce buffer, and
> only then do a single write() (or more likely, a writev() where the iov
> is set up to reconstruct a single buffer on the syscall, but where the
> source data is still at different offsets)?  We'd be avoiding the extra
> syscalls of pre-allocating the cluster, and while our write() call is
> still causing allocations, at least it is now one cluster-aligned
> write() rather than three sub-cluster out-of-order write()s.
>

I think yes we would achieve the same effect of less fragmentation;
but pre-zeroing also makes the following patch possible (to skip COW if 
there is no backing data)

I have follow-up patches which merge initial data and COW padding into a
single writev(). After those it should become reasonable to skip
cluster pre-zeroing (for cases when there is backing data).

/Anton

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 03/13] qcow2: do not COW the empty areas
  2017-05-22 19:24   ` Eric Blake
@ 2017-05-23  8:31     ` Anton Nefedov
  2017-05-23  9:15     ` Denis V. Lunev
  1 sibling, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-23  8:31 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: kwolf, Denis V . Lunev, den, mreitz



On 05/22/2017 10:24 PM, Eric Blake wrote:
> On 05/19/2017 04:34 AM, Anton Nefedov wrote:
>> If COW area of the newly allocated cluster is zeroes, there is no reason
>> to write zero sectors in perform_cow() again now as whole clusters are
>> zeroed out in single chunks by handle_alloc_space().
>
> But that's only true if you can guarantee that handle_alloc_space()
> succeeded at ensuring the cluster reads as zeroes.  If you silently
> ignore errors (which is what patch 1/13 does), you risk assuming that
> the cluster reads as zeroes when in reality it does not, and then you
> have corrupted data.
>

Sure; COW is only skipped if pwrite_zeroes() from patch 1/13
succeeds

> The idea of avoiding a COW of areas that read as zero at the source when
> the destination also already reads as zeroes makes sense, but I'm not
> convinced that this patch is safe as written.
>

/Anton

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 02/13] qcow2: is_zero_sectors(): return true if area is outside of backing file
  2017-05-22 19:14     ` Eric Blake
@ 2017-05-23  8:35       ` Anton Nefedov
  0 siblings, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-23  8:35 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: kwolf, den, mreitz



On 05/22/2017 10:14 PM, Eric Blake wrote:
> On 05/22/2017 02:12 PM, Eric Blake wrote:
>
>>> +++ b/block/qcow2.c
>>> @@ -2482,7 +2482,7 @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
>>>      int64_t res;
>>>
>>>      if (start + count > bs->total_sectors) {
>>> -        count = bs->total_sectors - start;
>>> +        count = start < bs->total_sectors ? bs->total_sectors - start : 0;
>>>      }
>>>
>>>      if (!count) {
>>> @@ -2490,7 +2490,9 @@ static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
>>>      }
>>>      res = bdrv_get_block_status_above(bs, NULL, start, count,
>>>                                        &nr, &file);
>>> -    return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
>>> +    return res >= 0
>>> +        && (((res & BDRV_BLOCK_ZERO) && nr == count)
>>> +            || nr == 0);
>>
>> The logic makes sense to me, although the formatting is unusual (we tend
>> to split && and || with the operator at the tail of the previous line,
>> not the head of the new line).
>>
>> This quick check may make me revisit whether it is worth my my RFC
>> series about adding BDRV_BLOCK_EOF for more quickly tracking reads
>> beyond EOF (my solution in that series was able to make the same change
>> to test 154, but by doing it at the block layer instead of the qcow2.c
>> code).  https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg01127.html
>
> Actually, re-reading my RFC - I was able to change 6 instances in test
> 154, while your tweak only affected 2 instances (you still left four
> instances that were allocating).  So my approach may still be more
> optimal in the long run.
>

Yes, looks like your approach is more general; let's drop this one then

/Anton

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk
  2017-05-22 19:00   ` Eric Blake
  2017-05-23  8:28     ` Anton Nefedov
@ 2017-05-23  9:13     ` Denis V. Lunev
  1 sibling, 0 replies; 36+ messages in thread
From: Denis V. Lunev @ 2017-05-23  9:13 UTC (permalink / raw)
  To: Eric Blake, Anton Nefedov, qemu-devel; +Cc: kwolf, den, mreitz

On 05/22/2017 10:00 PM, Eric Blake wrote:
> On 05/19/2017 04:34 AM, Anton Nefedov wrote:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> Currently each single write operation can result in 3 write operations
>> if guest offsets are not cluster aligned. One write is performed for the
>> real payload and two for COW-ed areas. Thus the data possibly lays
>> non-contiguously on the host filesystem. This will reduce further
>> sequential read performance significantly.
>>
>> The patch allocates the space in the file with cluster granularity,
>> ensuring
>>   1. better host offset locality
>>   2. less space allocation operations
>>      (which can be expensive on distributed storages)
> s/storages/storage/
>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>  block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>>  1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index a8d61f0..2e6a0ec 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1575,6 +1575,32 @@ fail:
>>      return ret;
>>  }
>>  
>> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    BlockDriverState *file = bs->file->bs;
>> +    QCowL2Meta *m;
>> +    int ret;
>> +
>> +    for (m = l2meta; m != NULL; m = m->next) {
>> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
>> +
>> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
>> +            continue;
>> +        }
>> +
>> +        /* try to alloc host space in one chunk for better locality */
>> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);
> Are we guaranteed that this is a fast operation?  (That is, it either
> results in a hole or an error, and doesn't waste time tediously writing
> actual zeroes)
>
This is why we call driver directly. We expect that replacing with actualy
zeroes write happens only in generic function.

>> +
>> +        if (ret != 0) {
>> +            continue;
>> +        }
> Supposing we are using a file system that doesn't support holes, then
> ret will not be zero, and we ended up not allocating anything after all.
>  Is that a problem that we are just blindly continuing the loop as our
> reaction to the error?
>
> /reads further
>
> I guess not - you aren't reacting to any error call, but merely using
> the side effect that an allocation happened for speed when it worked,
> and ignoring failure (you get the old behavior of the write() now
> causing the allocation) when it didn't.
good point

>> +
>> +        file->total_sectors = MAX(file->total_sectors,
>> +                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
>> +    }
>> +}
>> +
>>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>                                           uint64_t bytes, QEMUIOVector *qiov,
>>                                           int flags)
>> @@ -1656,8 +1682,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>          if (ret < 0) {
>>              goto fail;
>>          }
>> -
>>          qemu_co_mutex_unlock(&s->lock);
>> +
>> +        if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) {
>> +            handle_alloc_space(bs, l2meta);
>> +        }
> Is it really a good idea to be modifying the underlying protocol image
> outside of the mutex?
yes! This is by design, not accidental. The area is protected by the
meta and
thus writes to not allocated areas will pass.

> At any rate, it looks like your patch is doing a best-effort write
> zeroes as an attempt to trigger consecutive allocation of the entire
> cluster in the underlying protocol right after a cluster has been
> allocated at the qcow2 format layer.  Which means there are more
> syscalls now than there were previously, but now when we do three
> write() calls at offsets B, A, C, those three calls are into file space
> that was allocated earlier by the write zeroes, rather than fresh calls
> into unallocated space that is likely to trigger up to three disjoint
> allocations.
>
> As a discussion point, wouldn't we achieve the same effect of less
> fragmentation if we instead collect our data into a bounce buffer, and
> only then do a single write() (or more likely, a writev() where the iov
> is set up to reconstruct a single buffer on the syscall, but where the
> source data is still at different offsets)?  We'd be avoiding the extra
> syscalls of pre-allocating the cluster, and while our write() call is
> still causing allocations, at least it is now one cluster-aligned
> write() rather than three sub-cluster out-of-order write()s.
>
If the cluster will become bigger, the difference will be much more
significant.
Actually we have done that already in the original patchset, but those
changes are a bit controversal in terms of performance. They works much
better on SSD and worse on HDD.

May be we have done something wrong, but here only simple non-questionable
things in terms of performance are included.

Den

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 03/13] qcow2: do not COW the empty areas
  2017-05-22 19:24   ` Eric Blake
  2017-05-23  8:31     ` Anton Nefedov
@ 2017-05-23  9:15     ` Denis V. Lunev
  1 sibling, 0 replies; 36+ messages in thread
From: Denis V. Lunev @ 2017-05-23  9:15 UTC (permalink / raw)
  To: Eric Blake, Anton Nefedov, qemu-devel; +Cc: kwolf, den, mreitz

On 05/22/2017 10:24 PM, Eric Blake wrote:
> On 05/19/2017 04:34 AM, Anton Nefedov wrote:
>> If COW area of the newly allocated cluster is zeroes, there is no reason
>> to write zero sectors in perform_cow() again now as whole clusters are
>> zeroed out in single chunks by handle_alloc_space().
> But that's only true if you can guarantee that handle_alloc_space()
> succeeded at ensuring the cluster reads as zeroes.  If you silently
> ignore errors (which is what patch 1/13 does), you risk assuming that
> the cluster reads as zeroes when in reality it does not, and then you
> have corrupted data.
>
> The idea of avoiding a COW of areas that read as zero at the source when
> the destination also already reads as zeroes makes sense, but I'm not
> convinced that this patch is safe as written.
we will recheck error path. OK.

>> Introduce QCowL2Meta field "reduced", since the existing fields
>> (offset and nb_bytes) still has to keep other write requests from
>> simultaneous writing in the area
>>
>> iotest 060:
>> write to the discarded cluster does not trigger COW anymore.
>> so, break on write_aio event instead, will work for the test
>> (but write won't fail anymore, so update reference output)
>>
>> iotest 066:
>> cluster-alignment areas that were not really COWed are now detected
>> as zeroes, hence the initial write has to be exactly the same size for
>> the maps to match
>>
>> performance tests: ===
>>
>> qemu-io,
>>   results in seconds to complete (less is better)
>>   random write 4k to empty image, no backing
>>     HDD
>>       64k cluster
>>         128M over 128M image:   160 -> 160 ( x1  )
>>         128M over   2G image:    86 ->  84 ( x1  )
>>         128M over   8G image:    40 ->  29 ( x1.4 )
>>       1M cluster
>>          32M over   8G image:    58 ->  23 ( x2.5 )
>>
>>     SSD
>>       64k cluster
>>           2G over   2G image:    71 ->  38 (  x1.9 )
>>         512M over   8G image:    85 ->   8 ( x10.6 )
>>       1M cluster
>>         128M over  32G image:   314 ->   2 ( x157  )
> At any rate, the benchmark numbers show that there is merit to pursuing
> the idea of reducing I/O when partial cluster writes can avoid writing
> COW'd zeroes on either side of the data.
>
yes! This is exactly the point and also with this approach we would
allow sequential non-aligned to cluster writes, which is also very good.

Den

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements
  2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (12 preceding siblings ...)
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 13/13] iotest 046: test simultaneous cluster write error case Anton Nefedov
@ 2017-05-23 14:35 ` Eric Blake
  2017-05-23 14:51   ` Denis V. Lunev
  13 siblings, 1 reply; 36+ messages in thread
From: Eric Blake @ 2017-05-23 14:35 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: kwolf, den, mreitz, Alberto Garcia

[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]

On 05/19/2017 04:34 AM, Anton Nefedov wrote:
> This pull request is to address a few performance problems of qcow2 format:
> 
>   1. non cluster-aligned write requests (to unallocated clusters) explicitly
>     pad data with zeroes if there is no backing data. This can be avoided
>     and the whole clusters are preallocated and zeroed in a single
>     efficient write_zeroes() operation, also providing better host file
>     continuity
> 
>   2. moreover, efficient write_zeroes() operation can be used to preallocate
>     space megabytes ahead which gives noticeable improvement on some storage
>     types (e.g. distributed storages where space allocation operation is
>     expensive)
> 
>   3. preallocating/zeroing the clusters in advance makes possible to enable
>     simultaneous writes to the same unallocated cluster, which is beneficial
>     for parallel sequential write operations which are not cluster-aligned
> 
> Performance test results are added to commit messages (see patch 3, 12)

And now Berto has posted parallel patches. I'm not sure which ones to
focus on, or if you can work it out between you on the best approach
forward...

https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg05236.html

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements
  2017-05-23 14:35 ` [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Eric Blake
@ 2017-05-23 14:51   ` Denis V. Lunev
  0 siblings, 0 replies; 36+ messages in thread
From: Denis V. Lunev @ 2017-05-23 14:51 UTC (permalink / raw)
  To: Eric Blake, Anton Nefedov, qemu-devel; +Cc: kwolf, mreitz, Alberto Garcia

On 05/23/2017 05:35 PM, Eric Blake wrote:
> On 05/19/2017 04:34 AM, Anton Nefedov wrote:
>> This pull request is to address a few performance problems of qcow2 format:
>>
>>   1. non cluster-aligned write requests (to unallocated clusters) explicitly
>>     pad data with zeroes if there is no backing data. This can be avoided
>>     and the whole clusters are preallocated and zeroed in a single
>>     efficient write_zeroes() operation, also providing better host file
>>     continuity
>>
>>   2. moreover, efficient write_zeroes() operation can be used to preallocate
>>     space megabytes ahead which gives noticeable improvement on some storage
>>     types (e.g. distributed storages where space allocation operation is
>>     expensive)
>>
>>   3. preallocating/zeroing the clusters in advance makes possible to enable
>>     simultaneous writes to the same unallocated cluster, which is beneficial
>>     for parallel sequential write operations which are not cluster-aligned
>>
>> Performance test results are added to commit messages (see patch 3, 12)
> And now Berto has posted parallel patches. I'm not sure which ones to
> focus on, or if you can work it out between you on the best approach
> forward...
>
> https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg05236.html
>
I have seen those patches and will comment.

yes, they are intersect. He is concentrating on the fact of performing
operation in minimal read/write amount. We have made an attempt
to minimize amount of IO in the expand case.

Frankly speaking Berto patches are not that influenced. We definitely
should merge IO if write_zeroes are impossible. Thus both could
be merged in parallel.

What I want to note is the fact that there is another big problem
in QCOW2 layer. We have 64k block by default. Guest request
can be 4-5Mb in total. If the image is fragmented, most like
it will if it is old, we will have sequentially read/write the data.
Thus in ideal world we should switch to co-routine pool and the
same approach would be the best in COW too.

Pls note, that this also would very positively influence VM downtime
on migration as bdrv_drain_all is one the most important time hogs.

Den

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 04/13] qcow2: preallocation at image expand
  2017-05-22 19:29   ` Eric Blake
@ 2017-05-24 16:57     ` Anton Nefedov
  0 siblings, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-24 16:57 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: kwolf, Denis V. Lunev, mreitz



On 05/22/2017 10:29 PM, Eric Blake wrote:
> On 05/19/2017 04:34 AM, Anton Nefedov wrote:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> This patch adds image preallocation at expand to provide better locality
>> of QCOW2 image file and optimize this procedure for some distributed
>> storages where this procedure is slow.
>>
>> Image expand requests have to be suspended until the allocation is
>> performed which is done via special QCowL2Meta.
>> This meta is invisible to handle_dependencies() code.
>> This is the main reason for also calling preallocation before metadata
>> write: it might intersect with preallocation triggered by another IO,
>> and has to yield
>
> How does this interact with Max's work on preallocated truncate?
> https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg00267.html
>

Seems like Max's patchset makes it possible to manually (with qemu-img)
allocate a continuous guest address space in advance?

Preallocation in this series is a little bit different:
- it's in-flight, and it fallocates (using drv->write_zeroes) space
ahead in the underlying image as qcow2 writes beyond EOF.
- it doesn't link L2. So it leaves the image 'sparse' on qcow2 level.

If you ask me, these 2 patchsets are not contradictory, and probably
won't even merge-conflict :)

/Anton

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 05/13] qcow2: set inactive flag
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 05/13] qcow2: set inactive flag Anton Nefedov
@ 2017-05-26  8:11   ` Kevin Wolf
  2017-05-31 16:56     ` Anton Nefedov
  0 siblings, 1 reply; 36+ messages in thread
From: Kevin Wolf @ 2017-05-26  8:11 UTC (permalink / raw)
  To: Anton Nefedov; +Cc: qemu-devel, den, mreitz

Am 19.05.2017 um 11:34 hat Anton Nefedov geschrieben:
> Qcow2State and BlockDriverState flags have to be in sync
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>  block/qcow2.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 6e7ce96..07c1706 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1939,6 +1939,7 @@ static int qcow2_inactivate(BlockDriverState *bs)
>  
>      if (result == 0) {
>          qcow2_mark_clean(bs);
> +        s->flags |= BDRV_O_INACTIVE;
>      }

Good catch.

But can't we simply use bs->open_flags and completely get rid of
s->flags?

Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk
  2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk Anton Nefedov
  2017-05-22 19:00   ` Eric Blake
@ 2017-05-26  8:11   ` Kevin Wolf
  2017-05-26  8:57     ` Denis V. Lunev
  2017-05-26 10:57     ` Denis V. Lunev
  1 sibling, 2 replies; 36+ messages in thread
From: Kevin Wolf @ 2017-05-26  8:11 UTC (permalink / raw)
  To: Anton Nefedov; +Cc: qemu-devel, den, mreitz, Denis V. Lunev

Am 19.05.2017 um 11:34 hat Anton Nefedov geschrieben:
> From: "Denis V. Lunev" <den@openvz.org>
> 
> Currently each single write operation can result in 3 write operations
> if guest offsets are not cluster aligned. One write is performed for the
> real payload and two for COW-ed areas. Thus the data possibly lays
> non-contiguously on the host filesystem. This will reduce further
> sequential read performance significantly.
> 
> The patch allocates the space in the file with cluster granularity,
> ensuring
>   1. better host offset locality
>   2. less space allocation operations
>      (which can be expensive on distributed storages)
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>

I don't think this is the right approach. You end up with two
write operations: One write_zeroes and then one data write. If the
backend actually supports efficient zero writes, write_zeroes won't
necessarily allocate space, but writing data will definitely split
the already existing allocation. If anything, we need a new
bdrv_allocate() or something that would call fallocate() instead of
abusing write_zeroes.

It seems much clearer to me that simply unifying the three write
requests into a single one is an improvement. And it's easy to do, I
even had a patch once to do this. The reason that I didn't send it was
that it seemed to conflict with the data cache approach

>  block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index a8d61f0..2e6a0ec 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1575,6 +1575,32 @@ fail:
>      return ret;
>  }
>  
> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    BlockDriverState *file = bs->file->bs;
> +    QCowL2Meta *m;
> +    int ret;
> +
> +    for (m = l2meta; m != NULL; m = m->next) {
> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
> +
> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
> +            continue;
> +        }
> +
> +        /* try to alloc host space in one chunk for better locality */
> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);

No. This is what you bypass:

* All sanity checks that the block layer does

* bdrv_inc/dec_in_flight(), which is required for drain to work
  correctly. Not doing this will cause crashes.

* tracked_request_begin/end(), mark_request_serialising() and
  wait_serialising_requests(), which are required for serialising
  requests to work correctly

* Ensuring correct request alignment for file. This means that e.g.
  qcow2 with cluster size 512 on a host with a 4k native disk will
  break.

* blkdebug events

* before_write_notifiers. Not calling these will cause corrupted backups
  if someone backups file.

* Dirty bitmap updates

* Updating write_gen, wr_highest_offset and total_sectors

* Ensuring that bl.max_pwrite_zeroes and bl.pwrite_zeroes_alignment are
  respected

And these are just the obvious things. I'm sure I missed some.

> +        if (ret != 0) {
> +            continue;
> +        }
> +
> +        file->total_sectors = MAX(file->total_sectors,
> +                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);

You only compensate for part of a single item in the list above, by
duplicating code with block/io.c. This is not how to do things.

As I said above, I think you don't really want write_zeroes anyway, but
if you wanted a write_zeroes "but only if it's efficient" (which I'm not
sure is a good thing to want), then a better way might be introducing a
new request flag.

> +    }
> +}
> +
>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>                                           uint64_t bytes, QEMUIOVector *qiov,
>                                           int flags)

Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk
  2017-05-26  8:11   ` Kevin Wolf
@ 2017-05-26  8:57     ` Denis V. Lunev
  2017-05-26 10:09       ` Anton Nefedov
  2017-05-26 11:16       ` Kevin Wolf
  2017-05-26 10:57     ` Denis V. Lunev
  1 sibling, 2 replies; 36+ messages in thread
From: Denis V. Lunev @ 2017-05-26  8:57 UTC (permalink / raw)
  To: Kevin Wolf, Anton Nefedov; +Cc: qemu-devel, mreitz, Denis V. Lunev

On 05/26/2017 11:11 AM, Kevin Wolf wrote:
> Am 19.05.2017 um 11:34 hat Anton Nefedov geschrieben:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> Currently each single write operation can result in 3 write operations
>> if guest offsets are not cluster aligned. One write is performed for the
>> real payload and two for COW-ed areas. Thus the data possibly lays
>> non-contiguously on the host filesystem. This will reduce further
>> sequential read performance significantly.
>>
>> The patch allocates the space in the file with cluster granularity,
>> ensuring
>>   1. better host offset locality
>>   2. less space allocation operations
>>      (which can be expensive on distributed storages)
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> I don't think this is the right approach. You end up with two
> write operations: One write_zeroes and then one data write. If the
> backend actually supports efficient zero writes, write_zeroes won't
> necessarily allocate space, but writing data will definitely split
> the already existing allocation. If anything, we need a new
> bdrv_allocate() or something that would call fallocate() instead of
> abusing write_zeroes.
great idea. Very nice then.

> It seems much clearer to me that simply unifying the three write
> requests into a single one is an improvement. And it's easy to do, I
> even had a patch once to do this. The reason that I didn't send it was
> that it seemed to conflict with the data cache approach
These changes help a lot on 2 patterns:
- writing 4Kb into the middle of the block. The bigger the block size,
  more we gain
- sequential write, which is not sector aligned and comes in 2 requests:
  we perform allocation, which should be significantly faster than actual
  write and after that both parts of the write can be executed in parallel.
At my opinion both cases are frequent and important.

But OK, the code should be improved, you are absolutely right here.

>>  block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>>  1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index a8d61f0..2e6a0ec 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1575,6 +1575,32 @@ fail:
>>      return ret;
>>  }
>>  
>> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    BlockDriverState *file = bs->file->bs;
>> +    QCowL2Meta *m;
>> +    int ret;
>> +
>> +    for (m = l2meta; m != NULL; m = m->next) {
>> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
>> +
>> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
>> +            continue;
>> +        }
>> +
>> +        /* try to alloc host space in one chunk for better locality */
>> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);
> No. This is what you bypass:
>
> * All sanity checks that the block layer does
>
> * bdrv_inc/dec_in_flight(), which is required for drain to work
>   correctly. Not doing this will cause crashes.
>
> * tracked_request_begin/end(), mark_request_serialising() and
>   wait_serialising_requests(), which are required for serialising
>   requests to work correctly
>
> * Ensuring correct request alignment for file. This means that e.g.
>   qcow2 with cluster size 512 on a host with a 4k native disk will
>   break.
>
> * blkdebug events
>
> * before_write_notifiers. Not calling these will cause corrupted backups
>   if someone backups file.
>
> * Dirty bitmap updates
>
> * Updating write_gen, wr_highest_offset and total_sectors
>
> * Ensuring that bl.max_pwrite_zeroes and bl.pwrite_zeroes_alignment are
>   respected
>
> And these are just the obvious things. I'm sure I missed some.
>

You seems right. I have not though about that from this angle.

>> +        if (ret != 0) {
>> +            continue;
>> +        }
>> +
>> +        file->total_sectors = MAX(file->total_sectors,
>> +                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
> You only compensate for part of a single item in the list above, by
> duplicating code with block/io.c. This is not how to do things.
>
> As I said above, I think you don't really want write_zeroes anyway, but
> if you wanted a write_zeroes "but only if it's efficient" (which I'm not
> sure is a good thing to want), then a better way might be introducing a
> new request flag.
>
>> +    }
>> +}
>> +
>>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>                                           uint64_t bytes, QEMUIOVector *qiov,
>>                                           int flags)
> Kevin

Thank you for review and ideas ;)

Den

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk
  2017-05-26  8:57     ` Denis V. Lunev
@ 2017-05-26 10:09       ` Anton Nefedov
  2017-05-26 11:16       ` Kevin Wolf
  1 sibling, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-26 10:09 UTC (permalink / raw)
  To: Denis V. Lunev, Kevin Wolf; +Cc: qemu-devel, mreitz, Denis V. Lunev



On 05/26/2017 11:57 AM, Denis V. Lunev wrote:
> On 05/26/2017 11:11 AM, Kevin Wolf wrote:
>> Am 19.05.2017 um 11:34 hat Anton Nefedov geschrieben:
>>> From: "Denis V. Lunev" <den@openvz.org>
>>>
>>> Currently each single write operation can result in 3 write operations
>>> if guest offsets are not cluster aligned. One write is performed for the
>>> real payload and two for COW-ed areas. Thus the data possibly lays
>>> non-contiguously on the host filesystem. This will reduce further
>>> sequential read performance significantly.
>>>
>>> The patch allocates the space in the file with cluster granularity,
>>> ensuring
>>>    1. better host offset locality
>>>    2. less space allocation operations
>>>       (which can be expensive on distributed storages)
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> I don't think this is the right approach. You end up with two
>> write operations: One write_zeroes and then one data write. If the
>> backend actually supports efficient zero writes, write_zeroes won't
>> necessarily allocate space, but writing data will definitely split
>> the already existing allocation. If anything, we need a new
>> bdrv_allocate() or something that would call fallocate() instead of
>> abusing write_zeroes.
> great idea. Very nice then.
> 

hi Kevin, thanks a lot for your comments.

The driver's callback is commented as:

     /*
      * Efficiently zero a region of the disk image.  Typically an image 
format
      * would use a compact metadata representation to implement this.  This
      * function pointer may be NULL or return -ENOSUP and .bdrv_co_writev()
      * will be called instead.
      */
     int coroutine_fn (*bdrv_co_pwrite_zeroes)(BlockDriverState *bs,
         int64_t offset, int count, BdrvRequestFlags flags);


so it looks like one may expect either efficiency or ENOTSUP from that
one? but not necessarily from the safe "facade" bdrv_co_pwrite_zeroes()
which falls back to pwritev()

/*
  * Efficiently zero a region of the disk image.  Note that this is a 
regular
  * I/O request like read or write and should have a reasonable size.  This
  * function is not suitable for zeroing the entire image in a single 
request
  * because it may allocate memory for the entire region.
  */
int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
                                        int count, BdrvRequestFlags flags);


(maybe we should have even more explicit comment here)


Then let's indeed create a new bdrv_allocate(), which will result in
bdrv_co_pwrite_zeroes(flags|=ZERO_EFFICIENTLY), so we'll reuse most of
the code and employ the same driver callback, but return ENOTSUP if it
fails?


>> It seems much clearer to me that simply unifying the three write
>> requests into a single one is an improvement. And it's easy to do, I
>> even had a patch once to do this. The reason that I didn't send it was
>> that it seemed to conflict with the data cache approach
> These changes help a lot on 2 patterns:
> - writing 4Kb into the middle of the block. The bigger the block size,
>    more we gain
> - sequential write, which is not sector aligned and comes in 2 requests:
>    we perform allocation, which should be significantly faster than actual
>    write and after that both parts of the write can be executed in parallel.
> At my opinion both cases are frequent and important.
> 
> But OK, the code should be improved, you are absolutely right here.
> 
>>>   block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>>>   1 file changed, 31 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index a8d61f0..2e6a0ec 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -1575,6 +1575,32 @@ fail:
>>>       return ret;
>>>   }
>>>   
>>> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>>> +{
>>> +    BDRVQcow2State *s = bs->opaque;
>>> +    BlockDriverState *file = bs->file->bs;
>>> +    QCowL2Meta *m;
>>> +    int ret;
>>> +
>>> +    for (m = l2meta; m != NULL; m = m->next) {
>>> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
>>> +
>>> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
>>> +            continue;
>>> +        }
>>> +
>>> +        /* try to alloc host space in one chunk for better locality */
>>> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);
>> No. This is what you bypass:
>>
>> * All sanity checks that the block layer does
>>
>> * bdrv_inc/dec_in_flight(), which is required for drain to work
>>    correctly. Not doing this will cause crashes.
>>
>> * tracked_request_begin/end(), mark_request_serialising() and
>>    wait_serialising_requests(), which are required for serialising
>>    requests to work correctly
>>
>> * Ensuring correct request alignment for file. This means that e.g.
>>    qcow2 with cluster size 512 on a host with a 4k native disk will
>>    break.
>>
>> * blkdebug events
>>
>> * before_write_notifiers. Not calling these will cause corrupted backups
>>    if someone backups file.
>>
>> * Dirty bitmap updates
>>
>> * Updating write_gen, wr_highest_offset and total_sectors
>>
>> * Ensuring that bl.max_pwrite_zeroes and bl.pwrite_zeroes_alignment are
>>    respected
>>
>> And these are just the obvious things. I'm sure I missed some.
>>
> 
> You seems right. I have not though about that from this angle.
> 
>>> +        if (ret != 0) {
>>> +            continue;
>>> +        }
>>> +
>>> +        file->total_sectors = MAX(file->total_sectors,
>>> +                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
>> You only compensate for part of a single item in the list above, by
>> duplicating code with block/io.c. This is not how to do things.
>>
>> As I said above, I think you don't really want write_zeroes anyway, but
>> if you wanted a write_zeroes "but only if it's efficient" (which I'm not
>> sure is a good thing to want), then a better way might be introducing a
>> new request flag.
>>
>>> +    }
>>> +}
>>> +
>>>   static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>>                                            uint64_t bytes, QEMUIOVector *qiov,
>>>                                            int flags)
>> Kevin
> 
> Thank you for review and ideas ;)
> 
> Den
> 
/Anton

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk
  2017-05-26  8:11   ` Kevin Wolf
  2017-05-26  8:57     ` Denis V. Lunev
@ 2017-05-26 10:57     ` Denis V. Lunev
  2017-05-26 11:32       ` Kevin Wolf
  1 sibling, 1 reply; 36+ messages in thread
From: Denis V. Lunev @ 2017-05-26 10:57 UTC (permalink / raw)
  To: Kevin Wolf, Anton Nefedov; +Cc: qemu-devel, mreitz, Denis V. Lunev

On 05/26/2017 11:11 AM, Kevin Wolf wrote:
>
> No. This is what you bypass:
some analysis for the record.

> * All sanity checks that the block layer does
>
> * bdrv_inc/dec_in_flight(), which is required for drain to work
>   correctly. Not doing this will cause crashes.
Should not be  a problem. We are in the request, which
is accounted for the parent. Drain waits parent first and after that
switched to child.

> * tracked_request_begin/end(), mark_request_serialising() and
>   wait_serialising_requests(), which are required for serialising
>   requests to work correctly
should not cause a problem as we are protected by meta
and this area is not exposed to read operation until
fully complete. But yes, not that good.
 
> * Ensuring correct request alignment for file. This means that e.g.
>   qcow2 with cluster size 512 on a host with a 4k native disk will
>   break.
we have already checked that clusters are aligned.

> * blkdebug events
yes
> * before_write_notifiers. Not calling these will cause corrupted backups
>   if someone backups file.
no as there is no user visible data here, we are
writing zeroes where there are zeroes from guest
point of view

> * Dirty bitmap updates
same. The data from the guest point of view is not changed.
This is even good that we do not mark this areas as dirty.

> * Updating write_gen, wr_highest_offset and total_sectors
write_gen is not a problem. It will be set on actual write.
total_sectors is changed manually. Agree that this is not
really elegant.

> * Ensuring that bl.max_pwrite_zeroes and bl.pwrite_zeroes_alignment are
>   respected
yes
> And these are just the obvious things. I'm sure I missed some.
anyway, this is a good list of things to take into account
within bdrv_allocate. But it is important, that from the
guest point of view the content of COW areas is not
changed and thus we can have some important shortcuts.

Den

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk
  2017-05-26  8:57     ` Denis V. Lunev
  2017-05-26 10:09       ` Anton Nefedov
@ 2017-05-26 11:16       ` Kevin Wolf
  1 sibling, 0 replies; 36+ messages in thread
From: Kevin Wolf @ 2017-05-26 11:16 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Anton Nefedov, qemu-devel, mreitz, Denis V. Lunev

Am 26.05.2017 um 10:57 hat Denis V. Lunev geschrieben:
> On 05/26/2017 11:11 AM, Kevin Wolf wrote:
> > It seems much clearer to me that simply unifying the three write
> > requests into a single one is an improvement. And it's easy to do, I
> > even had a patch once to do this. The reason that I didn't send it was
> > that it seemed to conflict with the data cache approach
> These changes help a lot on 2 patterns:
> - writing 4Kb into the middle of the block. The bigger the block size,
>   more we gain
> - sequential write, which is not sector aligned and comes in 2 requests:
>   we perform allocation, which should be significantly faster than actual
>   write and after that both parts of the write can be executed in parallel.
> At my opinion both cases are frequent and important.

Both cases are helped somewhat by reducing the number of I/O requests.
So as a first step, I really think we need to get Berto's patches ready
to be merged.

As for this series, there is one important restriction that you don't
mention: Both cases only benefit from the changes when there is no
backing file (or the backing file is sparse at the given cluster). The
main reason why people use qcow2 is for snapshots, so qcow2 without or
with an empty backing file is actually not my primary concern.

To be honest, I am concerned a bit by the additional complexity brought
in by this series (which also seemed to be more tacked on rather than
fully integrated at the first sight), but I'll have to study it in more
detail.


Maybe another thing to keep in mind is that the sequential write case
can completely avoid COW in theory. I wonder whether we should make
another attempt at this. Previously, it was the complexity of my own
prototypes that stopped me from actually merging them.

Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk
  2017-05-26 10:57     ` Denis V. Lunev
@ 2017-05-26 11:32       ` Kevin Wolf
  0 siblings, 0 replies; 36+ messages in thread
From: Kevin Wolf @ 2017-05-26 11:32 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Anton Nefedov, qemu-devel, mreitz, Denis V. Lunev

Am 26.05.2017 um 12:57 hat Denis V. Lunev geschrieben:
> On 05/26/2017 11:11 AM, Kevin Wolf wrote:
> >
> > No. This is what you bypass:
> some analysis for the record.
> [...]

> anyway, this is a good list of things to take into account
> within bdrv_allocate. But it is important, that from the
> guest point of view the content of COW areas is not
> changed and thus we can have some important shortcuts.

Yes, not everything in this list is an actual problem for this specific
case. Some things are, though, and if you have to make an analysis like
this, that's already a sign that something is wrong.

The real point I wanted to make is that the bdrv_*() wrappers do a whole
lot of useful things (half of which everyone, including me, would
probably forget when listing them without looking at the code) - and on
top of this, when making changes to the block layer, we have an
expectation that every request consistently goes through the block/io.c
functions. So not wanting one of these useful things in a special case
isn't a good enough reason to completey bypass them.

If we really need to avoid some part of a function, we can always add
another flag or something that makes that part optional.

Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v1 05/13] qcow2: set inactive flag
  2017-05-26  8:11   ` Kevin Wolf
@ 2017-05-31 16:56     ` Anton Nefedov
  0 siblings, 0 replies; 36+ messages in thread
From: Anton Nefedov @ 2017-05-31 16:56 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, den, mreitz

On 05/26/2017 11:11 AM, Kevin Wolf wrote:
> Am 19.05.2017 um 11:34 hat Anton Nefedov geschrieben:
>> Qcow2State and BlockDriverState flags have to be in sync
>>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>   block/qcow2.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 6e7ce96..07c1706 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1939,6 +1939,7 @@ static int qcow2_inactivate(BlockDriverState *bs)
>>   
>>       if (result == 0) {
>>           qcow2_mark_clean(bs);
>> +        s->flags |= BDRV_O_INACTIVE;
>>       }
> 
> Good catch.
> 
> But can't we simply use bs->open_flags and completely get rid of
> s->flags?
> 
> Kevin
> 

the problem is that qcow2_close() or qcow2_inactivate() have to
consider the flags that the image was opened with,
but bs->open_flags (at least in invalidate_cache() case) at that point
contain the new flags already

/Anton

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2017-05-31 16:56 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-19  9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk Anton Nefedov
2017-05-22 19:00   ` Eric Blake
2017-05-23  8:28     ` Anton Nefedov
2017-05-23  9:13     ` Denis V. Lunev
2017-05-26  8:11   ` Kevin Wolf
2017-05-26  8:57     ` Denis V. Lunev
2017-05-26 10:09       ` Anton Nefedov
2017-05-26 11:16       ` Kevin Wolf
2017-05-26 10:57     ` Denis V. Lunev
2017-05-26 11:32       ` Kevin Wolf
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 02/13] qcow2: is_zero_sectors(): return true if area is outside of backing file Anton Nefedov
2017-05-22 19:12   ` Eric Blake
2017-05-22 19:14     ` Eric Blake
2017-05-23  8:35       ` Anton Nefedov
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 03/13] qcow2: do not COW the empty areas Anton Nefedov
2017-05-22 19:24   ` Eric Blake
2017-05-23  8:31     ` Anton Nefedov
2017-05-23  9:15     ` Denis V. Lunev
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 04/13] qcow2: preallocation at image expand Anton Nefedov
2017-05-22 19:29   ` Eric Blake
2017-05-24 16:57     ` Anton Nefedov
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 05/13] qcow2: set inactive flag Anton Nefedov
2017-05-26  8:11   ` Kevin Wolf
2017-05-31 16:56     ` Anton Nefedov
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 06/13] qcow2: truncate preallocated space Anton Nefedov
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 07/13] qcow2: check space leak at the end of the image Anton Nefedov
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 08/13] qcow2: handle_prealloc(): find out if area zeroed by earlier preallocation Anton Nefedov
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 09/13] qcow2: fix misleading comment about L2 linking Anton Nefedov
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 10/13] qcow2-cluster: slightly refactor handle_dependencies() Anton Nefedov
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 11/13] qcow2-cluster: make handle_dependencies() logic easier to follow Anton Nefedov
2017-05-22 19:37   ` Eric Blake
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 12/13] qcow2: allow concurrent unaligned writes to the same clusters Anton Nefedov
2017-05-19  9:34 ` [Qemu-devel] [PATCH v1 13/13] iotest 046: test simultaneous cluster write error case Anton Nefedov
2017-05-23 14:35 ` [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Eric Blake
2017-05-23 14:51   ` Denis V. Lunev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.