All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements
@ 2017-06-01 15:14 Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 01/15] block: introduce BDRV_REQ_ALLOCATE flag Anton Nefedov
                   ` (15 more replies)
  0 siblings, 16 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Anton Nefedov

Changes in v2:
  - introduce new BDRV flag for write_zeroes()
      instead of using driver callback directly.
    Skipped introducing new functions like bdrv_co_pallocate() for now:
      1. it seems ok to keep calling this write_zeroes() as zeroes
      are expected;
      2. most of the code can be reused now anyway, so changes to
      write_zeroes() path are not significant
      3. write_zeroes() alignment and max-request limits can also be reused

    As a possible alternative we can have bdrv_co_pallocate() which can
    switch to pwrite_zeroes(,flags|=BDRV_REQ_ALLOCATE) early.

========

This pull request is to address a few performance problems of qcow2 format:

  1. non cluster-aligned write requests (to unallocated clusters) explicitly
    pad data with zeroes if there is no backing data. This can be avoided
    and the whole clusters are preallocated and zeroed in a single
    efficient write_zeroes() operation, also providing better host file
    continuity

  2. moreover, efficient write_zeroes() operation can be used to preallocate
    space megabytes ahead which gives noticeable improvement on some storage
    types (e.g. distributed storages where space allocation operation is
    expensive)

  3. preallocating/zeroing the clusters in advance makes possible to enable
    simultaneous writes to the same unallocated cluster, which is beneficial
    for parallel sequential write operations which are not cluster-aligned

Performance test results are added to commit messages (see patch 3, 12)

Anton Nefedov (11):
  block: introduce BDRV_REQ_ALLOCATE flag
  file-posix: support BDRV_REQ_ALLOCATE
  blkdebug: support BDRV_REQ_ALLOCATE
  qcow2: do not COW the empty areas
  qcow2: set inactive flag
  qcow2: handle_prealloc(): find out if area zeroed by earlier
    preallocation
  qcow2: fix misleading comment about L2 linking
  qcow2-cluster: slightly refactor handle_dependencies()
  qcow2-cluster: make handle_dependencies() logic easier to follow
  qcow2: allow concurrent unaligned writes to the same clusters
  iotest 046: test simultaneous cluster write error case

Denis V. Lunev (3):
  qcow2: alloc space for COW in one chunk
  qcow2: preallocation at image expand
  qcow2: truncate preallocated space

Pavel Butsykin (1):
  qcow2: check space leak at the end of the image

 block/blkdebug.c                   |   3 +-
 block/file-posix.c                 |   9 +-
 block/io.c                         |  19 ++-
 block/qcow2-cache.c                |   3 +
 block/qcow2-cluster.c              | 218 +++++++++++++++++++++++------
 block/qcow2-refcount.c             |  21 +++
 block/qcow2.c                      | 273 ++++++++++++++++++++++++++++++++++++-
 block/qcow2.h                      |  26 ++++
 block/trace-events                 |   1 +
 include/block/block.h              |   6 +-
 tests/qemu-iotests/026.out         | 104 ++++++++++----
 tests/qemu-iotests/026.out.nocache | 104 ++++++++++----
 tests/qemu-iotests/029.out         |   5 +-
 tests/qemu-iotests/046             |  38 +++++-
 tests/qemu-iotests/046.out         |  23 ++++
 tests/qemu-iotests/060             |   2 +-
 tests/qemu-iotests/060.out         |  13 +-
 tests/qemu-iotests/061.out         |   5 +-
 tests/qemu-iotests/066             |   2 +-
 tests/qemu-iotests/066.out         |   9 +-
 tests/qemu-iotests/098.out         |   7 +-
 tests/qemu-iotests/108.out         |   5 +-
 tests/qemu-iotests/112.out         |   5 +-
 23 files changed, 789 insertions(+), 112 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 01/15] block: introduce BDRV_REQ_ALLOCATE flag
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 19:07   ` Eric Blake
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 02/15] file-posix: support BDRV_REQ_ALLOCATE Anton Nefedov
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Anton Nefedov

The flag is supposed to indicate that the region of the disk image has
to be sufficiently allocated so it reads as zeroes. The call with the flag
set has to return -ENOTSUP if allocation cannot be done efficiently
(i.e. without falling back to writing actual buffers)

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/io.c            | 19 ++++++++++++++++---
 block/trace-events    |  1 +
 include/block/block.h |  6 +++++-
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/block/io.c b/block/io.c
index ed31810..d47efa9 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1272,7 +1272,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
             assert(!bs->supported_zero_flags);
         }
 
-        if (ret == -ENOTSUP) {
+        if (ret == -ENOTSUP && !(flags & BDRV_REQ_ALLOCATE)) {
             /* Fall back to bounce buffer if write zeroes is unsupported */
             BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
 
@@ -1355,8 +1355,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child,
     ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
 
     if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
-        !(flags & BDRV_REQ_ZERO_WRITE) && drv->bdrv_co_pwrite_zeroes &&
-        qemu_iovec_is_zero(qiov)) {
+        !(flags & BDRV_REQ_ZERO_WRITE) && !(flags & BDRV_REQ_ALLOCATE) &&
+        drv->bdrv_co_pwrite_zeroes && qemu_iovec_is_zero(qiov)) {
         flags |= BDRV_REQ_ZERO_WRITE;
         if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
             flags |= BDRV_REQ_MAY_UNMAP;
@@ -1436,6 +1436,9 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
 
     assert(flags & BDRV_REQ_ZERO_WRITE);
     if (head_padding_bytes || tail_padding_bytes) {
+        if (flags & BDRV_REQ_ALLOCATE) {
+            return -ENOTSUP;
+        }
         buf = qemu_blockalign(bs, align);
         iov = (struct iovec) {
             .iov_base   = buf,
@@ -1534,6 +1537,11 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
         return ret;
     }
 
+    if (qiov && flags & BDRV_REQ_ALLOCATE) {
+        /* allocation request with qiov provided doesn't make much sense */
+        return -ENOTSUP;
+    }
+
     bdrv_inc_in_flight(bs);
     /*
      * Align write if necessary by performing a read-modify-write cycle.
@@ -1665,6 +1673,11 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
 {
     trace_bdrv_co_pwrite_zeroes(child->bs, offset, count, flags);
 
+    if (flags & BDRV_REQ_MAY_UNMAP && flags & BDRV_REQ_ALLOCATE) {
+        /* nonsense */
+        return -ENOTSUP;
+    }
+
     if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
         flags &= ~BDRV_REQ_MAY_UNMAP;
     }
diff --git a/block/trace-events b/block/trace-events
index 9a71c7f..a15c2cc 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -15,6 +15,7 @@ bdrv_aio_writev(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs
 bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_pwrite_zeroes(void *bs, int64_t offset, int count, int flags) "bs %p offset %"PRId64" count %d flags %#x"
+bdrv_co_allocate(void *bs, int64_t offset, int count) "bs %p offset %"PRId64" count %d"
 bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned int bytes, int64_t cluster_offset, unsigned int cluster_bytes) "bs %p offset %"PRId64" bytes %u cluster_offset %"PRId64" cluster_bytes %u"
 
 # block/stream.c
diff --git a/include/block/block.h b/include/block/block.h
index 9b355e9..53a357c 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -65,9 +65,13 @@ typedef enum {
     BDRV_REQ_NO_SERIALISING     = 0x8,
     BDRV_REQ_FUA                = 0x10,
     BDRV_REQ_WRITE_COMPRESSED   = 0x20,
+    /* BDRV_REQ_ALLOCATE is used to indicate that the driver is to
+     * efficiently allocate the space so it reads as zeroes or return an error
+     */
+    BDRV_REQ_ALLOCATE           = 0x40,
 
     /* Mask of valid flags */
-    BDRV_REQ_MASK               = 0x3f,
+    BDRV_REQ_MASK               = 0x7f,
 } BdrvRequestFlags;
 
 typedef struct BlockSizes {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 02/15] file-posix: support BDRV_REQ_ALLOCATE
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 01/15] block: introduce BDRV_REQ_ALLOCATE flag Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 19:49   ` Eric Blake
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 03/15] blkdebug: " Anton Nefedov
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Anton Nefedov

Current write_zeroes implementation is good enough to satisfy this flag too

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/file-posix.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index de2d3a2..117bbee 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -527,7 +527,6 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
 
     s->has_discard = true;
     s->has_write_zeroes = true;
-    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
     if ((bs->open_flags & BDRV_O_NOCACHE) != 0) {
         s->needs_alignment = true;
     }
@@ -577,6 +576,11 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
 #endif
 
+    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
+    if (s->has_write_zeroes || s->has_fallocate) {
+        bs->supported_zero_flags |= BDRV_REQ_ALLOCATE;
+    }
+
     ret = 0;
 fail:
     if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
@@ -1390,6 +1394,9 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
     }
 #endif
 
+    if (!s->has_fallocate) {
+        aiocb->bs->supported_zero_flags &= ~BDRV_REQ_ALLOCATE;
+    }
     return -ENOTSUP;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 03/15] blkdebug: support BDRV_REQ_ALLOCATE
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 01/15] block: introduce BDRV_REQ_ALLOCATE flag Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 02/15] file-posix: support BDRV_REQ_ALLOCATE Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 19:50   ` Eric Blake
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 04/15] qcow2: alloc space for COW in one chunk Anton Nefedov
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Anton Nefedov

Support the flag if the underlying BDS supports it

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/blkdebug.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index a5196e8..8b1401b 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -415,7 +415,8 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
 
     bs->supported_write_flags = BDRV_REQ_FUA &
         bs->file->bs->supported_write_flags;
-    bs->supported_zero_flags = (BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+    bs->supported_zero_flags =
+        (BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
         bs->file->bs->supported_zero_flags;
     ret = -EINVAL;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 04/15] qcow2: alloc space for COW in one chunk
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (2 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 03/15] blkdebug: " Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-05 14:28   ` Eric Blake
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 05/15] qcow2: do not COW the empty areas Anton Nefedov
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Denis V. Lunev, Anton Nefedov

From: "Denis V. Lunev" <den@openvz.org>

Currently each single write operation can result in 3 write operations
if guest offsets are not cluster aligned. One write is performed for the
real payload and two for COW-ed areas. Thus the data possibly lays
non-contiguously on the host filesystem. This will reduce further
sequential read performance significantly.

The patch allocates the space in the file with cluster granularity,
ensuring
  1. better host offset locality
  2. less space allocation operations
     (which can be expensive on distributed storage)

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/qcow2.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index b3ba5da..cd5efba 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1575,6 +1575,24 @@ fail:
     return ret;
 }
 
+static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
+{
+    BDRVQcow2State *s = bs->opaque;
+    QCowL2Meta *m;
+
+    for (m = l2meta; m != NULL; m = m->next) {
+        uint64_t bytes = m->nb_clusters << s->cluster_bits;
+
+        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
+            continue;
+        }
+
+        /* try to alloc host space in one chunk for better locality */
+        bdrv_co_pwrite_zeroes(bs->file, m->alloc_offset, bytes,
+                              BDRV_REQ_ALLOCATE);
+    }
+}
+
 static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -1656,8 +1674,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
         if (ret < 0) {
             goto fail;
         }
-
         qemu_co_mutex_unlock(&s->lock);
+
+        if (bs->file->bs->supported_zero_flags & BDRV_REQ_ALLOCATE) {
+            handle_alloc_space(bs, l2meta);
+        }
+
         BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
         trace_qcow2_writev_data(qemu_coroutine_self(),
                                 cluster_offset + offset_in_cluster);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 05/15] qcow2: do not COW the empty areas
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (3 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 04/15] qcow2: alloc space for COW in one chunk Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-05 14:40   ` Eric Blake
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 06/15] qcow2: preallocation at image expand Anton Nefedov
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Anton Nefedov, Denis V . Lunev

If COW area of the newly allocated cluster is zeroes, there is no reason
to write zero sectors in perform_cow() again now as whole clusters are
zeroed out in single chunks by handle_alloc_space().

Introduce QCowL2Meta field "reduced", since the existing fields
(offset and nb_bytes) still has to keep other write requests from
simultaneous writing in the area

iotest 060:
write to the discarded cluster does not trigger COW anymore.
so, break on write_aio event instead, will work for the test
(but write won't fail anymore, so update reference output)

iotest 066:
cluster-alignment areas that were not really COWed are now detected
as zeroes, hence the initial write has to be exactly the same size for
the maps to match

performance tests: ===

qemu-io,
  results in seconds to complete (less is better)
  random write 4k to empty image, no backing
    HDD
      64k cluster
        128M over 128M image:   160 -> 160 ( x1  )
        128M over   2G image:    86 ->  84 ( x1  )
        128M over   8G image:    40 ->  29 ( x1.4 )
      1M cluster
         32M over   8G image:    58 ->  23 ( x2.5 )

    SSD
      64k cluster
          2G over   2G image:    71 ->  38 (  x1.9 )
        512M over   8G image:    85 ->   8 ( x10.6 )
      1M cluster
        128M over  32G image:   314 ->   2 ( x157  )

  - improvement grows bigger the bigger the cluster size,
  - first data portions to the fresh image benefit the most
  (more chance to hit an unallocated cluster)
  - SSD improvement is close to the IO length reduction rate
  (e.g. writing only 4k instead of 64k) gives theoretical x16
  and practical x10 improvement)

fio tests over xfs, empty image (cluster 64k), no backing,

  first megabytes of random writes:
    randwrite 4k, size=8g:

      HDD (io_size=128m) :  730 ->  1050 IOPS ( x1.45)
      SSD (io_size=512m) : 1500 ->  7000 IOPS ( x4.7 )

  random writes io_size==image_size:
    randwrite 4k, size=2g io_size=2g:
                   HDD   : 200 IOPS (no difference)
                   SSD   : 7500 ->  9500 IOPS ( x1.3 )

  sequential write:
    seqwrite 4k, size=4g, iodepth=4
                   SSD   : 7000 -> 18000 IOPS ( x2.6 )

  - numbers are similar to qemu-io tests, slightly less improvement
  (damped by fs?)

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2-cluster.c      |  4 +++-
 block/qcow2.c              | 32 ++++++++++++++++++++++++++++++--
 block/qcow2.h              |  4 ++++
 tests/qemu-iotests/060     |  2 +-
 tests/qemu-iotests/060.out |  3 ++-
 tests/qemu-iotests/066     |  2 +-
 tests/qemu-iotests/066.out |  4 ++--
 7 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index d779ea1..ba8ce76 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -758,7 +758,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m, Qcow2COWRegion *r)
     BDRVQcow2State *s = bs->opaque;
     int ret;
 
-    if (r->nb_bytes == 0) {
+    if (r->nb_bytes == 0 || r->reduced) {
         return 0;
     }
 
@@ -1267,10 +1267,12 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset,
         .cow_start = {
             .offset     = 0,
             .nb_bytes   = offset_into_cluster(s, guest_offset),
+            .reduced    = false,
         },
         .cow_end = {
             .offset     = nb_bytes,
             .nb_bytes   = avail_bytes - nb_bytes,
+            .reduced    = false,
         },
     };
     qemu_co_queue_init(&(*m)->dependent_requests);
diff --git a/block/qcow2.c b/block/qcow2.c
index cd5efba..067bb87 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -64,6 +64,9 @@ typedef struct {
 #define  QCOW2_EXT_MAGIC_BACKING_FORMAT 0xE2792ACA
 #define  QCOW2_EXT_MAGIC_FEATURE_TABLE 0x6803f857
 
+static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
+                            uint32_t count);
+
 static int qcow2_probe(const uint8_t *buf, int buf_size, const char *filename)
 {
     const QCowHeader *cow_header = (const void *)buf;
@@ -1575,10 +1578,30 @@ fail:
     return ret;
 }
 
+static void handle_cow_reduce(BlockDriverState *bs, QCowL2Meta *m)
+{
+    if (bs->encrypted) {
+        return;
+    }
+    if (!m->cow_start.reduced && m->cow_start.nb_bytes != 0 &&
+        is_zero_sectors(bs,
+                        (m->offset + m->cow_start.offset) >> BDRV_SECTOR_BITS,
+                        m->cow_start.nb_bytes >> BDRV_SECTOR_BITS)) {
+        m->cow_start.reduced = true;
+    }
+    if (!m->cow_end.reduced && m->cow_end.nb_bytes != 0 &&
+        is_zero_sectors(bs,
+                        (m->offset + m->cow_end.offset) >> BDRV_SECTOR_BITS,
+                        m->cow_end.nb_bytes >> BDRV_SECTOR_BITS)) {
+        m->cow_end.reduced = true;
+    }
+}
+
 static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
 {
     BDRVQcow2State *s = bs->opaque;
     QCowL2Meta *m;
+    int ret;
 
     for (m = l2meta; m != NULL; m = m->next) {
         uint64_t bytes = m->nb_clusters << s->cluster_bits;
@@ -1588,8 +1611,13 @@ static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
         }
 
         /* try to alloc host space in one chunk for better locality */
-        bdrv_co_pwrite_zeroes(bs->file, m->alloc_offset, bytes,
-                              BDRV_REQ_ALLOCATE);
+        ret = bdrv_co_pwrite_zeroes(bs->file, m->alloc_offset, bytes,
+                                    BDRV_REQ_ALLOCATE);
+        if (ret != 0) {
+            continue;
+        }
+
+        handle_cow_reduce(bs, m);
     }
 }
 
diff --git a/block/qcow2.h b/block/qcow2.h
index 1801dc3..ba15c08 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -305,6 +305,10 @@ typedef struct Qcow2COWRegion {
 
     /** Number of bytes to copy */
     int         nb_bytes;
+
+    /** The region is filled with zeroes and does not require COW
+     */
+    bool        reduced;
 } Qcow2COWRegion;
 
 /**
diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
index 8e95c45..3a0f096 100755
--- a/tests/qemu-iotests/060
+++ b/tests/qemu-iotests/060
@@ -160,7 +160,7 @@ poke_file "$TEST_IMG" '131084' "\x00\x00" # 0x2000c
 # any unallocated cluster, leading to an attempt to overwrite the second L2
 # table. Finally, resume the COW write and see it fail (but not crash).
 echo "open -o file.driver=blkdebug $TEST_IMG
-break cow_read 0
+break write_aio 0
 aio_write 0k 1k
 wait_break 0
 write 64k 64k
diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
index 3bc1461..d057aa1 100644
--- a/tests/qemu-iotests/060.out
+++ b/tests/qemu-iotests/060.out
@@ -107,7 +107,8 @@ qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps
 blkdebug: Suspended request '0'
 write failed: Input/output error
 blkdebug: Resuming request '0'
-aio_write failed: No medium found
+wrote 1024/1024 bytes at offset 0
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
 === Testing unallocated image header ===
 
diff --git a/tests/qemu-iotests/066 b/tests/qemu-iotests/066
index 8638217..3c216a1 100755
--- a/tests/qemu-iotests/066
+++ b/tests/qemu-iotests/066
@@ -71,7 +71,7 @@ echo
 _make_test_img $IMG_SIZE
 
 # Create data clusters (not aligned to an L2 table)
-$QEMU_IO -c 'write -P 42 1M 256k' "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -P 42 $(((1024 + 32) * 1024)) 192k" "$TEST_IMG" | _filter_qemu_io
 orig_map=$($QEMU_IMG map --output=json "$TEST_IMG")
 
 # Convert the data clusters to preallocated zero clusters
diff --git a/tests/qemu-iotests/066.out b/tests/qemu-iotests/066.out
index 3d9da9b..093431e 100644
--- a/tests/qemu-iotests/066.out
+++ b/tests/qemu-iotests/066.out
@@ -19,8 +19,8 @@ Offset          Length          Mapped to       File
 === Writing to preallocated zero clusters ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67109376
-wrote 262144/262144 bytes at offset 1048576
-256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 196608/196608 bytes at offset 1081344
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 wrote 262144/262144 bytes at offset 1048576
 256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 wrote 196608/196608 bytes at offset 1081344
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 06/15] qcow2: preallocation at image expand
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (4 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 05/15] qcow2: do not COW the empty areas Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 07/15] qcow2: set inactive flag Anton Nefedov
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Denis V. Lunev, Anton Nefedov

From: "Denis V. Lunev" <den@openvz.org>

This patch adds image preallocation at expand to provide better locality
of QCOW2 image file and optimize this procedure for some distributed
storages where this procedure is slow.

Image expand requests have to be suspended until the allocation is
performed which is done via special QCowL2Meta.
This meta is invisible to handle_dependencies() code.
This is the main reason for also calling preallocation before metadata
write: it might intersect with preallocation triggered by another IO,
and has to yield

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/qcow2-cache.c    |   3 +
 block/qcow2-cluster.c  |   5 ++
 block/qcow2-refcount.c |  14 +++++
 block/qcow2.c          | 147 +++++++++++++++++++++++++++++++++++++++++++++++++
 block/qcow2.h          |   5 ++
 5 files changed, 174 insertions(+)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 1d25147..aa9da5f 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -204,6 +204,9 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i)
         return ret;
     }
 
+    /* check and preallocate extra space if touching a fresh metadata cluster */
+    qcow2_handle_prealloc(bs, c->entries[i].offset, s->cluster_size);
+
     if (c == s->refcount_block_cache) {
         BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_UPDATE_PART);
     } else if (c == s->l2_table_cache) {
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index ba8ce76..88dd555 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -108,6 +108,9 @@ int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
         goto fail;
     }
 
+    qcow2_handle_prealloc(bs, new_l1_table_offset,
+                          QEMU_ALIGN_UP(new_l1_size2, s->cluster_size));
+
     BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_WRITE_TABLE);
     for(i = 0; i < s->l1_size; i++)
         new_l1_table[i] = cpu_to_be64(new_l1_table[i]);
@@ -1821,6 +1824,8 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
                 goto fail;
             }
 
+            qcow2_handle_prealloc(bs, offset, s->cluster_size);
+
             ret = bdrv_pwrite_zeroes(bs->file, offset, s->cluster_size, 0);
             if (ret < 0) {
                 if (cluster_type == QCOW2_CLUSTER_ZERO_PLAIN) {
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 7c06061..873a1d2 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -547,6 +547,8 @@ static int alloc_refcount_block(BlockDriverState *bs,
     }
 
     /* Write refcount blocks to disk */
+    qcow2_handle_prealloc(bs, meta_offset, blocks_clusters * s->cluster_size);
+
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE_BLOCKS);
     ret = bdrv_pwrite_sync(bs->file, meta_offset, new_blocks,
         blocks_clusters * s->cluster_size);
@@ -561,6 +563,10 @@ static int alloc_refcount_block(BlockDriverState *bs,
         cpu_to_be64s(&new_table[i]);
     }
 
+    qcow2_handle_prealloc(bs, table_offset,
+                          QEMU_ALIGN_UP(table_size * sizeof(uint64_t),
+                                        s->cluster_size));
+
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE_TABLE);
     ret = bdrv_pwrite_sync(bs->file, table_offset, new_table,
         table_size * sizeof(uint64_t));
@@ -2104,6 +2110,8 @@ write_refblocks:
             goto fail;
         }
 
+        qcow2_handle_prealloc(bs, refblock_offset, s->cluster_size);
+
         /* The size of *refcount_table is always cluster-aligned, therefore the
          * write operation will not overflow */
         on_disk_refblock = (void *)((char *) *refcount_table +
@@ -2158,6 +2166,8 @@ write_refblocks:
     }
 
     assert(reftable_size < INT_MAX / sizeof(uint64_t));
+    qcow2_handle_prealloc(bs, reftable_offset,
+                          reftable_size * sizeof(uint64_t));
     ret = bdrv_pwrite(bs->file, reftable_offset, on_disk_reftable,
                       reftable_size * sizeof(uint64_t));
     if (ret < 0) {
@@ -2845,6 +2855,10 @@ int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
         cpu_to_be64s(&new_reftable[i]);
     }
 
+    qcow2_handle_prealloc(bs, new_reftable_offset,
+                          QEMU_ALIGN_UP(new_reftable_size * sizeof(uint64_t),
+                                        s->cluster_size));
+
     ret = bdrv_pwrite(bs->file, new_reftable_offset, new_reftable,
                       new_reftable_size * sizeof(uint64_t));
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 067bb87..c6fb714 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -464,6 +464,11 @@ static QemuOptsList qcow2_runtime_opts = {
             .type = QEMU_OPT_NUMBER,
             .help = "Clean unused cache entries after this time (in seconds)",
         },
+        {
+            .name = QCOW2_OPT_PREALLOC_SIZE,
+            .type = QEMU_OPT_SIZE,
+            .help = "Preallocation amount at image expand",
+        },
         { /* end of list */ }
     },
 };
@@ -754,6 +759,15 @@ static int qcow2_update_options_prepare(BlockDriverState *bs,
     r->discard_passthrough[QCOW2_DISCARD_OTHER] =
         qemu_opt_get_bool(opts, QCOW2_OPT_DISCARD_OTHER, false);
 
+    s->prealloc_size =
+        ROUND_UP(qemu_opt_get_size_del(opts, QCOW2_OPT_PREALLOC_SIZE, 0),
+                 s->cluster_size);
+    if (s->prealloc_size &&
+        !(bs->file->bs->supported_zero_flags & BDRV_REQ_ALLOCATE))
+    {
+        s->prealloc_size = 0;
+    }
+
     ret = 0;
 fail:
     qemu_opts_del(opts);
@@ -1597,6 +1611,129 @@ static void handle_cow_reduce(BlockDriverState *bs, QCowL2Meta *m)
     }
 }
 
+/*
+ * Checks that the host space area specified by @m is not being preallocated
+ * at the moment, and does co_queue_wait() if it is.
+ * If the specified area is not allocated yet, allocates it + prealloc_size
+ * bytes ahead.
+ *
+ * Returns
+ *   true if the space is allocated and contains zeroes
+ */
+static bool coroutine_fn handle_prealloc(BlockDriverState *bs,
+                                         const QCowL2Meta *m)
+{
+    BDRVQcow2State *s = bs->opaque;
+    BlockDriverState *file = bs->file->bs;
+    QCowL2Meta *old, *meta;
+    uint64_t start = m->alloc_offset;
+    uint64_t end = start + (m->nb_clusters << s->cluster_bits);
+    uint64_t nbytes;
+    int ret;
+
+    assert(offset_into_cluster(s, start) == 0);
+
+restart:
+    /* check that the request is not overlapped with any
+       currently running preallocations */
+    QLIST_FOREACH(old, &s->cluster_allocs, next_in_flight) {
+        uint64_t old_start, old_end;
+
+        old_start = old->alloc_offset;
+        old_end = old_start + (old->nb_clusters << s->cluster_bits);
+
+        if (old == m || end <= old_start || start >= old_end) {
+            /* No intersection */
+            continue;
+        }
+
+        qemu_co_queue_wait(&old->dependent_requests, NULL);
+        goto restart;
+    }
+
+    if (end <= bdrv_getlength(file)) {
+        /* No need to care, file size will not be changed */
+        return false;
+    }
+
+    meta = g_alloca(sizeof(*meta));
+    *meta = (QCowL2Meta) {
+        /* this meta is invisible for handle_dependencies() */
+        .alloc_offset   = bdrv_getlength(file),
+        .nb_clusters    = size_to_clusters(s, start +
+                (m->nb_clusters << s->cluster_bits) +
+                s->prealloc_size - bdrv_getlength(file)),
+    };
+    qemu_co_queue_init(&meta->dependent_requests);
+    QLIST_INSERT_HEAD(&s->cluster_allocs, meta, next_in_flight);
+
+    nbytes = meta->nb_clusters << s->cluster_bits;
+
+    /* try to alloc host space in one chunk for better locality */
+    ret = bdrv_co_pwrite_zeroes(bs->file, meta->alloc_offset, nbytes,
+                                BDRV_REQ_ALLOCATE);
+
+    QLIST_REMOVE(meta, next_in_flight);
+    qemu_co_queue_restart_all(&meta->dependent_requests);
+
+    return (ret == 0) ? start >= meta->alloc_offset : false;
+}
+
+typedef struct {
+    BlockDriverState *bs;
+    uint64_t offset;
+    uint64_t size;
+    int ret;
+} PreallocCo;
+
+static void coroutine_fn handle_prealloc_co_entry(void* opaque)
+{
+    PreallocCo *prco = opaque;
+    BDRVQcow2State *s = prco->bs->opaque;
+    QCowL2Meta meta = {
+        /* this meta is invisible for handle_dependencies() */
+        .alloc_offset = prco->offset,
+        .nb_clusters  = size_to_clusters(s, prco->size)
+    };
+    handle_prealloc(prco->bs, &meta);
+    prco->ret = 0;
+}
+
+/*
+ * Context(coroutine)-independent interface around handle_prealloc(), see
+ * its description.
+ * Must be called on a first write on the newly allocated cluster(s).
+ * @offset and @size must be cluster_aligned
+ */
+void qcow2_handle_prealloc(BlockDriverState *bs, uint64_t offset, uint64_t size)
+{
+    BDRVQcow2State *s = bs->opaque;
+    PreallocCo prco = {
+        .bs     = bs,
+        .offset = offset,
+        .size   = size,
+        .ret    = -EAGAIN
+    };
+
+    assert(offset_into_cluster(s, offset) == 0);
+    assert(offset_into_cluster(s, size) == 0);
+
+    if (s->prealloc_size == 0) {
+        return;
+    }
+
+    if (qemu_in_coroutine()) {
+        handle_prealloc_co_entry(&prco);
+    } else {
+        AioContext *aio_context = bdrv_get_aio_context(bs);
+        Coroutine *co = qemu_coroutine_create(handle_prealloc_co_entry, &prco);
+        qemu_coroutine_enter(co);
+        while (prco.ret == -EAGAIN) {
+            aio_poll(aio_context, true);
+        }
+    }
+}
+
 static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
 {
     BDRVQcow2State *s = bs->opaque;
@@ -1606,6 +1743,11 @@ static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
     for (m = l2meta; m != NULL; m = m->next) {
         uint64_t bytes = m->nb_clusters << s->cluster_bits;
 
+        if (s->prealloc_size != 0 && handle_prealloc(bs, m)) {
+            handle_cow_reduce(bs, m);
+            continue;
+        }
+
         if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
             continue;
         }
@@ -2720,6 +2862,11 @@ qcow2_co_pwritev_compressed(BlockDriverState *bs, uint64_t offset,
         goto fail;
     }
 
+    qcow2_handle_prealloc(bs, start_of_cluster(s, cluster_offset),
+                          QEMU_ALIGN_UP(
+                              offset_into_cluster(s, cluster_offset) + out_len,
+                              s->cluster_size));
+
     iov = (struct iovec) {
         .iov_base   = out_buf,
         .iov_len    = out_len,
diff --git a/block/qcow2.h b/block/qcow2.h
index ba15c08..a0d222d 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -97,6 +97,7 @@
 #define QCOW2_OPT_L2_CACHE_SIZE "l2-cache-size"
 #define QCOW2_OPT_REFCOUNT_CACHE_SIZE "refcount-cache-size"
 #define QCOW2_OPT_CACHE_CLEAN_INTERVAL "cache-clean-interval"
+#define QCOW2_OPT_PREALLOC_SIZE "prealloc-size"
 
 typedef struct QCowHeader {
     uint32_t magic;
@@ -294,6 +295,8 @@ typedef struct BDRVQcow2State {
      * override) */
     char *image_backing_file;
     char *image_backing_format;
+
+    uint64_t prealloc_size;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
@@ -493,6 +496,8 @@ int qcow2_mark_dirty(BlockDriverState *bs);
 int qcow2_mark_corrupt(BlockDriverState *bs);
 int qcow2_mark_consistent(BlockDriverState *bs);
 int qcow2_update_header(BlockDriverState *bs);
+void qcow2_handle_prealloc(BlockDriverState *bs,
+                           uint64_t offset, uint64_t size);
 
 void qcow2_signal_corruption(BlockDriverState *bs, bool fatal, int64_t offset,
                              int64_t size, const char *message_format, ...)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 07/15] qcow2: set inactive flag
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (5 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 06/15] qcow2: preallocation at image expand Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 08/15] qcow2: truncate preallocated space Anton Nefedov
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Anton Nefedov

Qcow2State and BlockDriverState flags have to be in sync

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/qcow2.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index c6fb714..b090833 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1932,6 +1932,7 @@ static int qcow2_inactivate(BlockDriverState *bs)
 
     if (result == 0) {
         qcow2_mark_clean(bs);
+        s->flags |= BDRV_O_INACTIVE;
     }
 
     return result;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 08/15] qcow2: truncate preallocated space
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (6 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 07/15] qcow2: set inactive flag Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 09/15] qcow2: check space leak at the end of the image Anton Nefedov
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Denis V. Lunev, Anton Nefedov

From: "Denis V. Lunev" <den@openvz.org>

This could be done after calculation of the end of data and metadata in
the qcow2 image.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/qcow2-cluster.c  | 9 +++++++++
 block/qcow2-refcount.c | 7 +++++++
 block/qcow2.c          | 8 ++++++++
 block/qcow2.h          | 3 +++
 4 files changed, 27 insertions(+)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 88dd555..c39e825 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1958,3 +1958,12 @@ fail:
     g_free(l1_table);
     return ret;
 }
+
+void qcow2_update_data_end(BlockDriverState *bs, uint64_t off)
+{
+    BDRVQcow2State *s = bs->opaque;
+
+    if (s->data_end < off) {
+        s->data_end = off;
+    }
+}
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 873a1d2..8156466 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -744,6 +744,9 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
             ret = alloc_refcount_block(bs, cluster_index, &refcount_block);
             if (ret < 0) {
                 goto fail;
+            } else {
+                qcow2_update_data_end(bs, s->refcount_table_offset +
+                        s->refcount_table_size * sizeof(uint64_t));
             }
         }
         old_table_index = table_index;
@@ -865,6 +868,8 @@ retry:
         s->free_cluster_index - 1 > (INT64_MAX >> s->cluster_bits))
     {
         return -EFBIG;
+    } else {
+        qcow2_update_data_end(bs, s->free_cluster_index << s->cluster_bits);
     }
 
 #ifdef DEBUG_ALLOC2
@@ -929,6 +934,8 @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
 
     if (ret < 0) {
         return ret;
+    } else {
+        qcow2_update_data_end(bs, offset + (nb_clusters << s->cluster_bits));
     }
 
     return i;
diff --git a/block/qcow2.c b/block/qcow2.c
index b090833..33e5455 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1194,6 +1194,8 @@ static int qcow2_do_open(BlockDriverState *bs, QDict *options, int flags,
         }
     }
 
+    s->data_end = bdrv_getlength(bs->file->bs);
+
 #ifdef DEBUG_ALLOC
     {
         BdrvCheckResult result = {0};
@@ -1941,12 +1943,18 @@ static int qcow2_inactivate(BlockDriverState *bs)
 static void qcow2_close(BlockDriverState *bs)
 {
     BDRVQcow2State *s = bs->opaque;
+
     qemu_vfree(s->l1_table);
     /* else pre-write overlap checks in cache_destroy may crash */
     s->l1_table = NULL;
 
     if (!(s->flags & BDRV_O_INACTIVE)) {
         qcow2_inactivate(bs);
+
+        /* truncate preallocated space */
+        if (!bs->read_only && s->data_end < bdrv_getlength(bs->file->bs)) {
+            bdrv_truncate(bs->file, s->data_end, NULL);
+        }
     }
 
     cache_clean_timer_del(bs);
diff --git a/block/qcow2.h b/block/qcow2.h
index a0d222d..e28c54a 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -297,6 +297,7 @@ typedef struct BDRVQcow2State {
     char *image_backing_format;
 
     uint64_t prealloc_size;
+    uint64_t data_end;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
@@ -607,4 +608,6 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
     void **table);
 void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
 
+void qcow2_update_data_end(BlockDriverState *bs, uint64_t off);
+
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 09/15] qcow2: check space leak at the end of the image
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (7 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 08/15] qcow2: truncate preallocated space Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 10/15] qcow2: handle_prealloc(): find out if area zeroed by earlier preallocation Anton Nefedov
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: den, kwolf, mreitz, eblake, Pavel Butsykin, Denis V . Lunev,
	Anton Nefedov

From: Pavel Butsykin <pbutsykin@virtuozzo.com>

Preallocating memory in the image may remain unused after the fall, for
the qcow2_check adds the ability to identify and fix it, so as not to
store extra memory on the host.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/qcow2.c                      |  31 +++++++++++
 tests/qemu-iotests/026.out         | 104 ++++++++++++++++++++++++++++---------
 tests/qemu-iotests/026.out.nocache | 104 ++++++++++++++++++++++++++++---------
 tests/qemu-iotests/029.out         |   5 +-
 tests/qemu-iotests/060.out         |  10 +++-
 tests/qemu-iotests/061.out         |   5 +-
 tests/qemu-iotests/066.out         |   5 +-
 tests/qemu-iotests/098.out         |   7 ++-
 tests/qemu-iotests/108.out         |   5 +-
 tests/qemu-iotests/112.out         |   5 +-
 10 files changed, 225 insertions(+), 56 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 33e5455..809102a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -322,6 +322,32 @@ int qcow2_mark_consistent(BlockDriverState *bs)
     return 0;
 }
 
+static int qcow2_check_extra_preallocation(BlockDriverState *bs,
+    BdrvCheckResult *res, BdrvCheckMode fix)
+{
+    BDRVQcow2State *s = bs->opaque;
+    uint64_t img_size = bdrv_getlength(bs->file->bs);
+
+    if (res->image_end_offset < img_size) {
+        uint64_t count =
+            DIV_ROUND_UP(img_size - res->image_end_offset, s->cluster_size);
+        fprintf(stderr, "%s space leaked at the end of the image %jd\n",
+                fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR",
+                img_size - res->image_end_offset);
+        res->leaks += count;
+        if (fix & BDRV_FIX_LEAKS) {
+            int ret = bdrv_truncate(bs->file, res->image_end_offset, NULL);
+            if (ret < 0) {
+                res->check_errors++;
+                return ret;
+            }
+            res->leaks_fixed += count;
+        }
+    }
+
+    return 0;
+}
+
 static int qcow2_check(BlockDriverState *bs, BdrvCheckResult *result,
                        BdrvCheckMode fix)
 {
@@ -330,6 +356,11 @@ static int qcow2_check(BlockDriverState *bs, BdrvCheckResult *result,
         return ret;
     }
 
+    ret = qcow2_check_extra_preallocation(bs, result, fix);
+    if (ret < 0) {
+        return ret;
+    }
+
     if (fix && result->check_errors == 0 && result->corruptions == 0) {
         ret = qcow2_mark_clean(bs);
         if (ret < 0) {
diff --git a/tests/qemu-iotests/026.out b/tests/qemu-iotests/026.out
index 86a50a2..e8cf348 100644
--- a/tests/qemu-iotests/026.out
+++ b/tests/qemu-iotests/026.out
@@ -5,7 +5,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 5; imm: off; once: on; write
 write failed: Input/output error
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 5; imm: off; once: on; write -b
@@ -33,7 +36,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 28; imm: off; once: on; write -b
@@ -181,7 +187,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l2_alloc_write; errno: 5; imm: off; once: on; write
 write failed: Input/output error
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l2_alloc_write; errno: 5; imm: off; once: on; write -b
@@ -207,7 +216,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l2_alloc_write; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l2_alloc_write; errno: 28; imm: off; once: on; write -b
@@ -468,20 +480,27 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 394240
+
+770 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 5120
 
-55 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -489,19 +508,26 @@ Event: refblock_alloc_hookup; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 265728
 
-251 leaked clusters were found on the image.
+770 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 394240
+
+770 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: off; write
@@ -520,20 +546,27 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 27648
 
-11 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -541,27 +574,35 @@ Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 21504
 
-23 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 27648
 
-11 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -569,27 +610,35 @@ Event: refblock_alloc_write_table; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 21504
 
-23 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: on; write
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 27648
 
-11 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -597,8 +646,9 @@ Event: refblock_alloc_switch_table; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 21504
 
-23 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 
 === L1 growth tests ===
@@ -651,7 +701,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_grow_activate_table; errno: 5; imm: off; once: on
 write failed: Input/output error
-No errors were found on the image.
+ERROR space leaked at the end of the image 98304
+
+96 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_grow_activate_table; errno: 5; imm: off; once: off
@@ -665,7 +718,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_grow_activate_table; errno: 28; imm: off; once: on
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 98304
+
+96 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_grow_activate_table; errno: 28; imm: off; once: off
diff --git a/tests/qemu-iotests/026.out.nocache b/tests/qemu-iotests/026.out.nocache
index ea2e166..820ce0f 100644
--- a/tests/qemu-iotests/026.out.nocache
+++ b/tests/qemu-iotests/026.out.nocache
@@ -5,7 +5,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 5; imm: off; once: on; write 
 write failed: Input/output error
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: l1_update; errno: 5; imm: off; once: on; write -b
@@ -33,7 +36,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: l1_update; errno: 28; imm: off; once: on; write -b
@@ -189,7 +195,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l2_alloc_write; errno: 5; imm: off; once: on; write 
 write failed: Input/output error
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: l2_alloc_write; errno: 5; imm: off; once: on; write -b
@@ -215,7 +224,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l2_alloc_write; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 1024
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: l2_alloc_write; errno: 28; imm: off; once: on; write -b
@@ -476,20 +488,27 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 394240
+
+770 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: off; write 
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 5120
 
-55 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -497,19 +516,26 @@ Event: refblock_alloc_hookup; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 265728
 
-251 leaked clusters were found on the image.
+770 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 394240
+
+770 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: off; write 
@@ -528,20 +554,27 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: off; write 
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 27648
 
-11 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -549,27 +582,35 @@ Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 21504
 
-23 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: off; write 
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 27648
 
-11 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -577,27 +618,35 @@ Event: refblock_alloc_write_table; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 21504
 
-23 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: on; write 
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: on; write -b
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 33280
+
+65 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: off; write 
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 27648
 
-11 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -605,8 +654,9 @@ Event: refblock_alloc_switch_table; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
+ERROR space leaked at the end of the image 21504
 
-23 leaked clusters were found on the image.
+65 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 
 === L1 growth tests ===
@@ -659,7 +709,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_grow_activate_table; errno: 5; imm: off; once: on
 write failed: Input/output error
-No errors were found on the image.
+ERROR space leaked at the end of the image 98304
+
+96 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: l1_grow_activate_table; errno: 5; imm: off; once: off
@@ -673,7 +726,10 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_grow_activate_table; errno: 28; imm: off; once: on
 write failed: No space left on device
-No errors were found on the image.
+ERROR space leaked at the end of the image 98304
+
+96 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
 Event: l1_grow_activate_table; errno: 28; imm: off; once: off
diff --git a/tests/qemu-iotests/029.out b/tests/qemu-iotests/029.out
index 5bc93e0..7176e38 100644
--- a/tests/qemu-iotests/029.out
+++ b/tests/qemu-iotests/029.out
@@ -6,7 +6,10 @@ is smaller than the current L1 table.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 wrote 4096/4096 bytes at offset 0
 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-No errors were found on the image.
+ERROR space leaked at the end of the image 65536
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=16777216
 wrote 4194304/4194304 bytes at offset 0
 4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
index d057aa1..3d4e5b3 100644
--- a/tests/qemu-iotests/060.out
+++ b/tests/qemu-iotests/060.out
@@ -77,20 +77,26 @@ Leaked cluster 9 refcount=1 reference=0
 Repairing cluster 4 refcount=1 reference=2
 Repairing cluster 9 refcount=1 reference=0
 Repairing OFLAG_COPIED data cluster: l2_entry=8000000000040000 refcount=2
+ERROR space leaked at the end of the image 65536
 The following inconsistencies were found and repaired:
 
     1 leaked clusters
     2 corruptions
 
 Double checking the fixed image now...
-No errors were found on the image.
+
+1 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 incompatible_features     0x0
 wrote 512/512 bytes at offset 0
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 incompatible_features     0x0
 read 512/512 bytes at offset 0
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-No errors were found on the image.
+ERROR space leaked at the end of the image 131072
+
+2 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 read 512/512 bytes at offset 0
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index a431b7f..69fb929 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -328,7 +328,10 @@ wrote 131072/131072 bytes at offset 0
 No errors were found on the image.
 read 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-No errors were found on the image.
+ERROR space leaked at the end of the image 196608
+
+3 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 read 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
diff --git a/tests/qemu-iotests/066.out b/tests/qemu-iotests/066.out
index 093431e..81ef795 100644
--- a/tests/qemu-iotests/066.out
+++ b/tests/qemu-iotests/066.out
@@ -13,7 +13,10 @@ discard 67109376/67109376 bytes at offset 0
 64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 read 67109376/67109376 bytes at offset 0
 64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-No errors were found on the image.
+ERROR space leaked at the end of the image 327680
+
+5 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Offset          Length          Mapped to       File
 
 === Writing to preallocated zero clusters ===
diff --git a/tests/qemu-iotests/098.out b/tests/qemu-iotests/098.out
index 7634d0e..733249f 100644
--- a/tests/qemu-iotests/098.out
+++ b/tests/qemu-iotests/098.out
@@ -20,7 +20,10 @@ Leaked cluster 4 refcount=1 reference=0
 Leaked cluster 5 refcount=1 reference=0
 Repairing cluster 4 refcount=1 reference=0
 Repairing cluster 5 refcount=1 reference=0
-No errors were found on the image.
+ERROR space leaked at the end of the image 131072
+
+2 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 
 === reftable_update ===
 
@@ -34,6 +37,7 @@ ERROR cluster 1 refcount=0 reference=1
 ERROR cluster 3 refcount=0 reference=1
 Rebuilding refcount structure
 Repairing cluster 1 refcount=1 reference=0
+Repairing space leaked at the end of the image 65536
 No errors were found on the image.
 
 === refblock_alloc ===
@@ -48,5 +52,6 @@ ERROR cluster 1 refcount=0 reference=1
 ERROR cluster 3 refcount=0 reference=1
 Rebuilding refcount structure
 Repairing cluster 1 refcount=1 reference=0
+Repairing space leaked at the end of the image 65536
 No errors were found on the image.
 *** done
diff --git a/tests/qemu-iotests/108.out b/tests/qemu-iotests/108.out
index 75bab8d..d178b1e 100644
--- a/tests/qemu-iotests/108.out
+++ b/tests/qemu-iotests/108.out
@@ -68,13 +68,16 @@ Rebuilding refcount structure
 Repairing cluster 1 refcount=1 reference=0
 Repairing cluster 2 refcount=1 reference=0
 Repairing cluster 16 refcount=1 reference=0
+ERROR space leaked at the end of the image 720896
 The following inconsistencies were found and repaired:
 
     0 leaked clusters
     2 corruptions
 
 Double checking the fixed image now...
-No errors were found on the image.
+
+11 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 
 --- Signed overflow after the refblock ---
 
diff --git a/tests/qemu-iotests/112.out b/tests/qemu-iotests/112.out
index 81b04d1..7386be8 100644
--- a/tests/qemu-iotests/112.out
+++ b/tests/qemu-iotests/112.out
@@ -93,7 +93,10 @@ refcount bits: 1
 
 === Amend from refcount_bits=1 to refcount_bits=64 ===
 
-No errors were found on the image.
+ERROR space leaked at the end of the image 131072
+
+2 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 refcount bits: 64
 
 === Amend to compat=0.10 ===
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 10/15] qcow2: handle_prealloc(): find out if area zeroed by earlier preallocation
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (8 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 09/15] qcow2: check space leak at the end of the image Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 11/15] qcow2: fix misleading comment about L2 linking Anton Nefedov
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Anton Nefedov, Denis V . Lunev

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2-cluster.c | 2 ++
 block/qcow2.c         | 8 +++++++-
 block/qcow2.h         | 4 ++++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index c39e825..ed65961 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1143,6 +1143,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset,
     uint64_t *host_offset, uint64_t *bytes, QCowL2Meta **m)
 {
     BDRVQcow2State *s = bs->opaque;
+    const uint64_t old_data_end = s->data_end;
     int l2_index;
     uint64_t *l2_table;
     uint64_t entry;
@@ -1264,6 +1265,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset,
         .alloc_offset   = alloc_cluster_offset,
         .offset         = start_of_cluster(s, guest_offset),
         .nb_clusters    = nb_clusters,
+        .clusters_are_trailing = alloc_cluster_offset >= old_data_end,
 
         .keep_old_clusters  = keep_old_clusters,
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 809102a..92d0af6 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1686,7 +1686,13 @@ restart:
 
     if (end <= bdrv_getlength(file)) {
         /* No need to care, file size will not be changed */
-        return false;
+
+        /* We're safe to assume that the area is zeroes if the area
+         * was allocated at the end of data (s->data_end).
+         * In this case, the only way for file length to be bigger is that
+         * the area was preallocated by another request.
+         */
+        return m->clusters_are_trailing;
     }
 
     meta = g_alloca(sizeof(*meta));
diff --git a/block/qcow2.h b/block/qcow2.h
index e28c54a..2fd8510 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -333,6 +333,10 @@ typedef struct QCowL2Meta
     /** Do not free the old clusters */
     bool keep_old_clusters;
 
+    /** True if the area is allocated after the end of data area
+     *  (i.e. >= s->data_end), which means that it is zeroed */
+    bool clusters_are_trailing;
+
     /**
      * Requests that overlap with this allocation and wait to be restarted
      * when the allocating request has completed.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 11/15] qcow2: fix misleading comment about L2 linking
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (9 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 10/15] qcow2: handle_prealloc(): find out if area zeroed by earlier preallocation Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 12/15] qcow2-cluster: slightly refactor handle_dependencies() Anton Nefedov
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Anton Nefedov, Denis V . Lunev

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2-cluster.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index ed65961..3dafd19 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -827,12 +827,10 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
 
     assert(l2_index + m->nb_clusters <= s->l2_size);
     for (i = 0; i < m->nb_clusters; i++) {
-        /* if two concurrent writes happen to the same unallocated cluster
-         * each write allocates separate cluster and writes data concurrently.
-         * The first one to complete updates l2 table with pointer to its
-         * cluster the second one has to do RMW (which is done above by
-         * perform_cow()), update l2 table with its cluster pointer and free
-         * old cluster. This is what this loop does */
+        /* handle_dependencies() protects from normal cluster allocation
+         * collision; still L2 entry might be !0 in case of zero or compressed
+         * cluster reusage or writing over the snapshot
+         */
         if (l2_table[l2_index + i] != 0) {
             old_cluster[j++] = l2_table[l2_index + i];
         }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 12/15] qcow2-cluster: slightly refactor handle_dependencies()
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (10 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 11/15] qcow2: fix misleading comment about L2 linking Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 13/15] qcow2-cluster: make handle_dependencies() logic easier to follow Anton Nefedov
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Anton Nefedov, Denis V . Lunev

  - assert the alignment on return if the allocation has to stop
    (at the start of a running allocation)
  - make use of const specifiers for local variables

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2-cluster.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 3dafd19..95e8862 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -900,15 +900,15 @@ out:
  * Check if there already is an AIO write request in flight which allocates
  * the same cluster. In this case we need to wait until the previous
  * request has completed and updated the L2 table accordingly.
- *
  * Returns:
  *   0       if there was no dependency. *cur_bytes indicates the number of
  *           bytes from guest_offset that can be read before the next
- *           dependency must be processed (or the request is complete)
+ *           dependency must be processed (or the request is complete).
+ *           *m is not modified
  *
- *   -EAGAIN if we had to wait for another request, previously gathered
- *           information on cluster allocation may be invalid now. The caller
- *           must start over anyway, so consider *cur_bytes undefined.
+ *   -EAGAIN if we had to wait for another request. The caller
+ *           must start over, so consider *cur_bytes undefined.
+ *           *m is not modified
  */
 static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
     uint64_t *cur_bytes, QCowL2Meta **m)
@@ -919,10 +919,10 @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
 
     QLIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight) {
 
-        uint64_t start = guest_offset;
-        uint64_t end = start + bytes;
-        uint64_t old_start = l2meta_cow_start(old_alloc);
-        uint64_t old_end = l2meta_cow_end(old_alloc);
+        const uint64_t start = guest_offset;
+        const uint64_t end = start + bytes;
+        const uint64_t old_start = l2meta_cow_start(old_alloc);
+        const uint64_t old_end = l2meta_cow_end(old_alloc);
 
         if (end <= old_start || start >= old_end) {
             /* No intersection */
@@ -939,6 +939,8 @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
              * and deal with requests depending on them before starting to
              * gather new ones. Not worth the trouble. */
             if (bytes == 0 && *m) {
+                /* start must be cluster aligned at this point */
+                assert(start == start_of_cluster(s, start));
                 *cur_bytes = 0;
                 return 0;
             }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 13/15] qcow2-cluster: make handle_dependencies() logic easier to follow
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (11 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 12/15] qcow2-cluster: slightly refactor handle_dependencies() Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 14/15] qcow2: allow concurrent unaligned writes to the same clusters Anton Nefedov
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Anton Nefedov, Denis V . Lunev

Avoid complicated nested conditions; return or continue asap instead.
The logic is not changed.

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2-cluster.c | 45 ++++++++++++++++++++++-----------------------
 1 file changed, 22 insertions(+), 23 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 95e8862..2fa549d 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -926,32 +926,31 @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
 
         if (end <= old_start || start >= old_end) {
             /* No intersection */
-        } else {
-            if (start < old_start) {
-                /* Stop at the start of a running allocation */
-                bytes = old_start - start;
-            } else {
-                bytes = 0;
-            }
+            continue;
+        }
 
-            /* Stop if already an l2meta exists. After yielding, it wouldn't
-             * be valid any more, so we'd have to clean up the old L2Metas
-             * and deal with requests depending on them before starting to
-             * gather new ones. Not worth the trouble. */
-            if (bytes == 0 && *m) {
-                /* start must be cluster aligned at this point */
-                assert(start == start_of_cluster(s, start));
-                *cur_bytes = 0;
-                return 0;
-            }
+        if (start < old_start) {
+            /* Stop at the start of a running allocation */
+            bytes = old_start - start;
+            /* ..if there is no other conflict, keep checking */
+            continue;
+        }
 
-            if (bytes == 0) {
-                /* Wait for the dependency to complete. We need to recheck
-                 * the free/allocated clusters when we continue. */
-                qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
-                return -EAGAIN;
-            }
+        /* Stop if an l2meta already exists. After yielding, it wouldn't
+         * be valid any more, so we'd have to clean up the old L2Metas
+         * and deal with requests depending on them before starting to
+         * gather new ones. Not worth the trouble. */
+        if (*m) {
+            /* start must be cluster aligned at this point */
+            assert(start == start_of_cluster(s, start));
+            *cur_bytes = 0;
+            return 0;
         }
+
+        /* Wait for the dependency to complete. We need to recheck
+         * the free/allocated clusters when we continue. */
+        qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
+        return -EAGAIN;
     }
 
     /* Make sure that existing clusters and new allocations are only used up to
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 14/15] qcow2: allow concurrent unaligned writes to the same clusters
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (12 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 13/15] qcow2-cluster: make handle_dependencies() logic easier to follow Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 15/15] iotest 046: test simultaneous cluster write error case Anton Nefedov
  2017-06-01 18:41 ` [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements John Snow
  15 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Anton Nefedov, Denis V . Lunev

If COW area of a write request to unallocated cluster is empty,
concurrent write requests can be allowed with a little bit of
extra synchronization; so they don't have to wait until L2 is filled.

Let qcow2_cluster.c::handle_dependencies() do the most of the job:
  if there is an in-flight request to the same cluster,
  and the current request wants to write in its COW area,
  and its COW area is marked empty,
  - steal the allocated offset and write concurrently. Let the original
    request update L2 later when it likes.

This gives an improvement for parallel misaligned writes to
unallocated clusters with no backing data:

HDD fio over xfs iodepth=4:
  seqwrite 4k:   18400 -> 22800 IOPS ( x1.24 )
  seqwrite 68k:   1600 ->  2300 IOPS ( x1.44 )

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2-cluster.c | 169 +++++++++++++++++++++++++++++++++++++++++++-------
 block/qcow2.c         |  28 ++++++++-
 block/qcow2.h         |  12 +++-
 3 files changed, 181 insertions(+), 28 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 2fa549d..03b2e6c 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -898,20 +898,32 @@ out:
 
 /*
  * Check if there already is an AIO write request in flight which allocates
- * the same cluster. In this case we need to wait until the previous
- * request has completed and updated the L2 table accordingly.
+ * the same cluster.
+ * In this case, check if that request has explicitly allowed to write
+ * in its COW area(s).
+ *   If yes - fill the meta to point to the same cluster.
+ *   If no  - we need to wait until the previous request has completed and
+ *            updated the L2 table accordingly or
+ *            has allowed writing in its COW area(s).
  * Returns:
  *   0       if there was no dependency. *cur_bytes indicates the number of
  *           bytes from guest_offset that can be read before the next
  *           dependency must be processed (or the request is complete).
- *           *m is not modified
+ *           *m, *host_offset are not modified
+ *
+ *   1       if there is a dependency but it is possible to write concurrently
+ *           *m is filled accordingly,
+ *           *cur_bytes may have decreased and describes
+ *             the length of the area that can be written to,
+ *           *host_offset contains the starting host image offset to write to
  *
  *   -EAGAIN if we had to wait for another request. The caller
- *           must start over, so consider *cur_bytes undefined.
+ *           must start over, so consider *cur_bytes and *host_offset undefined.
  *           *m is not modified
  */
 static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
-    uint64_t *cur_bytes, QCowL2Meta **m)
+                               uint64_t *host_offset, uint64_t *cur_bytes,
+                               QCowL2Meta **m)
 {
     BDRVQcow2State *s = bs->opaque;
     QCowL2Meta *old_alloc;
@@ -924,7 +936,7 @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
         const uint64_t old_start = l2meta_cow_start(old_alloc);
         const uint64_t old_end = l2meta_cow_end(old_alloc);
 
-        if (end <= old_start || start >= old_end) {
+        if (end <= old_start || start >= old_end || old_alloc->piggybacked) {
             /* No intersection */
             continue;
         }
@@ -936,21 +948,95 @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
             continue;
         }
 
-        /* Stop if an l2meta already exists. After yielding, it wouldn't
-         * be valid any more, so we'd have to clean up the old L2Metas
-         * and deal with requests depending on them before starting to
-         * gather new ones. Not worth the trouble. */
-        if (*m) {
+        /* offsets of the cluster we're intersecting in */
+        const uint64_t cluster_start = start_of_cluster(s, start);
+        const uint64_t cluster_end = cluster_start + s->cluster_size;
+
+        const uint64_t old_data_start = old_start
+            + old_alloc->cow_start.nb_bytes;
+        const uint64_t old_data_end = old_alloc->offset
+            + old_alloc->cow_end.offset;
+
+        const bool conflict_in_data_area =
+            end > old_data_start && start < old_data_end;
+        const bool conflict_in_old_cow_start =
+            /* 1). new write request area is before the old */
+            start < old_data_start
+            && /* 2). old request did not allow writing in its cow area */
+            !old_alloc->cow_start.reduced;
+        const bool conflict_in_old_cow_end =
+            /* 1). new write request area is after the old */
+            start > old_data_start
+            && /* 2). old request did not allow writing in its cow area */
+            !old_alloc->cow_end.reduced;
+
+        if (conflict_in_data_area ||
+            conflict_in_old_cow_start || conflict_in_old_cow_end) {
+
+            /* Stop if an l2meta already exists. After yielding, it wouldn't
+             * be valid any more, so we'd have to clean up the old L2Metas
+             * and deal with requests depending on them before starting to
+             * gather new ones. Not worth the trouble. */
+            if (*m) {
+                /* start must be cluster aligned at this point */
+                assert(start == cluster_start);
+                *cur_bytes = 0;
+                return 0;
+            }
+
+            /* Wait for the dependency to complete. We need to recheck
+             * the free/allocated clusters when we continue. */
+            qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
+            return -EAGAIN;
+        }
+
+        /* allocations do conflict, but the competitor kindly allowed us
+         * to write concurrently (our data area only, not the whole cluster!)
+         * Inter alia, this means we must not touch the COW areas */
+
+        if (*host_offset) {
             /* start must be cluster aligned at this point */
-            assert(start == start_of_cluster(s, start));
-            *cur_bytes = 0;
-            return 0;
+            assert(start == cluster_start);
+            if ((old_alloc->alloc_offset + (start - old_start))
+                != *host_offset) {
+                /* can't extend contiguous allocation */
+                *cur_bytes = 0;
+                return 0;
+            }
         }
 
-        /* Wait for the dependency to complete. We need to recheck
-         * the free/allocated clusters when we continue. */
-        qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
-        return -EAGAIN;
+        QCowL2Meta *old_m = *m;
+        *m = g_malloc0(sizeof(**m));
+
+        **m = (QCowL2Meta) {
+            .next           = old_m,
+
+            .alloc_offset   = old_alloc->alloc_offset
+                              + (cluster_start - old_start),
+            .offset         = old_alloc->offset
+                              + (cluster_start - old_start),
+            .nb_clusters    = 1,
+            .piggybacked    = true,
+            .clusters_are_trailing = false,
+
+            /* reduced COW areas. see above */
+            .cow_start = {
+                .offset   = 0,
+                .nb_bytes = start - cluster_start,
+                .reduced  = true,
+            },
+            .cow_end = {
+                .offset   = MIN(end - cluster_start, s->cluster_size),
+                .nb_bytes = end < cluster_end ? cluster_end - end : 0,
+                .reduced  = true,
+            },
+        };
+        qemu_co_queue_init(&(*m)->dependent_requests);
+        QLIST_INSERT_HEAD(&s->cluster_allocs, *m, next_in_flight);
+
+        *host_offset = old_alloc->alloc_offset + (start - old_start);
+        *cur_bytes = MIN(*cur_bytes, cluster_end - start);
+        return 1;
     }
 
     /* Make sure that existing clusters and new allocations are only used up to
@@ -1264,6 +1350,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset,
         .alloc_offset   = alloc_cluster_offset,
         .offset         = start_of_cluster(s, guest_offset),
         .nb_clusters    = nb_clusters,
+        .piggybacked    = false,
         .clusters_are_trailing = alloc_cluster_offset >= old_data_end,
 
         .keep_old_clusters  = keep_old_clusters,
@@ -1364,13 +1451,12 @@ again:
          *         for contiguous clusters (the situation could have changed
          *         while we were sleeping)
          *
-         *      c) TODO: Request starts in the same cluster as the in-flight
-         *         allocation ends. Shorten the COW of the in-fight allocation,
-         *         set cluster_offset to write to the same cluster and set up
-         *         the right synchronisation between the in-flight request and
-         *         the new one.
+         *      c) Overlap with another request's writeable COW area. Use
+         *         the stolen offset (and let the original request update L2
+         *         when it pleases)
+         *
          */
-        ret = handle_dependencies(bs, start, &cur_bytes, m);
+        ret = handle_dependencies(bs, start, &cluster_offset, &cur_bytes, m);
         if (ret == -EAGAIN) {
             /* Currently handle_dependencies() doesn't yield if we already had
              * an allocation. If it did, we would have to clean up the L2Meta
@@ -1379,6 +1465,8 @@ again:
             goto again;
         } else if (ret < 0) {
             return ret;
+        } else if (ret) {
+            continue;
         } else if (cur_bytes == 0) {
             break;
         } else {
@@ -1968,3 +2056,36 @@ void qcow2_update_data_end(BlockDriverState *bs, uint64_t off)
         s->data_end = off;
     }
 }
+
+/*
+ * For each @m, wait for its dependency request to finish and check for its
+ * success, i.e. that L2 table is updated as expected.
+ */
+int qcow2_wait_l2table_update(BlockDriverState *bs, const QCowL2Meta *m)
+{
+    BDRVQcow2State *s = bs->opaque;
+    QCowL2Meta *old_alloc;
+    uint64_t alloc_offset;
+    unsigned int bytes;
+    int ret;
+
+    for (; m != NULL; m = m->next) {
+        assert(m->piggybacked);
+        QLIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight) {
+            uint64_t a_off;
+            a_off = old_alloc->alloc_offset + (m->offset - old_alloc->offset);
+            if (!old_alloc->piggybacked && m->offset >= old_alloc->offset &&
+                a_off == m->alloc_offset) {
+
+                qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
+                break;
+            }
+        }
+        ret = qcow2_get_cluster_offset(bs, m->offset, &bytes, &alloc_offset);
+        if (ret != QCOW2_CLUSTER_NORMAL ||
+            alloc_offset != m->alloc_offset) {
+            return -1;
+        }
+    }
+    return 0;
+}
diff --git a/block/qcow2.c b/block/qcow2.c
index 92d0af6..f02c5e6 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1627,6 +1627,8 @@ fail:
 
 static void handle_cow_reduce(BlockDriverState *bs, QCowL2Meta *m)
 {
+    bool trimmed = false;
+
     if (bs->encrypted) {
         return;
     }
@@ -1635,12 +1637,19 @@ static void handle_cow_reduce(BlockDriverState *bs, QCowL2Meta *m)
                         (m->offset + m->cow_start.offset) >> BDRV_SECTOR_BITS,
                         m->cow_start.nb_bytes >> BDRV_SECTOR_BITS)) {
         m->cow_start.reduced = true;
+        trimmed = true;
     }
     if (!m->cow_end.reduced && m->cow_end.nb_bytes != 0 &&
         is_zero_sectors(bs,
                         (m->offset + m->cow_end.offset) >> BDRV_SECTOR_BITS,
                         m->cow_end.nb_bytes >> BDRV_SECTOR_BITS)) {
         m->cow_end.reduced = true;
+        trimmed = true;
+    }
+    /* The request is trimmed. Let's try to start dependent
+       ones, may be we will be lucky */
+    if (trimmed) {
+        qemu_co_queue_restart_all(&m->dependent_requests);
     }
 }
 
@@ -1782,6 +1791,10 @@ static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
     for (m = l2meta; m != NULL; m = m->next) {
         uint64_t bytes = m->nb_clusters << s->cluster_bits;
 
+        if (m->piggybacked) {
+            continue;
+        }
+
         if (s->prealloc_size != 0 && handle_prealloc(bs, m)) {
             handle_cow_reduce(bs, m);
             continue;
@@ -1903,9 +1916,18 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
         while (l2meta != NULL) {
             QCowL2Meta *next;
 
-            ret = qcow2_alloc_cluster_link_l2(bs, l2meta);
-            if (ret < 0) {
-                goto fail;
+            if (!l2meta->piggybacked) {
+                ret = qcow2_alloc_cluster_link_l2(bs, l2meta);
+                if (ret < 0) {
+                    goto fail;
+                }
+            } else {
+                ret = qcow2_wait_l2table_update(bs, l2meta);
+                if (ret < 0) {
+                    /* dependency request failed, return general EIO */
+                    ret = -EIO;
+                    goto fail;
+                }
             }
 
             /* Take the request off the list of running requests */
diff --git a/block/qcow2.h b/block/qcow2.h
index 2fd8510..5947045 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -310,7 +310,8 @@ typedef struct Qcow2COWRegion {
     /** Number of bytes to copy */
     int         nb_bytes;
 
-    /** The region is filled with zeroes and does not require COW
+    /** The region does not require COW
+     *  (either filled with zeroes or busy with other request)
      */
     bool        reduced;
 } Qcow2COWRegion;
@@ -338,6 +339,13 @@ typedef struct QCowL2Meta
     bool clusters_are_trailing;
 
     /**
+     * True if the described clusters are being allocated by
+     * the other concurrent request; so this one must not actually update L2
+     * or COW but only write its data
+     */
+    bool piggybacked;
+
+    /**
      * Requests that overlap with this allocation and wait to be restarted
      * when the allocating request has completed.
      */
@@ -575,6 +583,8 @@ int qcow2_expand_zero_clusters(BlockDriverState *bs,
                                BlockDriverAmendStatusCB *status_cb,
                                void *cb_opaque);
 
+int qcow2_wait_l2table_update(BlockDriverState *bs, const QCowL2Meta *m);
+
 /* qcow2-snapshot.c functions */
 int qcow2_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info);
 int qcow2_snapshot_goto(BlockDriverState *bs, const char *snapshot_id);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 15/15] iotest 046: test simultaneous cluster write error case
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (13 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 14/15] qcow2: allow concurrent unaligned writes to the same clusters Anton Nefedov
@ 2017-06-01 15:14 ` Anton Nefedov
  2017-06-01 18:41 ` [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements John Snow
  15 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-01 15:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: den, kwolf, mreitz, eblake, Anton Nefedov

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 tests/qemu-iotests/046     | 38 +++++++++++++++++++++++++++++++++++++-
 tests/qemu-iotests/046.out | 23 +++++++++++++++++++++++
 2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/046 b/tests/qemu-iotests/046
index f2ebecf..c210b55 100755
--- a/tests/qemu-iotests/046
+++ b/tests/qemu-iotests/046
@@ -29,7 +29,8 @@ status=1	# failure is the default!
 
 _cleanup()
 {
-	_cleanup_test_img
+    _cleanup_test_img
+    rm "$TEST_DIR/blkdebug.conf"
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
 
@@ -188,6 +189,37 @@ overlay_io | $QEMU_IO blkdebug::"$TEST_IMG" | _filter_qemu_io |\
 	sed -e 's/bytes at offset [0-9]*/bytes at offset XXX/g'
 
 echo
+echo "== Concurrency error case =="
+
+# 1. 1st request allocated the cluster, stop before it writes and updates L2
+# 2. 2nd request starts at the same cluster must complete write and start
+#    waiting for the 1st to update L2
+# 3. Resume the 1st request to make it fail (injected error)
+# 4. 2nd request must wake and fail as well
+#    1 cluster will end up leaked
+cat > "$TEST_DIR/blkdebug.conf" <<EOF
+[inject-error]
+event = "write_aio"
+sector = "0xb00"
+EOF
+function error_io()
+{
+cat <<EOF
+discard 0x120000 0x20000
+break write_aio A
+aio_write -P 180 0x120000 0x200
+wait_break A
+aio_write -P 181 0x120200 0x10000
+resume A
+EOF
+}
+
+error_io | $QEMU_IO "blkdebug:$TEST_DIR/blkdebug.conf:$TEST_IMG" | _filter_qemu_io |\
+	sed -e 's/bytes at offset [0-9]*/bytes at offset XXX/g'
+
+_check_test_img -r leaks
+
+echo
 echo "== Verify image content =="
 
 function verify_io()
@@ -259,6 +291,10 @@ function verify_io()
     # Undefined content for 0x10c000 0x8000
     echo read -P 160 0x114000 0x8000
     echo read -P 17  0x11c000 0x4000
+
+    echo read -P ${discarded:-18} 0x120000 0x10000
+    echo read -P 181 0x130000 0x200
+    echo read -P ${discarded:-19} 0x130200 0xfe00
 }
 
 verify_io | $QEMU_IO "$TEST_IMG" | _filter_qemu_io
diff --git a/tests/qemu-iotests/046.out b/tests/qemu-iotests/046.out
index ca2c740..a065102 100644
--- a/tests/qemu-iotests/046.out
+++ b/tests/qemu-iotests/046.out
@@ -140,6 +140,23 @@ wrote 32768/32768 bytes at offset XXX
 wrote 98304/98304 bytes at offset XXX
 96 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+== Concurrency error case ==
+discard 131072/131072 bytes at offset XXX
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Suspended request 'A'
+blkdebug: Resuming request 'A'
+aio_write failed: Input/output error
+aio_write failed: Input/output error
+Leaked cluster 22 refcount=1 reference=0
+Repairing cluster 22 refcount=1 reference=0
+The following inconsistencies were found and repaired:
+
+    1 leaked clusters
+    0 corruptions
+
+Double checking the fixed image now...
+No errors were found on the image.
+
 == Verify image content ==
 read 65536/65536 bytes at offset 0
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
@@ -235,5 +252,11 @@ read 32768/32768 bytes at offset 1130496
 32 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 read 16384/16384 bytes at offset 1163264
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 1179648
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 1245184
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65024/65024 bytes at offset 1245696
+63.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 No errors were found on the image.
 *** done
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements
  2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
                   ` (14 preceding siblings ...)
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 15/15] iotest 046: test simultaneous cluster write error case Anton Nefedov
@ 2017-06-01 18:41 ` John Snow
  15 siblings, 0 replies; 26+ messages in thread
From: John Snow @ 2017-06-01 18:41 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: kwolf, den, mreitz, Qemu-block

Missing qemu-block@, CCing.

On 06/01/2017 11:14 AM, Anton Nefedov wrote:
> Changes in v2:
>   - introduce new BDRV flag for write_zeroes()
>       instead of using driver callback directly.
>     Skipped introducing new functions like bdrv_co_pallocate() for now:
>       1. it seems ok to keep calling this write_zeroes() as zeroes
>       are expected;
>       2. most of the code can be reused now anyway, so changes to
>       write_zeroes() path are not significant
>       3. write_zeroes() alignment and max-request limits can also be reused
> 
>     As a possible alternative we can have bdrv_co_pallocate() which can
>     switch to pwrite_zeroes(,flags|=BDRV_REQ_ALLOCATE) early.
> 
> ========
> 
> This pull request is to address a few performance problems of qcow2 format:
> 
>   1. non cluster-aligned write requests (to unallocated clusters) explicitly
>     pad data with zeroes if there is no backing data. This can be avoided
>     and the whole clusters are preallocated and zeroed in a single
>     efficient write_zeroes() operation, also providing better host file
>     continuity
> 
>   2. moreover, efficient write_zeroes() operation can be used to preallocate
>     space megabytes ahead which gives noticeable improvement on some storage
>     types (e.g. distributed storages where space allocation operation is
>     expensive)
> 
>   3. preallocating/zeroing the clusters in advance makes possible to enable
>     simultaneous writes to the same unallocated cluster, which is beneficial
>     for parallel sequential write operations which are not cluster-aligned
> 
> Performance test results are added to commit messages (see patch 3, 12)
> 
> Anton Nefedov (11):
>   block: introduce BDRV_REQ_ALLOCATE flag
>   file-posix: support BDRV_REQ_ALLOCATE
>   blkdebug: support BDRV_REQ_ALLOCATE
>   qcow2: do not COW the empty areas
>   qcow2: set inactive flag
>   qcow2: handle_prealloc(): find out if area zeroed by earlier
>     preallocation
>   qcow2: fix misleading comment about L2 linking
>   qcow2-cluster: slightly refactor handle_dependencies()
>   qcow2-cluster: make handle_dependencies() logic easier to follow
>   qcow2: allow concurrent unaligned writes to the same clusters
>   iotest 046: test simultaneous cluster write error case
> 
> Denis V. Lunev (3):
>   qcow2: alloc space for COW in one chunk
>   qcow2: preallocation at image expand
>   qcow2: truncate preallocated space
> 
> Pavel Butsykin (1):
>   qcow2: check space leak at the end of the image
> 
>  block/blkdebug.c                   |   3 +-
>  block/file-posix.c                 |   9 +-
>  block/io.c                         |  19 ++-
>  block/qcow2-cache.c                |   3 +
>  block/qcow2-cluster.c              | 218 +++++++++++++++++++++++------
>  block/qcow2-refcount.c             |  21 +++
>  block/qcow2.c                      | 273 ++++++++++++++++++++++++++++++++++++-
>  block/qcow2.h                      |  26 ++++
>  block/trace-events                 |   1 +
>  include/block/block.h              |   6 +-
>  tests/qemu-iotests/026.out         | 104 ++++++++++----
>  tests/qemu-iotests/026.out.nocache | 104 ++++++++++----
>  tests/qemu-iotests/029.out         |   5 +-
>  tests/qemu-iotests/046             |  38 +++++-
>  tests/qemu-iotests/046.out         |  23 ++++
>  tests/qemu-iotests/060             |   2 +-
>  tests/qemu-iotests/060.out         |  13 +-
>  tests/qemu-iotests/061.out         |   5 +-
>  tests/qemu-iotests/066             |   2 +-
>  tests/qemu-iotests/066.out         |   9 +-
>  tests/qemu-iotests/098.out         |   7 +-
>  tests/qemu-iotests/108.out         |   5 +-
>  tests/qemu-iotests/112.out         |   5 +-
>  23 files changed, 789 insertions(+), 112 deletions(-)
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 01/15] block: introduce BDRV_REQ_ALLOCATE flag
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 01/15] block: introduce BDRV_REQ_ALLOCATE flag Anton Nefedov
@ 2017-06-01 19:07   ` Eric Blake
  2017-06-02 13:08     ` Anton Nefedov
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Blake @ 2017-06-01 19:07 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: den, kwolf, mreitz

[-- Attachment #1: Type: text/plain, Size: 6298 bytes --]

On 06/01/2017 10:14 AM, Anton Nefedov wrote:
> The flag is supposed to indicate that the region of the disk image has
> to be sufficiently allocated so it reads as zeroes. The call with the flag
> set has to return -ENOTSUP if allocation cannot be done efficiently
> (i.e. without falling back to writing actual buffers)
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>  block/io.c            | 19 ++++++++++++++++---
>  block/trace-events    |  1 +
>  include/block/block.h |  6 +++++-
>  3 files changed, 22 insertions(+), 4 deletions(-)

You may want to 'git config diff.orderFile /path/to/file' (with a
suitably populated file) so that .h files come first in your diffs, as
that can aid reviewers.  At one point, there was a thread about adding
such a file to qemu.git proper for everyone to share, although it seems
to have stalled.

> 
> diff --git a/block/io.c b/block/io.c
> index ed31810..d47efa9 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1272,7 +1272,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>              assert(!bs->supported_zero_flags);
>          }
>  
> -        if (ret == -ENOTSUP) {
> +        if (ret == -ENOTSUP && !(flags & BDRV_REQ_ALLOCATE)) {

I'd feel MUCH better if you first fixed the conditional just above this
point to ensure that if the caller requests BDRV_REQ_ALLOCATE that we do
not call bdrv->bdrv_co_pwrite_zeroes() unless bs->supported_zero_flags
also mentions this bit.

Remember, the existing semantics of .bdrv_co_pwrite_zeroes() merely
state that we must return -ENOTSUP unless we can guarantee that we read
back as zeroes, but puts no timing constraints on it.  A driver that has
not been retrofitted to understand the BDRV_REQ_ALLOCATE flag will
therefore risk taking too long.  Using bs->supported_zero_flags as your
gate is what will let you avoid calling into a driver that has not been
audited for fitting the new contract.

>              /* Fall back to bounce buffer if write zeroes is unsupported */
>              BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
>  
> @@ -1355,8 +1355,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child,
>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>  
>      if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
> -        !(flags & BDRV_REQ_ZERO_WRITE) && drv->bdrv_co_pwrite_zeroes &&
> -        qemu_iovec_is_zero(qiov)) {
> +        !(flags & BDRV_REQ_ZERO_WRITE) && !(flags & BDRV_REQ_ALLOCATE) &&
> +        drv->bdrv_co_pwrite_zeroes && qemu_iovec_is_zero(qiov)) {
>          flags |= BDRV_REQ_ZERO_WRITE;
>          if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
>              flags |= BDRV_REQ_MAY_UNMAP;
> @@ -1436,6 +1436,9 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
>  
>      assert(flags & BDRV_REQ_ZERO_WRITE);
>      if (head_padding_bytes || tail_padding_bytes) {
> +        if (flags & BDRV_REQ_ALLOCATE) {
> +            return -ENOTSUP;
> +        }

Can we assert that BDRV_REQ_ALLOCATE will only be supplied by a caller
that is already using aligned values?  Or is that too strict?

>          buf = qemu_blockalign(bs, align);
>          iov = (struct iovec) {
>              .iov_base   = buf,
> @@ -1534,6 +1537,11 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
>          return ret;
>      }
>  
> +    if (qiov && flags & BDRV_REQ_ALLOCATE) {
> +        /* allocation request with qiov provided doesn't make much sense */
> +        return -ENOTSUP;

Should this be an assertion (bug in the program for mixing things that
don't make sense) rather than just a runtime error return?

> +    }
> +
>      bdrv_inc_in_flight(bs);
>      /*
>       * Align write if necessary by performing a read-modify-write cycle.
> @@ -1665,6 +1673,11 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
>  {
>      trace_bdrv_co_pwrite_zeroes(child->bs, offset, count, flags);
>  
> +    if (flags & BDRV_REQ_MAY_UNMAP && flags & BDRV_REQ_ALLOCATE) {
> +        /* nonsense */
> +        return -ENOTSUP;
> +    }

Ditto.

> +
>      if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
>          flags &= ~BDRV_REQ_MAY_UNMAP;
>      }
> diff --git a/block/trace-events b/block/trace-events
> index 9a71c7f..a15c2cc 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -15,6 +15,7 @@ bdrv_aio_writev(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs
>  bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
>  bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
>  bdrv_co_pwrite_zeroes(void *bs, int64_t offset, int count, int flags) "bs %p offset %"PRId64" count %d flags %#x"
> +bdrv_co_allocate(void *bs, int64_t offset, int count) "bs %p offset %"PRId64" count %d"
>  bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned int bytes, int64_t cluster_offset, unsigned int cluster_bytes) "bs %p offset %"PRId64" bytes %u cluster_offset %"PRId64" cluster_bytes %u"
>  
>  # block/stream.c
> diff --git a/include/block/block.h b/include/block/block.h
> index 9b355e9..53a357c 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -65,9 +65,13 @@ typedef enum {
>      BDRV_REQ_NO_SERIALISING     = 0x8,
>      BDRV_REQ_FUA                = 0x10,
>      BDRV_REQ_WRITE_COMPRESSED   = 0x20,
> +    /* BDRV_REQ_ALLOCATE is used to indicate that the driver is to
> +     * efficiently allocate the space so it reads as zeroes or return an error
> +     */
> +    BDRV_REQ_ALLOCATE           = 0x40,

Doesn't match how the other flags are documented, but any documentation
is better than none.

Missing mention of the new flag in the documentation for
supported_zero_flags.

>  
>      /* Mask of valid flags */
> -    BDRV_REQ_MASK               = 0x3f,
> +    BDRV_REQ_MASK               = 0x7f,
>  } BdrvRequestFlags;
>  
>  typedef struct BlockSizes {
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/15] file-posix: support BDRV_REQ_ALLOCATE
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 02/15] file-posix: support BDRV_REQ_ALLOCATE Anton Nefedov
@ 2017-06-01 19:49   ` Eric Blake
  2017-06-01 19:54     ` Eric Blake
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Blake @ 2017-06-01 19:49 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: den, kwolf, mreitz

[-- Attachment #1: Type: text/plain, Size: 701 bytes --]

On 06/01/2017 10:14 AM, Anton Nefedov wrote:
> Current write_zeroes implementation is good enough to satisfy this flag too
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>  block/file-posix.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)

Are we sure that fallocate() is always fast, or are there some file
systems where it is no faster than manually writing zeroes?  I'm worried
that blindly claiming BDRV_REQ_ALLOCATE may fail if we encounter a libc
or kernel-based fallback that takes a slow patch on our behalf.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/15] blkdebug: support BDRV_REQ_ALLOCATE
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 03/15] blkdebug: " Anton Nefedov
@ 2017-06-01 19:50   ` Eric Blake
  2017-06-02 13:13     ` Anton Nefedov
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Blake @ 2017-06-01 19:50 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: den, kwolf, mreitz

[-- Attachment #1: Type: text/plain, Size: 1074 bytes --]

On 06/01/2017 10:14 AM, Anton Nefedov wrote:
> Support the flag if the underlying BDS supports it
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>  block/blkdebug.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Shouldn't other passthrough drivers (like raw-format.c) make this change
as well?

> 
> diff --git a/block/blkdebug.c b/block/blkdebug.c
> index a5196e8..8b1401b 100644
> --- a/block/blkdebug.c
> +++ b/block/blkdebug.c
> @@ -415,7 +415,8 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
>  
>      bs->supported_write_flags = BDRV_REQ_FUA &
>          bs->file->bs->supported_write_flags;
> -    bs->supported_zero_flags = (BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
> +    bs->supported_zero_flags =
> +        (BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
>          bs->file->bs->supported_zero_flags;
>      ret = -EINVAL;
>  
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/15] file-posix: support BDRV_REQ_ALLOCATE
  2017-06-01 19:49   ` Eric Blake
@ 2017-06-01 19:54     ` Eric Blake
  2017-06-02 14:34       ` Anton Nefedov
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Blake @ 2017-06-01 19:54 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: kwolf, den, mreitz

[-- Attachment #1: Type: text/plain, Size: 955 bytes --]

On 06/01/2017 02:49 PM, Eric Blake wrote:
> On 06/01/2017 10:14 AM, Anton Nefedov wrote:
>> Current write_zeroes implementation is good enough to satisfy this flag too
>>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>  block/file-posix.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> Are we sure that fallocate() is always fast, or are there some file
> systems where it is no faster than manually writing zeroes?  I'm worried
> that blindly claiming BDRV_REQ_ALLOCATE may fail if we encounter a libc

not so much fail as in "break the guest", but fail as in "take far more
time than we were expecting, pessimising our behavior to worse than if
we had not tried the allocation at all"

> or kernel-based fallback that takes a slow patch on our behalf.
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 01/15] block: introduce BDRV_REQ_ALLOCATE flag
  2017-06-01 19:07   ` Eric Blake
@ 2017-06-02 13:08     ` Anton Nefedov
  0 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-02 13:08 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: den, kwolf, mreitz, qemu-block



On 06/01/2017 10:07 PM, Eric Blake wrote:
> On 06/01/2017 10:14 AM, Anton Nefedov wrote:
>> The flag is supposed to indicate that the region of the disk image has
>> to be sufficiently allocated so it reads as zeroes. The call with the flag
>> set has to return -ENOTSUP if allocation cannot be done efficiently
>> (i.e. without falling back to writing actual buffers)
>>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>   block/io.c            | 19 ++++++++++++++++---
>>   block/trace-events    |  1 +
>>   include/block/block.h |  6 +++++-
>>   3 files changed, 22 insertions(+), 4 deletions(-)
> 
> You may want to 'git config diff.orderFile /path/to/file' (with a
> suitably populated file) so that .h files come first in your diffs, as
> that can aid reviewers.  At one point, there was a thread about adding
> such a file to qemu.git proper for everyone to share, although it seems
> to have stalled.
> 

Thanks, will do

>>
>> diff --git a/block/io.c b/block/io.c
>> index ed31810..d47efa9 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -1272,7 +1272,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>               assert(!bs->supported_zero_flags);
>>           }
>>   
>> -        if (ret == -ENOTSUP) {
>> +        if (ret == -ENOTSUP && !(flags & BDRV_REQ_ALLOCATE)) {
> 
> I'd feel MUCH better if you first fixed the conditional just above this
> point to ensure that if the caller requests BDRV_REQ_ALLOCATE that we do
> not call bdrv->bdrv_co_pwrite_zeroes() unless bs->supported_zero_flags
> also mentions this bit.
> 
> Remember, the existing semantics of .bdrv_co_pwrite_zeroes() merely
> state that we must return -ENOTSUP unless we can guarantee that we read
> back as zeroes, but puts no timing constraints on it.  A driver that has
> not been retrofitted to understand the BDRV_REQ_ALLOCATE flag will
> therefore risk taking too long.  Using bs->supported_zero_flags as your
> gate is what will let you avoid calling into a driver that has not been
> audited for fitting the new contract.
> 

Absolutely; I have even added that check but must have lost that at some
point.
Meant to add that much earlier though, to bdrv_co_pwrite_zeroes()

>>               /* Fall back to bounce buffer if write zeroes is unsupported */
>>               BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
>>   
>> @@ -1355,8 +1355,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child,
>>       ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>>   
>>       if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
>> -        !(flags & BDRV_REQ_ZERO_WRITE) && drv->bdrv_co_pwrite_zeroes &&
>> -        qemu_iovec_is_zero(qiov)) {
>> +        !(flags & BDRV_REQ_ZERO_WRITE) && !(flags & BDRV_REQ_ALLOCATE) &&
>> +        drv->bdrv_co_pwrite_zeroes && qemu_iovec_is_zero(qiov)) {
>>           flags |= BDRV_REQ_ZERO_WRITE;
>>           if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
>>               flags |= BDRV_REQ_MAY_UNMAP;
>> @@ -1436,6 +1436,9 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
>>   
>>       assert(flags & BDRV_REQ_ZERO_WRITE);
>>       if (head_padding_bytes || tail_padding_bytes) {
>> +        if (flags & BDRV_REQ_ALLOCATE) {
>> +            return -ENOTSUP;
>> +        }
> 
> Can we assert that BDRV_REQ_ALLOCATE will only be supplied by a caller
> that is already using aligned values?  Or is that too strict?
> 

as I understand the top driver should not care much about the child
driver alignment preferences? that's what the common bdrv_* interface is
there for

>>           buf = qemu_blockalign(bs, align);
>>           iov = (struct iovec) {
>>               .iov_base   = buf,
>> @@ -1534,6 +1537,11 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
>>           return ret;
>>       }
>>   
>> +    if (qiov && flags & BDRV_REQ_ALLOCATE) {
>> +        /* allocation request with qiov provided doesn't make much sense */
>> +        return -ENOTSUP;
> 
> Should this be an assertion (bug in the program for mixing things that
> don't make sense) rather than just a runtime error return?
> 

incline to agree here

>> +    }
>> +
>>       bdrv_inc_in_flight(bs);
>>       /*
>>        * Align write if necessary by performing a read-modify-write cycle.
>> @@ -1665,6 +1673,11 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
>>   {
>>       trace_bdrv_co_pwrite_zeroes(child->bs, offset, count, flags);
>>   
>> +    if (flags & BDRV_REQ_MAY_UNMAP && flags & BDRV_REQ_ALLOCATE) {
>> +        /* nonsense */
>> +        return -ENOTSUP;
>> +    }
> 
> Ditto.
> 

yep

>> +
>>       if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
>>           flags &= ~BDRV_REQ_MAY_UNMAP;
>>       }
>> diff --git a/block/trace-events b/block/trace-events
>> index 9a71c7f..a15c2cc 100644
>> --- a/block/trace-events
>> +++ b/block/trace-events
>> @@ -15,6 +15,7 @@ bdrv_aio_writev(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs
>>   bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
>>   bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
>>   bdrv_co_pwrite_zeroes(void *bs, int64_t offset, int count, int flags) "bs %p offset %"PRId64" count %d flags %#x"
>> +bdrv_co_allocate(void *bs, int64_t offset, int count) "bs %p offset %"PRId64" count %d"
>>   bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned int bytes, int64_t cluster_offset, unsigned int cluster_bytes) "bs %p offset %"PRId64" bytes %u cluster_offset %"PRId64" cluster_bytes %u"
>>   
>>   # block/stream.c
>> diff --git a/include/block/block.h b/include/block/block.h
>> index 9b355e9..53a357c 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
>> @@ -65,9 +65,13 @@ typedef enum {
>>       BDRV_REQ_NO_SERIALISING     = 0x8,
>>       BDRV_REQ_FUA                = 0x10,
>>       BDRV_REQ_WRITE_COMPRESSED   = 0x20,
>> +    /* BDRV_REQ_ALLOCATE is used to indicate that the driver is to
>> +     * efficiently allocate the space so it reads as zeroes or return an error
>> +     */
>> +    BDRV_REQ_ALLOCATE           = 0x40,
> 
> Doesn't match how the other flags are documented, but any documentation
> is better than none.
> 

Will fix

> Missing mention of the new flag in the documentation for
> supported_zero_flags.
> 

Done.

>>   
>>       /* Mask of valid flags */
>> -    BDRV_REQ_MASK               = 0x3f,
>> +    BDRV_REQ_MASK               = 0x7f,
>>   } BdrvRequestFlags;
>>   
>>   typedef struct BlockSizes {
>>
> 
/Anton

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/15] blkdebug: support BDRV_REQ_ALLOCATE
  2017-06-01 19:50   ` Eric Blake
@ 2017-06-02 13:13     ` Anton Nefedov
  0 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-02 13:13 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: den, kwolf, mreitz, qemu-block



On 06/01/2017 10:50 PM, Eric Blake wrote:
> On 06/01/2017 10:14 AM, Anton Nefedov wrote:
>> Support the flag if the underlying BDS supports it
>>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>   block/blkdebug.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Shouldn't other passthrough drivers (like raw-format.c) make this change
> as well?
> 

Right.

Wonder why they even enumerate those instead of just

bs->supported_zero_flags = bs->file->bs->supported_zero_flags;

>>
>> diff --git a/block/blkdebug.c b/block/blkdebug.c
>> index a5196e8..8b1401b 100644
>> --- a/block/blkdebug.c
>> +++ b/block/blkdebug.c
>> @@ -415,7 +415,8 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
>>   
>>       bs->supported_write_flags = BDRV_REQ_FUA &
>>           bs->file->bs->supported_write_flags;
>> -    bs->supported_zero_flags = (BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
>> +    bs->supported_zero_flags =
>> +        (BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
>>           bs->file->bs->supported_zero_flags;
>>       ret = -EINVAL;
>>   
>>
> 

/Anton

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/15] file-posix: support BDRV_REQ_ALLOCATE
  2017-06-01 19:54     ` Eric Blake
@ 2017-06-02 14:34       ` Anton Nefedov
  0 siblings, 0 replies; 26+ messages in thread
From: Anton Nefedov @ 2017-06-02 14:34 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: kwolf, den, mreitz, qemu-block


On 06/01/2017 10:54 PM, Eric Blake wrote:
> On 06/01/2017 02:49 PM, Eric Blake wrote:
>> On 06/01/2017 10:14 AM, Anton Nefedov wrote:
>>> Current write_zeroes implementation is good enough to satisfy this flag too
>>>
>>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>>> ---
>>>   block/file-posix.c | 9 ++++++++-
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> Are we sure that fallocate() is always fast, or are there some file
>> systems where it is no faster than manually writing zeroes?  I'm worried
>> that blindly claiming BDRV_REQ_ALLOCATE may fail if we encounter a libc
> 
> not so much fail as in "break the guest", but fail as in "take far more
> time than we were expecting, pessimising our behavior to worse than if
> we had not tried the allocation at all"
> 
>> or kernel-based fallback that takes a slow patch on our behalf.
>>
> 

I would expect such filesystems to not support fallocate.

Though I must admit I can't see anywhere in the documentation that it 
MUST be strictly faster than writing zeroes; it would look very strange
to me if there were a slowpath fallback somewhere past the libc.

/Anton

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/15] qcow2: alloc space for COW in one chunk
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 04/15] qcow2: alloc space for COW in one chunk Anton Nefedov
@ 2017-06-05 14:28   ` Eric Blake
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Blake @ 2017-06-05 14:28 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: den, kwolf, mreitz, Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 2913 bytes --]

On 06/01/2017 10:14 AM, Anton Nefedov wrote:
> From: "Denis V. Lunev" <den@openvz.org>
> 
> Currently each single write operation can result in 3 write operations

Maybe:

Currently each single guest write operation can result in up to 3 host
write operations

> if guest offsets are not cluster aligned. One write is performed for the
> real payload and two for COW-ed areas. Thus the data possibly lays
> non-contiguously on the host filesystem. This will reduce further
> sequential read performance significantly.
> 
> The patch allocates the space in the file with cluster granularity,
> ensuring
>   1. better host offset locality
>   2. less space allocation operations
>      (which can be expensive on distributed storage)
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>  block/qcow2.c | 24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index b3ba5da..cd5efba 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1575,6 +1575,24 @@ fail:
>      return ret;
>  }
>  
> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    QCowL2Meta *m;
> +
> +    for (m = l2meta; m != NULL; m = m->next) {
> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
> +
> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
> +            continue;
> +        }
> +
> +        /* try to alloc host space in one chunk for better locality */
> +        bdrv_co_pwrite_zeroes(bs->file, m->alloc_offset, bytes,
> +                              BDRV_REQ_ALLOCATE);

Is it worth a trace point, to easily track when we are using
handle_alloc_space()?  Or do existing trace points in
bdrv_co_pwrite_zeroes already sufficiently serve the purpose?

> +    }
> +}
> +
>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>                                           uint64_t bytes, QEMUIOVector *qiov,
>                                           int flags)
> @@ -1656,8 +1674,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>          if (ret < 0) {
>              goto fail;
>          }
> -
>          qemu_co_mutex_unlock(&s->lock);
> +
> +        if (bs->file->bs->supported_zero_flags & BDRV_REQ_ALLOCATE) {
> +            handle_alloc_space(bs, l2meta);
> +        }
> +

This certainly looks better than v1!

>          BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
>          trace_qcow2_writev_data(qemu_coroutine_self(),
>                                  cluster_offset + offset_in_cluster);
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 05/15] qcow2: do not COW the empty areas
  2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 05/15] qcow2: do not COW the empty areas Anton Nefedov
@ 2017-06-05 14:40   ` Eric Blake
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Blake @ 2017-06-05 14:40 UTC (permalink / raw)
  To: Anton Nefedov, qemu-devel; +Cc: den, kwolf, mreitz, Denis V . Lunev

[-- Attachment #1: Type: text/plain, Size: 2401 bytes --]

On 06/01/2017 10:14 AM, Anton Nefedov wrote:
> If COW area of the newly allocated cluster is zeroes, there is no reason
> to write zero sectors in perform_cow() again now as whole clusters are
> zeroed out in single chunks by handle_alloc_space().
> 
> Introduce QCowL2Meta field "reduced", since the existing fields
> (offset and nb_bytes) still has to keep other write requests from
> simultaneous writing in the area
> 
> iotest 060:
> write to the discarded cluster does not trigger COW anymore.
> so, break on write_aio event instead, will work for the test
> (but write won't fail anymore, so update reference output)
> 
> iotest 066:
> cluster-alignment areas that were not really COWed are now detected
> as zeroes, hence the initial write has to be exactly the same size for
> the maps to match

> +++ b/block/qcow2.c
> @@ -64,6 +64,9 @@ typedef struct {
>  #define  QCOW2_EXT_MAGIC_BACKING_FORMAT 0xE2792ACA
>  #define  QCOW2_EXT_MAGIC_FEATURE_TABLE 0x6803f857
>  
> +static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
> +                            uint32_t count);
> +

I've already posted patches that redo this function to be byte-based;
I'm not sure which of our series will land first, but it is something to
keep in mind.

I'm also not a big fan of forward declarations of leaf static functions
(they are essential for mutually recursive functions, but when there is
no recursion, the better solution is to just hoist the implementation of
the function to occur where it is first needed, rather than using a
forward declaration).

> @@ -1588,8 +1611,13 @@ static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>          }
>  
>          /* try to alloc host space in one chunk for better locality */
> -        bdrv_co_pwrite_zeroes(bs->file, m->alloc_offset, bytes,
> -                              BDRV_REQ_ALLOCATE);
> +        ret = bdrv_co_pwrite_zeroes(bs->file, m->alloc_offset, bytes,
> +                                    BDRV_REQ_ALLOCATE);
> +        if (ret != 0) {
> +            continue;
> +        }
> +
> +        handle_cow_reduce(bs, m);

Does this look any better as:

if (bdrv_co_pwrite_zeroes(...) == 0) {
    handle_cow_reduce(bs, m);
}

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-06-05 14:40 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-01 15:14 [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 01/15] block: introduce BDRV_REQ_ALLOCATE flag Anton Nefedov
2017-06-01 19:07   ` Eric Blake
2017-06-02 13:08     ` Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 02/15] file-posix: support BDRV_REQ_ALLOCATE Anton Nefedov
2017-06-01 19:49   ` Eric Blake
2017-06-01 19:54     ` Eric Blake
2017-06-02 14:34       ` Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 03/15] blkdebug: " Anton Nefedov
2017-06-01 19:50   ` Eric Blake
2017-06-02 13:13     ` Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 04/15] qcow2: alloc space for COW in one chunk Anton Nefedov
2017-06-05 14:28   ` Eric Blake
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 05/15] qcow2: do not COW the empty areas Anton Nefedov
2017-06-05 14:40   ` Eric Blake
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 06/15] qcow2: preallocation at image expand Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 07/15] qcow2: set inactive flag Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 08/15] qcow2: truncate preallocated space Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 09/15] qcow2: check space leak at the end of the image Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 10/15] qcow2: handle_prealloc(): find out if area zeroed by earlier preallocation Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 11/15] qcow2: fix misleading comment about L2 linking Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 12/15] qcow2-cluster: slightly refactor handle_dependencies() Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 13/15] qcow2-cluster: make handle_dependencies() logic easier to follow Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 14/15] qcow2: allow concurrent unaligned writes to the same clusters Anton Nefedov
2017-06-01 15:14 ` [Qemu-devel] [PATCH v2 15/15] iotest 046: test simultaneous cluster write error case Anton Nefedov
2017-06-01 18:41 ` [Qemu-devel] [PATCH v2 00/15] qcow2: space preallocation and COW improvements John Snow

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.