All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v14 0/1] qcow2: cluster space preallocation
@ 2019-05-16 14:27 Anton Nefedov
  2019-05-16 14:27 ` [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
  2019-05-24 13:56 ` [Qemu-devel] [PATCH v14 0/1] qcow2: cluster space preallocation Max Reitz
  0 siblings, 2 replies; 9+ messages in thread
From: Anton Nefedov @ 2019-05-16 14:27 UTC (permalink / raw)
  To: qemu-block
  Cc: kwolf, vsementsov, berto, den, qemu-devel, mreitz, Anton Nefedov

..apparently v13 ended up in a weird base64 that would not easily git-am.
Resending.

----

hi,

this used to be a large 10-patches series, but now thanks to Kevin's
BDRV_REQ_NO_FALLBACK series
(http://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg06389.html)
the core block layer functionality is already in place: the existing flag
can be reused.

A few accompanying patches were also dropped from the series as those can
be processed separately.

So,

new in v13:
   - patch 1 (was patch 9) rebased to use s->data_file pointer
   - comments fixed/added
   - other patches dropped ;)

v12: http://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg02761.html
v11: http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg04342.html
v10: http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg00121.html

----

This pull request is to start to improve a few performance points of
qcow2 format:

  1. non cluster-aligned write requests (to unallocated clusters) explicitly
     pad data with zeroes if there is no backing data.
     Resulting increase in ops number and potential cluster fragmentation
     (on the host file) is already solved by:
       ee22a9d qcow2: Merge the writing of the COW regions with the guest data
     However, in case of zero COW regions, that can be avoided at all
     but the whole clusters are preallocated and zeroed in a single
     efficient write_zeroes() operation

  2. moreover, efficient write_zeroes() operation can be used to preallocate
     space megabytes (*configurable number) ahead which gives noticeable
     improvement on some storage types (e.g. distributed storage)
     where the space allocation operation might be expensive)
     (Not included in this patchset since v6).

  3. this will also allow to enable simultaneous writes to the same unallocated
     cluster after the space has been allocated & zeroed but before
     the first data is written and the cluster is linked to L2.
     (Not included in this patchset).

Efficient write_zeroes usually implies that the blocks are not actually
written to but only reserved and marked as zeroed by the storage.
Existing bdrv_write_zeroes() falls back to writing zero buffers if
write_zeroes is not supported by the driver, so BDRV_REQ_NO_FALLBACK is used.

simple perf test:

  qemu-img create -f qcow2 test.img 4G && \
  qemu-img bench -c $((1024*1024)) -f qcow2 -n -s 4k -t none -w test.img

test results (seconds):

    +-----------+-------+------+-------+------+------+
    |   file    |    before    |     after    | gain |
    +-----------+-------+------+-------+------+------+
    |    ssd    |      61.153  |      36.313  |  41% |
    |    hdd    |     112.676  |     122.056  |  -8% |
    +-----------+--------------+--------------+------+

Anton Nefedov (1):
  qcow2: skip writing zero buffers to empty COW areas

 qapi/block-core.json       |  4 +-
 block/qcow2.h              |  6 +++
 block/qcow2-cluster.c      |  2 +-
 block/qcow2.c              | 93 +++++++++++++++++++++++++++++++++++++-
 block/trace-events         |  1 +
 tests/qemu-iotests/060     |  7 ++-
 tests/qemu-iotests/060.out |  5 +-
 7 files changed, 112 insertions(+), 6 deletions(-)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas
  2019-05-16 14:27 [Qemu-devel] [PATCH v14 0/1] qcow2: cluster space preallocation Anton Nefedov
@ 2019-05-16 14:27 ` Anton Nefedov
  2019-05-16 14:51   ` Vladimir Sementsov-Ogievskiy
                     ` (2 more replies)
  2019-05-24 13:56 ` [Qemu-devel] [PATCH v14 0/1] qcow2: cluster space preallocation Max Reitz
  1 sibling, 3 replies; 9+ messages in thread
From: Anton Nefedov @ 2019-05-16 14:27 UTC (permalink / raw)
  To: qemu-block
  Cc: kwolf, vsementsov, berto, den, qemu-devel, mreitz, Anton Nefedov

If COW areas of the newly allocated clusters are zeroes on the backing
image, efficient bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK) can be
used on the whole cluster instead of writing explicit zero buffers later
in perform_cow().

iotest 060:
write to the discarded cluster does not trigger COW anymore.
Use a backing image instead.

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 qapi/block-core.json       |  4 +-
 block/qcow2.h              |  6 +++
 block/qcow2-cluster.c      |  2 +-
 block/qcow2.c              | 93 +++++++++++++++++++++++++++++++++++++-
 block/trace-events         |  1 +
 tests/qemu-iotests/060     |  7 ++-
 tests/qemu-iotests/060.out |  5 +-
 7 files changed, 112 insertions(+), 6 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7ccbfff9d0..3e4042be7f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3215,6 +3215,8 @@
 #
 # @cor_write: a write due to copy-on-read (since 2.11)
 #
+# @cluster_alloc_space: an allocation of file space for a cluster (since 4.1)
+#
 # Since: 2.9
 ##
 { 'enum': 'BlkdebugEvent', 'prefix': 'BLKDBG',
@@ -3233,7 +3235,7 @@
             'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
             'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
             'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
-            'cor_write'] }
+            'cor_write', 'cluster_alloc_space'] }
 
 ##
 # @BlkdebugInjectErrorOptions:
diff --git a/block/qcow2.h b/block/qcow2.h
index e62508d1ce..07b18a733b 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -398,6 +398,12 @@ typedef struct QCowL2Meta
      */
     Qcow2COWRegion cow_end;
 
+    /*
+     * Indicates that COW regions are already handled and do not require
+     * any more processing.
+     */
+    bool skip_cow;
+
     /**
      * The I/O vector with the data from the actual guest write request.
      * If non-NULL, this is meant to be merged together with the data
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 974a4e8656..79d4651603 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -832,7 +832,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     assert(start->offset + start->nb_bytes <= end->offset);
     assert(!m->data_qiov || m->data_qiov->size == data_bytes);
 
-    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
+    if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
         return 0;
     }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 8e024007db..e6b1293ddf 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2120,6 +2120,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
             continue;
         }
 
+        /* If COW regions are handled already, skip this too */
+        if (m->skip_cow) {
+            continue;
+        }
+
         /* The data (middle) region must be immediately after the
          * start region */
         if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
@@ -2145,6 +2150,80 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
     return false;
 }
 
+static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
+{
+    int64_t nr;
+    return !bytes ||
+        (!bdrv_is_allocated_above(bs, NULL, offset, bytes, &nr) && nr == bytes);
+}
+
+static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
+{
+    /*
+     * This check is designed for optimization shortcut so it must be
+     * efficient.
+     * Instead of is_zero(), use is_unallocated() as it is faster (but not
+     * as accurate and can result in false negatives).
+     */
+    return is_unallocated(bs, m->offset + m->cow_start.offset,
+                          m->cow_start.nb_bytes) &&
+           is_unallocated(bs, m->offset + m->cow_end.offset,
+                          m->cow_end.nb_bytes);
+}
+
+static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
+{
+    BDRVQcow2State *s = bs->opaque;
+    QCowL2Meta *m;
+
+    if (!(s->data_file->bs->supported_zero_flags & BDRV_REQ_NO_FALLBACK)) {
+        return 0;
+    }
+
+    if (bs->encrypted) {
+        return 0;
+    }
+
+    for (m = l2meta; m != NULL; m = m->next) {
+        int ret;
+
+        if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
+            continue;
+        }
+
+        if (!is_zero_cow(bs, m)) {
+            continue;
+        }
+
+        /*
+         * instead of writing zero COW buffers,
+         * efficiently zero out the whole clusters
+         */
+
+        ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
+                                            m->nb_clusters * s->cluster_size,
+                                            true);
+        if (ret < 0) {
+            return ret;
+        }
+
+        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
+        ret = bdrv_co_pwrite_zeroes(s->data_file, m->alloc_offset,
+                                    m->nb_clusters * s->cluster_size,
+                                    BDRV_REQ_NO_FALLBACK);
+        if (ret < 0) {
+            if (ret != -ENOTSUP && ret != -EAGAIN) {
+                return ret;
+            }
+            continue;
+        }
+
+        trace_qcow2_skip_cow(qemu_coroutine_self(), m->offset, m->nb_clusters);
+        m->skip_cow = true;
+    }
+    return 0;
+}
+
 static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -2225,24 +2304,34 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
             goto fail;
         }
 
+        qemu_co_mutex_unlock(&s->lock);
+
+        /* Try to efficiently initialize the physical space with zeroes */
+        ret = handle_alloc_space(bs, l2meta);
+        if (ret < 0) {
+            qemu_co_mutex_lock(&s->lock);
+            goto fail;
+        }
+
         /* If we need to do COW, check if it's possible to merge the
          * writing of the guest data together with that of the COW regions.
          * If it's not possible (or not necessary) then write the
          * guest data now. */
         if (!merge_cow(offset, cur_bytes, &hd_qiov, l2meta)) {
-            qemu_co_mutex_unlock(&s->lock);
             BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
             trace_qcow2_writev_data(qemu_coroutine_self(),
                                     cluster_offset + offset_in_cluster);
             ret = bdrv_co_pwritev(s->data_file,
                                   cluster_offset + offset_in_cluster,
                                   cur_bytes, &hd_qiov, 0);
-            qemu_co_mutex_lock(&s->lock);
             if (ret < 0) {
+                qemu_co_mutex_lock(&s->lock);
                 goto fail;
             }
         }
 
+        qemu_co_mutex_lock(&s->lock);
+
         ret = qcow2_handle_l2meta(bs, &l2meta, true);
         if (ret) {
             goto fail;
diff --git a/block/trace-events b/block/trace-events
index 79ccd8d824..1e0653ce6d 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -68,6 +68,7 @@ qcow2_writev_done_part(void *co, int cur_bytes) "co %p cur_bytes %d"
 qcow2_writev_data(void *co, uint64_t offset) "co %p offset 0x%" PRIx64
 qcow2_pwrite_zeroes_start_req(void *co, int64_t offset, int count) "co %p offset 0x%" PRIx64 " count %d"
 qcow2_pwrite_zeroes(void *co, int64_t offset, int count) "co %p offset 0x%" PRIx64 " count %d"
+qcow2_skip_cow(void *co, uint64_t offset, int nb_clusters) "co %p offset 0x%" PRIx64 " nb_clusters %d"
 
 # qcow2-cluster.c
 qcow2_alloc_clusters_offset(void *co, uint64_t offset, int bytes) "co %p offset 0x%" PRIx64 " bytes %d"
diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
index 89e911400c..b91d8321bb 100755
--- a/tests/qemu-iotests/060
+++ b/tests/qemu-iotests/060
@@ -150,10 +150,15 @@ $QEMU_IO -c "$OPEN_RO" -c "read -P 1 0 512" | _filter_qemu_io
 echo
 echo "=== Testing overlap while COW is in flight ==="
 echo
+BACKING_IMG=$TEST_IMG.base
+TEST_IMG=$BACKING_IMG _make_test_img 1G
+
+$QEMU_IO -c 'write 0k 64k' "$BACKING_IMG" | _filter_qemu_io
+
 # compat=0.10 is required in order to make the following discard actually
 # unallocate the sector rather than make it a zero sector - we want COW, after
 # all.
-IMGOPTS='compat=0.10' _make_test_img 1G
+IMGOPTS='compat=0.10' _make_test_img -b "$BACKING_IMG" 1G
 # Write two clusters, the second one enforces creation of an L2 table after
 # the first data cluster.
 $QEMU_IO -c 'write 0k 64k' -c 'write 512M 64k' "$TEST_IMG" | _filter_qemu_io
diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
index e42bf8c5a9..0f6b0658a1 100644
--- a/tests/qemu-iotests/060.out
+++ b/tests/qemu-iotests/060.out
@@ -97,7 +97,10 @@ read 512/512 bytes at offset 0
 
 === Testing overlap while COW is in flight ===
 
-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=1073741824
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 backing_file=TEST_DIR/t.IMGFMT.base
 wrote 65536/65536 bytes at offset 0
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 wrote 65536/65536 bytes at offset 536870912
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas
  2019-05-16 14:27 ` [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
@ 2019-05-16 14:51   ` Vladimir Sementsov-Ogievskiy
  2019-05-22 11:33   ` Alberto Garcia
  2019-05-22 20:40   ` Max Reitz
  2 siblings, 0 replies; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-05-16 14:51 UTC (permalink / raw)
  To: Anton Nefedov, qemu-block; +Cc: kwolf, berto, Denis Lunev, qemu-devel, mreitz

16.05.2019 17:27, Anton Nefedov wrote:
> If COW areas of the newly allocated clusters are zeroes on the backing
> image, efficient bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK) can be
> used on the whole cluster instead of writing explicit zero buffers later
> in perform_cow().
> 
> iotest 060:
> write to the discarded cluster does not trigger COW anymore.
> Use a backing image instead.
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>



-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas
  2019-05-16 14:27 ` [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
  2019-05-16 14:51   ` Vladimir Sementsov-Ogievskiy
@ 2019-05-22 11:33   ` Alberto Garcia
  2019-05-22 20:40   ` Max Reitz
  2 siblings, 0 replies; 9+ messages in thread
From: Alberto Garcia @ 2019-05-22 11:33 UTC (permalink / raw)
  To: Anton Nefedov, qemu-block
  Cc: kwolf, vsementsov, den, qemu-devel, mreitz, Anton Nefedov

On Thu 16 May 2019 04:27:49 PM CEST, Anton Nefedov wrote:
> If COW areas of the newly allocated clusters are zeroes on the backing
> image, efficient bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK) can be
> used on the whole cluster instead of writing explicit zero buffers later
> in perform_cow().
>
> iotest 060:
> write to the discarded cluster does not trigger COW anymore.
> Use a backing image instead.
>
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas
  2019-05-16 14:27 ` [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
  2019-05-16 14:51   ` Vladimir Sementsov-Ogievskiy
  2019-05-22 11:33   ` Alberto Garcia
@ 2019-05-22 20:40   ` Max Reitz
  2019-05-23 12:45     ` Anton Nefedov
  2 siblings, 1 reply; 9+ messages in thread
From: Max Reitz @ 2019-05-22 20:40 UTC (permalink / raw)
  To: Anton Nefedov, qemu-block; +Cc: kwolf, vsementsov, berto, den, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5910 bytes --]

On 16.05.19 16:27, Anton Nefedov wrote:
> If COW areas of the newly allocated clusters are zeroes on the backing
> image, efficient bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK) can be
> used on the whole cluster instead of writing explicit zero buffers later
> in perform_cow().
> 
> iotest 060:
> write to the discarded cluster does not trigger COW anymore.
> Use a backing image instead.
> 
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>  qapi/block-core.json       |  4 +-
>  block/qcow2.h              |  6 +++
>  block/qcow2-cluster.c      |  2 +-
>  block/qcow2.c              | 93 +++++++++++++++++++++++++++++++++++++-
>  block/trace-events         |  1 +
>  tests/qemu-iotests/060     |  7 ++-
>  tests/qemu-iotests/060.out |  5 +-
>  7 files changed, 112 insertions(+), 6 deletions(-)

[...]

> diff --git a/block/qcow2.c b/block/qcow2.c
> index 8e024007db..e6b1293ddf 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c

[...]

> @@ -2145,6 +2150,80 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>      return false;
>  }
>  
> +static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
> +{
> +    int64_t nr;
> +    return !bytes ||
> +        (!bdrv_is_allocated_above(bs, NULL, offset, bytes, &nr) && nr == bytes);

It's a pity that this bdrv_is_allocated() throws away BDRV_BLOCK_ZERO
information.  If something in the backing chain has explicit zero
clusters there, we could get the information for free, but this will
consider it allocated.

Wouldn't it make sense to make bdrv_co_block_status_above() public and
then use that (with want_zero == false)?

(But that can be done later, too, of course.)

> +}
> +
> +static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
> +{
> +    /*
> +     * This check is designed for optimization shortcut so it must be
> +     * efficient.
> +     * Instead of is_zero(), use is_unallocated() as it is faster (but not
> +     * as accurate and can result in false negatives).
> +     */
> +    return is_unallocated(bs, m->offset + m->cow_start.offset,
> +                          m->cow_start.nb_bytes) &&
> +           is_unallocated(bs, m->offset + m->cow_end.offset,
> +                          m->cow_end.nb_bytes);
> +}
> +
> +static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    QCowL2Meta *m;
> +
> +    if (!(s->data_file->bs->supported_zero_flags & BDRV_REQ_NO_FALLBACK)) {
> +        return 0;
> +    }
> +
> +    if (bs->encrypted) {
> +        return 0;
> +    }
> +
> +    for (m = l2meta; m != NULL; m = m->next) {
> +        int ret;
> +
> +        if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
> +            continue;
> +        }
> +
> +        if (!is_zero_cow(bs, m)) {
> +            continue;
> +        }
> +
> +        /*
> +         * instead of writing zero COW buffers,
> +         * efficiently zero out the whole clusters
> +         */
> +
> +        ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
> +                                            m->nb_clusters * s->cluster_size,
> +                                            true);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
> +        ret = bdrv_co_pwrite_zeroes(s->data_file, m->alloc_offset,
> +                                    m->nb_clusters * s->cluster_size,
> +                                    BDRV_REQ_NO_FALLBACK);

Mostly I'm writing this mail because of this.  Zeroing the whole range
seems inefficient to me for very large requests, and the commit message
doesn't say anything about the reasoning behind unconditionally zeroing
everything.

Benchmarking looks like in most cases it is about equal to zeroing head
and tail separately.  But if I pre-filll an image with random data, it
is much slower:

$ qemu-img create -f qcow2 foo.qcow2 10G
$ dd if=/dev/urandom of=foo.qcow2 bs=1M seek=1 count=4096
$ sync

Then doing large unaligned requests on it:

$ ./qemu-img bench -w -t none -s $((400 * 1048576)) \
  -S $((401 * 1048576)) -o 32768 -c 10 -d 2 -f qcow2 foo.qcow2

When zeroing the whole range, this usually takes around 25 s for me;
just zeroing head and tail takes around 14 s. Without this patch, it
takes around 14 s, too.

On the other hand, when doing smaller requests on a single cluster
(which is what this patch is for):

$ ./qemu-img bench -w -t none -s 4096 -S 65536 -o 32768 \
  -f qcow2 foo.qcow2

This takes around 26 s before this patch, 12 s with it, and like 30 to
40 when zeroing head and tail separately.


Now, such large requests surely are the exception, as is allocating
blocks in an area of the image that already contains data.  However, I
just want to ask back that zeroing the whole range unconditionally is
done on purpose.  Maybe it makes sense to split head and tail if they
are like more than, I don't know, 16 MB apart?  But the "I don't know"
is the problem of course.  Is there a way to make a good default?


(Note that I wrote a lot, but it's not like I'm making a good point to
stop this patch.  I just want to have asked about this before I take it.)

Max

> +        if (ret < 0) {
> +            if (ret != -ENOTSUP && ret != -EAGAIN) {
> +                return ret;
> +            }
> +            continue;
> +        }
> +
> +        trace_qcow2_skip_cow(qemu_coroutine_self(), m->offset, m->nb_clusters);
> +        m->skip_cow = true;
> +    }
> +    return 0;
> +}
> +
>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>                                           uint64_t bytes, QEMUIOVector *qiov,
>                                           int flags)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas
  2019-05-22 20:40   ` Max Reitz
@ 2019-05-23 12:45     ` Anton Nefedov
  0 siblings, 0 replies; 9+ messages in thread
From: Anton Nefedov @ 2019-05-23 12:45 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: kwolf, Vladimir Sementsov-Ogievskiy, berto, Denis Lunev, qemu-devel

On 22/5/2019 11:40 PM, Max Reitz wrote:
> On 16.05.19 16:27, Anton Nefedov wrote:
>> If COW areas of the newly allocated clusters are zeroes on the backing
>> image, efficient bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK) can be
>> used on the whole cluster instead of writing explicit zero buffers later
>> in perform_cow().
>>
>> iotest 060:
>> write to the discarded cluster does not trigger COW anymore.
>> Use a backing image instead.
>>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>   qapi/block-core.json       |  4 +-
>>   block/qcow2.h              |  6 +++
>>   block/qcow2-cluster.c      |  2 +-
>>   block/qcow2.c              | 93 +++++++++++++++++++++++++++++++++++++-
>>   block/trace-events         |  1 +
>>   tests/qemu-iotests/060     |  7 ++-
>>   tests/qemu-iotests/060.out |  5 +-
>>   7 files changed, 112 insertions(+), 6 deletions(-)
> 
> [...]
> 
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 8e024007db..e6b1293ddf 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
> 
> [...]
> 
>> @@ -2145,6 +2150,80 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>       return false;
>>   }
>>   
>> +static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
>> +{
>> +    int64_t nr;
>> +    return !bytes ||
>> +        (!bdrv_is_allocated_above(bs, NULL, offset, bytes, &nr) && nr == bytes);
> 
> It's a pity that this bdrv_is_allocated() throws away BDRV_BLOCK_ZERO
> information.  If something in the backing chain has explicit zero
> clusters there, we could get the information for free, but this will
> consider it allocated.
> 
> Wouldn't it make sense to make bdrv_co_block_status_above() public and
> then use that (with want_zero == false)?
> 
> (But that can be done later, too, of course.)
> 

or something like bdrv_has_non_zero_data(), with the argument
want_zero (maybe call it check_file in this case).

Yes, I'd rather implement it separately.

>> +}
>> +
>> +static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
>> +{
>> +    /*
>> +     * This check is designed for optimization shortcut so it must be
>> +     * efficient.
>> +     * Instead of is_zero(), use is_unallocated() as it is faster (but not
>> +     * as accurate and can result in false negatives).
>> +     */
>> +    return is_unallocated(bs, m->offset + m->cow_start.offset,
>> +                          m->cow_start.nb_bytes) &&
>> +           is_unallocated(bs, m->offset + m->cow_end.offset,
>> +                          m->cow_end.nb_bytes);
>> +}
>> +
>> +static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    QCowL2Meta *m;
>> +
>> +    if (!(s->data_file->bs->supported_zero_flags & BDRV_REQ_NO_FALLBACK)) {
>> +        return 0;
>> +    }
>> +
>> +    if (bs->encrypted) {
>> +        return 0;
>> +    }
>> +
>> +    for (m = l2meta; m != NULL; m = m->next) {
>> +        int ret;
>> +
>> +        if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
>> +            continue;
>> +        }
>> +
>> +        if (!is_zero_cow(bs, m)) {
>> +            continue;
>> +        }
>> +
>> +        /*
>> +         * instead of writing zero COW buffers,
>> +         * efficiently zero out the whole clusters
>> +         */
>> +
>> +        ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
>> +                                            m->nb_clusters * s->cluster_size,
>> +                                            true);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +
>> +        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
>> +        ret = bdrv_co_pwrite_zeroes(s->data_file, m->alloc_offset,
>> +                                    m->nb_clusters * s->cluster_size,
>> +                                    BDRV_REQ_NO_FALLBACK);
> 
> Mostly I'm writing this mail because of this.  Zeroing the whole range
> seems inefficient to me for very large requests, and the commit message
> doesn't say anything about the reasoning behind unconditionally zeroing
> everything.
> 
> Benchmarking looks like in most cases it is about equal to zeroing head
> and tail separately.  But if I pre-filll an image with random data, it
> is much slower:
> 
> $ qemu-img create -f qcow2 foo.qcow2 10G
> $ dd if=/dev/urandom of=foo.qcow2 bs=1M seek=1 count=4096
> $ sync
> 
> Then doing large unaligned requests on it:
> 
> $ ./qemu-img bench -w -t none -s $((400 * 1048576)) \
>    -S $((401 * 1048576)) -o 32768 -c 10 -d 2 -f qcow2 foo.qcow2
> 
> When zeroing the whole range, this usually takes around 25 s for me;
> just zeroing head and tail takes around 14 s. Without this patch, it
> takes around 14 s, too.
> 
> On the other hand, when doing smaller requests on a single cluster
> (which is what this patch is for):
> 
> $ ./qemu-img bench -w -t none -s 4096 -S 65536 -o 32768 \
>    -f qcow2 foo.qcow2
> 
> This takes around 26 s before this patch, 12 s with it, and like 30 to
> 40 when zeroing head and tail separately.
> 
> 
> Now, such large requests surely are the exception, as is allocating
> blocks in an area of the image that already contains data.  However, I
> just want to ask back that zeroing the whole range unconditionally is
> done on purpose.  Maybe it makes sense to split head and tail if they
> are like more than, I don't know, 16 MB apart?  But the "I don't know"
> is the problem of course.  Is there a way to make a good default?
> 

It's indeed tricky to find a solution that performs better most
universally. I implied that writes are often shorter than a cluster,
and the bigger the cluster size the more the gain too. I traced a while
and did not even see a write larger than 1Mb from a freshly installed
centos VM.

Zeroing head and tail separately has an extra penalty being separate i/o
ops (probably why it loses/equates in performance anyway according to
your measurements).

So I'd say it's either unconditional, or if we want to protect the rare
cases from degradation too, we might fallback to writing
zero buffers. E.g. if zeroes_length/data_length ratio is too small.
Again, no idea what "too small" is.
Probably no other way to know than to measure with different borderline
ratios.

/Anton


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v14 0/1] qcow2: cluster space preallocation
  2019-05-16 14:27 [Qemu-devel] [PATCH v14 0/1] qcow2: cluster space preallocation Anton Nefedov
  2019-05-16 14:27 ` [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
@ 2019-05-24 13:56 ` Max Reitz
  2019-05-26 15:01   ` Alberto Garcia
  1 sibling, 1 reply; 9+ messages in thread
From: Max Reitz @ 2019-05-24 13:56 UTC (permalink / raw)
  To: Anton Nefedov, qemu-block; +Cc: kwolf, vsementsov, berto, den, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3977 bytes --]

On 16.05.19 16:27, Anton Nefedov wrote:
> ..apparently v13 ended up in a weird base64 that would not easily git-am.
> Resending.
> 
> ----
> 
> hi,
> 
> this used to be a large 10-patches series, but now thanks to Kevin's
> BDRV_REQ_NO_FALLBACK series
> (http://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg06389.html)
> the core block layer functionality is already in place: the existing flag
> can be reused.
> 
> A few accompanying patches were also dropped from the series as those can
> be processed separately.
> 
> So,
> 
> new in v13:
>    - patch 1 (was patch 9) rebased to use s->data_file pointer
>    - comments fixed/added
>    - other patches dropped ;)
> 
> v12: http://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg02761.html
> v11: http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg04342.html
> v10: http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg00121.html
> 
> ----
> 
> This pull request is to start to improve a few performance points of
> qcow2 format:
> 
>   1. non cluster-aligned write requests (to unallocated clusters) explicitly
>      pad data with zeroes if there is no backing data.
>      Resulting increase in ops number and potential cluster fragmentation
>      (on the host file) is already solved by:
>        ee22a9d qcow2: Merge the writing of the COW regions with the guest data
>      However, in case of zero COW regions, that can be avoided at all
>      but the whole clusters are preallocated and zeroed in a single
>      efficient write_zeroes() operation
> 
>   2. moreover, efficient write_zeroes() operation can be used to preallocate
>      space megabytes (*configurable number) ahead which gives noticeable
>      improvement on some storage types (e.g. distributed storage)
>      where the space allocation operation might be expensive)
>      (Not included in this patchset since v6).
> 
>   3. this will also allow to enable simultaneous writes to the same unallocated
>      cluster after the space has been allocated & zeroed but before
>      the first data is written and the cluster is linked to L2.
>      (Not included in this patchset).
> 
> Efficient write_zeroes usually implies that the blocks are not actually
> written to but only reserved and marked as zeroed by the storage.
> Existing bdrv_write_zeroes() falls back to writing zero buffers if
> write_zeroes is not supported by the driver, so BDRV_REQ_NO_FALLBACK is used.
> 
> simple perf test:
> 
>   qemu-img create -f qcow2 test.img 4G && \
>   qemu-img bench -c $((1024*1024)) -f qcow2 -n -s 4k -t none -w test.img
> 
> test results (seconds):
> 
>     +-----------+-------+------+-------+------+------+
>     |   file    |    before    |     after    | gain |
>     +-----------+-------+------+-------+------+------+
>     |    ssd    |      61.153  |      36.313  |  41% |
>     |    hdd    |     112.676  |     122.056  |  -8% |
>     +-----------+--------------+--------------+------+

I’ve done a few more tests, and I’ve seen more slowdown on an HDD.
(Like 30 % when doing 64 kB requests that are not aligned to clusters.)
 On the other hand, the SSD gain is generally in the same ballpark (38 %
when issuing the same kind of requests.)

On the basis that:

(1) I believe that SSD performance is more important than HDD performance,

(2) I can’t think of a simple way to automatically decide whether the
new or the old codepath is more efficient[1], and

(3) users probably would not use a switch if we introduced one.

I have applied this patch to my block branch:

https://git.xanclic.moe/XanClic/qemu/commits/branch/block

Thanks!


[1] Hm.  We can probably investigate whether the file is stored on a
rotational medium or not.  Is there a fundamental reason why this patch
seems to degrade performance on an HDD but improves it on an SSD?  If
so, we can probably make a choice based on that.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v14 0/1] qcow2: cluster space preallocation
  2019-05-24 13:56 ` [Qemu-devel] [PATCH v14 0/1] qcow2: cluster space preallocation Max Reitz
@ 2019-05-26 15:01   ` Alberto Garcia
  2019-05-27 12:14     ` Max Reitz
  0 siblings, 1 reply; 9+ messages in thread
From: Alberto Garcia @ 2019-05-26 15:01 UTC (permalink / raw)
  To: Max Reitz, Anton Nefedov, qemu-block; +Cc: kwolf, vsementsov, qemu-devel, den

On Fri 24 May 2019 03:56:21 PM CEST, Max Reitz <mreitz@redhat.com> wrote:
>>     +-----------+-------+------+-------+------+------+
>>     |   file    |    before    |     after    | gain |
>>     +-----------+-------+------+-------+------+------+
>>     |    ssd    |      61.153  |      36.313  |  41% |
>>     |    hdd    |     112.676  |     122.056  |  -8% |
>>     +-----------+--------------+--------------+------+
>
> I’ve done a few more tests, and I’ve seen more slowdown on an HDD.
> (Like 30 % when doing 64 kB requests that are not aligned to
> clusters.)  On the other hand, the SSD gain is generally in the same
> ballpark (38 % when issuing the same kind of requests.)
  [...]
> [1] Hm.  We can probably investigate whether the file is stored on a
> rotational medium or not.  Is there a fundamental reason why this
> patch seems to degrade performance on an HDD but improves it on an
> SSD?  If so, we can probably make a choice based on that.

This is when writing to an unallocated cluster with no existing data on
the backing image, right? Then it's probably because you need 2
operations (write zeros + write data) instead of just one.

Berto


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v14 0/1] qcow2: cluster space preallocation
  2019-05-26 15:01   ` Alberto Garcia
@ 2019-05-27 12:14     ` Max Reitz
  0 siblings, 0 replies; 9+ messages in thread
From: Max Reitz @ 2019-05-27 12:14 UTC (permalink / raw)
  To: Alberto Garcia, Anton Nefedov, qemu-block
  Cc: kwolf, vsementsov, qemu-devel, den

[-- Attachment #1: Type: text/plain, Size: 1352 bytes --]

On 26.05.19 17:01, Alberto Garcia wrote:
> On Fri 24 May 2019 03:56:21 PM CEST, Max Reitz <mreitz@redhat.com> wrote:
>>>     +-----------+-------+------+-------+------+------+
>>>     |   file    |    before    |     after    | gain |
>>>     +-----------+-------+------+-------+------+------+
>>>     |    ssd    |      61.153  |      36.313  |  41% |
>>>     |    hdd    |     112.676  |     122.056  |  -8% |
>>>     +-----------+--------------+--------------+------+
>>
>> I’ve done a few more tests, and I’ve seen more slowdown on an HDD.
>> (Like 30 % when doing 64 kB requests that are not aligned to
>> clusters.)  On the other hand, the SSD gain is generally in the same
>> ballpark (38 % when issuing the same kind of requests.)
>   [...]
>> [1] Hm.  We can probably investigate whether the file is stored on a
>> rotational medium or not.  Is there a fundamental reason why this
>> patch seems to degrade performance on an HDD but improves it on an
>> SSD?  If so, we can probably make a choice based on that.
> 
> This is when writing to an unallocated cluster with no existing data on
> the backing image, right? Then it's probably because you need 2
> operations (write zeros + write data) instead of just one.

Hm, yes.  I didn’t test writing tail and head separately, which should
be even worse.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-05-27 12:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-16 14:27 [Qemu-devel] [PATCH v14 0/1] qcow2: cluster space preallocation Anton Nefedov
2019-05-16 14:27 ` [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas Anton Nefedov
2019-05-16 14:51   ` Vladimir Sementsov-Ogievskiy
2019-05-22 11:33   ` Alberto Garcia
2019-05-22 20:40   ` Max Reitz
2019-05-23 12:45     ` Anton Nefedov
2019-05-24 13:56 ` [Qemu-devel] [PATCH v14 0/1] qcow2: cluster space preallocation Max Reitz
2019-05-26 15:01   ` Alberto Garcia
2019-05-27 12:14     ` Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.