All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io
@ 2018-08-07 17:43 Vladimir Sementsov-Ogievskiy
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 1/7] qcow2: move qemu_co_mutex_lock below decryption procedure Vladimir Sementsov-Ogievskiy
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-08-07 17:43 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, vsementsov, den

Hi all!

Here is an asynchronous scheme for handling fragmented qcow2
reads and writes. Both qcow2 read and write functions loops through
sequential portions of data. The series aim it to parallelize these
loops iterations.

It improves performance for fragmented qcow2 images, I've tested it
as follows:

I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
t-seq.qcow2 - sequentially written qcow2 image
t-reverse.qcow2 - filled by writing 64k portions from end to the start
t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
(see source code of image generation in the end for details)

and the test (sequential io by 1mb chunks):

test write:
    for t in /ssd/t-*; \
        do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
        ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
    done

test read (same, just drop -w parameter):
    for t in /ssd/t-*; \
        do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
        ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
    done

short info about parameters:
  -w - do writes (otherwise do reads)
  -c - count of blocks
  -s - block size
  -t none - disable cache
  -n - native aio
  -d 1 - don't use parallel requests provided by qemu-img bench itself

results:
    +-----------+-----------+----------+-----------+----------+
    |   file    | wr before | wr after | rd before | rd after |
    +-----------+-----------+----------+-----------+----------+
    | seq       |     8.605 |    8.636 |     9.043 |    9.010 |
    | reverse   |     9.934 |    8.654 |    17.162 |    8.662 |
    | rand      |     9.983 |    8.687 |    19.775 |    9.010 |
    | part-rand |     9.871 |    8.650 |    14.241 |    8.669 |
    +-----------+-----------+----------+-----------+----------+

Performance gain is obvious, especially for read.

how images are generated:

 === gen-writes file ===
    #!/usr/bin/env python
    import random
    import sys

    size = 4 * 1024 * 1024 * 1024
    block = 64 * 1024
    block2 = 1024 * 1024

    arg = sys.argv[1]

    if arg in ('rand', 'reverse', 'seq'):
        writes = list(range(0, size, block))

    if arg == 'rand':
        random.shuffle(writes)
    elif arg == 'reverse':
        writes.reverse()
    elif arg == 'part-rand':
        writes = []
        for off in range(0, size, block2):
            wr = list(range(off, off + block2, block))
            random.shuffle(wr)
            writes.extend(wr)
    elif arg != 'seq':
        sys.exit(1)

    for w in writes:
        print 'write -P 0xff {} {}'.format(w, block)

    print 'q'


 === gen-test-images.sh file ===
    #!/bin/bash

    IMG_PATH=/ssd

    for name in seq reverse rand part-rand; do
        IMG=$IMG_PATH/t-$name.qcow2
        echo createing $IMG ...
        rm -f $IMG
        qemu-img create -f qcow2 $IMG 4G
        gen-writes $name | qemu-io $IMG
    done

Denis V. Lunev (1):
  qcow2: move qemu_co_mutex_lock below decryption procedure

Vladimir Sementsov-Ogievskiy (6):
  qcow2: bdrv_co_pwritev: move encryption code out of lock
  qcow2: split out reading normal clusters from qcow2_co_preadv
  qcow2: async scheme for qcow2_co_preadv
  qcow2: refactor qcow2_co_pwritev: split out qcow2_co_do_pwritev
  qcow2: refactor qcow2_co_pwritev locals scope
  qcow2: async scheme for qcow2_co_pwritev

 block/qcow2.c                      | 506 +++++++++++++++++++++++++++++--------
 tests/qemu-iotests/026.out         |  18 +-
 tests/qemu-iotests/026.out.nocache |  20 +-
 3 files changed, 415 insertions(+), 129 deletions(-)

-- 
2.11.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [PATCH 1/7] qcow2: move qemu_co_mutex_lock below decryption procedure
  2018-08-07 17:43 [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Vladimir Sementsov-Ogievskiy
@ 2018-08-07 17:43 ` Vladimir Sementsov-Ogievskiy
       [not found]   ` <8e1cc18c-307f-99b1-5892-713ebd17a15f@redhat.com>
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 2/7] qcow2: bdrv_co_pwritev: move encryption code out of lock Vladimir Sementsov-Ogievskiy
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-08-07 17:43 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, vsementsov, den

From: "Denis V. Lunev" <den@openvz.org>

We are not working with a shared state data in the decruption code and
thus this operation is safe. On the other hand this significantly
reduces the scope of the lock.

Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 block/qcow2.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index ec9e6238a0..41def07c67 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1880,9 +1880,11 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
             break;
 
         case QCOW2_CLUSTER_NORMAL:
+            qemu_co_mutex_unlock(&s->lock);
+
             if ((cluster_offset & 511) != 0) {
                 ret = -EIO;
-                goto fail;
+                goto fail_nolock;
             }
 
             if (bs->encrypted) {
@@ -1899,7 +1901,7 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
                                             * s->cluster_size);
                     if (cluster_data == NULL) {
                         ret = -ENOMEM;
-                        goto fail;
+                        goto fail_nolock;
                     }
                 }
 
@@ -1909,13 +1911,11 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
             }
 
             BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
-            qemu_co_mutex_unlock(&s->lock);
             ret = bdrv_co_preadv(bs->file,
                                  cluster_offset + offset_in_cluster,
                                  cur_bytes, &hd_qiov, 0);
-            qemu_co_mutex_lock(&s->lock);
             if (ret < 0) {
-                goto fail;
+                goto fail_nolock;
             }
             if (bs->encrypted) {
                 assert(s->crypto);
@@ -1929,10 +1929,11 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
                                           cur_bytes,
                                           NULL) < 0) {
                     ret = -EIO;
-                    goto fail;
+                    goto fail_nolock;
                 }
                 qemu_iovec_from_buf(qiov, bytes_done, cluster_data, cur_bytes);
             }
+            qemu_co_mutex_lock(&s->lock);
             break;
 
         default:
@@ -1950,6 +1951,7 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
 fail:
     qemu_co_mutex_unlock(&s->lock);
 
+fail_nolock:
     qemu_iovec_destroy(&hd_qiov);
     qemu_vfree(cluster_data);
 
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Qemu-devel] [PATCH 2/7] qcow2: bdrv_co_pwritev: move encryption code out of lock
  2018-08-07 17:43 [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Vladimir Sementsov-Ogievskiy
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 1/7] qcow2: move qemu_co_mutex_lock below decryption procedure Vladimir Sementsov-Ogievskiy
@ 2018-08-07 17:43 ` Vladimir Sementsov-Ogievskiy
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv Vladimir Sementsov-Ogievskiy
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-08-07 17:43 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, vsementsov, den

We don't need locking for encryption code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 41def07c67..65e3ca00e2 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2039,6 +2039,15 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
 
         assert((cluster_offset & 511) == 0);
 
+        ret = qcow2_pre_write_overlap_check(bs, 0,
+                                            cluster_offset + offset_in_cluster,
+                                            cur_bytes);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        qemu_co_mutex_unlock(&s->lock);
+
         qemu_iovec_reset(&hd_qiov);
         qemu_iovec_concat(&hd_qiov, qiov, bytes_done, cur_bytes);
 
@@ -2072,30 +2081,25 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
             qemu_iovec_add(&hd_qiov, cluster_data, cur_bytes);
         }
 
-        ret = qcow2_pre_write_overlap_check(bs, 0,
-                cluster_offset + offset_in_cluster, cur_bytes);
-        if (ret < 0) {
-            goto fail;
-        }
-
         /* If we need to do COW, check if it's possible to merge the
          * writing of the guest data together with that of the COW regions.
          * If it's not possible (or not necessary) then write the
          * guest data now. */
         if (!merge_cow(offset, cur_bytes, &hd_qiov, l2meta)) {
-            qemu_co_mutex_unlock(&s->lock);
             BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
             trace_qcow2_writev_data(qemu_coroutine_self(),
                                     cluster_offset + offset_in_cluster);
             ret = bdrv_co_pwritev(bs->file,
                                   cluster_offset + offset_in_cluster,
                                   cur_bytes, &hd_qiov, 0);
-            qemu_co_mutex_lock(&s->lock);
             if (ret < 0) {
+                qemu_co_mutex_lock(&s->lock);
                 goto fail;
             }
         }
 
+        qemu_co_mutex_lock(&s->lock);
+
         ret = qcow2_handle_l2meta(bs, &l2meta, true);
         if (ret) {
             goto fail;
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv
  2018-08-07 17:43 [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Vladimir Sementsov-Ogievskiy
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 1/7] qcow2: move qemu_co_mutex_lock below decryption procedure Vladimir Sementsov-Ogievskiy
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 2/7] qcow2: bdrv_co_pwritev: move encryption code out of lock Vladimir Sementsov-Ogievskiy
@ 2018-08-07 17:43 ` Vladimir Sementsov-Ogievskiy
       [not found]   ` <6e19aaeb-8acc-beb9-5ece-9ae6101637a9@redhat.com>
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 4/7] qcow2: async scheme for qcow2_co_preadv Vladimir Sementsov-Ogievskiy
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-08-07 17:43 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, vsementsov, den

Memory allocation may become less efficient for encrypted case. It's a
payment for further asynchronous scheme.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.c | 114 ++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 64 insertions(+), 50 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 65e3ca00e2..5e7f2ee318 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1808,6 +1808,67 @@ out:
     return ret;
 }
 
+static coroutine_fn int qcow2_co_preadv_normal(BlockDriverState *bs,
+                                               uint64_t file_cluster_offset,
+                                               uint64_t offset,
+                                               uint64_t bytes,
+                                               QEMUIOVector *qiov,
+                                               uint64_t qiov_offset)
+{
+    int ret;
+    BDRVQcow2State *s = bs->opaque;
+    void *crypt_buf = NULL;
+    QEMUIOVector hd_qiov;
+    int offset_in_cluster = offset_into_cluster(s, offset);
+
+    if ((file_cluster_offset & 511) != 0) {
+        return -EIO;
+    }
+
+    qemu_iovec_init(&hd_qiov, qiov->niov);
+
+    if (bs->encrypted) {
+        assert(s->crypto);
+        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
+
+        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
+        qemu_iovec_add(&hd_qiov, crypt_buf, bytes);
+    } else {
+        qemu_iovec_concat(&hd_qiov, qiov, qiov_offset, bytes);
+    }
+
+    BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
+    ret = bdrv_co_preadv(bs->file,
+                         file_cluster_offset + offset_in_cluster,
+                         bytes, &hd_qiov, 0);
+    if (ret < 0) {
+        goto out;
+    }
+
+    if (bs->encrypted) {
+        assert(s->crypto);
+        assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+        assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+        if (qcrypto_block_decrypt(s->crypto,
+                                  (s->crypt_physical_offset ?
+                                   file_cluster_offset + offset_in_cluster :
+                                   offset),
+                                  crypt_buf,
+                                  bytes,
+                                  NULL) < 0) {
+            ret = -EIO;
+            goto out;
+        }
+        qemu_iovec_from_buf(qiov, qiov_offset, crypt_buf, bytes);
+    }
+
+out:
+    qemu_vfree(crypt_buf);
+    qemu_iovec_destroy(&hd_qiov);
+
+    return ret;
+}
+
 static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
                                         uint64_t bytes, QEMUIOVector *qiov,
                                         int flags)
@@ -1819,7 +1880,6 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
     uint64_t cluster_offset = 0;
     uint64_t bytes_done = 0;
     QEMUIOVector hd_qiov;
-    uint8_t *cluster_data = NULL;
 
     qemu_iovec_init(&hd_qiov, qiov->niov);
 
@@ -1882,57 +1942,12 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
         case QCOW2_CLUSTER_NORMAL:
             qemu_co_mutex_unlock(&s->lock);
 
-            if ((cluster_offset & 511) != 0) {
-                ret = -EIO;
-                goto fail_nolock;
-            }
-
-            if (bs->encrypted) {
-                assert(s->crypto);
-
-                /*
-                 * For encrypted images, read everything into a temporary
-                 * contiguous buffer on which the AES functions can work.
-                 */
-                if (!cluster_data) {
-                    cluster_data =
-                        qemu_try_blockalign(bs->file->bs,
-                                            QCOW_MAX_CRYPT_CLUSTERS
-                                            * s->cluster_size);
-                    if (cluster_data == NULL) {
-                        ret = -ENOMEM;
-                        goto fail_nolock;
-                    }
-                }
-
-                assert(cur_bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
-                qemu_iovec_reset(&hd_qiov);
-                qemu_iovec_add(&hd_qiov, cluster_data, cur_bytes);
-            }
-
-            BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
-            ret = bdrv_co_preadv(bs->file,
-                                 cluster_offset + offset_in_cluster,
-                                 cur_bytes, &hd_qiov, 0);
+            ret = qcow2_co_preadv_normal(bs, cluster_offset,
+                                         offset, cur_bytes, qiov, bytes_done);
             if (ret < 0) {
                 goto fail_nolock;
             }
-            if (bs->encrypted) {
-                assert(s->crypto);
-                assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
-                assert((cur_bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
-                if (qcrypto_block_decrypt(s->crypto,
-                                          (s->crypt_physical_offset ?
-                                           cluster_offset + offset_in_cluster :
-                                           offset),
-                                          cluster_data,
-                                          cur_bytes,
-                                          NULL) < 0) {
-                    ret = -EIO;
-                    goto fail_nolock;
-                }
-                qemu_iovec_from_buf(qiov, bytes_done, cluster_data, cur_bytes);
-            }
+
             qemu_co_mutex_lock(&s->lock);
             break;
 
@@ -1953,7 +1968,6 @@ fail:
 
 fail_nolock:
     qemu_iovec_destroy(&hd_qiov);
-    qemu_vfree(cluster_data);
 
     return ret;
 }
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Qemu-devel] [PATCH 4/7] qcow2: async scheme for qcow2_co_preadv
  2018-08-07 17:43 [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv Vladimir Sementsov-Ogievskiy
@ 2018-08-07 17:43 ` Vladimir Sementsov-Ogievskiy
       [not found]   ` <08a610aa-9c78-1c83-5e48-b93080aac87b@redhat.com>
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 5/7] qcow2: refactor qcow2_co_pwritev: split out qcow2_co_do_pwritev Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-08-07 17:43 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, vsementsov, den

Start several async requests instead of read chunk by chunk.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.c | 208 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 204 insertions(+), 4 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 5e7f2ee318..a0df8d4e50 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1869,6 +1869,197 @@ out:
     return ret;
 }
 
+typedef struct Qcow2WorkerTask {
+    uint64_t file_cluster_offset;
+    uint64_t offset;
+    uint64_t bytes;
+    uint64_t bytes_done;
+} Qcow2WorkerTask;
+
+typedef int (*Qcow2DoWorkFunc)(BlockDriverState *bs, QEMUIOVector *qiov,
+                               Qcow2WorkerTask *task);
+
+typedef struct Qcow2RWState {
+    BlockDriverState *bs;
+    QEMUIOVector *qiov;
+    uint64_t bytes;
+    int ret;
+    bool waiting_one;
+    bool waiting_all;
+    bool finalize;
+    Coroutine *co;
+    QSIMPLEQ_HEAD(, Qcow2Worker) free_workers;
+    QSIMPLEQ_HEAD(, Qcow2Worker) busy_workers;
+    int online_workers;
+    Qcow2DoWorkFunc do_work_func;
+} Qcow2RWState;
+
+typedef struct Qcow2Worker {
+    Qcow2RWState *rws;
+    Coroutine *co;
+    Qcow2WorkerTask task;
+    bool busy;
+    QSIMPLEQ_ENTRY(Qcow2Worker) entry;
+} Qcow2Worker;
+#define QCOW2_MAX_WORKERS 64
+
+static coroutine_fn void qcow2_rw_worker(void *opaque);
+static Qcow2Worker *qcow2_new_worker(Qcow2RWState *rws)
+{
+    Qcow2Worker *w = g_new0(Qcow2Worker, 1);
+    w->rws = rws;
+    w->co = qemu_coroutine_create(qcow2_rw_worker, w);
+
+    return w;
+}
+
+static void qcow2_free_worker(Qcow2Worker *w)
+{
+    g_free(w);
+}
+
+static coroutine_fn void qcow2_rw_worker(void *opaque)
+{
+    Qcow2Worker *w = opaque;
+    Qcow2RWState *rws = w->rws;
+
+    rws->online_workers++;
+
+    while (!rws->finalize) {
+        int ret = rws->do_work_func(rws->bs, rws->qiov, &w->task);
+        if (ret < 0 && rws->ret == 0) {
+            rws->ret = ret;
+        }
+
+        if (rws->waiting_all || rws->ret < 0) {
+            break;
+        }
+
+        w->busy = false;
+        QSIMPLEQ_REMOVE(&rws->busy_workers, w, Qcow2Worker, entry);
+        QSIMPLEQ_INSERT_TAIL(&rws->free_workers, w, entry);
+        if (rws->waiting_one) {
+            rws->waiting_one = false;
+            /* we must unset it here, to prevent queuing rws->co in several
+             * workers (it may happen if other worker already waits us on mutex,
+             * so it will be entered after our yield and before rws->co enter)
+             *
+             * TODO: rethink this comment, as here (and in other places in the
+             * file) we moved from qemu_coroutine_add_next to aio_co_wake.
+             */
+            aio_co_wake(rws->co);
+        }
+
+        qemu_coroutine_yield();
+    }
+
+    if (w->busy) {
+        w->busy = false;
+        QSIMPLEQ_REMOVE(&rws->busy_workers, w, Qcow2Worker, entry);
+    }
+    qcow2_free_worker(w);
+    rws->online_workers--;
+
+    if (rws->waiting_all && rws->online_workers == 0) {
+        aio_co_wake(rws->co);
+    }
+}
+
+static coroutine_fn void qcow2_rws_add_task(Qcow2RWState *rws,
+                                            uint64_t file_cluster_offset,
+                                            uint64_t offset,
+                                            uint64_t bytes,
+                                            uint64_t bytes_done)
+{
+    Qcow2Worker *w;
+
+    assert(rws->co == qemu_coroutine_self());
+
+    if (bytes_done == 0 && bytes == rws->bytes) {
+        Qcow2WorkerTask task = {
+            .file_cluster_offset = file_cluster_offset,
+            .offset = offset,
+            .bytes = bytes,
+            .bytes_done = bytes_done
+        };
+        rws->ret = rws->do_work_func(rws->bs, rws->qiov, &task);
+        return;
+    }
+
+    if (!QSIMPLEQ_EMPTY(&rws->free_workers)) {
+        w = QSIMPLEQ_FIRST(&rws->free_workers);
+        QSIMPLEQ_REMOVE_HEAD(&rws->free_workers, entry);
+    } else if (rws->online_workers < QCOW2_MAX_WORKERS) {
+        w = qcow2_new_worker(rws);
+    } else {
+        rws->waiting_one = true;
+        qemu_coroutine_yield();
+        assert(!rws->waiting_one); /* already unset by worker */
+
+        w = QSIMPLEQ_FIRST(&rws->free_workers);
+        QSIMPLEQ_REMOVE_HEAD(&rws->free_workers, entry);
+    }
+    w->busy = true;
+    QSIMPLEQ_INSERT_TAIL(&rws->busy_workers, w, entry);
+
+    w->task.file_cluster_offset = file_cluster_offset;
+    w->task.offset = offset;
+    w->task.bytes = bytes;
+    w->task.bytes_done = bytes_done;
+
+    qemu_coroutine_enter(w->co);
+}
+
+static void qcow2_init_rws(Qcow2RWState *rws, BlockDriverState *bs,
+                           QEMUIOVector *qiov, uint64_t bytes,
+                           Qcow2DoWorkFunc do_work_func)
+{
+    memset(rws, 0, sizeof(*rws));
+    rws->bs = bs;
+    rws->qiov = qiov;
+    rws->bytes = bytes;
+    rws->co = qemu_coroutine_self();
+    rws->do_work_func = do_work_func;
+    QSIMPLEQ_INIT(&rws->free_workers);
+    QSIMPLEQ_INIT(&rws->busy_workers);
+}
+
+static void qcow2_finalize_rws(Qcow2RWState *rws)
+{
+    assert(rws->co == qemu_coroutine_self());
+
+    /* kill waiting workers */
+    rws->finalize = true;
+    while (!QSIMPLEQ_EMPTY(&rws->free_workers)) {
+        Qcow2Worker *w = QSIMPLEQ_FIRST(&rws->free_workers);
+        QSIMPLEQ_REMOVE_HEAD(&rws->free_workers, entry);
+        qemu_coroutine_enter(w->co);
+    }
+
+    /* wait others */
+    if (rws->online_workers > 0) {
+        rws->waiting_all = true;
+        qemu_coroutine_yield();
+        rws->waiting_all = false;
+    }
+
+    assert(rws->online_workers == 0);
+    assert(QSIMPLEQ_EMPTY(&rws->free_workers));
+    assert(QSIMPLEQ_EMPTY(&rws->busy_workers));
+}
+
+static coroutine_fn int qcow2_co_preadv_normal_task(BlockDriverState *bs,
+                                                    QEMUIOVector *qiov,
+                                                    Qcow2WorkerTask *task)
+{
+    return qcow2_co_preadv_normal(bs,
+                                  task->file_cluster_offset,
+                                  task->offset,
+                                  task->bytes,
+                                  qiov,
+                                  task->bytes_done);
+}
+
 static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
                                         uint64_t bytes, QEMUIOVector *qiov,
                                         int flags)
@@ -1880,12 +2071,15 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
     uint64_t cluster_offset = 0;
     uint64_t bytes_done = 0;
     QEMUIOVector hd_qiov;
+    Qcow2RWState rws = {0};
+
+    qcow2_init_rws(&rws, bs, qiov, bytes, qcow2_co_preadv_normal_task);
 
     qemu_iovec_init(&hd_qiov, qiov->niov);
 
     qemu_co_mutex_lock(&s->lock);
 
-    while (bytes != 0) {
+    while (bytes != 0 && rws.ret == 0) {
 
         /* prepare next request */
         cur_bytes = MIN(bytes, INT_MAX);
@@ -1942,9 +2136,10 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
         case QCOW2_CLUSTER_NORMAL:
             qemu_co_mutex_unlock(&s->lock);
 
-            ret = qcow2_co_preadv_normal(bs, cluster_offset,
-                                         offset, cur_bytes, qiov, bytes_done);
-            if (ret < 0) {
+            qcow2_rws_add_task(&rws, cluster_offset, offset, cur_bytes,
+                               bytes_done);
+            if (rws.ret < 0) {
+                ret = rws.ret;
                 goto fail_nolock;
             }
 
@@ -1967,6 +2162,11 @@ fail:
     qemu_co_mutex_unlock(&s->lock);
 
 fail_nolock:
+    qcow2_finalize_rws(&rws);
+    if (ret == 0) {
+        ret = rws.ret;
+    }
+
     qemu_iovec_destroy(&hd_qiov);
 
     return ret;
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Qemu-devel] [PATCH 5/7] qcow2: refactor qcow2_co_pwritev: split out qcow2_co_do_pwritev
  2018-08-07 17:43 [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 4/7] qcow2: async scheme for qcow2_co_preadv Vladimir Sementsov-Ogievskiy
@ 2018-08-07 17:43 ` Vladimir Sementsov-Ogievskiy
       [not found]   ` <5c871ce7-2cab-f897-0b06-cbc05b9ffe97@redhat.com>
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 6/7] qcow2: refactor qcow2_co_pwritev locals scope Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-08-07 17:43 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, vsementsov, den

Split out block which will be reused in async scheme.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.c | 138 ++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 86 insertions(+), 52 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index a0df8d4e50..4d669432d1 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2210,6 +2210,85 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
     return false;
 }
 
+/* qcow2_co_do_pwritev
+ * Called without s->lock unlocked
+ * hd_qiov - temp qiov for any use. It is initialized so it is empty and
+ *           support adding up to qiov->niov + 2 elements
+ * l2meta  - if not NULL, qcow2_co_do_pwritev() will consume it. Caller must not
+ *           use it somehow after qcow2_co_do_pwritev() call
+ */
+static coroutine_fn int qcow2_co_do_pwritev(BlockDriverState *bs,
+                                            uint64_t file_cluster_offset,
+                                            uint64_t offset,
+                                            uint64_t bytes,
+                                            QEMUIOVector *qiov,
+                                            uint64_t qiov_offset,
+                                            QCowL2Meta *l2meta)
+{
+    int ret;
+    BDRVQcow2State *s = bs->opaque;
+    void *crypt_buf = NULL;
+    QEMUIOVector hd_qiov;
+    int offset_in_cluster = offset_into_cluster(s, offset);
+
+    qemu_iovec_reset(&hd_qiov);
+    qemu_iovec_init(&hd_qiov, qiov->niov);
+
+    if (bs->encrypted) {
+        assert(s->crypto);
+        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
+        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
+        qemu_iovec_to_buf(qiov, qiov_offset, crypt_buf, bytes);
+
+        if (qcrypto_block_encrypt(s->crypto,
+                                  (s->crypt_physical_offset ?
+                                   file_cluster_offset + offset_in_cluster :
+                                   offset),
+                                  crypt_buf,
+                                  bytes, NULL) < 0) {
+            ret = -EIO;
+            goto fail;
+        }
+
+        qemu_iovec_add(&hd_qiov, crypt_buf, bytes);
+    } else {
+        qemu_iovec_concat(&hd_qiov, qiov, qiov_offset, bytes);
+    }
+
+    /* If we need to do COW, check if it's possible to merge the
+     * writing of the guest data together with that of the COW regions.
+     * If it's not possible (or not necessary) then write the
+     * guest data now. */
+    if (!merge_cow(offset, bytes, &hd_qiov, l2meta)) {
+        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
+        trace_qcow2_writev_data(qemu_coroutine_self(),
+                                file_cluster_offset + offset_in_cluster);
+        ret = bdrv_co_pwritev(bs->file,
+                              file_cluster_offset + offset_in_cluster,
+                              bytes, &hd_qiov, 0);
+        if (ret < 0) {
+            qemu_co_mutex_lock(&s->lock);
+            goto fail;
+        }
+    }
+
+    qemu_co_mutex_lock(&s->lock);
+
+    ret = qcow2_handle_l2meta(bs, &l2meta, true);
+    if (ret) {
+        goto fail;
+    }
+
+fail:
+    qcow2_handle_l2meta(bs, &l2meta, false);
+    qemu_co_mutex_unlock(&s->lock);
+
+    qemu_vfree(crypt_buf);
+    qemu_iovec_destroy(&hd_qiov);
+
+    return ret;
+}
+
 static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -2262,63 +2341,16 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
 
         qemu_co_mutex_unlock(&s->lock);
 
-        qemu_iovec_reset(&hd_qiov);
-        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, cur_bytes);
 
-        if (bs->encrypted) {
-            assert(s->crypto);
-            if (!cluster_data) {
-                cluster_data = qemu_try_blockalign(bs->file->bs,
-                                                   QCOW_MAX_CRYPT_CLUSTERS
-                                                   * s->cluster_size);
-                if (cluster_data == NULL) {
-                    ret = -ENOMEM;
-                    goto fail;
-                }
-            }
-
-            assert(hd_qiov.size <=
-                   QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
-            qemu_iovec_to_buf(&hd_qiov, 0, cluster_data, hd_qiov.size);
-
-            if (qcrypto_block_encrypt(s->crypto,
-                                      (s->crypt_physical_offset ?
-                                       cluster_offset + offset_in_cluster :
-                                       offset),
-                                      cluster_data,
-                                      cur_bytes, NULL) < 0) {
-                ret = -EIO;
-                goto fail;
-            }
-
-            qemu_iovec_reset(&hd_qiov);
-            qemu_iovec_add(&hd_qiov, cluster_data, cur_bytes);
-        }
-
-        /* If we need to do COW, check if it's possible to merge the
-         * writing of the guest data together with that of the COW regions.
-         * If it's not possible (or not necessary) then write the
-         * guest data now. */
-        if (!merge_cow(offset, cur_bytes, &hd_qiov, l2meta)) {
-            BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
-            trace_qcow2_writev_data(qemu_coroutine_self(),
-                                    cluster_offset + offset_in_cluster);
-            ret = bdrv_co_pwritev(bs->file,
-                                  cluster_offset + offset_in_cluster,
-                                  cur_bytes, &hd_qiov, 0);
-            if (ret < 0) {
-                qemu_co_mutex_lock(&s->lock);
-                goto fail;
-            }
+        ret = qcow2_co_do_pwritev(bs, cluster_offset, offset, cur_bytes,
+                                  qiov, bytes_done, l2meta);
+        l2meta = NULL; /* l2meta is consumed by qcow2_co_do_pwritev() */
+        if (ret < 0) {
+            goto fail_nometa;
         }
 
         qemu_co_mutex_lock(&s->lock);
 
-        ret = qcow2_handle_l2meta(bs, &l2meta, true);
-        if (ret) {
-            goto fail;
-        }
-
         bytes -= cur_bytes;
         offset += cur_bytes;
         bytes_done += cur_bytes;
@@ -2331,6 +2363,8 @@ fail:
 
     qemu_co_mutex_unlock(&s->lock);
 
+fail_nometa:
+
     qemu_iovec_destroy(&hd_qiov);
     qemu_vfree(cluster_data);
     trace_qcow2_writev_done_req(qemu_coroutine_self(), ret);
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Qemu-devel] [PATCH 6/7] qcow2: refactor qcow2_co_pwritev locals scope
  2018-08-07 17:43 [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Vladimir Sementsov-Ogievskiy
                   ` (4 preceding siblings ...)
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 5/7] qcow2: refactor qcow2_co_pwritev: split out qcow2_co_do_pwritev Vladimir Sementsov-Ogievskiy
@ 2018-08-07 17:43 ` Vladimir Sementsov-Ogievskiy
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 7/7] qcow2: async scheme for qcow2_co_pwritev Vladimir Sementsov-Ogievskiy
  2018-08-16  0:51 ` [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Max Reitz
  7 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-08-07 17:43 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, vsementsov, den

Move local variables related to individual loop iteration into while
block.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 4d669432d1..88b3fb8080 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2294,10 +2294,7 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                          int flags)
 {
     BDRVQcow2State *s = bs->opaque;
-    int offset_in_cluster;
     int ret;
-    unsigned int cur_bytes; /* number of sectors in current iteration */
-    uint64_t cluster_offset;
     QEMUIOVector hd_qiov;
     uint64_t bytes_done = 0;
     uint8_t *cluster_data = NULL;
@@ -2312,12 +2309,14 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
     qemu_co_mutex_lock(&s->lock);
 
     while (bytes != 0) {
+        int offset_in_cluster = offset_into_cluster(s, offset);
+        unsigned int cur_bytes = MIN(bytes, INT_MAX); /* number of sectors in
+                                                         current iteration */
+        uint64_t cluster_offset;
 
         l2meta = NULL;
 
         trace_qcow2_writev_start_part(qemu_coroutine_self());
-        offset_in_cluster = offset_into_cluster(s, offset);
-        cur_bytes = MIN(bytes, INT_MAX);
         if (bs->encrypted) {
             cur_bytes = MIN(cur_bytes,
                             QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Qemu-devel] [PATCH 7/7] qcow2: async scheme for qcow2_co_pwritev
  2018-08-07 17:43 [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Vladimir Sementsov-Ogievskiy
                   ` (5 preceding siblings ...)
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 6/7] qcow2: refactor qcow2_co_pwritev locals scope Vladimir Sementsov-Ogievskiy
@ 2018-08-07 17:43 ` Vladimir Sementsov-Ogievskiy
       [not found]   ` <1c8299bf-0b31-82a7-c7c4-5069581f2d94@redhat.com>
  2018-08-16  0:51 ` [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Max Reitz
  7 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-08-07 17:43 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, vsementsov, den

Start several async requests instead of read chunk by chunk.

Iotest 026 output is changed, as because of async io error path has
changed.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.c                      | 47 ++++++++++++++++++++++++++++++--------
 tests/qemu-iotests/026.out         | 18 ++++++++-------
 tests/qemu-iotests/026.out.nocache | 20 ++++++++--------
 3 files changed, 59 insertions(+), 26 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 88b3fb8080..c7501adfc6 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1874,6 +1874,7 @@ typedef struct Qcow2WorkerTask {
     uint64_t offset;
     uint64_t bytes;
     uint64_t bytes_done;
+    QCowL2Meta *l2meta;
 } Qcow2WorkerTask;
 
 typedef int (*Qcow2DoWorkFunc)(BlockDriverState *bs, QEMUIOVector *qiov,
@@ -1965,11 +1966,16 @@ static coroutine_fn void qcow2_rw_worker(void *opaque)
     }
 }
 
+/* qcow2_rws_add_task
+* l2meta - opaque pointer for do_work_func, so behavior depends on
+*          do_work_func, specified in qcow2_init_rws
+*/
 static coroutine_fn void qcow2_rws_add_task(Qcow2RWState *rws,
                                             uint64_t file_cluster_offset,
                                             uint64_t offset,
                                             uint64_t bytes,
-                                            uint64_t bytes_done)
+                                            uint64_t bytes_done,
+                                            QCowL2Meta *l2meta)
 {
     Qcow2Worker *w;
 
@@ -1980,7 +1986,8 @@ static coroutine_fn void qcow2_rws_add_task(Qcow2RWState *rws,
             .file_cluster_offset = file_cluster_offset,
             .offset = offset,
             .bytes = bytes,
-            .bytes_done = bytes_done
+            .bytes_done = bytes_done,
+            .l2meta = l2meta
         };
         rws->ret = rws->do_work_func(rws->bs, rws->qiov, &task);
         return;
@@ -2006,6 +2013,7 @@ static coroutine_fn void qcow2_rws_add_task(Qcow2RWState *rws,
     w->task.offset = offset;
     w->task.bytes = bytes;
     w->task.bytes_done = bytes_done;
+    w->task.l2meta = l2meta;
 
     qemu_coroutine_enter(w->co);
 }
@@ -2137,7 +2145,7 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
             qemu_co_mutex_unlock(&s->lock);
 
             qcow2_rws_add_task(&rws, cluster_offset, offset, cur_bytes,
-                               bytes_done);
+                               bytes_done, NULL);
             if (rws.ret < 0) {
                 ret = rws.ret;
                 goto fail_nolock;
@@ -2289,6 +2297,19 @@ fail:
     return ret;
 }
 
+static coroutine_fn int qcow2_co_do_pwritev_task(BlockDriverState *bs,
+                                                 QEMUIOVector *qiov,
+                                                 Qcow2WorkerTask *task)
+{
+    return qcow2_co_do_pwritev(bs,
+                               task->file_cluster_offset,
+                               task->offset,
+                               task->bytes,
+                               qiov,
+                               task->bytes_done,
+                               task->l2meta);
+}
+
 static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -2299,16 +2320,19 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
     uint64_t bytes_done = 0;
     uint8_t *cluster_data = NULL;
     QCowL2Meta *l2meta = NULL;
+    Qcow2RWState rws = {0};
 
     trace_qcow2_writev_start_req(qemu_coroutine_self(), offset, bytes);
 
+    qcow2_init_rws(&rws, bs, qiov, bytes, qcow2_co_do_pwritev_task);
+
     qemu_iovec_init(&hd_qiov, qiov->niov);
 
     s->cluster_cache_offset = -1; /* disable compressed cache */
 
     qemu_co_mutex_lock(&s->lock);
 
-    while (bytes != 0) {
+    while (bytes != 0 && rws.ret == 0) {
         int offset_in_cluster = offset_into_cluster(s, offset);
         unsigned int cur_bytes = MIN(bytes, INT_MAX); /* number of sectors in
                                                          current iteration */
@@ -2340,11 +2364,11 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
 
         qemu_co_mutex_unlock(&s->lock);
 
-
-        ret = qcow2_co_do_pwritev(bs, cluster_offset, offset, cur_bytes,
-                                  qiov, bytes_done, l2meta);
-        l2meta = NULL; /* l2meta is consumed by qcow2_co_do_pwritev() */
-        if (ret < 0) {
+        qcow2_rws_add_task(&rws, cluster_offset, offset, cur_bytes, bytes_done,
+                           l2meta);
+        l2meta = NULL; /* l2meta is consumed */
+        if (rws.ret < 0) {
+            ret = rws.ret;
             goto fail_nometa;
         }
 
@@ -2362,6 +2386,11 @@ fail:
 
     qemu_co_mutex_unlock(&s->lock);
 
+    qcow2_finalize_rws(&rws);
+    if (ret == 0) {
+        ret = rws.ret;
+    }
+
 fail_nometa:
 
     qemu_iovec_destroy(&hd_qiov);
diff --git a/tests/qemu-iotests/026.out b/tests/qemu-iotests/026.out
index dd10a82b51..8dd19f1b53 100644
--- a/tests/qemu-iotests/026.out
+++ b/tests/qemu-iotests/026.out
@@ -481,7 +481,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-55 leaked clusters were found on the image.
+119 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -508,7 +508,9 @@ Event: refblock_alloc_write; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
-No errors were found on the image.
+
+64 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: off; write -b
@@ -533,7 +535,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-10 leaked clusters were found on the image.
+74 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -542,7 +544,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-23 leaked clusters were found on the image.
+87 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -561,7 +563,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-10 leaked clusters were found on the image.
+74 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -570,7 +572,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-23 leaked clusters were found on the image.
+87 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -589,7 +591,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-10 leaked clusters were found on the image.
+74 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -598,7 +600,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-23 leaked clusters were found on the image.
+87 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 
 === L1 growth tests ===
diff --git a/tests/qemu-iotests/026.out.nocache b/tests/qemu-iotests/026.out.nocache
index 1ca6cda15c..10218c1aa6 100644
--- a/tests/qemu-iotests/026.out.nocache
+++ b/tests/qemu-iotests/026.out.nocache
@@ -489,7 +489,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-55 leaked clusters were found on the image.
+119 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -516,8 +516,10 @@ Event: refblock_alloc_write; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
-No errors were found on the image.
-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
+
+64 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_alloc_write; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
@@ -541,7 +543,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-10 leaked clusters were found on the image.
+74 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -550,7 +552,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-23 leaked clusters were found on the image.
+87 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -569,7 +571,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-10 leaked clusters were found on the image.
+74 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -578,7 +580,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-23 leaked clusters were found on the image.
+87 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -597,7 +599,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-10 leaked clusters were found on the image.
+74 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 
 
@@ -606,7 +608,7 @@ Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-23 leaked clusters were found on the image.
+87 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 
 === L1 growth tests ===
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io
  2018-08-07 17:43 [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Vladimir Sementsov-Ogievskiy
                   ` (6 preceding siblings ...)
  2018-08-07 17:43 ` [Qemu-devel] [PATCH 7/7] qcow2: async scheme for qcow2_co_pwritev Vladimir Sementsov-Ogievskiy
@ 2018-08-16  0:51 ` Max Reitz
  2018-08-16 13:58   ` Vladimir Sementsov-Ogievskiy
  7 siblings, 1 reply; 30+ messages in thread
From: Max Reitz @ 2018-08-16  0:51 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf, den

[-- Attachment #1: Type: text/plain, Size: 1882 bytes --]

On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> Here is an asynchronous scheme for handling fragmented qcow2
> reads and writes. Both qcow2 read and write functions loops through
> sequential portions of data. The series aim it to parallelize these
> loops iterations.
> 
> It improves performance for fragmented qcow2 images, I've tested it
> as follows:
> 
> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
> t-seq.qcow2 - sequentially written qcow2 image
> t-reverse.qcow2 - filled by writing 64k portions from end to the start
> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
> (see source code of image generation in the end for details)
> 
> and the test (sequential io by 1mb chunks):
> 
> test write:
>     for t in /ssd/t-*; \
>         do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>         ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
>     done
> 
> test read (same, just drop -w parameter):
>     for t in /ssd/t-*; \
>         do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>         ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>     done
> 
> short info about parameters:
>   -w - do writes (otherwise do reads)
>   -c - count of blocks
>   -s - block size
>   -t none - disable cache
>   -n - native aio
>   -d 1 - don't use parallel requests provided by qemu-img bench itself

Hm, actually, why not?  And how does a guest behave?

If parallel requests on an SSD perform better, wouldn't a guest issue
parallel requests to the virtual device and thus to qcow2 anyway?

(I suppose the global qcow2 lock could be an issue here, but then your
benchmark should work even without -d 1.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io
  2018-08-16  0:51 ` [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Max Reitz
@ 2018-08-16 13:58   ` Vladimir Sementsov-Ogievskiy
  2018-08-17 19:34     ` Max Reitz
  0 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-08-16 13:58 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, den

16.08.2018 03:51, Max Reitz wrote:
> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> Here is an asynchronous scheme for handling fragmented qcow2
>> reads and writes. Both qcow2 read and write functions loops through
>> sequential portions of data. The series aim it to parallelize these
>> loops iterations.
>>
>> It improves performance for fragmented qcow2 images, I've tested it
>> as follows:
>>
>> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
>> t-seq.qcow2 - sequentially written qcow2 image
>> t-reverse.qcow2 - filled by writing 64k portions from end to the start
>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
>> (see source code of image generation in the end for details)
>>
>> and the test (sequential io by 1mb chunks):
>>
>> test write:
>>      for t in /ssd/t-*; \
>>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
>>      done
>>
>> test read (same, just drop -w parameter):
>>      for t in /ssd/t-*; \
>>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>>      done
>>
>> short info about parameters:
>>    -w - do writes (otherwise do reads)
>>    -c - count of blocks
>>    -s - block size
>>    -t none - disable cache
>>    -n - native aio
>>    -d 1 - don't use parallel requests provided by qemu-img bench itself
> Hm, actually, why not?  And how does a guest behave?
>
> If parallel requests on an SSD perform better, wouldn't a guest issue
> parallel requests to the virtual device and thus to qcow2 anyway?

Guest knows nothing about qcow2 fragmentation, so this kind of 
"asynchronization" could be done only at qcow2 level.
However, if guest do async io, send a lot of parallel requests, it 
behave like qemu-img without -d 1 option, and in this case,
parallel loop iterations in qcow2 doesn't have such great sense. 
However, I think that async parallel requests are better in
general than sequential, because if device have some unused opportunity 
of parallelization, it will be utilized. We've already
use this approach in mirror and qemu-img convert. In Virtuozzo we have 
backup, improved by parallelization of requests
loop too. I think, it would be good to have some general code for such 
things in future.

>
> (I suppose the global qcow2 lock could be an issue here, but then your
> benchmark should work even without -d 1.)
>
> Max
>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io
  2018-08-16 13:58   ` Vladimir Sementsov-Ogievskiy
@ 2018-08-17 19:34     ` Max Reitz
  2018-08-17 19:43       ` Denis V. Lunev
  2018-08-20 16:33       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 2 replies; 30+ messages in thread
From: Max Reitz @ 2018-08-17 19:34 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf, den

[-- Attachment #1: Type: text/plain, Size: 4834 bytes --]

On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote:
> 16.08.2018 03:51, Max Reitz wrote:
>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> Here is an asynchronous scheme for handling fragmented qcow2
>>> reads and writes. Both qcow2 read and write functions loops through
>>> sequential portions of data. The series aim it to parallelize these
>>> loops iterations.
>>>
>>> It improves performance for fragmented qcow2 images, I've tested it
>>> as follows:
>>>
>>> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
>>> t-seq.qcow2 - sequentially written qcow2 image
>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start
>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
>>> (see source code of image generation in the end for details)
>>>
>>> and the test (sequential io by 1mb chunks):
>>>
>>> test write:
>>>     for t in /ssd/t-*; \
>>>         do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>>         ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
>>>     done
>>>
>>> test read (same, just drop -w parameter):
>>>     for t in /ssd/t-*; \
>>>         do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>>         ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>>>     done
>>>
>>> short info about parameters:
>>>   -w - do writes (otherwise do reads)
>>>   -c - count of blocks
>>>   -s - block size
>>>   -t none - disable cache
>>>   -n - native aio
>>>   -d 1 - don't use parallel requests provided by qemu-img bench itself
>> Hm, actually, why not?  And how does a guest behave?
>>
>> If parallel requests on an SSD perform better, wouldn't a guest issue
>> parallel requests to the virtual device and thus to qcow2 anyway?
> 
> Guest knows nothing about qcow2 fragmentation, so this kind of
> "asynchronization" could be done only at qcow2 level.

Hm, yes.  I'm sorry, but without having looked closer at the series
(which is why I'm sorry in advance), I would suspect that the
performance improvement comes from us being able to send parallel
requests to an SSD.

So if you send large requests to an SSD, you may either send them in
parallel or sequentially, it doesn't matter.  But for small requests,
it's better to send them in parallel so the SSD always has requests in
its queue.

I would think this is where the performance improvement comes from.  But
I would also think that a guest OS knows this and it would also send
many requests in parallel so the virtual block device never runs out of
requests.

> However, if guest do async io, send a lot of parallel requests, it
> behave like qemu-img without -d 1 option, and in this case,
> parallel loop iterations in qcow2 doesn't have such great sense.
> However, I think that async parallel requests are better in
> general than sequential, because if device have some unused opportunity
> of parallelization, it will be utilized.

I agree that it probably doesn't make things worse performance-wise, but
it's always added complexity (see the diffstat), which is why I'm just
routinely asking how useful it is in practice. :-)

Anyway, I suspect there are indeed cases where a guest doesn't send many
requests in parallel but it makes sense for the qcow2 driver to
parallelize it.  That would be mainly when the guest reads seemingly
sequential data that is then fragmented in the qcow2 file.  So basically
what your benchmark is testing. :-)

Then, the guest could assume that there is no sense in parallelizing it
because the latency from the device is large enough, whereas in qemu
itself we always run dry and wait for different parts of the single
large request to finish.  So, yes, in that case, parallelization that's
internal to qcow2 would make sense.

Now another question is, does this negatively impact devices where
seeking is slow, i.e. HDDs?  Unfortunately I'm not home right now, so I
don't have access to an HDD to test myself...

> We've already
> use this approach in mirror and qemu-img convert.

Indeed, but here you could always argue that this is just what guests
do, so we should, too.

> In Virtuozzo we have
> backup, improved by parallelization of requests
> loop too. I think, it would be good to have some general code for such
> things in future.
Well, those are different things, I'd think.  Parallelization in
mirror/backup/convert is useful not just because of qcow2 issues, but
also because you have a volume to read from and a volume to write to, so
that's where parallelization gives you some pipelining.  And it gives
you buffers for latency spikes, I guess.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io
  2018-08-17 19:34     ` Max Reitz
@ 2018-08-17 19:43       ` Denis V. Lunev
  2018-08-20 16:33       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 30+ messages in thread
From: Denis V. Lunev @ 2018-08-17 19:43 UTC (permalink / raw)
  To: Max Reitz, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf

On 08/17/2018 10:34 PM, Max Reitz wrote:
> On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote:
>> 16.08.2018 03:51, Max Reitz wrote:
>>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> Here is an asynchronous scheme for handling fragmented qcow2
>>>> reads and writes. Both qcow2 read and write functions loops through
>>>> sequential portions of data. The series aim it to parallelize these
>>>> loops iterations.
>>>>
>>>> It improves performance for fragmented qcow2 images, I've tested it
>>>> as follows:
>>>>
>>>> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
>>>> t-seq.qcow2 - sequentially written qcow2 image
>>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start
>>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
>>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
>>>> (see source code of image generation in the end for details)
>>>>
>>>> and the test (sequential io by 1mb chunks):
>>>>
>>>> test write:
>>>>     for t in /ssd/t-*; \
>>>>         do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>>>         ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
>>>>     done
>>>>
>>>> test read (same, just drop -w parameter):
>>>>     for t in /ssd/t-*; \
>>>>         do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>>>         ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>>>>     done
>>>>
>>>> short info about parameters:
>>>>   -w - do writes (otherwise do reads)
>>>>   -c - count of blocks
>>>>   -s - block size
>>>>   -t none - disable cache
>>>>   -n - native aio
>>>>   -d 1 - don't use parallel requests provided by qemu-img bench itself
>>> Hm, actually, why not?  And how does a guest behave?
>>>
>>> If parallel requests on an SSD perform better, wouldn't a guest issue
>>> parallel requests to the virtual device and thus to qcow2 anyway?
>> Guest knows nothing about qcow2 fragmentation, so this kind of
>> "asynchronization" could be done only at qcow2 level.
> Hm, yes.  I'm sorry, but without having looked closer at the series
> (which is why I'm sorry in advance), I would suspect that the
> performance improvement comes from us being able to send parallel
> requests to an SSD.
>
> So if you send large requests to an SSD, you may either send them in
> parallel or sequentially, it doesn't matter.  But for small requests,
> it's better to send them in parallel so the SSD always has requests in
> its queue.
>
> I would think this is where the performance improvement comes from.  But
> I would also think that a guest OS knows this and it would also send
> many requests in parallel so the virtual block device never runs out of
> requests.
>
>> However, if guest do async io, send a lot of parallel requests, it
>> behave like qemu-img without -d 1 option, and in this case,
>> parallel loop iterations in qcow2 doesn't have such great sense.
>> However, I think that async parallel requests are better in
>> general than sequential, because if device have some unused opportunity
>> of parallelization, it will be utilized.
> I agree that it probably doesn't make things worse performance-wise, but
> it's always added complexity (see the diffstat), which is why I'm just
> routinely asking how useful it is in practice. :-)
>
> Anyway, I suspect there are indeed cases where a guest doesn't send many
> requests in parallel but it makes sense for the qcow2 driver to
> parallelize it.  That would be mainly when the guest reads seemingly
> sequential data that is then fragmented in the qcow2 file.  So basically
> what your benchmark is testing. :-)
>
> Then, the guest could assume that there is no sense in parallelizing it
> because the latency from the device is large enough, whereas in qemu
> itself we always run dry and wait for different parts of the single
> large request to finish.  So, yes, in that case, parallelization that's
> internal to qcow2 would make sense.
>
> Now another question is, does this negatively impact devices where
> seeking is slow, i.e. HDDs?  Unfortunately I'm not home right now, so I
> don't have access to an HDD to test myself...
There are different situations and different load pattern, f.e. there
are situations when the guest executes sequential read
in a single thread. This looks obvious and dummy, but this is for sure
is possible in the real life. Also there is an observation, that Windows
guest prefers long requests. There is not unusual to observe 4Mb
requests in a pipeline.

Thus for such a load in a scattered file the performance difference should
be very big, even on SSD as without this SSD will starve without requests.

Here we are speaking in terms of latency, which definitely will be bigger
in sequential case.

Den

>> We've already
>> use this approach in mirror and qemu-img convert.
> Indeed, but here you could always argue that this is just what guests
> do, so we should, too.
>
>> In Virtuozzo we have
>> backup, improved by parallelization of requests
>> loop too. I think, it would be good to have some general code for such
>> things in future.
> Well, those are different things, I'd think.  Parallelization in
> mirror/backup/convert is useful not just because of qcow2 issues, but
> also because you have a volume to read from and a volume to write to, so
> that's where parallelization gives you some pipelining.  And it gives
> you buffers for latency spikes, I guess.
>
> Max
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io
  2018-08-17 19:34     ` Max Reitz
  2018-08-17 19:43       ` Denis V. Lunev
@ 2018-08-20 16:33       ` Vladimir Sementsov-Ogievskiy
  2018-08-20 16:39         ` Max Reitz
  1 sibling, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-08-20 16:33 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, den

17.08.2018 22:34, Max Reitz wrote:
> On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote:
>> 16.08.2018 03:51, Max Reitz wrote:
>>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> Here is an asynchronous scheme for handling fragmented qcow2
>>>> reads and writes. Both qcow2 read and write functions loops through
>>>> sequential portions of data. The series aim it to parallelize these
>>>> loops iterations.
>>>>
>>>> It improves performance for fragmented qcow2 images, I've tested it
>>>> as follows:
>>>>
>>>> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
>>>> t-seq.qcow2 - sequentially written qcow2 image
>>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start
>>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
>>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
>>>> (see source code of image generation in the end for details)
>>>>
>>>> and the test (sequential io by 1mb chunks):
>>>>
>>>> test write:
>>>>      for t in /ssd/t-*; \
>>>>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>>>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
>>>>      done
>>>>
>>>> test read (same, just drop -w parameter):
>>>>      for t in /ssd/t-*; \
>>>>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>>>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>>>>      done
>>>>
>>>> short info about parameters:
>>>>    -w - do writes (otherwise do reads)
>>>>    -c - count of blocks
>>>>    -s - block size
>>>>    -t none - disable cache
>>>>    -n - native aio
>>>>    -d 1 - don't use parallel requests provided by qemu-img bench itself
>>> Hm, actually, why not?  And how does a guest behave?
>>>
>>> If parallel requests on an SSD perform better, wouldn't a guest issue
>>> parallel requests to the virtual device and thus to qcow2 anyway?
>> Guest knows nothing about qcow2 fragmentation, so this kind of
>> "asynchronization" could be done only at qcow2 level.
> Hm, yes.  I'm sorry, but without having looked closer at the series
> (which is why I'm sorry in advance), I would suspect that the
> performance improvement comes from us being able to send parallel
> requests to an SSD.
>
> So if you send large requests to an SSD, you may either send them in
> parallel or sequentially, it doesn't matter.  But for small requests,
> it's better to send them in parallel so the SSD always has requests in
> its queue.
>
> I would think this is where the performance improvement comes from.  But
> I would also think that a guest OS knows this and it would also send
> many requests in parallel so the virtual block device never runs out of
> requests.
>
>> However, if guest do async io, send a lot of parallel requests, it
>> behave like qemu-img without -d 1 option, and in this case,
>> parallel loop iterations in qcow2 doesn't have such great sense.
>> However, I think that async parallel requests are better in
>> general than sequential, because if device have some unused opportunity
>> of parallelization, it will be utilized.
> I agree that it probably doesn't make things worse performance-wise, but
> it's always added complexity (see the diffstat), which is why I'm just
> routinely asking how useful it is in practice. :-)
>
> Anyway, I suspect there are indeed cases where a guest doesn't send many
> requests in parallel but it makes sense for the qcow2 driver to
> parallelize it.  That would be mainly when the guest reads seemingly
> sequential data that is then fragmented in the qcow2 file.  So basically
> what your benchmark is testing. :-)
>
> Then, the guest could assume that there is no sense in parallelizing it
> because the latency from the device is large enough, whereas in qemu
> itself we always run dry and wait for different parts of the single
> large request to finish.  So, yes, in that case, parallelization that's
> internal to qcow2 would make sense.
>
> Now another question is, does this negatively impact devices where
> seeking is slow, i.e. HDDs?  Unfortunately I'm not home right now, so I
> don't have access to an HDD to test myself...


hdd:

+-----------+-----------+----------+-----------+----------+
|   file    | wr before | wr after | rd before | rd after |
+-----------+-----------+----------+-----------+----------+
| seq       |    39.821 |   40.513 |    38.600 |   38.916 |
| reverse   |    60.320 |   57.902 |    98.223 |  111.717 |
| rand      |   614.826 |  580.452 |   672.600 |  465.120 |
| part-rand |    52.311 |   52.450 |    37.663 |   37.989 |
+-----------+-----------+----------+-----------+----------+

hmm. 10% degradation on "reverse" case, strange magic.. However reverse 
is near to impossible.


>
>> We've already
>> use this approach in mirror and qemu-img convert.
> Indeed, but here you could always argue that this is just what guests
> do, so we should, too.
>
>> In Virtuozzo we have
>> backup, improved by parallelization of requests
>> loop too. I think, it would be good to have some general code for such
>> things in future.
> Well, those are different things, I'd think.  Parallelization in
> mirror/backup/convert is useful not just because of qcow2 issues, but
> also because you have a volume to read from and a volume to write to, so
> that's where parallelization gives you some pipelining.  And it gives
> you buffers for latency spikes, I guess.
>
> Max
>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io
  2018-08-20 16:33       ` Vladimir Sementsov-Ogievskiy
@ 2018-08-20 16:39         ` Max Reitz
  0 siblings, 0 replies; 30+ messages in thread
From: Max Reitz @ 2018-08-20 16:39 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf, den

[-- Attachment #1: Type: text/plain, Size: 5529 bytes --]

On 2018-08-20 18:33, Vladimir Sementsov-Ogievskiy wrote:
> 17.08.2018 22:34, Max Reitz wrote:
>> On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote:
>>> 16.08.2018 03:51, Max Reitz wrote:
>>>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Hi all!
>>>>>
>>>>> Here is an asynchronous scheme for handling fragmented qcow2
>>>>> reads and writes. Both qcow2 read and write functions loops through
>>>>> sequential portions of data. The series aim it to parallelize these
>>>>> loops iterations.
>>>>>
>>>>> It improves performance for fragmented qcow2 images, I've tested it
>>>>> as follows:
>>>>>
>>>>> I have four 4G qcow2 images (with default 64k block size) on my ssd
>>>>> disk:
>>>>> t-seq.qcow2 - sequentially written qcow2 image
>>>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start
>>>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random
>>>>> order
>>>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m
>>>>> clusters
>>>>> (see source code of image generation in the end for details)
>>>>>
>>>>> and the test (sequential io by 1mb chunks):
>>>>>
>>>>> test write:
>>>>>      for t in /ssd/t-*; \
>>>>>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t 
>>>>> ===; \
>>>>>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w
>>>>> $t; \
>>>>>      done
>>>>>
>>>>> test read (same, just drop -w parameter):
>>>>>      for t in /ssd/t-*; \
>>>>>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t 
>>>>> ===; \
>>>>>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>>>>>      done
>>>>>
>>>>> short info about parameters:
>>>>>    -w - do writes (otherwise do reads)
>>>>>    -c - count of blocks
>>>>>    -s - block size
>>>>>    -t none - disable cache
>>>>>    -n - native aio
>>>>>    -d 1 - don't use parallel requests provided by qemu-img bench
>>>>> itself
>>>> Hm, actually, why not?  And how does a guest behave?
>>>>
>>>> If parallel requests on an SSD perform better, wouldn't a guest issue
>>>> parallel requests to the virtual device and thus to qcow2 anyway?
>>> Guest knows nothing about qcow2 fragmentation, so this kind of
>>> "asynchronization" could be done only at qcow2 level.
>> Hm, yes.  I'm sorry, but without having looked closer at the series
>> (which is why I'm sorry in advance), I would suspect that the
>> performance improvement comes from us being able to send parallel
>> requests to an SSD.
>>
>> So if you send large requests to an SSD, you may either send them in
>> parallel or sequentially, it doesn't matter.  But for small requests,
>> it's better to send them in parallel so the SSD always has requests in
>> its queue.
>>
>> I would think this is where the performance improvement comes from.  But
>> I would also think that a guest OS knows this and it would also send
>> many requests in parallel so the virtual block device never runs out of
>> requests.
>>
>>> However, if guest do async io, send a lot of parallel requests, it
>>> behave like qemu-img without -d 1 option, and in this case,
>>> parallel loop iterations in qcow2 doesn't have such great sense.
>>> However, I think that async parallel requests are better in
>>> general than sequential, because if device have some unused opportunity
>>> of parallelization, it will be utilized.
>> I agree that it probably doesn't make things worse performance-wise, but
>> it's always added complexity (see the diffstat), which is why I'm just
>> routinely asking how useful it is in practice. :-)
>>
>> Anyway, I suspect there are indeed cases where a guest doesn't send many
>> requests in parallel but it makes sense for the qcow2 driver to
>> parallelize it.  That would be mainly when the guest reads seemingly
>> sequential data that is then fragmented in the qcow2 file.  So basically
>> what your benchmark is testing. :-)
>>
>> Then, the guest could assume that there is no sense in parallelizing it
>> because the latency from the device is large enough, whereas in qemu
>> itself we always run dry and wait for different parts of the single
>> large request to finish.  So, yes, in that case, parallelization that's
>> internal to qcow2 would make sense.
>>
>> Now another question is, does this negatively impact devices where
>> seeking is slow, i.e. HDDs?  Unfortunately I'm not home right now, so I
>> don't have access to an HDD to test myself...
> 
> 
> hdd:
> 
> +-----------+-----------+----------+-----------+----------+
> |   file    | wr before | wr after | rd before | rd after |
> +-----------+-----------+----------+-----------+----------+
> | seq       |    39.821 |   40.513 |    38.600 |   38.916 |
> | reverse   |    60.320 |   57.902 |    98.223 |  111.717 |
> | rand      |   614.826 |  580.452 |   672.600 |  465.120 |
> | part-rand |    52.311 |   52.450 |    37.663 |   37.989 |
> +-----------+-----------+----------+-----------+----------+
> 
> hmm. 10% degradation on "reverse" case, strange magic.. However reverse
> is near to impossible.

I tend to agree.  It's faster for random, and that's what matters more.

(Distinguishing between the cases in qcow2 seems like not so good of an
idea, and making it user-configurable is probably pointless because
noone will change the default.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv
       [not found]   ` <6e19aaeb-8acc-beb9-5ece-9ae6101637a9@redhat.com>
@ 2018-10-01 15:14     ` Vladimir Sementsov-Ogievskiy
  2018-10-01 15:39       ` Max Reitz
  2018-11-01 12:17     ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-10-01 15:14 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, den

27.09.2018 20:35, Max Reitz wrote:
> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>> Memory allocation may become less efficient for encrypted case. It's a
>> payment for further asynchronous scheme.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/qcow2.c | 114 ++++++++++++++++++++++++++++++++--------------------------
>>   1 file changed, 64 insertions(+), 50 deletions(-)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 65e3ca00e2..5e7f2ee318 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1808,6 +1808,67 @@ out:
>>       return ret;
>>   }
>>   
>> +static coroutine_fn int qcow2_co_preadv_normal(BlockDriverState *bs,
>> +                                               uint64_t file_cluster_offset,
>> +                                               uint64_t offset,
>> +                                               uint64_t bytes,
>> +                                               QEMUIOVector *qiov,
>> +                                               uint64_t qiov_offset)
>> +{
>> +    int ret;
>> +    BDRVQcow2State *s = bs->opaque;
>> +    void *crypt_buf = NULL;
>> +    QEMUIOVector hd_qiov;
>> +    int offset_in_cluster = offset_into_cluster(s, offset);
>> +
>> +    if ((file_cluster_offset & 511) != 0) {
>> +        return -EIO;
>> +    }
>> +
>> +    qemu_iovec_init(&hd_qiov, qiov->niov);
> So you're not just re-allocating the bounce buffer for every single
> call, but also this I/O vector.  Hm.
>
>> +    if (bs->encrypted) {
>> +        assert(s->crypto);
>> +        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
>> +
>> +        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
> 1. Why did you remove the comment?
>
> 2. The check for whether crypt_buf was successfully allocated is missing.
>
> 3. Do you have any benchmarks for encrypted images?  Having to allocate
> the bounce buffer for potentially every single cluster sounds rather bad
> to me.

Hmm, no excuses. 1- Will fix, 2 - will fix, 3 - will fix or at least 
test the performance.

>
>> +        qemu_iovec_add(&hd_qiov, crypt_buf, bytes);
>> +    } else {
>> +        qemu_iovec_concat(&hd_qiov, qiov, qiov_offset, bytes);
> qcow2_co_preadv() continues to do this as well.  That's superfluous now.

hd_qiov is local here.. or what do you mean?

>
>> +    }
>> +
>> +    BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
>> +    ret = bdrv_co_preadv(bs->file,
>> +                         file_cluster_offset + offset_in_cluster,
>> +                         bytes, &hd_qiov, 0);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    if (bs->encrypted) {
>> +        assert(s->crypto);
>> +        assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
>> +        assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
>> +        if (qcrypto_block_decrypt(s->crypto,
>> +                                  (s->crypt_physical_offset ?
>> +                                   file_cluster_offset + offset_in_cluster :
>> +                                   offset),
>> +                                  crypt_buf,
>> +                                  bytes,
>> +                                  NULL) < 0) {
> What's the reason against decrypting this in-place in qiov so we could
> save the bounce buffer?  We allow offsets in clusters, so why can't we
> just call this function once per involved I/O vector entry?

Isn't it unsafe to do decryption in guest buffers?

>
> Max
>
>> +            ret = -EIO;
>> +            goto out;
>> +        }
>> +        qemu_iovec_from_buf(qiov, qiov_offset, crypt_buf, bytes);
>> +    }
>> +
>> +out:
>> +    qemu_vfree(crypt_buf);
>> +    qemu_iovec_destroy(&hd_qiov);
>> +
>> +    return ret;
>> +}
>> +
>>   static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
>>                                           uint64_t bytes, QEMUIOVector *qiov,
>>                                           int flags)
>> @@ -1819,7 +1880,6 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
>>       uint64_t cluster_offset = 0;
>>       uint64_t bytes_done = 0;
>>       QEMUIOVector hd_qiov;
>> -    uint8_t *cluster_data = NULL;
>>   
>>       qemu_iovec_init(&hd_qiov, qiov->niov);
>>   
>> @@ -1882,57 +1942,12 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
>>           case QCOW2_CLUSTER_NORMAL:
>>               qemu_co_mutex_unlock(&s->lock);
>>   
>> -            if ((cluster_offset & 511) != 0) {
>> -                ret = -EIO;
>> -                goto fail_nolock;
>> -            }
>> -
>> -            if (bs->encrypted) {
>> -                assert(s->crypto);
>> -
>> -                /*
>> -                 * For encrypted images, read everything into a temporary
>> -                 * contiguous buffer on which the AES functions can work.
>> -                 */
>> -                if (!cluster_data) {
>> -                    cluster_data =
>> -                        qemu_try_blockalign(bs->file->bs,
>> -                                            QCOW_MAX_CRYPT_CLUSTERS
>> -                                            * s->cluster_size);
>> -                    if (cluster_data == NULL) {
>> -                        ret = -ENOMEM;
>> -                        goto fail_nolock;
>> -                    }
>> -                }
>> -
>> -                assert(cur_bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
>> -                qemu_iovec_reset(&hd_qiov);
>> -                qemu_iovec_add(&hd_qiov, cluster_data, cur_bytes);
>> -            }
>> -
>> -            BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
>> -            ret = bdrv_co_preadv(bs->file,
>> -                                 cluster_offset + offset_in_cluster,
>> -                                 cur_bytes, &hd_qiov, 0);
>> +            ret = qcow2_co_preadv_normal(bs, cluster_offset,
>> +                                         offset, cur_bytes, qiov, bytes_done);
>>               if (ret < 0) {
>>                   goto fail_nolock;
>>               }
>> -            if (bs->encrypted) {
>> -                assert(s->crypto);
>> -                assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
>> -                assert((cur_bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
>> -                if (qcrypto_block_decrypt(s->crypto,
>> -                                          (s->crypt_physical_offset ?
>> -                                           cluster_offset + offset_in_cluster :
>> -                                           offset),
>> -                                          cluster_data,
>> -                                          cur_bytes,
>> -                                          NULL) < 0) {
>> -                    ret = -EIO;
>> -                    goto fail_nolock;
>> -                }
>> -                qemu_iovec_from_buf(qiov, bytes_done, cluster_data, cur_bytes);
>> -            }
>> +
>>               qemu_co_mutex_lock(&s->lock);
>>               break;
>>   
>> @@ -1953,7 +1968,6 @@ fail:
>>   
>>   fail_nolock:
>>       qemu_iovec_destroy(&hd_qiov);
>> -    qemu_vfree(cluster_data);
>>   
>>       return ret;
>>   }
>>
>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 4/7] qcow2: async scheme for qcow2_co_preadv
       [not found]   ` <08a610aa-9c78-1c83-5e48-b93080aac87b@redhat.com>
@ 2018-10-01 15:33     ` Vladimir Sementsov-Ogievskiy
  2018-10-01 15:49       ` Max Reitz
  0 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-10-01 15:33 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, den

27.09.2018 21:35, Max Reitz wrote:
> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>> Start several async requests instead of read chunk by chunk.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/qcow2.c | 208 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 204 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 5e7f2ee318..a0df8d4e50 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1869,6 +1869,197 @@ out:
>>       return ret;
>>   }
>>   
>> +typedef struct Qcow2WorkerTask {
>> +    uint64_t file_cluster_offset;
>> +    uint64_t offset;
>> +    uint64_t bytes;
>> +    uint64_t bytes_done;
>> +} Qcow2WorkerTask;
> Why don't you make this a union of request-specific structs?

ok, will try

>
>> +
>> +typedef int (*Qcow2DoWorkFunc)(BlockDriverState *bs, QEMUIOVector *qiov,
>> +                               Qcow2WorkerTask *task);
>> +
>> +typedef struct Qcow2RWState {
>> +    BlockDriverState *bs;
>> +    QEMUIOVector *qiov;
>> +    uint64_t bytes;
> Maybe make it total_bytes so it doesn't conflict with the value in
> Qcow2WorkerTask?

ok

>
>> +    int ret;
>> +    bool waiting_one;
>> +    bool waiting_all;
>> +    bool finalize;
>> +    Coroutine *co;
>> +    QSIMPLEQ_HEAD(, Qcow2Worker) free_workers;
>> +    QSIMPLEQ_HEAD(, Qcow2Worker) busy_workers;
>> +    int online_workers;
>> +    Qcow2DoWorkFunc do_work_func;
>> +} Qcow2RWState;
>> +
>> +typedef struct Qcow2Worker {
>> +    Qcow2RWState *rws;
>> +    Coroutine *co;
>> +    Qcow2WorkerTask task;
>> +    bool busy;
>> +    QSIMPLEQ_ENTRY(Qcow2Worker) entry;
>> +} Qcow2Worker;
>> +#define QCOW2_MAX_WORKERS 64
> That's really a bit hidden here.  I think it should go into the header.
>
> Also I'm not quite sure about the number.  In other places we've always
> used 16.
>
> (With the encryption code always allocating a new bounce buffer, this
> can mean quite a bit of memory usage.)

No doubts.

>
>> +
>> +static coroutine_fn void qcow2_rw_worker(void *opaque);
>> +static Qcow2Worker *qcow2_new_worker(Qcow2RWState *rws)
>> +{
>> +    Qcow2Worker *w = g_new0(Qcow2Worker, 1);
>> +    w->rws = rws;
>> +    w->co = qemu_coroutine_create(qcow2_rw_worker, w);
>> +
>> +    return w;
>> +}
>> +
>> +static void qcow2_free_worker(Qcow2Worker *w)
>> +{
>> +    g_free(w);
>> +}
>> +
>> +static coroutine_fn void qcow2_rw_worker(void *opaque)
>> +{
>> +    Qcow2Worker *w = opaque;
>> +    Qcow2RWState *rws = w->rws;
>> +
>> +    rws->online_workers++;
>> +
>> +    while (!rws->finalize) {
>> +        int ret = rws->do_work_func(rws->bs, rws->qiov, &w->task);
>> +        if (ret < 0 && rws->ret == 0) {
>> +            rws->ret = ret;
>> +        }
>> +
>> +        if (rws->waiting_all || rws->ret < 0) {
>> +            break;
>> +        }
>> +
>> +        w->busy = false;
>> +        QSIMPLEQ_REMOVE(&rws->busy_workers, w, Qcow2Worker, entry);
>> +        QSIMPLEQ_INSERT_TAIL(&rws->free_workers, w, entry);
>> +        if (rws->waiting_one) {
>> +            rws->waiting_one = false;
>> +            /* we must unset it here, to prevent queuing rws->co in several
>> +             * workers (it may happen if other worker already waits us on mutex,
>> +             * so it will be entered after our yield and before rws->co enter)
>> +             *
>> +             * TODO: rethink this comment, as here (and in other places in the
>> +             * file) we moved from qemu_coroutine_add_next to aio_co_wake.
>> +             */
>> +            aio_co_wake(rws->co);
>> +        }
>> +
>> +        qemu_coroutine_yield();
>> +    }
>> +
>> +    if (w->busy) {
>> +        w->busy = false;
>> +        QSIMPLEQ_REMOVE(&rws->busy_workers, w, Qcow2Worker, entry);
>> +    }
>> +    qcow2_free_worker(w);
>> +    rws->online_workers--;
>> +
>> +    if (rws->waiting_all && rws->online_workers == 0) {
>> +        aio_co_wake(rws->co);
>> +    }
>> +}
>> +
>> +static coroutine_fn void qcow2_rws_add_task(Qcow2RWState *rws,
>> +                                            uint64_t file_cluster_offset,
>> +                                            uint64_t offset,
>> +                                            uint64_t bytes,
>> +                                            uint64_t bytes_done)
> I'd propose just taking a const Qcow2WorkerTask * here.  (Makes even
> more sense if you make it a union.)

ok, I'll try this way

>
>> +{
>> +    Qcow2Worker *w;
>> +
>> +    assert(rws->co == qemu_coroutine_self());
>> +
>> +    if (bytes_done == 0 && bytes == rws->bytes) {
>> +        Qcow2WorkerTask task = {
>> +            .file_cluster_offset = file_cluster_offset,
>> +            .offset = offset,
>> +            .bytes = bytes,
>> +            .bytes_done = bytes_done
>> +        };
>> +        rws->ret = rws->do_work_func(rws->bs, rws->qiov, &task);
> (If so, you'd just pass the pointer along here)
>
>> +        return;
>> +    }
> I like this fast path, but I think it deserves a small comment.  (That
> is a fast path and bypasses the whole worker infrastructure.)
>
>> +
>> +    if (!QSIMPLEQ_EMPTY(&rws->free_workers)) {
>> +        w = QSIMPLEQ_FIRST(&rws->free_workers);
>> +        QSIMPLEQ_REMOVE_HEAD(&rws->free_workers, entry);
>> +    } else if (rws->online_workers < QCOW2_MAX_WORKERS) {
>> +        w = qcow2_new_worker(rws);
>> +    } else {
>> +        rws->waiting_one = true;
>> +        qemu_coroutine_yield();
>> +        assert(!rws->waiting_one); /* already unset by worker */
> Sometimes I hate coroutines.  OK.  So, how does the yield ensure that
> any worker is scheduled?  Doesn't yield just give control to the parent?

hm. I don't follow. All workers are busy - we sure, because there no 
free workers,
and we can't create one more, second condition isn't satisfied too.
So, we give control to the parent. And only worker can wake us up.

>
> Right now I think it would be clearer to me if you'd just wake all busy
> coroutines (looping over them) until one has settled.

but all workers are busy, we should not touch them (they may be yielded 
in io operation)..

>
> This would also save you the aio_co_wake() in the worker itself, as
> they'd just have to yield in all cases.
>
>> +
>> +        w = QSIMPLEQ_FIRST(&rws->free_workers);
>> +        QSIMPLEQ_REMOVE_HEAD(&rws->free_workers, entry);
>> +    }
>> +    w->busy = true;
>> +    QSIMPLEQ_INSERT_TAIL(&rws->busy_workers, w, entry);
>> +
>> +    w->task.file_cluster_offset = file_cluster_offset;
>> +    w->task.offset = offset;
>> +    w->task.bytes = bytes;
>> +    w->task.bytes_done = bytes_done;
> (And you'd copy it with w->task = *task here)
>
>> +
>> +    qemu_coroutine_enter(w->co);
>> +}
>> +
>> +static void qcow2_init_rws(Qcow2RWState *rws, BlockDriverState *bs,
>> +                           QEMUIOVector *qiov, uint64_t bytes,
>> +                           Qcow2DoWorkFunc do_work_func)
>> +{
>> +    memset(rws, 0, sizeof(*rws));
>> +    rws->bs = bs;
>> +    rws->qiov = qiov;
>> +    rws->bytes = bytes;
>> +    rws->co = qemu_coroutine_self();
>> +    rws->do_work_func = do_work_func;
> Maybe you'd like to use
>
> *rws = (Qcow2RWState) {
>      .bs = bs,
>      ...
> };
>
> Then you could save yourself the memset().

ok

>
>> +    QSIMPLEQ_INIT(&rws->free_workers);
>> +    QSIMPLEQ_INIT(&rws->busy_workers);
>> +}
>> +
>> +static void qcow2_finalize_rws(Qcow2RWState *rws)
>> +{
>> +    assert(rws->co == qemu_coroutine_self());
>> +
>> +    /* kill waiting workers */
>> +    rws->finalize = true;
>> +    while (!QSIMPLEQ_EMPTY(&rws->free_workers)) {
>> +        Qcow2Worker *w = QSIMPLEQ_FIRST(&rws->free_workers);
>> +        QSIMPLEQ_REMOVE_HEAD(&rws->free_workers, entry);
>> +        qemu_coroutine_enter(w->co);
>> +    }
>> +
>> +    /* wait others */
>> +    if (rws->online_workers > 0) {
>> +        rws->waiting_all = true;
>> +        qemu_coroutine_yield();
>> +        rws->waiting_all = false;
> Why don't you enter the busy workers here?  (And keep doing so until
> online_workers is 0.)  That way, you could save yourself the other
> aio_co_wake() in qcow2_rw_worker().

We shouldn't enter busy workers, as they may yielded on io operation. 
The operation should complete.

>
> (My problem with those aio_co_wake()s is that they make the control flow
> rather hard to follow, in my opinion.  I'd prefer it if only these
> "control" functions woke up the worker coroutines, and not have it also
> the other way round in some cases.)
>
>> +    }
>> +
>> +    assert(rws->online_workers == 0);
>> +    assert(QSIMPLEQ_EMPTY(&rws->free_workers));
>> +    assert(QSIMPLEQ_EMPTY(&rws->busy_workers));
>> +}
>> +
>> +static coroutine_fn int qcow2_co_preadv_normal_task(BlockDriverState *bs,
>> +                                                    QEMUIOVector *qiov,
>> +                                                    Qcow2WorkerTask *task)
>> +{
>> +    return qcow2_co_preadv_normal(bs,
>> +                                  task->file_cluster_offset,
>> +                                  task->offset,
>> +                                  task->bytes,
>> +                                  qiov,
>> +                                  task->bytes_done);
>> +}
>> +
>>   static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
>>                                           uint64_t bytes, QEMUIOVector *qiov,
>>                                           int flags)
>> @@ -1880,12 +2071,15 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
>>       uint64_t cluster_offset = 0;
>>       uint64_t bytes_done = 0;
>>       QEMUIOVector hd_qiov;
>> +    Qcow2RWState rws = {0};
>> +
>> +    qcow2_init_rws(&rws, bs, qiov, bytes, qcow2_co_preadv_normal_task);
> I think it's a bit of a shame to initialize all of this in case we're
> just going to go into the fast path anyway.  But off the top of my head
> I can't come up with a much better solution.
>
> (Apart from adding more fields to Qcow2WorkerTask, which, to be honest,
> wouldn't be much better.  Because then we initialize that bigger object
> on the stack somewhere, as opposed to the static Qcow2RWState, so we
> probably don't gain anything.)
>
> Max
>
>>   
>>       qemu_iovec_init(&hd_qiov, qiov->niov);
>>   
>>       qemu_co_mutex_lock(&s->lock);
>>   
>> -    while (bytes != 0) {
>> +    while (bytes != 0 && rws.ret == 0) {
>>   
>>           /* prepare next request */
>>           cur_bytes = MIN(bytes, INT_MAX);
>> @@ -1942,9 +2136,10 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
>>           case QCOW2_CLUSTER_NORMAL:
>>               qemu_co_mutex_unlock(&s->lock);
>>   
>> -            ret = qcow2_co_preadv_normal(bs, cluster_offset,
>> -                                         offset, cur_bytes, qiov, bytes_done);
>> -            if (ret < 0) {
>> +            qcow2_rws_add_task(&rws, cluster_offset, offset, cur_bytes,
>> +                               bytes_done);
>> +            if (rws.ret < 0) {
>> +                ret = rws.ret;
>>                   goto fail_nolock;
>>               }
>>   
>> @@ -1967,6 +2162,11 @@ fail:
>>       qemu_co_mutex_unlock(&s->lock);
>>   
>>   fail_nolock:
>> +    qcow2_finalize_rws(&rws);
>> +    if (ret == 0) {
>> +        ret = rws.ret;
>> +    }
>> +
>>       qemu_iovec_destroy(&hd_qiov);
>>   
>>       return ret;
>>
>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv
  2018-10-01 15:14     ` Vladimir Sementsov-Ogievskiy
@ 2018-10-01 15:39       ` Max Reitz
  2018-10-01 16:00         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 30+ messages in thread
From: Max Reitz @ 2018-10-01 15:39 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf, den

[-- Attachment #1: Type: text/plain, Size: 4316 bytes --]

On 01.10.18 17:14, Vladimir Sementsov-Ogievskiy wrote:
> 27.09.2018 20:35, Max Reitz wrote:
>> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>> Memory allocation may become less efficient for encrypted case. It's a
>>> payment for further asynchronous scheme.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>  block/qcow2.c | 114 ++++++++++++++++++++++++++++++++--------------------------
>>>  1 file changed, 64 insertions(+), 50 deletions(-)
>>>
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index 65e3ca00e2..5e7f2ee318 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -1808,6 +1808,67 @@ out:
>>>      return ret;
>>>  }
>>>  
>>> +static coroutine_fn int qcow2_co_preadv_normal(BlockDriverState *bs,
>>> +                                               uint64_t file_cluster_offset,
>>> +                                               uint64_t offset,
>>> +                                               uint64_t bytes,
>>> +                                               QEMUIOVector *qiov,
>>> +                                               uint64_t qiov_offset)
>>> +{
>>> +    int ret;
>>> +    BDRVQcow2State *s = bs->opaque;
>>> +    void *crypt_buf = NULL;
>>> +    QEMUIOVector hd_qiov;
>>> +    int offset_in_cluster = offset_into_cluster(s, offset);
>>> +
>>> +    if ((file_cluster_offset & 511) != 0) {
>>> +        return -EIO;
>>> +    }
>>> +
>>> +    qemu_iovec_init(&hd_qiov, qiov->niov);
>> So you're not just re-allocating the bounce buffer for every single
>> call, but also this I/O vector.  Hm.
>>
>>> +    if (bs->encrypted) {
>>> +        assert(s->crypto);
>>> +        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
>>> +
>>> +        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
>> 1. Why did you remove the comment?
>>
>> 2. The check for whether crypt_buf was successfully allocated is missing.
>>
>> 3. Do you have any benchmarks for encrypted images?  Having to allocate
>> the bounce buffer for potentially every single cluster sounds rather bad
>> to me.
> 
> Hmm, no excuses. 1- Will fix, 2 - will fix, 3 - will fix or at least
> test the performance.
> 
>>> +        qemu_iovec_add(&hd_qiov, crypt_buf, bytes);
>>> +    } else {
>>> +        qemu_iovec_concat(&hd_qiov, qiov, qiov_offset, bytes);
>> qcow2_co_preadv() continues to do this as well.  That's superfluous now.
> 
> hd_qiov is local here.. or what do you mean?

qcow2_co_preadv() continues to have its own hd_qiov and appends qiov to
it (just like here), but it's unused for normal clusters.  So it doesn't
have to do that for normal clusters.

>>> +    }
>>> +
>>> +    BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
>>> +    ret = bdrv_co_preadv(bs->file,
>>> +                         file_cluster_offset + offset_in_cluster,
>>> +                         bytes, &hd_qiov, 0);
>>> +    if (ret < 0) {
>>> +        goto out;
>>> +    }
>>> +
>>> +    if (bs->encrypted) {
>>> +        assert(s->crypto);
>>> +        assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
>>> +        assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
>>> +        if (qcrypto_block_decrypt(s->crypto,
>>> +                                  (s->crypt_physical_offset ?
>>> +                                   file_cluster_offset + offset_in_cluster :
>>> +                                   offset),
>>> +                                  crypt_buf,
>>> +                                  bytes,
>>> +                                  NULL) < 0) {
>> What's the reason against decrypting this in-place in qiov so we could
>> save the bounce buffer?  We allow offsets in clusters, so why can't we
>> just call this function once per involved I/O vector entry?
> 
> Isn't it unsafe to do decryption in guest buffers?

I don't know.  Why would it be?  Because...

>> Max
>>
>>> +            ret = -EIO;
>>> +            goto out;
>>> +        }
>>> +        qemu_iovec_from_buf(qiov, qiov_offset, crypt_buf, bytes);

...we're putting the decrypted content there anyway.

The only issue I see is that the guest may see encrypted content for a
short duration.  Hm.  Maybe we don't want that.  In which case it
probably has to stay as it is.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 5/7] qcow2: refactor qcow2_co_pwritev: split out qcow2_co_do_pwritev
       [not found]   ` <5c871ce7-2cab-f897-0b06-cbc05b9ffe97@redhat.com>
@ 2018-10-01 15:43     ` Vladimir Sementsov-Ogievskiy
  2018-10-01 15:50       ` Max Reitz
  0 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-10-01 15:43 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, den

28.09.2018 17:23, Max Reitz wrote:
> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>> Split out block which will be reused in async scheme.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/qcow2.c | 138 ++++++++++++++++++++++++++++++++++++----------------------
>>   1 file changed, 86 insertions(+), 52 deletions(-)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index a0df8d4e50..4d669432d1 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -2210,6 +2210,85 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>       return false;
>>   }
>>   
>> +/* qcow2_co_do_pwritev
>> + * Called without s->lock unlocked
>> + * hd_qiov - temp qiov for any use. It is initialized so it is empty and
>> + *           support adding up to qiov->niov + 2 elements
>> + * l2meta  - if not NULL, qcow2_co_do_pwritev() will consume it. Caller must not
>> + *           use it somehow after qcow2_co_do_pwritev() call
>> + */
>> +static coroutine_fn int qcow2_co_do_pwritev(BlockDriverState *bs,
>> +                                            uint64_t file_cluster_offset,
>> +                                            uint64_t offset,
>> +                                            uint64_t bytes,
>> +                                            QEMUIOVector *qiov,
>> +                                            uint64_t qiov_offset,
>> +                                            QCowL2Meta *l2meta)
>> +{
>> +    int ret;
>> +    BDRVQcow2State *s = bs->opaque;
>> +    void *crypt_buf = NULL;
>> +    QEMUIOVector hd_qiov;
>> +    int offset_in_cluster = offset_into_cluster(s, offset);
>> +
>> +    qemu_iovec_reset(&hd_qiov);
> This shouldn't be here.
>
>> +    qemu_iovec_init(&hd_qiov, qiov->niov);
>> +
>> +    if (bs->encrypted) {
>> +        assert(s->crypto);
>> +        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
>> +        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
>> +        qemu_iovec_to_buf(qiov, qiov_offset, crypt_buf, bytes);
>> +
>> +        if (qcrypto_block_encrypt(s->crypto,
>> +                                  (s->crypt_physical_offset ?
>> +                                   file_cluster_offset + offset_in_cluster :
>> +                                   offset),
>> +                                  crypt_buf,
>> +                                  bytes, NULL) < 0) {
> Same question as in the read case: Can't we make do without the bounce
> buffer?

I think, we should not modify guest buffers..

>
>> +            ret = -EIO;
>> +            goto fail;
>> +        }
>> +
>> +        qemu_iovec_add(&hd_qiov, crypt_buf, bytes);
>> +    } else {
>> +        qemu_iovec_concat(&hd_qiov, qiov, qiov_offset, bytes);
>> +    }
>> +
>> +    /* If we need to do COW, check if it's possible to merge the
>> +     * writing of the guest data together with that of the COW regions.
>> +     * If it's not possible (or not necessary) then write the
>> +     * guest data now. */
>> +    if (!merge_cow(offset, bytes, &hd_qiov, l2meta)) {
>> +        BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
>> +        trace_qcow2_writev_data(qemu_coroutine_self(),
>> +                                file_cluster_offset + offset_in_cluster);
>> +        ret = bdrv_co_pwritev(bs->file,
>> +                              file_cluster_offset + offset_in_cluster,
>> +                              bytes, &hd_qiov, 0);
>> +        if (ret < 0) {
>> +            qemu_co_mutex_lock(&s->lock);
>> +            goto fail;
>> +        }
>> +    }
>> +
>> +    qemu_co_mutex_lock(&s->lock);
> It looks a bit weird to only do some of the lock handling in this
> function.  I don't quite like it, but as long as you keep it around the
> qcow2_handle_l2meta(), I won't object.
>
> A problem is that currently you don't keep it around
> qcow2_handle_l2meta().  Before going to the fail label, one has to
> manually lock the mutex (which you don't do in one of the above error
> paths, but that's fallout from patch 2).  Maybe it would be nicer with a
> separate label that does the lock.
>
> Like:
>
>      ...
>      ret = qcow2_handle_l2meta(bs, &l2meta, true);
>      goto done_locked;
>
> fail_unlocked:
>      qemu_co_mutex_lock(&s->lock)
> done_locked:
>      qcow2_handle_l2meta(bs, &l2meta, false);
>      ...

ok

>
>> +
>> +    ret = qcow2_handle_l2meta(bs, &l2meta, true);
>> +    if (ret) {
>> +        goto fail;
>> +    }
>> +
>> +fail:
>> +    qcow2_handle_l2meta(bs, &l2meta, false);
>> +    qemu_co_mutex_unlock(&s->lock);
>> +
>> +    qemu_vfree(crypt_buf);
>> +    qemu_iovec_destroy(&hd_qiov);
>> +
>> +    return ret;
>> +}
>> +
>>   static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>                                            uint64_t bytes, QEMUIOVector *qiov,
>>                                            int flags)
>> @@ -2262,63 +2341,16 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>   
>>           qemu_co_mutex_unlock(&s->lock);
>>   
>> -        qemu_iovec_reset(&hd_qiov);
>> -        qemu_iovec_concat(&hd_qiov, qiov, bytes_done, cur_bytes);
>>   
>> -        if (bs->encrypted) {
>> -            assert(s->crypto);
>> -            if (!cluster_data) {
>> -                cluster_data = qemu_try_blockalign(bs->file->bs,
>> -                                                   QCOW_MAX_CRYPT_CLUSTERS
>> -                                                   * s->cluster_size);
>> -                if (cluster_data == NULL) {
>> -                    ret = -ENOMEM;
>> -                    goto fail;
>> -                }
>> -            }
>> -
>> -            assert(hd_qiov.size <=
>> -                   QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
>> -            qemu_iovec_to_buf(&hd_qiov, 0, cluster_data, hd_qiov.size);
>> -
>> -            if (qcrypto_block_encrypt(s->crypto,
>> -                                      (s->crypt_physical_offset ?
>> -                                       cluster_offset + offset_in_cluster :
>> -                                       offset),
>> -                                      cluster_data,
>> -                                      cur_bytes, NULL) < 0) {
>> -                ret = -EIO;
>> -                goto fail;
>> -            }
>> -
>> -            qemu_iovec_reset(&hd_qiov);
>> -            qemu_iovec_add(&hd_qiov, cluster_data, cur_bytes);
>> -        }
>> -
>> -        /* If we need to do COW, check if it's possible to merge the
>> -         * writing of the guest data together with that of the COW regions.
>> -         * If it's not possible (or not necessary) then write the
>> -         * guest data now. */
>> -        if (!merge_cow(offset, cur_bytes, &hd_qiov, l2meta)) {
>> -            BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
>> -            trace_qcow2_writev_data(qemu_coroutine_self(),
>> -                                    cluster_offset + offset_in_cluster);
>> -            ret = bdrv_co_pwritev(bs->file,
>> -                                  cluster_offset + offset_in_cluster,
>> -                                  cur_bytes, &hd_qiov, 0);
>> -            if (ret < 0) {
>> -                qemu_co_mutex_lock(&s->lock);
>> -                goto fail;
>> -            }
>> +        ret = qcow2_co_do_pwritev(bs, cluster_offset, offset, cur_bytes,
>> +                                  qiov, bytes_done, l2meta);
>> +        l2meta = NULL; /* l2meta is consumed by qcow2_co_do_pwritev() */
>> +        if (ret < 0) {
>> +            goto fail_nometa;
>>           }
>>   
>>           qemu_co_mutex_lock(&s->lock);
>>   
>> -        ret = qcow2_handle_l2meta(bs, &l2meta, true);
>> -        if (ret) {
>> -            goto fail;
>> -        }
>> -
>>           bytes -= cur_bytes;
>>           offset += cur_bytes;
>>           bytes_done += cur_bytes;
>> @@ -2331,6 +2363,8 @@ fail:
>>   
>>       qemu_co_mutex_unlock(&s->lock);
>>   
>> +fail_nometa:
>> +
> I'd remove the blank line (because we usually don't have blank lines
> below labels).
>
>>       qemu_iovec_destroy(&hd_qiov);
>>       qemu_vfree(cluster_data);
> cluster_data and hd_qiov are unused now.
>
> Max
>
>>       trace_qcow2_writev_done_req(qemu_coroutine_self(), ret);
>>
>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] qcow2: async scheme for qcow2_co_pwritev
       [not found]   ` <1c8299bf-0b31-82a7-c7c4-5069581f2d94@redhat.com>
@ 2018-10-01 15:46     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-10-01 15:46 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, den

28.09.2018 17:35, Max Reitz wrote:
> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>> Start several async requests instead of read chunk by chunk.
>>
>> Iotest 026 output is changed, as because of async io error path has
>> changed.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/qcow2.c                      | 47 ++++++++++++++++++++++++++++++--------
>>   tests/qemu-iotests/026.out         | 18 ++++++++-------
>>   tests/qemu-iotests/026.out.nocache | 20 ++++++++--------
>>   3 files changed, 59 insertions(+), 26 deletions(-)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 88b3fb8080..c7501adfc6 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1874,6 +1874,7 @@ typedef struct Qcow2WorkerTask {
>>       uint64_t offset;
>>       uint64_t bytes;
>>       uint64_t bytes_done;
>> +    QCowL2Meta *l2meta;
>>   } Qcow2WorkerTask;
>>   
>>   typedef int (*Qcow2DoWorkFunc)(BlockDriverState *bs, QEMUIOVector *qiov,
>> @@ -1965,11 +1966,16 @@ static coroutine_fn void qcow2_rw_worker(void *opaque)
>>       }
>>   }
>>   
>> +/* qcow2_rws_add_task
>> +* l2meta - opaque pointer for do_work_func, so behavior depends on
>> +*          do_work_func, specified in qcow2_init_rws
>> +*/
> I think this would work better if Qcow2WorkerTask was indeed a union.
>
>>   static coroutine_fn void qcow2_rws_add_task(Qcow2RWState *rws,
>>                                               uint64_t file_cluster_offset,
>>                                               uint64_t offset,
>>                                               uint64_t bytes,
>> -                                            uint64_t bytes_done)
>> +                                            uint64_t bytes_done,
>> +                                            QCowL2Meta *l2meta)
>>   {
>>       Qcow2Worker *w;
>>   
>> @@ -1980,7 +1986,8 @@ static coroutine_fn void qcow2_rws_add_task(Qcow2RWState *rws,
>>               .file_cluster_offset = file_cluster_offset,
>>               .offset = offset,
>>               .bytes = bytes,
>> -            .bytes_done = bytes_done
>> +            .bytes_done = bytes_done,
>> +            .l2meta = l2meta
>>           };
>>           rws->ret = rws->do_work_func(rws->bs, rws->qiov, &task);
>>           return;
>> @@ -2006,6 +2013,7 @@ static coroutine_fn void qcow2_rws_add_task(Qcow2RWState *rws,
>>       w->task.offset = offset;
>>       w->task.bytes = bytes;
>>       w->task.bytes_done = bytes_done;
>> +    w->task.l2meta = l2meta;
>>   
>>       qemu_coroutine_enter(w->co);
>>   }
>> @@ -2137,7 +2145,7 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
>>               qemu_co_mutex_unlock(&s->lock);
>>   
>>               qcow2_rws_add_task(&rws, cluster_offset, offset, cur_bytes,
>> -                               bytes_done);
>> +                               bytes_done, NULL);
>>               if (rws.ret < 0) {
>>                   ret = rws.ret;
>>                   goto fail_nolock;
>> @@ -2289,6 +2297,19 @@ fail:
>>       return ret;
>>   }
>>   
>> +static coroutine_fn int qcow2_co_do_pwritev_task(BlockDriverState *bs,
>> +                                                 QEMUIOVector *qiov,
>> +                                                 Qcow2WorkerTask *task)
>> +{
>> +    return qcow2_co_do_pwritev(bs,
>> +                               task->file_cluster_offset,
>> +                               task->offset,
>> +                               task->bytes,
>> +                               qiov,
>> +                               task->bytes_done,
>> +                               task->l2meta);
>> +}
>> +
>>   static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>                                            uint64_t bytes, QEMUIOVector *qiov,
>>                                            int flags)
>> @@ -2299,16 +2320,19 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>       uint64_t bytes_done = 0;
>>       uint8_t *cluster_data = NULL;
>>       QCowL2Meta *l2meta = NULL;
>> +    Qcow2RWState rws = {0};
>>   
>>       trace_qcow2_writev_start_req(qemu_coroutine_self(), offset, bytes);
>>   
>> +    qcow2_init_rws(&rws, bs, qiov, bytes, qcow2_co_do_pwritev_task);
>> +
>>       qemu_iovec_init(&hd_qiov, qiov->niov);
>>   
>>       s->cluster_cache_offset = -1; /* disable compressed cache */
>>   
>>       qemu_co_mutex_lock(&s->lock);
>>   
>> -    while (bytes != 0) {
>> +    while (bytes != 0 && rws.ret == 0) {
>>           int offset_in_cluster = offset_into_cluster(s, offset);
>>           unsigned int cur_bytes = MIN(bytes, INT_MAX); /* number of sectors in
>>                                                            current iteration */
>> @@ -2340,11 +2364,11 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>   
>>           qemu_co_mutex_unlock(&s->lock);
>>   
>> -
>> -        ret = qcow2_co_do_pwritev(bs, cluster_offset, offset, cur_bytes,
>> -                                  qiov, bytes_done, l2meta);
>> -        l2meta = NULL; /* l2meta is consumed by qcow2_co_do_pwritev() */
>> -        if (ret < 0) {
>> +        qcow2_rws_add_task(&rws, cluster_offset, offset, cur_bytes, bytes_done,
>> +                           l2meta);
>> +        l2meta = NULL; /* l2meta is consumed */
>> +        if (rws.ret < 0) {
>> +            ret = rws.ret;
>>               goto fail_nometa;
>>           }
>>   
>> @@ -2362,6 +2386,11 @@ fail:
>>   
>>       qemu_co_mutex_unlock(&s->lock);
>>   
>> +    qcow2_finalize_rws(&rws);
>> +    if (ret == 0) {
>> +        ret = rws.ret;
>> +    }
>> +
>>   fail_nometa:
> Shouldn't we finalize below the fail_nometa label?

yes you are right, it's a bug.

>
> (And as a minor thing, if the "ret = rws.ret" assignment was there as
> well, you could drop the same assignment before the "goto fail_nometa".)
>
> Max
>
>>   
>>       qemu_iovec_destroy(&hd_qiov);


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 4/7] qcow2: async scheme for qcow2_co_preadv
  2018-10-01 15:33     ` Vladimir Sementsov-Ogievskiy
@ 2018-10-01 15:49       ` Max Reitz
  2018-10-01 16:17         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 30+ messages in thread
From: Max Reitz @ 2018-10-01 15:49 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf, den

[-- Attachment #1: Type: text/plain, Size: 10348 bytes --]

On 01.10.18 17:33, Vladimir Sementsov-Ogievskiy wrote:
> 27.09.2018 21:35, Max Reitz wrote:
>> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>> Start several async requests instead of read chunk by chunk.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   block/qcow2.c | 208
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>>   1 file changed, 204 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index 5e7f2ee318..a0df8d4e50 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -1869,6 +1869,197 @@ out:
>>>       return ret;
>>>   }
>>>   +typedef struct Qcow2WorkerTask {
>>> +    uint64_t file_cluster_offset;
>>> +    uint64_t offset;
>>> +    uint64_t bytes;
>>> +    uint64_t bytes_done;
>>> +} Qcow2WorkerTask;
>> Why don't you make this a union of request-specific structs?
> 
> ok, will try
> 
>>
>>> +
>>> +typedef int (*Qcow2DoWorkFunc)(BlockDriverState *bs, QEMUIOVector
>>> *qiov,
>>> +                               Qcow2WorkerTask *task);
>>> +
>>> +typedef struct Qcow2RWState {
>>> +    BlockDriverState *bs;
>>> +    QEMUIOVector *qiov;
>>> +    uint64_t bytes;
>> Maybe make it total_bytes so it doesn't conflict with the value in
>> Qcow2WorkerTask?
> 
> ok
> 
>>
>>> +    int ret;
>>> +    bool waiting_one;
>>> +    bool waiting_all;
>>> +    bool finalize;
>>> +    Coroutine *co;
>>> +    QSIMPLEQ_HEAD(, Qcow2Worker) free_workers;
>>> +    QSIMPLEQ_HEAD(, Qcow2Worker) busy_workers;
>>> +    int online_workers;
>>> +    Qcow2DoWorkFunc do_work_func;
>>> +} Qcow2RWState;
>>> +
>>> +typedef struct Qcow2Worker {
>>> +    Qcow2RWState *rws;
>>> +    Coroutine *co;
>>> +    Qcow2WorkerTask task;
>>> +    bool busy;
>>> +    QSIMPLEQ_ENTRY(Qcow2Worker) entry;
>>> +} Qcow2Worker;
>>> +#define QCOW2_MAX_WORKERS 64
>> That's really a bit hidden here.  I think it should go into the header.
>>
>> Also I'm not quite sure about the number.  In other places we've always
>> used 16.
>>
>> (With the encryption code always allocating a new bounce buffer, this
>> can mean quite a bit of memory usage.)
> 
> No doubts.
> 
>>
>>> +
>>> +static coroutine_fn void qcow2_rw_worker(void *opaque);
>>> +static Qcow2Worker *qcow2_new_worker(Qcow2RWState *rws)
>>> +{
>>> +    Qcow2Worker *w = g_new0(Qcow2Worker, 1);
>>> +    w->rws = rws;
>>> +    w->co = qemu_coroutine_create(qcow2_rw_worker, w);
>>> +
>>> +    return w;
>>> +}
>>> +
>>> +static void qcow2_free_worker(Qcow2Worker *w)
>>> +{
>>> +    g_free(w);
>>> +}
>>> +
>>> +static coroutine_fn void qcow2_rw_worker(void *opaque)
>>> +{
>>> +    Qcow2Worker *w = opaque;
>>> +    Qcow2RWState *rws = w->rws;
>>> +
>>> +    rws->online_workers++;
>>> +
>>> +    while (!rws->finalize) {
>>> +        int ret = rws->do_work_func(rws->bs, rws->qiov, &w->task);
>>> +        if (ret < 0 && rws->ret == 0) {
>>> +            rws->ret = ret;
>>> +        }
>>> +
>>> +        if (rws->waiting_all || rws->ret < 0) {
>>> +            break;
>>> +        }
>>> +
>>> +        w->busy = false;
>>> +        QSIMPLEQ_REMOVE(&rws->busy_workers, w, Qcow2Worker, entry);
>>> +        QSIMPLEQ_INSERT_TAIL(&rws->free_workers, w, entry);
>>> +        if (rws->waiting_one) {
>>> +            rws->waiting_one = false;
>>> +            /* we must unset it here, to prevent queuing rws->co in
>>> several
>>> +             * workers (it may happen if other worker already waits
>>> us on mutex,
>>> +             * so it will be entered after our yield and before
>>> rws->co enter)
>>> +             *
>>> +             * TODO: rethink this comment, as here (and in other
>>> places in the
>>> +             * file) we moved from qemu_coroutine_add_next to
>>> aio_co_wake.
>>> +             */
>>> +            aio_co_wake(rws->co);
>>> +        }
>>> +
>>> +        qemu_coroutine_yield();
>>> +    }
>>> +
>>> +    if (w->busy) {
>>> +        w->busy = false;
>>> +        QSIMPLEQ_REMOVE(&rws->busy_workers, w, Qcow2Worker, entry);
>>> +    }
>>> +    qcow2_free_worker(w);
>>> +    rws->online_workers--;
>>> +
>>> +    if (rws->waiting_all && rws->online_workers == 0) {
>>> +        aio_co_wake(rws->co);
>>> +    }
>>> +}
>>> +
>>> +static coroutine_fn void qcow2_rws_add_task(Qcow2RWState *rws,
>>> +                                            uint64_t
>>> file_cluster_offset,
>>> +                                            uint64_t offset,
>>> +                                            uint64_t bytes,
>>> +                                            uint64_t bytes_done)
>> I'd propose just taking a const Qcow2WorkerTask * here.  (Makes even
>> more sense if you make it a union.)
> 
> ok, I'll try this way
> 
>>
>>> +{
>>> +    Qcow2Worker *w;
>>> +
>>> +    assert(rws->co == qemu_coroutine_self());
>>> +
>>> +    if (bytes_done == 0 && bytes == rws->bytes) {
>>> +        Qcow2WorkerTask task = {
>>> +            .file_cluster_offset = file_cluster_offset,
>>> +            .offset = offset,
>>> +            .bytes = bytes,
>>> +            .bytes_done = bytes_done
>>> +        };
>>> +        rws->ret = rws->do_work_func(rws->bs, rws->qiov, &task);
>> (If so, you'd just pass the pointer along here)
>>
>>> +        return;
>>> +    }
>> I like this fast path, but I think it deserves a small comment.  (That
>> is a fast path and bypasses the whole worker infrastructure.)
>>
>>> +
>>> +    if (!QSIMPLEQ_EMPTY(&rws->free_workers)) {
>>> +        w = QSIMPLEQ_FIRST(&rws->free_workers);
>>> +        QSIMPLEQ_REMOVE_HEAD(&rws->free_workers, entry);
>>> +    } else if (rws->online_workers < QCOW2_MAX_WORKERS) {
>>> +        w = qcow2_new_worker(rws);
>>> +    } else {
>>> +        rws->waiting_one = true;
>>> +        qemu_coroutine_yield();
>>> +        assert(!rws->waiting_one); /* already unset by worker */
>> Sometimes I hate coroutines.  OK.  So, how does the yield ensure that
>> any worker is scheduled?  Doesn't yield just give control to the parent?
> 
> hm. I don't follow. All workers are busy - we sure, because there no
> free workers,
> and we can't create one more, second condition isn't satisfied too.
> So, we give control to the parent. And only worker can wake us up.

Ah, I see.  And then something at the bottom just continues to ppoll()
or whatever.

>> Right now I think it would be clearer to me if you'd just wake all busy
>> coroutines (looping over them) until one has settled.
> 
> but all workers are busy, we should not touch them (they may be yielded
> in io operation)..

I would have assumed that if they are yielded in an I/O operation, they
could handle spurious wakeups.  But I'm very likely wrong.

>> This would also save you the aio_co_wake() in the worker itself, as
>> they'd just have to yield in all cases.
>>
>>> +
>>> +        w = QSIMPLEQ_FIRST(&rws->free_workers);
>>> +        QSIMPLEQ_REMOVE_HEAD(&rws->free_workers, entry);
>>> +    }
>>> +    w->busy = true;
>>> +    QSIMPLEQ_INSERT_TAIL(&rws->busy_workers, w, entry);
>>> +
>>> +    w->task.file_cluster_offset = file_cluster_offset;
>>> +    w->task.offset = offset;
>>> +    w->task.bytes = bytes;
>>> +    w->task.bytes_done = bytes_done;
>> (And you'd copy it with w->task = *task here)
>>
>>> +
>>> +    qemu_coroutine_enter(w->co);
>>> +}
>>> +
>>> +static void qcow2_init_rws(Qcow2RWState *rws, BlockDriverState *bs,
>>> +                           QEMUIOVector *qiov, uint64_t bytes,
>>> +                           Qcow2DoWorkFunc do_work_func)
>>> +{
>>> +    memset(rws, 0, sizeof(*rws));
>>> +    rws->bs = bs;
>>> +    rws->qiov = qiov;
>>> +    rws->bytes = bytes;
>>> +    rws->co = qemu_coroutine_self();
>>> +    rws->do_work_func = do_work_func;
>> Maybe you'd like to use
>>
>> *rws = (Qcow2RWState) {
>>      .bs = bs,
>>      ...
>> };
>>
>> Then you could save yourself the memset().
> 
> ok
> 
>>
>>> +    QSIMPLEQ_INIT(&rws->free_workers);
>>> +    QSIMPLEQ_INIT(&rws->busy_workers);
>>> +}
>>> +
>>> +static void qcow2_finalize_rws(Qcow2RWState *rws)
>>> +{
>>> +    assert(rws->co == qemu_coroutine_self());
>>> +
>>> +    /* kill waiting workers */
>>> +    rws->finalize = true;
>>> +    while (!QSIMPLEQ_EMPTY(&rws->free_workers)) {
>>> +        Qcow2Worker *w = QSIMPLEQ_FIRST(&rws->free_workers);
>>> +        QSIMPLEQ_REMOVE_HEAD(&rws->free_workers, entry);
>>> +        qemu_coroutine_enter(w->co);
>>> +    }
>>> +
>>> +    /* wait others */
>>> +    if (rws->online_workers > 0) {
>>> +        rws->waiting_all = true;
>>> +        qemu_coroutine_yield();
>>> +        rws->waiting_all = false;
>> Why don't you enter the busy workers here?  (And keep doing so until
>> online_workers is 0.)  That way, you could save yourself the other
>> aio_co_wake() in qcow2_rw_worker().
> 
> We shouldn't enter busy workers, as they may yielded on io operation.
> The operation should complete.

Yes.

I think my misunderstanding was that I like to assume that everything
that yields checks whether I/O is done by itself, whereas in reality
that's probably usually done with some central polling and those
yielding coroutines assume they only wake up when that polling assures
them the I/O is done.

So I have no objections to the control flow now.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 5/7] qcow2: refactor qcow2_co_pwritev: split out qcow2_co_do_pwritev
  2018-10-01 15:43     ` Vladimir Sementsov-Ogievskiy
@ 2018-10-01 15:50       ` Max Reitz
  0 siblings, 0 replies; 30+ messages in thread
From: Max Reitz @ 2018-10-01 15:50 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf, den

[-- Attachment #1: Type: text/plain, Size: 3281 bytes --]

On 01.10.18 17:43, Vladimir Sementsov-Ogievskiy wrote:
> 28.09.2018 17:23, Max Reitz wrote:
>> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>> Split out block which will be reused in async scheme.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   block/qcow2.c | 138
>>> ++++++++++++++++++++++++++++++++++++----------------------
>>>   1 file changed, 86 insertions(+), 52 deletions(-)
>>>
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index a0df8d4e50..4d669432d1 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -2210,6 +2210,85 @@ static bool merge_cow(uint64_t offset,
>>> unsigned bytes,
>>>       return false;
>>>   }
>>>   +/* qcow2_co_do_pwritev
>>> + * Called without s->lock unlocked
>>> + * hd_qiov - temp qiov for any use. It is initialized so it is empty
>>> and
>>> + *           support adding up to qiov->niov + 2 elements
>>> + * l2meta  - if not NULL, qcow2_co_do_pwritev() will consume it.
>>> Caller must not
>>> + *           use it somehow after qcow2_co_do_pwritev() call
>>> + */
>>> +static coroutine_fn int qcow2_co_do_pwritev(BlockDriverState *bs,
>>> +                                            uint64_t
>>> file_cluster_offset,
>>> +                                            uint64_t offset,
>>> +                                            uint64_t bytes,
>>> +                                            QEMUIOVector *qiov,
>>> +                                            uint64_t qiov_offset,
>>> +                                            QCowL2Meta *l2meta)
>>> +{
>>> +    int ret;
>>> +    BDRVQcow2State *s = bs->opaque;
>>> +    void *crypt_buf = NULL;
>>> +    QEMUIOVector hd_qiov;
>>> +    int offset_in_cluster = offset_into_cluster(s, offset);
>>> +
>>> +    qemu_iovec_reset(&hd_qiov);
>> This shouldn't be here.
>>
>>> +    qemu_iovec_init(&hd_qiov, qiov->niov);
>>> +
>>> +    if (bs->encrypted) {
>>> +        assert(s->crypto);
>>> +        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
>>> +        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
>>> +        qemu_iovec_to_buf(qiov, qiov_offset, crypt_buf, bytes);
>>> +
>>> +        if (qcrypto_block_encrypt(s->crypto,
>>> +                                  (s->crypt_physical_offset ?
>>> +                                   file_cluster_offset +
>>> offset_in_cluster :
>>> +                                   offset),
>>> +                                  crypt_buf,
>>> +                                  bytes, NULL) < 0) {
>> Same question as in the read case: Can't we make do without the bounce
>> buffer?
> 
> I think, we should not modify guest buffers..

Hmm, yes, agreed. O:-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] qcow2: move qemu_co_mutex_lock below decryption procedure
       [not found]     ` <43277786-b6b9-e18c-b0ca-064ff7c9c0c9@redhat.com>
@ 2018-10-01 15:56       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-10-01 15:56 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, den

27.09.2018 20:02, Max Reitz wrote:
> On 27.09.18 18:58, Max Reitz wrote:
>> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>> From: "Denis V. Lunev" <den@openvz.org>
>>>
>>> We are not working with a shared state data in the decruption code and
> (*decryption)
>
>>> thus this operation is safe. On the other hand this significantly
>>> reduces the scope of the lock.
>> Sure, but does it have any effect?  This is a coroutine lock and I don't
>> see any yield points besides the bdrv_co_preadv() that was executed
>> outside of the lock anyway.
> I think it makes sense to move the qemu_co_mutex_lock() down below the
> decryption, as the commit title suggests.  Because then we can decrypt
> while another coroutine that uses the lock is blocking.
>
> But I don't see the point of moving the qemu_co_mutex_unlock() up.
> (That is non-trivial movement because it means I'd have to find out
> whether anything that could modify the cluster offset is serialized by
> the generic block layer.  Or, well, not really, because there is no
> yield point, so it doesn't actually matter.  But it doesn't give us any
> benefit either, I think.)
>
> Max

It don't make any performance benefit, but it allows to split part about 
normal cluster reading to separate function.

Hmm, in future we can do decryption in threads (like compress threads) 
and it should improve performance.

>
>> Max
>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> ---
>>>   block/qcow2.c | 14 ++++++++------
>>>   1 file changed, 8 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index ec9e6238a0..41def07c67 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -1880,9 +1880,11 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
>>>               break;
>>>   
>>>           case QCOW2_CLUSTER_NORMAL:
>>> +            qemu_co_mutex_unlock(&s->lock);
>>> +
>>>               if ((cluster_offset & 511) != 0) {
>>>                   ret = -EIO;
>>> -                goto fail;
>>> +                goto fail_nolock;
>>>               }
>>>   
>>>               if (bs->encrypted) {
>>> @@ -1899,7 +1901,7 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
>>>                                               * s->cluster_size);
>>>                       if (cluster_data == NULL) {
>>>                           ret = -ENOMEM;
>>> -                        goto fail;
>>> +                        goto fail_nolock;
>>>                       }
>>>                   }
>>>   
>>> @@ -1909,13 +1911,11 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
>>>               }
>>>   
>>>               BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
>>> -            qemu_co_mutex_unlock(&s->lock);
>>>               ret = bdrv_co_preadv(bs->file,
>>>                                    cluster_offset + offset_in_cluster,
>>>                                    cur_bytes, &hd_qiov, 0);
>>> -            qemu_co_mutex_lock(&s->lock);
>>>               if (ret < 0) {
>>> -                goto fail;
>>> +                goto fail_nolock;
>>>               }
>>>               if (bs->encrypted) {
>>>                   assert(s->crypto);
>>> @@ -1929,10 +1929,11 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
>>>                                             cur_bytes,
>>>                                             NULL) < 0) {
>>>                       ret = -EIO;
>>> -                    goto fail;
>>> +                    goto fail_nolock;
>>>                   }
>>>                   qemu_iovec_from_buf(qiov, bytes_done, cluster_data, cur_bytes);
>>>               }
>>> +            qemu_co_mutex_lock(&s->lock);
>>>               break;
>>>   
>>>           default:
>>> @@ -1950,6 +1951,7 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
>>>   fail:
>>>       qemu_co_mutex_unlock(&s->lock);
>>>   
>>> +fail_nolock:
>>>       qemu_iovec_destroy(&hd_qiov);
>>>       qemu_vfree(cluster_data);
>>>   
>>>
>>
>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv
  2018-10-01 15:39       ` Max Reitz
@ 2018-10-01 16:00         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-10-01 16:00 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, den

01.10.2018 18:39, Max Reitz wrote:
> On 01.10.18 17:14, Vladimir Sementsov-Ogievskiy wrote:
>> 27.09.2018 20:35, Max Reitz wrote:
>>> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>>> Memory allocation may become less efficient for encrypted case. It's a
>>>> payment for further asynchronous scheme.
>>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> ---
>>>>   block/qcow2.c | 114 ++++++++++++++++++++++++++++++++--------------------------
>>>>   1 file changed, 64 insertions(+), 50 deletions(-)
>>>>
>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>> index 65e3ca00e2..5e7f2ee318 100644
>>>> --- a/block/qcow2.c
>>>> +++ b/block/qcow2.c
>>>> @@ -1808,6 +1808,67 @@ out:
>>>>       return ret;
>>>>   }
>>>>   
>>>> +static coroutine_fn int qcow2_co_preadv_normal(BlockDriverState *bs,
>>>> +                                               uint64_t file_cluster_offset,
>>>> +                                               uint64_t offset,
>>>> +                                               uint64_t bytes,
>>>> +                                               QEMUIOVector *qiov,
>>>> +                                               uint64_t qiov_offset)
>>>> +{
>>>> +    int ret;
>>>> +    BDRVQcow2State *s = bs->opaque;
>>>> +    void *crypt_buf = NULL;
>>>> +    QEMUIOVector hd_qiov;
>>>> +    int offset_in_cluster = offset_into_cluster(s, offset);
>>>> +
>>>> +    if ((file_cluster_offset & 511) != 0) {
>>>> +        return -EIO;
>>>> +    }
>>>> +
>>>> +    qemu_iovec_init(&hd_qiov, qiov->niov);
>>> So you're not just re-allocating the bounce buffer for every single
>>> call, but also this I/O vector.  Hm.
>>>
>>>> +    if (bs->encrypted) {
>>>> +        assert(s->crypto);
>>>> +        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
>>>> +
>>>> +        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
>>> 1. Why did you remove the comment?
>>>
>>> 2. The check for whether crypt_buf was successfully allocated is missing.
>>>
>>> 3. Do you have any benchmarks for encrypted images?  Having to allocate
>>> the bounce buffer for potentially every single cluster sounds rather bad
>>> to me.
>> Hmm, no excuses. 1- Will fix, 2 - will fix, 3 - will fix or at least
>> test the performance.
>>
>>>> +        qemu_iovec_add(&hd_qiov, crypt_buf, bytes);
>>>> +    } else {
>>>> +        qemu_iovec_concat(&hd_qiov, qiov, qiov_offset, bytes);
>>> qcow2_co_preadv() continues to do this as well.  That's superfluous now.
>> hd_qiov is local here.. or what do you mean?
> qcow2_co_preadv() continues to have its own hd_qiov and appends qiov to
> it (just like here), but it's unused for normal clusters.  So it doesn't
> have to do that for normal clusters.

ah, yes, understand. Will think how to properly handle it.

>
>>>> +    }
>>>> +
>>>> +    BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
>>>> +    ret = bdrv_co_preadv(bs->file,
>>>> +                         file_cluster_offset + offset_in_cluster,
>>>> +                         bytes, &hd_qiov, 0);
>>>> +    if (ret < 0) {
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    if (bs->encrypted) {
>>>> +        assert(s->crypto);
>>>> +        assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
>>>> +        assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
>>>> +        if (qcrypto_block_decrypt(s->crypto,
>>>> +                                  (s->crypt_physical_offset ?
>>>> +                                   file_cluster_offset + offset_in_cluster :
>>>> +                                   offset),
>>>> +                                  crypt_buf,
>>>> +                                  bytes,
>>>> +                                  NULL) < 0) {
>>> What's the reason against decrypting this in-place in qiov so we could
>>> save the bounce buffer?  We allow offsets in clusters, so why can't we
>>> just call this function once per involved I/O vector entry?
>> Isn't it unsafe to do decryption in guest buffers?
> I don't know.  Why would it be?  Because...
>
>>> Max
>>>
>>>> +            ret = -EIO;
>>>> +            goto out;
>>>> +        }
>>>> +        qemu_iovec_from_buf(qiov, qiov_offset, crypt_buf, bytes);
> ...we're putting the decrypted content there anyway.
>
> The only issue I see is that the guest may see encrypted content for a
> short duration.  Hm.  Maybe we don't want that.  In which case it
> probably has to stay as it is.

Yes, I mean exactly this thing.

>
> Max
>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 4/7] qcow2: async scheme for qcow2_co_preadv
  2018-10-01 15:49       ` Max Reitz
@ 2018-10-01 16:17         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-10-01 16:17 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, den

01.10.2018 18:49, Max Reitz wrote:
> On 01.10.18 17:33, Vladimir Sementsov-Ogievskiy wrote:
>> 27.09.2018 21:35, Max Reitz wrote:
>>> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>>> Start several async requests instead of read chunk by chunk.
>>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> ---
>>>>    block/qcow2.c | 208
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>>>    1 file changed, 204 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>> index 5e7f2ee318..a0df8d4e50 100644
>>>> --- a/block/qcow2.c
>>>> +++ b/block/qcow2.c
>>>> @@ -1869,6 +1869,197 @@ out:
>>>>        return ret;
>>>>    }
>>>>    +typedef struct Qcow2WorkerTask {
>>>> +    uint64_t file_cluster_offset;
>>>> +    uint64_t offset;
>>>> +    uint64_t bytes;
>>>> +    uint64_t bytes_done;
>>>> +} Qcow2WorkerTask;
>>> Why don't you make this a union of request-specific structs?
>> ok, will try
>>
>>>> +
>>>> +typedef int (*Qcow2DoWorkFunc)(BlockDriverState *bs, QEMUIOVector
>>>> *qiov,
>>>> +                               Qcow2WorkerTask *task);
>>>> +
>>>> +typedef struct Qcow2RWState {
>>>> +    BlockDriverState *bs;
>>>> +    QEMUIOVector *qiov;
>>>> +    uint64_t bytes;
>>> Maybe make it total_bytes so it doesn't conflict with the value in
>>> Qcow2WorkerTask?
>> ok
>>
>>>> +    int ret;
>>>> +    bool waiting_one;
>>>> +    bool waiting_all;
>>>> +    bool finalize;
>>>> +    Coroutine *co;
>>>> +    QSIMPLEQ_HEAD(, Qcow2Worker) free_workers;
>>>> +    QSIMPLEQ_HEAD(, Qcow2Worker) busy_workers;
>>>> +    int online_workers;
>>>> +    Qcow2DoWorkFunc do_work_func;
>>>> +} Qcow2RWState;
>>>> +
>>>> +typedef struct Qcow2Worker {
>>>> +    Qcow2RWState *rws;
>>>> +    Coroutine *co;
>>>> +    Qcow2WorkerTask task;
>>>> +    bool busy;
>>>> +    QSIMPLEQ_ENTRY(Qcow2Worker) entry;
>>>> +} Qcow2Worker;
>>>> +#define QCOW2_MAX_WORKERS 64
>>> That's really a bit hidden here.  I think it should go into the header.
>>>
>>> Also I'm not quite sure about the number.  In other places we've always
>>> used 16.
>>>
>>> (With the encryption code always allocating a new bounce buffer, this
>>> can mean quite a bit of memory usage.)
>> No doubts.
>>
>>>> +
>>>> +static coroutine_fn void qcow2_rw_worker(void *opaque);
>>>> +static Qcow2Worker *qcow2_new_worker(Qcow2RWState *rws)
>>>> +{
>>>> +    Qcow2Worker *w = g_new0(Qcow2Worker, 1);
>>>> +    w->rws = rws;
>>>> +    w->co = qemu_coroutine_create(qcow2_rw_worker, w);
>>>> +
>>>> +    return w;
>>>> +}
>>>> +
>>>> +static void qcow2_free_worker(Qcow2Worker *w)
>>>> +{
>>>> +    g_free(w);
>>>> +}
>>>> +
>>>> +static coroutine_fn void qcow2_rw_worker(void *opaque)
>>>> +{
>>>> +    Qcow2Worker *w = opaque;
>>>> +    Qcow2RWState *rws = w->rws;
>>>> +
>>>> +    rws->online_workers++;
>>>> +
>>>> +    while (!rws->finalize) {
>>>> +        int ret = rws->do_work_func(rws->bs, rws->qiov, &w->task);
>>>> +        if (ret < 0 && rws->ret == 0) {
>>>> +            rws->ret = ret;
>>>> +        }
>>>> +
>>>> +        if (rws->waiting_all || rws->ret < 0) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        w->busy = false;
>>>> +        QSIMPLEQ_REMOVE(&rws->busy_workers, w, Qcow2Worker, entry);
>>>> +        QSIMPLEQ_INSERT_TAIL(&rws->free_workers, w, entry);
>>>> +        if (rws->waiting_one) {
>>>> +            rws->waiting_one = false;
>>>> +            /* we must unset it here, to prevent queuing rws->co in
>>>> several
>>>> +             * workers (it may happen if other worker already waits
>>>> us on mutex,
>>>> +             * so it will be entered after our yield and before
>>>> rws->co enter)
>>>> +             *
>>>> +             * TODO: rethink this comment, as here (and in other
>>>> places in the
>>>> +             * file) we moved from qemu_coroutine_add_next to
>>>> aio_co_wake.
>>>> +             */
>>>> +            aio_co_wake(rws->co);
>>>> +        }
>>>> +
>>>> +        qemu_coroutine_yield();
>>>> +    }
>>>> +
>>>> +    if (w->busy) {
>>>> +        w->busy = false;
>>>> +        QSIMPLEQ_REMOVE(&rws->busy_workers, w, Qcow2Worker, entry);
>>>> +    }
>>>> +    qcow2_free_worker(w);
>>>> +    rws->online_workers--;
>>>> +
>>>> +    if (rws->waiting_all && rws->online_workers == 0) {
>>>> +        aio_co_wake(rws->co);
>>>> +    }
>>>> +}
>>>> +
>>>> +static coroutine_fn void qcow2_rws_add_task(Qcow2RWState *rws,
>>>> +                                            uint64_t
>>>> file_cluster_offset,
>>>> +                                            uint64_t offset,
>>>> +                                            uint64_t bytes,
>>>> +                                            uint64_t bytes_done)
>>> I'd propose just taking a const Qcow2WorkerTask * here.  (Makes even
>>> more sense if you make it a union.)
>> ok, I'll try this way
>>
>>>> +{
>>>> +    Qcow2Worker *w;
>>>> +
>>>> +    assert(rws->co == qemu_coroutine_self());
>>>> +
>>>> +    if (bytes_done == 0 && bytes == rws->bytes) {
>>>> +        Qcow2WorkerTask task = {
>>>> +            .file_cluster_offset = file_cluster_offset,
>>>> +            .offset = offset,
>>>> +            .bytes = bytes,
>>>> +            .bytes_done = bytes_done
>>>> +        };
>>>> +        rws->ret = rws->do_work_func(rws->bs, rws->qiov, &task);
>>> (If so, you'd just pass the pointer along here)
>>>
>>>> +        return;
>>>> +    }
>>> I like this fast path, but I think it deserves a small comment.  (That
>>> is a fast path and bypasses the whole worker infrastructure.)
>>>
>>>> +
>>>> +    if (!QSIMPLEQ_EMPTY(&rws->free_workers)) {
>>>> +        w = QSIMPLEQ_FIRST(&rws->free_workers);
>>>> +        QSIMPLEQ_REMOVE_HEAD(&rws->free_workers, entry);
>>>> +    } else if (rws->online_workers < QCOW2_MAX_WORKERS) {
>>>> +        w = qcow2_new_worker(rws);
>>>> +    } else {
>>>> +        rws->waiting_one = true;
>>>> +        qemu_coroutine_yield();
>>>> +        assert(!rws->waiting_one); /* already unset by worker */
>>> Sometimes I hate coroutines.  OK.  So, how does the yield ensure that
>>> any worker is scheduled?  Doesn't yield just give control to the parent?
>> hm. I don't follow. All workers are busy - we sure, because there no
>> free workers,
>> and we can't create one more, second condition isn't satisfied too.
>> So, we give control to the parent. And only worker can wake us up.
> Ah, I see.  And then something at the bottom just continues to ppoll()
> or whatever.
>
>>> Right now I think it would be clearer to me if you'd just wake all busy
>>> coroutines (looping over them) until one has settled.
>> but all workers are busy, we should not touch them (they may be yielded
>> in io operation)..
> I would have assumed that if they are yielded in an I/O operation, they
> could handle spurious wakeups.  But I'm very likely wrong.

Hm, as I understand they don't. At least laio_co_submit (and therefore 
file-posix _co_ rw operations) don't support this, they'll just return 
EINPROGRESS, and at some point, real aio operation will complete and 
enter the same coroutine which may be already finished or yield at 
another point.

I know only co_aio_sleep_ns(), who support third-party enters, it just 
removes timer on wake up after yield.

>
>>> This would also save you the aio_co_wake() in the worker itself, as
>>> they'd just have to yield in all cases.
>>>
>>>> +
>>>> +        w = QSIMPLEQ_FIRST(&rws->free_workers);
>>>> +        QSIMPLEQ_REMOVE_HEAD(&rws->free_workers, entry);
>>>> +    }
>>>> +    w->busy = true;
>>>> +    QSIMPLEQ_INSERT_TAIL(&rws->busy_workers, w, entry);
>>>> +
>>>> +    w->task.file_cluster_offset = file_cluster_offset;
>>>> +    w->task.offset = offset;
>>>> +    w->task.bytes = bytes;
>>>> +    w->task.bytes_done = bytes_done;
>>> (And you'd copy it with w->task = *task here)
>>>
>>>> +
>>>> +    qemu_coroutine_enter(w->co);
>>>> +}
>>>> +
>>>> +static void qcow2_init_rws(Qcow2RWState *rws, BlockDriverState *bs,
>>>> +                           QEMUIOVector *qiov, uint64_t bytes,
>>>> +                           Qcow2DoWorkFunc do_work_func)
>>>> +{
>>>> +    memset(rws, 0, sizeof(*rws));
>>>> +    rws->bs = bs;
>>>> +    rws->qiov = qiov;
>>>> +    rws->bytes = bytes;
>>>> +    rws->co = qemu_coroutine_self();
>>>> +    rws->do_work_func = do_work_func;
>>> Maybe you'd like to use
>>>
>>> *rws = (Qcow2RWState) {
>>>       .bs = bs,
>>>       ...
>>> };
>>>
>>> Then you could save yourself the memset().
>> ok
>>
>>>> +    QSIMPLEQ_INIT(&rws->free_workers);
>>>> +    QSIMPLEQ_INIT(&rws->busy_workers);
>>>> +}
>>>> +
>>>> +static void qcow2_finalize_rws(Qcow2RWState *rws)
>>>> +{
>>>> +    assert(rws->co == qemu_coroutine_self());
>>>> +
>>>> +    /* kill waiting workers */
>>>> +    rws->finalize = true;
>>>> +    while (!QSIMPLEQ_EMPTY(&rws->free_workers)) {
>>>> +        Qcow2Worker *w = QSIMPLEQ_FIRST(&rws->free_workers);
>>>> +        QSIMPLEQ_REMOVE_HEAD(&rws->free_workers, entry);
>>>> +        qemu_coroutine_enter(w->co);
>>>> +    }
>>>> +
>>>> +    /* wait others */
>>>> +    if (rws->online_workers > 0) {
>>>> +        rws->waiting_all = true;
>>>> +        qemu_coroutine_yield();
>>>> +        rws->waiting_all = false;
>>> Why don't you enter the busy workers here?  (And keep doing so until
>>> online_workers is 0.)  That way, you could save yourself the other
>>> aio_co_wake() in qcow2_rw_worker().
>> We shouldn't enter busy workers, as they may yielded on io operation.
>> The operation should complete.
> Yes.
>
> I think my misunderstanding was that I like to assume that everything
> that yields checks whether I/O is done by itself, whereas in reality
> that's probably usually done with some central polling and those
> yielding coroutines assume they only wake up when that polling assures
> them the I/O is done.
>
> So I have no objections to the control flow now.
>
> Max
>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv
       [not found]   ` <6e19aaeb-8acc-beb9-5ece-9ae6101637a9@redhat.com>
  2018-10-01 15:14     ` Vladimir Sementsov-Ogievskiy
@ 2018-11-01 12:17     ` Vladimir Sementsov-Ogievskiy
  2018-11-07 13:51       ` Max Reitz
  2018-11-07 18:16       ` Kevin Wolf
  1 sibling, 2 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-11-01 12:17 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, Denis Lunev

[-- Attachment #1: Type: text/plain, Size: 9391 bytes --]

27.09.2018 20:35, Max Reitz wrote:

On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:


Memory allocation may become less efficient for encrypted case. It's a
payment for further asynchronous scheme.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com><mailto:vsementsov@virtuozzo.com>
---
 block/qcow2.c | 114 ++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 64 insertions(+), 50 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 65e3ca00e2..5e7f2ee318 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1808,6 +1808,67 @@ out:
     return ret;
 }

+static coroutine_fn int qcow2_co_preadv_normal(BlockDriverState *bs,
+                                               uint64_t file_cluster_offset,
+                                               uint64_t offset,
+                                               uint64_t bytes,
+                                               QEMUIOVector *qiov,
+                                               uint64_t qiov_offset)
+{
+    int ret;
+    BDRVQcow2State *s = bs->opaque;
+    void *crypt_buf = NULL;
+    QEMUIOVector hd_qiov;
+    int offset_in_cluster = offset_into_cluster(s, offset);
+
+    if ((file_cluster_offset & 511) != 0) {
+        return -EIO;
+    }
+
+    qemu_iovec_init(&hd_qiov, qiov->niov);


So you're not just re-allocating the bounce buffer for every single
call, but also this I/O vector.  Hm.



+    if (bs->encrypted) {
+        assert(s->crypto);
+        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
+
+        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);


1. Why did you remove the comment?

2. The check for whether crypt_buf was successfully allocated is missing.

3. Do you have any benchmarks for encrypted images?  Having to allocate
the bounce buffer for potentially every single cluster sounds rather bad
to me.

Hmm, about this.

Now we have compress threads to do several compress operations in parallel. And I plan to do the same thing for decompression, encryption and decryption. So, we'll definitely need several cluster-size buffers to do all these operations. How many? The first thought is just as many as maximum number of threads (or twice as many, to have in and out buffers for compression). But it's wrong, consider for example encryption on write:

1. get free thread for encryption (may be yield if maximum thread number achieved, waiting until all threads are busy)
2. get buffer (per thread)
3. thread handles encryption
a. free thread here?
4. write encrypted data to underlying file
b. free thread here?

Of course, we want free thread as soon as possible, in "a." not in "b.". And this mean, that we should free buffer in "a." too, so we should copy data to locally allocated buffer, or, better, we just use local buffer from the beginning, and thread don't own it's own buffer..

So, what are the options we have?

1. the most simple thing: just allocate buffers for each request locally, like it is already done for compression.
pros: simple
cons: allocate/free every time may influence performance, as you noticed

2. keep a pool of such buffers, so when buffer freed, it's just queued to the list of free buffers
pros:
   reduce number of real alloc/free calls
   we can limit the pool maximum size
   we can allocate new buffers on demand, not the whole pool at start
cons:
   hmm, so, it looks like custom allocator. Is it really needed? Is it really faster than just use system allocator and call alloc/free every time we need a buffer?

3. should not we use g_slice_alloc instead of inventing it in "2."

So, I've decided to do some tests, and here are results (memcpy is for 32K, all other things allocates 64K in a loop, list_alloc is a simple realization of "2.":

nothing: 200000000 it / 0.301361 sec = 663655696.202532 it/s
memcpy: 2000000 it / 1.633015 sec = 1224728.554970 it/s
g_malloc: 200000000 it / 5.346707 sec = 37406202.540530 it/s
g_slice_alloc: 200000000 it / 7.812036 sec = 25601520.402792 it/s
list_alloc: 200000000 it / 5.432771 sec = 36813626.268630 it/s
posix_memalign: 20000000 it / 1.025543 sec = 19501864.376084 it/s
aligned_alloc: 20000000 it / 1.035639 sec = 19311752.149739 it/s

So, you see that yes, list_alloc is twice as fast as posix_memalign, but on the other hand, simple memcpy is a lot slower (16 times slower) (and I don't say about real disk IO which will be even more slower), so, should we care about allocation, what do you thing?
test is attached.





+        qemu_iovec_add(&hd_qiov, crypt_buf, bytes);
+    } else {
+        qemu_iovec_concat(&hd_qiov, qiov, qiov_offset, bytes);


qcow2_co_preadv() continues to do this as well.  That's superfluous now.



+    }
+
+    BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
+    ret = bdrv_co_preadv(bs->file,
+                         file_cluster_offset + offset_in_cluster,
+                         bytes, &hd_qiov, 0);
+    if (ret < 0) {
+        goto out;
+    }
+
+    if (bs->encrypted) {
+        assert(s->crypto);
+        assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+        assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+        if (qcrypto_block_decrypt(s->crypto,
+                                  (s->crypt_physical_offset ?
+                                   file_cluster_offset + offset_in_cluster :
+                                   offset),
+                                  crypt_buf,
+                                  bytes,
+                                  NULL) < 0) {


What's the reason against decrypting this in-place in qiov so we could
save the bounce buffer?  We allow offsets in clusters, so why can't we
just call this function once per involved I/O vector entry?

Max



+            ret = -EIO;
+            goto out;
+        }
+        qemu_iovec_from_buf(qiov, qiov_offset, crypt_buf, bytes);
+    }
+
+out:
+    qemu_vfree(crypt_buf);
+    qemu_iovec_destroy(&hd_qiov);
+
+    return ret;
+}
+
 static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
                                         uint64_t bytes, QEMUIOVector *qiov,
                                         int flags)
@@ -1819,7 +1880,6 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
     uint64_t cluster_offset = 0;
     uint64_t bytes_done = 0;
     QEMUIOVector hd_qiov;
-    uint8_t *cluster_data = NULL;

     qemu_iovec_init(&hd_qiov, qiov->niov);

@@ -1882,57 +1942,12 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
         case QCOW2_CLUSTER_NORMAL:
             qemu_co_mutex_unlock(&s->lock);

-            if ((cluster_offset & 511) != 0) {
-                ret = -EIO;
-                goto fail_nolock;
-            }
-
-            if (bs->encrypted) {
-                assert(s->crypto);
-
-                /*
-                 * For encrypted images, read everything into a temporary
-                 * contiguous buffer on which the AES functions can work.
-                 */
-                if (!cluster_data) {
-                    cluster_data =
-                        qemu_try_blockalign(bs->file->bs,
-                                            QCOW_MAX_CRYPT_CLUSTERS
-                                            * s->cluster_size);
-                    if (cluster_data == NULL) {
-                        ret = -ENOMEM;
-                        goto fail_nolock;
-                    }
-                }
-
-                assert(cur_bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
-                qemu_iovec_reset(&hd_qiov);
-                qemu_iovec_add(&hd_qiov, cluster_data, cur_bytes);
-            }
-
-            BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
-            ret = bdrv_co_preadv(bs->file,
-                                 cluster_offset + offset_in_cluster,
-                                 cur_bytes, &hd_qiov, 0);
+            ret = qcow2_co_preadv_normal(bs, cluster_offset,
+                                         offset, cur_bytes, qiov, bytes_done);
             if (ret < 0) {
                 goto fail_nolock;
             }
-            if (bs->encrypted) {
-                assert(s->crypto);
-                assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
-                assert((cur_bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
-                if (qcrypto_block_decrypt(s->crypto,
-                                          (s->crypt_physical_offset ?
-                                           cluster_offset + offset_in_cluster :
-                                           offset),
-                                          cluster_data,
-                                          cur_bytes,
-                                          NULL) < 0) {
-                    ret = -EIO;
-                    goto fail_nolock;
-                }
-                qemu_iovec_from_buf(qiov, bytes_done, cluster_data, cur_bytes);
-            }
+
             qemu_co_mutex_lock(&s->lock);
             break;

@@ -1953,7 +1968,6 @@ fail:

 fail_nolock:
     qemu_iovec_destroy(&hd_qiov);
-    qemu_vfree(cluster_data);

     return ret;
 }








--
Best regards,
Vladimir

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: bench.c --]
[-- Type: text/x-csrc; name="bench.c", Size: 2279 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/resource.h>

#include <glib.h>
#include <gmodule.h>

double get_time()
{
    struct timeval t;
    gettimeofday(&t, NULL);
    return t.tv_sec + t.tv_usec*1e-6;
}

double START, T;
int COUNT, COUNTER;

#define start(name, count) \
    printf("%s: %d it", name, count); \
    fflush(stdout); \
    START = get_time(); \
    COUNT = count; \
    COUNTER = count; \
    while (COUNTER--) {

#define end() \
    } \
    T = get_time() - START; \
    printf(" / %f sec = %f it/s\n", T, COUNT / T);

GSList *free_blocks = NULL;
int block_size = 64 * 1024;

void *list_alloc() {
    void *res;

    if (free_blocks) {
        res = free_blocks->data;
        free_blocks = g_slist_delete_link(free_blocks, free_blocks);
        return res;
    }

    posix_memalign(&res, 4096, 64 * 1024);
    return res;
}

void list_free(void *ptr) {
    free_blocks = g_slist_prepend(free_blocks, ptr);
}

void list_finalize() {
    while (free_blocks) {
        void *p = free_blocks->data;
        free_blocks = g_slist_delete_link(free_blocks, free_blocks);
        free(p);
    }
}

int main(int argc, char *argv[]) {
    int count = atoi(argv[1]);
    int loc[64 * 1024 / 4];
    int loc2[64 * 1024 / 4];
    int *p;

    start("nothing", count)
        loc[0] = 1 + loc[1];
    end()

    start("memcpy", count / 100)
        loc[0] = 1 + loc[1];
        memcpy(loc, loc2, sizeof(loc) / 2);
    end()

    start("g_malloc", count)
        p = g_malloc(64 * 1024);
        p[0] = 1 + p[1];
        g_free(p);
    end()

    start("g_slice_alloc", count)
        p = g_slice_alloc(64 * 1024);
        p[0] = 1 + p[1];
        g_slice_free1(64 * 1024, p);
    end()

    start("list_alloc", count)
        p = list_alloc();
        p[0] = 1 + p[1];
        list_free(p);
    end()
    list_finalize();

    start("posix_memalign", count / 10)
        posix_memalign(&p, 4096, 64 * 1024);
        p[0] = 1 + p[1];
        free(p);
    end()

    start("aligned_alloc", count / 10)
        p = aligned_alloc(4096, 64 * 1024);
        p[0] = 1 + p[1];
        free(p);
    end()

    return 0;
}

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv
  2018-11-01 12:17     ` Vladimir Sementsov-Ogievskiy
@ 2018-11-07 13:51       ` Max Reitz
  2018-11-07 18:16       ` Kevin Wolf
  1 sibling, 0 replies; 30+ messages in thread
From: Max Reitz @ 2018-11-07 13:51 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf, Denis Lunev

[-- Attachment #1: Type: text/plain, Size: 5358 bytes --]

On 01.11.18 13:17, Vladimir Sementsov-Ogievskiy wrote:
> 27.09.2018 20:35, Max Reitz wrote:
>> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>> Memory allocation may become less efficient for encrypted case. It's a
>>> payment for further asynchronous scheme.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>  block/qcow2.c | 114 ++++++++++++++++++++++++++++++++--------------------------
>>>  1 file changed, 64 insertions(+), 50 deletions(-)
>>>
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index 65e3ca00e2..5e7f2ee318 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -1808,6 +1808,67 @@ out:
>>>      return ret;
>>>  }
>>>  
>>> +static coroutine_fn int qcow2_co_preadv_normal(BlockDriverState *bs,
>>> +                                               uint64_t file_cluster_offset,
>>> +                                               uint64_t offset,
>>> +                                               uint64_t bytes,
>>> +                                               QEMUIOVector *qiov,
>>> +                                               uint64_t qiov_offset)
>>> +{
>>> +    int ret;
>>> +    BDRVQcow2State *s = bs->opaque;
>>> +    void *crypt_buf = NULL;
>>> +    QEMUIOVector hd_qiov;
>>> +    int offset_in_cluster = offset_into_cluster(s, offset);
>>> +
>>> +    if ((file_cluster_offset & 511) != 0) {
>>> +        return -EIO;
>>> +    }
>>> +
>>> +    qemu_iovec_init(&hd_qiov, qiov->niov);
>> So you're not just re-allocating the bounce buffer for every single
>> call, but also this I/O vector.  Hm.
>>
>>> +    if (bs->encrypted) {
>>> +        assert(s->crypto);
>>> +        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
>>> +
>>> +        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
>> 1. Why did you remove the comment?
>>
>> 2. The check for whether crypt_buf was successfully allocated is missing.
>>
>> 3. Do you have any benchmarks for encrypted images?  Having to allocate
>> the bounce buffer for potentially every single cluster sounds rather bad
>> to me.
> 
> Hmm, about this.
> 
> Now we have compress threads to do several compress operations in
> parallel. And I plan to do the same thing for decompression, encryption
> and decryption. So, we'll definitely need several cluster-size buffers
> to do all these operations. How many? The first thought is just as many
> as maximum number of threads (or twice as many, to have in and out
> buffers for compression). But it's wrong, consider for example
> encryption on write:
> 
> 1. get free thread for encryption (may be yield if maximum thread number
> achieved, waiting until all threads are busy)
> 2. get buffer (per thread)
> 3. thread handles encryption
> a. free thread here?
> 4. write encrypted data to underlying file
> b. free thread here?
> 
> Of course, we want free thread as soon as possible, in "a." not in "b.".
> And this mean, that we should free buffer in "a." too, so we should copy
> data to locally allocated buffer, or, better, we just use local buffer
> from the beginning, and thread don't own it's own buffer..
> 
> So, what are the options we have?
> 
> 1. the most simple thing: just allocate buffers for each request
> locally, like it is already done for compression.
> pros: simple
> cons: allocate/free every time may influence performance, as you noticed
> 
> 2. keep a pool of such buffers, so when buffer freed, it's just queued
> to the list of free buffers
> pros:
>    reduce number of real alloc/free calls
>    we can limit the pool maximum size
>    we can allocate new buffers on demand, not the whole pool at start
> cons:
>    hmm, so, it looks like custom allocator. Is it really needed? Is it
> really faster than just use system allocator and call alloc/free every
> time we need a buffer?
> 
> 3. should not we use g_slice_alloc instead of inventing it in "2."
> 
> So, I've decided to do some tests, and here are results (memcpy is for
> 32K, all other things allocates 64K in a loop, list_alloc is a simple
> realization of "2.":
> 
> nothing: 200000000 it / 0.301361 sec = 663655696.202532 it/s
> memcpy: 2000000 it / 1.633015 sec = 1224728.554970 it/s
> g_malloc: 200000000 it / 5.346707 sec = 37406202.540530 it/s
> g_slice_alloc: 200000000 it / 7.812036 sec = 25601520.402792 it/s
> list_alloc: 200000000 it / 5.432771 sec = 36813626.268630 it/s
> posix_memalign: 20000000 it / 1.025543 sec = 19501864.376084 it/s
> aligned_alloc: 20000000 it / 1.035639 sec = 19311752.149739 it/s
> 
> So, you see that yes, list_alloc is twice as fast as posix_memalign, but
> on the other hand, simple memcpy is a lot slower (16 times slower) (and
> I don't say about real disk IO which will be even more slower), so,
> should we care about allocation, what do you thing?
> test is attached.

Agreed.  Thanks for testing.

I am a bit worried about the maximum memory usage still.  With 16
workers, having bounce buffers can mean up to 32 MB of memory usage with
the default cluster size (and 1 GB with 2 MB clusters).  Then again, it
should be easy to limit memory usage even without a custom pool (just
some variable that says how much memory there is left).  Also, it'd be
OK to do that on top.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv
  2018-11-01 12:17     ` Vladimir Sementsov-Ogievskiy
  2018-11-07 13:51       ` Max Reitz
@ 2018-11-07 18:16       ` Kevin Wolf
  2018-11-08 10:02         ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 30+ messages in thread
From: Kevin Wolf @ 2018-11-07 18:16 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Max Reitz, qemu-devel, qemu-block, Denis Lunev

(Broken quoting in text/plain again)

Am 01.11.2018 um 13:17 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 27.09.2018 20:35, Max Reitz wrote:
> 
>     On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
> 
>         Memory allocation may become less efficient for encrypted case. It's a
>         payment for further asynchronous scheme.
> 
>         Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>         ---
>          block/qcow2.c | 114 ++++++++++++++++++++++++++++++++--------------------------
>          1 file changed, 64 insertions(+), 50 deletions(-)
> 
>         diff --git a/block/qcow2.c b/block/qcow2.c
>         index 65e3ca00e2..5e7f2ee318 100644
>         --- a/block/qcow2.c
>         +++ b/block/qcow2.c
>         @@ -1808,6 +1808,67 @@ out:
>              return ret;
>          }
> 
>         +static coroutine_fn int qcow2_co_preadv_normal(BlockDriverState *bs,
>         +                                               uint64_t file_cluster_offset,
>         +                                               uint64_t offset,
>         +                                               uint64_t bytes,
>         +                                               QEMUIOVector *qiov,
>         +                                               uint64_t qiov_offset)
>         +{
>         +    int ret;
>         +    BDRVQcow2State *s = bs->opaque;
>         +    void *crypt_buf = NULL;
>         +    QEMUIOVector hd_qiov;
>         +    int offset_in_cluster = offset_into_cluster(s, offset);
>         +
>         +    if ((file_cluster_offset & 511) != 0) {
>         +        return -EIO;
>         +    }
>         +
>         +    qemu_iovec_init(&hd_qiov, qiov->niov);
> 
>     So you're not just re-allocating the bounce buffer for every single
>     call, but also this I/O vector.  Hm.

And this one is actually at least a little more serious, I think.

Decryption and decompression (including copying between the original
qiov and the bounce buffer) are already very slow, so allocating a
bounce buffer or not shouldn't make any noticable difference.

But I'd like to keep allocations out of the fast path as much as we can.

>         +    if (bs->encrypted) {
>         +        assert(s->crypto);
>         +        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
>         +
>         +        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
> 
>     1. Why did you remove the comment?
> 
>     2. The check for whether crypt_buf was successfully allocated is missing.
> 
>     3. Do you have any benchmarks for encrypted images?  Having to allocate
>     the bounce buffer for potentially every single cluster sounds rather bad
>     to me.

Actually, benchmarks for normal, fully allocated images would be a
little more interesting. Though I'm not sure if qcow2 actually performs
well enough that we'll see any difference even there (when we measured
memory allocation overhead etc. for raw images in the context of
comparing coroutines to AIO callback styes, the difference was already
hard to see).

Kevin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv
  2018-11-07 18:16       ` Kevin Wolf
@ 2018-11-08 10:02         ` Vladimir Sementsov-Ogievskiy
  2018-11-08 10:33           ` Kevin Wolf
  0 siblings, 1 reply; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-11-08 10:02 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Max Reitz, qemu-devel, qemu-block, Denis Lunev

07.11.2018 21:16, Kevin Wolf wrote:
> (Broken quoting in text/plain again)
>
> Am 01.11.2018 um 13:17 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 27.09.2018 20:35, Max Reitz wrote:
>>
>>      On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>
>>          Memory allocation may become less efficient for encrypted case. It's a
>>          payment for further asynchronous scheme.
>>
>>          Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>          ---
>>           block/qcow2.c | 114 ++++++++++++++++++++++++++++++++--------------------------
>>           1 file changed, 64 insertions(+), 50 deletions(-)
>>
>>          diff --git a/block/qcow2.c b/block/qcow2.c
>>          index 65e3ca00e2..5e7f2ee318 100644
>>          --- a/block/qcow2.c
>>          +++ b/block/qcow2.c
>>          @@ -1808,6 +1808,67 @@ out:
>>               return ret;
>>           }
>>
>>          +static coroutine_fn int qcow2_co_preadv_normal(BlockDriverState *bs,
>>          +                                               uint64_t file_cluster_offset,
>>          +                                               uint64_t offset,
>>          +                                               uint64_t bytes,
>>          +                                               QEMUIOVector *qiov,
>>          +                                               uint64_t qiov_offset)
>>          +{
>>          +    int ret;
>>          +    BDRVQcow2State *s = bs->opaque;
>>          +    void *crypt_buf = NULL;
>>          +    QEMUIOVector hd_qiov;
>>          +    int offset_in_cluster = offset_into_cluster(s, offset);
>>          +
>>          +    if ((file_cluster_offset & 511) != 0) {
>>          +        return -EIO;
>>          +    }
>>          +
>>          +    qemu_iovec_init(&hd_qiov, qiov->niov);
>>
>>      So you're not just re-allocating the bounce buffer for every single
>>      call, but also this I/O vector.  Hm.
> And this one is actually at least a little more serious, I think.
>
> Decryption and decompression (including copying between the original
> qiov and the bounce buffer) are already very slow, so allocating a
> bounce buffer or not shouldn't make any noticable difference.
>
> But I'd like to keep allocations out of the fast path as much as we can.
>
>>          +    if (bs->encrypted) {
>>          +        assert(s->crypto);
>>          +        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
>>          +
>>          +        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
>>
>>      1. Why did you remove the comment?
>>
>>      2. The check for whether crypt_buf was successfully allocated is missing.
>>
>>      3. Do you have any benchmarks for encrypted images?  Having to allocate
>>      the bounce buffer for potentially every single cluster sounds rather bad
>>      to me.
> Actually, benchmarks for normal, fully allocated images would be a
> little more interesting. Though I'm not sure if qcow2 actually performs
> well enough that we'll see any difference even there (when we measured
> memory allocation overhead etc. for raw images in the context of
> comparing coroutines to AIO callback styes, the difference was already
> hard to see).

Ok, I'll benchmark it and/or improve fast path (you mean code path, 
where we skip coroutines creation, to handle the whole request at once, 
yes?).

Hmm, all this staff with hd_qiov's in many places is due to we don't 
have qiov_offset parameter in .bdrv_co_preadv.. Is it a bad idea to add it?

Anyway, I now think, it's better to start with decompress-threads 
series, as compress-threads are already merged, then bring encryption to 
threads too, and then return to these series. This way will remove some 
questions about allocation, and may be it would be simpler and more 
consistent to bring more things to coroutines, not only normal clusters.

>
> Kevin


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv
  2018-11-08 10:02         ` Vladimir Sementsov-Ogievskiy
@ 2018-11-08 10:33           ` Kevin Wolf
  2018-11-08 12:36             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 30+ messages in thread
From: Kevin Wolf @ 2018-11-08 10:33 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Max Reitz, qemu-devel, qemu-block, Denis Lunev

Am 08.11.2018 um 11:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 07.11.2018 21:16, Kevin Wolf wrote:
> > (Broken quoting in text/plain again)
> >
> > Am 01.11.2018 um 13:17 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> 27.09.2018 20:35, Max Reitz wrote:
> >>
> >>      On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
> >>
> >>          Memory allocation may become less efficient for encrypted case. It's a
> >>          payment for further asynchronous scheme.
> >>
> >>          Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >>          ---
> >>           block/qcow2.c | 114 ++++++++++++++++++++++++++++++++--------------------------
> >>           1 file changed, 64 insertions(+), 50 deletions(-)
> >>
> >>          diff --git a/block/qcow2.c b/block/qcow2.c
> >>          index 65e3ca00e2..5e7f2ee318 100644
> >>          --- a/block/qcow2.c
> >>          +++ b/block/qcow2.c
> >>          @@ -1808,6 +1808,67 @@ out:
> >>               return ret;
> >>           }
> >>
> >>          +static coroutine_fn int qcow2_co_preadv_normal(BlockDriverState *bs,
> >>          +                                               uint64_t file_cluster_offset,
> >>          +                                               uint64_t offset,
> >>          +                                               uint64_t bytes,
> >>          +                                               QEMUIOVector *qiov,
> >>          +                                               uint64_t qiov_offset)
> >>          +{
> >>          +    int ret;
> >>          +    BDRVQcow2State *s = bs->opaque;
> >>          +    void *crypt_buf = NULL;
> >>          +    QEMUIOVector hd_qiov;
> >>          +    int offset_in_cluster = offset_into_cluster(s, offset);
> >>          +
> >>          +    if ((file_cluster_offset & 511) != 0) {
> >>          +        return -EIO;
> >>          +    }
> >>          +
> >>          +    qemu_iovec_init(&hd_qiov, qiov->niov);
> >>
> >>      So you're not just re-allocating the bounce buffer for every single
> >>      call, but also this I/O vector.  Hm.
> > And this one is actually at least a little more serious, I think.
> >
> > Decryption and decompression (including copying between the original
> > qiov and the bounce buffer) are already very slow, so allocating a
> > bounce buffer or not shouldn't make any noticable difference.
> >
> > But I'd like to keep allocations out of the fast path as much as we can.
> >
> >>          +    if (bs->encrypted) {
> >>          +        assert(s->crypto);
> >>          +        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
> >>          +
> >>          +        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
> >>
> >>      1. Why did you remove the comment?
> >>
> >>      2. The check for whether crypt_buf was successfully allocated is missing.
> >>
> >>      3. Do you have any benchmarks for encrypted images?  Having to allocate
> >>      the bounce buffer for potentially every single cluster sounds rather bad
> >>      to me.
> > Actually, benchmarks for normal, fully allocated images would be a
> > little more interesting. Though I'm not sure if qcow2 actually performs
> > well enough that we'll see any difference even there (when we measured
> > memory allocation overhead etc. for raw images in the context of
> > comparing coroutines to AIO callback styes, the difference was already
> > hard to see).
> 
> Ok, I'll benchmark it and/or improve fast path (you mean code path, 
> where we skip coroutines creation, to handle the whole request at once, 
> yes?).

I had less the specific code path in mind, but the case of reading
unallocated clusters or reading/writing an already allocated area of
normal or zero clusters (no compression, no encrpytion, no other fancy
stuff). These are the cases that qcow2 images will hit the most in
practice.

> Hmm, all this staff with hd_qiov's in many places is due to we don't
> have qiov_offset parameter in .bdrv_co_preadv.. Is it a bad idea to
> add it?

I actually thought the same yesterday, though I'm not completely sure
yet about it. It makes the interfaces a bit uglier, but it could indeed
save us some work in the I/O path, so unless we find bigger reasons
against it, maybe we should do that.

> Anyway, I now think, it's better to start with decompress-threads 
> series, as compress-threads are already merged, then bring encryption to 
> threads too, and then return to these series. This way will remove some 
> questions about allocation, and may be it would be simpler and more 
> consistent to bring more things to coroutines, not only normal clusters.

Okay, we can do that.

Kevin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv
  2018-11-08 10:33           ` Kevin Wolf
@ 2018-11-08 12:36             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 30+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-11-08 12:36 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Max Reitz, qemu-devel, qemu-block, Denis Lunev

08.11.2018 13:33, Kevin Wolf wrote:
> Am 08.11.2018 um 11:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 07.11.2018 21:16, Kevin Wolf wrote:
>>> (Broken quoting in text/plain again)
>>>
>>> Am 01.11.2018 um 13:17 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 27.09.2018 20:35, Max Reitz wrote:
>>>>
>>>>       On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>>>
>>>>           Memory allocation may become less efficient for encrypted case. It's a
>>>>           payment for further asynchronous scheme.
>>>>
>>>>           Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>           ---
>>>>            block/qcow2.c | 114 ++++++++++++++++++++++++++++++++--------------------------
>>>>            1 file changed, 64 insertions(+), 50 deletions(-)
>>>>
>>>>           diff --git a/block/qcow2.c b/block/qcow2.c
>>>>           index 65e3ca00e2..5e7f2ee318 100644
>>>>           --- a/block/qcow2.c
>>>>           +++ b/block/qcow2.c
>>>>           @@ -1808,6 +1808,67 @@ out:
>>>>                return ret;
>>>>            }
>>>>
>>>>           +static coroutine_fn int qcow2_co_preadv_normal(BlockDriverState *bs,
>>>>           +                                               uint64_t file_cluster_offset,
>>>>           +                                               uint64_t offset,
>>>>           +                                               uint64_t bytes,
>>>>           +                                               QEMUIOVector *qiov,
>>>>           +                                               uint64_t qiov_offset)
>>>>           +{
>>>>           +    int ret;
>>>>           +    BDRVQcow2State *s = bs->opaque;
>>>>           +    void *crypt_buf = NULL;
>>>>           +    QEMUIOVector hd_qiov;
>>>>           +    int offset_in_cluster = offset_into_cluster(s, offset);
>>>>           +
>>>>           +    if ((file_cluster_offset & 511) != 0) {
>>>>           +        return -EIO;
>>>>           +    }
>>>>           +
>>>>           +    qemu_iovec_init(&hd_qiov, qiov->niov);
>>>>
>>>>       So you're not just re-allocating the bounce buffer for every single
>>>>       call, but also this I/O vector.  Hm.
>>> And this one is actually at least a little more serious, I think.
>>>
>>> Decryption and decompression (including copying between the original
>>> qiov and the bounce buffer) are already very slow, so allocating a
>>> bounce buffer or not shouldn't make any noticable difference.
>>>
>>> But I'd like to keep allocations out of the fast path as much as we can.
>>>
>>>>           +    if (bs->encrypted) {
>>>>           +        assert(s->crypto);
>>>>           +        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
>>>>           +
>>>>           +        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
>>>>
>>>>       1. Why did you remove the comment?
>>>>
>>>>       2. The check for whether crypt_buf was successfully allocated is missing.
>>>>
>>>>       3. Do you have any benchmarks for encrypted images?  Having to allocate
>>>>       the bounce buffer for potentially every single cluster sounds rather bad
>>>>       to me.
>>> Actually, benchmarks for normal, fully allocated images would be a
>>> little more interesting. Though I'm not sure if qcow2 actually performs
>>> well enough that we'll see any difference even there (when we measured
>>> memory allocation overhead etc. for raw images in the context of
>>> comparing coroutines to AIO callback styes, the difference was already
>>> hard to see).
>>
>> Ok, I'll benchmark it and/or improve fast path (you mean code path,
>> where we skip coroutines creation, to handle the whole request at once,
>> yes?).
> 
> I had less the specific code path in mind, but the case of reading
> unallocated clusters or reading/writing an already allocated area of
> normal or zero clusters (no compression, no encrpytion, no other fancy
> stuff). These are the cases that qcow2 images will hit the most in
> practice.

There are already some benchmarks in the thread started from the cover 
letter of the series.

> 
>> Hmm, all this staff with hd_qiov's in many places is due to we don't
>> have qiov_offset parameter in .bdrv_co_preadv.. Is it a bad idea to
>> add it?
> 
> I actually thought the same yesterday, though I'm not completely sure
> yet about it. It makes the interfaces a bit uglier, but it could indeed
> save us some work in the I/O path, so unless we find bigger reasons
> against it, maybe we should do that.
> 
>> Anyway, I now think, it's better to start with decompress-threads
>> series, as compress-threads are already merged, then bring encryption to
>> threads too, and then return to these series. This way will remove some
>> questions about allocation, and may be it would be simpler and more
>> consistent to bring more things to coroutines, not only normal clusters.
> 
> Okay, we can do that.
> 
> Kevin
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-11-08 12:36 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-07 17:43 [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Vladimir Sementsov-Ogievskiy
2018-08-07 17:43 ` [Qemu-devel] [PATCH 1/7] qcow2: move qemu_co_mutex_lock below decryption procedure Vladimir Sementsov-Ogievskiy
     [not found]   ` <8e1cc18c-307f-99b1-5892-713ebd17a15f@redhat.com>
     [not found]     ` <43277786-b6b9-e18c-b0ca-064ff7c9c0c9@redhat.com>
2018-10-01 15:56       ` Vladimir Sementsov-Ogievskiy
2018-08-07 17:43 ` [Qemu-devel] [PATCH 2/7] qcow2: bdrv_co_pwritev: move encryption code out of lock Vladimir Sementsov-Ogievskiy
2018-08-07 17:43 ` [Qemu-devel] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv Vladimir Sementsov-Ogievskiy
     [not found]   ` <6e19aaeb-8acc-beb9-5ece-9ae6101637a9@redhat.com>
2018-10-01 15:14     ` Vladimir Sementsov-Ogievskiy
2018-10-01 15:39       ` Max Reitz
2018-10-01 16:00         ` Vladimir Sementsov-Ogievskiy
2018-11-01 12:17     ` Vladimir Sementsov-Ogievskiy
2018-11-07 13:51       ` Max Reitz
2018-11-07 18:16       ` Kevin Wolf
2018-11-08 10:02         ` Vladimir Sementsov-Ogievskiy
2018-11-08 10:33           ` Kevin Wolf
2018-11-08 12:36             ` Vladimir Sementsov-Ogievskiy
2018-08-07 17:43 ` [Qemu-devel] [PATCH 4/7] qcow2: async scheme for qcow2_co_preadv Vladimir Sementsov-Ogievskiy
     [not found]   ` <08a610aa-9c78-1c83-5e48-b93080aac87b@redhat.com>
2018-10-01 15:33     ` Vladimir Sementsov-Ogievskiy
2018-10-01 15:49       ` Max Reitz
2018-10-01 16:17         ` Vladimir Sementsov-Ogievskiy
2018-08-07 17:43 ` [Qemu-devel] [PATCH 5/7] qcow2: refactor qcow2_co_pwritev: split out qcow2_co_do_pwritev Vladimir Sementsov-Ogievskiy
     [not found]   ` <5c871ce7-2cab-f897-0b06-cbc05b9ffe97@redhat.com>
2018-10-01 15:43     ` Vladimir Sementsov-Ogievskiy
2018-10-01 15:50       ` Max Reitz
2018-08-07 17:43 ` [Qemu-devel] [PATCH 6/7] qcow2: refactor qcow2_co_pwritev locals scope Vladimir Sementsov-Ogievskiy
2018-08-07 17:43 ` [Qemu-devel] [PATCH 7/7] qcow2: async scheme for qcow2_co_pwritev Vladimir Sementsov-Ogievskiy
     [not found]   ` <1c8299bf-0b31-82a7-c7c4-5069581f2d94@redhat.com>
2018-10-01 15:46     ` Vladimir Sementsov-Ogievskiy
2018-08-16  0:51 ` [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io Max Reitz
2018-08-16 13:58   ` Vladimir Sementsov-Ogievskiy
2018-08-17 19:34     ` Max Reitz
2018-08-17 19:43       ` Denis V. Lunev
2018-08-20 16:33       ` Vladimir Sementsov-Ogievskiy
2018-08-20 16:39         ` Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.