qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] block: seriously improve savevm performance
@ 2020-06-11 17:11 Denis V. Lunev
  2020-06-11 17:11 ` [PATCH 1/4] migration/savevm: respect qemu_fclose() error code in save_snapshot() Denis V. Lunev
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-11 17:11 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi, Denis V . Lunev

This series do standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

This patch series is an implementation of idea discussed in the RFC
posted by Denis Plotnikov
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg01925.html
Results with this series over NVME are better than original code
                original     rfc    this
cached:          1.79s      2.38s   1.27s
non-cached:      3.29s      1.31s   0.81s

Changes from v2:
- code moved from QCOW2 level to generic block level
- created bdrv_flush_vmstate helper to fix 022, 029 tests
- added recursive for bs->file in bdrv_co_flush_vmstate (fix 267)
- fixed blk_save_vmstate helper
- fixed coroutine wait as Vladimir suggested with waiting fixes from me

Changes from v1:
- patchew warning fixed
- fixed validation that only 1 waiter is allowed in patch 1

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>




^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] migration/savevm: respect qemu_fclose() error code in save_snapshot()
  2020-06-11 17:11 [PATCH v3 0/4] block: seriously improve savevm performance Denis V. Lunev
@ 2020-06-11 17:11 ` Denis V. Lunev
  2020-06-15  7:39   ` Vladimir Sementsov-Ogievskiy
  2020-06-15 12:03   ` Dr. David Alan Gilbert
  2020-06-11 17:11 ` [PATCH 2/4] block/aio_task: allow start/wait task from any coroutine Denis V. Lunev
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-11 17:11 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi, Denis V. Lunev

qemu_fclose() could return error, f.e. if bdrv_co_flush() will return
the error.

This validation will become more important once we will start waiting of
asynchronous IO operations, started from bdrv_write_vmstate(), which are
coming soon.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 migration/savevm.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index c00a6807d9..0ff5bb40ed 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2628,7 +2628,7 @@ int save_snapshot(const char *name, Error **errp)
 {
     BlockDriverState *bs, *bs1;
     QEMUSnapshotInfo sn1, *sn = &sn1, old_sn1, *old_sn = &old_sn1;
-    int ret = -1;
+    int ret = -1, ret2;
     QEMUFile *f;
     int saved_vm_running;
     uint64_t vm_state_size;
@@ -2712,10 +2712,14 @@ int save_snapshot(const char *name, Error **errp)
     }
     ret = qemu_savevm_state(f, errp);
     vm_state_size = qemu_ftell(f);
-    qemu_fclose(f);
+    ret2 = qemu_fclose(f);
     if (ret < 0) {
         goto the_end;
     }
+    if (ret2 < 0) {
+        ret = ret2;
+        goto the_end;
+    }
 
     /* The bdrv_all_create_snapshot() call that follows acquires the AioContext
      * for itself.  BDRV_POLL_WHILE() does not support nested locking because
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] block/aio_task: allow start/wait task from any coroutine
  2020-06-11 17:11 [PATCH v3 0/4] block: seriously improve savevm performance Denis V. Lunev
  2020-06-11 17:11 ` [PATCH 1/4] migration/savevm: respect qemu_fclose() error code in save_snapshot() Denis V. Lunev
@ 2020-06-11 17:11 ` Denis V. Lunev
  2020-06-15  7:47   ` Vladimir Sementsov-Ogievskiy
  2020-06-11 17:11 ` [PATCH 3/4] block, migration: add bdrv_flush_vmstate helper Denis V. Lunev
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-11 17:11 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi, Denis V . Lunev

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Currently, aio task pool assumes that there is a main coroutine, which
creates tasks and wait for them. Let's remove the restriction by using
CoQueue. Code becomes clearer, interface more obvious.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/aio_task.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/block/aio_task.c b/block/aio_task.c
index 88989fa248..cf62e5c58b 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -27,11 +27,10 @@
 #include "block/aio_task.h"
 
 struct AioTaskPool {
-    Coroutine *main_co;
     int status;
     int max_busy_tasks;
     int busy_tasks;
-    bool waiting;
+    CoQueue waiters;
 };
 
 static void coroutine_fn aio_task_co(void *opaque)
@@ -52,31 +51,23 @@ static void coroutine_fn aio_task_co(void *opaque)
 
     g_free(task);
 
-    if (pool->waiting) {
-        pool->waiting = false;
-        aio_co_wake(pool->main_co);
-    }
+    qemu_co_queue_restart_all(&pool->waiters);
 }
 
 void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
 {
     assert(pool->busy_tasks > 0);
-    assert(qemu_coroutine_self() == pool->main_co);
 
-    pool->waiting = true;
-    qemu_coroutine_yield();
+    qemu_co_queue_wait(&pool->waiters, NULL);
 
-    assert(!pool->waiting);
     assert(pool->busy_tasks < pool->max_busy_tasks);
 }
 
 void coroutine_fn aio_task_pool_wait_slot(AioTaskPool *pool)
 {
-    if (pool->busy_tasks < pool->max_busy_tasks) {
-        return;
+    while (pool->busy_tasks >= pool->max_busy_tasks) {
+        aio_task_pool_wait_one(pool);
     }
-
-    aio_task_pool_wait_one(pool);
 }
 
 void coroutine_fn aio_task_pool_wait_all(AioTaskPool *pool)
@@ -98,8 +89,8 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int max_busy_tasks)
 {
     AioTaskPool *pool = g_new0(AioTaskPool, 1);
 
-    pool->main_co = qemu_coroutine_self();
     pool->max_busy_tasks = max_busy_tasks;
+    qemu_co_queue_init(&pool->waiters);
 
     return pool;
 }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] block, migration: add bdrv_flush_vmstate helper
  2020-06-11 17:11 [PATCH v3 0/4] block: seriously improve savevm performance Denis V. Lunev
  2020-06-11 17:11 ` [PATCH 1/4] migration/savevm: respect qemu_fclose() error code in save_snapshot() Denis V. Lunev
  2020-06-11 17:11 ` [PATCH 2/4] block/aio_task: allow start/wait task from any coroutine Denis V. Lunev
@ 2020-06-11 17:11 ` Denis V. Lunev
  2020-06-15  7:56   ` Vladimir Sementsov-Ogievskiy
  2020-06-11 17:11 ` [PATCH 4/4] block/io: improve savevm performance Denis V. Lunev
  2020-06-15 12:17 ` [PATCH v3 0/4] block: seriously " Dr. David Alan Gilbert
  4 siblings, 1 reply; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-11 17:11 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi, Denis V. Lunev

Right now bdrv_fclose() is just calling bdrv_flush().

The problem is that migration code is working inefficently from black
layer terms and are frequently called for very small pieces of not
properly aligned data. Block layer is capable to work this way, but
this is very slow.

This patch is a preparation for the introduction of the intermediate
buffer at block driver state. It would be beneficial to separate
conventional bdrv_flush() from closing QEMU file from migration code.

The patch also forces bdrv_flush_vmstate() operation inside
synchronous blk_save_vmstate() operation. This helper is used from
qemu-io only.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/block-backend.c |  6 +++++-
 block/io.c            | 39 +++++++++++++++++++++++++++++++++++++++
 include/block/block.h |  1 +
 migration/savevm.c    |  3 +++
 4 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9342a475cb..2107ace699 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2177,7 +2177,7 @@ int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
 int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
                      int64_t pos, int size)
 {
-    int ret;
+    int ret, ret2;
 
     if (!blk_is_available(blk)) {
         return -ENOMEDIUM;
@@ -2187,6 +2187,10 @@ int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
     if (ret < 0) {
         return ret;
     }
+    ret2 = bdrv_flush_vmstate(blk_bs(blk));
+    if (ret2 < 0) {
+        return ret;
+    }
 
     if (ret == size && !blk->enable_write_cache) {
         ret = bdrv_flush(blk_bs(blk));
diff --git a/block/io.c b/block/io.c
index 121ce17a49..fbf352f39d 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2725,6 +2725,45 @@ int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
     return bdrv_rw_vmstate(bs, qiov, pos, true);
 }
 
+
+typedef struct FlushVMStateCo {
+    BlockDriverState *bs;
+    int ret;
+} FlushVMStateCo;
+
+static int coroutine_fn bdrv_co_flush_vmstate(BlockDriverState *bs)
+{
+    return 0;
+}
+
+static void coroutine_fn bdrv_flush_vmstate_co_entry(void *opaque)
+{
+    FlushVMStateCo *rwco = opaque;
+
+    rwco->ret = bdrv_co_flush_vmstate(rwco->bs);
+    aio_wait_kick();
+}
+
+int bdrv_flush_vmstate(BlockDriverState *bs)
+{
+    Coroutine *co;
+    FlushVMStateCo flush_co = {
+        .bs = bs,
+        .ret = NOT_DONE,
+    };
+
+    if (qemu_in_coroutine()) {
+        /* Fast-path if already in coroutine context */
+        bdrv_flush_vmstate_co_entry(&flush_co);
+    } else {
+        co = qemu_coroutine_create(bdrv_flush_vmstate_co_entry, &flush_co);
+        bdrv_coroutine_enter(bs, co);
+        BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE);
+    }
+
+    return flush_co.ret;
+}
+
 /**************************************************************/
 /* async I/Os */
 
diff --git a/include/block/block.h b/include/block/block.h
index 25e299605e..024525b87d 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -572,6 +572,7 @@ int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
 
 int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
                       int64_t pos, int size);
+int bdrv_flush_vmstate(BlockDriverState *bs);
 
 void bdrv_img_create(const char *filename, const char *fmt,
                      const char *base_filename, const char *base_fmt,
diff --git a/migration/savevm.c b/migration/savevm.c
index 0ff5bb40ed..9698c909d7 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -150,6 +150,9 @@ static ssize_t block_get_buffer(void *opaque, uint8_t *buf, int64_t pos,
 
 static int bdrv_fclose(void *opaque, Error **errp)
 {
+    int err = bdrv_flush_vmstate(opaque);
+    if (err < 0)
+        return err;
     return bdrv_flush(opaque);
 }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] block/io: improve savevm performance
  2020-06-11 17:11 [PATCH v3 0/4] block: seriously improve savevm performance Denis V. Lunev
                   ` (2 preceding siblings ...)
  2020-06-11 17:11 ` [PATCH 3/4] block, migration: add bdrv_flush_vmstate helper Denis V. Lunev
@ 2020-06-11 17:11 ` Denis V. Lunev
  2020-06-15  9:25   ` Vladimir Sementsov-Ogievskiy
  2020-06-15 12:17 ` [PATCH v3 0/4] block: seriously " Dr. David Alan Gilbert
  4 siblings, 1 reply; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-11 17:11 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi, Denis V. Lunev

This patch does 2 standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to block driver,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

Thus bdrv_vmstate_write() is becoming asynchronous. All pending operations
completion are performed in newly invented bdrv_flush_vmstate().

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations if target file descriptor is opened with O_DIRECT. It should
also be noted that all operations are performed into unallocated image
blocks, which also suffer due to partial writes to such new clusters
even on cached file descriptors.

Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
                original     fixed
cached:          1.79s       1.27s
non-cached:      3.29s       0.81s

The difference over HDD would be more significant :)

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/io.c                | 121 +++++++++++++++++++++++++++++++++++++-
 include/block/block_int.h |   8 +++
 2 files changed, 127 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index fbf352f39d..698f1eef76 100644
--- a/block/io.c
+++ b/block/io.c
@@ -26,6 +26,7 @@
 #include "trace.h"
 #include "sysemu/block-backend.h"
 #include "block/aio-wait.h"
+#include "block/aio_task.h"
 #include "block/blockjob.h"
 #include "block/blockjob_int.h"
 #include "block/block_int.h"
@@ -2633,6 +2634,102 @@ typedef struct BdrvVmstateCo {
     int                 ret;
 } BdrvVmstateCo;
 
+typedef struct BdrvVMStateTask {
+    AioTask task;
+
+    BlockDriverState *bs;
+    int64_t offset;
+    void *buf;
+    size_t bytes;
+} BdrvVMStateTask;
+
+typedef struct BdrvSaveVMState {
+    AioTaskPool *pool;
+    BdrvVMStateTask *t;
+} BdrvSaveVMState;
+
+
+static coroutine_fn int bdrv_co_vmstate_save_task_entry(AioTask *task)
+{
+    int err = 0;
+    BdrvVMStateTask *t = container_of(task, BdrvVMStateTask, task);
+
+    if (t->bytes != 0) {
+        QEMUIOVector local_qiov;
+        qemu_iovec_init_buf(&local_qiov, t->buf, t->bytes);
+
+        bdrv_inc_in_flight(t->bs);
+        err = t->bs->drv->bdrv_save_vmstate(t->bs, &local_qiov, t->offset);
+        bdrv_dec_in_flight(t->bs);
+    }
+
+    qemu_vfree(t->buf);
+    return err;
+}
+
+static BdrvVMStateTask *bdrv_vmstate_task_create(BlockDriverState *bs,
+                                                 int64_t pos, size_t size)
+{
+    BdrvVMStateTask *t = g_new(BdrvVMStateTask, 1);
+
+    *t = (BdrvVMStateTask) {
+        .task.func = bdrv_co_vmstate_save_task_entry,
+        .buf = qemu_blockalign(bs, size),
+        .offset = pos,
+        .bs = bs,
+    };
+
+    return t;
+}
+
+static int bdrv_co_do_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
+                                   int64_t pos)
+{
+    BdrvSaveVMState *state = bs->savevm_state;
+    BdrvVMStateTask *t;
+    size_t buf_size = MAX(bdrv_get_cluster_size(bs), 1 * MiB);
+    size_t to_copy;
+    size_t off;
+
+    if (state == NULL) {
+        state = g_new(BdrvSaveVMState, 1);
+        *state = (BdrvSaveVMState) {
+            .pool = aio_task_pool_new(BDRV_VMSTATE_WORKERS_MAX),
+            .t = bdrv_vmstate_task_create(bs, pos, buf_size),
+        };
+
+        bs->savevm_state = state;
+    }
+
+    if (aio_task_pool_status(state->pool) < 0) {
+        return aio_task_pool_status(state->pool);
+    }
+
+    t = state->t;
+    if (t->offset + t->bytes != pos) {
+        /* Normally this branch is not reachable from migration */
+        return bs->drv->bdrv_save_vmstate(bs, qiov, pos);
+    }
+
+    off = 0;
+    while (1) {
+        to_copy = MIN(qiov->size - off, buf_size - t->bytes);
+        qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
+        t->bytes += to_copy;
+        if (t->bytes < buf_size) {
+            return qiov->size;
+        }
+
+        aio_task_pool_start_task(state->pool, &t->task);
+
+        pos += to_copy;
+        off += to_copy;
+        state->t = t = bdrv_vmstate_task_create(bs, pos, buf_size);
+    }
+
+    return qiov->size;
+}
+
 static int coroutine_fn
 bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                    bool is_read)
@@ -2648,7 +2745,7 @@ bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
         if (is_read) {
             ret = drv->bdrv_load_vmstate(bs, qiov, pos);
         } else {
-            ret = drv->bdrv_save_vmstate(bs, qiov, pos);
+            ret = bdrv_co_do_save_vmstate(bs, qiov, pos);
         }
     } else if (bs->file) {
         ret = bdrv_co_rw_vmstate(bs->file->bs, qiov, pos, is_read);
@@ -2733,7 +2830,27 @@ typedef struct FlushVMStateCo {
 
 static int coroutine_fn bdrv_co_flush_vmstate(BlockDriverState *bs)
 {
-    return 0;
+    int err;
+    BdrvSaveVMState *state = bs->savevm_state;
+
+    if (bs->drv->bdrv_save_vmstate == NULL && bs->file != NULL) {
+        return bdrv_co_flush_vmstate(bs->file->bs);
+    }
+    if (state == NULL) {
+        return 0;
+    }
+
+    aio_task_pool_start_task(state->pool, &state->t->task);
+
+    aio_task_pool_wait_all(state->pool);
+    err = aio_task_pool_status(state->pool);
+
+    aio_task_pool_free(state->pool);
+    g_free(state);
+
+    bs->savevm_state = NULL;
+
+    return err;
 }
 
 static void coroutine_fn bdrv_flush_vmstate_co_entry(void *opaque)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 791de6a59c..f90f0e8b6a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -61,6 +61,8 @@
 
 #define BLOCK_PROBE_BUF_SIZE        512
 
+#define BDRV_VMSTATE_WORKERS_MAX    8
+
 enum BdrvTrackedRequestType {
     BDRV_TRACKED_READ,
     BDRV_TRACKED_WRITE,
@@ -784,6 +786,9 @@ struct BdrvChild {
     QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+
+typedef struct BdrvSaveVMState BdrvSaveVMState;
+
 /*
  * Note: the function bdrv_append() copies and swaps contents of
  * BlockDriverStates, so if you add new fields to this struct, please
@@ -947,6 +952,9 @@ struct BlockDriverState {
 
     /* BdrvChild links to this node may never be frozen */
     bool never_freeze;
+
+    /* Intermediate buffer for VM state saving from snapshot creation code */
+    BdrvSaveVMState *savevm_state;
 };
 
 struct BlockBackendRootState {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] migration/savevm: respect qemu_fclose() error code in save_snapshot()
  2020-06-11 17:11 ` [PATCH 1/4] migration/savevm: respect qemu_fclose() error code in save_snapshot() Denis V. Lunev
@ 2020-06-15  7:39   ` Vladimir Sementsov-Ogievskiy
  2020-06-15 12:03   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-15  7:39 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi

11.06.2020 20:11, Denis V. Lunev wrote:
> qemu_fclose() could return error, f.e. if bdrv_co_flush() will return
> the error.
> 
> This validation will become more important once we will start waiting of
> asynchronous IO operations, started from bdrv_write_vmstate(), which are
> coming soon.
> 
> Signed-off-by: Denis V. Lunev<den@openvz.org>
> CC: Kevin Wolf<kwolf@redhat.com>
> CC: Max Reitz<mreitz@redhat.com>
> CC: Stefan Hajnoczi<stefanha@redhat.com>
> CC: Fam Zheng<fam@euphon.net>
> CC: Juan Quintela<quintela@redhat.com>
> CC: "Dr. David Alan Gilbert"<dgilbert@redhat.com>
> CC: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Denis Plotnikov<dplotnikov@virtuozzo.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] block/aio_task: allow start/wait task from any coroutine
  2020-06-11 17:11 ` [PATCH 2/4] block/aio_task: allow start/wait task from any coroutine Denis V. Lunev
@ 2020-06-15  7:47   ` Vladimir Sementsov-Ogievskiy
  2020-06-15  9:34     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-15  7:47 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi

11.06.2020 20:11, Denis V. Lunev wrote:
> From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> Currently, aio task pool assumes that there is a main coroutine, which
> creates tasks and wait for them. Let's remove the restriction by using
> CoQueue. Code becomes clearer, interface more obvious.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <fam@euphon.net>
> CC: Juan Quintela <quintela@redhat.com>
> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
> ---
>   block/aio_task.c | 21 ++++++---------------
>   1 file changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/block/aio_task.c b/block/aio_task.c
> index 88989fa248..cf62e5c58b 100644
> --- a/block/aio_task.c
> +++ b/block/aio_task.c
> @@ -27,11 +27,10 @@
>   #include "block/aio_task.h"
>   
>   struct AioTaskPool {
> -    Coroutine *main_co;
>       int status;
>       int max_busy_tasks;
>       int busy_tasks;
> -    bool waiting;
> +    CoQueue waiters;
>   };
>   
>   static void coroutine_fn aio_task_co(void *opaque)
> @@ -52,31 +51,23 @@ static void coroutine_fn aio_task_co(void *opaque)
>   
>       g_free(task);
>   
> -    if (pool->waiting) {
> -        pool->waiting = false;
> -        aio_co_wake(pool->main_co);
> -    }
> +    qemu_co_queue_restart_all(&pool->waiters);
>   }
>   
>   void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
>   {
>       assert(pool->busy_tasks > 0);
> -    assert(qemu_coroutine_self() == pool->main_co);
>   
> -    pool->waiting = true;
> -    qemu_coroutine_yield();
> +    qemu_co_queue_wait(&pool->waiters, NULL);
>   
> -    assert(!pool->waiting);
>       assert(pool->busy_tasks < pool->max_busy_tasks);

As we wake up several coroutines now, I'm afraid this assertion may start to fire.
And aio_task_pool_wait_one() becomes useless as a public API (still, it's used only locally, so we can make it static).

I'll send updated patch after reviewing the rest of the series.

>   }
>   
>   void coroutine_fn aio_task_pool_wait_slot(AioTaskPool *pool)
>   {
> -    if (pool->busy_tasks < pool->max_busy_tasks) {
> -        return;
> +    while (pool->busy_tasks >= pool->max_busy_tasks) {
> +        aio_task_pool_wait_one(pool);
>       }
> -
> -    aio_task_pool_wait_one(pool);
>   }
>   
>   void coroutine_fn aio_task_pool_wait_all(AioTaskPool *pool)
> @@ -98,8 +89,8 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int max_busy_tasks)
>   {
>       AioTaskPool *pool = g_new0(AioTaskPool, 1);
>   
> -    pool->main_co = qemu_coroutine_self();
>       pool->max_busy_tasks = max_busy_tasks;
> +    qemu_co_queue_init(&pool->waiters);
>   
>       return pool;
>   }
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] block, migration: add bdrv_flush_vmstate helper
  2020-06-11 17:11 ` [PATCH 3/4] block, migration: add bdrv_flush_vmstate helper Denis V. Lunev
@ 2020-06-15  7:56   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-15  7:56 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi

11.06.2020 20:11, Denis V. Lunev wrote:
> Right now bdrv_fclose() is just calling bdrv_flush().
> 
> The problem is that migration code is working inefficently from black
> layer terms and are frequently called for very small pieces of not
> properly aligned data. Block layer is capable to work this way, but
> this is very slow.
> 
> This patch is a preparation for the introduction of the intermediate
> buffer at block driver state. It would be beneficial to separate
> conventional bdrv_flush() from closing QEMU file from migration code.
> 
> The patch also forces bdrv_flush_vmstate() operation inside
> synchronous blk_save_vmstate() operation. This helper is used from
> qemu-io only.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <fam@euphon.net>
> CC: Juan Quintela <quintela@redhat.com>
> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
> ---
>   block/block-backend.c |  6 +++++-
>   block/io.c            | 39 +++++++++++++++++++++++++++++++++++++++
>   include/block/block.h |  1 +
>   migration/savevm.c    |  3 +++
>   4 files changed, 48 insertions(+), 1 deletion(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 9342a475cb..2107ace699 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -2177,7 +2177,7 @@ int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
>   int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
>                        int64_t pos, int size)
>   {
> -    int ret;
> +    int ret, ret2;
>   
>       if (!blk_is_available(blk)) {
>           return -ENOMEDIUM;
> @@ -2187,6 +2187,10 @@ int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
>       if (ret < 0) {
>           return ret;
>       }
> +    ret2 = bdrv_flush_vmstate(blk_bs(blk));
> +    if (ret2 < 0) {
> +        return ret;
> +    }
>   
>       if (ret == size && !blk->enable_write_cache) {
>           ret = bdrv_flush(blk_bs(blk));
> diff --git a/block/io.c b/block/io.c
> index 121ce17a49..fbf352f39d 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2725,6 +2725,45 @@ int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
>       return bdrv_rw_vmstate(bs, qiov, pos, true);
>   }
>   
> +
> +typedef struct FlushVMStateCo {
> +    BlockDriverState *bs;
> +    int ret;
> +} FlushVMStateCo;
> +
> +static int coroutine_fn bdrv_co_flush_vmstate(BlockDriverState *bs)
> +{
> +    return 0;
> +}
> +
> +static void coroutine_fn bdrv_flush_vmstate_co_entry(void *opaque)
> +{
> +    FlushVMStateCo *rwco = opaque;
> +
> +    rwco->ret = bdrv_co_flush_vmstate(rwco->bs);
> +    aio_wait_kick();
> +}
> +
> +int bdrv_flush_vmstate(BlockDriverState *bs)
> +{
> +    Coroutine *co;
> +    FlushVMStateCo flush_co = {
> +        .bs = bs,
> +        .ret = NOT_DONE,
> +    };
> +
> +    if (qemu_in_coroutine()) {
> +        /* Fast-path if already in coroutine context */
> +        bdrv_flush_vmstate_co_entry(&flush_co);
> +    } else {
> +        co = qemu_coroutine_create(bdrv_flush_vmstate_co_entry, &flush_co);
> +        bdrv_coroutine_enter(bs, co);
> +        BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE);
> +    }
> +
> +    return flush_co.ret;
> +}

In block/io.c, since 7d2410cea154bf, these coroutine wrappers are using bdrv_run_co() instead.
I hope, it's an intermediate state, and my series with auto-generated wrappers will be applied, still, now we should use bdrv_run_co()-based approach for consistency.

Otherwise, patch looks OK for me, just add a new interface, doing nothing for now. Still, would be good to add a comment in block.h, that bdrv_flush_vmsate is needed after bdrv_save_vmstate.

> +
>   /**************************************************************/
>   /* async I/Os */
>   
> diff --git a/include/block/block.h b/include/block/block.h
> index 25e299605e..024525b87d 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -572,6 +572,7 @@ int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
>   
>   int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
>                         int64_t pos, int size);
> +int bdrv_flush_vmstate(BlockDriverState *bs);
>   
>   void bdrv_img_create(const char *filename, const char *fmt,
>                        const char *base_filename, const char *base_fmt,
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 0ff5bb40ed..9698c909d7 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -150,6 +150,9 @@ static ssize_t block_get_buffer(void *opaque, uint8_t *buf, int64_t pos,
>   
>   static int bdrv_fclose(void *opaque, Error **errp)
>   {
> +    int err = bdrv_flush_vmstate(opaque);
> +    if (err < 0)
> +        return err;
>       return bdrv_flush(opaque);
>   }
>   
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] block/io: improve savevm performance
  2020-06-11 17:11 ` [PATCH 4/4] block/io: improve savevm performance Denis V. Lunev
@ 2020-06-15  9:25   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-15  9:25 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi

11.06.2020 20:11, Denis V. Lunev wrote:
> This patch does 2 standard basic things:
> - it creates intermediate buffer for all writes from QEMU migration code
>    to block driver,
> - this buffer is sent to disk asynchronously, allowing several writes to
>    run in parallel.
> 
> Thus bdrv_vmstate_write() is becoming asynchronous. All pending operations
> completion are performed in newly invented bdrv_flush_vmstate().
> 
> In general, migration code is fantastically inefficent (by observation),
> buffers are not aligned and sent with arbitrary pieces, a lot of time
> less than 100 bytes at a chunk, which results in read-modify-write
> operations if target file descriptor is opened with O_DIRECT. It should
> also be noted that all operations are performed into unallocated image
> blocks, which also suffer due to partial writes to such new clusters
> even on cached file descriptors.
> 
> Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
>                  original     fixed
> cached:          1.79s       1.27s
> non-cached:      3.29s       0.81s
> 
> The difference over HDD would be more significant :)
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <fam@euphon.net>
> CC: Juan Quintela <quintela@redhat.com>
> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
> ---
>   block/io.c                | 121 +++++++++++++++++++++++++++++++++++++-
>   include/block/block_int.h |   8 +++
>   2 files changed, 127 insertions(+), 2 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index fbf352f39d..698f1eef76 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -26,6 +26,7 @@
>   #include "trace.h"
>   #include "sysemu/block-backend.h"
>   #include "block/aio-wait.h"
> +#include "block/aio_task.h"
>   #include "block/blockjob.h"
>   #include "block/blockjob_int.h"
>   #include "block/block_int.h"
> @@ -2633,6 +2634,102 @@ typedef struct BdrvVmstateCo {
>       int                 ret;
>   } BdrvVmstateCo;
>   
> +typedef struct BdrvVMStateTask {
> +    AioTask task;
> +
> +    BlockDriverState *bs;
> +    int64_t offset;
> +    void *buf;
> +    size_t bytes;
> +} BdrvVMStateTask;
> +
> +typedef struct BdrvSaveVMState {
> +    AioTaskPool *pool;
> +    BdrvVMStateTask *t;
> +} BdrvSaveVMState;
> +
> +
> +static coroutine_fn int bdrv_co_vmstate_save_task_entry(AioTask *task)
> +{
> +    int err = 0;
> +    BdrvVMStateTask *t = container_of(task, BdrvVMStateTask, task);
> +
> +    if (t->bytes != 0) {
> +        QEMUIOVector local_qiov;
> +        qemu_iovec_init_buf(&local_qiov, t->buf, t->bytes);

Consider special oneliner for this case:

QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, t->buf, t->bytes);

> +
> +        bdrv_inc_in_flight(t->bs);
> +        err = t->bs->drv->bdrv_save_vmstate(t->bs, &local_qiov, t->offset);
> +        bdrv_dec_in_flight(t->bs);
> +    }
> +
> +    qemu_vfree(t->buf);
> +    return err;
> +}
> +
> +static BdrvVMStateTask *bdrv_vmstate_task_create(BlockDriverState *bs,
> +                                                 int64_t pos, size_t size)
> +{
> +    BdrvVMStateTask *t = g_new(BdrvVMStateTask, 1);
> +
> +    *t = (BdrvVMStateTask) {
> +        .task.func = bdrv_co_vmstate_save_task_entry,
> +        .buf = qemu_blockalign(bs, size),
> +        .offset = pos,
> +        .bs = bs,
> +    };
> +
> +    return t;
> +}
> +
> +static int bdrv_co_do_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
> +                                   int64_t pos)
> +{
> +    BdrvSaveVMState *state = bs->savevm_state;
> +    BdrvVMStateTask *t;
> +    size_t buf_size = MAX(bdrv_get_cluster_size(bs), 1 * MiB);
> +    size_t to_copy;
> +    size_t off;
> +
> +    if (state == NULL) {
> +        state = g_new(BdrvSaveVMState, 1);
> +        *state = (BdrvSaveVMState) {
> +            .pool = aio_task_pool_new(BDRV_VMSTATE_WORKERS_MAX),
> +            .t = bdrv_vmstate_task_create(bs, pos, buf_size),
> +        };
> +
> +        bs->savevm_state = state;
> +    }
> +
> +    if (aio_task_pool_status(state->pool) < 0) {
> +        return aio_task_pool_status(state->pool);

So, on failure, we still need flush, to cleanup the state. I think better is to cleanup it
immediately here (goto fail, etc.).

Aha, actually in blk_save_vmstate(), you don't do bdrv_flush_vmstate() if bdrv_save_vmstate()
failed.

> +    }
> +
> +    t = state->t;
> +    if (t->offset + t->bytes != pos) {
> +        /* Normally this branch is not reachable from migration */

Aha, like a cache-miss. OK

> +        return bs->drv->bdrv_save_vmstate(bs, qiov, pos);
> +    }
> +
> +    off = 0;
> +    while (1) {
> +        to_copy = MIN(qiov->size - off, buf_size - t->bytes);
> +        qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
> +        t->bytes += to_copy;
> +        if (t->bytes < buf_size) {
> +            return qiov->size;
> +        }
> +
> +        aio_task_pool_start_task(state->pool, &t->task);
> +
> +        pos += to_copy;
> +        off += to_copy;
> +        state->t = t = bdrv_vmstate_task_create(bs, pos, buf_size);
> +    }
> +
> +    return qiov->size;

this is unreachable actually. So, I'd drop it or do "break" in a loop instead of return.

> +}
> +
>   static int coroutine_fn
>   bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
>                      bool is_read)
> @@ -2648,7 +2745,7 @@ bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
>           if (is_read) {
>               ret = drv->bdrv_load_vmstate(bs, qiov, pos);
>           } else {
> -            ret = drv->bdrv_save_vmstate(bs, qiov, pos);
> +            ret = bdrv_co_do_save_vmstate(bs, qiov, pos);
>           }
>       } else if (bs->file) {
>           ret = bdrv_co_rw_vmstate(bs->file->bs, qiov, pos, is_read);
> @@ -2733,7 +2830,27 @@ typedef struct FlushVMStateCo {
>   
>   static int coroutine_fn bdrv_co_flush_vmstate(BlockDriverState *bs)
>   {
> -    return 0;
> +    int err;
> +    BdrvSaveVMState *state = bs->savevm_state;
> +
> +    if (bs->drv->bdrv_save_vmstate == NULL && bs->file != NULL) {
> +        return bdrv_co_flush_vmstate(bs->file->bs);
> +    }
> +    if (state == NULL) {
> +        return 0;
> +    }
> +
> +    aio_task_pool_start_task(state->pool, &state->t->task);

We probably shouldn't do it, if pool_status is already bad. But it shouldn't hurt as is.

> +
> +    aio_task_pool_wait_all(state->pool);
> +    err = aio_task_pool_status(state->pool);
> +
> +    aio_task_pool_free(state->pool);
> +    g_free(state);
> +
> +    bs->savevm_state = NULL;
> +
> +    return err;
>   }
>   
>   static void coroutine_fn bdrv_flush_vmstate_co_entry(void *opaque)
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 791de6a59c..f90f0e8b6a 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -61,6 +61,8 @@
>   
>   #define BLOCK_PROBE_BUF_SIZE        512
>   
> +#define BDRV_VMSTATE_WORKERS_MAX    8
> +
>   enum BdrvTrackedRequestType {
>       BDRV_TRACKED_READ,
>       BDRV_TRACKED_WRITE,
> @@ -784,6 +786,9 @@ struct BdrvChild {
>       QLIST_ENTRY(BdrvChild) next_parent;
>   };
>   
> +
> +typedef struct BdrvSaveVMState BdrvSaveVMState;
> +
>   /*
>    * Note: the function bdrv_append() copies and swaps contents of
>    * BlockDriverStates, so if you add new fields to this struct, please
> @@ -947,6 +952,9 @@ struct BlockDriverState {
>   
>       /* BdrvChild links to this node may never be frozen */
>       bool never_freeze;
> +
> +    /* Intermediate buffer for VM state saving from snapshot creation code */
> +    BdrvSaveVMState *savevm_state;
>   };
>   
>   struct BlockBackendRootState {
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] block/aio_task: allow start/wait task from any coroutine
  2020-06-15  7:47   ` Vladimir Sementsov-Ogievskiy
@ 2020-06-15  9:34     ` Vladimir Sementsov-Ogievskiy
  2020-06-16 14:49       ` Denis V. Lunev
  0 siblings, 1 reply; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-15  9:34 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi

15.06.2020 10:47, Vladimir Sementsov-Ogievskiy wrote:
> 11.06.2020 20:11, Denis V. Lunev wrote:
>> From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>
>> Currently, aio task pool assumes that there is a main coroutine, which
>> creates tasks and wait for them. Let's remove the restriction by using
>> CoQueue. Code becomes clearer, interface more obvious.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Fam Zheng <fam@euphon.net>
>> CC: Juan Quintela <quintela@redhat.com>
>> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
>> ---
>>   block/aio_task.c | 21 ++++++---------------
>>   1 file changed, 6 insertions(+), 15 deletions(-)
>>
>> diff --git a/block/aio_task.c b/block/aio_task.c
>> index 88989fa248..cf62e5c58b 100644
>> --- a/block/aio_task.c
>> +++ b/block/aio_task.c
>> @@ -27,11 +27,10 @@
>>   #include "block/aio_task.h"
>>   struct AioTaskPool {
>> -    Coroutine *main_co;
>>       int status;
>>       int max_busy_tasks;
>>       int busy_tasks;
>> -    bool waiting;
>> +    CoQueue waiters;
>>   };
>>   static void coroutine_fn aio_task_co(void *opaque)
>> @@ -52,31 +51,23 @@ static void coroutine_fn aio_task_co(void *opaque)
>>       g_free(task);
>> -    if (pool->waiting) {
>> -        pool->waiting = false;
>> -        aio_co_wake(pool->main_co);
>> -    }
>> +    qemu_co_queue_restart_all(&pool->waiters);
>>   }
>>   void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
>>   {
>>       assert(pool->busy_tasks > 0);
>> -    assert(qemu_coroutine_self() == pool->main_co);
>> -    pool->waiting = true;
>> -    qemu_coroutine_yield();
>> +    qemu_co_queue_wait(&pool->waiters, NULL);
>> -    assert(!pool->waiting);
>>       assert(pool->busy_tasks < pool->max_busy_tasks);
> 
> As we wake up several coroutines now, I'm afraid this assertion may start to fire.
> And aio_task_pool_wait_one() becomes useless as a public API (still, it's used only locally, so we can make it static).
> 
> I'll send updated patch after reviewing the rest of the series.
> 

Hm, OK, we have two kinds of waiters: waiting for a slot and for all tasks to finish. So, either we need two queues, or do like this patch (one queue, but wake-up all waiters, for them to check does their condition satisfied or not).

I'm OK with this patch with the following squashed-in:

diff --git a/include/block/aio_task.h b/include/block/aio_task.h
index 50bc1e1817..50b1c036c5 100644
--- a/include/block/aio_task.h
+++ b/include/block/aio_task.h
@@ -48,7 +48,6 @@ bool aio_task_pool_empty(AioTaskPool *pool);
  void coroutine_fn aio_task_pool_start_task(AioTaskPool *pool, AioTask *task);
  
  void coroutine_fn aio_task_pool_wait_slot(AioTaskPool *pool);
-void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool);
  void coroutine_fn aio_task_pool_wait_all(AioTaskPool *pool);
  
  #endif /* BLOCK_AIO_TASK_H */
diff --git a/block/aio_task.c b/block/aio_task.c
index cf62e5c58b..7ba15ff41f 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -54,26 +54,17 @@ static void coroutine_fn aio_task_co(void *opaque)
      qemu_co_queue_restart_all(&pool->waiters);
  }
  
-void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
-{
-    assert(pool->busy_tasks > 0);
-
-    qemu_co_queue_wait(&pool->waiters, NULL);
-
-    assert(pool->busy_tasks < pool->max_busy_tasks);
-}
-
  void coroutine_fn aio_task_pool_wait_slot(AioTaskPool *pool)
  {
      while (pool->busy_tasks >= pool->max_busy_tasks) {
-        aio_task_pool_wait_one(pool);
+        qemu_co_queue_wait(&pool->waiters, NULL);
      }
  }
  
  void coroutine_fn aio_task_pool_wait_all(AioTaskPool *pool)
  {
      while (pool->busy_tasks > 0) {
-        aio_task_pool_wait_one(pool);
+        qemu_co_queue_wait(&pool->waiters, NULL);
      }
  }
  



-- 
Best regards,
Vladimir


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] migration/savevm: respect qemu_fclose() error code in save_snapshot()
  2020-06-11 17:11 ` [PATCH 1/4] migration/savevm: respect qemu_fclose() error code in save_snapshot() Denis V. Lunev
  2020-06-15  7:39   ` Vladimir Sementsov-Ogievskiy
@ 2020-06-15 12:03   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 15+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-15 12:03 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	Juan Quintela, qemu-devel, Max Reitz, Denis Plotnikov,
	Stefan Hajnoczi

* Denis V. Lunev (den@openvz.org) wrote:
> qemu_fclose() could return error, f.e. if bdrv_co_flush() will return
> the error.
> 
> This validation will become more important once we will start waiting of
> asynchronous IO operations, started from bdrv_write_vmstate(), which are
> coming soon.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <fam@euphon.net>
> CC: Juan Quintela <quintela@redhat.com>
> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>

We check the return value in very few other places; I think in the
migration case we do flushes and assume that if the flushes work we
were OK; then most of the closes happen on error paths or after points
we think we're done.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/savevm.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index c00a6807d9..0ff5bb40ed 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2628,7 +2628,7 @@ int save_snapshot(const char *name, Error **errp)
>  {
>      BlockDriverState *bs, *bs1;
>      QEMUSnapshotInfo sn1, *sn = &sn1, old_sn1, *old_sn = &old_sn1;
> -    int ret = -1;
> +    int ret = -1, ret2;
>      QEMUFile *f;
>      int saved_vm_running;
>      uint64_t vm_state_size;
> @@ -2712,10 +2712,14 @@ int save_snapshot(const char *name, Error **errp)
>      }
>      ret = qemu_savevm_state(f, errp);
>      vm_state_size = qemu_ftell(f);
> -    qemu_fclose(f);
> +    ret2 = qemu_fclose(f);
>      if (ret < 0) {
>          goto the_end;
>      }
> +    if (ret2 < 0) {
> +        ret = ret2;
> +        goto the_end;
> +    }
>  
>      /* The bdrv_all_create_snapshot() call that follows acquires the AioContext
>       * for itself.  BDRV_POLL_WHILE() does not support nested locking because
> -- 
> 2.17.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] block: seriously improve savevm performance
  2020-06-11 17:11 [PATCH v3 0/4] block: seriously improve savevm performance Denis V. Lunev
                   ` (3 preceding siblings ...)
  2020-06-11 17:11 ` [PATCH 4/4] block/io: improve savevm performance Denis V. Lunev
@ 2020-06-15 12:17 ` Dr. David Alan Gilbert
  2020-06-15 12:36   ` Denis V. Lunev
  4 siblings, 1 reply; 15+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-15 12:17 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	Juan Quintela, qemu-devel, Max Reitz, Denis Plotnikov,
	Stefan Hajnoczi

* Denis V. Lunev (den@openvz.org) wrote:
> This series do standard basic things:
> - it creates intermediate buffer for all writes from QEMU migration code
>   to QCOW2 image,
> - this buffer is sent to disk asynchronously, allowing several writes to
>   run in parallel.
> 
> In general, migration code is fantastically inefficent (by observation),
> buffers are not aligned and sent with arbitrary pieces, a lot of time
> less than 100 bytes at a chunk, which results in read-modify-write
> operations with non-cached operations. It should also be noted that all
> operations are performed into unallocated image blocks, which also suffer
> due to partial writes to such new clusters.

It surprises me a little that you're not benefiting from the buffer
internal to qemu-file.c

Dave

> This patch series is an implementation of idea discussed in the RFC
> posted by Denis Plotnikov
> https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg01925.html
> Results with this series over NVME are better than original code
>                 original     rfc    this
> cached:          1.79s      2.38s   1.27s
> non-cached:      3.29s      1.31s   0.81s
> 
> Changes from v2:
> - code moved from QCOW2 level to generic block level
> - created bdrv_flush_vmstate helper to fix 022, 029 tests
> - added recursive for bs->file in bdrv_co_flush_vmstate (fix 267)
> - fixed blk_save_vmstate helper
> - fixed coroutine wait as Vladimir suggested with waiting fixes from me
> 
> Changes from v1:
> - patchew warning fixed
> - fixed validation that only 1 waiter is allowed in patch 1
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <fam@euphon.net>
> CC: Juan Quintela <quintela@redhat.com>
> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] block: seriously improve savevm performance
  2020-06-15 12:17 ` [PATCH v3 0/4] block: seriously " Dr. David Alan Gilbert
@ 2020-06-15 12:36   ` Denis V. Lunev
  2020-06-15 12:49     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-15 12:36 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	Juan Quintela, qemu-devel, Max Reitz, Denis Plotnikov,
	Stefan Hajnoczi

On 6/15/20 3:17 PM, Dr. David Alan Gilbert wrote:
> * Denis V. Lunev (den@openvz.org) wrote:
>> This series do standard basic things:
>> - it creates intermediate buffer for all writes from QEMU migration code
>>   to QCOW2 image,
>> - this buffer is sent to disk asynchronously, allowing several writes to
>>   run in parallel.
>>
>> In general, migration code is fantastically inefficent (by observation),
>> buffers are not aligned and sent with arbitrary pieces, a lot of time
>> less than 100 bytes at a chunk, which results in read-modify-write
>> operations with non-cached operations. It should also be noted that all
>> operations are performed into unallocated image blocks, which also suffer
>> due to partial writes to such new clusters.
> It surprises me a little that you're not benefiting from the buffer
> internal to qemu-file.c
>
> Dave
There are a lot of problems with this buffer:

Flushes to block driver state are performed in the abstract places,
pushing
  a) small IO
  b) non-aligned IO both to
       1) page size
       2) cluster size
It should also be noted that buffer in QEMU file is quite small and
all IO operations with it are synchronous. IO, like ethernet, wants
good queues.

The difference is on the table.

Den


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] block: seriously improve savevm performance
  2020-06-15 12:36   ` Denis V. Lunev
@ 2020-06-15 12:49     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 15+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-15 12:49 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	Juan Quintela, qemu-devel, Max Reitz, Denis Plotnikov,
	Stefan Hajnoczi

* Denis V. Lunev (den@openvz.org) wrote:
> On 6/15/20 3:17 PM, Dr. David Alan Gilbert wrote:
> > * Denis V. Lunev (den@openvz.org) wrote:
> >> This series do standard basic things:
> >> - it creates intermediate buffer for all writes from QEMU migration code
> >>   to QCOW2 image,
> >> - this buffer is sent to disk asynchronously, allowing several writes to
> >>   run in parallel.
> >>
> >> In general, migration code is fantastically inefficent (by observation),
> >> buffers are not aligned and sent with arbitrary pieces, a lot of time
> >> less than 100 bytes at a chunk, which results in read-modify-write
> >> operations with non-cached operations. It should also be noted that all
> >> operations are performed into unallocated image blocks, which also suffer
> >> due to partial writes to such new clusters.
> > It surprises me a little that you're not benefiting from the buffer
> > internal to qemu-file.c
> >
> > Dave
> There are a lot of problems with this buffer:
> 
> Flushes to block driver state are performed in the abstract places,
> pushing
>   a) small IO
>   b) non-aligned IO both to
>        1) page size
>        2) cluster size
> It should also be noted that buffer in QEMU file is quite small and
> all IO operations with it are synchronous. IO, like ethernet, wants
> good queues.

Yeh, for ethernet we immediately get the kernels buffer so it's not too
bad; and I guess the async page writes are easier as well.

Dave

> The difference is on the table.
> 
> Den
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] block/aio_task: allow start/wait task from any coroutine
  2020-06-15  9:34     ` Vladimir Sementsov-Ogievskiy
@ 2020-06-16 14:49       ` Denis V. Lunev
  0 siblings, 0 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-16 14:49 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi

On 6/15/20 12:34 PM, Vladimir Sementsov-Ogievskiy wrote:
> 15.06.2020 10:47, Vladimir Sementsov-Ogievskiy wrote:
>> 11.06.2020 20:11, Denis V. Lunev wrote:
>>> From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>
>>> Currently, aio task pool assumes that there is a main coroutine, which
>>> creates tasks and wait for them. Let's remove the restriction by using
>>> CoQueue. Code becomes clearer, interface more obvious.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Max Reitz <mreitz@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> CC: Fam Zheng <fam@euphon.net>
>>> CC: Juan Quintela <quintela@redhat.com>
>>> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
>>> ---
>>>   block/aio_task.c | 21 ++++++---------------
>>>   1 file changed, 6 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/block/aio_task.c b/block/aio_task.c
>>> index 88989fa248..cf62e5c58b 100644
>>> --- a/block/aio_task.c
>>> +++ b/block/aio_task.c
>>> @@ -27,11 +27,10 @@
>>>   #include "block/aio_task.h"
>>>   struct AioTaskPool {
>>> -    Coroutine *main_co;
>>>       int status;
>>>       int max_busy_tasks;
>>>       int busy_tasks;
>>> -    bool waiting;
>>> +    CoQueue waiters;
>>>   };
>>>   static void coroutine_fn aio_task_co(void *opaque)
>>> @@ -52,31 +51,23 @@ static void coroutine_fn aio_task_co(void *opaque)
>>>       g_free(task);
>>> -    if (pool->waiting) {
>>> -        pool->waiting = false;
>>> -        aio_co_wake(pool->main_co);
>>> -    }
>>> +    qemu_co_queue_restart_all(&pool->waiters);
>>>   }
>>>   void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
>>>   {
>>>       assert(pool->busy_tasks > 0);
>>> -    assert(qemu_coroutine_self() == pool->main_co);
>>> -    pool->waiting = true;
>>> -    qemu_coroutine_yield();
>>> +    qemu_co_queue_wait(&pool->waiters, NULL);
>>> -    assert(!pool->waiting);
>>>       assert(pool->busy_tasks < pool->max_busy_tasks);
>>
>> As we wake up several coroutines now, I'm afraid this assertion may
>> start to fire.
>> And aio_task_pool_wait_one() becomes useless as a public API (still,
>> it's used only locally, so we can make it static).
>>
>> I'll send updated patch after reviewing the rest of the series.
>>
>
> Hm, OK, we have two kinds of waiters: waiting for a slot and for all
> tasks to finish. So, either we need two queues, or do like this patch
> (one queue, but wake-up all waiters, for them to check does their
> condition satisfied or not).
>
> I'm OK with this patch with the following squashed-in:
>
> diff --git a/include/block/aio_task.h b/include/block/aio_task.h
> index 50bc1e1817..50b1c036c5 100644
> --- a/include/block/aio_task.h
> +++ b/include/block/aio_task.h
> @@ -48,7 +48,6 @@ bool aio_task_pool_empty(AioTaskPool *pool);
>  void coroutine_fn aio_task_pool_start_task(AioTaskPool *pool, AioTask
> *task);
>  
>  void coroutine_fn aio_task_pool_wait_slot(AioTaskPool *pool);
> -void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool);
>  void coroutine_fn aio_task_pool_wait_all(AioTaskPool *pool);
>  
>  #endif /* BLOCK_AIO_TASK_H */
> diff --git a/block/aio_task.c b/block/aio_task.c
> index cf62e5c58b..7ba15ff41f 100644
> --- a/block/aio_task.c
> +++ b/block/aio_task.c
> @@ -54,26 +54,17 @@ static void coroutine_fn aio_task_co(void *opaque)
>      qemu_co_queue_restart_all(&pool->waiters);
>  }
>  
> -void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
> -{
> -    assert(pool->busy_tasks > 0);
> -
> -    qemu_co_queue_wait(&pool->waiters, NULL);
> -
> -    assert(pool->busy_tasks < pool->max_busy_tasks);
> -}
> -
>  void coroutine_fn aio_task_pool_wait_slot(AioTaskPool *pool)
>  {
>      while (pool->busy_tasks >= pool->max_busy_tasks) {
> -        aio_task_pool_wait_one(pool);
> +        qemu_co_queue_wait(&pool->waiters, NULL);
>      }
>  }
>  
>  void coroutine_fn aio_task_pool_wait_all(AioTaskPool *pool)
>  {
>      while (pool->busy_tasks > 0) {
> -        aio_task_pool_wait_one(pool);
> +        qemu_co_queue_wait(&pool->waiters, NULL);
>      }
>  }
>  
>
>
>
I'd better make this separate

Den


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-06-16 15:18 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-11 17:11 [PATCH v3 0/4] block: seriously improve savevm performance Denis V. Lunev
2020-06-11 17:11 ` [PATCH 1/4] migration/savevm: respect qemu_fclose() error code in save_snapshot() Denis V. Lunev
2020-06-15  7:39   ` Vladimir Sementsov-Ogievskiy
2020-06-15 12:03   ` Dr. David Alan Gilbert
2020-06-11 17:11 ` [PATCH 2/4] block/aio_task: allow start/wait task from any coroutine Denis V. Lunev
2020-06-15  7:47   ` Vladimir Sementsov-Ogievskiy
2020-06-15  9:34     ` Vladimir Sementsov-Ogievskiy
2020-06-16 14:49       ` Denis V. Lunev
2020-06-11 17:11 ` [PATCH 3/4] block, migration: add bdrv_flush_vmstate helper Denis V. Lunev
2020-06-15  7:56   ` Vladimir Sementsov-Ogievskiy
2020-06-11 17:11 ` [PATCH 4/4] block/io: improve savevm performance Denis V. Lunev
2020-06-15  9:25   ` Vladimir Sementsov-Ogievskiy
2020-06-15 12:17 ` [PATCH v3 0/4] block: seriously " Dr. David Alan Gilbert
2020-06-15 12:36   ` Denis V. Lunev
2020-06-15 12:49     ` Dr. David Alan Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).