qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/6] block: seriously improve savevm/loadvm performance
@ 2020-07-09 13:26 Denis V. Lunev
  2020-07-09 13:26 ` [PATCH 1/6] block/aio_task: allow start/wait task from any coroutine Denis V. Lunev
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-07-09 13:26 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi, Denis V . Lunev

This series do standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

This patch series is an implementation of idea discussed in the RFC
posted by Denis Plotnikov
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg01925.html
Results with this series over NVME are better than original code
                original     rfc    this
cached:          1.79s      2.38s   1.27s
non-cached:      3.29s      1.31s   0.81s

Changes from v7:
- dropped lock from LoadVMState
- fixed assert in last patch
- dropped patch 1 as queued

Changes from v6:
- blk_load_vmstate kludges added (patchew problem fixed)

Changes from v5:
- loadvm optimizations added with Vladimir comments included

Changes from v4:
- added patch 4 with blk_save_vmstate() cleanup
- added R-By
- bdrv_flush_vmstate -> bdrv_finalize_vmstate
- fixed return code of bdrv_co_do_save_vmstate
- fixed typos in comments (Eric, thanks!)
- fixed patchew warnings

Changes from v3:
- rebased to master
- added patch 3 which removes aio_task_pool_wait_one()
- added R-By to patch 1
- patch 4 is rewritten via bdrv_run_co
- error path in blk_save_vmstate() is rewritten to call bdrv_flush_vmstate
  unconditionally
- added some comments
- fixes initialization in bdrv_co_vmstate_save_task_entry as suggested

Changes from v2:
- code moved from QCOW2 level to generic block level
- created bdrv_flush_vmstate helper to fix 022, 029 tests
- added recursive for bs->file in bdrv_co_flush_vmstate (fix 267)
- fixed blk_save_vmstate helper
- fixed coroutine wait as Vladimir suggested with waiting fixes from me

Changes from v1:
- patchew warning fixed
- fixed validation that only 1 waiter is allowed in patch 1

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>




^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/6] block/aio_task: allow start/wait task from any coroutine
  2020-07-09 13:26 [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
@ 2020-07-09 13:26 ` Denis V. Lunev
  2020-07-09 13:26 ` [PATCH 2/6] block/aio_task: drop aio_task_pool_wait_one() helper Denis V. Lunev
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-07-09 13:26 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi, Denis V . Lunev

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Currently, aio task pool assumes that there is a main coroutine, which
creates tasks and wait for them. Let's remove the restriction by using
CoQueue. Code becomes clearer, interface more obvious.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/aio_task.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/block/aio_task.c b/block/aio_task.c
index 88989fa248..cf62e5c58b 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -27,11 +27,10 @@
 #include "block/aio_task.h"
 
 struct AioTaskPool {
-    Coroutine *main_co;
     int status;
     int max_busy_tasks;
     int busy_tasks;
-    bool waiting;
+    CoQueue waiters;
 };
 
 static void coroutine_fn aio_task_co(void *opaque)
@@ -52,31 +51,23 @@ static void coroutine_fn aio_task_co(void *opaque)
 
     g_free(task);
 
-    if (pool->waiting) {
-        pool->waiting = false;
-        aio_co_wake(pool->main_co);
-    }
+    qemu_co_queue_restart_all(&pool->waiters);
 }
 
 void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
 {
     assert(pool->busy_tasks > 0);
-    assert(qemu_coroutine_self() == pool->main_co);
 
-    pool->waiting = true;
-    qemu_coroutine_yield();
+    qemu_co_queue_wait(&pool->waiters, NULL);
 
-    assert(!pool->waiting);
     assert(pool->busy_tasks < pool->max_busy_tasks);
 }
 
 void coroutine_fn aio_task_pool_wait_slot(AioTaskPool *pool)
 {
-    if (pool->busy_tasks < pool->max_busy_tasks) {
-        return;
+    while (pool->busy_tasks >= pool->max_busy_tasks) {
+        aio_task_pool_wait_one(pool);
     }
-
-    aio_task_pool_wait_one(pool);
 }
 
 void coroutine_fn aio_task_pool_wait_all(AioTaskPool *pool)
@@ -98,8 +89,8 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int max_busy_tasks)
 {
     AioTaskPool *pool = g_new0(AioTaskPool, 1);
 
-    pool->main_co = qemu_coroutine_self();
     pool->max_busy_tasks = max_busy_tasks;
+    qemu_co_queue_init(&pool->waiters);
 
     return pool;
 }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/6] block/aio_task: drop aio_task_pool_wait_one() helper
  2020-07-09 13:26 [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
  2020-07-09 13:26 ` [PATCH 1/6] block/aio_task: allow start/wait task from any coroutine Denis V. Lunev
@ 2020-07-09 13:26 ` Denis V. Lunev
  2020-07-09 13:26 ` [PATCH 3/6] block/block-backend: remove always true check from blk_save_vmstate Denis V. Lunev
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-07-09 13:26 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi, Denis V. Lunev

It is not used outside the module.

Actually there are 2 kind of waiters:
- for a slot and
- for all tasks to finish
This patch limits external API to listed types.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/aio_task.c         | 13 ++-----------
 include/block/aio_task.h |  1 -
 2 files changed, 2 insertions(+), 12 deletions(-)

diff --git a/block/aio_task.c b/block/aio_task.c
index cf62e5c58b..7ba15ff41f 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -54,26 +54,17 @@ static void coroutine_fn aio_task_co(void *opaque)
     qemu_co_queue_restart_all(&pool->waiters);
 }
 
-void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
-{
-    assert(pool->busy_tasks > 0);
-
-    qemu_co_queue_wait(&pool->waiters, NULL);
-
-    assert(pool->busy_tasks < pool->max_busy_tasks);
-}
-
 void coroutine_fn aio_task_pool_wait_slot(AioTaskPool *pool)
 {
     while (pool->busy_tasks >= pool->max_busy_tasks) {
-        aio_task_pool_wait_one(pool);
+        qemu_co_queue_wait(&pool->waiters, NULL);
     }
 }
 
 void coroutine_fn aio_task_pool_wait_all(AioTaskPool *pool)
 {
     while (pool->busy_tasks > 0) {
-        aio_task_pool_wait_one(pool);
+        qemu_co_queue_wait(&pool->waiters, NULL);
     }
 }
 
diff --git a/include/block/aio_task.h b/include/block/aio_task.h
index 50bc1e1817..50b1c036c5 100644
--- a/include/block/aio_task.h
+++ b/include/block/aio_task.h
@@ -48,7 +48,6 @@ bool aio_task_pool_empty(AioTaskPool *pool);
 void coroutine_fn aio_task_pool_start_task(AioTaskPool *pool, AioTask *task);
 
 void coroutine_fn aio_task_pool_wait_slot(AioTaskPool *pool);
-void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool);
 void coroutine_fn aio_task_pool_wait_all(AioTaskPool *pool);
 
 #endif /* BLOCK_AIO_TASK_H */
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/6] block/block-backend: remove always true check from blk_save_vmstate
  2020-07-09 13:26 [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
  2020-07-09 13:26 ` [PATCH 1/6] block/aio_task: allow start/wait task from any coroutine Denis V. Lunev
  2020-07-09 13:26 ` [PATCH 2/6] block/aio_task: drop aio_task_pool_wait_one() helper Denis V. Lunev
@ 2020-07-09 13:26 ` Denis V. Lunev
  2020-07-09 13:26 ` [PATCH 4/6] block, migration: add bdrv_finalize_vmstate helper Denis V. Lunev
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-07-09 13:26 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi, Denis V. Lunev

bdrv_save_vmstate() returns either error with negative return value or
size. Thus this check is useless.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Suggested-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/block-backend.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 6936b25c83..1c6e53bbde 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2188,7 +2188,7 @@ int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
         return ret;
     }
 
-    if (ret == size && !blk->enable_write_cache) {
+    if (!blk->enable_write_cache) {
         ret = bdrv_flush(blk_bs(blk));
     }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/6] block, migration: add bdrv_finalize_vmstate helper
  2020-07-09 13:26 [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
                   ` (2 preceding siblings ...)
  2020-07-09 13:26 ` [PATCH 3/6] block/block-backend: remove always true check from blk_save_vmstate Denis V. Lunev
@ 2020-07-09 13:26 ` Denis V. Lunev
  2020-08-27 12:58   ` Daniel P. Berrangé
  2020-07-09 13:26 ` [PATCH 5/6] block/io: improve savevm performance Denis V. Lunev
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Denis V. Lunev @ 2020-07-09 13:26 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi, Denis V. Lunev

Right now bdrv_fclose() is just calling bdrv_flush().

The problem is that migration code is working inefficiently from block
layer terms and are frequently called for very small pieces of
unaligned data. Block layer is capable to work this way, but this is very
slow.

This patch is a preparation for the introduction of the intermediate
buffer at block driver state. It would be beneficial to separate
conventional bdrv_flush() from closing QEMU file from migration code.

The patch also forces bdrv_finalize_vmstate() operation inside
synchronous blk_save_vmstate() operation. This helper is used from
qemu-io only.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/block-backend.c |  6 +++++-
 block/io.c            | 15 +++++++++++++++
 include/block/block.h |  5 +++++
 migration/savevm.c    |  4 ++++
 4 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 1c6e53bbde..5bb11c8e01 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2177,16 +2177,20 @@ int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
 int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
                      int64_t pos, int size)
 {
-    int ret;
+    int ret, ret2;
 
     if (!blk_is_available(blk)) {
         return -ENOMEDIUM;
     }
 
     ret = bdrv_save_vmstate(blk_bs(blk), buf, pos, size);
+    ret2 = bdrv_finalize_vmstate(blk_bs(blk));
     if (ret < 0) {
         return ret;
     }
+    if (ret2 < 0) {
+        return ret2;
+    }
 
     if (!blk->enable_write_cache) {
         ret = bdrv_flush(blk_bs(blk));
diff --git a/block/io.c b/block/io.c
index b6564e34c5..be1fac0be8 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2724,6 +2724,21 @@ int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
     return bdrv_rw_vmstate(bs, qiov, pos, true);
 }
 
+static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState *bs)
+{
+    return 0;
+}
+
+static int coroutine_fn bdrv_finalize_vmstate_co_entry(void *opaque)
+{
+    return bdrv_co_finalize_vmstate(opaque);
+}
+
+int bdrv_finalize_vmstate(BlockDriverState *bs)
+{
+    return bdrv_run_co(bs, bdrv_finalize_vmstate_co_entry, bs);
+}
+
 /**************************************************************/
 /* async I/Os */
 
diff --git a/include/block/block.h b/include/block/block.h
index bca3bb831c..124eb8d422 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -567,6 +567,11 @@ int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
 
 int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
                       int64_t pos, int size);
+/*
+ * bdrv_finalize_vmstate() is mandatory to commit vmstate changes if
+ * bdrv_save_vmstate() was ever called.
+ */
+int bdrv_finalize_vmstate(BlockDriverState *bs);
 
 void bdrv_img_create(const char *filename, const char *fmt,
                      const char *base_filename, const char *base_fmt,
diff --git a/migration/savevm.c b/migration/savevm.c
index 45c9dd9d8a..d8a94e312c 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -150,6 +150,10 @@ static ssize_t block_get_buffer(void *opaque, uint8_t *buf, int64_t pos,
 
 static int bdrv_fclose(void *opaque, Error **errp)
 {
+    int err = bdrv_finalize_vmstate(opaque);
+    if (err < 0) {
+        return err;
+    }
     return bdrv_flush(opaque);
 }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/6] block/io: improve savevm performance
  2020-07-09 13:26 [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
                   ` (3 preceding siblings ...)
  2020-07-09 13:26 ` [PATCH 4/6] block, migration: add bdrv_finalize_vmstate helper Denis V. Lunev
@ 2020-07-09 13:26 ` Denis V. Lunev
  2020-07-09 13:26 ` [PATCH 6/6] block/io: improve loadvm performance Denis V. Lunev
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-07-09 13:26 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi, Denis V. Lunev

This patch does 2 standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to block driver,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

Thus bdrv_vmstate_write() is becoming asynchronous. All pending operations
completion are performed in newly invented bdrv_finalize_vmstate().

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations if target file descriptor is opened with O_DIRECT. It should
also be noted that all operations are performed into unallocated image
blocks, which also suffer due to partial writes to such new clusters
even on cached file descriptors.

Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
                original     fixed
cached:          1.79s       1.27s
non-cached:      3.29s       0.81s

The difference over HDD would be more significant :)

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/io.c                | 126 +++++++++++++++++++++++++++++++++++++-
 include/block/block_int.h |   8 +++
 2 files changed, 132 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index be1fac0be8..061d3239b9 100644
--- a/block/io.c
+++ b/block/io.c
@@ -26,6 +26,7 @@
 #include "trace.h"
 #include "sysemu/block-backend.h"
 #include "block/aio-wait.h"
+#include "block/aio_task.h"
 #include "block/blockjob.h"
 #include "block/blockjob_int.h"
 #include "block/block_int.h"
@@ -33,6 +34,7 @@
 #include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "qemu/units.h"
 #include "sysemu/replay.h"
 
 /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
@@ -2640,6 +2642,103 @@ typedef struct BdrvVmstateCo {
     bool                is_read;
 } BdrvVmstateCo;
 
+typedef struct BdrvVMStateTask {
+    AioTask task;
+
+    BlockDriverState *bs;
+    int64_t offset;
+    void *buf;
+    size_t bytes;
+} BdrvVMStateTask;
+
+typedef struct BdrvSaveVMState {
+    AioTaskPool *pool;
+    BdrvVMStateTask *t;
+} BdrvSaveVMState;
+
+
+static coroutine_fn int bdrv_co_vmstate_save_task_entry(AioTask *task)
+{
+    int err = 0;
+    BdrvVMStateTask *t = container_of(task, BdrvVMStateTask, task);
+
+    if (t->bytes != 0) {
+        QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, t->buf, t->bytes);
+
+        bdrv_inc_in_flight(t->bs);
+        err = t->bs->drv->bdrv_save_vmstate(t->bs, &qiov, t->offset);
+        bdrv_dec_in_flight(t->bs);
+    }
+
+    qemu_vfree(t->buf);
+    return err;
+}
+
+static BdrvVMStateTask *bdrv_vmstate_task_create(BlockDriverState *bs,
+                                                 int64_t pos, size_t size)
+{
+    BdrvVMStateTask *t = g_new(BdrvVMStateTask, 1);
+
+    *t = (BdrvVMStateTask) {
+        .task.func = bdrv_co_vmstate_save_task_entry,
+        .buf = qemu_blockalign(bs, size),
+        .offset = pos,
+        .bs = bs,
+    };
+
+    return t;
+}
+
+static int bdrv_co_do_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
+                                   int64_t pos)
+{
+    BdrvSaveVMState *state = bs->savevm_state;
+    BdrvVMStateTask *t;
+    size_t buf_size = MAX(bdrv_get_cluster_size(bs), 1 * MiB);
+    size_t to_copy, off;
+
+    if (state == NULL) {
+        state = g_new(BdrvSaveVMState, 1);
+        *state = (BdrvSaveVMState) {
+            .pool = aio_task_pool_new(BDRV_VMSTATE_WORKERS_MAX),
+            .t = bdrv_vmstate_task_create(bs, pos, buf_size),
+        };
+
+        bs->savevm_state = state;
+    }
+
+    if (aio_task_pool_status(state->pool) < 0) {
+        /*
+         * The operation as a whole is unsuccessful. Prohibit all futher
+         * operations. If we clean here, new useless ops will come again.
+         * Thus we rely on caller for cleanup here.
+         */
+        return aio_task_pool_status(state->pool);
+    }
+
+    t = state->t;
+    if (t->offset + t->bytes != pos) {
+        /* Normally this branch is not reachable from migration */
+        return bs->drv->bdrv_save_vmstate(bs, qiov, pos);
+    }
+
+    off = 0;
+    while (1) {
+        to_copy = MIN(qiov->size - off, buf_size - t->bytes);
+        qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
+        t->bytes += to_copy;
+        if (t->bytes < buf_size) {
+            return 0;
+        }
+
+        aio_task_pool_start_task(state->pool, &t->task);
+
+        pos += to_copy;
+        off += to_copy;
+        state->t = t = bdrv_vmstate_task_create(bs, pos, buf_size);
+    }
+}
+
 static int coroutine_fn
 bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                    bool is_read)
@@ -2655,7 +2754,7 @@ bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
         if (is_read) {
             ret = drv->bdrv_load_vmstate(bs, qiov, pos);
         } else {
-            ret = drv->bdrv_save_vmstate(bs, qiov, pos);
+            ret = bdrv_co_do_save_vmstate(bs, qiov, pos);
         }
     } else if (bs->file) {
         ret = bdrv_co_rw_vmstate(bs->file->bs, qiov, pos, is_read);
@@ -2726,7 +2825,30 @@ int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
 
 static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState *bs)
 {
-    return 0;
+    int err;
+    BdrvSaveVMState *state = bs->savevm_state;
+
+    if (bs->drv->bdrv_save_vmstate == NULL && bs->file != NULL) {
+        return bdrv_co_finalize_vmstate(bs->file->bs);
+    }
+    if (state == NULL) {
+        return 0;
+    }
+
+    if (aio_task_pool_status(state->pool) >= 0) {
+        /* We are on success path, commit last chunk if possible */
+        aio_task_pool_start_task(state->pool, &state->t->task);
+    }
+
+    aio_task_pool_wait_all(state->pool);
+    err = aio_task_pool_status(state->pool);
+
+    aio_task_pool_free(state->pool);
+    g_free(state);
+
+    bs->savevm_state = NULL;
+
+    return err;
 }
 
 static int coroutine_fn bdrv_finalize_vmstate_co_entry(void *opaque)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 3d6cf88592..e975ac4c21 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -61,6 +61,8 @@
 
 #define BLOCK_PROBE_BUF_SIZE        512
 
+#define BDRV_VMSTATE_WORKERS_MAX    8
+
 enum BdrvTrackedRequestType {
     BDRV_TRACKED_READ,
     BDRV_TRACKED_WRITE,
@@ -808,6 +810,9 @@ struct BdrvChild {
     QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+
+typedef struct BdrvSaveVMState BdrvSaveVMState;
+
 /*
  * Note: the function bdrv_append() copies and swaps contents of
  * BlockDriverStates, so if you add new fields to this struct, please
@@ -971,6 +976,9 @@ struct BlockDriverState {
 
     /* BdrvChild links to this node may never be frozen */
     bool never_freeze;
+
+    /* Intermediate buffer for VM state saving from snapshot creation code */
+    BdrvSaveVMState *savevm_state;
 };
 
 struct BlockBackendRootState {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/6] block/io: improve loadvm performance
  2020-07-09 13:26 [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
                   ` (4 preceding siblings ...)
  2020-07-09 13:26 ` [PATCH 5/6] block/io: improve savevm performance Denis V. Lunev
@ 2020-07-09 13:26 ` Denis V. Lunev
  2020-07-10 14:40   ` Vladimir Sementsov-Ogievskiy
  2020-08-20  7:42 ` [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
  2020-09-14 13:13 ` Stefan Hajnoczi
  7 siblings, 1 reply; 15+ messages in thread
From: Denis V. Lunev @ 2020-07-09 13:26 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Max Reitz, Denis Plotnikov, Stefan Hajnoczi,
	Denis V. Lunev

This patch creates intermediate buffer for reading from block driver
state and performs read-ahead to this buffer. Snapshot code performs
reads sequentially and thus we know what offsets will be required
and when they will become not needed.

Results are fantastic. Switch to snapshot times of 2GB Fedora 31 VM
over NVME storage are the following:
                original     fixed
cached:          1.84s       1.16s
non-cached:     12.74s       1.27s

The difference over HDD would be more significant :)

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/block-backend.c     |  12 +-
 block/io.c                | 239 +++++++++++++++++++++++++++++++++++++-
 include/block/block_int.h |   3 +
 3 files changed, 250 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 5bb11c8e01..09773b3e37 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2201,11 +2201,21 @@ int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
 
 int blk_load_vmstate(BlockBackend *blk, uint8_t *buf, int64_t pos, int size)
 {
+    int ret, ret2;
+
     if (!blk_is_available(blk)) {
         return -ENOMEDIUM;
     }
 
-    return bdrv_load_vmstate(blk_bs(blk), buf, pos, size);
+    ret = bdrv_load_vmstate(blk_bs(blk), buf, pos, size);
+    ret2 = bdrv_finalize_vmstate(blk_bs(blk));
+    if (ret < 0) {
+        return ret;
+    }
+    if (ret2 < 0) {
+        return ret2;
+    }
+    return ret;
 }
 
 int blk_probe_blocksizes(BlockBackend *blk, BlockSizes *bsz)
diff --git a/block/io.c b/block/io.c
index 061d3239b9..510122fbc4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2739,6 +2739,194 @@ static int bdrv_co_do_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
     }
 }
 
+
+typedef struct BdrvLoadVMChunk {
+    void *buf;
+    uint64_t offset;
+    ssize_t bytes;
+
+    QLIST_ENTRY(BdrvLoadVMChunk) list;
+} BdrvLoadVMChunk;
+
+typedef struct BdrvLoadVMState {
+    AioTaskPool *pool;
+
+    int64_t offset;
+    int64_t last_loaded;
+
+    int chunk_count;
+    QLIST_HEAD(, BdrvLoadVMChunk) chunks;
+    QLIST_HEAD(, BdrvLoadVMChunk) loading;
+    CoQueue waiters;
+} BdrvLoadVMState;
+
+typedef struct BdrvLoadVMStateTask {
+    AioTask task;
+
+    BlockDriverState *bs;
+    BdrvLoadVMChunk *chunk;
+} BdrvLoadVMStateTask;
+
+static BdrvLoadVMChunk *bdrv_co_find_loadvmstate_chunk(int64_t pos,
+                                                       BdrvLoadVMChunk *c)
+{
+    for (; c != NULL; c = QLIST_NEXT(c, list)) {
+        if (c->offset <= pos && c->offset + c->bytes > pos) {
+            return c;
+        }
+    }
+
+    return NULL;
+}
+
+static void bdrv_free_loadvm_chunk(BdrvLoadVMChunk *c)
+{
+    qemu_vfree(c->buf);
+    g_free(c);
+}
+
+static coroutine_fn int bdrv_co_vmstate_load_task_entry(AioTask *task)
+{
+    int err = 0;
+    BdrvLoadVMStateTask *t = container_of(task, BdrvLoadVMStateTask, task);
+    BdrvLoadVMChunk *c = t->chunk;
+    BdrvLoadVMState *state = t->bs->loadvm_state;
+    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, c->buf, c->bytes);
+
+    bdrv_inc_in_flight(t->bs);
+    err = t->bs->drv->bdrv_load_vmstate(t->bs, &qiov, c->offset);
+    bdrv_dec_in_flight(t->bs);
+
+    QLIST_REMOVE(c, list);
+    if (err == 0) {
+        QLIST_INSERT_HEAD(&state->chunks, c, list);
+    } else {
+        bdrv_free_loadvm_chunk(c);
+    }
+    qemu_co_queue_restart_all(&state->waiters);
+
+    return err;
+}
+
+
+static void bdrv_co_loadvmstate_next(BlockDriverState *bs, BdrvLoadVMChunk *c)
+{
+    BdrvLoadVMStateTask *t = g_new(BdrvLoadVMStateTask, 1);
+    BdrvLoadVMState *state = bs->loadvm_state;
+
+    c->offset = state->last_loaded;
+
+    *t = (BdrvLoadVMStateTask) {
+        .task.func = bdrv_co_vmstate_load_task_entry,
+        .bs = bs,
+        .chunk = c,
+    };
+
+    QLIST_INSERT_HEAD(&state->loading, t->chunk, list);
+    state->chunk_count++;
+    state->last_loaded += c->bytes;
+
+    aio_task_pool_start_task(state->pool, &t->task);
+}
+
+
+static void bdrv_co_loadvmstate_start(BlockDriverState *bs)
+{
+    int i;
+    size_t buf_size = MAX(bdrv_get_cluster_size(bs), 1 * MiB);
+
+    for (i = 0; i < BDRV_VMSTATE_WORKERS_MAX; i++) {
+        BdrvLoadVMChunk *c = g_new0(BdrvLoadVMChunk, 1);
+
+        c->buf = qemu_blockalign(bs, buf_size);
+        c->bytes = buf_size;
+
+        bdrv_co_loadvmstate_next(bs, c);
+    }
+}
+
+static int bdrv_co_do_load_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
+                                   int64_t pos)
+{
+    BdrvLoadVMState *state = bs->loadvm_state;
+    BdrvLoadVMChunk *c;
+    size_t off;
+    int64_t start_pos = pos;
+
+    if (state == NULL) {
+        if (pos != 0) {
+            goto slow_path;
+        }
+
+        state = g_new(BdrvLoadVMState, 1);
+        *state = (BdrvLoadVMState) {
+            .pool = aio_task_pool_new(BDRV_VMSTATE_WORKERS_MAX),
+            .chunks = QLIST_HEAD_INITIALIZER(state->chunks),
+            .loading = QLIST_HEAD_INITIALIZER(state->loading),
+        };
+        qemu_co_queue_init(&state->waiters);
+
+        bs->loadvm_state = state;
+
+        bdrv_co_loadvmstate_start(bs);
+    }
+
+    if (state->offset != pos) {
+        goto slow_path;
+    }
+
+    off = 0;
+
+    while (off < qiov->size && aio_task_pool_status(state->pool) == 0) {
+        c = bdrv_co_find_loadvmstate_chunk(pos, QLIST_FIRST(&state->chunks));
+        if (c != NULL) {
+            ssize_t chunk_off = pos - c->offset;
+            ssize_t to_copy = MIN(qiov->size - off, c->bytes - chunk_off);
+
+            qemu_iovec_from_buf(qiov, off, c->buf + chunk_off, to_copy);
+
+            off += to_copy;
+            pos += to_copy;
+
+            if (pos == c->offset + c->bytes) {
+                state->chunk_count--;
+                /* End of buffer, discard it from the list */
+                QLIST_REMOVE(c, list);
+
+                /*
+                 * Start loading next chunk. The slot in the pool should
+                 * always be free for this purpose at the moment.
+                 *
+                 * There is a problem with the end of the stream. This code
+                 * starts to read the data beyond the end of the saved state.
+                 * The code in low level should be ready to such behavior but
+                 * we will have unnecessary BDRV_VMSTATE_WORKERS_MAX chunks
+                 * fully zeroed. This is not good, but acceptable.
+                 */
+                bdrv_co_loadvmstate_next(bs, c);
+            }
+
+            state->offset += to_copy;
+            continue;
+        }
+
+        c = bdrv_co_find_loadvmstate_chunk(pos, QLIST_FIRST(&state->loading));
+        if (c != NULL) {
+            qemu_co_queue_wait(&state->waiters, NULL);
+            continue;
+        }
+
+slow_path:
+        /*
+         * This should not happen normally. This point could be reached only
+         * if we have had some parallel requests. Fallback to slow load.
+         */
+        return bs->drv->bdrv_load_vmstate(bs, qiov, start_pos);
+    }
+
+    return aio_task_pool_status(state->pool);
+}
+
 static int coroutine_fn
 bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                    bool is_read)
@@ -2752,7 +2940,7 @@ bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
         ret = -ENOMEDIUM;
     } else if (drv->bdrv_load_vmstate) {
         if (is_read) {
-            ret = drv->bdrv_load_vmstate(bs, qiov, pos);
+            ret = bdrv_co_do_load_vmstate(bs, qiov, pos);
         } else {
             ret = bdrv_co_do_save_vmstate(bs, qiov, pos);
         }
@@ -2823,13 +3011,13 @@ int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
     return bdrv_rw_vmstate(bs, qiov, pos, true);
 }
 
-static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState *bs)
+static int coroutine_fn bdrv_co_finalize_save_vmstate(BlockDriverState *bs)
 {
     int err;
     BdrvSaveVMState *state = bs->savevm_state;
 
     if (bs->drv->bdrv_save_vmstate == NULL && bs->file != NULL) {
-        return bdrv_co_finalize_vmstate(bs->file->bs);
+        return bdrv_co_finalize_save_vmstate(bs->file->bs);
     }
     if (state == NULL) {
         return 0;
@@ -2851,6 +3039,51 @@ static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState *bs)
     return err;
 }
 
+static int coroutine_fn bdrv_co_finalize_load_vmstate(BlockDriverState *bs)
+{
+    int err;
+    BdrvLoadVMState *state = bs->loadvm_state;
+    BdrvLoadVMChunk *c, *tmp;
+
+    if (bs->drv->bdrv_load_vmstate == NULL && bs->file != NULL) {
+        return bdrv_co_finalize_load_vmstate(bs->file->bs);
+    }
+    if (state == NULL) {
+        return 0;
+    }
+
+    aio_task_pool_wait_all(state->pool);
+    err = aio_task_pool_status(state->pool);
+    aio_task_pool_free(state->pool);
+
+    /* this list must be empty as all tasks are committed */
+    assert(QLIST_EMPTY(&state->loading));
+
+    QLIST_FOREACH_SAFE(c, &state->chunks, list, tmp) {
+        QLIST_REMOVE(c, list);
+        bdrv_free_loadvm_chunk(c);
+    }
+
+    g_free(state);
+
+    bs->loadvm_state = NULL;
+
+    return err;
+}
+
+static int coroutine_fn bdrv_co_finalize_vmstate(BlockDriverState *bs)
+{
+    int err1 = bdrv_co_finalize_save_vmstate(bs);
+    int err2 = bdrv_co_finalize_load_vmstate(bs);
+    if (err1 < 0) {
+        return err1;
+    }
+    if (err2 < 0) {
+        return err2;
+    }
+    return 0;
+}
+
 static int coroutine_fn bdrv_finalize_vmstate_co_entry(void *opaque)
 {
     return bdrv_co_finalize_vmstate(opaque);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index e975ac4c21..b98d44f61b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -812,6 +812,7 @@ struct BdrvChild {
 
 
 typedef struct BdrvSaveVMState BdrvSaveVMState;
+typedef struct BdrvLoadVMState BdrvLoadVMState;
 
 /*
  * Note: the function bdrv_append() copies and swaps contents of
@@ -979,6 +980,8 @@ struct BlockDriverState {
 
     /* Intermediate buffer for VM state saving from snapshot creation code */
     BdrvSaveVMState *savevm_state;
+    /* Intermediate buffer for VM state loading */
+    BdrvLoadVMState *loadvm_state;
 };
 
 struct BlockBackendRootState {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 6/6] block/io: improve loadvm performance
  2020-07-09 13:26 ` [PATCH 6/6] block/io: improve loadvm performance Denis V. Lunev
@ 2020-07-10 14:40   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-10 14:40 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Max Reitz, Denis Plotnikov,
	Stefan Hajnoczi

09.07.2020 16:26, Denis V. Lunev wrote:
> This patch creates intermediate buffer for reading from block driver
> state and performs read-ahead to this buffer. Snapshot code performs
> reads sequentially and thus we know what offsets will be required
> and when they will become not needed.
> 
> Results are fantastic. Switch to snapshot times of 2GB Fedora 31 VM
> over NVME storage are the following:
>                  original     fixed
> cached:          1.84s       1.16s
> non-cached:     12.74s       1.27s
> 
> The difference over HDD would be more significant:)
> 
> Signed-off-by: Denis V. Lunev<den@openvz.org>
> CC: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Kevin Wolf<kwolf@redhat.com>
> CC: Max Reitz<mreitz@redhat.com>
> CC: Stefan Hajnoczi<stefanha@redhat.com>
> CC: Fam Zheng<fam@euphon.net>
> CC: Juan Quintela<quintela@redhat.com>
> CC: Denis Plotnikov<dplotnikov@virtuozzo.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 0/6] block: seriously improve savevm/loadvm performance
  2020-07-09 13:26 [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
                   ` (5 preceding siblings ...)
  2020-07-09 13:26 ` [PATCH 6/6] block/io: improve loadvm performance Denis V. Lunev
@ 2020-08-20  7:42 ` Denis V. Lunev
  2020-08-27 12:41   ` Denis V. Lunev
  2020-09-14 13:13 ` Stefan Hajnoczi
  7 siblings, 1 reply; 15+ messages in thread
From: Denis V. Lunev @ 2020-08-20  7:42 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi

On 7/9/20 4:26 PM, Denis V. Lunev wrote:
> This series do standard basic things:
> - it creates intermediate buffer for all writes from QEMU migration code
>   to QCOW2 image,
> - this buffer is sent to disk asynchronously, allowing several writes to
>   run in parallel.
>
> In general, migration code is fantastically inefficent (by observation),
> buffers are not aligned and sent with arbitrary pieces, a lot of time
> less than 100 bytes at a chunk, which results in read-modify-write
> operations with non-cached operations. It should also be noted that all
> operations are performed into unallocated image blocks, which also suffer
> due to partial writes to such new clusters.
>
> This patch series is an implementation of idea discussed in the RFC
> posted by Denis Plotnikov
> https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg01925.html
> Results with this series over NVME are better than original code
>                 original     rfc    this
> cached:          1.79s      2.38s   1.27s
> non-cached:      3.29s      1.31s   0.81s
>
> Changes from v7:
> - dropped lock from LoadVMState
> - fixed assert in last patch
> - dropped patch 1 as queued
>
> Changes from v6:
> - blk_load_vmstate kludges added (patchew problem fixed)
>
> Changes from v5:
> - loadvm optimizations added with Vladimir comments included
>
> Changes from v4:
> - added patch 4 with blk_save_vmstate() cleanup
> - added R-By
> - bdrv_flush_vmstate -> bdrv_finalize_vmstate
> - fixed return code of bdrv_co_do_save_vmstate
> - fixed typos in comments (Eric, thanks!)
> - fixed patchew warnings
>
> Changes from v3:
> - rebased to master
> - added patch 3 which removes aio_task_pool_wait_one()
> - added R-By to patch 1
> - patch 4 is rewritten via bdrv_run_co
> - error path in blk_save_vmstate() is rewritten to call bdrv_flush_vmstate
>   unconditionally
> - added some comments
> - fixes initialization in bdrv_co_vmstate_save_task_entry as suggested
>
> Changes from v2:
> - code moved from QCOW2 level to generic block level
> - created bdrv_flush_vmstate helper to fix 022, 029 tests
> - added recursive for bs->file in bdrv_co_flush_vmstate (fix 267)
> - fixed blk_save_vmstate helper
> - fixed coroutine wait as Vladimir suggested with waiting fixes from me
>
> Changes from v1:
> - patchew warning fixed
> - fixed validation that only 1 waiter is allowed in patch 1
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <fam@euphon.net>
> CC: Juan Quintela <quintela@redhat.com>
> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
>
>
ping


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 0/6] block: seriously improve savevm/loadvm performance
  2020-08-20  7:42 ` [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
@ 2020-08-27 12:41   ` Denis V. Lunev
  0 siblings, 0 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-08-27 12:41 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi

On 8/20/20 10:42 AM, Denis V. Lunev wrote:
> On 7/9/20 4:26 PM, Denis V. Lunev wrote:
>> This series do standard basic things:
>> - it creates intermediate buffer for all writes from QEMU migration code
>>   to QCOW2 image,
>> - this buffer is sent to disk asynchronously, allowing several writes to
>>   run in parallel.
>>
>> In general, migration code is fantastically inefficent (by observation),
>> buffers are not aligned and sent with arbitrary pieces, a lot of time
>> less than 100 bytes at a chunk, which results in read-modify-write
>> operations with non-cached operations. It should also be noted that all
>> operations are performed into unallocated image blocks, which also suffer
>> due to partial writes to such new clusters.
>>
>> This patch series is an implementation of idea discussed in the RFC
>> posted by Denis Plotnikov
>> https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg01925.html
>> Results with this series over NVME are better than original code
>>                 original     rfc    this
>> cached:          1.79s      2.38s   1.27s
>> non-cached:      3.29s      1.31s   0.81s
>>
>> Changes from v7:
>> - dropped lock from LoadVMState
>> - fixed assert in last patch
>> - dropped patch 1 as queued
>>
>> Changes from v6:
>> - blk_load_vmstate kludges added (patchew problem fixed)
>>
>> Changes from v5:
>> - loadvm optimizations added with Vladimir comments included
>>
>> Changes from v4:
>> - added patch 4 with blk_save_vmstate() cleanup
>> - added R-By
>> - bdrv_flush_vmstate -> bdrv_finalize_vmstate
>> - fixed return code of bdrv_co_do_save_vmstate
>> - fixed typos in comments (Eric, thanks!)
>> - fixed patchew warnings
>>
>> Changes from v3:
>> - rebased to master
>> - added patch 3 which removes aio_task_pool_wait_one()
>> - added R-By to patch 1
>> - patch 4 is rewritten via bdrv_run_co
>> - error path in blk_save_vmstate() is rewritten to call bdrv_flush_vmstate
>>   unconditionally
>> - added some comments
>> - fixes initialization in bdrv_co_vmstate_save_task_entry as suggested
>>
>> Changes from v2:
>> - code moved from QCOW2 level to generic block level
>> - created bdrv_flush_vmstate helper to fix 022, 029 tests
>> - added recursive for bs->file in bdrv_co_flush_vmstate (fix 267)
>> - fixed blk_save_vmstate helper
>> - fixed coroutine wait as Vladimir suggested with waiting fixes from me
>>
>> Changes from v1:
>> - patchew warning fixed
>> - fixed validation that only 1 waiter is allowed in patch 1
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Fam Zheng <fam@euphon.net>
>> CC: Juan Quintela <quintela@redhat.com>
>> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
>>
>>
> ping
ping V2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/6] block, migration: add bdrv_finalize_vmstate helper
  2020-07-09 13:26 ` [PATCH 4/6] block, migration: add bdrv_finalize_vmstate helper Denis V. Lunev
@ 2020-08-27 12:58   ` Daniel P. Berrangé
  2020-08-27 13:02     ` Denis V. Lunev
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel P. Berrangé @ 2020-08-27 12:58 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, qemu-block, Juan Quintela,
	Dr. David Alan Gilbert, qemu-devel, Denis Plotnikov,
	Stefan Hajnoczi, Max Reitz

On Thu, Jul 09, 2020 at 04:26:42PM +0300, Denis V. Lunev wrote:
> Right now bdrv_fclose() is just calling bdrv_flush().
> 
> The problem is that migration code is working inefficiently from block
> layer terms and are frequently called for very small pieces of
> unaligned data. Block layer is capable to work this way, but this is very
> slow.
> 
> This patch is a preparation for the introduction of the intermediate
> buffer at block driver state. It would be beneficial to separate
> conventional bdrv_flush() from closing QEMU file from migration code.
> 
> The patch also forces bdrv_finalize_vmstate() operation inside
> synchronous blk_save_vmstate() operation. This helper is used from
> qemu-io only.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <fam@euphon.net>
> CC: Juan Quintela <quintela@redhat.com>
> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
> ---
>  block/block-backend.c |  6 +++++-
>  block/io.c            | 15 +++++++++++++++
>  include/block/block.h |  5 +++++
>  migration/savevm.c    |  4 ++++
>  4 files changed, 29 insertions(+), 1 deletion(-)

> diff --git a/migration/savevm.c b/migration/savevm.c
> index 45c9dd9d8a..d8a94e312c 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -150,6 +150,10 @@ static ssize_t block_get_buffer(void *opaque, uint8_t *buf, int64_t pos,
>  
>  static int bdrv_fclose(void *opaque, Error **errp)
>  {
> +    int err = bdrv_finalize_vmstate(opaque);
> +    if (err < 0) {
> +        return err;

This is returning an error without having populating 'errp' which means
the caller will be missing error diagnosis

> +    }
>      return bdrv_flush(opaque);
>  }

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/6] block, migration: add bdrv_finalize_vmstate helper
  2020-08-27 12:58   ` Daniel P. Berrangé
@ 2020-08-27 13:02     ` Denis V. Lunev
  2020-08-27 13:06       ` Daniel P. Berrangé
  0 siblings, 1 reply; 15+ messages in thread
From: Denis V. Lunev @ 2020-08-27 13:02 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Kevin Wolf, Fam Zheng, qemu-block, Juan Quintela,
	Dr. David Alan Gilbert, qemu-devel, Denis Plotnikov,
	Stefan Hajnoczi, Max Reitz

On 8/27/20 3:58 PM, Daniel P. Berrangé wrote:
> On Thu, Jul 09, 2020 at 04:26:42PM +0300, Denis V. Lunev wrote:
>> Right now bdrv_fclose() is just calling bdrv_flush().
>>
>> The problem is that migration code is working inefficiently from block
>> layer terms and are frequently called for very small pieces of
>> unaligned data. Block layer is capable to work this way, but this is very
>> slow.
>>
>> This patch is a preparation for the introduction of the intermediate
>> buffer at block driver state. It would be beneficial to separate
>> conventional bdrv_flush() from closing QEMU file from migration code.
>>
>> The patch also forces bdrv_finalize_vmstate() operation inside
>> synchronous blk_save_vmstate() operation. This helper is used from
>> qemu-io only.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Fam Zheng <fam@euphon.net>
>> CC: Juan Quintela <quintela@redhat.com>
>> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
>> ---
>>  block/block-backend.c |  6 +++++-
>>  block/io.c            | 15 +++++++++++++++
>>  include/block/block.h |  5 +++++
>>  migration/savevm.c    |  4 ++++
>>  4 files changed, 29 insertions(+), 1 deletion(-)
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 45c9dd9d8a..d8a94e312c 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -150,6 +150,10 @@ static ssize_t block_get_buffer(void *opaque, uint8_t *buf, int64_t pos,
>>  
>>  static int bdrv_fclose(void *opaque, Error **errp)
>>  {
>> +    int err = bdrv_finalize_vmstate(opaque);
>> +    if (err < 0) {
>> +        return err;
> This is returning an error without having populating 'errp' which means
> the caller will be missing error diagnosis

but this behaves exactly like the branch below,
bdrv_flush() could return error too and errp
is not filled in the same way.

Den


>> +    }
>>      return bdrv_flush(opaque);
>>  }
> Regards,
> Daniel



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/6] block, migration: add bdrv_finalize_vmstate helper
  2020-08-27 13:02     ` Denis V. Lunev
@ 2020-08-27 13:06       ` Daniel P. Berrangé
  2020-08-28  6:13         ` Markus Armbruster
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel P. Berrangé @ 2020-08-27 13:06 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, qemu-block, Juan Quintela,
	Dr. David Alan Gilbert, qemu-devel, Denis Plotnikov,
	Stefan Hajnoczi, Max Reitz

On Thu, Aug 27, 2020 at 04:02:38PM +0300, Denis V. Lunev wrote:
> On 8/27/20 3:58 PM, Daniel P. Berrangé wrote:
> > On Thu, Jul 09, 2020 at 04:26:42PM +0300, Denis V. Lunev wrote:
> >> Right now bdrv_fclose() is just calling bdrv_flush().
> >>
> >> The problem is that migration code is working inefficiently from block
> >> layer terms and are frequently called for very small pieces of
> >> unaligned data. Block layer is capable to work this way, but this is very
> >> slow.
> >>
> >> This patch is a preparation for the introduction of the intermediate
> >> buffer at block driver state. It would be beneficial to separate
> >> conventional bdrv_flush() from closing QEMU file from migration code.
> >>
> >> The patch also forces bdrv_finalize_vmstate() operation inside
> >> synchronous blk_save_vmstate() operation. This helper is used from
> >> qemu-io only.
> >>
> >> Signed-off-by: Denis V. Lunev <den@openvz.org>
> >> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >> CC: Kevin Wolf <kwolf@redhat.com>
> >> CC: Max Reitz <mreitz@redhat.com>
> >> CC: Stefan Hajnoczi <stefanha@redhat.com>
> >> CC: Fam Zheng <fam@euphon.net>
> >> CC: Juan Quintela <quintela@redhat.com>
> >> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
> >> ---
> >>  block/block-backend.c |  6 +++++-
> >>  block/io.c            | 15 +++++++++++++++
> >>  include/block/block.h |  5 +++++
> >>  migration/savevm.c    |  4 ++++
> >>  4 files changed, 29 insertions(+), 1 deletion(-)
> >> diff --git a/migration/savevm.c b/migration/savevm.c
> >> index 45c9dd9d8a..d8a94e312c 100644
> >> --- a/migration/savevm.c
> >> +++ b/migration/savevm.c
> >> @@ -150,6 +150,10 @@ static ssize_t block_get_buffer(void *opaque, uint8_t *buf, int64_t pos,
> >>  
> >>  static int bdrv_fclose(void *opaque, Error **errp)
> >>  {
> >> +    int err = bdrv_finalize_vmstate(opaque);
> >> +    if (err < 0) {
> >> +        return err;
> > This is returning an error without having populating 'errp' which means
> > the caller will be missing error diagnosis
> 
> but this behaves exactly like the branch below,
> bdrv_flush() could return error too and errp
> is not filled in the same way.

Doh, it seems the only caller passes NULL for the errp too,
so it is a redundant parameter. So nothing wrong with your
patch after all.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/6] block, migration: add bdrv_finalize_vmstate helper
  2020-08-27 13:06       ` Daniel P. Berrangé
@ 2020-08-28  6:13         ` Markus Armbruster
  0 siblings, 0 replies; 15+ messages in thread
From: Markus Armbruster @ 2020-08-28  6:13 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Kevin Wolf, Fam Zheng, qemu-block, Juan Quintela,
	Dr. David Alan Gilbert, qemu-devel, Denis Plotnikov,
	Stefan Hajnoczi, Denis V. Lunev, Max Reitz

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Thu, Aug 27, 2020 at 04:02:38PM +0300, Denis V. Lunev wrote:
>> On 8/27/20 3:58 PM, Daniel P. Berrangé wrote:
>> > On Thu, Jul 09, 2020 at 04:26:42PM +0300, Denis V. Lunev wrote:
>> >> Right now bdrv_fclose() is just calling bdrv_flush().
>> >>
>> >> The problem is that migration code is working inefficiently from block
>> >> layer terms and are frequently called for very small pieces of
>> >> unaligned data. Block layer is capable to work this way, but this is very
>> >> slow.
>> >>
>> >> This patch is a preparation for the introduction of the intermediate
>> >> buffer at block driver state. It would be beneficial to separate
>> >> conventional bdrv_flush() from closing QEMU file from migration code.
>> >>
>> >> The patch also forces bdrv_finalize_vmstate() operation inside
>> >> synchronous blk_save_vmstate() operation. This helper is used from
>> >> qemu-io only.
>> >>
>> >> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> >> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> >> CC: Kevin Wolf <kwolf@redhat.com>
>> >> CC: Max Reitz <mreitz@redhat.com>
>> >> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> >> CC: Fam Zheng <fam@euphon.net>
>> >> CC: Juan Quintela <quintela@redhat.com>
>> >> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> >> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
>> >> ---
>> >>  block/block-backend.c |  6 +++++-
>> >>  block/io.c            | 15 +++++++++++++++
>> >>  include/block/block.h |  5 +++++
>> >>  migration/savevm.c    |  4 ++++
>> >>  4 files changed, 29 insertions(+), 1 deletion(-)
>> >> diff --git a/migration/savevm.c b/migration/savevm.c
>> >> index 45c9dd9d8a..d8a94e312c 100644
>> >> --- a/migration/savevm.c
>> >> +++ b/migration/savevm.c
>> >> @@ -150,6 +150,10 @@ static ssize_t block_get_buffer(void *opaque, uint8_t *buf, int64_t pos,
>> >>  
>> >>  static int bdrv_fclose(void *opaque, Error **errp)
>> >>  {
>> >> +    int err = bdrv_finalize_vmstate(opaque);
>> >> +    if (err < 0) {
>> >> +        return err;
>> > This is returning an error without having populating 'errp' which means
>> > the caller will be missing error diagnosis
>> 
>> but this behaves exactly like the branch below,
>> bdrv_flush() could return error too and errp
>> is not filled in the same way.
>
> Doh, it seems the only caller passes NULL for the errp too,
> so it is a redundant parameter. So nothing wrong with your
> patch after all.

Not setting an error on failure is plainly wrong.

If it works because all callers pass NULL, then the obvious fix is to
drop the @errp parameter.

I agree it's not this patch's fault.  It needs fixing anyway.  If you
have to respin for some other reason, including a fix would be nice.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 0/6] block: seriously improve savevm/loadvm performance
  2020-07-09 13:26 [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
                   ` (6 preceding siblings ...)
  2020-08-20  7:42 ` [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
@ 2020-09-14 13:13 ` Stefan Hajnoczi
  7 siblings, 0 replies; 15+ messages in thread
From: Stefan Hajnoczi @ 2020-09-14 13:13 UTC (permalink / raw)
  To: Kevin Wolf, Max Reitz
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	Juan Quintela, qemu-devel, Max Reitz, Denis Plotnikov,
	Denis V. Lunev, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 1169 bytes --]

On Thu, Jul 09, 2020 at 04:26:38PM +0300, Denis V. Lunev wrote:
> This series do standard basic things:
> - it creates intermediate buffer for all writes from QEMU migration code
>   to QCOW2 image,
> - this buffer is sent to disk asynchronously, allowing several writes to
>   run in parallel.
> 
> In general, migration code is fantastically inefficent (by observation),
> buffers are not aligned and sent with arbitrary pieces, a lot of time
> less than 100 bytes at a chunk, which results in read-modify-write
> operations with non-cached operations. It should also be noted that all
> operations are performed into unallocated image blocks, which also suffer
> due to partial writes to such new clusters.
> 
> This patch series is an implementation of idea discussed in the RFC
> posted by Denis Plotnikov
> https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg01925.html
> Results with this series over NVME are better than original code
>                 original     rfc    this
> cached:          1.79s      2.38s   1.27s
> non-cached:      3.29s      1.31s   0.81s

Kevin and Max: Is this going through one of your trees?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-09-14 16:03 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-09 13:26 [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
2020-07-09 13:26 ` [PATCH 1/6] block/aio_task: allow start/wait task from any coroutine Denis V. Lunev
2020-07-09 13:26 ` [PATCH 2/6] block/aio_task: drop aio_task_pool_wait_one() helper Denis V. Lunev
2020-07-09 13:26 ` [PATCH 3/6] block/block-backend: remove always true check from blk_save_vmstate Denis V. Lunev
2020-07-09 13:26 ` [PATCH 4/6] block, migration: add bdrv_finalize_vmstate helper Denis V. Lunev
2020-08-27 12:58   ` Daniel P. Berrangé
2020-08-27 13:02     ` Denis V. Lunev
2020-08-27 13:06       ` Daniel P. Berrangé
2020-08-28  6:13         ` Markus Armbruster
2020-07-09 13:26 ` [PATCH 5/6] block/io: improve savevm performance Denis V. Lunev
2020-07-09 13:26 ` [PATCH 6/6] block/io: improve loadvm performance Denis V. Lunev
2020-07-10 14:40   ` Vladimir Sementsov-Ogievskiy
2020-08-20  7:42 ` [PATCH v8 0/6] block: seriously improve savevm/loadvm performance Denis V. Lunev
2020-08-27 12:41   ` Denis V. Lunev
2020-09-14 13:13 ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).