All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] block: seriously improve savevm performance
@ 2020-06-16 16:20 Denis V. Lunev
  2020-06-16 16:20 ` [PATCH 1/5] migration/savevm: respect qemu_fclose() error code in save_snapshot() Denis V. Lunev
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-16 16:20 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi, Denis V . Lunev

This series do standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

This patch series is an implementation of idea discussed in the RFC
posted by Denis Plotnikov
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg01925.html
Results with this series over NVME are better than original code
                original     rfc    this
cached:          1.79s      2.38s   1.27s
non-cached:      3.29s      1.31s   0.81s

Changes from v3:
- rebased to master
- added patch 3 which removes aio_task_pool_wait_one()
- added R-By to patch 1
- patch 4 is rewritten via bdrv_run_co
- error path in blk_save_vmstate() is rewritten to call bdrv_flush_vmstate
  unconditionally
- added some comments
- fixes initialization in bdrv_co_vmstate_save_task_entry as suggested

Changes from v2:
- code moved from QCOW2 level to generic block level
- created bdrv_flush_vmstate helper to fix 022, 029 tests
- added recursive for bs->file in bdrv_co_flush_vmstate (fix 267)
- fixed blk_save_vmstate helper
- fixed coroutine wait as Vladimir suggested with waiting fixes from me

Changes from v1:
- patchew warning fixed
- fixed validation that only 1 waiter is allowed in patch 1

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] migration/savevm: respect qemu_fclose() error code in save_snapshot()
  2020-06-16 16:20 [PATCH v4 0/4] block: seriously improve savevm performance Denis V. Lunev
@ 2020-06-16 16:20 ` Denis V. Lunev
  2020-06-16 16:20 ` [PATCH 2/5] block/aio_task: allow start/wait task from any coroutine Denis V. Lunev
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-16 16:20 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Max Reitz, Denis Plotnikov,
	Stefan Hajnoczi, Denis V. Lunev

qemu_fclose() could return error, f.e. if bdrv_co_flush() will return
the error.

This validation will become more important once we will start waiting of
asynchronous IO operations, started from bdrv_write_vmstate(), which are
coming soon.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 migration/savevm.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index b979ea6e7f..da3dead4e9 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2628,7 +2628,7 @@ int save_snapshot(const char *name, Error **errp)
 {
     BlockDriverState *bs, *bs1;
     QEMUSnapshotInfo sn1, *sn = &sn1, old_sn1, *old_sn = &old_sn1;
-    int ret = -1;
+    int ret = -1, ret2;
     QEMUFile *f;
     int saved_vm_running;
     uint64_t vm_state_size;
@@ -2712,10 +2712,14 @@ int save_snapshot(const char *name, Error **errp)
     }
     ret = qemu_savevm_state(f, errp);
     vm_state_size = qemu_ftell(f);
-    qemu_fclose(f);
+    ret2 = qemu_fclose(f);
     if (ret < 0) {
         goto the_end;
     }
+    if (ret2 < 0) {
+        ret = ret2;
+        goto the_end;
+    }
 
     /* The bdrv_all_create_snapshot() call that follows acquires the AioContext
      * for itself.  BDRV_POLL_WHILE() does not support nested locking because
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] block/aio_task: allow start/wait task from any coroutine
  2020-06-16 16:20 [PATCH v4 0/4] block: seriously improve savevm performance Denis V. Lunev
  2020-06-16 16:20 ` [PATCH 1/5] migration/savevm: respect qemu_fclose() error code in save_snapshot() Denis V. Lunev
@ 2020-06-16 16:20 ` Denis V. Lunev
  2020-06-18  9:41   ` Vladimir Sementsov-Ogievskiy
  2020-06-16 16:20 ` [PATCH 3/5] block/aio_task: drop aio_task_pool_wait_one() helper Denis V. Lunev
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-16 16:20 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi, Denis V . Lunev

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Currently, aio task pool assumes that there is a main coroutine, which
creates tasks and wait for them. Let's remove the restriction by using
CoQueue. Code becomes clearer, interface more obvious.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/aio_task.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/block/aio_task.c b/block/aio_task.c
index 88989fa248..cf62e5c58b 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -27,11 +27,10 @@
 #include "block/aio_task.h"
 
 struct AioTaskPool {
-    Coroutine *main_co;
     int status;
     int max_busy_tasks;
     int busy_tasks;
-    bool waiting;
+    CoQueue waiters;
 };
 
 static void coroutine_fn aio_task_co(void *opaque)
@@ -52,31 +51,23 @@ static void coroutine_fn aio_task_co(void *opaque)
 
     g_free(task);
 
-    if (pool->waiting) {
-        pool->waiting = false;
-        aio_co_wake(pool->main_co);
-    }
+    qemu_co_queue_restart_all(&pool->waiters);
 }
 
 void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
 {
     assert(pool->busy_tasks > 0);
-    assert(qemu_coroutine_self() == pool->main_co);
 
-    pool->waiting = true;
-    qemu_coroutine_yield();
+    qemu_co_queue_wait(&pool->waiters, NULL);
 
-    assert(!pool->waiting);
     assert(pool->busy_tasks < pool->max_busy_tasks);
 }
 
 void coroutine_fn aio_task_pool_wait_slot(AioTaskPool *pool)
 {
-    if (pool->busy_tasks < pool->max_busy_tasks) {
-        return;
+    while (pool->busy_tasks >= pool->max_busy_tasks) {
+        aio_task_pool_wait_one(pool);
     }
-
-    aio_task_pool_wait_one(pool);
 }
 
 void coroutine_fn aio_task_pool_wait_all(AioTaskPool *pool)
@@ -98,8 +89,8 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int max_busy_tasks)
 {
     AioTaskPool *pool = g_new0(AioTaskPool, 1);
 
-    pool->main_co = qemu_coroutine_self();
     pool->max_busy_tasks = max_busy_tasks;
+    qemu_co_queue_init(&pool->waiters);
 
     return pool;
 }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] block/aio_task: drop aio_task_pool_wait_one() helper
  2020-06-16 16:20 [PATCH v4 0/4] block: seriously improve savevm performance Denis V. Lunev
  2020-06-16 16:20 ` [PATCH 1/5] migration/savevm: respect qemu_fclose() error code in save_snapshot() Denis V. Lunev
  2020-06-16 16:20 ` [PATCH 2/5] block/aio_task: allow start/wait task from any coroutine Denis V. Lunev
@ 2020-06-16 16:20 ` Denis V. Lunev
  2020-06-18  9:41   ` Vladimir Sementsov-Ogievskiy
  2020-06-16 16:20 ` [PATCH 4/5] block, migration: add bdrv_flush_vmstate helper Denis V. Lunev
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-16 16:20 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi, Denis V. Lunev

It is not used outside the module.

Actually there are 2 kind of waiters:
- for a slot and
- for all tasks to finish
This patch limits external API to listed types.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/aio_task.c         | 13 ++-----------
 include/block/aio_task.h |  1 -
 2 files changed, 2 insertions(+), 12 deletions(-)

diff --git a/block/aio_task.c b/block/aio_task.c
index cf62e5c58b..7ba15ff41f 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -54,26 +54,17 @@ static void coroutine_fn aio_task_co(void *opaque)
     qemu_co_queue_restart_all(&pool->waiters);
 }
 
-void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
-{
-    assert(pool->busy_tasks > 0);
-
-    qemu_co_queue_wait(&pool->waiters, NULL);
-
-    assert(pool->busy_tasks < pool->max_busy_tasks);
-}
-
 void coroutine_fn aio_task_pool_wait_slot(AioTaskPool *pool)
 {
     while (pool->busy_tasks >= pool->max_busy_tasks) {
-        aio_task_pool_wait_one(pool);
+        qemu_co_queue_wait(&pool->waiters, NULL);
     }
 }
 
 void coroutine_fn aio_task_pool_wait_all(AioTaskPool *pool)
 {
     while (pool->busy_tasks > 0) {
-        aio_task_pool_wait_one(pool);
+        qemu_co_queue_wait(&pool->waiters, NULL);
     }
 }
 
diff --git a/include/block/aio_task.h b/include/block/aio_task.h
index 50bc1e1817..50b1c036c5 100644
--- a/include/block/aio_task.h
+++ b/include/block/aio_task.h
@@ -48,7 +48,6 @@ bool aio_task_pool_empty(AioTaskPool *pool);
 void coroutine_fn aio_task_pool_start_task(AioTaskPool *pool, AioTask *task);
 
 void coroutine_fn aio_task_pool_wait_slot(AioTaskPool *pool);
-void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool);
 void coroutine_fn aio_task_pool_wait_all(AioTaskPool *pool);
 
 #endif /* BLOCK_AIO_TASK_H */
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] block, migration: add bdrv_flush_vmstate helper
  2020-06-16 16:20 [PATCH v4 0/4] block: seriously improve savevm performance Denis V. Lunev
                   ` (2 preceding siblings ...)
  2020-06-16 16:20 ` [PATCH 3/5] block/aio_task: drop aio_task_pool_wait_one() helper Denis V. Lunev
@ 2020-06-16 16:20 ` Denis V. Lunev
  2020-06-16 21:11   ` Eric Blake
  2020-06-18 10:02   ` Vladimir Sementsov-Ogievskiy
  2020-06-16 16:20 ` [PATCH 5/5] block/io: improve savevm performance Denis V. Lunev
  2020-06-16 20:28 ` [PATCH v4 0/4] block: seriously " no-reply
  5 siblings, 2 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-16 16:20 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi, Denis V. Lunev

Right now bdrv_fclose() is just calling bdrv_flush().

The problem is that migration code is working inefficently from black
layer terms and are frequently called for very small pieces of not
properly aligned data. Block layer is capable to work this way, but
this is very slow.

This patch is a preparation for the introduction of the intermediate
buffer at block driver state. It would be beneficial to separate
conventional bdrv_flush() from closing QEMU file from migration code.

The patch also forces bdrv_flush_vmstate() operation inside
synchronous blk_save_vmstate() operation. This helper is used from
qemu-io only.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/block-backend.c |  6 +++++-
 block/io.c            | 15 +++++++++++++++
 include/block/block.h |  3 +++
 migration/savevm.c    |  3 +++
 4 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 6936b25c83..3afa0ff7d5 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2177,16 +2177,20 @@ int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
 int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
                      int64_t pos, int size)
 {
-    int ret;
+    int ret, ret2;
 
     if (!blk_is_available(blk)) {
         return -ENOMEDIUM;
     }
 
     ret = bdrv_save_vmstate(blk_bs(blk), buf, pos, size);
+    ret2 = bdrv_flush_vmstate(blk_bs(blk));
     if (ret < 0) {
         return ret;
     }
+    if (ret2 < 0) {
+        return ret2;
+    }
 
     if (ret == size && !blk->enable_write_cache) {
         ret = bdrv_flush(blk_bs(blk));
diff --git a/block/io.c b/block/io.c
index df8f2a98d4..8718df4ea8 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2724,6 +2724,21 @@ int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
     return bdrv_rw_vmstate(bs, qiov, pos, true);
 }
 
+static int coroutine_fn bdrv_co_flush_vmstate(BlockDriverState *bs)
+{
+    return 0;
+}
+
+static int coroutine_fn bdrv_flush_vmstate_co_entry(void *opaque)
+{
+    return bdrv_co_flush_vmstate(opaque);
+}
+
+int bdrv_flush_vmstate(BlockDriverState *bs)
+{
+    return bdrv_run_co(bs, bdrv_flush_vmstate_co_entry, bs);
+}
+
 /**************************************************************/
 /* async I/Os */
 
diff --git a/include/block/block.h b/include/block/block.h
index 25e299605e..e532bcffc8 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -572,6 +572,9 @@ int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
 
 int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
                       int64_t pos, int size);
+/* bdrv_flush_vmstate() is mandatory to commit vmstate changes if
+   bdrv_save_vmstate() was ever called */
+int bdrv_flush_vmstate(BlockDriverState *bs);
 
 void bdrv_img_create(const char *filename, const char *fmt,
                      const char *base_filename, const char *base_fmt,
diff --git a/migration/savevm.c b/migration/savevm.c
index da3dead4e9..d984ce7aa1 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -150,6 +150,9 @@ static ssize_t block_get_buffer(void *opaque, uint8_t *buf, int64_t pos,
 
 static int bdrv_fclose(void *opaque, Error **errp)
 {
+    int err = bdrv_flush_vmstate(opaque);
+    if (err < 0)
+        return err;
     return bdrv_flush(opaque);
 }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] block/io: improve savevm performance
  2020-06-16 16:20 [PATCH v4 0/4] block: seriously improve savevm performance Denis V. Lunev
                   ` (3 preceding siblings ...)
  2020-06-16 16:20 ` [PATCH 4/5] block, migration: add bdrv_flush_vmstate helper Denis V. Lunev
@ 2020-06-16 16:20 ` Denis V. Lunev
  2020-06-18 10:56   ` Vladimir Sementsov-Ogievskiy
  2020-06-16 20:28 ` [PATCH v4 0/4] block: seriously " no-reply
  5 siblings, 1 reply; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-16 16:20 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi, Denis V. Lunev

This patch does 2 standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to block driver,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

Thus bdrv_vmstate_write() is becoming asynchronous. All pending operations
completion are performed in newly invented bdrv_flush_vmstate().

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations if target file descriptor is opened with O_DIRECT. It should
also be noted that all operations are performed into unallocated image
blocks, which also suffer due to partial writes to such new clusters
even on cached file descriptors.

Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
                original     fixed
cached:          1.79s       1.27s
non-cached:      3.29s       0.81s

The difference over HDD would be more significant :)

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Juan Quintela <quintela@redhat.com>
CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/io.c                | 123 +++++++++++++++++++++++++++++++++++++-
 include/block/block_int.h |   8 +++
 2 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index 8718df4ea8..1979098c02 100644
--- a/block/io.c
+++ b/block/io.c
@@ -26,6 +26,7 @@
 #include "trace.h"
 #include "sysemu/block-backend.h"
 #include "block/aio-wait.h"
+#include "block/aio_task.h"
 #include "block/blockjob.h"
 #include "block/blockjob_int.h"
 #include "block/block_int.h"
@@ -33,6 +34,7 @@
 #include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "qemu/units.h"
 #include "sysemu/replay.h"
 
 /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
@@ -2640,6 +2642,100 @@ typedef struct BdrvVmstateCo {
     bool                is_read;
 } BdrvVmstateCo;
 
+typedef struct BdrvVMStateTask {
+    AioTask task;
+
+    BlockDriverState *bs;
+    int64_t offset;
+    void *buf;
+    size_t bytes;
+} BdrvVMStateTask;
+
+typedef struct BdrvSaveVMState {
+    AioTaskPool *pool;
+    BdrvVMStateTask *t;
+} BdrvSaveVMState;
+
+
+static coroutine_fn int bdrv_co_vmstate_save_task_entry(AioTask *task)
+{
+    int err = 0;
+    BdrvVMStateTask *t = container_of(task, BdrvVMStateTask, task);
+
+    if (t->bytes != 0) {
+        QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, t->buf, t->bytes);
+
+        bdrv_inc_in_flight(t->bs);
+        err = t->bs->drv->bdrv_save_vmstate(t->bs, &qiov, t->offset);
+        bdrv_dec_in_flight(t->bs);
+    }
+
+    qemu_vfree(t->buf);
+    return err;
+}
+
+static BdrvVMStateTask *bdrv_vmstate_task_create(BlockDriverState *bs,
+                                                 int64_t pos, size_t size)
+{
+    BdrvVMStateTask *t = g_new(BdrvVMStateTask, 1);
+
+    *t = (BdrvVMStateTask) {
+        .task.func = bdrv_co_vmstate_save_task_entry,
+        .buf = qemu_blockalign(bs, size),
+        .offset = pos,
+        .bs = bs,
+    };
+
+    return t;
+}
+
+static int bdrv_co_do_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
+                                   int64_t pos)
+{
+    BdrvSaveVMState *state = bs->savevm_state;
+    BdrvVMStateTask *t;
+    size_t buf_size = MAX(bdrv_get_cluster_size(bs), 1 * MiB);
+    size_t to_copy, off;
+
+    if (state == NULL) {
+        state = g_new(BdrvSaveVMState, 1);
+        *state = (BdrvSaveVMState) {
+            .pool = aio_task_pool_new(BDRV_VMSTATE_WORKERS_MAX),
+            .t = bdrv_vmstate_task_create(bs, pos, buf_size),
+        };
+
+        bs->savevm_state = state;
+    }
+
+    if (aio_task_pool_status(state->pool) < 0) {
+        /* Caller is responsible for cleanup. We should block all further
+         * save operations for this exact state */
+        return aio_task_pool_status(state->pool);
+    }
+
+    t = state->t;
+    if (t->offset + t->bytes != pos) {
+        /* Normally this branch is not reachable from migration */
+        return bs->drv->bdrv_save_vmstate(bs, qiov, pos);
+    }
+
+    off = 0;
+    while (1) {
+        to_copy = MIN(qiov->size - off, buf_size - t->bytes);
+        qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
+        t->bytes += to_copy;
+        if (t->bytes < buf_size) {
+            return qiov->size;
+        }
+
+        aio_task_pool_start_task(state->pool, &t->task);
+
+        pos += to_copy;
+        off += to_copy;
+        state->t = t = bdrv_vmstate_task_create(bs, pos, buf_size);
+    }
+}
+
 static int coroutine_fn
 bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                    bool is_read)
@@ -2655,7 +2751,7 @@ bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
         if (is_read) {
             ret = drv->bdrv_load_vmstate(bs, qiov, pos);
         } else {
-            ret = drv->bdrv_save_vmstate(bs, qiov, pos);
+            ret = bdrv_co_do_save_vmstate(bs, qiov, pos);
         }
     } else if (bs->file) {
         ret = bdrv_co_rw_vmstate(bs->file->bs, qiov, pos, is_read);
@@ -2726,7 +2822,30 @@ int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
 
 static int coroutine_fn bdrv_co_flush_vmstate(BlockDriverState *bs)
 {
-    return 0;
+    int err;
+    BdrvSaveVMState *state = bs->savevm_state;
+
+    if (bs->drv->bdrv_save_vmstate == NULL && bs->file != NULL) {
+        return bdrv_co_flush_vmstate(bs->file->bs);
+    }
+    if (state == NULL) {
+        return 0;
+    }
+
+    if (aio_task_pool_status(state->pool) >= 0) {
+        /* We are on success path, commit last chunk if possible */
+        aio_task_pool_start_task(state->pool, &state->t->task);
+    }
+
+    aio_task_pool_wait_all(state->pool);
+    err = aio_task_pool_status(state->pool);
+
+    aio_task_pool_free(state->pool);
+    g_free(state);
+
+    bs->savevm_state = NULL;
+
+    return err;
 }
 
 static int coroutine_fn bdrv_flush_vmstate_co_entry(void *opaque)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 791de6a59c..f90f0e8b6a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -61,6 +61,8 @@
 
 #define BLOCK_PROBE_BUF_SIZE        512
 
+#define BDRV_VMSTATE_WORKERS_MAX    8
+
 enum BdrvTrackedRequestType {
     BDRV_TRACKED_READ,
     BDRV_TRACKED_WRITE,
@@ -784,6 +786,9 @@ struct BdrvChild {
     QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+
+typedef struct BdrvSaveVMState BdrvSaveVMState;
+
 /*
  * Note: the function bdrv_append() copies and swaps contents of
  * BlockDriverStates, so if you add new fields to this struct, please
@@ -947,6 +952,9 @@ struct BlockDriverState {
 
     /* BdrvChild links to this node may never be frozen */
     bool never_freeze;
+
+    /* Intermediate buffer for VM state saving from snapshot creation code */
+    BdrvSaveVMState *savevm_state;
 };
 
 struct BlockBackendRootState {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/4] block: seriously improve savevm performance
  2020-06-16 16:20 [PATCH v4 0/4] block: seriously improve savevm performance Denis V. Lunev
                   ` (4 preceding siblings ...)
  2020-06-16 16:20 ` [PATCH 5/5] block/io: improve savevm performance Denis V. Lunev
@ 2020-06-16 20:28 ` no-reply
  5 siblings, 0 replies; 15+ messages in thread
From: no-reply @ 2020-06-16 20:28 UTC (permalink / raw)
  To: den
  Cc: kwolf, fam, vsementsov, qemu-block, quintela, qemu-devel,
	dgilbert, dplotnikov, stefanha, den, mreitz

Patchew URL: https://patchew.org/QEMU/20200616162035.29857-1-den@openvz.org/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Subject: [PATCH v4 0/4] block: seriously improve savevm performance
Type: series
Message-id: 20200616162035.29857-1-den@openvz.org

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
551826e block/io: improve savevm performance
81a93fd block, migration: add bdrv_flush_vmstate helper
05e3395 block/aio_task: drop aio_task_pool_wait_one() helper
61e0ce3 block/aio_task: allow start/wait task from any coroutine
d04a716 migration/savevm: respect qemu_fclose() error code in save_snapshot()

=== OUTPUT BEGIN ===
1/5 Checking commit d04a7160d2c0 (migration/savevm: respect qemu_fclose() error code in save_snapshot())
2/5 Checking commit 61e0ce32d320 (block/aio_task: allow start/wait task from any coroutine)
3/5 Checking commit 05e339500af8 (block/aio_task: drop aio_task_pool_wait_one() helper)
4/5 Checking commit 81a93fdf74a9 (block, migration: add bdrv_flush_vmstate helper)
WARNING: Block comments use a leading /* on a separate line
#93: FILE: include/block/block.h:575:
+/* bdrv_flush_vmstate() is mandatory to commit vmstate changes if

WARNING: Block comments use * on subsequent lines
#94: FILE: include/block/block.h:576:
+/* bdrv_flush_vmstate() is mandatory to commit vmstate changes if
+   bdrv_save_vmstate() was ever called */

WARNING: Block comments use a trailing */ on a separate line
#94: FILE: include/block/block.h:576:
+   bdrv_save_vmstate() was ever called */

ERROR: braces {} are necessary for all arms of this statement
#108: FILE: migration/savevm.c:154:
+    if (err < 0)
[...]

total: 1 errors, 3 warnings, 60 lines checked

Patch 4/5 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

5/5 Checking commit 551826e4ae66 (block/io: improve savevm performance)
WARNING: Block comments use a leading /* on a separate line
#132: FILE: block/io.c:2711:
+        /* Caller is responsible for cleanup. We should block all further

WARNING: Block comments use a trailing */ on a separate line
#133: FILE: block/io.c:2712:
+         * save operations for this exact state */

total: 0 errors, 2 warnings, 179 lines checked

Patch 5/5 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20200616162035.29857-1-den@openvz.org/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] block, migration: add bdrv_flush_vmstate helper
  2020-06-16 16:20 ` [PATCH 4/5] block, migration: add bdrv_flush_vmstate helper Denis V. Lunev
@ 2020-06-16 21:11   ` Eric Blake
  2020-06-16 21:29     ` Denis V. Lunev
  2020-06-18 10:02   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Blake @ 2020-06-16 21:11 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi

On 6/16/20 11:20 AM, Denis V. Lunev wrote:
> Right now bdrv_fclose() is just calling bdrv_flush().
> 
> The problem is that migration code is working inefficently from black

inefficiently, block

> layer terms and are frequently called for very small pieces of not
> properly aligned data. Block layer is capable to work this way, but

s/not properly aligned/unaligned/

> this is very slow.
> 
> This patch is a preparation for the introduction of the intermediate
> buffer at block driver state. It would be beneficial to separate
> conventional bdrv_flush() from closing QEMU file from migration code.
> 
> The patch also forces bdrv_flush_vmstate() operation inside
> synchronous blk_save_vmstate() operation. This helper is used from
> qemu-io only.
> 

> +++ b/block/block-backend.c
> @@ -2177,16 +2177,20 @@ int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
>   int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
>                        int64_t pos, int size)
>   {
> -    int ret;
> +    int ret, ret2;
>   
>       if (!blk_is_available(blk)) {
>           return -ENOMEDIUM;
>       }
>   
>       ret = bdrv_save_vmstate(blk_bs(blk), buf, pos, size);
> +    ret2 = bdrv_flush_vmstate(blk_bs(blk));

Do you really want to be attempting bdrv_flush_vmstate() even after 
bdrv_save_vmstate() failed?  Better might be...

>       if (ret < 0) {
>           return ret;
>       }

...attempting it here, at which point it looks like the only reason you 
need ret2 is to preserve ret long enough...

> +    if (ret2 < 0) {
> +        return ret2;
> +    }
>   
>       if (ret == size && !blk->enable_write_cache) {

...for this check.  But a quick look at bdrv_save_vmstate() says this 
check is dead: the function can only return negative error or exactly 
size, and we already filtered out negative error above.


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] block, migration: add bdrv_flush_vmstate helper
  2020-06-16 21:11   ` Eric Blake
@ 2020-06-16 21:29     ` Denis V. Lunev
  2020-06-18 10:03       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-16 21:29 UTC (permalink / raw)
  To: Eric Blake, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Juan Quintela, Dr. David Alan Gilbert, Max Reitz,
	Denis Plotnikov, Stefan Hajnoczi

On 6/17/20 12:11 AM, Eric Blake wrote:
> On 6/16/20 11:20 AM, Denis V. Lunev wrote:
>> Right now bdrv_fclose() is just calling bdrv_flush().
>>
>> The problem is that migration code is working inefficently from black
>
> inefficiently, block
>
>> layer terms and are frequently called for very small pieces of not
>> properly aligned data. Block layer is capable to work this way, but
>
> s/not properly aligned/unaligned/
>
>> this is very slow.
>>
>> This patch is a preparation for the introduction of the intermediate
>> buffer at block driver state. It would be beneficial to separate
>> conventional bdrv_flush() from closing QEMU file from migration code.
>>
>> The patch also forces bdrv_flush_vmstate() operation inside
>> synchronous blk_save_vmstate() operation. This helper is used from
>> qemu-io only.
>>
>
>> +++ b/block/block-backend.c
>> @@ -2177,16 +2177,20 @@ int blk_truncate(BlockBackend *blk, int64_t
>> offset, bool exact,
>>   int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
>>                        int64_t pos, int size)
>>   {
>> -    int ret;
>> +    int ret, ret2;
>>         if (!blk_is_available(blk)) {
>>           return -ENOMEDIUM;
>>       }
>>         ret = bdrv_save_vmstate(blk_bs(blk), buf, pos, size);
>> +    ret2 = bdrv_flush_vmstate(blk_bs(blk));
>
> Do you really want to be attempting bdrv_flush_vmstate() even after
> bdrv_save_vmstate() failed?  Better might be...
>
>>       if (ret < 0) {
>>           return ret;
>>       }
>
> ...attempting it here, at which point it looks like the only reason
> you need ret2 is to preserve ret long enough...
no, I would like to be sure that intermediate state is always cleared at
the end.
In the next patch I am going to put intermediate buffer to
BlockDriverState and thus it has to be removed in any case.

May be flush would be a bad name... clean or finalize?

>
>> +    if (ret2 < 0) {
>> +        return ret2;
>> +    }
>>         if (ret == size && !blk->enable_write_cache) {
>
> ...for this check.  But a quick look at bdrv_save_vmstate() says this
> check is dead: the function can only return negative error or exactly
> size, and we already filtered out negative error above.
>
>
hmm. you are right. good point. this check is dead.
Worth to remove :) I'll add this as a separate patch.

Den


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] block/aio_task: allow start/wait task from any coroutine
  2020-06-16 16:20 ` [PATCH 2/5] block/aio_task: allow start/wait task from any coroutine Denis V. Lunev
@ 2020-06-18  9:41   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-18  9:41 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi

16.06.2020 19:20, Denis V. Lunev wrote:
> From: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> 
> Currently, aio task pool assumes that there is a main coroutine, which
> creates tasks and wait for them. Let's remove the restriction by using
> CoQueue. Code becomes clearer, interface more obvious.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> Signed-off-by: Denis V. Lunev<den@openvz.org>

if it applicable:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] block/aio_task: drop aio_task_pool_wait_one() helper
  2020-06-16 16:20 ` [PATCH 3/5] block/aio_task: drop aio_task_pool_wait_one() helper Denis V. Lunev
@ 2020-06-18  9:41   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-18  9:41 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi

16.06.2020 19:20, Denis V. Lunev wrote:
> It is not used outside the module.
> 
> Actually there are 2 kind of waiters:
> - for a slot and
> - for all tasks to finish
> This patch limits external API to listed types.
> 
> Signed-off-by: Denis V. Lunev<den@openvz.org>
> Suggested-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] block, migration: add bdrv_flush_vmstate helper
  2020-06-16 16:20 ` [PATCH 4/5] block, migration: add bdrv_flush_vmstate helper Denis V. Lunev
  2020-06-16 21:11   ` Eric Blake
@ 2020-06-18 10:02   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-18 10:02 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi

16.06.2020 19:20, Denis V. Lunev wrote:
> Right now bdrv_fclose() is just calling bdrv_flush().
> 
> The problem is that migration code is working inefficently from black
> layer terms and are frequently called for very small pieces of not
> properly aligned data. Block layer is capable to work this way, but
> this is very slow.
> 
> This patch is a preparation for the introduction of the intermediate
> buffer at block driver state. It would be beneficial to separate
> conventional bdrv_flush() from closing QEMU file from migration code.
> 
> The patch also forces bdrv_flush_vmstate() operation inside
> synchronous blk_save_vmstate() operation. This helper is used from
> qemu-io only.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <fam@euphon.net>
> CC: Juan Quintela <quintela@redhat.com>
> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
> ---
>   block/block-backend.c |  6 +++++-
>   block/io.c            | 15 +++++++++++++++
>   include/block/block.h |  3 +++
>   migration/savevm.c    |  3 +++
>   4 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 6936b25c83..3afa0ff7d5 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -2177,16 +2177,20 @@ int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
>   int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
>                        int64_t pos, int size)
>   {
> -    int ret;
> +    int ret, ret2;
>   
>       if (!blk_is_available(blk)) {
>           return -ENOMEDIUM;
>       }
>   
>       ret = bdrv_save_vmstate(blk_bs(blk), buf, pos, size);
> +    ret2 = bdrv_flush_vmstate(blk_bs(blk));
>       if (ret < 0) {
>           return ret;
>       }
> +    if (ret2 < 0) {
> +        return ret2;
> +    }
>   
>       if (ret == size && !blk->enable_write_cache) {
>           ret = bdrv_flush(blk_bs(blk));
> diff --git a/block/io.c b/block/io.c
> index df8f2a98d4..8718df4ea8 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2724,6 +2724,21 @@ int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
>       return bdrv_rw_vmstate(bs, qiov, pos, true);
>   }
>   
> +static int coroutine_fn bdrv_co_flush_vmstate(BlockDriverState *bs)
> +{
> +    return 0;
> +}
> +
> +static int coroutine_fn bdrv_flush_vmstate_co_entry(void *opaque)
> +{
> +    return bdrv_co_flush_vmstate(opaque);
> +}
> +
> +int bdrv_flush_vmstate(BlockDriverState *bs)
> +{
> +    return bdrv_run_co(bs, bdrv_flush_vmstate_co_entry, bs);
> +}
> +
>   /**************************************************************/
>   /* async I/Os */
>   
> diff --git a/include/block/block.h b/include/block/block.h
> index 25e299605e..e532bcffc8 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -572,6 +572,9 @@ int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
>   
>   int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
>                         int64_t pos, int size);
> +/* bdrv_flush_vmstate() is mandatory to commit vmstate changes if
> +   bdrv_save_vmstate() was ever called */
> +int bdrv_flush_vmstate(BlockDriverState *bs);
>   
>   void bdrv_img_create(const char *filename, const char *fmt,
>                        const char *base_filename, const char *base_fmt,
> diff --git a/migration/savevm.c b/migration/savevm.c
> index da3dead4e9..d984ce7aa1 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -150,6 +150,9 @@ static ssize_t block_get_buffer(void *opaque, uint8_t *buf, int64_t pos,
>   
>   static int bdrv_fclose(void *opaque, Error **errp)
>   {
> +    int err = bdrv_flush_vmstate(opaque);
> +    if (err < 0)
> +        return err;

Qemu style wants braces anyway..

>       return bdrv_flush(opaque);
>   }

I'm not sure, that we want this on read path, neither, why we already do normal flush on read path. Anyway, it should be no-op on read path and won't hurt.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] block, migration: add bdrv_flush_vmstate helper
  2020-06-16 21:29     ` Denis V. Lunev
@ 2020-06-18 10:03       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-18 10:03 UTC (permalink / raw)
  To: Denis V. Lunev, Eric Blake, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi

17.06.2020 00:29, Denis V. Lunev wrote:
>>>        if (ret < 0) {
>>>            return ret;
>>>        }
>> ...attempting it here, at which point it looks like the only reason
>> you need ret2 is to preserve ret long enough...
> no, I would like to be sure that intermediate state is always cleared at
> the end.
> In the next patch I am going to put intermediate buffer to
> BlockDriverState and thus it has to be removed in any case.
> 
> May be flush would be a bad name... clean or finalize?
> 

finalize sounds good for me

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] block/io: improve savevm performance
  2020-06-16 16:20 ` [PATCH 5/5] block/io: improve savevm performance Denis V. Lunev
@ 2020-06-18 10:56   ` Vladimir Sementsov-Ogievskiy
  2020-06-18 11:07     ` Denis V. Lunev
  0 siblings, 1 reply; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-18 10:56 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi

16.06.2020 19:20, Denis V. Lunev wrote:
> This patch does 2 standard basic things:
> - it creates intermediate buffer for all writes from QEMU migration code
>    to block driver,
> - this buffer is sent to disk asynchronously, allowing several writes to
>    run in parallel.
> 
> Thus bdrv_vmstate_write() is becoming asynchronous. All pending operations
> completion are performed in newly invented bdrv_flush_vmstate().
> 
> In general, migration code is fantastically inefficent (by observation),
> buffers are not aligned and sent with arbitrary pieces, a lot of time
> less than 100 bytes at a chunk, which results in read-modify-write
> operations if target file descriptor is opened with O_DIRECT. It should
> also be noted that all operations are performed into unallocated image
> blocks, which also suffer due to partial writes to such new clusters
> even on cached file descriptors.
> 
> Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
>                  original     fixed
> cached:          1.79s       1.27s
> non-cached:      3.29s       0.81s
> 
> The difference over HDD would be more significant :)
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <fam@euphon.net>
> CC: Juan Quintela <quintela@redhat.com>
> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
> ---
>   block/io.c                | 123 +++++++++++++++++++++++++++++++++++++-
>   include/block/block_int.h |   8 +++
>   2 files changed, 129 insertions(+), 2 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 8718df4ea8..1979098c02 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -26,6 +26,7 @@
>   #include "trace.h"
>   #include "sysemu/block-backend.h"
>   #include "block/aio-wait.h"
> +#include "block/aio_task.h"
>   #include "block/blockjob.h"
>   #include "block/blockjob_int.h"
>   #include "block/block_int.h"
> @@ -33,6 +34,7 @@
>   #include "qapi/error.h"
>   #include "qemu/error-report.h"
>   #include "qemu/main-loop.h"
> +#include "qemu/units.h"
>   #include "sysemu/replay.h"
>   
>   /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
> @@ -2640,6 +2642,100 @@ typedef struct BdrvVmstateCo {
>       bool                is_read;
>   } BdrvVmstateCo;
>   
> +typedef struct BdrvVMStateTask {
> +    AioTask task;
> +
> +    BlockDriverState *bs;
> +    int64_t offset;
> +    void *buf;
> +    size_t bytes;
> +} BdrvVMStateTask;
> +
> +typedef struct BdrvSaveVMState {
> +    AioTaskPool *pool;
> +    BdrvVMStateTask *t;
> +} BdrvSaveVMState;
> +
> +
> +static coroutine_fn int bdrv_co_vmstate_save_task_entry(AioTask *task)
> +{
> +    int err = 0;
> +    BdrvVMStateTask *t = container_of(task, BdrvVMStateTask, task);
> +
> +    if (t->bytes != 0) {
> +        QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, t->buf, t->bytes);
> +
> +        bdrv_inc_in_flight(t->bs);
> +        err = t->bs->drv->bdrv_save_vmstate(t->bs, &qiov, t->offset);
> +        bdrv_dec_in_flight(t->bs);
> +    }
> +
> +    qemu_vfree(t->buf);
> +    return err;
> +}
> +
> +static BdrvVMStateTask *bdrv_vmstate_task_create(BlockDriverState *bs,
> +                                                 int64_t pos, size_t size)
> +{
> +    BdrvVMStateTask *t = g_new(BdrvVMStateTask, 1);
> +
> +    *t = (BdrvVMStateTask) {
> +        .task.func = bdrv_co_vmstate_save_task_entry,
> +        .buf = qemu_blockalign(bs, size),
> +        .offset = pos,
> +        .bs = bs,
> +    };
> +
> +    return t;
> +}
> +
> +static int bdrv_co_do_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
> +                                   int64_t pos)
> +{
> +    BdrvSaveVMState *state = bs->savevm_state;
> +    BdrvVMStateTask *t;
> +    size_t buf_size = MAX(bdrv_get_cluster_size(bs), 1 * MiB);
> +    size_t to_copy, off;
> +
> +    if (state == NULL) {
> +        state = g_new(BdrvSaveVMState, 1);
> +        *state = (BdrvSaveVMState) {
> +            .pool = aio_task_pool_new(BDRV_VMSTATE_WORKERS_MAX),
> +            .t = bdrv_vmstate_task_create(bs, pos, buf_size),
> +        };
> +
> +        bs->savevm_state = state;
> +    }
> +
> +    if (aio_task_pool_status(state->pool) < 0) {
> +        /* Caller is responsible for cleanup. We should block all further
> +         * save operations for this exact state */
> +        return aio_task_pool_status(state->pool);
> +    }
> +
> +    t = state->t;
> +    if (t->offset + t->bytes != pos) {
> +        /* Normally this branch is not reachable from migration */
> +        return bs->drv->bdrv_save_vmstate(bs, qiov, pos);
> +    }
> +
> +    off = 0;
> +    while (1) {

"while (aio_task_pool_status(state->pool) == 0)" + "return aio_task_pool_status(state->pool)" after loop will improve interactivity on failure path, but shouldn't be necessary.

> +        to_copy = MIN(qiov->size - off, buf_size - t->bytes);
> +        qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
> +        t->bytes += to_copy;
> +        if (t->bytes < buf_size) {
> +            return qiov->size;

Intersting: we are substituting .bdrv_save_vmstate by this function. There are two existing now:

qcow2_save_vmstate, which doesn't care to return qiov->size and sd_save_vmstate which does it.

So, it seems safe to return qiov->size now, but I'm sure that it's actually unused and should be
refactored to return 0 on success everywhere.

> +        }
> +
> +        aio_task_pool_start_task(state->pool, &t->task);
> +
> +        pos += to_copy;
> +        off += to_copy;
> +        state->t = t = bdrv_vmstate_task_create(bs, pos, buf_size);
> +    }
> +}
> +
>   static int coroutine_fn
>   bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
>                      bool is_read)
> @@ -2655,7 +2751,7 @@ bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
>           if (is_read) {
>               ret = drv->bdrv_load_vmstate(bs, qiov, pos);
>           } else {
> -            ret = drv->bdrv_save_vmstate(bs, qiov, pos);
> +            ret = bdrv_co_do_save_vmstate(bs, qiov, pos);
>           }
>       } else if (bs->file) {
>           ret = bdrv_co_rw_vmstate(bs->file->bs, qiov, pos, is_read);
> @@ -2726,7 +2822,30 @@ int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
>   
>   static int coroutine_fn bdrv_co_flush_vmstate(BlockDriverState *bs)
>   {
> -    return 0;
> +    int err;

nitpicking: you use "err" to store return codes, but everywhere in this file (and mostly in block/*) "ret" is used for it.

> +    BdrvSaveVMState *state = bs->savevm_state;
> +
> +    if (bs->drv->bdrv_save_vmstate == NULL && bs->file != NULL) {
> +        return bdrv_co_flush_vmstate(bs->file->bs);
> +    }
> +    if (state == NULL) {
> +        return 0;
> +    }
> +
> +    if (aio_task_pool_status(state->pool) >= 0) {
> +        /* We are on success path, commit last chunk if possible */
> +        aio_task_pool_start_task(state->pool, &state->t->task);
> +    }
> +
> +    aio_task_pool_wait_all(state->pool);
> +    err = aio_task_pool_status(state->pool);
> +
> +    aio_task_pool_free(state->pool);
> +    g_free(state);
> +
> +    bs->savevm_state = NULL;
> +
> +    return err;
>   }
>   
>   static int coroutine_fn bdrv_flush_vmstate_co_entry(void *opaque)
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 791de6a59c..f90f0e8b6a 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -61,6 +61,8 @@
>   
>   #define BLOCK_PROBE_BUF_SIZE        512
>   
> +#define BDRV_VMSTATE_WORKERS_MAX    8
> +
>   enum BdrvTrackedRequestType {
>       BDRV_TRACKED_READ,
>       BDRV_TRACKED_WRITE,
> @@ -784,6 +786,9 @@ struct BdrvChild {
>       QLIST_ENTRY(BdrvChild) next_parent;
>   };
>   
> +
> +typedef struct BdrvSaveVMState BdrvSaveVMState;
> +
>   /*
>    * Note: the function bdrv_append() copies and swaps contents of
>    * BlockDriverStates, so if you add new fields to this struct, please
> @@ -947,6 +952,9 @@ struct BlockDriverState {
>   
>       /* BdrvChild links to this node may never be frozen */
>       bool never_freeze;
> +
> +    /* Intermediate buffer for VM state saving from snapshot creation code */
> +    BdrvSaveVMState *savevm_state;
>   };
>   
>   struct BlockBackendRootState {
> 

Minor improvements and further refactoring may be done in separate, I'm OK with it as is:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] block/io: improve savevm performance
  2020-06-18 10:56   ` Vladimir Sementsov-Ogievskiy
@ 2020-06-18 11:07     ` Denis V. Lunev
  0 siblings, 0 replies; 15+ messages in thread
From: Denis V. Lunev @ 2020-06-18 11:07 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Juan Quintela, Dr. David Alan Gilbert,
	Max Reitz, Denis Plotnikov, Stefan Hajnoczi

On 6/18/20 1:56 PM, Vladimir Sementsov-Ogievskiy wrote:
> 16.06.2020 19:20, Denis V. Lunev wrote:
>> This patch does 2 standard basic things:
>> - it creates intermediate buffer for all writes from QEMU migration code
>>    to block driver,
>> - this buffer is sent to disk asynchronously, allowing several writes to
>>    run in parallel.
>>
>> Thus bdrv_vmstate_write() is becoming asynchronous. All pending
>> operations
>> completion are performed in newly invented bdrv_flush_vmstate().
>>
>> In general, migration code is fantastically inefficent (by observation),
>> buffers are not aligned and sent with arbitrary pieces, a lot of time
>> less than 100 bytes at a chunk, which results in read-modify-write
>> operations if target file descriptor is opened with O_DIRECT. It should
>> also be noted that all operations are performed into unallocated image
>> blocks, which also suffer due to partial writes to such new clusters
>> even on cached file descriptors.
>>
>> Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
>>                  original     fixed
>> cached:          1.79s       1.27s
>> non-cached:      3.29s       0.81s
>>
>> The difference over HDD would be more significant :)
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Fam Zheng <fam@euphon.net>
>> CC: Juan Quintela <quintela@redhat.com>
>> CC: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
>> ---
>>   block/io.c                | 123 +++++++++++++++++++++++++++++++++++++-
>>   include/block/block_int.h |   8 +++
>>   2 files changed, 129 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/io.c b/block/io.c
>> index 8718df4ea8..1979098c02 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -26,6 +26,7 @@
>>   #include "trace.h"
>>   #include "sysemu/block-backend.h"
>>   #include "block/aio-wait.h"
>> +#include "block/aio_task.h"
>>   #include "block/blockjob.h"
>>   #include "block/blockjob_int.h"
>>   #include "block/block_int.h"
>> @@ -33,6 +34,7 @@
>>   #include "qapi/error.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/main-loop.h"
>> +#include "qemu/units.h"
>>   #include "sysemu/replay.h"
>>     /* Maximum bounce buffer for copy-on-read and write zeroes, in
>> bytes */
>> @@ -2640,6 +2642,100 @@ typedef struct BdrvVmstateCo {
>>       bool                is_read;
>>   } BdrvVmstateCo;
>>   +typedef struct BdrvVMStateTask {
>> +    AioTask task;
>> +
>> +    BlockDriverState *bs;
>> +    int64_t offset;
>> +    void *buf;
>> +    size_t bytes;
>> +} BdrvVMStateTask;
>> +
>> +typedef struct BdrvSaveVMState {
>> +    AioTaskPool *pool;
>> +    BdrvVMStateTask *t;
>> +} BdrvSaveVMState;
>> +
>> +
>> +static coroutine_fn int bdrv_co_vmstate_save_task_entry(AioTask *task)
>> +{
>> +    int err = 0;
>> +    BdrvVMStateTask *t = container_of(task, BdrvVMStateTask, task);
>> +
>> +    if (t->bytes != 0) {
>> +        QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, t->buf,
>> t->bytes);
>> +
>> +        bdrv_inc_in_flight(t->bs);
>> +        err = t->bs->drv->bdrv_save_vmstate(t->bs, &qiov, t->offset);
>> +        bdrv_dec_in_flight(t->bs);
>> +    }
>> +
>> +    qemu_vfree(t->buf);
>> +    return err;
>> +}
>> +
>> +static BdrvVMStateTask *bdrv_vmstate_task_create(BlockDriverState *bs,
>> +                                                 int64_t pos, size_t
>> size)
>> +{
>> +    BdrvVMStateTask *t = g_new(BdrvVMStateTask, 1);
>> +
>> +    *t = (BdrvVMStateTask) {
>> +        .task.func = bdrv_co_vmstate_save_task_entry,
>> +        .buf = qemu_blockalign(bs, size),
>> +        .offset = pos,
>> +        .bs = bs,
>> +    };
>> +
>> +    return t;
>> +}
>> +
>> +static int bdrv_co_do_save_vmstate(BlockDriverState *bs,
>> QEMUIOVector *qiov,
>> +                                   int64_t pos)
>> +{
>> +    BdrvSaveVMState *state = bs->savevm_state;
>> +    BdrvVMStateTask *t;
>> +    size_t buf_size = MAX(bdrv_get_cluster_size(bs), 1 * MiB);
>> +    size_t to_copy, off;
>> +
>> +    if (state == NULL) {
>> +        state = g_new(BdrvSaveVMState, 1);
>> +        *state = (BdrvSaveVMState) {
>> +            .pool = aio_task_pool_new(BDRV_VMSTATE_WORKERS_MAX),
>> +            .t = bdrv_vmstate_task_create(bs, pos, buf_size),
>> +        };
>> +
>> +        bs->savevm_state = state;
>> +    }
>> +
>> +    if (aio_task_pool_status(state->pool) < 0) {
>> +        /* Caller is responsible for cleanup. We should block all
>> further
>> +         * save operations for this exact state */
>> +        return aio_task_pool_status(state->pool);
>> +    }
>> +
>> +    t = state->t;
>> +    if (t->offset + t->bytes != pos) {
>> +        /* Normally this branch is not reachable from migration */
>> +        return bs->drv->bdrv_save_vmstate(bs, qiov, pos);
>> +    }
>> +
>> +    off = 0;
>> +    while (1) {
>
> "while (aio_task_pool_status(state->pool) == 0)" + "return
> aio_task_pool_status(state->pool)" after loop will improve
> interactivity on failure path, but shouldn't be necessary.
>
>> +        to_copy = MIN(qiov->size - off, buf_size - t->bytes);
>> +        qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
>> +        t->bytes += to_copy;
>> +        if (t->bytes < buf_size) {
>> +            return qiov->size;
>
> Intersting: we are substituting .bdrv_save_vmstate by this function.
> There are two existing now:
>
> qcow2_save_vmstate, which doesn't care to return qiov->size and
> sd_save_vmstate which does it.
>
> So, it seems safe to return qiov->size now, but I'm sure that it's
> actually unused and should be
> refactored to return 0 on success everywhere.

This looks like my mistake now. I have spent some time
with error codes working with Eric's suggestions and
believe that 0 should be returned now.

Will fix in next revision.

Den


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-06-18 11:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-16 16:20 [PATCH v4 0/4] block: seriously improve savevm performance Denis V. Lunev
2020-06-16 16:20 ` [PATCH 1/5] migration/savevm: respect qemu_fclose() error code in save_snapshot() Denis V. Lunev
2020-06-16 16:20 ` [PATCH 2/5] block/aio_task: allow start/wait task from any coroutine Denis V. Lunev
2020-06-18  9:41   ` Vladimir Sementsov-Ogievskiy
2020-06-16 16:20 ` [PATCH 3/5] block/aio_task: drop aio_task_pool_wait_one() helper Denis V. Lunev
2020-06-18  9:41   ` Vladimir Sementsov-Ogievskiy
2020-06-16 16:20 ` [PATCH 4/5] block, migration: add bdrv_flush_vmstate helper Denis V. Lunev
2020-06-16 21:11   ` Eric Blake
2020-06-16 21:29     ` Denis V. Lunev
2020-06-18 10:03       ` Vladimir Sementsov-Ogievskiy
2020-06-18 10:02   ` Vladimir Sementsov-Ogievskiy
2020-06-16 16:20 ` [PATCH 5/5] block/io: improve savevm performance Denis V. Lunev
2020-06-18 10:56   ` Vladimir Sementsov-Ogievskiy
2020-06-18 11:07     ` Denis V. Lunev
2020-06-16 20:28 ` [PATCH v4 0/4] block: seriously " no-reply

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.