All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
@ 2023-03-30 18:03 Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 01/26] migration: Add support for 'file:' uri for source migration Fabiano Rosas
                   ` (27 more replies)
  0 siblings, 28 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

Hi folks,

I'm continuing the work done last year to add a new format of
migration stream that can be used to migrate large guests to a single
file in a performant way.

This is an early RFC with the previous code + my additions to support
multifd and direct IO. Let me know what you think!

Here are the reference links for previous discussions:

https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html

The series has 4 main parts:

1) File migration: A new "file:" migration URI. So "file:mig" does the
   same as "exec:cat > mig". Patches 1-4 implement this;

2) Fixed-ram format: A new format for the migration stream. Puts guest
   pages at their relative offsets in the migration file. This saves
   space on the worst case of RAM utilization because every page has a
   fixed offset in the migration file and (potentially) saves us time
   because we could write pages independently in parallel. It also
   gives alignment guarantees so we could use O_DIRECT. Patches 5-13
   implement this;

With patches 1-13 these two^ can be used with:

(qemu) migrate_set_capability fixed-ram on
(qemu) migrate[_incoming] file:mig

--> new in this series:

3) MultiFD support: This is about making use of the parallelism
   allowed by the new format. We just need the threading and page
   queuing infrastructure that is already in place for
   multifd. Patches 14-24 implement this;

(qemu) migrate_set_capability fixed-ram on
(qemu) migrate_set_capability multifd on
(qemu) migrate_set_parameter multifd-channels 4
(qemu) migrate_set_parameter max-bandwith 0
(qemu) migrate[_incoming] file:mig

4) Add a new "direct_io" parameter and enable O_DIRECT for the
   properly aligned segments of the migration (mostly ram). Patch 25.

(qemu) migrate_set_parameter direct-io on

Thanks! Some data below:
=====

Outgoing migration to file. NVMe disk. XFS filesystem.

- Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
  running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
  10m -v`:

migration type  | MB/s | pages/s |  ms
----------------+------+---------+------
savevm io_uring |  434 |  102294 | 71473
file:           | 3017 |  855862 | 10301
fixed-ram       | 1982 |  330686 | 15637
----------------+------+---------+------
fixed-ram + multifd + O_DIRECT
         2 ch.  | 5565 | 1500882 |  5576
         4 ch.  | 5735 | 1991549 |  5412
         8 ch.  | 5650 | 1769650 |  5489
        16 ch.  | 6071 | 1832407 |  5114
        32 ch.  | 6147 | 1809588 |  5050
        64 ch.  | 6344 | 1841728 |  4895
       128 ch.  | 6120 | 1915669 |  5085
----------------+------+---------+------

- Average of 10 migration runs of guestperf.py --mem 32 --cpus 4:

migration type | #ch. | MB/s | ms
---------------+------+------+-----
fixed-ram +    |    2 | 4132 | 8388
multifd        |    4 | 4273 | 8082
               |    8 | 4094 | 8441
               |   16 | 4204 | 8217
               |   32 | 4048 | 8528
               |   64 | 3861 | 8946
               |  128 | 3777 | 9147
---------------+------+------+-----
fixed-ram +    |    2 | 6031 | 5754
multifd +      |    4 | 6377 | 5421
O_DIRECT       |    8 | 6386 | 5416
               |   16 | 6321 | 5466
               |   32 | 5911 | 5321
               |   64 | 6375 | 5433
               |  128 | 6400 | 5412
---------------+------+------+-----

Fabiano Rosas (13):
  migration: Add completion tracepoint
  migration/multifd: Remove direct "socket" references
  migration/multifd: Allow multifd without packets
  migration/multifd: Add outgoing QIOChannelFile support
  migration/multifd: Add incoming QIOChannelFile support
  migration/multifd: Add pages to the receiving side
  io: Add a pwritev/preadv version that takes a discontiguous iovec
  migration/ram: Add a wrapper for fixed-ram shadow bitmap
  migration/multifd: Support outgoing fixed-ram stream format
  migration/multifd: Support incoming fixed-ram stream format
  tests/qtest: Add a multifd + fixed-ram migration test
  migration: Add direct-io parameter
  tests/migration/guestperf: Add file, fixed-ram and direct-io support

Nikolay Borisov (13):
  migration: Add support for 'file:' uri for source migration
  migration: Add support for 'file:' uri for incoming migration
  tests/qtest: migration: Add migrate_incoming_qmp helper
  tests/qtest: migration-test: Add tests for file-based migration
  migration: Initial support of fixed-ram feature for
    analyze-migration.py
  io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file
  io: Add generic pwritev/preadv interface
  io: implement io_pwritev/preadv for QIOChannelFile
  migration/qemu-file: add utility methods for working with seekable
    channels
  migration/ram: Introduce 'fixed-ram' migration stream capability
  migration: Refactor precopy ram loading code
  migration: Add support for 'fixed-ram' migration restore
  tests/qtest: migration-test: Add tests for fixed-ram file-based
    migration

 docs/devel/migration.rst              |  38 +++
 include/exec/ramblock.h               |   8 +
 include/io/channel-file.h             |   1 +
 include/io/channel.h                  | 133 ++++++++++
 include/migration/qemu-file-types.h   |   2 +
 include/qemu/osdep.h                  |   2 +
 io/channel-file.c                     |  60 +++++
 io/channel.c                          | 140 +++++++++++
 migration/file.c                      | 130 ++++++++++
 migration/file.h                      |  14 ++
 migration/meson.build                 |   1 +
 migration/migration-hmp-cmds.c        |   9 +
 migration/migration.c                 | 108 +++++++-
 migration/migration.h                 |  11 +-
 migration/multifd.c                   | 327 ++++++++++++++++++++----
 migration/multifd.h                   |  13 +
 migration/qemu-file.c                 |  80 ++++++
 migration/qemu-file.h                 |   4 +
 migration/ram.c                       | 349 ++++++++++++++++++++------
 migration/ram.h                       |   1 +
 migration/savevm.c                    |  23 +-
 migration/trace-events                |   1 +
 qapi/migration.json                   |  19 +-
 scripts/analyze-migration.py          |  51 +++-
 tests/migration/guestperf/engine.py   |  38 ++-
 tests/migration/guestperf/scenario.py |  14 +-
 tests/migration/guestperf/shell.py    |  18 +-
 tests/qtest/migration-helpers.c       |  19 ++
 tests/qtest/migration-helpers.h       |   4 +
 tests/qtest/migration-test.c          |  73 ++++++
 util/osdep.c                          |   9 +
 31 files changed, 1546 insertions(+), 154 deletions(-)
 create mode 100644 migration/file.c
 create mode 100644 migration/file.h

-- 
2.35.3



^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 01/26] migration: Add support for 'file:' uri for source migration
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 02/26] migration: Add support for 'file:' uri for incoming migration Fabiano Rosas
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

Implement support for a "file:" uri so that a migration can be initiated
directly to a file from QEMU.

Unlike other migration protocol backends, the 'file' protocol cannot
honour non-blocking mode. POSIX file/block storage will always report
ready to read/write, regardless of how slow the underlying storage
will be at servicing the request.

For outgoing migration this limitation is not a serious problem as
the migration data transfer always happens in a dedicated thread.
It may, however, result in delays in honouring a request to cancel
the migration operation.

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/file.c      | 21 +++++++++++++++++++++
 migration/file.h      |  9 +++++++++
 migration/meson.build |  1 +
 migration/migration.c |  3 +++
 4 files changed, 34 insertions(+)
 create mode 100644 migration/file.c
 create mode 100644 migration/file.h

diff --git a/migration/file.c b/migration/file.c
new file mode 100644
index 0000000000..36d6178c75
--- /dev/null
+++ b/migration/file.c
@@ -0,0 +1,21 @@
+#include "qemu/osdep.h"
+#include "channel.h"
+#include "io/channel-file.h"
+#include "file.h"
+#include "qemu/error-report.h"
+
+
+void file_start_outgoing_migration(MigrationState *s, const char *fname, Error **errp)
+{
+    QIOChannelFile *ioc;
+
+    ioc = qio_channel_file_new_path(fname, O_CREAT | O_TRUNC | O_WRONLY, 0660, errp);
+    if (!ioc) {
+        error_report("Error creating a channel");
+        return;
+    }
+
+    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-file-outgoing");
+    migration_channel_connect(s, QIO_CHANNEL(ioc), NULL, NULL);
+    object_unref(OBJECT(ioc));
+}
diff --git a/migration/file.h b/migration/file.h
new file mode 100644
index 0000000000..d476eb1157
--- /dev/null
+++ b/migration/file.h
@@ -0,0 +1,9 @@
+#ifndef QEMU_MIGRATION_FILE_H
+#define QEMU_MIGRATION_FILE_H
+
+void file_start_outgoing_migration(MigrationState *s,
+                                   const char *filename,
+                                   Error **errp);
+
+#endif
+
diff --git a/migration/meson.build b/migration/meson.build
index 0d1bb9f96e..6c02298c70 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -17,6 +17,7 @@ softmmu_ss.add(files(
   'colo.c',
   'exec.c',
   'fd.c',
+  'file.c',
   'global_state.c',
   'migration-hmp-cmds.c',
   'migration.c',
diff --git a/migration/migration.c b/migration/migration.c
index ae2025d9d8..58ff0cb7c7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -20,6 +20,7 @@
 #include "migration/blocker.h"
 #include "exec.h"
 #include "fd.h"
+#include "file.h"
 #include "socket.h"
 #include "sysemu/runstate.h"
 #include "sysemu/sysemu.h"
@@ -2523,6 +2524,8 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
         exec_start_outgoing_migration(s, p, &local_err);
     } else if (strstart(uri, "fd:", &p)) {
         fd_start_outgoing_migration(s, p, &local_err);
+    } else if (strstart(uri, "file:", &p)) {
+        file_start_outgoing_migration(s, p, &local_err);
     } else {
         if (!(has_resume && resume)) {
             yank_unregister_instance(MIGRATION_YANK_INSTANCE);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 02/26] migration: Add support for 'file:' uri for incoming migration
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 01/26] migration: Add support for 'file:' uri for source migration Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 03/26] tests/qtest: migration: Add migrate_incoming_qmp helper Fabiano Rosas
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

This is a counterpart to the 'file:' uri support for source migration,
now a file can also serve as the source of an incoming migration.

Unlike other migration protocol backends, the 'file' protocol cannot
honour non-blocking mode. POSIX file/block storage will always report
ready to read/write, regardless of how slow the underlying storage
will be at servicing the request.

For incoming migration this limitation may result in the main event
loop not being fully responsive while loading the VM state. This
won't impact the VM since it is not running at this phase, however,
it may impact management applications.

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 docs/devel/migration.rst |  2 ++
 migration/file.c         | 15 +++++++++++++++
 migration/file.h         |  1 +
 migration/migration.c    |  2 ++
 4 files changed, 20 insertions(+)

diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index 6f65c23b47..1080211f8e 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -39,6 +39,8 @@ over any transport.
 - exec migration: do the migration using the stdin/stdout through a process.
 - fd migration: do the migration using a file descriptor that is
   passed to QEMU.  QEMU doesn't care how this file descriptor is opened.
+- file migration: do the migration using a file that is passed by name
+  to QEMU.
 
 In addition, support is included for migration using RDMA, which
 transports the page data using ``RDMA``, where the hardware takes care of
diff --git a/migration/file.c b/migration/file.c
index 36d6178c75..ab4e12926c 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -19,3 +19,18 @@ void file_start_outgoing_migration(MigrationState *s, const char *fname, Error *
     migration_channel_connect(s, QIO_CHANNEL(ioc), NULL, NULL);
     object_unref(OBJECT(ioc));
 }
+
+void file_start_incoming_migration(const char *fname, Error **errp)
+{
+    QIOChannelFile *ioc;
+
+    ioc = qio_channel_file_new_path(fname, O_RDONLY, 0, errp);
+    if (!ioc) {
+        error_report("Error creating a channel");
+        return;
+    }
+
+    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-file-incoming");
+    migration_channel_process_incoming(QIO_CHANNEL(ioc));
+    object_unref(OBJECT(ioc));
+}
diff --git a/migration/file.h b/migration/file.h
index d476eb1157..cdbd291322 100644
--- a/migration/file.h
+++ b/migration/file.h
@@ -5,5 +5,6 @@ void file_start_outgoing_migration(MigrationState *s,
                                    const char *filename,
                                    Error **errp);
 
+void file_start_incoming_migration(const char *fname, Error **errp);
 #endif
 
diff --git a/migration/migration.c b/migration/migration.c
index 58ff0cb7c7..5408d87453 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -527,6 +527,8 @@ static void qemu_start_incoming_migration(const char *uri, Error **errp)
         exec_start_incoming_migration(p, errp);
     } else if (strstart(uri, "fd:", &p)) {
         fd_start_incoming_migration(p, errp);
+    } else if (strstart(uri, "file:", &p)) {
+        file_start_incoming_migration(p, errp);
     } else {
         error_setg(errp, "unknown migration protocol: %s", uri);
     }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 03/26] tests/qtest: migration: Add migrate_incoming_qmp helper
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 01/26] migration: Add support for 'file:' uri for source migration Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 02/26] migration: Add support for 'file:' uri for incoming migration Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 04/26] tests/qtest: migration-test: Add tests for file-based migration Fabiano Rosas
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov, Thomas Huth, Laurent Vivier,
	Paolo Bonzini

From: Nikolay Borisov <nborisov@suse.com>

file-based migration requires the target to initiate its migration after
the source has finished writing out the data in the file. Currently
there's no easy way to initiate 'migrate-incoming', allow this by
introducing migrate_incoming_qmp helper, similarly to migrate_qmp.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/qtest/migration-helpers.c | 19 +++++++++++++++++++
 tests/qtest/migration-helpers.h |  4 ++++
 2 files changed, 23 insertions(+)

diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index f6f3c6680f..8161495c27 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -130,6 +130,25 @@ void migrate_qmp(QTestState *who, const char *uri, const char *fmt, ...)
     qobject_unref(rsp);
 }
 
+
+void migrate_incoming_qmp(QTestState *who, const char *uri, const char *fmt, ...)
+{
+    va_list ap;
+    QDict *args, *rsp;
+
+    va_start(ap, fmt);
+    args = qdict_from_vjsonf_nofail(fmt, ap);
+    va_end(ap);
+
+    g_assert(!qdict_haskey(args, "uri"));
+    qdict_put_str(args, "uri", uri);
+
+    rsp = qtest_qmp(who, "{ 'execute': 'migrate-incoming', 'arguments': %p}", args);
+
+    g_assert(qdict_haskey(rsp, "return"));
+    qobject_unref(rsp);
+}
+
 /*
  * Note: caller is responsible to free the returned object via
  * qobject_unref() after use
diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
index a188b62787..53ddeaebb7 100644
--- a/tests/qtest/migration-helpers.h
+++ b/tests/qtest/migration-helpers.h
@@ -31,6 +31,10 @@ QDict *qmp_command(QTestState *who, const char *command, ...);
 G_GNUC_PRINTF(3, 4)
 void migrate_qmp(QTestState *who, const char *uri, const char *fmt, ...);
 
+G_GNUC_PRINTF(3, 4)
+void migrate_incoming_qmp(QTestState *who, const char *uri,
+                          const char *fmt, ...);
+
 QDict *migrate_query(QTestState *who);
 QDict *migrate_query_not_failed(QTestState *who);
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 04/26] tests/qtest: migration-test: Add tests for file-based migration
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (2 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 03/26] tests/qtest: migration: Add migrate_incoming_qmp helper Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 05/26] migration: Initial support of fixed-ram feature for analyze-migration.py Fabiano Rosas
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov, Thomas Huth, Laurent Vivier,
	Paolo Bonzini

From: Nikolay Borisov <nborisov@suse.com>

Add basic tests for file-based migration.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
(farosas) fix segfault when connect_uri is not set
---
 tests/qtest/migration-test.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 3b615b0da9..13e5cdd5a4 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -748,6 +748,7 @@ static void test_migrate_end(QTestState *from, QTestState *to, bool test_dest)
     cleanup("migsocket");
     cleanup("src_serial");
     cleanup("dest_serial");
+    cleanup("migfile");
 }
 
 #ifdef CONFIG_GNUTLS
@@ -1371,6 +1372,14 @@ static void test_precopy_common(MigrateCommon *args)
          * hanging forever if migration didn't converge */
         wait_for_migration_complete(from);
 
+        /*
+         * For file based migration the target must begin its migration after
+         * the source has finished
+         */
+        if (args->connect_uri && strstr(args->connect_uri, "file:")) {
+            migrate_incoming_qmp(to, args->connect_uri, "{}");
+        }
+
         if (!got_stop) {
             qtest_qmp_eventwait(from, "STOP");
         }
@@ -1524,6 +1533,17 @@ static void test_precopy_unix_xbzrle(void)
     test_precopy_common(&args);
 }
 
+static void test_precopy_file_stream_ram(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:%s/migfile", tmpfs);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+    };
+
+    test_precopy_common(&args);
+}
+
 static void test_precopy_tcp_plain(void)
 {
     MigrateCommon args = {
@@ -2515,6 +2535,10 @@ int main(int argc, char **argv)
     qtest_add_func("/migration/bad_dest", test_baddest);
     qtest_add_func("/migration/precopy/unix/plain", test_precopy_unix_plain);
     qtest_add_func("/migration/precopy/unix/xbzrle", test_precopy_unix_xbzrle);
+
+    qtest_add_func("/migration/precopy/file/stream-ram",
+                   test_precopy_file_stream_ram);
+
 #ifdef CONFIG_GNUTLS
     qtest_add_func("/migration/precopy/unix/tls/psk",
                    test_precopy_unix_tls_psk);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 05/26] migration: Initial support of fixed-ram feature for analyze-migration.py
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (3 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 04/26] tests/qtest: migration-test: Add tests for file-based migration Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 06/26] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov, John Snow, Cleber Rosa

From: Nikolay Borisov <nborisov@suse.com>

In order to allow analyze-migration.py script to work with migration
streams that have the 'fixed-ram' capability, it's required to have
access to the stream's configuration object. This commit enables this
by making migration json writer part of MigrationState struct,
allowing the configuration object be serialized to json.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c        |  1 +
 migration/savevm.c           | 18 ++++++++++---
 scripts/analyze-migration.py | 51 +++++++++++++++++++++++++++++++++---
 3 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 5408d87453..177fb0de0f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2260,6 +2260,7 @@ void migrate_init(MigrationState *s)
     error_free(s->error);
     s->error = NULL;
     s->hostname = NULL;
+    s->vmdesc = NULL;
 
     migrate_set_state(&s->state, MIGRATION_STATUS_NONE, MIGRATION_STATUS_SETUP);
 
diff --git a/migration/savevm.c b/migration/savevm.c
index aa54a67fda..92102c1fe5 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1206,13 +1206,25 @@ void qemu_savevm_non_migratable_list(strList **reasons)
 
 void qemu_savevm_state_header(QEMUFile *f)
 {
+    MigrationState *s = migrate_get_current();
+
+    s->vmdesc = json_writer_new(false);
+
     trace_savevm_state_header();
     qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
     qemu_put_be32(f, QEMU_VM_FILE_VERSION);
 
-    if (migrate_get_current()->send_configuration) {
+    if (s->send_configuration) {
         qemu_put_byte(f, QEMU_VM_CONFIGURATION);
-        vmstate_save_state(f, &vmstate_configuration, &savevm_state, 0);
+        /*
+         * This starts the main json object and is paired with the
+         * json_writer_end_object in
+         * qemu_savevm_state_complete_precopy_non_iterable
+         */
+        json_writer_start_object(s->vmdesc, NULL);
+        json_writer_start_object(s->vmdesc, "configuration");
+        vmstate_save_state(f, &vmstate_configuration, &savevm_state, s->vmdesc);
+        json_writer_end_object(s->vmdesc);
     }
 }
 
@@ -1237,8 +1249,6 @@ void qemu_savevm_state_setup(QEMUFile *f)
     Error *local_err = NULL;
     int ret;
 
-    ms->vmdesc = json_writer_new(false);
-    json_writer_start_object(ms->vmdesc, NULL);
     json_writer_int64(ms->vmdesc, "page_size", qemu_target_page_size());
     json_writer_start_array(ms->vmdesc, "devices");
 
diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py
index b82a1b0c58..05af9efd2f 100755
--- a/scripts/analyze-migration.py
+++ b/scripts/analyze-migration.py
@@ -23,7 +23,7 @@
 import collections
 import struct
 import sys
-
+import math
 
 def mkdir_p(path):
     try:
@@ -119,11 +119,16 @@ def __init__(self, file, version_id, ramargs, section_key):
         self.file = file
         self.section_key = section_key
         self.TARGET_PAGE_SIZE = ramargs['page_size']
+        self.TARGET_PAGE_BITS = math.log2(self.TARGET_PAGE_SIZE)
         self.dump_memory = ramargs['dump_memory']
         self.write_memory = ramargs['write_memory']
+        self.fixed_ram = ramargs['fixed-ram']
         self.sizeinfo = collections.OrderedDict()
+        self.bitmap_offset = collections.OrderedDict()
+        self.pages_offset = collections.OrderedDict()
         self.data = collections.OrderedDict()
         self.data['section sizes'] = self.sizeinfo
+        self.ram_read = False
         self.name = ''
         if self.write_memory:
             self.files = { }
@@ -140,7 +145,13 @@ def __str__(self):
     def getDict(self):
         return self.data
 
+    def write_or_dump_fixed_ram(self):
+        pass
+
     def read(self):
+        if self.fixed_ram and self.ram_read:
+            return
+
         # Read all RAM sections
         while True:
             addr = self.file.read64()
@@ -167,7 +178,26 @@ def read(self):
                         f.truncate(0)
                         f.truncate(len)
                         self.files[self.name] = f
+
+                    if self.fixed_ram:
+                        bitmap_len = self.file.read32()
+                        # skip the pages_offset which we don't need
+                        offset = self.file.tell() + 8
+                        self.bitmap_offset[self.name] = offset
+                        offset = ((offset + bitmap_len + self.TARGET_PAGE_SIZE - 1) //
+                                  self.TARGET_PAGE_SIZE) * self.TARGET_PAGE_SIZE
+                        self.pages_offset[self.name] = offset
+                        self.file.file.seek(offset + len)
+
                 flags &= ~self.RAM_SAVE_FLAG_MEM_SIZE
+                if self.fixed_ram:
+                    self.ram_read = True
+                # now we should rewind to the ram page offset of the first
+                # ram section
+                if self.fixed_ram:
+                    if self.write_memory or self.dump_memory:
+                        self.write_or_dump_fixed_ram()
+                        return
 
             if flags & self.RAM_SAVE_FLAG_COMPRESS:
                 if flags & self.RAM_SAVE_FLAG_CONTINUE:
@@ -208,7 +238,7 @@ def read(self):
 
             # End of RAM section
             if flags & self.RAM_SAVE_FLAG_EOS:
-                break
+                return
 
             if flags != 0:
                 raise Exception("Unknown RAM flags: %x" % flags)
@@ -521,6 +551,7 @@ def read(self, desc_only = False, dump_memory = False, write_memory = False):
         ramargs['page_size'] = self.vmsd_desc['page_size']
         ramargs['dump_memory'] = dump_memory
         ramargs['write_memory'] = write_memory
+        ramargs['fixed-ram'] = False
         self.section_classes[('ram',0)][1] = ramargs
 
         while True:
@@ -528,8 +559,20 @@ def read(self, desc_only = False, dump_memory = False, write_memory = False):
             if section_type == self.QEMU_VM_EOF:
                 break
             elif section_type == self.QEMU_VM_CONFIGURATION:
-                section = ConfigurationSection(file)
-                section.read()
+                config_desc = self.vmsd_desc.get('configuration')
+                if config_desc is not None:
+                    config = VMSDSection(file, 1, config_desc, 'configuration')
+                    config.read()
+                    caps = config.data.get("configuration/capabilities")
+                    if caps is not None:
+                        caps = caps.data["capabilities"]
+                        if type(caps) != list:
+                            caps = [caps]
+                        for i in caps:
+                            # chomp out string length
+                            cap = i.data[1:].decode("utf8")
+                            if cap == "fixed-ram":
+                                ramargs['fixed-ram'] = True
             elif section_type == self.QEMU_VM_SECTION_START or section_type == self.QEMU_VM_SECTION_FULL:
                 section_id = file.read32()
                 name = file.readstr()
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 06/26] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (4 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 05/26] migration: Initial support of fixed-ram feature for analyze-migration.py Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 07/26] io: Add generic pwritev/preadv interface Fabiano Rosas
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

Add a generic QIOChannel feature SEEKABLE which would be used by the
qemu_file* apis. For the time being this will be only implemented for
file channels.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 include/io/channel.h | 1 +
 io/channel-file.c    | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 153fbd2904..29461dda41 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -44,6 +44,7 @@ enum QIOChannelFeature {
     QIO_CHANNEL_FEATURE_LISTEN,
     QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY,
     QIO_CHANNEL_FEATURE_READ_MSG_PEEK,
+    QIO_CHANNEL_FEATURE_SEEKABLE,
 };
 
 
diff --git a/io/channel-file.c b/io/channel-file.c
index d76663e6ae..a0268232da 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -35,6 +35,10 @@ qio_channel_file_new_fd(int fd)
 
     ioc->fd = fd;
 
+    if (lseek(fd, 0, SEEK_CUR) != (off_t)-1) {
+        qio_channel_set_feature(QIO_CHANNEL(ioc), QIO_CHANNEL_FEATURE_SEEKABLE);
+    }
+
     trace_qio_channel_file_new_fd(ioc, fd);
 
     return ioc;
@@ -59,6 +63,10 @@ qio_channel_file_new_path(const char *path,
         return NULL;
     }
 
+    if (lseek(ioc->fd, 0, SEEK_CUR) != (off_t)-1) {
+        qio_channel_set_feature(QIO_CHANNEL(ioc), QIO_CHANNEL_FEATURE_SEEKABLE);
+    }
+
     trace_qio_channel_file_new_path(ioc, path, flags, mode, ioc->fd);
 
     return ioc;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 07/26] io: Add generic pwritev/preadv interface
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (5 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 06/26] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 08/26] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

Introduce basic pwritev/preadv support in the generic channel layer.
Specific implementation will follow for the file channel as this is
required in order to support migration streams with fixed location of
each ram page.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/io/channel.h | 82 ++++++++++++++++++++++++++++++++++++++++++++
 io/channel.c         | 58 +++++++++++++++++++++++++++++++
 2 files changed, 140 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 29461dda41..28bce7ef17 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -129,6 +129,16 @@ struct QIOChannelClass {
                            Error **errp);
 
     /* Optional callbacks */
+    ssize_t (*io_pwritev)(QIOChannel *ioc,
+                          const struct iovec *iov,
+                          size_t niov,
+                          off_t offset,
+                          Error **errp);
+    ssize_t (*io_preadv)(QIOChannel *ioc,
+                         const struct iovec *iov,
+                         size_t niov,
+                         off_t offset,
+                         Error **errp);
     int (*io_shutdown)(QIOChannel *ioc,
                        QIOChannelShutdown how,
                        Error **errp);
@@ -511,6 +521,78 @@ int qio_channel_set_blocking(QIOChannel *ioc,
 int qio_channel_close(QIOChannel *ioc,
                       Error **errp);
 
+/**
+ * qio_channel_pwritev_full
+ * @ioc: the channel object
+ * @iov: the array of memory regions to write data from
+ * @niov: the length of the @iov array
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error. To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ * Behaves as qio_channel_writev_full, apart from not supporting
+ * sending of file handles as well as beginning the write at the
+ * passed @offset
+ *
+ */
+ssize_t qio_channel_pwritev_full(QIOChannel *ioc, const struct iovec *iov,
+                                 size_t niov, off_t offset, Error **errp);
+
+/**
+ * qio_channel_pwritev
+ * @ioc: the channel object
+ * @buf: the memory region to write data into
+ * @buflen: the number of bytes to @buf
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error. To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ */
+ssize_t qio_channel_pwritev(QIOChannel *ioc, char *buf, size_t buflen,
+                            off_t offset, Error **errp);
+
+/**
+ * qio_channel_preadv_full
+ * @ioc: the channel object
+ * @iov: the array of memory regions to read data into
+ * @niov: the length of the @iov array
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error.  To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ * Behaves as qio_channel_readv_full, apart from not supporting
+ * receiving of file handles as well as beginning the read at the
+ * passed @offset
+ *
+ */
+ssize_t qio_channel_preadv_full(QIOChannel *ioc, const struct iovec *iov,
+                                size_t niov, off_t offset, Error **errp);
+
+/**
+ * qio_channel_preadv
+ * @ioc: the channel object
+ * @buf: the memory region to write data into
+ * @buflen: the number of bytes to @buf
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error.  To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ */
+ssize_t qio_channel_preadv(QIOChannel *ioc, char *buf, size_t buflen,
+                           off_t offset, Error **errp);
+
 /**
  * qio_channel_shutdown:
  * @ioc: the channel object
diff --git a/io/channel.c b/io/channel.c
index a8c7f11649..312445b3aa 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -445,6 +445,64 @@ GSource *qio_channel_add_watch_source(QIOChannel *ioc,
 }
 
 
+ssize_t qio_channel_pwritev_full(QIOChannel *ioc, const struct iovec *iov,
+                                 size_t niov, off_t offset, Error **errp)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (!klass->io_pwritev) {
+        error_setg(errp, "Channel does not support pwritev");
+        return -1;
+    }
+
+    if (!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_SEEKABLE)) {
+        error_setg_errno(errp, EINVAL, "Requested channel is not seekable");
+        return -1;
+    }
+
+    return klass->io_pwritev(ioc, iov, niov, offset, errp);
+}
+
+ssize_t qio_channel_pwritev(QIOChannel *ioc, char *buf, size_t buflen,
+                            off_t offset, Error **errp)
+{
+    struct iovec iov = {
+        .iov_base = buf,
+        .iov_len = buflen
+    };
+
+    return qio_channel_pwritev_full(ioc, &iov, 1, offset, errp);
+}
+
+ssize_t qio_channel_preadv_full(QIOChannel *ioc, const struct iovec *iov,
+                                size_t niov, off_t offset, Error **errp)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (!klass->io_preadv) {
+        error_setg(errp, "Channel does not support preadv");
+        return -1;
+    }
+
+    if (!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_SEEKABLE)) {
+        error_setg_errno(errp, EINVAL, "Requested channel is not seekable");
+        return -1;
+    }
+
+    return klass->io_preadv(ioc, iov, niov, offset, errp);
+}
+
+ssize_t qio_channel_preadv(QIOChannel *ioc, char *buf, size_t buflen,
+                           off_t offset, Error **errp)
+{
+    struct iovec iov = {
+        .iov_base = buf,
+        .iov_len = buflen
+    };
+
+    return qio_channel_preadv_full(ioc, &iov, 1, offset, errp);
+}
+
 int qio_channel_shutdown(QIOChannel *ioc,
                          QIOChannelShutdown how,
                          Error **errp)
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 08/26] io: implement io_pwritev/preadv for QIOChannelFile
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (6 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 07/26] io: Add generic pwritev/preadv interface Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 09/26] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

The upcoming 'fixed-ram' feature will require qemu to write data to
(and restore from) specific offsets of the migration file.

Add a minimal implementation of pwritev/preadv and expose them via the
io_pwritev and io_preadv interfaces.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 io/channel-file.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/io/channel-file.c b/io/channel-file.c
index a0268232da..a3d2f0bcf9 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -145,6 +145,56 @@ static ssize_t qio_channel_file_writev(QIOChannel *ioc,
     return ret;
 }
 
+static ssize_t qio_channel_file_preadv(QIOChannel *ioc,
+                                       const struct iovec *iov,
+                                       size_t niov,
+                                       off_t offset,
+                                       Error **errp)
+{
+    QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
+    ssize_t ret;
+
+ retry:
+    ret = preadv(fioc->fd, iov, niov, offset);
+    if (ret < 0) {
+        if (errno == EAGAIN) {
+            return QIO_CHANNEL_ERR_BLOCK;
+        }
+        if (errno == EINTR) {
+            goto retry;
+        }
+
+        error_setg_errno(errp, errno, "Unable to read from file");
+        return -1;
+    }
+
+    return ret;
+}
+
+static ssize_t qio_channel_file_pwritev(QIOChannel *ioc,
+                                        const struct iovec *iov,
+                                        size_t niov,
+                                        off_t offset,
+                                        Error **errp)
+{
+    QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
+    ssize_t ret;
+
+ retry:
+    ret = pwritev(fioc->fd, iov, niov, offset);
+    if (ret <= 0) {
+        if (errno == EAGAIN) {
+            return QIO_CHANNEL_ERR_BLOCK;
+        }
+        if (errno == EINTR) {
+            goto retry;
+        }
+        error_setg_errno(errp, errno, "Unable to write to file");
+        return -1;
+    }
+    return ret;
+}
+
 static int qio_channel_file_set_blocking(QIOChannel *ioc,
                                          bool enabled,
                                          Error **errp)
@@ -227,6 +277,8 @@ static void qio_channel_file_class_init(ObjectClass *klass,
     ioc_klass->io_writev = qio_channel_file_writev;
     ioc_klass->io_readv = qio_channel_file_readv;
     ioc_klass->io_set_blocking = qio_channel_file_set_blocking;
+    ioc_klass->io_pwritev = qio_channel_file_pwritev;
+    ioc_klass->io_preadv = qio_channel_file_preadv;
     ioc_klass->io_seek = qio_channel_file_seek;
     ioc_klass->io_close = qio_channel_file_close;
     ioc_klass->io_create_watch = qio_channel_file_create_watch;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 09/26] migration/qemu-file: add utility methods for working with seekable channels
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (7 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 08/26] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability Fabiano Rosas
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

Add utility methods that will be needed when implementing 'fixed-ram'
migration capability.

qemu_file_is_seekable
qemu_put_buffer_at
qemu_get_buffer_at
qemu_set_offset
qemu_get_offset

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
fixed total_transferred accounting

restructured to use qio_channel_file_preadv instead of the _full
variant
---
 include/migration/qemu-file-types.h |  2 +
 migration/qemu-file.c               | 80 +++++++++++++++++++++++++++++
 migration/qemu-file.h               |  4 ++
 3 files changed, 86 insertions(+)

diff --git a/include/migration/qemu-file-types.h b/include/migration/qemu-file-types.h
index 2867e3da84..eb0325ee86 100644
--- a/include/migration/qemu-file-types.h
+++ b/include/migration/qemu-file-types.h
@@ -50,6 +50,8 @@ unsigned int qemu_get_be16(QEMUFile *f);
 unsigned int qemu_get_be32(QEMUFile *f);
 uint64_t qemu_get_be64(QEMUFile *f);
 
+bool qemu_file_is_seekable(QEMUFile *f);
+
 static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
 {
     qemu_put_be64(f, *pv);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 102ab3b439..a1f7dbb3d9 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -30,6 +30,7 @@
 #include "qemu-file.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "io/channel-file.h"
 
 #define IO_BUF_SIZE 32768
 #define MAX_IOV_SIZE MIN_CONST(IOV_MAX, 64)
@@ -281,6 +282,10 @@ static void qemu_iovec_release_ram(QEMUFile *f)
     memset(f->may_free, 0, sizeof(f->may_free));
 }
 
+bool qemu_file_is_seekable(QEMUFile *f)
+{
+    return qio_channel_has_feature(f->ioc, QIO_CHANNEL_FEATURE_SEEKABLE);
+}
 
 /**
  * Flushes QEMUFile buffer
@@ -559,6 +564,81 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, size_t size)
     }
 }
 
+void qemu_put_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen, off_t pos)
+{
+    Error *err = NULL;
+
+    if (f->last_error) {
+        return;
+    }
+
+    qemu_fflush(f);
+    qio_channel_pwritev(f->ioc, (char *)buf, buflen, pos, &err);
+
+    if (err) {
+        qemu_file_set_error_obj(f, -EIO, err);
+    } else {
+        f->total_transferred += buflen;
+    }
+
+    return;
+}
+
+
+size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen, off_t pos)
+{
+    Error *err = NULL;
+    ssize_t ret;
+
+    if (f->last_error) {
+        return 0;
+    }
+
+    ret = qio_channel_preadv(f->ioc, (char *)buf, buflen, pos, &err);
+    if (ret == -1 || err) {
+        goto error;
+    }
+
+    return (size_t)ret;
+
+ error:
+    qemu_file_set_error_obj(f, -EIO, err);
+    return 0;
+}
+
+void qemu_set_offset(QEMUFile *f, off_t off, int whence)
+{
+    Error *err = NULL;
+    off_t ret;
+
+    qemu_fflush(f);
+
+    if (!qemu_file_is_writable(f)) {
+        f->buf_index = 0;
+        f->buf_size = 0;
+    }
+
+    ret = qio_channel_io_seek(f->ioc, off, whence, &err);
+    if (ret == (off_t)-1) {
+        qemu_file_set_error_obj(f, -EIO, err);
+    }
+}
+
+off_t qemu_get_offset(QEMUFile *f)
+{
+    Error *err = NULL;
+    off_t ret;
+
+    qemu_fflush(f);
+
+    ret = qio_channel_io_seek(f->ioc, 0, SEEK_CUR, &err);
+    if (ret == (off_t)-1) {
+        qemu_file_set_error_obj(f, -EIO, err);
+    }
+    return ret;
+}
+
+
 void qemu_put_byte(QEMUFile *f, int v)
 {
     if (f->last_error) {
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 9d0155a2a1..350273b441 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -149,6 +149,10 @@ QEMUFile *qemu_file_get_return_path(QEMUFile *f);
 void qemu_fflush(QEMUFile *f);
 void qemu_file_set_blocking(QEMUFile *f, bool block);
 int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
+void qemu_set_offset(QEMUFile *f, off_t off, int whence);
+off_t qemu_get_offset(QEMUFile *f);
+void qemu_put_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen, off_t pos);
+size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen, off_t pos);
 
 void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
 void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (8 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 09/26] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 22:01   ` Peter Xu
  2023-03-31  5:50   ` Markus Armbruster
  2023-03-30 18:03 ` [RFC PATCH v1 11/26] migration: Refactor precopy ram loading code Fabiano Rosas
                   ` (17 subsequent siblings)
  27 siblings, 2 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster

From: Nikolay Borisov <nborisov@suse.com>

Implement 'fixed-ram' feature. The core of the feature is to ensure that
each ram page of the migration stream has a specific offset in the
resulting migration stream. The reason why we'd want such behavior are
two fold:

 - When doing a 'fixed-ram' migration the resulting file will have a
   bounded size, since pages which are dirtied multiple times will
   always go to a fixed location in the file, rather than constantly
   being added to a sequential stream. This eliminates cases where a vm
   with, say, 1G of ram can result in a migration file that's 10s of
   GBs, provided that the workload constantly redirties memory.

 - It paves the way to implement DIO-enabled save/restore of the
   migration stream as the pages are ensured to be written at aligned
   offsets.

The feature requires changing the stream format. First, a bitmap is
introduced which tracks which pages have been written (i.e are
dirtied) during migration and subsequently it's being written in the
resulting file, again at a fixed location for every ramblock. Zero
pages are ignored as they'd be zero in the destination migration as
well. With the changed format data would look like the following:

|name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|

* pc - refers to the page_size/mr->addr members, so newly added members
begin from "bitmap_size".

This layout is initialized during ram_save_setup so instead of having a
sequential stream of pages that follow the ramblock headers the dirty
pages for a ramblock follow its header. Since all pages have a fixed
location RAM_SAVE_FLAG_EOS is no longer generated on every migration
iteration but there is effectively a single RAM_SAVE_FLAG_EOS right at
the end.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 docs/devel/migration.rst | 36 +++++++++++++++
 include/exec/ramblock.h  |  8 ++++
 migration/migration.c    | 51 +++++++++++++++++++++-
 migration/migration.h    |  1 +
 migration/ram.c          | 94 +++++++++++++++++++++++++++++++++-------
 migration/savevm.c       |  1 +
 qapi/migration.json      |  2 +-
 7 files changed, 176 insertions(+), 17 deletions(-)

diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index 1080211f8e..84112d7f3f 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -568,6 +568,42 @@ Others (especially either older devices or system devices which for
 some reason don't have a bus concept) make use of the ``instance id``
 for otherwise identically named devices.
 
+Fixed-ram format
+----------------
+
+When the ``fixed-ram`` capability is enabled, a slightly different
+stream format is used for the RAM section. Instead of having a
+sequential stream of pages that follow the RAMBlock headers, the dirty
+pages for a RAMBlock follow its header. This ensures that each RAM
+page has a fixed offset in the resulting migration stream.
+
+  - RAMBlock 1
+
+    - ID string length
+    - ID string
+    - Used size
+    - Shadow bitmap size
+    - Pages offset in migration stream*
+
+  - Shadow bitmap
+  - Sequence of pages for RAMBlock 1 (* offset points here)
+
+  - RAMBlock 2
+
+    - ID string length
+    - ID string
+    - Used size
+    - Shadow bitmap size
+    - Pages offset in migration stream*
+
+  - Shadow bitmap
+  - Sequence of pages for RAMBlock 2 (* offset points here)
+
+The ``fixed-ram`` capaility can be enabled in both source and
+destination with:
+
+    ``migrate_set_capability fixed-ram on``
+
 Return path
 -----------
 
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index adc03df59c..4360c772c2 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -43,6 +43,14 @@ struct RAMBlock {
     size_t page_size;
     /* dirty bitmap used during migration */
     unsigned long *bmap;
+    /* shadow dirty bitmap used when migrating to a file */
+    unsigned long *shadow_bmap;
+    /*
+     * offset in the file pages belonging to this ramblock are saved,
+     * used only during migration to a file.
+     */
+    off_t bitmap_offset;
+    uint64_t pages_offset;
     /* bitmap of already received pages in postcopy */
     unsigned long *receivedmap;
 
diff --git a/migration/migration.c b/migration/migration.c
index 177fb0de0f..29630523e2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -168,7 +168,8 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
     MIGRATION_CAPABILITY_XBZRLE,
     MIGRATION_CAPABILITY_X_COLO,
     MIGRATION_CAPABILITY_VALIDATE_UUID,
-    MIGRATION_CAPABILITY_ZERO_COPY_SEND);
+    MIGRATION_CAPABILITY_ZERO_COPY_SEND,
+    MIGRATION_CAPABILITY_FIXED_RAM);
 
 /* When we add fault tolerance, we could have several
    migrations at once.  For now we don't need to add
@@ -1341,6 +1342,28 @@ static bool migrate_caps_check(bool *cap_list,
     }
 #endif
 
+    if (cap_list[MIGRATION_CAPABILITY_FIXED_RAM]) {
+        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
+            error_setg(errp, "Directly mapped memory incompatible with multifd");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
+            error_setg(errp, "Directly mapped memory incompatible with xbzrle");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+            error_setg(errp, "Directly mapped memory incompatible with compression");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
+            error_setg(errp, "Directly mapped memory incompatible with postcopy ram");
+            return false;
+        }
+    }
+
     if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
         /* This check is reasonably expensive, so only when it's being
          * set the first time, also it's only the destination that needs
@@ -2736,6 +2759,11 @@ MultiFDCompression migrate_multifd_compression(void)
     return s->parameters.multifd_compression;
 }
 
+int migrate_fixed_ram(void)
+{
+    return migrate_get_current()->enabled_capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
+}
+
 int migrate_multifd_zlib_level(void)
 {
     MigrationState *s;
@@ -4324,6 +4352,20 @@ fail:
     return NULL;
 }
 
+static int migrate_check_fixed_ram(MigrationState *s, Error **errp)
+{
+    if (!s->enabled_capabilities[MIGRATION_CAPABILITY_FIXED_RAM]) {
+        return 0;
+    }
+
+    if (!qemu_file_is_seekable(s->to_dst_file)) {
+        error_setg(errp, "Directly mapped memory requires a seekable transport");
+        return -1;
+    }
+
+    return 0;
+}
+
 void migrate_fd_connect(MigrationState *s, Error *error_in)
 {
     Error *local_err = NULL;
@@ -4390,6 +4432,12 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
         }
     }
 
+    if (migrate_check_fixed_ram(s, &local_err) < 0) {
+        migrate_fd_cleanup(s);
+        migrate_fd_error(s, local_err);
+        return;
+    }
+
     if (resume) {
         /* Wakeup the main migration thread to do the recovery */
         migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_PAUSED,
@@ -4519,6 +4567,7 @@ static Property migration_properties[] = {
     DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
 
     /* Migration capabilities */
+    DEFINE_PROP_MIG_CAP("x-fixed-ram", MIGRATION_CAPABILITY_FIXED_RAM),
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
     DEFINE_PROP_MIG_CAP("x-rdma-pin-all", MIGRATION_CAPABILITY_RDMA_PIN_ALL),
     DEFINE_PROP_MIG_CAP("x-auto-converge", MIGRATION_CAPABILITY_AUTO_CONVERGE),
diff --git a/migration/migration.h b/migration/migration.h
index 2da2f8a164..8cf3caecfe 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -416,6 +416,7 @@ bool migrate_zero_blocks(void);
 bool migrate_dirty_bitmaps(void);
 bool migrate_ignore_shared(void);
 bool migrate_validate_uuid(void);
+int migrate_fixed_ram(void);
 
 bool migrate_auto_converge(void);
 bool migrate_use_multifd(void);
diff --git a/migration/ram.c b/migration/ram.c
index 96e8a19a58..56f0f782c8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1310,9 +1310,14 @@ static int save_zero_page_to_file(PageSearchStatus *pss,
     int len = 0;
 
     if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
-        len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
-        qemu_put_byte(file, 0);
-        len += 1;
+        if (migrate_fixed_ram()) {
+            /* for zero pages we don't need to do anything */
+            len = 1;
+        } else {
+            len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
+            qemu_put_byte(file, 0);
+            len += 1;
+        }
         ram_release_page(block->idstr, offset);
     }
     return len;
@@ -1394,14 +1399,20 @@ static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
 {
     QEMUFile *file = pss->pss_channel;
 
-    ram_transferred_add(save_page_header(pss, block,
-                                         offset | RAM_SAVE_FLAG_PAGE));
-    if (async) {
-        qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
-                              migrate_release_ram() &&
-                              migration_in_postcopy());
+    if (migrate_fixed_ram()) {
+        qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
+                           block->pages_offset + offset);
+        set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
     } else {
-        qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
+        ram_transferred_add(save_page_header(pss, block,
+                                             offset | RAM_SAVE_FLAG_PAGE));
+        if (async) {
+            qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
+                                  migrate_release_ram() &&
+                                  migration_in_postcopy());
+        } else {
+            qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
+        }
     }
     ram_transferred_add(TARGET_PAGE_SIZE);
     stat64_add(&ram_atomic_counters.normal, 1);
@@ -2731,6 +2742,8 @@ static void ram_save_cleanup(void *opaque)
         block->clear_bmap = NULL;
         g_free(block->bmap);
         block->bmap = NULL;
+        g_free(block->shadow_bmap);
+        block->shadow_bmap = NULL;
     }
 
     xbzrle_cleanup();
@@ -3098,6 +3111,7 @@ static void ram_list_init_bitmaps(void)
              */
             block->bmap = bitmap_new(pages);
             bitmap_set(block->bmap, 0, pages);
+            block->shadow_bmap = bitmap_new(block->used_length >> TARGET_PAGE_BITS);
             block->clear_bmap_shift = shift;
             block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
         }
@@ -3287,6 +3301,33 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
             if (migrate_ignore_shared()) {
                 qemu_put_be64(f, block->mr->addr);
             }
+
+            if (migrate_fixed_ram()) {
+                long num_pages = block->used_length >> TARGET_PAGE_BITS;
+                long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
+
+                /* Needed for external programs (think analyze-migration.py) */
+                qemu_put_be32(f, bitmap_size);
+
+                /*
+                 * The bitmap starts after pages_offset, so add 8 to
+                 * account for the pages_offset size.
+                 */
+                block->bitmap_offset = qemu_get_offset(f) + 8;
+
+                /*
+                 * Make pages_offset aligned to 1 MiB to account for
+                 * migration file movement between filesystems with
+                 * possibly different alignment restrictions when
+                 * using O_DIRECT.
+                 */
+                block->pages_offset = ROUND_UP(block->bitmap_offset +
+                                               bitmap_size, 0x100000);
+                qemu_put_be64(f, block->pages_offset);
+
+                /* Now prepare offset for next ramblock */
+                qemu_set_offset(f, block->pages_offset + block->used_length, SEEK_SET);
+            }
         }
     }
 
@@ -3306,6 +3347,18 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     return 0;
 }
 
+static void ram_save_shadow_bmap(QEMUFile *f)
+{
+    RAMBlock *block;
+
+    RAMBLOCK_FOREACH_MIGRATABLE(block) {
+        long num_pages = block->used_length >> TARGET_PAGE_BITS;
+        long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
+        qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
+                           block->bitmap_offset);
+    }
+}
+
 /**
  * ram_save_iterate: iterative stage for migration
  *
@@ -3413,9 +3466,15 @@ out:
             return ret;
         }
 
-        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
-        qemu_fflush(f);
-        ram_transferred_add(8);
+        /*
+         * For fixed ram we don't want to pollute the migration stream with
+         * EOS flags.
+         */
+        if (!migrate_fixed_ram()) {
+            qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+            qemu_fflush(f);
+            ram_transferred_add(8);
+        }
 
         ret = qemu_file_get_error(f);
     }
@@ -3461,6 +3520,9 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
             pages = ram_find_and_save_block(rs);
             /* no more blocks to sent */
             if (pages == 0) {
+                if (migrate_fixed_ram()) {
+                    ram_save_shadow_bmap(f);
+                }
                 break;
             }
             if (pages < 0) {
@@ -3483,8 +3545,10 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         return ret;
     }
 
-    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
-    qemu_fflush(f);
+    if (!migrate_fixed_ram()) {
+        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+        qemu_fflush(f);
+    }
 
     return 0;
 }
diff --git a/migration/savevm.c b/migration/savevm.c
index 92102c1fe5..1f1bc19224 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -241,6 +241,7 @@ static bool should_validate_capability(int capability)
     /* Validate only new capabilities to keep compatibility. */
     switch (capability) {
     case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
+    case MIGRATION_CAPABILITY_FIXED_RAM:
         return true;
     default:
         return false;
diff --git a/qapi/migration.json b/qapi/migration.json
index c84fa10e86..22eea58ce3 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -485,7 +485,7 @@
 ##
 { 'enum': 'MigrationCapability',
   'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
-           'compress', 'events', 'postcopy-ram',
+           'compress', 'events', 'postcopy-ram', 'fixed-ram',
            { 'name': 'x-colo', 'features': [ 'unstable' ] },
            'release-ram',
            'block', 'return-path', 'pause-before-switchover', 'multifd',
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 11/26] migration: Refactor precopy ram loading code
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (9 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 12/26] migration: Add support for 'fixed-ram' migration restore Fabiano Rosas
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

To facilitate the implementation of the 'fixed-ram' migration restore,
factor out the code responsible for parsing the ramblocks
headers. This also makes ram_load_precopy easier to comprehend.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/ram.c | 142 +++++++++++++++++++++++++++---------------------
 1 file changed, 80 insertions(+), 62 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 56f0f782c8..5c085d6154 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4319,6 +4319,83 @@ void colo_flush_ram_cache(void)
     trace_colo_flush_ram_cache_end();
 }
 
+static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
+{
+    int ret = 0;
+    /* ADVISE is earlier, it shows the source has the postcopy capability on */
+    bool postcopy_advised = migration_incoming_postcopy_advised();
+
+    assert(block);
+
+    if (!qemu_ram_is_migratable(block)) {
+        error_report("block %s should not be migrated !", block->idstr);
+        ret = -EINVAL;
+    }
+
+    if (length != block->used_length) {
+        Error *local_err = NULL;
+
+        ret = qemu_ram_resize(block, length, &local_err);
+        if (local_err) {
+            error_report_err(local_err);
+        }
+    }
+    /* For postcopy we need to check hugepage sizes match */
+    if (postcopy_advised && migrate_postcopy_ram() &&
+        block->page_size != qemu_host_page_size) {
+        uint64_t remote_page_size = qemu_get_be64(f);
+        if (remote_page_size != block->page_size) {
+            error_report("Mismatched RAM page size %s "
+                         "(local) %zd != %" PRId64, block->idstr,
+                         block->page_size, remote_page_size);
+            ret = -EINVAL;
+        }
+    }
+    if (migrate_ignore_shared()) {
+        hwaddr addr = qemu_get_be64(f);
+        if (ramblock_is_ignored(block) &&
+            block->mr->addr != addr) {
+            error_report("Mismatched GPAs for block %s "
+                         "%" PRId64 "!= %" PRId64, block->idstr,
+                         (uint64_t)addr,
+                         (uint64_t)block->mr->addr);
+            ret = -EINVAL;
+        }
+    }
+    ram_control_load_hook(f, RAM_CONTROL_BLOCK_REG, block->idstr);
+
+    return ret;
+}
+
+static int parse_ramblocks(QEMUFile *f, ram_addr_t total_ram_bytes)
+{
+    int ret = 0;
+
+    /* Synchronize RAM block list */
+    while (!ret && total_ram_bytes) {
+        char id[256];
+        RAMBlock *block;
+        ram_addr_t length;
+        int len = qemu_get_byte(f);
+
+        qemu_get_buffer(f, (uint8_t *)id, len);
+        id[len] = 0;
+        length = qemu_get_be64(f);
+
+        block = qemu_ram_block_by_name(id);
+        if (block) {
+            ret = parse_ramblock(f, block, length);
+        } else {
+            error_report("Unknown ramblock \"%s\", cannot accept "
+                         "migration", id);
+            ret = -EINVAL;
+        }
+        total_ram_bytes -= length;
+    }
+
+    return ret;
+}
+
 /**
  * ram_load_precopy: load pages in precopy case
  *
@@ -4333,14 +4410,13 @@ static int ram_load_precopy(QEMUFile *f)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
     int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0;
-    /* ADVISE is earlier, it shows the source has the postcopy capability on */
-    bool postcopy_advised = migration_incoming_postcopy_advised();
+
     if (!migrate_use_compression()) {
         invalid_flags |= RAM_SAVE_FLAG_COMPRESS_PAGE;
     }
 
     while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
-        ram_addr_t addr, total_ram_bytes;
+        ram_addr_t addr;
         void *host = NULL, *host_bak = NULL;
         uint8_t ch;
 
@@ -4411,65 +4487,7 @@ static int ram_load_precopy(QEMUFile *f)
 
         switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
         case RAM_SAVE_FLAG_MEM_SIZE:
-            /* Synchronize RAM block list */
-            total_ram_bytes = addr;
-            while (!ret && total_ram_bytes) {
-                RAMBlock *block;
-                char id[256];
-                ram_addr_t length;
-
-                len = qemu_get_byte(f);
-                qemu_get_buffer(f, (uint8_t *)id, len);
-                id[len] = 0;
-                length = qemu_get_be64(f);
-
-                block = qemu_ram_block_by_name(id);
-                if (block && !qemu_ram_is_migratable(block)) {
-                    error_report("block %s should not be migrated !", id);
-                    ret = -EINVAL;
-                } else if (block) {
-                    if (length != block->used_length) {
-                        Error *local_err = NULL;
-
-                        ret = qemu_ram_resize(block, length,
-                                              &local_err);
-                        if (local_err) {
-                            error_report_err(local_err);
-                        }
-                    }
-                    /* For postcopy we need to check hugepage sizes match */
-                    if (postcopy_advised && migrate_postcopy_ram() &&
-                        block->page_size != qemu_host_page_size) {
-                        uint64_t remote_page_size = qemu_get_be64(f);
-                        if (remote_page_size != block->page_size) {
-                            error_report("Mismatched RAM page size %s "
-                                         "(local) %zd != %" PRId64,
-                                         id, block->page_size,
-                                         remote_page_size);
-                            ret = -EINVAL;
-                        }
-                    }
-                    if (migrate_ignore_shared()) {
-                        hwaddr addr = qemu_get_be64(f);
-                        if (ramblock_is_ignored(block) &&
-                            block->mr->addr != addr) {
-                            error_report("Mismatched GPAs for block %s "
-                                         "%" PRId64 "!= %" PRId64,
-                                         id, (uint64_t)addr,
-                                         (uint64_t)block->mr->addr);
-                            ret = -EINVAL;
-                        }
-                    }
-                    ram_control_load_hook(f, RAM_CONTROL_BLOCK_REG,
-                                          block->idstr);
-                } else {
-                    error_report("Unknown ramblock \"%s\", cannot "
-                                 "accept migration", id);
-                    ret = -EINVAL;
-                }
-
-                total_ram_bytes -= length;
-            }
+            ret = parse_ramblocks(f, addr);
             break;
 
         case RAM_SAVE_FLAG_ZERO:
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 12/26] migration: Add support for 'fixed-ram' migration restore
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (10 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 11/26] migration: Refactor precopy ram loading code Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 13/26] tests/qtest: migration-test: Add tests for fixed-ram file-based migration Fabiano Rosas
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

Add the necessary code to parse the format changes for the 'fixed-ram'
capability.

One of the more notable changes in behavior is that in the 'fixed-ram'
case ram pages are restored in one go rather than constantly looping
through the migration stream.

Also due to idiosyncrasies of the format I have added the
'ram_migrated' since it was easier to simply return directly from
->load_state rather than introducing more conditionals around the code
to prevent ->load_state being called multiple times (from
qemu_loadvm_section_start_full/qemu_loadvm_section_part_end i.e. from
multiple QEMU_VM_SECTION_(PART|END) flags).

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.h |   2 +
 migration/ram.c       | 105 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 105 insertions(+), 2 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index 8cf3caecfe..84be34587f 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -96,6 +96,8 @@ struct MigrationIncomingState {
     bool           have_listen_thread;
     QemuThread     listen_thread;
 
+    bool ram_migrated;
+
     /* For the kernel to send us notifications */
     int       userfault_fd;
     /* To notify the fault_thread to wake, e.g., when need to quit */
diff --git a/migration/ram.c b/migration/ram.c
index 5c085d6154..1666ce6d5f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4396,6 +4396,100 @@ static int parse_ramblocks(QEMUFile *f, ram_addr_t total_ram_bytes)
     return ret;
 }
 
+static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
+                                    long num_pages, unsigned long *bitmap)
+{
+    unsigned long set_bit_idx, clear_bit_idx;
+    unsigned long len;
+    ram_addr_t offset;
+    void *host;
+    size_t read, completed, read_len;
+
+    for (set_bit_idx = find_first_bit(bitmap, num_pages);
+         set_bit_idx < num_pages;
+         set_bit_idx = find_next_bit(bitmap, num_pages, clear_bit_idx + 1)) {
+
+        clear_bit_idx = find_next_zero_bit(bitmap, num_pages, set_bit_idx + 1);
+
+        len = TARGET_PAGE_SIZE * (clear_bit_idx - set_bit_idx);
+        offset = set_bit_idx << TARGET_PAGE_BITS;
+
+        for (read = 0, completed = 0; completed < len; offset += read) {
+            host = host_from_ram_block_offset(block, offset);
+            read_len = MIN(len, TARGET_PAGE_SIZE);
+
+            read = qemu_get_buffer_at(f, host, read_len,
+                                      block->pages_offset + offset);
+            completed += read;
+        }
+    }
+}
+
+static int parse_ramblocks_fixed_ram(QEMUFile *f)
+{
+    int ret = 0;
+
+    while (!ret) {
+        char id[256];
+        RAMBlock *block;
+        ram_addr_t length;
+        long num_pages, bitmap_size;
+        int len = qemu_get_byte(f);
+        g_autofree unsigned long *dirty_bitmap = NULL;
+
+        qemu_get_buffer(f, (uint8_t *)id, len);
+        id[len] = 0;
+        length = qemu_get_be64(f);
+
+        block = qemu_ram_block_by_name(id);
+        if (block) {
+            ret = parse_ramblock(f, block, length);
+            if (ret < 0) {
+                return ret;
+            }
+        } else {
+            error_report("Unknown ramblock \"%s\", cannot accept "
+                         "migration", id);
+            ret = -EINVAL;
+            continue;
+        }
+
+        /* 1. read the bitmap size */
+        num_pages = length >> TARGET_PAGE_BITS;
+        bitmap_size = qemu_get_be32(f);
+
+        assert(bitmap_size == BITS_TO_LONGS(num_pages) * sizeof(unsigned long));
+
+        block->pages_offset = qemu_get_be64(f);
+
+        /* 2. read the actual bitmap */
+        dirty_bitmap = g_malloc0(bitmap_size);
+        if (qemu_get_buffer(f, (uint8_t *)dirty_bitmap, bitmap_size) != bitmap_size) {
+            error_report("Error parsing dirty bitmap");
+            return -EINVAL;
+        }
+
+        read_ramblock_fixed_ram(f, block, num_pages, dirty_bitmap);
+
+        /* Skip pages array */
+        qemu_set_offset(f, block->pages_offset + length, SEEK_SET);
+
+        /* Check if this is the last ramblock */
+        if (qemu_get_be64(f) == RAM_SAVE_FLAG_EOS) {
+            ret = 1;
+        } else {
+            /*
+             * If not, adjust the internal file index to account for the
+             * previous 64 bit read
+             */
+            qemu_file_skip(f, -8);
+            ret = 0;
+        }
+    }
+
+    return ret;
+}
+
 /**
  * ram_load_precopy: load pages in precopy case
  *
@@ -4415,7 +4509,7 @@ static int ram_load_precopy(QEMUFile *f)
         invalid_flags |= RAM_SAVE_FLAG_COMPRESS_PAGE;
     }
 
-    while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
+    while (!ret && !(flags & RAM_SAVE_FLAG_EOS) && !mis->ram_migrated) {
         ram_addr_t addr;
         void *host = NULL, *host_bak = NULL;
         uint8_t ch;
@@ -4487,7 +4581,14 @@ static int ram_load_precopy(QEMUFile *f)
 
         switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
         case RAM_SAVE_FLAG_MEM_SIZE:
-            ret = parse_ramblocks(f, addr);
+            if (migrate_fixed_ram()) {
+                ret = parse_ramblocks_fixed_ram(f);
+                if (ret == 1) {
+                    mis->ram_migrated = true;
+                }
+            } else {
+                ret = parse_ramblocks(f, addr);
+            }
             break;
 
         case RAM_SAVE_FLAG_ZERO:
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 13/26] tests/qtest: migration-test: Add tests for fixed-ram file-based migration
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (11 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 12/26] migration: Add support for 'fixed-ram' migration restore Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 14/26] migration: Add completion tracepoint Fabiano Rosas
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov, Thomas Huth, Laurent Vivier,
	Paolo Bonzini

From: Nikolay Borisov <nborisov@suse.com>

Add basic tests for 'fixed-ram' migration.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration-test.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 13e5cdd5a4..84b4c761ad 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1544,6 +1544,26 @@ static void test_precopy_file_stream_ram(void)
     test_precopy_common(&args);
 }
 
+static void *migrate_fixed_ram_start(QTestState *from, QTestState *to)
+{
+    migrate_set_capability(from, "fixed-ram", true);
+    migrate_set_capability(to, "fixed-ram", true);
+
+    return NULL;
+}
+
+static void test_precopy_file_fixed_ram(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:%s/migfile", tmpfs);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = migrate_fixed_ram_start,
+    };
+
+    test_precopy_common(&args);
+}
+
 static void test_precopy_tcp_plain(void)
 {
     MigrateCommon args = {
@@ -2538,6 +2558,8 @@ int main(int argc, char **argv)
 
     qtest_add_func("/migration/precopy/file/stream-ram",
                    test_precopy_file_stream_ram);
+    qtest_add_func("/migration/precopy/file/fixed-ram",
+                   test_precopy_file_fixed_ram);
 
 #ifdef CONFIG_GNUTLS
     qtest_add_func("/migration/precopy/unix/tls/psk",
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 14/26] migration: Add completion tracepoint
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (12 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 13/26] tests/qtest: migration-test: Add tests for fixed-ram file-based migration Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 15/26] migration/multifd: Remove direct "socket" references Fabiano Rosas
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

Add a completion tracepoint that provides basic stats for
debug. Displays throughput (MB/s and pages/s) and total time (ms).

Usage:
  $QEMU ... -trace migration_status

Output:
  migration_status 1506 MB/s, 436725 pages/s, 8698 ms

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c  | 6 +++---
 migration/migration.h  | 4 +++-
 migration/savevm.c     | 4 ++++
 migration/trace-events | 1 +
 4 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 29630523e2..17b26c1808 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3811,7 +3811,7 @@ static uint64_t migration_total_bytes(MigrationState *s)
         ram_counters.multifd_bytes;
 }
 
-static void migration_calculate_complete(MigrationState *s)
+void migration_calculate_complete(MigrationState *s)
 {
     uint64_t bytes = migration_total_bytes(s);
     int64_t end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
@@ -3843,8 +3843,7 @@ static void update_iteration_initial_status(MigrationState *s)
     s->iteration_initial_pages = ram_get_total_transferred_pages();
 }
 
-static void migration_update_counters(MigrationState *s,
-                                      int64_t current_time)
+void migration_update_counters(MigrationState *s, int64_t current_time)
 {
     uint64_t transferred, transferred_pages, time_spent;
     uint64_t current_bytes; /* bytes transferred since the beginning */
@@ -3941,6 +3940,7 @@ static void migration_iteration_finish(MigrationState *s)
     case MIGRATION_STATUS_COMPLETED:
         migration_calculate_complete(s);
         runstate_set(RUN_STATE_POSTMIGRATE);
+        trace_migration_status((int)s->mbps / 8, (int)s->pages_per_second, s->total_time);
         break;
     case MIGRATION_STATUS_COLO:
         if (!migrate_colo_enabled()) {
diff --git a/migration/migration.h b/migration/migration.h
index 84be34587f..01c8201cfa 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -387,7 +387,9 @@ struct MigrationState {
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
-
+void migration_calculate_complete(MigrationState *s);
+void migration_update_counters(MigrationState *s,
+                               int64_t current_time);
 void migration_fd_process_incoming(QEMUFile *f, Error **errp);
 void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp);
 void migration_incoming_process(void);
diff --git a/migration/savevm.c b/migration/savevm.c
index 1f1bc19224..b369d11b19 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1638,6 +1638,7 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
     qemu_mutex_lock_iothread();
 
     while (qemu_file_get_error(f) == 0) {
+        migration_update_counters(ms, qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
         if (qemu_savevm_state_iterate(f, false) > 0) {
             break;
         }
@@ -1660,6 +1661,9 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
     }
     migrate_set_state(&ms->state, MIGRATION_STATUS_SETUP, status);
 
+    migration_calculate_complete(ms);
+    trace_migration_status((int)ms->mbps / 8, (int)ms->pages_per_second, ms->total_time);
+
     /* f is outer parameter, it should not stay in global migration state after
      * this function finished */
     ms->to_dst_file = NULL;
diff --git a/migration/trace-events b/migration/trace-events
index 92161eeac5..23e4dad1ec 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -165,6 +165,7 @@ migration_return_path_end_after(int rp_error) "%d"
 migration_thread_after_loop(void) ""
 migration_thread_file_err(void) ""
 migration_thread_setup_complete(void) ""
+migration_status(int mpbs, int pages_per_second, int64_t total_time) "%d MB/s, %d pages/s, %ld ms"
 open_return_path_on_source(void) ""
 open_return_path_on_source_continue(void) ""
 postcopy_start(void) ""
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 15/26] migration/multifd: Remove direct "socket" references
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (13 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 14/26] migration: Add completion tracepoint Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 16/26] migration/multifd: Allow multifd without packets Fabiano Rosas
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

We're about to enable support for other transports in multifd, so
remove direct references to sockets.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/multifd.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index cbc0dfe39b..e613d85e24 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -512,6 +512,11 @@ static void multifd_send_terminate_threads(Error *err)
     }
 }
 
+static int multifd_send_channel_destroy(QIOChannel *send)
+{
+    return socket_send_channel_destroy(send);
+}
+
 void multifd_save_cleanup(void)
 {
     int i;
@@ -534,7 +539,7 @@ void multifd_save_cleanup(void)
         if (p->registered_yank) {
             migration_ioc_unregister_yank(p->c);
         }
-        socket_send_channel_destroy(p->c);
+        multifd_send_channel_destroy(p->c);
         p->c = NULL;
         qemu_mutex_destroy(&p->mutex);
         qemu_sem_destroy(&p->sem);
@@ -889,20 +894,25 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
 static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
 {
     MultiFDSendParams *p = opaque;
-    QIOChannel *sioc = QIO_CHANNEL(qio_task_get_source(task));
+    QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
     Error *local_err = NULL;
 
     trace_multifd_new_send_channel_async(p->id);
     if (!qio_task_propagate_error(task, &local_err)) {
-        p->c = QIO_CHANNEL(sioc);
+        p->c = QIO_CHANNEL(ioc);
         qio_channel_set_delay(p->c, false);
         p->running = true;
-        if (multifd_channel_connect(p, sioc, local_err)) {
+        if (multifd_channel_connect(p, ioc, local_err)) {
             return;
         }
     }
 
-    multifd_new_send_channel_cleanup(p, sioc, local_err);
+    multifd_new_send_channel_cleanup(p, ioc, local_err);
+}
+
+static void multifd_new_send_channel_create(gpointer opaque)
+{
+    socket_send_channel_create(multifd_new_send_channel_async, opaque);
 }
 
 int multifd_save_setup(Error **errp)
@@ -951,7 +961,7 @@ int multifd_save_setup(Error **errp)
             p->write_flags = 0;
         }
 
-        socket_send_channel_create(multifd_new_send_channel_async, p);
+        multifd_new_send_channel_create(p);
     }
 
     for (i = 0; i < thread_count; i++) {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 16/26] migration/multifd: Allow multifd without packets
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (14 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 15/26] migration/multifd: Remove direct "socket" references Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 17/26] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

For the upcoming support to the new 'fixed-ram' migration stream
format, we cannot use multifd packets because each write into the
ramblock section in the migration file is expected to contain only the
guest pages. They are written at their respective offsets relative to
the ramblock section header.

There is no space for the packet information and the expected gains
from the new approach come partly from being able to write the pages
sequentially without extraneous data in between.

The new format also doesn't need the packets and all necessary
information can be taken from the standard migration headers with some
(future) changes to multifd code.

Use the presence of the fixed-ram capability to decide whether to send
packets. For now this has no effect as fixed-ram cannot yet be enabled
with multifd.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c |   5 ++
 migration/migration.h |   2 +-
 migration/multifd.c   | 119 ++++++++++++++++++++++++++----------------
 3 files changed, 80 insertions(+), 46 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 17b26c1808..c647fbffa6 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2764,6 +2764,11 @@ int migrate_fixed_ram(void)
     return migrate_get_current()->enabled_capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
 }
 
+bool migrate_multifd_use_packets(void)
+{
+    return !migrate_fixed_ram();
+}
+
 int migrate_multifd_zlib_level(void)
 {
     MigrationState *s;
diff --git a/migration/migration.h b/migration/migration.h
index 01c8201cfa..d7a014ce57 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -421,7 +421,7 @@ bool migrate_dirty_bitmaps(void);
 bool migrate_ignore_shared(void);
 bool migrate_validate_uuid(void);
 int migrate_fixed_ram(void);
-
+bool migrate_multifd_use_packets(void);
 bool migrate_auto_converge(void);
 bool migrate_use_multifd(void);
 bool migrate_pause_before_switchover(void);
diff --git a/migration/multifd.c b/migration/multifd.c
index e613d85e24..9f6b2787ed 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -659,18 +659,22 @@ static void *multifd_send_thread(void *opaque)
     Error *local_err = NULL;
     int ret = 0;
     bool use_zero_copy_send = migrate_use_zero_copy_send();
+    bool use_packets = migrate_multifd_use_packets();
 
     thread = MigrationThreadAdd(p->name, qemu_get_thread_id());
 
     trace_multifd_send_thread_start(p->id);
     rcu_register_thread();
 
-    if (multifd_send_initial_packet(p, &local_err) < 0) {
-        ret = -1;
-        goto out;
+    if (use_packets) {
+        if (multifd_send_initial_packet(p, &local_err) < 0) {
+            ret = -1;
+            goto out;
+        }
+
+        /* initial packet */
+        p->num_packets = 1;
     }
-    /* initial packet */
-    p->num_packets = 1;
 
     while (true) {
         qemu_sem_wait(&p->sem);
@@ -681,11 +685,10 @@ static void *multifd_send_thread(void *opaque)
         qemu_mutex_lock(&p->mutex);
 
         if (p->pending_job) {
-            uint64_t packet_num = p->packet_num;
             uint32_t flags;
             p->normal_num = 0;
 
-            if (use_zero_copy_send) {
+            if (!use_packets || use_zero_copy_send) {
                 p->iovs_num = 0;
             } else {
                 p->iovs_num = 1;
@@ -703,16 +706,20 @@ static void *multifd_send_thread(void *opaque)
                     break;
                 }
             }
-            multifd_send_fill_packet(p);
+
+            if (use_packets) {
+                multifd_send_fill_packet(p);
+                p->num_packets++;
+            }
+
             flags = p->flags;
             p->flags = 0;
-            p->num_packets++;
             p->total_normal_pages += p->normal_num;
             p->pages->num = 0;
             p->pages->block = NULL;
             qemu_mutex_unlock(&p->mutex);
 
-            trace_multifd_send(p->id, packet_num, p->normal_num, flags,
+            trace_multifd_send(p->id, p->packet_num, p->normal_num, flags,
                                p->next_packet_size);
 
             if (use_zero_copy_send) {
@@ -722,7 +729,7 @@ static void *multifd_send_thread(void *opaque)
                 if (ret != 0) {
                     break;
                 }
-            } else {
+            } else if (use_packets) {
                 /* Send header using the same writev call */
                 p->iov[0].iov_len = p->packet_len;
                 p->iov[0].iov_base = p->packet;
@@ -919,6 +926,7 @@ int multifd_save_setup(Error **errp)
 {
     int thread_count;
     uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
+    bool use_packets = migrate_multifd_use_packets();
     uint8_t i;
 
     if (!migrate_use_multifd()) {
@@ -943,14 +951,20 @@ int multifd_save_setup(Error **errp)
         p->pending_job = 0;
         p->id = i;
         p->pages = multifd_pages_init(page_count);
-        p->packet_len = sizeof(MultiFDPacket_t)
-                      + sizeof(uint64_t) * page_count;
-        p->packet = g_malloc0(p->packet_len);
-        p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
-        p->packet->version = cpu_to_be32(MULTIFD_VERSION);
+
+        if (use_packets) {
+            p->packet_len = sizeof(MultiFDPacket_t)
+                          + sizeof(uint64_t) * page_count;
+            p->packet = g_malloc0(p->packet_len);
+            p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
+            p->packet->version = cpu_to_be32(MULTIFD_VERSION);
+
+            /* We need one extra place for the packet header */
+            p->iov = g_new0(struct iovec, page_count + 1);
+        } else {
+            p->iov = g_new0(struct iovec, page_count);
+        }
         p->name = g_strdup_printf("multifdsend_%d", i);
-        /* We need one extra place for the packet header */
-        p->iov = g_new0(struct iovec, page_count + 1);
         p->normal = g_new0(ram_addr_t, page_count);
         p->page_size = qemu_target_page_size();
         p->page_count = page_count;
@@ -1082,7 +1096,7 @@ void multifd_recv_sync_main(void)
 {
     int i;
 
-    if (!migrate_use_multifd()) {
+    if (!migrate_use_multifd() || !migrate_multifd_use_packets()) {
         return;
     }
     for (i = 0; i < migrate_multifd_channels(); i++) {
@@ -1109,6 +1123,7 @@ static void *multifd_recv_thread(void *opaque)
 {
     MultiFDRecvParams *p = opaque;
     Error *local_err = NULL;
+    bool use_packets = migrate_multifd_use_packets();
     int ret;
 
     trace_multifd_recv_thread_start(p->id);
@@ -1121,17 +1136,20 @@ static void *multifd_recv_thread(void *opaque)
             break;
         }
 
-        ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
-                                       p->packet_len, &local_err);
-        if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
-            break;
-        }
+        if (use_packets) {
+            ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
+                                           p->packet_len, &local_err);
+            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
+                break;
+            }
 
-        qemu_mutex_lock(&p->mutex);
-        ret = multifd_recv_unfill_packet(p, &local_err);
-        if (ret) {
-            qemu_mutex_unlock(&p->mutex);
-            break;
+            qemu_mutex_lock(&p->mutex);
+            ret = multifd_recv_unfill_packet(p, &local_err);
+            if (ret) {
+                qemu_mutex_unlock(&p->mutex);
+                break;
+            }
+            p->num_packets++;
         }
 
         flags = p->flags;
@@ -1139,7 +1157,7 @@ static void *multifd_recv_thread(void *opaque)
         p->flags &= ~MULTIFD_FLAG_SYNC;
         trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
                            p->next_packet_size);
-        p->num_packets++;
+
         p->total_normal_pages += p->normal_num;
         qemu_mutex_unlock(&p->mutex);
 
@@ -1174,6 +1192,7 @@ int multifd_load_setup(Error **errp)
 {
     int thread_count;
     uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
+    bool use_packets = migrate_multifd_use_packets();
     uint8_t i;
 
     /*
@@ -1198,9 +1217,12 @@ int multifd_load_setup(Error **errp)
         qemu_sem_init(&p->sem_sync, 0);
         p->quit = false;
         p->id = i;
-        p->packet_len = sizeof(MultiFDPacket_t)
-                      + sizeof(uint64_t) * page_count;
-        p->packet = g_malloc0(p->packet_len);
+
+        if (use_packets) {
+            p->packet_len = sizeof(MultiFDPacket_t)
+                + sizeof(uint64_t) * page_count;
+            p->packet = g_malloc0(p->packet_len);
+        }
         p->name = g_strdup_printf("multifdrecv_%d", i);
         p->iov = g_new0(struct iovec, page_count);
         p->normal = g_new0(ram_addr_t, page_count);
@@ -1246,18 +1268,26 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
 {
     MultiFDRecvParams *p;
     Error *local_err = NULL;
-    int id;
+    bool use_packets = migrate_multifd_use_packets();
+    int id, num_packets = 0;
 
-    id = multifd_recv_initial_packet(ioc, &local_err);
-    if (id < 0) {
-        multifd_recv_terminate_threads(local_err);
-        error_propagate_prepend(errp, local_err,
-                                "failed to receive packet"
-                                " via multifd channel %d: ",
-                                qatomic_read(&multifd_recv_state->count));
-        return;
+    if (use_packets) {
+        id = multifd_recv_initial_packet(ioc, &local_err);
+        if (id < 0) {
+            multifd_recv_terminate_threads(local_err);
+            error_propagate_prepend(errp, local_err,
+                                    "failed to receive packet"
+                                    " via multifd channel %d: ",
+                                    qatomic_read(&multifd_recv_state->count));
+            return;
+        }
+        trace_multifd_recv_new_channel(id);
+
+        /* initial packet */
+        num_packets = 1;
+    } else {
+        id = 0;
     }
-    trace_multifd_recv_new_channel(id);
 
     p = &multifd_recv_state->params[id];
     if (p->c != NULL) {
@@ -1268,9 +1298,8 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
         return;
     }
     p->c = ioc;
+    p->num_packets = num_packets;
     object_ref(OBJECT(ioc));
-    /* initial packet */
-    p->num_packets = 1;
 
     p->running = true;
     qemu_thread_create(&p->thread, p->name, multifd_recv_thread, p,
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 17/26] migration/multifd: Add outgoing QIOChannelFile support
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (15 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 16/26] migration/multifd: Allow multifd without packets Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 18/26] migration/multifd: Add incoming " Fabiano Rosas
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

Allow multifd to open file-backed channels. This will be used when
enabling the fixed-ram migration stream format which expects a
seekable transport.

The QIOChannel read and write methods will use the preadv/pwritev
versions which don't update the file offset at each call so we can
reuse the fd without re-opening for every channel.

Note that this is just setup code and multifd cannot yet make use of
the file channels.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/io/channel-file.h |  1 +
 migration/file.c          | 63 +++++++++++++++++++++++++++++++++++++--
 migration/file.h          |  6 +++-
 migration/migration.c     | 11 ++++++-
 migration/migration.h     |  1 +
 migration/multifd.c       | 14 +++++++--
 6 files changed, 89 insertions(+), 7 deletions(-)

diff --git a/include/io/channel-file.h b/include/io/channel-file.h
index 50e8eb1138..85b6c34a72 100644
--- a/include/io/channel-file.h
+++ b/include/io/channel-file.h
@@ -22,6 +22,7 @@
 #define QIO_CHANNEL_FILE_H
 
 #include "io/channel.h"
+#include "io/task.h"
 #include "qom/object.h"
 
 #define TYPE_QIO_CHANNEL_FILE "qio-channel-file"
diff --git a/migration/file.c b/migration/file.c
index ab4e12926c..f674cd1bdb 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -1,20 +1,77 @@
 #include "qemu/osdep.h"
-#include "channel.h"
 #include "io/channel-file.h"
 #include "file.h"
 #include "qemu/error-report.h"
 
+static struct FileOutgoingArgs {
+    char *fname;
+    int flags;
+    int mode;
+} outgoing_args;
+
+static void qio_channel_file_connect_worker(QIOTask *task, gpointer opaque)
+{
+    /* noop */
+}
+
+static void file_migration_cancel(Error *errp)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+                      MIGRATION_STATUS_FAILED);
+    migration_cancel(errp);
+}
+
+int file_send_channel_destroy(QIOChannel *ioc)
+{
+    if (ioc) {
+        qio_channel_close(ioc, NULL);
+        object_unref(OBJECT(ioc));
+    }
+    g_free(outgoing_args.fname);
+    outgoing_args.fname = NULL;
+
+    return 0;
+}
+
+void file_send_channel_create(QIOTaskFunc f, void *data)
+{
+    QIOChannelFile *ioc;
+    QIOTask *task;
+    Error *errp = NULL;
+
+    ioc = qio_channel_file_new_path(outgoing_args.fname,
+                                    outgoing_args.flags,
+                                    outgoing_args.mode, &errp);
+    if (!ioc) {
+        file_migration_cancel(errp);
+        return;
+    }
+
+    task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
+    qio_task_run_in_thread(task, qio_channel_file_connect_worker,
+                           (gpointer)data, NULL, NULL);
+}
 
 void file_start_outgoing_migration(MigrationState *s, const char *fname, Error **errp)
 {
     QIOChannelFile *ioc;
+    int flags = O_CREAT | O_TRUNC | O_WRONLY;
+    mode_t mode = 0660;
 
-    ioc = qio_channel_file_new_path(fname, O_CREAT | O_TRUNC | O_WRONLY, 0660, errp);
+    ioc = qio_channel_file_new_path(fname, flags, mode, errp);
     if (!ioc) {
-        error_report("Error creating a channel");
+        error_report("Error creating migration outgoing channel");
         return;
     }
 
+    outgoing_args.fname = g_strdup(fname);
+    outgoing_args.flags = flags;
+    outgoing_args.mode = mode;
+
     qio_channel_set_name(QIO_CHANNEL(ioc), "migration-file-outgoing");
     migration_channel_connect(s, QIO_CHANNEL(ioc), NULL, NULL);
     object_unref(OBJECT(ioc));
diff --git a/migration/file.h b/migration/file.h
index cdbd291322..5e27ca6afd 100644
--- a/migration/file.h
+++ b/migration/file.h
@@ -1,10 +1,14 @@
 #ifndef QEMU_MIGRATION_FILE_H
 #define QEMU_MIGRATION_FILE_H
 
+#include "io/task.h"
+#include "channel.h"
+
 void file_start_outgoing_migration(MigrationState *s,
                                    const char *filename,
                                    Error **errp);
 
 void file_start_incoming_migration(const char *fname, Error **errp);
+void file_send_channel_create(QIOTaskFunc f, void *data);
+int file_send_channel_destroy(QIOChannel *ioc);
 #endif
-
diff --git a/migration/migration.c b/migration/migration.c
index c647fbffa6..6594c2f404 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -194,7 +194,7 @@ static bool migration_needs_multiple_sockets(void)
 static bool uri_supports_multi_channels(const char *uri)
 {
     return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
-           strstart(uri, "vsock:", NULL);
+           strstart(uri, "vsock:", NULL) || strstart(uri, "file:", NULL);
 }
 
 static bool
@@ -2740,6 +2740,15 @@ bool migrate_pause_before_switchover(void)
         MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER];
 }
 
+bool migrate_to_file(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return qemu_file_is_seekable(s->to_dst_file);
+}
+
 int migrate_multifd_channels(void)
 {
     MigrationState *s;
diff --git a/migration/migration.h b/migration/migration.h
index d7a014ce57..8459201958 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -425,6 +425,7 @@ bool migrate_multifd_use_packets(void);
 bool migrate_auto_converge(void);
 bool migrate_use_multifd(void);
 bool migrate_pause_before_switchover(void);
+bool migrate_to_file(void);
 int migrate_multifd_channels(void);
 MultiFDCompression migrate_multifd_compression(void);
 int migrate_multifd_zlib_level(void);
diff --git a/migration/multifd.c b/migration/multifd.c
index 9f6b2787ed..50bd9b32eb 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -17,6 +17,7 @@
 #include "exec/ramblock.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
+#include "file.h"
 #include "ram.h"
 #include "migration.h"
 #include "socket.h"
@@ -27,6 +28,7 @@
 #include "threadinfo.h"
 
 #include "qemu/yank.h"
+#include "io/channel-file.h"
 #include "io/channel-socket.h"
 #include "yank_functions.h"
 
@@ -514,7 +516,11 @@ static void multifd_send_terminate_threads(Error *err)
 
 static int multifd_send_channel_destroy(QIOChannel *send)
 {
-    return socket_send_channel_destroy(send);
+    if (migrate_to_file()) {
+        return file_send_channel_destroy(send);
+    } else {
+        return socket_send_channel_destroy(send);
+    }
 }
 
 void multifd_save_cleanup(void)
@@ -919,7 +925,11 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
 
 static void multifd_new_send_channel_create(gpointer opaque)
 {
-    socket_send_channel_create(multifd_new_send_channel_async, opaque);
+    if (migrate_to_file()) {
+        file_send_channel_create(multifd_new_send_channel_async, opaque);
+    } else {
+        socket_send_channel_create(multifd_new_send_channel_async, opaque);
+    }
 }
 
 int multifd_save_setup(Error **errp)
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 18/26] migration/multifd: Add incoming QIOChannelFile support
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (16 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 17/26] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 19/26] migration/multifd: Add pages to the receiving side Fabiano Rosas
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

On the receiving side we don't need to differentiate between main
channel and threads, so whichever channel is defined first gets to be
the main one. And since there are no packets, use the atomic channel
count to index into the params array.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/file.c      | 38 +++++++++++++++++++++++++++++++++-----
 migration/migration.c |  2 ++
 migration/multifd.c   |  7 ++++++-
 migration/multifd.h   |  1 +
 4 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/migration/file.c b/migration/file.c
index f674cd1bdb..6f40894488 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -2,6 +2,7 @@
 #include "io/channel-file.h"
 #include "file.h"
 #include "qemu/error-report.h"
+#include "migration.h"
 
 static struct FileOutgoingArgs {
     char *fname;
@@ -77,17 +78,44 @@ void file_start_outgoing_migration(MigrationState *s, const char *fname, Error *
     object_unref(OBJECT(ioc));
 }
 
+static void file_process_migration_incoming(QIOTask *task, gpointer opaque)
+{
+    QIOChannelFile *ioc = opaque;
+
+    migration_channel_process_incoming(QIO_CHANNEL(ioc));
+    object_unref(OBJECT(ioc));
+}
+
 void file_start_incoming_migration(const char *fname, Error **errp)
 {
     QIOChannelFile *ioc;
+    QIOTask *task;
+    int channels = 1;
+    int i = 0, fd;
 
     ioc = qio_channel_file_new_path(fname, O_RDONLY, 0, errp);
     if (!ioc) {
-        error_report("Error creating a channel");
+        goto out;
+    }
+
+    if (migrate_use_multifd()) {
+        channels += migrate_multifd_channels();
+    }
+
+    fd = ioc->fd;
+
+    do {
+        qio_channel_set_name(QIO_CHANNEL(ioc), "migration-file-incoming");
+        task = qio_task_new(OBJECT(ioc), file_process_migration_incoming,
+                            (gpointer)ioc, NULL);
+
+        qio_task_run_in_thread(task, qio_channel_file_connect_worker,
+                               (gpointer)ioc, NULL, NULL);
+    } while (++i < channels && (ioc = qio_channel_file_new_fd(fd)));
+
+out:
+    if (!ioc) {
+        error_report("Error creating migration incoming channel");
         return;
     }
-
-    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-file-incoming");
-    migration_channel_process_incoming(QIO_CHANNEL(ioc));
-    object_unref(OBJECT(ioc));
 }
diff --git a/migration/migration.c b/migration/migration.c
index 6594c2f404..258709aee1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -794,6 +794,8 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
         }
 
         default_channel = (channel_magic == cpu_to_be32(QEMU_VM_FILE_MAGIC));
+    } else if (migrate_use_multifd() && migrate_fixed_ram()) {
+        default_channel = multifd_recv_first_channel();
     } else {
         default_channel = !mis->from_src_file;
     }
diff --git a/migration/multifd.c b/migration/multifd.c
index 50bd9b32eb..1332b426ce 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1254,6 +1254,11 @@ int multifd_load_setup(Error **errp)
     return 0;
 }
 
+bool multifd_recv_first_channel(void)
+{
+    return !multifd_recv_state;
+}
+
 bool multifd_recv_all_channels_created(void)
 {
     int thread_count = migrate_multifd_channels();
@@ -1296,7 +1301,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
         /* initial packet */
         num_packets = 1;
     } else {
-        id = 0;
+        id = qatomic_read(&multifd_recv_state->count);
     }
 
     p = &multifd_recv_state->params[id];
diff --git a/migration/multifd.h b/migration/multifd.h
index 7cfc265148..354150ff55 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -18,6 +18,7 @@ void multifd_save_cleanup(void);
 int multifd_load_setup(Error **errp);
 void multifd_load_cleanup(void);
 void multifd_load_shutdown(void);
+bool multifd_recv_first_channel(void);
 bool multifd_recv_all_channels_created(void);
 void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 19/26] migration/multifd: Add pages to the receiving side
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (17 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 18/26] migration/multifd: Add incoming " Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 20/26] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

Currently multifd does not need to have knowledge of pages on the
receiving side because all the information needed is within the
packets that come in the stream.

We're about to add support to fixed-ram migration, which cannot use
packets because it expects the ramblock section in the migration file
to contain only the guest pages data.

Add a pointer to MultiFDPages in the multifd_recv_state and use the
pages similarly to what we already do on the sending side. The pages
are used to transfer data between the ram migration code in the main
migration thread and the multifd receiving threads.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/multifd.c | 98 +++++++++++++++++++++++++++++++++++++++++++++
 migration/multifd.h | 12 ++++++
 2 files changed, 110 insertions(+)

diff --git a/migration/multifd.c b/migration/multifd.c
index 1332b426ce..20ef665218 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1004,6 +1004,8 @@ int multifd_save_setup(Error **errp)
 
 struct {
     MultiFDRecvParams *params;
+    /* array of pages to receive */
+    MultiFDPages_t *pages;
     /* number of created threads */
     int count;
     /* syncs main thread and channels */
@@ -1014,6 +1016,66 @@ struct {
     MultiFDMethods *ops;
 } *multifd_recv_state;
 
+static int multifd_recv_pages(QEMUFile *f)
+{
+    int i;
+    static int next_recv_channel;
+    MultiFDRecvParams *p = NULL;
+    MultiFDPages_t *pages = multifd_recv_state->pages;
+
+    /*
+     * next_channel can remain from a previous migration that was
+     * using more channels, so ensure it doesn't overflow if the
+     * limit is lower now.
+     */
+    next_recv_channel %= migrate_multifd_channels();
+    for (i = next_recv_channel;; i = (i + 1) % migrate_multifd_channels()) {
+        p = &multifd_recv_state->params[i];
+
+        qemu_mutex_lock(&p->mutex);
+        if (p->quit) {
+            error_report("%s: channel %d has already quit!", __func__, i);
+            qemu_mutex_unlock(&p->mutex);
+            return -1;
+        }
+        if (!p->pending_job) {
+            p->pending_job++;
+            next_recv_channel = (i + 1) % migrate_multifd_channels();
+            break;
+        }
+        qemu_mutex_unlock(&p->mutex);
+    }
+
+    multifd_recv_state->pages = p->pages;
+    p->pages = pages;
+    qemu_mutex_unlock(&p->mutex);
+    qemu_sem_post(&p->sem);
+
+    return 1;
+}
+
+int multifd_recv_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset)
+{
+    MultiFDPages_t *pages = multifd_recv_state->pages;
+
+    if (!pages->block) {
+        pages->block = block;
+    }
+
+    pages->offset[pages->num] = offset;
+    pages->num++;
+
+    if (pages->num < pages->allocated) {
+        return 1;
+    }
+
+    if (multifd_recv_pages(f) < 0) {
+        return -1;
+    }
+
+    return 1;
+}
+
 static void multifd_recv_terminate_threads(Error *err)
 {
     int i;
@@ -1035,6 +1097,7 @@ static void multifd_recv_terminate_threads(Error *err)
 
         qemu_mutex_lock(&p->mutex);
         p->quit = true;
+        qemu_sem_post(&p->sem);
         /*
          * We could arrive here for two reasons:
          *  - normal quit, i.e. everything went fine, just finished
@@ -1083,9 +1146,12 @@ void multifd_load_cleanup(void)
         object_unref(OBJECT(p->c));
         p->c = NULL;
         qemu_mutex_destroy(&p->mutex);
+        qemu_sem_destroy(&p->sem);
         qemu_sem_destroy(&p->sem_sync);
         g_free(p->name);
         p->name = NULL;
+        multifd_pages_clear(p->pages);
+        p->pages = NULL;
         p->packet_len = 0;
         g_free(p->packet);
         p->packet = NULL;
@@ -1098,6 +1164,8 @@ void multifd_load_cleanup(void)
     qemu_sem_destroy(&multifd_recv_state->sem_sync);
     g_free(multifd_recv_state->params);
     multifd_recv_state->params = NULL;
+    multifd_pages_clear(multifd_recv_state->pages);
+    multifd_recv_state->pages = NULL;
     g_free(multifd_recv_state);
     multifd_recv_state = NULL;
 }
@@ -1160,6 +1228,25 @@ static void *multifd_recv_thread(void *opaque)
                 break;
             }
             p->num_packets++;
+        } else {
+            /*
+             * No packets, so we need to wait for the vmstate code to
+             * queue pages.
+             */
+            qemu_sem_wait(&p->sem);
+            qemu_mutex_lock(&p->mutex);
+            if (!p->pending_job) {
+                qemu_mutex_unlock(&p->mutex);
+                break;
+            }
+
+            for (int i = 0; i < p->pages->num; i++) {
+                p->normal[p->normal_num] = p->pages->offset[i];
+                p->normal_num++;
+            }
+
+            p->pages->num = 0;
+            p->host = p->pages->block->host;
         }
 
         flags = p->flags;
@@ -1182,6 +1269,13 @@ static void *multifd_recv_thread(void *opaque)
             qemu_sem_post(&multifd_recv_state->sem_sync);
             qemu_sem_wait(&p->sem_sync);
         }
+
+        if (!use_packets) {
+            qemu_mutex_lock(&p->mutex);
+            p->pending_job--;
+            p->pages->block = NULL;
+            qemu_mutex_unlock(&p->mutex);
+        }
     }
 
     if (local_err) {
@@ -1216,6 +1310,7 @@ int multifd_load_setup(Error **errp)
     thread_count = migrate_multifd_channels();
     multifd_recv_state = g_malloc0(sizeof(*multifd_recv_state));
     multifd_recv_state->params = g_new0(MultiFDRecvParams, thread_count);
+    multifd_recv_state->pages = multifd_pages_init(page_count);
     qatomic_set(&multifd_recv_state->count, 0);
     qemu_sem_init(&multifd_recv_state->sem_sync, 0);
     multifd_recv_state->ops = multifd_ops[migrate_multifd_compression()];
@@ -1224,9 +1319,12 @@ int multifd_load_setup(Error **errp)
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
         qemu_mutex_init(&p->mutex);
+        qemu_sem_init(&p->sem, 0);
         qemu_sem_init(&p->sem_sync, 0);
         p->quit = false;
+        p->pending_job = 0;
         p->id = i;
+        p->pages = multifd_pages_init(page_count);
 
         if (use_packets) {
             p->packet_len = sizeof(MultiFDPacket_t)
diff --git a/migration/multifd.h b/migration/multifd.h
index 354150ff55..2f008217c3 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -24,6 +24,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
 int multifd_send_sync_main(QEMUFile *f);
 int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
+int multifd_recv_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
 
 /* Multifd Compression flags */
 #define MULTIFD_FLAG_SYNC (1 << 0)
@@ -153,7 +154,11 @@ typedef struct {
     uint32_t page_size;
     /* number of pages in a full packet */
     uint32_t page_count;
+    /* multifd flags for receiving ram */
+    int read_flags;
 
+    /* sem where to wait for more work */
+    QemuSemaphore sem;
     /* syncs main thread and channels */
     QemuSemaphore sem_sync;
 
@@ -167,6 +172,13 @@ typedef struct {
     uint32_t flags;
     /* global number of generated multifd packets */
     uint64_t packet_num;
+    int pending_job;
+    /* array of pages to sent.
+     * The owner of 'pages' depends of 'pending_job' value:
+     * pending_job == 0 -> migration_thread can use it.
+     * pending_job != 0 -> multifd_channel can use it.
+     */
+    MultiFDPages_t *pages;
 
     /* thread local variables. No locking required */
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 20/26] io: Add a pwritev/preadv version that takes a discontiguous iovec
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (18 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 19/26] migration/multifd: Add pages to the receiving side Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 21/26] migration/ram: Add a wrapper for fixed-ram shadow bitmap Fabiano Rosas
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

For the upcoming support to fixed-ram migration with multifd, we need
to be able to accept an iovec array with non-contiguous data.

Add a pwritev and preadv version that splits the array into contiguous
segments before writing. With that we can have the ram code continue
to add pages in any order and the multifd code continue to send large
arrays for reading and writing.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
Since iovs can be non contiguous, we'd need a separate array on the
side to carry an extra file offset for each of them, so I'm relying on
the fact that iovs are all within a same host page and passing in an
encoded offset that takes the host page into account.
---
 include/io/channel.h | 50 +++++++++++++++++++++++++++
 io/channel.c         | 82 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 28bce7ef17..7c4d2432cf 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -33,8 +33,10 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass,
 #define QIO_CHANNEL_ERR_BLOCK -2
 
 #define QIO_CHANNEL_WRITE_FLAG_ZERO_COPY 0x1
+#define QIO_CHANNEL_WRITE_FLAG_WITH_OFFSET 0x2
 
 #define QIO_CHANNEL_READ_FLAG_MSG_PEEK 0x1
+#define QIO_CHANNEL_READ_FLAG_WITH_OFFSET 0x2
 
 typedef enum QIOChannelFeature QIOChannelFeature;
 
@@ -541,6 +543,30 @@ int qio_channel_close(QIOChannel *ioc,
 ssize_t qio_channel_pwritev_full(QIOChannel *ioc, const struct iovec *iov,
                                  size_t niov, off_t offset, Error **errp);
 
+/**
+ * qio_channel_write_full_all:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to write data from
+ * @niov: the length of the @iov array
+ * @offset: the iovec offset in the file where to write the data
+ * @fds: an array of file handles to send
+ * @nfds: number of file handles in @fds
+ * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
+ * @errp: pointer to a NULL-initialized error object
+ *
+ *
+ * Selects between a writev or pwritev channel writer function.
+ *
+ * If QIO_CHANNEL_WRITE_FLAG_OFFSET is passed in flags, pwritev is
+ * used and @offset is expected to be a meaningful value, @fds and
+ * @nfds are ignored; otherwise uses writev and @offset is ignored.
+ *
+ * Returns: 0 if all bytes were written, or -1 on error
+ */
+int qio_channel_write_full_all(QIOChannel *ioc, const struct iovec *iov,
+                               size_t niov, off_t offset, int *fds, size_t nfds,
+                               int flags, Error **errp);
+
 /**
  * qio_channel_pwritev
  * @ioc: the channel object
@@ -577,6 +603,30 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, char *buf, size_t buflen,
 ssize_t qio_channel_preadv_full(QIOChannel *ioc, const struct iovec *iov,
                                 size_t niov, off_t offset, Error **errp);
 
+/**
+ * qio_channel_read_full_all:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to read data to
+ * @niov: the length of the @iov array
+ * @offset: the iovec offset in the file from where to read the data
+ * @fds: an array of file handles to send
+ * @nfds: number of file handles in @fds
+ * @flags: read flags (QIO_CHANNEL_READ_FLAG_*)
+ * @errp: pointer to a NULL-initialized error object
+ *
+ *
+ * Selects between a readv or preadv channel reader function.
+ *
+ * If QIO_CHANNEL_READ_FLAG_OFFSET is passed in flags, preadv is
+ * used and @offset is expected to be a meaningful value, @fds and
+ * @nfds are ignored; otherwise uses readv and @offset is ignored.
+ *
+ * Returns: 0 if all bytes were read, or -1 on error
+ */
+int qio_channel_read_full_all(QIOChannel *ioc, const struct iovec *iov,
+                              size_t niov, off_t offset,
+                              int flags, Error **errp);
+
 /**
  * qio_channel_preadv
  * @ioc: the channel object
diff --git a/io/channel.c b/io/channel.c
index 312445b3aa..64b26040c2 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -463,6 +463,76 @@ ssize_t qio_channel_pwritev_full(QIOChannel *ioc, const struct iovec *iov,
     return klass->io_pwritev(ioc, iov, niov, offset, errp);
 }
 
+static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
+                                                 const struct iovec *iov,
+                                                 size_t niov, off_t offset,
+                                                 bool is_write, Error **errp)
+{
+    ssize_t ret;
+    int i, slice_idx, slice_num;
+    uint64_t base, next, file_offset;
+    size_t len;
+
+    slice_idx = 0;
+    slice_num = 1;
+
+    /*
+     * If the iov array doesn't have contiguous elements, we need to
+     * split it in slices because we only have one (file) 'offset' for
+     * the whole iov. Do this here so callers don't need to break the
+     * iov array themselves.
+     */
+    for (i = 0; i < niov; i++, slice_num++) {
+        base = (uint64_t) iov[i].iov_base;
+
+        if (i != niov - 1) {
+            len = iov[i].iov_len;
+            next = (uint64_t) iov[i + 1].iov_base;
+
+            if (base + len == next) {
+                continue;
+            }
+        }
+
+        /*
+         * Use the offset of the first element of the segment that
+         * we're sending.
+         */
+        file_offset = offset + (uint64_t) iov[slice_idx].iov_base;
+
+        if (is_write) {
+            ret = qio_channel_pwritev_full(ioc, &iov[slice_idx], slice_num,
+                                           file_offset, errp);
+        } else {
+            ret = qio_channel_preadv_full(ioc, &iov[slice_idx], slice_num,
+                                          file_offset, errp);
+        }
+
+        if (ret < 0) {
+            break;
+        }
+
+        slice_idx += slice_num;
+        slice_num = 0;
+    }
+
+    return (ret < 0) ? -1 : 0;
+}
+
+int qio_channel_write_full_all(QIOChannel *ioc,
+                                const struct iovec *iov,
+                                size_t niov, off_t offset,
+                                int *fds, size_t nfds,
+                                int flags, Error **errp)
+{
+    if (flags & QIO_CHANNEL_WRITE_FLAG_WITH_OFFSET) {
+        return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
+                                                     offset, true, errp);
+    }
+
+    return qio_channel_writev_full_all(ioc, iov, niov, NULL, 0, flags, errp);
+}
+
 ssize_t qio_channel_pwritev(QIOChannel *ioc, char *buf, size_t buflen,
                             off_t offset, Error **errp)
 {
@@ -492,6 +562,18 @@ ssize_t qio_channel_preadv_full(QIOChannel *ioc, const struct iovec *iov,
     return klass->io_preadv(ioc, iov, niov, offset, errp);
 }
 
+int qio_channel_read_full_all(QIOChannel *ioc, const struct iovec *iov,
+                              size_t niov, off_t offset,
+                              int flags, Error **errp)
+{
+    if (flags & QIO_CHANNEL_READ_FLAG_WITH_OFFSET) {
+        return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
+                                                     offset, false, errp);
+    }
+
+    return qio_channel_readv_full_all(ioc, iov, niov, NULL, NULL, errp);
+}
+
 ssize_t qio_channel_preadv(QIOChannel *ioc, char *buf, size_t buflen,
                            off_t offset, Error **errp)
 {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 21/26] migration/ram: Add a wrapper for fixed-ram shadow bitmap
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (19 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 20/26] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 22/26] migration/multifd: Support outgoing fixed-ram stream format Fabiano Rosas
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

We'll need to set the shadow_bmap bits from outside ram.c soon and
TARGET_PAGE_BITS is poisoned, so add a wrapper to it.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/ram.c | 9 +++++++++
 migration/ram.h | 1 +
 2 files changed, 10 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 1666ce6d5f..e9b28c16da 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3359,6 +3359,15 @@ static void ram_save_shadow_bmap(QEMUFile *f)
     }
 }
 
+void ramblock_set_shadow_bmap(RAMBlock *block, ram_addr_t offset, bool set)
+{
+    if (set) {
+        set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
+    } else {
+        clear_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
+    }
+}
+
 /**
  * ram_save_iterate: iterative stage for migration
  *
diff --git a/migration/ram.h b/migration/ram.h
index 81cbb0947c..8d8258fee1 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -98,6 +98,7 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
 bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
 void postcopy_preempt_shutdown_file(MigrationState *s);
 void *postcopy_preempt_thread(void *opaque);
+void ramblock_set_shadow_bmap(RAMBlock *block, ram_addr_t offset, bool set);
 
 /* ram cache */
 int colo_init_ram_cache(void);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 22/26] migration/multifd: Support outgoing fixed-ram stream format
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (20 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 21/26] migration/ram: Add a wrapper for fixed-ram shadow bitmap Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 23/26] migration/multifd: Support incoming " Fabiano Rosas
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

The new fixed-ram stream format uses a file transport and puts ram
pages in the migration file at their respective offsets and can be
done in parallel by using the pwritev system call which takes iovecs
and an offset.

Add support to enabling the new format along with multifd to make use
of the threading and page handling already in place.

This requires multifd to stop sending headers and leaving the stream
format to the fixed-ram code. When it comes time to write the data, we
need to call a version of qio_channel_write that can take an offset.

Usage on HMP is:

(qemu) stop
(qemu) migrate_set_capability multifd on
(qemu) migrate_set_capability fixed-ram on
(qemu) migrate_set_parameter max-bandwidth 0
(qemu) migrate_set_parameter multifd-channels 8
(qemu) migrate file:migfile

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c |  5 -----
 migration/multifd.c   | 51 +++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 258709aee1..77d24a5114 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1345,11 +1345,6 @@ static bool migrate_caps_check(bool *cap_list,
 #endif
 
     if (cap_list[MIGRATION_CAPABILITY_FIXED_RAM]) {
-        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
-            error_setg(errp, "Directly mapped memory incompatible with multifd");
-            return false;
-        }
-
         if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
             error_setg(errp, "Directly mapped memory incompatible with xbzrle");
             return false;
diff --git a/migration/multifd.c b/migration/multifd.c
index 20ef665218..cc70b20ff7 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -256,6 +256,19 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
     g_free(pages);
 }
 
+static void multifd_set_file_bitmap(MultiFDSendParams *p, bool set)
+{
+    MultiFDPages_t *pages = p->pages;
+
+    if (!pages->block) {
+        return;
+    }
+
+    for (int i = 0; i < p->normal_num; i++) {
+        ramblock_set_shadow_bmap(pages->block, pages->offset[i], set);
+    }
+}
+
 static void multifd_send_fill_packet(MultiFDSendParams *p)
 {
     MultiFDPacket_t *packet = p->packet;
@@ -608,6 +621,17 @@ int multifd_send_sync_main(QEMUFile *f)
         }
     }
 
+    if (!migrate_multifd_use_packets()) {
+        for (i = 0; i < migrate_multifd_channels(); i++) {
+            MultiFDSendParams *p = &multifd_send_state->params[i];
+
+            qemu_sem_post(&p->sem);
+            continue;
+        }
+
+        return 0;
+    }
+
     /*
      * When using zero-copy, it's necessary to flush the pages before any of
      * the pages can be sent again, so we'll make sure the new version of the
@@ -692,6 +716,8 @@ static void *multifd_send_thread(void *opaque)
 
         if (p->pending_job) {
             uint32_t flags;
+            uint64_t write_base;
+
             p->normal_num = 0;
 
             if (!use_packets || use_zero_copy_send) {
@@ -716,6 +742,16 @@ static void *multifd_send_thread(void *opaque)
             if (use_packets) {
                 multifd_send_fill_packet(p);
                 p->num_packets++;
+                write_base = 0;
+            } else {
+                multifd_set_file_bitmap(p, true);
+
+                /*
+                 * If we subtract the host page now, we don't need to
+                 * pass it into qio_channel_write_full_all() below.
+                 */
+                write_base = p->pages->block->pages_offset -
+                    (uint64_t)p->pages->block->host;
             }
 
             flags = p->flags;
@@ -741,8 +777,9 @@ static void *multifd_send_thread(void *opaque)
                 p->iov[0].iov_base = p->packet;
             }
 
-            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
-                                              0, p->write_flags, &local_err);
+            ret = qio_channel_write_full_all(p->c, p->iov, p->iovs_num,
+                                             write_base, NULL, 0,
+                                             p->write_flags, &local_err);
             if (ret != 0) {
                 break;
             }
@@ -758,6 +795,13 @@ static void *multifd_send_thread(void *opaque)
         } else if (p->quit) {
             qemu_mutex_unlock(&p->mutex);
             break;
+        } else if (!use_packets) {
+            /*
+             * When migrating to a file there's not need for a SYNC
+             * packet, the channels are ready right away.
+             */
+            qemu_sem_post(&multifd_send_state->channels_ready);
+            qemu_mutex_unlock(&p->mutex);
         } else {
             qemu_mutex_unlock(&p->mutex);
             /* sometimes there are spurious wakeups */
@@ -767,6 +811,7 @@ static void *multifd_send_thread(void *opaque)
 out:
     if (local_err) {
         trace_multifd_send_error(p->id);
+        multifd_set_file_bitmap(p, false);
         multifd_send_terminate_threads(local_err);
         error_free(local_err);
     }
@@ -981,6 +1026,8 @@ int multifd_save_setup(Error **errp)
 
         if (migrate_use_zero_copy_send()) {
             p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
+        } else if (!use_packets) {
+            p->write_flags |= QIO_CHANNEL_WRITE_FLAG_WITH_OFFSET;
         } else {
             p->write_flags = 0;
         }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 23/26] migration/multifd: Support incoming fixed-ram stream format
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (21 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 22/26] migration/multifd: Support outgoing fixed-ram stream format Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 24/26] tests/qtest: Add a multifd + fixed-ram migration test Fabiano Rosas
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

For the incoming fixed-ram migration we need to read the ramblock
headers, get the pages bitmap and send the host address of each
non-zero page to the multifd channel thread for writing.

To read from the migration file we need a preadv function that can
read into the iovs in segments of contiguous pages because (as in the
writing case) the file offset applies to the entire iovec.

Usage on HMP is:

(qemu) migrate_set_capability multifd on
(qemu) migrate_set_capability fixed-ram on
(qemu) migrate_set_parameter max-bandwidth 0
(qemu) migrate_set_parameter multifd-channels 8
(qemu) migrate_incoming file:migfile
(qemu) info status
(qemu) c

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/multifd.c | 26 ++++++++++++++++++++++++--
 migration/ram.c     |  9 +++++++--
 2 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index cc70b20ff7..36b5aedb16 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -141,6 +141,7 @@ static void nocomp_recv_cleanup(MultiFDRecvParams *p)
 static int nocomp_recv_pages(MultiFDRecvParams *p, Error **errp)
 {
     uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
+    uint64_t read_base = 0;
 
     if (flags != MULTIFD_FLAG_NOCOMP) {
         error_setg(errp, "multifd %u: flags received %x flags expected %x",
@@ -151,7 +152,13 @@ static int nocomp_recv_pages(MultiFDRecvParams *p, Error **errp)
         p->iov[i].iov_base = p->host + p->normal[i];
         p->iov[i].iov_len = p->page_size;
     }
-    return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
+
+    if (migrate_fixed_ram()) {
+        read_base = p->pages->block->pages_offset - (uint64_t) p->host;
+    }
+
+    return qio_channel_read_full_all(p->c, p->iov, p->normal_num, read_base,
+                                     p->read_flags, errp);
 }
 
 static MultiFDMethods multifd_nocomp_ops = {
@@ -1221,9 +1228,21 @@ void multifd_recv_sync_main(void)
 {
     int i;
 
-    if (!migrate_use_multifd() || !migrate_multifd_use_packets()) {
+    if (!migrate_use_multifd()) {
         return;
     }
+
+    if (!migrate_multifd_use_packets()) {
+        for (i = 0; i < migrate_multifd_channels(); i++) {
+            MultiFDRecvParams *p = &multifd_recv_state->params[i];
+
+            qemu_sem_post(&p->sem);
+            continue;
+        }
+
+        return;
+    }
+
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
@@ -1256,6 +1275,7 @@ static void *multifd_recv_thread(void *opaque)
 
     while (true) {
         uint32_t flags;
+        p->normal_num = 0;
 
         if (p->quit) {
             break;
@@ -1377,6 +1397,8 @@ int multifd_load_setup(Error **errp)
             p->packet_len = sizeof(MultiFDPacket_t)
                 + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
+        } else {
+            p->read_flags |= QIO_CHANNEL_READ_FLAG_WITH_OFFSET;
         }
         p->name = g_strdup_printf("multifdrecv_%d", i);
         p->iov = g_new0(struct iovec, page_count);
diff --git a/migration/ram.c b/migration/ram.c
index e9b28c16da..180e8e0d94 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4427,8 +4427,13 @@ static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
             host = host_from_ram_block_offset(block, offset);
             read_len = MIN(len, TARGET_PAGE_SIZE);
 
-            read = qemu_get_buffer_at(f, host, read_len,
-                                      block->pages_offset + offset);
+            if (migrate_use_multifd()) {
+                multifd_recv_queue_page(f, block, offset);
+                read = read_len;
+            } else {
+                read = qemu_get_buffer_at(f, host, read_len,
+                                          block->pages_offset + offset);
+            }
             completed += read;
         }
     }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 24/26] tests/qtest: Add a multifd + fixed-ram migration test
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (22 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 23/26] migration/multifd: Support incoming " Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 25/26] migration: Add direct-io parameter Fabiano Rosas
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Thomas Huth, Laurent Vivier, Paolo Bonzini

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration-test.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 84b4c761ad..2e0911996d 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1564,6 +1564,31 @@ static void test_precopy_file_fixed_ram(void)
     test_precopy_common(&args);
 }
 
+static void *migrate_multifd_fixed_ram_start(QTestState *from, QTestState *to)
+{
+    migrate_fixed_ram_start(from, to);
+
+    migrate_set_parameter_int(from, "multifd-channels", 4);
+    migrate_set_parameter_int(to, "multifd-channels", 4);
+
+    migrate_set_capability(from, "multifd", true);
+    migrate_set_capability(to, "multifd", true);
+
+    return NULL;
+}
+
+static void test_multifd_file_fixed_ram(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:%s/migfile", tmpfs);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = migrate_multifd_fixed_ram_start,
+    };
+
+    test_precopy_common(&args);
+}
+
 static void test_precopy_tcp_plain(void)
 {
     MigrateCommon args = {
@@ -2560,6 +2585,8 @@ int main(int argc, char **argv)
                    test_precopy_file_stream_ram);
     qtest_add_func("/migration/precopy/file/fixed-ram",
                    test_precopy_file_fixed_ram);
+    qtest_add_func("/migration/multifd/file/fixed-ram",
+                   test_multifd_file_fixed_ram);
 
 #ifdef CONFIG_GNUTLS
     qtest_add_func("/migration/precopy/unix/tls/psk",
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 25/26] migration: Add direct-io parameter
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (23 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 24/26] tests/qtest: Add a multifd + fixed-ram migration test Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 18:03 ` [RFC PATCH v1 26/26] tests/migration/guestperf: Add file, fixed-ram and direct-io support Fabiano Rosas
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Eric Blake, Markus Armbruster

Add the direct-io migration parameter that tells the migration code to
use O_DIRECT when opening the migration stream file whenever possible.

This is currently only used for the secondary channels of fixed-ram
migration, which can guarantee that writes are page aligned.

However the parameter could be made to affect other types of
file-based migrations in the future.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/qemu/osdep.h           |  2 ++
 migration/file.c               | 13 +++++++++++--
 migration/migration-hmp-cmds.c |  9 +++++++++
 migration/migration.c          | 32 ++++++++++++++++++++++++++++++++
 migration/migration.h          |  1 +
 qapi/migration.json            | 17 ++++++++++++++---
 util/osdep.c                   |  9 +++++++++
 7 files changed, 78 insertions(+), 5 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 9eff0be95b..19c1d5f999 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -570,6 +570,8 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive);
 bool qemu_has_ofd_lock(void);
 #endif
 
+bool qemu_has_direct_io(void);
+
 #if defined(__HAIKU__) && defined(__i386__)
 #define FMT_pid "%ld"
 #elif defined(WIN64)
diff --git a/migration/file.c b/migration/file.c
index 6f40894488..1a40da097d 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -43,9 +43,18 @@ void file_send_channel_create(QIOTaskFunc f, void *data)
     QIOChannelFile *ioc;
     QIOTask *task;
     Error *errp = NULL;
+    int flags = outgoing_args.flags;
 
-    ioc = qio_channel_file_new_path(outgoing_args.fname,
-                                    outgoing_args.flags,
+    if (migrate_use_direct_io() && qemu_has_direct_io()) {
+        /*
+         * Enable O_DIRECT for the secondary channels. These are used
+         * for sending ram pages and writes should be guaranteed to be
+         * aligned to at least page size.
+         */
+        flags |= O_DIRECT;
+    }
+
+    ioc = qio_channel_file_new_path(outgoing_args.fname, flags,
                                     outgoing_args.mode, &errp);
     if (!ioc) {
         file_migration_cancel(errp);
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 72519ea99f..c9a8d84714 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -344,6 +344,11 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
                 }
             }
         }
+        if (params->has_direct_io) {
+            monitor_printf(mon, "%s: %s\n",
+                           MigrationParameter_str(MIGRATION_PARAMETER_DIRECT_IO),
+                           params->direct_io ? "on" : "off");
+        }
     }
 
     qapi_free_MigrationParameters(params);
@@ -600,6 +605,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         error_setg(&err, "The block-bitmap-mapping parameter can only be set "
                    "through QMP");
         break;
+    case MIGRATION_PARAMETER_DIRECT_IO:
+        p->has_direct_io = true;
+        visit_type_bool(v, param, &p->direct_io, &err);
+        break;
     default:
         assert(0);
     }
diff --git a/migration/migration.c b/migration/migration.c
index 77d24a5114..65798171e4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1022,6 +1022,11 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
                        s->parameters.block_bitmap_mapping);
     }
 
+    if (s->parameters.has_direct_io) {
+        params->has_direct_io = true;
+        params->direct_io = s->parameters.direct_io;
+    }
+
     return params;
 }
 
@@ -1782,6 +1787,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
         dest->has_block_bitmap_mapping = true;
         dest->block_bitmap_mapping = params->block_bitmap_mapping;
     }
+
+    if (params->has_direct_io) {
+        dest->direct_io = params->direct_io;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1904,6 +1913,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
             QAPI_CLONE(BitmapMigrationNodeAliasList,
                        params->block_bitmap_mapping);
     }
+
+    if (params->has_direct_io) {
+        s->parameters.direct_io = params->direct_io;
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
@@ -2885,6 +2898,24 @@ bool migrate_postcopy_preempt(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT];
 }
 
+bool migrate_use_direct_io(void)
+{
+    MigrationState *s;
+
+    /* For now O_DIRECT is only supported with fixed-ram */
+    if (!migrate_fixed_ram()) {
+        return false;
+    }
+
+    s = migrate_get_current();
+
+    if (s->parameters.has_direct_io) {
+        return s->parameters.direct_io;
+    }
+
+    return false;
+}
+
 /* migration thread support */
 /*
  * Something bad happened to the RP stream, mark an error
@@ -4666,6 +4697,7 @@ static void migration_instance_init(Object *obj)
     params->has_announce_max = true;
     params->has_announce_rounds = true;
     params->has_announce_step = true;
+    params->has_direct_io = qemu_has_direct_io();
 
     qemu_sem_init(&ms->postcopy_pause_sem, 0);
     qemu_sem_init(&ms->postcopy_pause_rp_sem, 0);
diff --git a/migration/migration.h b/migration/migration.h
index 8459201958..e0c9c78570 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -422,6 +422,7 @@ bool migrate_ignore_shared(void);
 bool migrate_validate_uuid(void);
 int migrate_fixed_ram(void);
 bool migrate_multifd_use_packets(void);
+bool migrate_use_direct_io(void);
 bool migrate_auto_converge(void);
 bool migrate_use_multifd(void);
 bool migrate_pause_before_switchover(void);
diff --git a/qapi/migration.json b/qapi/migration.json
index 22eea58ce3..2190d98ded 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -776,6 +776,9 @@
 #                        block device name if there is one, and to their node name
 #                        otherwise. (Since 5.2)
 #
+# @direct-io: Open migration files with O_DIRECT when possible. Not
+#             all migration transports support this. (since 8.1)
+#
 # Features:
 # @unstable: Member @x-checkpoint-delay is experimental.
 #
@@ -796,7 +799,7 @@
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
            'max-cpu-throttle', 'multifd-compression',
            'multifd-zlib-level' ,'multifd-zstd-level',
-           'block-bitmap-mapping' ] }
+           'block-bitmap-mapping', 'direct-io' ] }
 
 ##
 # @MigrateSetParameters:
@@ -941,6 +944,9 @@
 #                        block device name if there is one, and to their node name
 #                        otherwise. (Since 5.2)
 #
+# @direct-io: Open migration files with O_DIRECT when possible. Not
+#             all migration transports support this. (since 8.1)
+#
 # Features:
 # @unstable: Member @x-checkpoint-delay is experimental.
 #
@@ -976,7 +982,8 @@
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
-            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
+            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
+            '*direct-io': 'bool' } }
 
 ##
 # @migrate-set-parameters:
@@ -1141,6 +1148,9 @@
 #                        block device name if there is one, and to their node name
 #                        otherwise. (Since 5.2)
 #
+# @direct-io: Open migration files with O_DIRECT when possible. Not
+#             all migration transports support this. (since 8.1)
+#
 # Features:
 # @unstable: Member @x-checkpoint-delay is experimental.
 #
@@ -1174,7 +1184,8 @@
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
-            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
+            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
+            '*direct-io': 'bool' } }
 
 ##
 # @query-migrate-parameters:
diff --git a/util/osdep.c b/util/osdep.c
index e996c4744a..d0227a60ab 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -277,6 +277,15 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive)
 }
 #endif
 
+bool qemu_has_direct_io(void)
+{
+#ifdef O_DIRECT
+    return true;
+#else
+    return false;
+#endif
+}
+
 static int qemu_open_cloexec(const char *name, int flags, mode_t mode)
 {
     int ret;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v1 26/26] tests/migration/guestperf: Add file, fixed-ram and direct-io support
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (24 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 25/26] migration: Add direct-io parameter Fabiano Rosas
@ 2023-03-30 18:03 ` Fabiano Rosas
  2023-03-30 21:41 ` [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Peter Xu
  2023-04-03  7:38 ` David Hildenbrand
  27 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-30 18:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

Add support to the new migration features:
- 'file' transport;
- 'fixed-ram' stream format capability;
- 'direct-io' parameter;

Usage:
$ ./guestperf.py --binary <path/to/qemu> --initrd <path/to/initrd-stress.img> \
                 --transport file --dst-file migfile --multifd --fixed-ram \
		 --multifd-channels 4 --output fixed-ram.json  --verbose

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/migration/guestperf/engine.py   | 38 +++++++++++++++++++++++++--
 tests/migration/guestperf/scenario.py | 14 ++++++++--
 tests/migration/guestperf/shell.py    | 18 +++++++++++--
 3 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/tests/migration/guestperf/engine.py b/tests/migration/guestperf/engine.py
index e69d16a62c..a465336184 100644
--- a/tests/migration/guestperf/engine.py
+++ b/tests/migration/guestperf/engine.py
@@ -35,10 +35,11 @@
 class Engine(object):
 
     def __init__(self, binary, dst_host, kernel, initrd, transport="tcp",
-                 sleep=15, verbose=False, debug=False):
+                 sleep=15, verbose=False, debug=False, dst_file="/tmp/migfile"):
 
         self._binary = binary # Path to QEMU binary
         self._dst_host = dst_host # Hostname of target host
+        self._dst_file = dst_file # Path to file (for file transport)
         self._kernel = kernel # Path to kernel image
         self._initrd = initrd # Path to stress initrd
         self._transport = transport # 'unix' or 'tcp' or 'rdma'
@@ -203,6 +204,23 @@ def _migrate(self, hardware, scenario, src, dst, connect_uri):
             resp = dst.command("migrate-set-parameters",
                                multifd_channels=scenario._multifd_channels)
 
+        if scenario._fixed_ram:
+            resp = src.command("migrate-set-capabilities",
+                               capabilities = [
+                                   { "capability": "fixed-ram",
+                                     "state": True }
+                               ])
+            resp = dst.command("migrate-set-capabilities",
+                               capabilities = [
+                                   { "capability": "fixed-ram",
+                                     "state": True }
+                               ])
+
+        if scenario._direct_io:
+            resp = src.command("migrate-set-parameters",
+                               direct_io=scenario._direct_io)
+
+
         resp = src.command("migrate", uri=connect_uri)
 
         post_copy = False
@@ -233,6 +251,11 @@ def _migrate(self, hardware, scenario, src, dst, connect_uri):
                     progress_history.append(progress)
 
                 if progress._status == "completed":
+                    if connect_uri[0:5] == "file:":
+                        if self._verbose:
+                            print("Migrating incoming")
+                        dst.command("migrate-incoming", uri=connect_uri)
+
                     if self._verbose:
                         print("Sleeping %d seconds for final guest workload run" % self._sleep)
                     sleep_secs = self._sleep
@@ -357,7 +380,11 @@ def _get_dst_args(self, hardware, uri):
         if self._dst_host != "localhost":
             tunnelled = True
         argv = self._get_common_args(hardware, tunnelled)
-        return argv + ["-incoming", uri]
+
+        incoming = ["-incoming", uri]
+        if uri[0:5] == "file:":
+            incoming = ["-incoming", "defer"]
+        return argv + incoming
 
     @staticmethod
     def _get_common_wrapper(cpu_bind, mem_bind):
@@ -417,6 +444,10 @@ def run(self, hardware, scenario, result_dir=os.getcwd()):
                 os.remove(monaddr)
             except:
                 pass
+        elif self._transport == "file":
+            if self._dst_host != "localhost":
+                raise Exception("Use unix migration transport for non-local host")
+            uri = "file:%s" % self._dst_file
 
         if self._dst_host != "localhost":
             dstmonaddr = ("localhost", 9001)
@@ -453,6 +484,9 @@ def run(self, hardware, scenario, result_dir=os.getcwd()):
             if self._dst_host == "localhost" and os.path.exists(dstmonaddr):
                 os.remove(dstmonaddr)
 
+            if uri[0:5] == "file:" and os.path.exists(uri[5:]):
+                os.remove(uri[5:])
+
             if self._verbose:
                 print("Finished migration")
 
diff --git a/tests/migration/guestperf/scenario.py b/tests/migration/guestperf/scenario.py
index de70d9b2f5..29b6af41ac 100644
--- a/tests/migration/guestperf/scenario.py
+++ b/tests/migration/guestperf/scenario.py
@@ -30,7 +30,8 @@ def __init__(self, name,
                  auto_converge=False, auto_converge_step=10,
                  compression_mt=False, compression_mt_threads=1,
                  compression_xbzrle=False, compression_xbzrle_cache=10,
-                 multifd=False, multifd_channels=2):
+                 multifd=False, multifd_channels=2,
+                 fixed_ram=False, direct_io=False):
 
         self._name = name
 
@@ -60,6 +61,11 @@ def __init__(self, name,
         self._multifd = multifd
         self._multifd_channels = multifd_channels
 
+        self._fixed_ram = fixed_ram
+
+        self._direct_io = direct_io
+
+
     def serialize(self):
         return {
             "name": self._name,
@@ -79,6 +85,8 @@ def serialize(self):
             "compression_xbzrle_cache": self._compression_xbzrle_cache,
             "multifd": self._multifd,
             "multifd_channels": self._multifd_channels,
+            "fixed_ram": self._fixed_ram,
+            "direct_io": self._direct_io,
         }
 
     @classmethod
@@ -100,4 +108,6 @@ def deserialize(cls, data):
             data["compression_xbzrle"],
             data["compression_xbzrle_cache"],
             data["multifd"],
-            data["multifd_channels"])
+            data["multifd_channels"],
+            data["fixed_ram"],
+            data["direct_io"])
diff --git a/tests/migration/guestperf/shell.py b/tests/migration/guestperf/shell.py
index 8a809e3dda..0cb402adce 100644
--- a/tests/migration/guestperf/shell.py
+++ b/tests/migration/guestperf/shell.py
@@ -48,6 +48,7 @@ def __init__(self):
         parser.add_argument("--kernel", dest="kernel", default="/boot/vmlinuz-%s" % platform.release())
         parser.add_argument("--initrd", dest="initrd", default="tests/migration/initrd-stress.img")
         parser.add_argument("--transport", dest="transport", default="unix")
+        parser.add_argument("--dst-file", dest="dst_file")
 
 
         # Hardware args
@@ -71,7 +72,8 @@ def get_engine(self, args):
                       transport=args.transport,
                       sleep=args.sleep,
                       debug=args.debug,
-                      verbose=args.verbose)
+                      verbose=args.verbose,
+                      dst_file=args.dst_file)
 
     def get_hardware(self, args):
         def split_map(value):
@@ -127,6 +129,13 @@ def __init__(self):
         parser.add_argument("--multifd-channels", dest="multifd_channels",
                             default=2, type=int)
 
+        parser.add_argument("--fixed-ram", dest="fixed_ram", default=False,
+                            action="store_true")
+
+        parser.add_argument("--direct-io", dest="direct_io", default=False,
+                            action="store_true")
+
+
     def get_scenario(self, args):
         return Scenario(name="perfreport",
                         downtime=args.downtime,
@@ -150,7 +159,12 @@ def get_scenario(self, args):
                         compression_xbzrle_cache=args.compression_xbzrle_cache,
 
                         multifd=args.multifd,
-                        multifd_channels=args.multifd_channels)
+                        multifd_channels=args.multifd_channels,
+
+                        fixed_ram=args.fixed_ram,
+
+                        direct_io=args.direct_io)
+
 
     def run(self, argv):
         args = self._parser.parse_args(argv)
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (25 preceding siblings ...)
  2023-03-30 18:03 ` [RFC PATCH v1 26/26] tests/migration/guestperf: Add file, fixed-ram and direct-io support Fabiano Rosas
@ 2023-03-30 21:41 ` Peter Xu
  2023-03-31 14:37   ` Fabiano Rosas
  2023-04-03  7:38 ` David Hildenbrand
  27 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2023-03-30 21:41 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

On Thu, Mar 30, 2023 at 03:03:10PM -0300, Fabiano Rosas wrote:
> Hi folks,

Hi,

> 
> I'm continuing the work done last year to add a new format of
> migration stream that can be used to migrate large guests to a single
> file in a performant way.
> 
> This is an early RFC with the previous code + my additions to support
> multifd and direct IO. Let me know what you think!
> 
> Here are the reference links for previous discussions:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html
> 
> The series has 4 main parts:
> 
> 1) File migration: A new "file:" migration URI. So "file:mig" does the
>    same as "exec:cat > mig". Patches 1-4 implement this;
> 
> 2) Fixed-ram format: A new format for the migration stream. Puts guest
>    pages at their relative offsets in the migration file. This saves
>    space on the worst case of RAM utilization because every page has a
>    fixed offset in the migration file and (potentially) saves us time
>    because we could write pages independently in parallel. It also
>    gives alignment guarantees so we could use O_DIRECT. Patches 5-13
>    implement this;
> 
> With patches 1-13 these two^ can be used with:
> 
> (qemu) migrate_set_capability fixed-ram on
> (qemu) migrate[_incoming] file:mig

Have you considered enabling the new fixed-ram format with postcopy when
loading?

Due to the linear offseting of pages, I think it can achieve super fast vm
loads due to O(1) lookup of pages and local page fault resolutions.

> 
> --> new in this series:
> 
> 3) MultiFD support: This is about making use of the parallelism
>    allowed by the new format. We just need the threading and page
>    queuing infrastructure that is already in place for
>    multifd. Patches 14-24 implement this;
> 
> (qemu) migrate_set_capability fixed-ram on
> (qemu) migrate_set_capability multifd on
> (qemu) migrate_set_parameter multifd-channels 4
> (qemu) migrate_set_parameter max-bandwith 0
> (qemu) migrate[_incoming] file:mig
> 
> 4) Add a new "direct_io" parameter and enable O_DIRECT for the
>    properly aligned segments of the migration (mostly ram). Patch 25.
> 
> (qemu) migrate_set_parameter direct-io on
> 
> Thanks! Some data below:
> =====
> 
> Outgoing migration to file. NVMe disk. XFS filesystem.
> 
> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
>   10m -v`:
> 
> migration type  | MB/s | pages/s |  ms
> ----------------+------+---------+------
> savevm io_uring |  434 |  102294 | 71473

So I assume this is the non-live migration scenario.  Could you explain
what does io_uring mean here?

> file:           | 3017 |  855862 | 10301
> fixed-ram       | 1982 |  330686 | 15637
> ----------------+------+---------+------
> fixed-ram + multifd + O_DIRECT
>          2 ch.  | 5565 | 1500882 |  5576
>          4 ch.  | 5735 | 1991549 |  5412
>          8 ch.  | 5650 | 1769650 |  5489
>         16 ch.  | 6071 | 1832407 |  5114
>         32 ch.  | 6147 | 1809588 |  5050
>         64 ch.  | 6344 | 1841728 |  4895
>        128 ch.  | 6120 | 1915669 |  5085
> ----------------+------+---------+------

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability
  2023-03-30 18:03 ` [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability Fabiano Rosas
@ 2023-03-30 22:01   ` Peter Xu
  2023-03-31  7:56     ` Daniel P. Berrangé
  2023-03-31 15:05     ` Fabiano Rosas
  2023-03-31  5:50   ` Markus Armbruster
  1 sibling, 2 replies; 65+ messages in thread
From: Peter Xu @ 2023-03-30 22:01 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov, Paolo Bonzini, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster

On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
> 
> Implement 'fixed-ram' feature. The core of the feature is to ensure that
> each ram page of the migration stream has a specific offset in the
> resulting migration stream. The reason why we'd want such behavior are
> two fold:
> 
>  - When doing a 'fixed-ram' migration the resulting file will have a
>    bounded size, since pages which are dirtied multiple times will
>    always go to a fixed location in the file, rather than constantly
>    being added to a sequential stream. This eliminates cases where a vm
>    with, say, 1G of ram can result in a migration file that's 10s of
>    GBs, provided that the workload constantly redirties memory.
> 
>  - It paves the way to implement DIO-enabled save/restore of the
>    migration stream as the pages are ensured to be written at aligned
>    offsets.
> 
> The feature requires changing the stream format. First, a bitmap is
> introduced which tracks which pages have been written (i.e are
> dirtied) during migration and subsequently it's being written in the
> resulting file, again at a fixed location for every ramblock. Zero
> pages are ignored as they'd be zero in the destination migration as
> well. With the changed format data would look like the following:
> 
> |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|

What happens with huge pages?  Would page size matter here?

I would assume it's fine it uses a constant (small) page size, assuming
that should match with the granule that qemu tracks dirty (which IIUC is
the host page size not guest's).

But I didn't yet pay any further thoughts on that, maybe it would be
worthwhile in all cases to record page sizes here to be explicit or the
meaning of bitmap may not be clear (and then the bitmap_size will be a
field just for sanity check too).

If postcopy might be an option, we'd want the page size to be the host page
size because then looking up the bitmap will be straightforward, deciding
whether we should copy over page (UFFDIO_COPY) or fill in with zeros
(UFFDIO_ZEROPAGE).

> 
> * pc - refers to the page_size/mr->addr members, so newly added members
> begin from "bitmap_size".

Could you elaborate more on what's the pc?

I also didn't see this *pc in below migration.rst update.

> 
> This layout is initialized during ram_save_setup so instead of having a
> sequential stream of pages that follow the ramblock headers the dirty
> pages for a ramblock follow its header. Since all pages have a fixed
> location RAM_SAVE_FLAG_EOS is no longer generated on every migration
> iteration but there is effectively a single RAM_SAVE_FLAG_EOS right at
> the end.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  docs/devel/migration.rst | 36 +++++++++++++++
>  include/exec/ramblock.h  |  8 ++++
>  migration/migration.c    | 51 +++++++++++++++++++++-
>  migration/migration.h    |  1 +
>  migration/ram.c          | 94 +++++++++++++++++++++++++++++++++-------
>  migration/savevm.c       |  1 +
>  qapi/migration.json      |  2 +-
>  7 files changed, 176 insertions(+), 17 deletions(-)
> 
> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> index 1080211f8e..84112d7f3f 100644
> --- a/docs/devel/migration.rst
> +++ b/docs/devel/migration.rst
> @@ -568,6 +568,42 @@ Others (especially either older devices or system devices which for
>  some reason don't have a bus concept) make use of the ``instance id``
>  for otherwise identically named devices.
>  
> +Fixed-ram format
> +----------------
> +
> +When the ``fixed-ram`` capability is enabled, a slightly different
> +stream format is used for the RAM section. Instead of having a
> +sequential stream of pages that follow the RAMBlock headers, the dirty
> +pages for a RAMBlock follow its header. This ensures that each RAM
> +page has a fixed offset in the resulting migration stream.
> +
> +  - RAMBlock 1
> +
> +    - ID string length
> +    - ID string
> +    - Used size
> +    - Shadow bitmap size
> +    - Pages offset in migration stream*
> +
> +  - Shadow bitmap
> +  - Sequence of pages for RAMBlock 1 (* offset points here)
> +
> +  - RAMBlock 2
> +
> +    - ID string length
> +    - ID string
> +    - Used size
> +    - Shadow bitmap size
> +    - Pages offset in migration stream*
> +
> +  - Shadow bitmap
> +  - Sequence of pages for RAMBlock 2 (* offset points here)
> +
> +The ``fixed-ram`` capaility can be enabled in both source and
> +destination with:
> +
> +    ``migrate_set_capability fixed-ram on``
> +
>  Return path
>  -----------
>  
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index adc03df59c..4360c772c2 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -43,6 +43,14 @@ struct RAMBlock {
>      size_t page_size;
>      /* dirty bitmap used during migration */
>      unsigned long *bmap;
> +    /* shadow dirty bitmap used when migrating to a file */
> +    unsigned long *shadow_bmap;
> +    /*
> +     * offset in the file pages belonging to this ramblock are saved,
> +     * used only during migration to a file.
> +     */
> +    off_t bitmap_offset;
> +    uint64_t pages_offset;
>      /* bitmap of already received pages in postcopy */
>      unsigned long *receivedmap;
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index 177fb0de0f..29630523e2 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -168,7 +168,8 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
>      MIGRATION_CAPABILITY_XBZRLE,
>      MIGRATION_CAPABILITY_X_COLO,
>      MIGRATION_CAPABILITY_VALIDATE_UUID,
> -    MIGRATION_CAPABILITY_ZERO_COPY_SEND);
> +    MIGRATION_CAPABILITY_ZERO_COPY_SEND,
> +    MIGRATION_CAPABILITY_FIXED_RAM);
>  
>  /* When we add fault tolerance, we could have several
>     migrations at once.  For now we don't need to add
> @@ -1341,6 +1342,28 @@ static bool migrate_caps_check(bool *cap_list,
>      }
>  #endif
>  
> +    if (cap_list[MIGRATION_CAPABILITY_FIXED_RAM]) {
> +        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
> +            error_setg(errp, "Directly mapped memory incompatible with multifd");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
> +            error_setg(errp, "Directly mapped memory incompatible with xbzrle");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
> +            error_setg(errp, "Directly mapped memory incompatible with compression");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
> +            error_setg(errp, "Directly mapped memory incompatible with postcopy ram");
> +            return false;
> +        }
> +    }
> +
>      if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
>          /* This check is reasonably expensive, so only when it's being
>           * set the first time, also it's only the destination that needs
> @@ -2736,6 +2759,11 @@ MultiFDCompression migrate_multifd_compression(void)
>      return s->parameters.multifd_compression;
>  }
>  
> +int migrate_fixed_ram(void)
> +{
> +    return migrate_get_current()->enabled_capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
> +}
> +
>  int migrate_multifd_zlib_level(void)
>  {
>      MigrationState *s;
> @@ -4324,6 +4352,20 @@ fail:
>      return NULL;
>  }
>  
> +static int migrate_check_fixed_ram(MigrationState *s, Error **errp)
> +{
> +    if (!s->enabled_capabilities[MIGRATION_CAPABILITY_FIXED_RAM]) {
> +        return 0;
> +    }
> +
> +    if (!qemu_file_is_seekable(s->to_dst_file)) {
> +        error_setg(errp, "Directly mapped memory requires a seekable transport");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
>  void migrate_fd_connect(MigrationState *s, Error *error_in)
>  {
>      Error *local_err = NULL;
> @@ -4390,6 +4432,12 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>          }
>      }
>  
> +    if (migrate_check_fixed_ram(s, &local_err) < 0) {

This check might be too late afaict, QMP cmd "migrate" could have already
succeeded.

Can we do an early check in / close to qmp_migrate()?  The idea is we fail
at the QMP migrate command there.

> +        migrate_fd_cleanup(s);
> +        migrate_fd_error(s, local_err);
> +        return;
> +    }
> +
>      if (resume) {
>          /* Wakeup the main migration thread to do the recovery */
>          migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_PAUSED,
> @@ -4519,6 +4567,7 @@ static Property migration_properties[] = {
>      DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
>  
>      /* Migration capabilities */
> +    DEFINE_PROP_MIG_CAP("x-fixed-ram", MIGRATION_CAPABILITY_FIXED_RAM),
>      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
>      DEFINE_PROP_MIG_CAP("x-rdma-pin-all", MIGRATION_CAPABILITY_RDMA_PIN_ALL),
>      DEFINE_PROP_MIG_CAP("x-auto-converge", MIGRATION_CAPABILITY_AUTO_CONVERGE),
> diff --git a/migration/migration.h b/migration/migration.h
> index 2da2f8a164..8cf3caecfe 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -416,6 +416,7 @@ bool migrate_zero_blocks(void);
>  bool migrate_dirty_bitmaps(void);
>  bool migrate_ignore_shared(void);
>  bool migrate_validate_uuid(void);
> +int migrate_fixed_ram(void);
>  
>  bool migrate_auto_converge(void);
>  bool migrate_use_multifd(void);
> diff --git a/migration/ram.c b/migration/ram.c
> index 96e8a19a58..56f0f782c8 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1310,9 +1310,14 @@ static int save_zero_page_to_file(PageSearchStatus *pss,
>      int len = 0;
>  
>      if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
> -        len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
> -        qemu_put_byte(file, 0);
> -        len += 1;
> +        if (migrate_fixed_ram()) {
> +            /* for zero pages we don't need to do anything */
> +            len = 1;

I think you wanted to increase the "duplicated" counter, but this will also
increase ram-transferred even though only 1 byte.

Perhaps just pass a pointer to keep the bytes, and return true/false to
increase the counter (to make everything accurate)?

> +        } else {
> +            len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
> +            qemu_put_byte(file, 0);
> +            len += 1;
> +        }
>          ram_release_page(block->idstr, offset);
>      }
>      return len;
> @@ -1394,14 +1399,20 @@ static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
>  {
>      QEMUFile *file = pss->pss_channel;
>  
> -    ram_transferred_add(save_page_header(pss, block,
> -                                         offset | RAM_SAVE_FLAG_PAGE));
> -    if (async) {
> -        qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
> -                              migrate_release_ram() &&
> -                              migration_in_postcopy());
> +    if (migrate_fixed_ram()) {
> +        qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
> +                           block->pages_offset + offset);
> +        set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
>      } else {
> -        qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
> +        ram_transferred_add(save_page_header(pss, block,
> +                                             offset | RAM_SAVE_FLAG_PAGE));
> +        if (async) {
> +            qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
> +                                  migrate_release_ram() &&
> +                                  migration_in_postcopy());
> +        } else {
> +            qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
> +        }
>      }
>      ram_transferred_add(TARGET_PAGE_SIZE);
>      stat64_add(&ram_atomic_counters.normal, 1);
> @@ -2731,6 +2742,8 @@ static void ram_save_cleanup(void *opaque)
>          block->clear_bmap = NULL;
>          g_free(block->bmap);
>          block->bmap = NULL;
> +        g_free(block->shadow_bmap);
> +        block->shadow_bmap = NULL;
>      }
>  
>      xbzrle_cleanup();
> @@ -3098,6 +3111,7 @@ static void ram_list_init_bitmaps(void)
>               */
>              block->bmap = bitmap_new(pages);
>              bitmap_set(block->bmap, 0, pages);
> +            block->shadow_bmap = bitmap_new(block->used_length >> TARGET_PAGE_BITS);
>              block->clear_bmap_shift = shift;
>              block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
>          }
> @@ -3287,6 +3301,33 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>              if (migrate_ignore_shared()) {
>                  qemu_put_be64(f, block->mr->addr);
>              }
> +
> +            if (migrate_fixed_ram()) {
> +                long num_pages = block->used_length >> TARGET_PAGE_BITS;
> +                long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> +
> +                /* Needed for external programs (think analyze-migration.py) */
> +                qemu_put_be32(f, bitmap_size);
> +
> +                /*
> +                 * The bitmap starts after pages_offset, so add 8 to
> +                 * account for the pages_offset size.
> +                 */
> +                block->bitmap_offset = qemu_get_offset(f) + 8;
> +
> +                /*
> +                 * Make pages_offset aligned to 1 MiB to account for
> +                 * migration file movement between filesystems with
> +                 * possibly different alignment restrictions when
> +                 * using O_DIRECT.
> +                 */
> +                block->pages_offset = ROUND_UP(block->bitmap_offset +
> +                                               bitmap_size, 0x100000);
> +                qemu_put_be64(f, block->pages_offset);
> +
> +                /* Now prepare offset for next ramblock */
> +                qemu_set_offset(f, block->pages_offset + block->used_length, SEEK_SET);
> +            }
>          }
>      }
>  
> @@ -3306,6 +3347,18 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>      return 0;
>  }
>  
> +static void ram_save_shadow_bmap(QEMUFile *f)
> +{
> +    RAMBlock *block;
> +
> +    RAMBLOCK_FOREACH_MIGRATABLE(block) {
> +        long num_pages = block->used_length >> TARGET_PAGE_BITS;
> +        long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> +        qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
> +                           block->bitmap_offset);
> +    }
> +}
> +
>  /**
>   * ram_save_iterate: iterative stage for migration
>   *
> @@ -3413,9 +3466,15 @@ out:
>              return ret;
>          }
>  
> -        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> -        qemu_fflush(f);
> -        ram_transferred_add(8);
> +        /*
> +         * For fixed ram we don't want to pollute the migration stream with
> +         * EOS flags.
> +         */
> +        if (!migrate_fixed_ram()) {
> +            qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +            qemu_fflush(f);
> +            ram_transferred_add(8);
> +        }
>  
>          ret = qemu_file_get_error(f);
>      }
> @@ -3461,6 +3520,9 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>              pages = ram_find_and_save_block(rs);
>              /* no more blocks to sent */
>              if (pages == 0) {
> +                if (migrate_fixed_ram()) {
> +                    ram_save_shadow_bmap(f);
> +                }
>                  break;
>              }
>              if (pages < 0) {
> @@ -3483,8 +3545,10 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>          return ret;
>      }
>  
> -    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> -    qemu_fflush(f);
> +    if (!migrate_fixed_ram()) {
> +        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +        qemu_fflush(f);
> +    }
>  
>      return 0;
>  }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 92102c1fe5..1f1bc19224 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -241,6 +241,7 @@ static bool should_validate_capability(int capability)
>      /* Validate only new capabilities to keep compatibility. */
>      switch (capability) {
>      case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
> +    case MIGRATION_CAPABILITY_FIXED_RAM:
>          return true;
>      default:
>          return false;
> diff --git a/qapi/migration.json b/qapi/migration.json
> index c84fa10e86..22eea58ce3 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -485,7 +485,7 @@
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> -           'compress', 'events', 'postcopy-ram',
> +           'compress', 'events', 'postcopy-ram', 'fixed-ram',
>             { 'name': 'x-colo', 'features': [ 'unstable' ] },
>             'release-ram',
>             'block', 'return-path', 'pause-before-switchover', 'multifd',
> -- 
> 2.35.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability
  2023-03-30 18:03 ` [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability Fabiano Rosas
  2023-03-30 22:01   ` Peter Xu
@ 2023-03-31  5:50   ` Markus Armbruster
  1 sibling, 0 replies; 65+ messages in thread
From: Markus Armbruster @ 2023-03-31  5:50 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Eric Blake

Fabiano Rosas <farosas@suse.de> writes:

> From: Nikolay Borisov <nborisov@suse.com>
>
> Implement 'fixed-ram' feature. The core of the feature is to ensure that
> each ram page of the migration stream has a specific offset in the
> resulting migration stream. The reason why we'd want such behavior are
> two fold:
>
>  - When doing a 'fixed-ram' migration the resulting file will have a
>    bounded size, since pages which are dirtied multiple times will
>    always go to a fixed location in the file, rather than constantly
>    being added to a sequential stream. This eliminates cases where a vm
>    with, say, 1G of ram can result in a migration file that's 10s of
>    GBs, provided that the workload constantly redirties memory.
>
>  - It paves the way to implement DIO-enabled save/restore of the
>    migration stream as the pages are ensured to be written at aligned
>    offsets.
>
> The feature requires changing the stream format. First, a bitmap is
> introduced which tracks which pages have been written (i.e are
> dirtied) during migration and subsequently it's being written in the
> resulting file, again at a fixed location for every ramblock. Zero
> pages are ignored as they'd be zero in the destination migration as
> well. With the changed format data would look like the following:
>
> |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
>
> * pc - refers to the page_size/mr->addr members, so newly added members
> begin from "bitmap_size".
>
> This layout is initialized during ram_save_setup so instead of having a
> sequential stream of pages that follow the ramblock headers the dirty
> pages for a ramblock follow its header. Since all pages have a fixed
> location RAM_SAVE_FLAG_EOS is no longer generated on every migration
> iteration but there is effectively a single RAM_SAVE_FLAG_EOS right at
> the end.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

[...]

> diff --git a/qapi/migration.json b/qapi/migration.json
> index c84fa10e86..22eea58ce3 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -485,7 +485,7 @@
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> -           'compress', 'events', 'postcopy-ram',
> +           'compress', 'events', 'postcopy-ram', 'fixed-ram',
>             { 'name': 'x-colo', 'features': [ 'unstable' ] },
>             'release-ram',
>             'block', 'return-path', 'pause-before-switchover', 'multifd',

Doc comment update is missing.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability
  2023-03-30 22:01   ` Peter Xu
@ 2023-03-31  7:56     ` Daniel P. Berrangé
  2023-03-31 14:39       ` Peter Xu
  2023-03-31 15:05     ` Fabiano Rosas
  1 sibling, 1 reply; 65+ messages in thread
From: Daniel P. Berrangé @ 2023-03-31  7:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster

On Thu, Mar 30, 2023 at 06:01:51PM -0400, Peter Xu wrote:
> On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> > From: Nikolay Borisov <nborisov@suse.com>
> > 
> > Implement 'fixed-ram' feature. The core of the feature is to ensure that
> > each ram page of the migration stream has a specific offset in the
> > resulting migration stream. The reason why we'd want such behavior are
> > two fold:
> > 
> >  - When doing a 'fixed-ram' migration the resulting file will have a
> >    bounded size, since pages which are dirtied multiple times will
> >    always go to a fixed location in the file, rather than constantly
> >    being added to a sequential stream. This eliminates cases where a vm
> >    with, say, 1G of ram can result in a migration file that's 10s of
> >    GBs, provided that the workload constantly redirties memory.
> > 
> >  - It paves the way to implement DIO-enabled save/restore of the
> >    migration stream as the pages are ensured to be written at aligned
> >    offsets.
> > 
> > The feature requires changing the stream format. First, a bitmap is
> > introduced which tracks which pages have been written (i.e are
> > dirtied) during migration and subsequently it's being written in the
> > resulting file, again at a fixed location for every ramblock. Zero
> > pages are ignored as they'd be zero in the destination migration as
> > well. With the changed format data would look like the following:
> > 
> > |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
> 
> What happens with huge pages?  Would page size matter here?
> 
> I would assume it's fine it uses a constant (small) page size, assuming
> that should match with the granule that qemu tracks dirty (which IIUC is
> the host page size not guest's).
> 
> But I didn't yet pay any further thoughts on that, maybe it would be
> worthwhile in all cases to record page sizes here to be explicit or the
> meaning of bitmap may not be clear (and then the bitmap_size will be a
> field just for sanity check too).

I think recording the page sizes is an anti-feature in this case.

The migration format / state needs to reflect the guest ABI, but
we need to be free to have different backend config behind that
either side of the save/restore.

IOW, if I start a QEMU with 2 GB of RAM, I should be free to use
small pages initially and after restore use 2 x 1 GB hugepages,
or vica-verca.

The important thing with the pages that are saved into the file
is that they are a 1:1 mapping guest RAM regions to file offsets.
IOW, the 2 GB of guest RAM is always a contiguous 2 GB region
in the file.

If the src VM used 1 GB pages, we would be writing a full 2 GB
of data assuming both pages were dirty.

If the src VM used 4k pages, we would be writing some subset of
the 2 GB of data, and the rest would be unwritten.

Either way, when reading back the data we restore it into either
1 GB pages of 4k pages, beause any places there were unwritten
orignally  will read back as zeros.

> If postcopy might be an option, we'd want the page size to be the host page
> size because then looking up the bitmap will be straightforward, deciding
> whether we should copy over page (UFFDIO_COPY) or fill in with zeros
> (UFFDIO_ZEROPAGE).

This format is only intended for the case where we are migrating to
a random-access medium, aka a file, because the fixed RAM mappings
to disk mean that we need to seek back to the original location to
re-write pages that get dirtied. It isn't suitable for a live
migration stream, and thus postcopy is inherantly out of scope.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-30 21:41 ` [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Peter Xu
@ 2023-03-31 14:37   ` Fabiano Rosas
  2023-03-31 14:52     ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-31 14:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

Peter Xu <peterx@redhat.com> writes:

> On Thu, Mar 30, 2023 at 03:03:10PM -0300, Fabiano Rosas wrote:
>> Hi folks,
>
> Hi,
>
>> 
>> I'm continuing the work done last year to add a new format of
>> migration stream that can be used to migrate large guests to a single
>> file in a performant way.
>> 
>> This is an early RFC with the previous code + my additions to support
>> multifd and direct IO. Let me know what you think!
>> 
>> Here are the reference links for previous discussions:
>> 
>> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html
>> 
>> The series has 4 main parts:
>> 
>> 1) File migration: A new "file:" migration URI. So "file:mig" does the
>>    same as "exec:cat > mig". Patches 1-4 implement this;
>> 
>> 2) Fixed-ram format: A new format for the migration stream. Puts guest
>>    pages at their relative offsets in the migration file. This saves
>>    space on the worst case of RAM utilization because every page has a
>>    fixed offset in the migration file and (potentially) saves us time
>>    because we could write pages independently in parallel. It also
>>    gives alignment guarantees so we could use O_DIRECT. Patches 5-13
>>    implement this;
>> 
>> With patches 1-13 these two^ can be used with:
>> 
>> (qemu) migrate_set_capability fixed-ram on
>> (qemu) migrate[_incoming] file:mig
>
> Have you considered enabling the new fixed-ram format with postcopy when
> loading?
>
> Due to the linear offseting of pages, I think it can achieve super fast vm
> loads due to O(1) lookup of pages and local page fault resolutions.
>

I don't think we have looked that much at the loading side yet. Good to
know that it has potential to be faster. I'll look into it. Thanks for
the suggestion.

>> 
>> --> new in this series:
>> 
>> 3) MultiFD support: This is about making use of the parallelism
>>    allowed by the new format. We just need the threading and page
>>    queuing infrastructure that is already in place for
>>    multifd. Patches 14-24 implement this;
>> 
>> (qemu) migrate_set_capability fixed-ram on
>> (qemu) migrate_set_capability multifd on
>> (qemu) migrate_set_parameter multifd-channels 4
>> (qemu) migrate_set_parameter max-bandwith 0
>> (qemu) migrate[_incoming] file:mig
>> 
>> 4) Add a new "direct_io" parameter and enable O_DIRECT for the
>>    properly aligned segments of the migration (mostly ram). Patch 25.
>> 
>> (qemu) migrate_set_parameter direct-io on
>> 
>> Thanks! Some data below:
>> =====
>> 
>> Outgoing migration to file. NVMe disk. XFS filesystem.
>> 
>> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
>>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
>>   10m -v`:
>> 
>> migration type  | MB/s | pages/s |  ms
>> ----------------+------+---------+------
>> savevm io_uring |  434 |  102294 | 71473
>
> So I assume this is the non-live migration scenario.  Could you explain
> what does io_uring mean here?
>

This table is all non-live migration. This particular line is a snapshot
(hmp_savevm->save_snapshot). I thought it could be relevant because it
is another way by which we write RAM into disk.

The io_uring is noise, I was initially under the impression that the
block device aio configuration affected this scenario.

>> file:           | 3017 |  855862 | 10301
>> fixed-ram       | 1982 |  330686 | 15637
>> ----------------+------+---------+------
>> fixed-ram + multifd + O_DIRECT
>>          2 ch.  | 5565 | 1500882 |  5576
>>          4 ch.  | 5735 | 1991549 |  5412
>>          8 ch.  | 5650 | 1769650 |  5489
>>         16 ch.  | 6071 | 1832407 |  5114
>>         32 ch.  | 6147 | 1809588 |  5050
>>         64 ch.  | 6344 | 1841728 |  4895
>>        128 ch.  | 6120 | 1915669 |  5085
>> ----------------+------+---------+------
>
> Thanks,


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability
  2023-03-31  7:56     ` Daniel P. Berrangé
@ 2023-03-31 14:39       ` Peter Xu
  2023-03-31 15:34         ` Daniel P. Berrangé
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2023-03-31 14:39 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster

On Fri, Mar 31, 2023 at 08:56:01AM +0100, Daniel P. Berrangé wrote:
> On Thu, Mar 30, 2023 at 06:01:51PM -0400, Peter Xu wrote:
> > On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> > > From: Nikolay Borisov <nborisov@suse.com>
> > > 
> > > Implement 'fixed-ram' feature. The core of the feature is to ensure that
> > > each ram page of the migration stream has a specific offset in the
> > > resulting migration stream. The reason why we'd want such behavior are
> > > two fold:
> > > 
> > >  - When doing a 'fixed-ram' migration the resulting file will have a
> > >    bounded size, since pages which are dirtied multiple times will
> > >    always go to a fixed location in the file, rather than constantly
> > >    being added to a sequential stream. This eliminates cases where a vm
> > >    with, say, 1G of ram can result in a migration file that's 10s of
> > >    GBs, provided that the workload constantly redirties memory.
> > > 
> > >  - It paves the way to implement DIO-enabled save/restore of the
> > >    migration stream as the pages are ensured to be written at aligned
> > >    offsets.
> > > 
> > > The feature requires changing the stream format. First, a bitmap is
> > > introduced which tracks which pages have been written (i.e are
> > > dirtied) during migration and subsequently it's being written in the
> > > resulting file, again at a fixed location for every ramblock. Zero
> > > pages are ignored as they'd be zero in the destination migration as
> > > well. With the changed format data would look like the following:
> > > 
> > > |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
> > 
> > What happens with huge pages?  Would page size matter here?
> > 
> > I would assume it's fine it uses a constant (small) page size, assuming
> > that should match with the granule that qemu tracks dirty (which IIUC is
> > the host page size not guest's).
> > 
> > But I didn't yet pay any further thoughts on that, maybe it would be
> > worthwhile in all cases to record page sizes here to be explicit or the
> > meaning of bitmap may not be clear (and then the bitmap_size will be a
> > field just for sanity check too).
> 
> I think recording the page sizes is an anti-feature in this case.
> 
> The migration format / state needs to reflect the guest ABI, but
> we need to be free to have different backend config behind that
> either side of the save/restore.
> 
> IOW, if I start a QEMU with 2 GB of RAM, I should be free to use
> small pages initially and after restore use 2 x 1 GB hugepages,
> or vica-verca.
> 
> The important thing with the pages that are saved into the file
> is that they are a 1:1 mapping guest RAM regions to file offsets.
> IOW, the 2 GB of guest RAM is always a contiguous 2 GB region
> in the file.
> 
> If the src VM used 1 GB pages, we would be writing a full 2 GB
> of data assuming both pages were dirty.
> 
> If the src VM used 4k pages, we would be writing some subset of
> the 2 GB of data, and the rest would be unwritten.
> 
> Either way, when reading back the data we restore it into either
> 1 GB pages of 4k pages, beause any places there were unwritten
> orignally  will read back as zeros.

I think there's already the page size information, because there's a bitmap
embeded in the format at least in the current proposal, and the bitmap can
only be defined with a page size provided in some form.

Here I agree the backend can change before/after a migration (live or
not).  Though the question is whether page size matters in the snapshot
layout rather than what the loaded QEMU instance will use as backend.

> 
> > If postcopy might be an option, we'd want the page size to be the host page
> > size because then looking up the bitmap will be straightforward, deciding
> > whether we should copy over page (UFFDIO_COPY) or fill in with zeros
> > (UFFDIO_ZEROPAGE).
> 
> This format is only intended for the case where we are migrating to
> a random-access medium, aka a file, because the fixed RAM mappings
> to disk mean that we need to seek back to the original location to
> re-write pages that get dirtied. It isn't suitable for a live
> migration stream, and thus postcopy is inherantly out of scope.

Yes, I've commented also in the cover letter, but I can expand a bit.

I mean support postcopy only when loading, but not when saving.

Saving to file definitely cannot work with postcopy because there's no dest
qemu running.

Loading from file, OTOH, can work together with postcopy.

Right now AFAICT current approach is precopy loading the whole guest image
with the supported snapshot format (if I can call it just a snapshot).

What I want to say is we can consider supporting postcopy on loading in
that we start an "empty" QEMU dest node, when any page fault triggered we
do it using userfault and lookup the snapshot file instead rather than
sending a request back to the source.  I mentioned that because there'll be
two major benefits which I mentioned in reply to the cover letter quickly,
but I can also extend here:

  - Firstly, the snapshot format is ideally storing pages in linear
    offsets, it means when we know some page missing we can use O(1) time
    looking it up from the snapshot image.

  - Secondly, we don't need to let the page go through the wires, neither
    do we need to send a request to src qemu or anyone.  What we need here
    is simply test the bit on the snapshot bitmap, then:

    - If it is copied, do UFFDIO_COPY to resolve page faults,
    - If it is not copied, do UFFDIO_ZEROPAGE (e.g., if not hugetlb,
      hugetlb can use a fake UFFDIO_COPY)

So this is a perfect testing ground for using postcopy in a very efficient
way against a file snapshot.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-31 14:37   ` Fabiano Rosas
@ 2023-03-31 14:52     ` Peter Xu
  2023-03-31 15:30       ` Fabiano Rosas
  2023-03-31 15:46       ` Daniel P. Berrangé
  0 siblings, 2 replies; 65+ messages in thread
From: Peter Xu @ 2023-03-31 14:52 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> >> 
> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> >>   10m -v`:
> >> 
> >> migration type  | MB/s | pages/s |  ms
> >> ----------------+------+---------+------
> >> savevm io_uring |  434 |  102294 | 71473
> >
> > So I assume this is the non-live migration scenario.  Could you explain
> > what does io_uring mean here?
> >
> 
> This table is all non-live migration. This particular line is a snapshot
> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> is another way by which we write RAM into disk.

I see, so if all non-live that explains, because I was curious what's the
relationship between this feature and the live snapshot that QEMU also
supports.

I also don't immediately see why savevm will be much slower, do you have an
answer?  Maybe it's somewhere but I just overlooked..

IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
"we can stop the VM".  It smells slightly weird to build this on top of
"migrate" from that pov, rather than "savevm", though.  Any thoughts on
this aspect (on why not building this on top of "savevm")?

Thanks,

> 
> The io_uring is noise, I was initially under the impression that the
> block device aio configuration affected this scenario.
> 
> >> file:           | 3017 |  855862 | 10301
> >> fixed-ram       | 1982 |  330686 | 15637
> >> ----------------+------+---------+------
> >> fixed-ram + multifd + O_DIRECT
> >>          2 ch.  | 5565 | 1500882 |  5576
> >>          4 ch.  | 5735 | 1991549 |  5412
> >>          8 ch.  | 5650 | 1769650 |  5489
> >>         16 ch.  | 6071 | 1832407 |  5114
> >>         32 ch.  | 6147 | 1809588 |  5050
> >>         64 ch.  | 6344 | 1841728 |  4895
> >>        128 ch.  | 6120 | 1915669 |  5085
> >> ----------------+------+---------+------
> >
> > Thanks,
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability
  2023-03-30 22:01   ` Peter Xu
  2023-03-31  7:56     ` Daniel P. Berrangé
@ 2023-03-31 15:05     ` Fabiano Rosas
  1 sibling, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-31 15:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela, Nikolay Borisov, Paolo Bonzini, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster

Peter Xu <peterx@redhat.com> writes:

>> 
>> * pc - refers to the page_size/mr->addr members, so newly added members
>> begin from "bitmap_size".
>
> Could you elaborate more on what's the pc?
>
> I also didn't see this *pc in below migration.rst update.
>

Yeah, you need to be looking at the code to figure that one out. That
was intended to reference some postcopy data that is (already) inserted
into the stream. Literally this:

    if (migrate_postcopy_ram() && block->page_size !=
                                  qemu_host_page_size) {
        qemu_put_be64(f, block->page_size);
    }
    if (migrate_ignore_shared()) {
        qemu_put_be64(f, block->mr->addr);
    }

It has nothing to do with this patch. I need to rewrite that part of the
commit message a bit.

>> 
>> This layout is initialized during ram_save_setup so instead of having a
>> sequential stream of pages that follow the ramblock headers the dirty
>> pages for a ramblock follow its header. Since all pages have a fixed
>> location RAM_SAVE_FLAG_EOS is no longer generated on every migration
>> iteration but there is effectively a single RAM_SAVE_FLAG_EOS right at
>> the end.
>> 
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>

...

>> @@ -4390,6 +4432,12 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>>          }
>>      }
>>  
>> +    if (migrate_check_fixed_ram(s, &local_err) < 0) {
>
> This check might be too late afaict, QMP cmd "migrate" could have already
> succeeded.
>
> Can we do an early check in / close to qmp_migrate()?  The idea is we fail
> at the QMP migrate command there.
>

Yes, some of it depends on the QEMUFile being known but I can at least
move part of the verification earlier.

>> +        migrate_fd_cleanup(s);
>> +        migrate_fd_error(s, local_err);
>> +        return;
>> +    }
>> +
>>      if (resume) {
>>          /* Wakeup the main migration thread to do the recovery */
>>          migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_PAUSED,
>> @@ -4519,6 +4567,7 @@ static Property migration_properties[] = {
>>      DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
>>  
>>      /* Migration capabilities */
>> +    DEFINE_PROP_MIG_CAP("x-fixed-ram", MIGRATION_CAPABILITY_FIXED_RAM),
>>      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
>>      DEFINE_PROP_MIG_CAP("x-rdma-pin-all", MIGRATION_CAPABILITY_RDMA_PIN_ALL),
>>      DEFINE_PROP_MIG_CAP("x-auto-converge", MIGRATION_CAPABILITY_AUTO_CONVERGE),
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 2da2f8a164..8cf3caecfe 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -416,6 +416,7 @@ bool migrate_zero_blocks(void);
>>  bool migrate_dirty_bitmaps(void);
>>  bool migrate_ignore_shared(void);
>>  bool migrate_validate_uuid(void);
>> +int migrate_fixed_ram(void);
>>  
>>  bool migrate_auto_converge(void);
>>  bool migrate_use_multifd(void);
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 96e8a19a58..56f0f782c8 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1310,9 +1310,14 @@ static int save_zero_page_to_file(PageSearchStatus *pss,
>>      int len = 0;
>>  
>>      if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
>> -        len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
>> -        qemu_put_byte(file, 0);
>> -        len += 1;
>> +        if (migrate_fixed_ram()) {
>> +            /* for zero pages we don't need to do anything */
>> +            len = 1;
>
> I think you wanted to increase the "duplicated" counter, but this will also
> increase ram-transferred even though only 1 byte.
>

Ah, well spotted, that is indeed incorrect.

> Perhaps just pass a pointer to keep the bytes, and return true/false to
> increase the counter (to make everything accurate)?
>

Ok

>> +        } else {
>> +            len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
>> +            qemu_put_byte(file, 0);
>> +            len += 1;
>> +        }
>>          ram_release_page(block->idstr, offset);
>>      }
>>      return len;


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-31 14:52     ` Peter Xu
@ 2023-03-31 15:30       ` Fabiano Rosas
  2023-03-31 15:55         ` Peter Xu
  2023-03-31 15:46       ` Daniel P. Berrangé
  1 sibling, 1 reply; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-31 15:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

Peter Xu <peterx@redhat.com> writes:

> On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
>> >> Outgoing migration to file. NVMe disk. XFS filesystem.
>> >> 
>> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
>> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
>> >>   10m -v`:
>> >> 
>> >> migration type  | MB/s | pages/s |  ms
>> >> ----------------+------+---------+------
>> >> savevm io_uring |  434 |  102294 | 71473
>> >
>> > So I assume this is the non-live migration scenario.  Could you explain
>> > what does io_uring mean here?
>> >
>> 
>> This table is all non-live migration. This particular line is a snapshot
>> (hmp_savevm->save_snapshot). I thought it could be relevant because it
>> is another way by which we write RAM into disk.
>
> I see, so if all non-live that explains, because I was curious what's the
> relationship between this feature and the live snapshot that QEMU also
> supports.
>
> I also don't immediately see why savevm will be much slower, do you have an
> answer?  Maybe it's somewhere but I just overlooked..
>

I don't have a concrete answer. I could take a jab and maybe blame the
extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
of bandwidth limits?

> IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> "we can stop the VM".  It smells slightly weird to build this on top of
> "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> this aspect (on why not building this on top of "savevm")?
>

I share the same perception. I have done initial experiments with
savevm, but I decided to carry on the work that was already started by
others because my understanding of the problem was yet incomplete.

One point that has been raised is that the fixed-ram format alone does
not bring that many performance improvements. So we'll need
multi-threading and direct-io on top of it. Re-using multifd
infrastructure seems like it could be a good idea.

> Thanks,
>
>> 
>> The io_uring is noise, I was initially under the impression that the
>> block device aio configuration affected this scenario.
>> 
>> >> file:           | 3017 |  855862 | 10301
>> >> fixed-ram       | 1982 |  330686 | 15637
>> >> ----------------+------+---------+------
>> >> fixed-ram + multifd + O_DIRECT
>> >>          2 ch.  | 5565 | 1500882 |  5576
>> >>          4 ch.  | 5735 | 1991549 |  5412
>> >>          8 ch.  | 5650 | 1769650 |  5489
>> >>         16 ch.  | 6071 | 1832407 |  5114
>> >>         32 ch.  | 6147 | 1809588 |  5050
>> >>         64 ch.  | 6344 | 1841728 |  4895
>> >>        128 ch.  | 6120 | 1915669 |  5085
>> >> ----------------+------+---------+------
>> >
>> > Thanks,
>> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability
  2023-03-31 14:39       ` Peter Xu
@ 2023-03-31 15:34         ` Daniel P. Berrangé
  2023-03-31 16:13           ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Daniel P. Berrangé @ 2023-03-31 15:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster

On Fri, Mar 31, 2023 at 10:39:23AM -0400, Peter Xu wrote:
> On Fri, Mar 31, 2023 at 08:56:01AM +0100, Daniel P. Berrangé wrote:
> > On Thu, Mar 30, 2023 at 06:01:51PM -0400, Peter Xu wrote:
> > > On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> > > > From: Nikolay Borisov <nborisov@suse.com>
> > > > 
> > > > Implement 'fixed-ram' feature. The core of the feature is to ensure that
> > > > each ram page of the migration stream has a specific offset in the
> > > > resulting migration stream. The reason why we'd want such behavior are
> > > > two fold:
> > > > 
> > > >  - When doing a 'fixed-ram' migration the resulting file will have a
> > > >    bounded size, since pages which are dirtied multiple times will
> > > >    always go to a fixed location in the file, rather than constantly
> > > >    being added to a sequential stream. This eliminates cases where a vm
> > > >    with, say, 1G of ram can result in a migration file that's 10s of
> > > >    GBs, provided that the workload constantly redirties memory.
> > > > 
> > > >  - It paves the way to implement DIO-enabled save/restore of the
> > > >    migration stream as the pages are ensured to be written at aligned
> > > >    offsets.
> > > > 
> > > > The feature requires changing the stream format. First, a bitmap is
> > > > introduced which tracks which pages have been written (i.e are
> > > > dirtied) during migration and subsequently it's being written in the
> > > > resulting file, again at a fixed location for every ramblock. Zero
> > > > pages are ignored as they'd be zero in the destination migration as
> > > > well. With the changed format data would look like the following:
> > > > 
> > > > |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
> > > 
> > > What happens with huge pages?  Would page size matter here?
> > > 
> > > I would assume it's fine it uses a constant (small) page size, assuming
> > > that should match with the granule that qemu tracks dirty (which IIUC is
> > > the host page size not guest's).
> > > 
> > > But I didn't yet pay any further thoughts on that, maybe it would be
> > > worthwhile in all cases to record page sizes here to be explicit or the
> > > meaning of bitmap may not be clear (and then the bitmap_size will be a
> > > field just for sanity check too).
> > 
> > I think recording the page sizes is an anti-feature in this case.
> > 
> > The migration format / state needs to reflect the guest ABI, but
> > we need to be free to have different backend config behind that
> > either side of the save/restore.
> > 
> > IOW, if I start a QEMU with 2 GB of RAM, I should be free to use
> > small pages initially and after restore use 2 x 1 GB hugepages,
> > or vica-verca.
> > 
> > The important thing with the pages that are saved into the file
> > is that they are a 1:1 mapping guest RAM regions to file offsets.
> > IOW, the 2 GB of guest RAM is always a contiguous 2 GB region
> > in the file.
> > 
> > If the src VM used 1 GB pages, we would be writing a full 2 GB
> > of data assuming both pages were dirty.
> > 
> > If the src VM used 4k pages, we would be writing some subset of
> > the 2 GB of data, and the rest would be unwritten.
> > 
> > Either way, when reading back the data we restore it into either
> > 1 GB pages of 4k pages, beause any places there were unwritten
> > orignally  will read back as zeros.
> 
> I think there's already the page size information, because there's a bitmap
> embeded in the format at least in the current proposal, and the bitmap can
> only be defined with a page size provided in some form.
> 
> Here I agree the backend can change before/after a migration (live or
> not).  Though the question is whether page size matters in the snapshot
> layout rather than what the loaded QEMU instance will use as backend.

IIUC, the page size information merely sets a constraint on the granularity
of unwritten (sparse) regions in the file. If we didn't want to express
page size directly in the file format we would need explicit start/end
offsets for each written block. This is less convenient that just having
a bitmap, so I think its ok to use the page size bitmap

> > > If postcopy might be an option, we'd want the page size to be the host page
> > > size because then looking up the bitmap will be straightforward, deciding
> > > whether we should copy over page (UFFDIO_COPY) or fill in with zeros
> > > (UFFDIO_ZEROPAGE).
> > 
> > This format is only intended for the case where we are migrating to
> > a random-access medium, aka a file, because the fixed RAM mappings
> > to disk mean that we need to seek back to the original location to
> > re-write pages that get dirtied. It isn't suitable for a live
> > migration stream, and thus postcopy is inherantly out of scope.
> 
> Yes, I've commented also in the cover letter, but I can expand a bit.
> 
> I mean support postcopy only when loading, but not when saving.
> 
> Saving to file definitely cannot work with postcopy because there's no dest
> qemu running.
> 
> Loading from file, OTOH, can work together with postcopy.

Ahh, I see what you mean.

> Right now AFAICT current approach is precopy loading the whole guest image
> with the supported snapshot format (if I can call it just a snapshot).
> 
> What I want to say is we can consider supporting postcopy on loading in
> that we start an "empty" QEMU dest node, when any page fault triggered we
> do it using userfault and lookup the snapshot file instead rather than
> sending a request back to the source.  I mentioned that because there'll be
> two major benefits which I mentioned in reply to the cover letter quickly,
> but I can also extend here:
> 
>   - Firstly, the snapshot format is ideally storing pages in linear
>     offsets, it means when we know some page missing we can use O(1) time
>     looking it up from the snapshot image.
> 
>   - Secondly, we don't need to let the page go through the wires, neither
>     do we need to send a request to src qemu or anyone.  What we need here
>     is simply test the bit on the snapshot bitmap, then:
> 
>     - If it is copied, do UFFDIO_COPY to resolve page faults,
>     - If it is not copied, do UFFDIO_ZEROPAGE (e.g., if not hugetlb,
>       hugetlb can use a fake UFFDIO_COPY)
> 
> So this is a perfect testing ground for using postcopy in a very efficient
> way against a file snapshot.

Yes, that's an nice unexpected benefit of this fixed ram file format.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-31 14:52     ` Peter Xu
  2023-03-31 15:30       ` Fabiano Rosas
@ 2023-03-31 15:46       ` Daniel P. Berrangé
  1 sibling, 0 replies; 65+ messages in thread
From: Daniel P. Berrangé @ 2023-03-31 15:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela

On Fri, Mar 31, 2023 at 10:52:09AM -0400, Peter Xu wrote:
> On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> > >> Outgoing migration to file. NVMe disk. XFS filesystem.
> > >> 
> > >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> > >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> > >>   10m -v`:
> > >> 
> > >> migration type  | MB/s | pages/s |  ms
> > >> ----------------+------+---------+------
> > >> savevm io_uring |  434 |  102294 | 71473
> > >
> > > So I assume this is the non-live migration scenario.  Could you explain
> > > what does io_uring mean here?
> > >
> > 
> > This table is all non-live migration. This particular line is a snapshot
> > (hmp_savevm->save_snapshot). I thought it could be relevant because it
> > is another way by which we write RAM into disk.
> 
> I see, so if all non-live that explains, because I was curious what's the
> relationship between this feature and the live snapshot that QEMU also
> supports.
> 
> I also don't immediately see why savevm will be much slower, do you have an
> answer?  Maybe it's somewhere but I just overlooked..
> 
> IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> "we can stop the VM".  It smells slightly weird to build this on top of
> "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> this aspect (on why not building this on top of "savevm")?

Currently savevm covers memory, device state and disk snapshots
saving into the VM's disks, which basically means only works
with qcow2.

Libvirt's save logic only cares about saving memory and device
state, and supports saving guests regardless of what storage is
used, saving it externally from the disk.

This is only possible with 'migrate' today and so 'savevm' isn't
useful for this tasks from libvirt's POV.

In the past it has been suggested that actually 'savevm' command
as a concept is redundant, and that we could in fact layer it
on top of a combination of migration and block snapshot APIs.
eg if we had a 'blockdev:' migration protocol for saving the
vmstate.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-31 15:30       ` Fabiano Rosas
@ 2023-03-31 15:55         ` Peter Xu
  2023-03-31 16:10           ` Daniel P. Berrangé
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2023-03-31 15:55 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> >> >> 
> >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> >> >>   10m -v`:
> >> >> 
> >> >> migration type  | MB/s | pages/s |  ms
> >> >> ----------------+------+---------+------
> >> >> savevm io_uring |  434 |  102294 | 71473
> >> >
> >> > So I assume this is the non-live migration scenario.  Could you explain
> >> > what does io_uring mean here?
> >> >
> >> 
> >> This table is all non-live migration. This particular line is a snapshot
> >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> >> is another way by which we write RAM into disk.
> >
> > I see, so if all non-live that explains, because I was curious what's the
> > relationship between this feature and the live snapshot that QEMU also
> > supports.
> >
> > I also don't immediately see why savevm will be much slower, do you have an
> > answer?  Maybe it's somewhere but I just overlooked..
> >
> 
> I don't have a concrete answer. I could take a jab and maybe blame the
> extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
> of bandwidth limits?

IMHO it would be great if this can be investigated and reasons provided in
the next cover letter.

> 
> > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> > "we can stop the VM".  It smells slightly weird to build this on top of
> > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> > this aspect (on why not building this on top of "savevm")?
> >
> 
> I share the same perception. I have done initial experiments with
> savevm, but I decided to carry on the work that was already started by
> others because my understanding of the problem was yet incomplete.
> 
> One point that has been raised is that the fixed-ram format alone does
> not bring that many performance improvements. So we'll need
> multi-threading and direct-io on top of it. Re-using multifd
> infrastructure seems like it could be a good idea.

The thing is IMHO concurrency is not as hard if VM stopped, and when we're
100% sure locally on where the page will go.

IOW, I think multifd provides a lot of features that may not really be
useful for this effort, meanwhile using those features may need to already
pay for the overhead to support those features.

For example, a major benefit of multifd is it allows pages sent out of
order, so it indexes the page as a header.  I didn't read the follow up
patches, but I assume that's not needed in this effort.

What I understand so far with fixes-ram is we dump the whole ramblock
memory into a chunk at offset of a file.  Can concurrency of that
achievable easily by creating a bunch of threads dumping altogether during
the savevm, with different offsets of guest ram & file passed over?

It's very possible that I overlooked a lot of things, but IMHO my point is
it'll always be great to have a small section discussing the pros and cons
in the cover letter on the decision of using "migrate" infra rather than
"savevm".  Because it's still against the intuition at least to some
reviewers (like me..).  What I worry is this can be implemented more
efficiently and with less LOCs into savevm (and perhaps also benefit normal
savevm too!  so there's chance current savevm users can already benefit
from this) but we didn't do so because the project simply started with
using QMP migrate.  Any investigation on figuring more of this out would be
greatly helpful.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-31 15:55         ` Peter Xu
@ 2023-03-31 16:10           ` Daniel P. Berrangé
  2023-03-31 16:27             ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Daniel P. Berrangé @ 2023-03-31 16:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela

On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
> On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > 
> > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> > >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> > >> >> 
> > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> > >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> > >> >>   10m -v`:
> > >> >> 
> > >> >> migration type  | MB/s | pages/s |  ms
> > >> >> ----------------+------+---------+------
> > >> >> savevm io_uring |  434 |  102294 | 71473
> > >> >
> > >> > So I assume this is the non-live migration scenario.  Could you explain
> > >> > what does io_uring mean here?
> > >> >
> > >> 
> > >> This table is all non-live migration. This particular line is a snapshot
> > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> > >> is another way by which we write RAM into disk.
> > >
> > > I see, so if all non-live that explains, because I was curious what's the
> > > relationship between this feature and the live snapshot that QEMU also
> > > supports.
> > >
> > > I also don't immediately see why savevm will be much slower, do you have an
> > > answer?  Maybe it's somewhere but I just overlooked..
> > >
> > 
> > I don't have a concrete answer. I could take a jab and maybe blame the
> > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
> > of bandwidth limits?
> 
> IMHO it would be great if this can be investigated and reasons provided in
> the next cover letter.
> 
> > 
> > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> > > "we can stop the VM".  It smells slightly weird to build this on top of
> > > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> > > this aspect (on why not building this on top of "savevm")?
> > >
> > 
> > I share the same perception. I have done initial experiments with
> > savevm, but I decided to carry on the work that was already started by
> > others because my understanding of the problem was yet incomplete.
> > 
> > One point that has been raised is that the fixed-ram format alone does
> > not bring that many performance improvements. So we'll need
> > multi-threading and direct-io on top of it. Re-using multifd
> > infrastructure seems like it could be a good idea.
> 
> The thing is IMHO concurrency is not as hard if VM stopped, and when we're
> 100% sure locally on where the page will go.

We shouldn't assume the VM is stopped though. When saving to the file
the VM may still be active. The fixed-ram format lets us re-write the
same memory location on disk multiple times in this case, thus avoiding
growth of the file size.

> IOW, I think multifd provides a lot of features that may not really be
> useful for this effort, meanwhile using those features may need to already
> pay for the overhead to support those features.
> 
> For example, a major benefit of multifd is it allows pages sent out of
> order, so it indexes the page as a header.  I didn't read the follow up
> patches, but I assume that's not needed in this effort.
> 
> What I understand so far with fixes-ram is we dump the whole ramblock
> memory into a chunk at offset of a file.  Can concurrency of that
> achievable easily by creating a bunch of threads dumping altogether during
> the savevm, with different offsets of guest ram & file passed over?

I feel like the migration code is already insanely complicated and
the many threads involved have caused no end of subtle bugs. 

It was Juan I believe who expressed a desire to entirely remove
non-multifd code in the future, in order to reduce the maint burden.
IOW, ideally we would be pushing mgmt apps towards always using
multifd at all times, even if they only ask it to create 1 single
thread.

That would in turn suggest against creating new concurrency
mechanisms on top of non-multifd code, both to avoid adding yet
more complexity and also because it would make it harder to later
delete the non-multifd code.

On the libvirt side wrt fixed-ram, we could just use multifd
exclusively, as there should be no downside to it even for a
single FD.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability
  2023-03-31 15:34         ` Daniel P. Berrangé
@ 2023-03-31 16:13           ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2023-03-31 16:13 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster

On Fri, Mar 31, 2023 at 04:34:57PM +0100, Daniel P. Berrangé wrote:
> On Fri, Mar 31, 2023 at 10:39:23AM -0400, Peter Xu wrote:
> > On Fri, Mar 31, 2023 at 08:56:01AM +0100, Daniel P. Berrangé wrote:
> > > On Thu, Mar 30, 2023 at 06:01:51PM -0400, Peter Xu wrote:
> > > > On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> > > > > From: Nikolay Borisov <nborisov@suse.com>
> > > > > 
> > > > > Implement 'fixed-ram' feature. The core of the feature is to ensure that
> > > > > each ram page of the migration stream has a specific offset in the
> > > > > resulting migration stream. The reason why we'd want such behavior are
> > > > > two fold:
> > > > > 
> > > > >  - When doing a 'fixed-ram' migration the resulting file will have a
> > > > >    bounded size, since pages which are dirtied multiple times will
> > > > >    always go to a fixed location in the file, rather than constantly
> > > > >    being added to a sequential stream. This eliminates cases where a vm
> > > > >    with, say, 1G of ram can result in a migration file that's 10s of
> > > > >    GBs, provided that the workload constantly redirties memory.
> > > > > 
> > > > >  - It paves the way to implement DIO-enabled save/restore of the
> > > > >    migration stream as the pages are ensured to be written at aligned
> > > > >    offsets.
> > > > > 
> > > > > The feature requires changing the stream format. First, a bitmap is
> > > > > introduced which tracks which pages have been written (i.e are
> > > > > dirtied) during migration and subsequently it's being written in the
> > > > > resulting file, again at a fixed location for every ramblock. Zero
> > > > > pages are ignored as they'd be zero in the destination migration as
> > > > > well. With the changed format data would look like the following:
> > > > > 
> > > > > |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
> > > > 
> > > > What happens with huge pages?  Would page size matter here?
> > > > 
> > > > I would assume it's fine it uses a constant (small) page size, assuming
> > > > that should match with the granule that qemu tracks dirty (which IIUC is
> > > > the host page size not guest's).
> > > > 
> > > > But I didn't yet pay any further thoughts on that, maybe it would be
> > > > worthwhile in all cases to record page sizes here to be explicit or the
> > > > meaning of bitmap may not be clear (and then the bitmap_size will be a
> > > > field just for sanity check too).
> > > 
> > > I think recording the page sizes is an anti-feature in this case.
> > > 
> > > The migration format / state needs to reflect the guest ABI, but
> > > we need to be free to have different backend config behind that
> > > either side of the save/restore.
> > > 
> > > IOW, if I start a QEMU with 2 GB of RAM, I should be free to use
> > > small pages initially and after restore use 2 x 1 GB hugepages,
> > > or vica-verca.
> > > 
> > > The important thing with the pages that are saved into the file
> > > is that they are a 1:1 mapping guest RAM regions to file offsets.
> > > IOW, the 2 GB of guest RAM is always a contiguous 2 GB region
> > > in the file.
> > > 
> > > If the src VM used 1 GB pages, we would be writing a full 2 GB
> > > of data assuming both pages were dirty.
> > > 
> > > If the src VM used 4k pages, we would be writing some subset of
> > > the 2 GB of data, and the rest would be unwritten.
> > > 
> > > Either way, when reading back the data we restore it into either
> > > 1 GB pages of 4k pages, beause any places there were unwritten
> > > orignally  will read back as zeros.
> > 
> > I think there's already the page size information, because there's a bitmap
> > embeded in the format at least in the current proposal, and the bitmap can
> > only be defined with a page size provided in some form.
> > 
> > Here I agree the backend can change before/after a migration (live or
> > not).  Though the question is whether page size matters in the snapshot
> > layout rather than what the loaded QEMU instance will use as backend.
> 
> IIUC, the page size information merely sets a constraint on the granularity
> of unwritten (sparse) regions in the file. If we didn't want to express
> page size directly in the file format we would need explicit start/end
> offsets for each written block. This is less convenient that just having
> a bitmap, so I think its ok to use the page size bitmap

I'm perfectly fine with having the bitmap.  The original question was about
whether we should store page_size into the same header too along with the
bitmap.

Currently I think the page size can be implied by either the system
configuration (e.g. arch, cpu setups) and also the size of bitmap.  So I'm
wondering whether it'll be cleaner to replace the bitmap size with page
size (hence one can calculate the bitmap size from the page size), or just
keep both of them for sanity.

Besides, since we seem to be defining a new header format to be stored on
disks, maybe it'll be worthwhile to leave some space for future extentions
of the image?

So the image format can start with a versioning (perhaps also with field
explaning what it contains). Then if someday we want to extend the image,
the new qemu binary will still be able to load the old image even if the
format may change.  Or vice versa, where the old qemu binary would be able
to identify it's loading a new image that it doesn't really understand, so
to properly notify the user rather than weird loading errors.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-31 16:10           ` Daniel P. Berrangé
@ 2023-03-31 16:27             ` Peter Xu
  2023-03-31 18:18               ` Fabiano Rosas
  2023-04-18 16:58               ` Daniel P. Berrangé
  0 siblings, 2 replies; 65+ messages in thread
From: Peter Xu @ 2023-03-31 16:27 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela

On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote:
> On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
> > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
> > > Peter Xu <peterx@redhat.com> writes:
> > > 
> > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> > > >> >> 
> > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> > > >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> > > >> >>   10m -v`:
> > > >> >> 
> > > >> >> migration type  | MB/s | pages/s |  ms
> > > >> >> ----------------+------+---------+------
> > > >> >> savevm io_uring |  434 |  102294 | 71473
> > > >> >
> > > >> > So I assume this is the non-live migration scenario.  Could you explain
> > > >> > what does io_uring mean here?
> > > >> >
> > > >> 
> > > >> This table is all non-live migration. This particular line is a snapshot
> > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> > > >> is another way by which we write RAM into disk.
> > > >
> > > > I see, so if all non-live that explains, because I was curious what's the
> > > > relationship between this feature and the live snapshot that QEMU also
> > > > supports.
> > > >
> > > > I also don't immediately see why savevm will be much slower, do you have an
> > > > answer?  Maybe it's somewhere but I just overlooked..
> > > >
> > > 
> > > I don't have a concrete answer. I could take a jab and maybe blame the
> > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
> > > of bandwidth limits?
> > 
> > IMHO it would be great if this can be investigated and reasons provided in
> > the next cover letter.
> > 
> > > 
> > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> > > > "we can stop the VM".  It smells slightly weird to build this on top of
> > > > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> > > > this aspect (on why not building this on top of "savevm")?
> > > >
> > > 
> > > I share the same perception. I have done initial experiments with
> > > savevm, but I decided to carry on the work that was already started by
> > > others because my understanding of the problem was yet incomplete.
> > > 
> > > One point that has been raised is that the fixed-ram format alone does
> > > not bring that many performance improvements. So we'll need
> > > multi-threading and direct-io on top of it. Re-using multifd
> > > infrastructure seems like it could be a good idea.
> > 
> > The thing is IMHO concurrency is not as hard if VM stopped, and when we're
> > 100% sure locally on where the page will go.
> 
> We shouldn't assume the VM is stopped though. When saving to the file
> the VM may still be active. The fixed-ram format lets us re-write the
> same memory location on disk multiple times in this case, thus avoiding
> growth of the file size.

Before discussing on reusing multifd below, now I have a major confusing on
the use case of the feature..

The question is whether we would like to stop the VM after fixed-ram
migration completes.  I'm asking because:

  1. If it will stop, then it looks like a "VM suspend" to me. If so, could
     anyone help explain why we don't stop the VM first then migrate?
     Because it avoids copying single pages multiple times, no fiddling
     with dirty tracking at all - we just don't ever track anything.  In
     short, we'll stop the VM anyway, then why not stop it slightly
     earlier?

  2. If it will not stop, then it's "VM live snapshot" to me.  We have
     that, aren't we?  That's more efficient because it'll wr-protect all
     guest pages, any write triggers a CoW and we only copy the guest pages
     once and for all.

Either way to go, there's no need to copy any page more than once.  Did I
miss anything perhaps very important?

I would guess it's option (1) above, because it seems we don't snapshot the
disk alongside.  But I am really not sure now..

> 
> > IOW, I think multifd provides a lot of features that may not really be
> > useful for this effort, meanwhile using those features may need to already
> > pay for the overhead to support those features.
> > 
> > For example, a major benefit of multifd is it allows pages sent out of
> > order, so it indexes the page as a header.  I didn't read the follow up
> > patches, but I assume that's not needed in this effort.
> > 
> > What I understand so far with fixes-ram is we dump the whole ramblock
> > memory into a chunk at offset of a file.  Can concurrency of that
> > achievable easily by creating a bunch of threads dumping altogether during
> > the savevm, with different offsets of guest ram & file passed over?
> 
> I feel like the migration code is already insanely complicated and
> the many threads involved have caused no end of subtle bugs. 
> 
> It was Juan I believe who expressed a desire to entirely remove
> non-multifd code in the future, in order to reduce the maint burden.
> IOW, ideally we would be pushing mgmt apps towards always using
> multifd at all times, even if they only ask it to create 1 single
> thread.
> 
> That would in turn suggest against creating new concurrency
> mechanisms on top of non-multifd code, both to avoid adding yet
> more complexity and also because it would make it harder to later
> delete the non-multifd code.
> 
> On the libvirt side wrt fixed-ram, we could just use multifd
> exclusively, as there should be no downside to it even for a
> single FD.
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-31 16:27             ` Peter Xu
@ 2023-03-31 18:18               ` Fabiano Rosas
  2023-03-31 21:52                 ` Peter Xu
  2023-04-18 16:58               ` Daniel P. Berrangé
  1 sibling, 1 reply; 65+ messages in thread
From: Fabiano Rosas @ 2023-03-31 18:18 UTC (permalink / raw)
  To: Peter Xu, Daniel P. Berrangé
  Cc: qemu-devel, Claudio Fontana, jfehlig, dfaggioli, dgilbert, Juan Quintela

Peter Xu <peterx@redhat.com> writes:

> On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote:
>> On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
>> > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
>> > > Peter Xu <peterx@redhat.com> writes:
>> > > 
>> > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
>> > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
>> > > >> >> 
>> > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
>> > > >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
>> > > >> >>   10m -v`:
>> > > >> >> 
>> > > >> >> migration type  | MB/s | pages/s |  ms
>> > > >> >> ----------------+------+---------+------
>> > > >> >> savevm io_uring |  434 |  102294 | 71473
>> > > >> >
>> > > >> > So I assume this is the non-live migration scenario.  Could you explain
>> > > >> > what does io_uring mean here?
>> > > >> >
>> > > >> 
>> > > >> This table is all non-live migration. This particular line is a snapshot
>> > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
>> > > >> is another way by which we write RAM into disk.
>> > > >
>> > > > I see, so if all non-live that explains, because I was curious what's the
>> > > > relationship between this feature and the live snapshot that QEMU also
>> > > > supports.
>> > > >
>> > > > I also don't immediately see why savevm will be much slower, do you have an
>> > > > answer?  Maybe it's somewhere but I just overlooked..
>> > > >
>> > > 
>> > > I don't have a concrete answer. I could take a jab and maybe blame the
>> > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
>> > > of bandwidth limits?
>> > 
>> > IMHO it would be great if this can be investigated and reasons provided in
>> > the next cover letter.
>> > 
>> > > 
>> > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
>> > > > "we can stop the VM".  It smells slightly weird to build this on top of
>> > > > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
>> > > > this aspect (on why not building this on top of "savevm")?
>> > > >
>> > > 
>> > > I share the same perception. I have done initial experiments with
>> > > savevm, but I decided to carry on the work that was already started by
>> > > others because my understanding of the problem was yet incomplete.
>> > > 
>> > > One point that has been raised is that the fixed-ram format alone does
>> > > not bring that many performance improvements. So we'll need
>> > > multi-threading and direct-io on top of it. Re-using multifd
>> > > infrastructure seems like it could be a good idea.
>> > 
>> > The thing is IMHO concurrency is not as hard if VM stopped, and when we're
>> > 100% sure locally on where the page will go.
>> 
>> We shouldn't assume the VM is stopped though. When saving to the file
>> the VM may still be active. The fixed-ram format lets us re-write the
>> same memory location on disk multiple times in this case, thus avoiding
>> growth of the file size.
>
> Before discussing on reusing multifd below, now I have a major confusing on
> the use case of the feature..
>
> The question is whether we would like to stop the VM after fixed-ram
> migration completes.  I'm asking because:
>

We would.

>   1. If it will stop, then it looks like a "VM suspend" to me. If so, could
>      anyone help explain why we don't stop the VM first then migrate?
>      Because it avoids copying single pages multiple times, no fiddling
>      with dirty tracking at all - we just don't ever track anything.  In
>      short, we'll stop the VM anyway, then why not stop it slightly
>      earlier?
>

Looking at the previous discussions I don't see explicit mentions of a
requirement either way (stop before or stop after). I agree it makes
more sense to stop the guest first and then migrate without having to
deal with dirty pages.

I presume libvirt just migrates without altering the guest run state so
we implemented this to work in both scenarios. But even then, it seems
QEMU could store the current VM state, stop it, migrate and restore the
state on the destination.

I might be missing context here since I wasn't around when this work
started. Someone correct me if I'm wrong please.

>   2. If it will not stop, then it's "VM live snapshot" to me.  We have
>      that, aren't we?  That's more efficient because it'll wr-protect all
>      guest pages, any write triggers a CoW and we only copy the guest pages
>      once and for all.
>
> Either way to go, there's no need to copy any page more than once.  Did I
> miss anything perhaps very important?
>
> I would guess it's option (1) above, because it seems we don't snapshot the
> disk alongside.  But I am really not sure now..
>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-31 18:18               ` Fabiano Rosas
@ 2023-03-31 21:52                 ` Peter Xu
  2023-04-03  7:47                   ` Claudio Fontana
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2023-03-31 21:52 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Daniel P. Berrangé,
	qemu-devel, Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Juan Quintela

On Fri, Mar 31, 2023 at 03:18:37PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote:
> >> On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
> >> > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
> >> > > Peter Xu <peterx@redhat.com> writes:
> >> > > 
> >> > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> >> > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> >> > > >> >> 
> >> > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> >> > > >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> >> > > >> >>   10m -v`:
> >> > > >> >> 
> >> > > >> >> migration type  | MB/s | pages/s |  ms
> >> > > >> >> ----------------+------+---------+------
> >> > > >> >> savevm io_uring |  434 |  102294 | 71473
> >> > > >> >
> >> > > >> > So I assume this is the non-live migration scenario.  Could you explain
> >> > > >> > what does io_uring mean here?
> >> > > >> >
> >> > > >> 
> >> > > >> This table is all non-live migration. This particular line is a snapshot
> >> > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> >> > > >> is another way by which we write RAM into disk.
> >> > > >
> >> > > > I see, so if all non-live that explains, because I was curious what's the
> >> > > > relationship between this feature and the live snapshot that QEMU also
> >> > > > supports.
> >> > > >
> >> > > > I also don't immediately see why savevm will be much slower, do you have an
> >> > > > answer?  Maybe it's somewhere but I just overlooked..
> >> > > >
> >> > > 
> >> > > I don't have a concrete answer. I could take a jab and maybe blame the
> >> > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
> >> > > of bandwidth limits?
> >> > 
> >> > IMHO it would be great if this can be investigated and reasons provided in
> >> > the next cover letter.
> >> > 
> >> > > 
> >> > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> >> > > > "we can stop the VM".  It smells slightly weird to build this on top of
> >> > > > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> >> > > > this aspect (on why not building this on top of "savevm")?
> >> > > >
> >> > > 
> >> > > I share the same perception. I have done initial experiments with
> >> > > savevm, but I decided to carry on the work that was already started by
> >> > > others because my understanding of the problem was yet incomplete.
> >> > > 
> >> > > One point that has been raised is that the fixed-ram format alone does
> >> > > not bring that many performance improvements. So we'll need
> >> > > multi-threading and direct-io on top of it. Re-using multifd
> >> > > infrastructure seems like it could be a good idea.
> >> > 
> >> > The thing is IMHO concurrency is not as hard if VM stopped, and when we're
> >> > 100% sure locally on where the page will go.
> >> 
> >> We shouldn't assume the VM is stopped though. When saving to the file
> >> the VM may still be active. The fixed-ram format lets us re-write the
> >> same memory location on disk multiple times in this case, thus avoiding
> >> growth of the file size.
> >
> > Before discussing on reusing multifd below, now I have a major confusing on
> > the use case of the feature..
> >
> > The question is whether we would like to stop the VM after fixed-ram
> > migration completes.  I'm asking because:
> >
> 
> We would.
> 
> >   1. If it will stop, then it looks like a "VM suspend" to me. If so, could
> >      anyone help explain why we don't stop the VM first then migrate?
> >      Because it avoids copying single pages multiple times, no fiddling
> >      with dirty tracking at all - we just don't ever track anything.  In
> >      short, we'll stop the VM anyway, then why not stop it slightly
> >      earlier?
> >
> 
> Looking at the previous discussions I don't see explicit mentions of a
> requirement either way (stop before or stop after). I agree it makes
> more sense to stop the guest first and then migrate without having to
> deal with dirty pages.
> 
> I presume libvirt just migrates without altering the guest run state so
> we implemented this to work in both scenarios. But even then, it seems
> QEMU could store the current VM state, stop it, migrate and restore the
> state on the destination.

Yes, I can understand having a unified interface for libvirt would be great
in this case.  So I am personally not against reusing qmp command "migrate"
if that would help in any case from libvirt pov.

However this is an important question to be answered very sure before
building more things on top.  IOW, even if reusing QMP migrate, we could
consider a totally different impl (e.g. don't reuse migration thread model).

As I mentioned above it seems just ideal we always stop the VM so it could
be part of the command (unlike normal QMP migrate), then it's getting more
like save_snapshot() as there's the vm_stop().  We should make sure when
the user uses the new cmd it'll always do that because that's the most
performant (comparing to enabling dirty tracking and live migrate).

> 
> I might be missing context here since I wasn't around when this work
> started. Someone correct me if I'm wrong please.

Yes, it would be great if someone can help clarify.

Thanks,

> 
> >   2. If it will not stop, then it's "VM live snapshot" to me.  We have
> >      that, aren't we?  That's more efficient because it'll wr-protect all
> >      guest pages, any write triggers a CoW and we only copy the guest pages
> >      once and for all.
> >
> > Either way to go, there's no need to copy any page more than once.  Did I
> > miss anything perhaps very important?
> >
> > I would guess it's option (1) above, because it seems we don't snapshot the
> > disk alongside.  But I am really not sure now..
> >
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (26 preceding siblings ...)
  2023-03-30 21:41 ` [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Peter Xu
@ 2023-04-03  7:38 ` David Hildenbrand
  2023-04-03 14:41   ` Fabiano Rosas
  27 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2023-04-03  7:38 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

On 30.03.23 20:03, Fabiano Rosas wrote:
> Hi folks,
> 
> I'm continuing the work done last year to add a new format of
> migration stream that can be used to migrate large guests to a single
> file in a performant way.
> 
> This is an early RFC with the previous code + my additions to support
> multifd and direct IO. Let me know what you think!
> 
> Here are the reference links for previous discussions:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html
> 
> The series has 4 main parts:
> 
> 1) File migration: A new "file:" migration URI. So "file:mig" does the
>     same as "exec:cat > mig". Patches 1-4 implement this;
> 
> 2) Fixed-ram format: A new format for the migration stream. Puts guest
>     pages at their relative offsets in the migration file. This saves
>     space on the worst case of RAM utilization because every page has a
>     fixed offset in the migration file and (potentially) saves us time
>     because we could write pages independently in parallel. It also
>     gives alignment guarantees so we could use O_DIRECT. Patches 5-13
>     implement this;
> 
> With patches 1-13 these two^ can be used with:
> 
> (qemu) migrate_set_capability fixed-ram on
> (qemu) migrate[_incoming] file:mig

There are some use cases (especially virtio-mem, but also virtio-balloon 
with free-page-hinting) where we end up having very sparse guest RAM. We 
don't want to have such "memory without meaning" in the migration stream 
nor restore it on the destination.

Would that still be supported with the new format? For example, have a 
sparse VM savefile and remember which ranges actually contain reasonable 
data?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-31 21:52                 ` Peter Xu
@ 2023-04-03  7:47                   ` Claudio Fontana
  2023-04-03 19:26                     ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Claudio Fontana @ 2023-04-03  7:47 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Daniel P. Berrangé,
	qemu-devel, jfehlig, dfaggioli, dgilbert, Juan Quintela

On 3/31/23 23:52, Peter Xu wrote:
> On Fri, Mar 31, 2023 at 03:18:37PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>>> On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote:
>>>> On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
>>>>> On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
>>>>>> Peter Xu <peterx@redhat.com> writes:
>>>>>>
>>>>>>> On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
>>>>>>>>>> Outgoing migration to file. NVMe disk. XFS filesystem.
>>>>>>>>>>
>>>>>>>>>> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
>>>>>>>>>>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
>>>>>>>>>>   10m -v`:
>>>>>>>>>>
>>>>>>>>>> migration type  | MB/s | pages/s |  ms
>>>>>>>>>> ----------------+------+---------+------
>>>>>>>>>> savevm io_uring |  434 |  102294 | 71473
>>>>>>>>>
>>>>>>>>> So I assume this is the non-live migration scenario.  Could you explain
>>>>>>>>> what does io_uring mean here?
>>>>>>>>>
>>>>>>>>
>>>>>>>> This table is all non-live migration. This particular line is a snapshot
>>>>>>>> (hmp_savevm->save_snapshot). I thought it could be relevant because it
>>>>>>>> is another way by which we write RAM into disk.
>>>>>>>
>>>>>>> I see, so if all non-live that explains, because I was curious what's the
>>>>>>> relationship between this feature and the live snapshot that QEMU also
>>>>>>> supports.
>>>>>>>
>>>>>>> I also don't immediately see why savevm will be much slower, do you have an
>>>>>>> answer?  Maybe it's somewhere but I just overlooked..
>>>>>>>
>>>>>>
>>>>>> I don't have a concrete answer. I could take a jab and maybe blame the
>>>>>> extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
>>>>>> of bandwidth limits?
>>>>>
>>>>> IMHO it would be great if this can be investigated and reasons provided in
>>>>> the next cover letter.
>>>>>
>>>>>>
>>>>>>> IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
>>>>>>> "we can stop the VM".  It smells slightly weird to build this on top of
>>>>>>> "migrate" from that pov, rather than "savevm", though.  Any thoughts on
>>>>>>> this aspect (on why not building this on top of "savevm")?
>>>>>>>
>>>>>>
>>>>>> I share the same perception. I have done initial experiments with
>>>>>> savevm, but I decided to carry on the work that was already started by
>>>>>> others because my understanding of the problem was yet incomplete.
>>>>>>
>>>>>> One point that has been raised is that the fixed-ram format alone does
>>>>>> not bring that many performance improvements. So we'll need
>>>>>> multi-threading and direct-io on top of it. Re-using multifd
>>>>>> infrastructure seems like it could be a good idea.
>>>>>
>>>>> The thing is IMHO concurrency is not as hard if VM stopped, and when we're
>>>>> 100% sure locally on where the page will go.
>>>>
>>>> We shouldn't assume the VM is stopped though. When saving to the file
>>>> the VM may still be active. The fixed-ram format lets us re-write the
>>>> same memory location on disk multiple times in this case, thus avoiding
>>>> growth of the file size.
>>>
>>> Before discussing on reusing multifd below, now I have a major confusing on
>>> the use case of the feature..
>>>
>>> The question is whether we would like to stop the VM after fixed-ram
>>> migration completes.  I'm asking because:
>>>
>>
>> We would.
>>
>>>   1. If it will stop, then it looks like a "VM suspend" to me. If so, could
>>>      anyone help explain why we don't stop the VM first then migrate?
>>>      Because it avoids copying single pages multiple times, no fiddling
>>>      with dirty tracking at all - we just don't ever track anything.  In
>>>      short, we'll stop the VM anyway, then why not stop it slightly
>>>      earlier?
>>>
>>
>> Looking at the previous discussions I don't see explicit mentions of a
>> requirement either way (stop before or stop after). I agree it makes
>> more sense to stop the guest first and then migrate without having to
>> deal with dirty pages.
>>
>> I presume libvirt just migrates without altering the guest run state so
>> we implemented this to work in both scenarios. But even then, it seems
>> QEMU could store the current VM state, stop it, migrate and restore the
>> state on the destination.
> 
> Yes, I can understand having a unified interface for libvirt would be great
> in this case.  So I am personally not against reusing qmp command "migrate"
> if that would help in any case from libvirt pov.
> 
> However this is an important question to be answered very sure before
> building more things on top.  IOW, even if reusing QMP migrate, we could
> consider a totally different impl (e.g. don't reuse migration thread model).
> 
> As I mentioned above it seems just ideal we always stop the VM so it could
> be part of the command (unlike normal QMP migrate), then it's getting more
> like save_snapshot() as there's the vm_stop().  We should make sure when
> the user uses the new cmd it'll always do that because that's the most
> performant (comparing to enabling dirty tracking and live migrate).
> 
>>
>> I might be missing context here since I wasn't around when this work
>> started. Someone correct me if I'm wrong please.


Hi, not sure if what is asked here is context in terms of the previous upstream discussions or our specific requirement we are trying to bring upstream.

In terms of the specific requirement we are trying to bring upstream, we need to get libvirt+QEMU VM save and restore functionality to be able to transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.
When an event trigger happens, the VM needs to be quickly paused and saved to disk safely, including datasync, and another VM needs to be restored, also in ~5 secs.
For our specific requirement, the VM is never running when its data (mostly consisting of RAM) is saved.

I understand that the need to handle also the "live" case comes from upstream discussions about solving the "general case",
where someone might want to do this for "live" VMs, but if helpful I want to highlight that it is not part of the specific requirement we are trying to address,
and for this specific case won't also in the future, as the whole point of the trigger is to replace the running VM with another VM, so it cannot be kept running.

The reason we are using "migrate" here likely stems from the fact that existing libvirt code currently uses QMP migrate to implement the save and restore commands.
And in my personal view, I think that reusing the existing building blocks (migration, multifd) would be preferable, to avoid having to maintain two separate ways to do the same thing.

That said, it could be done in a different way, if the performance can keep up. Just thinking of reducing the overall effort and also maintenance surface.

Ciao,

Claudio

> 
> Yes, it would be great if someone can help clarify.
> 
> Thanks,
> 
>>
>>>   2. If it will not stop, then it's "VM live snapshot" to me.  We have
>>>      that, aren't we?  That's more efficient because it'll wr-protect all
>>>      guest pages, any write triggers a CoW and we only copy the guest pages
>>>      once and for all.
>>>
>>> Either way to go, there's no need to copy any page more than once.  Did I
>>> miss anything perhaps very important?
>>>
>>> I would guess it's option (1) above, because it seems we don't snapshot the
>>> disk alongside.  But I am really not sure now..
>>>
>>
> 



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-03  7:38 ` David Hildenbrand
@ 2023-04-03 14:41   ` Fabiano Rosas
  2023-04-03 16:24     ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: Fabiano Rosas @ 2023-04-03 14:41 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

David Hildenbrand <david@redhat.com> writes:

> On 30.03.23 20:03, Fabiano Rosas wrote:
>> Hi folks,
>> 
>> I'm continuing the work done last year to add a new format of
>> migration stream that can be used to migrate large guests to a single
>> file in a performant way.
>> 
>> This is an early RFC with the previous code + my additions to support
>> multifd and direct IO. Let me know what you think!
>> 
>> Here are the reference links for previous discussions:
>> 
>> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html
>> 
>> The series has 4 main parts:
>> 
>> 1) File migration: A new "file:" migration URI. So "file:mig" does the
>>     same as "exec:cat > mig". Patches 1-4 implement this;
>> 
>> 2) Fixed-ram format: A new format for the migration stream. Puts guest
>>     pages at their relative offsets in the migration file. This saves
>>     space on the worst case of RAM utilization because every page has a
>>     fixed offset in the migration file and (potentially) saves us time
>>     because we could write pages independently in parallel. It also
>>     gives alignment guarantees so we could use O_DIRECT. Patches 5-13
>>     implement this;
>> 
>> With patches 1-13 these two^ can be used with:
>> 
>> (qemu) migrate_set_capability fixed-ram on
>> (qemu) migrate[_incoming] file:mig
>
> There are some use cases (especially virtio-mem, but also virtio-balloon 
> with free-page-hinting) where we end up having very sparse guest RAM. We 
> don't want to have such "memory without meaning" in the migration stream 
> nor restore it on the destination.
>

Is that what is currently defined by ramblock_page_is_discarded ->
virtio_mem_rdm_is_populated ?

> Would that still be supported with the new format? For example, have a 
> sparse VM savefile and remember which ranges actually contain reasonable 
> data?

We do ignore zero pages, so I don't think it would be an issue to have
another criteria for ignoring pages. It seems if we do enable postcopy
load w/ fixed-ram that would be already handled in postcopy_request_page.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-03 14:41   ` Fabiano Rosas
@ 2023-04-03 16:24     ` David Hildenbrand
  2023-04-03 16:36       ` Fabiano Rosas
  0 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2023-04-03 16:24 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

On 03.04.23 16:41, Fabiano Rosas wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 30.03.23 20:03, Fabiano Rosas wrote:
>>> Hi folks,
>>>
>>> I'm continuing the work done last year to add a new format of
>>> migration stream that can be used to migrate large guests to a single
>>> file in a performant way.
>>>
>>> This is an early RFC with the previous code + my additions to support
>>> multifd and direct IO. Let me know what you think!
>>>
>>> Here are the reference links for previous discussions:
>>>
>>> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
>>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
>>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html
>>>
>>> The series has 4 main parts:
>>>
>>> 1) File migration: A new "file:" migration URI. So "file:mig" does the
>>>      same as "exec:cat > mig". Patches 1-4 implement this;
>>>
>>> 2) Fixed-ram format: A new format for the migration stream. Puts guest
>>>      pages at their relative offsets in the migration file. This saves
>>>      space on the worst case of RAM utilization because every page has a
>>>      fixed offset in the migration file and (potentially) saves us time
>>>      because we could write pages independently in parallel. It also
>>>      gives alignment guarantees so we could use O_DIRECT. Patches 5-13
>>>      implement this;
>>>
>>> With patches 1-13 these two^ can be used with:
>>>
>>> (qemu) migrate_set_capability fixed-ram on
>>> (qemu) migrate[_incoming] file:mig
>>
>> There are some use cases (especially virtio-mem, but also virtio-balloon
>> with free-page-hinting) where we end up having very sparse guest RAM. We
>> don't want to have such "memory without meaning" in the migration stream
>> nor restore it on the destination.
>>
> 
> Is that what is currently defined by ramblock_page_is_discarded ->
> virtio_mem_rdm_is_populated ?

For virtio-mem, yes. For virtio-balloon we communicate that information 
via qemu_guest_free_page_hint().

> 
>> Would that still be supported with the new format? For example, have a
>> sparse VM savefile and remember which ranges actually contain reasonable
>> data?
> 
> We do ignore zero pages, so I don't think it would be an issue to have
> another criteria for ignoring pages. It seems if we do enable postcopy
> load w/ fixed-ram that would be already handled in postcopy_request_page.

Ok, good. Just to note that we do have migration of sparse RAM blocks 
working and if fixed-ram would be incompatible we'd have to fence it.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-03 16:24     ` David Hildenbrand
@ 2023-04-03 16:36       ` Fabiano Rosas
  0 siblings, 0 replies; 65+ messages in thread
From: Fabiano Rosas @ 2023-04-03 16:36 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Claudio Fontana, jfehlig, dfaggioli, dgilbert,
	Daniel P . Berrangé,
	Juan Quintela

David Hildenbrand <david@redhat.com> writes:

> On 03.04.23 16:41, Fabiano Rosas wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>>> On 30.03.23 20:03, Fabiano Rosas wrote:
>>>> Hi folks,
>>>>
>>>> I'm continuing the work done last year to add a new format of
>>>> migration stream that can be used to migrate large guests to a single
>>>> file in a performant way.
>>>>
>>>> This is an early RFC with the previous code + my additions to support
>>>> multifd and direct IO. Let me know what you think!
>>>>
>>>> Here are the reference links for previous discussions:
>>>>
>>>> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
>>>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
>>>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html
>>>>
>>>> The series has 4 main parts:
>>>>
>>>> 1) File migration: A new "file:" migration URI. So "file:mig" does the
>>>>      same as "exec:cat > mig". Patches 1-4 implement this;
>>>>
>>>> 2) Fixed-ram format: A new format for the migration stream. Puts guest
>>>>      pages at their relative offsets in the migration file. This saves
>>>>      space on the worst case of RAM utilization because every page has a
>>>>      fixed offset in the migration file and (potentially) saves us time
>>>>      because we could write pages independently in parallel. It also
>>>>      gives alignment guarantees so we could use O_DIRECT. Patches 5-13
>>>>      implement this;
>>>>
>>>> With patches 1-13 these two^ can be used with:
>>>>
>>>> (qemu) migrate_set_capability fixed-ram on
>>>> (qemu) migrate[_incoming] file:mig
>>>
>>> There are some use cases (especially virtio-mem, but also virtio-balloon
>>> with free-page-hinting) where we end up having very sparse guest RAM. We
>>> don't want to have such "memory without meaning" in the migration stream
>>> nor restore it on the destination.
>>>
>> 
>> Is that what is currently defined by ramblock_page_is_discarded ->
>> virtio_mem_rdm_is_populated ?
>
> For virtio-mem, yes. For virtio-balloon we communicate that information 
> via qemu_guest_free_page_hint().
>
>> 
>>> Would that still be supported with the new format? For example, have a
>>> sparse VM savefile and remember which ranges actually contain reasonable
>>> data?
>> 
>> We do ignore zero pages, so I don't think it would be an issue to have
>> another criteria for ignoring pages. It seems if we do enable postcopy
>> load w/ fixed-ram that would be already handled in postcopy_request_page.
>
> Ok, good. Just to note that we do have migration of sparse RAM blocks 
> working and if fixed-ram would be incompatible we'd have to fence it.

Yep, thanks for the heads-up. I'll keep that in mind.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-03  7:47                   ` Claudio Fontana
@ 2023-04-03 19:26                     ` Peter Xu
  2023-04-04  8:00                       ` Claudio Fontana
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2023-04-03 19:26 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: Fabiano Rosas, Daniel P. Berrangé,
	qemu-devel, jfehlig, dfaggioli, dgilbert, Juan Quintela

Hi, Claudio,

Thanks for the context.

On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
> Hi, not sure if what is asked here is context in terms of the previous
> upstream discussions or our specific requirement we are trying to bring
> upstream.
>
> In terms of the specific requirement we are trying to bring upstream, we
> need to get libvirt+QEMU VM save and restore functionality to be able to
> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
> event trigger happens, the VM needs to be quickly paused and saved to
> disk safely, including datasync, and another VM needs to be restored,
> also in ~5 secs.  For our specific requirement, the VM is never running
> when its data (mostly consisting of RAM) is saved.
>
> I understand that the need to handle also the "live" case comes from
> upstream discussions about solving the "general case", where someone
> might want to do this for "live" VMs, but if helpful I want to highlight
> that it is not part of the specific requirement we are trying to address,
> and for this specific case won't also in the future, as the whole point
> of the trigger is to replace the running VM with another VM, so it cannot
> be kept running.

From what I read so far, that scenario suites exactly what live snapshot
would do with current QEMU - that at least should involve a snapshot on the
disks being used or I can't see how that can be live.  So it looks like a
separate request.

> The reason we are using "migrate" here likely stems from the fact that
> existing libvirt code currently uses QMP migrate to implement the save
> and restore commands.  And in my personal view, I think that reusing the
> existing building blocks (migration, multifd) would be preferable, to
> avoid having to maintain two separate ways to do the same thing.  That
> said, it could be done in a different way, if the performance can keep
> up. Just thinking of reducing the overall effort and also maintenance
> surface.

I would vaguely guess the performance can not only keep up but better than
what the current solution would provide, due to the possibility of (1)
batch handling of continuous guest pages, and (2) completely no dirty
tracking overhead.

For (2), it's not about wr-protect page faults or vmexits due to PML being
full (because vcpus will be stopped anyway..), it's about enabling the
dirty tracking (which already contains overhead, especially when huge pages
are enabled, to split huge pages in EPT pgtables) and all the bitmap
operations QEMU does during live migration even if the VM is not live.

IMHO reusing multifd may or may not be a good idea here, because it'll of
course also complicate multifd code, hence makes multifd harder to
maintain, while not in a good way, because as I mentioned I don't think it
can use much of what multifd provides.

I don't have a strong opinion on the impl (even though I do have a
preference..), but I think at least we should still check on two things:

  - Being crystal clear on the use case above, and double check whether "VM
    stop" should be the default operation at the start of the new cmd - we
    shouldn't assume the user will be aware of doing this, neither should
    we assume the user is aware of the performance implications.

  - Making sure the image layout is well defined, so:

    - It'll be extensible in the future, and,

    - If someone would like to refactor it to not use the migration thread
      model anymore, the image format, hopefully, can be easy to keep
      untouched so it can be compatible with the current approach.

Just my two cents. I think Juan should have the best grasp on this.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-03 19:26                     ` Peter Xu
@ 2023-04-04  8:00                       ` Claudio Fontana
  2023-04-04 14:53                         ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Claudio Fontana @ 2023-04-04  8:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Daniel P. Berrangé,
	qemu-devel, jfehlig, dfaggioli, dgilbert, Juan Quintela

Hi Peter,

On 4/3/23 21:26, Peter Xu wrote:
> Hi, Claudio,
> 
> Thanks for the context.
> 
> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
>> Hi, not sure if what is asked here is context in terms of the previous
>> upstream discussions or our specific requirement we are trying to bring
>> upstream.
>>
>> In terms of the specific requirement we are trying to bring upstream, we
>> need to get libvirt+QEMU VM save and restore functionality to be able to
>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
>> event trigger happens, the VM needs to be quickly paused and saved to
>> disk safely, including datasync, and another VM needs to be restored,
>> also in ~5 secs.  For our specific requirement, the VM is never running
>> when its data (mostly consisting of RAM) is saved.
>>
>> I understand that the need to handle also the "live" case comes from
>> upstream discussions about solving the "general case", where someone
>> might want to do this for "live" VMs, but if helpful I want to highlight
>> that it is not part of the specific requirement we are trying to address,
>> and for this specific case won't also in the future, as the whole point
>> of the trigger is to replace the running VM with another VM, so it cannot
>> be kept running.
> 
> From what I read so far, that scenario suites exactly what live snapshot
> would do with current QEMU - that at least should involve a snapshot on the
> disks being used or I can't see how that can be live.  So it looks like a
> separate request.
> 
>> The reason we are using "migrate" here likely stems from the fact that
>> existing libvirt code currently uses QMP migrate to implement the save
>> and restore commands.  And in my personal view, I think that reusing the
>> existing building blocks (migration, multifd) would be preferable, to
>> avoid having to maintain two separate ways to do the same thing.  That
>> said, it could be done in a different way, if the performance can keep
>> up. Just thinking of reducing the overall effort and also maintenance
>> surface.
> 
> I would vaguely guess the performance can not only keep up but better than
> what the current solution would provide, due to the possibility of (1)
> batch handling of continuous guest pages, and (2) completely no dirty
> tracking overhead.
> 
> For (2), it's not about wr-protect page faults or vmexits due to PML being
> full (because vcpus will be stopped anyway..), it's about enabling the
> dirty tracking (which already contains overhead, especially when huge pages
> are enabled, to split huge pages in EPT pgtables) and all the bitmap
> operations QEMU does during live migration even if the VM is not live.

something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
but maybe worthwhile redoing the profiling with Fabiano's patchset.

> 
> IMHO reusing multifd may or may not be a good idea here, because it'll of
> course also complicate multifd code, hence makes multifd harder to
> maintain, while not in a good way, because as I mentioned I don't think it
> can use much of what multifd provides.


The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.

Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.

The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
so it seems a very natural extension of multifd to me.

> 
> I don't have a strong opinion on the impl (even though I do have a
> preference..), but I think at least we should still check on two things:
> 
>   - Being crystal clear on the use case above, and double check whether "VM
>     stop" should be the default operation at the start of the new cmd - we
>     shouldn't assume the user will be aware of doing this, neither should
>     we assume the user is aware of the performance implications.


Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
Probably I missed something there.

> 
>   - Making sure the image layout is well defined, so:
> 
>     - It'll be extensible in the future, and,
> 
>     - If someone would like to refactor it to not use the migration thread
>       model anymore, the image format, hopefully, can be easy to keep
>       untouched so it can be compatible with the current approach.
> 
> Just my two cents. I think Juan should have the best grasp on this.
> 
> Thanks,
> 

Ciao,

Claudio


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-04  8:00                       ` Claudio Fontana
@ 2023-04-04 14:53                         ` Peter Xu
  2023-04-04 15:10                           ` Claudio Fontana
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2023-04-04 14:53 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: Fabiano Rosas, Daniel P. Berrangé,
	qemu-devel, jfehlig, dfaggioli, dgilbert, Juan Quintela

On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote:
> Hi Peter,

Hi, Claudio,

> 
> On 4/3/23 21:26, Peter Xu wrote:
> > Hi, Claudio,
> > 
> > Thanks for the context.
> > 
> > On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
> >> Hi, not sure if what is asked here is context in terms of the previous
> >> upstream discussions or our specific requirement we are trying to bring
> >> upstream.
> >>
> >> In terms of the specific requirement we are trying to bring upstream, we
> >> need to get libvirt+QEMU VM save and restore functionality to be able to
> >> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
> >> event trigger happens, the VM needs to be quickly paused and saved to
> >> disk safely, including datasync, and another VM needs to be restored,
> >> also in ~5 secs.  For our specific requirement, the VM is never running
> >> when its data (mostly consisting of RAM) is saved.
> >>
> >> I understand that the need to handle also the "live" case comes from
> >> upstream discussions about solving the "general case", where someone
> >> might want to do this for "live" VMs, but if helpful I want to highlight
> >> that it is not part of the specific requirement we are trying to address,
> >> and for this specific case won't also in the future, as the whole point
> >> of the trigger is to replace the running VM with another VM, so it cannot
> >> be kept running.
> > 
> > From what I read so far, that scenario suites exactly what live snapshot
> > would do with current QEMU - that at least should involve a snapshot on the
> > disks being used or I can't see how that can be live.  So it looks like a
> > separate request.
> > 
> >> The reason we are using "migrate" here likely stems from the fact that
> >> existing libvirt code currently uses QMP migrate to implement the save
> >> and restore commands.  And in my personal view, I think that reusing the
> >> existing building blocks (migration, multifd) would be preferable, to
> >> avoid having to maintain two separate ways to do the same thing.  That
> >> said, it could be done in a different way, if the performance can keep
> >> up. Just thinking of reducing the overall effort and also maintenance
> >> surface.
> > 
> > I would vaguely guess the performance can not only keep up but better than
> > what the current solution would provide, due to the possibility of (1)
> > batch handling of continuous guest pages, and (2) completely no dirty
> > tracking overhead.
> > 
> > For (2), it's not about wr-protect page faults or vmexits due to PML being
> > full (because vcpus will be stopped anyway..), it's about enabling the
> > dirty tracking (which already contains overhead, especially when huge pages
> > are enabled, to split huge pages in EPT pgtables) and all the bitmap
> > operations QEMU does during live migration even if the VM is not live.
> 
> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
> but maybe worthwhile redoing the profiling with Fabiano's patchset.

Yes I don't know the detailed number either, it should depend on the guest
configuration (mem size, mem type, kernel version etc).  It could be less a
concern comparing to the time used elsewhere.  More on this on below.

> 
> > 
> > IMHO reusing multifd may or may not be a good idea here, because it'll of
> > course also complicate multifd code, hence makes multifd harder to
> > maintain, while not in a good way, because as I mentioned I don't think it
> > can use much of what multifd provides.
> 
> 
> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.
> 
> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.
> 
> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
> so it seems a very natural extension of multifd to me.

Yes, since I haven't looked at the multifd patches at all so I don't have
solid clue on how much it'll affect multifd.  I'll leave that to Juan.

> 
> > 
> > I don't have a strong opinion on the impl (even though I do have a
> > preference..), but I think at least we should still check on two things:
> > 
> >   - Being crystal clear on the use case above, and double check whether "VM
> >     stop" should be the default operation at the start of the new cmd - we
> >     shouldn't assume the user will be aware of doing this, neither should
> >     we assume the user is aware of the performance implications.
> 
> 
> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
> Probably I missed something there.

Yes, then IMHO as mentioned we should make "vm stop" part of the command
procedure if vm was still running when invoked.  Then we can already
optimize dirty logging of above (2) with the current framework. E.g., we
already optimized live snapshot to not enable dirty logging:

        if (!migrate_background_snapshot()) {
            memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
            migration_bitmap_sync_precopy(rs);
        }

Maybe that can also be done for fixed-ram migration, so no matter how much
overhead there will be, that can be avoided.

PS: I think similar optimizations can be done too in ram_save_complete() or
ram_state_pending_exact().. maybe we should move the check into
migration_bitmap_sync_precopy() so it can be skipped as a whole when it can.

Thanks,

> 
> > 
> >   - Making sure the image layout is well defined, so:
> > 
> >     - It'll be extensible in the future, and,
> > 
> >     - If someone would like to refactor it to not use the migration thread
> >       model anymore, the image format, hopefully, can be easy to keep
> >       untouched so it can be compatible with the current approach.
> > 
> > Just my two cents. I think Juan should have the best grasp on this.
> > 
> > Thanks,
> > 
> 
> Ciao,
> 
> Claudio
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-04 14:53                         ` Peter Xu
@ 2023-04-04 15:10                           ` Claudio Fontana
  2023-04-04 15:56                             ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Claudio Fontana @ 2023-04-04 15:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Daniel P. Berrangé,
	qemu-devel, jfehlig, dfaggioli, dgilbert, Juan Quintela

On 4/4/23 16:53, Peter Xu wrote:
> On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote:
>> Hi Peter,
> 
> Hi, Claudio,
> 
>>
>> On 4/3/23 21:26, Peter Xu wrote:
>>> Hi, Claudio,
>>>
>>> Thanks for the context.
>>>
>>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
>>>> Hi, not sure if what is asked here is context in terms of the previous
>>>> upstream discussions or our specific requirement we are trying to bring
>>>> upstream.
>>>>
>>>> In terms of the specific requirement we are trying to bring upstream, we
>>>> need to get libvirt+QEMU VM save and restore functionality to be able to
>>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
>>>> event trigger happens, the VM needs to be quickly paused and saved to
>>>> disk safely, including datasync, and another VM needs to be restored,
>>>> also in ~5 secs.  For our specific requirement, the VM is never running
>>>> when its data (mostly consisting of RAM) is saved.
>>>>
>>>> I understand that the need to handle also the "live" case comes from
>>>> upstream discussions about solving the "general case", where someone
>>>> might want to do this for "live" VMs, but if helpful I want to highlight
>>>> that it is not part of the specific requirement we are trying to address,
>>>> and for this specific case won't also in the future, as the whole point
>>>> of the trigger is to replace the running VM with another VM, so it cannot
>>>> be kept running.
>>>
>>> From what I read so far, that scenario suites exactly what live snapshot
>>> would do with current QEMU - that at least should involve a snapshot on the
>>> disks being used or I can't see how that can be live.  So it looks like a
>>> separate request.
>>>
>>>> The reason we are using "migrate" here likely stems from the fact that
>>>> existing libvirt code currently uses QMP migrate to implement the save
>>>> and restore commands.  And in my personal view, I think that reusing the
>>>> existing building blocks (migration, multifd) would be preferable, to
>>>> avoid having to maintain two separate ways to do the same thing.  That
>>>> said, it could be done in a different way, if the performance can keep
>>>> up. Just thinking of reducing the overall effort and also maintenance
>>>> surface.
>>>
>>> I would vaguely guess the performance can not only keep up but better than
>>> what the current solution would provide, due to the possibility of (1)
>>> batch handling of continuous guest pages, and (2) completely no dirty
>>> tracking overhead.
>>>
>>> For (2), it's not about wr-protect page faults or vmexits due to PML being
>>> full (because vcpus will be stopped anyway..), it's about enabling the
>>> dirty tracking (which already contains overhead, especially when huge pages
>>> are enabled, to split huge pages in EPT pgtables) and all the bitmap
>>> operations QEMU does during live migration even if the VM is not live.
>>
>> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
>> but maybe worthwhile redoing the profiling with Fabiano's patchset.
> 
> Yes I don't know the detailed number either, it should depend on the guest
> configuration (mem size, mem type, kernel version etc).  It could be less a
> concern comparing to the time used elsewhere.  More on this on below.
> 
>>
>>>
>>> IMHO reusing multifd may or may not be a good idea here, because it'll of
>>> course also complicate multifd code, hence makes multifd harder to
>>> maintain, while not in a good way, because as I mentioned I don't think it
>>> can use much of what multifd provides.
>>
>>
>> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.
>>
>> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
>> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.
>>
>> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
>> so it seems a very natural extension of multifd to me.
> 
> Yes, since I haven't looked at the multifd patches at all so I don't have
> solid clue on how much it'll affect multifd.  I'll leave that to Juan.
> 
>>
>>>
>>> I don't have a strong opinion on the impl (even though I do have a
>>> preference..), but I think at least we should still check on two things:
>>>
>>>   - Being crystal clear on the use case above, and double check whether "VM
>>>     stop" should be the default operation at the start of the new cmd - we
>>>     shouldn't assume the user will be aware of doing this, neither should
>>>     we assume the user is aware of the performance implications.
>>
>>
>> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
>> Probably I missed something there.
> 
> Yes, then IMHO as mentioned we should make "vm stop" part of the command
> procedure if vm was still running when invoked.  Then we can already
> optimize dirty logging of above (2) with the current framework. E.g., we
> already optimized live snapshot to not enable dirty logging:
> 
>         if (!migrate_background_snapshot()) {
>             memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
>             migration_bitmap_sync_precopy(rs);
>         }
> 
> Maybe that can also be done for fixed-ram migration, so no matter how much
> overhead there will be, that can be avoided.

Understood, agree.

Would it make sense to check for something like if (!runstate_is_running())
instead of checking for the specific multifd + fixed-ram feature?

I think from a high level perspective, there should not be dirtying if the vcpus are not running right?
This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running.

> 
> PS: I think similar optimizations can be done too in ram_save_complete() or
> ram_state_pending_exact().. maybe we should move the check into
> migration_bitmap_sync_precopy() so it can be skipped as a whole when it can.

makes sense, interesting.

I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used.

this feature was added by:

commit fbd162e629aaf8a7e464af44d2f73d06b26428ad
Author: Yury Kotov <yury-kotov@yandex-team.ru>
Date:   Fri Feb 15 20:45:46 2019 +0300

    migration: Add an ability to ignore shared RAM blocks
    
    If ignore-shared capability is set then skip shared RAMBlocks during the
    RAM migration.
    Also, move qemu_ram_foreach_migratable_block (and rename) to the
    migration code, because it requires access to the migration capabilities.
    
    Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru>
    Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru>
    Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
    Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Probably not that important, just to mention since we were thinking of possible small optimizations.
I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful.

Likely we'd need a new round of measurements with perf...

Ciao,

Claudio

> 
> Thanks,
> 
>>
>>>
>>>   - Making sure the image layout is well defined, so:
>>>
>>>     - It'll be extensible in the future, and,
>>>
>>>     - If someone would like to refactor it to not use the migration thread
>>>       model anymore, the image format, hopefully, can be easy to keep
>>>       untouched so it can be compatible with the current approach.
>>>
>>> Just my two cents. I think Juan should have the best grasp on this.
>>>
>>> Thanks,
>>>
>>
>> Ciao,
>>
>> Claudio
>>
> 



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-04 15:10                           ` Claudio Fontana
@ 2023-04-04 15:56                             ` Peter Xu
  2023-04-06 16:46                               ` Fabiano Rosas
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2023-04-04 15:56 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: Fabiano Rosas, Daniel P. Berrangé,
	qemu-devel, jfehlig, dfaggioli, dgilbert, Juan Quintela

On Tue, Apr 04, 2023 at 05:10:52PM +0200, Claudio Fontana wrote:
> On 4/4/23 16:53, Peter Xu wrote:
> > On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote:
> >> Hi Peter,
> > 
> > Hi, Claudio,
> > 
> >>
> >> On 4/3/23 21:26, Peter Xu wrote:
> >>> Hi, Claudio,
> >>>
> >>> Thanks for the context.
> >>>
> >>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
> >>>> Hi, not sure if what is asked here is context in terms of the previous
> >>>> upstream discussions or our specific requirement we are trying to bring
> >>>> upstream.
> >>>>
> >>>> In terms of the specific requirement we are trying to bring upstream, we
> >>>> need to get libvirt+QEMU VM save and restore functionality to be able to
> >>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
> >>>> event trigger happens, the VM needs to be quickly paused and saved to
> >>>> disk safely, including datasync, and another VM needs to be restored,
> >>>> also in ~5 secs.  For our specific requirement, the VM is never running
> >>>> when its data (mostly consisting of RAM) is saved.
> >>>>
> >>>> I understand that the need to handle also the "live" case comes from
> >>>> upstream discussions about solving the "general case", where someone
> >>>> might want to do this for "live" VMs, but if helpful I want to highlight
> >>>> that it is not part of the specific requirement we are trying to address,
> >>>> and for this specific case won't also in the future, as the whole point
> >>>> of the trigger is to replace the running VM with another VM, so it cannot
> >>>> be kept running.
> >>>
> >>> From what I read so far, that scenario suites exactly what live snapshot
> >>> would do with current QEMU - that at least should involve a snapshot on the
> >>> disks being used or I can't see how that can be live.  So it looks like a
> >>> separate request.
> >>>
> >>>> The reason we are using "migrate" here likely stems from the fact that
> >>>> existing libvirt code currently uses QMP migrate to implement the save
> >>>> and restore commands.  And in my personal view, I think that reusing the
> >>>> existing building blocks (migration, multifd) would be preferable, to
> >>>> avoid having to maintain two separate ways to do the same thing.  That
> >>>> said, it could be done in a different way, if the performance can keep
> >>>> up. Just thinking of reducing the overall effort and also maintenance
> >>>> surface.
> >>>
> >>> I would vaguely guess the performance can not only keep up but better than
> >>> what the current solution would provide, due to the possibility of (1)
> >>> batch handling of continuous guest pages, and (2) completely no dirty
> >>> tracking overhead.
> >>>
> >>> For (2), it's not about wr-protect page faults or vmexits due to PML being
> >>> full (because vcpus will be stopped anyway..), it's about enabling the
> >>> dirty tracking (which already contains overhead, especially when huge pages
> >>> are enabled, to split huge pages in EPT pgtables) and all the bitmap
> >>> operations QEMU does during live migration even if the VM is not live.
> >>
> >> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
> >> but maybe worthwhile redoing the profiling with Fabiano's patchset.
> > 
> > Yes I don't know the detailed number either, it should depend on the guest
> > configuration (mem size, mem type, kernel version etc).  It could be less a
> > concern comparing to the time used elsewhere.  More on this on below.
> > 
> >>
> >>>
> >>> IMHO reusing multifd may or may not be a good idea here, because it'll of
> >>> course also complicate multifd code, hence makes multifd harder to
> >>> maintain, while not in a good way, because as I mentioned I don't think it
> >>> can use much of what multifd provides.
> >>
> >>
> >> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.
> >>
> >> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
> >> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.
> >>
> >> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
> >> so it seems a very natural extension of multifd to me.
> > 
> > Yes, since I haven't looked at the multifd patches at all so I don't have
> > solid clue on how much it'll affect multifd.  I'll leave that to Juan.
> > 
> >>
> >>>
> >>> I don't have a strong opinion on the impl (even though I do have a
> >>> preference..), but I think at least we should still check on two things:
> >>>
> >>>   - Being crystal clear on the use case above, and double check whether "VM
> >>>     stop" should be the default operation at the start of the new cmd - we
> >>>     shouldn't assume the user will be aware of doing this, neither should
> >>>     we assume the user is aware of the performance implications.
> >>
> >>
> >> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
> >> Probably I missed something there.
> > 
> > Yes, then IMHO as mentioned we should make "vm stop" part of the command
> > procedure if vm was still running when invoked.  Then we can already
> > optimize dirty logging of above (2) with the current framework. E.g., we
> > already optimized live snapshot to not enable dirty logging:
> > 
> >         if (!migrate_background_snapshot()) {
> >             memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
> >             migration_bitmap_sync_precopy(rs);
> >         }
> > 
> > Maybe that can also be done for fixed-ram migration, so no matter how much
> > overhead there will be, that can be avoided.
> 
> Understood, agree.
> 
> Would it make sense to check for something like if (!runstate_is_running())
> instead of checking for the specific multifd + fixed-ram feature?
> 
> I think from a high level perspective, there should not be dirtying if the vcpus are not running right?
> This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running.

Hmm I'm not sure.  I think we still allow use to stop/start VMs during
migration?  If so, probably not applicable.

And it won't cover live snapshot too - live snapshot always run with VM
running, but it doesn't need to track dirty.  It actually needs to track
dirty, but in a synchronous way to make it efficient (while kvm dirty
tracking is asynchronous, aka, vcpu won't be blocked if dirtied).

So here we can make it "if (migrate_needs_async_dirty_tracking())", and
having both live snapshot and fixed-ram migration covered in the helper to
opt-out dirty tracking.

One thing worth keeping an eye here is if we go that way we need to make
sure VM won't be started during the fixed-ram migration.  IOW, we can
cancel the fixed-ram migration (in this case, more suitable to be called
"vm suspend") if the user starts the VM during the process.

> 
> > 
> > PS: I think similar optimizations can be done too in ram_save_complete() or
> > ram_state_pending_exact().. maybe we should move the check into
> > migration_bitmap_sync_precopy() so it can be skipped as a whole when it can.
> 
> makes sense, interesting.
> 
> I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used.

Do you mean we can skip dirty tracking when ramblock_is_ignored() for a
ramblock?  I think it's doable but it'll be slightly more involved, because
ignored/shared ramblocks can be used together with private/non-ignored
ramblocks, hence at least it's not applicable globally.

> 
> this feature was added by:
> 
> commit fbd162e629aaf8a7e464af44d2f73d06b26428ad
> Author: Yury Kotov <yury-kotov@yandex-team.ru>
> Date:   Fri Feb 15 20:45:46 2019 +0300
> 
>     migration: Add an ability to ignore shared RAM blocks
>     
>     If ignore-shared capability is set then skip shared RAMBlocks during the
>     RAM migration.
>     Also, move qemu_ram_foreach_migratable_block (and rename) to the
>     migration code, because it requires access to the migration capabilities.
>     
>     Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru>
>     Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru>
>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>     Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Probably not that important, just to mention since we were thinking of possible small optimizations.
> I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful.
> 
> Likely we'd need a new round of measurements with perf...

Yes it would be good to know. Said that, I think it'll also be fine if
optimizations are done on top, as long as the change will be compatible
with the interface being proposed.

Here e.g. "stop the VM within the cmd" is part of the interface so IMHO it
should be decided before this series got merged.

Thanks.

> 
> Ciao,
> 
> Claudio
> 
> > 
> > Thanks,
> > 
> >>
> >>>
> >>>   - Making sure the image layout is well defined, so:
> >>>
> >>>     - It'll be extensible in the future, and,
> >>>
> >>>     - If someone would like to refactor it to not use the migration thread
> >>>       model anymore, the image format, hopefully, can be easy to keep
> >>>       untouched so it can be compatible with the current approach.
> >>>
> >>> Just my two cents. I think Juan should have the best grasp on this.
> >>>
> >>> Thanks,
> >>>
> >>
> >> Ciao,
> >>
> >> Claudio
> >>
> > 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-04 15:56                             ` Peter Xu
@ 2023-04-06 16:46                               ` Fabiano Rosas
  2023-04-07 10:36                                 ` Claudio Fontana
  0 siblings, 1 reply; 65+ messages in thread
From: Fabiano Rosas @ 2023-04-06 16:46 UTC (permalink / raw)
  To: Peter Xu, Claudio Fontana
  Cc: Daniel P. Berrangé,
	qemu-devel, jfehlig, dfaggioli, dgilbert, Juan Quintela

Peter Xu <peterx@redhat.com> writes:

> On Tue, Apr 04, 2023 at 05:10:52PM +0200, Claudio Fontana wrote:
>> On 4/4/23 16:53, Peter Xu wrote:
>> > On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote:
>> >> Hi Peter,
>> > 
>> > Hi, Claudio,
>> > 
>> >>
>> >> On 4/3/23 21:26, Peter Xu wrote:
>> >>> Hi, Claudio,
>> >>>
>> >>> Thanks for the context.
>> >>>
>> >>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
>> >>>> Hi, not sure if what is asked here is context in terms of the previous
>> >>>> upstream discussions or our specific requirement we are trying to bring
>> >>>> upstream.
>> >>>>
>> >>>> In terms of the specific requirement we are trying to bring upstream, we
>> >>>> need to get libvirt+QEMU VM save and restore functionality to be able to
>> >>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
>> >>>> event trigger happens, the VM needs to be quickly paused and saved to
>> >>>> disk safely, including datasync, and another VM needs to be restored,
>> >>>> also in ~5 secs.  For our specific requirement, the VM is never running
>> >>>> when its data (mostly consisting of RAM) is saved.
>> >>>>
>> >>>> I understand that the need to handle also the "live" case comes from
>> >>>> upstream discussions about solving the "general case", where someone
>> >>>> might want to do this for "live" VMs, but if helpful I want to highlight
>> >>>> that it is not part of the specific requirement we are trying to address,
>> >>>> and for this specific case won't also in the future, as the whole point
>> >>>> of the trigger is to replace the running VM with another VM, so it cannot
>> >>>> be kept running.
>> >>>
>> >>> From what I read so far, that scenario suites exactly what live snapshot
>> >>> would do with current QEMU - that at least should involve a snapshot on the
>> >>> disks being used or I can't see how that can be live.  So it looks like a
>> >>> separate request.
>> >>>
>> >>>> The reason we are using "migrate" here likely stems from the fact that
>> >>>> existing libvirt code currently uses QMP migrate to implement the save
>> >>>> and restore commands.  And in my personal view, I think that reusing the
>> >>>> existing building blocks (migration, multifd) would be preferable, to
>> >>>> avoid having to maintain two separate ways to do the same thing.  That
>> >>>> said, it could be done in a different way, if the performance can keep
>> >>>> up. Just thinking of reducing the overall effort and also maintenance
>> >>>> surface.
>> >>>
>> >>> I would vaguely guess the performance can not only keep up but better than
>> >>> what the current solution would provide, due to the possibility of (1)
>> >>> batch handling of continuous guest pages, and (2) completely no dirty
>> >>> tracking overhead.
>> >>>
>> >>> For (2), it's not about wr-protect page faults or vmexits due to PML being
>> >>> full (because vcpus will be stopped anyway..), it's about enabling the
>> >>> dirty tracking (which already contains overhead, especially when huge pages
>> >>> are enabled, to split huge pages in EPT pgtables) and all the bitmap
>> >>> operations QEMU does during live migration even if the VM is not live.
>> >>
>> >> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
>> >> but maybe worthwhile redoing the profiling with Fabiano's patchset.
>> > 
>> > Yes I don't know the detailed number either, it should depend on the guest
>> > configuration (mem size, mem type, kernel version etc).  It could be less a
>> > concern comparing to the time used elsewhere.  More on this on below.
>> > 
>> >>
>> >>>
>> >>> IMHO reusing multifd may or may not be a good idea here, because it'll of
>> >>> course also complicate multifd code, hence makes multifd harder to
>> >>> maintain, while not in a good way, because as I mentioned I don't think it
>> >>> can use much of what multifd provides.
>> >>
>> >>
>> >> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.
>> >>
>> >> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
>> >> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.
>> >>
>> >> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
>> >> so it seems a very natural extension of multifd to me.
>> > 
>> > Yes, since I haven't looked at the multifd patches at all so I don't have
>> > solid clue on how much it'll affect multifd.  I'll leave that to Juan.
>> > 
>> >>
>> >>>
>> >>> I don't have a strong opinion on the impl (even though I do have a
>> >>> preference..), but I think at least we should still check on two things:
>> >>>
>> >>>   - Being crystal clear on the use case above, and double check whether "VM
>> >>>     stop" should be the default operation at the start of the new cmd - we
>> >>>     shouldn't assume the user will be aware of doing this, neither should
>> >>>     we assume the user is aware of the performance implications.
>> >>
>> >>
>> >> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
>> >> Probably I missed something there.
>> > 
>> > Yes, then IMHO as mentioned we should make "vm stop" part of the command
>> > procedure if vm was still running when invoked.  Then we can already
>> > optimize dirty logging of above (2) with the current framework. E.g., we
>> > already optimized live snapshot to not enable dirty logging:
>> > 
>> >         if (!migrate_background_snapshot()) {
>> >             memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
>> >             migration_bitmap_sync_precopy(rs);
>> >         }
>> > 
>> > Maybe that can also be done for fixed-ram migration, so no matter how much
>> > overhead there will be, that can be avoided.
>> 
>> Understood, agree.
>> 
>> Would it make sense to check for something like if (!runstate_is_running())
>> instead of checking for the specific multifd + fixed-ram feature?
>> 
>> I think from a high level perspective, there should not be dirtying if the vcpus are not running right?
>> This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running.
>
> Hmm I'm not sure.  I think we still allow use to stop/start VMs during
> migration?  If so, probably not applicable.
>
> And it won't cover live snapshot too - live snapshot always run with VM
> running, but it doesn't need to track dirty.  It actually needs to track
> dirty, but in a synchronous way to make it efficient (while kvm dirty
> tracking is asynchronous, aka, vcpu won't be blocked if dirtied).
>
> So here we can make it "if (migrate_needs_async_dirty_tracking())", and
> having both live snapshot and fixed-ram migration covered in the helper to
> opt-out dirty tracking.
>
> One thing worth keeping an eye here is if we go that way we need to make
> sure VM won't be started during the fixed-ram migration.  IOW, we can
> cancel the fixed-ram migration (in this case, more suitable to be called
> "vm suspend") if the user starts the VM during the process.
>
>> 
>> > 
>> > PS: I think similar optimizations can be done too in ram_save_complete() or
>> > ram_state_pending_exact().. maybe we should move the check into
>> > migration_bitmap_sync_precopy() so it can be skipped as a whole when it can.
>> 
>> makes sense, interesting.
>> 
>> I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used.
>
> Do you mean we can skip dirty tracking when ramblock_is_ignored() for a
> ramblock?  I think it's doable but it'll be slightly more involved, because
> ignored/shared ramblocks can be used together with private/non-ignored
> ramblocks, hence at least it's not applicable globally.
>
>> 
>> this feature was added by:
>> 
>> commit fbd162e629aaf8a7e464af44d2f73d06b26428ad
>> Author: Yury Kotov <yury-kotov@yandex-team.ru>
>> Date:   Fri Feb 15 20:45:46 2019 +0300
>> 
>>     migration: Add an ability to ignore shared RAM blocks
>>     
>>     If ignore-shared capability is set then skip shared RAMBlocks during the
>>     RAM migration.
>>     Also, move qemu_ram_foreach_migratable_block (and rename) to the
>>     migration code, because it requires access to the migration capabilities.
>>     
>>     Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru>
>>     Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru>
>>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>     Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> 
>> Probably not that important, just to mention since we were thinking of possible small optimizations.
>> I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful.
>> 
>> Likely we'd need a new round of measurements with perf...
>
> Yes it would be good to know. Said that, I think it'll also be fine if
> optimizations are done on top, as long as the change will be compatible
> with the interface being proposed.
>
> Here e.g. "stop the VM within the cmd" is part of the interface so IMHO it
> should be decided before this series got merged.
>

Ok, so in summary, the high level requirement says we need to stop the
VM and we've determined that stopping it before the migration is what
probably makes more sense.

Keeping in mind that the design of fixed-ram already supports live
migration, I see three options for the interface so far:

1) Add a new command that does vm_stop + fixed-ram migrate;

2) Arbitrarily declare that fixed-ram is always non-live and hardcode
   that;

3) Add a new migration capability "live migration", ON by default and
   have the management layer set fixed-ram=on, live-migration=off.

I guess this also largely depends on what direction we're going with the
migration code in general. I.e. do we prefer a more isolated
implementation or keep the new feature flexible for future use-cases?

I'll give people time to catch up and in the meantime work on adding the
stop and the safeguards around the user re-starting.

Thanks all for the input so far.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-06 16:46                               ` Fabiano Rosas
@ 2023-04-07 10:36                                 ` Claudio Fontana
  2023-04-11 15:48                                   ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Claudio Fontana @ 2023-04-07 10:36 UTC (permalink / raw)
  To: Fabiano Rosas, Peter Xu
  Cc: Daniel P. Berrangé,
	qemu-devel, jfehlig, dfaggioli, dgilbert, Juan Quintela

On 4/6/23 18:46, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Tue, Apr 04, 2023 at 05:10:52PM +0200, Claudio Fontana wrote:
>>> On 4/4/23 16:53, Peter Xu wrote:
>>>> On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote:
>>>>> Hi Peter,
>>>>
>>>> Hi, Claudio,
>>>>
>>>>>
>>>>> On 4/3/23 21:26, Peter Xu wrote:
>>>>>> Hi, Claudio,
>>>>>>
>>>>>> Thanks for the context.
>>>>>>
>>>>>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
>>>>>>> Hi, not sure if what is asked here is context in terms of the previous
>>>>>>> upstream discussions or our specific requirement we are trying to bring
>>>>>>> upstream.
>>>>>>>
>>>>>>> In terms of the specific requirement we are trying to bring upstream, we
>>>>>>> need to get libvirt+QEMU VM save and restore functionality to be able to
>>>>>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
>>>>>>> event trigger happens, the VM needs to be quickly paused and saved to
>>>>>>> disk safely, including datasync, and another VM needs to be restored,
>>>>>>> also in ~5 secs.  For our specific requirement, the VM is never running
>>>>>>> when its data (mostly consisting of RAM) is saved.
>>>>>>>
>>>>>>> I understand that the need to handle also the "live" case comes from
>>>>>>> upstream discussions about solving the "general case", where someone
>>>>>>> might want to do this for "live" VMs, but if helpful I want to highlight
>>>>>>> that it is not part of the specific requirement we are trying to address,
>>>>>>> and for this specific case won't also in the future, as the whole point
>>>>>>> of the trigger is to replace the running VM with another VM, so it cannot
>>>>>>> be kept running.
>>>>>>
>>>>>> From what I read so far, that scenario suites exactly what live snapshot
>>>>>> would do with current QEMU - that at least should involve a snapshot on the
>>>>>> disks being used or I can't see how that can be live.  So it looks like a
>>>>>> separate request.
>>>>>>
>>>>>>> The reason we are using "migrate" here likely stems from the fact that
>>>>>>> existing libvirt code currently uses QMP migrate to implement the save
>>>>>>> and restore commands.  And in my personal view, I think that reusing the
>>>>>>> existing building blocks (migration, multifd) would be preferable, to
>>>>>>> avoid having to maintain two separate ways to do the same thing.  That
>>>>>>> said, it could be done in a different way, if the performance can keep
>>>>>>> up. Just thinking of reducing the overall effort and also maintenance
>>>>>>> surface.
>>>>>>
>>>>>> I would vaguely guess the performance can not only keep up but better than
>>>>>> what the current solution would provide, due to the possibility of (1)
>>>>>> batch handling of continuous guest pages, and (2) completely no dirty
>>>>>> tracking overhead.
>>>>>>
>>>>>> For (2), it's not about wr-protect page faults or vmexits due to PML being
>>>>>> full (because vcpus will be stopped anyway..), it's about enabling the
>>>>>> dirty tracking (which already contains overhead, especially when huge pages
>>>>>> are enabled, to split huge pages in EPT pgtables) and all the bitmap
>>>>>> operations QEMU does during live migration even if the VM is not live.
>>>>>
>>>>> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
>>>>> but maybe worthwhile redoing the profiling with Fabiano's patchset.
>>>>
>>>> Yes I don't know the detailed number either, it should depend on the guest
>>>> configuration (mem size, mem type, kernel version etc).  It could be less a
>>>> concern comparing to the time used elsewhere.  More on this on below.
>>>>
>>>>>
>>>>>>
>>>>>> IMHO reusing multifd may or may not be a good idea here, because it'll of
>>>>>> course also complicate multifd code, hence makes multifd harder to
>>>>>> maintain, while not in a good way, because as I mentioned I don't think it
>>>>>> can use much of what multifd provides.
>>>>>
>>>>>
>>>>> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.
>>>>>
>>>>> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
>>>>> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.
>>>>>
>>>>> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
>>>>> so it seems a very natural extension of multifd to me.
>>>>
>>>> Yes, since I haven't looked at the multifd patches at all so I don't have
>>>> solid clue on how much it'll affect multifd.  I'll leave that to Juan.
>>>>
>>>>>
>>>>>>
>>>>>> I don't have a strong opinion on the impl (even though I do have a
>>>>>> preference..), but I think at least we should still check on two things:
>>>>>>
>>>>>>   - Being crystal clear on the use case above, and double check whether "VM
>>>>>>     stop" should be the default operation at the start of the new cmd - we
>>>>>>     shouldn't assume the user will be aware of doing this, neither should
>>>>>>     we assume the user is aware of the performance implications.
>>>>>
>>>>>
>>>>> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
>>>>> Probably I missed something there.
>>>>
>>>> Yes, then IMHO as mentioned we should make "vm stop" part of the command
>>>> procedure if vm was still running when invoked.  Then we can already
>>>> optimize dirty logging of above (2) with the current framework. E.g., we
>>>> already optimized live snapshot to not enable dirty logging:
>>>>
>>>>         if (!migrate_background_snapshot()) {
>>>>             memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
>>>>             migration_bitmap_sync_precopy(rs);
>>>>         }
>>>>
>>>> Maybe that can also be done for fixed-ram migration, so no matter how much
>>>> overhead there will be, that can be avoided.
>>>
>>> Understood, agree.
>>>
>>> Would it make sense to check for something like if (!runstate_is_running())
>>> instead of checking for the specific multifd + fixed-ram feature?
>>>
>>> I think from a high level perspective, there should not be dirtying if the vcpus are not running right?
>>> This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running.
>>
>> Hmm I'm not sure.  I think we still allow use to stop/start VMs during
>> migration?  If so, probably not applicable.
>>
>> And it won't cover live snapshot too - live snapshot always run with VM
>> running, but it doesn't need to track dirty.  It actually needs to track
>> dirty, but in a synchronous way to make it efficient (while kvm dirty
>> tracking is asynchronous, aka, vcpu won't be blocked if dirtied).
>>
>> So here we can make it "if (migrate_needs_async_dirty_tracking())", and
>> having both live snapshot and fixed-ram migration covered in the helper to
>> opt-out dirty tracking.
>>
>> One thing worth keeping an eye here is if we go that way we need to make
>> sure VM won't be started during the fixed-ram migration.  IOW, we can
>> cancel the fixed-ram migration (in this case, more suitable to be called
>> "vm suspend") if the user starts the VM during the process.
>>
>>>
>>>>
>>>> PS: I think similar optimizations can be done too in ram_save_complete() or
>>>> ram_state_pending_exact().. maybe we should move the check into
>>>> migration_bitmap_sync_precopy() so it can be skipped as a whole when it can.
>>>
>>> makes sense, interesting.
>>>
>>> I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used.
>>
>> Do you mean we can skip dirty tracking when ramblock_is_ignored() for a
>> ramblock?  I think it's doable but it'll be slightly more involved, because
>> ignored/shared ramblocks can be used together with private/non-ignored
>> ramblocks, hence at least it's not applicable globally.
>>
>>>
>>> this feature was added by:
>>>
>>> commit fbd162e629aaf8a7e464af44d2f73d06b26428ad
>>> Author: Yury Kotov <yury-kotov@yandex-team.ru>
>>> Date:   Fri Feb 15 20:45:46 2019 +0300
>>>
>>>     migration: Add an ability to ignore shared RAM blocks
>>>     
>>>     If ignore-shared capability is set then skip shared RAMBlocks during the
>>>     RAM migration.
>>>     Also, move qemu_ram_foreach_migratable_block (and rename) to the
>>>     migration code, because it requires access to the migration capabilities.
>>>     
>>>     Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru>
>>>     Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru>
>>>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>     Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>
>>> Probably not that important, just to mention since we were thinking of possible small optimizations.
>>> I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful.
>>>
>>> Likely we'd need a new round of measurements with perf...
>>
>> Yes it would be good to know. Said that, I think it'll also be fine if
>> optimizations are done on top, as long as the change will be compatible
>> with the interface being proposed.
>>
>> Here e.g. "stop the VM within the cmd" is part of the interface so IMHO it
>> should be decided before this series got merged.
>>
> 
> Ok, so in summary, the high level requirement says we need to stop the
> VM and we've determined that stopping it before the migration is what
> probably makes more sense.
> 
> Keeping in mind that the design of fixed-ram already supports live
> migration, I see three options for the interface so far:

(just my opinion here, I might be wrong and is not directly a requirement I am presenting here)

Maybe there are other reasons to provide the fixed-ram offsets thing beyond the live case? I am unclear on that.

If the live case is a potential requirement for someone else, or there are other reasons for fixed-ram offsets anyway,
I think it would be better to leave the decision of whether to stop or not to stop the vm prior to transfer to the user, or to the management tools (libvirt ...)

We care about the stop case, but since the proposal already supports live too, there is no real good reason I think to force the user to stop the VM, forcing our own use case when others might find use for "live".

If we want to detect the two cases at runtime separately in the future for potential additional performance gain, that is a possibility in my view for future work,
but we know already experimentally that the bits of extra overhead for the dirty bitmap tracking is not the real bottleneck at least in our testing,
even with devices capable of transfering ~6 gigabytes per second.

But again this is assuming that the live case is compatible and does not make things overly complicated,
otherwise looking instead at the thing from purely these business requirements perspective we don't need it, and we could even scrap live.

> 
> 1) Add a new command that does vm_stop + fixed-ram migrate;
> 
> 2) Arbitrarily declare that fixed-ram is always non-live and hardcode
>    that;
> 
> 3) Add a new migration capability "live migration", ON by default and
>    have the management layer set fixed-ram=on, live-migration=off.

(just minor point, for the case where this would apply): instead of an additional options, could we not just detect whether we are "live" or not by just checking whether the guest is in a running state?
I suppose we don't allow to start/stop guests while the migration is running..


> 
> I guess this also largely depends on what direction we're going with the
> migration code in general. I.e. do we prefer a more isolated
> implementation or keep the new feature flexible for future use-cases?

right, looking for the migration experts and maintainers to chime in here :-)

> 
> I'll give people time to catch up and in the meantime work on adding the
> stop and the safeguards around the user re-starting.
> 
> Thanks all for the input so far.

Thanks and as again: all this is just my 2c truly.

Ciao,

Claudio


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-07 10:36                                 ` Claudio Fontana
@ 2023-04-11 15:48                                   ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2023-04-11 15:48 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: Fabiano Rosas, Daniel P. Berrangé,
	qemu-devel, jfehlig, dfaggioli, dgilbert, Juan Quintela

On Fri, Apr 07, 2023 at 12:36:24PM +0200, Claudio Fontana wrote:
> On 4/6/23 18:46, Fabiano Rosas wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > 
> >> On Tue, Apr 04, 2023 at 05:10:52PM +0200, Claudio Fontana wrote:
> >>> On 4/4/23 16:53, Peter Xu wrote:
> >>>> On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote:
> >>>>> Hi Peter,
> >>>>
> >>>> Hi, Claudio,
> >>>>
> >>>>>
> >>>>> On 4/3/23 21:26, Peter Xu wrote:
> >>>>>> Hi, Claudio,
> >>>>>>
> >>>>>> Thanks for the context.
> >>>>>>
> >>>>>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
> >>>>>>> Hi, not sure if what is asked here is context in terms of the previous
> >>>>>>> upstream discussions or our specific requirement we are trying to bring
> >>>>>>> upstream.
> >>>>>>>
> >>>>>>> In terms of the specific requirement we are trying to bring upstream, we
> >>>>>>> need to get libvirt+QEMU VM save and restore functionality to be able to
> >>>>>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
> >>>>>>> event trigger happens, the VM needs to be quickly paused and saved to
> >>>>>>> disk safely, including datasync, and another VM needs to be restored,
> >>>>>>> also in ~5 secs.  For our specific requirement, the VM is never running
> >>>>>>> when its data (mostly consisting of RAM) is saved.
> >>>>>>>
> >>>>>>> I understand that the need to handle also the "live" case comes from
> >>>>>>> upstream discussions about solving the "general case", where someone
> >>>>>>> might want to do this for "live" VMs, but if helpful I want to highlight
> >>>>>>> that it is not part of the specific requirement we are trying to address,
> >>>>>>> and for this specific case won't also in the future, as the whole point
> >>>>>>> of the trigger is to replace the running VM with another VM, so it cannot
> >>>>>>> be kept running.
> >>>>>>
> >>>>>> From what I read so far, that scenario suites exactly what live snapshot
> >>>>>> would do with current QEMU - that at least should involve a snapshot on the
> >>>>>> disks being used or I can't see how that can be live.  So it looks like a
> >>>>>> separate request.
> >>>>>>
> >>>>>>> The reason we are using "migrate" here likely stems from the fact that
> >>>>>>> existing libvirt code currently uses QMP migrate to implement the save
> >>>>>>> and restore commands.  And in my personal view, I think that reusing the
> >>>>>>> existing building blocks (migration, multifd) would be preferable, to
> >>>>>>> avoid having to maintain two separate ways to do the same thing.  That
> >>>>>>> said, it could be done in a different way, if the performance can keep
> >>>>>>> up. Just thinking of reducing the overall effort and also maintenance
> >>>>>>> surface.
> >>>>>>
> >>>>>> I would vaguely guess the performance can not only keep up but better than
> >>>>>> what the current solution would provide, due to the possibility of (1)
> >>>>>> batch handling of continuous guest pages, and (2) completely no dirty
> >>>>>> tracking overhead.
> >>>>>>
> >>>>>> For (2), it's not about wr-protect page faults or vmexits due to PML being
> >>>>>> full (because vcpus will be stopped anyway..), it's about enabling the
> >>>>>> dirty tracking (which already contains overhead, especially when huge pages
> >>>>>> are enabled, to split huge pages in EPT pgtables) and all the bitmap
> >>>>>> operations QEMU does during live migration even if the VM is not live.
> >>>>>
> >>>>> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
> >>>>> but maybe worthwhile redoing the profiling with Fabiano's patchset.
> >>>>
> >>>> Yes I don't know the detailed number either, it should depend on the guest
> >>>> configuration (mem size, mem type, kernel version etc).  It could be less a
> >>>> concern comparing to the time used elsewhere.  More on this on below.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> IMHO reusing multifd may or may not be a good idea here, because it'll of
> >>>>>> course also complicate multifd code, hence makes multifd harder to
> >>>>>> maintain, while not in a good way, because as I mentioned I don't think it
> >>>>>> can use much of what multifd provides.
> >>>>>
> >>>>>
> >>>>> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.
> >>>>>
> >>>>> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
> >>>>> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.
> >>>>>
> >>>>> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
> >>>>> so it seems a very natural extension of multifd to me.
> >>>>
> >>>> Yes, since I haven't looked at the multifd patches at all so I don't have
> >>>> solid clue on how much it'll affect multifd.  I'll leave that to Juan.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> I don't have a strong opinion on the impl (even though I do have a
> >>>>>> preference..), but I think at least we should still check on two things:
> >>>>>>
> >>>>>>   - Being crystal clear on the use case above, and double check whether "VM
> >>>>>>     stop" should be the default operation at the start of the new cmd - we
> >>>>>>     shouldn't assume the user will be aware of doing this, neither should
> >>>>>>     we assume the user is aware of the performance implications.
> >>>>>
> >>>>>
> >>>>> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
> >>>>> Probably I missed something there.
> >>>>
> >>>> Yes, then IMHO as mentioned we should make "vm stop" part of the command
> >>>> procedure if vm was still running when invoked.  Then we can already
> >>>> optimize dirty logging of above (2) with the current framework. E.g., we
> >>>> already optimized live snapshot to not enable dirty logging:
> >>>>
> >>>>         if (!migrate_background_snapshot()) {
> >>>>             memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
> >>>>             migration_bitmap_sync_precopy(rs);
> >>>>         }
> >>>>
> >>>> Maybe that can also be done for fixed-ram migration, so no matter how much
> >>>> overhead there will be, that can be avoided.
> >>>
> >>> Understood, agree.
> >>>
> >>> Would it make sense to check for something like if (!runstate_is_running())
> >>> instead of checking for the specific multifd + fixed-ram feature?
> >>>
> >>> I think from a high level perspective, there should not be dirtying if the vcpus are not running right?
> >>> This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running.
> >>
> >> Hmm I'm not sure.  I think we still allow use to stop/start VMs during
> >> migration?  If so, probably not applicable.
> >>
> >> And it won't cover live snapshot too - live snapshot always run with VM
> >> running, but it doesn't need to track dirty.  It actually needs to track
> >> dirty, but in a synchronous way to make it efficient (while kvm dirty
> >> tracking is asynchronous, aka, vcpu won't be blocked if dirtied).
> >>
> >> So here we can make it "if (migrate_needs_async_dirty_tracking())", and
> >> having both live snapshot and fixed-ram migration covered in the helper to
> >> opt-out dirty tracking.
> >>
> >> One thing worth keeping an eye here is if we go that way we need to make
> >> sure VM won't be started during the fixed-ram migration.  IOW, we can
> >> cancel the fixed-ram migration (in this case, more suitable to be called
> >> "vm suspend") if the user starts the VM during the process.
> >>
> >>>
> >>>>
> >>>> PS: I think similar optimizations can be done too in ram_save_complete() or
> >>>> ram_state_pending_exact().. maybe we should move the check into
> >>>> migration_bitmap_sync_precopy() so it can be skipped as a whole when it can.
> >>>
> >>> makes sense, interesting.
> >>>
> >>> I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used.
> >>
> >> Do you mean we can skip dirty tracking when ramblock_is_ignored() for a
> >> ramblock?  I think it's doable but it'll be slightly more involved, because
> >> ignored/shared ramblocks can be used together with private/non-ignored
> >> ramblocks, hence at least it's not applicable globally.
> >>
> >>>
> >>> this feature was added by:
> >>>
> >>> commit fbd162e629aaf8a7e464af44d2f73d06b26428ad
> >>> Author: Yury Kotov <yury-kotov@yandex-team.ru>
> >>> Date:   Fri Feb 15 20:45:46 2019 +0300
> >>>
> >>>     migration: Add an ability to ignore shared RAM blocks
> >>>     
> >>>     If ignore-shared capability is set then skip shared RAMBlocks during the
> >>>     RAM migration.
> >>>     Also, move qemu_ram_foreach_migratable_block (and rename) to the
> >>>     migration code, because it requires access to the migration capabilities.
> >>>     
> >>>     Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru>
> >>>     Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru>
> >>>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >>>     Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >>>
> >>> Probably not that important, just to mention since we were thinking of possible small optimizations.
> >>> I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful.
> >>>
> >>> Likely we'd need a new round of measurements with perf...
> >>
> >> Yes it would be good to know. Said that, I think it'll also be fine if
> >> optimizations are done on top, as long as the change will be compatible
> >> with the interface being proposed.
> >>
> >> Here e.g. "stop the VM within the cmd" is part of the interface so IMHO it
> >> should be decided before this series got merged.
> >>
> > 
> > Ok, so in summary, the high level requirement says we need to stop the
> > VM and we've determined that stopping it before the migration is what
> > probably makes more sense.
> > 
> > Keeping in mind that the design of fixed-ram already supports live
> > migration, I see three options for the interface so far:
> 
> (just my opinion here, I might be wrong and is not directly a requirement I am presenting here)
> 
> Maybe there are other reasons to provide the fixed-ram offsets thing beyond the live case? I am unclear on that.
> 
> If the live case is a potential requirement for someone else, or there are other reasons for fixed-ram offsets anyway,
> I think it would be better to leave the decision of whether to stop or not to stop the vm prior to transfer to the user, or to the management tools (libvirt ...)

We'll need someone stand up and explain the use case.  IMHO we should not
assume something can happen and design the interface with an assumption,
especially if there can be an impact on the design with the assumption.
Per my own experience that's the major source of over-engineering.

If we live migrate the VM then stop it after migration completes, it means
we're "assuming" the ultimate disk image will match with the VM image we're
going to create, but I doubt it.

Here the whole process actually contains a few steps:

     (1)                  (2)                     (3)               (4)
  VM running --> start live migration --> migration completes --> VM stop
                   (fixed-ram=on)

We have the VM image containing all the VM states (including device and
memory) at step (3), then we can optionally & quickly turn off the VM at
step (4).  IOW, the final disk image contains states in step (4) not step
(3).  The question is could something changed on the disk or IO flush
happened during step (3) and (4)?

IOW, I think the use case so far can only be justified if it's VM suspend.

> 
> We care about the stop case, but since the proposal already supports live too, there is no real good reason I think to force the user to stop the VM, forcing our own use case when others might find use for "live".
> 
> If we want to detect the two cases at runtime separately in the future for potential additional performance gain, that is a possibility in my view for future work,
> but we know already experimentally that the bits of extra overhead for the dirty bitmap tracking is not the real bottleneck at least in our testing,
> even with devices capable of transfering ~6 gigabytes per second.

Is that device assigned to the guest?  I'm very curious why that wouldn't
make a difference.

It could be that the device is reusing a small buffer of the guest so even
if it dirtied very fast it's still a small range impact.  Logically high
dirty loads definitely will make a difference irrelevant of dirty tracking
overheads (e.g., besides tracking overhead that we'll also need to migrate
dirtied pages >1 times; while we don't need to do that with live snapshot).

> 
> But again this is assuming that the live case is compatible and does not make things overly complicated,
> otherwise looking instead at the thing from purely these business requirements perspective we don't need it, and we could even scrap live.
> 
> > 
> > 1) Add a new command that does vm_stop + fixed-ram migrate;
> > 
> > 2) Arbitrarily declare that fixed-ram is always non-live and hardcode
> >    that;
> > 
> > 3) Add a new migration capability "live migration", ON by default and
> >    have the management layer set fixed-ram=on, live-migration=off.
> 
> (just minor point, for the case where this would apply): instead of an additional options, could we not just detect whether we are "live" or not by just checking whether the guest is in a running state?
> I suppose we don't allow to start/stop guests while the migration is running..
> 
> 
> > 
> > I guess this also largely depends on what direction we're going with the
> > migration code in general. I.e. do we prefer a more isolated
> > implementation or keep the new feature flexible for future use-cases?
> 
> right, looking for the migration experts and maintainers to chime in here :-)
> 
> > 
> > I'll give people time to catch up and in the meantime work on adding the
> > stop and the safeguards around the user re-starting.
> > 
> > Thanks all for the input so far.
> 
> Thanks and as again: all this is just my 2c truly.
> 
> Ciao,
> 
> Claudio
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-03-31 16:27             ` Peter Xu
  2023-03-31 18:18               ` Fabiano Rosas
@ 2023-04-18 16:58               ` Daniel P. Berrangé
  2023-04-18 19:26                 ` Peter Xu
  1 sibling, 1 reply; 65+ messages in thread
From: Daniel P. Berrangé @ 2023-04-18 16:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela

On Fri, Mar 31, 2023 at 12:27:48PM -0400, Peter Xu wrote:
> On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote:
> > On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
> > > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
> > > > Peter Xu <peterx@redhat.com> writes:
> > > > 
> > > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> > > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> > > > >> >> 
> > > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> > > > >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> > > > >> >>   10m -v`:
> > > > >> >> 
> > > > >> >> migration type  | MB/s | pages/s |  ms
> > > > >> >> ----------------+------+---------+------
> > > > >> >> savevm io_uring |  434 |  102294 | 71473
> > > > >> >
> > > > >> > So I assume this is the non-live migration scenario.  Could you explain
> > > > >> > what does io_uring mean here?
> > > > >> >
> > > > >> 
> > > > >> This table is all non-live migration. This particular line is a snapshot
> > > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> > > > >> is another way by which we write RAM into disk.
> > > > >
> > > > > I see, so if all non-live that explains, because I was curious what's the
> > > > > relationship between this feature and the live snapshot that QEMU also
> > > > > supports.
> > > > >
> > > > > I also don't immediately see why savevm will be much slower, do you have an
> > > > > answer?  Maybe it's somewhere but I just overlooked..
> > > > >
> > > > 
> > > > I don't have a concrete answer. I could take a jab and maybe blame the
> > > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
> > > > of bandwidth limits?
> > > 
> > > IMHO it would be great if this can be investigated and reasons provided in
> > > the next cover letter.
> > > 
> > > > 
> > > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> > > > > "we can stop the VM".  It smells slightly weird to build this on top of
> > > > > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> > > > > this aspect (on why not building this on top of "savevm")?
> > > > >
> > > > 
> > > > I share the same perception. I have done initial experiments with
> > > > savevm, but I decided to carry on the work that was already started by
> > > > others because my understanding of the problem was yet incomplete.
> > > > 
> > > > One point that has been raised is that the fixed-ram format alone does
> > > > not bring that many performance improvements. So we'll need
> > > > multi-threading and direct-io on top of it. Re-using multifd
> > > > infrastructure seems like it could be a good idea.
> > > 
> > > The thing is IMHO concurrency is not as hard if VM stopped, and when we're
> > > 100% sure locally on where the page will go.
> > 
> > We shouldn't assume the VM is stopped though. When saving to the file
> > the VM may still be active. The fixed-ram format lets us re-write the
> > same memory location on disk multiple times in this case, thus avoiding
> > growth of the file size.
> 
> Before discussing on reusing multifd below, now I have a major confusing on
> the use case of the feature..
> 
> The question is whether we would like to stop the VM after fixed-ram
> migration completes.  I'm asking because:
> 
>   1. If it will stop, then it looks like a "VM suspend" to me. If so, could
>      anyone help explain why we don't stop the VM first then migrate?
>      Because it avoids copying single pages multiple times, no fiddling
>      with dirty tracking at all - we just don't ever track anything.  In
>      short, we'll stop the VM anyway, then why not stop it slightly
>      earlier?
> 
>   2. If it will not stop, then it's "VM live snapshot" to me.  We have
>      that, aren't we?  That's more efficient because it'll wr-protect all
>      guest pages, any write triggers a CoW and we only copy the guest pages
>      once and for all.
> 
> Either way to go, there's no need to copy any page more than once.  Did I
> miss anything perhaps very important?
> 
> I would guess it's option (1) above, because it seems we don't snapshot the
> disk alongside.  But I am really not sure now..

It is both options above.

Libvirt has multiple APIs where it currently uses its migrate-to-file
approach

  * virDomainManagedSave()

    This saves VM state to an libvirt managed file, stops the VM, and the
    file state is auto-restored on next request to start the VM, and the
    file deleted. The VM CPUs are stopped during both save + restore
    phase

  * virDomainSave/virDomainRestore

    The former saves VM state to a file specified by the mgmt app/user.
    A later call to virDomaniRestore starts the VM using that saved
    state. The mgmt app / user can delete the file state, or re-use
    it many times as they desire. The VM CPUs are stopped during both
    save + restore phase

  * virDomainSnapshotXXX

    This family of APIs takes snapshots of the VM disks, optionally
    also including the full VM state to a separate file. The snapshots
    can later be restored. The VM CPUs remain running during the
    save phase, but are stopped during restore phase

All these APIs end up calling the same code inside libvirt that uses
the libvirt-iohelper, together with QEMU migrate:fd driver.

IIUC, Suse's original motivation for the performance improvements was
wrt to the first case of virDomainManagedSave. From the POV of actually
supporting this in libvirt though, we need to cover all the scenarios
there. Thus we need this to work both when CPUs are running and stopped,
and if we didn't use migrate in this case, then we basically just end
up re-inventing migrate again which IMHO is undesirable both from
libvirt's POV and QEMU's POV.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-18 16:58               ` Daniel P. Berrangé
@ 2023-04-18 19:26                 ` Peter Xu
  2023-04-19 17:12                   ` Daniel P. Berrangé
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2023-04-18 19:26 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela

On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> On Fri, Mar 31, 2023 at 12:27:48PM -0400, Peter Xu wrote:
> > On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote:
> > > On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
> > > > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
> > > > > Peter Xu <peterx@redhat.com> writes:
> > > > > 
> > > > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> > > > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> > > > > >> >> 
> > > > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> > > > > >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> > > > > >> >>   10m -v`:
> > > > > >> >> 
> > > > > >> >> migration type  | MB/s | pages/s |  ms
> > > > > >> >> ----------------+------+---------+------
> > > > > >> >> savevm io_uring |  434 |  102294 | 71473
> > > > > >> >
> > > > > >> > So I assume this is the non-live migration scenario.  Could you explain
> > > > > >> > what does io_uring mean here?
> > > > > >> >
> > > > > >> 
> > > > > >> This table is all non-live migration. This particular line is a snapshot
> > > > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> > > > > >> is another way by which we write RAM into disk.
> > > > > >
> > > > > > I see, so if all non-live that explains, because I was curious what's the
> > > > > > relationship between this feature and the live snapshot that QEMU also
> > > > > > supports.
> > > > > >
> > > > > > I also don't immediately see why savevm will be much slower, do you have an
> > > > > > answer?  Maybe it's somewhere but I just overlooked..
> > > > > >
> > > > > 
> > > > > I don't have a concrete answer. I could take a jab and maybe blame the
> > > > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
> > > > > of bandwidth limits?
> > > > 
> > > > IMHO it would be great if this can be investigated and reasons provided in
> > > > the next cover letter.
> > > > 
> > > > > 
> > > > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> > > > > > "we can stop the VM".  It smells slightly weird to build this on top of
> > > > > > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> > > > > > this aspect (on why not building this on top of "savevm")?
> > > > > >
> > > > > 
> > > > > I share the same perception. I have done initial experiments with
> > > > > savevm, but I decided to carry on the work that was already started by
> > > > > others because my understanding of the problem was yet incomplete.
> > > > > 
> > > > > One point that has been raised is that the fixed-ram format alone does
> > > > > not bring that many performance improvements. So we'll need
> > > > > multi-threading and direct-io on top of it. Re-using multifd
> > > > > infrastructure seems like it could be a good idea.
> > > > 
> > > > The thing is IMHO concurrency is not as hard if VM stopped, and when we're
> > > > 100% sure locally on where the page will go.
> > > 
> > > We shouldn't assume the VM is stopped though. When saving to the file
> > > the VM may still be active. The fixed-ram format lets us re-write the
> > > same memory location on disk multiple times in this case, thus avoiding
> > > growth of the file size.
> > 
> > Before discussing on reusing multifd below, now I have a major confusing on
> > the use case of the feature..
> > 
> > The question is whether we would like to stop the VM after fixed-ram
> > migration completes.  I'm asking because:
> > 
> >   1. If it will stop, then it looks like a "VM suspend" to me. If so, could
> >      anyone help explain why we don't stop the VM first then migrate?
> >      Because it avoids copying single pages multiple times, no fiddling
> >      with dirty tracking at all - we just don't ever track anything.  In
> >      short, we'll stop the VM anyway, then why not stop it slightly
> >      earlier?
> > 
> >   2. If it will not stop, then it's "VM live snapshot" to me.  We have
> >      that, aren't we?  That's more efficient because it'll wr-protect all
> >      guest pages, any write triggers a CoW and we only copy the guest pages
> >      once and for all.
> > 
> > Either way to go, there's no need to copy any page more than once.  Did I
> > miss anything perhaps very important?
> > 
> > I would guess it's option (1) above, because it seems we don't snapshot the
> > disk alongside.  But I am really not sure now..
> 
> It is both options above.
> 
> Libvirt has multiple APIs where it currently uses its migrate-to-file
> approach
> 
>   * virDomainManagedSave()
> 
>     This saves VM state to an libvirt managed file, stops the VM, and the
>     file state is auto-restored on next request to start the VM, and the
>     file deleted. The VM CPUs are stopped during both save + restore
>     phase
> 
>   * virDomainSave/virDomainRestore
> 
>     The former saves VM state to a file specified by the mgmt app/user.
>     A later call to virDomaniRestore starts the VM using that saved
>     state. The mgmt app / user can delete the file state, or re-use
>     it many times as they desire. The VM CPUs are stopped during both
>     save + restore phase
> 
>   * virDomainSnapshotXXX
> 
>     This family of APIs takes snapshots of the VM disks, optionally
>     also including the full VM state to a separate file. The snapshots
>     can later be restored. The VM CPUs remain running during the
>     save phase, but are stopped during restore phase

For this one IMHO it'll be good if Libvirt can consider leveraging the new
background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
perhaps any reason why a generic migrate:fd approach is better?

> 
> All these APIs end up calling the same code inside libvirt that uses
> the libvirt-iohelper, together with QEMU migrate:fd driver.
> 
> IIUC, Suse's original motivation for the performance improvements was
> wrt to the first case of virDomainManagedSave. From the POV of actually
> supporting this in libvirt though, we need to cover all the scenarios
> there. Thus we need this to work both when CPUs are running and stopped,
> and if we didn't use migrate in this case, then we basically just end
> up re-inventing migrate again which IMHO is undesirable both from
> libvirt's POV and QEMU's POV.

Just to make sure we're on the same page - I always think it fine to use
the QMP "migrate" command to do this.

Meanwhile, we can also reuse the migration framework if we think that's
still the good way to go (even if I am not 100% sure on this... I still
think _lots_ of the live migration framework as plenty of logics trying to
take care of a "live" VM, IOW, those logics will become pure overheads if
we reuse the live migration framework for vm suspend).

However could you help elaborate more on why it must support live mode for
a virDomainManagedSave() request?  As I assume this is the core of the goal.

IMHO virDomainManagedSave() is a good interface design, because it contains
the target goal of what it wants to do (according to above).  To ask in
another way, I'm curious whether virDomainManagedSave() will stop the VM
before triggering the QMP "migrate" to fd: If it doesn't, why not?  If it
does, then why we can't have that assumption also for QEMU?

That assumption is IMHO important for QEMU because non-live VM migration
can avoid tons of overhead that a live migration will need.  I've mentioned
this in the other reply, even if we keep using the migration framework, we
can still optimize other things like dirty tracking.  We probably don't
even need any bitmap at all because we simply scan over all ramblocks.

OTOH, if QEMU supports live mode for a "vm suspend" in the initial design,
not only it doesn't sound right at all from interface level, it means QEMU
will need to keep doing so forever because we need to be compatible with
the old interfaces even on new binaries.  That's why I keep suggesting we
should take "VM turned off" part of the cmd if that's what we're looking
for.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-18 19:26                 ` Peter Xu
@ 2023-04-19 17:12                   ` Daniel P. Berrangé
  2023-04-19 19:07                     ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Daniel P. Berrangé @ 2023-04-19 17:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela

On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote:
> On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> > Libvirt has multiple APIs where it currently uses its migrate-to-file
> > approach
> > 
> >   * virDomainManagedSave()
> > 
> >     This saves VM state to an libvirt managed file, stops the VM, and the
> >     file state is auto-restored on next request to start the VM, and the
> >     file deleted. The VM CPUs are stopped during both save + restore
> >     phase
> > 
> >   * virDomainSave/virDomainRestore
> > 
> >     The former saves VM state to a file specified by the mgmt app/user.
> >     A later call to virDomaniRestore starts the VM using that saved
> >     state. The mgmt app / user can delete the file state, or re-use
> >     it many times as they desire. The VM CPUs are stopped during both
> >     save + restore phase
> > 
> >   * virDomainSnapshotXXX
> > 
> >     This family of APIs takes snapshots of the VM disks, optionally
> >     also including the full VM state to a separate file. The snapshots
> >     can later be restored. The VM CPUs remain running during the
> >     save phase, but are stopped during restore phase
> 
> For this one IMHO it'll be good if Libvirt can consider leveraging the new
> background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
> perhaps any reason why a generic migrate:fd approach is better?

I'm not sure I fully understand the implications of 'background-snapshot' ?

Based on what the QAPI comment says, it sounds potentially interesting,
as conceptually it would be nicer to have the memory / state snapshot
represent the VM at the point where we started the snapshot operation,
rather than where we finished the snapshot operation.

It would not solve the performance problems that the work in this thread
was intended to address though.  With large VMs (100's of GB of RAM),
saving all the RAM state to disk takes a very long time, regardless of
whether the VM vCPUs are paused or running.

Currently when doing this libvirt has a "libvirt_iohelper" process
that we use so that we can do writes with O_DIRECT set. This avoids
thrashing the host OS's  I/O buffers/cache, and thus negatively
impacting performance of anything else on the host doing I/O. This
can't take advantage of multifd though, and even if extended todo
so, it still imposes extra data copies during the save/restore paths.


So to speed up the above 3 libvirt APIs, we want QEMU to be able to
directly save/restore mem/vmstate to files, with parallization and
O_DIRECT.


> > All these APIs end up calling the same code inside libvirt that uses
> > the libvirt-iohelper, together with QEMU migrate:fd driver.
> > 
> > IIUC, Suse's original motivation for the performance improvements was
> > wrt to the first case of virDomainManagedSave. From the POV of actually
> > supporting this in libvirt though, we need to cover all the scenarios
> > there. Thus we need this to work both when CPUs are running and stopped,
> > and if we didn't use migrate in this case, then we basically just end
> > up re-inventing migrate again which IMHO is undesirable both from
> > libvirt's POV and QEMU's POV.
> 
> Just to make sure we're on the same page - I always think it fine to use
> the QMP "migrate" command to do this.
> 
> Meanwhile, we can also reuse the migration framework if we think that's
> still the good way to go (even if I am not 100% sure on this... I still
> think _lots_ of the live migration framework as plenty of logics trying to
> take care of a "live" VM, IOW, those logics will become pure overheads if
> we reuse the live migration framework for vm suspend).
> 
> However could you help elaborate more on why it must support live mode for
> a virDomainManagedSave() request?  As I assume this is the core of the goal.

No, we've no need for live mode for virDomainManagedSave. Live mode is
needed for virDomainSnapshot* APIs.

The point I'm making is that all three of the above libvirt APIs run exactly
the same migration code in libvirt. The only difference in the APIs is how
the operation gets striggered and whether the CPUs are running or not.

We wwant the improved performance of having parallel save/restore-to-disk
and use of O_DIRECT to be available to all 3 APIs. To me it doesn't make
sense to provide different impls for these APIs when they all have the
same end goal - it would be extra work on QEMU side and libvirt side alike
to use different solutions for each. 

> IMHO virDomainManagedSave() is a good interface design, because it contains
> the target goal of what it wants to do (according to above).  To ask in
> another way, I'm curious whether virDomainManagedSave() will stop the VM
> before triggering the QMP "migrate" to fd: If it doesn't, why not?  If it
> does, then why we can't have that assumption also for QEMU?
> 
> That assumption is IMHO important for QEMU because non-live VM migration
> can avoid tons of overhead that a live migration will need.  I've mentioned
> this in the other reply, even if we keep using the migration framework, we
> can still optimize other things like dirty tracking.  We probably don't
> even need any bitmap at all because we simply scan over all ramblocks.
> 
> OTOH, if QEMU supports live mode for a "vm suspend" in the initial design,
> not only it doesn't sound right at all from interface level, it means QEMU
> will need to keep doing so forever because we need to be compatible with
> the old interfaces even on new binaries.  That's why I keep suggesting we
> should take "VM turned off" part of the cmd if that's what we're looking
> for.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-19 17:12                   ` Daniel P. Berrangé
@ 2023-04-19 19:07                     ` Peter Xu
  2023-04-20  9:02                       ` Daniel P. Berrangé
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2023-04-19 19:07 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela

On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote:
> On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote:
> > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> > > Libvirt has multiple APIs where it currently uses its migrate-to-file
> > > approach
> > > 
> > >   * virDomainManagedSave()
> > > 
> > >     This saves VM state to an libvirt managed file, stops the VM, and the
> > >     file state is auto-restored on next request to start the VM, and the
> > >     file deleted. The VM CPUs are stopped during both save + restore
> > >     phase
> > > 
> > >   * virDomainSave/virDomainRestore
> > > 
> > >     The former saves VM state to a file specified by the mgmt app/user.
> > >     A later call to virDomaniRestore starts the VM using that saved
> > >     state. The mgmt app / user can delete the file state, or re-use
> > >     it many times as they desire. The VM CPUs are stopped during both
> > >     save + restore phase
> > > 
> > >   * virDomainSnapshotXXX
> > > 
> > >     This family of APIs takes snapshots of the VM disks, optionally
> > >     also including the full VM state to a separate file. The snapshots
> > >     can later be restored. The VM CPUs remain running during the
> > >     save phase, but are stopped during restore phase
> > 
> > For this one IMHO it'll be good if Libvirt can consider leveraging the new
> > background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
> > perhaps any reason why a generic migrate:fd approach is better?
> 
> I'm not sure I fully understand the implications of 'background-snapshot' ?
> 
> Based on what the QAPI comment says, it sounds potentially interesting,
> as conceptually it would be nicer to have the memory / state snapshot
> represent the VM at the point where we started the snapshot operation,
> rather than where we finished the snapshot operation.
> 
> It would not solve the performance problems that the work in this thread
> was intended to address though.  With large VMs (100's of GB of RAM),
> saving all the RAM state to disk takes a very long time, regardless of
> whether the VM vCPUs are paused or running.

I think it solves the performance problem by only copy each of the guest
page once, even if the guest is running.

Different from mostly all the rest of "migrate" use cases, background
snapshot does not use the generic dirty tracking at all (for KVM that's
get-dirty-log), instead it uses userfaultfd wr-protects, so that when
taking the snapshot all the guest pages will be protected once.

Then when each page is written, the guest cannot proceed before copying the
snapshot page over first.  After one guest page is unprotected, any write
to it will be with full speed because the follow up writes won't matter for
a snapshot.

It guarantees the best efficiency of creating a snapshot with VM running,
afaict.  I sincerely think Libvirt should have someone investigating and
see whether virDomainSnapshotXXX() can be implemented by this cap rather
than the default migration.

I actually thought the Libvirt support was there. I think it must be that
someone posted support for Libvirt but it didn't really land for some
reason.

> 
> Currently when doing this libvirt has a "libvirt_iohelper" process
> that we use so that we can do writes with O_DIRECT set. This avoids
> thrashing the host OS's  I/O buffers/cache, and thus negatively
> impacting performance of anything else on the host doing I/O. This
> can't take advantage of multifd though, and even if extended todo
> so, it still imposes extra data copies during the save/restore paths.
> 
> So to speed up the above 3 libvirt APIs, we want QEMU to be able to
> directly save/restore mem/vmstate to files, with parallization and
> O_DIRECT.

Here IIUC above question can be really important on whether existing
virDomainSnapshotXXX() can (and should) use "background-snapshot" to
implement, because that's the only one that will need to support migration
live (out of 3 use cases).

If virDomainSnapshotXXX() can be implemented differently, I think it'll be
much easier to have both virDomainManagedSave() and virDomainSave() trigger
a migration command that will stop the VM first by whatever way.

It's probably fine if we still want to have CAP_FIXED_RAM as a new
capability describing the file property (so that libvirt will know iohelper
is not needed anymore), it can support live migrating even if it shouldn't
really use it.  But then we could probably have another CAP_SUSPEND which
gives QEMU a hint so QEMU can be smart on this non-live migration.

It's just that AFAIU CAP_FIXED_RAM should just always be set with
CAP_SUSPEND, because it must be a SUSPEND to fixed ram or one should just
use virDomainSnapshotXXX() (or say, live snapshot).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-19 19:07                     ` Peter Xu
@ 2023-04-20  9:02                       ` Daniel P. Berrangé
  2023-04-20 19:19                         ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Daniel P. Berrangé @ 2023-04-20  9:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela

On Wed, Apr 19, 2023 at 03:07:19PM -0400, Peter Xu wrote:
> On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote:
> > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote:
> > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> > > > Libvirt has multiple APIs where it currently uses its migrate-to-file
> > > > approach
> > > > 
> > > >   * virDomainManagedSave()
> > > > 
> > > >     This saves VM state to an libvirt managed file, stops the VM, and the
> > > >     file state is auto-restored on next request to start the VM, and the
> > > >     file deleted. The VM CPUs are stopped during both save + restore
> > > >     phase
> > > > 
> > > >   * virDomainSave/virDomainRestore
> > > > 
> > > >     The former saves VM state to a file specified by the mgmt app/user.
> > > >     A later call to virDomaniRestore starts the VM using that saved
> > > >     state. The mgmt app / user can delete the file state, or re-use
> > > >     it many times as they desire. The VM CPUs are stopped during both
> > > >     save + restore phase
> > > > 
> > > >   * virDomainSnapshotXXX
> > > > 
> > > >     This family of APIs takes snapshots of the VM disks, optionally
> > > >     also including the full VM state to a separate file. The snapshots
> > > >     can later be restored. The VM CPUs remain running during the
> > > >     save phase, but are stopped during restore phase
> > > 
> > > For this one IMHO it'll be good if Libvirt can consider leveraging the new
> > > background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
> > > perhaps any reason why a generic migrate:fd approach is better?
> > 
> > I'm not sure I fully understand the implications of 'background-snapshot' ?
> > 
> > Based on what the QAPI comment says, it sounds potentially interesting,
> > as conceptually it would be nicer to have the memory / state snapshot
> > represent the VM at the point where we started the snapshot operation,
> > rather than where we finished the snapshot operation.
> > 
> > It would not solve the performance problems that the work in this thread
> > was intended to address though.  With large VMs (100's of GB of RAM),
> > saving all the RAM state to disk takes a very long time, regardless of
> > whether the VM vCPUs are paused or running.
> 
> I think it solves the performance problem by only copy each of the guest
> page once, even if the guest is running.

I think we're talking about different performance problems.

What you describe here is about ensuring the snapshot is of finite size
and completes in linear time, by ensuring each page is written only
once.

What I'm talking about is being able to parallelize the writing of all
RAM, so if a single thread can saturate the storage, using multiple
threads will make the overal process faster, even when we're only
writing each page once.

> Different from mostly all the rest of "migrate" use cases, background
> snapshot does not use the generic dirty tracking at all (for KVM that's
> get-dirty-log), instead it uses userfaultfd wr-protects, so that when
> taking the snapshot all the guest pages will be protected once.

Oh, so that means this 'background-snapshot' feature only works on
Linux, and only when permissions allow it. The migration parameter
probably should be marked with 'CONFIG_LINUX' in the QAPI schema
to make it clear this is a non-portable feature.

> It guarantees the best efficiency of creating a snapshot with VM running,
> afaict.  I sincerely think Libvirt should have someone investigating and
> see whether virDomainSnapshotXXX() can be implemented by this cap rather
> than the default migration.

Since the background-snapshot feature is not universally available,
it will only ever be possible to use it as an optional enhancement
with virDomainSnapshotXXX, we'll need the portable impl to be the
default / fallback.

> > Currently when doing this libvirt has a "libvirt_iohelper" process
> > that we use so that we can do writes with O_DIRECT set. This avoids
> > thrashing the host OS's  I/O buffers/cache, and thus negatively
> > impacting performance of anything else on the host doing I/O. This
> > can't take advantage of multifd though, and even if extended todo
> > so, it still imposes extra data copies during the save/restore paths.
> > 
> > So to speed up the above 3 libvirt APIs, we want QEMU to be able to
> > directly save/restore mem/vmstate to files, with parallization and
> > O_DIRECT.
> 
> Here IIUC above question can be really important on whether existing
> virDomainSnapshotXXX() can (and should) use "background-snapshot" to
> implement, because that's the only one that will need to support migration
> live (out of 3 use cases).
> 
> If virDomainSnapshotXXX() can be implemented differently, I think it'll be
> much easier to have both virDomainManagedSave() and virDomainSave() trigger
> a migration command that will stop the VM first by whatever way.
> 
> It's probably fine if we still want to have CAP_FIXED_RAM as a new
> capability describing the file property (so that libvirt will know iohelper
> is not needed anymore), it can support live migrating even if it shouldn't
> really use it.  But then we could probably have another CAP_SUSPEND which
> gives QEMU a hint so QEMU can be smart on this non-live migration.
> 
> It's just that AFAIU CAP_FIXED_RAM should just always be set with
> CAP_SUSPEND, because it must be a SUSPEND to fixed ram or one should just
> use virDomainSnapshotXXX() (or say, live snapshot).

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-20  9:02                       ` Daniel P. Berrangé
@ 2023-04-20 19:19                         ` Peter Xu
  2023-04-21  7:48                           ` Daniel P. Berrangé
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2023-04-20 19:19 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela

On Thu, Apr 20, 2023 at 10:02:43AM +0100, Daniel P. Berrangé wrote:
> On Wed, Apr 19, 2023 at 03:07:19PM -0400, Peter Xu wrote:
> > On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote:
> > > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> > > > > Libvirt has multiple APIs where it currently uses its migrate-to-file
> > > > > approach
> > > > > 
> > > > >   * virDomainManagedSave()
> > > > > 
> > > > >     This saves VM state to an libvirt managed file, stops the VM, and the
> > > > >     file state is auto-restored on next request to start the VM, and the
> > > > >     file deleted. The VM CPUs are stopped during both save + restore
> > > > >     phase
> > > > > 
> > > > >   * virDomainSave/virDomainRestore
> > > > > 
> > > > >     The former saves VM state to a file specified by the mgmt app/user.
> > > > >     A later call to virDomaniRestore starts the VM using that saved
> > > > >     state. The mgmt app / user can delete the file state, or re-use
> > > > >     it many times as they desire. The VM CPUs are stopped during both
> > > > >     save + restore phase
> > > > > 
> > > > >   * virDomainSnapshotXXX
> > > > > 
> > > > >     This family of APIs takes snapshots of the VM disks, optionally
> > > > >     also including the full VM state to a separate file. The snapshots
> > > > >     can later be restored. The VM CPUs remain running during the
> > > > >     save phase, but are stopped during restore phase
> > > > 
> > > > For this one IMHO it'll be good if Libvirt can consider leveraging the new
> > > > background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
> > > > perhaps any reason why a generic migrate:fd approach is better?
> > > 
> > > I'm not sure I fully understand the implications of 'background-snapshot' ?
> > > 
> > > Based on what the QAPI comment says, it sounds potentially interesting,
> > > as conceptually it would be nicer to have the memory / state snapshot
> > > represent the VM at the point where we started the snapshot operation,
> > > rather than where we finished the snapshot operation.
> > > 
> > > It would not solve the performance problems that the work in this thread
> > > was intended to address though.  With large VMs (100's of GB of RAM),
> > > saving all the RAM state to disk takes a very long time, regardless of
> > > whether the VM vCPUs are paused or running.
> > 
> > I think it solves the performance problem by only copy each of the guest
> > page once, even if the guest is running.
> 
> I think we're talking about different performance problems.
> 
> What you describe here is about ensuring the snapshot is of finite size
> and completes in linear time, by ensuring each page is written only
> once.
> 
> What I'm talking about is being able to parallelize the writing of all
> RAM, so if a single thread can saturate the storage, using multiple
> threads will make the overal process faster, even when we're only
> writing each page once.

It depends on how much we want it.  Here the live snapshot scenaior could
probably leverage a same multi-threading framework with a vm suspend case
because it can assume all the pages are static and only saved once.

But I agree it's at least not there yet.. so we can directly leverage
multifd at least for now.

> 
> > Different from mostly all the rest of "migrate" use cases, background
> > snapshot does not use the generic dirty tracking at all (for KVM that's
> > get-dirty-log), instead it uses userfaultfd wr-protects, so that when
> > taking the snapshot all the guest pages will be protected once.
> 
> Oh, so that means this 'background-snapshot' feature only works on
> Linux, and only when permissions allow it. The migration parameter
> probably should be marked with 'CONFIG_LINUX' in the QAPI schema
> to make it clear this is a non-portable feature.

Indeed, I can have a follow up patch for this.  But it'll be the same as
some other features, like, postcopy (and all its sub-features including
postcopy-blocktime and postcopy-preempt)?

> 
> > It guarantees the best efficiency of creating a snapshot with VM running,
> > afaict.  I sincerely think Libvirt should have someone investigating and
> > see whether virDomainSnapshotXXX() can be implemented by this cap rather
> > than the default migration.
> 
> Since the background-snapshot feature is not universally available,
> it will only ever be possible to use it as an optional enhancement
> with virDomainSnapshotXXX, we'll need the portable impl to be the
> default / fallback.

I am actually curious on how a live snapshot can be implemented correctly
if without something like background snapshot.  I raised this question in
another reply here:

https://lore.kernel.org/all/ZDWBSuGDU9IMohEf@x1n/

I was using fixed-ram and vm suspend as example, but I assume it applies to
any live snapshot that is based on current default migration scheme.

For a real live snapshot (not vm suspend), IIUC we have similar challenges.

The problem is when migration completes (snapshot taken) the VM is still
running with a live disk image.  Then how can we take a snapshot exactly at
the same time when we got the guest image mirrored in the vm dump?  What
guarantees that there's no IO changes after VM image created but before we
take a snapshot on the disk image?

In short, it's a question on how libvirt can make sure the VM image and
disk snapshot image be taken at exactly the same time for live.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-20 19:19                         ` Peter Xu
@ 2023-04-21  7:48                           ` Daniel P. Berrangé
  2023-04-21 13:56                             ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Daniel P. Berrangé @ 2023-04-21  7:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela

On Thu, Apr 20, 2023 at 03:19:39PM -0400, Peter Xu wrote:
> On Thu, Apr 20, 2023 at 10:02:43AM +0100, Daniel P. Berrangé wrote:
> > On Wed, Apr 19, 2023 at 03:07:19PM -0400, Peter Xu wrote:
> > > On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote:
> > > > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> > > > > > Libvirt has multiple APIs where it currently uses its migrate-to-file
> > > > > > approach
> > > > > > 
> > > > > >   * virDomainManagedSave()
> > > > > > 
> > > > > >     This saves VM state to an libvirt managed file, stops the VM, and the
> > > > > >     file state is auto-restored on next request to start the VM, and the
> > > > > >     file deleted. The VM CPUs are stopped during both save + restore
> > > > > >     phase
> > > > > > 
> > > > > >   * virDomainSave/virDomainRestore
> > > > > > 
> > > > > >     The former saves VM state to a file specified by the mgmt app/user.
> > > > > >     A later call to virDomaniRestore starts the VM using that saved
> > > > > >     state. The mgmt app / user can delete the file state, or re-use
> > > > > >     it many times as they desire. The VM CPUs are stopped during both
> > > > > >     save + restore phase
> > > > > > 
> > > > > >   * virDomainSnapshotXXX
> > > > > > 
> > > > > >     This family of APIs takes snapshots of the VM disks, optionally
> > > > > >     also including the full VM state to a separate file. The snapshots
> > > > > >     can later be restored. The VM CPUs remain running during the
> > > > > >     save phase, but are stopped during restore phase
> > > > > 
> > > > > For this one IMHO it'll be good if Libvirt can consider leveraging the new
> > > > > background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
> > > > > perhaps any reason why a generic migrate:fd approach is better?
> > > > 
> > > > I'm not sure I fully understand the implications of 'background-snapshot' ?
> > > > 
> > > > Based on what the QAPI comment says, it sounds potentially interesting,
> > > > as conceptually it would be nicer to have the memory / state snapshot
> > > > represent the VM at the point where we started the snapshot operation,
> > > > rather than where we finished the snapshot operation.
> > > > 
> > > > It would not solve the performance problems that the work in this thread
> > > > was intended to address though.  With large VMs (100's of GB of RAM),
> > > > saving all the RAM state to disk takes a very long time, regardless of
> > > > whether the VM vCPUs are paused or running.
> > > 
> > > I think it solves the performance problem by only copy each of the guest
> > > page once, even if the guest is running.
> > 
> > I think we're talking about different performance problems.
> > 
> > What you describe here is about ensuring the snapshot is of finite size
> > and completes in linear time, by ensuring each page is written only
> > once.
> > 
> > What I'm talking about is being able to parallelize the writing of all
> > RAM, so if a single thread can saturate the storage, using multiple
> > threads will make the overal process faster, even when we're only
> > writing each page once.
> 
> It depends on how much we want it.  Here the live snapshot scenaior could
> probably leverage a same multi-threading framework with a vm suspend case
> because it can assume all the pages are static and only saved once.
> 
> But I agree it's at least not there yet.. so we can directly leverage
> multifd at least for now.
> 
> > 
> > > Different from mostly all the rest of "migrate" use cases, background
> > > snapshot does not use the generic dirty tracking at all (for KVM that's
> > > get-dirty-log), instead it uses userfaultfd wr-protects, so that when
> > > taking the snapshot all the guest pages will be protected once.
> > 
> > Oh, so that means this 'background-snapshot' feature only works on
> > Linux, and only when permissions allow it. The migration parameter
> > probably should be marked with 'CONFIG_LINUX' in the QAPI schema
> > to make it clear this is a non-portable feature.
> 
> Indeed, I can have a follow up patch for this.  But it'll be the same as
> some other features, like, postcopy (and all its sub-features including
> postcopy-blocktime and postcopy-preempt)?
> 
> > 
> > > It guarantees the best efficiency of creating a snapshot with VM running,
> > > afaict.  I sincerely think Libvirt should have someone investigating and
> > > see whether virDomainSnapshotXXX() can be implemented by this cap rather
> > > than the default migration.
> > 
> > Since the background-snapshot feature is not universally available,
> > it will only ever be possible to use it as an optional enhancement
> > with virDomainSnapshotXXX, we'll need the portable impl to be the
> > default / fallback.
> 
> I am actually curious on how a live snapshot can be implemented correctly
> if without something like background snapshot.  I raised this question in
> another reply here:
> 
> https://lore.kernel.org/all/ZDWBSuGDU9IMohEf@x1n/
> 
> I was using fixed-ram and vm suspend as example, but I assume it applies to
> any live snapshot that is based on current default migration scheme.
> 
> For a real live snapshot (not vm suspend), IIUC we have similar challenges.
> 
> The problem is when migration completes (snapshot taken) the VM is still
> running with a live disk image.  Then how can we take a snapshot exactly at
> the same time when we got the guest image mirrored in the vm dump?  What
> guarantees that there's no IO changes after VM image created but before we
> take a snapshot on the disk image?
> 
> In short, it's a question on how libvirt can make sure the VM image and
> disk snapshot image be taken at exactly the same time for live.

It is just a matter of where you have the synchronization point.

With background-snapshot, you have to snapshot the disks at the
start of the migrate operation. Without background-snapshot
yu have to snapshot the disks at the end of the migrate
operation. The CPUs are paused at the end of the migrate, so
when the CPUs pause, initiate the storage snapshot in the
background and then let the CPUs resume.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram
  2023-04-21  7:48                           ` Daniel P. Berrangé
@ 2023-04-21 13:56                             ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2023-04-21 13:56 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, Claudio Fontana, jfehlig, dfaggioli,
	dgilbert, Juan Quintela

On Fri, Apr 21, 2023 at 08:48:02AM +0100, Daniel P. Berrangé wrote:
> On Thu, Apr 20, 2023 at 03:19:39PM -0400, Peter Xu wrote:
> > On Thu, Apr 20, 2023 at 10:02:43AM +0100, Daniel P. Berrangé wrote:
> > > On Wed, Apr 19, 2023 at 03:07:19PM -0400, Peter Xu wrote:
> > > > On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote:
> > > > > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote:
> > > > > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> > > > > > > Libvirt has multiple APIs where it currently uses its migrate-to-file
> > > > > > > approach
> > > > > > > 
> > > > > > >   * virDomainManagedSave()
> > > > > > > 
> > > > > > >     This saves VM state to an libvirt managed file, stops the VM, and the
> > > > > > >     file state is auto-restored on next request to start the VM, and the
> > > > > > >     file deleted. The VM CPUs are stopped during both save + restore
> > > > > > >     phase
> > > > > > > 
> > > > > > >   * virDomainSave/virDomainRestore
> > > > > > > 
> > > > > > >     The former saves VM state to a file specified by the mgmt app/user.
> > > > > > >     A later call to virDomaniRestore starts the VM using that saved
> > > > > > >     state. The mgmt app / user can delete the file state, or re-use
> > > > > > >     it many times as they desire. The VM CPUs are stopped during both
> > > > > > >     save + restore phase
> > > > > > > 
> > > > > > >   * virDomainSnapshotXXX
> > > > > > > 
> > > > > > >     This family of APIs takes snapshots of the VM disks, optionally
> > > > > > >     also including the full VM state to a separate file. The snapshots
> > > > > > >     can later be restored. The VM CPUs remain running during the
> > > > > > >     save phase, but are stopped during restore phase
> > > > > > 
> > > > > > For this one IMHO it'll be good if Libvirt can consider leveraging the new
> > > > > > background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
> > > > > > perhaps any reason why a generic migrate:fd approach is better?
> > > > > 
> > > > > I'm not sure I fully understand the implications of 'background-snapshot' ?
> > > > > 
> > > > > Based on what the QAPI comment says, it sounds potentially interesting,
> > > > > as conceptually it would be nicer to have the memory / state snapshot
> > > > > represent the VM at the point where we started the snapshot operation,
> > > > > rather than where we finished the snapshot operation.
> > > > > 
> > > > > It would not solve the performance problems that the work in this thread
> > > > > was intended to address though.  With large VMs (100's of GB of RAM),
> > > > > saving all the RAM state to disk takes a very long time, regardless of
> > > > > whether the VM vCPUs are paused or running.
> > > > 
> > > > I think it solves the performance problem by only copy each of the guest
> > > > page once, even if the guest is running.
> > > 
> > > I think we're talking about different performance problems.
> > > 
> > > What you describe here is about ensuring the snapshot is of finite size
> > > and completes in linear time, by ensuring each page is written only
> > > once.
> > > 
> > > What I'm talking about is being able to parallelize the writing of all
> > > RAM, so if a single thread can saturate the storage, using multiple
> > > threads will make the overal process faster, even when we're only
> > > writing each page once.
> > 
> > It depends on how much we want it.  Here the live snapshot scenaior could
> > probably leverage a same multi-threading framework with a vm suspend case
> > because it can assume all the pages are static and only saved once.
> > 
> > But I agree it's at least not there yet.. so we can directly leverage
> > multifd at least for now.
> > 
> > > 
> > > > Different from mostly all the rest of "migrate" use cases, background
> > > > snapshot does not use the generic dirty tracking at all (for KVM that's
> > > > get-dirty-log), instead it uses userfaultfd wr-protects, so that when
> > > > taking the snapshot all the guest pages will be protected once.
> > > 
> > > Oh, so that means this 'background-snapshot' feature only works on
> > > Linux, and only when permissions allow it. The migration parameter
> > > probably should be marked with 'CONFIG_LINUX' in the QAPI schema
> > > to make it clear this is a non-portable feature.
> > 
> > Indeed, I can have a follow up patch for this.  But it'll be the same as
> > some other features, like, postcopy (and all its sub-features including
> > postcopy-blocktime and postcopy-preempt)?
> > 
> > > 
> > > > It guarantees the best efficiency of creating a snapshot with VM running,
> > > > afaict.  I sincerely think Libvirt should have someone investigating and
> > > > see whether virDomainSnapshotXXX() can be implemented by this cap rather
> > > > than the default migration.
> > > 
> > > Since the background-snapshot feature is not universally available,
> > > it will only ever be possible to use it as an optional enhancement
> > > with virDomainSnapshotXXX, we'll need the portable impl to be the
> > > default / fallback.
> > 
> > I am actually curious on how a live snapshot can be implemented correctly
> > if without something like background snapshot.  I raised this question in
> > another reply here:
> > 
> > https://lore.kernel.org/all/ZDWBSuGDU9IMohEf@x1n/
> > 
> > I was using fixed-ram and vm suspend as example, but I assume it applies to
> > any live snapshot that is based on current default migration scheme.
> > 
> > For a real live snapshot (not vm suspend), IIUC we have similar challenges.
> > 
> > The problem is when migration completes (snapshot taken) the VM is still
> > running with a live disk image.  Then how can we take a snapshot exactly at
> > the same time when we got the guest image mirrored in the vm dump?  What
> > guarantees that there's no IO changes after VM image created but before we
> > take a snapshot on the disk image?
> > 
> > In short, it's a question on how libvirt can make sure the VM image and
> > disk snapshot image be taken at exactly the same time for live.
> 
> It is just a matter of where you have the synchronization point.
> 
> With background-snapshot, you have to snapshot the disks at the
> start of the migrate operation. Without background-snapshot
> yu have to snapshot the disks at the end of the migrate
> operation. The CPUs are paused at the end of the migrate, so
> when the CPUs pause, initiate the storage snapshot in the
> background and then let the CPUs resume.

Ah, indeed.

Thanks.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2023-04-21 13:57 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 01/26] migration: Add support for 'file:' uri for source migration Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 02/26] migration: Add support for 'file:' uri for incoming migration Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 03/26] tests/qtest: migration: Add migrate_incoming_qmp helper Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 04/26] tests/qtest: migration-test: Add tests for file-based migration Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 05/26] migration: Initial support of fixed-ram feature for analyze-migration.py Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 06/26] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 07/26] io: Add generic pwritev/preadv interface Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 08/26] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 09/26] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability Fabiano Rosas
2023-03-30 22:01   ` Peter Xu
2023-03-31  7:56     ` Daniel P. Berrangé
2023-03-31 14:39       ` Peter Xu
2023-03-31 15:34         ` Daniel P. Berrangé
2023-03-31 16:13           ` Peter Xu
2023-03-31 15:05     ` Fabiano Rosas
2023-03-31  5:50   ` Markus Armbruster
2023-03-30 18:03 ` [RFC PATCH v1 11/26] migration: Refactor precopy ram loading code Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 12/26] migration: Add support for 'fixed-ram' migration restore Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 13/26] tests/qtest: migration-test: Add tests for fixed-ram file-based migration Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 14/26] migration: Add completion tracepoint Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 15/26] migration/multifd: Remove direct "socket" references Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 16/26] migration/multifd: Allow multifd without packets Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 17/26] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 18/26] migration/multifd: Add incoming " Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 19/26] migration/multifd: Add pages to the receiving side Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 20/26] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 21/26] migration/ram: Add a wrapper for fixed-ram shadow bitmap Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 22/26] migration/multifd: Support outgoing fixed-ram stream format Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 23/26] migration/multifd: Support incoming " Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 24/26] tests/qtest: Add a multifd + fixed-ram migration test Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 25/26] migration: Add direct-io parameter Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 26/26] tests/migration/guestperf: Add file, fixed-ram and direct-io support Fabiano Rosas
2023-03-30 21:41 ` [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Peter Xu
2023-03-31 14:37   ` Fabiano Rosas
2023-03-31 14:52     ` Peter Xu
2023-03-31 15:30       ` Fabiano Rosas
2023-03-31 15:55         ` Peter Xu
2023-03-31 16:10           ` Daniel P. Berrangé
2023-03-31 16:27             ` Peter Xu
2023-03-31 18:18               ` Fabiano Rosas
2023-03-31 21:52                 ` Peter Xu
2023-04-03  7:47                   ` Claudio Fontana
2023-04-03 19:26                     ` Peter Xu
2023-04-04  8:00                       ` Claudio Fontana
2023-04-04 14:53                         ` Peter Xu
2023-04-04 15:10                           ` Claudio Fontana
2023-04-04 15:56                             ` Peter Xu
2023-04-06 16:46                               ` Fabiano Rosas
2023-04-07 10:36                                 ` Claudio Fontana
2023-04-11 15:48                                   ` Peter Xu
2023-04-18 16:58               ` Daniel P. Berrangé
2023-04-18 19:26                 ` Peter Xu
2023-04-19 17:12                   ` Daniel P. Berrangé
2023-04-19 19:07                     ` Peter Xu
2023-04-20  9:02                       ` Daniel P. Berrangé
2023-04-20 19:19                         ` Peter Xu
2023-04-21  7:48                           ` Daniel P. Berrangé
2023-04-21 13:56                             ` Peter Xu
2023-03-31 15:46       ` Daniel P. Berrangé
2023-04-03  7:38 ` David Hildenbrand
2023-04-03 14:41   ` Fabiano Rosas
2023-04-03 16:24     ` David Hildenbrand
2023-04-03 16:36       ` Fabiano Rosas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.