All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram
@ 2023-10-23 20:35 Fabiano Rosas
  2023-10-23 20:35 ` [PATCH v2 01/29] tests/qtest: migration events Fabiano Rosas
                   ` (28 more replies)
  0 siblings, 29 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

Hi,

Here's the migration to file using fixed offsets for each RAM
page. We're calling it fixed-ram migration.

There's 3 big pieces in this series:

1) Fixed-ram: The single-threaded (no multifd) implementation of the
fixed-ram migration. This adds:
  - infrastructure for preadv/pwritev;
  - a bitmap to keep track of which pages are written to the file;
  - a migration header containing fixed-ram-specific information;
  - a capability to enable the feature;
  - the /x86/migration/multifd/file/fixed-ram/* tests.

2) Multifd support: Changes to multifd to support fixed-ram. The main
point here is that we don't need the synchronous parts of multifd,
only the page-queuing, multi-threading and IO is of interest. This
adds:
  - the use_packets() option to skip sending packets when doing fixed-ram;

  - the concept of pages to the receiving side. We want to collect the
    pages from the file using multifd as well.

3) Auto-pause capability: A new capability to allow QEMU to pause the
VM when the type of migration selected would already result in a
paused VM at the end of migration (snapshots, file migration). This is
intended to allow QEMU to implement optimizations for the migration
types that don't need the live migration infrastructure.

The feature is opt-in for new migration code in QEMU, but opt-out for
QEMU users. I.e. new migration code must declare that it supports
auto-pause, but the management layer needs to turn the capability off
to force a live migration.

Thanks

v1:
https://lore.kernel.org/r/20230330180336.2791-1-farosas@suse.de

Fabiano Rosas (21):
  tests/qtest: Move QTestMigrationState to libqtest
  tests/qtest: Allow waiting for migration events
  migration: Return the saved state from global_state_store
  migration: Introduce global_state_store_once
  migration: Add auto-pause capability
  migration: Run "file:" migration with a stopped VM
  tests/qtest: File migration auto-pause tests
  migration: fixed-ram: Add URI compatibility check
  migration/ram: Introduce 'fixed-ram' migration capability
  migration/multifd: Allow multifd without packets
  migration/multifd: Add outgoing QIOChannelFile support
  migration/multifd: Add incoming QIOChannelFile support
  migration/multifd: Add pages to the receiving side
  io: Add a pwritev/preadv version that takes a discontiguous iovec
  migration/ram: Add a wrapper for fixed-ram shadow bitmap
  migration/ram: Ignore multifd flush when doing fixed-ram migration
  migration/multifd: Support outgoing fixed-ram stream format
  migration/multifd: Support incoming fixed-ram stream format
  tests/qtest: Add a multifd + fixed-ram migration test
  migration: Add direct-io parameter
  tests/qtest: Add a test for migration with direct-io and multifd

Nikolay Borisov (7):
  io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file
  io: Add generic pwritev/preadv interface
  io: implement io_pwritev/preadv for QIOChannelFile
  migration/qemu-file: add utility methods for working with seekable
    channels
  migration/ram: Add support for 'fixed-ram' outgoing migration
  migration/ram: Add support for 'fixed-ram' migration restore
  tests/qtest: migration-test: Add tests for fixed-ram file-based
    migration

Steve Sistare (1):
  tests/qtest: migration events

 docs/devel/migration.rst            |  14 ++
 include/exec/ramblock.h             |   8 +
 include/io/channel.h                | 133 ++++++++++++
 include/migration/global_state.h    |   3 +-
 include/migration/qemu-file-types.h |   2 +
 include/qemu/bitops.h               |  13 ++
 include/qemu/osdep.h                |   2 +
 io/channel-file.c                   |  60 ++++++
 io/channel.c                        | 140 +++++++++++++
 migration/file.c                    | 114 ++++++++--
 migration/file.h                    |  10 +-
 migration/global_state.c            |  20 +-
 migration/migration-hmp-cmds.c      |  10 +
 migration/migration.c               |  55 ++++-
 migration/multifd.c                 | 313 +++++++++++++++++++++++-----
 migration/multifd.h                 |  14 +-
 migration/options.c                 |  79 +++++++
 migration/options.h                 |   5 +
 migration/qemu-file.c               |  80 +++++++
 migration/qemu-file.h               |   4 +
 migration/ram.c                     | 203 +++++++++++++++++-
 migration/ram.h                     |   1 +
 migration/savevm.c                  |   1 +
 qapi/migration.json                 |  26 ++-
 tests/qtest/libqtest.c              |  14 ++
 tests/qtest/libqtest.h              |  25 +++
 tests/qtest/migration-helpers.c     | 108 +++++++---
 tests/qtest/migration-helpers.h     |  10 +-
 tests/qtest/migration-test.c        | 219 ++++++++++++++-----
 util/osdep.c                        |   9 +
 30 files changed, 1520 insertions(+), 175 deletions(-)

-- 
2.35.3



^ permalink raw reply	[flat|nested] 128+ messages in thread

* [PATCH v2 01/29] tests/qtest: migration events
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-25  9:44   ` Thomas Huth
                     ` (2 more replies)
  2023-10-23 20:35 ` [PATCH v2 02/29] tests/qtest: Move QTestMigrationState to libqtest Fabiano Rosas
                   ` (27 subsequent siblings)
  28 siblings, 3 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Steve Sistare, Thomas Huth, Laurent Vivier,
	Paolo Bonzini

From: Steve Sistare <steven.sistare@oracle.com>

Define a state object to capture events seen by migration tests, to allow
more events to be captured in a subsequent patch, and simplify event
checking in wait_for_migration_pass.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 tests/qtest/migration-helpers.c | 24 ++++-------
 tests/qtest/migration-helpers.h |  8 ++--
 tests/qtest/migration-test.c    | 74 +++++++++++++++------------------
 3 files changed, 46 insertions(+), 60 deletions(-)

diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index 24fb7b3525..fd3b94efa2 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -24,26 +24,16 @@
  */
 #define MIGRATION_STATUS_WAIT_TIMEOUT 120
 
-bool migrate_watch_for_stop(QTestState *who, const char *name,
-                            QDict *event, void *opaque)
-{
-    bool *seen = opaque;
-
-    if (g_str_equal(name, "STOP")) {
-        *seen = true;
-        return true;
-    }
-
-    return false;
-}
-
-bool migrate_watch_for_resume(QTestState *who, const char *name,
+bool migrate_watch_for_events(QTestState *who, const char *name,
                               QDict *event, void *opaque)
 {
-    bool *seen = opaque;
+    QTestMigrationState *state = opaque;
 
-    if (g_str_equal(name, "RESUME")) {
-        *seen = true;
+    if (g_str_equal(name, "STOP")) {
+        state->stop_seen = true;
+        return true;
+    } else if (g_str_equal(name, "RESUME")) {
+        state->resume_seen = true;
         return true;
     }
 
diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
index e31dc85cc7..c1d4c84995 100644
--- a/tests/qtest/migration-helpers.h
+++ b/tests/qtest/migration-helpers.h
@@ -15,9 +15,11 @@
 
 #include "libqtest.h"
 
-bool migrate_watch_for_stop(QTestState *who, const char *name,
-                            QDict *event, void *opaque);
-bool migrate_watch_for_resume(QTestState *who, const char *name,
+typedef struct QTestMigrationState {
+    bool stop_seen, resume_seen;
+} QTestMigrationState;
+
+bool migrate_watch_for_events(QTestState *who, const char *name,
                               QDict *event, void *opaque);
 
 G_GNUC_PRINTF(3, 4)
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 35e0ded9d7..0425d1d527 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -43,8 +43,8 @@
 unsigned start_address;
 unsigned end_address;
 static bool uffd_feature_thread_id;
-static bool got_src_stop;
-static bool got_dst_resume;
+static QTestMigrationState src_state;
+static QTestMigrationState dst_state;
 
 /*
  * An initial 3 MB offset is used as that corresponds
@@ -230,6 +230,13 @@ static void wait_for_serial(const char *side)
     } while (true);
 }
 
+static void wait_for_stop(QTestState *who, QTestMigrationState *state)
+{
+    if (!state->stop_seen) {
+        qtest_qmp_eventwait(who, "STOP");
+    }
+}
+
 /*
  * It's tricky to use qemu's migration event capability with qtest,
  * events suddenly appearing confuse the qmp()/hmp() responses.
@@ -277,21 +284,19 @@ static void read_blocktime(QTestState *who)
     qobject_unref(rsp_return);
 }
 
+/*
+ * Wait for two changes in the migration pass count, but bail if we stop.
+ */
 static void wait_for_migration_pass(QTestState *who)
 {
-    uint64_t initial_pass = get_migration_pass(who);
-    uint64_t pass;
+    uint64_t pass, prev_pass = 0, changes = 0;
 
-    /* Wait for the 1st sync */
-    while (!got_src_stop && !initial_pass) {
-        usleep(1000);
-        initial_pass = get_migration_pass(who);
-    }
-
-    do {
+    while (changes < 2 && !src_state.stop_seen) {
         usleep(1000);
         pass = get_migration_pass(who);
-    } while (pass == initial_pass && !got_src_stop);
+        changes += (pass != prev_pass);
+        prev_pass = pass;
+    }
 }
 
 static void check_guests_ram(QTestState *who)
@@ -617,10 +622,7 @@ static void migrate_postcopy_start(QTestState *from, QTestState *to)
 {
     qtest_qmp_assert_success(from, "{ 'execute': 'migrate-start-postcopy' }");
 
-    if (!got_src_stop) {
-        qtest_qmp_eventwait(from, "STOP");
-    }
-
+    wait_for_stop(from, &src_state);
     qtest_qmp_eventwait(to, "RESUME");
 }
 
@@ -755,8 +757,9 @@ static int test_migrate_start(QTestState **from, QTestState **to,
         }
     }
 
-    got_src_stop = false;
-    got_dst_resume = false;
+    dst_state = (QTestMigrationState) { };
+    src_state = (QTestMigrationState) { };
+
     if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
         memory_size = "150M";
 
@@ -847,8 +850,8 @@ static int test_migrate_start(QTestState **from, QTestState **to,
     if (!args->only_target) {
         *from = qtest_init_with_env(QEMU_ENV_SRC, cmd_source);
         qtest_qmp_set_event_callback(*from,
-                                     migrate_watch_for_stop,
-                                     &got_src_stop);
+                                     migrate_watch_for_events,
+                                     &src_state);
     }
 
     cmd_target = g_strdup_printf("-accel kvm%s -accel tcg "
@@ -868,8 +871,8 @@ static int test_migrate_start(QTestState **from, QTestState **to,
                                  ignore_stderr);
     *to = qtest_init_with_env(QEMU_ENV_DST, cmd_target);
     qtest_qmp_set_event_callback(*to,
-                                 migrate_watch_for_resume,
-                                 &got_dst_resume);
+                                 migrate_watch_for_events,
+                                 &dst_state);
 
     /*
      * Remove shmem file immediately to avoid memory leak in test failed case.
@@ -1619,9 +1622,7 @@ static void test_precopy_common(MigrateCommon *args)
          */
         if (args->result == MIG_TEST_SUCCEED) {
             qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}");
-            if (!got_src_stop) {
-                qtest_qmp_eventwait(from, "STOP");
-            }
+            wait_for_stop(from, &src_state);
             migrate_ensure_converge(from);
         }
     }
@@ -1667,9 +1668,8 @@ static void test_precopy_common(MigrateCommon *args)
              */
             wait_for_migration_complete(from);
 
-            if (!got_src_stop) {
-                qtest_qmp_eventwait(from, "STOP");
-            }
+            wait_for_stop(from, &src_state);
+
         } else {
             wait_for_migration_complete(from);
             /*
@@ -1682,7 +1682,7 @@ static void test_precopy_common(MigrateCommon *args)
             qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
         }
 
-        if (!got_dst_resume) {
+        if (!dst_state.resume_seen) {
             qtest_qmp_eventwait(to, "RESUME");
         }
 
@@ -1723,9 +1723,7 @@ static void test_file_common(MigrateCommon *args, bool stop_src)
 
     if (stop_src) {
         qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}");
-        if (!got_src_stop) {
-            qtest_qmp_eventwait(from, "STOP");
-        }
+        wait_for_stop(from, &src_state);
     }
 
     if (args->result == MIG_TEST_QMP_ERROR) {
@@ -1747,7 +1745,7 @@ static void test_file_common(MigrateCommon *args, bool stop_src)
         qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
     }
 
-    if (!got_dst_resume) {
+    if (!dst_state.resume_seen) {
         qtest_qmp_eventwait(to, "RESUME");
     }
 
@@ -1868,9 +1866,7 @@ static void test_ignore_shared(void)
 
     migrate_wait_for_dirty_mem(from, to);
 
-    if (!got_src_stop) {
-        qtest_qmp_eventwait(from, "STOP");
-    }
+    wait_for_stop(from, &src_state);
 
     qtest_qmp_eventwait(to, "RESUME");
 
@@ -2380,7 +2376,7 @@ static void test_migrate_auto_converge(void)
             break;
         }
         usleep(20);
-        g_assert_false(got_src_stop);
+        g_assert_false(src_state.stop_seen);
     } while (true);
     /* The first percentage of throttling should be at least init_pct */
     g_assert_cmpint(percentage, >=, init_pct);
@@ -2719,9 +2715,7 @@ static void test_multifd_tcp_cancel(void)
 
     migrate_ensure_converge(from);
 
-    if (!got_src_stop) {
-        qtest_qmp_eventwait(from, "STOP");
-    }
+    wait_for_stop(from, &src_state);
     qtest_qmp_eventwait(to2, "RESUME");
 
     wait_for_serial("dest_serial");
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 02/29] tests/qtest: Move QTestMigrationState to libqtest
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
  2023-10-23 20:35 ` [PATCH v2 01/29] tests/qtest: migration events Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-25 10:17   ` Daniel P. Berrangé
  2023-10-23 20:35 ` [PATCH v2 03/29] tests/qtest: Allow waiting for migration events Fabiano Rosas
                   ` (26 subsequent siblings)
  28 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Thomas Huth, Laurent Vivier, Paolo Bonzini

Move the QTestMigrationState into QTestState so we don't have to pass
it around to the wait_for_* helpers anymore. Since QTestState is
private to libqtest.c, move the migration state struct to libqtest.h
and add a getter.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/libqtest.c          | 14 ++++++++++
 tests/qtest/libqtest.h          | 23 ++++++++++++++++
 tests/qtest/migration-helpers.c | 18 +++++++++++++
 tests/qtest/migration-helpers.h |  8 +++---
 tests/qtest/migration-test.c    | 47 +++++++++------------------------
 5 files changed, 72 insertions(+), 38 deletions(-)

diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
index f33a210861..f7e85486dc 100644
--- a/tests/qtest/libqtest.c
+++ b/tests/qtest/libqtest.c
@@ -87,6 +87,7 @@ struct QTestState
     GList *pending_events;
     QTestQMPEventCallback eventCB;
     void *eventData;
+    QTestMigrationState *migration_state;
 };
 
 static GHookList abrt_hooks;
@@ -500,6 +501,8 @@ static QTestState *qtest_init_internal(const char *qemu_bin,
         s->irq_level[i] = false;
     }
 
+    s->migration_state = g_new0(QTestMigrationState, 1);
+
     /*
      * Stopping QEMU for debugging is not supported on Windows.
      *
@@ -601,6 +604,7 @@ void qtest_quit(QTestState *s)
     close(s->fd);
     close(s->qmp_fd);
     g_string_free(s->rx, true);
+    g_free(s->migration_state);
 
     for (GList *it = s->pending_events; it != NULL; it = it->next) {
         qobject_unref((QDict *)it->data);
@@ -854,6 +858,11 @@ void qtest_qmp_set_event_callback(QTestState *s,
     s->eventData = opaque;
 }
 
+void qtest_qmp_set_migration_callback(QTestState *s, QTestQMPEventCallback cb)
+{
+    qtest_qmp_set_event_callback(s, cb, s->migration_state);
+}
+
 QDict *qtest_qmp_event_ref(QTestState *s, const char *event)
 {
     while (s->pending_events) {
@@ -1906,3 +1915,8 @@ bool mkimg(const char *file, const char *fmt, unsigned size_mb)
 
     return ret && !err;
 }
+
+QTestMigrationState *qtest_migration_state(QTestState *s)
+{
+    return s->migration_state;
+}
diff --git a/tests/qtest/libqtest.h b/tests/qtest/libqtest.h
index 6e3d3525bf..0421a1da24 100644
--- a/tests/qtest/libqtest.h
+++ b/tests/qtest/libqtest.h
@@ -23,6 +23,20 @@
 
 typedef struct QTestState QTestState;
 
+struct QTestMigrationState {
+    bool stop_seen;
+    bool resume_seen;
+};
+typedef struct QTestMigrationState QTestMigrationState;
+
+/**
+ * qtest_migration_state:
+ * @s: #QTestState instance to operate on.
+ *
+ * Returns: #QTestMigrationState instance.
+ */
+QTestMigrationState *qtest_migration_state(QTestState *s);
+
 /**
  * qtest_initf:
  * @fmt: Format for creating other arguments to pass to QEMU, formatted
@@ -288,6 +302,15 @@ typedef bool (*QTestQMPEventCallback)(QTestState *s, const char *name,
 void qtest_qmp_set_event_callback(QTestState *s,
                                   QTestQMPEventCallback cb, void *opaque);
 
+/**
+ * qtest_qmp_set_migration_callback:
+ * @s: #QTestSTate instance to operate on
+ * @cb: callback to invoke for events
+ *
+ * Like qtest_qmp_set_event_callback, but includes migration state events
+ */
+void qtest_qmp_set_migration_callback(QTestState *s, QTestQMPEventCallback cb);
+
 /**
  * qtest_qmp_eventwait:
  * @s: #QTestState instance to operate on.
diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index fd3b94efa2..cffa525c81 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -92,6 +92,24 @@ void migrate_set_capability(QTestState *who, const char *capability,
                              capability, value);
 }
 
+void wait_for_stop(QTestState *who)
+{
+    QTestMigrationState *state = qtest_migration_state(who);
+
+    if (!state->stop_seen) {
+        qtest_qmp_eventwait(who, "STOP");
+    }
+}
+
+void wait_for_resume(QTestState *who)
+{
+    QTestMigrationState *state = qtest_migration_state(who);
+
+    if (!state->resume_seen) {
+        qtest_qmp_eventwait(who, "RESUME");
+    }
+}
+
 void migrate_incoming_qmp(QTestState *to, const char *uri, const char *fmt, ...)
 {
     va_list ap;
diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
index c1d4c84995..7297f1ff2c 100644
--- a/tests/qtest/migration-helpers.h
+++ b/tests/qtest/migration-helpers.h
@@ -15,13 +15,13 @@
 
 #include "libqtest.h"
 
-typedef struct QTestMigrationState {
-    bool stop_seen, resume_seen;
-} QTestMigrationState;
-
 bool migrate_watch_for_events(QTestState *who, const char *name,
                               QDict *event, void *opaque);
 
+
+void wait_for_stop(QTestState *who);
+void wait_for_resume(QTestState *who);
+
 G_GNUC_PRINTF(3, 4)
 void migrate_qmp(QTestState *who, const char *uri, const char *fmt, ...);
 
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 0425d1d527..88e611e98f 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -43,8 +43,6 @@
 unsigned start_address;
 unsigned end_address;
 static bool uffd_feature_thread_id;
-static QTestMigrationState src_state;
-static QTestMigrationState dst_state;
 
 /*
  * An initial 3 MB offset is used as that corresponds
@@ -230,13 +228,6 @@ static void wait_for_serial(const char *side)
     } while (true);
 }
 
-static void wait_for_stop(QTestState *who, QTestMigrationState *state)
-{
-    if (!state->stop_seen) {
-        qtest_qmp_eventwait(who, "STOP");
-    }
-}
-
 /*
  * It's tricky to use qemu's migration event capability with qtest,
  * events suddenly appearing confuse the qmp()/hmp() responses.
@@ -290,8 +281,9 @@ static void read_blocktime(QTestState *who)
 static void wait_for_migration_pass(QTestState *who)
 {
     uint64_t pass, prev_pass = 0, changes = 0;
+    QTestMigrationState *state = qtest_migration_state(who);
 
-    while (changes < 2 && !src_state.stop_seen) {
+    while (changes < 2 && !state->stop_seen) {
         usleep(1000);
         pass = get_migration_pass(who);
         changes += (pass != prev_pass);
@@ -622,7 +614,7 @@ static void migrate_postcopy_start(QTestState *from, QTestState *to)
 {
     qtest_qmp_assert_success(from, "{ 'execute': 'migrate-start-postcopy' }");
 
-    wait_for_stop(from, &src_state);
+    wait_for_stop(from);
     qtest_qmp_eventwait(to, "RESUME");
 }
 
@@ -757,9 +749,6 @@ static int test_migrate_start(QTestState **from, QTestState **to,
         }
     }
 
-    dst_state = (QTestMigrationState) { };
-    src_state = (QTestMigrationState) { };
-
     if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
         memory_size = "150M";
 
@@ -849,9 +838,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
                                  ignore_stderr);
     if (!args->only_target) {
         *from = qtest_init_with_env(QEMU_ENV_SRC, cmd_source);
-        qtest_qmp_set_event_callback(*from,
-                                     migrate_watch_for_events,
-                                     &src_state);
+        qtest_qmp_set_migration_callback(*from, migrate_watch_for_events);
     }
 
     cmd_target = g_strdup_printf("-accel kvm%s -accel tcg "
@@ -870,9 +857,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
                                  args->opts_target ? args->opts_target : "",
                                  ignore_stderr);
     *to = qtest_init_with_env(QEMU_ENV_DST, cmd_target);
-    qtest_qmp_set_event_callback(*to,
-                                 migrate_watch_for_events,
-                                 &dst_state);
+    qtest_qmp_set_migration_callback(*to, migrate_watch_for_events);
 
     /*
      * Remove shmem file immediately to avoid memory leak in test failed case.
@@ -1622,7 +1607,7 @@ static void test_precopy_common(MigrateCommon *args)
          */
         if (args->result == MIG_TEST_SUCCEED) {
             qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}");
-            wait_for_stop(from, &src_state);
+            wait_for_stop(from);
             migrate_ensure_converge(from);
         }
     }
@@ -1668,7 +1653,7 @@ static void test_precopy_common(MigrateCommon *args)
              */
             wait_for_migration_complete(from);
 
-            wait_for_stop(from, &src_state);
+            wait_for_stop(from);
 
         } else {
             wait_for_migration_complete(from);
@@ -1682,10 +1667,7 @@ static void test_precopy_common(MigrateCommon *args)
             qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
         }
 
-        if (!dst_state.resume_seen) {
-            qtest_qmp_eventwait(to, "RESUME");
-        }
-
+        wait_for_resume(to);
         wait_for_serial("dest_serial");
     }
 
@@ -1723,7 +1705,7 @@ static void test_file_common(MigrateCommon *args, bool stop_src)
 
     if (stop_src) {
         qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}");
-        wait_for_stop(from, &src_state);
+        wait_for_stop(from);
     }
 
     if (args->result == MIG_TEST_QMP_ERROR) {
@@ -1745,10 +1727,7 @@ static void test_file_common(MigrateCommon *args, bool stop_src)
         qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
     }
 
-    if (!dst_state.resume_seen) {
-        qtest_qmp_eventwait(to, "RESUME");
-    }
-
+    wait_for_resume(to);
     wait_for_serial("dest_serial");
 
 finish:
@@ -1866,7 +1845,7 @@ static void test_ignore_shared(void)
 
     migrate_wait_for_dirty_mem(from, to);
 
-    wait_for_stop(from, &src_state);
+    wait_for_stop(from);
 
     qtest_qmp_eventwait(to, "RESUME");
 
@@ -2376,7 +2355,7 @@ static void test_migrate_auto_converge(void)
             break;
         }
         usleep(20);
-        g_assert_false(src_state.stop_seen);
+        g_assert_false(qtest_migration_state(from)->stop_seen);
     } while (true);
     /* The first percentage of throttling should be at least init_pct */
     g_assert_cmpint(percentage, >=, init_pct);
@@ -2715,7 +2694,7 @@ static void test_multifd_tcp_cancel(void)
 
     migrate_ensure_converge(from);
 
-    wait_for_stop(from, &src_state);
+    wait_for_stop(from);
     qtest_qmp_eventwait(to2, "RESUME");
 
     wait_for_serial("dest_serial");
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 03/29] tests/qtest: Allow waiting for migration events
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
  2023-10-23 20:35 ` [PATCH v2 01/29] tests/qtest: migration events Fabiano Rosas
  2023-10-23 20:35 ` [PATCH v2 02/29] tests/qtest: Move QTestMigrationState to libqtest Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-23 20:35 ` [PATCH v2 04/29] migration: Return the saved state from global_state_store Fabiano Rosas
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Thomas Huth, Laurent Vivier, Paolo Bonzini

Add support for waiting for a migration state change event to
happen. This can help disambiguate between runstate changes that
happen during VM lifecycle.

Specifically, the next couple of patches want to know whether STOP
events happened at the migration start or end. Add the "setup" and
"active" migration states for that purpose.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/libqtest.h          |  2 +
 tests/qtest/migration-helpers.c | 66 ++++++++++++++++++++++++++++-----
 tests/qtest/migration-helpers.h |  2 +
 3 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/tests/qtest/libqtest.h b/tests/qtest/libqtest.h
index 0421a1da24..67fc2ae487 100644
--- a/tests/qtest/libqtest.h
+++ b/tests/qtest/libqtest.h
@@ -26,6 +26,8 @@ typedef struct QTestState QTestState;
 struct QTestMigrationState {
     bool stop_seen;
     bool resume_seen;
+    bool setup_seen;
+    bool active_seen;
 };
 typedef struct QTestMigrationState QTestMigrationState;
 
diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index cffa525c81..a3beff8b57 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -34,6 +34,22 @@ bool migrate_watch_for_events(QTestState *who, const char *name,
         return true;
     } else if (g_str_equal(name, "RESUME")) {
         state->resume_seen = true;
+        return true;
+    } else if (g_str_equal(name, "MIGRATION")) {
+        QDict *data;
+        g_assert(qdict_haskey(event, "data"));
+
+        data = qdict_get_qdict(event, "data");
+        g_assert(qdict_haskey(data, "status"));
+
+        if (g_str_equal(qdict_get_str(data, "status"), "setup")) {
+            state->setup_seen = true;
+        } else if (g_str_equal(qdict_get_str(data, "status"), "active")) {
+            state->active_seen = true;
+        } else {
+            return false;
+        }
+
         return true;
     }
 
@@ -110,10 +126,49 @@ void wait_for_resume(QTestState *who)
     }
 }
 
+static void wait_for_migration_state(QTestState *who, const char* state)
+{
+        QDict *rsp, *data;
+
+        for (;;) {
+            rsp = qtest_qmp_eventwait_ref(who, "MIGRATION");
+            g_assert(qdict_haskey(rsp, "data"));
+
+            data = qdict_get_qdict(rsp, "data");
+            g_assert(qdict_haskey(data, "status"));
+
+            if (g_str_equal(qdict_get_str(data, "status"), state)) {
+                break;
+            }
+            qobject_unref(rsp);
+        }
+
+        qobject_unref(rsp);
+        return;
+}
+
+void wait_for_setup(QTestState *who)
+{
+    QTestMigrationState *state = qtest_migration_state(who);
+
+    if (!state->setup_seen) {
+        wait_for_migration_state(who, "setup");
+    }
+}
+
+void wait_for_active(QTestState *who)
+{
+    QTestMigrationState *state = qtest_migration_state(who);
+
+    if (!state->active_seen) {
+        wait_for_migration_state(who, "active");
+    }
+}
+
 void migrate_incoming_qmp(QTestState *to, const char *uri, const char *fmt, ...)
 {
     va_list ap;
-    QDict *args, *rsp, *data;
+    QDict *args, *rsp;
 
     va_start(ap, fmt);
     args = qdict_from_vjsonf_nofail(fmt, ap);
@@ -129,14 +184,7 @@ void migrate_incoming_qmp(QTestState *to, const char *uri, const char *fmt, ...)
     g_assert(qdict_haskey(rsp, "return"));
     qobject_unref(rsp);
 
-    rsp = qtest_qmp_eventwait_ref(to, "MIGRATION");
-    g_assert(qdict_haskey(rsp, "data"));
-
-    data = qdict_get_qdict(rsp, "data");
-    g_assert(qdict_haskey(data, "status"));
-    g_assert_cmpstr(qdict_get_str(data, "status"), ==, "setup");
-
-    qobject_unref(rsp);
+    wait_for_setup(to);
 }
 
 /*
diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
index 7297f1ff2c..11a93dd48d 100644
--- a/tests/qtest/migration-helpers.h
+++ b/tests/qtest/migration-helpers.h
@@ -21,6 +21,8 @@ bool migrate_watch_for_events(QTestState *who, const char *name,
 
 void wait_for_stop(QTestState *who);
 void wait_for_resume(QTestState *who);
+void wait_for_setup(QTestState *who);
+void wait_for_active(QTestState *who);
 
 G_GNUC_PRINTF(3, 4)
 void migrate_qmp(QTestState *who, const char *uri, const char *fmt, ...);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 04/29] migration: Return the saved state from global_state_store
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (2 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 03/29] tests/qtest: Allow waiting for migration events Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-25 10:19   ` Daniel P. Berrangé
  2023-10-23 20:35 ` [PATCH v2 05/29] migration: Introduce global_state_store_once Fabiano Rosas
                   ` (24 subsequent siblings)
  28 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

There is a pattern of calling runstate_get() to store the current
runstate and calling global_state_store() to save the current runstate
for migration. Since global_state_store() also calls runstate_get(),
make it return the runstate instead.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/migration/global_state.h | 2 +-
 migration/global_state.c         | 7 +++++--
 migration/migration.c            | 6 ++----
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/migration/global_state.h b/include/migration/global_state.h
index d7c2cd3216..e268dc1f18 100644
--- a/include/migration/global_state.h
+++ b/include/migration/global_state.h
@@ -16,7 +16,7 @@
 #include "qapi/qapi-types-run-state.h"
 
 void register_global_state(void);
-void global_state_store(void);
+RunState global_state_store(void);
 void global_state_store_running(void);
 bool global_state_received(void);
 RunState global_state_get_runstate(void);
diff --git a/migration/global_state.c b/migration/global_state.c
index 4e2a9d8ec0..d094af6198 100644
--- a/migration/global_state.c
+++ b/migration/global_state.c
@@ -37,9 +37,12 @@ static void global_state_do_store(RunState state)
               state_str, '\0');
 }
 
-void global_state_store(void)
+RunState global_state_store(void)
 {
-    global_state_do_store(runstate_get());
+    RunState r = runstate_get();
+
+    global_state_do_store(r);
+    return r;
 }
 
 void global_state_store_running(void)
diff --git a/migration/migration.c b/migration/migration.c
index 67547eb6a1..0c23117369 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2344,8 +2344,7 @@ static int migration_completion_precopy(MigrationState *s,
     s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
 
-    s->vm_old_state = runstate_get();
-    global_state_store();
+    s->vm_old_state = global_state_store();
 
     ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
     trace_migration_completion_vm_stop(ret);
@@ -3201,9 +3200,8 @@ static void *bg_migration_thread(void *opaque)
      * transition in vm_stop_force_state() we need to wakeup it up.
      */
     qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
-    s->vm_old_state = runstate_get();
+    s->vm_old_state = global_state_store();
 
-    global_state_store();
     /* Forcibly stop VM before saving state of vCPUs and devices */
     if (vm_stop_force_state(RUN_STATE_PAUSED)) {
         goto fail;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 05/29] migration: Introduce global_state_store_once
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (3 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 04/29] migration: Return the saved state from global_state_store Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-23 20:35 ` [PATCH v2 06/29] migration: Add auto-pause capability Fabiano Rosas
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

There are some situations during migration when we want to change the
runstate of the VM, but don't actually want the new runstate to be put
on the wire to be restored on the destination VM. In those cases, the
pattern is to use global_state_store() to save the state for migration
before changing it.

One scenario where this happens is when switching the source VM into
the FINISH_MIGRATE state. This state only makes sense on the source
VM. Another situation is when pausing the source VM prior to migration
completion.

We are about to introduce a third scenario when the whole migration
should be performed with a paused VM. In this case we will want to
save the VM runstate at the very start of the migration and that state
will be the one restored on the destination regardless of all the
runstate changes that happen in between.

To achieve that we need to make sure that the other two calls to
global_state_store() do not overwrite the state that is to be
migrated.

Introduce a version of global_state_store() that only saves the state
if no other state has already been saved.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/migration/global_state.h |  1 +
 migration/global_state.c         | 13 +++++++++++++
 migration/migration.c            |  6 +++---
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/migration/global_state.h b/include/migration/global_state.h
index e268dc1f18..9d3624100e 100644
--- a/include/migration/global_state.h
+++ b/include/migration/global_state.h
@@ -17,6 +17,7 @@
 
 void register_global_state(void);
 RunState global_state_store(void);
+RunState global_state_store_once(void);
 void global_state_store_running(void);
 bool global_state_received(void);
 RunState global_state_get_runstate(void);
diff --git a/migration/global_state.c b/migration/global_state.c
index d094af6198..beb00039d9 100644
--- a/migration/global_state.c
+++ b/migration/global_state.c
@@ -45,6 +45,19 @@ RunState global_state_store(void)
     return r;
 }
 
+RunState global_state_store_once(void)
+{
+    int r;
+    char *runstate = (char *)global_state.runstate;
+
+    r = qapi_enum_parse(&RunState_lookup, runstate, -1, NULL);
+    if (r < 0) {
+        return global_state_store();
+    }
+
+    return r;
+}
+
 void global_state_store_running(void)
 {
     global_state_do_store(RUN_STATE_RUNNING);
diff --git a/migration/migration.c b/migration/migration.c
index 0c23117369..a6efbd837a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2144,7 +2144,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
     trace_postcopy_start_set_run();
 
     qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
-    global_state_store();
+    global_state_store_once();
     ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
     if (ret < 0) {
         goto fail;
@@ -2344,7 +2344,7 @@ static int migration_completion_precopy(MigrationState *s,
     s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
 
-    s->vm_old_state = global_state_store();
+    s->vm_old_state = global_state_store_once();
 
     ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
     trace_migration_completion_vm_stop(ret);
@@ -3200,7 +3200,7 @@ static void *bg_migration_thread(void *opaque)
      * transition in vm_stop_force_state() we need to wakeup it up.
      */
     qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
-    s->vm_old_state = global_state_store();
+    s->vm_old_state = global_state_store_once();
 
     /* Forcibly stop VM before saving state of vCPUs and devices */
     if (vm_stop_force_state(RUN_STATE_PAUSED)) {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (4 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 05/29] migration: Introduce global_state_store_once Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-24  5:25   ` Markus Armbruster
  2023-10-25  8:48   ` Daniel P. Berrangé
  2023-10-23 20:35 ` [PATCH v2 07/29] migration: Run "file:" migration with a stopped VM Fabiano Rosas
                   ` (22 subsequent siblings)
  28 siblings, 2 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Add a capability that allows the management layer to delegate to QEMU
the decision of whether to pause a VM and perform a non-live
migration. Depending on the type of migration being performed, this
could bring performance benefits.

Note that the capability is enabled by default but at this moment no
migration scheme is making use of it.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 19 +++++++++++++++++++
 migration/options.c   |  9 +++++++++
 migration/options.h   |  1 +
 qapi/migration.json   |  6 +++++-
 4 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index a6efbd837a..8b0c3b0911 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -124,6 +124,20 @@ migration_channels_and_uri_compatible(const char *uri, Error **errp)
     return true;
 }
 
+static bool migration_should_pause(const char *uri)
+{
+    if (!migrate_auto_pause()) {
+        return false;
+    }
+
+    /*
+     * Return true for migration schemes that benefit from a nonlive
+     * migration.
+     */
+
+    return false;
+}
+
 static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
 {
     uintptr_t a = (uintptr_t) ap, b = (uintptr_t) bp;
@@ -1724,6 +1738,11 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
         }
     }
 
+    if (migration_should_pause(uri)) {
+        global_state_store();
+        vm_stop_force_state(RUN_STATE_PAUSED);
+    }
+
     if (strstart(uri, "tcp:", &p) ||
         strstart(uri, "unix:", NULL) ||
         strstart(uri, "vsock:", NULL)) {
diff --git a/migration/options.c b/migration/options.c
index 42fb818956..c3def757fe 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -200,6 +200,8 @@ Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-switchover-ack",
                         MIGRATION_CAPABILITY_SWITCHOVER_ACK),
     DEFINE_PROP_MIG_CAP("x-dirty-limit", MIGRATION_CAPABILITY_DIRTY_LIMIT),
+    DEFINE_PROP_BOOL("x-auto-pause", MigrationState,
+                     capabilities[MIGRATION_CAPABILITY_AUTO_PAUSE], true),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -210,6 +212,13 @@ bool migrate_auto_converge(void)
     return s->capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
 }
 
+bool migrate_auto_pause(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->capabilities[MIGRATION_CAPABILITY_AUTO_PAUSE];
+}
+
 bool migrate_background_snapshot(void)
 {
     MigrationState *s = migrate_get_current();
diff --git a/migration/options.h b/migration/options.h
index 237f2d6b4a..d1ba5c9de7 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -24,6 +24,7 @@ extern Property migration_properties[];
 /* capabilities */
 
 bool migrate_auto_converge(void);
+bool migrate_auto_pause(void);
 bool migrate_background_snapshot(void);
 bool migrate_block(void);
 bool migrate_colo(void);
diff --git a/qapi/migration.json b/qapi/migration.json
index db3df12d6c..74f12adc0e 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -523,6 +523,10 @@
 #     and can result in more stable read performance.  Requires KVM
 #     with accelerator property "dirty-ring-size" set.  (Since 8.1)
 #
+# @auto-pause: If enabled, allows QEMU to decide whether to pause the
+#     VM before migration for an optimal migration performance.
+#     Enabled by default. (since 8.1)
+#
 # Features:
 #
 # @unstable: Members @x-colo and @x-ignore-shared are experimental.
@@ -539,7 +543,7 @@
            { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
            'validate-uuid', 'background-snapshot',
            'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
-           'dirty-limit'] }
+           'dirty-limit', 'auto-pause'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 07/29] migration: Run "file:" migration with a stopped VM
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (5 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 06/29] migration: Add auto-pause capability Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-23 20:35 ` [PATCH v2 08/29] tests/qtest: File migration auto-pause tests Fabiano Rosas
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

The file migration is asynchronous, so it benefits from being done
with a stopped VM. Allow the file migration to take benefit of the
auto-pause capability.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 8b0c3b0911..692fbc5ad6 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -135,6 +135,10 @@ static bool migration_should_pause(const char *uri)
      * migration.
      */
 
+    if (strstart(uri, "file:", NULL)) {
+        return true;
+    }
+
     return false;
 }
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 08/29] tests/qtest: File migration auto-pause tests
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (6 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 07/29] migration: Run "file:" migration with a stopped VM Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-23 20:35 ` [PATCH v2 09/29] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Thomas Huth, Laurent Vivier, Paolo Bonzini

Adapt the file migration tests to take into account the auto-pause
feature.

The test currently has a flag 'stop_src' that is used to know if the
test itself should stop the VM. Add a new flag 'auto_pause' to enable
QEMU to stop the VM instead.. The two in combination allow us to
migrate a already stopped VM and check that it is still stopped on the
destination (auto-pause in effect restoring the original state).

By adding a more precise tracking of migration state changes, we can
also make sure that auto-pause is actually stopping the VM right after
qmp_migrate(), as opposed to the vm_stop() that happens at
migration_complete().

When resuming the destination a similar situation occurs, we use
'stop_src' to have a stopped VM and check that the destination does
not get a "resume" event.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration-test.c | 41 ++++++++++++++++++++++++++++++------
 1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 88e611e98f..06a7dd3c0a 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1679,7 +1679,7 @@ finish:
     test_migrate_end(from, to, args->result == MIG_TEST_SUCCEED);
 }
 
-static void test_file_common(MigrateCommon *args, bool stop_src)
+static void test_file_common(MigrateCommon *args, bool stop_src, bool auto_pause)
 {
     QTestState *from, *to;
     void *data_hook = NULL;
@@ -1689,6 +1689,13 @@ static void test_file_common(MigrateCommon *args, bool stop_src)
         return;
     }
 
+    migrate_set_capability(from, "events", true);
+    migrate_set_capability(to, "events", true);
+
+    if (!auto_pause) {
+        migrate_set_capability(from, "auto-pause", false);
+    }
+
     /*
      * File migration is never live. We can keep the source VM running
      * during migration, but the destination will not be running
@@ -1712,8 +1719,24 @@ static void test_file_common(MigrateCommon *args, bool stop_src)
         migrate_qmp_fail(from, connect_uri, "{}");
         goto finish;
     }
-
     migrate_qmp(from, connect_uri, "{}");
+
+    wait_for_setup(from);
+
+    /* auto-pause stops the VM right after setup */
+    if (auto_pause && !stop_src) {
+        wait_for_stop(from);
+    }
+
+    wait_for_active(from);
+
+    /*
+     * If the VM is not already stop by the test or auto-pause,
+     * migration completion will stop it.
+     */
+    if (!stop_src && !auto_pause) {
+        wait_for_stop(from);
+    }
     wait_for_migration_complete(from);
 
     /*
@@ -1721,9 +1744,15 @@ static void test_file_common(MigrateCommon *args, bool stop_src)
      * destination.
      */
     migrate_incoming_qmp(to, connect_uri, "{}");
+    wait_for_active(to);
     wait_for_migration_complete(to);
 
-    if (stop_src) {
+    if (stop_src || auto_pause) {
+        /*
+         * The VM has been paused on source by either the test or
+         * auto-pause, re-start on destination to make sure it won't
+         * crash.
+         */
         qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
     }
 
@@ -1940,7 +1969,7 @@ static void test_precopy_file(void)
         .listen_uri = "defer",
     };
 
-    test_file_common(&args, true);
+    test_file_common(&args, true, true);
 }
 
 static void file_offset_finish_hook(QTestState *from, QTestState *to,
@@ -1984,7 +2013,7 @@ static void test_precopy_file_offset(void)
         .finish_hook = file_offset_finish_hook,
     };
 
-    test_file_common(&args, false);
+    test_file_common(&args, false, true);
 }
 
 static void test_precopy_file_offset_bad(void)
@@ -1998,7 +2027,7 @@ static void test_precopy_file_offset_bad(void)
         .result = MIG_TEST_QMP_ERROR,
     };
 
-    test_file_common(&args, false);
+    test_file_common(&args, false, false);
 }
 
 static void test_precopy_tcp_plain(void)
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 09/29] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (7 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 08/29] tests/qtest: File migration auto-pause tests Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-23 20:35 ` [PATCH v2 10/29] io: Add generic pwritev/preadv interface Fabiano Rosas
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

Add a generic QIOChannel feature SEEKABLE which would be used by the
qemu_file* apis. For the time being this will be only implemented for
file channels.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 include/io/channel.h | 1 +
 io/channel-file.c    | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 5f9dbaab65..fcb19fd672 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -44,6 +44,7 @@ enum QIOChannelFeature {
     QIO_CHANNEL_FEATURE_LISTEN,
     QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY,
     QIO_CHANNEL_FEATURE_READ_MSG_PEEK,
+    QIO_CHANNEL_FEATURE_SEEKABLE,
 };
 
 
diff --git a/io/channel-file.c b/io/channel-file.c
index 4a12c61886..f91bf6db1c 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -36,6 +36,10 @@ qio_channel_file_new_fd(int fd)
 
     ioc->fd = fd;
 
+    if (lseek(fd, 0, SEEK_CUR) != (off_t)-1) {
+        qio_channel_set_feature(QIO_CHANNEL(ioc), QIO_CHANNEL_FEATURE_SEEKABLE);
+    }
+
     trace_qio_channel_file_new_fd(ioc, fd);
 
     return ioc;
@@ -60,6 +64,10 @@ qio_channel_file_new_path(const char *path,
         return NULL;
     }
 
+    if (lseek(ioc->fd, 0, SEEK_CUR) != (off_t)-1) {
+        qio_channel_set_feature(QIO_CHANNEL(ioc), QIO_CHANNEL_FEATURE_SEEKABLE);
+    }
+
     trace_qio_channel_file_new_path(ioc, path, flags, mode, ioc->fd);
 
     return ioc;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 10/29] io: Add generic pwritev/preadv interface
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (8 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 09/29] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-24  8:18   ` Daniel P. Berrangé
  2023-10-23 20:35 ` [PATCH v2 11/29] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
                   ` (18 subsequent siblings)
  28 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

Introduce basic pwritev/preadv support in the generic channel layer.
Specific implementation will follow for the file channel as this is
required in order to support migration streams with fixed location of
each ram page.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/io/channel.h | 82 ++++++++++++++++++++++++++++++++++++++++++++
 io/channel.c         | 58 +++++++++++++++++++++++++++++++
 2 files changed, 140 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index fcb19fd672..a8181d576a 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -131,6 +131,16 @@ struct QIOChannelClass {
                            Error **errp);
 
     /* Optional callbacks */
+    ssize_t (*io_pwritev)(QIOChannel *ioc,
+                          const struct iovec *iov,
+                          size_t niov,
+                          off_t offset,
+                          Error **errp);
+    ssize_t (*io_preadv)(QIOChannel *ioc,
+                         const struct iovec *iov,
+                         size_t niov,
+                         off_t offset,
+                         Error **errp);
     int (*io_shutdown)(QIOChannel *ioc,
                        QIOChannelShutdown how,
                        Error **errp);
@@ -529,6 +539,78 @@ void qio_channel_set_follow_coroutine_ctx(QIOChannel *ioc, bool enabled);
 int qio_channel_close(QIOChannel *ioc,
                       Error **errp);
 
+/**
+ * qio_channel_pwritev_full
+ * @ioc: the channel object
+ * @iov: the array of memory regions to write data from
+ * @niov: the length of the @iov array
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error. To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ * Behaves as qio_channel_writev_full, apart from not supporting
+ * sending of file handles as well as beginning the write at the
+ * passed @offset
+ *
+ */
+ssize_t qio_channel_pwritev_full(QIOChannel *ioc, const struct iovec *iov,
+                                 size_t niov, off_t offset, Error **errp);
+
+/**
+ * qio_channel_pwritev
+ * @ioc: the channel object
+ * @buf: the memory region to write data into
+ * @buflen: the number of bytes to @buf
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error. To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ */
+ssize_t qio_channel_pwritev(QIOChannel *ioc, char *buf, size_t buflen,
+                            off_t offset, Error **errp);
+
+/**
+ * qio_channel_preadv_full
+ * @ioc: the channel object
+ * @iov: the array of memory regions to read data into
+ * @niov: the length of the @iov array
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error.  To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ * Behaves as qio_channel_readv_full, apart from not supporting
+ * receiving of file handles as well as beginning the read at the
+ * passed @offset
+ *
+ */
+ssize_t qio_channel_preadv_full(QIOChannel *ioc, const struct iovec *iov,
+                                size_t niov, off_t offset, Error **errp);
+
+/**
+ * qio_channel_preadv
+ * @ioc: the channel object
+ * @buf: the memory region to write data into
+ * @buflen: the number of bytes to @buf
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error.  To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ */
+ssize_t qio_channel_preadv(QIOChannel *ioc, char *buf, size_t buflen,
+                           off_t offset, Error **errp);
+
 /**
  * qio_channel_shutdown:
  * @ioc: the channel object
diff --git a/io/channel.c b/io/channel.c
index 86c5834510..770d61ea00 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -454,6 +454,64 @@ GSource *qio_channel_add_watch_source(QIOChannel *ioc,
 }
 
 
+ssize_t qio_channel_pwritev_full(QIOChannel *ioc, const struct iovec *iov,
+                                 size_t niov, off_t offset, Error **errp)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (!klass->io_pwritev) {
+        error_setg(errp, "Channel does not support pwritev");
+        return -1;
+    }
+
+    if (!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_SEEKABLE)) {
+        error_setg_errno(errp, EINVAL, "Requested channel is not seekable");
+        return -1;
+    }
+
+    return klass->io_pwritev(ioc, iov, niov, offset, errp);
+}
+
+ssize_t qio_channel_pwritev(QIOChannel *ioc, char *buf, size_t buflen,
+                            off_t offset, Error **errp)
+{
+    struct iovec iov = {
+        .iov_base = buf,
+        .iov_len = buflen
+    };
+
+    return qio_channel_pwritev_full(ioc, &iov, 1, offset, errp);
+}
+
+ssize_t qio_channel_preadv_full(QIOChannel *ioc, const struct iovec *iov,
+                                size_t niov, off_t offset, Error **errp)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (!klass->io_preadv) {
+        error_setg(errp, "Channel does not support preadv");
+        return -1;
+    }
+
+    if (!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_SEEKABLE)) {
+        error_setg_errno(errp, EINVAL, "Requested channel is not seekable");
+        return -1;
+    }
+
+    return klass->io_preadv(ioc, iov, niov, offset, errp);
+}
+
+ssize_t qio_channel_preadv(QIOChannel *ioc, char *buf, size_t buflen,
+                           off_t offset, Error **errp)
+{
+    struct iovec iov = {
+        .iov_base = buf,
+        .iov_len = buflen
+    };
+
+    return qio_channel_preadv_full(ioc, &iov, 1, offset, errp);
+}
+
 int qio_channel_shutdown(QIOChannel *ioc,
                          QIOChannelShutdown how,
                          Error **errp)
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 11/29] io: implement io_pwritev/preadv for QIOChannelFile
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (9 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 10/29] io: Add generic pwritev/preadv interface Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-23 20:35 ` [PATCH v2 12/29] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

The upcoming 'fixed-ram' feature will require qemu to write data to
(and restore from) specific offsets of the migration file.

Add a minimal implementation of pwritev/preadv and expose them via the
io_pwritev and io_preadv interfaces.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 io/channel-file.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/io/channel-file.c b/io/channel-file.c
index f91bf6db1c..731a2fbc68 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -146,6 +146,56 @@ static ssize_t qio_channel_file_writev(QIOChannel *ioc,
     return ret;
 }
 
+static ssize_t qio_channel_file_preadv(QIOChannel *ioc,
+                                       const struct iovec *iov,
+                                       size_t niov,
+                                       off_t offset,
+                                       Error **errp)
+{
+    QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
+    ssize_t ret;
+
+ retry:
+    ret = preadv(fioc->fd, iov, niov, offset);
+    if (ret < 0) {
+        if (errno == EAGAIN) {
+            return QIO_CHANNEL_ERR_BLOCK;
+        }
+        if (errno == EINTR) {
+            goto retry;
+        }
+
+        error_setg_errno(errp, errno, "Unable to read from file");
+        return -1;
+    }
+
+    return ret;
+}
+
+static ssize_t qio_channel_file_pwritev(QIOChannel *ioc,
+                                        const struct iovec *iov,
+                                        size_t niov,
+                                        off_t offset,
+                                        Error **errp)
+{
+    QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
+    ssize_t ret;
+
+ retry:
+    ret = pwritev(fioc->fd, iov, niov, offset);
+    if (ret <= 0) {
+        if (errno == EAGAIN) {
+            return QIO_CHANNEL_ERR_BLOCK;
+        }
+        if (errno == EINTR) {
+            goto retry;
+        }
+        error_setg_errno(errp, errno, "Unable to write to file");
+        return -1;
+    }
+    return ret;
+}
+
 static int qio_channel_file_set_blocking(QIOChannel *ioc,
                                          bool enabled,
                                          Error **errp)
@@ -231,6 +281,8 @@ static void qio_channel_file_class_init(ObjectClass *klass,
     ioc_klass->io_writev = qio_channel_file_writev;
     ioc_klass->io_readv = qio_channel_file_readv;
     ioc_klass->io_set_blocking = qio_channel_file_set_blocking;
+    ioc_klass->io_pwritev = qio_channel_file_pwritev;
+    ioc_klass->io_preadv = qio_channel_file_preadv;
     ioc_klass->io_seek = qio_channel_file_seek;
     ioc_klass->io_close = qio_channel_file_close;
     ioc_klass->io_create_watch = qio_channel_file_create_watch;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 12/29] migration/qemu-file: add utility methods for working with seekable channels
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (10 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 11/29] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-25 10:22   ` Daniel P. Berrangé
  2023-10-23 20:35 ` [PATCH v2 13/29] migration: fixed-ram: Add URI compatibility check Fabiano Rosas
                   ` (16 subsequent siblings)
  28 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

Add utility methods that will be needed when implementing 'fixed-ram'
migration capability.

qemu_file_is_seekable
qemu_put_buffer_at
qemu_get_buffer_at
qemu_set_offset
qemu_get_offset

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
fixed total_transferred accounting

restructured to use qio_channel_file_preadv instead of the _full
variant
---
 include/migration/qemu-file-types.h |  2 +
 migration/qemu-file.c               | 80 +++++++++++++++++++++++++++++
 migration/qemu-file.h               |  4 ++
 3 files changed, 86 insertions(+)

diff --git a/include/migration/qemu-file-types.h b/include/migration/qemu-file-types.h
index 9ba163f333..adec5abc07 100644
--- a/include/migration/qemu-file-types.h
+++ b/include/migration/qemu-file-types.h
@@ -50,6 +50,8 @@ unsigned int qemu_get_be16(QEMUFile *f);
 unsigned int qemu_get_be32(QEMUFile *f);
 uint64_t qemu_get_be64(QEMUFile *f);
 
+bool qemu_file_is_seekable(QEMUFile *f);
+
 static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
 {
     qemu_put_be64(f, *pv);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 3fb25148d1..c8f635aa2b 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -33,6 +33,7 @@
 #include "options.h"
 #include "qapi/error.h"
 #include "rdma.h"
+#include "io/channel-file.h"
 
 #define IO_BUF_SIZE 32768
 #define MAX_IOV_SIZE MIN_CONST(IOV_MAX, 64)
@@ -258,6 +259,10 @@ static void qemu_iovec_release_ram(QEMUFile *f)
     memset(f->may_free, 0, sizeof(f->may_free));
 }
 
+bool qemu_file_is_seekable(QEMUFile *f)
+{
+    return qio_channel_has_feature(f->ioc, QIO_CHANNEL_FEATURE_SEEKABLE);
+}
 
 /**
  * Flushes QEMUFile buffer
@@ -460,6 +465,81 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, size_t size)
     }
 }
 
+void qemu_put_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen, off_t pos)
+{
+    Error *err = NULL;
+
+    if (f->last_error) {
+        return;
+    }
+
+    qemu_fflush(f);
+    qio_channel_pwritev(f->ioc, (char *)buf, buflen, pos, &err);
+
+    if (err) {
+        qemu_file_set_error_obj(f, -EIO, err);
+    } else {
+        f->total_transferred += buflen;
+    }
+
+    return;
+}
+
+
+size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen, off_t pos)
+{
+    Error *err = NULL;
+    ssize_t ret;
+
+    if (f->last_error) {
+        return 0;
+    }
+
+    ret = qio_channel_preadv(f->ioc, (char *)buf, buflen, pos, &err);
+    if (ret == -1 || err) {
+        goto error;
+    }
+
+    return (size_t)ret;
+
+ error:
+    qemu_file_set_error_obj(f, -EIO, err);
+    return 0;
+}
+
+void qemu_set_offset(QEMUFile *f, off_t off, int whence)
+{
+    Error *err = NULL;
+    off_t ret;
+
+    qemu_fflush(f);
+
+    if (!qemu_file_is_writable(f)) {
+        f->buf_index = 0;
+        f->buf_size = 0;
+    }
+
+    ret = qio_channel_io_seek(f->ioc, off, whence, &err);
+    if (ret == (off_t)-1) {
+        qemu_file_set_error_obj(f, -EIO, err);
+    }
+}
+
+off_t qemu_get_offset(QEMUFile *f)
+{
+    Error *err = NULL;
+    off_t ret;
+
+    qemu_fflush(f);
+
+    ret = qio_channel_io_seek(f->ioc, 0, SEEK_CUR, &err);
+    if (ret == (off_t)-1) {
+        qemu_file_set_error_obj(f, -EIO, err);
+    }
+    return ret;
+}
+
+
 void qemu_put_byte(QEMUFile *f, int v)
 {
     if (f->last_error) {
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index a29c37b0d0..fef4a35a6a 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -93,6 +93,10 @@ QEMUFile *qemu_file_get_return_path(QEMUFile *f);
 void qemu_fflush(QEMUFile *f);
 void qemu_file_set_blocking(QEMUFile *f, bool block);
 int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
+void qemu_set_offset(QEMUFile *f, off_t off, int whence);
+off_t qemu_get_offset(QEMUFile *f);
+void qemu_put_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen, off_t pos);
+size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen, off_t pos);
 
 QIOChannel *qemu_file_get_ioc(QEMUFile *file);
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 13/29] migration: fixed-ram: Add URI compatibility check
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (11 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 12/29] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-25 10:27   ` Daniel P. Berrangé
  2023-10-31 16:06   ` Peter Xu
  2023-10-23 20:35 ` [PATCH v2 14/29] migration/ram: Introduce 'fixed-ram' migration capability Fabiano Rosas
                   ` (15 subsequent siblings)
  28 siblings, 2 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

The fixed-ram migration format needs a channel that supports seeking
to be able to write each page to an arbitrary offset in the migration
stream.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 692fbc5ad6..cabb3ad3a5 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -106,22 +106,40 @@ static bool migration_needs_multiple_sockets(void)
     return migrate_multifd() || migrate_postcopy_preempt();
 }
 
+static bool migration_needs_seekable_channel(void)
+{
+    return migrate_fixed_ram();
+}
+
 static bool uri_supports_multi_channels(const char *uri)
 {
     return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
            strstart(uri, "vsock:", NULL);
 }
 
+static bool uri_supports_seeking(const char *uri)
+{
+    return strstart(uri, "file:", NULL);
+}
+
 static bool
 migration_channels_and_uri_compatible(const char *uri, Error **errp)
 {
+    bool compatible = true;
+
+    if (migration_needs_seekable_channel() &&
+        !uri_supports_seeking(uri)) {
+        error_setg(errp, "Migration requires seekable transport (e.g. file)");
+        compatible = false;
+    }
+
     if (migration_needs_multiple_sockets() &&
         !uri_supports_multi_channels(uri)) {
         error_setg(errp, "Migration requires multi-channel URIs (e.g. tcp)");
-        return false;
+        compatible = false;
     }
 
-    return true;
+    return compatible;
 }
 
 static bool migration_should_pause(const char *uri)
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 14/29] migration/ram: Introduce 'fixed-ram' migration capability
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (12 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 13/29] migration: fixed-ram: Add URI compatibility check Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-24  5:33   ` Markus Armbruster
  2023-10-23 20:35 ` [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration Fabiano Rosas
                   ` (14 subsequent siblings)
  28 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Add a new migration capability 'fixed-ram'.

The core of the feature is to ensure that each ram page has a specific
offset in the resulting migration stream. The reason why we'd want
such behavior are two fold:

 - When doing a 'fixed-ram' migration the resulting file will have a
   bounded size, since pages which are dirtied multiple times will
   always go to a fixed location in the file, rather than constantly
   being added to a sequential stream. This eliminates cases where a vm
   with, say, 1G of ram can result in a migration file that's 10s of
   GBs, provided that the workload constantly redirties memory.

 - It paves the way to implement DIRECT_IO-enabled save/restore of the
   migration stream as the pages are ensured to be written at aligned
   offsets.

For now, enabling the capability has no effect. The next couple of
patches implement the core funcionality.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 docs/devel/migration.rst | 14 ++++++++++++++
 migration/options.c      | 37 +++++++++++++++++++++++++++++++++++++
 migration/options.h      |  1 +
 migration/savevm.c       |  1 +
 qapi/migration.json      |  5 ++++-
 5 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index c3e1400c0c..6f898b5dbd 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -566,6 +566,20 @@ Others (especially either older devices or system devices which for
 some reason don't have a bus concept) make use of the ``instance id``
 for otherwise identically named devices.
 
+Fixed-ram format
+----------------
+
+When the ``fixed-ram`` capability is enabled, a slightly different
+stream format is used for the RAM section. Instead of having a
+sequential stream of pages that follow the RAMBlock headers, the dirty
+pages for a RAMBlock follow its header. This ensures that each RAM
+page has a fixed offset in the resulting migration stream.
+
+The ``fixed-ram`` capaility can be enabled in both source and
+destination with:
+
+    ``migrate_set_capability fixed-ram on``
+
 Return path
 -----------
 
diff --git a/migration/options.c b/migration/options.c
index c3def757fe..2622d8c483 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -202,6 +202,7 @@ Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-dirty-limit", MIGRATION_CAPABILITY_DIRTY_LIMIT),
     DEFINE_PROP_BOOL("x-auto-pause", MigrationState,
                      capabilities[MIGRATION_CAPABILITY_AUTO_PAUSE], true),
+    DEFINE_PROP_MIG_CAP("x-fixed-ram", MIGRATION_CAPABILITY_FIXED_RAM),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -268,6 +269,16 @@ bool migrate_events(void)
     return s->capabilities[MIGRATION_CAPABILITY_EVENTS];
 }
 
+bool migrate_fixed_ram(void)
+{
+/*
+    MigrationState *s = migrate_get_current();
+
+    return s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
+*/
+    return false;
+}
+
 bool migrate_ignore_shared(void)
 {
     MigrationState *s = migrate_get_current();
@@ -627,6 +638,32 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
         }
     }
 
+    if (new_caps[MIGRATION_CAPABILITY_FIXED_RAM]) {
+        if (new_caps[MIGRATION_CAPABILITY_MULTIFD]) {
+            error_setg(errp,
+                       "Fixed-ram migration is incompatible with multifd");
+            return false;
+        }
+
+        if (new_caps[MIGRATION_CAPABILITY_XBZRLE]) {
+            error_setg(errp,
+                       "Fixed-ram migration is incompatible with xbzrle");
+            return false;
+        }
+
+        if (new_caps[MIGRATION_CAPABILITY_COMPRESS]) {
+            error_setg(errp,
+                       "Fixed-ram migration is incompatible with compression");
+            return false;
+        }
+
+        if (new_caps[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
+            error_setg(errp,
+                       "Fixed-ram migration is incompatible with postcopy ram");
+            return false;
+        }
+    }
+
     return true;
 }
 
diff --git a/migration/options.h b/migration/options.h
index d1ba5c9de7..2a9e0e9e13 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -32,6 +32,7 @@ bool migrate_compress(void);
 bool migrate_dirty_bitmaps(void);
 bool migrate_dirty_limit(void);
 bool migrate_events(void);
+bool migrate_fixed_ram(void);
 bool migrate_ignore_shared(void);
 bool migrate_late_block_activate(void);
 bool migrate_multifd(void);
diff --git a/migration/savevm.c b/migration/savevm.c
index 8622f229e5..54e084122a 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -243,6 +243,7 @@ static bool should_validate_capability(int capability)
     /* Validate only new capabilities to keep compatibility. */
     switch (capability) {
     case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
+    case MIGRATION_CAPABILITY_FIXED_RAM:
         return true;
     default:
         return false;
diff --git a/qapi/migration.json b/qapi/migration.json
index 74f12adc0e..1317dd32ab 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -527,6 +527,9 @@
 #     VM before migration for an optimal migration performance.
 #     Enabled by default. (since 8.1)
 #
+# @fixed-ram: Migrate using fixed offsets for each RAM page. Requires
+#             a seekable transport such as a file.  (since 8.1)
+#
 # Features:
 #
 # @unstable: Members @x-colo and @x-ignore-shared are experimental.
@@ -543,7 +546,7 @@
            { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
            'validate-uuid', 'background-snapshot',
            'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
-           'dirty-limit', 'auto-pause'] }
+           'dirty-limit', 'auto-pause', 'fixed-ram'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (13 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 14/29] migration/ram: Introduce 'fixed-ram' migration capability Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-25  9:39   ` Daniel P. Berrangé
  2023-10-31 16:52   ` Peter Xu
  2023-10-23 20:35 ` [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore Fabiano Rosas
                   ` (13 subsequent siblings)
  28 siblings, 2 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé

From: Nikolay Borisov <nborisov@suse.com>

Implement the outgoing migration side for the 'fixed-ram' capability.

A bitmap is introduced to track which pages have been written in the
migration file. Pages are written at a fixed location for every
ramblock. Zero pages are ignored as they'd be zero in the destination
migration as well.

The migration stream is altered to put the dirty pages for a ramblock
after its header instead of having a sequential stream of pages that
follow the ramblock headers. Since all pages have a fixed location,
RAM_SAVE_FLAG_EOS is no longer generated on every migration iteration.

Without fixed-ram (current):

ramblock 1 header|ramblock 2 header|...|RAM_SAVE_FLAG_EOS|stream of
 pages (iter 1)|RAM_SAVE_FLAG_EOS|stream of pages (iter 2)|...

With fixed-ram (new):

ramblock 1 header|ramblock 1 fixed-ram header|ramblock 1 pages (fixed
 offsets)|ramblock 2 header|ramblock 2 fixed-ram header|ramblock 2
 pages (fixed offsets)|...|RAM_SAVE_FLAG_EOS

where:
 - ramblock header: the generic information for a ramblock, such as
   idstr, used_len, etc.

 - ramblock fixed-ram header: the new information added by this
   feature: bitmap of pages written, bitmap size and offset of pages
   in the migration file.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/exec/ramblock.h |  8 ++++
 migration/options.c     |  3 --
 migration/ram.c         | 98 ++++++++++++++++++++++++++++++++++++-----
 3 files changed, 96 insertions(+), 13 deletions(-)

diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 69c6a53902..e0e3f16852 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -44,6 +44,14 @@ struct RAMBlock {
     size_t page_size;
     /* dirty bitmap used during migration */
     unsigned long *bmap;
+    /* shadow dirty bitmap used when migrating to a file */
+    unsigned long *shadow_bmap;
+    /*
+     * offset in the file pages belonging to this ramblock are saved,
+     * used only during migration to a file.
+     */
+    off_t bitmap_offset;
+    uint64_t pages_offset;
     /* bitmap of already received pages in postcopy */
     unsigned long *receivedmap;
 
diff --git a/migration/options.c b/migration/options.c
index 2622d8c483..9f693d909f 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -271,12 +271,9 @@ bool migrate_events(void)
 
 bool migrate_fixed_ram(void)
 {
-/*
     MigrationState *s = migrate_get_current();
 
     return s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
-*/
-    return false;
 }
 
 bool migrate_ignore_shared(void)
diff --git a/migration/ram.c b/migration/ram.c
index 92769902bb..152a03604f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1157,12 +1157,18 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
         return 0;
     }
 
+    stat64_add(&mig_stats.zero_pages, 1);
+
+    if (migrate_fixed_ram()) {
+        /* zero pages are not transferred with fixed-ram */
+        clear_bit(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
+        return 1;
+    }
+
     len += save_page_header(pss, file, pss->block, offset | RAM_SAVE_FLAG_ZERO);
     qemu_put_byte(file, 0);
     len += 1;
     ram_release_page(pss->block->idstr, offset);
-
-    stat64_add(&mig_stats.zero_pages, 1);
     ram_transferred_add(len);
 
     /*
@@ -1220,14 +1226,20 @@ static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
 {
     QEMUFile *file = pss->pss_channel;
 
-    ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
-                                         offset | RAM_SAVE_FLAG_PAGE));
-    if (async) {
-        qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
-                              migrate_release_ram() &&
-                              migration_in_postcopy());
+    if (migrate_fixed_ram()) {
+        qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
+                           block->pages_offset + offset);
+        set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
     } else {
-        qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
+        ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
+                                             offset | RAM_SAVE_FLAG_PAGE));
+        if (async) {
+            qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
+                                  migrate_release_ram() &&
+                                  migration_in_postcopy());
+        } else {
+            qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
+        }
     }
     ram_transferred_add(TARGET_PAGE_SIZE);
     stat64_add(&mig_stats.normal_pages, 1);
@@ -2475,6 +2487,8 @@ static void ram_save_cleanup(void *opaque)
         block->clear_bmap = NULL;
         g_free(block->bmap);
         block->bmap = NULL;
+        g_free(block->shadow_bmap);
+        block->shadow_bmap = NULL;
     }
 
     xbzrle_cleanup();
@@ -2842,6 +2856,7 @@ static void ram_list_init_bitmaps(void)
              */
             block->bmap = bitmap_new(pages);
             bitmap_set(block->bmap, 0, pages);
+            block->shadow_bmap = bitmap_new(block->used_length >> TARGET_PAGE_BITS);
             block->clear_bmap_shift = shift;
             block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
         }
@@ -2979,6 +2994,44 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
     }
 }
 
+#define FIXED_RAM_HDR_VERSION 1
+struct FixedRamHeader {
+    uint32_t version;
+    uint64_t page_size;
+    uint64_t bitmap_offset;
+    uint64_t pages_offset;
+    /* end of v1 */
+} QEMU_PACKED;
+
+static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
+{
+    g_autofree struct FixedRamHeader *header;
+    size_t header_size, bitmap_size;
+    long num_pages;
+
+    header = g_new0(struct FixedRamHeader, 1);
+    header_size = sizeof(struct FixedRamHeader);
+
+    num_pages = block->used_length >> TARGET_PAGE_BITS;
+    bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
+
+    /*
+     * Save the file offsets of where the bitmap and the pages should
+     * go as they are written at the end of migration and during the
+     * iterative phase, respectively.
+     */
+    block->bitmap_offset = qemu_get_offset(file) + header_size;
+    block->pages_offset = ROUND_UP(block->bitmap_offset +
+                                   bitmap_size, 0x100000);
+
+    header->version = cpu_to_be32(FIXED_RAM_HDR_VERSION);
+    header->page_size = cpu_to_be64(TARGET_PAGE_SIZE);
+    header->bitmap_offset = cpu_to_be64(block->bitmap_offset);
+    header->pages_offset = cpu_to_be64(block->pages_offset);
+
+    qemu_put_buffer(file, (uint8_t *) header, header_size);
+}
+
 /*
  * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
  * long-running RCU critical section.  When rcu-reclaims in the code
@@ -3028,6 +3081,12 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
             if (migrate_ignore_shared()) {
                 qemu_put_be64(f, block->mr->addr);
             }
+
+            if (migrate_fixed_ram()) {
+                fixed_ram_insert_header(f, block);
+                /* prepare offset for next ramblock */
+                qemu_set_offset(f, block->pages_offset + block->used_length, SEEK_SET);
+            }
         }
     }
 
@@ -3061,6 +3120,20 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     return 0;
 }
 
+static void ram_save_shadow_bmap(QEMUFile *f)
+{
+    RAMBlock *block;
+
+    RAMBLOCK_FOREACH_MIGRATABLE(block) {
+        long num_pages = block->used_length >> TARGET_PAGE_BITS;
+        long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
+        qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
+                           block->bitmap_offset);
+        /* to catch any thread late sending pages */
+        block->shadow_bmap = NULL;
+    }
+}
+
 /**
  * ram_save_iterate: iterative stage for migration
  *
@@ -3179,7 +3252,6 @@ out:
         qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
         qemu_fflush(f);
         ram_transferred_add(8);
-
         ret = qemu_file_get_error(f);
     }
     if (ret < 0) {
@@ -3256,7 +3328,13 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
     if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
         qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
     }
+
+    if (migrate_fixed_ram()) {
+        ram_save_shadow_bmap(f);
+    }
+
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+
     qemu_fflush(f);
 
     return 0;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (14 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-25  9:43   ` Daniel P. Berrangé
  2023-10-31 19:09   ` Peter Xu
  2023-10-23 20:35 ` [PATCH v2 17/29] tests/qtest: migration-test: Add tests for fixed-ram file-based migration Fabiano Rosas
                   ` (12 subsequent siblings)
  28 siblings, 2 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

From: Nikolay Borisov <nborisov@suse.com>

Add the necessary code to parse the format changes for the 'fixed-ram'
capability.

One of the more notable changes in behavior is that in the 'fixed-ram'
case ram pages are restored in one go rather than constantly looping
through the migration stream.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
(farosas) reused more of the common code by making the fixed-ram
function take only one ramblock and calling it from inside
parse_ramblock.
---
 migration/ram.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 93 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 152a03604f..cea6971ab2 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3032,6 +3032,32 @@ static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
     qemu_put_buffer(file, (uint8_t *) header, header_size);
 }
 
+static int fixed_ram_read_header(QEMUFile *file, struct FixedRamHeader *header)
+{
+    size_t ret, header_size = sizeof(struct FixedRamHeader);
+
+    ret = qemu_get_buffer(file, (uint8_t *)header, header_size);
+    if (ret != header_size) {
+        return -1;
+    }
+
+    /* migration stream is big-endian */
+    be32_to_cpus(&header->version);
+
+    if (header->version > FIXED_RAM_HDR_VERSION) {
+        error_report("Migration fixed-ram capability version mismatch (expected %d, got %d)",
+                     FIXED_RAM_HDR_VERSION, header->version);
+        return -1;
+    }
+
+    be64_to_cpus(&header->page_size);
+    be64_to_cpus(&header->bitmap_offset);
+    be64_to_cpus(&header->pages_offset);
+
+
+    return 0;
+}
+
 /*
  * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
  * long-running RCU critical section.  When rcu-reclaims in the code
@@ -3932,6 +3958,68 @@ void colo_flush_ram_cache(void)
     trace_colo_flush_ram_cache_end();
 }
 
+static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
+                                    long num_pages, unsigned long *bitmap)
+{
+    unsigned long set_bit_idx, clear_bit_idx;
+    unsigned long len;
+    ram_addr_t offset;
+    void *host;
+    size_t read, completed, read_len;
+
+    for (set_bit_idx = find_first_bit(bitmap, num_pages);
+         set_bit_idx < num_pages;
+         set_bit_idx = find_next_bit(bitmap, num_pages, clear_bit_idx + 1)) {
+
+        clear_bit_idx = find_next_zero_bit(bitmap, num_pages, set_bit_idx + 1);
+
+        len = TARGET_PAGE_SIZE * (clear_bit_idx - set_bit_idx);
+        offset = set_bit_idx << TARGET_PAGE_BITS;
+
+        for (read = 0, completed = 0; completed < len; offset += read) {
+            host = host_from_ram_block_offset(block, offset);
+            read_len = MIN(len, TARGET_PAGE_SIZE);
+
+            read = qemu_get_buffer_at(f, host, read_len,
+                                      block->pages_offset + offset);
+            completed += read;
+        }
+    }
+}
+
+static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block, ram_addr_t length)
+{
+    g_autofree unsigned long *bitmap = NULL;
+    struct FixedRamHeader header;
+    size_t bitmap_size;
+    long num_pages;
+    int ret = 0;
+
+    ret = fixed_ram_read_header(f, &header);
+    if (ret < 0) {
+        error_report("Error reading fixed-ram header");
+        return -EINVAL;
+    }
+
+    block->pages_offset = header.pages_offset;
+    num_pages = length / header.page_size;
+    bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
+
+    bitmap = g_malloc0(bitmap_size);
+    if (qemu_get_buffer_at(f, (uint8_t *)bitmap, bitmap_size,
+                           header.bitmap_offset) != bitmap_size) {
+        error_report("Error parsing dirty bitmap");
+        return -EINVAL;
+    }
+
+    read_ramblock_fixed_ram(f, block, num_pages, bitmap);
+
+    /* Skip pages array */
+    qemu_set_offset(f, block->pages_offset + length, SEEK_SET);
+
+    return ret;
+}
+
 static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
 {
     int ret = 0;
@@ -3940,6 +4028,10 @@ static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
 
     assert(block);
 
+    if (migrate_fixed_ram()) {
+        return parse_ramblock_fixed_ram(f, block, length);
+    }
+
     if (!qemu_ram_is_migratable(block)) {
         error_report("block %s should not be migrated !", block->idstr);
         return -EINVAL;
@@ -4142,6 +4234,7 @@ static int ram_load_precopy(QEMUFile *f)
                 migrate_multifd_flush_after_each_section()) {
                 multifd_recv_sync_main();
             }
+
             break;
         case RAM_SAVE_FLAG_HOOK:
             ret = rdma_registration_handle(f);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 17/29] tests/qtest: migration-test: Add tests for fixed-ram file-based migration
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (15 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-23 20:35 ` [PATCH v2 18/29] migration/multifd: Allow multifd without packets Fabiano Rosas
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov, Thomas Huth, Laurent Vivier,
	Paolo Bonzini

From: Nikolay Borisov <nborisov@suse.com>

Add basic tests for 'fixed-ram' migration.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration-test.c | 39 ++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 06a7dd3c0a..bd4e866e67 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2030,6 +2030,40 @@ static void test_precopy_file_offset_bad(void)
     test_file_common(&args, false, false);
 }
 
+static void *migrate_fixed_ram_start(QTestState *from, QTestState *to)
+{
+    migrate_set_capability(from, "fixed-ram", true);
+    migrate_set_capability(to, "fixed-ram", true);
+
+    return NULL;
+}
+
+static void test_precopy_file_fixed_ram_live(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+                                           FILE_TEST_FILENAME);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = migrate_fixed_ram_start,
+    };
+
+    test_file_common(&args, false, false);
+}
+
+static void test_precopy_file_fixed_ram(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+                                           FILE_TEST_FILENAME);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = migrate_fixed_ram_start,
+    };
+
+    test_file_common(&args, false, true);
+}
+
 static void test_precopy_tcp_plain(void)
 {
     MigrateCommon args = {
@@ -3098,6 +3132,11 @@ int main(int argc, char **argv)
     qtest_add_func("/migration/precopy/file/offset/bad",
                    test_precopy_file_offset_bad);
 
+    qtest_add_func("/migration/precopy/file/fixed-ram",
+                   test_precopy_file_fixed_ram);
+    qtest_add_func("/migration/precopy/file/fixed-ram/live",
+                   test_precopy_file_fixed_ram_live);
+
 #ifdef CONFIG_GNUTLS
     qtest_add_func("/migration/precopy/unix/tls/psk",
                    test_precopy_unix_tls_psk);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 18/29] migration/multifd: Allow multifd without packets
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (16 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 17/29] tests/qtest: migration-test: Add tests for fixed-ram file-based migration Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-23 20:35 ` [PATCH v2 19/29] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

For the upcoming support to the new 'fixed-ram' migration stream
format, we cannot use multifd packets because each write into the
ramblock section in the migration file is expected to contain only the
guest pages. They are written at their respective offsets relative to
the ramblock section header.

There is no space for the packet information and the expected gains
from the new approach come partly from being able to write the pages
sequentially without extraneous data in between.

The new format also doesn't need the packets and all necessary
information can be taken from the standard migration headers with some
(future) changes to multifd code.

Use the presence of the fixed-ram capability to decide whether to send
packets. For now this has no effect as fixed-ram cannot yet be enabled
with multifd.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/multifd.c | 119 +++++++++++++++++++++++++++-----------------
 migration/options.c |   5 ++
 migration/options.h |   1 +
 3 files changed, 80 insertions(+), 45 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index e2a45c667a..b912060b32 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -655,18 +655,22 @@ static void *multifd_send_thread(void *opaque)
     Error *local_err = NULL;
     int ret = 0;
     bool use_zero_copy_send = migrate_zero_copy_send();
+    bool use_packets = migrate_multifd_packets();
 
     thread = migration_threads_add(p->name, qemu_get_thread_id());
 
     trace_multifd_send_thread_start(p->id);
     rcu_register_thread();
 
-    if (multifd_send_initial_packet(p, &local_err) < 0) {
-        ret = -1;
-        goto out;
+    if (use_packets) {
+        if (multifd_send_initial_packet(p, &local_err) < 0) {
+            ret = -1;
+            goto out;
+        }
+
+        /* initial packet */
+        p->num_packets = 1;
     }
-    /* initial packet */
-    p->num_packets = 1;
 
     while (true) {
         qemu_sem_post(&multifd_send_state->channels_ready);
@@ -678,11 +682,10 @@ static void *multifd_send_thread(void *opaque)
         qemu_mutex_lock(&p->mutex);
 
         if (p->pending_job) {
-            uint64_t packet_num = p->packet_num;
             uint32_t flags;
             p->normal_num = 0;
 
-            if (use_zero_copy_send) {
+            if (!use_packets || use_zero_copy_send) {
                 p->iovs_num = 0;
             } else {
                 p->iovs_num = 1;
@@ -700,16 +703,20 @@ static void *multifd_send_thread(void *opaque)
                     break;
                 }
             }
-            multifd_send_fill_packet(p);
+
+            if (use_packets) {
+                multifd_send_fill_packet(p);
+                p->num_packets++;
+            }
+
             flags = p->flags;
             p->flags = 0;
-            p->num_packets++;
             p->total_normal_pages += p->normal_num;
             p->pages->num = 0;
             p->pages->block = NULL;
             qemu_mutex_unlock(&p->mutex);
 
-            trace_multifd_send(p->id, packet_num, p->normal_num, flags,
+            trace_multifd_send(p->id, p->packet_num, p->normal_num, flags,
                                p->next_packet_size);
 
             if (use_zero_copy_send) {
@@ -719,7 +726,7 @@ static void *multifd_send_thread(void *opaque)
                 if (ret != 0) {
                     break;
                 }
-            } else {
+            } else if (use_packets) {
                 /* Send header using the same writev call */
                 p->iov[0].iov_len = p->packet_len;
                 p->iov[0].iov_base = p->packet;
@@ -907,6 +914,7 @@ int multifd_save_setup(Error **errp)
 {
     int thread_count;
     uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
+    bool use_packets = migrate_multifd_packets();
     uint8_t i;
 
     if (!migrate_multifd()) {
@@ -931,14 +939,20 @@ int multifd_save_setup(Error **errp)
         p->pending_job = 0;
         p->id = i;
         p->pages = multifd_pages_init(page_count);
-        p->packet_len = sizeof(MultiFDPacket_t)
-                      + sizeof(uint64_t) * page_count;
-        p->packet = g_malloc0(p->packet_len);
-        p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
-        p->packet->version = cpu_to_be32(MULTIFD_VERSION);
+
+        if (use_packets) {
+            p->packet_len = sizeof(MultiFDPacket_t)
+                          + sizeof(uint64_t) * page_count;
+            p->packet = g_malloc0(p->packet_len);
+            p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
+            p->packet->version = cpu_to_be32(MULTIFD_VERSION);
+
+            /* We need one extra place for the packet header */
+            p->iov = g_new0(struct iovec, page_count + 1);
+        } else {
+            p->iov = g_new0(struct iovec, page_count);
+        }
         p->name = g_strdup_printf("multifdsend_%d", i);
-        /* We need one extra place for the packet header */
-        p->iov = g_new0(struct iovec, page_count + 1);
         p->normal = g_new0(ram_addr_t, page_count);
         p->page_size = qemu_target_page_size();
         p->page_count = page_count;
@@ -1070,7 +1084,7 @@ void multifd_recv_sync_main(void)
 {
     int i;
 
-    if (!migrate_multifd()) {
+    if (!migrate_multifd() || !migrate_multifd_packets()) {
         return;
     }
     for (i = 0; i < migrate_multifd_channels(); i++) {
@@ -1097,6 +1111,7 @@ static void *multifd_recv_thread(void *opaque)
 {
     MultiFDRecvParams *p = opaque;
     Error *local_err = NULL;
+    bool use_packets = migrate_multifd_packets();
     int ret;
 
     trace_multifd_recv_thread_start(p->id);
@@ -1109,17 +1124,20 @@ static void *multifd_recv_thread(void *opaque)
             break;
         }
 
-        ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
-                                       p->packet_len, &local_err);
-        if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
-            break;
-        }
+        if (use_packets) {
+            ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
+                                           p->packet_len, &local_err);
+            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
+                break;
+            }
 
-        qemu_mutex_lock(&p->mutex);
-        ret = multifd_recv_unfill_packet(p, &local_err);
-        if (ret) {
-            qemu_mutex_unlock(&p->mutex);
-            break;
+            qemu_mutex_lock(&p->mutex);
+            ret = multifd_recv_unfill_packet(p, &local_err);
+            if (ret) {
+                qemu_mutex_unlock(&p->mutex);
+                break;
+            }
+            p->num_packets++;
         }
 
         flags = p->flags;
@@ -1127,7 +1145,7 @@ static void *multifd_recv_thread(void *opaque)
         p->flags &= ~MULTIFD_FLAG_SYNC;
         trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
                            p->next_packet_size);
-        p->num_packets++;
+
         p->total_normal_pages += p->normal_num;
         qemu_mutex_unlock(&p->mutex);
 
@@ -1162,6 +1180,7 @@ int multifd_load_setup(Error **errp)
 {
     int thread_count;
     uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
+    bool use_packets = migrate_multifd_packets();
     uint8_t i;
 
     /*
@@ -1186,9 +1205,12 @@ int multifd_load_setup(Error **errp)
         qemu_sem_init(&p->sem_sync, 0);
         p->quit = false;
         p->id = i;
-        p->packet_len = sizeof(MultiFDPacket_t)
-                      + sizeof(uint64_t) * page_count;
-        p->packet = g_malloc0(p->packet_len);
+
+        if (use_packets) {
+            p->packet_len = sizeof(MultiFDPacket_t)
+                + sizeof(uint64_t) * page_count;
+            p->packet = g_malloc0(p->packet_len);
+        }
         p->name = g_strdup_printf("multifdrecv_%d", i);
         p->iov = g_new0(struct iovec, page_count);
         p->normal = g_new0(ram_addr_t, page_count);
@@ -1234,18 +1256,26 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
 {
     MultiFDRecvParams *p;
     Error *local_err = NULL;
-    int id;
+    bool use_packets = migrate_multifd_packets();
+    int id, num_packets = 0;
 
-    id = multifd_recv_initial_packet(ioc, &local_err);
-    if (id < 0) {
-        multifd_recv_terminate_threads(local_err);
-        error_propagate_prepend(errp, local_err,
-                                "failed to receive packet"
-                                " via multifd channel %d: ",
-                                qatomic_read(&multifd_recv_state->count));
-        return;
+    if (use_packets) {
+        id = multifd_recv_initial_packet(ioc, &local_err);
+        if (id < 0) {
+            multifd_recv_terminate_threads(local_err);
+            error_propagate_prepend(errp, local_err,
+                                    "failed to receive packet"
+                                    " via multifd channel %d: ",
+                                    qatomic_read(&multifd_recv_state->count));
+            return;
+        }
+        trace_multifd_recv_new_channel(id);
+
+        /* initial packet */
+        num_packets = 1;
+    } else {
+        id = 0;
     }
-    trace_multifd_recv_new_channel(id);
 
     p = &multifd_recv_state->params[id];
     if (p->c != NULL) {
@@ -1256,9 +1286,8 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
         return;
     }
     p->c = ioc;
+    p->num_packets = num_packets;
     object_ref(OBJECT(ioc));
-    /* initial packet */
-    p->num_packets = 1;
 
     p->running = true;
     qemu_thread_create(&p->thread, p->name, multifd_recv_thread, p,
diff --git a/migration/options.c b/migration/options.c
index 9f693d909f..bb7a2bbe06 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -390,6 +390,11 @@ bool migrate_multifd_flush_after_each_section(void)
     return s->multifd_flush_after_each_section;
 }
 
+bool migrate_multifd_packets(void)
+{
+    return !migrate_fixed_ram();
+}
+
 bool migrate_postcopy(void)
 {
     return migrate_postcopy_ram() || migrate_dirty_bitmaps();
diff --git a/migration/options.h b/migration/options.h
index 2a9e0e9e13..4a3e7e36a8 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -57,6 +57,7 @@ bool migrate_zero_copy_send(void);
  */
 
 bool migrate_multifd_flush_after_each_section(void);
+bool migrate_multifd_packets(void);
 bool migrate_postcopy(void);
 bool migrate_rdma(void);
 bool migrate_tls(void);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 19/29] migration/multifd: Add outgoing QIOChannelFile support
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (17 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 18/29] migration/multifd: Allow multifd without packets Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-25  9:52   ` Daniel P. Berrangé
  2023-10-31 20:11   ` Peter Xu
  2023-10-23 20:35 ` [PATCH v2 20/29] migration/multifd: Add incoming " Fabiano Rosas
                   ` (9 subsequent siblings)
  28 siblings, 2 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

Allow multifd to open file-backed channels. This will be used when
enabling the fixed-ram migration stream format which expects a
seekable transport.

The QIOChannel read and write methods will use the preadv/pwritev
versions which don't update the file offset at each call so we can
reuse the fd without re-opening for every channel.

Note that this is just setup code and multifd cannot yet make use of
the file channels.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/file.c      | 64 +++++++++++++++++++++++++++++++++++++++++--
 migration/file.h      | 10 +++++--
 migration/migration.c |  2 +-
 migration/multifd.c   | 14 ++++++++--
 migration/options.c   |  7 +++++
 migration/options.h   |  1 +
 6 files changed, 90 insertions(+), 8 deletions(-)

diff --git a/migration/file.c b/migration/file.c
index cf5b1bf365..93b9b7bf5d 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -17,6 +17,12 @@
 
 #define OFFSET_OPTION ",offset="
 
+static struct FileOutgoingArgs {
+    char *fname;
+    int flags;
+    int mode;
+} outgoing_args;
+
 /* Remove the offset option from @filespec and return it in @offsetp. */
 
 static int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
@@ -36,13 +42,62 @@ static int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
     return 0;
 }
 
+static void qio_channel_file_connect_worker(QIOTask *task, gpointer opaque)
+{
+    /* noop */
+}
+
+static void file_migration_cancel(Error *errp)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+                      MIGRATION_STATUS_FAILED);
+    migration_cancel(errp);
+}
+
+int file_send_channel_destroy(QIOChannel *ioc)
+{
+    if (ioc) {
+        qio_channel_close(ioc, NULL);
+        object_unref(OBJECT(ioc));
+    }
+    g_free(outgoing_args.fname);
+    outgoing_args.fname = NULL;
+
+    return 0;
+}
+
+void file_send_channel_create(QIOTaskFunc f, void *data)
+{
+    QIOChannelFile *ioc;
+    QIOTask *task;
+    Error *errp = NULL;
+
+    ioc = qio_channel_file_new_path(outgoing_args.fname,
+                                    outgoing_args.flags,
+                                    outgoing_args.mode, &errp);
+    if (!ioc) {
+        file_migration_cancel(errp);
+        return;
+    }
+
+    task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
+    qio_task_run_in_thread(task, qio_channel_file_connect_worker,
+                           (gpointer)data, NULL, NULL);
+}
+
 void file_start_outgoing_migration(MigrationState *s, const char *filespec,
                                    Error **errp)
 {
-    g_autofree char *filename = g_strdup(filespec);
     g_autoptr(QIOChannelFile) fioc = NULL;
+    g_autofree char *filename = g_strdup(filespec);
     uint64_t offset = 0;
     QIOChannel *ioc;
+    int flags = O_CREAT | O_TRUNC | O_WRONLY;
+    mode_t mode = 0660;
 
     trace_migration_file_outgoing(filename);
 
@@ -50,12 +105,15 @@ void file_start_outgoing_migration(MigrationState *s, const char *filespec,
         return;
     }
 
-    fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
-                                     0600, errp);
+    fioc = qio_channel_file_new_path(filename, flags, mode, errp);
     if (!fioc) {
         return;
     }
 
+    outgoing_args.fname = g_strdup(filename);
+    outgoing_args.flags = flags;
+    outgoing_args.mode = mode;
+
     ioc = QIO_CHANNEL(fioc);
     if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
         return;
diff --git a/migration/file.h b/migration/file.h
index 90fa4849e0..10148233c5 100644
--- a/migration/file.h
+++ b/migration/file.h
@@ -7,8 +7,14 @@
 
 #ifndef QEMU_MIGRATION_FILE_H
 #define QEMU_MIGRATION_FILE_H
-void file_start_incoming_migration(const char *filename, Error **errp);
 
-void file_start_outgoing_migration(MigrationState *s, const char *filename,
+#include "io/task.h"
+#include "channel.h"
+
+void file_start_incoming_migration(const char *filespec, Error **errp);
+
+void file_start_outgoing_migration(MigrationState *s, const char *filespec,
                                    Error **errp);
+void file_send_channel_create(QIOTaskFunc f, void *data);
+int file_send_channel_destroy(QIOChannel *ioc);
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index cabb3ad3a5..ba806cea55 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -114,7 +114,7 @@ static bool migration_needs_seekable_channel(void)
 static bool uri_supports_multi_channels(const char *uri)
 {
     return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
-           strstart(uri, "vsock:", NULL);
+           strstart(uri, "vsock:", NULL) || strstart(uri, "file:", NULL);
 }
 
 static bool uri_supports_seeking(const char *uri)
diff --git a/migration/multifd.c b/migration/multifd.c
index b912060b32..75a17ea8ab 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -17,6 +17,7 @@
 #include "exec/ramblock.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
+#include "file.h"
 #include "ram.h"
 #include "migration.h"
 #include "migration-stats.h"
@@ -28,6 +29,7 @@
 #include "threadinfo.h"
 #include "options.h"
 #include "qemu/yank.h"
+#include "io/channel-file.h"
 #include "io/channel-socket.h"
 #include "yank_functions.h"
 
@@ -512,7 +514,11 @@ static void multifd_send_terminate_threads(Error *err)
 
 static int multifd_send_channel_destroy(QIOChannel *send)
 {
-    return socket_send_channel_destroy(send);
+    if (migrate_to_file()) {
+        return file_send_channel_destroy(send);
+    } else {
+        return socket_send_channel_destroy(send);
+    }
 }
 
 void multifd_save_cleanup(void)
@@ -907,7 +913,11 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
 
 static void multifd_new_send_channel_create(gpointer opaque)
 {
-    socket_send_channel_create(multifd_new_send_channel_async, opaque);
+    if (migrate_to_file()) {
+        file_send_channel_create(multifd_new_send_channel_async, opaque);
+    } else {
+        socket_send_channel_create(multifd_new_send_channel_async, opaque);
+    }
 }
 
 int multifd_save_setup(Error **errp)
diff --git a/migration/options.c b/migration/options.c
index bb7a2bbe06..469d5d4c50 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -414,6 +414,13 @@ bool migrate_tls(void)
     return s->parameters.tls_creds && *s->parameters.tls_creds;
 }
 
+bool migrate_to_file(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return qemu_file_is_seekable(s->to_dst_file);
+}
+
 typedef enum WriteTrackingSupport {
     WT_SUPPORT_UNKNOWN = 0,
     WT_SUPPORT_ABSENT,
diff --git a/migration/options.h b/migration/options.h
index 4a3e7e36a8..01bba5b928 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -61,6 +61,7 @@ bool migrate_multifd_packets(void);
 bool migrate_postcopy(void);
 bool migrate_rdma(void);
 bool migrate_tls(void);
+bool migrate_to_file(void);
 
 /* capabilities helpers */
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 20/29] migration/multifd: Add incoming QIOChannelFile support
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (18 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 19/29] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
@ 2023-10-23 20:35 ` Fabiano Rosas
  2023-10-25 10:29   ` Daniel P. Berrangé
  2023-10-31 21:28   ` Peter Xu
  2023-10-23 20:36 ` [PATCH v2 21/29] migration/multifd: Add pages to the receiving side Fabiano Rosas
                   ` (8 subsequent siblings)
  28 siblings, 2 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

On the receiving side we don't need to differentiate between main
channel and threads, so whichever channel is defined first gets to be
the main one. And since there are no packets, use the atomic channel
count to index into the params array.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/file.c      | 39 +++++++++++++++++++++++++++++----------
 migration/migration.c |  2 ++
 migration/multifd.c   |  7 ++++++-
 migration/multifd.h   |  1 +
 4 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/migration/file.c b/migration/file.c
index 93b9b7bf5d..ad75225f43 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -6,13 +6,15 @@
  */
 
 #include "qemu/osdep.h"
-#include "qemu/cutils.h"
 #include "qapi/error.h"
+#include "qemu/cutils.h"
+#include "qemu/error-report.h"
 #include "channel.h"
 #include "file.h"
 #include "migration.h"
 #include "io/channel-file.h"
 #include "io/channel-util.h"
+#include "options.h"
 #include "trace.h"
 
 #define OFFSET_OPTION ",offset="
@@ -136,7 +138,8 @@ void file_start_incoming_migration(const char *filespec, Error **errp)
     g_autofree char *filename = g_strdup(filespec);
     QIOChannelFile *fioc = NULL;
     uint64_t offset = 0;
-    QIOChannel *ioc;
+    int channels = 1;
+    int i = 0, fd;
 
     trace_migration_file_incoming(filename);
 
@@ -146,16 +149,32 @@ void file_start_incoming_migration(const char *filespec, Error **errp)
 
     fioc = qio_channel_file_new_path(filename, O_RDONLY, 0, errp);
     if (!fioc) {
-        return;
+        goto out;
+    }
+
+    if (migrate_multifd()) {
+        channels += migrate_multifd_channels();
     }
 
-    ioc = QIO_CHANNEL(fioc);
-    if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
+    fd = fioc->fd;
+
+    do {
+        QIOChannel *ioc = QIO_CHANNEL(fioc);
+
+        if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
+            return;
+        }
+
+        qio_channel_set_name(ioc, "migration-file-incoming");
+        qio_channel_add_watch_full(ioc, G_IO_IN,
+                                   file_accept_incoming_migration,
+                                   NULL, NULL,
+                                   g_main_context_get_thread_default());
+    } while (++i < channels && (fioc = qio_channel_file_new_fd(fd)));
+
+out:
+    if (!fioc) {
+        error_report("Error creating migration incoming channel");
         return;
     }
-    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-file-incoming");
-    qio_channel_add_watch_full(ioc, G_IO_IN,
-                               file_accept_incoming_migration,
-                               NULL, NULL,
-                               g_main_context_get_thread_default());
 }
diff --git a/migration/migration.c b/migration/migration.c
index ba806cea55..5fa726f6d4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -756,6 +756,8 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
         }
 
         default_channel = (channel_magic == cpu_to_be32(QEMU_VM_FILE_MAGIC));
+    } else if (migrate_multifd() && migrate_fixed_ram()) {
+        default_channel = multifd_recv_first_channel();
     } else {
         default_channel = !mis->from_src_file;
     }
diff --git a/migration/multifd.c b/migration/multifd.c
index 75a17ea8ab..ad51210f13 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1242,6 +1242,11 @@ int multifd_load_setup(Error **errp)
     return 0;
 }
 
+bool multifd_recv_first_channel(void)
+{
+    return !multifd_recv_state;
+}
+
 bool multifd_recv_all_channels_created(void)
 {
     int thread_count = migrate_multifd_channels();
@@ -1284,7 +1289,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
         /* initial packet */
         num_packets = 1;
     } else {
-        id = 0;
+        id = qatomic_read(&multifd_recv_state->count);
     }
 
     p = &multifd_recv_state->params[id];
diff --git a/migration/multifd.h b/migration/multifd.h
index a835643b48..a112ec7ac6 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -18,6 +18,7 @@ void multifd_save_cleanup(void);
 int multifd_load_setup(Error **errp);
 void multifd_load_cleanup(void);
 void multifd_load_shutdown(void);
+bool multifd_recv_first_channel(void);
 bool multifd_recv_all_channels_created(void);
 void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 21/29] migration/multifd: Add pages to the receiving side
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (19 preceding siblings ...)
  2023-10-23 20:35 ` [PATCH v2 20/29] migration/multifd: Add incoming " Fabiano Rosas
@ 2023-10-23 20:36 ` Fabiano Rosas
  2023-10-31 22:10   ` Peter Xu
  2023-10-23 20:36 ` [PATCH v2 22/29] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
                   ` (7 subsequent siblings)
  28 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

Currently multifd does not need to have knowledge of pages on the
receiving side because all the information needed is within the
packets that come in the stream.

We're about to add support to fixed-ram migration, which cannot use
packets because it expects the ramblock section in the migration file
to contain only the guest pages data.

Add a pointer to MultiFDPages in the multifd_recv_state and use the
pages similarly to what we already do on the sending side. The pages
are used to transfer data between the ram migration code in the main
migration thread and the multifd receiving threads.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/multifd.c | 107 ++++++++++++++++++++++++++++++++++++++++++++
 migration/multifd.h |  13 +++++-
 2 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index ad51210f13..20e8635740 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -992,6 +992,8 @@ int multifd_save_setup(Error **errp)
 
 struct {
     MultiFDRecvParams *params;
+    /* array of pages to receive */
+    MultiFDPages_t *pages;
     /* number of created threads */
     int count;
     /* syncs main thread and channels */
@@ -1002,6 +1004,75 @@ struct {
     MultiFDMethods *ops;
 } *multifd_recv_state;
 
+static int multifd_recv_pages(QEMUFile *f)
+{
+    int i;
+    static int next_recv_channel;
+    MultiFDRecvParams *p = NULL;
+    MultiFDPages_t *pages = multifd_recv_state->pages;
+
+    /*
+     * next_channel can remain from a previous migration that was
+     * using more channels, so ensure it doesn't overflow if the
+     * limit is lower now.
+     */
+    next_recv_channel %= migrate_multifd_channels();
+    for (i = next_recv_channel;; i = (i + 1) % migrate_multifd_channels()) {
+        p = &multifd_recv_state->params[i];
+
+        qemu_mutex_lock(&p->mutex);
+        if (p->quit) {
+            error_report("%s: channel %d has already quit!", __func__, i);
+            qemu_mutex_unlock(&p->mutex);
+            return -1;
+        }
+        if (!p->pending_job) {
+            p->pending_job++;
+            next_recv_channel = (i + 1) % migrate_multifd_channels();
+            break;
+        }
+        qemu_mutex_unlock(&p->mutex);
+    }
+
+    multifd_recv_state->pages = p->pages;
+    p->pages = pages;
+    qemu_mutex_unlock(&p->mutex);
+    qemu_sem_post(&p->sem);
+
+    return 1;
+}
+
+int multifd_recv_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset)
+{
+    MultiFDPages_t *pages = multifd_recv_state->pages;
+    bool changed = false;
+
+    if (!pages->block) {
+        pages->block = block;
+    }
+
+    if (pages->block == block) {
+        pages->offset[pages->num] = offset;
+        pages->num++;
+
+        if (pages->num < pages->allocated) {
+            return 1;
+        }
+    } else {
+        changed = true;
+    }
+
+    if (multifd_recv_pages(f) < 0) {
+        return -1;
+    }
+
+    if (changed) {
+        return multifd_recv_queue_page(f, block, offset);
+    }
+
+    return 1;
+}
+
 static void multifd_recv_terminate_threads(Error *err)
 {
     int i;
@@ -1023,6 +1094,7 @@ static void multifd_recv_terminate_threads(Error *err)
 
         qemu_mutex_lock(&p->mutex);
         p->quit = true;
+        qemu_sem_post(&p->sem);
         /*
          * We could arrive here for two reasons:
          *  - normal quit, i.e. everything went fine, just finished
@@ -1072,8 +1144,11 @@ void multifd_load_cleanup(void)
         p->c = NULL;
         qemu_mutex_destroy(&p->mutex);
         qemu_sem_destroy(&p->sem_sync);
+        qemu_sem_destroy(&p->sem);
         g_free(p->name);
         p->name = NULL;
+        multifd_pages_clear(p->pages);
+        p->pages = NULL;
         p->packet_len = 0;
         g_free(p->packet);
         p->packet = NULL;
@@ -1086,6 +1161,8 @@ void multifd_load_cleanup(void)
     qemu_sem_destroy(&multifd_recv_state->sem_sync);
     g_free(multifd_recv_state->params);
     multifd_recv_state->params = NULL;
+    multifd_pages_clear(multifd_recv_state->pages);
+    multifd_recv_state->pages = NULL;
     g_free(multifd_recv_state);
     multifd_recv_state = NULL;
 }
@@ -1148,6 +1225,25 @@ static void *multifd_recv_thread(void *opaque)
                 break;
             }
             p->num_packets++;
+        } else {
+            /*
+             * No packets, so we need to wait for the vmstate code to
+             * queue pages.
+             */
+            qemu_sem_wait(&p->sem);
+            qemu_mutex_lock(&p->mutex);
+            if (!p->pending_job) {
+                qemu_mutex_unlock(&p->mutex);
+                break;
+            }
+
+            for (int i = 0; i < p->pages->num; i++) {
+                p->normal[p->normal_num] = p->pages->offset[i];
+                p->normal_num++;
+            }
+
+            p->pages->num = 0;
+            p->host = p->pages->block->host;
         }
 
         flags = p->flags;
@@ -1170,6 +1266,13 @@ static void *multifd_recv_thread(void *opaque)
             qemu_sem_post(&multifd_recv_state->sem_sync);
             qemu_sem_wait(&p->sem_sync);
         }
+
+        if (!use_packets) {
+            qemu_mutex_lock(&p->mutex);
+            p->pending_job--;
+            p->pages->block = NULL;
+            qemu_mutex_unlock(&p->mutex);
+        }
     }
 
     if (local_err) {
@@ -1204,6 +1307,7 @@ int multifd_load_setup(Error **errp)
     thread_count = migrate_multifd_channels();
     multifd_recv_state = g_malloc0(sizeof(*multifd_recv_state));
     multifd_recv_state->params = g_new0(MultiFDRecvParams, thread_count);
+    multifd_recv_state->pages = multifd_pages_init(page_count);
     qatomic_set(&multifd_recv_state->count, 0);
     qemu_sem_init(&multifd_recv_state->sem_sync, 0);
     multifd_recv_state->ops = multifd_ops[migrate_multifd_compression()];
@@ -1213,8 +1317,11 @@ int multifd_load_setup(Error **errp)
 
         qemu_mutex_init(&p->mutex);
         qemu_sem_init(&p->sem_sync, 0);
+        qemu_sem_init(&p->sem, 0);
         p->quit = false;
+        p->pending_job = 0;
         p->id = i;
+        p->pages = multifd_pages_init(page_count);
 
         if (use_packets) {
             p->packet_len = sizeof(MultiFDPacket_t)
diff --git a/migration/multifd.h b/migration/multifd.h
index a112ec7ac6..b571b1e4a2 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -24,6 +24,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
 int multifd_send_sync_main(QEMUFile *f);
 int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
+int multifd_recv_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
 
 /* Multifd Compression flags */
 #define MULTIFD_FLAG_SYNC (1 << 0)
@@ -153,9 +154,13 @@ typedef struct {
     uint32_t page_size;
     /* number of pages in a full packet */
     uint32_t page_count;
+    /* multifd flags for receiving ram */
+    int read_flags;
 
     /* syncs main thread and channels */
     QemuSemaphore sem_sync;
+    /* sem where to wait for more work */
+    QemuSemaphore sem;
 
     /* this mutex protects the following parameters */
     QemuMutex mutex;
@@ -167,6 +172,13 @@ typedef struct {
     uint32_t flags;
     /* global number of generated multifd packets */
     uint64_t packet_num;
+    int pending_job;
+    /* array of pages to sent.
+     * The owner of 'pages' depends of 'pending_job' value:
+     * pending_job == 0 -> migration_thread can use it.
+     * pending_job != 0 -> multifd_channel can use it.
+     */
+    MultiFDPages_t *pages;
 
     /* thread local variables. No locking required */
 
@@ -210,4 +222,3 @@ typedef struct {
 void multifd_register_ops(int method, MultiFDMethods *ops);
 
 #endif
-
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 22/29] io: Add a pwritev/preadv version that takes a discontiguous iovec
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (20 preceding siblings ...)
  2023-10-23 20:36 ` [PATCH v2 21/29] migration/multifd: Add pages to the receiving side Fabiano Rosas
@ 2023-10-23 20:36 ` Fabiano Rosas
  2023-10-24  8:50   ` Daniel P. Berrangé
  2023-10-23 20:36 ` [PATCH v2 23/29] migration/ram: Add a wrapper for fixed-ram shadow bitmap Fabiano Rosas
                   ` (6 subsequent siblings)
  28 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

For the upcoming support to fixed-ram migration with multifd, we need
to be able to accept an iovec array with non-contiguous data.

Add a pwritev and preadv version that splits the array into contiguous
segments before writing. With that we can have the ram code continue
to add pages in any order and the multifd code continue to send large
arrays for reading and writing.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
Since iovs can be non contiguous, we'd need a separate array on the
side to carry an extra file offset for each of them, so I'm relying on
the fact that iovs are all within a same host page and passing in an
encoded offset that takes the host page into account.
---
 include/io/channel.h | 50 +++++++++++++++++++++++++++
 io/channel.c         | 82 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index a8181d576a..51a99fb9f6 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -33,8 +33,10 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass,
 #define QIO_CHANNEL_ERR_BLOCK -2
 
 #define QIO_CHANNEL_WRITE_FLAG_ZERO_COPY 0x1
+#define QIO_CHANNEL_WRITE_FLAG_WITH_OFFSET 0x2
 
 #define QIO_CHANNEL_READ_FLAG_MSG_PEEK 0x1
+#define QIO_CHANNEL_READ_FLAG_WITH_OFFSET 0x2
 
 typedef enum QIOChannelFeature QIOChannelFeature;
 
@@ -559,6 +561,30 @@ int qio_channel_close(QIOChannel *ioc,
 ssize_t qio_channel_pwritev_full(QIOChannel *ioc, const struct iovec *iov,
                                  size_t niov, off_t offset, Error **errp);
 
+/**
+ * qio_channel_write_full_all:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to write data from
+ * @niov: the length of the @iov array
+ * @offset: the iovec offset in the file where to write the data
+ * @fds: an array of file handles to send
+ * @nfds: number of file handles in @fds
+ * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
+ * @errp: pointer to a NULL-initialized error object
+ *
+ *
+ * Selects between a writev or pwritev channel writer function.
+ *
+ * If QIO_CHANNEL_WRITE_FLAG_OFFSET is passed in flags, pwritev is
+ * used and @offset is expected to be a meaningful value, @fds and
+ * @nfds are ignored; otherwise uses writev and @offset is ignored.
+ *
+ * Returns: 0 if all bytes were written, or -1 on error
+ */
+int qio_channel_write_full_all(QIOChannel *ioc, const struct iovec *iov,
+                               size_t niov, off_t offset, int *fds, size_t nfds,
+                               int flags, Error **errp);
+
 /**
  * qio_channel_pwritev
  * @ioc: the channel object
@@ -595,6 +621,30 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, char *buf, size_t buflen,
 ssize_t qio_channel_preadv_full(QIOChannel *ioc, const struct iovec *iov,
                                 size_t niov, off_t offset, Error **errp);
 
+/**
+ * qio_channel_read_full_all:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to read data to
+ * @niov: the length of the @iov array
+ * @offset: the iovec offset in the file from where to read the data
+ * @fds: an array of file handles to send
+ * @nfds: number of file handles in @fds
+ * @flags: read flags (QIO_CHANNEL_READ_FLAG_*)
+ * @errp: pointer to a NULL-initialized error object
+ *
+ *
+ * Selects between a readv or preadv channel reader function.
+ *
+ * If QIO_CHANNEL_READ_FLAG_OFFSET is passed in flags, preadv is
+ * used and @offset is expected to be a meaningful value, @fds and
+ * @nfds are ignored; otherwise uses readv and @offset is ignored.
+ *
+ * Returns: 0 if all bytes were read, or -1 on error
+ */
+int qio_channel_read_full_all(QIOChannel *ioc, const struct iovec *iov,
+                              size_t niov, off_t offset,
+                              int flags, Error **errp);
+
 /**
  * qio_channel_preadv
  * @ioc: the channel object
diff --git a/io/channel.c b/io/channel.c
index 770d61ea00..648b68451d 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -472,6 +472,76 @@ ssize_t qio_channel_pwritev_full(QIOChannel *ioc, const struct iovec *iov,
     return klass->io_pwritev(ioc, iov, niov, offset, errp);
 }
 
+static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
+                                                 const struct iovec *iov,
+                                                 size_t niov, off_t offset,
+                                                 bool is_write, Error **errp)
+{
+    ssize_t ret;
+    int i, slice_idx, slice_num;
+    uint64_t base, next, file_offset;
+    size_t len;
+
+    slice_idx = 0;
+    slice_num = 1;
+
+    /*
+     * If the iov array doesn't have contiguous elements, we need to
+     * split it in slices because we only have one (file) 'offset' for
+     * the whole iov. Do this here so callers don't need to break the
+     * iov array themselves.
+     */
+    for (i = 0; i < niov; i++, slice_num++) {
+        base = (uint64_t) iov[i].iov_base;
+
+        if (i != niov - 1) {
+            len = iov[i].iov_len;
+            next = (uint64_t) iov[i + 1].iov_base;
+
+            if (base + len == next) {
+                continue;
+            }
+        }
+
+        /*
+         * Use the offset of the first element of the segment that
+         * we're sending.
+         */
+        file_offset = offset + (uint64_t) iov[slice_idx].iov_base;
+
+        if (is_write) {
+            ret = qio_channel_pwritev_full(ioc, &iov[slice_idx], slice_num,
+                                           file_offset, errp);
+        } else {
+            ret = qio_channel_preadv_full(ioc, &iov[slice_idx], slice_num,
+                                          file_offset, errp);
+        }
+
+        if (ret < 0) {
+            break;
+        }
+
+        slice_idx += slice_num;
+        slice_num = 0;
+    }
+
+    return (ret < 0) ? -1 : 0;
+}
+
+int qio_channel_write_full_all(QIOChannel *ioc,
+                                const struct iovec *iov,
+                                size_t niov, off_t offset,
+                                int *fds, size_t nfds,
+                                int flags, Error **errp)
+{
+    if (flags & QIO_CHANNEL_WRITE_FLAG_WITH_OFFSET) {
+        return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
+                                                     offset, true, errp);
+    }
+
+    return qio_channel_writev_full_all(ioc, iov, niov, NULL, 0, flags, errp);
+}
+
 ssize_t qio_channel_pwritev(QIOChannel *ioc, char *buf, size_t buflen,
                             off_t offset, Error **errp)
 {
@@ -501,6 +571,18 @@ ssize_t qio_channel_preadv_full(QIOChannel *ioc, const struct iovec *iov,
     return klass->io_preadv(ioc, iov, niov, offset, errp);
 }
 
+int qio_channel_read_full_all(QIOChannel *ioc, const struct iovec *iov,
+                              size_t niov, off_t offset,
+                              int flags, Error **errp)
+{
+    if (flags & QIO_CHANNEL_READ_FLAG_WITH_OFFSET) {
+        return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
+                                                     offset, false, errp);
+    }
+
+    return qio_channel_readv_full_all(ioc, iov, niov, NULL, NULL, errp);
+}
+
 ssize_t qio_channel_preadv(QIOChannel *ioc, char *buf, size_t buflen,
                            off_t offset, Error **errp)
 {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 23/29] migration/ram: Add a wrapper for fixed-ram shadow bitmap
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (21 preceding siblings ...)
  2023-10-23 20:36 ` [PATCH v2 22/29] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
@ 2023-10-23 20:36 ` Fabiano Rosas
  2023-11-01 14:29   ` Peter Xu
  2023-10-23 20:36 ` [PATCH v2 24/29] migration/ram: Ignore multifd flush when doing fixed-ram migration Fabiano Rosas
                   ` (5 subsequent siblings)
  28 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

We'll need to set the shadow_bmap bits from outside ram.c soon and
TARGET_PAGE_BITS is poisoned, so add a wrapper to it.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/ram.c | 5 +++++
 migration/ram.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index cea6971ab2..8e34c1b597 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3160,6 +3160,11 @@ static void ram_save_shadow_bmap(QEMUFile *f)
     }
 }
 
+void ramblock_set_shadow_bmap_atomic(RAMBlock *block, ram_addr_t offset)
+{
+    set_bit_atomic(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
+}
+
 /**
  * ram_save_iterate: iterative stage for migration
  *
diff --git a/migration/ram.h b/migration/ram.h
index 145c915ca7..1acadffb06 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -75,6 +75,7 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
 bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
 void postcopy_preempt_shutdown_file(MigrationState *s);
 void *postcopy_preempt_thread(void *opaque);
+void ramblock_set_shadow_bmap_atomic(RAMBlock *block, ram_addr_t offset);
 
 /* ram cache */
 int colo_init_ram_cache(void);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 24/29] migration/ram: Ignore multifd flush when doing fixed-ram migration
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (22 preceding siblings ...)
  2023-10-23 20:36 ` [PATCH v2 23/29] migration/ram: Add a wrapper for fixed-ram shadow bitmap Fabiano Rosas
@ 2023-10-23 20:36 ` Fabiano Rosas
  2023-10-25  9:09   ` Daniel P. Berrangé
  2023-10-23 20:36 ` [PATCH v2 25/29] migration/multifd: Support outgoing fixed-ram stream format Fabiano Rosas
                   ` (4 subsequent siblings)
  28 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

Some functionalities of multifd are incompatible with the 'fixed-ram'
migration format.

The MULTIFD_FLUSH flag in particular is not used because in fixed-ram
there is no sinchronicity between migration source and destination so
there is not need for a sync packet. In fact, fixed-ram disables
packets in multifd as a whole.

Make sure RAM_SAVE_FLAG_MULTIFD_FLUSH is never emitted when fixed-ram
is enabled.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/ram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index 8e34c1b597..3497ed186a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1386,7 +1386,7 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
         pss->page = 0;
         pss->block = QLIST_NEXT_RCU(pss->block, next);
         if (!pss->block) {
-            if (migrate_multifd() &&
+            if (!migrate_fixed_ram() && migrate_multifd() &&
                 !migrate_multifd_flush_after_each_section()) {
                 QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
                 int ret = multifd_send_sync_main(f);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 25/29] migration/multifd: Support outgoing fixed-ram stream format
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (23 preceding siblings ...)
  2023-10-23 20:36 ` [PATCH v2 24/29] migration/ram: Ignore multifd flush when doing fixed-ram migration Fabiano Rosas
@ 2023-10-23 20:36 ` Fabiano Rosas
  2023-10-25  9:23   ` Daniel P. Berrangé
  2023-10-23 20:36 ` [PATCH v2 26/29] migration/multifd: Support incoming " Fabiano Rosas
                   ` (3 subsequent siblings)
  28 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

The new fixed-ram stream format uses a file transport and puts ram
pages in the migration file at their respective offsets and can be
done in parallel by using the pwritev system call which takes iovecs
and an offset.

Add support to enabling the new format along with multifd to make use
of the threading and page handling already in place.

This requires multifd to stop sending headers and leaving the stream
format to the fixed-ram code. When it comes time to write the data, we
need to call a version of qio_channel_write that can take an offset.

Usage on HMP is:

(qemu) stop
(qemu) migrate_set_capability multifd on
(qemu) migrate_set_capability fixed-ram on
(qemu) migrate_set_parameter max-bandwidth 0
(qemu) migrate_set_parameter multifd-channels 8
(qemu) migrate file:migfile

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/qemu/bitops.h | 13 ++++++++++
 migration/multifd.c   | 55 +++++++++++++++++++++++++++++++++++++++++--
 migration/options.c   |  6 -----
 migration/ram.c       |  2 +-
 4 files changed, 67 insertions(+), 9 deletions(-)

diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
index cb3526d1f4..2c0a2fe751 100644
--- a/include/qemu/bitops.h
+++ b/include/qemu/bitops.h
@@ -67,6 +67,19 @@ static inline void clear_bit(long nr, unsigned long *addr)
     *p &= ~mask;
 }
 
+/**
+ * clear_bit_atomic - Clears a bit in memory atomically
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ */
+static inline void clear_bit_atomic(long nr, unsigned long *addr)
+{
+    unsigned long mask = BIT_MASK(nr);
+    unsigned long *p = addr + BIT_WORD(nr);
+
+    return qatomic_and(p, ~mask);
+}
+
 /**
  * change_bit - Toggle a bit in memory
  * @nr: Bit to change
diff --git a/migration/multifd.c b/migration/multifd.c
index 20e8635740..3f95a41ee9 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -260,6 +260,19 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
     g_free(pages);
 }
 
+static void multifd_set_file_bitmap(MultiFDSendParams *p)
+{
+    MultiFDPages_t *pages = p->pages;
+
+    if (!pages->block) {
+        return;
+    }
+
+    for (int i = 0; i < p->normal_num; i++) {
+        ramblock_set_shadow_bmap_atomic(pages->block, pages->offset[i]);
+    }
+}
+
 static void multifd_send_fill_packet(MultiFDSendParams *p)
 {
     MultiFDPacket_t *packet = p->packet;
@@ -606,6 +619,29 @@ int multifd_send_sync_main(QEMUFile *f)
         }
     }
 
+    if (!migrate_multifd_packets()) {
+        /*
+         * There's no sync packet to send. Just make sure the sending
+         * above has finished.
+         */
+        for (i = 0; i < migrate_multifd_channels(); i++) {
+            qemu_sem_wait(&multifd_send_state->channels_ready);
+        }
+
+        /* sanity check and release the channels */
+        for (i = 0; i < migrate_multifd_channels(); i++) {
+            MultiFDSendParams *p = &multifd_send_state->params[i];
+
+            qemu_mutex_lock(&p->mutex);
+            assert(!p->pending_job || p->quit);
+            qemu_mutex_unlock(&p->mutex);
+
+            qemu_sem_post(&p->sem);
+        }
+
+        return 0;
+    }
+
     /*
      * When using zero-copy, it's necessary to flush the pages before any of
      * the pages can be sent again, so we'll make sure the new version of the
@@ -689,6 +725,8 @@ static void *multifd_send_thread(void *opaque)
 
         if (p->pending_job) {
             uint32_t flags;
+            uint64_t write_base;
+
             p->normal_num = 0;
 
             if (!use_packets || use_zero_copy_send) {
@@ -713,6 +751,16 @@ static void *multifd_send_thread(void *opaque)
             if (use_packets) {
                 multifd_send_fill_packet(p);
                 p->num_packets++;
+                write_base = 0;
+            } else {
+                multifd_set_file_bitmap(p);
+
+                /*
+                 * If we subtract the host page now, we don't need to
+                 * pass it into qio_channel_write_full_all() below.
+                 */
+                write_base = p->pages->block->pages_offset -
+                    (uint64_t)p->pages->block->host;
             }
 
             flags = p->flags;
@@ -738,8 +786,9 @@ static void *multifd_send_thread(void *opaque)
                 p->iov[0].iov_base = p->packet;
             }
 
-            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
-                                              0, p->write_flags, &local_err);
+            ret = qio_channel_write_full_all(p->c, p->iov, p->iovs_num,
+                                             write_base, NULL, 0,
+                                             p->write_flags, &local_err);
             if (ret != 0) {
                 break;
             }
@@ -969,6 +1018,8 @@ int multifd_save_setup(Error **errp)
 
         if (migrate_zero_copy_send()) {
             p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
+        } else if (!use_packets) {
+            p->write_flags |= QIO_CHANNEL_WRITE_FLAG_WITH_OFFSET;
         } else {
             p->write_flags = 0;
         }
diff --git a/migration/options.c b/migration/options.c
index 469d5d4c50..2193d69e71 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -648,12 +648,6 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
     }
 
     if (new_caps[MIGRATION_CAPABILITY_FIXED_RAM]) {
-        if (new_caps[MIGRATION_CAPABILITY_MULTIFD]) {
-            error_setg(errp,
-                       "Fixed-ram migration is incompatible with multifd");
-            return false;
-        }
-
         if (new_caps[MIGRATION_CAPABILITY_XBZRLE]) {
             error_setg(errp,
                        "Fixed-ram migration is incompatible with xbzrle");
diff --git a/migration/ram.c b/migration/ram.c
index 3497ed186a..5c67e30e55 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1161,7 +1161,7 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
 
     if (migrate_fixed_ram()) {
         /* zero pages are not transferred with fixed-ram */
-        clear_bit(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
+        clear_bit_atomic(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
         return 1;
     }
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 26/29] migration/multifd: Support incoming fixed-ram stream format
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (24 preceding siblings ...)
  2023-10-23 20:36 ` [PATCH v2 25/29] migration/multifd: Support outgoing fixed-ram stream format Fabiano Rosas
@ 2023-10-23 20:36 ` Fabiano Rosas
  2023-10-23 20:36 ` [PATCH v2 27/29] tests/qtest: Add a multifd + fixed-ram migration test Fabiano Rosas
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

For the incoming fixed-ram migration we need to read the ramblock
headers, get the pages bitmap and send the host address of each
non-zero page to the multifd channel thread for writing.

To read from the migration file we need a preadv function that can
read into the iovs in segments of contiguous pages because (as in the
writing case) the file offset applies to the entire iovec.

Usage on HMP is:

(qemu) migrate_set_capability multifd on
(qemu) migrate_set_capability fixed-ram on
(qemu) migrate_set_parameter max-bandwidth 0
(qemu) migrate_set_parameter multifd-channels 8
(qemu) migrate_incoming file:migfile
(qemu) info status
(qemu) c

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/multifd.c | 13 ++++++++++++-
 migration/ram.c     |  9 +++++++--
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 3f95a41ee9..3b6053ae5a 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -142,6 +142,7 @@ static void nocomp_recv_cleanup(MultiFDRecvParams *p)
 static int nocomp_recv_pages(MultiFDRecvParams *p, Error **errp)
 {
     uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
+    uint64_t read_base = 0;
 
     if (flags != MULTIFD_FLAG_NOCOMP) {
         error_setg(errp, "multifd %u: flags received %x flags expected %x",
@@ -152,7 +153,13 @@ static int nocomp_recv_pages(MultiFDRecvParams *p, Error **errp)
         p->iov[i].iov_base = p->host + p->normal[i];
         p->iov[i].iov_len = p->page_size;
     }
-    return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
+
+    if (migrate_fixed_ram()) {
+        read_base = p->pages->block->pages_offset - (uint64_t) p->host;
+    }
+
+    return qio_channel_read_full_all(p->c, p->iov, p->normal_num, read_base,
+                                     p->read_flags, errp);
 }
 
 static MultiFDMethods multifd_nocomp_ops = {
@@ -1225,6 +1232,7 @@ void multifd_recv_sync_main(void)
     if (!migrate_multifd() || !migrate_multifd_packets()) {
         return;
     }
+
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
@@ -1257,6 +1265,7 @@ static void *multifd_recv_thread(void *opaque)
 
     while (true) {
         uint32_t flags;
+        p->normal_num = 0;
 
         if (p->quit) {
             break;
@@ -1378,6 +1387,8 @@ int multifd_load_setup(Error **errp)
             p->packet_len = sizeof(MultiFDPacket_t)
                 + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
+        } else {
+            p->read_flags |= QIO_CHANNEL_READ_FLAG_WITH_OFFSET;
         }
         p->name = g_strdup_printf("multifdrecv_%d", i);
         p->iov = g_new0(struct iovec, page_count);
diff --git a/migration/ram.c b/migration/ram.c
index 5c67e30e55..9a5ee4767b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3985,8 +3985,13 @@ static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
             host = host_from_ram_block_offset(block, offset);
             read_len = MIN(len, TARGET_PAGE_SIZE);
 
-            read = qemu_get_buffer_at(f, host, read_len,
-                                      block->pages_offset + offset);
+            if (migrate_multifd()) {
+                multifd_recv_queue_page(f, block, offset);
+                read = read_len;
+            } else {
+                read = qemu_get_buffer_at(f, host, read_len,
+                                          block->pages_offset + offset);
+            }
             completed += read;
         }
     }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 27/29] tests/qtest: Add a multifd + fixed-ram migration test
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (25 preceding siblings ...)
  2023-10-23 20:36 ` [PATCH v2 26/29] migration/multifd: Support incoming " Fabiano Rosas
@ 2023-10-23 20:36 ` Fabiano Rosas
  2023-10-23 20:36 ` [PATCH v2 28/29] migration: Add direct-io parameter Fabiano Rosas
  2023-10-23 20:36 ` [PATCH v2 29/29] tests/qtest: Add a test for migration with direct-io and multifd Fabiano Rosas
  28 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Thomas Huth, Laurent Vivier, Paolo Bonzini

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration-test.c | 45 ++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index bd4e866e67..c74c911283 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2064,6 +2064,46 @@ static void test_precopy_file_fixed_ram(void)
     test_file_common(&args, false, true);
 }
 
+static void *migrate_multifd_fixed_ram_start(QTestState *from, QTestState *to)
+{
+    migrate_fixed_ram_start(from, to);
+
+    migrate_set_parameter_int(from, "multifd-channels", 4);
+    migrate_set_parameter_int(to, "multifd-channels", 4);
+
+    migrate_set_capability(from, "multifd", true);
+    migrate_set_capability(to, "multifd", true);
+
+    return NULL;
+}
+
+static void test_multifd_file_fixed_ram_live(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+                                           FILE_TEST_FILENAME);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = migrate_multifd_fixed_ram_start,
+    };
+
+    test_file_common(&args, false, false);
+}
+
+static void test_multifd_file_fixed_ram(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+                                           FILE_TEST_FILENAME);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = migrate_multifd_fixed_ram_start,
+    };
+
+    test_file_common(&args, false, true);
+}
+
+
 static void test_precopy_tcp_plain(void)
 {
     MigrateCommon args = {
@@ -3137,6 +3177,11 @@ int main(int argc, char **argv)
     qtest_add_func("/migration/precopy/file/fixed-ram/live",
                    test_precopy_file_fixed_ram_live);
 
+    qtest_add_func("/migration/multifd/file/fixed-ram",
+                   test_multifd_file_fixed_ram);
+    qtest_add_func("/migration/multifd/file/fixed-ram/live",
+                   test_multifd_file_fixed_ram_live);
+
 #ifdef CONFIG_GNUTLS
     qtest_add_func("/migration/precopy/unix/tls/psk",
                    test_precopy_unix_tls_psk);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (26 preceding siblings ...)
  2023-10-23 20:36 ` [PATCH v2 27/29] tests/qtest: Add a multifd + fixed-ram migration test Fabiano Rosas
@ 2023-10-23 20:36 ` Fabiano Rosas
  2023-10-24  5:41   ` Markus Armbruster
                     ` (2 more replies)
  2023-10-23 20:36 ` [PATCH v2 29/29] tests/qtest: Add a test for migration with direct-io and multifd Fabiano Rosas
  28 siblings, 3 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Add the direct-io migration parameter that tells the migration code to
use O_DIRECT when opening the migration stream file whenever possible.

This is currently only used for the secondary channels of fixed-ram
migration, which can guarantee that writes are page aligned.

However the parameter could be made to affect other types of
file-based migrations in the future.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/qemu/osdep.h           |  2 ++
 migration/file.c               | 15 ++++++++++++---
 migration/migration-hmp-cmds.c | 10 ++++++++++
 migration/options.c            | 30 ++++++++++++++++++++++++++++++
 migration/options.h            |  1 +
 qapi/migration.json            | 17 ++++++++++++++---
 util/osdep.c                   |  9 +++++++++
 7 files changed, 78 insertions(+), 6 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 475a1c62ff..ea5d29ab9b 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -597,6 +597,8 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive);
 bool qemu_has_ofd_lock(void);
 #endif
 
+bool qemu_has_direct_io(void);
+
 #if defined(__HAIKU__) && defined(__i386__)
 #define FMT_pid "%ld"
 #elif defined(WIN64)
diff --git a/migration/file.c b/migration/file.c
index ad75225f43..3d3c58ecad 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -11,9 +11,9 @@
 #include "qemu/error-report.h"
 #include "channel.h"
 #include "file.h"
-#include "migration.h"
 #include "io/channel-file.h"
 #include "io/channel-util.h"
+#include "migration.h"
 #include "options.h"
 #include "trace.h"
 
@@ -77,9 +77,18 @@ void file_send_channel_create(QIOTaskFunc f, void *data)
     QIOChannelFile *ioc;
     QIOTask *task;
     Error *errp = NULL;
+    int flags = outgoing_args.flags;
 
-    ioc = qio_channel_file_new_path(outgoing_args.fname,
-                                    outgoing_args.flags,
+    if (migrate_direct_io() && qemu_has_direct_io()) {
+        /*
+         * Enable O_DIRECT for the secondary channels. These are used
+         * for sending ram pages and writes should be guaranteed to be
+         * aligned to at least page size.
+         */
+        flags |= O_DIRECT;
+    }
+
+    ioc = qio_channel_file_new_path(outgoing_args.fname, flags,
                                     outgoing_args.mode, &errp);
     if (!ioc) {
         file_migration_cancel(errp);
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index a82597f18e..eab5ac3588 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -387,6 +387,12 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
         monitor_printf(mon, "%s: %" PRIu64 " MB/s\n",
             MigrationParameter_str(MIGRATION_PARAMETER_VCPU_DIRTY_LIMIT),
             params->vcpu_dirty_limit);
+
+        if (params->has_direct_io) {
+            monitor_printf(mon, "%s: %s\n",
+                           MigrationParameter_str(MIGRATION_PARAMETER_DIRECT_IO),
+                           params->direct_io ? "on" : "off");
+        }
     }
 
     qapi_free_MigrationParameters(params);
@@ -661,6 +667,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_vcpu_dirty_limit = true;
         visit_type_size(v, param, &p->vcpu_dirty_limit, &err);
         break;
+    case MIGRATION_PARAMETER_DIRECT_IO:
+        p->has_direct_io = true;
+        visit_type_bool(v, param, &p->direct_io, &err);
+        break;
     default:
         assert(0);
     }
diff --git a/migration/options.c b/migration/options.c
index 2193d69e71..6d0e3c26ae 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -817,6 +817,22 @@ int migrate_decompress_threads(void)
     return s->parameters.decompress_threads;
 }
 
+bool migrate_direct_io(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    /* For now O_DIRECT is only supported with fixed-ram */
+    if (!s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM]) {
+        return false;
+    }
+
+    if (s->parameters.has_direct_io) {
+        return s->parameters.direct_io;
+    }
+
+    return false;
+}
+
 uint64_t migrate_downtime_limit(void)
 {
     MigrationState *s = migrate_get_current();
@@ -1025,6 +1041,11 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->has_vcpu_dirty_limit = true;
     params->vcpu_dirty_limit = s->parameters.vcpu_dirty_limit;
 
+    if (s->parameters.has_direct_io) {
+        params->has_direct_io = true;
+        params->direct_io = s->parameters.direct_io;
+    }
+
     return params;
 }
 
@@ -1059,6 +1080,7 @@ void migrate_params_init(MigrationParameters *params)
     params->has_announce_step = true;
     params->has_x_vcpu_dirty_limit_period = true;
     params->has_vcpu_dirty_limit = true;
+    params->has_direct_io = qemu_has_direct_io();
 }
 
 /*
@@ -1356,6 +1378,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_vcpu_dirty_limit) {
         dest->vcpu_dirty_limit = params->vcpu_dirty_limit;
     }
+
+    if (params->has_direct_io) {
+        dest->direct_io = params->direct_io;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1486,6 +1512,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_vcpu_dirty_limit) {
         s->parameters.vcpu_dirty_limit = params->vcpu_dirty_limit;
     }
+
+    if (params->has_direct_io) {
+        s->parameters.direct_io = params->direct_io;
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
diff --git a/migration/options.h b/migration/options.h
index 01bba5b928..280f86bed1 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -82,6 +82,7 @@ uint8_t migrate_cpu_throttle_increment(void);
 uint8_t migrate_cpu_throttle_initial(void);
 bool migrate_cpu_throttle_tailslow(void);
 int migrate_decompress_threads(void);
+bool migrate_direct_io(void);
 uint64_t migrate_downtime_limit(void);
 uint8_t migrate_max_cpu_throttle(void);
 uint64_t migrate_max_bandwidth(void);
diff --git a/qapi/migration.json b/qapi/migration.json
index 1317dd32ab..3eb9e2c9b5 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -840,6 +840,9 @@
 # @vcpu-dirty-limit: Dirtyrate limit (MB/s) during live migration.
 #     Defaults to 1.  (Since 8.1)
 #
+# @direct-io: Open migration files with O_DIRECT when possible. Not
+#             all migration transports support this. (since 8.1)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
@@ -864,7 +867,7 @@
            'multifd-zlib-level', 'multifd-zstd-level',
            'block-bitmap-mapping',
            { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
-           'vcpu-dirty-limit'] }
+           'vcpu-dirty-limit', 'direct-io'] }
 
 ##
 # @MigrateSetParameters:
@@ -1016,6 +1019,9 @@
 # @vcpu-dirty-limit: Dirtyrate limit (MB/s) during live migration.
 #     Defaults to 1.  (Since 8.1)
 #
+# @direct-io: Open migration files with O_DIRECT when possible. Not
+#             all migration transports support this. (since 8.1)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
@@ -1058,7 +1064,8 @@
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
             '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
                                             'features': [ 'unstable' ] },
-            '*vcpu-dirty-limit': 'uint64'} }
+            '*vcpu-dirty-limit': 'uint64',
+            '*direct-io': 'bool' } }
 
 ##
 # @migrate-set-parameters:
@@ -1230,6 +1237,9 @@
 # @vcpu-dirty-limit: Dirtyrate limit (MB/s) during live migration.
 #     Defaults to 1.  (Since 8.1)
 #
+# @direct-io: Open migration files with O_DIRECT when possible. Not
+#             all migration transports support this. (since 8.1)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
@@ -1269,7 +1279,8 @@
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
             '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
                                             'features': [ 'unstable' ] },
-            '*vcpu-dirty-limit': 'uint64'} }
+            '*vcpu-dirty-limit': 'uint64',
+            '*direct-io': 'bool' } }
 
 ##
 # @query-migrate-parameters:
diff --git a/util/osdep.c b/util/osdep.c
index e996c4744a..d0227a60ab 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -277,6 +277,15 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive)
 }
 #endif
 
+bool qemu_has_direct_io(void)
+{
+#ifdef O_DIRECT
+    return true;
+#else
+    return false;
+#endif
+}
+
 static int qemu_open_cloexec(const char *name, int flags, mode_t mode)
 {
     int ret;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 29/29] tests/qtest: Add a test for migration with direct-io and multifd
  2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
                   ` (27 preceding siblings ...)
  2023-10-23 20:36 ` [PATCH v2 28/29] migration: Add direct-io parameter Fabiano Rosas
@ 2023-10-23 20:36 ` Fabiano Rosas
  2023-10-25  9:25   ` Daniel P. Berrangé
  28 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-23 20:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Thomas Huth, Laurent Vivier, Paolo Bonzini

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration-test.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index c74c911283..30e70c0e4e 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2077,6 +2077,16 @@ static void *migrate_multifd_fixed_ram_start(QTestState *from, QTestState *to)
     return NULL;
 }
 
+static void *migrate_multifd_fixed_ram_dio_start(QTestState *from, QTestState *to)
+{
+    migrate_multifd_fixed_ram_start(from, to);
+
+    migrate_set_parameter_bool(from, "direct-io", true);
+    migrate_set_parameter_bool(to, "direct-io", true);
+
+    return NULL;
+}
+
 static void test_multifd_file_fixed_ram_live(void)
 {
     g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
@@ -2103,6 +2113,18 @@ static void test_multifd_file_fixed_ram(void)
     test_file_common(&args, false, true);
 }
 
+static void test_multifd_file_fixed_ram_dio(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+                                           FILE_TEST_FILENAME);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = migrate_multifd_fixed_ram_dio_start,
+    };
+
+    test_file_common(&args, false, true);
+}
 
 static void test_precopy_tcp_plain(void)
 {
@@ -3182,6 +3204,9 @@ int main(int argc, char **argv)
     qtest_add_func("/migration/multifd/file/fixed-ram/live",
                    test_multifd_file_fixed_ram_live);
 
+    qtest_add_func("/migration/multifd/file/fixed-ram/dio",
+                   test_multifd_file_fixed_ram_dio);
+
 #ifdef CONFIG_GNUTLS
     qtest_add_func("/migration/precopy/unix/tls/psk",
                    test_precopy_unix_tls_psk);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-23 20:35 ` [PATCH v2 06/29] migration: Add auto-pause capability Fabiano Rosas
@ 2023-10-24  5:25   ` Markus Armbruster
  2023-10-24 18:12     ` Fabiano Rosas
  2023-10-25  8:48   ` Daniel P. Berrangé
  1 sibling, 1 reply; 128+ messages in thread
From: Markus Armbruster @ 2023-10-24  5:25 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Fabiano Rosas <farosas@suse.de> writes:

> Add a capability that allows the management layer to delegate to QEMU
> the decision of whether to pause a VM and perform a non-live
> migration. Depending on the type of migration being performed, this
> could bring performance benefits.
>
> Note that the capability is enabled by default but at this moment no
> migration scheme is making use of it.

This sounds as if the capability has no effect unless the "migration
scheme" (whatever that may be) opts into using it.  Am I confused?

> Signed-off-by: Fabiano Rosas <farosas@suse.de>

[...]

> diff --git a/qapi/migration.json b/qapi/migration.json
> index db3df12d6c..74f12adc0e 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -523,6 +523,10 @@
>  #     and can result in more stable read performance.  Requires KVM
>  #     with accelerator property "dirty-ring-size" set.  (Since 8.1)
>  #
> +# @auto-pause: If enabled, allows QEMU to decide whether to pause the
> +#     VM before migration for an optimal migration performance.
> +#     Enabled by default. (since 8.1)

If this needs an opt-in to take effect, it should be documented.

> +#
>  # Features:
>  #
>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
> @@ -539,7 +543,7 @@
>             { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>             'validate-uuid', 'background-snapshot',
>             'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
> -           'dirty-limit'] }
> +           'dirty-limit', 'auto-pause'] }
>  
>  ##
>  # @MigrationCapabilityStatus:



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 14/29] migration/ram: Introduce 'fixed-ram' migration capability
  2023-10-23 20:35 ` [PATCH v2 14/29] migration/ram: Introduce 'fixed-ram' migration capability Fabiano Rosas
@ 2023-10-24  5:33   ` Markus Armbruster
  2023-10-24 18:35     ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Markus Armbruster @ 2023-10-24  5:33 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Fabiano Rosas <farosas@suse.de> writes:

> Add a new migration capability 'fixed-ram'.
>
> The core of the feature is to ensure that each ram page has a specific
> offset in the resulting migration stream. The reason why we'd want
> such behavior are two fold:
>
>  - When doing a 'fixed-ram' migration the resulting file will have a
>    bounded size, since pages which are dirtied multiple times will
>    always go to a fixed location in the file, rather than constantly
>    being added to a sequential stream. This eliminates cases where a vm
>    with, say, 1G of ram can result in a migration file that's 10s of
>    GBs, provided that the workload constantly redirties memory.
>
>  - It paves the way to implement DIRECT_IO-enabled save/restore of the
>    migration stream as the pages are ensured to be written at aligned
>    offsets.
>
> For now, enabling the capability has no effect. The next couple of
> patches implement the core funcionality.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  docs/devel/migration.rst | 14 ++++++++++++++
>  migration/options.c      | 37 +++++++++++++++++++++++++++++++++++++
>  migration/options.h      |  1 +
>  migration/savevm.c       |  1 +
>  qapi/migration.json      |  5 ++++-
>  5 files changed, 57 insertions(+), 1 deletion(-)
>
> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> index c3e1400c0c..6f898b5dbd 100644
> --- a/docs/devel/migration.rst
> +++ b/docs/devel/migration.rst
> @@ -566,6 +566,20 @@ Others (especially either older devices or system devices which for
>  some reason don't have a bus concept) make use of the ``instance id``
>  for otherwise identically named devices.
>  
> +Fixed-ram format
> +----------------
> +
> +When the ``fixed-ram`` capability is enabled, a slightly different
> +stream format is used for the RAM section. Instead of having a
> +sequential stream of pages that follow the RAMBlock headers, the dirty
> +pages for a RAMBlock follow its header. This ensures that each RAM
> +page has a fixed offset in the resulting migration stream.

This requires the migration stream to be seekable, as documented in the
QAPI schema below.  I think it's worth documenting here, as well.

> +
> +The ``fixed-ram`` capaility can be enabled in both source and
> +destination with:
> +
> +    ``migrate_set_capability fixed-ram on``

Effect of enabling on the destination?

What happens when we enable it only on one end?

> +
>  Return path
>  -----------
>  

[...]

> diff --git a/qapi/migration.json b/qapi/migration.json
> index 74f12adc0e..1317dd32ab 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -527,6 +527,9 @@
>  #     VM before migration for an optimal migration performance.
>  #     Enabled by default. (since 8.1)
>  #
> +# @fixed-ram: Migrate using fixed offsets for each RAM page. Requires

Two spaces between sentences for consistency, please.

> +#             a seekable transport such as a file.  (since 8.1)

What is a migration transport?  migration.json doesn't define the term.

Which transports are seekable?

Out of curiosity: what happens if the transport isn't seekable?

> +#
>  # Features:
>  #
>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
> @@ -543,7 +546,7 @@
>             { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>             'validate-uuid', 'background-snapshot',
>             'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
> -           'dirty-limit', 'auto-pause'] }
> +           'dirty-limit', 'auto-pause', 'fixed-ram'] }
>  
>  ##
>  # @MigrationCapabilityStatus:



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-23 20:36 ` [PATCH v2 28/29] migration: Add direct-io parameter Fabiano Rosas
@ 2023-10-24  5:41   ` Markus Armbruster
  2023-10-24 19:32     ` Fabiano Rosas
  2023-10-24  8:33   ` Daniel P. Berrangé
  2023-10-25  9:07   ` Daniel P. Berrangé
  2 siblings, 1 reply; 128+ messages in thread
From: Markus Armbruster @ 2023-10-24  5:41 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Fabiano Rosas <farosas@suse.de> writes:

> Add the direct-io migration parameter that tells the migration code to
> use O_DIRECT when opening the migration stream file whenever possible.
>
> This is currently only used for the secondary channels of fixed-ram
> migration, which can guarantee that writes are page aligned.
>
> However the parameter could be made to affect other types of
> file-based migrations in the future.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

When would you want to enable @direct-io, and when would you want to
leave it disabled?

What happens when you enable @direct-io with a migration that cannot use
O_DIRECT?


[...]

> diff --git a/qapi/migration.json b/qapi/migration.json
> index 1317dd32ab..3eb9e2c9b5 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -840,6 +840,9 @@
>  # @vcpu-dirty-limit: Dirtyrate limit (MB/s) during live migration.
>  #     Defaults to 1.  (Since 8.1)
>  #
> +# @direct-io: Open migration files with O_DIRECT when possible. Not
> +#             all migration transports support this. (since 8.1)
> +#

Please format like

   # @direct-io: Open migration files with O_DIRECT when possible.  Not
   #     all migration transports support this.  (since 8.1)

to blend in with recent commit a937b6aa739 (qapi: Reformat doc comments
to conform to current conventions).

>  # Features:
>  #
>  # @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
> @@ -864,7 +867,7 @@
>             'multifd-zlib-level', 'multifd-zstd-level',
>             'block-bitmap-mapping',
>             { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
> -           'vcpu-dirty-limit'] }
> +           'vcpu-dirty-limit', 'direct-io'] }
>  
>  ##
>  # @MigrateSetParameters:
> @@ -1016,6 +1019,9 @@
>  # @vcpu-dirty-limit: Dirtyrate limit (MB/s) during live migration.
>  #     Defaults to 1.  (Since 8.1)
>  #
> +# @direct-io: Open migration files with O_DIRECT when possible. Not
> +#             all migration transports support this. (since 8.1)
> +#

Likewise.

>  # Features:
>  #
>  # @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
> @@ -1058,7 +1064,8 @@
>              '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
>              '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
>                                              'features': [ 'unstable' ] },
> -            '*vcpu-dirty-limit': 'uint64'} }
> +            '*vcpu-dirty-limit': 'uint64',
> +            '*direct-io': 'bool' } }
>  
>  ##
>  # @migrate-set-parameters:
> @@ -1230,6 +1237,9 @@
>  # @vcpu-dirty-limit: Dirtyrate limit (MB/s) during live migration.
>  #     Defaults to 1.  (Since 8.1)
>  #
> +# @direct-io: Open migration files with O_DIRECT when possible. Not
> +#             all migration transports support this. (since 8.1)
> +#

Likewise.

>  # Features:
>  #
>  # @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
> @@ -1269,7 +1279,8 @@
>              '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
>              '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
>                                              'features': [ 'unstable' ] },
> -            '*vcpu-dirty-limit': 'uint64'} }
> +            '*vcpu-dirty-limit': 'uint64',
> +            '*direct-io': 'bool' } }
>  
>  ##
>  # @query-migrate-parameters:

This one is for the migration maintainers: we have MigrationCapability,
set with migrate-set-capabilities, and MigrationParameters, set with
migrate-set-parameters.  For a boolean configuration setting, either
works.  Which one should we use when?

[...]



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 10/29] io: Add generic pwritev/preadv interface
  2023-10-23 20:35 ` [PATCH v2 10/29] io: Add generic pwritev/preadv interface Fabiano Rosas
@ 2023-10-24  8:18   ` Daniel P. Berrangé
  2023-10-24 19:06     ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-24  8:18 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

On Mon, Oct 23, 2023 at 05:35:49PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
> 
> Introduce basic pwritev/preadv support in the generic channel layer.
> Specific implementation will follow for the file channel as this is
> required in order to support migration streams with fixed location of
> each ram page.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  include/io/channel.h | 82 ++++++++++++++++++++++++++++++++++++++++++++
>  io/channel.c         | 58 +++++++++++++++++++++++++++++++
>  2 files changed, 140 insertions(+)
> 
> diff --git a/include/io/channel.h b/include/io/channel.h
> index fcb19fd672..a8181d576a 100644
> --- a/include/io/channel.h
> +++ b/include/io/channel.h
> @@ -131,6 +131,16 @@ struct QIOChannelClass {
>                             Error **errp);
>  
>      /* Optional callbacks */
> +    ssize_t (*io_pwritev)(QIOChannel *ioc,
> +                          const struct iovec *iov,
> +                          size_t niov,
> +                          off_t offset,
> +                          Error **errp);
> +    ssize_t (*io_preadv)(QIOChannel *ioc,
> +                         const struct iovec *iov,
> +                         size_t niov,
> +                         off_t offset,
> +                         Error **errp);
>      int (*io_shutdown)(QIOChannel *ioc,
>                         QIOChannelShutdown how,
>                         Error **errp);
> @@ -529,6 +539,78 @@ void qio_channel_set_follow_coroutine_ctx(QIOChannel *ioc, bool enabled);
>  int qio_channel_close(QIOChannel *ioc,
>                        Error **errp);
>  
> +/**
> + * qio_channel_pwritev_full
> + * @ioc: the channel object
> + * @iov: the array of memory regions to write data from
> + * @niov: the length of the @iov array
> + * @offset: offset in the channel where writes should begin
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Not all implementations will support this facility, so may report
> + * an error. To avoid errors, the caller may check for the feature
> + * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
> + *
> + * Behaves as qio_channel_writev_full, apart from not supporting
> + * sending of file handles as well as beginning the write at the
> + * passed @offset
> + *
> + */
> +ssize_t qio_channel_pwritev_full(QIOChannel *ioc, const struct iovec *iov,
> +                                 size_t niov, off_t offset, Error **errp);

In terms of naming this should be  just "_pwritev".

We don't support FD passing, so the "_full" suffix is not
appropriate

> +
> +/**
> + * qio_channel_pwritev
> + * @ioc: the channel object
> + * @buf: the memory region to write data into
> + * @buflen: the number of bytes to @buf
> + * @offset: offset in the channel where writes should begin
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Not all implementations will support this facility, so may report
> + * an error. To avoid errors, the caller may check for the feature
> + * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
> + *
> + */
> +ssize_t qio_channel_pwritev(QIOChannel *ioc, char *buf, size_t buflen,
> +                            off_t offset, Error **errp);

This isn't passing a vector of buffers, so it should be just
"pwrite", not "pwritev".

> +
> +/**
> + * qio_channel_preadv_full
> + * @ioc: the channel object
> + * @iov: the array of memory regions to read data into
> + * @niov: the length of the @iov array
> + * @offset: offset in the channel where writes should begin
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Not all implementations will support this facility, so may report
> + * an error.  To avoid errors, the caller may check for the feature
> + * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
> + *
> + * Behaves as qio_channel_readv_full, apart from not supporting
> + * receiving of file handles as well as beginning the read at the
> + * passed @offset
> + *
> + */
> +ssize_t qio_channel_preadv_full(QIOChannel *ioc, const struct iovec *iov,
> +                                size_t niov, off_t offset, Error **errp);

"preadv"


> +
> +/**
> + * qio_channel_preadv
> + * @ioc: the channel object
> + * @buf: the memory region to write data into
> + * @buflen: the number of bytes to @buf
> + * @offset: offset in the channel where writes should begin
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Not all implementations will support this facility, so may report
> + * an error.  To avoid errors, the caller may check for the feature
> + * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
> + *
> + */
> +ssize_t qio_channel_preadv(QIOChannel *ioc, char *buf, size_t buflen,
> +                           off_t offset, Error **errp);

"pread"


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-23 20:36 ` [PATCH v2 28/29] migration: Add direct-io parameter Fabiano Rosas
  2023-10-24  5:41   ` Markus Armbruster
@ 2023-10-24  8:33   ` Daniel P. Berrangé
  2023-10-24 19:06     ` Fabiano Rosas
  2023-10-25  9:07   ` Daniel P. Berrangé
  2 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-24  8:33 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

On Mon, Oct 23, 2023 at 05:36:07PM -0300, Fabiano Rosas wrote:
> Add the direct-io migration parameter that tells the migration code to
> use O_DIRECT when opening the migration stream file whenever possible.
> 
> This is currently only used for the secondary channels of fixed-ram
> migration, which can guarantee that writes are page aligned.
> 
> However the parameter could be made to affect other types of
> file-based migrations in the future.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  include/qemu/osdep.h           |  2 ++
>  migration/file.c               | 15 ++++++++++++---
>  migration/migration-hmp-cmds.c | 10 ++++++++++
>  migration/options.c            | 30 ++++++++++++++++++++++++++++++
>  migration/options.h            |  1 +
>  qapi/migration.json            | 17 ++++++++++++++---
>  util/osdep.c                   |  9 +++++++++
>  7 files changed, 78 insertions(+), 6 deletions(-)
> 
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 475a1c62ff..ea5d29ab9b 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -597,6 +597,8 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive);
>  bool qemu_has_ofd_lock(void);
>  #endif
>  
> +bool qemu_has_direct_io(void);
> +
>  #if defined(__HAIKU__) && defined(__i386__)
>  #define FMT_pid "%ld"
>  #elif defined(WIN64)
> diff --git a/migration/file.c b/migration/file.c
> index ad75225f43..3d3c58ecad 100644
> --- a/migration/file.c
> +++ b/migration/file.c
> @@ -11,9 +11,9 @@
>  #include "qemu/error-report.h"
>  #include "channel.h"
>  #include "file.h"
> -#include "migration.h"
>  #include "io/channel-file.h"
>  #include "io/channel-util.h"
> +#include "migration.h"
>  #include "options.h"
>  #include "trace.h"
>  
> @@ -77,9 +77,18 @@ void file_send_channel_create(QIOTaskFunc f, void *data)
>      QIOChannelFile *ioc;
>      QIOTask *task;
>      Error *errp = NULL;
> +    int flags = outgoing_args.flags;
>  
> -    ioc = qio_channel_file_new_path(outgoing_args.fname,
> -                                    outgoing_args.flags,
> +    if (migrate_direct_io() && qemu_has_direct_io()) {
> +        /*
> +         * Enable O_DIRECT for the secondary channels. These are used
> +         * for sending ram pages and writes should be guaranteed to be
> +         * aligned to at least page size.
> +         */
> +        flags |= O_DIRECT;
> +    }

IMHO we should not be silently ignoring the user's request for
direct I/O if we've been comiled for a platform which can't
support it. We should fail the setting of the direct I/O
parameter

Also this is referencing O_DIRECT without any #ifdef check.


> +
> +    ioc = qio_channel_file_new_path(outgoing_args.fname, flags,
>                                      outgoing_args.mode, &errp);
>      if (!ioc) {
>          file_migration_cancel(errp);
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index a82597f18e..eab5ac3588 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -387,6 +387,12 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>          monitor_printf(mon, "%s: %" PRIu64 " MB/s\n",
>              MigrationParameter_str(MIGRATION_PARAMETER_VCPU_DIRTY_LIMIT),
>              params->vcpu_dirty_limit);
> +
> +        if (params->has_direct_io) {
> +            monitor_printf(mon, "%s: %s\n",
> +                           MigrationParameter_str(MIGRATION_PARAMETER_DIRECT_IO),
> +                           params->direct_io ? "on" : "off");
> +        }
>      }
>  
>      qapi_free_MigrationParameters(params);
> @@ -661,6 +667,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
>          p->has_vcpu_dirty_limit = true;
>          visit_type_size(v, param, &p->vcpu_dirty_limit, &err);
>          break;
> +    case MIGRATION_PARAMETER_DIRECT_IO:
> +        p->has_direct_io = true;
> +        visit_type_bool(v, param, &p->direct_io, &err);
> +        break;
>      default:
>          assert(0);
>      }
> diff --git a/migration/options.c b/migration/options.c
> index 2193d69e71..6d0e3c26ae 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -817,6 +817,22 @@ int migrate_decompress_threads(void)
>      return s->parameters.decompress_threads;
>  }
>  
> +bool migrate_direct_io(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    /* For now O_DIRECT is only supported with fixed-ram */
> +    if (!s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM]) {
> +        return false;
> +    }
> +
> +    if (s->parameters.has_direct_io) {
> +        return s->parameters.direct_io;
> +    }
> +
> +    return false;
> +}
> +
>  uint64_t migrate_downtime_limit(void)
>  {
>      MigrationState *s = migrate_get_current();
> @@ -1025,6 +1041,11 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
>      params->has_vcpu_dirty_limit = true;
>      params->vcpu_dirty_limit = s->parameters.vcpu_dirty_limit;
>  
> +    if (s->parameters.has_direct_io) {
> +        params->has_direct_io = true;
> +        params->direct_io = s->parameters.direct_io;
> +    }
> +
>      return params;
>  }
>  
> @@ -1059,6 +1080,7 @@ void migrate_params_init(MigrationParameters *params)
>      params->has_announce_step = true;
>      params->has_x_vcpu_dirty_limit_period = true;
>      params->has_vcpu_dirty_limit = true;
> +    params->has_direct_io = qemu_has_direct_io();
>  }
>  
>  /*
> @@ -1356,6 +1378,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
>      if (params->has_vcpu_dirty_limit) {
>          dest->vcpu_dirty_limit = params->vcpu_dirty_limit;
>      }
> +
> +    if (params->has_direct_io) {
> +        dest->direct_io = params->direct_io;
> +    }
>  }
>  
>  static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
> @@ -1486,6 +1512,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
>      if (params->has_vcpu_dirty_limit) {
>          s->parameters.vcpu_dirty_limit = params->vcpu_dirty_limit;
>      }
> +
> +    if (params->has_direct_io) {

#ifndef O_DIRECT
     error_setg(errp, "Direct I/O is not supported on this platform");
#endif     

Should also be doing a check for the 'fixed-ram' capability being
set at this point.

> +        s->parameters.direct_io = params->direct_io;
> +    }
>  }
>  
>  void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
> diff --git a/migration/options.h b/migration/options.h
> index 01bba5b928..280f86bed1 100644
> --- a/migration/options.h
> +++ b/migration/options.h
> @@ -82,6 +82,7 @@ uint8_t migrate_cpu_throttle_increment(void);
>  uint8_t migrate_cpu_throttle_initial(void);
>  bool migrate_cpu_throttle_tailslow(void);
>  int migrate_decompress_threads(void);
> +bool migrate_direct_io(void);
>  uint64_t migrate_downtime_limit(void);
>  uint8_t migrate_max_cpu_throttle(void);
>  uint64_t migrate_max_bandwidth(void);
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 1317dd32ab..3eb9e2c9b5 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -840,6 +840,9 @@
>  # @vcpu-dirty-limit: Dirtyrate limit (MB/s) during live migration.
>  #     Defaults to 1.  (Since 8.1)
>  #
> +# @direct-io: Open migration files with O_DIRECT when possible. Not
> +#             all migration transports support this. (since 8.1)

"Not all migration transports support this" is too vague.

Lets say what the requirement is

    "This requires that the 'fixed-ram' capability is enabled"

> +#
>  # Features:
>  #
>  # @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
> @@ -864,7 +867,7 @@
>             'multifd-zlib-level', 'multifd-zstd-level',
>             'block-bitmap-mapping',
>             { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
> -           'vcpu-dirty-limit'] }
> +           'vcpu-dirty-limit', 'direct-io'] }
>  
>  ##
>  # @MigrateSetParameters:
> @@ -1016,6 +1019,9 @@
>  # @vcpu-dirty-limit: Dirtyrate limit (MB/s) during live migration.
>  #     Defaults to 1.  (Since 8.1)
>  #
> +# @direct-io: Open migration files with O_DIRECT when possible. Not
> +#             all migration transports support this. (since 8.1)
> +#
>  # Features:
>  #
>  # @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
> @@ -1058,7 +1064,8 @@
>              '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
>              '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
>                                              'features': [ 'unstable' ] },
> -            '*vcpu-dirty-limit': 'uint64'} }
> +            '*vcpu-dirty-limit': 'uint64',
> +            '*direct-io': 'bool' } }
>  
>  ##
>  # @migrate-set-parameters:
> @@ -1230,6 +1237,9 @@
>  # @vcpu-dirty-limit: Dirtyrate limit (MB/s) during live migration.
>  #     Defaults to 1.  (Since 8.1)
>  #
> +# @direct-io: Open migration files with O_DIRECT when possible. Not
> +#             all migration transports support this. (since 8.1)
> +#
>  # Features:
>  #
>  # @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
> @@ -1269,7 +1279,8 @@
>              '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
>              '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
>                                              'features': [ 'unstable' ] },
> -            '*vcpu-dirty-limit': 'uint64'} }
> +            '*vcpu-dirty-limit': 'uint64',
> +            '*direct-io': 'bool' } }
>  
>  ##
>  # @query-migrate-parameters:
> diff --git a/util/osdep.c b/util/osdep.c
> index e996c4744a..d0227a60ab 100644
> --- a/util/osdep.c
> +++ b/util/osdep.c
> @@ -277,6 +277,15 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive)
>  }
>  #endif
>  
> +bool qemu_has_direct_io(void)
> +{
> +#ifdef O_DIRECT
> +    return true;
> +#else
> +    return false;
> +#endif
> +}
> +
>  static int qemu_open_cloexec(const char *name, int flags, mode_t mode)
>  {
>      int ret;
> -- 
> 2.35.3
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 22/29] io: Add a pwritev/preadv version that takes a discontiguous iovec
  2023-10-23 20:36 ` [PATCH v2 22/29] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
@ 2023-10-24  8:50   ` Daniel P. Berrangé
  0 siblings, 0 replies; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-24  8:50 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

On Mon, Oct 23, 2023 at 05:36:01PM -0300, Fabiano Rosas wrote:
> For the upcoming support to fixed-ram migration with multifd, we need
> to be able to accept an iovec array with non-contiguous data.
> 
> Add a pwritev and preadv version that splits the array into contiguous
> segments before writing. With that we can have the ram code continue
> to add pages in any order and the multifd code continue to send large
> arrays for reading and writing.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> Since iovs can be non contiguous, we'd need a separate array on the
> side to carry an extra file offset for each of them, so I'm relying on
> the fact that iovs are all within a same host page and passing in an
> encoded offset that takes the host page into account.
> ---
>  include/io/channel.h | 50 +++++++++++++++++++++++++++
>  io/channel.c         | 82 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 132 insertions(+)
> 
> diff --git a/include/io/channel.h b/include/io/channel.h
> index a8181d576a..51a99fb9f6 100644
> --- a/include/io/channel.h
> +++ b/include/io/channel.h
> @@ -33,8 +33,10 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass,
>  #define QIO_CHANNEL_ERR_BLOCK -2
>  
>  #define QIO_CHANNEL_WRITE_FLAG_ZERO_COPY 0x1
> +#define QIO_CHANNEL_WRITE_FLAG_WITH_OFFSET 0x2
>  
>  #define QIO_CHANNEL_READ_FLAG_MSG_PEEK 0x1
> +#define QIO_CHANNEL_READ_FLAG_WITH_OFFSET 0x2
>  
>  typedef enum QIOChannelFeature QIOChannelFeature;
>  
> @@ -559,6 +561,30 @@ int qio_channel_close(QIOChannel *ioc,
>  ssize_t qio_channel_pwritev_full(QIOChannel *ioc, const struct iovec *iov,
>                                   size_t niov, off_t offset, Error **errp);
>  
> +/**
> + * qio_channel_write_full_all:
> + * @ioc: the channel object
> + * @iov: the array of memory regions to write data from
> + * @niov: the length of the @iov array
> + * @offset: the iovec offset in the file where to write the data
> + * @fds: an array of file handles to send
> + * @nfds: number of file handles in @fds
> + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
> + * @errp: pointer to a NULL-initialized error object
> + *
> + *
> + * Selects between a writev or pwritev channel writer function.
> + *
> + * If QIO_CHANNEL_WRITE_FLAG_OFFSET is passed in flags, pwritev is
> + * used and @offset is expected to be a meaningful value, @fds and
> + * @nfds are ignored; otherwise uses writev and @offset is ignored.
> + *
> + * Returns: 0 if all bytes were written, or -1 on error
> + */
> +int qio_channel_write_full_all(QIOChannel *ioc, const struct iovec *iov,
> +                               size_t niov, off_t offset, int *fds, size_t nfds,
> +                               int flags, Error **errp);
> +
>  /**
>   * qio_channel_pwritev
>   * @ioc: the channel object
> @@ -595,6 +621,30 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, char *buf, size_t buflen,
>  ssize_t qio_channel_preadv_full(QIOChannel *ioc, const struct iovec *iov,
>                                  size_t niov, off_t offset, Error **errp);
>  
> +/**
> + * qio_channel_read_full_all:
> + * @ioc: the channel object
> + * @iov: the array of memory regions to read data to
> + * @niov: the length of the @iov array
> + * @offset: the iovec offset in the file from where to read the data
> + * @fds: an array of file handles to send
> + * @nfds: number of file handles in @fds
> + * @flags: read flags (QIO_CHANNEL_READ_FLAG_*)
> + * @errp: pointer to a NULL-initialized error object
> + *
> + *
> + * Selects between a readv or preadv channel reader function.
> + *
> + * If QIO_CHANNEL_READ_FLAG_OFFSET is passed in flags, preadv is
> + * used and @offset is expected to be a meaningful value, @fds and
> + * @nfds are ignored; otherwise uses readv and @offset is ignored.
> + *
> + * Returns: 0 if all bytes were read, or -1 on error
> + */
> +int qio_channel_read_full_all(QIOChannel *ioc, const struct iovec *iov,
> +                              size_t niov, off_t offset,
> +                              int flags, Error **errp);
> +
>  /**
>   * qio_channel_preadv
>   * @ioc: the channel object
> diff --git a/io/channel.c b/io/channel.c
> index 770d61ea00..648b68451d 100644
> --- a/io/channel.c
> +++ b/io/channel.c
> @@ -472,6 +472,76 @@ ssize_t qio_channel_pwritev_full(QIOChannel *ioc, const struct iovec *iov,
>      return klass->io_pwritev(ioc, iov, niov, offset, errp);
>  }
>  
> +static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
> +                                                 const struct iovec *iov,
> +                                                 size_t niov, off_t offset,
> +                                                 bool is_write, Error **errp)
> +{
> +    ssize_t ret;
> +    int i, slice_idx, slice_num;
> +    uint64_t base, next, file_offset;
> +    size_t len;
> +
> +    slice_idx = 0;
> +    slice_num = 1;
> +
> +    /*
> +     * If the iov array doesn't have contiguous elements, we need to
> +     * split it in slices because we only have one (file) 'offset' for
> +     * the whole iov. Do this here so callers don't need to break the
> +     * iov array themselves.
> +     */
> +    for (i = 0; i < niov; i++, slice_num++) {
> +        base = (uint64_t) iov[i].iov_base;
> +
> +        if (i != niov - 1) {
> +            len = iov[i].iov_len;
> +            next = (uint64_t) iov[i + 1].iov_base;
> +
> +            if (base + len == next) {
> +                continue;
> +            }
> +        }
> +
> +        /*
> +         * Use the offset of the first element of the segment that
> +         * we're sending.
> +         */
> +        file_offset = offset + (uint64_t) iov[slice_idx].iov_base;
> +
> +        if (is_write) {
> +            ret = qio_channel_pwritev_full(ioc, &iov[slice_idx], slice_num,
> +                                           file_offset, errp);
> +        } else {
> +            ret = qio_channel_preadv_full(ioc, &iov[slice_idx], slice_num,
> +                                          file_offset, errp);
> +        }
> +
> +        if (ret < 0) {
> +            break;
> +        }
> +
> +        slice_idx += slice_num;
> +        slice_num = 0;
> +    }
> +
> +    return (ret < 0) ? -1 : 0;
> +}
> +
> +int qio_channel_write_full_all(QIOChannel *ioc,
> +                                const struct iovec *iov,
> +                                size_t niov, off_t offset,
> +                                int *fds, size_t nfds,
> +                                int flags, Error **errp)
> +{
> +    if (flags & QIO_CHANNEL_WRITE_FLAG_WITH_OFFSET) {
> +        return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
> +                                                     offset, true, errp);
> +    }
> +
> +    return qio_channel_writev_full_all(ioc, iov, niov, NULL, 0, flags, errp);
> +}

I don't much like this as a design, as it is two completely different
APIs shoved into one facade that is easy to misunderstand and misuse.
fds, nfds, and other flags values are all silently ignored in the first
branch. offset is silently ignored in the second branch. In fact there's
no functional benefit to the second branch at all, over calling the
existing apis.

I think that there should be qio_channel_{pwritev,preadv}_all
methods that take the 'flags' parameter.

> +
>  ssize_t qio_channel_pwritev(QIOChannel *ioc, char *buf, size_t buflen,
>                              off_t offset, Error **errp)
>  {
> @@ -501,6 +571,18 @@ ssize_t qio_channel_preadv_full(QIOChannel *ioc, const struct iovec *iov,
>      return klass->io_preadv(ioc, iov, niov, offset, errp);
>  }
>  
> +int qio_channel_read_full_all(QIOChannel *ioc, const struct iovec *iov,
> +                              size_t niov, off_t offset,
> +                              int flags, Error **errp)
> +{
> +    if (flags & QIO_CHANNEL_READ_FLAG_WITH_OFFSET) {
> +        return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
> +                                                     offset, false, errp);
> +    }
> +
> +    return qio_channel_readv_full_all(ioc, iov, niov, NULL, NULL, errp);
> +}
> +
>  ssize_t qio_channel_preadv(QIOChannel *ioc, char *buf, size_t buflen,
>                             off_t offset, Error **errp)
>  {
> -- 
> 2.35.3
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-24  5:25   ` Markus Armbruster
@ 2023-10-24 18:12     ` Fabiano Rosas
  2023-10-25  5:33       ` Markus Armbruster
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-24 18:12 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, berrange, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Markus Armbruster <armbru@redhat.com> writes:

> Fabiano Rosas <farosas@suse.de> writes:
>
>> Add a capability that allows the management layer to delegate to QEMU
>> the decision of whether to pause a VM and perform a non-live
>> migration. Depending on the type of migration being performed, this
>> could bring performance benefits.
>>
>> Note that the capability is enabled by default but at this moment no
>> migration scheme is making use of it.
>
> This sounds as if the capability has no effect unless the "migration
> scheme" (whatever that may be) opts into using it.  Am I confused?
>

What I mean here is that this capability is implemented and functional,
but I'm not retroactively enabling any existing migration code to use
auto-pause. Otherwise people would start seeing their guests being
paused before migraton in scenarios they never used to pause.

By "migration scheme" I mean types of migration. Or modes of
operation. Or exclusive parameters. Anything that is different enough
from what exists today that we would consider a different type of
migration. Anything that allow us to break backward compatibility
(because it never existed before to begin with).

E.g. this series introduces the fixed-ram migration. That never existed
before. So from the moment we enable that code to use this capability,
it will always do auto-pause, unless the management layer wants to avoid
it.

>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>
> [...]
>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index db3df12d6c..74f12adc0e 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -523,6 +523,10 @@
>>  #     and can result in more stable read performance.  Requires KVM
>>  #     with accelerator property "dirty-ring-size" set.  (Since 8.1)
>>  #
>> +# @auto-pause: If enabled, allows QEMU to decide whether to pause the
>> +#     VM before migration for an optimal migration performance.
>> +#     Enabled by default. (since 8.1)
>
> If this needs an opt-in to take effect, it should be documented.
>

Someting like this perhaps?

# @auto-pause: If enabled, allows QEMU to decide whether to pause the VM
#     before migration for an optimal migration performance. Enabled by
#     default. New migration code needs to opt-in at
#     migration_should_pause(), otherwise this behaves as if
#     disabled. (since 8.2)

>> +#
>>  # Features:
>>  #
>>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>> @@ -539,7 +543,7 @@
>>             { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>>             'validate-uuid', 'background-snapshot',
>>             'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
>> -           'dirty-limit'] }
>> +           'dirty-limit', 'auto-pause'] }
>>  
>>  ##
>>  # @MigrationCapabilityStatus:


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 14/29] migration/ram: Introduce 'fixed-ram' migration capability
  2023-10-24  5:33   ` Markus Armbruster
@ 2023-10-24 18:35     ` Fabiano Rosas
  2023-10-25  6:18       ` Markus Armbruster
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-24 18:35 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, berrange, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Markus Armbruster <armbru@redhat.com> writes:

> Fabiano Rosas <farosas@suse.de> writes:
>
>> Add a new migration capability 'fixed-ram'.
>>
>> The core of the feature is to ensure that each ram page has a specific
>> offset in the resulting migration stream. The reason why we'd want
>> such behavior are two fold:
>>
>>  - When doing a 'fixed-ram' migration the resulting file will have a
>>    bounded size, since pages which are dirtied multiple times will
>>    always go to a fixed location in the file, rather than constantly
>>    being added to a sequential stream. This eliminates cases where a vm
>>    with, say, 1G of ram can result in a migration file that's 10s of
>>    GBs, provided that the workload constantly redirties memory.
>>
>>  - It paves the way to implement DIRECT_IO-enabled save/restore of the
>>    migration stream as the pages are ensured to be written at aligned
>>    offsets.
>>
>> For now, enabling the capability has no effect. The next couple of
>> patches implement the core funcionality.
>>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  docs/devel/migration.rst | 14 ++++++++++++++
>>  migration/options.c      | 37 +++++++++++++++++++++++++++++++++++++
>>  migration/options.h      |  1 +
>>  migration/savevm.c       |  1 +
>>  qapi/migration.json      |  5 ++++-
>>  5 files changed, 57 insertions(+), 1 deletion(-)
>>
>> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
>> index c3e1400c0c..6f898b5dbd 100644
>> --- a/docs/devel/migration.rst
>> +++ b/docs/devel/migration.rst
>> @@ -566,6 +566,20 @@ Others (especially either older devices or system devices which for
>>  some reason don't have a bus concept) make use of the ``instance id``
>>  for otherwise identically named devices.
>>  
>> +Fixed-ram format
>> +----------------
>> +
>> +When the ``fixed-ram`` capability is enabled, a slightly different
>> +stream format is used for the RAM section. Instead of having a
>> +sequential stream of pages that follow the RAMBlock headers, the dirty
>> +pages for a RAMBlock follow its header. This ensures that each RAM
>> +page has a fixed offset in the resulting migration stream.
>
> This requires the migration stream to be seekable, as documented in the
> QAPI schema below.  I think it's worth documenting here, as well.
>

Ok.

>> +
>> +The ``fixed-ram`` capaility can be enabled in both source and
>> +destination with:
>> +
>> +    ``migrate_set_capability fixed-ram on``
>
> Effect of enabling on the destination?
>
> What happens when we enable it only on one end?
>

qemu-system-x86_64: Capability fixed-ram is off, but received capability is on
qemu-system-x86_64: load of migration failed: Invalid argument

So I guess that *can* be enabled up there should become a *must*.

>> +
>>  Return path
>>  -----------
>>  
>
> [...]
>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 74f12adc0e..1317dd32ab 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -527,6 +527,9 @@
>>  #     VM before migration for an optimal migration performance.
>>  #     Enabled by default. (since 8.1)
>>  #
>> +# @fixed-ram: Migrate using fixed offsets for each RAM page. Requires
>
> Two spaces between sentences for consistency, please.
>
>> +#             a seekable transport such as a file.  (since 8.1)
>
> What is a migration transport?  migration.json doesn't define the term.
>

The medium that transports the migration. We are about to define some
terms at the QAPI series:

[PATCH v15 00/14] migration: Modify 'migrate' and 'migrate-incoming'
QAPI commands for migration
https://lore.kernel.org/r/20231023182053.8711-1-farosas@suse.de

> Which transports are seekable?
>

The ones that implement QIO_CHANNEL_FEATURE_SEEKABLE. Currently only
QIOChannelFile.

> Out of curiosity: what happens if the transport isn't seekable?
>

We fail the migration. At migration_channels_and_uri_compatible():

    if (migration_needs_seekable_channel() &&
        !uri_supports_seeking(uri)) {
        error_setg(errp, "Migration requires seekable transport (e.g. file)");
        compatible = false;
    }

>> +#
>>  # Features:
>>  #
>>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>> @@ -543,7 +546,7 @@
>>             { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>>             'validate-uuid', 'background-snapshot',
>>             'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
>> -           'dirty-limit', 'auto-pause'] }
>> +           'dirty-limit', 'auto-pause', 'fixed-ram'] }
>>  
>>  ##
>>  # @MigrationCapabilityStatus:


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-24  8:33   ` Daniel P. Berrangé
@ 2023-10-24 19:06     ` Fabiano Rosas
  0 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-24 19:06 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:36:07PM -0300, Fabiano Rosas wrote:
>> Add the direct-io migration parameter that tells the migration code to
>> use O_DIRECT when opening the migration stream file whenever possible.
>> 
>> This is currently only used for the secondary channels of fixed-ram
>> migration, which can guarantee that writes are page aligned.
>> 
>> However the parameter could be made to affect other types of
>> file-based migrations in the future.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  include/qemu/osdep.h           |  2 ++
>>  migration/file.c               | 15 ++++++++++++---
>>  migration/migration-hmp-cmds.c | 10 ++++++++++
>>  migration/options.c            | 30 ++++++++++++++++++++++++++++++
>>  migration/options.h            |  1 +
>>  qapi/migration.json            | 17 ++++++++++++++---
>>  util/osdep.c                   |  9 +++++++++
>>  7 files changed, 78 insertions(+), 6 deletions(-)
>> 
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index 475a1c62ff..ea5d29ab9b 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -597,6 +597,8 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive);
>>  bool qemu_has_ofd_lock(void);
>>  #endif
>>  
>> +bool qemu_has_direct_io(void);
>> +
>>  #if defined(__HAIKU__) && defined(__i386__)
>>  #define FMT_pid "%ld"
>>  #elif defined(WIN64)
>> diff --git a/migration/file.c b/migration/file.c
>> index ad75225f43..3d3c58ecad 100644
>> --- a/migration/file.c
>> +++ b/migration/file.c
>> @@ -11,9 +11,9 @@
>>  #include "qemu/error-report.h"
>>  #include "channel.h"
>>  #include "file.h"
>> -#include "migration.h"
>>  #include "io/channel-file.h"
>>  #include "io/channel-util.h"
>> +#include "migration.h"
>>  #include "options.h"
>>  #include "trace.h"
>>  
>> @@ -77,9 +77,18 @@ void file_send_channel_create(QIOTaskFunc f, void *data)
>>      QIOChannelFile *ioc;
>>      QIOTask *task;
>>      Error *errp = NULL;
>> +    int flags = outgoing_args.flags;
>>  
>> -    ioc = qio_channel_file_new_path(outgoing_args.fname,
>> -                                    outgoing_args.flags,
>> +    if (migrate_direct_io() && qemu_has_direct_io()) {
>> +        /*
>> +         * Enable O_DIRECT for the secondary channels. These are used
>> +         * for sending ram pages and writes should be guaranteed to be
>> +         * aligned to at least page size.
>> +         */
>> +        flags |= O_DIRECT;
>> +    }
>
> IMHO we should not be silently ignoring the user's request for
> direct I/O if we've been comiled for a platform which can't
> support it. We should fail the setting of the direct I/O
> parameter
>

Good point.

> Also this is referencing O_DIRECT without any #ifdef check.
>

Ack.

>> +
>> +    ioc = qio_channel_file_new_path(outgoing_args.fname, flags,
>>                                      outgoing_args.mode, &errp);
>>      if (!ioc) {
>>          file_migration_cancel(errp);
>> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
>> index a82597f18e..eab5ac3588 100644
>> --- a/migration/migration-hmp-cmds.c
>> +++ b/migration/migration-hmp-cmds.c
>> @@ -387,6 +387,12 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>>          monitor_printf(mon, "%s: %" PRIu64 " MB/s\n",
>>              MigrationParameter_str(MIGRATION_PARAMETER_VCPU_DIRTY_LIMIT),
>>              params->vcpu_dirty_limit);
>> +
>> +        if (params->has_direct_io) {
>> +            monitor_printf(mon, "%s: %s\n",
>> +                           MigrationParameter_str(MIGRATION_PARAMETER_DIRECT_IO),
>> +                           params->direct_io ? "on" : "off");
>> +        }
>>      }
>>  
>>      qapi_free_MigrationParameters(params);
>> @@ -661,6 +667,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
>>          p->has_vcpu_dirty_limit = true;
>>          visit_type_size(v, param, &p->vcpu_dirty_limit, &err);
>>          break;
>> +    case MIGRATION_PARAMETER_DIRECT_IO:
>> +        p->has_direct_io = true;
>> +        visit_type_bool(v, param, &p->direct_io, &err);
>> +        break;
>>      default:
>>          assert(0);
>>      }
>> diff --git a/migration/options.c b/migration/options.c
>> index 2193d69e71..6d0e3c26ae 100644
>> --- a/migration/options.c
>> +++ b/migration/options.c
>> @@ -817,6 +817,22 @@ int migrate_decompress_threads(void)
>>      return s->parameters.decompress_threads;
>>  }
>>  
>> +bool migrate_direct_io(void)
>> +{
>> +    MigrationState *s = migrate_get_current();
>> +
>> +    /* For now O_DIRECT is only supported with fixed-ram */
>> +    if (!s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM]) {
>> +        return false;
>> +    }
>> +
>> +    if (s->parameters.has_direct_io) {
>> +        return s->parameters.direct_io;
>> +    }
>> +
>> +    return false;
>> +}
>> +
>>  uint64_t migrate_downtime_limit(void)
>>  {
>>      MigrationState *s = migrate_get_current();
>> @@ -1025,6 +1041,11 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
>>      params->has_vcpu_dirty_limit = true;
>>      params->vcpu_dirty_limit = s->parameters.vcpu_dirty_limit;
>>  
>> +    if (s->parameters.has_direct_io) {
>> +        params->has_direct_io = true;
>> +        params->direct_io = s->parameters.direct_io;
>> +    }
>> +
>>      return params;
>>  }
>>  
>> @@ -1059,6 +1080,7 @@ void migrate_params_init(MigrationParameters *params)
>>      params->has_announce_step = true;
>>      params->has_x_vcpu_dirty_limit_period = true;
>>      params->has_vcpu_dirty_limit = true;
>> +    params->has_direct_io = qemu_has_direct_io();
>>  }
>>  
>>  /*
>> @@ -1356,6 +1378,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
>>      if (params->has_vcpu_dirty_limit) {
>>          dest->vcpu_dirty_limit = params->vcpu_dirty_limit;
>>      }
>> +
>> +    if (params->has_direct_io) {
>> +        dest->direct_io = params->direct_io;
>> +    }
>>  }
>>  
>>  static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
>> @@ -1486,6 +1512,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
>>      if (params->has_vcpu_dirty_limit) {
>>          s->parameters.vcpu_dirty_limit = params->vcpu_dirty_limit;
>>      }
>> +
>> +    if (params->has_direct_io) {
>
> #ifndef O_DIRECT
>      error_setg(errp, "Direct I/O is not supported on this platform");
> #endif     
>
> Should also be doing a check for the 'fixed-ram' capability being
> set at this point.
>

Ok.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 10/29] io: Add generic pwritev/preadv interface
  2023-10-24  8:18   ` Daniel P. Berrangé
@ 2023-10-24 19:06     ` Fabiano Rosas
  0 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-24 19:06 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:35:49PM -0300, Fabiano Rosas wrote:
>> From: Nikolay Borisov <nborisov@suse.com>
>> 
>> Introduce basic pwritev/preadv support in the generic channel layer.
>> Specific implementation will follow for the file channel as this is
>> required in order to support migration streams with fixed location of
>> each ram page.
>> 
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  include/io/channel.h | 82 ++++++++++++++++++++++++++++++++++++++++++++
>>  io/channel.c         | 58 +++++++++++++++++++++++++++++++
>>  2 files changed, 140 insertions(+)
>> 
>> diff --git a/include/io/channel.h b/include/io/channel.h
>> index fcb19fd672..a8181d576a 100644
>> --- a/include/io/channel.h
>> +++ b/include/io/channel.h
>> @@ -131,6 +131,16 @@ struct QIOChannelClass {
>>                             Error **errp);
>>  
>>      /* Optional callbacks */
>> +    ssize_t (*io_pwritev)(QIOChannel *ioc,
>> +                          const struct iovec *iov,
>> +                          size_t niov,
>> +                          off_t offset,
>> +                          Error **errp);
>> +    ssize_t (*io_preadv)(QIOChannel *ioc,
>> +                         const struct iovec *iov,
>> +                         size_t niov,
>> +                         off_t offset,
>> +                         Error **errp);
>>      int (*io_shutdown)(QIOChannel *ioc,
>>                         QIOChannelShutdown how,
>>                         Error **errp);
>> @@ -529,6 +539,78 @@ void qio_channel_set_follow_coroutine_ctx(QIOChannel *ioc, bool enabled);
>>  int qio_channel_close(QIOChannel *ioc,
>>                        Error **errp);
>>  
>> +/**
>> + * qio_channel_pwritev_full
>> + * @ioc: the channel object
>> + * @iov: the array of memory regions to write data from
>> + * @niov: the length of the @iov array
>> + * @offset: offset in the channel where writes should begin
>> + * @errp: pointer to a NULL-initialized error object
>> + *
>> + * Not all implementations will support this facility, so may report
>> + * an error. To avoid errors, the caller may check for the feature
>> + * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
>> + *
>> + * Behaves as qio_channel_writev_full, apart from not supporting
>> + * sending of file handles as well as beginning the write at the
>> + * passed @offset
>> + *
>> + */
>> +ssize_t qio_channel_pwritev_full(QIOChannel *ioc, const struct iovec *iov,
>> +                                 size_t niov, off_t offset, Error **errp);
>
> In terms of naming this should be  just "_pwritev".
>
> We don't support FD passing, so the "_full" suffix is not
> appropriate
>
>> +
>> +/**
>> + * qio_channel_pwritev
>> + * @ioc: the channel object
>> + * @buf: the memory region to write data into
>> + * @buflen: the number of bytes to @buf
>> + * @offset: offset in the channel where writes should begin
>> + * @errp: pointer to a NULL-initialized error object
>> + *
>> + * Not all implementations will support this facility, so may report
>> + * an error. To avoid errors, the caller may check for the feature
>> + * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
>> + *
>> + */
>> +ssize_t qio_channel_pwritev(QIOChannel *ioc, char *buf, size_t buflen,
>> +                            off_t offset, Error **errp);
>
> This isn't passing a vector of buffers, so it should be just
> "pwrite", not "pwritev".
>
>> +
>> +/**
>> + * qio_channel_preadv_full
>> + * @ioc: the channel object
>> + * @iov: the array of memory regions to read data into
>> + * @niov: the length of the @iov array
>> + * @offset: offset in the channel where writes should begin
>> + * @errp: pointer to a NULL-initialized error object
>> + *
>> + * Not all implementations will support this facility, so may report
>> + * an error.  To avoid errors, the caller may check for the feature
>> + * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
>> + *
>> + * Behaves as qio_channel_readv_full, apart from not supporting
>> + * receiving of file handles as well as beginning the read at the
>> + * passed @offset
>> + *
>> + */
>> +ssize_t qio_channel_preadv_full(QIOChannel *ioc, const struct iovec *iov,
>> +                                size_t niov, off_t offset, Error **errp);
>
> "preadv"
>
>
>> +
>> +/**
>> + * qio_channel_preadv
>> + * @ioc: the channel object
>> + * @buf: the memory region to write data into
>> + * @buflen: the number of bytes to @buf
>> + * @offset: offset in the channel where writes should begin
>> + * @errp: pointer to a NULL-initialized error object
>> + *
>> + * Not all implementations will support this facility, so may report
>> + * an error.  To avoid errors, the caller may check for the feature
>> + * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
>> + *
>> + */
>> +ssize_t qio_channel_preadv(QIOChannel *ioc, char *buf, size_t buflen,
>> +                           off_t offset, Error **errp);
>
> "pread"
>
>
> With regards,
> Daniel

I'll fix all instances.

Thanks


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-24  5:41   ` Markus Armbruster
@ 2023-10-24 19:32     ` Fabiano Rosas
  2023-10-25  6:23       ` Markus Armbruster
  2023-10-25  8:44       ` Daniel P. Berrangé
  0 siblings, 2 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-24 19:32 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, berrange, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Markus Armbruster <armbru@redhat.com> writes:

> Fabiano Rosas <farosas@suse.de> writes:
>
>> Add the direct-io migration parameter that tells the migration code to
>> use O_DIRECT when opening the migration stream file whenever possible.
>>
>> This is currently only used for the secondary channels of fixed-ram
>> migration, which can guarantee that writes are page aligned.
>>
>> However the parameter could be made to affect other types of
>> file-based migrations in the future.
>>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>
> When would you want to enable @direct-io, and when would you want to
> leave it disabled?

That depends on a performance analysis. You'd generally leave it
disabled unless there's some indication that the operating system is
having trouble draining the page cache.

However I don't think QEMU should attempt any kind of prescription in
that regard.

From the migration implementation perspective, we need to provide
alignment guarantees on the stream before allowing direct IO to be
enabled. In this series we're just enabling it for the secondary multifd
channels which do page-aligned reads/writes.

> What happens when you enable @direct-io with a migration that cannot use
> O_DIRECT?
>

In this version of the series Daniel suggested that we fail migration in
case there's no support for direct IO or the migration doesn't support
it.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-24 18:12     ` Fabiano Rosas
@ 2023-10-25  5:33       ` Markus Armbruster
  0 siblings, 0 replies; 128+ messages in thread
From: Markus Armbruster @ 2023-10-25  5:33 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Fabiano Rosas <farosas@suse.de> writes:

> Markus Armbruster <armbru@redhat.com> writes:
>
>> Fabiano Rosas <farosas@suse.de> writes:
>>
>>> Add a capability that allows the management layer to delegate to QEMU
>>> the decision of whether to pause a VM and perform a non-live
>>> migration. Depending on the type of migration being performed, this
>>> could bring performance benefits.
>>>
>>> Note that the capability is enabled by default but at this moment no
>>> migration scheme is making use of it.
>>
>> This sounds as if the capability has no effect unless the "migration
>> scheme" (whatever that may be) opts into using it.  Am I confused?
>>
>
> What I mean here is that this capability is implemented and functional,
> but I'm not retroactively enabling any existing migration code to use
> auto-pause. Otherwise people would start seeing their guests being
> paused before migraton in scenarios they never used to pause.
>
> By "migration scheme" I mean types of migration. Or modes of
> operation. Or exclusive parameters. Anything that is different enough
> from what exists today that we would consider a different type of
> migration. Anything that allow us to break backward compatibility
> (because it never existed before to begin with).
>
> E.g. this series introduces the fixed-ram migration. That never existed
> before. So from the moment we enable that code to use this capability,
> it will always do auto-pause, unless the management layer wants to avoid
> it.

So the auto-pause's *effective* default depends on the migration scheme:
certain new schemes pause by default, everything else doesn't.  Is this
a good idea?

If it is, then we need to document this behavior clearly.

Here's another way to design the interface: keep the default behavior
consistent (no pause), and provide a way to ask for pause (fails if
migration scheme doesn't support it).

>>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>>
>> [...]
>>
>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>> index db3df12d6c..74f12adc0e 100644
>>> --- a/qapi/migration.json
>>> +++ b/qapi/migration.json
>>> @@ -523,6 +523,10 @@
>>>  #     and can result in more stable read performance.  Requires KVM
>>>  #     with accelerator property "dirty-ring-size" set.  (Since 8.1)
>>>  #
>>> +# @auto-pause: If enabled, allows QEMU to decide whether to pause the
>>> +#     VM before migration for an optimal migration performance.
>>> +#     Enabled by default. (since 8.1)
>>
>> If this needs an opt-in to take effect, it should be documented.
>
> Someting like this perhaps?
>
> # @auto-pause: If enabled, allows QEMU to decide whether to pause the VM
> #     before migration for an optimal migration performance. Enabled by
> #     default. New migration code needs to opt-in at
> #     migration_should_pause(), otherwise this behaves as if
> #     disabled. (since 8.2)

Remember, this is user-facing documentation.  Talking about
migration_should_pause() makes no sense there.

Instead, you need to document what @auto-pause does: pause when a
condition specific to the migration scheme is met, and specify the
condition for each migration scheme.  The condition could be "never"
(auto-pause has no effect), "always", or something in between.

A configuration knob that has no effect feels like an interface blemish.

>>> +#
>>>  # Features:
>>>  #
>>>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>>> @@ -539,7 +543,7 @@
>>>             { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>>>             'validate-uuid', 'background-snapshot',
>>>             'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
>>> -           'dirty-limit'] }
>>> +           'dirty-limit', 'auto-pause'] }
>>>  
>>>  ##
>>>  # @MigrationCapabilityStatus:



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 14/29] migration/ram: Introduce 'fixed-ram' migration capability
  2023-10-24 18:35     ` Fabiano Rosas
@ 2023-10-25  6:18       ` Markus Armbruster
  0 siblings, 0 replies; 128+ messages in thread
From: Markus Armbruster @ 2023-10-25  6:18 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Fabiano Rosas <farosas@suse.de> writes:

> Markus Armbruster <armbru@redhat.com> writes:
>
>> Fabiano Rosas <farosas@suse.de> writes:
>>
>>> Add a new migration capability 'fixed-ram'.
>>>
>>> The core of the feature is to ensure that each ram page has a specific
>>> offset in the resulting migration stream. The reason why we'd want
>>> such behavior are two fold:
>>>
>>>  - When doing a 'fixed-ram' migration the resulting file will have a
>>>    bounded size, since pages which are dirtied multiple times will
>>>    always go to a fixed location in the file, rather than constantly
>>>    being added to a sequential stream. This eliminates cases where a vm
>>>    with, say, 1G of ram can result in a migration file that's 10s of
>>>    GBs, provided that the workload constantly redirties memory.
>>>
>>>  - It paves the way to implement DIRECT_IO-enabled save/restore of the
>>>    migration stream as the pages are ensured to be written at aligned
>>>    offsets.
>>>
>>> For now, enabling the capability has no effect. The next couple of
>>> patches implement the core funcionality.
>>>
>>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>>> ---
>>>  docs/devel/migration.rst | 14 ++++++++++++++
>>>  migration/options.c      | 37 +++++++++++++++++++++++++++++++++++++
>>>  migration/options.h      |  1 +
>>>  migration/savevm.c       |  1 +
>>>  qapi/migration.json      |  5 ++++-
>>>  5 files changed, 57 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
>>> index c3e1400c0c..6f898b5dbd 100644
>>> --- a/docs/devel/migration.rst
>>> +++ b/docs/devel/migration.rst
>>> @@ -566,6 +566,20 @@ Others (especially either older devices or system devices which for
>>>  some reason don't have a bus concept) make use of the ``instance id``
>>>  for otherwise identically named devices.
>>>  
>>> +Fixed-ram format
>>> +----------------
>>> +
>>> +When the ``fixed-ram`` capability is enabled, a slightly different
>>> +stream format is used for the RAM section. Instead of having a
>>> +sequential stream of pages that follow the RAMBlock headers, the dirty
>>> +pages for a RAMBlock follow its header. This ensures that each RAM
>>> +page has a fixed offset in the resulting migration stream.
>>
>> This requires the migration stream to be seekable, as documented in the
>> QAPI schema below.  I think it's worth documenting here, as well.
>>
>
> Ok.
>
>>> +
>>> +The ``fixed-ram`` capaility can be enabled in both source and
>>> +destination with:
>>> +
>>> +    ``migrate_set_capability fixed-ram on``
>>
>> Effect of enabling on the destination?
>>
>> What happens when we enable it only on one end?
>>
>
> qemu-system-x86_64: Capability fixed-ram is off, but received capability is on
> qemu-system-x86_64: load of migration failed: Invalid argument
>
> So I guess that *can* be enabled up there should become a *must*.

Makes sense.

>>> +
>>>  Return path
>>>  -----------
>>>  
>>
>> [...]
>>
>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>> index 74f12adc0e..1317dd32ab 100644
>>> --- a/qapi/migration.json
>>> +++ b/qapi/migration.json
>>> @@ -527,6 +527,9 @@
>>>  #     VM before migration for an optimal migration performance.
>>>  #     Enabled by default. (since 8.1)
>>>  #
>>> +# @fixed-ram: Migrate using fixed offsets for each RAM page. Requires
>>
>> Two spaces between sentences for consistency, please.
>>
>>> +#             a seekable transport such as a file.  (since 8.1)
>>
>> What is a migration transport?  migration.json doesn't define the term.
>>
>
> The medium that transports the migration. We are about to define some
> terms at the QAPI series:
>
> [PATCH v15 00/14] migration: Modify 'migrate' and 'migrate-incoming'
> QAPI commands for migration
> https://lore.kernel.org/r/20231023182053.8711-1-farosas@suse.de

Can't find it there offhand.  No need to explain it further to me now,
just make sure it's defined at this point in the series when you respin.

>> Which transports are seekable?
>>
>
> The ones that implement QIO_CHANNEL_FEATURE_SEEKABLE. Currently only
> QIOChannelFile.

Transport seekability needs to be documented clearly.

>> Out of curiosity: what happens if the transport isn't seekable?
>
> We fail the migration. At migration_channels_and_uri_compatible():
>
>     if (migration_needs_seekable_channel() &&
>         !uri_supports_seeking(uri)) {
>         error_setg(errp, "Migration requires seekable transport (e.g. file)");
>         compatible = false;
>     }

Thanks!

>>> +#
>>>  # Features:
>>>  #
>>>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>>> @@ -543,7 +546,7 @@
>>>             { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>>>             'validate-uuid', 'background-snapshot',
>>>             'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
>>> -           'dirty-limit', 'auto-pause'] }
>>> +           'dirty-limit', 'auto-pause', 'fixed-ram'] }
>>>  
>>>  ##
>>>  # @MigrationCapabilityStatus:



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-24 19:32     ` Fabiano Rosas
@ 2023-10-25  6:23       ` Markus Armbruster
  2023-10-25  8:44       ` Daniel P. Berrangé
  1 sibling, 0 replies; 128+ messages in thread
From: Markus Armbruster @ 2023-10-25  6:23 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Fabiano Rosas <farosas@suse.de> writes:

> Markus Armbruster <armbru@redhat.com> writes:
>
>> Fabiano Rosas <farosas@suse.de> writes:
>>
>>> Add the direct-io migration parameter that tells the migration code to
>>> use O_DIRECT when opening the migration stream file whenever possible.
>>>
>>> This is currently only used for the secondary channels of fixed-ram
>>> migration, which can guarantee that writes are page aligned.
>>>
>>> However the parameter could be made to affect other types of
>>> file-based migrations in the future.
>>>
>>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>>
>> When would you want to enable @direct-io, and when would you want to
>> leave it disabled?
>
> That depends on a performance analysis. You'd generally leave it
> disabled unless there's some indication that the operating system is
> having trouble draining the page cache.
>
> However I don't think QEMU should attempt any kind of prescription in
> that regard.
>
> From the migration implementation perspective, we need to provide
> alignment guarantees on the stream before allowing direct IO to be
> enabled. In this series we're just enabling it for the secondary multifd
> channels which do page-aligned reads/writes.

I'm asking because QEMU provides too many configuration options with too
little guidance on how to use them.

"You'd generally leave it disabled unless there's some indication that
the operating system is having trouble draining the page cache" is
guidance.  It'll be a lot more useful in documentation than in the
mailing list archive ;)

>> What happens when you enable @direct-io with a migration that cannot use
>> O_DIRECT?
>
> In this version of the series Daniel suggested that we fail migration in
> case there's no support for direct IO or the migration doesn't support
> it.

Makes sense.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-24 19:32     ` Fabiano Rosas
  2023-10-25  6:23       ` Markus Armbruster
@ 2023-10-25  8:44       ` Daniel P. Berrangé
  2023-10-25 14:32         ` Fabiano Rosas
  1 sibling, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25  8:44 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

On Tue, Oct 24, 2023 at 04:32:10PM -0300, Fabiano Rosas wrote:
> Markus Armbruster <armbru@redhat.com> writes:
> 
> > Fabiano Rosas <farosas@suse.de> writes:
> >
> >> Add the direct-io migration parameter that tells the migration code to
> >> use O_DIRECT when opening the migration stream file whenever possible.
> >>
> >> This is currently only used for the secondary channels of fixed-ram
> >> migration, which can guarantee that writes are page aligned.
> >>
> >> However the parameter could be made to affect other types of
> >> file-based migrations in the future.
> >>
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >
> > When would you want to enable @direct-io, and when would you want to
> > leave it disabled?
> 
> That depends on a performance analysis. You'd generally leave it
> disabled unless there's some indication that the operating system is
> having trouble draining the page cache.

That's not the usage model I would suggest.

The biggest value of the page cache comes when it holds data that
will be repeatedly accessed.

When you are saving/restoring a guest to file, that data is used
once only (assuming there's a large gap between save & restore).
By using the page cache to save a big guest we essentially purge
the page cache of most of its existing data that is likely to be
reaccessed, to fill it up with data never to be reaccessed.

I usually describe save/restore operations as trashing the page
cache.

IMHO, mgmt apps should request O_DIRECT always unless they expect
the save/restore operation to run in quick succession, or if they
know that the host has oodles of free RAM such that existing data
in the page cache won't be trashed, or if the host FS does not
support O_DIRECT of course.

> However I don't think QEMU should attempt any kind of prescription in
> that regard.

It shouldn't prescribe it, but I think our docs should encourage
its use where possible.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-23 20:35 ` [PATCH v2 06/29] migration: Add auto-pause capability Fabiano Rosas
  2023-10-24  5:25   ` Markus Armbruster
@ 2023-10-25  8:48   ` Daniel P. Berrangé
  2023-10-25 13:57     ` Fabiano Rosas
  1 sibling, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25  8:48 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

On Mon, Oct 23, 2023 at 05:35:45PM -0300, Fabiano Rosas wrote:
> Add a capability that allows the management layer to delegate to QEMU
> the decision of whether to pause a VM and perform a non-live
> migration. Depending on the type of migration being performed, this
> could bring performance benefits.

I'm not really see what problem this is solving.

Mgmt apps are perfectly capable of pausing the VM before issuing
the migrate operation.

IMHO changing the default pause behaviour when certain migration
capabilties are set creates an unneccessary suprise for mgmt
apps. Having to then also add an extra capability to turn off
this new feature just adds to the migration maint burden.

> 
> Note that the capability is enabled by default but at this moment no
> migration scheme is making use of it.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/migration.c | 19 +++++++++++++++++++
>  migration/options.c   |  9 +++++++++
>  migration/options.h   |  1 +
>  qapi/migration.json   |  6 +++++-
>  4 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index a6efbd837a..8b0c3b0911 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -124,6 +124,20 @@ migration_channels_and_uri_compatible(const char *uri, Error **errp)
>      return true;
>  }
>  
> +static bool migration_should_pause(const char *uri)
> +{
> +    if (!migrate_auto_pause()) {
> +        return false;
> +    }
> +
> +    /*
> +     * Return true for migration schemes that benefit from a nonlive
> +     * migration.
> +     */
> +
> +    return false;
> +}
> +
>  static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
>  {
>      uintptr_t a = (uintptr_t) ap, b = (uintptr_t) bp;
> @@ -1724,6 +1738,11 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
>          }
>      }
>  
> +    if (migration_should_pause(uri)) {
> +        global_state_store();
> +        vm_stop_force_state(RUN_STATE_PAUSED);
> +    }
> +
>      if (strstart(uri, "tcp:", &p) ||
>          strstart(uri, "unix:", NULL) ||
>          strstart(uri, "vsock:", NULL)) {
> diff --git a/migration/options.c b/migration/options.c
> index 42fb818956..c3def757fe 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -200,6 +200,8 @@ Property migration_properties[] = {
>      DEFINE_PROP_MIG_CAP("x-switchover-ack",
>                          MIGRATION_CAPABILITY_SWITCHOVER_ACK),
>      DEFINE_PROP_MIG_CAP("x-dirty-limit", MIGRATION_CAPABILITY_DIRTY_LIMIT),
> +    DEFINE_PROP_BOOL("x-auto-pause", MigrationState,
> +                     capabilities[MIGRATION_CAPABILITY_AUTO_PAUSE], true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -210,6 +212,13 @@ bool migrate_auto_converge(void)
>      return s->capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
>  }
>  
> +bool migrate_auto_pause(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    return s->capabilities[MIGRATION_CAPABILITY_AUTO_PAUSE];
> +}
> +
>  bool migrate_background_snapshot(void)
>  {
>      MigrationState *s = migrate_get_current();
> diff --git a/migration/options.h b/migration/options.h
> index 237f2d6b4a..d1ba5c9de7 100644
> --- a/migration/options.h
> +++ b/migration/options.h
> @@ -24,6 +24,7 @@ extern Property migration_properties[];
>  /* capabilities */
>  
>  bool migrate_auto_converge(void);
> +bool migrate_auto_pause(void);
>  bool migrate_background_snapshot(void);
>  bool migrate_block(void);
>  bool migrate_colo(void);
> diff --git a/qapi/migration.json b/qapi/migration.json
> index db3df12d6c..74f12adc0e 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -523,6 +523,10 @@
>  #     and can result in more stable read performance.  Requires KVM
>  #     with accelerator property "dirty-ring-size" set.  (Since 8.1)
>  #
> +# @auto-pause: If enabled, allows QEMU to decide whether to pause the
> +#     VM before migration for an optimal migration performance.
> +#     Enabled by default. (since 8.1)
> +#
>  # Features:
>  #
>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
> @@ -539,7 +543,7 @@
>             { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>             'validate-uuid', 'background-snapshot',
>             'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
> -           'dirty-limit'] }
> +           'dirty-limit', 'auto-pause'] }
>  
>  ##
>  # @MigrationCapabilityStatus:
> -- 
> 2.35.3
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-23 20:36 ` [PATCH v2 28/29] migration: Add direct-io parameter Fabiano Rosas
  2023-10-24  5:41   ` Markus Armbruster
  2023-10-24  8:33   ` Daniel P. Berrangé
@ 2023-10-25  9:07   ` Daniel P. Berrangé
  2023-10-25 14:48     ` Fabiano Rosas
  2 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25  9:07 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

On Mon, Oct 23, 2023 at 05:36:07PM -0300, Fabiano Rosas wrote:
> Add the direct-io migration parameter that tells the migration code to
> use O_DIRECT when opening the migration stream file whenever possible.
> 
> This is currently only used for the secondary channels of fixed-ram
> migration, which can guarantee that writes are page aligned.

When you say "secondary channels", I presume you're meaning that
the bulk memory regions will be written with O_DIRECT, while
general vmstate will use normal I/O on the main channel ?  If so,
could we explain that a little more directly.

Having a mixture of O_DIRECT and non-O_DIRECT I/O on the same
file is a little bit of an unusual situation. It will work for
us because we're writing to different regions of the file in
each case.

Still I wonder if it would be sane in the outgoing case to
include a fsync() on the file in the main channel, to guarantee
that the whole saved file is on-media at completion ? Or perhaps
suggest in QAPI that mgmts might consider doing a fsync
themselves ?


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 24/29] migration/ram: Ignore multifd flush when doing fixed-ram migration
  2023-10-23 20:36 ` [PATCH v2 24/29] migration/ram: Ignore multifd flush when doing fixed-ram migration Fabiano Rosas
@ 2023-10-25  9:09   ` Daniel P. Berrangé
  0 siblings, 0 replies; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25  9:09 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

On Mon, Oct 23, 2023 at 05:36:03PM -0300, Fabiano Rosas wrote:
> Some functionalities of multifd are incompatible with the 'fixed-ram'
> migration format.
> 
> The MULTIFD_FLUSH flag in particular is not used because in fixed-ram
> there is no sinchronicity between migration source and destination so
> there is not need for a sync packet. In fact, fixed-ram disables
> packets in multifd as a whole.
> 
> Make sure RAM_SAVE_FLAG_MULTIFD_FLUSH is never emitted when fixed-ram
> is enabled.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/ram.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 8e34c1b597..3497ed186a 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1386,7 +1386,7 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
>          pss->page = 0;
>          pss->block = QLIST_NEXT_RCU(pss->block, next);
>          if (!pss->block) {
> -            if (migrate_multifd() &&
> +            if (!migrate_fixed_ram() && migrate_multifd() &&

If I'm nitpicking I would put migrate_multifd() first in the
conditional, because fixed-ram is a sub-feature of multifd.

Either way though

  Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


>                  !migrate_multifd_flush_after_each_section()) {
>                  QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
>                  int ret = multifd_send_sync_main(f);
> -- 
> 2.35.3
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 25/29] migration/multifd: Support outgoing fixed-ram stream format
  2023-10-23 20:36 ` [PATCH v2 25/29] migration/multifd: Support outgoing fixed-ram stream format Fabiano Rosas
@ 2023-10-25  9:23   ` Daniel P. Berrangé
  2023-10-25 14:21     ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25  9:23 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

On Mon, Oct 23, 2023 at 05:36:04PM -0300, Fabiano Rosas wrote:
> The new fixed-ram stream format uses a file transport and puts ram
> pages in the migration file at their respective offsets and can be
> done in parallel by using the pwritev system call which takes iovecs
> and an offset.
> 
> Add support to enabling the new format along with multifd to make use
> of the threading and page handling already in place.
> 
> This requires multifd to stop sending headers and leaving the stream
> format to the fixed-ram code. When it comes time to write the data, we
> need to call a version of qio_channel_write that can take an offset.
> 
> Usage on HMP is:
> 
> (qemu) stop
> (qemu) migrate_set_capability multifd on
> (qemu) migrate_set_capability fixed-ram on
> (qemu) migrate_set_parameter max-bandwidth 0
> (qemu) migrate_set_parameter multifd-channels 8
> (qemu) migrate file:migfile
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  include/qemu/bitops.h | 13 ++++++++++
>  migration/multifd.c   | 55 +++++++++++++++++++++++++++++++++++++++++--
>  migration/options.c   |  6 -----
>  migration/ram.c       |  2 +-
>  4 files changed, 67 insertions(+), 9 deletions(-)
> 
> diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
> index cb3526d1f4..2c0a2fe751 100644
> --- a/include/qemu/bitops.h
> +++ b/include/qemu/bitops.h
> @@ -67,6 +67,19 @@ static inline void clear_bit(long nr, unsigned long *addr)
>      *p &= ~mask;
>  }
>  
> +/**
> + * clear_bit_atomic - Clears a bit in memory atomically
> + * @nr: Bit to clear
> + * @addr: Address to start counting from
> + */
> +static inline void clear_bit_atomic(long nr, unsigned long *addr)
> +{
> +    unsigned long mask = BIT_MASK(nr);
> +    unsigned long *p = addr + BIT_WORD(nr);
> +
> +    return qatomic_and(p, ~mask);
> +}
> +
>  /**
>   * change_bit - Toggle a bit in memory
>   * @nr: Bit to change
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 20e8635740..3f95a41ee9 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -260,6 +260,19 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
>      g_free(pages);
>  }
>  
> +static void multifd_set_file_bitmap(MultiFDSendParams *p)
> +{
> +    MultiFDPages_t *pages = p->pages;
> +
> +    if (!pages->block) {
> +        return;
> +    }
> +
> +    for (int i = 0; i < p->normal_num; i++) {
> +        ramblock_set_shadow_bmap_atomic(pages->block, pages->offset[i]);
> +    }
> +}
> +
>  static void multifd_send_fill_packet(MultiFDSendParams *p)
>  {
>      MultiFDPacket_t *packet = p->packet;
> @@ -606,6 +619,29 @@ int multifd_send_sync_main(QEMUFile *f)
>          }
>      }
>  
> +    if (!migrate_multifd_packets()) {
> +        /*
> +         * There's no sync packet to send. Just make sure the sending
> +         * above has finished.
> +         */
> +        for (i = 0; i < migrate_multifd_channels(); i++) {
> +            qemu_sem_wait(&multifd_send_state->channels_ready);
> +        }
> +
> +        /* sanity check and release the channels */
> +        for (i = 0; i < migrate_multifd_channels(); i++) {
> +            MultiFDSendParams *p = &multifd_send_state->params[i];
> +
> +            qemu_mutex_lock(&p->mutex);
> +            assert(!p->pending_job || p->quit);
> +            qemu_mutex_unlock(&p->mutex);
> +
> +            qemu_sem_post(&p->sem);
> +        }
> +
> +        return 0;
> +    }
> +
>      /*
>       * When using zero-copy, it's necessary to flush the pages before any of
>       * the pages can be sent again, so we'll make sure the new version of the
> @@ -689,6 +725,8 @@ static void *multifd_send_thread(void *opaque)
>  
>          if (p->pending_job) {
>              uint32_t flags;
> +            uint64_t write_base;
> +
>              p->normal_num = 0;
>  
>              if (!use_packets || use_zero_copy_send) {
> @@ -713,6 +751,16 @@ static void *multifd_send_thread(void *opaque)
>              if (use_packets) {
>                  multifd_send_fill_packet(p);
>                  p->num_packets++;
> +                write_base = 0;
> +            } else {
> +                multifd_set_file_bitmap(p);
> +
> +                /*
> +                 * If we subtract the host page now, we don't need to
> +                 * pass it into qio_channel_write_full_all() below.
> +                 */
> +                write_base = p->pages->block->pages_offset -
> +                    (uint64_t)p->pages->block->host;
>              }
>  
>              flags = p->flags;
> @@ -738,8 +786,9 @@ static void *multifd_send_thread(void *opaque)
>                  p->iov[0].iov_base = p->packet;
>              }
>  
> -            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
> -                                              0, p->write_flags, &local_err);
> +            ret = qio_channel_write_full_all(p->c, p->iov, p->iovs_num,
> +                                             write_base, NULL, 0,
> +                                             p->write_flags, &local_err);
>              if (ret != 0) {
>                  break;
>              }
> @@ -969,6 +1018,8 @@ int multifd_save_setup(Error **errp)
>  
>          if (migrate_zero_copy_send()) {
>              p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
> +        } else if (!use_packets) {
> +            p->write_flags |= QIO_CHANNEL_WRITE_FLAG_WITH_OFFSET;
>          } else {
>              p->write_flags = 0;
>          }

Ah, so this is why you had the wierd overloaded design for
qio_channel_write_full_all in patch 22 that I queried. I'd
still prefer the simpler design at the QIO level, and just
calling the appropriate function above. 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 29/29] tests/qtest: Add a test for migration with direct-io and multifd
  2023-10-23 20:36 ` [PATCH v2 29/29] tests/qtest: Add a test for migration with direct-io and multifd Fabiano Rosas
@ 2023-10-25  9:25   ` Daniel P. Berrangé
  0 siblings, 0 replies; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25  9:25 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Thomas Huth, Laurent Vivier, Paolo Bonzini

On Mon, Oct 23, 2023 at 05:36:08PM -0300, Fabiano Rosas wrote:
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  tests/qtest/migration-test.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index c74c911283..30e70c0e4e 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -2077,6 +2077,16 @@ static void *migrate_multifd_fixed_ram_start(QTestState *from, QTestState *to)
>      return NULL;
>  }
>  
> +static void *migrate_multifd_fixed_ram_dio_start(QTestState *from, QTestState *to)
> +{
> +    migrate_multifd_fixed_ram_start(from, to);
> +
> +    migrate_set_parameter_bool(from, "direct-io", true);
> +    migrate_set_parameter_bool(to, "direct-io", true);
> +
> +    return NULL;
> +}
> +
>  static void test_multifd_file_fixed_ram_live(void)
>  {
>      g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
> @@ -2103,6 +2113,18 @@ static void test_multifd_file_fixed_ram(void)
>      test_file_common(&args, false, true);
>  }
>  
> +static void test_multifd_file_fixed_ram_dio(void)
> +{
> +    g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
> +                                           FILE_TEST_FILENAME);
> +    MigrateCommon args = {
> +        .connect_uri = uri,
> +        .listen_uri = "defer",
> +        .start_hook = migrate_multifd_fixed_ram_dio_start,
> +    };
> +
> +    test_file_common(&args, false, true);
> +}
>  
>  static void test_precopy_tcp_plain(void)
>  {
> @@ -3182,6 +3204,9 @@ int main(int argc, char **argv)
>      qtest_add_func("/migration/multifd/file/fixed-ram/live",
>                     test_multifd_file_fixed_ram_live);
>  
> +    qtest_add_func("/migration/multifd/file/fixed-ram/dio",
> +                   test_multifd_file_fixed_ram_dio);

All of the above should be put behind a #ifdef O_DIRECT check.

Also even if the platform supports O_DIRECT, we shouldn't assume
the filesystem used for QEMU build supports it. So we need to
open a tmp file at the same place as the save/restore and probe
to see if it succeeds with O_DIRECT, and skip the test if not.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration
  2023-10-23 20:35 ` [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration Fabiano Rosas
@ 2023-10-25  9:39   ` Daniel P. Berrangé
  2023-10-25 14:03     ` Fabiano Rosas
  2023-11-01 15:23     ` Peter Xu
  2023-10-31 16:52   ` Peter Xu
  1 sibling, 2 replies; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25  9:39 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé

On Mon, Oct 23, 2023 at 05:35:54PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
> 
> Implement the outgoing migration side for the 'fixed-ram' capability.
> 
> A bitmap is introduced to track which pages have been written in the
> migration file. Pages are written at a fixed location for every
> ramblock. Zero pages are ignored as they'd be zero in the destination
> migration as well.
> 
> The migration stream is altered to put the dirty pages for a ramblock
> after its header instead of having a sequential stream of pages that
> follow the ramblock headers. Since all pages have a fixed location,
> RAM_SAVE_FLAG_EOS is no longer generated on every migration iteration.
> 
> Without fixed-ram (current):
> 
> ramblock 1 header|ramblock 2 header|...|RAM_SAVE_FLAG_EOS|stream of
>  pages (iter 1)|RAM_SAVE_FLAG_EOS|stream of pages (iter 2)|...

Clearer to diagram this vertically I think

 ---------------------
 | ramblock 1 header |
 ---------------------
 | ramblock 2 header |
 ---------------------
 | ...               |
 ---------------------
 | ramblock n header |
 ---------------------
 | RAM_SAVE_FLAG_EOS |
 ---------------------
 | stream of pages   |
 | (iter 1)          |
 | ...               |
 ---------------------
 | RAM_SAVE_FLAG_EOS |
 ---------------------
 | stream of pages   |
 | (iter 2)          |
 | ...               |
 ---------------------
 | ...               |
 ---------------------
 

> 
> With fixed-ram (new):
> 
> ramblock 1 header|ramblock 1 fixed-ram header|ramblock 1 pages (fixed
>  offsets)|ramblock 2 header|ramblock 2 fixed-ram header|ramblock 2
>  pages (fixed offsets)|...|RAM_SAVE_FLAG_EOS

If I'm reading the code correctly the new format has some padding
such that each "ramblock pages" region starts on a 1 MB boundary.

eg so we get:

 --------------------------------
 | ramblock 1 header            |
 --------------------------------
 | ramblock 1 fixed-ram header  |
 --------------------------------
 | padding to next 1MB boundary |
 | ...                          |
 --------------------------------
 | ramblock 1 pages             |
 | ...                          |
 --------------------------------
 | ramblock 2 header            |
 --------------------------------
 | ramblock 2 fixed-ram header  |
 --------------------------------
 | padding to next 1MB boundary |
 | ...                          |
 --------------------------------
 | ramblock 2 pages             |
 | ...                          |
 --------------------------------
 | ...                          |
 --------------------------------
 | RAM_SAVE_FLAG_EOS            |
 --------------------------------
 | ...                          |
 -------------------------------

> 
> where:
>  - ramblock header: the generic information for a ramblock, such as
>    idstr, used_len, etc.
> 
>  - ramblock fixed-ram header: the new information added by this
>    feature: bitmap of pages written, bitmap size and offset of pages
>    in the migration file.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  include/exec/ramblock.h |  8 ++++
>  migration/options.c     |  3 --
>  migration/ram.c         | 98 ++++++++++++++++++++++++++++++++++++-----
>  3 files changed, 96 insertions(+), 13 deletions(-)
> 
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index 69c6a53902..e0e3f16852 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -44,6 +44,14 @@ struct RAMBlock {
>      size_t page_size;
>      /* dirty bitmap used during migration */
>      unsigned long *bmap;
> +    /* shadow dirty bitmap used when migrating to a file */
> +    unsigned long *shadow_bmap;
> +    /*
> +     * offset in the file pages belonging to this ramblock are saved,
> +     * used only during migration to a file.
> +     */
> +    off_t bitmap_offset;
> +    uint64_t pages_offset;
>      /* bitmap of already received pages in postcopy */
>      unsigned long *receivedmap;
>  
> diff --git a/migration/options.c b/migration/options.c
> index 2622d8c483..9f693d909f 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -271,12 +271,9 @@ bool migrate_events(void)
>  
>  bool migrate_fixed_ram(void)
>  {
> -/*
>      MigrationState *s = migrate_get_current();
>  
>      return s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
> -*/
> -    return false;
>  }
>  
>  bool migrate_ignore_shared(void)
> diff --git a/migration/ram.c b/migration/ram.c
> index 92769902bb..152a03604f 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1157,12 +1157,18 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
>          return 0;
>      }
>  
> +    stat64_add(&mig_stats.zero_pages, 1);
> +
> +    if (migrate_fixed_ram()) {
> +        /* zero pages are not transferred with fixed-ram */
> +        clear_bit(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
> +        return 1;
> +    }
> +
>      len += save_page_header(pss, file, pss->block, offset | RAM_SAVE_FLAG_ZERO);
>      qemu_put_byte(file, 0);
>      len += 1;
>      ram_release_page(pss->block->idstr, offset);
> -
> -    stat64_add(&mig_stats.zero_pages, 1);
>      ram_transferred_add(len);
>  
>      /*
> @@ -1220,14 +1226,20 @@ static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
>  {
>      QEMUFile *file = pss->pss_channel;
>  
> -    ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
> -                                         offset | RAM_SAVE_FLAG_PAGE));
> -    if (async) {
> -        qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
> -                              migrate_release_ram() &&
> -                              migration_in_postcopy());
> +    if (migrate_fixed_ram()) {
> +        qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
> +                           block->pages_offset + offset);
> +        set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
>      } else {
> -        qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
> +        ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
> +                                             offset | RAM_SAVE_FLAG_PAGE));
> +        if (async) {
> +            qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
> +                                  migrate_release_ram() &&
> +                                  migration_in_postcopy());
> +        } else {
> +            qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
> +        }
>      }
>      ram_transferred_add(TARGET_PAGE_SIZE);
>      stat64_add(&mig_stats.normal_pages, 1);
> @@ -2475,6 +2487,8 @@ static void ram_save_cleanup(void *opaque)
>          block->clear_bmap = NULL;
>          g_free(block->bmap);
>          block->bmap = NULL;
> +        g_free(block->shadow_bmap);
> +        block->shadow_bmap = NULL;
>      }
>  
>      xbzrle_cleanup();
> @@ -2842,6 +2856,7 @@ static void ram_list_init_bitmaps(void)
>               */
>              block->bmap = bitmap_new(pages);
>              bitmap_set(block->bmap, 0, pages);
> +            block->shadow_bmap = bitmap_new(block->used_length >> TARGET_PAGE_BITS);
>              block->clear_bmap_shift = shift;
>              block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
>          }
> @@ -2979,6 +2994,44 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
>      }
>  }
>  
> +#define FIXED_RAM_HDR_VERSION 1
> +struct FixedRamHeader {
> +    uint32_t version;
> +    uint64_t page_size;
> +    uint64_t bitmap_offset;
> +    uint64_t pages_offset;
> +    /* end of v1 */
> +} QEMU_PACKED;
> +
> +static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
> +{
> +    g_autofree struct FixedRamHeader *header;
> +    size_t header_size, bitmap_size;
> +    long num_pages;
> +
> +    header = g_new0(struct FixedRamHeader, 1);
> +    header_size = sizeof(struct FixedRamHeader);
> +
> +    num_pages = block->used_length >> TARGET_PAGE_BITS;
> +    bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> +
> +    /*
> +     * Save the file offsets of where the bitmap and the pages should
> +     * go as they are written at the end of migration and during the
> +     * iterative phase, respectively.
> +     */
> +    block->bitmap_offset = qemu_get_offset(file) + header_size;
> +    block->pages_offset = ROUND_UP(block->bitmap_offset +
> +                                   bitmap_size, 0x100000);

The 0x100000 gives us our 1 MB alignment.

This is quite an important thing, so can we put a nice clear
constant at the top of the file:

  /* Align fixed-ram pages data to the next 1 MB boundary */
  #define FIXED_RAM_PAGE_OFFSET_ALIGNMENT 0x100000

> @@ -3179,7 +3252,6 @@ out:
>          qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>          qemu_fflush(f);
>          ram_transferred_add(8);
> -
>          ret = qemu_file_get_error(f);
>      }
>      if (ret < 0) {

Supurious whitspace change, possibly better in whatever patch
introduced it, or dropped ?


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-10-23 20:35 ` [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore Fabiano Rosas
@ 2023-10-25  9:43   ` Daniel P. Berrangé
  2023-10-25 14:07     ` Fabiano Rosas
  2023-10-31 19:09   ` Peter Xu
  1 sibling, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25  9:43 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

On Mon, Oct 23, 2023 at 05:35:55PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
> 
> Add the necessary code to parse the format changes for the 'fixed-ram'
> capability.
> 
> One of the more notable changes in behavior is that in the 'fixed-ram'
> case ram pages are restored in one go rather than constantly looping
> through the migration stream.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> (farosas) reused more of the common code by making the fixed-ram
> function take only one ramblock and calling it from inside
> parse_ramblock.
> ---
>  migration/ram.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 93 insertions(+)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 152a03604f..cea6971ab2 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3032,6 +3032,32 @@ static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
>      qemu_put_buffer(file, (uint8_t *) header, header_size);
>  }
>  
> +static int fixed_ram_read_header(QEMUFile *file, struct FixedRamHeader *header)
> +{
> +    size_t ret, header_size = sizeof(struct FixedRamHeader);
> +
> +    ret = qemu_get_buffer(file, (uint8_t *)header, header_size);
> +    if (ret != header_size) {
> +        return -1;
> +    }
> +
> +    /* migration stream is big-endian */
> +    be32_to_cpus(&header->version);
> +
> +    if (header->version > FIXED_RAM_HDR_VERSION) {
> +        error_report("Migration fixed-ram capability version mismatch (expected %d, got %d)",
> +                     FIXED_RAM_HDR_VERSION, header->version);
> +        return -1;
> +    }
> +
> +    be64_to_cpus(&header->page_size);
> +    be64_to_cpus(&header->bitmap_offset);
> +    be64_to_cpus(&header->pages_offset);
> +
> +
> +    return 0;
> +}
> +
>  /*
>   * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
>   * long-running RCU critical section.  When rcu-reclaims in the code
> @@ -3932,6 +3958,68 @@ void colo_flush_ram_cache(void)
>      trace_colo_flush_ram_cache_end();
>  }
>  
> +static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
> +                                    long num_pages, unsigned long *bitmap)
> +{
> +    unsigned long set_bit_idx, clear_bit_idx;
> +    unsigned long len;
> +    ram_addr_t offset;
> +    void *host;
> +    size_t read, completed, read_len;
> +
> +    for (set_bit_idx = find_first_bit(bitmap, num_pages);
> +         set_bit_idx < num_pages;
> +         set_bit_idx = find_next_bit(bitmap, num_pages, clear_bit_idx + 1)) {
> +
> +        clear_bit_idx = find_next_zero_bit(bitmap, num_pages, set_bit_idx + 1);
> +
> +        len = TARGET_PAGE_SIZE * (clear_bit_idx - set_bit_idx);
> +        offset = set_bit_idx << TARGET_PAGE_BITS;
> +
> +        for (read = 0, completed = 0; completed < len; offset += read) {
> +            host = host_from_ram_block_offset(block, offset);
> +            read_len = MIN(len, TARGET_PAGE_SIZE);
> +
> +            read = qemu_get_buffer_at(f, host, read_len,
> +                                      block->pages_offset + offset);
> +            completed += read;
> +        }
> +    }
> +}
> +
> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block, ram_addr_t length)
> +{
> +    g_autofree unsigned long *bitmap = NULL;
> +    struct FixedRamHeader header;
> +    size_t bitmap_size;
> +    long num_pages;
> +    int ret = 0;
> +
> +    ret = fixed_ram_read_header(f, &header);
> +    if (ret < 0) {
> +        error_report("Error reading fixed-ram header");
> +        return -EINVAL;
> +    }
> +
> +    block->pages_offset = header.pages_offset;

Do you think it is worth sanity checking that 'pages_offset' is aligned
in some way.

It is nice that we have flexibility to change the alignment in future
if we find 1 MB is not optimal, so I wouldn't want to force 1MB align
check htere. Perhaps we could at least sanity check for alignment at
TARGET_PAGE_SIZE, to detect a gross data corruption problem ?

> +    num_pages = length / header.page_size;
> +    bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> +
> +    bitmap = g_malloc0(bitmap_size);
> +    if (qemu_get_buffer_at(f, (uint8_t *)bitmap, bitmap_size,
> +                           header.bitmap_offset) != bitmap_size) {
> +        error_report("Error parsing dirty bitmap");

s/parsing/reading/ since we're not actually parsing any semantic
info here.

> +        return -EINVAL;
> +    }
> +
> +    read_ramblock_fixed_ram(f, block, num_pages, bitmap);
> +
> +    /* Skip pages array */
> +    qemu_set_offset(f, block->pages_offset + length, SEEK_SET);
> +
> +    return ret;
> +}
> +
>  static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
>  {
>      int ret = 0;
> @@ -3940,6 +4028,10 @@ static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
>  
>      assert(block);
>  
> +    if (migrate_fixed_ram()) {
> +        return parse_ramblock_fixed_ram(f, block, length);
> +    }
> +
>      if (!qemu_ram_is_migratable(block)) {
>          error_report("block %s should not be migrated !", block->idstr);
>          return -EINVAL;
> @@ -4142,6 +4234,7 @@ static int ram_load_precopy(QEMUFile *f)
>                  migrate_multifd_flush_after_each_section()) {
>                  multifd_recv_sync_main();
>              }
> +
>              break;

Spurious whitespace


>          case RAM_SAVE_FLAG_HOOK:
>              ret = rdma_registration_handle(f);
> -- 
> 2.35.3
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 01/29] tests/qtest: migration events
  2023-10-23 20:35 ` [PATCH v2 01/29] tests/qtest: migration events Fabiano Rosas
@ 2023-10-25  9:44   ` Thomas Huth
  2023-10-25 10:14   ` Daniel P. Berrangé
  2023-10-25 13:21   ` Fabiano Rosas
  2 siblings, 0 replies; 128+ messages in thread
From: Thomas Huth @ 2023-10-25  9:44 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Steve Sistare, Laurent Vivier, Paolo Bonzini

On 23/10/2023 22.35, Fabiano Rosas wrote:
> From: Steve Sistare <steven.sistare@oracle.com>
> 
> Define a state object to capture events seen by migration tests, to allow
> more events to be captured in a subsequent patch, and simplify event
> checking in wait_for_migration_pass.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   tests/qtest/migration-helpers.c | 24 ++++-------
>   tests/qtest/migration-helpers.h |  8 ++--
>   tests/qtest/migration-test.c    | 74 +++++++++++++++------------------
>   3 files changed, 46 insertions(+), 60 deletions(-)
> 
> diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
> index 24fb7b3525..fd3b94efa2 100644
> --- a/tests/qtest/migration-helpers.c
> +++ b/tests/qtest/migration-helpers.c
> @@ -24,26 +24,16 @@
>    */
>   #define MIGRATION_STATUS_WAIT_TIMEOUT 120
>   
> -bool migrate_watch_for_stop(QTestState *who, const char *name,
> -                            QDict *event, void *opaque)
> -{
> -    bool *seen = opaque;
> -
> -    if (g_str_equal(name, "STOP")) {
> -        *seen = true;
> -        return true;
> -    }
> -
> -    return false;
> -}
> -
> -bool migrate_watch_for_resume(QTestState *who, const char *name,
> +bool migrate_watch_for_events(QTestState *who, const char *name,
>                                 QDict *event, void *opaque)
>   {
> -    bool *seen = opaque;
> +    QTestMigrationState *state = opaque;
>   
> -    if (g_str_equal(name, "RESUME")) {
> -        *seen = true;
> +    if (g_str_equal(name, "STOP")) {
> +        state->stop_seen = true;
> +        return true;
> +    } else if (g_str_equal(name, "RESUME")) {
> +        state->resume_seen = true;
>           return true;
>       }
>   
> diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
> index e31dc85cc7..c1d4c84995 100644
> --- a/tests/qtest/migration-helpers.h
> +++ b/tests/qtest/migration-helpers.h
> @@ -15,9 +15,11 @@
>   
>   #include "libqtest.h"
>   
> -bool migrate_watch_for_stop(QTestState *who, const char *name,
> -                            QDict *event, void *opaque);
> -bool migrate_watch_for_resume(QTestState *who, const char *name,
> +typedef struct QTestMigrationState {
> +    bool stop_seen, resume_seen;

Just a matter of taste, but for struct definitions, I'd prefer to put the 
entries on a single line each instead.

> +} QTestMigrationState;
> +
> +bool migrate_watch_for_events(QTestState *who, const char *name,
>                                 QDict *event, void *opaque);

  Thomas




^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 19/29] migration/multifd: Add outgoing QIOChannelFile support
  2023-10-23 20:35 ` [PATCH v2 19/29] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
@ 2023-10-25  9:52   ` Daniel P. Berrangé
  2023-10-25 14:12     ` Fabiano Rosas
  2023-10-31 20:11   ` Peter Xu
  1 sibling, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25  9:52 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

On Mon, Oct 23, 2023 at 05:35:58PM -0300, Fabiano Rosas wrote:
> Allow multifd to open file-backed channels. This will be used when
> enabling the fixed-ram migration stream format which expects a
> seekable transport.
> 
> The QIOChannel read and write methods will use the preadv/pwritev
> versions which don't update the file offset at each call so we can
> reuse the fd without re-opening for every channel.
> 
> Note that this is just setup code and multifd cannot yet make use of
> the file channels.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/file.c      | 64 +++++++++++++++++++++++++++++++++++++++++--
>  migration/file.h      | 10 +++++--
>  migration/migration.c |  2 +-
>  migration/multifd.c   | 14 ++++++++--
>  migration/options.c   |  7 +++++
>  migration/options.h   |  1 +
>  6 files changed, 90 insertions(+), 8 deletions(-)
> 
> diff --git a/migration/file.c b/migration/file.c
> index cf5b1bf365..93b9b7bf5d 100644
> --- a/migration/file.c
> +++ b/migration/file.c
> @@ -17,6 +17,12 @@

> +void file_send_channel_create(QIOTaskFunc f, void *data)
> +{
> +    QIOChannelFile *ioc;
> +    QIOTask *task;
> +    Error *errp = NULL;
> +
> +    ioc = qio_channel_file_new_path(outgoing_args.fname,
> +                                    outgoing_args.flags,
> +                                    outgoing_args.mode, &errp);
> +    if (!ioc) {
> +        file_migration_cancel(errp);
> +        return;
> +    }
> +
> +    task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
> +    qio_task_run_in_thread(task, qio_channel_file_connect_worker,
> +                           (gpointer)data, NULL, NULL);
> +}
> +
>  void file_start_outgoing_migration(MigrationState *s, const char *filespec,
>                                     Error **errp)
>  {
> -    g_autofree char *filename = g_strdup(filespec);
>      g_autoptr(QIOChannelFile) fioc = NULL;
> +    g_autofree char *filename = g_strdup(filespec);
>      uint64_t offset = 0;
>      QIOChannel *ioc;
> +    int flags = O_CREAT | O_TRUNC | O_WRONLY;
> +    mode_t mode = 0660;
>  
>      trace_migration_file_outgoing(filename);
>  
> @@ -50,12 +105,15 @@ void file_start_outgoing_migration(MigrationState *s, const char *filespec,
>          return;
>      }
>  
> -    fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
> -                                     0600, errp);
> +    fioc = qio_channel_file_new_path(filename, flags, mode, errp);

So this initially opens the file with O_CREAT|O_TRUNC which
makes sense.

>      if (!fioc) {
>          return;
>      }
>  
> +    outgoing_args.fname = g_strdup(filename);
> +    outgoing_args.flags = flags;
> +    outgoing_args.mode = mode;

We're passing on O_CREAT|O_TRUNC to all the multifd threads too. This
doesn't make sense to me - the file should already exist and be truncated
by the time the threads open it. I would think they should only be using
O_WRONLY and no mode at all.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 01/29] tests/qtest: migration events
  2023-10-23 20:35 ` [PATCH v2 01/29] tests/qtest: migration events Fabiano Rosas
  2023-10-25  9:44   ` Thomas Huth
@ 2023-10-25 10:14   ` Daniel P. Berrangé
  2023-10-25 13:21   ` Fabiano Rosas
  2 siblings, 0 replies; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 10:14 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Steve Sistare, Thomas Huth, Laurent Vivier,
	Paolo Bonzini

On Mon, Oct 23, 2023 at 05:35:40PM -0300, Fabiano Rosas wrote:
> From: Steve Sistare <steven.sistare@oracle.com>
> 
> Define a state object to capture events seen by migration tests, to allow
> more events to be captured in a subsequent patch, and simplify event
> checking in wait_for_migration_pass.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  tests/qtest/migration-helpers.c | 24 ++++-------
>  tests/qtest/migration-helpers.h |  8 ++--
>  tests/qtest/migration-test.c    | 74 +++++++++++++++------------------
>  3 files changed, 46 insertions(+), 60 deletions(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 02/29] tests/qtest: Move QTestMigrationState to libqtest
  2023-10-23 20:35 ` [PATCH v2 02/29] tests/qtest: Move QTestMigrationState to libqtest Fabiano Rosas
@ 2023-10-25 10:17   ` Daniel P. Berrangé
  2023-10-25 13:19     ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 10:17 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Thomas Huth, Laurent Vivier, Paolo Bonzini

On Mon, Oct 23, 2023 at 05:35:41PM -0300, Fabiano Rosas wrote:
> Move the QTestMigrationState into QTestState so we don't have to pass
> it around to the wait_for_* helpers anymore. Since QTestState is
> private to libqtest.c, move the migration state struct to libqtest.h
> and add a getter.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  tests/qtest/libqtest.c          | 14 ++++++++++
>  tests/qtest/libqtest.h          | 23 ++++++++++++++++
>  tests/qtest/migration-helpers.c | 18 +++++++++++++
>  tests/qtest/migration-helpers.h |  8 +++---
>  tests/qtest/migration-test.c    | 47 +++++++++------------------------
>  5 files changed, 72 insertions(+), 38 deletions(-)
> 
> diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
> index f33a210861..f7e85486dc 100644
> --- a/tests/qtest/libqtest.c
> +++ b/tests/qtest/libqtest.c
> @@ -87,6 +87,7 @@ struct QTestState
>      GList *pending_events;
>      QTestQMPEventCallback eventCB;
>      void *eventData;
> +    QTestMigrationState *migration_state;

It feels wrong to have something called MigrationState in the
general qtest code. In the end there's nothing particularly
migration related about this though.

With that in mind, we could just rename it to "QTestEventState"
instead.

>  };
>  
>  static GHookList abrt_hooks;
> @@ -500,6 +501,8 @@ static QTestState *qtest_init_internal(const char *qemu_bin,
>          s->irq_level[i] = false;
>      }
>  
> +    s->migration_state = g_new0(QTestMigrationState, 1);
> +
>      /*
>       * Stopping QEMU for debugging is not supported on Windows.
>       *
> @@ -601,6 +604,7 @@ void qtest_quit(QTestState *s)
>      close(s->fd);
>      close(s->qmp_fd);
>      g_string_free(s->rx, true);
> +    g_free(s->migration_state);
>  
>      for (GList *it = s->pending_events; it != NULL; it = it->next) {
>          qobject_unref((QDict *)it->data);
> @@ -854,6 +858,11 @@ void qtest_qmp_set_event_callback(QTestState *s,
>      s->eventData = opaque;
>  }
>  
> +void qtest_qmp_set_migration_callback(QTestState *s, QTestQMPEventCallback cb)
> +{
> +    qtest_qmp_set_event_callback(s, cb, s->migration_state);
> +}
> +
>  QDict *qtest_qmp_event_ref(QTestState *s, const char *event)
>  {
>      while (s->pending_events) {
> @@ -1906,3 +1915,8 @@ bool mkimg(const char *file, const char *fmt, unsigned size_mb)
>  
>      return ret && !err;
>  }
> +
> +QTestMigrationState *qtest_migration_state(QTestState *s)
> +{
> +    return s->migration_state;
> +}
> diff --git a/tests/qtest/libqtest.h b/tests/qtest/libqtest.h
> index 6e3d3525bf..0421a1da24 100644
> --- a/tests/qtest/libqtest.h
> +++ b/tests/qtest/libqtest.h
> @@ -23,6 +23,20 @@
>  
>  typedef struct QTestState QTestState;
>  
> +struct QTestMigrationState {
> +    bool stop_seen;
> +    bool resume_seen;
> +};
> +typedef struct QTestMigrationState QTestMigrationState;
> +
> +/**
> + * qtest_migration_state:
> + * @s: #QTestState instance to operate on.
> + *
> + * Returns: #QTestMigrationState instance.
> + */
> +QTestMigrationState *qtest_migration_state(QTestState *s);
> +
>  /**
>   * qtest_initf:
>   * @fmt: Format for creating other arguments to pass to QEMU, formatted
> @@ -288,6 +302,15 @@ typedef bool (*QTestQMPEventCallback)(QTestState *s, const char *name,
>  void qtest_qmp_set_event_callback(QTestState *s,
>                                    QTestQMPEventCallback cb, void *opaque);
>  
> +/**
> + * qtest_qmp_set_migration_callback:
> + * @s: #QTestSTate instance to operate on
> + * @cb: callback to invoke for events
> + *
> + * Like qtest_qmp_set_event_callback, but includes migration state events
> + */
> +void qtest_qmp_set_migration_callback(QTestState *s, QTestQMPEventCallback cb);
> +
>  /**
>   * qtest_qmp_eventwait:
>   * @s: #QTestState instance to operate on.
> diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
> index fd3b94efa2..cffa525c81 100644
> --- a/tests/qtest/migration-helpers.c
> +++ b/tests/qtest/migration-helpers.c
> @@ -92,6 +92,24 @@ void migrate_set_capability(QTestState *who, const char *capability,
>                               capability, value);
>  }
>  
> +void wait_for_stop(QTestState *who)
> +{
> +    QTestMigrationState *state = qtest_migration_state(who);
> +
> +    if (!state->stop_seen) {
> +        qtest_qmp_eventwait(who, "STOP");
> +    }
> +}
> +
> +void wait_for_resume(QTestState *who)
> +{
> +    QTestMigrationState *state = qtest_migration_state(who);
> +
> +    if (!state->resume_seen) {
> +        qtest_qmp_eventwait(who, "RESUME");
> +    }
> +}

I'd be included to put them into the libqtest.c file too eg

  qtest_wait_for_resume/qtst_wait_for_stop

> +
>  void migrate_incoming_qmp(QTestState *to, const char *uri, const char *fmt, ...)
>  {
>      va_list ap;
> diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
> index c1d4c84995..7297f1ff2c 100644
> --- a/tests/qtest/migration-helpers.h
> +++ b/tests/qtest/migration-helpers.h
> @@ -15,13 +15,13 @@
>  
>  #include "libqtest.h"
>  
> -typedef struct QTestMigrationState {
> -    bool stop_seen, resume_seen;
> -} QTestMigrationState;
> -
>  bool migrate_watch_for_events(QTestState *who, const char *name,
>                                QDict *event, void *opaque);
>  
> +
> +void wait_for_stop(QTestState *who);
> +void wait_for_resume(QTestState *who);
> +
>  G_GNUC_PRINTF(3, 4)
>  void migrate_qmp(QTestState *who, const char *uri, const char *fmt, ...);
>  
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 0425d1d527..88e611e98f 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -43,8 +43,6 @@
>  unsigned start_address;
>  unsigned end_address;
>  static bool uffd_feature_thread_id;
> -static QTestMigrationState src_state;
> -static QTestMigrationState dst_state;
>  
>  /*
>   * An initial 3 MB offset is used as that corresponds
> @@ -230,13 +228,6 @@ static void wait_for_serial(const char *side)
>      } while (true);
>  }
>  
> -static void wait_for_stop(QTestState *who, QTestMigrationState *state)
> -{
> -    if (!state->stop_seen) {
> -        qtest_qmp_eventwait(who, "STOP");
> -    }
> -}
> -
>  /*
>   * It's tricky to use qemu's migration event capability with qtest,
>   * events suddenly appearing confuse the qmp()/hmp() responses.
> @@ -290,8 +281,9 @@ static void read_blocktime(QTestState *who)
>  static void wait_for_migration_pass(QTestState *who)
>  {
>      uint64_t pass, prev_pass = 0, changes = 0;
> +    QTestMigrationState *state = qtest_migration_state(who);
>  
> -    while (changes < 2 && !src_state.stop_seen) {
> +    while (changes < 2 && !state->stop_seen) {
>          usleep(1000);
>          pass = get_migration_pass(who);
>          changes += (pass != prev_pass);
> @@ -622,7 +614,7 @@ static void migrate_postcopy_start(QTestState *from, QTestState *to)
>  {
>      qtest_qmp_assert_success(from, "{ 'execute': 'migrate-start-postcopy' }");
>  
> -    wait_for_stop(from, &src_state);
> +    wait_for_stop(from);
>      qtest_qmp_eventwait(to, "RESUME");
>  }
>  
> @@ -757,9 +749,6 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>          }
>      }
>  
> -    dst_state = (QTestMigrationState) { };
> -    src_state = (QTestMigrationState) { };
> -
>      if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
>          memory_size = "150M";
>  
> @@ -849,9 +838,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>                                   ignore_stderr);
>      if (!args->only_target) {
>          *from = qtest_init_with_env(QEMU_ENV_SRC, cmd_source);
> -        qtest_qmp_set_event_callback(*from,
> -                                     migrate_watch_for_events,
> -                                     &src_state);
> +        qtest_qmp_set_migration_callback(*from, migrate_watch_for_events);
>      }
>  
>      cmd_target = g_strdup_printf("-accel kvm%s -accel tcg "
> @@ -870,9 +857,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>                                   args->opts_target ? args->opts_target : "",
>                                   ignore_stderr);
>      *to = qtest_init_with_env(QEMU_ENV_DST, cmd_target);
> -    qtest_qmp_set_event_callback(*to,
> -                                 migrate_watch_for_events,
> -                                 &dst_state);
> +    qtest_qmp_set_migration_callback(*to, migrate_watch_for_events);
>  
>      /*
>       * Remove shmem file immediately to avoid memory leak in test failed case.
> @@ -1622,7 +1607,7 @@ static void test_precopy_common(MigrateCommon *args)
>           */
>          if (args->result == MIG_TEST_SUCCEED) {
>              qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}");
> -            wait_for_stop(from, &src_state);
> +            wait_for_stop(from);
>              migrate_ensure_converge(from);
>          }
>      }
> @@ -1668,7 +1653,7 @@ static void test_precopy_common(MigrateCommon *args)
>               */
>              wait_for_migration_complete(from);
>  
> -            wait_for_stop(from, &src_state);
> +            wait_for_stop(from);
>  
>          } else {
>              wait_for_migration_complete(from);
> @@ -1682,10 +1667,7 @@ static void test_precopy_common(MigrateCommon *args)
>              qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
>          }
>  
> -        if (!dst_state.resume_seen) {
> -            qtest_qmp_eventwait(to, "RESUME");
> -        }
> -
> +        wait_for_resume(to);
>          wait_for_serial("dest_serial");
>      }
>  
> @@ -1723,7 +1705,7 @@ static void test_file_common(MigrateCommon *args, bool stop_src)
>  
>      if (stop_src) {
>          qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}");
> -        wait_for_stop(from, &src_state);
> +        wait_for_stop(from);
>      }
>  
>      if (args->result == MIG_TEST_QMP_ERROR) {
> @@ -1745,10 +1727,7 @@ static void test_file_common(MigrateCommon *args, bool stop_src)
>          qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
>      }
>  
> -    if (!dst_state.resume_seen) {
> -        qtest_qmp_eventwait(to, "RESUME");
> -    }
> -
> +    wait_for_resume(to);
>      wait_for_serial("dest_serial");
>  
>  finish:
> @@ -1866,7 +1845,7 @@ static void test_ignore_shared(void)
>  
>      migrate_wait_for_dirty_mem(from, to);
>  
> -    wait_for_stop(from, &src_state);
> +    wait_for_stop(from);
>  
>      qtest_qmp_eventwait(to, "RESUME");
>  
> @@ -2376,7 +2355,7 @@ static void test_migrate_auto_converge(void)
>              break;
>          }
>          usleep(20);
> -        g_assert_false(src_state.stop_seen);
> +        g_assert_false(qtest_migration_state(from)->stop_seen);
>      } while (true);
>      /* The first percentage of throttling should be at least init_pct */
>      g_assert_cmpint(percentage, >=, init_pct);
> @@ -2715,7 +2694,7 @@ static void test_multifd_tcp_cancel(void)
>  
>      migrate_ensure_converge(from);
>  
> -    wait_for_stop(from, &src_state);
> +    wait_for_stop(from);
>      qtest_qmp_eventwait(to2, "RESUME");
>  
>      wait_for_serial("dest_serial");
> -- 
> 2.35.3
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 04/29] migration: Return the saved state from global_state_store
  2023-10-23 20:35 ` [PATCH v2 04/29] migration: Return the saved state from global_state_store Fabiano Rosas
@ 2023-10-25 10:19   ` Daniel P. Berrangé
  0 siblings, 0 replies; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 10:19 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

On Mon, Oct 23, 2023 at 05:35:43PM -0300, Fabiano Rosas wrote:
> There is a pattern of calling runstate_get() to store the current
> runstate and calling global_state_store() to save the current runstate
> for migration. Since global_state_store() also calls runstate_get(),
> make it return the runstate instead.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  include/migration/global_state.h | 2 +-
>  migration/global_state.c         | 7 +++++--
>  migration/migration.c            | 6 ++----
>  3 files changed, 8 insertions(+), 7 deletions(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/29] migration/qemu-file: add utility methods for working with seekable channels
  2023-10-23 20:35 ` [PATCH v2 12/29] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
@ 2023-10-25 10:22   ` Daniel P. Berrangé
  0 siblings, 0 replies; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 10:22 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

On Mon, Oct 23, 2023 at 05:35:51PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
> 
> Add utility methods that will be needed when implementing 'fixed-ram'
> migration capability.
> 
> qemu_file_is_seekable
> qemu_put_buffer_at
> qemu_get_buffer_at
> qemu_set_offset
> qemu_get_offset
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> fixed total_transferred accounting
> 
> restructured to use qio_channel_file_preadv instead of the _full
> variant
> ---
>  include/migration/qemu-file-types.h |  2 +
>  migration/qemu-file.c               | 80 +++++++++++++++++++++++++++++
>  migration/qemu-file.h               |  4 ++
>  3 files changed, 86 insertions(+)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 13/29] migration: fixed-ram: Add URI compatibility check
  2023-10-23 20:35 ` [PATCH v2 13/29] migration: fixed-ram: Add URI compatibility check Fabiano Rosas
@ 2023-10-25 10:27   ` Daniel P. Berrangé
  2023-10-31 16:06   ` Peter Xu
  1 sibling, 0 replies; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 10:27 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

On Mon, Oct 23, 2023 at 05:35:52PM -0300, Fabiano Rosas wrote:
> The fixed-ram migration format needs a channel that supports seeking
> to be able to write each page to an arbitrary offset in the migration
> stream.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/migration.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 20/29] migration/multifd: Add incoming QIOChannelFile support
  2023-10-23 20:35 ` [PATCH v2 20/29] migration/multifd: Add incoming " Fabiano Rosas
@ 2023-10-25 10:29   ` Daniel P. Berrangé
  2023-10-25 14:18     ` Fabiano Rosas
  2023-10-31 21:28   ` Peter Xu
  1 sibling, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 10:29 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

On Mon, Oct 23, 2023 at 05:35:59PM -0300, Fabiano Rosas wrote:
> On the receiving side we don't need to differentiate between main
> channel and threads, so whichever channel is defined first gets to be
> the main one. And since there are no packets, use the atomic channel
> count to index into the params array.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/file.c      | 39 +++++++++++++++++++++++++++++----------
>  migration/migration.c |  2 ++
>  migration/multifd.c   |  7 ++++++-
>  migration/multifd.h   |  1 +
>  4 files changed, 38 insertions(+), 11 deletions(-)
> 
> diff --git a/migration/file.c b/migration/file.c
> index 93b9b7bf5d..ad75225f43 100644
> --- a/migration/file.c
> +++ b/migration/file.c
> @@ -6,13 +6,15 @@
>   */
>  
>  #include "qemu/osdep.h"
> -#include "qemu/cutils.h"
>  #include "qapi/error.h"
> +#include "qemu/cutils.h"
> +#include "qemu/error-report.h"
>  #include "channel.h"
>  #include "file.h"
>  #include "migration.h"
>  #include "io/channel-file.h"
>  #include "io/channel-util.h"
> +#include "options.h"
>  #include "trace.h"
>  
>  #define OFFSET_OPTION ",offset="
> @@ -136,7 +138,8 @@ void file_start_incoming_migration(const char *filespec, Error **errp)
>      g_autofree char *filename = g_strdup(filespec);
>      QIOChannelFile *fioc = NULL;
>      uint64_t offset = 0;
> -    QIOChannel *ioc;
> +    int channels = 1;
> +    int i = 0, fd;
>  
>      trace_migration_file_incoming(filename);
>  
> @@ -146,16 +149,32 @@ void file_start_incoming_migration(const char *filespec, Error **errp)
>  
>      fioc = qio_channel_file_new_path(filename, O_RDONLY, 0, errp);
>      if (!fioc) {
> -        return;
> +        goto out;
> +    }
> +
> +    if (migrate_multifd()) {
> +        channels += migrate_multifd_channels();
>      }
>  
> -    ioc = QIO_CHANNEL(fioc);
> -    if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
> +    fd = fioc->fd;
> +
> +    do {
> +        QIOChannel *ioc = QIO_CHANNEL(fioc);
> +
> +        if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
> +            return;
> +        }
> +
> +        qio_channel_set_name(ioc, "migration-file-incoming");
> +        qio_channel_add_watch_full(ioc, G_IO_IN,
> +                                   file_accept_incoming_migration,
> +                                   NULL, NULL,
> +                                   g_main_context_get_thread_default());
> +    } while (++i < channels && (fioc = qio_channel_file_new_fd(fd)));

IIUC, this loop is failing to call qio_channel_io_seek to set
the offset on the last 'fioc' that is created.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 02/29] tests/qtest: Move QTestMigrationState to libqtest
  2023-10-25 10:17   ` Daniel P. Berrangé
@ 2023-10-25 13:19     ` Fabiano Rosas
  0 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 13:19 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Thomas Huth, Laurent Vivier, Paolo Bonzini

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:35:41PM -0300, Fabiano Rosas wrote:
>> Move the QTestMigrationState into QTestState so we don't have to pass
>> it around to the wait_for_* helpers anymore. Since QTestState is
>> private to libqtest.c, move the migration state struct to libqtest.h
>> and add a getter.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  tests/qtest/libqtest.c          | 14 ++++++++++
>>  tests/qtest/libqtest.h          | 23 ++++++++++++++++
>>  tests/qtest/migration-helpers.c | 18 +++++++++++++
>>  tests/qtest/migration-helpers.h |  8 +++---
>>  tests/qtest/migration-test.c    | 47 +++++++++------------------------
>>  5 files changed, 72 insertions(+), 38 deletions(-)
>> 
>> diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
>> index f33a210861..f7e85486dc 100644
>> --- a/tests/qtest/libqtest.c
>> +++ b/tests/qtest/libqtest.c
>> @@ -87,6 +87,7 @@ struct QTestState
>>      GList *pending_events;
>>      QTestQMPEventCallback eventCB;
>>      void *eventData;
>> +    QTestMigrationState *migration_state;
>
> It feels wrong to have something called MigrationState in the
> general qtest code. In the end there's nothing particularly
> migration related about this though.
>
> With that in mind, we could just rename it to "QTestEventState"
> instead.
>

Ok, will do.

>>  };
>>  
>>  static GHookList abrt_hooks;
>> @@ -500,6 +501,8 @@ static QTestState *qtest_init_internal(const char *qemu_bin,
>>          s->irq_level[i] = false;
>>      }
>>  
>> +    s->migration_state = g_new0(QTestMigrationState, 1);
>> +
>>      /*
>>       * Stopping QEMU for debugging is not supported on Windows.
>>       *
>> @@ -601,6 +604,7 @@ void qtest_quit(QTestState *s)
>>      close(s->fd);
>>      close(s->qmp_fd);
>>      g_string_free(s->rx, true);
>> +    g_free(s->migration_state);
>>  
>>      for (GList *it = s->pending_events; it != NULL; it = it->next) {
>>          qobject_unref((QDict *)it->data);
>> @@ -854,6 +858,11 @@ void qtest_qmp_set_event_callback(QTestState *s,
>>      s->eventData = opaque;
>>  }
>>  
>> +void qtest_qmp_set_migration_callback(QTestState *s, QTestQMPEventCallback cb)
>> +{
>> +    qtest_qmp_set_event_callback(s, cb, s->migration_state);
>> +}
>> +
>>  QDict *qtest_qmp_event_ref(QTestState *s, const char *event)
>>  {
>>      while (s->pending_events) {
>> @@ -1906,3 +1915,8 @@ bool mkimg(const char *file, const char *fmt, unsigned size_mb)
>>  
>>      return ret && !err;
>>  }
>> +
>> +QTestMigrationState *qtest_migration_state(QTestState *s)
>> +{
>> +    return s->migration_state;
>> +}
>> diff --git a/tests/qtest/libqtest.h b/tests/qtest/libqtest.h
>> index 6e3d3525bf..0421a1da24 100644
>> --- a/tests/qtest/libqtest.h
>> +++ b/tests/qtest/libqtest.h
>> @@ -23,6 +23,20 @@
>>  
>>  typedef struct QTestState QTestState;
>>  
>> +struct QTestMigrationState {
>> +    bool stop_seen;
>> +    bool resume_seen;
>> +};
>> +typedef struct QTestMigrationState QTestMigrationState;
>> +
>> +/**
>> + * qtest_migration_state:
>> + * @s: #QTestState instance to operate on.
>> + *
>> + * Returns: #QTestMigrationState instance.
>> + */
>> +QTestMigrationState *qtest_migration_state(QTestState *s);
>> +
>>  /**
>>   * qtest_initf:
>>   * @fmt: Format for creating other arguments to pass to QEMU, formatted
>> @@ -288,6 +302,15 @@ typedef bool (*QTestQMPEventCallback)(QTestState *s, const char *name,
>>  void qtest_qmp_set_event_callback(QTestState *s,
>>                                    QTestQMPEventCallback cb, void *opaque);
>>  
>> +/**
>> + * qtest_qmp_set_migration_callback:
>> + * @s: #QTestSTate instance to operate on
>> + * @cb: callback to invoke for events
>> + *
>> + * Like qtest_qmp_set_event_callback, but includes migration state events
>> + */
>> +void qtest_qmp_set_migration_callback(QTestState *s, QTestQMPEventCallback cb);
>> +
>>  /**
>>   * qtest_qmp_eventwait:
>>   * @s: #QTestState instance to operate on.
>> diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
>> index fd3b94efa2..cffa525c81 100644
>> --- a/tests/qtest/migration-helpers.c
>> +++ b/tests/qtest/migration-helpers.c
>> @@ -92,6 +92,24 @@ void migrate_set_capability(QTestState *who, const char *capability,
>>                               capability, value);
>>  }
>>  
>> +void wait_for_stop(QTestState *who)
>> +{
>> +    QTestMigrationState *state = qtest_migration_state(who);
>> +
>> +    if (!state->stop_seen) {
>> +        qtest_qmp_eventwait(who, "STOP");
>> +    }
>> +}
>> +
>> +void wait_for_resume(QTestState *who)
>> +{
>> +    QTestMigrationState *state = qtest_migration_state(who);
>> +
>> +    if (!state->resume_seen) {
>> +        qtest_qmp_eventwait(who, "RESUME");
>> +    }
>> +}
>
> I'd be included to put them into the libqtest.c file too eg
>
>   qtest_wait_for_resume/qtst_wait_for_stop
>

Haven't I done this already? It must have gotten lost in the various
rebases.

Thanks



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 01/29] tests/qtest: migration events
  2023-10-23 20:35 ` [PATCH v2 01/29] tests/qtest: migration events Fabiano Rosas
  2023-10-25  9:44   ` Thomas Huth
  2023-10-25 10:14   ` Daniel P. Berrangé
@ 2023-10-25 13:21   ` Fabiano Rosas
  2023-10-25 13:33     ` Steven Sistare
  2 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 13:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Steve Sistare, Thomas Huth, Laurent Vivier,
	Paolo Bonzini

Fabiano Rosas <farosas@suse.de> writes:

> From: Steve Sistare <steven.sistare@oracle.com>
>
> Define a state object to capture events seen by migration tests, to allow
> more events to be captured in a subsequent patch, and simplify event
> checking in wait_for_migration_pass.  No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Steven, I haven't seen your series that contained this patch for a
while, let me know if this is an acceptable version to merge. Let's make
sure your series gets in first so I don't get in your way.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 01/29] tests/qtest: migration events
  2023-10-25 13:21   ` Fabiano Rosas
@ 2023-10-25 13:33     ` Steven Sistare
  0 siblings, 0 replies; 128+ messages in thread
From: Steven Sistare @ 2023-10-25 13:33 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Thomas Huth, Laurent Vivier, Paolo Bonzini

On 10/25/2023 9:21 AM, Fabiano Rosas wrote:
> Fabiano Rosas <farosas@suse.de> writes:
> 
>> From: Steve Sistare <steven.sistare@oracle.com>
>>
>> Define a state object to capture events seen by migration tests, to allow
>> more events to be captured in a subsequent patch, and simplify event
>> checking in wait_for_migration_pass.  No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Steven, I haven't seen your series that contained this patch for a
> while, let me know if this is an acceptable version to merge. Let's make
> sure your series gets in first so I don't get in your way.

Sure, feel free to merge this patch as is, and thanks for asking.
My suspension series needs more work before I post its next version.

- Steve


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-25  8:48   ` Daniel P. Berrangé
@ 2023-10-25 13:57     ` Fabiano Rosas
  2023-10-25 14:20       ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 13:57 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:35:45PM -0300, Fabiano Rosas wrote:
>> Add a capability that allows the management layer to delegate to QEMU
>> the decision of whether to pause a VM and perform a non-live
>> migration. Depending on the type of migration being performed, this
>> could bring performance benefits.
>
> I'm not really see what problem this is solving.
>

Well, this is the fruit of your discussion with Peter Xu in the previous
version of the patch.

To recap: he thinks QEMU is doing useless work with file migrations
because they are always asynchronous. He thinks we should always pause
before doing fixed-ram migration. You said that libvirt would rather use
fixed-ram for a more broad set of savevm-style commands, so you'd rather
not always pause. I'm trying to cater to both of your wishes. This new
capability is the middle ground I came up with.

So fixed-ram would always pause the VM, because that is the primary
use-case, but libvirt would be allowed to say: don't pause this time.

> Mgmt apps are perfectly capable of pausing the VM before issuing
> the migrate operation.
>

Right. But would QEMU be allowed to just assume that if a VM is paused
at the start of migration it can then go ahead and skip all dirty page
mechanisms?

Without pausing, we're basically doing *live* migration into a static
file that will be kept on disk for who knows how long before being
restored on the other side. We could release the src QEMU resources (a
bit) earlier if we paused the VM beforehand.

We're basically talking about whether we want the VM to be usable in the
(hopefully) very short time between issuing the migration command and
the migration being finished. We might be splitting hairs here, but we
need some sort of consensus.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration
  2023-10-25  9:39   ` Daniel P. Berrangé
@ 2023-10-25 14:03     ` Fabiano Rosas
  2023-11-01 15:23     ` Peter Xu
  1 sibling, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 14:03 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:35:54PM -0300, Fabiano Rosas wrote:
>> From: Nikolay Borisov <nborisov@suse.com>
>> 
>> Implement the outgoing migration side for the 'fixed-ram' capability.
>> 
>> A bitmap is introduced to track which pages have been written in the
>> migration file. Pages are written at a fixed location for every
>> ramblock. Zero pages are ignored as they'd be zero in the destination
>> migration as well.
>> 
>> The migration stream is altered to put the dirty pages for a ramblock
>> after its header instead of having a sequential stream of pages that
>> follow the ramblock headers. Since all pages have a fixed location,
>> RAM_SAVE_FLAG_EOS is no longer generated on every migration iteration.
>> 
>> Without fixed-ram (current):
>> 
>> ramblock 1 header|ramblock 2 header|...|RAM_SAVE_FLAG_EOS|stream of
>>  pages (iter 1)|RAM_SAVE_FLAG_EOS|stream of pages (iter 2)|...
>
> Clearer to diagram this vertically I think
>
>  ---------------------
>  | ramblock 1 header |
>  ---------------------
>  | ramblock 2 header |
>  ---------------------
>  | ...               |
>  ---------------------
>  | ramblock n header |
>  ---------------------
>  | RAM_SAVE_FLAG_EOS |
>  ---------------------
>  | stream of pages   |
>  | (iter 1)          |
>  | ...               |
>  ---------------------
>  | RAM_SAVE_FLAG_EOS |
>  ---------------------
>  | stream of pages   |
>  | (iter 2)          |
>  | ...               |
>  ---------------------
>  | ...               |
>  ---------------------
>  
>
>> 
>> With fixed-ram (new):
>> 
>> ramblock 1 header|ramblock 1 fixed-ram header|ramblock 1 pages (fixed
>>  offsets)|ramblock 2 header|ramblock 2 fixed-ram header|ramblock 2
>>  pages (fixed offsets)|...|RAM_SAVE_FLAG_EOS
>
> If I'm reading the code correctly the new format has some padding
> such that each "ramblock pages" region starts on a 1 MB boundary.
>
> eg so we get:
>
>  --------------------------------
>  | ramblock 1 header            |
>  --------------------------------
>  | ramblock 1 fixed-ram header  |
>  --------------------------------
>  | padding to next 1MB boundary |
>  | ...                          |
>  --------------------------------
>  | ramblock 1 pages             |
>  | ...                          |
>  --------------------------------
>  | ramblock 2 header            |
>  --------------------------------
>  | ramblock 2 fixed-ram header  |
>  --------------------------------
>  | padding to next 1MB boundary |
>  | ...                          |
>  --------------------------------
>  | ramblock 2 pages             |
>  | ...                          |
>  --------------------------------
>  | ...                          |
>  --------------------------------
>  | RAM_SAVE_FLAG_EOS            |
>  --------------------------------
>  | ...                          |
>  -------------------------------
>
>> 
>> where:
>>  - ramblock header: the generic information for a ramblock, such as
>>    idstr, used_len, etc.
>> 
>>  - ramblock fixed-ram header: the new information added by this
>>    feature: bitmap of pages written, bitmap size and offset of pages
>>    in the migration file.
>> 
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  include/exec/ramblock.h |  8 ++++
>>  migration/options.c     |  3 --
>>  migration/ram.c         | 98 ++++++++++++++++++++++++++++++++++++-----
>>  3 files changed, 96 insertions(+), 13 deletions(-)
>> 
>> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
>> index 69c6a53902..e0e3f16852 100644
>> --- a/include/exec/ramblock.h
>> +++ b/include/exec/ramblock.h
>> @@ -44,6 +44,14 @@ struct RAMBlock {
>>      size_t page_size;
>>      /* dirty bitmap used during migration */
>>      unsigned long *bmap;
>> +    /* shadow dirty bitmap used when migrating to a file */
>> +    unsigned long *shadow_bmap;
>> +    /*
>> +     * offset in the file pages belonging to this ramblock are saved,
>> +     * used only during migration to a file.
>> +     */
>> +    off_t bitmap_offset;
>> +    uint64_t pages_offset;
>>      /* bitmap of already received pages in postcopy */
>>      unsigned long *receivedmap;
>>  
>> diff --git a/migration/options.c b/migration/options.c
>> index 2622d8c483..9f693d909f 100644
>> --- a/migration/options.c
>> +++ b/migration/options.c
>> @@ -271,12 +271,9 @@ bool migrate_events(void)
>>  
>>  bool migrate_fixed_ram(void)
>>  {
>> -/*
>>      MigrationState *s = migrate_get_current();
>>  
>>      return s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
>> -*/
>> -    return false;
>>  }
>>  
>>  bool migrate_ignore_shared(void)
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 92769902bb..152a03604f 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1157,12 +1157,18 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
>>          return 0;
>>      }
>>  
>> +    stat64_add(&mig_stats.zero_pages, 1);
>> +
>> +    if (migrate_fixed_ram()) {
>> +        /* zero pages are not transferred with fixed-ram */
>> +        clear_bit(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
>> +        return 1;
>> +    }
>> +
>>      len += save_page_header(pss, file, pss->block, offset | RAM_SAVE_FLAG_ZERO);
>>      qemu_put_byte(file, 0);
>>      len += 1;
>>      ram_release_page(pss->block->idstr, offset);
>> -
>> -    stat64_add(&mig_stats.zero_pages, 1);
>>      ram_transferred_add(len);
>>  
>>      /*
>> @@ -1220,14 +1226,20 @@ static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
>>  {
>>      QEMUFile *file = pss->pss_channel;
>>  
>> -    ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
>> -                                         offset | RAM_SAVE_FLAG_PAGE));
>> -    if (async) {
>> -        qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
>> -                              migrate_release_ram() &&
>> -                              migration_in_postcopy());
>> +    if (migrate_fixed_ram()) {
>> +        qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
>> +                           block->pages_offset + offset);
>> +        set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
>>      } else {
>> -        qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
>> +        ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
>> +                                             offset | RAM_SAVE_FLAG_PAGE));
>> +        if (async) {
>> +            qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
>> +                                  migrate_release_ram() &&
>> +                                  migration_in_postcopy());
>> +        } else {
>> +            qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
>> +        }
>>      }
>>      ram_transferred_add(TARGET_PAGE_SIZE);
>>      stat64_add(&mig_stats.normal_pages, 1);
>> @@ -2475,6 +2487,8 @@ static void ram_save_cleanup(void *opaque)
>>          block->clear_bmap = NULL;
>>          g_free(block->bmap);
>>          block->bmap = NULL;
>> +        g_free(block->shadow_bmap);
>> +        block->shadow_bmap = NULL;
>>      }
>>  
>>      xbzrle_cleanup();
>> @@ -2842,6 +2856,7 @@ static void ram_list_init_bitmaps(void)
>>               */
>>              block->bmap = bitmap_new(pages);
>>              bitmap_set(block->bmap, 0, pages);
>> +            block->shadow_bmap = bitmap_new(block->used_length >> TARGET_PAGE_BITS);
>>              block->clear_bmap_shift = shift;
>>              block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
>>          }
>> @@ -2979,6 +2994,44 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
>>      }
>>  }
>>  
>> +#define FIXED_RAM_HDR_VERSION 1
>> +struct FixedRamHeader {
>> +    uint32_t version;
>> +    uint64_t page_size;
>> +    uint64_t bitmap_offset;
>> +    uint64_t pages_offset;
>> +    /* end of v1 */
>> +} QEMU_PACKED;
>> +
>> +static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
>> +{
>> +    g_autofree struct FixedRamHeader *header;
>> +    size_t header_size, bitmap_size;
>> +    long num_pages;
>> +
>> +    header = g_new0(struct FixedRamHeader, 1);
>> +    header_size = sizeof(struct FixedRamHeader);
>> +
>> +    num_pages = block->used_length >> TARGET_PAGE_BITS;
>> +    bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
>> +
>> +    /*
>> +     * Save the file offsets of where the bitmap and the pages should
>> +     * go as they are written at the end of migration and during the
>> +     * iterative phase, respectively.
>> +     */
>> +    block->bitmap_offset = qemu_get_offset(file) + header_size;
>> +    block->pages_offset = ROUND_UP(block->bitmap_offset +
>> +                                   bitmap_size, 0x100000);
>
> The 0x100000 gives us our 1 MB alignment.
>
> This is quite an important thing, so can we put a nice clear
> constant at the top of the file:
>
>   /* Align fixed-ram pages data to the next 1 MB boundary */
>   #define FIXED_RAM_PAGE_OFFSET_ALIGNMENT 0x100000
>
>> @@ -3179,7 +3252,6 @@ out:
>>          qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>>          qemu_fflush(f);
>>          ram_transferred_add(8);
>> -
>>          ret = qemu_file_get_error(f);
>>      }
>>      if (ret < 0) {
>
> Supurious whitspace change, possibly better in whatever patch
> introduced it, or dropped ?
>
>
> With regards,
> Daniel

I'll apply all the suggestions. Thanks!


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-10-25  9:43   ` Daniel P. Berrangé
@ 2023-10-25 14:07     ` Fabiano Rosas
  2023-10-31 19:03       ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 14:07 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:35:55PM -0300, Fabiano Rosas wrote:
>> From: Nikolay Borisov <nborisov@suse.com>
>> 
>> Add the necessary code to parse the format changes for the 'fixed-ram'
>> capability.
>> 
>> One of the more notable changes in behavior is that in the 'fixed-ram'
>> case ram pages are restored in one go rather than constantly looping
>> through the migration stream.
>> 
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>> (farosas) reused more of the common code by making the fixed-ram
>> function take only one ramblock and calling it from inside
>> parse_ramblock.
>> ---
>>  migration/ram.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 93 insertions(+)
>> 
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 152a03604f..cea6971ab2 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -3032,6 +3032,32 @@ static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
>>      qemu_put_buffer(file, (uint8_t *) header, header_size);
>>  }
>>  
>> +static int fixed_ram_read_header(QEMUFile *file, struct FixedRamHeader *header)
>> +{
>> +    size_t ret, header_size = sizeof(struct FixedRamHeader);
>> +
>> +    ret = qemu_get_buffer(file, (uint8_t *)header, header_size);
>> +    if (ret != header_size) {
>> +        return -1;
>> +    }
>> +
>> +    /* migration stream is big-endian */
>> +    be32_to_cpus(&header->version);
>> +
>> +    if (header->version > FIXED_RAM_HDR_VERSION) {
>> +        error_report("Migration fixed-ram capability version mismatch (expected %d, got %d)",
>> +                     FIXED_RAM_HDR_VERSION, header->version);
>> +        return -1;
>> +    }
>> +
>> +    be64_to_cpus(&header->page_size);
>> +    be64_to_cpus(&header->bitmap_offset);
>> +    be64_to_cpus(&header->pages_offset);
>> +
>> +
>> +    return 0;
>> +}
>> +
>>  /*
>>   * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
>>   * long-running RCU critical section.  When rcu-reclaims in the code
>> @@ -3932,6 +3958,68 @@ void colo_flush_ram_cache(void)
>>      trace_colo_flush_ram_cache_end();
>>  }
>>  
>> +static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
>> +                                    long num_pages, unsigned long *bitmap)
>> +{
>> +    unsigned long set_bit_idx, clear_bit_idx;
>> +    unsigned long len;
>> +    ram_addr_t offset;
>> +    void *host;
>> +    size_t read, completed, read_len;
>> +
>> +    for (set_bit_idx = find_first_bit(bitmap, num_pages);
>> +         set_bit_idx < num_pages;
>> +         set_bit_idx = find_next_bit(bitmap, num_pages, clear_bit_idx + 1)) {
>> +
>> +        clear_bit_idx = find_next_zero_bit(bitmap, num_pages, set_bit_idx + 1);
>> +
>> +        len = TARGET_PAGE_SIZE * (clear_bit_idx - set_bit_idx);
>> +        offset = set_bit_idx << TARGET_PAGE_BITS;
>> +
>> +        for (read = 0, completed = 0; completed < len; offset += read) {
>> +            host = host_from_ram_block_offset(block, offset);
>> +            read_len = MIN(len, TARGET_PAGE_SIZE);
>> +
>> +            read = qemu_get_buffer_at(f, host, read_len,
>> +                                      block->pages_offset + offset);
>> +            completed += read;
>> +        }
>> +    }
>> +}
>> +
>> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block, ram_addr_t length)
>> +{
>> +    g_autofree unsigned long *bitmap = NULL;
>> +    struct FixedRamHeader header;
>> +    size_t bitmap_size;
>> +    long num_pages;
>> +    int ret = 0;
>> +
>> +    ret = fixed_ram_read_header(f, &header);
>> +    if (ret < 0) {
>> +        error_report("Error reading fixed-ram header");
>> +        return -EINVAL;
>> +    }
>> +
>> +    block->pages_offset = header.pages_offset;
>
> Do you think it is worth sanity checking that 'pages_offset' is aligned
> in some way.
>
> It is nice that we have flexibility to change the alignment in future
> if we find 1 MB is not optimal, so I wouldn't want to force 1MB align
> check htere. Perhaps we could at least sanity check for alignment at
> TARGET_PAGE_SIZE, to detect a gross data corruption problem ?
>

I don't see why not. I'll add it.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 19/29] migration/multifd: Add outgoing QIOChannelFile support
  2023-10-25  9:52   ` Daniel P. Berrangé
@ 2023-10-25 14:12     ` Fabiano Rosas
  2023-10-25 14:23       ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 14:12 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:35:58PM -0300, Fabiano Rosas wrote:
>> Allow multifd to open file-backed channels. This will be used when
>> enabling the fixed-ram migration stream format which expects a
>> seekable transport.
>> 
>> The QIOChannel read and write methods will use the preadv/pwritev
>> versions which don't update the file offset at each call so we can
>> reuse the fd without re-opening for every channel.
>> 
>> Note that this is just setup code and multifd cannot yet make use of
>> the file channels.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  migration/file.c      | 64 +++++++++++++++++++++++++++++++++++++++++--
>>  migration/file.h      | 10 +++++--
>>  migration/migration.c |  2 +-
>>  migration/multifd.c   | 14 ++++++++--
>>  migration/options.c   |  7 +++++
>>  migration/options.h   |  1 +
>>  6 files changed, 90 insertions(+), 8 deletions(-)
>> 
>> diff --git a/migration/file.c b/migration/file.c
>> index cf5b1bf365..93b9b7bf5d 100644
>> --- a/migration/file.c
>> +++ b/migration/file.c
>> @@ -17,6 +17,12 @@
>
>> +void file_send_channel_create(QIOTaskFunc f, void *data)
>> +{
>> +    QIOChannelFile *ioc;
>> +    QIOTask *task;
>> +    Error *errp = NULL;
>> +
>> +    ioc = qio_channel_file_new_path(outgoing_args.fname,
>> +                                    outgoing_args.flags,
>> +                                    outgoing_args.mode, &errp);
>> +    if (!ioc) {
>> +        file_migration_cancel(errp);
>> +        return;
>> +    }
>> +
>> +    task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
>> +    qio_task_run_in_thread(task, qio_channel_file_connect_worker,
>> +                           (gpointer)data, NULL, NULL);
>> +}
>> +
>>  void file_start_outgoing_migration(MigrationState *s, const char *filespec,
>>                                     Error **errp)
>>  {
>> -    g_autofree char *filename = g_strdup(filespec);
>>      g_autoptr(QIOChannelFile) fioc = NULL;
>> +    g_autofree char *filename = g_strdup(filespec);
>>      uint64_t offset = 0;
>>      QIOChannel *ioc;
>> +    int flags = O_CREAT | O_TRUNC | O_WRONLY;
>> +    mode_t mode = 0660;
>>  
>>      trace_migration_file_outgoing(filename);
>>  
>> @@ -50,12 +105,15 @@ void file_start_outgoing_migration(MigrationState *s, const char *filespec,
>>          return;
>>      }
>>  
>> -    fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
>> -                                     0600, errp);

By the way, we're experimenting with add-fd to flesh out the interface
with libvirt and I see that the flags here can conflict with the flags
set on the fd passed through `virsh --pass-fd ...` due to this at
monitor_fdset_dup_fd_add():

    if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
        fd = mon_fdset_fd->fd;
        break;
    }

We're requiring the O_RDONLY, O_WRONLY, O_RDWR flags defined here to
match the fdset passed into QEMU. Should we just sync the code of the
two projects to use the same flags? That feels a little clumsy to me.

>> +    fioc = qio_channel_file_new_path(filename, flags, mode, errp);
>
> So this initially opens the file with O_CREAT|O_TRUNC which
> makes sense.
>
>>      if (!fioc) {
>>          return;
>>      }
>>  
>> +    outgoing_args.fname = g_strdup(filename);
>> +    outgoing_args.flags = flags;
>> +    outgoing_args.mode = mode;
>
> We're passing on O_CREAT|O_TRUNC to all the multifd threads too. This
> doesn't make sense to me - the file should already exist and be truncated
> by the time the threads open it. I would think they should only be using
> O_WRONLY and no mode at all.
>

Ok.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 20/29] migration/multifd: Add incoming QIOChannelFile support
  2023-10-25 10:29   ` Daniel P. Berrangé
@ 2023-10-25 14:18     ` Fabiano Rosas
  0 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 14:18 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:35:59PM -0300, Fabiano Rosas wrote:
>> On the receiving side we don't need to differentiate between main
>> channel and threads, so whichever channel is defined first gets to be
>> the main one. And since there are no packets, use the atomic channel
>> count to index into the params array.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  migration/file.c      | 39 +++++++++++++++++++++++++++++----------
>>  migration/migration.c |  2 ++
>>  migration/multifd.c   |  7 ++++++-
>>  migration/multifd.h   |  1 +
>>  4 files changed, 38 insertions(+), 11 deletions(-)
>> 
>> diff --git a/migration/file.c b/migration/file.c
>> index 93b9b7bf5d..ad75225f43 100644
>> --- a/migration/file.c
>> +++ b/migration/file.c
>> @@ -6,13 +6,15 @@
>>   */
>>  
>>  #include "qemu/osdep.h"
>> -#include "qemu/cutils.h"
>>  #include "qapi/error.h"
>> +#include "qemu/cutils.h"
>> +#include "qemu/error-report.h"
>>  #include "channel.h"
>>  #include "file.h"
>>  #include "migration.h"
>>  #include "io/channel-file.h"
>>  #include "io/channel-util.h"
>> +#include "options.h"
>>  #include "trace.h"
>>  
>>  #define OFFSET_OPTION ",offset="
>> @@ -136,7 +138,8 @@ void file_start_incoming_migration(const char *filespec, Error **errp)
>>      g_autofree char *filename = g_strdup(filespec);
>>      QIOChannelFile *fioc = NULL;
>>      uint64_t offset = 0;
>> -    QIOChannel *ioc;
>> +    int channels = 1;
>> +    int i = 0, fd;
>>  
>>      trace_migration_file_incoming(filename);
>>  
>> @@ -146,16 +149,32 @@ void file_start_incoming_migration(const char *filespec, Error **errp)
>>  
>>      fioc = qio_channel_file_new_path(filename, O_RDONLY, 0, errp);
>>      if (!fioc) {
>> -        return;
>> +        goto out;
>> +    }
>> +
>> +    if (migrate_multifd()) {
>> +        channels += migrate_multifd_channels();
>>      }
>>  
>> -    ioc = QIO_CHANNEL(fioc);
>> -    if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
>> +    fd = fioc->fd;
>> +
>> +    do {
>> +        QIOChannel *ioc = QIO_CHANNEL(fioc);
>> +
>> +        if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
>> +            return;
>> +        }
>> +
>> +        qio_channel_set_name(ioc, "migration-file-incoming");
>> +        qio_channel_add_watch_full(ioc, G_IO_IN,
>> +                                   file_accept_incoming_migration,
>> +                                   NULL, NULL,
>> +                                   g_main_context_get_thread_default());
>> +    } while (++i < channels && (fioc = qio_channel_file_new_fd(fd)));
>
> IIUC, this loop is failing to call qio_channel_io_seek to set
> the offset on the last 'fioc' that is created.
>

Ah, this is actually bogus. We don't need to offset the secondary
channels. That does nothing since we carry pointers to everything in the
fixed-ram header. This should be out of the loop.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-25 13:57     ` Fabiano Rosas
@ 2023-10-25 14:20       ` Daniel P. Berrangé
  2023-10-25 14:58         ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 14:20 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

On Wed, Oct 25, 2023 at 10:57:12AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Mon, Oct 23, 2023 at 05:35:45PM -0300, Fabiano Rosas wrote:
> >> Add a capability that allows the management layer to delegate to QEMU
> >> the decision of whether to pause a VM and perform a non-live
> >> migration. Depending on the type of migration being performed, this
> >> could bring performance benefits.
> >
> > I'm not really see what problem this is solving.
> >
> 
> Well, this is the fruit of your discussion with Peter Xu in the previous
> version of the patch.
> 
> To recap: he thinks QEMU is doing useless work with file migrations
> because they are always asynchronous. He thinks we should always pause
> before doing fixed-ram migration. You said that libvirt would rather use
> fixed-ram for a more broad set of savevm-style commands, so you'd rather
> not always pause. I'm trying to cater to both of your wishes. This new
> capability is the middle ground I came up with.
> 
> So fixed-ram would always pause the VM, because that is the primary
> use-case, but libvirt would be allowed to say: don't pause this time.

If the VM is going to be powered off immediately after saving
a snapshot then yes, you might as well pause it, but we can't
assume that will be the case.  An equally common use case
would be for saving periodic snapshots of a running VM. This
should be transparent such that the VM remains running the
whole time, except a narrow window at completion of RAM/state
saving where we flip the disk snapshots, so they are in sync
with the RAM snapshot.

IOW, save/restore to disk can imply paused, but snapshotting
should not imply paused. So I don't see an unambiguous
rationale that we should diverge when fixed-ram is set and
auto-pause the VM.

> > Mgmt apps are perfectly capable of pausing the VM before issuing
> > the migrate operation.
> >
> 
> Right. But would QEMU be allowed to just assume that if a VM is paused
> at the start of migration it can then go ahead and skip all dirty page
> mechanisms?

Skipping dirty page tracking would imply that the mgmt app cannot
resume CPUs without either letting the operation complete, or
aborting it.

That is probably a reasonable assumption, as I can't come up with
a use case for starting out paused and then later resuming, unless
there was a scearnio where you needed to synchronous something
external with the start of migration.  Sychronizing storage though
is something that happens at the end of migration instead.

> Without pausing, we're basically doing *live* migration into a static
> file that will be kept on disk for who knows how long before being
> restored on the other side. We could release the src QEMU resources (a
> bit) earlier if we paused the VM beforehand.

Can we really release resources early ?  If the save operation fails
right at the end, we want to be able to resume execution of CPUs,
which assumes all resources are still available, otherwise we have
a failure scenario where we've not successfully saved to disk and
also don't still have the running QEMU.

> We're basically talking about whether we want the VM to be usable in the
> (hopefully) very short time between issuing the migration command and
> the migration being finished. We might be splitting hairs here, but we
> need some sort of consensus.

The time may not be very short for large VMs.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 25/29] migration/multifd: Support outgoing fixed-ram stream format
  2023-10-25  9:23   ` Daniel P. Berrangé
@ 2023-10-25 14:21     ` Fabiano Rosas
  0 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 14:21 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:36:04PM -0300, Fabiano Rosas wrote:
>> The new fixed-ram stream format uses a file transport and puts ram
>> pages in the migration file at their respective offsets and can be
>> done in parallel by using the pwritev system call which takes iovecs
>> and an offset.
>> 
>> Add support to enabling the new format along with multifd to make use
>> of the threading and page handling already in place.
>> 
>> This requires multifd to stop sending headers and leaving the stream
>> format to the fixed-ram code. When it comes time to write the data, we
>> need to call a version of qio_channel_write that can take an offset.
>> 
>> Usage on HMP is:
>> 
>> (qemu) stop
>> (qemu) migrate_set_capability multifd on
>> (qemu) migrate_set_capability fixed-ram on
>> (qemu) migrate_set_parameter max-bandwidth 0
>> (qemu) migrate_set_parameter multifd-channels 8
>> (qemu) migrate file:migfile
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  include/qemu/bitops.h | 13 ++++++++++
>>  migration/multifd.c   | 55 +++++++++++++++++++++++++++++++++++++++++--
>>  migration/options.c   |  6 -----
>>  migration/ram.c       |  2 +-
>>  4 files changed, 67 insertions(+), 9 deletions(-)
>> 
>> diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
>> index cb3526d1f4..2c0a2fe751 100644
>> --- a/include/qemu/bitops.h
>> +++ b/include/qemu/bitops.h
>> @@ -67,6 +67,19 @@ static inline void clear_bit(long nr, unsigned long *addr)
>>      *p &= ~mask;
>>  }
>>  
>> +/**
>> + * clear_bit_atomic - Clears a bit in memory atomically
>> + * @nr: Bit to clear
>> + * @addr: Address to start counting from
>> + */
>> +static inline void clear_bit_atomic(long nr, unsigned long *addr)
>> +{
>> +    unsigned long mask = BIT_MASK(nr);
>> +    unsigned long *p = addr + BIT_WORD(nr);
>> +
>> +    return qatomic_and(p, ~mask);
>> +}
>> +
>>  /**
>>   * change_bit - Toggle a bit in memory
>>   * @nr: Bit to change
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index 20e8635740..3f95a41ee9 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -260,6 +260,19 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
>>      g_free(pages);
>>  }
>>  
>> +static void multifd_set_file_bitmap(MultiFDSendParams *p)
>> +{
>> +    MultiFDPages_t *pages = p->pages;
>> +
>> +    if (!pages->block) {
>> +        return;
>> +    }
>> +
>> +    for (int i = 0; i < p->normal_num; i++) {
>> +        ramblock_set_shadow_bmap_atomic(pages->block, pages->offset[i]);
>> +    }
>> +}
>> +
>>  static void multifd_send_fill_packet(MultiFDSendParams *p)
>>  {
>>      MultiFDPacket_t *packet = p->packet;
>> @@ -606,6 +619,29 @@ int multifd_send_sync_main(QEMUFile *f)
>>          }
>>      }
>>  
>> +    if (!migrate_multifd_packets()) {
>> +        /*
>> +         * There's no sync packet to send. Just make sure the sending
>> +         * above has finished.
>> +         */
>> +        for (i = 0; i < migrate_multifd_channels(); i++) {
>> +            qemu_sem_wait(&multifd_send_state->channels_ready);
>> +        }
>> +
>> +        /* sanity check and release the channels */
>> +        for (i = 0; i < migrate_multifd_channels(); i++) {
>> +            MultiFDSendParams *p = &multifd_send_state->params[i];
>> +
>> +            qemu_mutex_lock(&p->mutex);
>> +            assert(!p->pending_job || p->quit);
>> +            qemu_mutex_unlock(&p->mutex);
>> +
>> +            qemu_sem_post(&p->sem);
>> +        }
>> +
>> +        return 0;
>> +    }
>> +
>>      /*
>>       * When using zero-copy, it's necessary to flush the pages before any of
>>       * the pages can be sent again, so we'll make sure the new version of the
>> @@ -689,6 +725,8 @@ static void *multifd_send_thread(void *opaque)
>>  
>>          if (p->pending_job) {
>>              uint32_t flags;
>> +            uint64_t write_base;
>> +
>>              p->normal_num = 0;
>>  
>>              if (!use_packets || use_zero_copy_send) {
>> @@ -713,6 +751,16 @@ static void *multifd_send_thread(void *opaque)
>>              if (use_packets) {
>>                  multifd_send_fill_packet(p);
>>                  p->num_packets++;
>> +                write_base = 0;
>> +            } else {
>> +                multifd_set_file_bitmap(p);
>> +
>> +                /*
>> +                 * If we subtract the host page now, we don't need to
>> +                 * pass it into qio_channel_write_full_all() below.
>> +                 */
>> +                write_base = p->pages->block->pages_offset -
>> +                    (uint64_t)p->pages->block->host;
>>              }
>>  
>>              flags = p->flags;
>> @@ -738,8 +786,9 @@ static void *multifd_send_thread(void *opaque)
>>                  p->iov[0].iov_base = p->packet;
>>              }
>>  
>> -            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
>> -                                              0, p->write_flags, &local_err);
>> +            ret = qio_channel_write_full_all(p->c, p->iov, p->iovs_num,
>> +                                             write_base, NULL, 0,
>> +                                             p->write_flags, &local_err);
>>              if (ret != 0) {
>>                  break;
>>              }
>> @@ -969,6 +1018,8 @@ int multifd_save_setup(Error **errp)
>>  
>>          if (migrate_zero_copy_send()) {
>>              p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
>> +        } else if (!use_packets) {
>> +            p->write_flags |= QIO_CHANNEL_WRITE_FLAG_WITH_OFFSET;
>>          } else {
>>              p->write_flags = 0;
>>          }
>
> Ah, so this is why you had the wierd overloaded design for
> qio_channel_write_full_all in patch 22 that I queried. I'd
> still prefer the simpler design at the QIO level, and just
> calling the appropriate function above. 

Yes, I didn't want to have multifd choosing between different IO
functions during (multifd) runtime. Here we set the flag at setup time
and forget about it.

I now understand multifd a bit more so I'm more confident changing the
code, let's see how your suggestion looks like.


>
> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 19/29] migration/multifd: Add outgoing QIOChannelFile support
  2023-10-25 14:12     ` Fabiano Rosas
@ 2023-10-25 14:23       ` Daniel P. Berrangé
  2023-10-25 15:00         ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 14:23 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

On Wed, Oct 25, 2023 at 11:12:38AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Mon, Oct 23, 2023 at 05:35:58PM -0300, Fabiano Rosas wrote:
> >> Allow multifd to open file-backed channels. This will be used when
> >> enabling the fixed-ram migration stream format which expects a
> >> seekable transport.
> >> 
> >> The QIOChannel read and write methods will use the preadv/pwritev
> >> versions which don't update the file offset at each call so we can
> >> reuse the fd without re-opening for every channel.
> >> 
> >> Note that this is just setup code and multifd cannot yet make use of
> >> the file channels.
> >> 
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >> ---
> >>  migration/file.c      | 64 +++++++++++++++++++++++++++++++++++++++++--
> >>  migration/file.h      | 10 +++++--
> >>  migration/migration.c |  2 +-
> >>  migration/multifd.c   | 14 ++++++++--
> >>  migration/options.c   |  7 +++++
> >>  migration/options.h   |  1 +
> >>  6 files changed, 90 insertions(+), 8 deletions(-)
> >> 
> >> diff --git a/migration/file.c b/migration/file.c
> >> index cf5b1bf365..93b9b7bf5d 100644
> >> --- a/migration/file.c
> >> +++ b/migration/file.c
> >> @@ -17,6 +17,12 @@
> >
> >> +void file_send_channel_create(QIOTaskFunc f, void *data)
> >> +{
> >> +    QIOChannelFile *ioc;
> >> +    QIOTask *task;
> >> +    Error *errp = NULL;
> >> +
> >> +    ioc = qio_channel_file_new_path(outgoing_args.fname,
> >> +                                    outgoing_args.flags,
> >> +                                    outgoing_args.mode, &errp);
> >> +    if (!ioc) {
> >> +        file_migration_cancel(errp);
> >> +        return;
> >> +    }
> >> +
> >> +    task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
> >> +    qio_task_run_in_thread(task, qio_channel_file_connect_worker,
> >> +                           (gpointer)data, NULL, NULL);
> >> +}
> >> +
> >>  void file_start_outgoing_migration(MigrationState *s, const char *filespec,
> >>                                     Error **errp)
> >>  {
> >> -    g_autofree char *filename = g_strdup(filespec);
> >>      g_autoptr(QIOChannelFile) fioc = NULL;
> >> +    g_autofree char *filename = g_strdup(filespec);
> >>      uint64_t offset = 0;
> >>      QIOChannel *ioc;
> >> +    int flags = O_CREAT | O_TRUNC | O_WRONLY;
> >> +    mode_t mode = 0660;
> >>  
> >>      trace_migration_file_outgoing(filename);
> >>  
> >> @@ -50,12 +105,15 @@ void file_start_outgoing_migration(MigrationState *s, const char *filespec,
> >>          return;
> >>      }
> >>  
> >> -    fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
> >> -                                     0600, errp);
> 
> By the way, we're experimenting with add-fd to flesh out the interface
> with libvirt and I see that the flags here can conflict with the flags
> set on the fd passed through `virsh --pass-fd ...` due to this at
> monitor_fdset_dup_fd_add():
> 
>     if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
>         fd = mon_fdset_fd->fd;
>         break;
>     }
> 
> We're requiring the O_RDONLY, O_WRONLY, O_RDWR flags defined here to
> match the fdset passed into QEMU. Should we just sync the code of the
> two projects to use the same flags? That feels a little clumsy to me.

Is there a reason for libvirt to have set O_RDONLY for a file used
for outgoing migration ?  I can't recall off-hand.

If we document the required O_ACCMODE against the 'file' address
schema in QAPI, I don't think it'd be too painful for mgmt apps.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-25  8:44       ` Daniel P. Berrangé
@ 2023-10-25 14:32         ` Fabiano Rosas
  2023-10-25 14:43           ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 14:32 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Tue, Oct 24, 2023 at 04:32:10PM -0300, Fabiano Rosas wrote:
>> Markus Armbruster <armbru@redhat.com> writes:
>> 
>> > Fabiano Rosas <farosas@suse.de> writes:
>> >
>> >> Add the direct-io migration parameter that tells the migration code to
>> >> use O_DIRECT when opening the migration stream file whenever possible.
>> >>
>> >> This is currently only used for the secondary channels of fixed-ram
>> >> migration, which can guarantee that writes are page aligned.
>> >>
>> >> However the parameter could be made to affect other types of
>> >> file-based migrations in the future.
>> >>
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >
>> > When would you want to enable @direct-io, and when would you want to
>> > leave it disabled?
>> 
>> That depends on a performance analysis. You'd generally leave it
>> disabled unless there's some indication that the operating system is
>> having trouble draining the page cache.
>
> That's not the usage model I would suggest.
>

Hehe I took a shot at answering but I really wanted to say "ask Daniel".

> The biggest value of the page cache comes when it holds data that
> will be repeatedly accessed.
>
> When you are saving/restoring a guest to file, that data is used
> once only (assuming there's a large gap between save & restore).
> By using the page cache to save a big guest we essentially purge
> the page cache of most of its existing data that is likely to be
> reaccessed, to fill it up with data never to be reaccessed.
>
> I usually describe save/restore operations as trashing the page
> cache.
>
> IMHO, mgmt apps should request O_DIRECT always unless they expect
> the save/restore operation to run in quick succession, or if they
> know that the host has oodles of free RAM such that existing data
> in the page cache won't be trashed, or

Thanks, I'll try to incorporate this to some kind of doc in the next
version.

> if the host FS does not support O_DIRECT of course.

Should we try to probe for this and inform the user?


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-25 14:32         ` Fabiano Rosas
@ 2023-10-25 14:43           ` Daniel P. Berrangé
  2023-10-25 17:30             ` Fabiano Rosas
  2023-10-30 22:51             ` Fabiano Rosas
  0 siblings, 2 replies; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 14:43 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

On Wed, Oct 25, 2023 at 11:32:00AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Tue, Oct 24, 2023 at 04:32:10PM -0300, Fabiano Rosas wrote:
> >> Markus Armbruster <armbru@redhat.com> writes:
> >> 
> >> > Fabiano Rosas <farosas@suse.de> writes:
> >> >
> >> >> Add the direct-io migration parameter that tells the migration code to
> >> >> use O_DIRECT when opening the migration stream file whenever possible.
> >> >>
> >> >> This is currently only used for the secondary channels of fixed-ram
> >> >> migration, which can guarantee that writes are page aligned.
> >> >>
> >> >> However the parameter could be made to affect other types of
> >> >> file-based migrations in the future.
> >> >>
> >> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >> >
> >> > When would you want to enable @direct-io, and when would you want to
> >> > leave it disabled?
> >> 
> >> That depends on a performance analysis. You'd generally leave it
> >> disabled unless there's some indication that the operating system is
> >> having trouble draining the page cache.
> >
> > That's not the usage model I would suggest.
> >
> 
> Hehe I took a shot at answering but I really wanted to say "ask Daniel".
> 
> > The biggest value of the page cache comes when it holds data that
> > will be repeatedly accessed.
> >
> > When you are saving/restoring a guest to file, that data is used
> > once only (assuming there's a large gap between save & restore).
> > By using the page cache to save a big guest we essentially purge
> > the page cache of most of its existing data that is likely to be
> > reaccessed, to fill it up with data never to be reaccessed.
> >
> > I usually describe save/restore operations as trashing the page
> > cache.
> >
> > IMHO, mgmt apps should request O_DIRECT always unless they expect
> > the save/restore operation to run in quick succession, or if they
> > know that the host has oodles of free RAM such that existing data
> > in the page cache won't be trashed, or
> 
> Thanks, I'll try to incorporate this to some kind of doc in the next
> version.
> 
> > if the host FS does not support O_DIRECT of course.
> 
> Should we try to probe for this and inform the user?

qemu_open_internall will already do a nice error message. If it gets
EINVAL when using O_DIRECT, it'll retry without O_DIRECT and if that
works, it'll reoprt "filesystem does not support O_DIRECT"

Having said that I see a problem with /dev/fdset handling, because
we're only validating O_ACCMODE and that excludes O_DIRECT.

If the mgmt apps passes an FD with O_DIRECT already set, then it
won't work for VMstate saving which is unaligned.

If the mgmt app passes an FD without O_DIRECT set, then we are
not setting O_DIRECT for the multifd RAM threads.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-25  9:07   ` Daniel P. Berrangé
@ 2023-10-25 14:48     ` Fabiano Rosas
  2023-10-25 15:22       ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 14:48 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:36:07PM -0300, Fabiano Rosas wrote:
>> Add the direct-io migration parameter that tells the migration code to
>> use O_DIRECT when opening the migration stream file whenever possible.
>> 
>> This is currently only used for the secondary channels of fixed-ram
>> migration, which can guarantee that writes are page aligned.
>
> When you say "secondary channels", I presume you're meaning that
> the bulk memory regions will be written with O_DIRECT, while
> general vmstate will use normal I/O on the main channel ?  If so,
> could we explain that a little more directly.

Yes, the main channel writes via QEMUFile, so no O_DIRECT. The channels
created via multifd_new_send_channel_create() have O_DIRECT enabled.

> Having a mixture of O_DIRECT and non-O_DIRECT I/O on the same
> file is a little bit of an unusual situation. It will work for
> us because we're writing to different regions of the file in
> each case.
>
> Still I wonder if it would be sane in the outgoing case to
> include a fsync() on the file in the main channel, to guarantee
> that the whole saved file is on-media at completion ? Or perhaps
> suggest in QAPI that mgmts might consider doing a fsync
> themselves ?

I think that should be present in QIOChannelFile in general. Not even
related to this series. I'll add it at qio_channel_file_close() unless
you object.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-25 14:20       ` Daniel P. Berrangé
@ 2023-10-25 14:58         ` Peter Xu
  2023-10-25 15:25           ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-10-25 14:58 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Eric Blake

On Wed, Oct 25, 2023 at 03:20:16PM +0100, Daniel P. Berrangé wrote:
> On Wed, Oct 25, 2023 at 10:57:12AM -0300, Fabiano Rosas wrote:
> > Daniel P. Berrangé <berrange@redhat.com> writes:
> > 
> > > On Mon, Oct 23, 2023 at 05:35:45PM -0300, Fabiano Rosas wrote:
> > >> Add a capability that allows the management layer to delegate to QEMU
> > >> the decision of whether to pause a VM and perform a non-live
> > >> migration. Depending on the type of migration being performed, this
> > >> could bring performance benefits.
> > >
> > > I'm not really see what problem this is solving.
> > >
> > 
> > Well, this is the fruit of your discussion with Peter Xu in the previous
> > version of the patch.
> > 
> > To recap: he thinks QEMU is doing useless work with file migrations
> > because they are always asynchronous. He thinks we should always pause
> > before doing fixed-ram migration. You said that libvirt would rather use
> > fixed-ram for a more broad set of savevm-style commands, so you'd rather
> > not always pause. I'm trying to cater to both of your wishes. This new
> > capability is the middle ground I came up with.
> > 
> > So fixed-ram would always pause the VM, because that is the primary
> > use-case, but libvirt would be allowed to say: don't pause this time.
> 
> If the VM is going to be powered off immediately after saving
> a snapshot then yes, you might as well pause it, but we can't
> assume that will be the case.  An equally common use case
> would be for saving periodic snapshots of a running VM. This
> should be transparent such that the VM remains running the
> whole time, except a narrow window at completion of RAM/state
> saving where we flip the disk snapshots, so they are in sync
> with the RAM snapshot.

Libvirt will still use fixed-ram for live snapshot purpose, especially for
Windows?  Then auto-pause may still be useful to identify that from what
Fabiano wants to achieve here (which is in reality, non-live)?

IIRC of previous discussion that was the major point that libvirt can still
leverage fixed-ram for a live case - since Windows lacks efficient live
snapshot (background-snapshot feature).

From that POV it sounds like auto-pause is a good knob for that.

> 
> IOW, save/restore to disk can imply paused, but snapshotting
> should not imply paused. So I don't see an unambiguous
> rationale that we should diverge when fixed-ram is set and
> auto-pause the VM.
> 
> > > Mgmt apps are perfectly capable of pausing the VM before issuing
> > > the migrate operation.
> > >
> > 
> > Right. But would QEMU be allowed to just assume that if a VM is paused
> > at the start of migration it can then go ahead and skip all dirty page
> > mechanisms?
> 
> Skipping dirty page tracking would imply that the mgmt app cannot
> resume CPUs without either letting the operation complete, or
> aborting it.
> 
> That is probably a reasonable assumption, as I can't come up with
> a use case for starting out paused and then later resuming, unless
> there was a scearnio where you needed to synchronous something
> external with the start of migration.  Sychronizing storage though
> is something that happens at the end of migration instead.
> 
> > Without pausing, we're basically doing *live* migration into a static
> > file that will be kept on disk for who knows how long before being
> > restored on the other side. We could release the src QEMU resources (a
> > bit) earlier if we paused the VM beforehand.
> 
> Can we really release resources early ?  If the save operation fails
> right at the end, we want to be able to resume execution of CPUs,
> which assumes all resources are still available, otherwise we have
> a failure scenario where we've not successfully saved to disk and
> also don't still have the running QEMU.

Indeed we need to consider if the user starts the VM again during the
auto-pause enabled migration.  A few options, and one of them should allow
early free of resources.  Assuming auto-pause=on and migration started,
then:

  1) Allow VM starts later

    1.a) Start dirty tracking right at this point

         Not prefer this.  This will make all things transparent but IMHO
         unnecessary complexity on maintaining dirty tracking status.

    1.b) Fail the migration

         Can be a good option, IMHO, treating auto-pause as a promise from
         the user that VM won't need to be running anymore.  If VM starts,
         promise break, migration fails.

  2) Doesn't allow VM starts later

         Can also be a good option.  In this case VM resources (I think
         mostly, RAM) can be freed right after migrated.  If user request
         VM start, fail the start instead of migration itself.  Migration
         must succeed or data lost.

Thanks,

> 
> > We're basically talking about whether we want the VM to be usable in the
> > (hopefully) very short time between issuing the migration command and
> > the migration being finished. We might be splitting hairs here, but we
> > need some sort of consensus.
> 
> The time may not be very short for large VMs.
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 19/29] migration/multifd: Add outgoing QIOChannelFile support
  2023-10-25 14:23       ` Daniel P. Berrangé
@ 2023-10-25 15:00         ` Fabiano Rosas
  2023-10-25 15:26           ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 15:00 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Wed, Oct 25, 2023 at 11:12:38AM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > On Mon, Oct 23, 2023 at 05:35:58PM -0300, Fabiano Rosas wrote:
>> >> Allow multifd to open file-backed channels. This will be used when
>> >> enabling the fixed-ram migration stream format which expects a
>> >> seekable transport.
>> >> 
>> >> The QIOChannel read and write methods will use the preadv/pwritev
>> >> versions which don't update the file offset at each call so we can
>> >> reuse the fd without re-opening for every channel.
>> >> 
>> >> Note that this is just setup code and multifd cannot yet make use of
>> >> the file channels.
>> >> 
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >> ---
>> >>  migration/file.c      | 64 +++++++++++++++++++++++++++++++++++++++++--
>> >>  migration/file.h      | 10 +++++--
>> >>  migration/migration.c |  2 +-
>> >>  migration/multifd.c   | 14 ++++++++--
>> >>  migration/options.c   |  7 +++++
>> >>  migration/options.h   |  1 +
>> >>  6 files changed, 90 insertions(+), 8 deletions(-)
>> >> 
>> >> diff --git a/migration/file.c b/migration/file.c
>> >> index cf5b1bf365..93b9b7bf5d 100644
>> >> --- a/migration/file.c
>> >> +++ b/migration/file.c
>> >> @@ -17,6 +17,12 @@
>> >
>> >> +void file_send_channel_create(QIOTaskFunc f, void *data)
>> >> +{
>> >> +    QIOChannelFile *ioc;
>> >> +    QIOTask *task;
>> >> +    Error *errp = NULL;
>> >> +
>> >> +    ioc = qio_channel_file_new_path(outgoing_args.fname,
>> >> +                                    outgoing_args.flags,
>> >> +                                    outgoing_args.mode, &errp);
>> >> +    if (!ioc) {
>> >> +        file_migration_cancel(errp);
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
>> >> +    qio_task_run_in_thread(task, qio_channel_file_connect_worker,
>> >> +                           (gpointer)data, NULL, NULL);
>> >> +}
>> >> +
>> >>  void file_start_outgoing_migration(MigrationState *s, const char *filespec,
>> >>                                     Error **errp)
>> >>  {
>> >> -    g_autofree char *filename = g_strdup(filespec);
>> >>      g_autoptr(QIOChannelFile) fioc = NULL;
>> >> +    g_autofree char *filename = g_strdup(filespec);
>> >>      uint64_t offset = 0;
>> >>      QIOChannel *ioc;
>> >> +    int flags = O_CREAT | O_TRUNC | O_WRONLY;
>> >> +    mode_t mode = 0660;
>> >>  
>> >>      trace_migration_file_outgoing(filename);
>> >>  
>> >> @@ -50,12 +105,15 @@ void file_start_outgoing_migration(MigrationState *s, const char *filespec,
>> >>          return;
>> >>      }
>> >>  
>> >> -    fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
>> >> -                                     0600, errp);
>> 
>> By the way, we're experimenting with add-fd to flesh out the interface
>> with libvirt and I see that the flags here can conflict with the flags
>> set on the fd passed through `virsh --pass-fd ...` due to this at
>> monitor_fdset_dup_fd_add():
>> 
>>     if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
>>         fd = mon_fdset_fd->fd;
>>         break;
>>     }
>> 
>> We're requiring the O_RDONLY, O_WRONLY, O_RDWR flags defined here to
>> match the fdset passed into QEMU. Should we just sync the code of the
>> two projects to use the same flags? That feels a little clumsy to me.
>
> Is there a reason for libvirt to have set O_RDONLY for a file used
> for outgoing migration ?  I can't recall off-hand.
>

The flags need to match exactly, so either libvirt or QEMU could in the
future decide to use O_RDWR. Then we'd have a compatibility problem when
passing the fds around.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-25 14:48     ` Fabiano Rosas
@ 2023-10-25 15:22       ` Daniel P. Berrangé
  0 siblings, 0 replies; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 15:22 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana, Eric Blake

On Wed, Oct 25, 2023 at 11:48:08AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Mon, Oct 23, 2023 at 05:36:07PM -0300, Fabiano Rosas wrote:
> >> Add the direct-io migration parameter that tells the migration code to
> >> use O_DIRECT when opening the migration stream file whenever possible.
> >> 
> >> This is currently only used for the secondary channels of fixed-ram
> >> migration, which can guarantee that writes are page aligned.
> >
> > When you say "secondary channels", I presume you're meaning that
> > the bulk memory regions will be written with O_DIRECT, while
> > general vmstate will use normal I/O on the main channel ?  If so,
> > could we explain that a little more directly.
> 
> Yes, the main channel writes via QEMUFile, so no O_DIRECT. The channels
> created via multifd_new_send_channel_create() have O_DIRECT enabled.
> 
> > Having a mixture of O_DIRECT and non-O_DIRECT I/O on the same
> > file is a little bit of an unusual situation. It will work for
> > us because we're writing to different regions of the file in
> > each case.
> >
> > Still I wonder if it would be sane in the outgoing case to
> > include a fsync() on the file in the main channel, to guarantee
> > that the whole saved file is on-media at completion ? Or perhaps
> > suggest in QAPI that mgmts might consider doing a fsync
> > themselves ?
> 
> I think that should be present in QIOChannelFile in general. Not even
> related to this series. I'll add it at qio_channel_file_close() unless
> you object.

Yes, looking at the places QIOChannelFile is used, I think they would
all benefit from fdatasync().  fsync() is probably too big of a hammer

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-25 14:58         ` Peter Xu
@ 2023-10-25 15:25           ` Daniel P. Berrangé
  2023-10-25 15:36             ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 15:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Eric Blake

On Wed, Oct 25, 2023 at 10:58:16AM -0400, Peter Xu wrote:
> On Wed, Oct 25, 2023 at 03:20:16PM +0100, Daniel P. Berrangé wrote:
> > On Wed, Oct 25, 2023 at 10:57:12AM -0300, Fabiano Rosas wrote:
> > > Daniel P. Berrangé <berrange@redhat.com> writes:
> > > 
> > > > On Mon, Oct 23, 2023 at 05:35:45PM -0300, Fabiano Rosas wrote:
> > > >> Add a capability that allows the management layer to delegate to QEMU
> > > >> the decision of whether to pause a VM and perform a non-live
> > > >> migration. Depending on the type of migration being performed, this
> > > >> could bring performance benefits.
> > > >
> > > > I'm not really see what problem this is solving.
> > > >
> > > 
> > > Well, this is the fruit of your discussion with Peter Xu in the previous
> > > version of the patch.
> > > 
> > > To recap: he thinks QEMU is doing useless work with file migrations
> > > because they are always asynchronous. He thinks we should always pause
> > > before doing fixed-ram migration. You said that libvirt would rather use
> > > fixed-ram for a more broad set of savevm-style commands, so you'd rather
> > > not always pause. I'm trying to cater to both of your wishes. This new
> > > capability is the middle ground I came up with.
> > > 
> > > So fixed-ram would always pause the VM, because that is the primary
> > > use-case, but libvirt would be allowed to say: don't pause this time.
> > 
> > If the VM is going to be powered off immediately after saving
> > a snapshot then yes, you might as well pause it, but we can't
> > assume that will be the case.  An equally common use case
> > would be for saving periodic snapshots of a running VM. This
> > should be transparent such that the VM remains running the
> > whole time, except a narrow window at completion of RAM/state
> > saving where we flip the disk snapshots, so they are in sync
> > with the RAM snapshot.
> 
> Libvirt will still use fixed-ram for live snapshot purpose, especially for
> Windows?  Then auto-pause may still be useful to identify that from what
> Fabiano wants to achieve here (which is in reality, non-live)?
> 
> IIRC of previous discussion that was the major point that libvirt can still
> leverage fixed-ram for a live case - since Windows lacks efficient live
> snapshot (background-snapshot feature).

Libvirt will use fixed-ram for all APIs it has that involve saving to
disk, with CPUs both running and paused.

> From that POV it sounds like auto-pause is a good knob for that.

From libvirt's POV auto-pause will create extra work for integration
for no gain.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 19/29] migration/multifd: Add outgoing QIOChannelFile support
  2023-10-25 15:00         ` Fabiano Rosas
@ 2023-10-25 15:26           ` Daniel P. Berrangé
  0 siblings, 0 replies; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 15:26 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
	Claudio Fontana

On Wed, Oct 25, 2023 at 12:00:12PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Wed, Oct 25, 2023 at 11:12:38AM -0300, Fabiano Rosas wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> >> 
> >> > On Mon, Oct 23, 2023 at 05:35:58PM -0300, Fabiano Rosas wrote:
> >> >> Allow multifd to open file-backed channels. This will be used when
> >> >> enabling the fixed-ram migration stream format which expects a
> >> >> seekable transport.
> >> >> 
> >> >> The QIOChannel read and write methods will use the preadv/pwritev
> >> >> versions which don't update the file offset at each call so we can
> >> >> reuse the fd without re-opening for every channel.
> >> >> 
> >> >> Note that this is just setup code and multifd cannot yet make use of
> >> >> the file channels.
> >> >> 
> >> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >> >> ---
> >> >>  migration/file.c      | 64 +++++++++++++++++++++++++++++++++++++++++--
> >> >>  migration/file.h      | 10 +++++--
> >> >>  migration/migration.c |  2 +-
> >> >>  migration/multifd.c   | 14 ++++++++--
> >> >>  migration/options.c   |  7 +++++
> >> >>  migration/options.h   |  1 +
> >> >>  6 files changed, 90 insertions(+), 8 deletions(-)
> >> >> 
> >> >> diff --git a/migration/file.c b/migration/file.c
> >> >> index cf5b1bf365..93b9b7bf5d 100644
> >> >> --- a/migration/file.c
> >> >> +++ b/migration/file.c
> >> >> @@ -17,6 +17,12 @@
> >> >
> >> >> +void file_send_channel_create(QIOTaskFunc f, void *data)
> >> >> +{
> >> >> +    QIOChannelFile *ioc;
> >> >> +    QIOTask *task;
> >> >> +    Error *errp = NULL;
> >> >> +
> >> >> +    ioc = qio_channel_file_new_path(outgoing_args.fname,
> >> >> +                                    outgoing_args.flags,
> >> >> +                                    outgoing_args.mode, &errp);
> >> >> +    if (!ioc) {
> >> >> +        file_migration_cancel(errp);
> >> >> +        return;
> >> >> +    }
> >> >> +
> >> >> +    task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
> >> >> +    qio_task_run_in_thread(task, qio_channel_file_connect_worker,
> >> >> +                           (gpointer)data, NULL, NULL);
> >> >> +}
> >> >> +
> >> >>  void file_start_outgoing_migration(MigrationState *s, const char *filespec,
> >> >>                                     Error **errp)
> >> >>  {
> >> >> -    g_autofree char *filename = g_strdup(filespec);
> >> >>      g_autoptr(QIOChannelFile) fioc = NULL;
> >> >> +    g_autofree char *filename = g_strdup(filespec);
> >> >>      uint64_t offset = 0;
> >> >>      QIOChannel *ioc;
> >> >> +    int flags = O_CREAT | O_TRUNC | O_WRONLY;
> >> >> +    mode_t mode = 0660;
> >> >>  
> >> >>      trace_migration_file_outgoing(filename);
> >> >>  
> >> >> @@ -50,12 +105,15 @@ void file_start_outgoing_migration(MigrationState *s, const char *filespec,
> >> >>          return;
> >> >>      }
> >> >>  
> >> >> -    fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
> >> >> -                                     0600, errp);
> >> 
> >> By the way, we're experimenting with add-fd to flesh out the interface
> >> with libvirt and I see that the flags here can conflict with the flags
> >> set on the fd passed through `virsh --pass-fd ...` due to this at
> >> monitor_fdset_dup_fd_add():
> >> 
> >>     if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
> >>         fd = mon_fdset_fd->fd;
> >>         break;
> >>     }
> >> 
> >> We're requiring the O_RDONLY, O_WRONLY, O_RDWR flags defined here to
> >> match the fdset passed into QEMU. Should we just sync the code of the
> >> two projects to use the same flags? That feels a little clumsy to me.
> >
> > Is there a reason for libvirt to have set O_RDONLY for a file used
> > for outgoing migration ?  I can't recall off-hand.
> >
> 
> The flags need to match exactly, so either libvirt or QEMU could in the
> future decide to use O_RDWR. Then we'd have a compatibility problem when
> passing the fds around.

The "safe" option would be to always open O_RDWR, even if it is
technically redundant for our current needs.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-25 15:25           ` Daniel P. Berrangé
@ 2023-10-25 15:36             ` Peter Xu
  2023-10-25 15:40               ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-10-25 15:36 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Eric Blake

On Wed, Oct 25, 2023 at 04:25:23PM +0100, Daniel P. Berrangé wrote:
> > Libvirt will still use fixed-ram for live snapshot purpose, especially for
> > Windows?  Then auto-pause may still be useful to identify that from what
> > Fabiano wants to achieve here (which is in reality, non-live)?
> > 
> > IIRC of previous discussion that was the major point that libvirt can still
> > leverage fixed-ram for a live case - since Windows lacks efficient live
> > snapshot (background-snapshot feature).
> 
> Libvirt will use fixed-ram for all APIs it has that involve saving to
> disk, with CPUs both running and paused.

There are still two scenarios.  How should we identify them, then?  For
sure we can always make it live, but QEMU needs that information to make it
efficient for non-live.

Considering when there's no auto-pause, then Libvirt will still need to
know the scenario first then to decide whether pausing VM before migration
or do nothing, am I right?

If so, can Libvirt replace that "pause VM" operation with setting
auto-pause=on here?  Again, the benefit is QEMU can benefit from it.

I think when pausing Libvirt can still receive an event, then it can
cooperate with state changes?  Meanwhile auto-pause=on will be set by
Libvirt too, so Libvirt will even have that expectation that QMP migrate
later on will pause the VM.

> 
> > From that POV it sounds like auto-pause is a good knob for that.
> 
> From libvirt's POV auto-pause will create extra work for integration
> for no gain.

Yes, I agree for Libvirt there's no gain, as the gain is on QEMU's side.
Could you elaborate what is the complexity for Libvirt to support it?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-25 15:36             ` Peter Xu
@ 2023-10-25 15:40               ` Daniel P. Berrangé
  2023-10-25 17:20                 ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 15:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Eric Blake

On Wed, Oct 25, 2023 at 11:36:27AM -0400, Peter Xu wrote:
> On Wed, Oct 25, 2023 at 04:25:23PM +0100, Daniel P. Berrangé wrote:
> > > Libvirt will still use fixed-ram for live snapshot purpose, especially for
> > > Windows?  Then auto-pause may still be useful to identify that from what
> > > Fabiano wants to achieve here (which is in reality, non-live)?
> > > 
> > > IIRC of previous discussion that was the major point that libvirt can still
> > > leverage fixed-ram for a live case - since Windows lacks efficient live
> > > snapshot (background-snapshot feature).
> > 
> > Libvirt will use fixed-ram for all APIs it has that involve saving to
> > disk, with CPUs both running and paused.
> 
> There are still two scenarios.  How should we identify them, then?  For
> sure we can always make it live, but QEMU needs that information to make it
> efficient for non-live.
> 
> Considering when there's no auto-pause, then Libvirt will still need to
> know the scenario first then to decide whether pausing VM before migration
> or do nothing, am I right?

libvirt will issue a 'stop' before invoking 'migrate' if it
needs to. QEMU should be able to optimize that scenario if
it sees CPUs already stopped when migrate is started ?

> If so, can Libvirt replace that "pause VM" operation with setting
> auto-pause=on here?  Again, the benefit is QEMU can benefit from it.
> 
> I think when pausing Libvirt can still receive an event, then it can
> cooperate with state changes?  Meanwhile auto-pause=on will be set by
> Libvirt too, so Libvirt will even have that expectation that QMP migrate
> later on will pause the VM.
> 
> > 
> > > From that POV it sounds like auto-pause is a good knob for that.
> > 
> > From libvirt's POV auto-pause will create extra work for integration
> > for no gain.
> 
> Yes, I agree for Libvirt there's no gain, as the gain is on QEMU's side.
> Could you elaborate what is the complexity for Libvirt to support it?

It increases the code paths because we will have to support
and test different behaviour wrt CPU state for fixed-ram
vs non-fixed ram usage.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-25 15:40               ` Daniel P. Berrangé
@ 2023-10-25 17:20                 ` Peter Xu
  2023-10-25 17:31                   ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-10-25 17:20 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Eric Blake

On Wed, Oct 25, 2023 at 04:40:52PM +0100, Daniel P. Berrangé wrote:
> On Wed, Oct 25, 2023 at 11:36:27AM -0400, Peter Xu wrote:
> > On Wed, Oct 25, 2023 at 04:25:23PM +0100, Daniel P. Berrangé wrote:
> > > > Libvirt will still use fixed-ram for live snapshot purpose, especially for
> > > > Windows?  Then auto-pause may still be useful to identify that from what
> > > > Fabiano wants to achieve here (which is in reality, non-live)?
> > > > 
> > > > IIRC of previous discussion that was the major point that libvirt can still
> > > > leverage fixed-ram for a live case - since Windows lacks efficient live
> > > > snapshot (background-snapshot feature).
> > > 
> > > Libvirt will use fixed-ram for all APIs it has that involve saving to
> > > disk, with CPUs both running and paused.
> > 
> > There are still two scenarios.  How should we identify them, then?  For
> > sure we can always make it live, but QEMU needs that information to make it
> > efficient for non-live.
> > 
> > Considering when there's no auto-pause, then Libvirt will still need to
> > know the scenario first then to decide whether pausing VM before migration
> > or do nothing, am I right?
> 
> libvirt will issue a 'stop' before invoking 'migrate' if it
> needs to. QEMU should be able to optimize that scenario if
> it sees CPUs already stopped when migrate is started ?
> 
> > If so, can Libvirt replace that "pause VM" operation with setting
> > auto-pause=on here?  Again, the benefit is QEMU can benefit from it.
> > 
> > I think when pausing Libvirt can still receive an event, then it can
> > cooperate with state changes?  Meanwhile auto-pause=on will be set by
> > Libvirt too, so Libvirt will even have that expectation that QMP migrate
> > later on will pause the VM.
> > 
> > > 
> > > > From that POV it sounds like auto-pause is a good knob for that.
> > > 
> > > From libvirt's POV auto-pause will create extra work for integration
> > > for no gain.
> > 
> > Yes, I agree for Libvirt there's no gain, as the gain is on QEMU's side.
> > Could you elaborate what is the complexity for Libvirt to support it?
> 
> It increases the code paths because we will have to support
> and test different behaviour wrt CPU state for fixed-ram
> vs non-fixed ram usage.

To me if the user scenario is different, it makes sense to have a flag
showing what the user wants to do.

Guessing that from "whether VM is running or not" could work in many cases
but not all.

It means at least for dirty tracking, we only have one option to make it
fully transparent, starting dirty tracking when VM starts during such
migration.  The complexity moves from Libvirt into migration / kvm from
this aspect.

Meanwhile we lose some other potential optimizations for good, early
releasing of resources will never be possible anymore because they need to
be prepared to be reused very soon, even if we know they will never.  But
maybe that's not a major concern.

No strong opinion from my side.  I'll leave it to Fabiano.  I didn't see
any further optimization yet with the new cap in this series.  I think the
trick is current extra overheads are just not high enough for us to
care.. even if we know some work is pure overhead.  Then indeed we can also
postpone the optimizations until justified worthwhile.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-25 14:43           ` Daniel P. Berrangé
@ 2023-10-25 17:30             ` Fabiano Rosas
  2023-10-25 17:45               ` Daniel P. Berrangé
  2023-10-30 22:51             ` Fabiano Rosas
  1 sibling, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 17:30 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Wed, Oct 25, 2023 at 11:32:00AM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > On Tue, Oct 24, 2023 at 04:32:10PM -0300, Fabiano Rosas wrote:
>> >> Markus Armbruster <armbru@redhat.com> writes:
>> >> 
>> >> > Fabiano Rosas <farosas@suse.de> writes:
>> >> >
>> >> >> Add the direct-io migration parameter that tells the migration code to
>> >> >> use O_DIRECT when opening the migration stream file whenever possible.
>> >> >>
>> >> >> This is currently only used for the secondary channels of fixed-ram
>> >> >> migration, which can guarantee that writes are page aligned.
>> >> >>
>> >> >> However the parameter could be made to affect other types of
>> >> >> file-based migrations in the future.
>> >> >>
>> >> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >> >
>> >> > When would you want to enable @direct-io, and when would you want to
>> >> > leave it disabled?
>> >> 
>> >> That depends on a performance analysis. You'd generally leave it
>> >> disabled unless there's some indication that the operating system is
>> >> having trouble draining the page cache.
>> >
>> > That's not the usage model I would suggest.
>> >
>> 
>> Hehe I took a shot at answering but I really wanted to say "ask Daniel".
>> 
>> > The biggest value of the page cache comes when it holds data that
>> > will be repeatedly accessed.
>> >
>> > When you are saving/restoring a guest to file, that data is used
>> > once only (assuming there's a large gap between save & restore).
>> > By using the page cache to save a big guest we essentially purge
>> > the page cache of most of its existing data that is likely to be
>> > reaccessed, to fill it up with data never to be reaccessed.
>> >
>> > I usually describe save/restore operations as trashing the page
>> > cache.
>> >
>> > IMHO, mgmt apps should request O_DIRECT always unless they expect
>> > the save/restore operation to run in quick succession, or if they
>> > know that the host has oodles of free RAM such that existing data
>> > in the page cache won't be trashed, or
>> 
>> Thanks, I'll try to incorporate this to some kind of doc in the next
>> version.
>> 
>> > if the host FS does not support O_DIRECT of course.
>> 
>> Should we try to probe for this and inform the user?
>
> qemu_open_internall will already do a nice error message. If it gets
> EINVAL when using O_DIRECT, it'll retry without O_DIRECT and if that
> works, it'll reoprt "filesystem does not support O_DIRECT"
>
> Having said that I see a problem with /dev/fdset handling, because
> we're only validating O_ACCMODE and that excludes O_DIRECT.
>
> If the mgmt apps passes an FD with O_DIRECT already set, then it
> won't work for VMstate saving which is unaligned.
>
> If the mgmt app passes an FD without O_DIRECT set, then we are
> not setting O_DIRECT for the multifd RAM threads.

Worse, the fds get dup'ed so even without O_DIRECT, we we enable it for
the secondary channels the main channel will break on unaligned writes.

For now I can only think of requiring two fds. One for the main channel
and a second one for the rest of the channels. And validating the fd
flags to make sure O_DIRECT is only allowed to be set in the second fd.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-25 17:20                 ` Peter Xu
@ 2023-10-25 17:31                   ` Daniel P. Berrangé
  2023-10-25 19:28                     ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 17:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Eric Blake

On Wed, Oct 25, 2023 at 01:20:52PM -0400, Peter Xu wrote:
> On Wed, Oct 25, 2023 at 04:40:52PM +0100, Daniel P. Berrangé wrote:
> > On Wed, Oct 25, 2023 at 11:36:27AM -0400, Peter Xu wrote:
> > > On Wed, Oct 25, 2023 at 04:25:23PM +0100, Daniel P. Berrangé wrote:
> > > > > Libvirt will still use fixed-ram for live snapshot purpose, especially for
> > > > > Windows?  Then auto-pause may still be useful to identify that from what
> > > > > Fabiano wants to achieve here (which is in reality, non-live)?
> > > > > 
> > > > > IIRC of previous discussion that was the major point that libvirt can still
> > > > > leverage fixed-ram for a live case - since Windows lacks efficient live
> > > > > snapshot (background-snapshot feature).
> > > > 
> > > > Libvirt will use fixed-ram for all APIs it has that involve saving to
> > > > disk, with CPUs both running and paused.
> > > 
> > > There are still two scenarios.  How should we identify them, then?  For
> > > sure we can always make it live, but QEMU needs that information to make it
> > > efficient for non-live.
> > > 
> > > Considering when there's no auto-pause, then Libvirt will still need to
> > > know the scenario first then to decide whether pausing VM before migration
> > > or do nothing, am I right?
> > 
> > libvirt will issue a 'stop' before invoking 'migrate' if it
> > needs to. QEMU should be able to optimize that scenario if
> > it sees CPUs already stopped when migrate is started ?
> > 
> > > If so, can Libvirt replace that "pause VM" operation with setting
> > > auto-pause=on here?  Again, the benefit is QEMU can benefit from it.
> > > 
> > > I think when pausing Libvirt can still receive an event, then it can
> > > cooperate with state changes?  Meanwhile auto-pause=on will be set by
> > > Libvirt too, so Libvirt will even have that expectation that QMP migrate
> > > later on will pause the VM.
> > > 
> > > > 
> > > > > From that POV it sounds like auto-pause is a good knob for that.
> > > > 
> > > > From libvirt's POV auto-pause will create extra work for integration
> > > > for no gain.
> > > 
> > > Yes, I agree for Libvirt there's no gain, as the gain is on QEMU's side.
> > > Could you elaborate what is the complexity for Libvirt to support it?
> > 
> > It increases the code paths because we will have to support
> > and test different behaviour wrt CPU state for fixed-ram
> > vs non-fixed ram usage.
> 
> To me if the user scenario is different, it makes sense to have a flag
> showing what the user wants to do.
> 
> Guessing that from "whether VM is running or not" could work in many cases
> but not all.
> 
> It means at least for dirty tracking, we only have one option to make it
> fully transparent, starting dirty tracking when VM starts during such
> migration.  The complexity moves from Libvirt into migration / kvm from
> this aspect.

Even with auto-pause we can't skip dirty tracking, as we don't
guarantee the app won't run 'cont' at some point.

We could have an explicit capability 'dirty-tracking' which an app
could set as an explicit "promise" that it won't ever need to
(re)start CPUs while migration is running.   If dirty-tracking==no,
then any attempt to run 'cont' should return an hard error while
migration is running.

> Meanwhile we lose some other potential optimizations for good, early
> releasing of resources will never be possible anymore because they need to
> be prepared to be reused very soon, even if we know they will never.  But
> maybe that's not a major concern.

What resources can we release early, without harming our ability to
restart the current QEMU on failure ?  

> No strong opinion from my side.  I'll leave it to Fabiano.  I didn't see
> any further optimization yet with the new cap in this series.  I think the
> trick is current extra overheads are just not high enough for us to
> care.. even if we know some work is pure overhead.  Then indeed we can also
> postpone the optimizations until justified worthwhile.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-25 17:30             ` Fabiano Rosas
@ 2023-10-25 17:45               ` Daniel P. Berrangé
  2023-10-25 18:10                 ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-25 17:45 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

On Wed, Oct 25, 2023 at 02:30:01PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Wed, Oct 25, 2023 at 11:32:00AM -0300, Fabiano Rosas wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> >> 
> >> > On Tue, Oct 24, 2023 at 04:32:10PM -0300, Fabiano Rosas wrote:
> >> >> Markus Armbruster <armbru@redhat.com> writes:
> >> >> 
> >> >> > Fabiano Rosas <farosas@suse.de> writes:
> >> >> >
> >> >> >> Add the direct-io migration parameter that tells the migration code to
> >> >> >> use O_DIRECT when opening the migration stream file whenever possible.
> >> >> >>
> >> >> >> This is currently only used for the secondary channels of fixed-ram
> >> >> >> migration, which can guarantee that writes are page aligned.
> >> >> >>
> >> >> >> However the parameter could be made to affect other types of
> >> >> >> file-based migrations in the future.
> >> >> >>
> >> >> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >> >> >
> >> >> > When would you want to enable @direct-io, and when would you want to
> >> >> > leave it disabled?
> >> >> 
> >> >> That depends on a performance analysis. You'd generally leave it
> >> >> disabled unless there's some indication that the operating system is
> >> >> having trouble draining the page cache.
> >> >
> >> > That's not the usage model I would suggest.
> >> >
> >> 
> >> Hehe I took a shot at answering but I really wanted to say "ask Daniel".
> >> 
> >> > The biggest value of the page cache comes when it holds data that
> >> > will be repeatedly accessed.
> >> >
> >> > When you are saving/restoring a guest to file, that data is used
> >> > once only (assuming there's a large gap between save & restore).
> >> > By using the page cache to save a big guest we essentially purge
> >> > the page cache of most of its existing data that is likely to be
> >> > reaccessed, to fill it up with data never to be reaccessed.
> >> >
> >> > I usually describe save/restore operations as trashing the page
> >> > cache.
> >> >
> >> > IMHO, mgmt apps should request O_DIRECT always unless they expect
> >> > the save/restore operation to run in quick succession, or if they
> >> > know that the host has oodles of free RAM such that existing data
> >> > in the page cache won't be trashed, or
> >> 
> >> Thanks, I'll try to incorporate this to some kind of doc in the next
> >> version.
> >> 
> >> > if the host FS does not support O_DIRECT of course.
> >> 
> >> Should we try to probe for this and inform the user?
> >
> > qemu_open_internall will already do a nice error message. If it gets
> > EINVAL when using O_DIRECT, it'll retry without O_DIRECT and if that
> > works, it'll reoprt "filesystem does not support O_DIRECT"
> >
> > Having said that I see a problem with /dev/fdset handling, because
> > we're only validating O_ACCMODE and that excludes O_DIRECT.
> >
> > If the mgmt apps passes an FD with O_DIRECT already set, then it
> > won't work for VMstate saving which is unaligned.
> >
> > If the mgmt app passes an FD without O_DIRECT set, then we are
> > not setting O_DIRECT for the multifd RAM threads.
> 
> Worse, the fds get dup'ed so even without O_DIRECT, we we enable it for
> the secondary channels the main channel will break on unaligned writes.
> 
> For now I can only think of requiring two fds. One for the main channel
> and a second one for the rest of the channels. And validating the fd
> flags to make sure O_DIRECT is only allowed to be set in the second fd.

In this new model I think there's no reason for libvirt to set O_DIRECT
for its own initial I/O. So we could just totally ignore O_DIRECT when
initially opening the QIOCHannelFile.

Then provide a method on QIOCHannelFile to enable O_DIRECT on the fly
which can be called for the multifd threads setup ?

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-25 17:45               ` Daniel P. Berrangé
@ 2023-10-25 18:10                 ` Fabiano Rosas
  0 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-25 18:10 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Wed, Oct 25, 2023 at 02:30:01PM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > On Wed, Oct 25, 2023 at 11:32:00AM -0300, Fabiano Rosas wrote:
>> >> Daniel P. Berrangé <berrange@redhat.com> writes:
>> >> 
>> >> > On Tue, Oct 24, 2023 at 04:32:10PM -0300, Fabiano Rosas wrote:
>> >> >> Markus Armbruster <armbru@redhat.com> writes:
>> >> >> 
>> >> >> > Fabiano Rosas <farosas@suse.de> writes:
>> >> >> >
>> >> >> >> Add the direct-io migration parameter that tells the migration code to
>> >> >> >> use O_DIRECT when opening the migration stream file whenever possible.
>> >> >> >>
>> >> >> >> This is currently only used for the secondary channels of fixed-ram
>> >> >> >> migration, which can guarantee that writes are page aligned.
>> >> >> >>
>> >> >> >> However the parameter could be made to affect other types of
>> >> >> >> file-based migrations in the future.
>> >> >> >>
>> >> >> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >> >> >
>> >> >> > When would you want to enable @direct-io, and when would you want to
>> >> >> > leave it disabled?
>> >> >> 
>> >> >> That depends on a performance analysis. You'd generally leave it
>> >> >> disabled unless there's some indication that the operating system is
>> >> >> having trouble draining the page cache.
>> >> >
>> >> > That's not the usage model I would suggest.
>> >> >
>> >> 
>> >> Hehe I took a shot at answering but I really wanted to say "ask Daniel".
>> >> 
>> >> > The biggest value of the page cache comes when it holds data that
>> >> > will be repeatedly accessed.
>> >> >
>> >> > When you are saving/restoring a guest to file, that data is used
>> >> > once only (assuming there's a large gap between save & restore).
>> >> > By using the page cache to save a big guest we essentially purge
>> >> > the page cache of most of its existing data that is likely to be
>> >> > reaccessed, to fill it up with data never to be reaccessed.
>> >> >
>> >> > I usually describe save/restore operations as trashing the page
>> >> > cache.
>> >> >
>> >> > IMHO, mgmt apps should request O_DIRECT always unless they expect
>> >> > the save/restore operation to run in quick succession, or if they
>> >> > know that the host has oodles of free RAM such that existing data
>> >> > in the page cache won't be trashed, or
>> >> 
>> >> Thanks, I'll try to incorporate this to some kind of doc in the next
>> >> version.
>> >> 
>> >> > if the host FS does not support O_DIRECT of course.
>> >> 
>> >> Should we try to probe for this and inform the user?
>> >
>> > qemu_open_internall will already do a nice error message. If it gets
>> > EINVAL when using O_DIRECT, it'll retry without O_DIRECT and if that
>> > works, it'll reoprt "filesystem does not support O_DIRECT"
>> >
>> > Having said that I see a problem with /dev/fdset handling, because
>> > we're only validating O_ACCMODE and that excludes O_DIRECT.
>> >
>> > If the mgmt apps passes an FD with O_DIRECT already set, then it
>> > won't work for VMstate saving which is unaligned.
>> >
>> > If the mgmt app passes an FD without O_DIRECT set, then we are
>> > not setting O_DIRECT for the multifd RAM threads.
>> 
>> Worse, the fds get dup'ed so even without O_DIRECT, we we enable it for
>> the secondary channels the main channel will break on unaligned writes.
>> 
>> For now I can only think of requiring two fds. One for the main channel
>> and a second one for the rest of the channels. And validating the fd
>> flags to make sure O_DIRECT is only allowed to be set in the second fd.
>
> In this new model I think there's no reason for libvirt to set O_DIRECT
> for its own initial I/O. So we could just totally ignore O_DIRECT when
> initially opening the QIOCHannelFile.
>

Yes. I still have to disallow setting it on the main channel just to be
safe.

> Then provide a method on QIOCHannelFile to enable O_DIRECT on the fly
> which can be called for the multifd threads setup ?

Sure, but there's not really an "on the fly" here, after
file_send_channel_create() returns the channel should be ready to
use. It would go from:

 flag |= O_DIRECT;
 qio_channel_file_new_path(...);

to:

 qio_channel_file_new_path(...);
 qio_channel_file_set_direct_io();

Which could be cleaner since the migration code doesn't have to check
for O_DIRECT support.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/29] migration: Add auto-pause capability
  2023-10-25 17:31                   ` Daniel P. Berrangé
@ 2023-10-25 19:28                     ` Peter Xu
  0 siblings, 0 replies; 128+ messages in thread
From: Peter Xu @ 2023-10-25 19:28 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Eric Blake

On Wed, Oct 25, 2023 at 06:31:53PM +0100, Daniel P. Berrangé wrote:
> On Wed, Oct 25, 2023 at 01:20:52PM -0400, Peter Xu wrote:
> > On Wed, Oct 25, 2023 at 04:40:52PM +0100, Daniel P. Berrangé wrote:
> > > On Wed, Oct 25, 2023 at 11:36:27AM -0400, Peter Xu wrote:
> > > > On Wed, Oct 25, 2023 at 04:25:23PM +0100, Daniel P. Berrangé wrote:
> > > > > > Libvirt will still use fixed-ram for live snapshot purpose, especially for
> > > > > > Windows?  Then auto-pause may still be useful to identify that from what
> > > > > > Fabiano wants to achieve here (which is in reality, non-live)?
> > > > > > 
> > > > > > IIRC of previous discussion that was the major point that libvirt can still
> > > > > > leverage fixed-ram for a live case - since Windows lacks efficient live
> > > > > > snapshot (background-snapshot feature).
> > > > > 
> > > > > Libvirt will use fixed-ram for all APIs it has that involve saving to
> > > > > disk, with CPUs both running and paused.
> > > > 
> > > > There are still two scenarios.  How should we identify them, then?  For
> > > > sure we can always make it live, but QEMU needs that information to make it
> > > > efficient for non-live.
> > > > 
> > > > Considering when there's no auto-pause, then Libvirt will still need to
> > > > know the scenario first then to decide whether pausing VM before migration
> > > > or do nothing, am I right?
> > > 
> > > libvirt will issue a 'stop' before invoking 'migrate' if it
> > > needs to. QEMU should be able to optimize that scenario if
> > > it sees CPUs already stopped when migrate is started ?
> > > 
> > > > If so, can Libvirt replace that "pause VM" operation with setting
> > > > auto-pause=on here?  Again, the benefit is QEMU can benefit from it.
> > > > 
> > > > I think when pausing Libvirt can still receive an event, then it can
> > > > cooperate with state changes?  Meanwhile auto-pause=on will be set by
> > > > Libvirt too, so Libvirt will even have that expectation that QMP migrate
> > > > later on will pause the VM.
> > > > 
> > > > > 
> > > > > > From that POV it sounds like auto-pause is a good knob for that.
> > > > > 
> > > > > From libvirt's POV auto-pause will create extra work for integration
> > > > > for no gain.
> > > > 
> > > > Yes, I agree for Libvirt there's no gain, as the gain is on QEMU's side.
> > > > Could you elaborate what is the complexity for Libvirt to support it?
> > > 
> > > It increases the code paths because we will have to support
> > > and test different behaviour wrt CPU state for fixed-ram
> > > vs non-fixed ram usage.
> > 
> > To me if the user scenario is different, it makes sense to have a flag
> > showing what the user wants to do.
> > 
> > Guessing that from "whether VM is running or not" could work in many cases
> > but not all.
> > 
> > It means at least for dirty tracking, we only have one option to make it
> > fully transparent, starting dirty tracking when VM starts during such
> > migration.  The complexity moves from Libvirt into migration / kvm from
> > this aspect.
> 
> Even with auto-pause we can't skip dirty tracking, as we don't
> guarantee the app won't run 'cont' at some point.
> 
> We could have an explicit capability 'dirty-tracking' which an app
> could set as an explicit "promise" that it won't ever need to
> (re)start CPUs while migration is running.   If dirty-tracking==no,
> then any attempt to run 'cont' should return an hard error while
> migration is running.

I do have some thoughts even before this series on disabling dirty
tracking, but until now I think it might be better to make "dirty track" be
hidden as an internal flag, decided by other migration caps/parameters.

For example, postcopy-only migration will not require dirty tracking in
whatever form.  But that can be a higher level "postcopy-only" capability
or even a higher concept than that, then it'll set dirty_tracking=false
internally.

I tried to list our options in the previous email.  Quotting from that:

https://lore.kernel.org/qemu-devel/ZTktCM%2FccipYaJ80@x1n/

  1) Allow VM starts later

    1.a) Start dirty tracking right at this point

         Not prefer this.  This will make all things transparent but IMHO
         unnecessary complexity on maintaining dirty tracking status.

    1.b) Fail the migration

         Can be a good option, IMHO, treating auto-pause as a promise from
         the user that VM won't need to be running anymore.  If VM starts,
         promise break, migration fails.

  2) Doesn't allow VM starts later

         Can also be a good option.  In this case VM resources (I think
         mostly, RAM) can be freed right after migrated.  If user request
         VM start, fail the start instead of migration itself.  Migration
         must succeed or data lost.

So indeed we can fail the migration already if auto-pause=on.

> 
> > Meanwhile we lose some other potential optimizations for good, early
> > releasing of resources will never be possible anymore because they need to
> > be prepared to be reused very soon, even if we know they will never.  But
> > maybe that's not a major concern.
> 
> What resources can we release early, without harming our ability to
> restart the current QEMU on failure ?  

We can't if we always allow a restart indeed.

I think releasing resources early may not be a major benefit here even if
with the option, depending on whether that can make a difference in any of
the use cases. I don't see much yet.

Consider release-ram for postcopy, that makes sense only because we'll
initiate two QEMUs, so that early release guarantees total memory
consumption, more or less, to ~1 VM only.  Here we have only one single VM
anyway, may not be a problem to release everything later.

However I still think there can be something done by QEMU if QEMU knows for
sure the VM won't ever be restarted.  Dirty tracking can be omitted is one
of them.  One simple example of an extention of dirty tracking: consider
the case where a device doesn't support dirty tracking, then it will need
to block live migration normally, but it'll work if auto-pause=true,
because tracking is not needed.  But as long as such migration starts, we
can only either fail migration if VM restarts, or rejects the VM restart
request.  So that can be more than "dirty tracking overhead" itself.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-25 14:43           ` Daniel P. Berrangé
  2023-10-25 17:30             ` Fabiano Rosas
@ 2023-10-30 22:51             ` Fabiano Rosas
  2023-10-31  9:03               ` Daniel P. Berrangé
  1 sibling, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-30 22:51 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Wed, Oct 25, 2023 at 11:32:00AM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > On Tue, Oct 24, 2023 at 04:32:10PM -0300, Fabiano Rosas wrote:
>> >> Markus Armbruster <armbru@redhat.com> writes:
>> >> 
>> >> > Fabiano Rosas <farosas@suse.de> writes:
>> >> >
>> >> >> Add the direct-io migration parameter that tells the migration code to
>> >> >> use O_DIRECT when opening the migration stream file whenever possible.
>> >> >>
>> >> >> This is currently only used for the secondary channels of fixed-ram
>> >> >> migration, which can guarantee that writes are page aligned.
>> >> >>
>> >> >> However the parameter could be made to affect other types of
>> >> >> file-based migrations in the future.
>> >> >>
>> >> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >> >
>> >> > When would you want to enable @direct-io, and when would you want to
>> >> > leave it disabled?
>> >> 
>> >> That depends on a performance analysis. You'd generally leave it
>> >> disabled unless there's some indication that the operating system is
>> >> having trouble draining the page cache.
>> >
>> > That's not the usage model I would suggest.
>> >
>> 
>> Hehe I took a shot at answering but I really wanted to say "ask Daniel".
>> 
>> > The biggest value of the page cache comes when it holds data that
>> > will be repeatedly accessed.
>> >
>> > When you are saving/restoring a guest to file, that data is used
>> > once only (assuming there's a large gap between save & restore).
>> > By using the page cache to save a big guest we essentially purge
>> > the page cache of most of its existing data that is likely to be
>> > reaccessed, to fill it up with data never to be reaccessed.
>> >
>> > I usually describe save/restore operations as trashing the page
>> > cache.
>> >
>> > IMHO, mgmt apps should request O_DIRECT always unless they expect
>> > the save/restore operation to run in quick succession, or if they
>> > know that the host has oodles of free RAM such that existing data
>> > in the page cache won't be trashed, or
>> 
>> Thanks, I'll try to incorporate this to some kind of doc in the next
>> version.
>> 
>> > if the host FS does not support O_DIRECT of course.
>> 
>> Should we try to probe for this and inform the user?
>
> qemu_open_internall will already do a nice error message. If it gets
> EINVAL when using O_DIRECT, it'll retry without O_DIRECT and if that
> works, it'll reoprt "filesystem does not support O_DIRECT"
>
> Having said that I see a problem with /dev/fdset handling, because
> we're only validating O_ACCMODE and that excludes O_DIRECT.
>
> If the mgmt apps passes an FD with O_DIRECT already set, then it
> won't work for VMstate saving which is unaligned.
>
> If the mgmt app passes an FD without O_DIRECT set, then we are
> not setting O_DIRECT for the multifd RAM threads.

I could use some advice on how to solve this situation. The fdset code
at monitor/fds.c and the add-fd command don't seem to be usable outside
the original use-case of passing fds with different open flags.

There are several problems, the biggest one being that there's no way to
manipulate the set of file descriptors aside from asking for duplication
of an fd that matches a particular set of flags.

That doesn't work for us because the two fds we need (one for main
channel, other for secondary channels) will have the same open flags. So
the fdset code will always return the first one it finds in the set.

Another problem (or feature) of the fdset code is that it only removes
an fd when qmp_remove_fd() is called if the VM runstate is RUNNING,
which means that the migration code cannot remove the fds after use
reliably. We need to be able to remove them to make sure we use the
correct fds in a subsequent migration.

I see some paths forward:

1) Require the user to put the fds in separate fdsets.

  This would be the easiest to handle in the migration code, but we
  would have to come up with special file: URL syntax to accomodate more
  than one fdset. Perhaps "file:/dev/fdsets/1,2" ?

2) Require the two fds in the same fdset and separate them ourselves.

  This would require extending the fdset code to allow more ways of
  manipulating the fdset. There's two options here:

  a) Implement a pop() operation in the fdset code. We receive the
     fdset, pop one fd from it and put it in a new fdset. I did some
     experimentation with this by having an fd->present flag and just
     skipping the fd during query-fdsets and
     monitor_fdset_dup_fd_add(). It works, but it's convoluted.

  b) Add support for removing the original fd when given the dup()ed
     fd. The list of duplicated fds is currently by-fdset and not
     by-original-fd, so this would require a larger code change.

3) Design a whole new URI.

  Here, there are the usual benefits and drawbacks of doing something
  from scratch. With the added drawback of dissociating from the file:
  URI which is already well tested and easy to use when doing QEMU-only
  migration.


With the three options above there's still the issue of removing the
fd. I think the original commit[1] might have been mistaken in adding
the runstate_is_running() check for *both* the "removed = true" clause
and the "fd was never duplicated" clause. But it's hard to tell since
this whole feature is a bit opaque to me.

1- ebe52b592d (monitor: Prevent removing fd from set during init)
https://gitlab.com/qemu-project/qemu/-/commit/ebe52b592d

All in all, I'm inclined to consider the first option, unless someone
has a better idea. Assuming we can figure out the removal issue, that
is.

Thoughts?


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-30 22:51             ` Fabiano Rosas
@ 2023-10-31  9:03               ` Daniel P. Berrangé
  2023-10-31 13:05                 ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-31  9:03 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

On Mon, Oct 30, 2023 at 07:51:34PM -0300, Fabiano Rosas wrote:
> I could use some advice on how to solve this situation. The fdset code
> at monitor/fds.c and the add-fd command don't seem to be usable outside
> the original use-case of passing fds with different open flags.
> 
> There are several problems, the biggest one being that there's no way to
> manipulate the set of file descriptors aside from asking for duplication
> of an fd that matches a particular set of flags.
> 
> That doesn't work for us because the two fds we need (one for main
> channel, other for secondary channels) will have the same open flags. So
> the fdset code will always return the first one it finds in the set.

QEMU may want multiple FDs *internally*, but IMHO that fact should
not be exposed to mgmt applications. It would be valid for a QEMU
impl to share the same FD across multiple threads, or have a different
FD for each thread. All threads are using pread/pwrite, so it is safe
for them to use the same FD if they desire. It is a private impl choice
for QEMU at any given point in time and could change over time.

Thus from the POV of the mgmt app, QEMU is writing to a single file, no
matter how many threads are involved & thus it should only need to supply
a single FD for thta file. QEMU can then call 'dup()' on that FD as many
times as it desires, and use fcntl() to toggle O_DIRECT if and when
it needs to.

> Another problem (or feature) of the fdset code is that it only removes
> an fd when qmp_remove_fd() is called if the VM runstate is RUNNING,
> which means that the migration code cannot remove the fds after use
> reliably. We need to be able to remove them to make sure we use the
> correct fds in a subsequent migration.

The "easy" option is to just add a new API that does what you want.
Maybe during review someone will then point out why the orgianl
API works the way it does.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-31  9:03               ` Daniel P. Berrangé
@ 2023-10-31 13:05                 ` Fabiano Rosas
  2023-10-31 13:45                   ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-31 13:05 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Oct 30, 2023 at 07:51:34PM -0300, Fabiano Rosas wrote:
>> I could use some advice on how to solve this situation. The fdset code
>> at monitor/fds.c and the add-fd command don't seem to be usable outside
>> the original use-case of passing fds with different open flags.
>> 
>> There are several problems, the biggest one being that there's no way to
>> manipulate the set of file descriptors aside from asking for duplication
>> of an fd that matches a particular set of flags.
>> 
>> That doesn't work for us because the two fds we need (one for main
>> channel, other for secondary channels) will have the same open flags. So
>> the fdset code will always return the first one it finds in the set.
>
> QEMU may want multiple FDs *internally*, but IMHO that fact should
> not be exposed to mgmt applications. It would be valid for a QEMU
> impl to share the same FD across multiple threads, or have a different
> FD for each thread. All threads are using pread/pwrite, so it is safe
> for them to use the same FD if they desire. It is a private impl choice
> for QEMU at any given point in time and could change over time.
>

Sure, I don't disagree. However up until last week we had a seemingly
usable "add-fd" command that allows the user to provide a *set of file
descriptors* to QEMU. It's just now that we're learning that interface
serves only a special use-case.

> Thus from the POV of the mgmt app, QEMU is writing to a single file, no
> matter how many threads are involved & thus it should only need to supply
> a single FD for thta file. QEMU can then call 'dup()' on that FD as many
> times as it desires, and use fcntl() to toggle O_DIRECT if and when
> it needs to.

Ok, so I think the way to go here is for QEMU to receive a file + offset
instead of an FD. That way QEMU can have adequate control of the
resources for the migration. I don't remember why we went on the FD
tangent. Is it not acceptable for libvirt to provide the file name +
offset?

>> Another problem (or feature) of the fdset code is that it only removes
>> an fd when qmp_remove_fd() is called if the VM runstate is RUNNING,
>> which means that the migration code cannot remove the fds after use
>> reliably. We need to be able to remove them to make sure we use the
>> correct fds in a subsequent migration.
>
> The "easy" option is to just add a new API that does what you want.
> Maybe during review someone will then point out why the orgianl
> API works the way it does.

Hehe so I'll add a qmp_actually_remove_fd() =)


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-31 13:05                 ` Fabiano Rosas
@ 2023-10-31 13:45                   ` Daniel P. Berrangé
  2023-10-31 14:33                     ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-31 13:45 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

On Tue, Oct 31, 2023 at 10:05:56AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Mon, Oct 30, 2023 at 07:51:34PM -0300, Fabiano Rosas wrote:
> >> I could use some advice on how to solve this situation. The fdset code
> >> at monitor/fds.c and the add-fd command don't seem to be usable outside
> >> the original use-case of passing fds with different open flags.
> >> 
> >> There are several problems, the biggest one being that there's no way to
> >> manipulate the set of file descriptors aside from asking for duplication
> >> of an fd that matches a particular set of flags.
> >> 
> >> That doesn't work for us because the two fds we need (one for main
> >> channel, other for secondary channels) will have the same open flags. So
> >> the fdset code will always return the first one it finds in the set.
> >
> > QEMU may want multiple FDs *internally*, but IMHO that fact should
> > not be exposed to mgmt applications. It would be valid for a QEMU
> > impl to share the same FD across multiple threads, or have a different
> > FD for each thread. All threads are using pread/pwrite, so it is safe
> > for them to use the same FD if they desire. It is a private impl choice
> > for QEMU at any given point in time and could change over time.
> >
> 
> Sure, I don't disagree. However up until last week we had a seemingly
> usable "add-fd" command that allows the user to provide a *set of file
> descriptors* to QEMU. It's just now that we're learning that interface
> serves only a special use-case.

AFAICT though we don't need add-fd to support passing many files
for our needs. Saving only requires a single FD. All others can
be opened by dup(), so the limitation of add-fd is irrelevant
surely ?

> 
> > Thus from the POV of the mgmt app, QEMU is writing to a single file, no
> > matter how many threads are involved & thus it should only need to supply
> > a single FD for thta file. QEMU can then call 'dup()' on that FD as many
> > times as it desires, and use fcntl() to toggle O_DIRECT if and when
> > it needs to.
> 
> Ok, so I think the way to go here is for QEMU to receive a file + offset
> instead of an FD. That way QEMU can have adequate control of the
> resources for the migration. I don't remember why we went on the FD
> tangent. Is it not acceptable for libvirt to provide the file name +
> offset?

FD passing means QEMU does not need privileges to open the file
which could be useful.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-31 13:45                   ` Daniel P. Berrangé
@ 2023-10-31 14:33                     ` Fabiano Rosas
  2023-10-31 15:22                       ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-31 14:33 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Tue, Oct 31, 2023 at 10:05:56AM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > On Mon, Oct 30, 2023 at 07:51:34PM -0300, Fabiano Rosas wrote:
>> >> I could use some advice on how to solve this situation. The fdset code
>> >> at monitor/fds.c and the add-fd command don't seem to be usable outside
>> >> the original use-case of passing fds with different open flags.
>> >> 
>> >> There are several problems, the biggest one being that there's no way to
>> >> manipulate the set of file descriptors aside from asking for duplication
>> >> of an fd that matches a particular set of flags.
>> >> 
>> >> That doesn't work for us because the two fds we need (one for main
>> >> channel, other for secondary channels) will have the same open flags. So
>> >> the fdset code will always return the first one it finds in the set.
>> >
>> > QEMU may want multiple FDs *internally*, but IMHO that fact should
>> > not be exposed to mgmt applications. It would be valid for a QEMU
>> > impl to share the same FD across multiple threads, or have a different
>> > FD for each thread. All threads are using pread/pwrite, so it is safe
>> > for them to use the same FD if they desire. It is a private impl choice
>> > for QEMU at any given point in time and could change over time.
>> >
>> 
>> Sure, I don't disagree. However up until last week we had a seemingly
>> usable "add-fd" command that allows the user to provide a *set of file
>> descriptors* to QEMU. It's just now that we're learning that interface
>> serves only a special use-case.
>
> AFAICT though we don't need add-fd to support passing many files
> for our needs. Saving only requires a single FD. All others can
> be opened by dup(), so the limitation of add-fd is irrelevant
> surely ?

Only once we decide to use one FD. If we had a generic add-fd backend,
then that's already a user-facing API, so the "implementation detail"
argument becomes weaker.

With a single FD we'll need to be very careful about what code is
allowed to run while the multifd channels are doing IO. Since O_DIRECT
is not widely supported, now we have to also be careful about someone
using that QEMUFile handle to do unaligned writes and not even noticing
that it breaks direct IO. None of this in unworkable, of course, I just
find the design way clearer with just the file name + offset.

>> > Thus from the POV of the mgmt app, QEMU is writing to a single file, no
>> > matter how many threads are involved & thus it should only need to supply
>> > a single FD for thta file. QEMU can then call 'dup()' on that FD as many
>> > times as it desires, and use fcntl() to toggle O_DIRECT if and when
>> > it needs to.
>> 
>> Ok, so I think the way to go here is for QEMU to receive a file + offset
>> instead of an FD. That way QEMU can have adequate control of the
>> resources for the migration. I don't remember why we went on the FD
>> tangent. Is it not acceptable for libvirt to provide the file name +
>> offset?
>
> FD passing means QEMU does not need privileges to open the file
> which could be useful.

Ok, let me give this a try. Let's see how bad it is to juggle the flag
around the main channel.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-31 14:33                     ` Fabiano Rosas
@ 2023-10-31 15:22                       ` Daniel P. Berrangé
  2023-10-31 15:52                         ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-31 15:22 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

On Tue, Oct 31, 2023 at 11:33:24AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Tue, Oct 31, 2023 at 10:05:56AM -0300, Fabiano Rosas wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> >> 
> >> > On Mon, Oct 30, 2023 at 07:51:34PM -0300, Fabiano Rosas wrote:
> >> >> I could use some advice on how to solve this situation. The fdset code
> >> >> at monitor/fds.c and the add-fd command don't seem to be usable outside
> >> >> the original use-case of passing fds with different open flags.
> >> >> 
> >> >> There are several problems, the biggest one being that there's no way to
> >> >> manipulate the set of file descriptors aside from asking for duplication
> >> >> of an fd that matches a particular set of flags.
> >> >> 
> >> >> That doesn't work for us because the two fds we need (one for main
> >> >> channel, other for secondary channels) will have the same open flags. So
> >> >> the fdset code will always return the first one it finds in the set.
> >> >
> >> > QEMU may want multiple FDs *internally*, but IMHO that fact should
> >> > not be exposed to mgmt applications. It would be valid for a QEMU
> >> > impl to share the same FD across multiple threads, or have a different
> >> > FD for each thread. All threads are using pread/pwrite, so it is safe
> >> > for them to use the same FD if they desire. It is a private impl choice
> >> > for QEMU at any given point in time and could change over time.
> >> >
> >> 
> >> Sure, I don't disagree. However up until last week we had a seemingly
> >> usable "add-fd" command that allows the user to provide a *set of file
> >> descriptors* to QEMU. It's just now that we're learning that interface
> >> serves only a special use-case.
> >
> > AFAICT though we don't need add-fd to support passing many files
> > for our needs. Saving only requires a single FD. All others can
> > be opened by dup(), so the limitation of add-fd is irrelevant
> > surely ?
> 
> Only once we decide to use one FD. If we had a generic add-fd backend,
> then that's already a user-facing API, so the "implementation detail"
> argument becomes weaker.
> 
> With a single FD we'll need to be very careful about what code is
> allowed to run while the multifd channels are doing IO. Since O_DIRECT
> is not widely supported, now we have to also be careful about someone
> using that QEMUFile handle to do unaligned writes and not even noticing
> that it breaks direct IO. None of this in unworkable, of course, I just
> find the design way clearer with just the file name + offset.

I guess I'm not seeing the problem still.  A single FD is passed across
from libvirt, but QEMU is free to turn that into *many* FDs for its
internal use, using dup() and then setting O_DIRECT on as many/few of
the dup()d FDs as its wants to.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-31 15:22                       ` Daniel P. Berrangé
@ 2023-10-31 15:52                         ` Fabiano Rosas
  2023-10-31 15:58                           ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-31 15:52 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Tue, Oct 31, 2023 at 11:33:24AM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > On Tue, Oct 31, 2023 at 10:05:56AM -0300, Fabiano Rosas wrote:
>> >> Daniel P. Berrangé <berrange@redhat.com> writes:
>> >> 
>> >> > On Mon, Oct 30, 2023 at 07:51:34PM -0300, Fabiano Rosas wrote:
>> >> >> I could use some advice on how to solve this situation. The fdset code
>> >> >> at monitor/fds.c and the add-fd command don't seem to be usable outside
>> >> >> the original use-case of passing fds with different open flags.
>> >> >> 
>> >> >> There are several problems, the biggest one being that there's no way to
>> >> >> manipulate the set of file descriptors aside from asking for duplication
>> >> >> of an fd that matches a particular set of flags.
>> >> >> 
>> >> >> That doesn't work for us because the two fds we need (one for main
>> >> >> channel, other for secondary channels) will have the same open flags. So
>> >> >> the fdset code will always return the first one it finds in the set.
>> >> >
>> >> > QEMU may want multiple FDs *internally*, but IMHO that fact should
>> >> > not be exposed to mgmt applications. It would be valid for a QEMU
>> >> > impl to share the same FD across multiple threads, or have a different
>> >> > FD for each thread. All threads are using pread/pwrite, so it is safe
>> >> > for them to use the same FD if they desire. It is a private impl choice
>> >> > for QEMU at any given point in time and could change over time.
>> >> >
>> >> 
>> >> Sure, I don't disagree. However up until last week we had a seemingly
>> >> usable "add-fd" command that allows the user to provide a *set of file
>> >> descriptors* to QEMU. It's just now that we're learning that interface
>> >> serves only a special use-case.
>> >
>> > AFAICT though we don't need add-fd to support passing many files
>> > for our needs. Saving only requires a single FD. All others can
>> > be opened by dup(), so the limitation of add-fd is irrelevant
>> > surely ?
>> 
>> Only once we decide to use one FD. If we had a generic add-fd backend,
>> then that's already a user-facing API, so the "implementation detail"
>> argument becomes weaker.
>> 
>> With a single FD we'll need to be very careful about what code is
>> allowed to run while the multifd channels are doing IO. Since O_DIRECT
>> is not widely supported, now we have to also be careful about someone
>> using that QEMUFile handle to do unaligned writes and not even noticing
>> that it breaks direct IO. None of this in unworkable, of course, I just
>> find the design way clearer with just the file name + offset.
>
> I guess I'm not seeing the problem still.  A single FD is passed across
> from libvirt, but QEMU is free to turn that into *many* FDs for its
> internal use, using dup() and then setting O_DIRECT on as many/few of
> the dup()d FDs as its wants to.

The problem is that duplicated FDs share the file status flags. If we
set O_DIRECT on the multifd channels and the main thread happens to do
an unaligned write with qemu_file_put* then the filesystem will fail
that write.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-31 15:52                         ` Fabiano Rosas
@ 2023-10-31 15:58                           ` Daniel P. Berrangé
  2023-10-31 19:05                             ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-10-31 15:58 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

On Tue, Oct 31, 2023 at 12:52:41PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> >
> > I guess I'm not seeing the problem still.  A single FD is passed across
> > from libvirt, but QEMU is free to turn that into *many* FDs for its
> > internal use, using dup() and then setting O_DIRECT on as many/few of
> > the dup()d FDs as its wants to.
> 
> The problem is that duplicated FDs share the file status flags. If we
> set O_DIRECT on the multifd channels and the main thread happens to do
> an unaligned write with qemu_file_put* then the filesystem will fail
> that write.

Doh, I had forgotten that sharing.

Do we have any synchronization between multifd  channels and the main
thread ?  eg does the main thread wait for RAM sending completion
before carrying on writing other non-RAM data ?  If not, is it at all
practical to add such synchronization. IOW, to let us turn on O_DIRECT
at start of a RAM section and turn it off again afterwards.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 13/29] migration: fixed-ram: Add URI compatibility check
  2023-10-23 20:35 ` [PATCH v2 13/29] migration: fixed-ram: Add URI compatibility check Fabiano Rosas
  2023-10-25 10:27   ` Daniel P. Berrangé
@ 2023-10-31 16:06   ` Peter Xu
  1 sibling, 0 replies; 128+ messages in thread
From: Peter Xu @ 2023-10-31 16:06 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana

On Mon, Oct 23, 2023 at 05:35:52PM -0300, Fabiano Rosas wrote:
> The fixed-ram migration format needs a channel that supports seeking
> to be able to write each page to an arbitrary offset in the migration
> stream.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/migration.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 692fbc5ad6..cabb3ad3a5 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -106,22 +106,40 @@ static bool migration_needs_multiple_sockets(void)
>      return migrate_multifd() || migrate_postcopy_preempt();
>  }
>  
> +static bool migration_needs_seekable_channel(void)
> +{
> +    return migrate_fixed_ram();
> +}
> +
>  static bool uri_supports_multi_channels(const char *uri)
>  {
>      return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
>             strstart(uri, "vsock:", NULL);
>  }
>  
> +static bool uri_supports_seeking(const char *uri)
> +{
> +    return strstart(uri, "file:", NULL);
> +}
> +
>  static bool
>  migration_channels_and_uri_compatible(const char *uri, Error **errp)
>  {
> +    bool compatible = true;
> +
> +    if (migration_needs_seekable_channel() &&
> +        !uri_supports_seeking(uri)) {
> +        error_setg(errp, "Migration requires seekable transport (e.g. file)");
> +        compatible = false;

We may want to return directly after setting errp once, as error_setg() can
trigger assertion over "*errp==NULL" if below check will fail too.

> +    }
> +
>      if (migration_needs_multiple_sockets() &&
>          !uri_supports_multi_channels(uri)) {
>          error_setg(errp, "Migration requires multi-channel URIs (e.g. tcp)");
> -        return false;
> +        compatible = false;
>      }
>  
> -    return true;
> +    return compatible;
>  }
>  
>  static bool migration_should_pause(const char *uri)
> -- 
> 2.35.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration
  2023-10-23 20:35 ` [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration Fabiano Rosas
  2023-10-25  9:39   ` Daniel P. Berrangé
@ 2023-10-31 16:52   ` Peter Xu
  2023-10-31 17:33     ` Fabiano Rosas
  1 sibling, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-10-31 16:52 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé

On Mon, Oct 23, 2023 at 05:35:54PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
> 
> Implement the outgoing migration side for the 'fixed-ram' capability.
> 
> A bitmap is introduced to track which pages have been written in the
> migration file. Pages are written at a fixed location for every
> ramblock. Zero pages are ignored as they'd be zero in the destination
> migration as well.
> 
> The migration stream is altered to put the dirty pages for a ramblock
> after its header instead of having a sequential stream of pages that
> follow the ramblock headers. Since all pages have a fixed location,
> RAM_SAVE_FLAG_EOS is no longer generated on every migration iteration.
> 
> Without fixed-ram (current):
> 
> ramblock 1 header|ramblock 2 header|...|RAM_SAVE_FLAG_EOS|stream of
>  pages (iter 1)|RAM_SAVE_FLAG_EOS|stream of pages (iter 2)|...
> 
> With fixed-ram (new):
> 
> ramblock 1 header|ramblock 1 fixed-ram header|ramblock 1 pages (fixed
>  offsets)|ramblock 2 header|ramblock 2 fixed-ram header|ramblock 2
>  pages (fixed offsets)|...|RAM_SAVE_FLAG_EOS
> 
> where:
>  - ramblock header: the generic information for a ramblock, such as
>    idstr, used_len, etc.
> 
>  - ramblock fixed-ram header: the new information added by this
>    feature: bitmap of pages written, bitmap size and offset of pages
>    in the migration file.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  include/exec/ramblock.h |  8 ++++
>  migration/options.c     |  3 --
>  migration/ram.c         | 98 ++++++++++++++++++++++++++++++++++++-----
>  3 files changed, 96 insertions(+), 13 deletions(-)
> 
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index 69c6a53902..e0e3f16852 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -44,6 +44,14 @@ struct RAMBlock {
>      size_t page_size;
>      /* dirty bitmap used during migration */
>      unsigned long *bmap;
> +    /* shadow dirty bitmap used when migrating to a file */
> +    unsigned long *shadow_bmap;
> +    /*
> +     * offset in the file pages belonging to this ramblock are saved,
> +     * used only during migration to a file.
> +     */
> +    off_t bitmap_offset;
> +    uint64_t pages_offset;
>      /* bitmap of already received pages in postcopy */
>      unsigned long *receivedmap;
>  
> diff --git a/migration/options.c b/migration/options.c
> index 2622d8c483..9f693d909f 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -271,12 +271,9 @@ bool migrate_events(void)
>  
>  bool migrate_fixed_ram(void)
>  {
> -/*
>      MigrationState *s = migrate_get_current();
>  
>      return s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
> -*/
> -    return false;
>  }
>  
>  bool migrate_ignore_shared(void)
> diff --git a/migration/ram.c b/migration/ram.c
> index 92769902bb..152a03604f 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1157,12 +1157,18 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
>          return 0;
>      }
>  
> +    stat64_add(&mig_stats.zero_pages, 1);

Here we keep zero page accounting, but..

> +
> +    if (migrate_fixed_ram()) {
> +        /* zero pages are not transferred with fixed-ram */
> +        clear_bit(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
> +        return 1;
> +    }
> +
>      len += save_page_header(pss, file, pss->block, offset | RAM_SAVE_FLAG_ZERO);
>      qemu_put_byte(file, 0);
>      len += 1;
>      ram_release_page(pss->block->idstr, offset);
> -
> -    stat64_add(&mig_stats.zero_pages, 1);
>      ram_transferred_add(len);
>  
>      /*
> @@ -1220,14 +1226,20 @@ static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
>  {
>      QEMUFile *file = pss->pss_channel;
>  
> -    ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
> -                                         offset | RAM_SAVE_FLAG_PAGE));
> -    if (async) {
> -        qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
> -                              migrate_release_ram() &&
> -                              migration_in_postcopy());
> +    if (migrate_fixed_ram()) {
> +        qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
> +                           block->pages_offset + offset);
> +        set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
>      } else {
> -        qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
> +        ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
> +                                             offset | RAM_SAVE_FLAG_PAGE));

.. here we ignored normal page accounting.

I think we should have the same behavior on both, perhaps keep them always?

> +        if (async) {
> +            qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
> +                                  migrate_release_ram() &&
> +                                  migration_in_postcopy());
> +        } else {
> +            qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
> +        }
>      }
>      ram_transferred_add(TARGET_PAGE_SIZE);
>      stat64_add(&mig_stats.normal_pages, 1);
> @@ -2475,6 +2487,8 @@ static void ram_save_cleanup(void *opaque)
>          block->clear_bmap = NULL;
>          g_free(block->bmap);
>          block->bmap = NULL;
> +        g_free(block->shadow_bmap);
> +        block->shadow_bmap = NULL;
>      }
>  
>      xbzrle_cleanup();
> @@ -2842,6 +2856,7 @@ static void ram_list_init_bitmaps(void)
>               */
>              block->bmap = bitmap_new(pages);
>              bitmap_set(block->bmap, 0, pages);
> +            block->shadow_bmap = bitmap_new(block->used_length >> TARGET_PAGE_BITS);

AFAICT bmap should also use used_length.  How about adding one more patch
to change that, then you can use "pages" here?

See ram_mig_ram_block_resized() where we call migration_cancel() if resized.

>              block->clear_bmap_shift = shift;
>              block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
>          }
> @@ -2979,6 +2994,44 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
>      }
>  }
>  
> +#define FIXED_RAM_HDR_VERSION 1
> +struct FixedRamHeader {
> +    uint32_t version;
> +    uint64_t page_size;
> +    uint64_t bitmap_offset;
> +    uint64_t pages_offset;
> +    /* end of v1 */
> +} QEMU_PACKED;
> +
> +static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
> +{
> +    g_autofree struct FixedRamHeader *header;
> +    size_t header_size, bitmap_size;
> +    long num_pages;
> +
> +    header = g_new0(struct FixedRamHeader, 1);
> +    header_size = sizeof(struct FixedRamHeader);
> +
> +    num_pages = block->used_length >> TARGET_PAGE_BITS;
> +    bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> +
> +    /*
> +     * Save the file offsets of where the bitmap and the pages should
> +     * go as they are written at the end of migration and during the
> +     * iterative phase, respectively.
> +     */
> +    block->bitmap_offset = qemu_get_offset(file) + header_size;
> +    block->pages_offset = ROUND_UP(block->bitmap_offset +
> +                                   bitmap_size, 0x100000);
> +
> +    header->version = cpu_to_be32(FIXED_RAM_HDR_VERSION);
> +    header->page_size = cpu_to_be64(TARGET_PAGE_SIZE);

This is the "page size" for the shadow bitmap, right?  Shall we state it
somewhere (e.g. explaining why it's not block->page_size)?

It's unfortunate that we already have things like:

            if (migrate_postcopy_ram() && block->page_size !=
                                          qemu_host_page_size) {
                qemu_put_be64(f, block->page_size);
            }

But indeed we can't merge them because they seem to service different
purpose.

> +    header->bitmap_offset = cpu_to_be64(block->bitmap_offset);
> +    header->pages_offset = cpu_to_be64(block->pages_offset);
> +
> +    qemu_put_buffer(file, (uint8_t *) header, header_size);
> +}
> +
>  /*
>   * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
>   * long-running RCU critical section.  When rcu-reclaims in the code
> @@ -3028,6 +3081,12 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>              if (migrate_ignore_shared()) {
>                  qemu_put_be64(f, block->mr->addr);
>              }
> +
> +            if (migrate_fixed_ram()) {
> +                fixed_ram_insert_header(f, block);
> +                /* prepare offset for next ramblock */
> +                qemu_set_offset(f, block->pages_offset + block->used_length, SEEK_SET);
> +            }
>          }
>      }
>  
> @@ -3061,6 +3120,20 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>      return 0;
>  }
>  
> +static void ram_save_shadow_bmap(QEMUFile *f)
> +{
> +    RAMBlock *block;
> +
> +    RAMBLOCK_FOREACH_MIGRATABLE(block) {
> +        long num_pages = block->used_length >> TARGET_PAGE_BITS;
> +        long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> +        qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
> +                           block->bitmap_offset);
> +        /* to catch any thread late sending pages */
> +        block->shadow_bmap = NULL;

What is this for?  Wouldn't this leak the buffer already?

> +    }
> +}
> +
>  /**
>   * ram_save_iterate: iterative stage for migration
>   *
> @@ -3179,7 +3252,6 @@ out:
>          qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>          qemu_fflush(f);
>          ram_transferred_add(8);
> -
>          ret = qemu_file_get_error(f);
>      }
>      if (ret < 0) {
> @@ -3256,7 +3328,13 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>      if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
>          qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
>      }
> +
> +    if (migrate_fixed_ram()) {
> +        ram_save_shadow_bmap(f);
> +    }
> +
>      qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +
>      qemu_fflush(f);
>  
>      return 0;
> -- 
> 2.35.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration
  2023-10-31 16:52   ` Peter Xu
@ 2023-10-31 17:33     ` Fabiano Rosas
  2023-10-31 17:59       ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-31 17:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé

Peter Xu <peterx@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:35:54PM -0300, Fabiano Rosas wrote:
>> From: Nikolay Borisov <nborisov@suse.com>
>> 
>> Implement the outgoing migration side for the 'fixed-ram' capability.
>> 
>> A bitmap is introduced to track which pages have been written in the
>> migration file. Pages are written at a fixed location for every
>> ramblock. Zero pages are ignored as they'd be zero in the destination
>> migration as well.
>> 
>> The migration stream is altered to put the dirty pages for a ramblock
>> after its header instead of having a sequential stream of pages that
>> follow the ramblock headers. Since all pages have a fixed location,
>> RAM_SAVE_FLAG_EOS is no longer generated on every migration iteration.
>> 
>> Without fixed-ram (current):
>> 
>> ramblock 1 header|ramblock 2 header|...|RAM_SAVE_FLAG_EOS|stream of
>>  pages (iter 1)|RAM_SAVE_FLAG_EOS|stream of pages (iter 2)|...
>> 
>> With fixed-ram (new):
>> 
>> ramblock 1 header|ramblock 1 fixed-ram header|ramblock 1 pages (fixed
>>  offsets)|ramblock 2 header|ramblock 2 fixed-ram header|ramblock 2
>>  pages (fixed offsets)|...|RAM_SAVE_FLAG_EOS
>> 
>> where:
>>  - ramblock header: the generic information for a ramblock, such as
>>    idstr, used_len, etc.
>> 
>>  - ramblock fixed-ram header: the new information added by this
>>    feature: bitmap of pages written, bitmap size and offset of pages
>>    in the migration file.
>> 
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  include/exec/ramblock.h |  8 ++++
>>  migration/options.c     |  3 --
>>  migration/ram.c         | 98 ++++++++++++++++++++++++++++++++++++-----
>>  3 files changed, 96 insertions(+), 13 deletions(-)
>> 
>> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
>> index 69c6a53902..e0e3f16852 100644
>> --- a/include/exec/ramblock.h
>> +++ b/include/exec/ramblock.h
>> @@ -44,6 +44,14 @@ struct RAMBlock {
>>      size_t page_size;
>>      /* dirty bitmap used during migration */
>>      unsigned long *bmap;
>> +    /* shadow dirty bitmap used when migrating to a file */
>> +    unsigned long *shadow_bmap;
>> +    /*
>> +     * offset in the file pages belonging to this ramblock are saved,
>> +     * used only during migration to a file.
>> +     */
>> +    off_t bitmap_offset;
>> +    uint64_t pages_offset;
>>      /* bitmap of already received pages in postcopy */
>>      unsigned long *receivedmap;
>>  
>> diff --git a/migration/options.c b/migration/options.c
>> index 2622d8c483..9f693d909f 100644
>> --- a/migration/options.c
>> +++ b/migration/options.c
>> @@ -271,12 +271,9 @@ bool migrate_events(void)
>>  
>>  bool migrate_fixed_ram(void)
>>  {
>> -/*
>>      MigrationState *s = migrate_get_current();
>>  
>>      return s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
>> -*/
>> -    return false;
>>  }
>>  
>>  bool migrate_ignore_shared(void)
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 92769902bb..152a03604f 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1157,12 +1157,18 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
>>          return 0;
>>      }
>>  
>> +    stat64_add(&mig_stats.zero_pages, 1);
>
> Here we keep zero page accounting, but..
>
>> +
>> +    if (migrate_fixed_ram()) {
>> +        /* zero pages are not transferred with fixed-ram */
>> +        clear_bit(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
>> +        return 1;
>> +    }
>> +
>>      len += save_page_header(pss, file, pss->block, offset | RAM_SAVE_FLAG_ZERO);
>>      qemu_put_byte(file, 0);
>>      len += 1;
>>      ram_release_page(pss->block->idstr, offset);
>> -
>> -    stat64_add(&mig_stats.zero_pages, 1);
>>      ram_transferred_add(len);
>>  
>>      /*
>> @@ -1220,14 +1226,20 @@ static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
>>  {
>>      QEMUFile *file = pss->pss_channel;
>>  
>> -    ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
>> -                                         offset | RAM_SAVE_FLAG_PAGE));
>> -    if (async) {
>> -        qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
>> -                              migrate_release_ram() &&
>> -                              migration_in_postcopy());
>> +    if (migrate_fixed_ram()) {
>> +        qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
>> +                           block->pages_offset + offset);
>> +        set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
>>      } else {
>> -        qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
>> +        ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
>> +                                             offset | RAM_SAVE_FLAG_PAGE));
>
> .. here we ignored normal page accounting.
>
> I think we should have the same behavior on both, perhaps keep them always?
>

This is the accounting for the header only if I'm not mistaken.

>> +        if (async) {
>> +            qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
>> +                                  migrate_release_ram() &&
>> +                                  migration_in_postcopy());
>> +        } else {
>> +            qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
>> +        }
>>      }
>>      ram_transferred_add(TARGET_PAGE_SIZE);
>>      stat64_add(&mig_stats.normal_pages, 1);

Here's the page accounting.

>> @@ -2475,6 +2487,8 @@ static void ram_save_cleanup(void *opaque)
>>          block->clear_bmap = NULL;
>>          g_free(block->bmap);
>>          block->bmap = NULL;
>> +        g_free(block->shadow_bmap);
>> +        block->shadow_bmap = NULL;
>>      }
>>  
>>      xbzrle_cleanup();
>> @@ -2842,6 +2856,7 @@ static void ram_list_init_bitmaps(void)
>>               */
>>              block->bmap = bitmap_new(pages);
>>              bitmap_set(block->bmap, 0, pages);
>> +            block->shadow_bmap = bitmap_new(block->used_length >> TARGET_PAGE_BITS);
>
> AFAICT bmap should also use used_length.  How about adding one more patch
> to change that, then you can use "pages" here?

It uses max_length. I don't know what are the effects of that
change. I'll look into it.

> See ram_mig_ram_block_resized() where we call migration_cancel() if resized.
>
>>              block->clear_bmap_shift = shift;
>>              block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
>>          }
>> @@ -2979,6 +2994,44 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
>>      }
>>  }
>>  
>> +#define FIXED_RAM_HDR_VERSION 1
>> +struct FixedRamHeader {
>> +    uint32_t version;
>> +    uint64_t page_size;
>> +    uint64_t bitmap_offset;
>> +    uint64_t pages_offset;
>> +    /* end of v1 */
>> +} QEMU_PACKED;
>> +
>> +static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
>> +{
>> +    g_autofree struct FixedRamHeader *header;
>> +    size_t header_size, bitmap_size;
>> +    long num_pages;
>> +
>> +    header = g_new0(struct FixedRamHeader, 1);
>> +    header_size = sizeof(struct FixedRamHeader);
>> +
>> +    num_pages = block->used_length >> TARGET_PAGE_BITS;
>> +    bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
>> +
>> +    /*
>> +     * Save the file offsets of where the bitmap and the pages should
>> +     * go as they are written at the end of migration and during the
>> +     * iterative phase, respectively.
>> +     */
>> +    block->bitmap_offset = qemu_get_offset(file) + header_size;
>> +    block->pages_offset = ROUND_UP(block->bitmap_offset +
>> +                                   bitmap_size, 0x100000);
>> +
>> +    header->version = cpu_to_be32(FIXED_RAM_HDR_VERSION);
>> +    header->page_size = cpu_to_be64(TARGET_PAGE_SIZE);
>
> This is the "page size" for the shadow bitmap, right?  Shall we state it
> somewhere (e.g. explaining why it's not block->page_size)?

Ok.

> It's unfortunate that we already have things like:
>
>             if (migrate_postcopy_ram() && block->page_size !=
>                                           qemu_host_page_size) {
>                 qemu_put_be64(f, block->page_size);
>             }
>
> But indeed we can't merge them because they seem to service different
> purpose.
>
>> +    header->bitmap_offset = cpu_to_be64(block->bitmap_offset);
>> +    header->pages_offset = cpu_to_be64(block->pages_offset);
>> +
>> +    qemu_put_buffer(file, (uint8_t *) header, header_size);
>> +}
>> +
>>  /*
>>   * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
>>   * long-running RCU critical section.  When rcu-reclaims in the code
>> @@ -3028,6 +3081,12 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>              if (migrate_ignore_shared()) {
>>                  qemu_put_be64(f, block->mr->addr);
>>              }
>> +
>> +            if (migrate_fixed_ram()) {
>> +                fixed_ram_insert_header(f, block);
>> +                /* prepare offset for next ramblock */
>> +                qemu_set_offset(f, block->pages_offset + block->used_length, SEEK_SET);
>> +            }
>>          }
>>      }
>>  
>> @@ -3061,6 +3120,20 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>      return 0;
>>  }
>>  
>> +static void ram_save_shadow_bmap(QEMUFile *f)
>> +{
>> +    RAMBlock *block;
>> +
>> +    RAMBLOCK_FOREACH_MIGRATABLE(block) {
>> +        long num_pages = block->used_length >> TARGET_PAGE_BITS;
>> +        long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
>> +        qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
>> +                           block->bitmap_offset);
>> +        /* to catch any thread late sending pages */
>> +        block->shadow_bmap = NULL;
>
> What is this for?  Wouldn't this leak the buffer already?
>

Ah this is debug code. It's because of multifd. In this series I don't
use sem_sync because there's no packets. But doing so causes
multifd_send_sync_main() to return before the multifd channel has sent
all its pages. This is here so the channel crashes when writing the
bitmap.

I think it's worth it to keep it but I'd have to move it to the multifd
patch and free the bitmap properly.

Thanks!



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration
  2023-10-31 17:33     ` Fabiano Rosas
@ 2023-10-31 17:59       ` Peter Xu
  0 siblings, 0 replies; 128+ messages in thread
From: Peter Xu @ 2023-10-31 17:59 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé

On Tue, Oct 31, 2023 at 02:33:04PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Oct 23, 2023 at 05:35:54PM -0300, Fabiano Rosas wrote:
> >> From: Nikolay Borisov <nborisov@suse.com>
> >> 
> >> Implement the outgoing migration side for the 'fixed-ram' capability.
> >> 
> >> A bitmap is introduced to track which pages have been written in the
> >> migration file. Pages are written at a fixed location for every
> >> ramblock. Zero pages are ignored as they'd be zero in the destination
> >> migration as well.
> >> 
> >> The migration stream is altered to put the dirty pages for a ramblock
> >> after its header instead of having a sequential stream of pages that
> >> follow the ramblock headers. Since all pages have a fixed location,
> >> RAM_SAVE_FLAG_EOS is no longer generated on every migration iteration.
> >> 
> >> Without fixed-ram (current):
> >> 
> >> ramblock 1 header|ramblock 2 header|...|RAM_SAVE_FLAG_EOS|stream of
> >>  pages (iter 1)|RAM_SAVE_FLAG_EOS|stream of pages (iter 2)|...
> >> 
> >> With fixed-ram (new):
> >> 
> >> ramblock 1 header|ramblock 1 fixed-ram header|ramblock 1 pages (fixed
> >>  offsets)|ramblock 2 header|ramblock 2 fixed-ram header|ramblock 2
> >>  pages (fixed offsets)|...|RAM_SAVE_FLAG_EOS
> >> 
> >> where:
> >>  - ramblock header: the generic information for a ramblock, such as
> >>    idstr, used_len, etc.
> >> 
> >>  - ramblock fixed-ram header: the new information added by this
> >>    feature: bitmap of pages written, bitmap size and offset of pages
> >>    in the migration file.
> >> 
> >> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >> ---
> >>  include/exec/ramblock.h |  8 ++++
> >>  migration/options.c     |  3 --
> >>  migration/ram.c         | 98 ++++++++++++++++++++++++++++++++++++-----
> >>  3 files changed, 96 insertions(+), 13 deletions(-)
> >> 
> >> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> >> index 69c6a53902..e0e3f16852 100644
> >> --- a/include/exec/ramblock.h
> >> +++ b/include/exec/ramblock.h
> >> @@ -44,6 +44,14 @@ struct RAMBlock {
> >>      size_t page_size;
> >>      /* dirty bitmap used during migration */
> >>      unsigned long *bmap;
> >> +    /* shadow dirty bitmap used when migrating to a file */
> >> +    unsigned long *shadow_bmap;
> >> +    /*
> >> +     * offset in the file pages belonging to this ramblock are saved,
> >> +     * used only during migration to a file.
> >> +     */
> >> +    off_t bitmap_offset;
> >> +    uint64_t pages_offset;
> >>      /* bitmap of already received pages in postcopy */
> >>      unsigned long *receivedmap;
> >>  
> >> diff --git a/migration/options.c b/migration/options.c
> >> index 2622d8c483..9f693d909f 100644
> >> --- a/migration/options.c
> >> +++ b/migration/options.c
> >> @@ -271,12 +271,9 @@ bool migrate_events(void)
> >>  
> >>  bool migrate_fixed_ram(void)
> >>  {
> >> -/*
> >>      MigrationState *s = migrate_get_current();
> >>  
> >>      return s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
> >> -*/
> >> -    return false;

One more thing: maybe we can avoid this and just assume nobody will only
apply the previous patch and cause trouble.

> >>  }
> >>  
> >>  bool migrate_ignore_shared(void)
> >> diff --git a/migration/ram.c b/migration/ram.c
> >> index 92769902bb..152a03604f 100644
> >> --- a/migration/ram.c
> >> +++ b/migration/ram.c
> >> @@ -1157,12 +1157,18 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
> >>          return 0;
> >>      }
> >>  
> >> +    stat64_add(&mig_stats.zero_pages, 1);
> >
> > Here we keep zero page accounting, but..
> >
> >> +
> >> +    if (migrate_fixed_ram()) {
> >> +        /* zero pages are not transferred with fixed-ram */
> >> +        clear_bit(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
> >> +        return 1;
> >> +    }
> >> +
> >>      len += save_page_header(pss, file, pss->block, offset | RAM_SAVE_FLAG_ZERO);
> >>      qemu_put_byte(file, 0);
> >>      len += 1;
> >>      ram_release_page(pss->block->idstr, offset);
> >> -
> >> -    stat64_add(&mig_stats.zero_pages, 1);
> >>      ram_transferred_add(len);
> >>  
> >>      /*
> >> @@ -1220,14 +1226,20 @@ static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
> >>  {
> >>      QEMUFile *file = pss->pss_channel;
> >>  
> >> -    ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
> >> -                                         offset | RAM_SAVE_FLAG_PAGE));
> >> -    if (async) {
> >> -        qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
> >> -                              migrate_release_ram() &&
> >> -                              migration_in_postcopy());
> >> +    if (migrate_fixed_ram()) {
> >> +        qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
> >> +                           block->pages_offset + offset);
> >> +        set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
> >>      } else {
> >> -        qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
> >> +        ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
> >> +                                             offset | RAM_SAVE_FLAG_PAGE));
> >
> > .. here we ignored normal page accounting.
> >
> > I think we should have the same behavior on both, perhaps keep them always?
> >
> 
> This is the accounting for the header only if I'm not mistaken.
> 
> >> +        if (async) {
> >> +            qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
> >> +                                  migrate_release_ram() &&
> >> +                                  migration_in_postcopy());
> >> +        } else {
> >> +            qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
> >> +        }
> >>      }
> >>      ram_transferred_add(TARGET_PAGE_SIZE);
> >>      stat64_add(&mig_stats.normal_pages, 1);
> 
> Here's the page accounting.

Oh, that's okay then.

> 
> >> @@ -2475,6 +2487,8 @@ static void ram_save_cleanup(void *opaque)
> >>          block->clear_bmap = NULL;
> >>          g_free(block->bmap);
> >>          block->bmap = NULL;
> >> +        g_free(block->shadow_bmap);
> >> +        block->shadow_bmap = NULL;
> >>      }
> >>  
> >>      xbzrle_cleanup();
> >> @@ -2842,6 +2856,7 @@ static void ram_list_init_bitmaps(void)
> >>               */
> >>              block->bmap = bitmap_new(pages);
> >>              bitmap_set(block->bmap, 0, pages);
> >> +            block->shadow_bmap = bitmap_new(block->used_length >> TARGET_PAGE_BITS);
> >
> > AFAICT bmap should also use used_length.  How about adding one more patch
> > to change that, then you can use "pages" here?
> 
> It uses max_length. I don't know what are the effects of that
> change. I'll look into it.
> 
> > See ram_mig_ram_block_resized() where we call migration_cancel() if resized.
> >
> >>              block->clear_bmap_shift = shift;
> >>              block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
> >>          }
> >> @@ -2979,6 +2994,44 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
> >>      }
> >>  }
> >>  
> >> +#define FIXED_RAM_HDR_VERSION 1
> >> +struct FixedRamHeader {
> >> +    uint32_t version;
> >> +    uint64_t page_size;
> >> +    uint64_t bitmap_offset;
> >> +    uint64_t pages_offset;
> >> +    /* end of v1 */
> >> +} QEMU_PACKED;
> >> +
> >> +static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
> >> +{
> >> +    g_autofree struct FixedRamHeader *header;
> >> +    size_t header_size, bitmap_size;
> >> +    long num_pages;
> >> +
> >> +    header = g_new0(struct FixedRamHeader, 1);
> >> +    header_size = sizeof(struct FixedRamHeader);
> >> +
> >> +    num_pages = block->used_length >> TARGET_PAGE_BITS;
> >> +    bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> >> +
> >> +    /*
> >> +     * Save the file offsets of where the bitmap and the pages should
> >> +     * go as they are written at the end of migration and during the
> >> +     * iterative phase, respectively.
> >> +     */
> >> +    block->bitmap_offset = qemu_get_offset(file) + header_size;
> >> +    block->pages_offset = ROUND_UP(block->bitmap_offset +
> >> +                                   bitmap_size, 0x100000);
> >> +
> >> +    header->version = cpu_to_be32(FIXED_RAM_HDR_VERSION);
> >> +    header->page_size = cpu_to_be64(TARGET_PAGE_SIZE);
> >
> > This is the "page size" for the shadow bitmap, right?  Shall we state it
> > somewhere (e.g. explaining why it's not block->page_size)?
> 
> Ok.
> 
> > It's unfortunate that we already have things like:
> >
> >             if (migrate_postcopy_ram() && block->page_size !=
> >                                           qemu_host_page_size) {
> >                 qemu_put_be64(f, block->page_size);
> >             }
> >
> > But indeed we can't merge them because they seem to service different
> > purpose.
> >
> >> +    header->bitmap_offset = cpu_to_be64(block->bitmap_offset);
> >> +    header->pages_offset = cpu_to_be64(block->pages_offset);
> >> +
> >> +    qemu_put_buffer(file, (uint8_t *) header, header_size);
> >> +}
> >> +
> >>  /*
> >>   * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
> >>   * long-running RCU critical section.  When rcu-reclaims in the code
> >> @@ -3028,6 +3081,12 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> >>              if (migrate_ignore_shared()) {
> >>                  qemu_put_be64(f, block->mr->addr);
> >>              }
> >> +
> >> +            if (migrate_fixed_ram()) {
> >> +                fixed_ram_insert_header(f, block);
> >> +                /* prepare offset for next ramblock */
> >> +                qemu_set_offset(f, block->pages_offset + block->used_length, SEEK_SET);
> >> +            }
> >>          }
> >>      }
> >>  
> >> @@ -3061,6 +3120,20 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> >>      return 0;
> >>  }
> >>  
> >> +static void ram_save_shadow_bmap(QEMUFile *f)
> >> +{
> >> +    RAMBlock *block;
> >> +
> >> +    RAMBLOCK_FOREACH_MIGRATABLE(block) {
> >> +        long num_pages = block->used_length >> TARGET_PAGE_BITS;
> >> +        long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> >> +        qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
> >> +                           block->bitmap_offset);
> >> +        /* to catch any thread late sending pages */
> >> +        block->shadow_bmap = NULL;
> >
> > What is this for?  Wouldn't this leak the buffer already?
> >
> 
> Ah this is debug code. It's because of multifd. In this series I don't
> use sem_sync because there's no packets. But doing so causes
> multifd_send_sync_main() to return before the multifd channel has sent
> all its pages. This is here so the channel crashes when writing the
> bitmap.
> 
> I think it's worth it to keep it but I'd have to move it to the multifd
> patch and free the bitmap properly.

Ok, I'll keep reading.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-10-25 14:07     ` Fabiano Rosas
@ 2023-10-31 19:03       ` Peter Xu
  2023-11-01  9:26         ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-10-31 19:03 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Daniel P. Berrangé,
	qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

On Wed, Oct 25, 2023 at 11:07:33AM -0300, Fabiano Rosas wrote:
> >> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block, ram_addr_t length)
> >> +{
> >> +    g_autofree unsigned long *bitmap = NULL;
> >> +    struct FixedRamHeader header;
> >> +    size_t bitmap_size;
> >> +    long num_pages;
> >> +    int ret = 0;
> >> +
> >> +    ret = fixed_ram_read_header(f, &header);
> >> +    if (ret < 0) {
> >> +        error_report("Error reading fixed-ram header");
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    block->pages_offset = header.pages_offset;
> >
> > Do you think it is worth sanity checking that 'pages_offset' is aligned
> > in some way.
> >
> > It is nice that we have flexibility to change the alignment in future
> > if we find 1 MB is not optimal, so I wouldn't want to force 1MB align
> > check htere. Perhaps we could at least sanity check for alignment at
> > TARGET_PAGE_SIZE, to detect a gross data corruption problem ?
> >
> 
> I don't see why not. I'll add it.

Is there any explanation on why that 1MB offset, and how the number is
chosen?  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-31 15:58                           ` Daniel P. Berrangé
@ 2023-10-31 19:05                             ` Fabiano Rosas
  2023-11-01  9:30                               ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-31 19:05 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Tue, Oct 31, 2023 at 12:52:41PM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> >
>> > I guess I'm not seeing the problem still.  A single FD is passed across
>> > from libvirt, but QEMU is free to turn that into *many* FDs for its
>> > internal use, using dup() and then setting O_DIRECT on as many/few of
>> > the dup()d FDs as its wants to.
>> 
>> The problem is that duplicated FDs share the file status flags. If we
>> set O_DIRECT on the multifd channels and the main thread happens to do
>> an unaligned write with qemu_file_put* then the filesystem will fail
>> that write.
>
> Doh, I had forgotten that sharing.
>
> Do we have any synchronization between multifd  channels and the main
> thread ?  eg does the main thread wait for RAM sending completion
> before carrying on writing other non-RAM data ?

We do have, but the issue with that approach is that there are no rules
for adding data into the stream. Anyone could add a qemu_put_* call
right in the middle of the section for whatever reason.

That is almost a separate matter due to our current compatibility model
being based on capabilities rather than resilience of the stream
format. So extraneous data in the stream always causes the migration to
break.

But with the O_DIRECT situation we'd be adding another aspect to
this. Not only changing the code requires syncing capabilities (as it
does today), but it would also require knowing which parts of the stream
can be interrupted by new data and which cannot.

So while it would probably work, it's also a little fragile. If QEMU
were given 2 FDs or given access to the file, then only the multifd
channels would get O_DIRECT and they are guaranteed to not have
extraneous unaligned data showing up.




^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-10-23 20:35 ` [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore Fabiano Rosas
  2023-10-25  9:43   ` Daniel P. Berrangé
@ 2023-10-31 19:09   ` Peter Xu
  2023-10-31 20:00     ` Fabiano Rosas
  1 sibling, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-10-31 19:09 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

On Mon, Oct 23, 2023 at 05:35:55PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
> 
> Add the necessary code to parse the format changes for the 'fixed-ram'
> capability.
> 
> One of the more notable changes in behavior is that in the 'fixed-ram'
> case ram pages are restored in one go rather than constantly looping
> through the migration stream.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> (farosas) reused more of the common code by making the fixed-ram
> function take only one ramblock and calling it from inside
> parse_ramblock.
> ---
>  migration/ram.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 93 insertions(+)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 152a03604f..cea6971ab2 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3032,6 +3032,32 @@ static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
>      qemu_put_buffer(file, (uint8_t *) header, header_size);
>  }
>  
> +static int fixed_ram_read_header(QEMUFile *file, struct FixedRamHeader *header)
> +{
> +    size_t ret, header_size = sizeof(struct FixedRamHeader);
> +
> +    ret = qemu_get_buffer(file, (uint8_t *)header, header_size);
> +    if (ret != header_size) {
> +        return -1;
> +    }
> +
> +    /* migration stream is big-endian */
> +    be32_to_cpus(&header->version);
> +
> +    if (header->version > FIXED_RAM_HDR_VERSION) {
> +        error_report("Migration fixed-ram capability version mismatch (expected %d, got %d)",
> +                     FIXED_RAM_HDR_VERSION, header->version);

I know it doesn't matter a lot for now, but it'll be good to start using
Error** in new codes?

> +        return -1;
> +    }
> +
> +    be64_to_cpus(&header->page_size);
> +    be64_to_cpus(&header->bitmap_offset);
> +    be64_to_cpus(&header->pages_offset);
> +
> +
> +    return 0;
> +}
> +
>  /*
>   * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
>   * long-running RCU critical section.  When rcu-reclaims in the code
> @@ -3932,6 +3958,68 @@ void colo_flush_ram_cache(void)
>      trace_colo_flush_ram_cache_end();
>  }
>  
> +static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
> +                                    long num_pages, unsigned long *bitmap)
> +{
> +    unsigned long set_bit_idx, clear_bit_idx;
> +    unsigned long len;
> +    ram_addr_t offset;
> +    void *host;
> +    size_t read, completed, read_len;
> +
> +    for (set_bit_idx = find_first_bit(bitmap, num_pages);
> +         set_bit_idx < num_pages;
> +         set_bit_idx = find_next_bit(bitmap, num_pages, clear_bit_idx + 1)) {
> +
> +        clear_bit_idx = find_next_zero_bit(bitmap, num_pages, set_bit_idx + 1);
> +
> +        len = TARGET_PAGE_SIZE * (clear_bit_idx - set_bit_idx);
> +        offset = set_bit_idx << TARGET_PAGE_BITS;
> +
> +        for (read = 0, completed = 0; completed < len; offset += read) {
> +            host = host_from_ram_block_offset(block, offset);
> +            read_len = MIN(len, TARGET_PAGE_SIZE);

Why MIN()?  I didn't read qemu_get_buffer_at() yet, but shouldn't len
always be multiple of target page size or zero?

> +
> +            read = qemu_get_buffer_at(f, host, read_len,
> +                                      block->pages_offset + offset);
> +            completed += read;
> +        }
> +    }
> +}
> +
> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block, ram_addr_t length)
> +{
> +    g_autofree unsigned long *bitmap = NULL;
> +    struct FixedRamHeader header;
> +    size_t bitmap_size;
> +    long num_pages;
> +    int ret = 0;
> +
> +    ret = fixed_ram_read_header(f, &header);
> +    if (ret < 0) {
> +        error_report("Error reading fixed-ram header");

Same here on error handling; suggest to use Error** from the start.

> +        return -EINVAL;
> +    }
> +
> +    block->pages_offset = header.pages_offset;
> +    num_pages = length / header.page_size;
> +    bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> +
> +    bitmap = g_malloc0(bitmap_size);
> +    if (qemu_get_buffer_at(f, (uint8_t *)bitmap, bitmap_size,
> +                           header.bitmap_offset) != bitmap_size) {
> +        error_report("Error parsing dirty bitmap");
> +        return -EINVAL;
> +    }
> +
> +    read_ramblock_fixed_ram(f, block, num_pages, bitmap);
> +
> +    /* Skip pages array */
> +    qemu_set_offset(f, block->pages_offset + length, SEEK_SET);
> +
> +    return ret;
> +}
> +
>  static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
>  {
>      int ret = 0;
> @@ -3940,6 +4028,10 @@ static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
>  
>      assert(block);
>  
> +    if (migrate_fixed_ram()) {
> +        return parse_ramblock_fixed_ram(f, block, length);
> +    }
> +
>      if (!qemu_ram_is_migratable(block)) {
>          error_report("block %s should not be migrated !", block->idstr);
>          return -EINVAL;
> @@ -4142,6 +4234,7 @@ static int ram_load_precopy(QEMUFile *f)
>                  migrate_multifd_flush_after_each_section()) {
>                  multifd_recv_sync_main();
>              }
> +
>              break;
>          case RAM_SAVE_FLAG_HOOK:
>              ret = rdma_registration_handle(f);
> -- 
> 2.35.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-10-31 19:09   ` Peter Xu
@ 2023-10-31 20:00     ` Fabiano Rosas
  0 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-31 20:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

Peter Xu <peterx@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:35:55PM -0300, Fabiano Rosas wrote:
>> From: Nikolay Borisov <nborisov@suse.com>
>> 
>> Add the necessary code to parse the format changes for the 'fixed-ram'
>> capability.
>> 
>> One of the more notable changes in behavior is that in the 'fixed-ram'
>> case ram pages are restored in one go rather than constantly looping
>> through the migration stream.
>> 
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>> (farosas) reused more of the common code by making the fixed-ram
>> function take only one ramblock and calling it from inside
>> parse_ramblock.
>> ---
>>  migration/ram.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 93 insertions(+)
>> 
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 152a03604f..cea6971ab2 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -3032,6 +3032,32 @@ static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
>>      qemu_put_buffer(file, (uint8_t *) header, header_size);
>>  }
>>  
>> +static int fixed_ram_read_header(QEMUFile *file, struct FixedRamHeader *header)
>> +{
>> +    size_t ret, header_size = sizeof(struct FixedRamHeader);
>> +
>> +    ret = qemu_get_buffer(file, (uint8_t *)header, header_size);
>> +    if (ret != header_size) {
>> +        return -1;
>> +    }
>> +
>> +    /* migration stream is big-endian */
>> +    be32_to_cpus(&header->version);
>> +
>> +    if (header->version > FIXED_RAM_HDR_VERSION) {
>> +        error_report("Migration fixed-ram capability version mismatch (expected %d, got %d)",
>> +                     FIXED_RAM_HDR_VERSION, header->version);
>
> I know it doesn't matter a lot for now, but it'll be good to start using
> Error** in new codes?

This whole series was written before the many discussions we had about
error handling. Thanks for pointing that out, I'll revise and change
where appropriate.

>> +        return -1;
>> +    }
>> +
>> +    be64_to_cpus(&header->page_size);
>> +    be64_to_cpus(&header->bitmap_offset);
>> +    be64_to_cpus(&header->pages_offset);
>> +
>> +
>> +    return 0;
>> +}
>> +
>>  /*
>>   * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
>>   * long-running RCU critical section.  When rcu-reclaims in the code
>> @@ -3932,6 +3958,68 @@ void colo_flush_ram_cache(void)
>>      trace_colo_flush_ram_cache_end();
>>  }
>>  
>> +static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
>> +                                    long num_pages, unsigned long *bitmap)
>> +{
>> +    unsigned long set_bit_idx, clear_bit_idx;
>> +    unsigned long len;
>> +    ram_addr_t offset;
>> +    void *host;
>> +    size_t read, completed, read_len;
>> +
>> +    for (set_bit_idx = find_first_bit(bitmap, num_pages);
>> +         set_bit_idx < num_pages;
>> +         set_bit_idx = find_next_bit(bitmap, num_pages, clear_bit_idx + 1)) {
>> +
>> +        clear_bit_idx = find_next_zero_bit(bitmap, num_pages, set_bit_idx + 1);
>> +
>> +        len = TARGET_PAGE_SIZE * (clear_bit_idx - set_bit_idx);
>> +        offset = set_bit_idx << TARGET_PAGE_BITS;
>> +
>> +        for (read = 0, completed = 0; completed < len; offset += read) {
>> +            host = host_from_ram_block_offset(block, offset);
>> +            read_len = MIN(len, TARGET_PAGE_SIZE);
>
> Why MIN()?  I didn't read qemu_get_buffer_at() yet, but shouldn't len
> always be multiple of target page size or zero?
>

Hmm, this is not my code. The original code had MIN(len, BUFSIZE) with
BUFSIZE being defined at 4M. I think the idea might have been to cap the
amount of pages read.

So it seems I have made a mistake here and this could be reading more
pages at a time.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 19/29] migration/multifd: Add outgoing QIOChannelFile support
  2023-10-23 20:35 ` [PATCH v2 19/29] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
  2023-10-25  9:52   ` Daniel P. Berrangé
@ 2023-10-31 20:11   ` Peter Xu
  1 sibling, 0 replies; 128+ messages in thread
From: Peter Xu @ 2023-10-31 20:11 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana

On Mon, Oct 23, 2023 at 05:35:58PM -0300, Fabiano Rosas wrote:
> Allow multifd to open file-backed channels. This will be used when
> enabling the fixed-ram migration stream format which expects a
> seekable transport.
> 
> The QIOChannel read and write methods will use the preadv/pwritev
> versions which don't update the file offset at each call so we can
> reuse the fd without re-opening for every channel.
> 
> Note that this is just setup code and multifd cannot yet make use of
> the file channels.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/file.c      | 64 +++++++++++++++++++++++++++++++++++++++++--
>  migration/file.h      | 10 +++++--
>  migration/migration.c |  2 +-
>  migration/multifd.c   | 14 ++++++++--
>  migration/options.c   |  7 +++++
>  migration/options.h   |  1 +
>  6 files changed, 90 insertions(+), 8 deletions(-)
> 
> diff --git a/migration/file.c b/migration/file.c
> index cf5b1bf365..93b9b7bf5d 100644
> --- a/migration/file.c
> +++ b/migration/file.c
> @@ -17,6 +17,12 @@
>  
>  #define OFFSET_OPTION ",offset="
>  
> +static struct FileOutgoingArgs {
> +    char *fname;
> +    int flags;
> +    int mode;
> +} outgoing_args;
> +
>  /* Remove the offset option from @filespec and return it in @offsetp. */
>  
>  static int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
> @@ -36,13 +42,62 @@ static int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
>      return 0;
>  }
>  
> +static void qio_channel_file_connect_worker(QIOTask *task, gpointer opaque)
> +{
> +    /* noop */
> +}
> +
> +static void file_migration_cancel(Error *errp)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> +                      MIGRATION_STATUS_FAILED);

This doesn't sound right to set FAILED here, then call cancel() afterwards
(which will try to set it to CANCELLING).

For socket based, multifd sets error and kick main in
multifd_new_send_channel_cleanup().  Can it be done similarly, rather than
calling migration_cancel()?

> +    migration_cancel(errp);
> +}

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 20/29] migration/multifd: Add incoming QIOChannelFile support
  2023-10-23 20:35 ` [PATCH v2 20/29] migration/multifd: Add incoming " Fabiano Rosas
  2023-10-25 10:29   ` Daniel P. Berrangé
@ 2023-10-31 21:28   ` Peter Xu
  1 sibling, 0 replies; 128+ messages in thread
From: Peter Xu @ 2023-10-31 21:28 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana

On Mon, Oct 23, 2023 at 05:35:59PM -0300, Fabiano Rosas wrote:
> On the receiving side we don't need to differentiate between main
> channel and threads, so whichever channel is defined first gets to be
> the main one. And since there are no packets, use the atomic channel
> count to index into the params array.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/file.c      | 39 +++++++++++++++++++++++++++++----------
>  migration/migration.c |  2 ++
>  migration/multifd.c   |  7 ++++++-
>  migration/multifd.h   |  1 +
>  4 files changed, 38 insertions(+), 11 deletions(-)
> 
> diff --git a/migration/file.c b/migration/file.c
> index 93b9b7bf5d..ad75225f43 100644
> --- a/migration/file.c
> +++ b/migration/file.c
> @@ -6,13 +6,15 @@
>   */
>  
>  #include "qemu/osdep.h"
> -#include "qemu/cutils.h"
>  #include "qapi/error.h"
> +#include "qemu/cutils.h"
> +#include "qemu/error-report.h"
>  #include "channel.h"
>  #include "file.h"
>  #include "migration.h"
>  #include "io/channel-file.h"
>  #include "io/channel-util.h"
> +#include "options.h"
>  #include "trace.h"
>  
>  #define OFFSET_OPTION ",offset="
> @@ -136,7 +138,8 @@ void file_start_incoming_migration(const char *filespec, Error **errp)
>      g_autofree char *filename = g_strdup(filespec);
>      QIOChannelFile *fioc = NULL;
>      uint64_t offset = 0;
> -    QIOChannel *ioc;
> +    int channels = 1;
> +    int i = 0, fd;
>  
>      trace_migration_file_incoming(filename);
>  
> @@ -146,16 +149,32 @@ void file_start_incoming_migration(const char *filespec, Error **errp)
>  
>      fioc = qio_channel_file_new_path(filename, O_RDONLY, 0, errp);
>      if (!fioc) {
> -        return;
> +        goto out;

Can we already return with *errp set?  Why still need the error_report()?

> +    }
> +
> +    if (migrate_multifd()) {
> +        channels += migrate_multifd_channels();
>      }
>  
> -    ioc = QIO_CHANNEL(fioc);
> -    if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
> +    fd = fioc->fd;
> +
> +    do {
> +        QIOChannel *ioc = QIO_CHANNEL(fioc);
> +
> +        if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
> +            return;
> +        }
> +
> +        qio_channel_set_name(ioc, "migration-file-incoming");
> +        qio_channel_add_watch_full(ioc, G_IO_IN,
> +                                   file_accept_incoming_migration,
> +                                   NULL, NULL,
> +                                   g_main_context_get_thread_default());
> +    } while (++i < channels && (fioc = qio_channel_file_new_fd(fd)));
> +
> +out:
> +    if (!fioc) {
> +        error_report("Error creating migration incoming channel");
>          return;
>      }
> -    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-file-incoming");
> -    qio_channel_add_watch_full(ioc, G_IO_IN,
> -                               file_accept_incoming_migration,
> -                               NULL, NULL,
> -                               g_main_context_get_thread_default());
>  }
> diff --git a/migration/migration.c b/migration/migration.c
> index ba806cea55..5fa726f6d4 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -756,6 +756,8 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
>          }
>  
>          default_channel = (channel_magic == cpu_to_be32(QEMU_VM_FILE_MAGIC));
> +    } else if (migrate_multifd() && migrate_fixed_ram()) {
> +        default_channel = multifd_recv_first_channel();

Is this check required?  IIUC you wanted to set default_channel only when
the 1st time trigger this function, but then IIUC that's exactly what:

        default_channel = !mis->from_src_file;

is about?

I think it may be clearer to add "migrate_multifd_packets()" too in the
previous "if" check to make sure fixed-ram won't peak it.

IIUC now it won't fall into that now only because file URI doesn't yet
support QIO_CHANNEL_FEATURE_READ_MSG_PEEK, however AFAIU it'll be fairly
easy to add it, and even more reasonable than a socket, when adding.

Fundamentally that trick can only work with multifd init packets, that
matches with what migrate_multifd_packets() means.

>      } else {
>          default_channel = !mis->from_src_file;
>      }
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 75a17ea8ab..ad51210f13 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -1242,6 +1242,11 @@ int multifd_load_setup(Error **errp)
>      return 0;
>  }
>  
> +bool multifd_recv_first_channel(void)
> +{
> +    return !multifd_recv_state;
> +}
> +
>  bool multifd_recv_all_channels_created(void)
>  {
>      int thread_count = migrate_multifd_channels();
> @@ -1284,7 +1289,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
>          /* initial packet */
>          num_packets = 1;
>      } else {
> -        id = 0;
> +        id = qatomic_read(&multifd_recv_state->count);

I was quite confused on the previous "id=0" and now it answers..

Can we merge the two patches somehow?

>      }
>  
>      p = &multifd_recv_state->params[id];
> diff --git a/migration/multifd.h b/migration/multifd.h
> index a835643b48..a112ec7ac6 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -18,6 +18,7 @@ void multifd_save_cleanup(void);
>  int multifd_load_setup(Error **errp);
>  void multifd_load_cleanup(void);
>  void multifd_load_shutdown(void);
> +bool multifd_recv_first_channel(void);
>  bool multifd_recv_all_channels_created(void);
>  void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
>  void multifd_recv_sync_main(void);
> -- 
> 2.35.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 21/29] migration/multifd: Add pages to the receiving side
  2023-10-23 20:36 ` [PATCH v2 21/29] migration/multifd: Add pages to the receiving side Fabiano Rosas
@ 2023-10-31 22:10   ` Peter Xu
  2023-10-31 23:18     ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-10-31 22:10 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana

On Mon, Oct 23, 2023 at 05:36:00PM -0300, Fabiano Rosas wrote:
> Currently multifd does not need to have knowledge of pages on the
> receiving side because all the information needed is within the
> packets that come in the stream.
> 
> We're about to add support to fixed-ram migration, which cannot use
> packets because it expects the ramblock section in the migration file
> to contain only the guest pages data.
> 
> Add a pointer to MultiFDPages in the multifd_recv_state and use the
> pages similarly to what we already do on the sending side. The pages
> are used to transfer data between the ram migration code in the main
> migration thread and the multifd receiving threads.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

If it'll be new code to maintain anyway, I think we don't necessarily
always use multifd structs, right?

Rather than introducing MultiFDPages_t into recv side, can we allow pages
to be distributed in chunks of (ramblock, start_offset, end_offset) tuples?
That'll be much more efficient than per-page.  We don't need page granule
here on recv side, we want to load chunks of mem fast.

We don't even need page granule on sender side, but since only myself cared
about perf.. and obviously the plan is to even drop auto-pause, then VM can
be running there, so sender must do that per-page for now.  But now on recv
side VM must be stopped before all ram loaded, so there's no such problem.
And since we'll introduce new code anyway, IMHO we can decide how to do
that even if we want to reuse multifd.

Main thread can assign these (ramblock, start_offset, end_offset) jobs to
recv threads.  If ramblock is too small (e.g. 1M), assign it anyway to one
thread.  If ramblock is >512MB, cut it into slices and feed them to multifd
threads one by one.  All the rest can be the same.

Would that be better?  I would expect measurable loading speed difference
with much larger chunks and with that range-based tuples.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 21/29] migration/multifd: Add pages to the receiving side
  2023-10-31 22:10   ` Peter Xu
@ 2023-10-31 23:18     ` Fabiano Rosas
  2023-11-01 15:55       ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-10-31 23:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana

Peter Xu <peterx@redhat.com> writes:

> On Mon, Oct 23, 2023 at 05:36:00PM -0300, Fabiano Rosas wrote:
>> Currently multifd does not need to have knowledge of pages on the
>> receiving side because all the information needed is within the
>> packets that come in the stream.
>> 
>> We're about to add support to fixed-ram migration, which cannot use
>> packets because it expects the ramblock section in the migration file
>> to contain only the guest pages data.
>> 
>> Add a pointer to MultiFDPages in the multifd_recv_state and use the
>> pages similarly to what we already do on the sending side. The pages
>> are used to transfer data between the ram migration code in the main
>> migration thread and the multifd receiving threads.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>
> If it'll be new code to maintain anyway, I think we don't necessarily
> always use multifd structs, right?
>

For the sending side, unrelated to this series, I'm experimenting with
defining a generic structure to be passed into multifd:

struct MultiFDData_t {
    void *opaque;
    size_t size;
    bool ready;
    void (*cleanup_fn)(void *);
};

The client code (ram.c) would use the opaque field to put whatever it
wants in it. Maybe we could have a similar concept on the receiving
side?

Here's a PoC I'm writing, if you're interested:

https://github.com/farosas/qemu/commits/multifd-packet-cleanups

(I'm delaying sending this to the list because we already have a
reasonable backlog of features and refactorings to merge.)

> Rather than introducing MultiFDPages_t into recv side, can we allow pages
> to be distributed in chunks of (ramblock, start_offset, end_offset) tuples?
> That'll be much more efficient than per-page.  We don't need page granule
> here on recv side, we want to load chunks of mem fast.
>
> We don't even need page granule on sender side, but since only myself cared
> about perf.. and obviously the plan is to even drop auto-pause, then VM can
> be running there, so sender must do that per-page for now.  But now on recv
> side VM must be stopped before all ram loaded, so there's no such problem.
> And since we'll introduce new code anyway, IMHO we can decide how to do
> that even if we want to reuse multifd.
>
> Main thread can assign these (ramblock, start_offset, end_offset) jobs to
> recv threads.  If ramblock is too small (e.g. 1M), assign it anyway to one
> thread.  If ramblock is >512MB, cut it into slices and feed them to multifd
> threads one by one.  All the rest can be the same.
>
> Would that be better?  I would expect measurable loading speed difference
> with much larger chunks and with that range-based tuples.

I need to check how that would interact with the existing recv_thread
code. Hopefully there's nothing there preventing us from using a
different data structure.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-10-31 19:03       ` Peter Xu
@ 2023-11-01  9:26         ` Daniel P. Berrangé
  2023-11-01 14:21           ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-11-01  9:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

On Tue, Oct 31, 2023 at 03:03:50PM -0400, Peter Xu wrote:
> On Wed, Oct 25, 2023 at 11:07:33AM -0300, Fabiano Rosas wrote:
> > >> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block, ram_addr_t length)
> > >> +{
> > >> +    g_autofree unsigned long *bitmap = NULL;
> > >> +    struct FixedRamHeader header;
> > >> +    size_t bitmap_size;
> > >> +    long num_pages;
> > >> +    int ret = 0;
> > >> +
> > >> +    ret = fixed_ram_read_header(f, &header);
> > >> +    if (ret < 0) {
> > >> +        error_report("Error reading fixed-ram header");
> > >> +        return -EINVAL;
> > >> +    }
> > >> +
> > >> +    block->pages_offset = header.pages_offset;
> > >
> > > Do you think it is worth sanity checking that 'pages_offset' is aligned
> > > in some way.
> > >
> > > It is nice that we have flexibility to change the alignment in future
> > > if we find 1 MB is not optimal, so I wouldn't want to force 1MB align
> > > check htere. Perhaps we could at least sanity check for alignment at
> > > TARGET_PAGE_SIZE, to detect a gross data corruption problem ?
> > >
> > 
> > I don't see why not. I'll add it.
> 
> Is there any explanation on why that 1MB offset, and how the number is
> chosen?  Thanks,

The fixed-ram format is anticipating the use of O_DIRECT.

With O_DIRECT both the buffers in memory, and the file handle offset
have alignment requirements. The buffer alignments are usually page
sized, and QEMU RAM blocks will trivially satisfy those.

The file handle offset alignment varies per filesystem. While you can
query the alignment for the FS holding the file with statx(), that is
not appropriate todo. If a user saves/restores QEMU state to file, we
must assume there is a chance the user will copy the saved state to a
different filesystem.

IOW, we want alignment to satisfy the likely worst case.

Picking 1 MB is a nice round number that is large enough that it is
almost certainly going to satisfy any filesystem alignment. In fact
it is likely massive overkill. None the less 1 MB is also still tiny
in the context of guest RAM sizes, so no one is going to notice the
padding holes in the file from this.

IOW, the 1 MB choice is an arbitrary, but somewhat informed choice.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-10-31 19:05                             ` Fabiano Rosas
@ 2023-11-01  9:30                               ` Daniel P. Berrangé
  2023-11-01 12:16                                 ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-11-01  9:30 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

On Tue, Oct 31, 2023 at 04:05:46PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Tue, Oct 31, 2023 at 12:52:41PM -0300, Fabiano Rosas wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> >> >
> >> > I guess I'm not seeing the problem still.  A single FD is passed across
> >> > from libvirt, but QEMU is free to turn that into *many* FDs for its
> >> > internal use, using dup() and then setting O_DIRECT on as many/few of
> >> > the dup()d FDs as its wants to.
> >> 
> >> The problem is that duplicated FDs share the file status flags. If we
> >> set O_DIRECT on the multifd channels and the main thread happens to do
> >> an unaligned write with qemu_file_put* then the filesystem will fail
> >> that write.
> >
> > Doh, I had forgotten that sharing.
> >
> > Do we have any synchronization between multifd  channels and the main
> > thread ?  eg does the main thread wait for RAM sending completion
> > before carrying on writing other non-RAM data ?
> 
> We do have, but the issue with that approach is that there are no rules
> for adding data into the stream. Anyone could add a qemu_put_* call
> right in the middle of the section for whatever reason.
> 
> That is almost a separate matter due to our current compatibility model
> being based on capabilities rather than resilience of the stream
> format. So extraneous data in the stream always causes the migration to
> break.
> 
> But with the O_DIRECT situation we'd be adding another aspect to
> this. Not only changing the code requires syncing capabilities (as it
> does today), but it would also require knowing which parts of the stream
> can be interrupted by new data and which cannot.
> 
> So while it would probably work, it's also a little fragile. If QEMU
> were given 2 FDs or given access to the file, then only the multifd
> channels would get O_DIRECT and they are guaranteed to not have
> extraneous unaligned data showing up.

So the problem with add-fd is that when requesting a FD, the monitor
code masks flags with O_ACCMODE.  What if we extended it such that
the monitor masked with O_ACCMODE | O_DIRECT.

That would let us pass 1 plain FD and one O_DIRECT fd, and be able
to ask for each separately by setting O_DIRECT or not.

Existing users of add-fd are not likely to be affected since none of
them will be using O_DIRECT.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-11-01  9:30                               ` Daniel P. Berrangé
@ 2023-11-01 12:16                                 ` Fabiano Rosas
  2023-11-01 12:23                                   ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-11-01 12:16 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Tue, Oct 31, 2023 at 04:05:46PM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > On Tue, Oct 31, 2023 at 12:52:41PM -0300, Fabiano Rosas wrote:
>> >> Daniel P. Berrangé <berrange@redhat.com> writes:
>> >> >
>> >> > I guess I'm not seeing the problem still.  A single FD is passed across
>> >> > from libvirt, but QEMU is free to turn that into *many* FDs for its
>> >> > internal use, using dup() and then setting O_DIRECT on as many/few of
>> >> > the dup()d FDs as its wants to.
>> >> 
>> >> The problem is that duplicated FDs share the file status flags. If we
>> >> set O_DIRECT on the multifd channels and the main thread happens to do
>> >> an unaligned write with qemu_file_put* then the filesystem will fail
>> >> that write.
>> >
>> > Doh, I had forgotten that sharing.
>> >
>> > Do we have any synchronization between multifd  channels and the main
>> > thread ?  eg does the main thread wait for RAM sending completion
>> > before carrying on writing other non-RAM data ?
>> 
>> We do have, but the issue with that approach is that there are no rules
>> for adding data into the stream. Anyone could add a qemu_put_* call
>> right in the middle of the section for whatever reason.
>> 
>> That is almost a separate matter due to our current compatibility model
>> being based on capabilities rather than resilience of the stream
>> format. So extraneous data in the stream always causes the migration to
>> break.
>> 
>> But with the O_DIRECT situation we'd be adding another aspect to
>> this. Not only changing the code requires syncing capabilities (as it
>> does today), but it would also require knowing which parts of the stream
>> can be interrupted by new data and which cannot.
>> 
>> So while it would probably work, it's also a little fragile. If QEMU
>> were given 2 FDs or given access to the file, then only the multifd
>> channels would get O_DIRECT and they are guaranteed to not have
>> extraneous unaligned data showing up.
>
> So the problem with add-fd is that when requesting a FD, the monitor
> code masks flags with O_ACCMODE.  What if we extended it such that
> the monitor masked with O_ACCMODE | O_DIRECT.
>
> That would let us pass 1 plain FD and one O_DIRECT fd, and be able
> to ask for each separately by setting O_DIRECT or not.

That would likely work. The usage gets a little more complicated, but
we'd be using fdset as it was intended.

Should we keep the direct-io capability? If the user now needs to set
O_DIRECT and also set the cap, that seems a little redundant. I could
keep O_DIRECT in the flags (when supported) and test after open if we
got the flag set. If it's not set, then we remove O_DIRECT from the
flags and retry.

> Existing users of add-fd are not likely to be affected since none of
> them will be using O_DIRECT.

I had thought of passing a comparison function into
monitor_fdset_dup_fd_add() to avoid affecting existing users. That would
require plumbing it through qemu_open_internal() or moving
monitor_fdset_dup_fd_add() earlier in the stack (probably more
sensible). I'll not worry about that for now though, let's first make
sure the approach you suggested works.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-11-01 12:16                                 ` Fabiano Rosas
@ 2023-11-01 12:23                                   ` Daniel P. Berrangé
  2023-11-01 12:30                                     ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-11-01 12:23 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

On Wed, Nov 01, 2023 at 09:16:33AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> >
> > So the problem with add-fd is that when requesting a FD, the monitor
> > code masks flags with O_ACCMODE.  What if we extended it such that
> > the monitor masked with O_ACCMODE | O_DIRECT.
> >
> > That would let us pass 1 plain FD and one O_DIRECT fd, and be able
> > to ask for each separately by setting O_DIRECT or not.
> 
> That would likely work. The usage gets a little more complicated, but
> we'd be using fdset as it was intended.
> 
> Should we keep the direct-io capability? If the user now needs to set
> O_DIRECT and also set the cap, that seems a little redundant. I could
> keep O_DIRECT in the flags (when supported) and test after open if we
> got the flag set. If it's not set, then we remove O_DIRECT from the
> flags and retry.

While it is redundant, I like the idea of always requireing the
direct-io capabilty to be set, as a statement of intent. There's
a decent chance for apps to mess up with FD passing, and so by
seeing the 'direct-io' capability we know what the app intended
todo.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 28/29] migration: Add direct-io parameter
  2023-11-01 12:23                                   ` Daniel P. Berrangé
@ 2023-11-01 12:30                                     ` Fabiano Rosas
  0 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-11-01 12:30 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Markus Armbruster, qemu-devel, Juan Quintela, Peter Xu,
	Leonardo Bras, Claudio Fontana, Eric Blake

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Wed, Nov 01, 2023 at 09:16:33AM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> >
>> > So the problem with add-fd is that when requesting a FD, the monitor
>> > code masks flags with O_ACCMODE.  What if we extended it such that
>> > the monitor masked with O_ACCMODE | O_DIRECT.
>> >
>> > That would let us pass 1 plain FD and one O_DIRECT fd, and be able
>> > to ask for each separately by setting O_DIRECT or not.
>> 
>> That would likely work. The usage gets a little more complicated, but
>> we'd be using fdset as it was intended.
>> 
>> Should we keep the direct-io capability? If the user now needs to set
>> O_DIRECT and also set the cap, that seems a little redundant. I could
>> keep O_DIRECT in the flags (when supported) and test after open if we
>> got the flag set. If it's not set, then we remove O_DIRECT from the
>> flags and retry.
>
> While it is redundant, I like the idea of always requireing the
> direct-io capabilty to be set, as a statement of intent. There's
> a decent chance for apps to mess up with FD passing, and so by
> seeing the 'direct-io' capability we know what the app intended
> todo.

Ok. I'll go write some code then. Thanks for the help!


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-11-01  9:26         ` Daniel P. Berrangé
@ 2023-11-01 14:21           ` Peter Xu
  2023-11-01 14:28             ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-11-01 14:21 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

On Wed, Nov 01, 2023 at 09:26:46AM +0000, Daniel P. Berrangé wrote:
> On Tue, Oct 31, 2023 at 03:03:50PM -0400, Peter Xu wrote:
> > On Wed, Oct 25, 2023 at 11:07:33AM -0300, Fabiano Rosas wrote:
> > > >> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block, ram_addr_t length)
> > > >> +{
> > > >> +    g_autofree unsigned long *bitmap = NULL;
> > > >> +    struct FixedRamHeader header;
> > > >> +    size_t bitmap_size;
> > > >> +    long num_pages;
> > > >> +    int ret = 0;
> > > >> +
> > > >> +    ret = fixed_ram_read_header(f, &header);
> > > >> +    if (ret < 0) {
> > > >> +        error_report("Error reading fixed-ram header");
> > > >> +        return -EINVAL;
> > > >> +    }
> > > >> +
> > > >> +    block->pages_offset = header.pages_offset;
> > > >
> > > > Do you think it is worth sanity checking that 'pages_offset' is aligned
> > > > in some way.
> > > >
> > > > It is nice that we have flexibility to change the alignment in future
> > > > if we find 1 MB is not optimal, so I wouldn't want to force 1MB align
> > > > check htere. Perhaps we could at least sanity check for alignment at
> > > > TARGET_PAGE_SIZE, to detect a gross data corruption problem ?
> > > >
> > > 
> > > I don't see why not. I'll add it.
> > 
> > Is there any explanation on why that 1MB offset, and how the number is
> > chosen?  Thanks,
> 
> The fixed-ram format is anticipating the use of O_DIRECT.
> 
> With O_DIRECT both the buffers in memory, and the file handle offset
> have alignment requirements. The buffer alignments are usually page
> sized, and QEMU RAM blocks will trivially satisfy those.
> 
> The file handle offset alignment varies per filesystem. While you can
> query the alignment for the FS holding the file with statx(), that is
> not appropriate todo. If a user saves/restores QEMU state to file, we
> must assume there is a chance the user will copy the saved state to a
> different filesystem.
> 
> IOW, we want alignment to satisfy the likely worst case.
> 
> Picking 1 MB is a nice round number that is large enough that it is
> almost certainly going to satisfy any filesystem alignment. In fact
> it is likely massive overkill. None the less 1 MB is also still tiny

Is that calculated by something like max of possible host (small) page
sizes?  I've no idea what's it for all archs, the max small page size I'm
aware of is 64K, but I don't know a lot archs.

> in the context of guest RAM sizes, so no one is going to notice the
> padding holes in the file from this.
> 
> IOW, the 1 MB choice is an arbitrary, but somewhat informed choice.

I see, thanks.  Shall we document it clearly?  Then if there's a need to
adjust that value we will know what to reference.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-11-01 14:21           ` Peter Xu
@ 2023-11-01 14:28             ` Daniel P. Berrangé
  2023-11-01 15:00               ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-11-01 14:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

On Wed, Nov 01, 2023 at 10:21:07AM -0400, Peter Xu wrote:
> On Wed, Nov 01, 2023 at 09:26:46AM +0000, Daniel P. Berrangé wrote:
> > On Tue, Oct 31, 2023 at 03:03:50PM -0400, Peter Xu wrote:
> > > On Wed, Oct 25, 2023 at 11:07:33AM -0300, Fabiano Rosas wrote:
> > > > >> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block, ram_addr_t length)
> > > > >> +{
> > > > >> +    g_autofree unsigned long *bitmap = NULL;
> > > > >> +    struct FixedRamHeader header;
> > > > >> +    size_t bitmap_size;
> > > > >> +    long num_pages;
> > > > >> +    int ret = 0;
> > > > >> +
> > > > >> +    ret = fixed_ram_read_header(f, &header);
> > > > >> +    if (ret < 0) {
> > > > >> +        error_report("Error reading fixed-ram header");
> > > > >> +        return -EINVAL;
> > > > >> +    }
> > > > >> +
> > > > >> +    block->pages_offset = header.pages_offset;
> > > > >
> > > > > Do you think it is worth sanity checking that 'pages_offset' is aligned
> > > > > in some way.
> > > > >
> > > > > It is nice that we have flexibility to change the alignment in future
> > > > > if we find 1 MB is not optimal, so I wouldn't want to force 1MB align
> > > > > check htere. Perhaps we could at least sanity check for alignment at
> > > > > TARGET_PAGE_SIZE, to detect a gross data corruption problem ?
> > > > >
> > > > 
> > > > I don't see why not. I'll add it.
> > > 
> > > Is there any explanation on why that 1MB offset, and how the number is
> > > chosen?  Thanks,
> > 
> > The fixed-ram format is anticipating the use of O_DIRECT.
> > 
> > With O_DIRECT both the buffers in memory, and the file handle offset
> > have alignment requirements. The buffer alignments are usually page
> > sized, and QEMU RAM blocks will trivially satisfy those.
> > 
> > The file handle offset alignment varies per filesystem. While you can
> > query the alignment for the FS holding the file with statx(), that is
> > not appropriate todo. If a user saves/restores QEMU state to file, we
> > must assume there is a chance the user will copy the saved state to a
> > different filesystem.
> > 
> > IOW, we want alignment to satisfy the likely worst case.
> > 
> > Picking 1 MB is a nice round number that is large enough that it is
> > almost certainly going to satisfy any filesystem alignment. In fact
> > it is likely massive overkill. None the less 1 MB is also still tiny
> 
> Is that calculated by something like max of possible host (small) page
> sizes?  I've no idea what's it for all archs, the max small page size I'm
> aware of is 64K, but I don't know a lot archs.

It wasn't anything as precise as that. It is literally just "1MB" looks
large enough that we don't need to spend time to investigate per arch
page sizes.

Having said that I'm now having slight self-doubt wrt huge pages, though
I swear we investigated it last year when first discussing this feature.
The guest memory will of course already be suitably aligned, but I'm
wondering if the filesystem I/O places any offset alignment constraints
related to non-default page size.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 23/29] migration/ram: Add a wrapper for fixed-ram shadow bitmap
  2023-10-23 20:36 ` [PATCH v2 23/29] migration/ram: Add a wrapper for fixed-ram shadow bitmap Fabiano Rosas
@ 2023-11-01 14:29   ` Peter Xu
  0 siblings, 0 replies; 128+ messages in thread
From: Peter Xu @ 2023-11-01 14:29 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana

On Mon, Oct 23, 2023 at 05:36:02PM -0300, Fabiano Rosas wrote:
> We'll need to set the shadow_bmap bits from outside ram.c soon and
> TARGET_PAGE_BITS is poisoned, so add a wrapper to it.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Merge this into existing patch to add ram.c usage?

> ---
>  migration/ram.c | 5 +++++
>  migration/ram.h | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index cea6971ab2..8e34c1b597 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3160,6 +3160,11 @@ static void ram_save_shadow_bmap(QEMUFile *f)
>      }
>  }
>  
> +void ramblock_set_shadow_bmap_atomic(RAMBlock *block, ram_addr_t offset)
> +{
> +    set_bit_atomic(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
> +}
> +
>  /**
>   * ram_save_iterate: iterative stage for migration
>   *
> diff --git a/migration/ram.h b/migration/ram.h
> index 145c915ca7..1acadffb06 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -75,6 +75,7 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
>  bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
>  void postcopy_preempt_shutdown_file(MigrationState *s);
>  void *postcopy_preempt_thread(void *opaque);
> +void ramblock_set_shadow_bmap_atomic(RAMBlock *block, ram_addr_t offset);
>  
>  /* ram cache */
>  int colo_init_ram_cache(void);
> -- 
> 2.35.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-11-01 14:28             ` Daniel P. Berrangé
@ 2023-11-01 15:00               ` Peter Xu
  2023-11-06 13:18                 ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-11-01 15:00 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

On Wed, Nov 01, 2023 at 02:28:24PM +0000, Daniel P. Berrangé wrote:
> On Wed, Nov 01, 2023 at 10:21:07AM -0400, Peter Xu wrote:
> > On Wed, Nov 01, 2023 at 09:26:46AM +0000, Daniel P. Berrangé wrote:
> > > On Tue, Oct 31, 2023 at 03:03:50PM -0400, Peter Xu wrote:
> > > > On Wed, Oct 25, 2023 at 11:07:33AM -0300, Fabiano Rosas wrote:
> > > > > >> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block, ram_addr_t length)
> > > > > >> +{
> > > > > >> +    g_autofree unsigned long *bitmap = NULL;
> > > > > >> +    struct FixedRamHeader header;
> > > > > >> +    size_t bitmap_size;
> > > > > >> +    long num_pages;
> > > > > >> +    int ret = 0;
> > > > > >> +
> > > > > >> +    ret = fixed_ram_read_header(f, &header);
> > > > > >> +    if (ret < 0) {
> > > > > >> +        error_report("Error reading fixed-ram header");
> > > > > >> +        return -EINVAL;
> > > > > >> +    }
> > > > > >> +
> > > > > >> +    block->pages_offset = header.pages_offset;
> > > > > >
> > > > > > Do you think it is worth sanity checking that 'pages_offset' is aligned
> > > > > > in some way.
> > > > > >
> > > > > > It is nice that we have flexibility to change the alignment in future
> > > > > > if we find 1 MB is not optimal, so I wouldn't want to force 1MB align
> > > > > > check htere. Perhaps we could at least sanity check for alignment at
> > > > > > TARGET_PAGE_SIZE, to detect a gross data corruption problem ?
> > > > > >
> > > > > 
> > > > > I don't see why not. I'll add it.
> > > > 
> > > > Is there any explanation on why that 1MB offset, and how the number is
> > > > chosen?  Thanks,
> > > 
> > > The fixed-ram format is anticipating the use of O_DIRECT.
> > > 
> > > With O_DIRECT both the buffers in memory, and the file handle offset
> > > have alignment requirements. The buffer alignments are usually page
> > > sized, and QEMU RAM blocks will trivially satisfy those.
> > > 
> > > The file handle offset alignment varies per filesystem. While you can
> > > query the alignment for the FS holding the file with statx(), that is
> > > not appropriate todo. If a user saves/restores QEMU state to file, we
> > > must assume there is a chance the user will copy the saved state to a
> > > different filesystem.
> > > 
> > > IOW, we want alignment to satisfy the likely worst case.
> > > 
> > > Picking 1 MB is a nice round number that is large enough that it is
> > > almost certainly going to satisfy any filesystem alignment. In fact
> > > it is likely massive overkill. None the less 1 MB is also still tiny
> > 
> > Is that calculated by something like max of possible host (small) page
> > sizes?  I've no idea what's it for all archs, the max small page size I'm
> > aware of is 64K, but I don't know a lot archs.
> 
> It wasn't anything as precise as that. It is literally just "1MB" looks
> large enough that we don't need to spend time to investigate per arch
> page sizes.

IMHO we need that precision on reasoning and document it, even if not on
the exact number we prefer, which can be prone to change later.  Otherwise
that value will be a pure magic soon after a few years or even less, it'll
be more of a challenge later to figure things out.

> 
> Having said that I'm now having slight self-doubt wrt huge pages, though
> I swear we investigated it last year when first discussing this feature.
> The guest memory will of course already be suitably aligned, but I'm
> wondering if the filesystem I/O places any offset alignment constraints
> related to non-default page size.

AFAIU direct IO is about pinning the IO buffers, playing the role of fs
cache instead.  If my understanding is correct, huge pages shouldn't be a
problem for such pinning, because it's legal to pin partial of a huge page.

After the partial huge pages pinned, they should be treated as normal fs
buffers when doing block IO.  And then the offset of file should, per my
understanding, not relevant to what is the type of backend of that user
buffer anymore that triggers read()/write().

But maybe I missed something, if so that will need to be part of
documentation of 1MB magic value, IMHO.  We may want to double check with
that by doing fixed-ram migration on e.g. 1GB hugetlb memory-backend-file
with 1MB file offset per-ramblock.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration
  2023-10-25  9:39   ` Daniel P. Berrangé
  2023-10-25 14:03     ` Fabiano Rosas
@ 2023-11-01 15:23     ` Peter Xu
  2023-11-01 15:52       ` Daniel P. Berrangé
  1 sibling, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-11-01 15:23 UTC (permalink / raw)
  To: Daniel P. Berrangé, Fabiano Rosas
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé

On Wed, Oct 25, 2023 at 10:39:58AM +0100, Daniel P. Berrangé wrote:
> If I'm reading the code correctly the new format has some padding
> such that each "ramblock pages" region starts on a 1 MB boundary.
> 
> eg so we get:
> 
>  --------------------------------
>  | ramblock 1 header            |
>  --------------------------------
>  | ramblock 1 fixed-ram header  |
>  --------------------------------
>  | padding to next 1MB boundary |
>  | ...                          |
>  --------------------------------
>  | ramblock 1 pages             |
>  | ...                          |
>  --------------------------------
>  | ramblock 2 header            |
>  --------------------------------
>  | ramblock 2 fixed-ram header  |
>  --------------------------------
>  | padding to next 1MB boundary |
>  | ...                          |
>  --------------------------------
>  | ramblock 2 pages             |
>  | ...                          |
>  --------------------------------
>  | ...                          |
>  --------------------------------
>  | RAM_SAVE_FLAG_EOS            |
>  --------------------------------
>  | ...                          |
>  -------------------------------

When reading the series, I was thinking one more thing on whether fixed-ram
would like to leverage compression in the future?

To be exact, not really fixed-ram as a feature, but non-live snapshot as
the real use case.  More below.

I just noticed that compression can be a great feature to have for such use
case, where the image size can be further shrinked noticeably.  In this
case, speed of savevm may not matter as much as image size (as compression
can take some more cpu overhead): VM will be stopped anyway.

With current fixed-ram layout, we probably can't have compression due to
two reasons:

  - We offset each page with page alignment in the final image, and that's
    where fixed-ram as the term comes from; more fundamentally,

  - We allow src VM to run (dropping auto-pause as the plan, even if we
    plan to guarantee it not run; QEMU still can't take that as
    guaranteed), then we need page granule on storing pages, and then it's
    hard to know the size of each page after compressed.

If with the guarantee that VM is stopped, I think compression should be
easy to get?  Because right after dropping the page-granule requirement, we
can compress in chunks, storing binary in the image, one page written once.
We may lose O_DIRECT but we can consider the hardware accelerators on
[de]compress if necessary.

I'm just raising this up to discuss to make sure we're on the same page.
Again, maybe that's not a concern to anyone, but I want to double check
with all of us, because it will affect the whole design including the image
layout.  I want to make sure we don't introduce another totally different
image layout in the near future just to support compression.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration
  2023-11-01 15:23     ` Peter Xu
@ 2023-11-01 15:52       ` Daniel P. Berrangé
  2023-11-01 16:24         ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-11-01 15:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé

On Wed, Nov 01, 2023 at 11:23:37AM -0400, Peter Xu wrote:
> On Wed, Oct 25, 2023 at 10:39:58AM +0100, Daniel P. Berrangé wrote:
> > If I'm reading the code correctly the new format has some padding
> > such that each "ramblock pages" region starts on a 1 MB boundary.
> > 
> > eg so we get:
> > 
> >  --------------------------------
> >  | ramblock 1 header            |
> >  --------------------------------
> >  | ramblock 1 fixed-ram header  |
> >  --------------------------------
> >  | padding to next 1MB boundary |
> >  | ...                          |
> >  --------------------------------
> >  | ramblock 1 pages             |
> >  | ...                          |
> >  --------------------------------
> >  | ramblock 2 header            |
> >  --------------------------------
> >  | ramblock 2 fixed-ram header  |
> >  --------------------------------
> >  | padding to next 1MB boundary |
> >  | ...                          |
> >  --------------------------------
> >  | ramblock 2 pages             |
> >  | ...                          |
> >  --------------------------------
> >  | ...                          |
> >  --------------------------------
> >  | RAM_SAVE_FLAG_EOS            |
> >  --------------------------------
> >  | ...                          |
> >  -------------------------------
> 
> When reading the series, I was thinking one more thing on whether fixed-ram
> would like to leverage compression in the future?

Libvirt currently supports compression of saved state images, so yes,
I think compression is a desirable feature.

Due to libvirt's architecture it does compression on the stream and
the final step in the sequence bounc buffers into suitably aligned
memory required for O_DIRECT.

> To be exact, not really fixed-ram as a feature, but non-live snapshot as
> the real use case.  More below.
> 
> I just noticed that compression can be a great feature to have for such use
> case, where the image size can be further shrinked noticeably.  In this
> case, speed of savevm may not matter as much as image size (as compression
> can take some more cpu overhead): VM will be stopped anyway.
> 
> With current fixed-ram layout, we probably can't have compression due to
> two reasons:
> 
>   - We offset each page with page alignment in the final image, and that's
>     where fixed-ram as the term comes from; more fundamentally,
> 
>   - We allow src VM to run (dropping auto-pause as the plan, even if we
>     plan to guarantee it not run; QEMU still can't take that as
>     guaranteed), then we need page granule on storing pages, and then it's
>     hard to know the size of each page after compressed.
> 
> If with the guarantee that VM is stopped, I think compression should be
> easy to get?  Because right after dropping the page-granule requirement, we
> can compress in chunks, storing binary in the image, one page written once.
> We may lose O_DIRECT but we can consider the hardware accelerators on
> [de]compress if necessary.

We can keep O_DIRECT if we buffer in QEMU between compressor output
and disk I/O, which is what libvirt does. QEMU would still be saving
at least one extra copy compared to libvirt


The fixed RAM layout was primarily intended to allow easy parallel
I/O without needing any synchronization between threads. In theory
fixed RAM layout even allows you todo something fun like

   maped_addr = mmap(save-stat-fd, offset, ramblocksize);
   memcpy(ramblock, maped_addr, ramblocksize)
   munmap(maped_addr)

which would still be buffered I/O without O_DIRECT, but might be better
than many writes() as you avoid 1000's of syscalls.

Anyway back to compression, I think if you wanted to allow for parallel
I/O, then it would require a different "fixed ram" approach, where each
multifd  thread requested use of a 64 MB region, compressed until that
was full, then asked for another 64 MB region, repeat until done.

The reason we didn't want to break up the file format into regions like
this is because we wanted to allow for flexbility into configuration on
save / restore. eg  you might save using 7 threads, but restore using
3 threads. We didn't want the on-disk layout to have any structural
artifact that was related to the number of threads saving data, as that
would make restore less efficient. eg 2 threads would process 2 chunks
each and  and 1 thread would process 3 chunks, which is unbalanced.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 21/29] migration/multifd: Add pages to the receiving side
  2023-10-31 23:18     ` Fabiano Rosas
@ 2023-11-01 15:55       ` Peter Xu
  2023-11-01 17:20         ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-11-01 15:55 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana

On Tue, Oct 31, 2023 at 08:18:06PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Oct 23, 2023 at 05:36:00PM -0300, Fabiano Rosas wrote:
> >> Currently multifd does not need to have knowledge of pages on the
> >> receiving side because all the information needed is within the
> >> packets that come in the stream.
> >> 
> >> We're about to add support to fixed-ram migration, which cannot use
> >> packets because it expects the ramblock section in the migration file
> >> to contain only the guest pages data.
> >> 
> >> Add a pointer to MultiFDPages in the multifd_recv_state and use the
> >> pages similarly to what we already do on the sending side. The pages
> >> are used to transfer data between the ram migration code in the main
> >> migration thread and the multifd receiving threads.
> >> 
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >
> > If it'll be new code to maintain anyway, I think we don't necessarily
> > always use multifd structs, right?
> >
> 
> For the sending side, unrelated to this series, I'm experimenting with
> defining a generic structure to be passed into multifd:
> 
> struct MultiFDData_t {
>     void *opaque;
>     size_t size;
>     bool ready;
>     void (*cleanup_fn)(void *);
> };
> 
> The client code (ram.c) would use the opaque field to put whatever it
> wants in it. Maybe we could have a similar concept on the receiving
> side?
> 
> Here's a PoC I'm writing, if you're interested:
> 
> https://github.com/farosas/qemu/commits/multifd-packet-cleanups
> 
> (I'm delaying sending this to the list because we already have a
> reasonable backlog of features and refactorings to merge.)

I went through the idea, I agree it's reasonable to generalize multifd to
drop the page constraints.  Actually I'm wondering maybe it should be
better that we have a thread pool model for migration, then multifd can be
based on that.

Something like: job submissions, proper locks, notifications, quits,
etc. with a bunch of API to manipulate the thread pool.

And actually.. I just noticed we have. :) See util/thread-pool.c.  I didn't
have closer look, but that looks like something good if we can work on top
(e.g., I don't think we want the bottom halfs..), or refactor to satisfy
all our needs from migration pov.  Not something I'm asking right away, but
maybe we can at least keep an eye on.

> 
> > Rather than introducing MultiFDPages_t into recv side, can we allow pages
> > to be distributed in chunks of (ramblock, start_offset, end_offset) tuples?
> > That'll be much more efficient than per-page.  We don't need page granule
> > here on recv side, we want to load chunks of mem fast.
> >
> > We don't even need page granule on sender side, but since only myself cared
> > about perf.. and obviously the plan is to even drop auto-pause, then VM can
> > be running there, so sender must do that per-page for now.  But now on recv
> > side VM must be stopped before all ram loaded, so there's no such problem.
> > And since we'll introduce new code anyway, IMHO we can decide how to do
> > that even if we want to reuse multifd.
> >
> > Main thread can assign these (ramblock, start_offset, end_offset) jobs to
> > recv threads.  If ramblock is too small (e.g. 1M), assign it anyway to one
> > thread.  If ramblock is >512MB, cut it into slices and feed them to multifd
> > threads one by one.  All the rest can be the same.
> >
> > Would that be better?  I would expect measurable loading speed difference
> > with much larger chunks and with that range-based tuples.
> 
> I need to check how that would interact with the existing recv_thread
> code. Hopefully there's nothing there preventing us from using a
> different data structure.

Sure, thanks.  Maybe there's a good way to provide a middle ground on both
"less code changes" and "easily maintainable", if that helps on this series
being merged.

What I want to make sure is we don't introduce new complicated logic but
even not doing the job as correct as we can.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration
  2023-11-01 15:52       ` Daniel P. Berrangé
@ 2023-11-01 16:24         ` Peter Xu
  2023-11-01 16:37           ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-11-01 16:24 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé

On Wed, Nov 01, 2023 at 03:52:18PM +0000, Daniel P. Berrangé wrote:
> On Wed, Nov 01, 2023 at 11:23:37AM -0400, Peter Xu wrote:
> > On Wed, Oct 25, 2023 at 10:39:58AM +0100, Daniel P. Berrangé wrote:
> > > If I'm reading the code correctly the new format has some padding
> > > such that each "ramblock pages" region starts on a 1 MB boundary.
> > > 
> > > eg so we get:
> > > 
> > >  --------------------------------
> > >  | ramblock 1 header            |
> > >  --------------------------------
> > >  | ramblock 1 fixed-ram header  |
> > >  --------------------------------
> > >  | padding to next 1MB boundary |
> > >  | ...                          |
> > >  --------------------------------
> > >  | ramblock 1 pages             |
> > >  | ...                          |
> > >  --------------------------------
> > >  | ramblock 2 header            |
> > >  --------------------------------
> > >  | ramblock 2 fixed-ram header  |
> > >  --------------------------------
> > >  | padding to next 1MB boundary |
> > >  | ...                          |
> > >  --------------------------------
> > >  | ramblock 2 pages             |
> > >  | ...                          |
> > >  --------------------------------
> > >  | ...                          |
> > >  --------------------------------
> > >  | RAM_SAVE_FLAG_EOS            |
> > >  --------------------------------
> > >  | ...                          |
> > >  -------------------------------
> > 
> > When reading the series, I was thinking one more thing on whether fixed-ram
> > would like to leverage compression in the future?
> 
> Libvirt currently supports compression of saved state images, so yes,
> I think compression is a desirable feature.

Ah, yeah this will work too; one more copy as you mentioned below, but
assume that's not a major concern so far (or.. will it?).

> 
> Due to libvirt's architecture it does compression on the stream and
> the final step in the sequence bounc buffers into suitably aligned
> memory required for O_DIRECT.
> 
> > To be exact, not really fixed-ram as a feature, but non-live snapshot as
> > the real use case.  More below.
> > 
> > I just noticed that compression can be a great feature to have for such use
> > case, where the image size can be further shrinked noticeably.  In this
> > case, speed of savevm may not matter as much as image size (as compression
> > can take some more cpu overhead): VM will be stopped anyway.
> > 
> > With current fixed-ram layout, we probably can't have compression due to
> > two reasons:
> > 
> >   - We offset each page with page alignment in the final image, and that's
> >     where fixed-ram as the term comes from; more fundamentally,
> > 
> >   - We allow src VM to run (dropping auto-pause as the plan, even if we
> >     plan to guarantee it not run; QEMU still can't take that as
> >     guaranteed), then we need page granule on storing pages, and then it's
> >     hard to know the size of each page after compressed.
> > 
> > If with the guarantee that VM is stopped, I think compression should be
> > easy to get?  Because right after dropping the page-granule requirement, we
> > can compress in chunks, storing binary in the image, one page written once.
> > We may lose O_DIRECT but we can consider the hardware accelerators on
> > [de]compress if necessary.
> 
> We can keep O_DIRECT if we buffer in QEMU between compressor output
> and disk I/O, which is what libvirt does. QEMU would still be saving
> at least one extra copy compared to libvirt
> 
> 
> The fixed RAM layout was primarily intended to allow easy parallel
> I/O without needing any synchronization between threads. In theory
> fixed RAM layout even allows you todo something fun like
> 
>    maped_addr = mmap(save-stat-fd, offset, ramblocksize);
>    memcpy(ramblock, maped_addr, ramblocksize)
>    munmap(maped_addr)
> 
> which would still be buffered I/O without O_DIRECT, but might be better
> than many writes() as you avoid 1000's of syscalls.
> 
> Anyway back to compression, I think if you wanted to allow for parallel
> I/O, then it would require a different "fixed ram" approach, where each
> multifd  thread requested use of a 64 MB region, compressed until that
> was full, then asked for another 64 MB region, repeat until done.

Right, we need a constant buffer per-thread if so.

> 
> The reason we didn't want to break up the file format into regions like
> this is because we wanted to allow for flexbility into configuration on
> save / restore. eg  you might save using 7 threads, but restore using
> 3 threads. We didn't want the on-disk layout to have any structural
> artifact that was related to the number of threads saving data, as that
> would make restore less efficient. eg 2 threads would process 2 chunks
> each and  and 1 thread would process 3 chunks, which is unbalanced.

I didn't follow on why the image needs to contain thread number
information.

Can the sub-header for each compressed chunk be described as (assuming
under specific ramblock header, so ramblock is known):

  - size of compressed data
  - (start_offset, end_offset) of pages this chunk of data represents

Then when saving, we assign 64M to each thread no matter how many are
there, for each thread it first compresses 64M into binary, knowing the
size, then request for a writeback to image, with the chunk header and
binary flushed.

Then the final image will be a sequence of chunks for each ramblock.

Assuming decompress can do the same by assigning different chunks to each
decompress thread, no matter how many are there.

Would that work?

To go back to the original topic: I think it's fine if Libvirt will do the
compression, that is more flexible indeed to do per-file with whatever
compression algorithm the uesr wants, and even cover non-RAM data.

I think such considerations / thoughts over compression solution may also
be nice to be documented in the docs/ under this feature.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration
  2023-11-01 16:24         ` Peter Xu
@ 2023-11-01 16:37           ` Daniel P. Berrangé
  2023-11-01 17:30             ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-11-01 16:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé

On Wed, Nov 01, 2023 at 12:24:22PM -0400, Peter Xu wrote:
> On Wed, Nov 01, 2023 at 03:52:18PM +0000, Daniel P. Berrangé wrote:
> > On Wed, Nov 01, 2023 at 11:23:37AM -0400, Peter Xu wrote:
> > > On Wed, Oct 25, 2023 at 10:39:58AM +0100, Daniel P. Berrangé wrote:
> > > > If I'm reading the code correctly the new format has some padding
> > > > such that each "ramblock pages" region starts on a 1 MB boundary.
> > > > 
> > > > eg so we get:
> > > > 
> > > >  --------------------------------
> > > >  | ramblock 1 header            |
> > > >  --------------------------------
> > > >  | ramblock 1 fixed-ram header  |
> > > >  --------------------------------
> > > >  | padding to next 1MB boundary |
> > > >  | ...                          |
> > > >  --------------------------------
> > > >  | ramblock 1 pages             |
> > > >  | ...                          |
> > > >  --------------------------------
> > > >  | ramblock 2 header            |
> > > >  --------------------------------
> > > >  | ramblock 2 fixed-ram header  |
> > > >  --------------------------------
> > > >  | padding to next 1MB boundary |
> > > >  | ...                          |
> > > >  --------------------------------
> > > >  | ramblock 2 pages             |
> > > >  | ...                          |
> > > >  --------------------------------
> > > >  | ...                          |
> > > >  --------------------------------
> > > >  | RAM_SAVE_FLAG_EOS            |
> > > >  --------------------------------
> > > >  | ...                          |
> > > >  -------------------------------
> > > 
> > > When reading the series, I was thinking one more thing on whether fixed-ram
> > > would like to leverage compression in the future?
> > 
> > Libvirt currently supports compression of saved state images, so yes,
> > I think compression is a desirable feature.
> 
> Ah, yeah this will work too; one more copy as you mentioned below, but
> assume that's not a major concern so far (or.. will it?).
> 
> > 
> > Due to libvirt's architecture it does compression on the stream and
> > the final step in the sequence bounc buffers into suitably aligned
> > memory required for O_DIRECT.
> > 
> > > To be exact, not really fixed-ram as a feature, but non-live snapshot as
> > > the real use case.  More below.
> > > 
> > > I just noticed that compression can be a great feature to have for such use
> > > case, where the image size can be further shrinked noticeably.  In this
> > > case, speed of savevm may not matter as much as image size (as compression
> > > can take some more cpu overhead): VM will be stopped anyway.
> > > 
> > > With current fixed-ram layout, we probably can't have compression due to
> > > two reasons:
> > > 
> > >   - We offset each page with page alignment in the final image, and that's
> > >     where fixed-ram as the term comes from; more fundamentally,
> > > 
> > >   - We allow src VM to run (dropping auto-pause as the plan, even if we
> > >     plan to guarantee it not run; QEMU still can't take that as
> > >     guaranteed), then we need page granule on storing pages, and then it's
> > >     hard to know the size of each page after compressed.
> > > 
> > > If with the guarantee that VM is stopped, I think compression should be
> > > easy to get?  Because right after dropping the page-granule requirement, we
> > > can compress in chunks, storing binary in the image, one page written once.
> > > We may lose O_DIRECT but we can consider the hardware accelerators on
> > > [de]compress if necessary.
> > 
> > We can keep O_DIRECT if we buffer in QEMU between compressor output
> > and disk I/O, which is what libvirt does. QEMU would still be saving
> > at least one extra copy compared to libvirt
> > 
> > 
> > The fixed RAM layout was primarily intended to allow easy parallel
> > I/O without needing any synchronization between threads. In theory
> > fixed RAM layout even allows you todo something fun like
> > 
> >    maped_addr = mmap(save-stat-fd, offset, ramblocksize);
> >    memcpy(ramblock, maped_addr, ramblocksize)
> >    munmap(maped_addr)
> > 
> > which would still be buffered I/O without O_DIRECT, but might be better
> > than many writes() as you avoid 1000's of syscalls.
> > 
> > Anyway back to compression, I think if you wanted to allow for parallel
> > I/O, then it would require a different "fixed ram" approach, where each
> > multifd  thread requested use of a 64 MB region, compressed until that
> > was full, then asked for another 64 MB region, repeat until done.
> 
> Right, we need a constant buffer per-thread if so.
> 
> > 
> > The reason we didn't want to break up the file format into regions like
> > this is because we wanted to allow for flexbility into configuration on
> > save / restore. eg  you might save using 7 threads, but restore using
> > 3 threads. We didn't want the on-disk layout to have any structural
> > artifact that was related to the number of threads saving data, as that
> > would make restore less efficient. eg 2 threads would process 2 chunks
> > each and  and 1 thread would process 3 chunks, which is unbalanced.
> 
> I didn't follow on why the image needs to contain thread number
> information.

It doesn't contain thread number information directly, but it can
be implicit from the data layout.

If you want parallel I/O, each thread has to know it is the only
one reading/writing to a particular region of the file. With the
fixed RAM layout in this series, the file offset directly maps
to the memory region. So if a thread has been given a guest page
to save it knows it will be the only thing writing to the file
at that offset. There is no relationship at all between the
number of threads and the file layout.

If you can't directly map pages to file offsets, then you need
some other way to lay out date such that each thread can safely
write. If you split up a file based on fixed size chunks, then
the number of chunks you end up with in the file is likely to be
a multiple of the number of threads you had saving data.

This means if you restore using a different number of threads,
you can't evenly assign file chunks to each restore thread.

There's no info about thread IDs in the file, but the data layout
reflects how mcuh threads were doing work.

> Assuming decompress can do the same by assigning different chunks to each
> decompress thread, no matter how many are there.
> 
> Would that work?

Again you get uneven workloads if the number of restore threads is
different than the save threads, as some threads will have to process
more chunks than other threads. If the chunks are small this might
not matter, if they are big it could matter.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 21/29] migration/multifd: Add pages to the receiving side
  2023-11-01 15:55       ` Peter Xu
@ 2023-11-01 17:20         ` Fabiano Rosas
  2023-11-01 17:35           ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-11-01 17:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana

Peter Xu <peterx@redhat.com> writes:

> On Tue, Oct 31, 2023 at 08:18:06PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Mon, Oct 23, 2023 at 05:36:00PM -0300, Fabiano Rosas wrote:
>> >> Currently multifd does not need to have knowledge of pages on the
>> >> receiving side because all the information needed is within the
>> >> packets that come in the stream.
>> >> 
>> >> We're about to add support to fixed-ram migration, which cannot use
>> >> packets because it expects the ramblock section in the migration file
>> >> to contain only the guest pages data.
>> >> 
>> >> Add a pointer to MultiFDPages in the multifd_recv_state and use the
>> >> pages similarly to what we already do on the sending side. The pages
>> >> are used to transfer data between the ram migration code in the main
>> >> migration thread and the multifd receiving threads.
>> >> 
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >
>> > If it'll be new code to maintain anyway, I think we don't necessarily
>> > always use multifd structs, right?
>> >
>> 
>> For the sending side, unrelated to this series, I'm experimenting with
>> defining a generic structure to be passed into multifd:
>> 
>> struct MultiFDData_t {
>>     void *opaque;
>>     size_t size;
>>     bool ready;
>>     void (*cleanup_fn)(void *);
>> };
>> 
>> The client code (ram.c) would use the opaque field to put whatever it
>> wants in it. Maybe we could have a similar concept on the receiving
>> side?
>> 
>> Here's a PoC I'm writing, if you're interested:
>> 
>> https://github.com/farosas/qemu/commits/multifd-packet-cleanups
>> 
>> (I'm delaying sending this to the list because we already have a
>> reasonable backlog of features and refactorings to merge.)
>
> I went through the idea, I agree it's reasonable to generalize multifd to
> drop the page constraints.

Ok, I'll propose it once we get a slowdown on the ml volume

> Actually I'm wondering maybe it should be
> better that we have a thread pool model for migration, then multifd can be
> based on that.
>
> Something like: job submissions, proper locks, notifications, quits,
> etc. with a bunch of API to manipulate the thread pool.

I agree in principle.

> And actually.. I just noticed we have. :) See util/thread-pool.c.  I didn't
> have closer look, but that looks like something good if we can work on top
> (e.g., I don't think we want the bottom halfs..), or refactor to satisfy
> all our needs from migration pov.  Not something I'm asking right away, but
> maybe we can at least keep an eye on.
>

I wonder if adapting multifd to use a QIOTask for the channels would
make sense as an intermediary step. Seems simpler and would force us to
format multifd in more generic terms.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration
  2023-11-01 16:37           ` Daniel P. Berrangé
@ 2023-11-01 17:30             ` Peter Xu
  0 siblings, 0 replies; 128+ messages in thread
From: Peter Xu @ 2023-11-01 17:30 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daudé

On Wed, Nov 01, 2023 at 04:37:12PM +0000, Daniel P. Berrangé wrote:
> It doesn't contain thread number information directly, but it can
> be implicit from the data layout.
> 
> If you want parallel I/O, each thread has to know it is the only
> one reading/writing to a particular region of the file. With the
> fixed RAM layout in this series, the file offset directly maps
> to the memory region. So if a thread has been given a guest page
> to save it knows it will be the only thing writing to the file
> at that offset. There is no relationship at all between the
> number of threads and the file layout.
> 
> If you can't directly map pages to file offsets, then you need
> some other way to lay out date such that each thread can safely
> write. If you split up a file based on fixed size chunks, then
> the number of chunks you end up with in the file is likely to be
> a multiple of the number of threads you had saving data.

What I was thinking is provision fixed size chunk in ramblock address
space, e.g. 64M pages for each thread.  It compresses with a local buffer,
then request the file offsets to write only after the compression
completed, because we'll need that to request file offset.

> 
> This means if you restore using a different number of threads,
> you can't evenly assign file chunks to each restore thread.
> 
> There's no info about thread IDs in the file, but the data layout
> reflects how mcuh threads were doing work.
> 
> > Assuming decompress can do the same by assigning different chunks to each
> > decompress thread, no matter how many are there.
> > 
> > Would that work?
> 
> Again you get uneven workloads if the number of restore threads is
> different than the save threads, as some threads will have to process
> more chunks than other threads. If the chunks are small this might
> not matter, if they are big it could matter.

Maybe you meant when the chunk size is only calculated from thread numbers,
and when chunk is very large?

If we have fixed size ramblock chunks, the number of chunks can be mostly
irrelevant, e.g. for 4G guest it can contain 4G/64M=128 chunks.  128 chunks
can easily be decompressed concurrently with mostly whatever number of recv
threads.

Parallel IO is not a problem either, afaict, if each thread can request its
file offset to read/write.  The write side is a bit tricky if with what I
said above, it can only be requested and exclusively assigned to the writer
thread after compression has finished and the thread knows how many bytes
it needs to put the results.  On read side we know the binary size of each
chunk, so we can already mark each chunk exclusive to the each reader
thread.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 21/29] migration/multifd: Add pages to the receiving side
  2023-11-01 17:20         ` Fabiano Rosas
@ 2023-11-01 17:35           ` Peter Xu
  2023-11-01 18:14             ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-11-01 17:35 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana

On Wed, Nov 01, 2023 at 02:20:32PM -0300, Fabiano Rosas wrote:
> I wonder if adapting multifd to use a QIOTask for the channels would
> make sense as an intermediary step. Seems simpler and would force us to
> format multifd in more generic terms.

Isn't QIOTask event based, too?

From my previous experience, making it not gcontext based, if we already
have threads, are easier.  But maybe I didn't really get what you meant.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 21/29] migration/multifd: Add pages to the receiving side
  2023-11-01 17:35           ` Peter Xu
@ 2023-11-01 18:14             ` Fabiano Rosas
  0 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2023-11-01 18:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana

Peter Xu <peterx@redhat.com> writes:

> On Wed, Nov 01, 2023 at 02:20:32PM -0300, Fabiano Rosas wrote:
>> I wonder if adapting multifd to use a QIOTask for the channels would
>> make sense as an intermediary step. Seems simpler and would force us to
>> format multifd in more generic terms.
>
> Isn't QIOTask event based, too?
>
> From my previous experience, making it not gcontext based, if we already
> have threads, are easier.  But maybe I didn't really get what you meant.

Sorry, I wasn't thinking about the context aspect. I agree it's easier
without it.

I was talking about having standardized dispatch and completion code for
multifd without being a whole thread pool. So just something that takes
a function and a pointer to data, runs that in a thread with some
locking and returns in a sane way. Every thread we create in the
migration code has a different mechanism to return after an error and a
different way to do cleanup. The QIOTask seemed to fit that at a high
level.

I would be happy with just the return + cleanup part really. We've been
doing work around those areas for a while. If we could reuse generic
code for that it would be nice.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-11-01 15:00               ` Peter Xu
@ 2023-11-06 13:18                 ` Fabiano Rosas
  2023-11-06 21:00                   ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2023-11-06 13:18 UTC (permalink / raw)
  To: Peter Xu, Daniel P. Berrangé
  Cc: qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

Peter Xu <peterx@redhat.com> writes:

> On Wed, Nov 01, 2023 at 02:28:24PM +0000, Daniel P. Berrangé wrote:
>> On Wed, Nov 01, 2023 at 10:21:07AM -0400, Peter Xu wrote:
>> > On Wed, Nov 01, 2023 at 09:26:46AM +0000, Daniel P. Berrangé wrote:
>> > > On Tue, Oct 31, 2023 at 03:03:50PM -0400, Peter Xu wrote:
>> > > > On Wed, Oct 25, 2023 at 11:07:33AM -0300, Fabiano Rosas wrote:
>> > > > > >> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block, ram_addr_t length)
>> > > > > >> +{
>> > > > > >> +    g_autofree unsigned long *bitmap = NULL;
>> > > > > >> +    struct FixedRamHeader header;
>> > > > > >> +    size_t bitmap_size;
>> > > > > >> +    long num_pages;
>> > > > > >> +    int ret = 0;
>> > > > > >> +
>> > > > > >> +    ret = fixed_ram_read_header(f, &header);
>> > > > > >> +    if (ret < 0) {
>> > > > > >> +        error_report("Error reading fixed-ram header");
>> > > > > >> +        return -EINVAL;
>> > > > > >> +    }
>> > > > > >> +
>> > > > > >> +    block->pages_offset = header.pages_offset;
>> > > > > >
>> > > > > > Do you think it is worth sanity checking that 'pages_offset' is aligned
>> > > > > > in some way.
>> > > > > >
>> > > > > > It is nice that we have flexibility to change the alignment in future
>> > > > > > if we find 1 MB is not optimal, so I wouldn't want to force 1MB align
>> > > > > > check htere. Perhaps we could at least sanity check for alignment at
>> > > > > > TARGET_PAGE_SIZE, to detect a gross data corruption problem ?
>> > > > > >
>> > > > > 
>> > > > > I don't see why not. I'll add it.
>> > > > 
>> > > > Is there any explanation on why that 1MB offset, and how the number is
>> > > > chosen?  Thanks,
>> > > 
>> > > The fixed-ram format is anticipating the use of O_DIRECT.
>> > > 
>> > > With O_DIRECT both the buffers in memory, and the file handle offset
>> > > have alignment requirements. The buffer alignments are usually page
>> > > sized, and QEMU RAM blocks will trivially satisfy those.
>> > > 
>> > > The file handle offset alignment varies per filesystem. While you can
>> > > query the alignment for the FS holding the file with statx(), that is
>> > > not appropriate todo. If a user saves/restores QEMU state to file, we
>> > > must assume there is a chance the user will copy the saved state to a
>> > > different filesystem.
>> > > 
>> > > IOW, we want alignment to satisfy the likely worst case.
>> > > 
>> > > Picking 1 MB is a nice round number that is large enough that it is
>> > > almost certainly going to satisfy any filesystem alignment. In fact
>> > > it is likely massive overkill. None the less 1 MB is also still tiny
>> > 
>> > Is that calculated by something like max of possible host (small) page
>> > sizes?  I've no idea what's it for all archs, the max small page size I'm
>> > aware of is 64K, but I don't know a lot archs.
>> 
>> It wasn't anything as precise as that. It is literally just "1MB" looks
>> large enough that we don't need to spend time to investigate per arch
>> page sizes.
>
> IMHO we need that precision on reasoning and document it, even if not on
> the exact number we prefer, which can be prone to change later.  Otherwise
> that value will be a pure magic soon after a few years or even less, it'll
> be more of a challenge later to figure things out.
>
>> 
>> Having said that I'm now having slight self-doubt wrt huge pages, though
>> I swear we investigated it last year when first discussing this feature.
>> The guest memory will of course already be suitably aligned, but I'm
>> wondering if the filesystem I/O places any offset alignment constraints
>> related to non-default page size.
>
> AFAIU direct IO is about pinning the IO buffers, playing the role of fs
> cache instead.  If my understanding is correct, huge pages shouldn't be a
> problem for such pinning, because it's legal to pin partial of a huge page.
>
> After the partial huge pages pinned, they should be treated as normal fs
> buffers when doing block IO.  And then the offset of file should, per my
> understanding, not relevant to what is the type of backend of that user
> buffer anymore that triggers read()/write().
>
> But maybe I missed something, if so that will need to be part of
> documentation of 1MB magic value, IMHO.  We may want to double check with
> that by doing fixed-ram migration on e.g. 1GB hugetlb memory-backend-file
> with 1MB file offset per-ramblock.

Does anyone have any indication that we need to relate the aligment to
the page size? All I find online points to device block size being the
limiting factor for filesystems. There's also raw_probe_alignment() at
file-posix.c which seems to check up to 4k and recommend to disable
O_DIRECT if an alignment is not found.

Note that we shouldn't have any problems changing the alignment we
choose since we have a pointer to the start of the aligned region which
goes along with the fixed-ram header. We could even do some probing like
the block layer does if we wanted.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-11-06 13:18                 ` Fabiano Rosas
@ 2023-11-06 21:00                   ` Peter Xu
  2023-11-07  9:02                     ` Daniel P. Berrangé
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2023-11-06 21:00 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Daniel P. Berrangé,
	qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

On Mon, Nov 06, 2023 at 10:18:03AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Wed, Nov 01, 2023 at 02:28:24PM +0000, Daniel P. Berrangé wrote:
> >> On Wed, Nov 01, 2023 at 10:21:07AM -0400, Peter Xu wrote:
> >> > On Wed, Nov 01, 2023 at 09:26:46AM +0000, Daniel P. Berrangé wrote:
> >> > > On Tue, Oct 31, 2023 at 03:03:50PM -0400, Peter Xu wrote:
> >> > > > On Wed, Oct 25, 2023 at 11:07:33AM -0300, Fabiano Rosas wrote:
> >> > > > > >> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block, ram_addr_t length)
> >> > > > > >> +{
> >> > > > > >> +    g_autofree unsigned long *bitmap = NULL;
> >> > > > > >> +    struct FixedRamHeader header;
> >> > > > > >> +    size_t bitmap_size;
> >> > > > > >> +    long num_pages;
> >> > > > > >> +    int ret = 0;
> >> > > > > >> +
> >> > > > > >> +    ret = fixed_ram_read_header(f, &header);
> >> > > > > >> +    if (ret < 0) {
> >> > > > > >> +        error_report("Error reading fixed-ram header");
> >> > > > > >> +        return -EINVAL;
> >> > > > > >> +    }
> >> > > > > >> +
> >> > > > > >> +    block->pages_offset = header.pages_offset;
> >> > > > > >
> >> > > > > > Do you think it is worth sanity checking that 'pages_offset' is aligned
> >> > > > > > in some way.
> >> > > > > >
> >> > > > > > It is nice that we have flexibility to change the alignment in future
> >> > > > > > if we find 1 MB is not optimal, so I wouldn't want to force 1MB align
> >> > > > > > check htere. Perhaps we could at least sanity check for alignment at
> >> > > > > > TARGET_PAGE_SIZE, to detect a gross data corruption problem ?
> >> > > > > >
> >> > > > > 
> >> > > > > I don't see why not. I'll add it.
> >> > > > 
> >> > > > Is there any explanation on why that 1MB offset, and how the number is
> >> > > > chosen?  Thanks,
> >> > > 
> >> > > The fixed-ram format is anticipating the use of O_DIRECT.
> >> > > 
> >> > > With O_DIRECT both the buffers in memory, and the file handle offset
> >> > > have alignment requirements. The buffer alignments are usually page
> >> > > sized, and QEMU RAM blocks will trivially satisfy those.
> >> > > 
> >> > > The file handle offset alignment varies per filesystem. While you can
> >> > > query the alignment for the FS holding the file with statx(), that is
> >> > > not appropriate todo. If a user saves/restores QEMU state to file, we
> >> > > must assume there is a chance the user will copy the saved state to a
> >> > > different filesystem.
> >> > > 
> >> > > IOW, we want alignment to satisfy the likely worst case.
> >> > > 
> >> > > Picking 1 MB is a nice round number that is large enough that it is
> >> > > almost certainly going to satisfy any filesystem alignment. In fact
> >> > > it is likely massive overkill. None the less 1 MB is also still tiny
> >> > 
> >> > Is that calculated by something like max of possible host (small) page
> >> > sizes?  I've no idea what's it for all archs, the max small page size I'm
> >> > aware of is 64K, but I don't know a lot archs.
> >> 
> >> It wasn't anything as precise as that. It is literally just "1MB" looks
> >> large enough that we don't need to spend time to investigate per arch
> >> page sizes.
> >
> > IMHO we need that precision on reasoning and document it, even if not on
> > the exact number we prefer, which can be prone to change later.  Otherwise
> > that value will be a pure magic soon after a few years or even less, it'll
> > be more of a challenge later to figure things out.
> >
> >> 
> >> Having said that I'm now having slight self-doubt wrt huge pages, though
> >> I swear we investigated it last year when first discussing this feature.
> >> The guest memory will of course already be suitably aligned, but I'm
> >> wondering if the filesystem I/O places any offset alignment constraints
> >> related to non-default page size.
> >
> > AFAIU direct IO is about pinning the IO buffers, playing the role of fs
> > cache instead.  If my understanding is correct, huge pages shouldn't be a
> > problem for such pinning, because it's legal to pin partial of a huge page.
> >
> > After the partial huge pages pinned, they should be treated as normal fs
> > buffers when doing block IO.  And then the offset of file should, per my
> > understanding, not relevant to what is the type of backend of that user
> > buffer anymore that triggers read()/write().
> >
> > But maybe I missed something, if so that will need to be part of
> > documentation of 1MB magic value, IMHO.  We may want to double check with
> > that by doing fixed-ram migration on e.g. 1GB hugetlb memory-backend-file
> > with 1MB file offset per-ramblock.
> 
> Does anyone have any indication that we need to relate the aligment to
> the page size? All I find online points to device block size being the
> limiting factor for filesystems. There's also raw_probe_alignment() at
> file-posix.c which seems to check up to 4k and recommend to disable
> O_DIRECT if an alignment is not found.

Right, it should be more relevant to block size.

> 
> Note that we shouldn't have any problems changing the alignment we
> choose since we have a pointer to the start of the aligned region which
> goes along with the fixed-ram header. We could even do some probing like
> the block layer does if we wanted.

Having 1MB offset is fine, especially as you said we keep recv side fetch
that in the headers always.

Perhaps we make that 1MB a macro, and add a comment explaining whatever we
have on understanding how that 1MB come from?  I think we can also
reference raw_probe_alignment() in the comment.

Meanwhile now I'm wondering whether that 1MB is too conservative.  Only a
problem if there can be tons of ramblocks (e.g. 1000 ramblock means
1MB*1000=1G for the headers).  I can't think of any, though..  We can
definitely leave that for later.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore
  2023-11-06 21:00                   ` Peter Xu
@ 2023-11-07  9:02                     ` Daniel P. Berrangé
  0 siblings, 0 replies; 128+ messages in thread
From: Daniel P. Berrangé @ 2023-11-07  9:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
	Claudio Fontana, Nikolay Borisov

On Mon, Nov 06, 2023 at 04:00:33PM -0500, Peter Xu wrote:
> On Mon, Nov 06, 2023 at 10:18:03AM -0300, Fabiano Rosas wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > 
> > > On Wed, Nov 01, 2023 at 02:28:24PM +0000, Daniel P. Berrangé wrote:
> > >> On Wed, Nov 01, 2023 at 10:21:07AM -0400, Peter Xu wrote:
> > >> > On Wed, Nov 01, 2023 at 09:26:46AM +0000, Daniel P. Berrangé wrote:
> > >> > > On Tue, Oct 31, 2023 at 03:03:50PM -0400, Peter Xu wrote:
> > >> > > > On Wed, Oct 25, 2023 at 11:07:33AM -0300, Fabiano Rosas wrote:
> > >> > > > > >> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block, ram_addr_t length)
> > >> > > > > >> +{
> > >> > > > > >> +    g_autofree unsigned long *bitmap = NULL;
> > >> > > > > >> +    struct FixedRamHeader header;
> > >> > > > > >> +    size_t bitmap_size;
> > >> > > > > >> +    long num_pages;
> > >> > > > > >> +    int ret = 0;
> > >> > > > > >> +
> > >> > > > > >> +    ret = fixed_ram_read_header(f, &header);
> > >> > > > > >> +    if (ret < 0) {
> > >> > > > > >> +        error_report("Error reading fixed-ram header");
> > >> > > > > >> +        return -EINVAL;
> > >> > > > > >> +    }
> > >> > > > > >> +
> > >> > > > > >> +    block->pages_offset = header.pages_offset;
> > >> > > > > >
> > >> > > > > > Do you think it is worth sanity checking that 'pages_offset' is aligned
> > >> > > > > > in some way.
> > >> > > > > >
> > >> > > > > > It is nice that we have flexibility to change the alignment in future
> > >> > > > > > if we find 1 MB is not optimal, so I wouldn't want to force 1MB align
> > >> > > > > > check htere. Perhaps we could at least sanity check for alignment at
> > >> > > > > > TARGET_PAGE_SIZE, to detect a gross data corruption problem ?
> > >> > > > > >
> > >> > > > > 
> > >> > > > > I don't see why not. I'll add it.
> > >> > > > 
> > >> > > > Is there any explanation on why that 1MB offset, and how the number is
> > >> > > > chosen?  Thanks,
> > >> > > 
> > >> > > The fixed-ram format is anticipating the use of O_DIRECT.
> > >> > > 
> > >> > > With O_DIRECT both the buffers in memory, and the file handle offset
> > >> > > have alignment requirements. The buffer alignments are usually page
> > >> > > sized, and QEMU RAM blocks will trivially satisfy those.
> > >> > > 
> > >> > > The file handle offset alignment varies per filesystem. While you can
> > >> > > query the alignment for the FS holding the file with statx(), that is
> > >> > > not appropriate todo. If a user saves/restores QEMU state to file, we
> > >> > > must assume there is a chance the user will copy the saved state to a
> > >> > > different filesystem.
> > >> > > 
> > >> > > IOW, we want alignment to satisfy the likely worst case.
> > >> > > 
> > >> > > Picking 1 MB is a nice round number that is large enough that it is
> > >> > > almost certainly going to satisfy any filesystem alignment. In fact
> > >> > > it is likely massive overkill. None the less 1 MB is also still tiny
> > >> > 
> > >> > Is that calculated by something like max of possible host (small) page
> > >> > sizes?  I've no idea what's it for all archs, the max small page size I'm
> > >> > aware of is 64K, but I don't know a lot archs.
> > >> 
> > >> It wasn't anything as precise as that. It is literally just "1MB" looks
> > >> large enough that we don't need to spend time to investigate per arch
> > >> page sizes.
> > >
> > > IMHO we need that precision on reasoning and document it, even if not on
> > > the exact number we prefer, which can be prone to change later.  Otherwise
> > > that value will be a pure magic soon after a few years or even less, it'll
> > > be more of a challenge later to figure things out.
> > >
> > >> 
> > >> Having said that I'm now having slight self-doubt wrt huge pages, though
> > >> I swear we investigated it last year when first discussing this feature.
> > >> The guest memory will of course already be suitably aligned, but I'm
> > >> wondering if the filesystem I/O places any offset alignment constraints
> > >> related to non-default page size.
> > >
> > > AFAIU direct IO is about pinning the IO buffers, playing the role of fs
> > > cache instead.  If my understanding is correct, huge pages shouldn't be a
> > > problem for such pinning, because it's legal to pin partial of a huge page.
> > >
> > > After the partial huge pages pinned, they should be treated as normal fs
> > > buffers when doing block IO.  And then the offset of file should, per my
> > > understanding, not relevant to what is the type of backend of that user
> > > buffer anymore that triggers read()/write().
> > >
> > > But maybe I missed something, if so that will need to be part of
> > > documentation of 1MB magic value, IMHO.  We may want to double check with
> > > that by doing fixed-ram migration on e.g. 1GB hugetlb memory-backend-file
> > > with 1MB file offset per-ramblock.
> > 
> > Does anyone have any indication that we need to relate the aligment to
> > the page size? All I find online points to device block size being the
> > limiting factor for filesystems. There's also raw_probe_alignment() at
> > file-posix.c which seems to check up to 4k and recommend to disable
> > O_DIRECT if an alignment is not found.
> 
> Right, it should be more relevant to block size.
> 
> > 
> > Note that we shouldn't have any problems changing the alignment we
> > choose since we have a pointer to the start of the aligned region which
> > goes along with the fixed-ram header. We could even do some probing like
> > the block layer does if we wanted.
> 
> Having 1MB offset is fine, especially as you said we keep recv side fetch
> that in the headers always.
> 
> Perhaps we make that 1MB a macro, and add a comment explaining whatever we
> have on understanding how that 1MB come from?  I think we can also
> reference raw_probe_alignment() in the comment.
> 
> Meanwhile now I'm wondering whether that 1MB is too conservative.  Only a
> problem if there can be tons of ramblocks (e.g. 1000 ramblock means
> 1MB*1000=1G for the headers).  I can't think of any, though..  We can
> definitely leave that for later.

NB, the space won't actually be consumed on disk. By seeking to ramblock
offset, the file simply gets an allocation hole on disk. IOW, with 1000
ramblocks we will have 1 GB of holes.

Apps just have to be aware to preserve sparseness if they copy the
file around. This is probably a good docs point.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 128+ messages in thread

end of thread, other threads:[~2023-11-07  9:03 UTC | newest]

Thread overview: 128+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-23 20:35 [PATCH v2 00/29] migration: File based migration with multifd and fixed-ram Fabiano Rosas
2023-10-23 20:35 ` [PATCH v2 01/29] tests/qtest: migration events Fabiano Rosas
2023-10-25  9:44   ` Thomas Huth
2023-10-25 10:14   ` Daniel P. Berrangé
2023-10-25 13:21   ` Fabiano Rosas
2023-10-25 13:33     ` Steven Sistare
2023-10-23 20:35 ` [PATCH v2 02/29] tests/qtest: Move QTestMigrationState to libqtest Fabiano Rosas
2023-10-25 10:17   ` Daniel P. Berrangé
2023-10-25 13:19     ` Fabiano Rosas
2023-10-23 20:35 ` [PATCH v2 03/29] tests/qtest: Allow waiting for migration events Fabiano Rosas
2023-10-23 20:35 ` [PATCH v2 04/29] migration: Return the saved state from global_state_store Fabiano Rosas
2023-10-25 10:19   ` Daniel P. Berrangé
2023-10-23 20:35 ` [PATCH v2 05/29] migration: Introduce global_state_store_once Fabiano Rosas
2023-10-23 20:35 ` [PATCH v2 06/29] migration: Add auto-pause capability Fabiano Rosas
2023-10-24  5:25   ` Markus Armbruster
2023-10-24 18:12     ` Fabiano Rosas
2023-10-25  5:33       ` Markus Armbruster
2023-10-25  8:48   ` Daniel P. Berrangé
2023-10-25 13:57     ` Fabiano Rosas
2023-10-25 14:20       ` Daniel P. Berrangé
2023-10-25 14:58         ` Peter Xu
2023-10-25 15:25           ` Daniel P. Berrangé
2023-10-25 15:36             ` Peter Xu
2023-10-25 15:40               ` Daniel P. Berrangé
2023-10-25 17:20                 ` Peter Xu
2023-10-25 17:31                   ` Daniel P. Berrangé
2023-10-25 19:28                     ` Peter Xu
2023-10-23 20:35 ` [PATCH v2 07/29] migration: Run "file:" migration with a stopped VM Fabiano Rosas
2023-10-23 20:35 ` [PATCH v2 08/29] tests/qtest: File migration auto-pause tests Fabiano Rosas
2023-10-23 20:35 ` [PATCH v2 09/29] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
2023-10-23 20:35 ` [PATCH v2 10/29] io: Add generic pwritev/preadv interface Fabiano Rosas
2023-10-24  8:18   ` Daniel P. Berrangé
2023-10-24 19:06     ` Fabiano Rosas
2023-10-23 20:35 ` [PATCH v2 11/29] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
2023-10-23 20:35 ` [PATCH v2 12/29] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
2023-10-25 10:22   ` Daniel P. Berrangé
2023-10-23 20:35 ` [PATCH v2 13/29] migration: fixed-ram: Add URI compatibility check Fabiano Rosas
2023-10-25 10:27   ` Daniel P. Berrangé
2023-10-31 16:06   ` Peter Xu
2023-10-23 20:35 ` [PATCH v2 14/29] migration/ram: Introduce 'fixed-ram' migration capability Fabiano Rosas
2023-10-24  5:33   ` Markus Armbruster
2023-10-24 18:35     ` Fabiano Rosas
2023-10-25  6:18       ` Markus Armbruster
2023-10-23 20:35 ` [PATCH v2 15/29] migration/ram: Add support for 'fixed-ram' outgoing migration Fabiano Rosas
2023-10-25  9:39   ` Daniel P. Berrangé
2023-10-25 14:03     ` Fabiano Rosas
2023-11-01 15:23     ` Peter Xu
2023-11-01 15:52       ` Daniel P. Berrangé
2023-11-01 16:24         ` Peter Xu
2023-11-01 16:37           ` Daniel P. Berrangé
2023-11-01 17:30             ` Peter Xu
2023-10-31 16:52   ` Peter Xu
2023-10-31 17:33     ` Fabiano Rosas
2023-10-31 17:59       ` Peter Xu
2023-10-23 20:35 ` [PATCH v2 16/29] migration/ram: Add support for 'fixed-ram' migration restore Fabiano Rosas
2023-10-25  9:43   ` Daniel P. Berrangé
2023-10-25 14:07     ` Fabiano Rosas
2023-10-31 19:03       ` Peter Xu
2023-11-01  9:26         ` Daniel P. Berrangé
2023-11-01 14:21           ` Peter Xu
2023-11-01 14:28             ` Daniel P. Berrangé
2023-11-01 15:00               ` Peter Xu
2023-11-06 13:18                 ` Fabiano Rosas
2023-11-06 21:00                   ` Peter Xu
2023-11-07  9:02                     ` Daniel P. Berrangé
2023-10-31 19:09   ` Peter Xu
2023-10-31 20:00     ` Fabiano Rosas
2023-10-23 20:35 ` [PATCH v2 17/29] tests/qtest: migration-test: Add tests for fixed-ram file-based migration Fabiano Rosas
2023-10-23 20:35 ` [PATCH v2 18/29] migration/multifd: Allow multifd without packets Fabiano Rosas
2023-10-23 20:35 ` [PATCH v2 19/29] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
2023-10-25  9:52   ` Daniel P. Berrangé
2023-10-25 14:12     ` Fabiano Rosas
2023-10-25 14:23       ` Daniel P. Berrangé
2023-10-25 15:00         ` Fabiano Rosas
2023-10-25 15:26           ` Daniel P. Berrangé
2023-10-31 20:11   ` Peter Xu
2023-10-23 20:35 ` [PATCH v2 20/29] migration/multifd: Add incoming " Fabiano Rosas
2023-10-25 10:29   ` Daniel P. Berrangé
2023-10-25 14:18     ` Fabiano Rosas
2023-10-31 21:28   ` Peter Xu
2023-10-23 20:36 ` [PATCH v2 21/29] migration/multifd: Add pages to the receiving side Fabiano Rosas
2023-10-31 22:10   ` Peter Xu
2023-10-31 23:18     ` Fabiano Rosas
2023-11-01 15:55       ` Peter Xu
2023-11-01 17:20         ` Fabiano Rosas
2023-11-01 17:35           ` Peter Xu
2023-11-01 18:14             ` Fabiano Rosas
2023-10-23 20:36 ` [PATCH v2 22/29] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
2023-10-24  8:50   ` Daniel P. Berrangé
2023-10-23 20:36 ` [PATCH v2 23/29] migration/ram: Add a wrapper for fixed-ram shadow bitmap Fabiano Rosas
2023-11-01 14:29   ` Peter Xu
2023-10-23 20:36 ` [PATCH v2 24/29] migration/ram: Ignore multifd flush when doing fixed-ram migration Fabiano Rosas
2023-10-25  9:09   ` Daniel P. Berrangé
2023-10-23 20:36 ` [PATCH v2 25/29] migration/multifd: Support outgoing fixed-ram stream format Fabiano Rosas
2023-10-25  9:23   ` Daniel P. Berrangé
2023-10-25 14:21     ` Fabiano Rosas
2023-10-23 20:36 ` [PATCH v2 26/29] migration/multifd: Support incoming " Fabiano Rosas
2023-10-23 20:36 ` [PATCH v2 27/29] tests/qtest: Add a multifd + fixed-ram migration test Fabiano Rosas
2023-10-23 20:36 ` [PATCH v2 28/29] migration: Add direct-io parameter Fabiano Rosas
2023-10-24  5:41   ` Markus Armbruster
2023-10-24 19:32     ` Fabiano Rosas
2023-10-25  6:23       ` Markus Armbruster
2023-10-25  8:44       ` Daniel P. Berrangé
2023-10-25 14:32         ` Fabiano Rosas
2023-10-25 14:43           ` Daniel P. Berrangé
2023-10-25 17:30             ` Fabiano Rosas
2023-10-25 17:45               ` Daniel P. Berrangé
2023-10-25 18:10                 ` Fabiano Rosas
2023-10-30 22:51             ` Fabiano Rosas
2023-10-31  9:03               ` Daniel P. Berrangé
2023-10-31 13:05                 ` Fabiano Rosas
2023-10-31 13:45                   ` Daniel P. Berrangé
2023-10-31 14:33                     ` Fabiano Rosas
2023-10-31 15:22                       ` Daniel P. Berrangé
2023-10-31 15:52                         ` Fabiano Rosas
2023-10-31 15:58                           ` Daniel P. Berrangé
2023-10-31 19:05                             ` Fabiano Rosas
2023-11-01  9:30                               ` Daniel P. Berrangé
2023-11-01 12:16                                 ` Fabiano Rosas
2023-11-01 12:23                                   ` Daniel P. Berrangé
2023-11-01 12:30                                     ` Fabiano Rosas
2023-10-24  8:33   ` Daniel P. Berrangé
2023-10-24 19:06     ` Fabiano Rosas
2023-10-25  9:07   ` Daniel P. Berrangé
2023-10-25 14:48     ` Fabiano Rosas
2023-10-25 15:22       ` Daniel P. Berrangé
2023-10-23 20:36 ` [PATCH v2 29/29] tests/qtest: Add a test for migration with direct-io and multifd Fabiano Rosas
2023-10-25  9:25   ` Daniel P. Berrangé

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.