All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/25] migration: Postcopy Preemption
@ 2022-03-01  8:39 Peter Xu
  2022-03-01  8:39 ` [PATCH v2 01/25] migration: Dump sub-cmd name in loadvm_process_command tp Peter Xu
                   ` (26 more replies)
  0 siblings, 27 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

This is v2 of postcopy preempt series.  It can also be found here:

  https://github.com/xzpeter/qemu/tree/postcopy-preempt

RFC: https://lore.kernel.org/qemu-devel/20220119080929.39485-1-peterx@redhat.com
V1:  https://lore.kernel.org/qemu-devel/20220216062809.57179-1-peterx@redhat.com

v1->v2 changelog:
- Picked up more r-bs from Dave
- Rename both fault threads to drop "qemu/" prefix [Dave]
- Further rework on postcopy recovery, to be able to detect qemufile errors
  from either main channel or postcopy one [Dave]
- shutdown() qemufile before close on src postcopy channel when postcopy is
  paused [Dave]
- In postcopy_preempt_new_channel(), explicitly set the new channel in
  blocking state, even if it's the default [Dave]
- Make RAMState.postcopy_channel unsigned int [Dave]
- Added patches:
  - "migration: Create the postcopy preempt channel asynchronously"
  - "migration: Parameter x-postcopy-preempt-break-huge"
  - "migration: Add helpers to detect TLS capability"
  - "migration: Fail postcopy preempt with TLS"
  - "tests: Pass in MigrateStart** into test_migrate_start()"

Abstract
========

This series added a new migration capability called "postcopy-preempt".  It can
be enabled when postcopy is enabled, and it'll simply (but greatly) speed up
postcopy page requests handling process.

Below are some initial postcopy page request latency measurements after the
new series applied.

For each page size, I measured page request latency for three cases:

  (a) Vanilla:                the old postcopy
  (b) Preempt no-break-huge:  preempt enabled, x-postcopy-preempt-break-huge=off
  (c) Preempt full:           preempt enabled, x-postcopy-preempt-break-huge=on
                              (this is the default option when preempt enabled)

Here x-postcopy-preempt-break-huge parameter is just added in v2 so as to
conditionally disable the behavior to break sending a precopy huge page for
debugging purpose.  So when it's off, postcopy will not preempt precopy
sending a huge page, but still postcopy will use its own channel.

I tested it separately to give a rough idea on which part of the change
helped how much of it.  The overall benefit should be the comparison
between case (a) and (c).

  |-----------+---------+-----------------------+--------------|
  | Page size | Vanilla | Preempt no-break-huge | Preempt full |
  |-----------+---------+-----------------------+--------------|
  | 4K        |   10.68 |               N/A [*] |         0.57 |
  | 2M        |   10.58 |                  5.49 |         5.02 |
  | 1G        | 2046.65 |               933.185 |      649.445 |
  |-----------+---------+-----------------------+--------------|
  [*]: This case is N/A because 4K page does not contain huge page at all

[1] https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd-latency.bpf

TODO List
=========

TLS support
-----------

I only noticed its missing very recently.  Since soft freeze is coming, and
obviously I'm still growing this series, so I tend to have the existing
material discussed. Let's see if it can still catch the train for QEMU 7.0
release (soft freeze on 2022-03-08)..

Avoid precopy write() blocks postcopy
-------------------------------------

I didn't prove this, but I always think the write() syscalls being blocked
for precopy pages can affect postcopy services.  If we can solve this
problem then my wild guess is we can further reduce the average page
latency.

Two solutions at least in mind: (1) we could have made the write side of
the migration channel NON_BLOCK too, or (2) multi-threads on send side,
just like multifd, but we may use lock to protect which page to send too
(e.g., the core idea is we should _never_ rely anything on the main thread,
multifd has that dependency on queuing pages only on main thread).

That can definitely be done and thought about later.

Multi-channel for preemption threads
------------------------------------

Currently the postcopy preempt feature use only one extra channel and one
extra thread on dest (no new thread on src QEMU).  It should be mostly good
enough for major use cases, but when the postcopy queue is long enough
(e.g. hundreds of vCPUs faulted on different pages) logically we could
still observe more delays in average.  Whether growing threads/channels can
solve it is debatable, but sounds worthwhile a try.  That's yet another
thing we can think about after this patchset lands.

Logically the design provides space for that - the receiving postcopy
preempt thread can understand all ram-layer migration protocol, and for
multi channel and multi threads we could simply grow that into multile
threads handling the same protocol (with multiple PostcopyTmpPage).  The
source needs more thoughts on synchronizations, though, but it shouldn't
affect the whole protocol layer, so should be easy to keep compatible.

Patch Layout
============

Patch 1-3: Three leftover patches from patchset "[PATCH v3 0/8] migration:
Postcopy cleanup on ram disgard" that I picked up here too.

  https://lore.kernel.org/qemu-devel/20211224065000.97572-1-peterx@redhat.com/

  migration: Dump sub-cmd name in loadvm_process_command tp
  migration: Finer grained tracepoints for POSTCOPY_LISTEN
  migration: Tracepoint change in postcopy-run bottom half

Patch 4-9: Original postcopy preempt RFC preparation patches (with slight
modifications).

  migration: Introduce postcopy channels on dest node
  migration: Dump ramblock and offset too when non-same-page detected
  migration: Add postcopy_thread_create()
  migration: Move static var in ram_block_from_stream() into global
  migration: Add pss.postcopy_requested status
  migration: Move migrate_allow_multifd and helpers into migration.c

Patch 10-15: Some newly added patches when working on postcopy recovery
support.  After these patches migrate-recover command will allow re-entrance,
which is a very nice side effect.

  migration: Enlarge postcopy recovery to capture !-EIO too
  migration: postcopy_pause_fault_thread() never fails
  migration: Export ram_load_postcopy()
  migration: Move channel setup out of postcopy_try_recover()
  migration: Add migration_incoming_transport_cleanup()
  migration: Allow migrate-recover to run multiple times

Patch 16-19: The major work of postcopy preemption implementation is split into
four patches as suggested by Dave.

  migration: Add postcopy-preempt capability
  migration: Postcopy preemption preparation on channel creation
  migration: Postcopy preemption enablement
  migration: Postcopy recover with preempt enabled

Patch 20-23: Newly added patches in this v2 for different purposes.
             Majorly some amendment on existing postcopy preempt.

  migration: Create the postcopy preempt channel asynchronously
  migration: Parameter x-postcopy-preempt-break-huge
  migration: Add helpers to detect TLS capability
  migration: Fail postcopy preempt with TLS for now

Patch 24-25: Test cases (including one more patch for cleanup)

  tests: Add postcopy preempt test
  tests: Pass in MigrateStart** into test_migrate_start()

Please review, thanks.

Peter Xu (25):
  migration: Dump sub-cmd name in loadvm_process_command tp
  migration: Finer grained tracepoints for POSTCOPY_LISTEN
  migration: Tracepoint change in postcopy-run bottom half
  migration: Introduce postcopy channels on dest node
  migration: Dump ramblock and offset too when non-same-page detected
  migration: Add postcopy_thread_create()
  migration: Move static var in ram_block_from_stream() into global
  migration: Add pss.postcopy_requested status
  migration: Move migrate_allow_multifd and helpers into migration.c
  migration: Enlarge postcopy recovery to capture !-EIO too
  migration: postcopy_pause_fault_thread() never fails
  migration: Export ram_load_postcopy()
  migration: Move channel setup out of postcopy_try_recover()
  migration: Add migration_incoming_transport_cleanup()
  migration: Allow migrate-recover to run multiple times
  migration: Add postcopy-preempt capability
  migration: Postcopy preemption preparation on channel creation
  migration: Postcopy preemption enablement
  migration: Postcopy recover with preempt enabled
  migration: Create the postcopy preempt channel asynchronously
  migration: Parameter x-postcopy-preempt-break-huge
  migration: Add helpers to detect TLS capability
  migration: Fail postcopy preempt with TLS for now
  tests: Add postcopy preempt test
  tests: Pass in MigrateStart** into test_migrate_start()

 migration/channel.c          |  10 +-
 migration/migration.c        | 235 ++++++++++++++++++++-----
 migration/migration.h        |  98 ++++++++++-
 migration/multifd.c          |  26 +--
 migration/multifd.h          |   2 -
 migration/postcopy-ram.c     | 244 +++++++++++++++++++++-----
 migration/postcopy-ram.h     |  15 ++
 migration/qemu-file.c        |  27 +++
 migration/qemu-file.h        |   1 +
 migration/ram.c              | 330 +++++++++++++++++++++++++++++++----
 migration/ram.h              |   3 +
 migration/savevm.c           |  70 ++++++--
 migration/socket.c           |  22 ++-
 migration/socket.h           |   1 +
 migration/trace-events       |  19 +-
 qapi/migration.json          |   8 +-
 tests/qtest/migration-test.c |  68 ++++++--
 17 files changed, 983 insertions(+), 196 deletions(-)

-- 
2.32.0



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 01/25] migration: Dump sub-cmd name in loadvm_process_command tp
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 02/25] migration: Finer grained tracepoints for POSTCOPY_LISTEN Peter Xu
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

It'll be easier to read the name rather than index of sub-cmd when debugging.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/savevm.c     | 3 ++-
 migration/trace-events | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 1599b02fbc..7bb65e1d61 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2273,12 +2273,13 @@ static int loadvm_process_command(QEMUFile *f)
         return qemu_file_get_error(f);
     }
 
-    trace_loadvm_process_command(cmd, len);
     if (cmd >= MIG_CMD_MAX || cmd == MIG_CMD_INVALID) {
         error_report("MIG_CMD 0x%x unknown (len 0x%x)", cmd, len);
         return -EINVAL;
     }
 
+    trace_loadvm_process_command(mig_cmd_args[cmd].name, len);
+
     if (mig_cmd_args[cmd].len != -1 && mig_cmd_args[cmd].len != len) {
         error_report("%s received with bad length - expecting %zu, got %d",
                      mig_cmd_args[cmd].name,
diff --git a/migration/trace-events b/migration/trace-events
index 48aa7b10ee..123cfe79d7 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -22,7 +22,7 @@ loadvm_postcopy_handle_resume(void) ""
 loadvm_postcopy_ram_handle_discard(void) ""
 loadvm_postcopy_ram_handle_discard_end(void) ""
 loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
-loadvm_process_command(uint16_t com, uint16_t len) "com=0x%x len=%d"
+loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
 loadvm_process_command_ping(uint32_t val) "0x%x"
 postcopy_ram_listen_thread_exit(void) ""
 postcopy_ram_listen_thread_start(void) ""
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 02/25] migration: Finer grained tracepoints for POSTCOPY_LISTEN
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
  2022-03-01  8:39 ` [PATCH v2 01/25] migration: Dump sub-cmd name in loadvm_process_command tp Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 03/25] migration: Tracepoint change in postcopy-run bottom half Peter Xu
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

The enablement of postcopy listening has a few steps, add a few tracepoints to
be there ready for some basic measurements for them.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/savevm.c     | 9 ++++++++-
 migration/trace-events | 2 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 7bb65e1d61..190cc5fc42 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1948,9 +1948,10 @@ static void *postcopy_ram_listen_thread(void *opaque)
 static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
 {
     PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_LISTENING);
-    trace_loadvm_postcopy_handle_listen();
     Error *local_err = NULL;
 
+    trace_loadvm_postcopy_handle_listen("enter");
+
     if (ps != POSTCOPY_INCOMING_ADVISE && ps != POSTCOPY_INCOMING_DISCARD) {
         error_report("CMD_POSTCOPY_LISTEN in wrong postcopy state (%d)", ps);
         return -1;
@@ -1965,6 +1966,8 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
         }
     }
 
+    trace_loadvm_postcopy_handle_listen("after discard");
+
     /*
      * Sensitise RAM - can now generate requests for blocks that don't exist
      * However, at this point the CPU shouldn't be running, and the IO
@@ -1977,6 +1980,8 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
         }
     }
 
+    trace_loadvm_postcopy_handle_listen("after uffd");
+
     if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_LISTEN, &local_err)) {
         error_report_err(local_err);
         return -1;
@@ -1991,6 +1996,8 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
     qemu_sem_wait(&mis->listen_thread_sem);
     qemu_sem_destroy(&mis->listen_thread_sem);
 
+    trace_loadvm_postcopy_handle_listen("return");
+
     return 0;
 }
 
diff --git a/migration/trace-events b/migration/trace-events
index 123cfe79d7..92596c00d8 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -14,7 +14,7 @@ loadvm_handle_cmd_packaged_main(int ret) "%d"
 loadvm_handle_cmd_packaged_received(int ret) "%d"
 loadvm_handle_recv_bitmap(char *s) "%s"
 loadvm_postcopy_handle_advise(void) ""
-loadvm_postcopy_handle_listen(void) ""
+loadvm_postcopy_handle_listen(const char *str) "%s"
 loadvm_postcopy_handle_run(void) ""
 loadvm_postcopy_handle_run_cpu_sync(void) ""
 loadvm_postcopy_handle_run_vmstart(void) ""
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 03/25] migration: Tracepoint change in postcopy-run bottom half
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
  2022-03-01  8:39 ` [PATCH v2 01/25] migration: Dump sub-cmd name in loadvm_process_command tp Peter Xu
  2022-03-01  8:39 ` [PATCH v2 02/25] migration: Finer grained tracepoints for POSTCOPY_LISTEN Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 04/25] migration: Introduce postcopy channels on dest node Peter Xu
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Remove the old two tracepoints and they're even near each other:

    trace_loadvm_postcopy_handle_run_cpu_sync()
    trace_loadvm_postcopy_handle_run_vmstart()

Add trace_loadvm_postcopy_handle_run_bh() with a finer granule trace.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/savevm.c     | 12 +++++++++---
 migration/trace-events |  3 +--
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 190cc5fc42..41e3238798 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2006,13 +2006,19 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
     Error *local_err = NULL;
     MigrationIncomingState *mis = opaque;
 
+    trace_loadvm_postcopy_handle_run_bh("enter");
+
     /* TODO we should move all of this lot into postcopy_ram.c or a shared code
      * in migration.c
      */
     cpu_synchronize_all_post_init();
 
+    trace_loadvm_postcopy_handle_run_bh("after cpu sync");
+
     qemu_announce_self(&mis->announce_timer, migrate_announce_params());
 
+    trace_loadvm_postcopy_handle_run_bh("after announce");
+
     /* Make sure all file formats flush their mutable metadata.
      * If we get an error here, just don't restart the VM yet. */
     bdrv_invalidate_cache_all(&local_err);
@@ -2022,9 +2028,7 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
         autostart = false;
     }
 
-    trace_loadvm_postcopy_handle_run_cpu_sync();
-
-    trace_loadvm_postcopy_handle_run_vmstart();
+    trace_loadvm_postcopy_handle_run_bh("after invalidate cache");
 
     dirty_bitmap_mig_before_vm_start();
 
@@ -2037,6 +2041,8 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
     }
 
     qemu_bh_delete(mis->bh);
+
+    trace_loadvm_postcopy_handle_run_bh("return");
 }
 
 /* After all discards we can start running and asking for pages */
diff --git a/migration/trace-events b/migration/trace-events
index 92596c00d8..1aec580e92 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -16,8 +16,7 @@ loadvm_handle_recv_bitmap(char *s) "%s"
 loadvm_postcopy_handle_advise(void) ""
 loadvm_postcopy_handle_listen(const char *str) "%s"
 loadvm_postcopy_handle_run(void) ""
-loadvm_postcopy_handle_run_cpu_sync(void) ""
-loadvm_postcopy_handle_run_vmstart(void) ""
+loadvm_postcopy_handle_run_bh(const char *str) "%s"
 loadvm_postcopy_handle_resume(void) ""
 loadvm_postcopy_ram_handle_discard(void) ""
 loadvm_postcopy_ram_handle_discard_end(void) ""
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 04/25] migration: Introduce postcopy channels on dest node
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (2 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 03/25] migration: Tracepoint change in postcopy-run bottom half Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 05/25] migration: Dump ramblock and offset too when non-same-page detected Peter Xu
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Postcopy handles huge pages in a special way that currently we can only have
one "channel" to transfer the page.

It's because when we install pages using UFFDIO_COPY, we need to have the whole
huge page ready, it also means we need to have a temp huge page when trying to
receive the whole content of the page.

Currently all maintainance around this tmp page is global: firstly we'll
allocate a temp huge page, then we maintain its status mostly within
ram_load_postcopy().

To enable multiple channels for postcopy, the first thing we need to do is to
prepare N temp huge pages as caching, one for each channel.

Meanwhile we need to maintain the tmp huge page status per-channel too.

To give some example, some local variables maintained in ram_load_postcopy()
are listed; they are responsible for maintaining temp huge page status:

  - all_zero:     this keeps whether this huge page contains all zeros
  - target_pages: this counts how many target pages have been copied
  - host_page:    this keeps the host ptr for the page to install

Move all these fields to be together with the temp huge pages to form a new
structure called PostcopyTmpPage.  Then for each (future) postcopy channel, we
need one structure to keep the state around.

For vanilla postcopy, obviously there's only one channel.  It contains both
precopy and postcopy pages.

This patch teaches the dest migration node to start realize the possible number
of postcopy channels by introducing the "postcopy_channels" variable.  Its
value is calculated when setup postcopy on dest node (during POSTCOPY_LISTEN
phase).

Vanilla postcopy will have channels=1, but when postcopy-preempt capability is
enabled (in the future), we will boost it to 2 because even during partial
sending of a precopy huge page we still want to preempt it and start sending
the postcopy requested page right away (so we start to keep two temp huge
pages; more if we want to enable multifd).  In this patch there's a TODO marked
for that; so far the channels is always set to 1.

We need to send one "host huge page" on one channel only and we cannot split
them, because otherwise the data upon the same huge page can locate on more
than one channel so we need more complicated logic to manage.  One temp host
huge page for each channel will be enough for us for now.

Postcopy will still always use the index=0 huge page even after this patch.
However it prepares for the latter patches where it can start to use multiple
channels (which needs src intervention, because only src knows which channel we
should use).

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h    | 36 +++++++++++++++++++++++-
 migration/postcopy-ram.c | 60 ++++++++++++++++++++++++++++++----------
 migration/ram.c          | 43 ++++++++++++++--------------
 migration/savevm.c       | 12 ++++++++
 4 files changed, 113 insertions(+), 38 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index 8130b703eb..42c7395094 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -45,6 +45,24 @@ struct PostcopyBlocktimeContext;
  */
 #define CLEAR_BITMAP_SHIFT_MAX            31
 
+/* This is an abstraction of a "temp huge page" for postcopy's purpose */
+typedef struct {
+    /*
+     * This points to a temporary huge page as a buffer for UFFDIO_COPY.  It's
+     * mmap()ed and needs to be freed when cleanup.
+     */
+    void *tmp_huge_page;
+    /*
+     * This points to the host page we're going to install for this temp page.
+     * It tells us after we've received the whole page, where we should put it.
+     */
+    void *host_addr;
+    /* Number of small pages copied (in size of TARGET_PAGE_SIZE) */
+    unsigned int target_pages;
+    /* Whether this page contains all zeros */
+    bool all_zero;
+} PostcopyTmpPage;
+
 /* State for the incoming migration */
 struct MigrationIncomingState {
     QEMUFile *from_src_file;
@@ -81,7 +99,22 @@ struct MigrationIncomingState {
     QemuMutex rp_mutex;    /* We send replies from multiple threads */
     /* RAMBlock of last request sent to source */
     RAMBlock *last_rb;
-    void     *postcopy_tmp_page;
+    /*
+     * Number of postcopy channels including the default precopy channel, so
+     * vanilla postcopy will only contain one channel which contain both
+     * precopy and postcopy streams.
+     *
+     * This is calculated when the src requests to enable postcopy but before
+     * it starts.  Its value can depend on e.g. whether postcopy preemption is
+     * enabled.
+     */
+    unsigned int postcopy_channels;
+    /*
+     * An array of temp host huge pages to be used, one for each postcopy
+     * channel.
+     */
+    PostcopyTmpPage *postcopy_tmp_pages;
+    /* This is shared for all postcopy channels */
     void     *postcopy_tmp_zero_page;
     /* PostCopyFD's for external userfaultfds & handlers of shared memory */
     GArray   *postcopy_remote_fds;
@@ -391,5 +424,6 @@ bool migration_rate_limit(void);
 void migration_cancel(const Error *error);
 
 void populate_vfio_info(MigrationInfo *info);
+void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
 
 #endif
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 2a2cc5faf8..30c3508f44 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -526,9 +526,18 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis)
 
 static void postcopy_temp_pages_cleanup(MigrationIncomingState *mis)
 {
-    if (mis->postcopy_tmp_page) {
-        munmap(mis->postcopy_tmp_page, mis->largest_page_size);
-        mis->postcopy_tmp_page = NULL;
+    int i;
+
+    if (mis->postcopy_tmp_pages) {
+        for (i = 0; i < mis->postcopy_channels; i++) {
+            if (mis->postcopy_tmp_pages[i].tmp_huge_page) {
+                munmap(mis->postcopy_tmp_pages[i].tmp_huge_page,
+                       mis->largest_page_size);
+                mis->postcopy_tmp_pages[i].tmp_huge_page = NULL;
+            }
+        }
+        g_free(mis->postcopy_tmp_pages);
+        mis->postcopy_tmp_pages = NULL;
     }
 
     if (mis->postcopy_tmp_zero_page) {
@@ -1092,17 +1101,30 @@ retry:
 
 static int postcopy_temp_pages_setup(MigrationIncomingState *mis)
 {
-    int err;
-
-    mis->postcopy_tmp_page = mmap(NULL, mis->largest_page_size,
-                                  PROT_READ | PROT_WRITE,
-                                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
-    if (mis->postcopy_tmp_page == MAP_FAILED) {
-        err = errno;
-        mis->postcopy_tmp_page = NULL;
-        error_report("%s: Failed to map postcopy_tmp_page %s",
-                     __func__, strerror(err));
-        return -err;
+    PostcopyTmpPage *tmp_page;
+    int err, i, channels;
+    void *temp_page;
+
+    /* TODO: will be boosted when enable postcopy preemption */
+    mis->postcopy_channels = 1;
+
+    channels = mis->postcopy_channels;
+    mis->postcopy_tmp_pages = g_malloc0_n(sizeof(PostcopyTmpPage), channels);
+
+    for (i = 0; i < channels; i++) {
+        tmp_page = &mis->postcopy_tmp_pages[i];
+        temp_page = mmap(NULL, mis->largest_page_size, PROT_READ | PROT_WRITE,
+                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+        if (temp_page == MAP_FAILED) {
+            err = errno;
+            error_report("%s: Failed to map postcopy_tmp_pages[%d]: %s",
+                         __func__, i, strerror(err));
+            /* Clean up will be done later */
+            return -err;
+        }
+        tmp_page->tmp_huge_page = temp_page;
+        /* Initialize default states for each tmp page */
+        postcopy_temp_page_reset(tmp_page);
     }
 
     /*
@@ -1352,6 +1374,16 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
 #endif
 
 /* ------------------------------------------------------------------------- */
+void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page)
+{
+    tmp_page->target_pages = 0;
+    tmp_page->host_addr = NULL;
+    /*
+     * This is set to true when reset, and cleared as long as we received any
+     * of the non-zero small page within this huge page.
+     */
+    tmp_page->all_zero = true;
+}
 
 void postcopy_fault_thread_notify(MigrationIncomingState *mis)
 {
diff --git a/migration/ram.c b/migration/ram.c
index 781f0745dc..0fc6b8e349 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3641,11 +3641,8 @@ static int ram_load_postcopy(QEMUFile *f)
     bool place_needed = false;
     bool matches_target_page_size = false;
     MigrationIncomingState *mis = migration_incoming_get_current();
-    /* Temporary page that is later 'placed' */
-    void *postcopy_host_page = mis->postcopy_tmp_page;
-    void *host_page = NULL;
-    bool all_zero = true;
-    int target_pages = 0;
+    /* Currently we only use channel 0.  TODO: use all the channels */
+    PostcopyTmpPage *tmp_page = &mis->postcopy_tmp_pages[0];
 
     while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
         ram_addr_t addr;
@@ -3689,7 +3686,7 @@ static int ram_load_postcopy(QEMUFile *f)
                 ret = -EINVAL;
                 break;
             }
-            target_pages++;
+            tmp_page->target_pages++;
             matches_target_page_size = block->page_size == TARGET_PAGE_SIZE;
             /*
              * Postcopy requires that we place whole host pages atomically;
@@ -3701,15 +3698,16 @@ static int ram_load_postcopy(QEMUFile *f)
              * however the source ensures it always sends all the components
              * of a host page in one chunk.
              */
-            page_buffer = postcopy_host_page +
+            page_buffer = tmp_page->tmp_huge_page +
                           host_page_offset_from_ram_block_offset(block, addr);
             /* If all TP are zero then we can optimise the place */
-            if (target_pages == 1) {
-                host_page = host_page_from_ram_block_offset(block, addr);
-            } else if (host_page != host_page_from_ram_block_offset(block,
-                                                                    addr)) {
+            if (tmp_page->target_pages == 1) {
+                tmp_page->host_addr =
+                    host_page_from_ram_block_offset(block, addr);
+            } else if (tmp_page->host_addr !=
+                       host_page_from_ram_block_offset(block, addr)) {
                 /* not the 1st TP within the HP */
-                error_report("Non-same host page %p/%p", host_page,
+                error_report("Non-same host page %p/%p", tmp_page->host_addr,
                              host_page_from_ram_block_offset(block, addr));
                 ret = -EINVAL;
                 break;
@@ -3719,10 +3717,11 @@ static int ram_load_postcopy(QEMUFile *f)
              * If it's the last part of a host page then we place the host
              * page
              */
-            if (target_pages == (block->page_size / TARGET_PAGE_SIZE)) {
+            if (tmp_page->target_pages ==
+                (block->page_size / TARGET_PAGE_SIZE)) {
                 place_needed = true;
             }
-            place_source = postcopy_host_page;
+            place_source = tmp_page->tmp_huge_page;
         }
 
         switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
@@ -3736,12 +3735,12 @@ static int ram_load_postcopy(QEMUFile *f)
                 memset(page_buffer, ch, TARGET_PAGE_SIZE);
             }
             if (ch) {
-                all_zero = false;
+                tmp_page->all_zero = false;
             }
             break;
 
         case RAM_SAVE_FLAG_PAGE:
-            all_zero = false;
+            tmp_page->all_zero = false;
             if (!matches_target_page_size) {
                 /* For huge pages, we always use temporary buffer */
                 qemu_get_buffer(f, page_buffer, TARGET_PAGE_SIZE);
@@ -3759,7 +3758,7 @@ static int ram_load_postcopy(QEMUFile *f)
             }
             break;
         case RAM_SAVE_FLAG_COMPRESS_PAGE:
-            all_zero = false;
+            tmp_page->all_zero = false;
             len = qemu_get_be32(f);
             if (len < 0 || len > compressBound(TARGET_PAGE_SIZE)) {
                 error_report("Invalid compressed data length: %d", len);
@@ -3791,16 +3790,14 @@ static int ram_load_postcopy(QEMUFile *f)
         }
 
         if (!ret && place_needed) {
-            if (all_zero) {
-                ret = postcopy_place_page_zero(mis, host_page, block);
+            if (tmp_page->all_zero) {
+                ret = postcopy_place_page_zero(mis, tmp_page->host_addr, block);
             } else {
-                ret = postcopy_place_page(mis, host_page, place_source,
+                ret = postcopy_place_page(mis, tmp_page->host_addr, place_source,
                                           block);
             }
             place_needed = false;
-            target_pages = 0;
-            /* Assume we have a zero page until we detect something different */
-            all_zero = true;
+            postcopy_temp_page_reset(tmp_page);
         }
     }
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 41e3238798..0ccd7e5e3f 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2579,6 +2579,18 @@ void qemu_loadvm_state_cleanup(void)
 /* Return true if we should continue the migration, or false. */
 static bool postcopy_pause_incoming(MigrationIncomingState *mis)
 {
+    int i;
+
+    /*
+     * If network is interrupted, any temp page we received will be useless
+     * because we didn't mark them as "received" in receivedmap.  After a
+     * proper recovery later (which will sync src dirty bitmap with receivedmap
+     * on dest) these cached small pages will be resent again.
+     */
+    for (i = 0; i < mis->postcopy_channels; i++) {
+        postcopy_temp_page_reset(&mis->postcopy_tmp_pages[i]);
+    }
+
     trace_postcopy_pause_incoming();
 
     assert(migrate_postcopy_ram());
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 05/25] migration: Dump ramblock and offset too when non-same-page detected
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (3 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 04/25] migration: Introduce postcopy channels on dest node Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 06/25] migration: Add postcopy_thread_create() Peter Xu
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

In ram_load_postcopy() we'll try to detect non-same-page case and dump error.
This error is very helpful for debugging.  Adding ramblock & offset into the
error log too.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 0fc6b8e349..3a216c2340 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3707,8 +3707,12 @@ static int ram_load_postcopy(QEMUFile *f)
             } else if (tmp_page->host_addr !=
                        host_page_from_ram_block_offset(block, addr)) {
                 /* not the 1st TP within the HP */
-                error_report("Non-same host page %p/%p", tmp_page->host_addr,
-                             host_page_from_ram_block_offset(block, addr));
+                error_report("Non-same host page detected.  Target host page %p, "
+                             "received host page %p "
+                             "(rb %s offset 0x"RAM_ADDR_FMT" target_pages %d)",
+                             tmp_page->host_addr,
+                             host_page_from_ram_block_offset(block, addr),
+                             block->idstr, addr, tmp_page->target_pages);
                 ret = -EINVAL;
                 break;
             }
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 06/25] migration: Add postcopy_thread_create()
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (4 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 05/25] migration: Dump ramblock and offset too when non-same-page detected Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 07/25] migration: Move static var in ram_block_from_stream() into global Peter Xu
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Postcopy create threads. A common manner is we init a sem and use it to sync
with the thread.  Namely, we have fault_thread_sem and listen_thread_sem and
they're only used for this.

Make it a shared infrastructure so it's easier to create yet another thread.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h    |  8 +++++---
 migration/postcopy-ram.c | 23 +++++++++++++++++------
 migration/postcopy-ram.h |  4 ++++
 migration/savevm.c       | 12 +++---------
 4 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index 42c7395094..8445e1d14a 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -70,7 +70,11 @@ struct MigrationIncomingState {
     /* A hook to allow cleanup at the end of incoming migration */
     void *transport_data;
     void (*transport_cleanup)(void *data);
-
+    /*
+     * Used to sync thread creations.  Note that we can't create threads in
+     * parallel with this sem.
+     */
+    QemuSemaphore  thread_sync_sem;
     /*
      * Free at the start of the main state load, set as the main thread finishes
      * loading state.
@@ -83,13 +87,11 @@ struct MigrationIncomingState {
     size_t         largest_page_size;
     bool           have_fault_thread;
     QemuThread     fault_thread;
-    QemuSemaphore  fault_thread_sem;
     /* Set this when we want the fault thread to quit */
     bool           fault_thread_quit;
 
     bool           have_listen_thread;
     QemuThread     listen_thread;
-    QemuSemaphore  listen_thread_sem;
 
     /* For the kernel to send us notifications */
     int       userfault_fd;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 30c3508f44..d08d396c63 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -78,6 +78,20 @@ int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp)
                                             &pnd);
 }
 
+/*
+ * NOTE: this routine is not thread safe, we can't call it concurrently. But it
+ * should be good enough for migration's purposes.
+ */
+void postcopy_thread_create(MigrationIncomingState *mis,
+                            QemuThread *thread, const char *name,
+                            void *(*fn)(void *), int joinable)
+{
+    qemu_sem_init(&mis->thread_sync_sem, 0);
+    qemu_thread_create(thread, name, fn, mis, joinable);
+    qemu_sem_wait(&mis->thread_sync_sem);
+    qemu_sem_destroy(&mis->thread_sync_sem);
+}
+
 /* Postcopy needs to detect accesses to pages that haven't yet been copied
  * across, and efficiently map new pages in, the techniques for doing this
  * are target OS specific.
@@ -902,7 +916,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
     trace_postcopy_ram_fault_thread_entry();
     rcu_register_thread();
     mis->last_rb = NULL; /* last RAMBlock we sent part of */
-    qemu_sem_post(&mis->fault_thread_sem);
+    qemu_sem_post(&mis->thread_sync_sem);
 
     struct pollfd *pfd;
     size_t pfd_len = 2 + mis->postcopy_remote_fds->len;
@@ -1173,11 +1187,8 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
         return -1;
     }
 
-    qemu_sem_init(&mis->fault_thread_sem, 0);
-    qemu_thread_create(&mis->fault_thread, "postcopy/fault",
-                       postcopy_ram_fault_thread, mis, QEMU_THREAD_JOINABLE);
-    qemu_sem_wait(&mis->fault_thread_sem);
-    qemu_sem_destroy(&mis->fault_thread_sem);
+    postcopy_thread_create(mis, &mis->fault_thread, "postcopy/fault",
+                           postcopy_ram_fault_thread, QEMU_THREAD_JOINABLE);
     mis->have_fault_thread = true;
 
     /* Mark so that we get notified of accesses to unwritten areas */
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 6d2b3cf124..07684c0e1d 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -135,6 +135,10 @@ void postcopy_remove_notifier(NotifierWithReturn *n);
 /* Call the notifier list set by postcopy_add_start_notifier */
 int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp);
 
+void postcopy_thread_create(MigrationIncomingState *mis,
+                            QemuThread *thread, const char *name,
+                            void *(*fn)(void *), int joinable);
+
 struct PostCopyFD;
 
 /* ufd is a pointer to the struct uffd_msg *TODO: more Portable! */
diff --git a/migration/savevm.c b/migration/savevm.c
index 0ccd7e5e3f..967ff80547 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1863,7 +1863,7 @@ static void *postcopy_ram_listen_thread(void *opaque)
 
     migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
                                    MIGRATION_STATUS_POSTCOPY_ACTIVE);
-    qemu_sem_post(&mis->listen_thread_sem);
+    qemu_sem_post(&mis->thread_sync_sem);
     trace_postcopy_ram_listen_thread_start();
 
     rcu_register_thread();
@@ -1988,14 +1988,8 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
     }
 
     mis->have_listen_thread = true;
-    /* Start up the listening thread and wait for it to signal ready */
-    qemu_sem_init(&mis->listen_thread_sem, 0);
-    qemu_thread_create(&mis->listen_thread, "postcopy/listen",
-                       postcopy_ram_listen_thread, NULL,
-                       QEMU_THREAD_DETACHED);
-    qemu_sem_wait(&mis->listen_thread_sem);
-    qemu_sem_destroy(&mis->listen_thread_sem);
-
+    postcopy_thread_create(mis, &mis->listen_thread, "postcopy/listen",
+                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
     trace_loadvm_postcopy_handle_listen("return");
 
     return 0;
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 07/25] migration: Move static var in ram_block_from_stream() into global
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (5 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 06/25] migration: Add postcopy_thread_create() Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 08/25] migration: Add pss.postcopy_requested status Peter Xu
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Static variable is very unfriendly to threading of ram_block_from_stream().
Move it into MigrationIncomingState.

Make the incoming state pointer to be passed over to ram_block_from_stream() on
both caller sites.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h |  3 ++-
 migration/ram.c       | 13 +++++++++----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index 8445e1d14a..d8b9850eae 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -66,7 +66,8 @@ typedef struct {
 /* State for the incoming migration */
 struct MigrationIncomingState {
     QEMUFile *from_src_file;
-
+    /* Previously received RAM's RAMBlock pointer */
+    RAMBlock *last_recv_block;
     /* A hook to allow cleanup at the end of incoming migration */
     void *transport_data;
     void (*transport_cleanup)(void *data);
diff --git a/migration/ram.c b/migration/ram.c
index 3a216c2340..9516dd655a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3185,12 +3185,14 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
  *
  * Returns a pointer from within the RCU-protected ram_list.
  *
+ * @mis: the migration incoming state pointer
  * @f: QEMUFile where to read the data from
  * @flags: Page flags (mostly to see if it's a continuation of previous block)
  */
-static inline RAMBlock *ram_block_from_stream(QEMUFile *f, int flags)
+static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
+                                              QEMUFile *f, int flags)
 {
-    static RAMBlock *block;
+    RAMBlock *block = mis->last_recv_block;
     char id[256];
     uint8_t len;
 
@@ -3217,6 +3219,8 @@ static inline RAMBlock *ram_block_from_stream(QEMUFile *f, int flags)
         return NULL;
     }
 
+    mis->last_recv_block = block;
+
     return block;
 }
 
@@ -3669,7 +3673,7 @@ static int ram_load_postcopy(QEMUFile *f)
         trace_ram_load_postcopy_loop((uint64_t)addr, flags);
         if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
                      RAM_SAVE_FLAG_COMPRESS_PAGE)) {
-            block = ram_block_from_stream(f, flags);
+            block = ram_block_from_stream(mis, f, flags);
             if (!block) {
                 ret = -EINVAL;
                 break;
@@ -3881,6 +3885,7 @@ void colo_flush_ram_cache(void)
  */
 static int ram_load_precopy(QEMUFile *f)
 {
+    MigrationIncomingState *mis = migration_incoming_get_current();
     int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0;
     /* ADVISE is earlier, it shows the source has the postcopy capability on */
     bool postcopy_advised = postcopy_is_advised();
@@ -3919,7 +3924,7 @@ static int ram_load_precopy(QEMUFile *f)
 
         if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
                      RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
-            RAMBlock *block = ram_block_from_stream(f, flags);
+            RAMBlock *block = ram_block_from_stream(mis, f, flags);
 
             host = host_from_ram_block_offset(block, addr);
             /*
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 08/25] migration: Add pss.postcopy_requested status
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (6 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 07/25] migration: Move static var in ram_block_from_stream() into global Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 09/25] migration: Move migrate_allow_multifd and helpers into migration.c Peter Xu
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

This boolean flag shows whether the current page during migration is triggered
by postcopy or not.  Then in ram_save_host_page() and deeper stack we'll be
able to have a reference on the priority of this page.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 9516dd655a..f1de1a06e4 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -414,6 +414,8 @@ struct PageSearchStatus {
     unsigned long page;
     /* Set once we wrap around */
     bool         complete_round;
+    /* Whether current page is explicitly requested by postcopy */
+    bool         postcopy_requested;
 };
 typedef struct PageSearchStatus PageSearchStatus;
 
@@ -1487,6 +1489,9 @@ retry:
  */
 static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
 {
+    /* This is not a postcopy requested page */
+    pss->postcopy_requested = false;
+
     pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
     if (pss->complete_round && pss->block == rs->last_seen_block &&
         pss->page >= rs->last_page) {
@@ -1981,6 +1986,7 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
          * really rare.
          */
         pss->complete_round = false;
+        pss->postcopy_requested = true;
     }
 
     return !!block;
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 09/25] migration: Move migrate_allow_multifd and helpers into migration.c
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (7 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 08/25] migration: Add pss.postcopy_requested status Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 10/25] migration: Enlarge postcopy recovery to capture !-EIO too Peter Xu
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

This variable, along with its helpers, is used to detect whether multiple
channel will be supported for migration.  In follow up patches, there'll be
other capability that requires multi-channels.  Hence move it outside multifd
specific code and make it public.  Meanwhile rename it from "multifd" to
"multi_channels" to show its real meaning.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 22 +++++++++++++++++-----
 migration/migration.h |  3 +++
 migration/multifd.c   | 19 ++++---------------
 migration/multifd.h   |  2 --
 4 files changed, 24 insertions(+), 22 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index bcc385b94b..6e4cc9cc87 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -180,6 +180,18 @@ static int migration_maybe_pause(MigrationState *s,
                                  int new_state);
 static void migrate_fd_cancel(MigrationState *s);
 
+static bool migrate_allow_multi_channels = true;
+
+void migrate_protocol_allow_multi_channels(bool allow)
+{
+    migrate_allow_multi_channels = allow;
+}
+
+bool migrate_multi_channels_is_allowed(void)
+{
+    return migrate_allow_multi_channels;
+}
+
 static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
 {
     uintptr_t a = (uintptr_t) ap, b = (uintptr_t) bp;
@@ -463,12 +475,12 @@ static void qemu_start_incoming_migration(const char *uri, Error **errp)
 {
     const char *p = NULL;
 
-    migrate_protocol_allow_multifd(false); /* reset it anyway */
+    migrate_protocol_allow_multi_channels(false); /* reset it anyway */
     qapi_event_send_migration(MIGRATION_STATUS_SETUP);
     if (strstart(uri, "tcp:", &p) ||
         strstart(uri, "unix:", NULL) ||
         strstart(uri, "vsock:", NULL)) {
-        migrate_protocol_allow_multifd(true);
+        migrate_protocol_allow_multi_channels(true);
         socket_start_incoming_migration(p ? p : uri, errp);
 #ifdef CONFIG_RDMA
     } else if (strstart(uri, "rdma:", &p)) {
@@ -1255,7 +1267,7 @@ static bool migrate_caps_check(bool *cap_list,
 
     /* incoming side only */
     if (runstate_check(RUN_STATE_INMIGRATE) &&
-        !migrate_multifd_is_allowed() &&
+        !migrate_multi_channels_is_allowed() &&
         cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
         error_setg(errp, "multifd is not supported by current protocol");
         return false;
@@ -2313,11 +2325,11 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
         }
     }
 
-    migrate_protocol_allow_multifd(false);
+    migrate_protocol_allow_multi_channels(false);
     if (strstart(uri, "tcp:", &p) ||
         strstart(uri, "unix:", NULL) ||
         strstart(uri, "vsock:", NULL)) {
-        migrate_protocol_allow_multifd(true);
+        migrate_protocol_allow_multi_channels(true);
         socket_start_outgoing_migration(s, p ? p : uri, &local_err);
 #ifdef CONFIG_RDMA
     } else if (strstart(uri, "rdma:", &p)) {
diff --git a/migration/migration.h b/migration/migration.h
index d8b9850eae..d677a750c9 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -429,4 +429,7 @@ void migration_cancel(const Error *error);
 void populate_vfio_info(MigrationInfo *info);
 void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
 
+bool migrate_multi_channels_is_allowed(void);
+void migrate_protocol_allow_multi_channels(bool allow);
+
 #endif
diff --git a/migration/multifd.c b/migration/multifd.c
index 76b57a7177..180586dcde 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -517,7 +517,7 @@ void multifd_save_cleanup(void)
 {
     int i;
 
-    if (!migrate_use_multifd() || !migrate_multifd_is_allowed()) {
+    if (!migrate_use_multifd() || !migrate_multi_channels_is_allowed()) {
         return;
     }
     multifd_send_terminate_threads(NULL);
@@ -858,17 +858,6 @@ cleanup:
     multifd_new_send_channel_cleanup(p, sioc, local_err);
 }
 
-static bool migrate_allow_multifd = true;
-void migrate_protocol_allow_multifd(bool allow)
-{
-    migrate_allow_multifd = allow;
-}
-
-bool migrate_multifd_is_allowed(void)
-{
-    return migrate_allow_multifd;
-}
-
 int multifd_save_setup(Error **errp)
 {
     int thread_count;
@@ -879,7 +868,7 @@ int multifd_save_setup(Error **errp)
     if (!migrate_use_multifd()) {
         return 0;
     }
-    if (!migrate_multifd_is_allowed()) {
+    if (!migrate_multi_channels_is_allowed()) {
         error_setg(errp, "multifd is not supported by current protocol");
         return -1;
     }
@@ -980,7 +969,7 @@ int multifd_load_cleanup(Error **errp)
 {
     int i;
 
-    if (!migrate_use_multifd() || !migrate_multifd_is_allowed()) {
+    if (!migrate_use_multifd() || !migrate_multi_channels_is_allowed()) {
         return 0;
     }
     multifd_recv_terminate_threads(NULL);
@@ -1129,7 +1118,7 @@ int multifd_load_setup(Error **errp)
     if (!migrate_use_multifd()) {
         return 0;
     }
-    if (!migrate_multifd_is_allowed()) {
+    if (!migrate_multi_channels_is_allowed()) {
         error_setg(errp, "multifd is not supported by current protocol");
         return -1;
     }
diff --git a/migration/multifd.h b/migration/multifd.h
index 4dda900a0b..636e599395 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -13,8 +13,6 @@
 #ifndef QEMU_MIGRATION_MULTIFD_H
 #define QEMU_MIGRATION_MULTIFD_H
 
-bool migrate_multifd_is_allowed(void);
-void migrate_protocol_allow_multifd(bool allow);
 int multifd_save_setup(Error **errp);
 void multifd_save_cleanup(void);
 int multifd_load_setup(Error **errp);
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 10/25] migration: Enlarge postcopy recovery to capture !-EIO too
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (8 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 09/25] migration: Move migrate_allow_multifd and helpers into migration.c Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 11/25] migration: postcopy_pause_fault_thread() never fails Peter Xu
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

We used to have quite a few places making sure -EIO happened and that's the
only way to trigger postcopy recovery.  That's based on the assumption that
we'll only return -EIO for channel issues.

It'll work in 99.99% cases but logically that won't cover some corner cases.
One example is e.g. ram_block_from_stream() could fail with an interrupted
network, then -EINVAL will be returned instead of -EIO.

I remembered Dave Gilbert pointed that out before, but somehow this is
overlooked.  Neither did I encounter anything outside the -EIO error.

However we'd better touch that up before it triggers a rare VM data loss during
live migrating.

To cover as much those cases as possible, remove the -EIO restriction on
triggering the postcopy recovery, because even if it's not a channel failure,
we can't do anything better than halting QEMU anyway - the corpse of the
process may even be used by a good hand to dig out useful memory regions, or
the admin could simply kill the process later on.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c    | 4 ++--
 migration/postcopy-ram.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 6e4cc9cc87..67520d3105 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2877,7 +2877,7 @@ retry:
 out:
     res = qemu_file_get_error(rp);
     if (res) {
-        if (res == -EIO && migration_in_postcopy()) {
+        if (res && migration_in_postcopy()) {
             /*
              * Maybe there is something we can do: it looks like a
              * network down issue, and we pause for a recovery.
@@ -3478,7 +3478,7 @@ static MigThrError migration_detect_error(MigrationState *s)
         error_free(local_error);
     }
 
-    if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret == -EIO) {
+    if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret) {
         /*
          * For postcopy, we allow the network to be down for a
          * while. After that, it can be continued by a
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index d08d396c63..b0d12d5053 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1039,7 +1039,7 @@ retry:
                                         msg.arg.pagefault.address);
             if (ret) {
                 /* May be network failure, try to wait for recovery */
-                if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
+                if (postcopy_pause_fault_thread(mis)) {
                     /* We got reconnected somehow, try to continue */
                     goto retry;
                 } else {
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 11/25] migration: postcopy_pause_fault_thread() never fails
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (9 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 10/25] migration: Enlarge postcopy recovery to capture !-EIO too Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 12/25] migration: Export ram_load_postcopy() Peter Xu
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Per the title, remove the return code and simplify the callers as the errors
will never be triggered.  No functional change intended.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 25 ++++---------------------
 1 file changed, 4 insertions(+), 21 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index b0d12d5053..32c52f4b1d 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -891,15 +891,11 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
                                       affected_cpu);
 }
 
-static bool postcopy_pause_fault_thread(MigrationIncomingState *mis)
+static void postcopy_pause_fault_thread(MigrationIncomingState *mis)
 {
     trace_postcopy_pause_fault_thread();
-
     qemu_sem_wait(&mis->postcopy_pause_sem_fault);
-
     trace_postcopy_pause_fault_thread_continued();
-
-    return true;
 }
 
 /*
@@ -959,13 +955,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
              * broken already using the event. We should hold until
              * the channel is rebuilt.
              */
-            if (postcopy_pause_fault_thread(mis)) {
-                /* Continue to read the userfaultfd */
-            } else {
-                error_report("%s: paused but don't allow to continue",
-                             __func__);
-                break;
-            }
+            postcopy_pause_fault_thread(mis);
         }
 
         if (pfd[1].revents) {
@@ -1039,15 +1029,8 @@ retry:
                                         msg.arg.pagefault.address);
             if (ret) {
                 /* May be network failure, try to wait for recovery */
-                if (postcopy_pause_fault_thread(mis)) {
-                    /* We got reconnected somehow, try to continue */
-                    goto retry;
-                } else {
-                    /* This is a unavoidable fault */
-                    error_report("%s: postcopy_request_page() get %d",
-                                 __func__, ret);
-                    break;
-                }
+                postcopy_pause_fault_thread(mis);
+                goto retry;
             }
         }
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 12/25] migration: Export ram_load_postcopy()
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (10 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 11/25] migration: postcopy_pause_fault_thread() never fails Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 13/25] migration: Move channel setup out of postcopy_try_recover() Peter Xu
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Will be reused in postcopy fast load thread.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 2 +-
 migration/ram.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index f1de1a06e4..5cb5dfc2cc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3645,7 +3645,7 @@ int ram_postcopy_incoming_init(MigrationIncomingState *mis)
  *
  * @f: QEMUFile where to send the data
  */
-static int ram_load_postcopy(QEMUFile *f)
+int ram_load_postcopy(QEMUFile *f)
 {
     int flags = 0, ret = 0;
     bool place_needed = false;
diff --git a/migration/ram.h b/migration/ram.h
index 2c6dc3675d..ded0a3a086 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -61,6 +61,7 @@ void ram_postcopy_send_discard_bitmap(MigrationState *ms);
 /* For incoming postcopy discard */
 int ram_discard_range(const char *block_name, uint64_t start, size_t length);
 int ram_postcopy_incoming_init(MigrationIncomingState *mis);
+int ram_load_postcopy(QEMUFile *f);
 
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 13/25] migration: Move channel setup out of postcopy_try_recover()
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (11 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 12/25] migration: Export ram_load_postcopy() Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 14/25] migration: Add migration_incoming_transport_cleanup() Peter Xu
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

We used to use postcopy_try_recover() to replace migration_incoming_setup() to
setup incoming channels.  That's fine for the old world, but in the new world
there can be more than one channels that need setup.  Better move the channel
setup out of it so that postcopy_try_recover() only handles the last phase of
switching to the recovery phase.

To do that in migration_fd_process_incoming(), move the postcopy_try_recover()
call to be after migration_incoming_setup(), which will setup the channels.
While in migration_ioc_process_incoming(), postpone the recover() routine right
before we'll jump into migration_incoming_process().

A side benefit is we don't need to pass in QEMUFile* to postcopy_try_recover()
anymore.  Remove it.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 67520d3105..b2e6446457 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -665,19 +665,20 @@ void migration_incoming_process(void)
 }
 
 /* Returns true if recovered from a paused migration, otherwise false */
-static bool postcopy_try_recover(QEMUFile *f)
+static bool postcopy_try_recover(void)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
 
     if (mis->state == MIGRATION_STATUS_POSTCOPY_PAUSED) {
         /* Resumed from a paused postcopy migration */
 
-        mis->from_src_file = f;
+        /* This should be set already in migration_incoming_setup() */
+        assert(mis->from_src_file);
         /* Postcopy has standalone thread to do vm load */
-        qemu_file_set_blocking(f, true);
+        qemu_file_set_blocking(mis->from_src_file, true);
 
         /* Re-configure the return path */
-        mis->to_src_file = qemu_file_get_return_path(f);
+        mis->to_src_file = qemu_file_get_return_path(mis->from_src_file);
 
         migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_PAUSED,
                           MIGRATION_STATUS_POSTCOPY_RECOVER);
@@ -698,11 +699,10 @@ static bool postcopy_try_recover(QEMUFile *f)
 
 void migration_fd_process_incoming(QEMUFile *f, Error **errp)
 {
-    if (postcopy_try_recover(f)) {
+    if (!migration_incoming_setup(f, errp)) {
         return;
     }
-
-    if (!migration_incoming_setup(f, errp)) {
+    if (postcopy_try_recover()) {
         return;
     }
     migration_incoming_process();
@@ -718,11 +718,6 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
         /* The first connection (multifd may have multiple) */
         QEMUFile *f = qemu_fopen_channel_input(ioc);
 
-        /* If it's a recovery, we're done */
-        if (postcopy_try_recover(f)) {
-            return;
-        }
-
         if (!migration_incoming_setup(f, errp)) {
             return;
         }
@@ -743,6 +738,10 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
     }
 
     if (start_migration) {
+        /* If it's a recovery, we're done */
+        if (postcopy_try_recover()) {
+            return;
+        }
         migration_incoming_process();
     }
 }
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 14/25] migration: Add migration_incoming_transport_cleanup()
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (12 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 13/25] migration: Move channel setup out of postcopy_try_recover() Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 15/25] migration: Allow migrate-recover to run multiple times Peter Xu
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Add a helper to cleanup the transport listener.

When do it, we should also null-ify the cleanup hook and the data, then it's
even safe to call it multiple times.

Move the socket_address_list cleanup altogether, because that's a mirror of the
listener channels and only for the purpose of query-migrate.  Hence when
someone wants to cleanup the listener transport, it should also want to cleanup
the socket list too, always.

No functional change intended.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 22 ++++++++++++++--------
 migration/migration.h |  1 +
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index b2e6446457..6bb321cdd3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -279,6 +279,19 @@ MigrationIncomingState *migration_incoming_get_current(void)
     return current_incoming;
 }
 
+void migration_incoming_transport_cleanup(MigrationIncomingState *mis)
+{
+    if (mis->socket_address_list) {
+        qapi_free_SocketAddressList(mis->socket_address_list);
+        mis->socket_address_list = NULL;
+    }
+
+    if (mis->transport_cleanup) {
+        mis->transport_cleanup(mis->transport_data);
+        mis->transport_data = mis->transport_cleanup = NULL;
+    }
+}
+
 void migration_incoming_state_destroy(void)
 {
     struct MigrationIncomingState *mis = migration_incoming_get_current();
@@ -299,10 +312,8 @@ void migration_incoming_state_destroy(void)
         g_array_free(mis->postcopy_remote_fds, TRUE);
         mis->postcopy_remote_fds = NULL;
     }
-    if (mis->transport_cleanup) {
-        mis->transport_cleanup(mis->transport_data);
-    }
 
+    migration_incoming_transport_cleanup(mis);
     qemu_event_reset(&mis->main_thread_load_event);
 
     if (mis->page_requested) {
@@ -310,11 +321,6 @@ void migration_incoming_state_destroy(void)
         mis->page_requested = NULL;
     }
 
-    if (mis->socket_address_list) {
-        qapi_free_SocketAddressList(mis->socket_address_list);
-        mis->socket_address_list = NULL;
-    }
-
     yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index d677a750c9..f17ccc657c 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -166,6 +166,7 @@ struct MigrationIncomingState {
 
 MigrationIncomingState *migration_incoming_get_current(void);
 void migration_incoming_state_destroy(void);
+void migration_incoming_transport_cleanup(MigrationIncomingState *mis);
 /*
  * Functions to work with blocktime context
  */
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 15/25] migration: Allow migrate-recover to run multiple times
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (13 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 14/25] migration: Add migration_incoming_transport_cleanup() Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 16/25] migration: Add postcopy-preempt capability Peter Xu
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Previously migration didn't have an easy way to cleanup the listening
transport, migrate recovery only allows to execute once.  That's done with a
trick flag in postcopy_recover_triggered.

Now the facility is already there.

Drop postcopy_recover_triggered and instead allows a new migrate-recover to
release the previous listener transport.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 13 ++-----------
 migration/migration.h |  1 -
 migration/savevm.c    |  3 ---
 3 files changed, 2 insertions(+), 15 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 6bb321cdd3..16086897aa 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2159,11 +2159,8 @@ void qmp_migrate_recover(const char *uri, Error **errp)
         return;
     }
 
-    if (qatomic_cmpxchg(&mis->postcopy_recover_triggered,
-                       false, true) == true) {
-        error_setg(errp, "Migrate recovery is triggered already");
-        return;
-    }
+    /* If there's an existing transport, release it */
+    migration_incoming_transport_cleanup(mis);
 
     /*
      * Note that this call will never start a real migration; it will
@@ -2171,12 +2168,6 @@ void qmp_migrate_recover(const char *uri, Error **errp)
      * to continue using that newly established channel.
      */
     qemu_start_incoming_migration(uri, errp);
-
-    /* Safe to dereference with the assert above */
-    if (*errp) {
-        /* Reset the flag so user could still retry */
-        qatomic_set(&mis->postcopy_recover_triggered, false);
-    }
 }
 
 void qmp_migrate_pause(Error **errp)
diff --git a/migration/migration.h b/migration/migration.h
index f17ccc657c..a863032b71 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -139,7 +139,6 @@ struct MigrationIncomingState {
     struct PostcopyBlocktimeContext *blocktime_ctx;
 
     /* notify PAUSED postcopy incoming migrations to try to continue */
-    bool postcopy_recover_triggered;
     QemuSemaphore postcopy_pause_sem_dst;
     QemuSemaphore postcopy_pause_sem_fault;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 967ff80547..254aa78234 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2589,9 +2589,6 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
 
     assert(migrate_postcopy_ram());
 
-    /* Clear the triggered bit to allow one recovery */
-    mis->postcopy_recover_triggered = false;
-
     /*
      * Unregister yank with either from/to src would work, since ioc behind it
      * is the same
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 16/25] migration: Add postcopy-preempt capability
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (14 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 15/25] migration: Allow migrate-recover to run multiple times Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 17/25] migration: Postcopy preemption preparation on channel creation Peter Xu
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Firstly, postcopy already preempts precopy due to the fact that we do
unqueue_page() first before looking into dirty bits.

However that's not enough, e.g., when there're host huge page enabled, when
sending a precopy huge page, a postcopy request needs to wait until the whole
huge page that is sending to finish.  That could introduce quite some delay,
the bigger the huge page is the larger delay it'll bring.

This patch adds a new capability to allow postcopy requests to preempt existing
precopy page during sending a huge page, so that postcopy requests can be
serviced even faster.

Meanwhile to send it even faster, bypass the precopy stream by providing a
standalone postcopy socket for sending requested pages.

Since the new behavior will not be compatible with the old behavior, this will
not be the default, it's enabled only when the new capability is set on both
src/dst QEMUs.

This patch only adds the capability itself, the logic will be added in follow
up patches.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 23 +++++++++++++++++++++++
 migration/migration.h |  1 +
 qapi/migration.json   |  8 +++++++-
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 16086897aa..4c22bad304 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1235,6 +1235,11 @@ static bool migrate_caps_check(bool *cap_list,
             error_setg(errp, "Postcopy is not compatible with ignore-shared");
             return false;
         }
+
+        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
+            error_setg(errp, "Multifd is not supported in postcopy");
+            return false;
+        }
     }
 
     if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
@@ -1278,6 +1283,13 @@ static bool migrate_caps_check(bool *cap_list,
         return false;
     }
 
+    if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT]) {
+        if (!cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
+            error_setg(errp, "Postcopy preempt requires postcopy-ram");
+            return false;
+        }
+    }
+
     return true;
 }
 
@@ -2622,6 +2634,15 @@ bool migrate_background_snapshot(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT];
 }
 
+bool migrate_postcopy_preempt(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT];
+}
+
 /* migration thread support */
 /*
  * Something bad happened to the RP stream, mark an error
@@ -4232,6 +4253,8 @@ static Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-compress", MIGRATION_CAPABILITY_COMPRESS),
     DEFINE_PROP_MIG_CAP("x-events", MIGRATION_CAPABILITY_EVENTS),
     DEFINE_PROP_MIG_CAP("x-postcopy-ram", MIGRATION_CAPABILITY_POSTCOPY_RAM),
+    DEFINE_PROP_MIG_CAP("x-postcopy-preempt",
+                        MIGRATION_CAPABILITY_POSTCOPY_PREEMPT),
     DEFINE_PROP_MIG_CAP("x-colo", MIGRATION_CAPABILITY_X_COLO),
     DEFINE_PROP_MIG_CAP("x-release-ram", MIGRATION_CAPABILITY_RELEASE_RAM),
     DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
diff --git a/migration/migration.h b/migration/migration.h
index a863032b71..af4bcb19c2 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -394,6 +394,7 @@ int migrate_decompress_threads(void);
 bool migrate_use_events(void);
 bool migrate_postcopy_blocktime(void);
 bool migrate_background_snapshot(void);
+bool migrate_postcopy_preempt(void);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_shut(MigrationIncomingState *mis,
diff --git a/qapi/migration.json b/qapi/migration.json
index 5975a0e104..50878b5f3b 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -463,6 +463,12 @@
 #                       procedure starts. The VM RAM is saved with running VM.
 #                       (since 6.0)
 #
+# @postcopy-preempt: If enabled, the migration process will allow postcopy
+#                    requests to preempt precopy stream, so postcopy requests
+#                    will be handled faster.  This is a performance feature and
+#                    should not affect the correctness of postcopy migration.
+#                    (since 7.0)
+#
 # Features:
 # @unstable: Members @x-colo and @x-ignore-shared are experimental.
 #
@@ -476,7 +482,7 @@
            'block', 'return-path', 'pause-before-switchover', 'multifd',
            'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
            { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
-           'validate-uuid', 'background-snapshot'] }
+           'validate-uuid', 'background-snapshot', 'postcopy-preempt'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 17/25] migration: Postcopy preemption preparation on channel creation
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (15 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 16/25] migration: Add postcopy-preempt capability Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 18/25] migration: Postcopy preemption enablement Peter Xu
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Create a new socket for postcopy to be prepared to send postcopy requested
pages via this specific channel, so as to not get blocked by precopy pages.

A new thread is also created on dest qemu to receive data from this new channel
based on the ram_load_postcopy() routine.

The ram_load_postcopy(POSTCOPY) branch and the thread has not started to
function, and that'll be done in follow up patches.

Cleanup the new sockets on both src/dst QEMUs, meanwhile look after the new
thread too to make sure it'll be recycled properly.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c    | 62 +++++++++++++++++++++++----
 migration/migration.h    |  8 ++++
 migration/postcopy-ram.c | 92 ++++++++++++++++++++++++++++++++++++++--
 migration/postcopy-ram.h | 10 +++++
 migration/ram.c          | 25 ++++++++---
 migration/ram.h          |  4 +-
 migration/socket.c       | 22 +++++++++-
 migration/socket.h       |  1 +
 migration/trace-events   |  3 ++
 9 files changed, 207 insertions(+), 20 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 4c22bad304..3d7f897b72 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -321,6 +321,12 @@ void migration_incoming_state_destroy(void)
         mis->page_requested = NULL;
     }
 
+    if (mis->postcopy_qemufile_dst) {
+        migration_ioc_unregister_yank_from_file(mis->postcopy_qemufile_dst);
+        qemu_fclose(mis->postcopy_qemufile_dst);
+        mis->postcopy_qemufile_dst = NULL;
+    }
+
     yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
@@ -714,15 +720,21 @@ void migration_fd_process_incoming(QEMUFile *f, Error **errp)
     migration_incoming_process();
 }
 
+static bool migration_needs_multiple_sockets(void)
+{
+    return migrate_use_multifd() || migrate_postcopy_preempt();
+}
+
 void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
     Error *local_err = NULL;
     bool start_migration;
+    QEMUFile *f;
 
     if (!mis->from_src_file) {
         /* The first connection (multifd may have multiple) */
-        QEMUFile *f = qemu_fopen_channel_input(ioc);
+        f = qemu_fopen_channel_input(ioc);
 
         if (!migration_incoming_setup(f, errp)) {
             return;
@@ -730,13 +742,18 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
 
         /*
          * Common migration only needs one channel, so we can start
-         * right now.  Multifd needs more than one channel, we wait.
+         * right now.  Some features need more than one channel, we wait.
          */
-        start_migration = !migrate_use_multifd();
+        start_migration = !migration_needs_multiple_sockets();
     } else {
         /* Multiple connections */
-        assert(migrate_use_multifd());
-        start_migration = multifd_recv_new_channel(ioc, &local_err);
+        assert(migration_needs_multiple_sockets());
+        if (migrate_use_multifd()) {
+            start_migration = multifd_recv_new_channel(ioc, &local_err);
+        } else if (migrate_postcopy_preempt()) {
+            f = qemu_fopen_channel_input(ioc);
+            start_migration = postcopy_preempt_new_channel(mis, f);
+        }
         if (local_err) {
             error_propagate(errp, local_err);
             return;
@@ -761,11 +778,20 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
 bool migration_has_all_channels(void)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
-    bool all_channels;
 
-    all_channels = multifd_recv_all_channels_created();
+    if (!mis->from_src_file) {
+        return false;
+    }
+
+    if (migrate_use_multifd()) {
+        return multifd_recv_all_channels_created();
+    }
+
+    if (migrate_postcopy_preempt()) {
+        return mis->postcopy_qemufile_dst != NULL;
+    }
 
-    return all_channels && mis->from_src_file != NULL;
+    return true;
 }
 
 /*
@@ -1858,6 +1884,12 @@ static void migrate_fd_cleanup(MigrationState *s)
         qemu_fclose(tmp);
     }
 
+    if (s->postcopy_qemufile_src) {
+        migration_ioc_unregister_yank_from_file(s->postcopy_qemufile_src);
+        qemu_fclose(s->postcopy_qemufile_src);
+        s->postcopy_qemufile_src = NULL;
+    }
+
     assert(!migration_is_active(s));
 
     if (s->state == MIGRATION_STATUS_CANCELLING) {
@@ -3233,6 +3265,11 @@ static void migration_completion(MigrationState *s)
         qemu_savevm_state_complete_postcopy(s->to_dst_file);
         qemu_mutex_unlock_iothread();
 
+        /* Shutdown the postcopy fast path thread */
+        if (migrate_postcopy_preempt()) {
+            postcopy_preempt_shutdown_file(s);
+        }
+
         trace_migration_completion_postcopy_end_after_complete();
     } else {
         goto fail;
@@ -4120,6 +4157,15 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
         }
     }
 
+    /* This needs to be done before resuming a postcopy */
+    if (postcopy_preempt_setup(s, &local_err)) {
+        error_report_err(local_err);
+        migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+                          MIGRATION_STATUS_FAILED);
+        migrate_fd_cleanup(s);
+        return;
+    }
+
     if (resume) {
         /* Wakeup the main migration thread to do the recovery */
         migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_PAUSED,
diff --git a/migration/migration.h b/migration/migration.h
index af4bcb19c2..caa910d956 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -23,6 +23,7 @@
 #include "io/channel-buffer.h"
 #include "net/announce.h"
 #include "qom/object.h"
+#include "postcopy-ram.h"
 
 struct PostcopyBlocktimeContext;
 
@@ -112,6 +113,11 @@ struct MigrationIncomingState {
      * enabled.
      */
     unsigned int postcopy_channels;
+    /* QEMUFile for postcopy only; it'll be handled by a separate thread */
+    QEMUFile *postcopy_qemufile_dst;
+    /* Postcopy priority thread is used to receive postcopy requested pages */
+    QemuThread postcopy_prio_thread;
+    bool postcopy_prio_thread_created;
     /*
      * An array of temp host huge pages to be used, one for each postcopy
      * channel.
@@ -192,6 +198,8 @@ struct MigrationState {
     QEMUBH *cleanup_bh;
     /* Protected by qemu_file_lock */
     QEMUFile *to_dst_file;
+    /* Postcopy specific transfer channel */
+    QEMUFile *postcopy_qemufile_src;
     QIOChannelBuffer *bioc;
     /*
      * Protects to_dst_file/from_dst_file pointers.  We need to make sure we
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 32c52f4b1d..df0c02f729 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -33,6 +33,9 @@
 #include "trace.h"
 #include "hw/boards.h"
 #include "exec/ramblock.h"
+#include "socket.h"
+#include "qemu-file-channel.h"
+#include "yank_functions.h"
 
 /* Arbitrary limit on size of each discard command,
  * keeps them around ~200 bytes
@@ -567,6 +570,11 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
 {
     trace_postcopy_ram_incoming_cleanup_entry();
 
+    if (mis->postcopy_prio_thread_created) {
+        qemu_thread_join(&mis->postcopy_prio_thread);
+        mis->postcopy_prio_thread_created = false;
+    }
+
     if (mis->have_fault_thread) {
         Error *local_err = NULL;
 
@@ -1102,8 +1110,13 @@ static int postcopy_temp_pages_setup(MigrationIncomingState *mis)
     int err, i, channels;
     void *temp_page;
 
-    /* TODO: will be boosted when enable postcopy preemption */
-    mis->postcopy_channels = 1;
+    if (migrate_postcopy_preempt()) {
+        /* If preemption enabled, need extra channel for urgent requests */
+        mis->postcopy_channels = RAM_CHANNEL_MAX;
+    } else {
+        /* Both precopy/postcopy on the same channel */
+        mis->postcopy_channels = 1;
+    }
 
     channels = mis->postcopy_channels;
     mis->postcopy_tmp_pages = g_malloc0_n(sizeof(PostcopyTmpPage), channels);
@@ -1170,7 +1183,7 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
         return -1;
     }
 
-    postcopy_thread_create(mis, &mis->fault_thread, "postcopy/fault",
+    postcopy_thread_create(mis, &mis->fault_thread, "fault-default",
                            postcopy_ram_fault_thread, QEMU_THREAD_JOINABLE);
     mis->have_fault_thread = true;
 
@@ -1185,6 +1198,16 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
         return -1;
     }
 
+    if (migrate_postcopy_preempt()) {
+        /*
+         * This thread needs to be created after the temp pages because it'll fetch
+         * RAM_CHANNEL_POSTCOPY PostcopyTmpPage immediately.
+         */
+        postcopy_thread_create(mis, &mis->postcopy_prio_thread, "fault-fast",
+                               postcopy_preempt_thread, QEMU_THREAD_JOINABLE);
+        mis->postcopy_prio_thread_created = true;
+    }
+
     trace_postcopy_ram_enable_notify();
 
     return 0;
@@ -1514,3 +1537,66 @@ void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd)
         }
     }
 }
+
+bool postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file)
+{
+    /*
+     * The new loading channel has its own threads, so it needs to be
+     * blocked too.  It's by default true, just be explicit.
+     */
+    qemu_file_set_blocking(file, true);
+    mis->postcopy_qemufile_dst = file;
+    trace_postcopy_preempt_new_channel();
+
+    /* Start the migration immediately */
+    return true;
+}
+
+int postcopy_preempt_setup(MigrationState *s, Error **errp)
+{
+    QIOChannel *ioc;
+
+    if (!migrate_postcopy_preempt()) {
+        return 0;
+    }
+
+    if (!migrate_multi_channels_is_allowed()) {
+        error_setg(errp, "Postcopy preempt is not supported as current "
+                   "migration stream does not support multi-channels.");
+        return -1;
+    }
+
+    ioc = socket_send_channel_create_sync(errp);
+
+    if (ioc == NULL) {
+        return -1;
+    }
+
+    migration_ioc_register_yank(ioc);
+    s->postcopy_qemufile_src = qemu_fopen_channel_output(ioc);
+
+    trace_postcopy_preempt_new_channel();
+
+    return 0;
+}
+
+void *postcopy_preempt_thread(void *opaque)
+{
+    MigrationIncomingState *mis = opaque;
+    int ret;
+
+    trace_postcopy_preempt_thread_entry();
+
+    rcu_register_thread();
+
+    qemu_sem_post(&mis->thread_sync_sem);
+
+    /* Sending RAM_SAVE_FLAG_EOS to terminate this thread */
+    ret = ram_load_postcopy(mis->postcopy_qemufile_dst, RAM_CHANNEL_POSTCOPY);
+
+    rcu_unregister_thread();
+
+    trace_postcopy_preempt_thread_exit();
+
+    return ret == 0 ? NULL : (void *)-1;
+}
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 07684c0e1d..34b1080cde 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -183,4 +183,14 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd, uint64_t client_addr,
 int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
                                  uint64_t client_addr, uint64_t offset);
 
+/* Hard-code channels for now for postcopy preemption */
+enum PostcopyChannels {
+    RAM_CHANNEL_PRECOPY = 0,
+    RAM_CHANNEL_POSTCOPY = 1,
+    RAM_CHANNEL_MAX,
+};
+
+bool postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file);
+int postcopy_preempt_setup(MigrationState *s, Error **errp);
+
 #endif
diff --git a/migration/ram.c b/migration/ram.c
index 5cb5dfc2cc..713ef6e421 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3644,15 +3644,15 @@ int ram_postcopy_incoming_init(MigrationIncomingState *mis)
  * rcu_read_lock is taken prior to this being called.
  *
  * @f: QEMUFile where to send the data
+ * @channel: the channel to use for loading
  */
-int ram_load_postcopy(QEMUFile *f)
+int ram_load_postcopy(QEMUFile *f, int channel)
 {
     int flags = 0, ret = 0;
     bool place_needed = false;
     bool matches_target_page_size = false;
     MigrationIncomingState *mis = migration_incoming_get_current();
-    /* Currently we only use channel 0.  TODO: use all the channels */
-    PostcopyTmpPage *tmp_page = &mis->postcopy_tmp_pages[0];
+    PostcopyTmpPage *tmp_page = &mis->postcopy_tmp_pages[channel];
 
     while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
         ram_addr_t addr;
@@ -3717,10 +3717,10 @@ int ram_load_postcopy(QEMUFile *f)
             } else if (tmp_page->host_addr !=
                        host_page_from_ram_block_offset(block, addr)) {
                 /* not the 1st TP within the HP */
-                error_report("Non-same host page detected.  Target host page %p, "
-                             "received host page %p "
+                error_report("Non-same host page detected on channel %d: "
+                             "Target host page %p, received host page %p "
                              "(rb %s offset 0x"RAM_ADDR_FMT" target_pages %d)",
-                             tmp_page->host_addr,
+                             channel, tmp_page->host_addr,
                              host_page_from_ram_block_offset(block, addr),
                              block->idstr, addr, tmp_page->target_pages);
                 ret = -EINVAL;
@@ -4107,7 +4107,12 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
      */
     WITH_RCU_READ_LOCK_GUARD() {
         if (postcopy_running) {
-            ret = ram_load_postcopy(f);
+            /*
+             * Note!  Here RAM_CHANNEL_PRECOPY is the precopy channel of
+             * postcopy migration, we have another RAM_CHANNEL_POSTCOPY to
+             * service fast page faults.
+             */
+            ret = ram_load_postcopy(f, RAM_CHANNEL_PRECOPY);
         } else {
             ret = ram_load_precopy(f);
         }
@@ -4269,6 +4274,12 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
     return 0;
 }
 
+void postcopy_preempt_shutdown_file(MigrationState *s)
+{
+    qemu_put_be64(s->postcopy_qemufile_src, RAM_SAVE_FLAG_EOS);
+    qemu_fflush(s->postcopy_qemufile_src);
+}
+
 static SaveVMHandlers savevm_ram_handlers = {
     .save_setup = ram_save_setup,
     .save_live_iterate = ram_save_iterate,
diff --git a/migration/ram.h b/migration/ram.h
index ded0a3a086..5d90945a6e 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -61,7 +61,7 @@ void ram_postcopy_send_discard_bitmap(MigrationState *ms);
 /* For incoming postcopy discard */
 int ram_discard_range(const char *block_name, uint64_t start, size_t length);
 int ram_postcopy_incoming_init(MigrationIncomingState *mis);
-int ram_load_postcopy(QEMUFile *f);
+int ram_load_postcopy(QEMUFile *f, int channel);
 
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 
@@ -73,6 +73,8 @@ int64_t ramblock_recv_bitmap_send(QEMUFile *file,
                                   const char *block_name);
 int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
 bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
+void postcopy_preempt_shutdown_file(MigrationState *s);
+void *postcopy_preempt_thread(void *opaque);
 
 /* ram cache */
 int colo_init_ram_cache(void);
diff --git a/migration/socket.c b/migration/socket.c
index 05705a32d8..a7f345b353 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -26,7 +26,7 @@
 #include "io/channel-socket.h"
 #include "io/net-listener.h"
 #include "trace.h"
-
+#include "postcopy-ram.h"
 
 struct SocketOutgoingArgs {
     SocketAddress *saddr;
@@ -39,6 +39,24 @@ void socket_send_channel_create(QIOTaskFunc f, void *data)
                                      f, data, NULL, NULL);
 }
 
+QIOChannel *socket_send_channel_create_sync(Error **errp)
+{
+    QIOChannelSocket *sioc = qio_channel_socket_new();
+
+    if (!outgoing_args.saddr) {
+        object_unref(OBJECT(sioc));
+        error_setg(errp, "Initial sock address not set!");
+        return NULL;
+    }
+
+    if (qio_channel_socket_connect_sync(sioc, outgoing_args.saddr, errp) < 0) {
+        object_unref(OBJECT(sioc));
+        return NULL;
+    }
+
+    return QIO_CHANNEL(sioc);
+}
+
 int socket_send_channel_destroy(QIOChannel *send)
 {
     /* Remove channel */
@@ -158,6 +176,8 @@ socket_start_incoming_migration_internal(SocketAddress *saddr,
 
     if (migrate_use_multifd()) {
         num = migrate_multifd_channels();
+    } else if (migrate_postcopy_preempt()) {
+        num = RAM_CHANNEL_MAX;
     }
 
     if (qio_net_listener_open_sync(listener, saddr, num, errp) < 0) {
diff --git a/migration/socket.h b/migration/socket.h
index 891dbccceb..dc54df4e6c 100644
--- a/migration/socket.h
+++ b/migration/socket.h
@@ -21,6 +21,7 @@
 #include "io/task.h"
 
 void socket_send_channel_create(QIOTaskFunc f, void *data);
+QIOChannel *socket_send_channel_create_sync(Error **errp);
 int socket_send_channel_destroy(QIOChannel *send);
 
 void socket_start_incoming_migration(const char *str, Error **errp);
diff --git a/migration/trace-events b/migration/trace-events
index 1aec580e92..4474a76614 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -278,6 +278,9 @@ postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_off
 postcopy_request_shared_page_present(const char *sharer, const char *rb, uint64_t rb_offset) "%s already %s offset 0x%"PRIx64
 postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
 postcopy_page_req_del(void *addr, int count) "resolved page req %p total %d"
+postcopy_preempt_new_channel(void) ""
+postcopy_preempt_thread_entry(void) ""
+postcopy_preempt_thread_exit(void) ""
 
 get_mem_fault_cpu_index(int cpu, uint32_t pid) "cpu: %d, pid: %u"
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 18/25] migration: Postcopy preemption enablement
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (16 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 17/25] migration: Postcopy preemption preparation on channel creation Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 19/25] migration: Postcopy recover with preempt enabled Peter Xu
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

This patch enables postcopy-preempt feature.

It contains two major changes to the migration logic:

(1) Postcopy requests are now sent via a different socket from precopy
    background migration stream, so as to be isolated from very high page
    request delays.

(2) For huge page enabled hosts: when there's postcopy requests, they can now
    intercept a partial sending of huge host pages on src QEMU.

After this patch, we'll live migrate a VM with two channels for postcopy: (1)
PRECOPY channel, which is the default channel that transfers background pages;
and (2) POSTCOPY channel, which only transfers requested pages.

There's no strict rule of which channel to use, e.g., if a requested page is
already being transferred on precopy channel, then we will keep using the same
precopy channel to transfer the page even if it's explicitly requested.  In 99%
of the cases we'll prioritize the channels so we send requested page via the
postcopy channel as long as possible.

On the source QEMU, when we found a postcopy request, we'll interrupt the
PRECOPY channel sending process and quickly switch to the POSTCOPY channel.
After we serviced all the high priority postcopy pages, we'll switch back to
PRECOPY channel so that we'll continue to send the interrupted huge page again.
There's no new thread introduced on src QEMU.

On the destination QEMU, one new thread is introduced to receive page data from
the postcopy specific socket (done in the preparation patch).

This patch has a side effect: after sending postcopy pages, previously we'll
assume the guest will access follow up pages so we'll keep sending from there.
Now it's changed.  Instead of going on with a postcopy requested page, we'll go
back and continue sending the precopy huge page (which can be intercepted by a
postcopy request so the huge page can be sent partially before).

Whether that's a problem is debatable, because "assuming the guest will
continue to access the next page" may not really suite when huge pages are
used, especially if the huge page is large (e.g. 1GB pages).  So that locality
hint is much meaningless if huge pages are used.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c  |   2 +
 migration/migration.h  |   2 +-
 migration/ram.c        | 250 +++++++++++++++++++++++++++++++++++++++--
 migration/trace-events |   7 ++
 4 files changed, 252 insertions(+), 9 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 3d7f897b72..d20db04097 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3153,6 +3153,8 @@ static int postcopy_start(MigrationState *ms)
                               MIGRATION_STATUS_FAILED);
     }
 
+    trace_postcopy_preempt_enabled(migrate_postcopy_preempt());
+
     return ret;
 
 fail_closefb:
diff --git a/migration/migration.h b/migration/migration.h
index caa910d956..b8aacfe3af 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -68,7 +68,7 @@ typedef struct {
 struct MigrationIncomingState {
     QEMUFile *from_src_file;
     /* Previously received RAM's RAMBlock pointer */
-    RAMBlock *last_recv_block;
+    RAMBlock *last_recv_block[RAM_CHANNEL_MAX];
     /* A hook to allow cleanup at the end of incoming migration */
     void *transport_data;
     void (*transport_cleanup)(void *data);
diff --git a/migration/ram.c b/migration/ram.c
index 713ef6e421..53dfd9be38 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -295,6 +295,20 @@ struct RAMSrcPageRequest {
     QSIMPLEQ_ENTRY(RAMSrcPageRequest) next_req;
 };
 
+typedef struct {
+    /*
+     * Cached ramblock/offset values if preempted.  They're only meaningful if
+     * preempted==true below.
+     */
+    RAMBlock *ram_block;
+    unsigned long ram_page;
+    /*
+     * Whether a postcopy preemption just happened.  Will be reset after
+     * precopy recovered to background migration.
+     */
+    bool preempted;
+} PostcopyPreemptState;
+
 /* State of RAM for migration */
 struct RAMState {
     /* QEMUFile used for this migration */
@@ -349,6 +363,14 @@ struct RAMState {
     /* Queue of outstanding page requests from the destination */
     QemuMutex src_page_req_mutex;
     QSIMPLEQ_HEAD(, RAMSrcPageRequest) src_page_requests;
+
+    /* Postcopy preemption informations */
+    PostcopyPreemptState postcopy_preempt_state;
+    /*
+     * Current channel we're using on src VM.  Only valid if postcopy-preempt
+     * is enabled.
+     */
+    unsigned int postcopy_channel;
 };
 typedef struct RAMState RAMState;
 
@@ -356,6 +378,11 @@ static RAMState *ram_state;
 
 static NotifierWithReturnList precopy_notifier_list;
 
+static void postcopy_preempt_reset(RAMState *rs)
+{
+    memset(&rs->postcopy_preempt_state, 0, sizeof(PostcopyPreemptState));
+}
+
 /* Whether postcopy has queued requests? */
 static bool postcopy_has_request(RAMState *rs)
 {
@@ -1947,6 +1974,55 @@ void ram_write_tracking_stop(void)
 }
 #endif /* defined(__linux__) */
 
+/*
+ * Check whether two addr/offset of the ramblock falls onto the same host huge
+ * page.  Returns true if so, false otherwise.
+ */
+static bool offset_on_same_huge_page(RAMBlock *rb, uint64_t addr1,
+                                     uint64_t addr2)
+{
+    size_t page_size = qemu_ram_pagesize(rb);
+
+    addr1 = ROUND_DOWN(addr1, page_size);
+    addr2 = ROUND_DOWN(addr2, page_size);
+
+    return addr1 == addr2;
+}
+
+/*
+ * Whether a previous preempted precopy huge page contains current requested
+ * page?  Returns true if so, false otherwise.
+ *
+ * This should really happen very rarely, because it means when we were sending
+ * during background migration for postcopy we're sending exactly the page that
+ * some vcpu got faulted on on dest node.  When it happens, we probably don't
+ * need to do much but drop the request, because we know right after we restore
+ * the precopy stream it'll be serviced.  It'll slightly affect the order of
+ * postcopy requests to be serviced (e.g. it'll be the same as we move current
+ * request to the end of the queue) but it shouldn't be a big deal.  The most
+ * imporant thing is we can _never_ try to send a partial-sent huge page on the
+ * POSTCOPY channel again, otherwise that huge page will got "split brain" on
+ * two channels (PRECOPY, POSTCOPY).
+ */
+static bool postcopy_preempted_contains(RAMState *rs, RAMBlock *block,
+                                        ram_addr_t offset)
+{
+    PostcopyPreemptState *state = &rs->postcopy_preempt_state;
+
+    /* No preemption at all? */
+    if (!state->preempted) {
+        return false;
+    }
+
+    /* Not even the same ramblock? */
+    if (state->ram_block != block) {
+        return false;
+    }
+
+    return offset_on_same_huge_page(block, offset,
+                                    state->ram_page << TARGET_PAGE_BITS);
+}
+
 /**
  * get_queued_page: unqueue a page from the postcopy requests
  *
@@ -1962,9 +2038,17 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
     RAMBlock  *block;
     ram_addr_t offset;
 
+again:
     block = unqueue_page(rs, &offset);
 
-    if (!block) {
+    if (block) {
+        /* See comment above postcopy_preempted_contains() */
+        if (postcopy_preempted_contains(rs, block, offset)) {
+            trace_postcopy_preempt_hit(block->idstr, offset);
+            /* This request is dropped */
+            goto again;
+        }
+    } else {
         /*
          * Poll write faults too if background snapshot is enabled; that's
          * when we have vcpus got blocked by the write protected pages.
@@ -2180,6 +2264,117 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
     return ram_save_page(rs, pss);
 }
 
+static bool postcopy_needs_preempt(RAMState *rs, PageSearchStatus *pss)
+{
+    /* Not enabled eager preempt?  Then never do that. */
+    if (!migrate_postcopy_preempt()) {
+        return false;
+    }
+
+    /* If the ramblock we're sending is a small page?  Never bother. */
+    if (qemu_ram_pagesize(pss->block) == TARGET_PAGE_SIZE) {
+        return false;
+    }
+
+    /* Not in postcopy at all? */
+    if (!migration_in_postcopy()) {
+        return false;
+    }
+
+    /*
+     * If we're already handling a postcopy request, don't preempt as this page
+     * has got the same high priority.
+     */
+    if (pss->postcopy_requested) {
+        return false;
+    }
+
+    /* If there's postcopy requests, then check it up! */
+    return postcopy_has_request(rs);
+}
+
+/* Returns true if we preempted precopy, false otherwise */
+static void postcopy_do_preempt(RAMState *rs, PageSearchStatus *pss)
+{
+    PostcopyPreemptState *p_state = &rs->postcopy_preempt_state;
+
+    trace_postcopy_preempt_triggered(pss->block->idstr, pss->page);
+
+    /*
+     * Time to preempt precopy. Cache current PSS into preempt state, so that
+     * after handling the postcopy pages we can recover to it.  We need to do
+     * so because the dest VM will have partial of the precopy huge page kept
+     * over in its tmp huge page caches; better move on with it when we can.
+     */
+    p_state->ram_block = pss->block;
+    p_state->ram_page = pss->page;
+    p_state->preempted = true;
+}
+
+/* Whether we're preempted by a postcopy request during sending a huge page */
+static bool postcopy_preempt_triggered(RAMState *rs)
+{
+    return rs->postcopy_preempt_state.preempted;
+}
+
+static void postcopy_preempt_restore(RAMState *rs, PageSearchStatus *pss)
+{
+    PostcopyPreemptState *state = &rs->postcopy_preempt_state;
+
+    assert(state->preempted);
+
+    pss->block = state->ram_block;
+    pss->page = state->ram_page;
+    /* This is not a postcopy request but restoring previous precopy */
+    pss->postcopy_requested = false;
+
+    trace_postcopy_preempt_restored(pss->block->idstr, pss->page);
+
+    /* Reset preempt state, most importantly, set preempted==false */
+    postcopy_preempt_reset(rs);
+}
+
+static void postcopy_preempt_choose_channel(RAMState *rs, PageSearchStatus *pss)
+{
+    MigrationState *s = migrate_get_current();
+    unsigned int channel;
+    QEMUFile *next;
+
+    channel = pss->postcopy_requested ?
+        RAM_CHANNEL_POSTCOPY : RAM_CHANNEL_PRECOPY;
+
+    if (channel != rs->postcopy_channel) {
+        if (channel == RAM_CHANNEL_PRECOPY) {
+            next = s->to_dst_file;
+        } else {
+            next = s->postcopy_qemufile_src;
+        }
+        /* Update and cache the current channel */
+        rs->f = next;
+        rs->postcopy_channel = channel;
+
+        /*
+         * If channel switched, reset last_sent_block since the old sent block
+         * may not be on the same channel.
+         */
+        rs->last_sent_block = NULL;
+
+        trace_postcopy_preempt_switch_channel(channel);
+    }
+
+    trace_postcopy_preempt_send_host_page(pss->block->idstr, pss->page);
+}
+
+/* We need to make sure rs->f always points to the default channel elsewhere */
+static void postcopy_preempt_reset_channel(RAMState *rs)
+{
+    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
+        rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
+        rs->f = migrate_get_current()->to_dst_file;
+        trace_postcopy_preempt_reset_channel();
+    }
+}
+
 /**
  * ram_save_host_page: save a whole host page
  *
@@ -2211,7 +2406,16 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
         return 0;
     }
 
+    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
+        postcopy_preempt_choose_channel(rs, pss);
+    }
+
     do {
+        if (postcopy_needs_preempt(rs, pss)) {
+            postcopy_do_preempt(rs, pss);
+            break;
+        }
+
         /* Check the pages is dirty and if it is send it */
         if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
             tmppages = ram_save_target_page(rs, pss);
@@ -2235,6 +2439,19 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
     /* The offset we leave with is the min boundary of host page and block */
     pss->page = MIN(pss->page, hostpage_boundary);
 
+    /*
+     * When with postcopy preempt mode, flush the data as soon as possible for
+     * postcopy requests, because we've already sent a whole huge page, so the
+     * dst node should already have enough resource to atomically filling in
+     * the current missing page.
+     *
+     * More importantly, when using separate postcopy channel, we must do
+     * explicit flush or it won't flush until the buffer is full.
+     */
+    if (migrate_postcopy_preempt() && pss->postcopy_requested) {
+        qemu_fflush(rs->f);
+    }
+
     res = ram_save_release_protection(rs, pss, start_page);
     return (res < 0 ? res : pages);
 }
@@ -2276,8 +2493,17 @@ static int ram_find_and_save_block(RAMState *rs)
         found = get_queued_page(rs, &pss);
 
         if (!found) {
-            /* priority queue empty, so just search for something dirty */
-            found = find_dirty_block(rs, &pss, &again);
+            /*
+             * Recover previous precopy ramblock/offset if postcopy has
+             * preempted precopy.  Otherwise find the next dirty bit.
+             */
+            if (postcopy_preempt_triggered(rs)) {
+                postcopy_preempt_restore(rs, &pss);
+                found = true;
+            } else {
+                /* priority queue empty, so just search for something dirty */
+                found = find_dirty_block(rs, &pss, &again);
+            }
         }
 
         if (found) {
@@ -2405,6 +2631,8 @@ static void ram_state_reset(RAMState *rs)
     rs->last_page = 0;
     rs->last_version = ram_list.version;
     rs->xbzrle_enabled = false;
+    postcopy_preempt_reset(rs);
+    rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
@@ -3043,6 +3271,8 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
     }
     qemu_mutex_unlock(&rs->bitmap_mutex);
 
+    postcopy_preempt_reset_channel(rs);
+
     /*
      * Must occur before EOS (or any QEMUFile operation)
      * because of RDMA protocol.
@@ -3112,6 +3342,8 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         ram_control_after_iterate(f, RAM_CONTROL_FINISH);
     }
 
+    postcopy_preempt_reset_channel(rs);
+
     if (ret >= 0) {
         multifd_send_sync_main(rs->f);
         qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
@@ -3194,11 +3426,13 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
  * @mis: the migration incoming state pointer
  * @f: QEMUFile where to read the data from
  * @flags: Page flags (mostly to see if it's a continuation of previous block)
+ * @channel: the channel we're using
  */
 static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
-                                              QEMUFile *f, int flags)
+                                              QEMUFile *f, int flags,
+                                              int channel)
 {
-    RAMBlock *block = mis->last_recv_block;
+    RAMBlock *block = mis->last_recv_block[channel];
     char id[256];
     uint8_t len;
 
@@ -3225,7 +3459,7 @@ static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
         return NULL;
     }
 
-    mis->last_recv_block = block;
+    mis->last_recv_block[channel] = block;
 
     return block;
 }
@@ -3679,7 +3913,7 @@ int ram_load_postcopy(QEMUFile *f, int channel)
         trace_ram_load_postcopy_loop((uint64_t)addr, flags);
         if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
                      RAM_SAVE_FLAG_COMPRESS_PAGE)) {
-            block = ram_block_from_stream(mis, f, flags);
+            block = ram_block_from_stream(mis, f, flags, channel);
             if (!block) {
                 ret = -EINVAL;
                 break;
@@ -3930,7 +4164,7 @@ static int ram_load_precopy(QEMUFile *f)
 
         if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
                      RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
-            RAMBlock *block = ram_block_from_stream(mis, f, flags);
+            RAMBlock *block = ram_block_from_stream(mis, f, flags, RAM_CHANNEL_PRECOPY);
 
             host = host_from_ram_block_offset(block, addr);
             /*
diff --git a/migration/trace-events b/migration/trace-events
index 4474a76614..b1155d9da0 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -111,6 +111,12 @@ ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRI
 ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
 ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
 unqueue_page(char *block, uint64_t offset, bool dirty) "ramblock '%s' offset 0x%"PRIx64" dirty %d"
+postcopy_preempt_triggered(char *str, unsigned long page) "during sending ramblock %s offset 0x%lx"
+postcopy_preempt_restored(char *str, unsigned long page) "ramblock %s offset 0x%lx"
+postcopy_preempt_hit(char *str, uint64_t offset) "ramblock %s offset 0x%"PRIx64
+postcopy_preempt_send_host_page(char *str, uint64_t offset) "ramblock %s offset 0x%"PRIx64
+postcopy_preempt_switch_channel(int channel) "%d"
+postcopy_preempt_reset_channel(void) ""
 
 # multifd.c
 multifd_new_send_channel_async(uint8_t id) "channel %u"
@@ -176,6 +182,7 @@ migration_thread_low_pending(uint64_t pending) "%" PRIu64
 migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
 process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
 process_incoming_migration_co_postcopy_end_main(void) ""
+postcopy_preempt_enabled(bool value) "%d"
 
 # channel.c
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 19/25] migration: Postcopy recover with preempt enabled
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (17 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 18/25] migration: Postcopy preemption enablement Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 20/25] migration: Create the postcopy preempt channel asynchronously Peter Xu
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

To allow postcopy recovery, the ram fast load (preempt-only) dest QEMU thread
needs similar handling on fault tolerance.  When ram_load_postcopy() fails,
instead of stopping the thread it halts with a semaphore, preparing to be
kicked again when recovery is detected.

A mutex is introduced to make sure there's no concurrent operation upon the
socket.  To make it simple, the fast ram load thread will take the mutex during
its whole procedure, and only release it if it's paused.  The fast-path socket
will be properly released by the main loading thread safely when there's
network failures during postcopy with that mutex held.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c    | 27 +++++++++++++++++++++++----
 migration/migration.h    | 19 +++++++++++++++++++
 migration/postcopy-ram.c | 24 ++++++++++++++++++++++--
 migration/qemu-file.c    | 27 +++++++++++++++++++++++++++
 migration/qemu-file.h    |  1 +
 migration/savevm.c       | 21 +++++++++++++++++++--
 migration/trace-events   |  2 ++
 7 files changed, 113 insertions(+), 8 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index d20db04097..69778cab23 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -215,9 +215,11 @@ void migration_object_init(void)
     current_incoming->postcopy_remote_fds =
         g_array_new(FALSE, TRUE, sizeof(struct PostCopyFD));
     qemu_mutex_init(&current_incoming->rp_mutex);
+    qemu_mutex_init(&current_incoming->postcopy_prio_thread_mutex);
     qemu_event_init(&current_incoming->main_thread_load_event, false);
     qemu_sem_init(&current_incoming->postcopy_pause_sem_dst, 0);
     qemu_sem_init(&current_incoming->postcopy_pause_sem_fault, 0);
+    qemu_sem_init(&current_incoming->postcopy_pause_sem_fast_load, 0);
     qemu_mutex_init(&current_incoming->page_request_mutex);
     current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
 
@@ -697,9 +699,9 @@ static bool postcopy_try_recover(void)
 
         /*
          * Here, we only wake up the main loading thread (while the
-         * fault thread will still be waiting), so that we can receive
+         * rest threads will still be waiting), so that we can receive
          * commands from source now, and answer it if needed. The
-         * fault thread will be woken up afterwards until we are sure
+         * rest threads will be woken up afterwards until we are sure
          * that source is ready to reply to page requests.
          */
         qemu_sem_post(&mis->postcopy_pause_sem_dst);
@@ -3466,6 +3468,18 @@ static MigThrError postcopy_pause(MigrationState *s)
         qemu_file_shutdown(file);
         qemu_fclose(file);
 
+        /*
+         * Do the same to postcopy fast path socket too if there is.  No
+         * locking needed because no racer as long as we do this before setting
+         * status to paused.
+         */
+        if (s->postcopy_qemufile_src) {
+            migration_ioc_unregister_yank_from_file(s->postcopy_qemufile_src);
+            qemu_file_shutdown(s->postcopy_qemufile_src);
+            qemu_fclose(s->postcopy_qemufile_src);
+            s->postcopy_qemufile_src = NULL;
+        }
+
         migrate_set_state(&s->state, s->state,
                           MIGRATION_STATUS_POSTCOPY_PAUSED);
 
@@ -3521,8 +3535,13 @@ static MigThrError migration_detect_error(MigrationState *s)
         return MIG_THR_ERR_FATAL;
     }
 
-    /* Try to detect any file errors */
-    ret = qemu_file_get_error_obj(s->to_dst_file, &local_error);
+    /*
+     * Try to detect any file errors.  Note that postcopy_qemufile_src will
+     * be NULL when postcopy preempt is not enabled.
+     */
+    ret = qemu_file_get_error_obj_any(s->to_dst_file,
+                                      s->postcopy_qemufile_src,
+                                      &local_error);
     if (!ret) {
         /* Everything is fine */
         assert(!local_error);
diff --git a/migration/migration.h b/migration/migration.h
index b8aacfe3af..91f845e9e4 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -118,6 +118,18 @@ struct MigrationIncomingState {
     /* Postcopy priority thread is used to receive postcopy requested pages */
     QemuThread postcopy_prio_thread;
     bool postcopy_prio_thread_created;
+    /*
+     * Used to sync between the ram load main thread and the fast ram load
+     * thread.  It protects postcopy_qemufile_dst, which is the postcopy
+     * fast channel.
+     *
+     * The ram fast load thread will take it mostly for the whole lifecycle
+     * because it needs to continuously read data from the channel, and
+     * it'll only release this mutex if postcopy is interrupted, so that
+     * the ram load main thread will take this mutex over and properly
+     * release the broken channel.
+     */
+    QemuMutex postcopy_prio_thread_mutex;
     /*
      * An array of temp host huge pages to be used, one for each postcopy
      * channel.
@@ -147,6 +159,13 @@ struct MigrationIncomingState {
     /* notify PAUSED postcopy incoming migrations to try to continue */
     QemuSemaphore postcopy_pause_sem_dst;
     QemuSemaphore postcopy_pause_sem_fault;
+    /*
+     * This semaphore is used to allow the ram fast load thread (only when
+     * postcopy preempt is enabled) fall into sleep when there's network
+     * interruption detected.  When the recovery is done, the main load
+     * thread will kick the fast ram load thread using this semaphore.
+     */
+    QemuSemaphore postcopy_pause_sem_fast_load;
 
     /* List of listening socket addresses  */
     SocketAddressList *socket_address_list;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index df0c02f729..e20305a9e2 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1580,6 +1580,15 @@ int postcopy_preempt_setup(MigrationState *s, Error **errp)
     return 0;
 }
 
+static void postcopy_pause_ram_fast_load(MigrationIncomingState *mis)
+{
+    trace_postcopy_pause_fast_load();
+    qemu_mutex_unlock(&mis->postcopy_prio_thread_mutex);
+    qemu_sem_wait(&mis->postcopy_pause_sem_fast_load);
+    qemu_mutex_lock(&mis->postcopy_prio_thread_mutex);
+    trace_postcopy_pause_fast_load_continued();
+}
+
 void *postcopy_preempt_thread(void *opaque)
 {
     MigrationIncomingState *mis = opaque;
@@ -1592,11 +1601,22 @@ void *postcopy_preempt_thread(void *opaque)
     qemu_sem_post(&mis->thread_sync_sem);
 
     /* Sending RAM_SAVE_FLAG_EOS to terminate this thread */
-    ret = ram_load_postcopy(mis->postcopy_qemufile_dst, RAM_CHANNEL_POSTCOPY);
+    qemu_mutex_lock(&mis->postcopy_prio_thread_mutex);
+    while (1) {
+        ret = ram_load_postcopy(mis->postcopy_qemufile_dst, RAM_CHANNEL_POSTCOPY);
+        /* If error happened, go into recovery routine */
+        if (ret) {
+            postcopy_pause_ram_fast_load(mis);
+        } else {
+            /* We're done */
+            break;
+        }
+    }
+    qemu_mutex_unlock(&mis->postcopy_prio_thread_mutex);
 
     rcu_unregister_thread();
 
     trace_postcopy_preempt_thread_exit();
 
-    return ret == 0 ? NULL : (void *)-1;
+    return NULL;
 }
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 1479cddad9..397652f0ba 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -139,6 +139,33 @@ int qemu_file_get_error_obj(QEMUFile *f, Error **errp)
     return f->last_error;
 }
 
+/*
+ * Get last error for either stream f1 or f2 with optional Error*.
+ * The error returned (non-zero) can be either from f1 or f2.
+ *
+ * If any of the qemufile* is NULL, then skip the check on that file.
+ *
+ * When there is no error on both qemufile, zero is returned.
+ */
+int qemu_file_get_error_obj_any(QEMUFile *f1, QEMUFile *f2, Error **errp)
+{
+    int ret = 0;
+
+    if (f1) {
+        ret = qemu_file_get_error_obj(f1, errp);
+        /* If there's already error detected, return */
+        if (ret) {
+            return ret;
+        }
+    }
+
+    if (f2) {
+        ret = qemu_file_get_error_obj(f2, errp);
+    }
+
+    return ret;
+}
+
 /*
  * Set the last error for stream f with optional Error*
  */
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 3f36d4dc8c..2564e5e1c7 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -156,6 +156,7 @@ void qemu_file_update_transfer(QEMUFile *f, int64_t len);
 void qemu_file_set_rate_limit(QEMUFile *f, int64_t new_rate);
 int64_t qemu_file_get_rate_limit(QEMUFile *f);
 int qemu_file_get_error_obj(QEMUFile *f, Error **errp);
+int qemu_file_get_error_obj_any(QEMUFile *f1, QEMUFile *f2, Error **errp);
 void qemu_file_set_error_obj(QEMUFile *f, int ret, Error *err);
 void qemu_file_set_error(QEMUFile *f, int ret);
 int qemu_file_shutdown(QEMUFile *f);
diff --git a/migration/savevm.c b/migration/savevm.c
index 254aa78234..24b69a1008 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2152,6 +2152,13 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
      */
     qemu_sem_post(&mis->postcopy_pause_sem_fault);
 
+    if (migrate_postcopy_preempt()) {
+        /* The channel should already be setup again; make sure of it */
+        assert(mis->postcopy_qemufile_dst);
+        /* Kick the fast ram load thread too */
+        qemu_sem_post(&mis->postcopy_pause_sem_fast_load);
+    }
+
     return 0;
 }
 
@@ -2607,6 +2614,16 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
     mis->to_src_file = NULL;
     qemu_mutex_unlock(&mis->rp_mutex);
 
+    if (mis->postcopy_qemufile_dst) {
+        qemu_file_shutdown(mis->postcopy_qemufile_dst);
+        /* Take the mutex to make sure the fast ram load thread halted */
+        qemu_mutex_lock(&mis->postcopy_prio_thread_mutex);
+        migration_ioc_unregister_yank_from_file(mis->postcopy_qemufile_dst);
+        qemu_fclose(mis->postcopy_qemufile_dst);
+        mis->postcopy_qemufile_dst = NULL;
+        qemu_mutex_unlock(&mis->postcopy_prio_thread_mutex);
+    }
+
     migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
                       MIGRATION_STATUS_POSTCOPY_PAUSED);
 
@@ -2634,8 +2651,8 @@ retry:
     while (true) {
         section_type = qemu_get_byte(f);
 
-        if (qemu_file_get_error(f)) {
-            ret = qemu_file_get_error(f);
+        ret = qemu_file_get_error_obj_any(f, mis->postcopy_qemufile_dst, NULL);
+        if (ret) {
             break;
         }
 
diff --git a/migration/trace-events b/migration/trace-events
index b1155d9da0..dfe36a3340 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -270,6 +270,8 @@ mark_postcopy_blocktime_begin(uint64_t addr, void *dd, uint32_t time, int cpu, i
 mark_postcopy_blocktime_end(uint64_t addr, void *dd, uint32_t time, int affected_cpu) "addr: 0x%" PRIx64 ", dd: %p, time: %u, affected_cpu: %d"
 postcopy_pause_fault_thread(void) ""
 postcopy_pause_fault_thread_continued(void) ""
+postcopy_pause_fast_load(void) ""
+postcopy_pause_fast_load_continued(void) ""
 postcopy_ram_fault_thread_entry(void) ""
 postcopy_ram_fault_thread_exit(void) ""
 postcopy_ram_fault_thread_fds_core(int baseufd, int quitfd) "ufd: %d quitfd: %d"
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 20/25] migration: Create the postcopy preempt channel asynchronously
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (18 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 19/25] migration: Postcopy recover with preempt enabled Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 21/25] migration: Parameter x-postcopy-preempt-break-huge Peter Xu
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

This patch allows the postcopy preempt channel to be created
asynchronously.  The benefit is that when the connection is slow, we won't
take the BQL (and potentially block all things like QMP) for a long time
without releasing.

A function postcopy_preempt_wait_channel() is introduced, allowing the
migration thread to be able to wait on the channel creation.  The channel
is always created by the main thread, in which we'll kick a new semaphore
to tell the migration thread that the channel has created.

We'll need to wait for the new channel in two places: (1) when there's a
new postcopy migration that is starting, or (2) when there's a postcopy
migration to resume.

For the start of migration, we don't need to wait for this channel until
when we want to start postcopy, aka, postcopy_start().  We'll fail the
migration if we found that the channel creation failed (which should
probably not happen at all in 99% of the cases, because the main channel is
using the same network topology).

For a postcopy recovery, we'll need to wait in postcopy_pause().  In that
case if the channel creation failed, we can't fail the migration or we'll
crash the VM, instead we keep in PAUSED state, waiting for yet another
recovery.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c    | 16 ++++++++++++
 migration/migration.h    |  7 +++++
 migration/postcopy-ram.c | 56 +++++++++++++++++++++++++++++++---------
 migration/postcopy-ram.h |  1 +
 4 files changed, 68 insertions(+), 12 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 69778cab23..78e1e6bfb9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3016,6 +3016,12 @@ static int postcopy_start(MigrationState *ms)
     int64_t bandwidth = migrate_max_postcopy_bandwidth();
     bool restart_block = false;
     int cur_state = MIGRATION_STATUS_ACTIVE;
+
+    if (postcopy_preempt_wait_channel(ms)) {
+        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
+        return -1;
+    }
+
     if (!migrate_pause_before_switchover()) {
         migrate_set_state(&ms->state, MIGRATION_STATUS_ACTIVE,
                           MIGRATION_STATUS_POSTCOPY_ACTIVE);
@@ -3497,6 +3503,14 @@ static MigThrError postcopy_pause(MigrationState *s)
         if (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER) {
             /* Woken up by a recover procedure. Give it a shot */
 
+            if (postcopy_preempt_wait_channel(s)) {
+                /*
+                 * Preempt enabled, and new channel create failed; loop
+                 * back to wait for another recovery.
+                 */
+                continue;
+            }
+
             /*
              * Firstly, let's wake up the return path now, with a new
              * return path channel.
@@ -4356,6 +4370,7 @@ static void migration_instance_finalize(Object *obj)
     qemu_sem_destroy(&ms->postcopy_pause_sem);
     qemu_sem_destroy(&ms->postcopy_pause_rp_sem);
     qemu_sem_destroy(&ms->rp_state.rp_sem);
+    qemu_sem_destroy(&ms->postcopy_qemufile_src_sem);
     error_free(ms->error);
 }
 
@@ -4402,6 +4417,7 @@ static void migration_instance_init(Object *obj)
     qemu_sem_init(&ms->rp_state.rp_sem, 0);
     qemu_sem_init(&ms->rate_limit_sem, 0);
     qemu_sem_init(&ms->wait_unplug_sem, 0);
+    qemu_sem_init(&ms->postcopy_qemufile_src_sem, 0);
     qemu_mutex_init(&ms->qemu_file_lock);
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index 91f845e9e4..f898b8547a 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -219,6 +219,13 @@ struct MigrationState {
     QEMUFile *to_dst_file;
     /* Postcopy specific transfer channel */
     QEMUFile *postcopy_qemufile_src;
+    /*
+     * It is posted when the preempt channel is established.  Note: this is
+     * used for both the start or recover of a postcopy migration.  We'll
+     * post to this sem every time a new preempt channel is created in the
+     * main thread, and we keep post() and wait() in pair.
+     */
+    QemuSemaphore postcopy_qemufile_src_sem;
     QIOChannelBuffer *bioc;
     /*
      * Protects to_dst_file/from_dst_file pointers.  We need to make sure we
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index e20305a9e2..3ead5b1b3c 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1552,10 +1552,50 @@ bool postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file)
     return true;
 }
 
-int postcopy_preempt_setup(MigrationState *s, Error **errp)
+static void
+postcopy_preempt_send_channel_new(QIOTask *task, gpointer opaque)
 {
-    QIOChannel *ioc;
+    MigrationState *s = opaque;
+    QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
+    Error *local_err = NULL;
+
+    if (qio_task_propagate_error(task, &local_err)) {
+        /* Something wrong happened.. */
+        object_unref(OBJECT(ioc));
+        migrate_set_error(migrate_get_current(), local_err);
+        error_free(local_err);
+    } else {
+        migration_ioc_register_yank(ioc);
+        s->postcopy_qemufile_src = qemu_fopen_channel_output(ioc);
+        trace_postcopy_preempt_new_channel();
+    }
+
+    /*
+     * Kick the waiter in all cases.  The waiter should check upon
+     * postcopy_qemufile_src to know whether it failed or not.
+     */
+    qemu_sem_post(&s->postcopy_qemufile_src_sem);
+}
 
+/* Returns 0 if channel established, -1 for error. */
+int postcopy_preempt_wait_channel(MigrationState *s)
+{
+    /* If preempt not enabled, no need to wait */
+    if (!migrate_postcopy_preempt()) {
+        return 0;
+    }
+
+    /*
+     * We need the postcopy preempt channel to be established before
+     * starting doing anything.
+     */
+    qemu_sem_wait(&s->postcopy_qemufile_src_sem);
+
+    return s->postcopy_qemufile_src ? 0 : -1;
+}
+
+int postcopy_preempt_setup(MigrationState *s, Error **errp)
+{
     if (!migrate_postcopy_preempt()) {
         return 0;
     }
@@ -1566,16 +1606,8 @@ int postcopy_preempt_setup(MigrationState *s, Error **errp)
         return -1;
     }
 
-    ioc = socket_send_channel_create_sync(errp);
-
-    if (ioc == NULL) {
-        return -1;
-    }
-
-    migration_ioc_register_yank(ioc);
-    s->postcopy_qemufile_src = qemu_fopen_channel_output(ioc);
-
-    trace_postcopy_preempt_new_channel();
+    /* Kick an async task to connect */
+    socket_send_channel_create(postcopy_preempt_send_channel_new, s);
 
     return 0;
 }
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 34b1080cde..6147bf7d1d 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -192,5 +192,6 @@ enum PostcopyChannels {
 
 bool postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file);
 int postcopy_preempt_setup(MigrationState *s, Error **errp);
+int postcopy_preempt_wait_channel(MigrationState *s);
 
 #endif
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 21/25] migration: Parameter x-postcopy-preempt-break-huge
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (19 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 20/25] migration: Create the postcopy preempt channel asynchronously Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 22/25] migration: Add helpers to detect TLS capability Peter Xu
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Add a parameter that can conditionally disable the "break sending huge
page" behavior in postcopy preemption.  By default it's enabled.

It should only be used for debugging purposes, and we should never remove
the "x-" prefix.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 2 ++
 migration/migration.h | 7 +++++++
 migration/ram.c       | 7 +++++++
 3 files changed, 16 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 78e1e6bfb9..cd4a150202 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -4325,6 +4325,8 @@ static Property migration_properties[] = {
     DEFINE_PROP_SIZE("announce-step", MigrationState,
                       parameters.announce_step,
                       DEFAULT_MIGRATE_ANNOUNCE_STEP),
+    DEFINE_PROP_BOOL("x-postcopy-preempt-break-huge", MigrationState,
+                      postcopy_preempt_break_huge, true),
 
     /* Migration capabilities */
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
diff --git a/migration/migration.h b/migration/migration.h
index f898b8547a..6ee520642f 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -340,6 +340,13 @@ struct MigrationState {
     bool send_configuration;
     /* Whether we send section footer during migration */
     bool send_section_footer;
+    /*
+     * Whether we allow break sending huge pages when postcopy preempt is
+     * enabled.  When disabled, we won't interrupt precopy within sending a
+     * host huge page, which is the old behavior of vanilla postcopy.
+     * NOTE: this parameter is ignored if postcopy preempt is not enabled.
+     */
+    bool postcopy_preempt_break_huge;
 
     /* Needed by postcopy-pause state */
     QemuSemaphore postcopy_pause_sem;
diff --git a/migration/ram.c b/migration/ram.c
index 53dfd9be38..ede8aaac01 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2266,11 +2266,18 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
 
 static bool postcopy_needs_preempt(RAMState *rs, PageSearchStatus *pss)
 {
+    MigrationState *ms = migrate_get_current();
+
     /* Not enabled eager preempt?  Then never do that. */
     if (!migrate_postcopy_preempt()) {
         return false;
     }
 
+    /* If the user explicitly disabled breaking of huge page, skip */
+    if (!ms->postcopy_preempt_break_huge) {
+        return false;
+    }
+
     /* If the ramblock we're sending is a small page?  Never bother. */
     if (qemu_ram_pagesize(pss->block) == TARGET_PAGE_SIZE) {
         return false;
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 22/25] migration: Add helpers to detect TLS capability
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (20 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 21/25] migration: Parameter x-postcopy-preempt-break-huge Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 23/25] migration: Fail postcopy preempt with TLS for now Peter Xu
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Add migrate_tls_enabled() to detect whether TLS is configured.

Add migrate_channel_requires_tls() to detect whether the specific channel
requires TLS.

No functional change intended.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/channel.c   | 10 ++--------
 migration/migration.c | 17 +++++++++++++++++
 migration/migration.h |  4 ++++
 migration/multifd.c   |  7 +------
 4 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/migration/channel.c b/migration/channel.c
index c4fc000a1a..85ac053275 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -38,10 +38,7 @@ void migration_channel_process_incoming(QIOChannel *ioc)
     trace_migration_set_incoming_channel(
         ioc, object_get_typename(OBJECT(ioc)));
 
-    if (s->parameters.tls_creds &&
-        *s->parameters.tls_creds &&
-        !object_dynamic_cast(OBJECT(ioc),
-                             TYPE_QIO_CHANNEL_TLS)) {
+    if (migrate_channel_requires_tls(ioc)) {
         migration_tls_channel_process_incoming(s, ioc, &local_err);
     } else {
         migration_ioc_register_yank(ioc);
@@ -71,10 +68,7 @@ void migration_channel_connect(MigrationState *s,
         ioc, object_get_typename(OBJECT(ioc)), hostname, error);
 
     if (!error) {
-        if (s->parameters.tls_creds &&
-            *s->parameters.tls_creds &&
-            !object_dynamic_cast(OBJECT(ioc),
-                                 TYPE_QIO_CHANNEL_TLS)) {
+        if (migrate_channel_requires_tls(ioc)) {
             migration_tls_channel_connect(s, ioc, hostname, &error);
 
             if (!error) {
diff --git a/migration/migration.c b/migration/migration.c
index cd4a150202..f30bad982c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -49,6 +49,7 @@
 #include "trace.h"
 #include "exec/target_page.h"
 #include "io/channel-buffer.h"
+#include "io/channel-tls.h"
 #include "migration/colo.h"
 #include "hw/boards.h"
 #include "hw/qdev-properties.h"
@@ -4246,6 +4247,22 @@ void migration_global_dump(Monitor *mon)
                    ms->clear_bitmap_shift);
 }
 
+bool migrate_tls_enabled(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->parameters.tls_creds && *s->parameters.tls_creds;
+}
+
+bool migrate_channel_requires_tls(QIOChannel *ioc)
+{
+    if (!migrate_tls_enabled()) {
+        return false;
+    }
+
+    return !object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_TLS);
+}
+
 #define DEFINE_PROP_MIG_CAP(name, x)             \
     DEFINE_PROP_BOOL(name, MigrationState, enabled_capabilities[x], false)
 
diff --git a/migration/migration.h b/migration/migration.h
index 6ee520642f..8b9ad7fe31 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -436,6 +436,10 @@ bool migrate_use_events(void);
 bool migrate_postcopy_blocktime(void);
 bool migrate_background_snapshot(void);
 bool migrate_postcopy_preempt(void);
+/* Whether TLS is enabled for migration? */
+bool migrate_tls_enabled(void);
+/* Whether the QIO channel requires further TLS handshake? */
+bool migrate_channel_requires_tls(QIOChannel *ioc);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_shut(MigrationIncomingState *mis,
diff --git a/migration/multifd.c b/migration/multifd.c
index 180586dcde..46dfcbfa1d 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -784,16 +784,11 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
                                     QIOChannel *ioc,
                                     Error *error)
 {
-    MigrationState *s = migrate_get_current();
-
     trace_multifd_set_outgoing_channel(
         ioc, object_get_typename(OBJECT(ioc)), p->tls_hostname, error);
 
     if (!error) {
-        if (s->parameters.tls_creds &&
-            *s->parameters.tls_creds &&
-            !object_dynamic_cast(OBJECT(ioc),
-                                 TYPE_QIO_CHANNEL_TLS)) {
+        if (migrate_channel_requires_tls(ioc)) {
             multifd_tls_channel_connect(p, ioc, &error);
             if (!error) {
                 /*
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 23/25] migration: Fail postcopy preempt with TLS for now
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (21 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 22/25] migration: Add helpers to detect TLS capability Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 24/25] tests: Add postcopy preempt test Peter Xu
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

The support is not yet there.  Temporarily fail it properly when starting
postcopy until it's supported.  Fail at postcopy-start still allows the
user to proceed with e.g. pure tls precopy even if postcopy-ram is set.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index f30bad982c..95cfc483c9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1802,6 +1802,12 @@ void qmp_migrate_start_postcopy(Error **errp)
                          " started");
         return;
     }
+
+    if (migrate_postcopy_preempt() && migrate_tls_enabled()) {
+        error_setg(errp, "Postcopy preemption does not support TLS yet");
+        return;
+    }
+
     /*
      * we don't error if migration has finished since that would be racy
      * with issuing this command.
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 24/25] tests: Add postcopy preempt test
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (22 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 23/25] migration: Fail postcopy preempt with TLS for now Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-01  8:39 ` [PATCH v2 25/25] tests: Pass in MigrateStart** into test_migrate_start() Peter Xu
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Two tests are added: a normal postcopy preempt test, and a recovery test.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tests/qtest/migration-test.c | 41 ++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 7b42f6fd90..09a9ce4401 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -470,6 +470,7 @@ typedef struct {
      */
     bool hide_stderr;
     bool use_shmem;
+    bool postcopy_preempt;
     /* only launch the target process */
     bool only_target;
     /* Use dirty ring if true; dirty logging otherwise */
@@ -663,6 +664,8 @@ static int migrate_postcopy_prepare(QTestState **from_ptr,
                                     MigrateStart *args)
 {
     g_autofree char *uri = g_strdup_printf("unix:%s/migsocket", tmpfs);
+    /* NOTE: args will be freed in test_migrate_start(), cache it */
+    bool postcopy_preempt = args->postcopy_preempt;
     QTestState *from, *to;
 
     if (test_migrate_start(&from, &to, uri, args)) {
@@ -673,6 +676,11 @@ static int migrate_postcopy_prepare(QTestState **from_ptr,
     migrate_set_capability(to, "postcopy-ram", true);
     migrate_set_capability(to, "postcopy-blocktime", true);
 
+    if (postcopy_preempt) {
+        migrate_set_capability(from, "postcopy-preempt", true);
+        migrate_set_capability(to, "postcopy-preempt", true);
+    }
+
     /* We want to pick a speed slow enough that the test completes
      * quickly, but that it doesn't complete precopy even on a slow
      * machine, so also set the downtime.
@@ -719,13 +727,29 @@ static void test_postcopy(void)
     migrate_postcopy_complete(from, to);
 }
 
-static void test_postcopy_recovery(void)
+static void test_postcopy_preempt(void)
+{
+    MigrateStart *args = migrate_start_new();
+    QTestState *from, *to;
+
+    args->postcopy_preempt = true;
+
+    if (migrate_postcopy_prepare(&from, &to, args)) {
+        return;
+    }
+    migrate_postcopy_start(from, to);
+    migrate_postcopy_complete(from, to);
+}
+
+/* @preempt: whether to use postcopy-preempt */
+static void test_postcopy_recovery(bool preempt)
 {
     MigrateStart *args = migrate_start_new();
     QTestState *from, *to;
     g_autofree char *uri = NULL;
 
     args->hide_stderr = true;
+    args->postcopy_preempt = preempt;
 
     if (migrate_postcopy_prepare(&from, &to, args)) {
         return;
@@ -781,6 +805,16 @@ static void test_postcopy_recovery(void)
     migrate_postcopy_complete(from, to);
 }
 
+static void test_postcopy_recovery_normal(void)
+{
+    test_postcopy_recovery(false);
+}
+
+static void test_postcopy_recovery_preempt(void)
+{
+    test_postcopy_recovery(true);
+}
+
 static void test_baddest(void)
 {
     MigrateStart *args = migrate_start_new();
@@ -1458,7 +1492,10 @@ int main(int argc, char **argv)
     module_call_init(MODULE_INIT_QOM);
 
     qtest_add_func("/migration/postcopy/unix", test_postcopy);
-    qtest_add_func("/migration/postcopy/recovery", test_postcopy_recovery);
+    qtest_add_func("/migration/postcopy/recovery", test_postcopy_recovery_normal);
+    qtest_add_func("/migration/postcopy/preempt/unix", test_postcopy_preempt);
+    qtest_add_func("/migration/postcopy/preempt/recovery",
+                   test_postcopy_recovery_preempt);
     qtest_add_func("/migration/bad_dest", test_baddest);
     qtest_add_func("/migration/precopy/unix", test_precopy_unix);
     qtest_add_func("/migration/precopy/tcp", test_precopy_tcp);
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 25/25] tests: Pass in MigrateStart** into test_migrate_start()
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (23 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 24/25] tests: Add postcopy preempt test Peter Xu
@ 2022-03-01  8:39 ` Peter Xu
  2022-03-02 12:11   ` Dr. David Alan Gilbert
  2022-03-01  9:25 ` [PATCH v2 00/25] migration: Postcopy Preemption Daniel P. Berrangé
  2022-03-02 12:14 ` Dr. David Alan Gilbert
  26 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2022-03-01  8:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

test_migrate_start() will release the MigrateStart structure that passed
in, however that's not super clear to the caller because after the call
returned the pointer can still be referenced by the callers.  It can easily
be a source of use-after-free.

Let's pass in a double pointer of that, then we can safely clear the
pointer for the caller after the struct is released.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tests/qtest/migration-test.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 09a9ce4401..67f0601988 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -496,7 +496,7 @@ static void migrate_start_destroy(MigrateStart *args)
 }
 
 static int test_migrate_start(QTestState **from, QTestState **to,
-                              const char *uri, MigrateStart *args)
+                              const char *uri, MigrateStart **pargs)
 {
     g_autofree gchar *arch_source = NULL;
     g_autofree gchar *arch_target = NULL;
@@ -508,6 +508,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
     g_autofree char *shmem_path = NULL;
     const char *arch = qtest_get_arch();
     const char *machine_opts = NULL;
+    MigrateStart *args = *pargs;
     const char *memory_size;
     int ret = 0;
 
@@ -622,6 +623,8 @@ static int test_migrate_start(QTestState **from, QTestState **to,
 
 out:
     migrate_start_destroy(args);
+    /* This tells the caller that this structure is gone */
+    *pargs = NULL;
     return ret;
 }
 
@@ -668,7 +671,7 @@ static int migrate_postcopy_prepare(QTestState **from_ptr,
     bool postcopy_preempt = args->postcopy_preempt;
     QTestState *from, *to;
 
-    if (test_migrate_start(&from, &to, uri, args)) {
+    if (test_migrate_start(&from, &to, uri, &args)) {
         return -1;
     }
 
@@ -822,7 +825,7 @@ static void test_baddest(void)
 
     args->hide_stderr = true;
 
-    if (test_migrate_start(&from, &to, "tcp:127.0.0.1:0", args)) {
+    if (test_migrate_start(&from, &to, "tcp:127.0.0.1:0", &args)) {
         return;
     }
     migrate_qmp(from, "tcp:127.0.0.1:0", "{}");
@@ -838,7 +841,7 @@ static void test_precopy_unix_common(bool dirty_ring)
 
     args->use_dirty_ring = dirty_ring;
 
-    if (test_migrate_start(&from, &to, uri, args)) {
+    if (test_migrate_start(&from, &to, uri, &args)) {
         return;
     }
 
@@ -926,7 +929,7 @@ static void test_xbzrle(const char *uri)
     MigrateStart *args = migrate_start_new();
     QTestState *from, *to;
 
-    if (test_migrate_start(&from, &to, uri, args)) {
+    if (test_migrate_start(&from, &to, uri, &args)) {
         return;
     }
 
@@ -980,7 +983,7 @@ static void test_precopy_tcp(void)
     g_autofree char *uri = NULL;
     QTestState *from, *to;
 
-    if (test_migrate_start(&from, &to, "tcp:127.0.0.1:0", args)) {
+    if (test_migrate_start(&from, &to, "tcp:127.0.0.1:0", &args)) {
         return;
     }
 
@@ -1025,7 +1028,7 @@ static void test_migrate_fd_proto(void)
     QDict *rsp;
     const char *error_desc;
 
-    if (test_migrate_start(&from, &to, "defer", args)) {
+    if (test_migrate_start(&from, &to, "defer", &args)) {
         return;
     }
 
@@ -1105,7 +1108,7 @@ static void do_test_validate_uuid(MigrateStart *args, bool should_fail)
     g_autofree char *uri = g_strdup_printf("unix:%s/migsocket", tmpfs);
     QTestState *from, *to;
 
-    if (test_migrate_start(&from, &to, uri, args)) {
+    if (test_migrate_start(&from, &to, uri, &args)) {
         return;
     }
 
@@ -1197,7 +1200,7 @@ static void test_migrate_auto_converge(void)
      */
     const int64_t expected_threshold = max_bandwidth * downtime_limit / 1000;
 
-    if (test_migrate_start(&from, &to, uri, args)) {
+    if (test_migrate_start(&from, &to, uri, &args)) {
         return;
     }
 
@@ -1266,7 +1269,7 @@ static void test_multifd_tcp(const char *method)
     QDict *rsp;
     g_autofree char *uri = NULL;
 
-    if (test_migrate_start(&from, &to, "defer", args)) {
+    if (test_migrate_start(&from, &to, "defer", &args)) {
         return;
     }
 
@@ -1352,7 +1355,7 @@ static void test_multifd_tcp_cancel(void)
 
     args->hide_stderr = true;
 
-    if (test_migrate_start(&from, &to, "defer", args)) {
+    if (test_migrate_start(&from, &to, "defer", &args)) {
         return;
     }
 
@@ -1391,7 +1394,7 @@ static void test_multifd_tcp_cancel(void)
     args = migrate_start_new();
     args->only_target = true;
 
-    if (test_migrate_start(&from, &to2, "defer", args)) {
+    if (test_migrate_start(&from, &to2, "defer", &args)) {
         return;
     }
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/25] migration: Postcopy Preemption
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (24 preceding siblings ...)
  2022-03-01  8:39 ` [PATCH v2 25/25] tests: Pass in MigrateStart** into test_migrate_start() Peter Xu
@ 2022-03-01  9:25 ` Daniel P. Berrangé
  2022-03-01 10:17   ` Peter Xu
  2022-03-02 12:14 ` Dr. David Alan Gilbert
  26 siblings, 1 reply; 47+ messages in thread
From: Daniel P. Berrangé @ 2022-03-01  9:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: Leonardo Bras Soares Passos, qemu-devel, Dr . David Alan Gilbert,
	Juan Quintela

On Tue, Mar 01, 2022 at 04:39:00PM +0800, Peter Xu wrote:
> This is v2 of postcopy preempt series.  It can also be found here:
> 
>   https://github.com/xzpeter/qemu/tree/postcopy-preempt
> 
> RFC: https://lore.kernel.org/qemu-devel/20220119080929.39485-1-peterx@redhat.com
> V1:  https://lore.kernel.org/qemu-devel/20220216062809.57179-1-peterx@redhat.com
> 
> v1->v2 changelog:
> - Picked up more r-bs from Dave
> - Rename both fault threads to drop "qemu/" prefix [Dave]
> - Further rework on postcopy recovery, to be able to detect qemufile errors
>   from either main channel or postcopy one [Dave]
> - shutdown() qemufile before close on src postcopy channel when postcopy is
>   paused [Dave]
> - In postcopy_preempt_new_channel(), explicitly set the new channel in
>   blocking state, even if it's the default [Dave]
> - Make RAMState.postcopy_channel unsigned int [Dave]
> - Added patches:
>   - "migration: Create the postcopy preempt channel asynchronously"
>   - "migration: Parameter x-postcopy-preempt-break-huge"
>   - "migration: Add helpers to detect TLS capability"
>   - "migration: Fail postcopy preempt with TLS"
>   - "tests: Pass in MigrateStart** into test_migrate_start()"
> 
> Abstract
> ========
> 
> This series added a new migration capability called "postcopy-preempt".  It can
> be enabled when postcopy is enabled, and it'll simply (but greatly) speed up
> postcopy page requests handling process.

Is there no way we can just automatically enable this new feature, rather
than requiring apps to specify yet another new flag ?

> TODO List
> =========
> 
> TLS support
> -----------
> 
> I only noticed its missing very recently.  Since soft freeze is coming, and
> obviously I'm still growing this series, so I tend to have the existing
> material discussed. Let's see if it can still catch the train for QEMU 7.0
> release (soft freeze on 2022-03-08)..

I don't like the idea of shipping something that is only half finished.
It means that when apps probe for the feature, they'll see preempt
capability present, but have no idea whether they're using a QEMU that
is broken when combined with TLS or not. We shouldn't merge something
just to meet the soft freeze deadline if we know key features are broken.

> Multi-channel for preemption threads
> ------------------------------------
> 
> Currently the postcopy preempt feature use only one extra channel and one
> extra thread on dest (no new thread on src QEMU).  It should be mostly good
> enough for major use cases, but when the postcopy queue is long enough
> (e.g. hundreds of vCPUs faulted on different pages) logically we could
> still observe more delays in average.  Whether growing threads/channels can
> solve it is debatable, but sounds worthwhile a try.  That's yet another
> thing we can think about after this patchset lands.

If we don't think about it upfront, then we'll possibly end up with
yet another tunable flag that apps have to worry about. It also
could make migration code even more complex if we have to support
two different scenarios. If we think multiple threads are goign to
be a benefit lets check that and if so, design it into the exposed
application facing interface from the start rather than retrofitting
afterwards.

> Logically the design provides space for that - the receiving postcopy
> preempt thread can understand all ram-layer migration protocol, and for
> multi channel and multi threads we could simply grow that into multile
> threads handling the same protocol (with multiple PostcopyTmpPage).  The
> source needs more thoughts on synchronizations, though, but it shouldn't
> affect the whole protocol layer, so should be easy to keep compatible.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/25] migration: Postcopy Preemption
  2022-03-01  9:25 ` [PATCH v2 00/25] migration: Postcopy Preemption Daniel P. Berrangé
@ 2022-03-01 10:17   ` Peter Xu
  2022-03-01 10:27     ` Daniel P. Berrangé
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2022-03-01 10:17 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Leonardo Bras Soares Passos, qemu-devel, Dr . David Alan Gilbert,
	Juan Quintela

On Tue, Mar 01, 2022 at 09:25:55AM +0000, Daniel P. Berrangé wrote:
> On Tue, Mar 01, 2022 at 04:39:00PM +0800, Peter Xu wrote:
> > This is v2 of postcopy preempt series.  It can also be found here:
> > 
> >   https://github.com/xzpeter/qemu/tree/postcopy-preempt
> > 
> > RFC: https://lore.kernel.org/qemu-devel/20220119080929.39485-1-peterx@redhat.com
> > V1:  https://lore.kernel.org/qemu-devel/20220216062809.57179-1-peterx@redhat.com
> > 
> > v1->v2 changelog:
> > - Picked up more r-bs from Dave
> > - Rename both fault threads to drop "qemu/" prefix [Dave]
> > - Further rework on postcopy recovery, to be able to detect qemufile errors
> >   from either main channel or postcopy one [Dave]
> > - shutdown() qemufile before close on src postcopy channel when postcopy is
> >   paused [Dave]
> > - In postcopy_preempt_new_channel(), explicitly set the new channel in
> >   blocking state, even if it's the default [Dave]
> > - Make RAMState.postcopy_channel unsigned int [Dave]
> > - Added patches:
> >   - "migration: Create the postcopy preempt channel asynchronously"
> >   - "migration: Parameter x-postcopy-preempt-break-huge"
> >   - "migration: Add helpers to detect TLS capability"
> >   - "migration: Fail postcopy preempt with TLS"
> >   - "tests: Pass in MigrateStart** into test_migrate_start()"
> > 
> > Abstract
> > ========
> > 
> > This series added a new migration capability called "postcopy-preempt".  It can
> > be enabled when postcopy is enabled, and it'll simply (but greatly) speed up
> > postcopy page requests handling process.
> 
> Is there no way we can just automatically enable this new feature, rather
> than requiring apps to specify yet another new flag ?

I didn't make it the default for now, but I do have thought about making it
the default when it consolidates a bit, perhaps on a new machine type.

I also didn't know whether there's other limitations of it.  For example,
will a new socket pair be a problem for any VM environment (either a
limitation from the management app, container, and so on)?  I think it's
the same to multifd in that aspect, but I never explored.

> 
> > TODO List
> > =========
> > 
> > TLS support
> > -----------
> > 
> > I only noticed its missing very recently.  Since soft freeze is coming, and
> > obviously I'm still growing this series, so I tend to have the existing
> > material discussed. Let's see if it can still catch the train for QEMU 7.0
> > release (soft freeze on 2022-03-08)..
> 
> I don't like the idea of shipping something that is only half finished.
> It means that when apps probe for the feature, they'll see preempt
> capability present, but have no idea whether they're using a QEMU that
> is broken when combined with TLS or not. We shouldn't merge something
> just to meet the soft freeze deadline if we know key features are broken.

IMHO merging and declaring support are two problems.

To me, it's always fine to merge the code that implemented the fundation of a
feature.  The feature can be worked upon in the future.

Requiring a feature to be "complete" sometimes can cause burden to not only
the author of the series but also reviewers.  It's IMHO not necessary to
bind these two ideas.

It's sometimes also hard to define "complete": take the TLS as example, no
one probably even noticed that it won't work with TLS and I just noticed it
merely these two days..  We obviously can't merge partial patchset, but if
the patchset is well isolated, then it's not a blocker for merging, imho.

Per my understanding, what you worried is when we declare it supported but
later we never know when TLS will be ready for it.  One solution is I can
rename the capability as x-, then after the TLS side ready I drop the x-
prefix.  Then Libvirt or any mgmt software doesn't need to support this
until we drop the x-, so there's no risk of compatibility.

Would that sound okay to you?

I can always step back and work on TLS first before it's merged, but again
I don't think it's required.

> 
> > Multi-channel for preemption threads
> > ------------------------------------
> > 
> > Currently the postcopy preempt feature use only one extra channel and one
> > extra thread on dest (no new thread on src QEMU).  It should be mostly good
> > enough for major use cases, but when the postcopy queue is long enough
> > (e.g. hundreds of vCPUs faulted on different pages) logically we could
> > still observe more delays in average.  Whether growing threads/channels can
> > solve it is debatable, but sounds worthwhile a try.  That's yet another
> > thing we can think about after this patchset lands.
> 
> If we don't think about it upfront, then we'll possibly end up with
> yet another tunable flag that apps have to worry about. It also
> could make migration code even more complex if we have to support
> two different scenarios. If we think multiple threads are goign to
> be a benefit lets check that and if so, design it into the exposed
> application facing interface from the start rather than retrofitting
> afterwards.

I am not very sure we need that multi-threading.  I'm always open to
opinions, though.

I raised it not as "must to do" but some idea in mind.  I'm not sure
whether we'll have it some day.  Note that this is different from multifd,
obviously only 1 multifd thread doesn't sound right at all. Postcopy is
different, because in most cases the new thread can be idle.

That's also why this series didn't start from N threads but 1 thread only,
because AFAICT it solves the major problem, which is the isolation between
page request pages and background pages. One thread is a very good fit in
many scenarios already.

So without obvious reasoning, I'm afraid it could over-engineer things if
we start with N threads, keeping most of them idle.

Even if we need it, we'll need a parameter to specify the number of
threads, that can be a good way to identify the boundary with old/new
QEMUs, so e.g. upper layers like Libvirt can easily detect this difference
if it wants by trying to detect the n_channels parameter.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/25] migration: Postcopy Preemption
  2022-03-01 10:17   ` Peter Xu
@ 2022-03-01 10:27     ` Daniel P. Berrangé
  2022-03-01 10:55       ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Daniel P. Berrangé @ 2022-03-01 10:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: Leonardo Bras Soares Passos, qemu-devel, Dr . David Alan Gilbert,
	Juan Quintela

On Tue, Mar 01, 2022 at 06:17:49PM +0800, Peter Xu wrote:
> On Tue, Mar 01, 2022 at 09:25:55AM +0000, Daniel P. Berrangé wrote:
> > On Tue, Mar 01, 2022 at 04:39:00PM +0800, Peter Xu wrote:
> > > This is v2 of postcopy preempt series.  It can also be found here:
> > > 
> > >   https://github.com/xzpeter/qemu/tree/postcopy-preempt
> > > 
> > > RFC: https://lore.kernel.org/qemu-devel/20220119080929.39485-1-peterx@redhat.com
> > > V1:  https://lore.kernel.org/qemu-devel/20220216062809.57179-1-peterx@redhat.com
> > > 
> > > v1->v2 changelog:
> > > - Picked up more r-bs from Dave
> > > - Rename both fault threads to drop "qemu/" prefix [Dave]
> > > - Further rework on postcopy recovery, to be able to detect qemufile errors
> > >   from either main channel or postcopy one [Dave]
> > > - shutdown() qemufile before close on src postcopy channel when postcopy is
> > >   paused [Dave]
> > > - In postcopy_preempt_new_channel(), explicitly set the new channel in
> > >   blocking state, even if it's the default [Dave]
> > > - Make RAMState.postcopy_channel unsigned int [Dave]
> > > - Added patches:
> > >   - "migration: Create the postcopy preempt channel asynchronously"
> > >   - "migration: Parameter x-postcopy-preempt-break-huge"
> > >   - "migration: Add helpers to detect TLS capability"
> > >   - "migration: Fail postcopy preempt with TLS"
> > >   - "tests: Pass in MigrateStart** into test_migrate_start()"
> > > 
> > > Abstract
> > > ========
> > > 
> > > This series added a new migration capability called "postcopy-preempt".  It can
> > > be enabled when postcopy is enabled, and it'll simply (but greatly) speed up
> > > postcopy page requests handling process.
> > 
> > Is there no way we can just automatically enable this new feature, rather
> > than requiring apps to specify yet another new flag ?
> 
> I didn't make it the default for now, but I do have thought about making it
> the default when it consolidates a bit, perhaps on a new machine type.

Toggling based on machine type feels strange. If it needs a toggle that
implies it is not something that can be transparently enabled without
the app being aware of it. Given that toggling based on machine  type
would be inappropriate as existing apps expect to work with new QEMU.

> I also didn't know whether there's other limitations of it.  For example,
> will a new socket pair be a problem for any VM environment (either a
> limitation from the management app, container, and so on)?  I think it's
> the same to multifd in that aspect, but I never explored.

If it needs extra sockets that is something apps will need to be aware
of unfortunately and explicitly opt-in to :-( Migration is often
tunnelled/proxied over other channels, so whatever does that needs to
be aware of possibility of seeing extra sockets.

> > > TODO List
> > > =========
> > > 
> > > TLS support
> > > -----------
> > > 
> > > I only noticed its missing very recently.  Since soft freeze is coming, and
> > > obviously I'm still growing this series, so I tend to have the existing
> > > material discussed. Let's see if it can still catch the train for QEMU 7.0
> > > release (soft freeze on 2022-03-08)..
> > 
> > I don't like the idea of shipping something that is only half finished.
> > It means that when apps probe for the feature, they'll see preempt
> > capability present, but have no idea whether they're using a QEMU that
> > is broken when combined with TLS or not. We shouldn't merge something
> > just to meet the soft freeze deadline if we know key features are broken.
> 
> IMHO merging and declaring support are two problems.
> 
> To me, it's always fine to merge the code that implemented the fundation of a
> feature.  The feature can be worked upon in the future.
> 
> Requiring a feature to be "complete" sometimes can cause burden to not only
> the author of the series but also reviewers.  It's IMHO not necessary to
> bind these two ideas.
> 
> It's sometimes also hard to define "complete": take the TLS as example, no
> one probably even noticed that it won't work with TLS and I just noticed it
> merely these two days..  We obviously can't merge partial patchset, but if
> the patchset is well isolated, then it's not a blocker for merging, imho.
> 
> Per my understanding, what you worried is when we declare it supported but
> later we never know when TLS will be ready for it.  One solution is I can
> rename the capability as x-, then after the TLS side ready I drop the x-
> prefix.  Then Libvirt or any mgmt software doesn't need to support this
> until we drop the x-, so there's no risk of compatibility.
> 
> Would that sound okay to you?

If it has an x- prefix then we can basically ignore it from a mgmt app
POV until it is actually finished.

> I can always step back and work on TLS first before it's merged, but again
> I don't think it's required.

Apps increasingly consider use of TLS to be a mandatory feature for
migration, so until that works, this preempt has to be considered
unsupported & unfinished IMHO. So either TLS should be ready when
it merges, or it should be clearly marked unsupported at the QAPI
level.

> 
> > 
> > > Multi-channel for preemption threads
> > > ------------------------------------
> > > 
> > > Currently the postcopy preempt feature use only one extra channel and one
> > > extra thread on dest (no new thread on src QEMU).  It should be mostly good
> > > enough for major use cases, but when the postcopy queue is long enough
> > > (e.g. hundreds of vCPUs faulted on different pages) logically we could
> > > still observe more delays in average.  Whether growing threads/channels can
> > > solve it is debatable, but sounds worthwhile a try.  That's yet another
> > > thing we can think about after this patchset lands.
> > 
> > If we don't think about it upfront, then we'll possibly end up with
> > yet another tunable flag that apps have to worry about. It also
> > could make migration code even more complex if we have to support
> > two different scenarios. If we think multiple threads are goign to
> > be a benefit lets check that and if so, design it into the exposed
> > application facing interface from the start rather than retrofitting
> > afterwards.
> 
> I am not very sure we need that multi-threading.  I'm always open to
> opinions, though.
> 
> I raised it not as "must to do" but some idea in mind.  I'm not sure
> whether we'll have it some day.  Note that this is different from multifd,
> obviously only 1 multifd thread doesn't sound right at all. Postcopy is
> different, because in most cases the new thread can be idle.
> 
> That's also why this series didn't start from N threads but 1 thread only,
> because AFAICT it solves the major problem, which is the isolation between
> page request pages and background pages. One thread is a very good fit in
> many scenarios already.
> 
> So without obvious reasoning, I'm afraid it could over-engineer things if
> we start with N threads, keeping most of them idle.
> 
> Even if we need it, we'll need a parameter to specify the number of
> threads, that can be a good way to identify the boundary with old/new
> QEMUs, so e.g. upper layers like Libvirt can easily detect this difference
> if it wants by trying to detect the n_channels parameter.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/25] migration: Postcopy Preemption
  2022-03-01 10:27     ` Daniel P. Berrangé
@ 2022-03-01 10:55       ` Peter Xu
  2022-03-01 16:51         ` Dr. David Alan Gilbert
  2022-03-01 18:05         ` [PATCH v2 00/25] migration: Postcopy Preemption Daniel P. Berrangé
  0 siblings, 2 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-01 10:55 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Leonardo Bras Soares Passos, qemu-devel, Dr . David Alan Gilbert,
	Juan Quintela

On Tue, Mar 01, 2022 at 10:27:10AM +0000, Daniel P. Berrangé wrote:
> > I also didn't know whether there's other limitations of it.  For example,
> > will a new socket pair be a problem for any VM environment (either a
> > limitation from the management app, container, and so on)?  I think it's
> > the same to multifd in that aspect, but I never explored.
> 
> If it needs extra sockets that is something apps will need to be aware
> of unfortunately and explicitly opt-in to :-( Migration is often
> tunnelled/proxied over other channels, so whatever does that needs to
> be aware of possibility of seeing extra sockets.

Ah, then probably it can never be the default.  But for sure it could be
nice that higher level can opt-in and make it a default at some point as
long as it knows the network topology is safe to do so.

> 
> > > > TODO List
> > > > =========
> > > > 
> > > > TLS support
> > > > -----------
> > > > 
> > > > I only noticed its missing very recently.  Since soft freeze is coming, and
> > > > obviously I'm still growing this series, so I tend to have the existing
> > > > material discussed. Let's see if it can still catch the train for QEMU 7.0
> > > > release (soft freeze on 2022-03-08)..
> > > 
> > > I don't like the idea of shipping something that is only half finished.
> > > It means that when apps probe for the feature, they'll see preempt
> > > capability present, but have no idea whether they're using a QEMU that
> > > is broken when combined with TLS or not. We shouldn't merge something
> > > just to meet the soft freeze deadline if we know key features are broken.
> > 
> > IMHO merging and declaring support are two problems.
> > 
> > To me, it's always fine to merge the code that implemented the fundation of a
> > feature.  The feature can be worked upon in the future.
> > 
> > Requiring a feature to be "complete" sometimes can cause burden to not only
> > the author of the series but also reviewers.  It's IMHO not necessary to
> > bind these two ideas.
> > 
> > It's sometimes also hard to define "complete": take the TLS as example, no
> > one probably even noticed that it won't work with TLS and I just noticed it
> > merely these two days..  We obviously can't merge partial patchset, but if
> > the patchset is well isolated, then it's not a blocker for merging, imho.
> > 
> > Per my understanding, what you worried is when we declare it supported but
> > later we never know when TLS will be ready for it.  One solution is I can
> > rename the capability as x-, then after the TLS side ready I drop the x-
> > prefix.  Then Libvirt or any mgmt software doesn't need to support this
> > until we drop the x-, so there's no risk of compatibility.
> > 
> > Would that sound okay to you?
> 
> If it has an x- prefix then we can basically ignore it from a mgmt app
> POV until it is actually finished.
> 
> > I can always step back and work on TLS first before it's merged, but again
> > I don't think it's required.
> 
> Apps increasingly consider use of TLS to be a mandatory feature for
> migration, so until that works, this preempt has to be considered
> unsupported & unfinished IMHO. So either TLS should be ready when
> it merges, or it should be clearly marked unsupported at the QAPI
> level.

Yes, I fully agree with it, and for huge vm migrations I think TLS is in
many cases mandatory.

I do plan to work on it right afterwards if this series land, but as the
series grows I just noticed maybe we should start landing some codes that's
already solid.  Landing the code as another benefit that I want to make
sure the code merged at least won't affect the existing features.

So what I'm curious is why TLS is getting quite some attentions in the past
few years but I didn't even see any selftests included in migration-test on
tls.  That's something I wanted to look into, maybe even before adding the
preempt+tls support. But maybe I just missed something, as I didn't use tls
a lot in the past.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/25] migration: Postcopy Preemption
  2022-03-01 10:55       ` Peter Xu
@ 2022-03-01 16:51         ` Dr. David Alan Gilbert
  2022-03-02  1:46           ` Peter Xu
  2022-03-14 18:49           ` Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption) Daniel P. Berrangé
  2022-03-01 18:05         ` [PATCH v2 00/25] migration: Postcopy Preemption Daniel P. Berrangé
  1 sibling, 2 replies; 47+ messages in thread
From: Dr. David Alan Gilbert @ 2022-03-01 16:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Leonardo Bras Soares Passos, Daniel P. Berrangé,
	qemu-devel, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Mar 01, 2022 at 10:27:10AM +0000, Daniel P. Berrangé wrote:
> > > I also didn't know whether there's other limitations of it.  For example,
> > > will a new socket pair be a problem for any VM environment (either a
> > > limitation from the management app, container, and so on)?  I think it's
> > > the same to multifd in that aspect, but I never explored.
> > 
> > If it needs extra sockets that is something apps will need to be aware
> > of unfortunately and explicitly opt-in to :-( Migration is often
> > tunnelled/proxied over other channels, so whatever does that needs to
> > be aware of possibility of seeing extra sockets.
> 
> Ah, then probably it can never be the default.  But for sure it could be
> nice that higher level can opt-in and make it a default at some point as
> long as it knows the network topology is safe to do so.
> 
> > 
> > > > > TODO List
> > > > > =========
> > > > > 
> > > > > TLS support
> > > > > -----------
> > > > > 
> > > > > I only noticed its missing very recently.  Since soft freeze is coming, and
> > > > > obviously I'm still growing this series, so I tend to have the existing
> > > > > material discussed. Let's see if it can still catch the train for QEMU 7.0
> > > > > release (soft freeze on 2022-03-08)..
> > > > 
> > > > I don't like the idea of shipping something that is only half finished.
> > > > It means that when apps probe for the feature, they'll see preempt
> > > > capability present, but have no idea whether they're using a QEMU that
> > > > is broken when combined with TLS or not. We shouldn't merge something
> > > > just to meet the soft freeze deadline if we know key features are broken.
> > > 
> > > IMHO merging and declaring support are two problems.
> > > 
> > > To me, it's always fine to merge the code that implemented the fundation of a
> > > feature.  The feature can be worked upon in the future.
> > > 
> > > Requiring a feature to be "complete" sometimes can cause burden to not only
> > > the author of the series but also reviewers.  It's IMHO not necessary to
> > > bind these two ideas.
> > > 
> > > It's sometimes also hard to define "complete": take the TLS as example, no
> > > one probably even noticed that it won't work with TLS and I just noticed it
> > > merely these two days..  We obviously can't merge partial patchset, but if
> > > the patchset is well isolated, then it's not a blocker for merging, imho.
> > > 
> > > Per my understanding, what you worried is when we declare it supported but
> > > later we never know when TLS will be ready for it.  One solution is I can
> > > rename the capability as x-, then after the TLS side ready I drop the x-
> > > prefix.  Then Libvirt or any mgmt software doesn't need to support this
> > > until we drop the x-, so there's no risk of compatibility.
> > > 
> > > Would that sound okay to you?
> > 
> > If it has an x- prefix then we can basically ignore it from a mgmt app
> > POV until it is actually finished.
> > 
> > > I can always step back and work on TLS first before it's merged, but again
> > > I don't think it's required.
> > 
> > Apps increasingly consider use of TLS to be a mandatory feature for
> > migration, so until that works, this preempt has to be considered
> > unsupported & unfinished IMHO. So either TLS should be ready when
> > it merges, or it should be clearly marked unsupported at the QAPI
> > level.
> 
> Yes, I fully agree with it, and for huge vm migrations I think TLS is in
> many cases mandatory.
> 
> I do plan to work on it right afterwards if this series land, but as the
> series grows I just noticed maybe we should start landing some codes that's
> already solid.  Landing the code as another benefit that I want to make
> sure the code merged at least won't affect the existing features.
> 
> So what I'm curious is why TLS is getting quite some attentions in the past
> few years but I didn't even see any selftests included in migration-test on
> tls.  That's something I wanted to look into, maybe even before adding the
> preempt+tls support. But maybe I just missed something, as I didn't use tls
> a lot in the past.

Hmm, I think it's worth getting TLS working before putting the full
series in, because it might impact the way you wire the channels up -
it's going to take some care; but lets see which parts we can/should
take.

Dave
> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/25] migration: Postcopy Preemption
  2022-03-01 10:55       ` Peter Xu
  2022-03-01 16:51         ` Dr. David Alan Gilbert
@ 2022-03-01 18:05         ` Daniel P. Berrangé
  2022-03-02  1:48           ` Peter Xu
  1 sibling, 1 reply; 47+ messages in thread
From: Daniel P. Berrangé @ 2022-03-01 18:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Leonardo Bras Soares Passos, qemu-devel, Dr . David Alan Gilbert,
	Juan Quintela

On Tue, Mar 01, 2022 at 06:55:00PM +0800, Peter Xu wrote:
> On Tue, Mar 01, 2022 at 10:27:10AM +0000, Daniel P. Berrangé wrote:
> So what I'm curious is why TLS is getting quite some attentions in the past
> few years but I didn't even see any selftests included in migration-test on
> tls.  That's something I wanted to look into, maybe even before adding the
> preempt+tls support. But maybe I just missed something, as I didn't use tls
> a lot in the past.

I'm going to send some patches for adding TLS to migration-test

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/25] migration: Postcopy Preemption
  2022-03-01 16:51         ` Dr. David Alan Gilbert
@ 2022-03-02  1:46           ` Peter Xu
  2022-03-14 18:49           ` Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption) Daniel P. Berrangé
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-02  1:46 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Leonardo Bras Soares Passos, Daniel P. Berrangé,
	qemu-devel, Juan Quintela

On Tue, Mar 01, 2022 at 04:51:09PM +0000, Dr. David Alan Gilbert wrote:
> Hmm, I think it's worth getting TLS working before putting the full
> series in, because it might impact the way you wire the channels up -
> it's going to take some care; but lets see which parts we can/should
> take.

IMHO it should be mostly transparent to the whole user interface and the
rest of the features, thanks to the well-abstracted qio channel layer, so
most code does not really need to worry about what kind of channel it is.
But sure, we don't need to rush.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/25] migration: Postcopy Preemption
  2022-03-01 18:05         ` [PATCH v2 00/25] migration: Postcopy Preemption Daniel P. Berrangé
@ 2022-03-02  1:48           ` Peter Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-02  1:48 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Leonardo Bras Soares Passos, qemu-devel, Dr . David Alan Gilbert,
	Juan Quintela

On Tue, Mar 01, 2022 at 06:05:50PM +0000, Daniel P. Berrangé wrote:
> On Tue, Mar 01, 2022 at 06:55:00PM +0800, Peter Xu wrote:
> > On Tue, Mar 01, 2022 at 10:27:10AM +0000, Daniel P. Berrangé wrote:
> > So what I'm curious is why TLS is getting quite some attentions in the past
> > few years but I didn't even see any selftests included in migration-test on
> > tls.  That's something I wanted to look into, maybe even before adding the
> > preempt+tls support. But maybe I just missed something, as I didn't use tls
> > a lot in the past.
> 
> I'm going to send some patches for adding TLS to migration-test

That's great.  Please feel free to copy me, I can probably work the
preempt+tls upon, thanks.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 25/25] tests: Pass in MigrateStart** into test_migrate_start()
  2022-03-01  8:39 ` [PATCH v2 25/25] tests: Pass in MigrateStart** into test_migrate_start() Peter Xu
@ 2022-03-02 12:11   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 47+ messages in thread
From: Dr. David Alan Gilbert @ 2022-03-02 12:11 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> test_migrate_start() will release the MigrateStart structure that passed
> in, however that's not super clear to the caller because after the call
> returned the pointer can still be referenced by the callers.  It can easily
> be a source of use-after-free.
> 
> Let's pass in a double pointer of that, then we can safely clear the
> pointer for the caller after the struct is released.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  tests/qtest/migration-test.c | 27 +++++++++++++++------------
>  1 file changed, 15 insertions(+), 12 deletions(-)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 09a9ce4401..67f0601988 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -496,7 +496,7 @@ static void migrate_start_destroy(MigrateStart *args)
>  }
>  
>  static int test_migrate_start(QTestState **from, QTestState **to,
> -                              const char *uri, MigrateStart *args)
> +                              const char *uri, MigrateStart **pargs)
>  {
>      g_autofree gchar *arch_source = NULL;
>      g_autofree gchar *arch_target = NULL;
> @@ -508,6 +508,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>      g_autofree char *shmem_path = NULL;
>      const char *arch = qtest_get_arch();
>      const char *machine_opts = NULL;
> +    MigrateStart *args = *pargs;
>      const char *memory_size;
>      int ret = 0;
>  
> @@ -622,6 +623,8 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>  
>  out:
>      migrate_start_destroy(args);
> +    /* This tells the caller that this structure is gone */
> +    *pargs = NULL;
>      return ret;
>  }
>  
> @@ -668,7 +671,7 @@ static int migrate_postcopy_prepare(QTestState **from_ptr,
>      bool postcopy_preempt = args->postcopy_preempt;
>      QTestState *from, *to;
>  
> -    if (test_migrate_start(&from, &to, uri, args)) {
> +    if (test_migrate_start(&from, &to, uri, &args)) {
>          return -1;
>      }
>  
> @@ -822,7 +825,7 @@ static void test_baddest(void)
>  
>      args->hide_stderr = true;
>  
> -    if (test_migrate_start(&from, &to, "tcp:127.0.0.1:0", args)) {
> +    if (test_migrate_start(&from, &to, "tcp:127.0.0.1:0", &args)) {
>          return;
>      }
>      migrate_qmp(from, "tcp:127.0.0.1:0", "{}");
> @@ -838,7 +841,7 @@ static void test_precopy_unix_common(bool dirty_ring)
>  
>      args->use_dirty_ring = dirty_ring;
>  
> -    if (test_migrate_start(&from, &to, uri, args)) {
> +    if (test_migrate_start(&from, &to, uri, &args)) {
>          return;
>      }
>  
> @@ -926,7 +929,7 @@ static void test_xbzrle(const char *uri)
>      MigrateStart *args = migrate_start_new();
>      QTestState *from, *to;
>  
> -    if (test_migrate_start(&from, &to, uri, args)) {
> +    if (test_migrate_start(&from, &to, uri, &args)) {
>          return;
>      }
>  
> @@ -980,7 +983,7 @@ static void test_precopy_tcp(void)
>      g_autofree char *uri = NULL;
>      QTestState *from, *to;
>  
> -    if (test_migrate_start(&from, &to, "tcp:127.0.0.1:0", args)) {
> +    if (test_migrate_start(&from, &to, "tcp:127.0.0.1:0", &args)) {
>          return;
>      }
>  
> @@ -1025,7 +1028,7 @@ static void test_migrate_fd_proto(void)
>      QDict *rsp;
>      const char *error_desc;
>  
> -    if (test_migrate_start(&from, &to, "defer", args)) {
> +    if (test_migrate_start(&from, &to, "defer", &args)) {
>          return;
>      }
>  
> @@ -1105,7 +1108,7 @@ static void do_test_validate_uuid(MigrateStart *args, bool should_fail)
>      g_autofree char *uri = g_strdup_printf("unix:%s/migsocket", tmpfs);
>      QTestState *from, *to;
>  
> -    if (test_migrate_start(&from, &to, uri, args)) {
> +    if (test_migrate_start(&from, &to, uri, &args)) {
>          return;
>      }
>  
> @@ -1197,7 +1200,7 @@ static void test_migrate_auto_converge(void)
>       */
>      const int64_t expected_threshold = max_bandwidth * downtime_limit / 1000;
>  
> -    if (test_migrate_start(&from, &to, uri, args)) {
> +    if (test_migrate_start(&from, &to, uri, &args)) {
>          return;
>      }
>  
> @@ -1266,7 +1269,7 @@ static void test_multifd_tcp(const char *method)
>      QDict *rsp;
>      g_autofree char *uri = NULL;
>  
> -    if (test_migrate_start(&from, &to, "defer", args)) {
> +    if (test_migrate_start(&from, &to, "defer", &args)) {
>          return;
>      }
>  
> @@ -1352,7 +1355,7 @@ static void test_multifd_tcp_cancel(void)
>  
>      args->hide_stderr = true;
>  
> -    if (test_migrate_start(&from, &to, "defer", args)) {
> +    if (test_migrate_start(&from, &to, "defer", &args)) {
>          return;
>      }
>  
> @@ -1391,7 +1394,7 @@ static void test_multifd_tcp_cancel(void)
>      args = migrate_start_new();
>      args->only_target = true;
>  
> -    if (test_migrate_start(&from, &to2, "defer", args)) {
> +    if (test_migrate_start(&from, &to2, "defer", &args)) {
>          return;
>      }
>  
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/25] migration: Postcopy Preemption
  2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
                   ` (25 preceding siblings ...)
  2022-03-01  9:25 ` [PATCH v2 00/25] migration: Postcopy Preemption Daniel P. Berrangé
@ 2022-03-02 12:14 ` Dr. David Alan Gilbert
  2022-03-02 12:34   ` Peter Xu
  26 siblings, 1 reply; 47+ messages in thread
From: Dr. David Alan Gilbert @ 2022-03-02 12:14 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> This is v2 of postcopy preempt series.  It can also be found here:
> 
>   https://github.com/xzpeter/qemu/tree/postcopy-preempt
> 
> RFC: https://lore.kernel.org/qemu-devel/20220119080929.39485-1-peterx@redhat.com
> V1:  https://lore.kernel.org/qemu-devel/20220216062809.57179-1-peterx@redhat.com

I've queued some of this:

tests: Pass in MigrateStart** into test_migrate_start()
migration: Add migration_incoming_transport_cleanup()
migration: postcopy_pause_fault_thread() never fails
migration: Enlarge postcopy recovery to capture !-EIO too
migration: Move static var in ram_block_from_stream() into global
migration: Add postcopy_thread_create()
migration: Dump ramblock and offset too when non-same-page detected
migration: Introduce postcopy channels on dest node
migration: Tracepoint change in postcopy-run bottom half
migration: Finer grained tracepoints for POSTCOPY_LISTEN
migration: Dump sub-cmd name in loadvm_process_command tp

> v1->v2 changelog:
> - Picked up more r-bs from Dave
> - Rename both fault threads to drop "qemu/" prefix [Dave]
> - Further rework on postcopy recovery, to be able to detect qemufile errors
>   from either main channel or postcopy one [Dave]
> - shutdown() qemufile before close on src postcopy channel when postcopy is
>   paused [Dave]
> - In postcopy_preempt_new_channel(), explicitly set the new channel in
>   blocking state, even if it's the default [Dave]
> - Make RAMState.postcopy_channel unsigned int [Dave]
> - Added patches:
>   - "migration: Create the postcopy preempt channel asynchronously"
>   - "migration: Parameter x-postcopy-preempt-break-huge"
>   - "migration: Add helpers to detect TLS capability"
>   - "migration: Fail postcopy preempt with TLS"
>   - "tests: Pass in MigrateStart** into test_migrate_start()"
> 
> Abstract
> ========
> 
> This series added a new migration capability called "postcopy-preempt".  It can
> be enabled when postcopy is enabled, and it'll simply (but greatly) speed up
> postcopy page requests handling process.
> 
> Below are some initial postcopy page request latency measurements after the
> new series applied.
> 
> For each page size, I measured page request latency for three cases:
> 
>   (a) Vanilla:                the old postcopy
>   (b) Preempt no-break-huge:  preempt enabled, x-postcopy-preempt-break-huge=off
>   (c) Preempt full:           preempt enabled, x-postcopy-preempt-break-huge=on
>                               (this is the default option when preempt enabled)
> 
> Here x-postcopy-preempt-break-huge parameter is just added in v2 so as to
> conditionally disable the behavior to break sending a precopy huge page for
> debugging purpose.  So when it's off, postcopy will not preempt precopy
> sending a huge page, but still postcopy will use its own channel.
> 
> I tested it separately to give a rough idea on which part of the change
> helped how much of it.  The overall benefit should be the comparison
> between case (a) and (c).
> 
>   |-----------+---------+-----------------------+--------------|
>   | Page size | Vanilla | Preempt no-break-huge | Preempt full |
>   |-----------+---------+-----------------------+--------------|
>   | 4K        |   10.68 |               N/A [*] |         0.57 |
>   | 2M        |   10.58 |                  5.49 |         5.02 |
>   | 1G        | 2046.65 |               933.185 |      649.445 |
>   |-----------+---------+-----------------------+--------------|
>   [*]: This case is N/A because 4K page does not contain huge page at all
> 
> [1] https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd-latency.bpf
> 
> TODO List
> =========
> 
> TLS support
> -----------
> 
> I only noticed its missing very recently.  Since soft freeze is coming, and
> obviously I'm still growing this series, so I tend to have the existing
> material discussed. Let's see if it can still catch the train for QEMU 7.0
> release (soft freeze on 2022-03-08)..
> 
> Avoid precopy write() blocks postcopy
> -------------------------------------
> 
> I didn't prove this, but I always think the write() syscalls being blocked
> for precopy pages can affect postcopy services.  If we can solve this
> problem then my wild guess is we can further reduce the average page
> latency.
> 
> Two solutions at least in mind: (1) we could have made the write side of
> the migration channel NON_BLOCK too, or (2) multi-threads on send side,
> just like multifd, but we may use lock to protect which page to send too
> (e.g., the core idea is we should _never_ rely anything on the main thread,
> multifd has that dependency on queuing pages only on main thread).
> 
> That can definitely be done and thought about later.
> 
> Multi-channel for preemption threads
> ------------------------------------
> 
> Currently the postcopy preempt feature use only one extra channel and one
> extra thread on dest (no new thread on src QEMU).  It should be mostly good
> enough for major use cases, but when the postcopy queue is long enough
> (e.g. hundreds of vCPUs faulted on different pages) logically we could
> still observe more delays in average.  Whether growing threads/channels can
> solve it is debatable, but sounds worthwhile a try.  That's yet another
> thing we can think about after this patchset lands.
> 
> Logically the design provides space for that - the receiving postcopy
> preempt thread can understand all ram-layer migration protocol, and for
> multi channel and multi threads we could simply grow that into multile
> threads handling the same protocol (with multiple PostcopyTmpPage).  The
> source needs more thoughts on synchronizations, though, but it shouldn't
> affect the whole protocol layer, so should be easy to keep compatible.
> 
> Patch Layout
> ============
> 
> Patch 1-3: Three leftover patches from patchset "[PATCH v3 0/8] migration:
> Postcopy cleanup on ram disgard" that I picked up here too.
> 
>   https://lore.kernel.org/qemu-devel/20211224065000.97572-1-peterx@redhat.com/
> 
>   migration: Dump sub-cmd name in loadvm_process_command tp
>   migration: Finer grained tracepoints for POSTCOPY_LISTEN
>   migration: Tracepoint change in postcopy-run bottom half
> 
> Patch 4-9: Original postcopy preempt RFC preparation patches (with slight
> modifications).
> 
>   migration: Introduce postcopy channels on dest node
>   migration: Dump ramblock and offset too when non-same-page detected
>   migration: Add postcopy_thread_create()
>   migration: Move static var in ram_block_from_stream() into global
>   migration: Add pss.postcopy_requested status
>   migration: Move migrate_allow_multifd and helpers into migration.c
> 
> Patch 10-15: Some newly added patches when working on postcopy recovery
> support.  After these patches migrate-recover command will allow re-entrance,
> which is a very nice side effect.
> 
>   migration: Enlarge postcopy recovery to capture !-EIO too
>   migration: postcopy_pause_fault_thread() never fails
>   migration: Export ram_load_postcopy()
>   migration: Move channel setup out of postcopy_try_recover()
>   migration: Add migration_incoming_transport_cleanup()
>   migration: Allow migrate-recover to run multiple times
> 
> Patch 16-19: The major work of postcopy preemption implementation is split into
> four patches as suggested by Dave.
> 
>   migration: Add postcopy-preempt capability
>   migration: Postcopy preemption preparation on channel creation
>   migration: Postcopy preemption enablement
>   migration: Postcopy recover with preempt enabled
> 
> Patch 20-23: Newly added patches in this v2 for different purposes.
>              Majorly some amendment on existing postcopy preempt.
> 
>   migration: Create the postcopy preempt channel asynchronously
>   migration: Parameter x-postcopy-preempt-break-huge
>   migration: Add helpers to detect TLS capability
>   migration: Fail postcopy preempt with TLS for now
> 
> Patch 24-25: Test cases (including one more patch for cleanup)
> 
>   tests: Add postcopy preempt test
>   tests: Pass in MigrateStart** into test_migrate_start()
> 
> Please review, thanks.
> 
> Peter Xu (25):
>   migration: Dump sub-cmd name in loadvm_process_command tp
>   migration: Finer grained tracepoints for POSTCOPY_LISTEN
>   migration: Tracepoint change in postcopy-run bottom half
>   migration: Introduce postcopy channels on dest node
>   migration: Dump ramblock and offset too when non-same-page detected
>   migration: Add postcopy_thread_create()
>   migration: Move static var in ram_block_from_stream() into global
>   migration: Add pss.postcopy_requested status
>   migration: Move migrate_allow_multifd and helpers into migration.c
>   migration: Enlarge postcopy recovery to capture !-EIO too
>   migration: postcopy_pause_fault_thread() never fails
>   migration: Export ram_load_postcopy()
>   migration: Move channel setup out of postcopy_try_recover()
>   migration: Add migration_incoming_transport_cleanup()
>   migration: Allow migrate-recover to run multiple times
>   migration: Add postcopy-preempt capability
>   migration: Postcopy preemption preparation on channel creation
>   migration: Postcopy preemption enablement
>   migration: Postcopy recover with preempt enabled
>   migration: Create the postcopy preempt channel asynchronously
>   migration: Parameter x-postcopy-preempt-break-huge
>   migration: Add helpers to detect TLS capability
>   migration: Fail postcopy preempt with TLS for now
>   tests: Add postcopy preempt test
>   tests: Pass in MigrateStart** into test_migrate_start()
> 
>  migration/channel.c          |  10 +-
>  migration/migration.c        | 235 ++++++++++++++++++++-----
>  migration/migration.h        |  98 ++++++++++-
>  migration/multifd.c          |  26 +--
>  migration/multifd.h          |   2 -
>  migration/postcopy-ram.c     | 244 +++++++++++++++++++++-----
>  migration/postcopy-ram.h     |  15 ++
>  migration/qemu-file.c        |  27 +++
>  migration/qemu-file.h        |   1 +
>  migration/ram.c              | 330 +++++++++++++++++++++++++++++++----
>  migration/ram.h              |   3 +
>  migration/savevm.c           |  70 ++++++--
>  migration/socket.c           |  22 ++-
>  migration/socket.h           |   1 +
>  migration/trace-events       |  19 +-
>  qapi/migration.json          |   8 +-
>  tests/qtest/migration-test.c |  68 ++++++--
>  17 files changed, 983 insertions(+), 196 deletions(-)
> 
> -- 
> 2.32.0
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/25] migration: Postcopy Preemption
  2022-03-02 12:14 ` Dr. David Alan Gilbert
@ 2022-03-02 12:34   ` Peter Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-02 12:34 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

On Wed, Mar 02, 2022 at 12:14:30PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > This is v2 of postcopy preempt series.  It can also be found here:
> > 
> >   https://github.com/xzpeter/qemu/tree/postcopy-preempt
> > 
> > RFC: https://lore.kernel.org/qemu-devel/20220119080929.39485-1-peterx@redhat.com
> > V1:  https://lore.kernel.org/qemu-devel/20220216062809.57179-1-peterx@redhat.com
> 
> I've queued some of this:
> 
> tests: Pass in MigrateStart** into test_migrate_start()
> migration: Add migration_incoming_transport_cleanup()
> migration: postcopy_pause_fault_thread() never fails
> migration: Enlarge postcopy recovery to capture !-EIO too
> migration: Move static var in ram_block_from_stream() into global
> migration: Add postcopy_thread_create()
> migration: Dump ramblock and offset too when non-same-page detected
> migration: Introduce postcopy channels on dest node
> migration: Tracepoint change in postcopy-run bottom half
> migration: Finer grained tracepoints for POSTCOPY_LISTEN
> migration: Dump sub-cmd name in loadvm_process_command tp

They all look pretty safe to merge, thanks!

We could even consider to postpone merging of below two patches:

  migration: Move static var in ram_block_from_stream() into global
  migration: Introduce postcopy channels on dest node

Because fundamentally if we want to only merge postcopy preempt as a whole,
then these two patches do not bring anything beneficial to the existing
code but should be accounted for part of the impl of the preempt mode.

Said that, I think it's still fine to merge them too, as long as we're very
sure we'll finally merge preempt code when tls side of thing is ready.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption)
  2022-03-01 16:51         ` Dr. David Alan Gilbert
  2022-03-02  1:46           ` Peter Xu
@ 2022-03-14 18:49           ` Daniel P. Berrangé
  2022-03-15  6:13             ` Peter Xu
  2022-03-15 10:43             ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 47+ messages in thread
From: Daniel P. Berrangé @ 2022-03-14 18:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Leonardo Bras Soares Passos, qemu-devel, Peter Xu, Juan Quintela

On Tue, Mar 01, 2022 at 04:51:09PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Tue, Mar 01, 2022 at 10:27:10AM +0000, Daniel P. Berrangé wrote:
> > > > I also didn't know whether there's other limitations of it.  For example,
> > > > will a new socket pair be a problem for any VM environment (either a
> > > > limitation from the management app, container, and so on)?  I think it's
> > > > the same to multifd in that aspect, but I never explored.
> > > 
> > > If it needs extra sockets that is something apps will need to be aware
> > > of unfortunately and explicitly opt-in to :-( Migration is often
> > > tunnelled/proxied over other channels, so whatever does that needs to
> > > be aware of possibility of seeing extra sockets.
> > 
> > Ah, then probably it can never be the default.  But for sure it could be
> > nice that higher level can opt-in and make it a default at some point as
> > long as it knows the network topology is safe to do so.
> > 
> > > 
> > > > > > TODO List
> > > > > > =========
> > > > > > 
> > > > > > TLS support
> > > > > > -----------
> > > > > > 
> > > > > > I only noticed its missing very recently.  Since soft freeze is coming, and
> > > > > > obviously I'm still growing this series, so I tend to have the existing
> > > > > > material discussed. Let's see if it can still catch the train for QEMU 7.0
> > > > > > release (soft freeze on 2022-03-08)..
> > > > > 
> > > > > I don't like the idea of shipping something that is only half finished.
> > > > > It means that when apps probe for the feature, they'll see preempt
> > > > > capability present, but have no idea whether they're using a QEMU that
> > > > > is broken when combined with TLS or not. We shouldn't merge something
> > > > > just to meet the soft freeze deadline if we know key features are broken.
> > > > 
> > > > IMHO merging and declaring support are two problems.
> > > > 
> > > > To me, it's always fine to merge the code that implemented the fundation of a
> > > > feature.  The feature can be worked upon in the future.
> > > > 
> > > > Requiring a feature to be "complete" sometimes can cause burden to not only
> > > > the author of the series but also reviewers.  It's IMHO not necessary to
> > > > bind these two ideas.
> > > > 
> > > > It's sometimes also hard to define "complete": take the TLS as example, no
> > > > one probably even noticed that it won't work with TLS and I just noticed it
> > > > merely these two days..  We obviously can't merge partial patchset, but if
> > > > the patchset is well isolated, then it's not a blocker for merging, imho.
> > > > 
> > > > Per my understanding, what you worried is when we declare it supported but
> > > > later we never know when TLS will be ready for it.  One solution is I can
> > > > rename the capability as x-, then after the TLS side ready I drop the x-
> > > > prefix.  Then Libvirt or any mgmt software doesn't need to support this
> > > > until we drop the x-, so there's no risk of compatibility.
> > > > 
> > > > Would that sound okay to you?
> > > 
> > > If it has an x- prefix then we can basically ignore it from a mgmt app
> > > POV until it is actually finished.
> > > 
> > > > I can always step back and work on TLS first before it's merged, but again
> > > > I don't think it's required.
> > > 
> > > Apps increasingly consider use of TLS to be a mandatory feature for
> > > migration, so until that works, this preempt has to be considered
> > > unsupported & unfinished IMHO. So either TLS should be ready when
> > > it merges, or it should be clearly marked unsupported at the QAPI
> > > level.
> > 
> > Yes, I fully agree with it, and for huge vm migrations I think TLS is in
> > many cases mandatory.
> > 
> > I do plan to work on it right afterwards if this series land, but as the
> > series grows I just noticed maybe we should start landing some codes that's
> > already solid.  Landing the code as another benefit that I want to make
> > sure the code merged at least won't affect the existing features.
> > 
> > So what I'm curious is why TLS is getting quite some attentions in the past
> > few years but I didn't even see any selftests included in migration-test on
> > tls.  That's something I wanted to look into, maybe even before adding the
> > preempt+tls support. But maybe I just missed something, as I didn't use tls
> > a lot in the past.
> 
> Hmm, I think it's worth getting TLS working before putting the full
> series in, because it might impact the way you wire the channels up -
> it's going to take some care; but lets see which parts we can/should
> take.

Taking a step back here and looking at the bigger picture of
migration protocol configuration....

Almost every time we add a new feature to migration, we end up
having to define at least one new migration parameter, then wire
it up in libvirt, and then the mgmt app too, often needing to
ensure it is turn on for both client and server at the same time.


For some features, requiring an explicit opt-in could make sense,
because we don't know for sure that the feature is always a benefit.
These are things that can be thought of as workload sensitive
tunables.


For other features though, it feels like we would be better off if
we could turn it on by default with no config. These are things
that can be thought of as migration infrastructre / transport
architectural designs.


eg it would be nice to be able to use multifd by default for
migration. We would still want a tunable to control the number
of channels, but we ought to be able to just start with a default
number of channels automatically, so the tunable is only needed
for special cases.

This post-copy is another case.  We should start off knowing
we can switch to post-copy at any time. We should further be
able to add pre-emption if we find it available. IOW, we should
not have required anything more than 'switch to post-copy' to
be exposed to mgmtm apps.

Or enabling zero copy on either send or receive side.

Or enabling kernel-TLS offload

Or ..insert other interesting protocol feature...



All this stems from our current migration protocol that started
as a single unidirectional channel, which goes straight into
the migration data stream, with no protocol handshake  and
thus no feature negotiation either.

We've offloaded feature negotiation to libvirt and in turn to
the mgmt app and this is awful, for thue layers above, but
also awful for QEMU. Because multifd requires mgmt app opt-in,
we can wait 10 years and there will still be countless apps
using single-fd mode because they've not been updated to
opt-in.  If we negotiated features at QEMU level we could
have everything using multifd in a few years, and have dropped
single-fd mode a few years later.


So rather than following our historical practice, anjd adding
yet another migration parameter for a specific feature, I'd
really encourage us to put a stop to it and future proof
ourselves.


Introduce one *final-no-more-never-again-after-this* migration
capability called "protocol-negotiation".


When that capability is set, first declare that henceforth the
migration transport is REQUIRED to support **multiple**,
**bi-directional** channels. We might only use 1 TCP channel
in some cases, but it declares our intent that we expect to be
able to use as many channels as we see fit henceforth.

Now define a protocol handshake. A 5 minute thought experiment
starts off with something simple:

   dst -> src:  Greeting Message:
                  Magic: "QEMU-MIGRATE"  12 bytes
                  Num Versions: 1 byte
                  Version list: 1 byte * num versions
                  Num features: 4 bytes
                  Feature list: string * num features

   src -> dst:  Greeting Reply:
                  Magic: "QEMU-MIGRATE" 12 bytes
                  Select version: 1 byte
                  Num select features: 4 bytes
                  Selected features: string * num features   

   .... possibly more src <-> dst messages depending on
        features negotiated....

   src -> dst:  start migration
 
    ...traditional migration stream runs now for the remainder
       of this connection ...



I suggest "dst" starts first, so that connecting to a dst lets you
easily debug whether QEMU is speaking v2 or just waiting for the
client to send something as traditionally the case.

This shouldn't need very much code, and it gives us flexibility
to do all sorts of interesting things going forward with less
overhead for everyone involved.

We can layer in a real authentication system like SASL after
the greeting without any libvirt / mgmt app support

We can enable zero-copy at will. We can enable kernel-TLS at
will. We can add new TCP connections for clever feature XYZ.

We get a back channel every time, so dst can pass info back
to the src to optimize behaviour.

We can experiment with features and throw them away again
later without involving the mgmt app, since we negotiate
their use.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption)
  2022-03-14 18:49           ` Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption) Daniel P. Berrangé
@ 2022-03-15  6:13             ` Peter Xu
  2022-03-15 11:15               ` Daniel P. Berrangé
  2022-03-15 10:43             ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Xu @ 2022-03-15  6:13 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, Juan Quintela, Dr. David Alan Gilbert,
	Leonardo Bras Soares Passos, qemu-devel

On Mon, Mar 14, 2022 at 06:49:25PM +0000, Daniel P. Berrangé wrote:
> Taking a step back here and looking at the bigger picture of
> migration protocol configuration....
> 
> Almost every time we add a new feature to migration, we end up
> having to define at least one new migration parameter, then wire
> it up in libvirt, and then the mgmt app too, often needing to
> ensure it is turn on for both client and server at the same time.
> 
> 
> For some features, requiring an explicit opt-in could make sense,
> because we don't know for sure that the feature is always a benefit.
> These are things that can be thought of as workload sensitive
> tunables.
> 
> 
> For other features though, it feels like we would be better off if
> we could turn it on by default with no config. These are things
> that can be thought of as migration infrastructre / transport
> architectural designs.

Thanks for raising this discussion.  That's something I wanted to raise too
but I just haven't, at least formally.

Actually I think I raised this question once or twice, but I just didn't
insist trying. :)

> 
> 
> eg it would be nice to be able to use multifd by default for
> migration. We would still want a tunable to control the number
> of channels, but we ought to be able to just start with a default
> number of channels automatically, so the tunable is only needed
> for special cases.

I still remember you mentioned the upper layer softwares can have
assumption on using only 1 pair of socket for migration, I think that makes
postcopy-preempt by default impossible.

Why multifd is different here?

> 
> This post-copy is another case.  We should start off knowing
> we can switch to post-copy at any time.

This one is kind of special and it'll be harder, IMHO.

AFAIU, postcopy users will always initiate the migration with at least a
full round of precopy, with the hope that all the static guest pages will
be migrated.

It could even keep going with the 2nd, or 3rd iteration until the VM admin
thinks it's proper to trigger the last phase postcopy. So at least for some
use scenarios the switch of pre->post does require human resource
intervention.

However we could still have some parameter so that when the user wants to
let QEMU decide the time of switch, then we could at least still consider:

  -global migration.postcopy-auto-switch=off|on|immediate

We could define "off|on|immediate" as:

  - "off": this should still be the default, means we need another
    migrate-start-postcopy QMP command to trigger the switch

  - "first": this can mean that right after we finish the 1st round
    migration we automatically switch to postcopy

  - "immediate": this will be the most interesting that I wanted to try
    out, which is.. we could consider start postcopy right now without
    precopy.  It further means:

    - KVM dirty tracking is not needed at all, because all pages are dirty
      by default on dest qemu, so all pages need to be requested.  This
      removes _all_ dirty tracking complexity.

    - It will be the most bandwidth-friendly solution, because literally
      each guest page is only sent once.

I could have got off-topic a bit more especially on the "immediate" option
above, but since we're talking about auto-switch of postcopy I want to
mention this because this has been in my mind for a very long time...

> We should further be able to add pre-emption if we find it available.

Yeah here I have the same question per multifd above.  I just have no idea
whether QEMU has such knowledge on making this decision.  E.g., how could
QEMU know whether upper app is not tunneling the migration stream?  How
could QEMU know whether the upper app could handle multiple tcp sockets
well?

> IOW, we should
> not have required anything more than 'switch to post-copy' to
> be exposed to mgmtm apps.
> 
> Or enabling zero copy on either send or receive side.
> 
> Or enabling kernel-TLS offload
> 
> Or ..insert other interesting protocol feature...
> 
> 
> 
> All this stems from our current migration protocol that started
> as a single unidirectional channel, which goes straight into
> the migration data stream, with no protocol handshake  and
> thus no feature negotiation either.
> 
> We've offloaded feature negotiation to libvirt and in turn to
> the mgmt app and this is awful, for thue layers above, but
> also awful for QEMU. Because multifd requires mgmt app opt-in,
> we can wait 10 years and there will still be countless apps
> using single-fd mode because they've not been updated to
> opt-in.  If we negotiated features at QEMU level we could
> have everything using multifd in a few years, and have dropped
> single-fd mode a few years later.
> 
> 
> So rather than following our historical practice, anjd adding
> yet another migration parameter for a specific feature, I'd
> really encourage us to put a stop to it and future proof
> ourselves.
> 
> 
> Introduce one *final-no-more-never-again-after-this* migration
> capability called "protocol-negotiation".

Let's see how Juan/Dave/others think.. anyway, that's something I always
wanted.

IMHO an even simpler term can be as simple as:

  -global migration.handshake=on

But the naming is not anything important. The idea should always be that
the protocol should not be static anymore (which was only based on
cap/params set by the user) but it can be dynamic depending on how the
handshake/negotiation goes.

I would very much second that idea if it'll come one day.

> 
> 
> When that capability is set, first declare that henceforth the
> migration transport is REQUIRED to support **multiple**,
> **bi-directional** channels.

This new capability will simply need to depend on the return-path
capability we already have.  E.g. exec-typed migration won't be able to
enable return-path, so not applicable to this one too.

When we introduce return-path capability, we _could_ have already required
handshake already.  We didn't, iirc, because at that time I haven't thought
solid on how to define this dynamic protocol, and there we have a real
problem to solve, which is when dest QEMU failed to load the last phase of
device state we used to have a bug (when without return-path capability)
that both QEMUs will quit and VM data corrupted.

The new return-path well resolve that problem because that allows the dest
QEMU to do a very last phase ACK to source telling the source QEMU to quit,
otherwise the src QEMU will always contain the most latest guest pages
(we're only talking about precopy here, of course..).

> We might only use 1 TCP channel
> in some cases, but it declares our intent that we expect to be
> able to use as many channels as we see fit henceforth.
> 
> Now define a protocol handshake. A 5 minute thought experiment
> starts off with something simple:
> 
>    dst -> src:  Greeting Message:
>                   Magic: "QEMU-MIGRATE"  12 bytes
>                   Num Versions: 1 byte
>                   Version list: 1 byte * num versions
>                   Num features: 4 bytes
>                   Feature list: string * num features
> 
>    src -> dst:  Greeting Reply:
>                   Magic: "QEMU-MIGRATE" 12 bytes
>                   Select version: 1 byte
>                   Num select features: 4 bytes
>                   Selected features: string * num features   
> 
>    .... possibly more src <-> dst messages depending on
>         features negotiated....
> 
>    src -> dst:  start migration
>  
>     ...traditional migration stream runs now for the remainder
>        of this connection ...
> 
> 
> 
> I suggest "dst" starts first, so that connecting to a dst lets you
> easily debug whether QEMU is speaking v2 or just waiting for the
> client to send something as traditionally the case.

No strong opinion on which QEMU should start the conversation, just to
mention that we may not be able to use this to identify whether it's an old
or new QEMU, afaiu, because of network delays?

We can never tell whether the dest QEMU didn't talk is because it's an old
binary or it's new binary but with high latency network.

> 
> This shouldn't need very much code, and it gives us flexibility
> to do all sorts of interesting things going forward with less
> overhead for everyone involved.
> 
> We can layer in a real authentication system like SASL after
> the greeting without any libvirt / mgmt app support
> 
> We can enable zero-copy at will. We can enable kernel-TLS at
> will. We can add new TCP connections for clever feature XYZ.
> 
> We get a back channel every time, so dst can pass info back
> to the src to optimize behaviour.
> 
> We can experiment with features and throw them away again
> later without involving the mgmt app, since we negotiate
> their use.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption)
  2022-03-14 18:49           ` Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption) Daniel P. Berrangé
  2022-03-15  6:13             ` Peter Xu
@ 2022-03-15 10:43             ` Dr. David Alan Gilbert
  2022-03-15 11:05               ` Daniel P. Berrangé
  1 sibling, 1 reply; 47+ messages in thread
From: Dr. David Alan Gilbert @ 2022-03-15 10:43 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Leonardo Bras Soares Passos, qemu-devel, Peter Xu, Juan Quintela

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, Mar 01, 2022 at 04:51:09PM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Tue, Mar 01, 2022 at 10:27:10AM +0000, Daniel P. Berrangé wrote:
> > > > > I also didn't know whether there's other limitations of it.  For example,
> > > > > will a new socket pair be a problem for any VM environment (either a
> > > > > limitation from the management app, container, and so on)?  I think it's
> > > > > the same to multifd in that aspect, but I never explored.
> > > > 
> > > > If it needs extra sockets that is something apps will need to be aware
> > > > of unfortunately and explicitly opt-in to :-( Migration is often
> > > > tunnelled/proxied over other channels, so whatever does that needs to
> > > > be aware of possibility of seeing extra sockets.
> > > 
> > > Ah, then probably it can never be the default.  But for sure it could be
> > > nice that higher level can opt-in and make it a default at some point as
> > > long as it knows the network topology is safe to do so.
> > > 
> > > > 
> > > > > > > TODO List
> > > > > > > =========
> > > > > > > 
> > > > > > > TLS support
> > > > > > > -----------
> > > > > > > 
> > > > > > > I only noticed its missing very recently.  Since soft freeze is coming, and
> > > > > > > obviously I'm still growing this series, so I tend to have the existing
> > > > > > > material discussed. Let's see if it can still catch the train for QEMU 7.0
> > > > > > > release (soft freeze on 2022-03-08)..
> > > > > > 
> > > > > > I don't like the idea of shipping something that is only half finished.
> > > > > > It means that when apps probe for the feature, they'll see preempt
> > > > > > capability present, but have no idea whether they're using a QEMU that
> > > > > > is broken when combined with TLS or not. We shouldn't merge something
> > > > > > just to meet the soft freeze deadline if we know key features are broken.
> > > > > 
> > > > > IMHO merging and declaring support are two problems.
> > > > > 
> > > > > To me, it's always fine to merge the code that implemented the fundation of a
> > > > > feature.  The feature can be worked upon in the future.
> > > > > 
> > > > > Requiring a feature to be "complete" sometimes can cause burden to not only
> > > > > the author of the series but also reviewers.  It's IMHO not necessary to
> > > > > bind these two ideas.
> > > > > 
> > > > > It's sometimes also hard to define "complete": take the TLS as example, no
> > > > > one probably even noticed that it won't work with TLS and I just noticed it
> > > > > merely these two days..  We obviously can't merge partial patchset, but if
> > > > > the patchset is well isolated, then it's not a blocker for merging, imho.
> > > > > 
> > > > > Per my understanding, what you worried is when we declare it supported but
> > > > > later we never know when TLS will be ready for it.  One solution is I can
> > > > > rename the capability as x-, then after the TLS side ready I drop the x-
> > > > > prefix.  Then Libvirt or any mgmt software doesn't need to support this
> > > > > until we drop the x-, so there's no risk of compatibility.
> > > > > 
> > > > > Would that sound okay to you?
> > > > 
> > > > If it has an x- prefix then we can basically ignore it from a mgmt app
> > > > POV until it is actually finished.
> > > > 
> > > > > I can always step back and work on TLS first before it's merged, but again
> > > > > I don't think it's required.
> > > > 
> > > > Apps increasingly consider use of TLS to be a mandatory feature for
> > > > migration, so until that works, this preempt has to be considered
> > > > unsupported & unfinished IMHO. So either TLS should be ready when
> > > > it merges, or it should be clearly marked unsupported at the QAPI
> > > > level.
> > > 
> > > Yes, I fully agree with it, and for huge vm migrations I think TLS is in
> > > many cases mandatory.
> > > 
> > > I do plan to work on it right afterwards if this series land, but as the
> > > series grows I just noticed maybe we should start landing some codes that's
> > > already solid.  Landing the code as another benefit that I want to make
> > > sure the code merged at least won't affect the existing features.
> > > 
> > > So what I'm curious is why TLS is getting quite some attentions in the past
> > > few years but I didn't even see any selftests included in migration-test on
> > > tls.  That's something I wanted to look into, maybe even before adding the
> > > preempt+tls support. But maybe I just missed something, as I didn't use tls
> > > a lot in the past.
> > 
> > Hmm, I think it's worth getting TLS working before putting the full
> > series in, because it might impact the way you wire the channels up -
> > it's going to take some care; but lets see which parts we can/should
> > take.
> 
> Taking a step back here and looking at the bigger picture of
> migration protocol configuration....

OK, but lets keep that as a separate discussion and not bog down Peter's
improvement in a different overhaul.

> Almost every time we add a new feature to migration, we end up
> having to define at least one new migration parameter, then wire
> it up in libvirt, and then the mgmt app too, often needing to
> ensure it is turn on for both client and server at the same time.
> 
> 
> For some features, requiring an explicit opt-in could make sense,
> because we don't know for sure that the feature is always a benefit.
> These are things that can be thought of as workload sensitive
> tunables.
> 
> 
> For other features though, it feels like we would be better off if
> we could turn it on by default with no config. These are things
> that can be thought of as migration infrastructre / transport
> architectural designs.
> 
> 
> eg it would be nice to be able to use multifd by default for
> migration. We would still want a tunable to control the number
> of channels, but we ought to be able to just start with a default
> number of channels automatically, so the tunable is only needed
> for special cases.

Right, I agree in part - but we do need those tunables to exist; we rely
on being able to turn things on or off, or play with the tunables
to debug and get performance.  We need libvirt to enumerate the tunables
from qemu rather than having to add code to libvirt every time.
They're all in QAPI definitions anyway - libvirt really shouldn't be
adding code each time.   Then we could have a  virsh migrate --tunable
rather than having loads of extra options which all have different names
from qemu's name for the same feature.

> This post-copy is another case.  We should start off knowing
> we can switch to post-copy at any time. We should further be
> able to add pre-emption if we find it available. IOW, we should
> not have required anything more than 'switch to post-copy' to
> be exposed to mgmtm apps.

Some of these things are tricky; for example knowing whether or not you
can do postcopy depends on your exact memory configuration; some of that
is tricky to probe.

> Or enabling zero copy on either send or receive side.
> 
> Or enabling kernel-TLS offload

Will kernel-TLS be something you'd want to automatically turn on?
We don't know yet whether it's a good idea if you don't have hardware
support.

> Or ..insert other interesting protocol feature...
> 
> 
> 
> All this stems from our current migration protocol that started
> as a single unidirectional channel, which goes straight into
> the migration data stream, with no protocol handshake  and
> thus no feature negotiation either.
> 
> We've offloaded feature negotiation to libvirt and in turn to
> the mgmt app and this is awful, for thue layers above, but
> also awful for QEMU. Because multifd requires mgmt app opt-in,
> we can wait 10 years and there will still be countless apps
> using single-fd mode because they've not been updated to
> opt-in.  If we negotiated features at QEMU level we could
> have everything using multifd in a few years, and have dropped
> single-fd mode a few years later.
> 
> 
> So rather than following our historical practice, anjd adding
> yet another migration parameter for a specific feature, I'd
> really encourage us to put a stop to it and future proof
> ourselves.
> 
> 
> Introduce one *final-no-more-never-again-after-this* migration
> capability called "protocol-negotiation".
> 
> 
> When that capability is set, first declare that henceforth the
> migration transport is REQUIRED to support **multiple**,
> **bi-directional** channels. We might only use 1 TCP channel
> in some cases, but it declares our intent that we expect to be
> able to use as many channels as we see fit henceforth.
> 
> Now define a protocol handshake. A 5 minute thought experiment
> starts off with something simple:
> 
>    dst -> src:  Greeting Message:
>                   Magic: "QEMU-MIGRATE"  12 bytes
>                   Num Versions: 1 byte
>                   Version list: 1 byte * num versions
>                   Num features: 4 bytes
>                   Feature list: string * num features
> 
>    src -> dst:  Greeting Reply:
>                   Magic: "QEMU-MIGRATE" 12 bytes
>                   Select version: 1 byte
>                   Num select features: 4 bytes
>                   Selected features: string * num features   
> 
>    .... possibly more src <-> dst messages depending on
>         features negotiated....
> 
>    src -> dst:  start migration
>  
>     ...traditional migration stream runs now for the remainder
>        of this connection ...

Don't worry about designing the bytes; we already have a command
structure; we just need to add a MIG_CMD_FEATURES and a 
MIG_RP_MSG_FEATURES
(I'm not sure what we need to do for RDMA; or what we do for exec: or
savevm)

> I suggest "dst" starts first, so that connecting to a dst lets you
> easily debug whether QEMU is speaking v2 or just waiting for the
> client to send something as traditionally the case.
> 
> This shouldn't need very much code, and it gives us flexibility
> to do all sorts of interesting things going forward with less
> overhead for everyone involved.
> 
> We can layer in a real authentication system like SASL after
> the greeting without any libvirt / mgmt app support
> 
> We can enable zero-copy at will. We can enable kernel-TLS at
> will. We can add new TCP connections for clever feature XYZ.
> 
> We get a back channel every time, so dst can pass info back
> to the src to optimize behaviour.
> 
> We can experiment with features and throw them away again
> later without involving the mgmt app, since we negotiate
> their use.
> 
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption)
  2022-03-15 10:43             ` Dr. David Alan Gilbert
@ 2022-03-15 11:05               ` Daniel P. Berrangé
  0 siblings, 0 replies; 47+ messages in thread
From: Daniel P. Berrangé @ 2022-03-15 11:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Leonardo Bras Soares Passos, qemu-devel, Peter Xu, Juan Quintela

On Tue, Mar 15, 2022 at 10:43:45AM +0000, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > Almost every time we add a new feature to migration, we end up
> > having to define at least one new migration parameter, then wire
> > it up in libvirt, and then the mgmt app too, often needing to
> > ensure it is turn on for both client and server at the same time.
> > 
> > 
> > For some features, requiring an explicit opt-in could make sense,
> > because we don't know for sure that the feature is always a benefit.
> > These are things that can be thought of as workload sensitive
> > tunables.
> > 
> > 
> > For other features though, it feels like we would be better off if
> > we could turn it on by default with no config. These are things
> > that can be thought of as migration infrastructre / transport
> > architectural designs.
> > 
> > 
> > eg it would be nice to be able to use multifd by default for
> > migration. We would still want a tunable to control the number
> > of channels, but we ought to be able to just start with a default
> > number of channels automatically, so the tunable is only needed
> > for special cases.
> 
> Right, I agree in part - but we do need those tunables to exist; we rely
> on being able to turn things on or off, or play with the tunables
> to debug and get performance.  We need libvirt to enumerate the tunables
> from qemu rather than having to add code to libvirt every time.
> They're all in QAPI definitions anyway - libvirt really shouldn't be
> adding code each time.   Then we could have a  virsh migrate --tunable
> rather than having loads of extra options which all have different names
> from qemu's name for the same feature.

Provided tunables are strictl just tunables, that would be viable.
Right now our tunables are a mixture of tunables and low level
data transport architectural knobs.

> > This post-copy is another case.  We should start off knowing
> > we can switch to post-copy at any time. We should further be
> > able to add pre-emption if we find it available. IOW, we should
> > not have required anything more than 'switch to post-copy' to
> > be exposed to mgmtm apps.
> 
> Some of these things are tricky; for example knowing whether or not you
> can do postcopy depends on your exact memory configuration; some of that
> is tricky to probe.

I'm just refering to the postcopy capability that we nneed to
set upfront before starting the migration on both sides.  IIUC
that should be possible for QEMU to automatically figure out,
if it could negotiate with dst QEMU.

Whether we ever switch from precopy to postcopy mode once
running can remain under mgmt app control.

> > Or enabling zero copy on either send or receive side.
> > 
> > Or enabling kernel-TLS offload
> 
> Will kernel-TLS be something you'd want to automatically turn on?
> We don't know yet whether it's a good idea if you don't have hardware
> support.

I'm pretty sure kTLS will always be a benefit, because even without
hardware offload you still benefit from getting the TLS encryption
onto a separate CPU core from QEMU's migration thread. We've measured
this already with NBD and I've no reason to suspect it will differ
for migration. 


> > Now define a protocol handshake. A 5 minute thought experiment
> > starts off with something simple:
> > 
> >    dst -> src:  Greeting Message:
> >                   Magic: "QEMU-MIGRATE"  12 bytes
> >                   Num Versions: 1 byte
> >                   Version list: 1 byte * num versions
> >                   Num features: 4 bytes
> >                   Feature list: string * num features
> > 
> >    src -> dst:  Greeting Reply:
> >                   Magic: "QEMU-MIGRATE" 12 bytes
> >                   Select version: 1 byte
> >                   Num select features: 4 bytes
> >                   Selected features: string * num features   
> > 
> >    .... possibly more src <-> dst messages depending on
> >         features negotiated....
> > 
> >    src -> dst:  start migration
> >  
> >     ...traditional migration stream runs now for the remainder
> >        of this connection ...
> 
> Don't worry about designing the bytes; we already have a command
> structure; we just need to add a MIG_CMD_FEATURES and a 
> MIG_RP_MSG_FEATURES
> (I'm not sure what we need to do for RDMA; or what we do for exec: or
> savevm)

For RDMA there are two options

 - Drop RDMA support (preferred ;-)

 - Use a regular TCP channel for the migration protocol
   handshake todo all the feature negotiation.  Open a
   second channel using RDMA just for the migration payload

Before considering "exec", lets think about "fd" as that's more
critical.

How can be get an arbitrary number of bi-directional channels
open when the user is passing in pre-opened FDs individual and
does not know upfront how many QEMU wants ?

We could have an event that QEMU emits whenever it wants to be
given a new "fd" channel. The mgmt app would watch for that and
pass in more pre-opened FDs in response. Not too difficult

Back to "exec" we have two options

 - Drop exec support, and just let the user spawn the
   program externally and pass in a pre-opened socket
   FDs for talking to it

 - Keep exec and make it use a socketpair instead of
   pipe FDs. Connect the socketpair to both stdin+stdout.
   Exec the program many times if needing many channels.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption)
  2022-03-15  6:13             ` Peter Xu
@ 2022-03-15 11:15               ` Daniel P. Berrangé
  2022-03-16  3:30                 ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Daniel P. Berrangé @ 2022-03-15 11:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Juan Quintela, Dr. David Alan Gilbert,
	Leonardo Bras Soares Passos, qemu-devel

On Tue, Mar 15, 2022 at 02:13:43PM +0800, Peter Xu wrote:
> On Mon, Mar 14, 2022 at 06:49:25PM +0000, Daniel P. Berrangé wrote:
> > Taking a step back here and looking at the bigger picture of
> > migration protocol configuration....
> > 
> > Almost every time we add a new feature to migration, we end up
> > having to define at least one new migration parameter, then wire
> > it up in libvirt, and then the mgmt app too, often needing to
> > ensure it is turn on for both client and server at the same time.
> > 
> > 
> > For some features, requiring an explicit opt-in could make sense,
> > because we don't know for sure that the feature is always a benefit.
> > These are things that can be thought of as workload sensitive
> > tunables.
> > 
> > 
> > For other features though, it feels like we would be better off if
> > we could turn it on by default with no config. These are things
> > that can be thought of as migration infrastructre / transport
> > architectural designs.
> 
> Thanks for raising this discussion.  That's something I wanted to raise too
> but I just haven't, at least formally.
> 
> Actually I think I raised this question once or twice, but I just didn't
> insist trying. :)
> 
> > 
> > 
> > eg it would be nice to be able to use multifd by default for
> > migration. We would still want a tunable to control the number
> > of channels, but we ought to be able to just start with a default
> > number of channels automatically, so the tunable is only needed
> > for special cases.
> 
> I still remember you mentioned the upper layer softwares can have
> assumption on using only 1 pair of socket for migration, I think that makes
> postcopy-preempt by default impossible.
> 
> Why multifd is different here?

It isn't different. We went through the pain to extending libvirt
to know how to open many channels for multifd. We'll have todo
the same with this postcopy-pre-empt. To this day though, management
apps above libvirt largely don't enable multifd, which is a real
shame. This is the key reason I think we need to handle this at
the QEMU level automatically.

> > This post-copy is another case.  We should start off knowing
> > we can switch to post-copy at any time.
> 
> This one is kind of special and it'll be harder, IMHO.
> 
> AFAIU, postcopy users will always initiate the migration with at least a
> full round of precopy, with the hope that all the static guest pages will
> be migrated.

I think I didn't explain myself properly here. Today there are
two parts to postcopy usage in libvirt

  - Pass the "VIR_MIGRATE_POSTCOPY" when starting the migration.
    The migration still runs in pre-copy mode. This merely ensures
    we configure a bi-directional socket, so the app has the option
    to swtich to postcopy later

  - Invoke virDomainMigrateStartPostCopy  to flip from pre-copy
    to post-copy phase. This requires you previously passed
    VIR_MIGRATE_POSTCOPY to enable its use.

The first point using 'VIR_MIGRATE_POSTCOPY' should not exist.
That should be automaticaly negotiated and handled by QEMU.

Libvirt and mgmt apps should only need to care about whether
or not they call virDomainMigrateStartPostCopy to flip to
post-copy mode.

> > We should further be able to add pre-emption if we find it available.
> 
> Yeah here I have the same question per multifd above.  I just have no idea
> whether QEMU has such knowledge on making this decision.  E.g., how could
> QEMU know whether upper app is not tunneling the migration stream?  How
> could QEMU know whether the upper app could handle multiple tcp sockets
> well?

It can't do this today - that's why we need the new migration protocol
feature negotiation I describe below.

> > So rather than following our historical practice, anjd adding
> > yet another migration parameter for a specific feature, I'd
> > really encourage us to put a stop to it and future proof
> > ourselves.
> > 
> > 
> > Introduce one *final-no-more-never-again-after-this* migration
> > capability called "protocol-negotiation".
> 
> Let's see how Juan/Dave/others think.. anyway, that's something I always
> wanted.
> 
> IMHO an even simpler term can be as simple as:
> 
>   -global migration.handshake=on

This is just inventing a new migration capability framework. We
can just use existing QMP for this.

> > When that capability is set, first declare that henceforth the
> > migration transport is REQUIRED to support **multiple**,
> > **bi-directional** channels.
> 
> This new capability will simply need to depend on the return-path
> capability we already have.  E.g. exec-typed migration won't be able to
> enable return-path, so not applicable to this one too.

'exec' can be made to work if desired. Currently we only create
a unidirectuional pipe and wire it up to stdin for outgoing
migration. Nothing stops us declaring 'exec' uses a socketpair
wired to stdin + stdout, and supprot invoking 'exec' multiple
times to get many sockets

> > Now define a protocol handshake. A 5 minute thought experiment
> > starts off with something simple:
> > 
> >    dst -> src:  Greeting Message:
> >                   Magic: "QEMU-MIGRATE"  12 bytes
> >                   Num Versions: 1 byte
> >                   Version list: 1 byte * num versions
> >                   Num features: 4 bytes
> >                   Feature list: string * num features
> > 
> >    src -> dst:  Greeting Reply:
> >                   Magic: "QEMU-MIGRATE" 12 bytes
> >                   Select version: 1 byte
> >                   Num select features: 4 bytes
> >                   Selected features: string * num features   
> > 
> >    .... possibly more src <-> dst messages depending on
> >         features negotiated....
> > 
> >    src -> dst:  start migration
> >  
> >     ...traditional migration stream runs now for the remainder
> >        of this connection ...
> > 
> > 
> > 
> > I suggest "dst" starts first, so that connecting to a dst lets you
> > easily debug whether QEMU is speaking v2 or just waiting for the
> > client to send something as traditionally the case.
> 
> No strong opinion on which QEMU should start the conversation, just to
> mention that we may not be able to use this to identify whether it's an old
> or new QEMU, afaiu, because of network delays?
> 
> We can never tell whether the dest QEMU didn't talk is because it's an old
> binary or it's new binary but with high latency network.

Sure, you wouldn't want to functionally rely on it. It is mostly
just a debugging aid so you can port scan and show it is new
QEMU migration protocol not the old one.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption)
  2022-03-15 11:15               ` Daniel P. Berrangé
@ 2022-03-16  3:30                 ` Peter Xu
  2022-03-16  9:59                   ` Daniel P. Berrangé
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2022-03-16  3:30 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, Juan Quintela, Dr. David Alan Gilbert,
	Leonardo Bras Soares Passos, qemu-devel

On Tue, Mar 15, 2022 at 11:15:41AM +0000, Daniel P. Berrangé wrote:
> > I still remember you mentioned the upper layer softwares can have
> > assumption on using only 1 pair of socket for migration, I think that makes
> > postcopy-preempt by default impossible.
> > 
> > Why multifd is different here?
> 
> It isn't different. We went through the pain to extending libvirt
> to know how to open many channels for multifd. We'll have todo
> the same with this postcopy-pre-empt. To this day though, management
> apps above libvirt largely don't enable multifd, which is a real
> shame. This is the key reason I think we need to handle this at
> the QEMU level automatically.

But I still don't undertand how QEMU could know about those tunnels, which
should be beyond QEMU's awareness?

The tunneling program can be some admin initiated socat tcp forwarding
programs, which by default may not allow >1 socket pairs.

Or maybe I have mis-understood on what's the tunneling we're discussing?

> 
> > > This post-copy is another case.  We should start off knowing
> > > we can switch to post-copy at any time.
> > 
> > This one is kind of special and it'll be harder, IMHO.
> > 
> > AFAIU, postcopy users will always initiate the migration with at least a
> > full round of precopy, with the hope that all the static guest pages will
> > be migrated.
> 
> I think I didn't explain myself properly here. Today there are
> two parts to postcopy usage in libvirt
> 
>   - Pass the "VIR_MIGRATE_POSTCOPY" when starting the migration.
>     The migration still runs in pre-copy mode. This merely ensures
>     we configure a bi-directional socket, so the app has the option
>     to swtich to postcopy later
> 
>   - Invoke virDomainMigrateStartPostCopy  to flip from pre-copy
>     to post-copy phase. This requires you previously passed
>     VIR_MIGRATE_POSTCOPY to enable its use.
> 
> The first point using 'VIR_MIGRATE_POSTCOPY' should not exist.
> That should be automaticaly negotiated and handled by QEMU.
> 
> Libvirt and mgmt apps should only need to care about whether
> or not they call virDomainMigrateStartPostCopy to flip to
> post-copy mode.

Ah I see.  I think Dave also mentioned it'll be a bit tricky to do so, but
it'll be at least sounds doable.

> 
> > > We should further be able to add pre-emption if we find it available.
> > 
> > Yeah here I have the same question per multifd above.  I just have no idea
> > whether QEMU has such knowledge on making this decision.  E.g., how could
> > QEMU know whether upper app is not tunneling the migration stream?  How
> > could QEMU know whether the upper app could handle multiple tcp sockets
> > well?
> 
> It can't do this today - that's why we need the new migration protocol
> feature negotiation I describe below.
> 
> > > So rather than following our historical practice, anjd adding
> > > yet another migration parameter for a specific feature, I'd
> > > really encourage us to put a stop to it and future proof
> > > ourselves.
> > > 
> > > 
> > > Introduce one *final-no-more-never-again-after-this* migration
> > > capability called "protocol-negotiation".
> > 
> > Let's see how Juan/Dave/others think.. anyway, that's something I always
> > wanted.
> > 
> > IMHO an even simpler term can be as simple as:
> > 
> >   -global migration.handshake=on
> 
> This is just inventing a new migration capability framework. We
> can just use existing QMP for this.

It's not a new one, it's just that a few years ago we exported the
migration capabilities to cmdline too (2081475841fe8), even if it's mostly
for debugging purpose.  In my daily tests it's quite handy.

> 
> > > When that capability is set, first declare that henceforth the
> > > migration transport is REQUIRED to support **multiple**,
> > > **bi-directional** channels.
> > 
> > This new capability will simply need to depend on the return-path
> > capability we already have.  E.g. exec-typed migration won't be able to
> > enable return-path, so not applicable to this one too.
> 
> 'exec' can be made to work if desired. Currently we only create
> a unidirectuional pipe and wire it up to stdin for outgoing
> migration. Nothing stops us declaring 'exec' uses a socketpair
> wired to stdin + stdout, and supprot invoking 'exec' multiple
> times to get many sockets

Yeah sounds working, it's just that we need to have users of it first.  One
point is that exec shouldn't be used in production but for quick hacks or
experiments, so supporting new/perf/enhancement features for it sounds a
bit over-engineering unless explicitly useful.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption)
  2022-03-16  3:30                 ` Peter Xu
@ 2022-03-16  9:59                   ` Daniel P. Berrangé
  2022-03-16 10:40                     ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Daniel P. Berrangé @ 2022-03-16  9:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Juan Quintela, Dr. David Alan Gilbert,
	Leonardo Bras Soares Passos, qemu-devel

On Wed, Mar 16, 2022 at 11:30:59AM +0800, Peter Xu wrote:
> On Tue, Mar 15, 2022 at 11:15:41AM +0000, Daniel P. Berrangé wrote:
> > > I still remember you mentioned the upper layer softwares can have
> > > assumption on using only 1 pair of socket for migration, I think that makes
> > > postcopy-preempt by default impossible.
> > > 
> > > Why multifd is different here?
> > 
> > It isn't different. We went through the pain to extending libvirt
> > to know how to open many channels for multifd. We'll have todo
> > the same with this postcopy-pre-empt. To this day though, management
> > apps above libvirt largely don't enable multifd, which is a real
> > shame. This is the key reason I think we need to handle this at
> > the QEMU level automatically.
> 
> But I still don't undertand how QEMU could know about those tunnels, which
> should be beyond QEMU's awareness?
> 
> The tunneling program can be some admin initiated socat tcp forwarding
> programs, which by default may not allow >1 socket pairs.
> 
> Or maybe I have mis-understood on what's the tunneling we're discussing?

I dont think I was talking about tunneling at all, just QEMU
migration protocol options !

If an app is tunnelling QEMU's migration protocol over some
channel, that isn't important to QEMU - regardless whether a
passed in 'fd:' protocol FD is a direct TCP socket, or a
UNIX socket for a tunnel, QEMU works the same way. In one
of my other replies I mention a way to make 'fd:' work with
an arbitrary number of channels, by using an event from QEMU
to request the app provide additional FDs.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption)
  2022-03-16  9:59                   ` Daniel P. Berrangé
@ 2022-03-16 10:40                     ` Peter Xu
  2022-03-16 11:00                       ` Daniel P. Berrangé
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2022-03-16 10:40 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, Juan Quintela, Dr. David Alan Gilbert,
	Leonardo Bras Soares Passos, qemu-devel

On Wed, Mar 16, 2022 at 09:59:28AM +0000, Daniel P. Berrangé wrote:
> On Wed, Mar 16, 2022 at 11:30:59AM +0800, Peter Xu wrote:
> > On Tue, Mar 15, 2022 at 11:15:41AM +0000, Daniel P. Berrangé wrote:
> > > > I still remember you mentioned the upper layer softwares can have
> > > > assumption on using only 1 pair of socket for migration, I think that makes
> > > > postcopy-preempt by default impossible.
> > > > 
> > > > Why multifd is different here?
> > > 
> > > It isn't different. We went through the pain to extending libvirt
> > > to know how to open many channels for multifd. We'll have todo
> > > the same with this postcopy-pre-empt. To this day though, management
> > > apps above libvirt largely don't enable multifd, which is a real
> > > shame. This is the key reason I think we need to handle this at
> > > the QEMU level automatically.
> > 
> > But I still don't undertand how QEMU could know about those tunnels, which
> > should be beyond QEMU's awareness?
> > 
> > The tunneling program can be some admin initiated socat tcp forwarding
> > programs, which by default may not allow >1 socket pairs.
> > 
> > Or maybe I have mis-understood on what's the tunneling we're discussing?
> 
> I dont think I was talking about tunneling at all, just QEMU
> migration protocol options !

Ah. :)

> 
> If an app is tunnelling QEMU's migration protocol over some
> channel, that isn't important to QEMU - regardless whether a
> passed in 'fd:' protocol FD is a direct TCP socket, or a
> UNIX socket for a tunnel, QEMU works the same way. In one
> of my other replies I mention a way to make 'fd:' work with
> an arbitrary number of channels, by using an event from QEMU
> to request the app provide additional FDs.

I very much agree on the whole concept of what you proposed, either on the
new negotiation phase itself, or the idea that with the negotiation phase
we can try to auto-enable some features we not used to.

What I wanted to express is we can't enable either preempt mode or multifd
automatically from qemu even with them, because these two are quite special
IMHO in that qemu doesn't know whether the mgmt app can handle the multiple
socket pairs.  Yes we could teach qemu to dynamically accept new "fd"s, but
again IMHO that still needs to be intervened by the mgmt app.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption)
  2022-03-16 10:40                     ` Peter Xu
@ 2022-03-16 11:00                       ` Daniel P. Berrangé
  2022-03-18  7:08                         ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Daniel P. Berrangé @ 2022-03-16 11:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Juan Quintela, Dr. David Alan Gilbert,
	Leonardo Bras Soares Passos, qemu-devel

On Wed, Mar 16, 2022 at 06:40:08PM +0800, Peter Xu wrote:
> On Wed, Mar 16, 2022 at 09:59:28AM +0000, Daniel P. Berrangé wrote:
> > On Wed, Mar 16, 2022 at 11:30:59AM +0800, Peter Xu wrote:
> > > On Tue, Mar 15, 2022 at 11:15:41AM +0000, Daniel P. Berrangé wrote:
> > > > > I still remember you mentioned the upper layer softwares can have
> > > > > assumption on using only 1 pair of socket for migration, I think that makes
> > > > > postcopy-preempt by default impossible.
> > > > > 
> > > > > Why multifd is different here?
> > > > 
> > > > It isn't different. We went through the pain to extending libvirt
> > > > to know how to open many channels for multifd. We'll have todo
> > > > the same with this postcopy-pre-empt. To this day though, management
> > > > apps above libvirt largely don't enable multifd, which is a real
> > > > shame. This is the key reason I think we need to handle this at
> > > > the QEMU level automatically.
> > > 
> > > But I still don't undertand how QEMU could know about those tunnels, which
> > > should be beyond QEMU's awareness?
> > > 
> > > The tunneling program can be some admin initiated socat tcp forwarding
> > > programs, which by default may not allow >1 socket pairs.
> > > 
> > > Or maybe I have mis-understood on what's the tunneling we're discussing?
> > 
> > I dont think I was talking about tunneling at all, just QEMU
> > migration protocol options !
> 
> Ah. :)
> 
> > 
> > If an app is tunnelling QEMU's migration protocol over some
> > channel, that isn't important to QEMU - regardless whether a
> > passed in 'fd:' protocol FD is a direct TCP socket, or a
> > UNIX socket for a tunnel, QEMU works the same way. In one
> > of my other replies I mention a way to make 'fd:' work with
> > an arbitrary number of channels, by using an event from QEMU
> > to request the app provide additional FDs.
> 
> I very much agree on the whole concept of what you proposed, either on the
> new negotiation phase itself, or the idea that with the negotiation phase
> we can try to auto-enable some features we not used to.
> 
> What I wanted to express is we can't enable either preempt mode or multifd
> automatically from qemu even with them, because these two are quite special
> IMHO in that qemu doesn't know whether the mgmt app can handle the multiple
> socket pairs.  Yes we could teach qemu to dynamically accept new "fd"s, but
> again IMHO that still needs to be intervened by the mgmt app.

My proposal absolutely *can* let QEMU do that automatically, and that
is one of the most important benefits of it.

[quote]
Introduce one *final-no-more-never-again-after-this* migration
capability called "protocol-negotiation".

When that capability is set, first declare that henceforth the
migration transport is REQUIRED to support **multiple**,
**bi-directional** channels. We might only use 1 TCP channel
in some cases, but it declares our intent that we expect to be
able to use as many channels as we see fit henceforth.
[/quote]

IOW, any management app that enabled 'protocol-negotiation' is explicitly
declaring that it accepts the new requirements for support for multiple
channels. An app which enabled 'protocol-negotiation' capability while
only allowing 1 chanels is simply broken, because it would be violating
the documented requirements for the capability.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption)
  2022-03-16 11:00                       ` Daniel P. Berrangé
@ 2022-03-18  7:08                         ` Peter Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2022-03-18  7:08 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, Juan Quintela, Dr. David Alan Gilbert,
	Leonardo Bras Soares Passos, qemu-devel

On Wed, Mar 16, 2022 at 11:00:53AM +0000, Daniel P. Berrangé wrote:
> On Wed, Mar 16, 2022 at 06:40:08PM +0800, Peter Xu wrote:
> > On Wed, Mar 16, 2022 at 09:59:28AM +0000, Daniel P. Berrangé wrote:
> > > On Wed, Mar 16, 2022 at 11:30:59AM +0800, Peter Xu wrote:
> > > > On Tue, Mar 15, 2022 at 11:15:41AM +0000, Daniel P. Berrangé wrote:
> > > > > > I still remember you mentioned the upper layer softwares can have
> > > > > > assumption on using only 1 pair of socket for migration, I think that makes
> > > > > > postcopy-preempt by default impossible.
> > > > > > 
> > > > > > Why multifd is different here?
> > > > > 
> > > > > It isn't different. We went through the pain to extending libvirt
> > > > > to know how to open many channels for multifd. We'll have todo
> > > > > the same with this postcopy-pre-empt. To this day though, management
> > > > > apps above libvirt largely don't enable multifd, which is a real
> > > > > shame. This is the key reason I think we need to handle this at
> > > > > the QEMU level automatically.
> > > > 
> > > > But I still don't undertand how QEMU could know about those tunnels, which
> > > > should be beyond QEMU's awareness?
> > > > 
> > > > The tunneling program can be some admin initiated socat tcp forwarding
> > > > programs, which by default may not allow >1 socket pairs.
> > > > 
> > > > Or maybe I have mis-understood on what's the tunneling we're discussing?
> > > 
> > > I dont think I was talking about tunneling at all, just QEMU
> > > migration protocol options !
> > 
> > Ah. :)
> > 
> > > 
> > > If an app is tunnelling QEMU's migration protocol over some
> > > channel, that isn't important to QEMU - regardless whether a
> > > passed in 'fd:' protocol FD is a direct TCP socket, or a
> > > UNIX socket for a tunnel, QEMU works the same way. In one
> > > of my other replies I mention a way to make 'fd:' work with
> > > an arbitrary number of channels, by using an event from QEMU
> > > to request the app provide additional FDs.
> > 
> > I very much agree on the whole concept of what you proposed, either on the
> > new negotiation phase itself, or the idea that with the negotiation phase
> > we can try to auto-enable some features we not used to.
> > 
> > What I wanted to express is we can't enable either preempt mode or multifd
> > automatically from qemu even with them, because these two are quite special
> > IMHO in that qemu doesn't know whether the mgmt app can handle the multiple
> > socket pairs.  Yes we could teach qemu to dynamically accept new "fd"s, but
> > again IMHO that still needs to be intervened by the mgmt app.
> 
> My proposal absolutely *can* let QEMU do that automatically, and that
> is one of the most important benefits of it.
> 
> [quote]
> Introduce one *final-no-more-never-again-after-this* migration
> capability called "protocol-negotiation".
> 
> When that capability is set, first declare that henceforth the
> migration transport is REQUIRED to support **multiple**,
> **bi-directional** channels. We might only use 1 TCP channel
> in some cases, but it declares our intent that we expect to be
> able to use as many channels as we see fit henceforth.
> [/quote]
> 
> IOW, any management app that enabled 'protocol-negotiation' is explicitly
> declaring that it accepts the new requirements for support for multiple
> channels. An app which enabled 'protocol-negotiation' capability while
> only allowing 1 chanels is simply broken, because it would be violating
> the documented requirements for the capability.

Sorry I misteriously overlooked that paragraph.. it's just that from the
wording "negotiation" shouldn't rely on multipe sockets, since from the
literal meaning any bidirectional channel should be negotiatable.  But I
see what you mean now..  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2022-03-18  7:09 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-01  8:39 [PATCH v2 00/25] migration: Postcopy Preemption Peter Xu
2022-03-01  8:39 ` [PATCH v2 01/25] migration: Dump sub-cmd name in loadvm_process_command tp Peter Xu
2022-03-01  8:39 ` [PATCH v2 02/25] migration: Finer grained tracepoints for POSTCOPY_LISTEN Peter Xu
2022-03-01  8:39 ` [PATCH v2 03/25] migration: Tracepoint change in postcopy-run bottom half Peter Xu
2022-03-01  8:39 ` [PATCH v2 04/25] migration: Introduce postcopy channels on dest node Peter Xu
2022-03-01  8:39 ` [PATCH v2 05/25] migration: Dump ramblock and offset too when non-same-page detected Peter Xu
2022-03-01  8:39 ` [PATCH v2 06/25] migration: Add postcopy_thread_create() Peter Xu
2022-03-01  8:39 ` [PATCH v2 07/25] migration: Move static var in ram_block_from_stream() into global Peter Xu
2022-03-01  8:39 ` [PATCH v2 08/25] migration: Add pss.postcopy_requested status Peter Xu
2022-03-01  8:39 ` [PATCH v2 09/25] migration: Move migrate_allow_multifd and helpers into migration.c Peter Xu
2022-03-01  8:39 ` [PATCH v2 10/25] migration: Enlarge postcopy recovery to capture !-EIO too Peter Xu
2022-03-01  8:39 ` [PATCH v2 11/25] migration: postcopy_pause_fault_thread() never fails Peter Xu
2022-03-01  8:39 ` [PATCH v2 12/25] migration: Export ram_load_postcopy() Peter Xu
2022-03-01  8:39 ` [PATCH v2 13/25] migration: Move channel setup out of postcopy_try_recover() Peter Xu
2022-03-01  8:39 ` [PATCH v2 14/25] migration: Add migration_incoming_transport_cleanup() Peter Xu
2022-03-01  8:39 ` [PATCH v2 15/25] migration: Allow migrate-recover to run multiple times Peter Xu
2022-03-01  8:39 ` [PATCH v2 16/25] migration: Add postcopy-preempt capability Peter Xu
2022-03-01  8:39 ` [PATCH v2 17/25] migration: Postcopy preemption preparation on channel creation Peter Xu
2022-03-01  8:39 ` [PATCH v2 18/25] migration: Postcopy preemption enablement Peter Xu
2022-03-01  8:39 ` [PATCH v2 19/25] migration: Postcopy recover with preempt enabled Peter Xu
2022-03-01  8:39 ` [PATCH v2 20/25] migration: Create the postcopy preempt channel asynchronously Peter Xu
2022-03-01  8:39 ` [PATCH v2 21/25] migration: Parameter x-postcopy-preempt-break-huge Peter Xu
2022-03-01  8:39 ` [PATCH v2 22/25] migration: Add helpers to detect TLS capability Peter Xu
2022-03-01  8:39 ` [PATCH v2 23/25] migration: Fail postcopy preempt with TLS for now Peter Xu
2022-03-01  8:39 ` [PATCH v2 24/25] tests: Add postcopy preempt test Peter Xu
2022-03-01  8:39 ` [PATCH v2 25/25] tests: Pass in MigrateStart** into test_migrate_start() Peter Xu
2022-03-02 12:11   ` Dr. David Alan Gilbert
2022-03-01  9:25 ` [PATCH v2 00/25] migration: Postcopy Preemption Daniel P. Berrangé
2022-03-01 10:17   ` Peter Xu
2022-03-01 10:27     ` Daniel P. Berrangé
2022-03-01 10:55       ` Peter Xu
2022-03-01 16:51         ` Dr. David Alan Gilbert
2022-03-02  1:46           ` Peter Xu
2022-03-14 18:49           ` Time to introduce a migration protocol negotiation (Re: [PATCH v2 00/25] migration: Postcopy Preemption) Daniel P. Berrangé
2022-03-15  6:13             ` Peter Xu
2022-03-15 11:15               ` Daniel P. Berrangé
2022-03-16  3:30                 ` Peter Xu
2022-03-16  9:59                   ` Daniel P. Berrangé
2022-03-16 10:40                     ` Peter Xu
2022-03-16 11:00                       ` Daniel P. Berrangé
2022-03-18  7:08                         ` Peter Xu
2022-03-15 10:43             ` Dr. David Alan Gilbert
2022-03-15 11:05               ` Daniel P. Berrangé
2022-03-01 18:05         ` [PATCH v2 00/25] migration: Postcopy Preemption Daniel P. Berrangé
2022-03-02  1:48           ` Peter Xu
2022-03-02 12:14 ` Dr. David Alan Gilbert
2022-03-02 12:34   ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.