qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/15] migration: Postcopy Preemption
@ 2022-01-19  8:09 Peter Xu
  2022-01-19  8:09 ` [PATCH RFC 01/15] migration: No off-by-one for pss->page update in host page size Peter Xu
                   ` (15 more replies)
  0 siblings, 16 replies; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Based-on: <20211224065000.97572-1-peterx@redhat.com>

Human version - This patchset is based on:
  https://lore.kernel.org/qemu-devel/20211224065000.97572-1-peterx@redhat.com/

This series can also be found here:
  https://github.com/xzpeter/qemu/tree/postcopy-preempt

Abstract
========

This series added a new migration capability called "postcopy-preempt".  It can
be enabled when postcopy is enabled, and it'll simply (but greatly) speed up
postcopy page requests handling process.

Some quick tests below measuring postcopy page request latency:

  - Guest config: 20G guest, 40 vcpus
  - Host config: 10Gbps host NIC attached between src/dst
  - Workload: one busy dirty thread, writting to 18G memory (pre-faulted).
    (refers to "2M/4K huge page, 1 dirty thread" tests below)
  - Script: see [1]

  |----------------+--------------+-----------------------|
  | Host page size | Vanilla (ms) | Postcopy Preempt (ms) |
  |----------------+--------------+-----------------------|
  | 2M             |        10.58 |                  4.96 |
  | 4K             |        10.68 |                  0.57 |
  |----------------+--------------+-----------------------|

For 2M page, we got 1x speedup.  For 4K page, 18x speedup.

For more information on the testing, please refer to "Test Results" below.

Design
======

The postcopy-preempt feature contains two major reworks on postcopy page fault
handlings:

    (1) Postcopy requests are now sent via a different socket from precopy
        background migration stream, so as to be isolated from very high page
        request delays

    (2) For huge page enabled hosts: when there's postcopy requests, they can
        now intercept a partial sending of huge host pages on src QEMU.

The design is relatively straightforward, however there're trivial
implementation details that the patchset needs to address.  Many of them are
addressed as separate patches.  The rest is handled majorly in the big patch to
enable the whole feature.

Postcopy recovery is not yet supported, it'll be done after some initial review
on the solution first.

Patch layout
============

The initial 10 (out of 15) patches are mostly even suitable to be merged
without the new feature, so they can be looked at even earlier.

Patch 11-14 implements the new feature, in which patches 11-13 are mostly still
small and doing preparations, and the major change is done in patch 14.

Patch 15 is an unit test.

Tests Results
==================

When measuring the page request latency, I did that via trapping userfaultfd
kernel faults using the bpf script [1]. I ignored kvm fast page faults, because
when it happened it means no major/real page fault is even needed, IOW, no
query to src QEMU.

The numbers (and histogram) I captured below are based on a whole procedure of
postcopy migration that I sampled with different configurations, and the
average page request latency was calculated.  I also captured the latency
distribution, it's also interesting too to look at them here.

One thing to mention is I didn't even test 1G pages.  It doesn't mean that this
series won't help 1G - actually it'll help no less than what I've tested I
believe, it's just that for 1G huge pages the latency will be >1sec on 10Gbps
nic so it's not really a usable scenario for any sensible customer.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2M huge page, 1 dirty thread
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With vanilla postcopy:

Average: 10582 (us)

@delay_us:
[1K, 2K)               7 |                                                    |
[2K, 4K)               1 |                                                    |
[4K, 8K)               9 |                                                    |
[8K, 16K)           1983 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

With postcopy-preempt:

Average: 4960 (us)

@delay_us:
[1K, 2K)               5 |                                                    |
[2K, 4K)              44 |                                                    |
[4K, 8K)            3495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)            154 |@@                                                  |
[16K, 32K)             1 |                                                    |

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4K small page, 1 dirty thread
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With vanilla postcopy:

Average: 10676 (us)

@delay_us:
[4, 8)                 1 |                                                    |
[8, 16)                3 |                                                    |
[16, 32)               5 |                                                    |
[32, 64)               3 |                                                    |
[64, 128)             12 |                                                    |
[128, 256)            10 |                                                    |
[256, 512)            27 |                                                    |
[512, 1K)              5 |                                                    |
[1K, 2K)              11 |                                                    |
[2K, 4K)              17 |                                                    |
[4K, 8K)              10 |                                                    |
[8K, 16K)           2681 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)             6 |                                                    |

With postcopy preempt:

Average: 570 (us)

@delay_us:
[16, 32)               5 |                                                    |
[32, 64)               6 |                                                    |
[64, 128)           8340 |@@@@@@@@@@@@@@@@@@                                  |
[128, 256)         23052 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)          8119 |@@@@@@@@@@@@@@@@@@                                  |
[512, 1K)            148 |                                                    |
[1K, 2K)             759 |@                                                   |
[2K, 4K)            6729 |@@@@@@@@@@@@@@@                                     |
[4K, 8K)              80 |                                                    |
[8K, 16K)            115 |                                                    |
[16K, 32K)            32 |                                                    |

So one thing funny about 4K small pages is that with vanilla postcopy I didn't
even get a speedup comparing to 2M pages, probably because the major overhead
is not sending the page itself, but other things (e.g. waiting for precopy to
flush the existing pages).

The other thing is in postcopy preempt test, I can still see a bunch of 2ms-4ms
latency page requests.  That's probably what we would like to dig into next.
One possibility is since we shared the same sending thread on src QEMU, we
could have yield ourselves because precopy socket is full.  But that's TBD.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4K small page, 16 dirty threads
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What I did test in extra was using 16 concurrent faulting threads, in this case
the postcopy queue can be relatively longer.  It's done via:

  $ stress -m 16 --vm-bytes 1073741824 --vm-keep

With vanilla postcopy:

Average: 2244 (us)

@delay_us:
[0]                  556 |                                                    |
[1]                11251 |@@@@@@@@@@@@                                        |
[2, 4)             12094 |@@@@@@@@@@@@@                                       |
[4, 8)             12234 |@@@@@@@@@@@@@                                       |
[8, 16)            47144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)           42281 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32, 64)           17676 |@@@@@@@@@@@@@@@@@@@                                 |
[64, 128)            952 |@                                                   |
[128, 256)           405 |                                                    |
[256, 512)           779 |                                                    |
[512, 1K)           1003 |@                                                   |
[1K, 2K)            1976 |@@                                                  |
[2K, 4K)            4865 |@@@@@                                               |
[4K, 8K)            5892 |@@@@@@                                              |
[8K, 16K)          26941 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[16K, 32K)           844 |                                                    |
[32K, 64K)            17 |                                                    |

With postcopy preempt:

Average: 1064 (us)

@delay_us:
[0]                 1341 |                                                    |
[1]                30211 |@@@@@@@@@@@@                                        |
[2, 4)             32934 |@@@@@@@@@@@@@                                       |
[4, 8)             21295 |@@@@@@@@                                            |
[8, 16)           130774 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)           95128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
[32, 64)           49591 |@@@@@@@@@@@@@@@@@@@                                 |
[64, 128)           3921 |@                                                   |
[128, 256)          1066 |                                                    |
[256, 512)          2730 |@                                                   |
[512, 1K)           1849 |                                                    |
[1K, 2K)             512 |                                                    |
[2K, 4K)            2355 |                                                    |
[4K, 8K)           48812 |@@@@@@@@@@@@@@@@@@@                                 |
[8K, 16K)          10026 |@@@                                                 |
[16K, 32K)           810 |                                                    |
[32K, 64K)            68 |                                                    |

In this specific case, a funny thing is when there're tons of postcopy
requests, the vanilla postcopy page requests are handled even faster (2ms
average) than when there's only 1 dirty thread.  It's probably because
unqueue_page() will always hit anyway so precopy streaming has a less effect on
postcopy.  However that'll be still slower than having a standalone postcopy
stream as preempt version has (1ms).

Any comment welcomed.

[1] https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd-latency.bpf

Peter Xu (15):
  migration: No off-by-one for pss->page update in host page size
  migration: Allow pss->page jump over clean pages
  migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat
  migration: Add postcopy_has_request()
  migration: Simplify unqueue_page()
  migration: Move temp page setup and cleanup into separate functions
  migration: Introduce postcopy channels on dest node
  migration: Dump ramblock and offset too when non-same-page detected
  migration: Add postcopy_thread_create()
  migration: Move static var in ram_block_from_stream() into global
  migration: Add pss.postcopy_requested status
  migration: Move migrate_allow_multifd and helpers into migration.c
  migration: Add postcopy-preempt capability
  migration: Postcopy preemption on separate channel
  tests: Add postcopy preempt test

 migration/migration.c        | 107 +++++++--
 migration/migration.h        |  55 ++++-
 migration/multifd.c          |  19 +-
 migration/multifd.h          |   2 -
 migration/postcopy-ram.c     | 192 ++++++++++++----
 migration/postcopy-ram.h     |  14 ++
 migration/ram.c              | 417 ++++++++++++++++++++++++++++-------
 migration/ram.h              |   2 +
 migration/savevm.c           |  12 +-
 migration/socket.c           |  18 ++
 migration/socket.h           |   1 +
 migration/trace-events       |  12 +-
 qapi/migration.json          |   8 +-
 tests/qtest/migration-test.c |  21 ++
 14 files changed, 716 insertions(+), 164 deletions(-)

-- 
2.32.0



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH RFC 01/15] migration: No off-by-one for pss->page update in host page size
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-01-19 12:58   ` Dr. David Alan Gilbert
  2022-01-27  9:40   ` Juan Quintela
  2022-01-19  8:09 ` [PATCH RFC 02/15] migration: Allow pss->page jump over clean pages Peter Xu
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Kunkun Jiang, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos, Keqian Zhu, Andrey Gruzdev

We used to do off-by-one fixup for pss->page when finished one host huge page
transfer.  That seems to be unnecesary at all.  Drop it.

Cc: Keqian Zhu <zhukeqian1@huawei.com>
Cc: Kunkun Jiang <jiangkunkun@huawei.com>
Cc: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 5234d1ece1..381ad56d26 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1611,7 +1611,7 @@ static int ram_save_release_protection(RAMState *rs, PageSearchStatus *pss,
     /* Check if page is from UFFD-managed region. */
     if (pss->block->flags & RAM_UF_WRITEPROTECT) {
         void *page_address = pss->block->host + (start_page << TARGET_PAGE_BITS);
-        uint64_t run_length = (pss->page - start_page + 1) << TARGET_PAGE_BITS;
+        uint64_t run_length = (pss->page - start_page) << TARGET_PAGE_BITS;
 
         /* Flush async buffers before un-protect. */
         qemu_fflush(rs->f);
@@ -2230,7 +2230,7 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
              offset_in_ramblock(pss->block,
                                 ((ram_addr_t)pss->page) << TARGET_PAGE_BITS));
     /* The offset we leave with is the min boundary of host page and block */
-    pss->page = MIN(pss->page, hostpage_boundary) - 1;
+    pss->page = MIN(pss->page, hostpage_boundary);
 
     res = ram_save_release_protection(rs, pss, start_page);
     return (res < 0 ? res : pages);
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 02/15] migration: Allow pss->page jump over clean pages
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
  2022-01-19  8:09 ` [PATCH RFC 01/15] migration: No off-by-one for pss->page update in host page size Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-01-19 13:42   ` Dr. David Alan Gilbert
  2022-01-19  8:09 ` [PATCH RFC 03/15] migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat Peter Xu
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Kunkun Jiang, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos, Keqian Zhu

Commit ba1b7c812c ("migration/ram: Optimize ram_save_host_page()") managed to
optimize host huge page use case by scanning the dirty bitmap when looking for
the next dirty small page to migrate.

However when updating the pss->page before returning from that function, we
used MIN() of these two values: (1) next dirty bit, or (2) end of current sent
huge page, to fix up pss->page.

That sounds unnecessary, because I see nowhere that requires pss->page to be
not going over current huge page boundary.

What we need here is probably MAX() instead of MIN() so that we'll start
scanning from the next dirty bit next time. Since pss->page can't be smaller
than hostpage_boundary (the loop guarantees it), it probably means we don't
need to fix it up at all.

Cc: Keqian Zhu <zhukeqian1@huawei.com>
Cc: Kunkun Jiang <jiangkunkun@huawei.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 381ad56d26..94b0ad4234 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2229,8 +2229,6 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
     } while ((pss->page < hostpage_boundary) &&
              offset_in_ramblock(pss->block,
                                 ((ram_addr_t)pss->page) << TARGET_PAGE_BITS));
-    /* The offset we leave with is the min boundary of host page and block */
-    pss->page = MIN(pss->page, hostpage_boundary);
 
     res = ram_save_release_protection(rs, pss, start_page);
     return (res < 0 ? res : pages);
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 03/15] migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
  2022-01-19  8:09 ` [PATCH RFC 01/15] migration: No off-by-one for pss->page update in host page size Peter Xu
  2022-01-19  8:09 ` [PATCH RFC 02/15] migration: Allow pss->page jump over clean pages Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-01-19 14:15   ` Dr. David Alan Gilbert
  2022-01-27  9:40   ` Juan Quintela
  2022-01-19  8:09 ` [PATCH RFC 04/15] migration: Add postcopy_has_request() Peter Xu
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

This patch allows us to read the tid even without blocktime feature enabled.
It's useful when tracing postcopy fault thread on faulted pages to show thread
id too with the address.

Remove the comments - they're merely not helpful at all.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index d18b5d05b2..2176ed68a5 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -283,15 +283,13 @@ static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
     }
 
 #ifdef UFFD_FEATURE_THREAD_ID
-    if (migrate_postcopy_blocktime() && mis &&
-        UFFD_FEATURE_THREAD_ID & supported_features) {
-        /* kernel supports that feature */
-        /* don't create blocktime_context if it exists */
-        if (!mis->blocktime_ctx) {
-            mis->blocktime_ctx = blocktime_context_new();
-        }
-
+    if (UFFD_FEATURE_THREAD_ID & supported_features) {
         asked_features |= UFFD_FEATURE_THREAD_ID;
+        if (migrate_postcopy_blocktime()) {
+            if (!mis->blocktime_ctx) {
+                mis->blocktime_ctx = blocktime_context_new();
+            }
+        }
     }
 #endif
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 04/15] migration: Add postcopy_has_request()
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (2 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 03/15] migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-01-19 14:27   ` Dr. David Alan Gilbert
  2022-01-27  9:41   ` Juan Quintela
  2022-01-19  8:09 ` [PATCH RFC 05/15] migration: Simplify unqueue_page() Peter Xu
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Add a helper to detect whether postcopy has pending request.

Since at it, cleanup the code a bit, e.g. in unqueue_page() we shouldn't need
to check it again on queue empty because we're the only one (besides cleanup
code, which should never run during this process) that will take a request off
the list, so the request list can only grow but not shrink under the hood.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 45 ++++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 94b0ad4234..dc6ba041fa 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -354,6 +354,12 @@ static RAMState *ram_state;
 
 static NotifierWithReturnList precopy_notifier_list;
 
+/* Whether postcopy has queued requests? */
+static bool postcopy_has_request(RAMState *rs)
+{
+    return !QSIMPLEQ_EMPTY_ATOMIC(&rs->src_page_requests);
+}
+
 void precopy_infrastructure_init(void)
 {
     notifier_with_return_list_init(&precopy_notifier_list);
@@ -1533,28 +1539,33 @@ static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
  */
 static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
 {
+    struct RAMSrcPageRequest *entry;
     RAMBlock *block = NULL;
 
-    if (QSIMPLEQ_EMPTY_ATOMIC(&rs->src_page_requests)) {
+    if (!postcopy_has_request(rs)) {
         return NULL;
     }
 
     QEMU_LOCK_GUARD(&rs->src_page_req_mutex);
-    if (!QSIMPLEQ_EMPTY(&rs->src_page_requests)) {
-        struct RAMSrcPageRequest *entry =
-                                QSIMPLEQ_FIRST(&rs->src_page_requests);
-        block = entry->rb;
-        *offset = entry->offset;
-
-        if (entry->len > TARGET_PAGE_SIZE) {
-            entry->len -= TARGET_PAGE_SIZE;
-            entry->offset += TARGET_PAGE_SIZE;
-        } else {
-            memory_region_unref(block->mr);
-            QSIMPLEQ_REMOVE_HEAD(&rs->src_page_requests, next_req);
-            g_free(entry);
-            migration_consume_urgent_request();
-        }
+
+    /*
+     * This should _never_ change even after we take the lock, because no one
+     * should be taking anything off the request list other than us.
+     */
+    assert(postcopy_has_request(rs));
+
+    entry = QSIMPLEQ_FIRST(&rs->src_page_requests);
+    block = entry->rb;
+    *offset = entry->offset;
+
+    if (entry->len > TARGET_PAGE_SIZE) {
+        entry->len -= TARGET_PAGE_SIZE;
+        entry->offset += TARGET_PAGE_SIZE;
+    } else {
+        memory_region_unref(block->mr);
+        QSIMPLEQ_REMOVE_HEAD(&rs->src_page_requests, next_req);
+        g_free(entry);
+        migration_consume_urgent_request();
     }
 
     return block;
@@ -2996,7 +3007,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
         t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
         i = 0;
         while ((ret = qemu_file_rate_limit(f)) == 0 ||
-                !QSIMPLEQ_EMPTY(&rs->src_page_requests)) {
+               postcopy_has_request(rs)) {
             int pages;
 
             if (qemu_file_get_error(f)) {
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 05/15] migration: Simplify unqueue_page()
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (3 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 04/15] migration: Add postcopy_has_request() Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-01-19 16:36   ` Dr. David Alan Gilbert
  2022-01-27  9:41   ` Juan Quintela
  2022-01-19  8:09 ` [PATCH RFC 06/15] migration: Move temp page setup and cleanup into separate functions Peter Xu
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

This patch simplifies unqueue_page() on both sides of it (itself, and caller).

Firstly, due to the fact that right after unqueue_page() returned true, we'll
definitely send a huge page (see ram_save_huge_page() call - it will _never_
exit before finish sending that huge page), so unqueue_page() does not need to
jump in small page size if huge page is enabled on the ramblock.  IOW, it's
destined that only the 1st 4K page will be valid, when unqueue the 2nd+ time
we'll notice the whole huge page has already been sent anyway.  Switching to
operating on huge page reduces a lot of the loops of redundant unqueue_page().

Meanwhile, drop the dirty check.  It's not helpful to call test_bit() every
time to jump over clean pages, as ram_save_host_page() has already done so,
while in a faster way (see commit ba1b7c812c ("migration/ram: Optimize
ram_save_host_page()", 2021-05-13)).  So that's not necessary too.

Drop the two tracepoints along the way - based on above analysis it's very
possible that no one is really using it..

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c        | 34 ++++++++--------------------------
 migration/trace-events |  2 --
 2 files changed, 8 insertions(+), 28 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index dc6ba041fa..0df15ff663 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1541,6 +1541,7 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
 {
     struct RAMSrcPageRequest *entry;
     RAMBlock *block = NULL;
+    size_t page_size;
 
     if (!postcopy_has_request(rs)) {
         return NULL;
@@ -1557,10 +1558,13 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
     entry = QSIMPLEQ_FIRST(&rs->src_page_requests);
     block = entry->rb;
     *offset = entry->offset;
+    page_size = qemu_ram_pagesize(block);
+    /* Each page request should only be multiple page size of the ramblock */
+    assert((entry->len % page_size) == 0);
 
-    if (entry->len > TARGET_PAGE_SIZE) {
-        entry->len -= TARGET_PAGE_SIZE;
-        entry->offset += TARGET_PAGE_SIZE;
+    if (entry->len > page_size) {
+        entry->len -= page_size;
+        entry->offset += page_size;
     } else {
         memory_region_unref(block->mr);
         QSIMPLEQ_REMOVE_HEAD(&rs->src_page_requests, next_req);
@@ -1942,30 +1946,8 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
 {
     RAMBlock  *block;
     ram_addr_t offset;
-    bool dirty;
 
-    do {
-        block = unqueue_page(rs, &offset);
-        /*
-         * We're sending this page, and since it's postcopy nothing else
-         * will dirty it, and we must make sure it doesn't get sent again
-         * even if this queue request was received after the background
-         * search already sent it.
-         */
-        if (block) {
-            unsigned long page;
-
-            page = offset >> TARGET_PAGE_BITS;
-            dirty = test_bit(page, block->bmap);
-            if (!dirty) {
-                trace_get_queued_page_not_dirty(block->idstr, (uint64_t)offset,
-                                                page);
-            } else {
-                trace_get_queued_page(block->idstr, (uint64_t)offset, page);
-            }
-        }
-
-    } while (block && !dirty);
+    block = unqueue_page(rs, &offset);
 
     if (!block) {
         /*
diff --git a/migration/trace-events b/migration/trace-events
index e165687af2..3a9b3567ae 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -85,8 +85,6 @@ put_qlist_end(const char *field_name, const char *vmsd_name) "%s(%s)"
 qemu_file_fclose(void) ""
 
 # ram.c
-get_queued_page(const char *block_name, uint64_t tmp_offset, unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
-get_queued_page_not_dirty(const char *block_name, uint64_t tmp_offset, unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
 migration_bitmap_sync_start(void) ""
 migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64
 migration_bitmap_clear_dirty(char *str, uint64_t start, uint64_t size, unsigned long page) "rb %s start 0x%"PRIx64" size 0x%"PRIx64" page 0x%lx"
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 06/15] migration: Move temp page setup and cleanup into separate functions
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (4 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 05/15] migration: Simplify unqueue_page() Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-01-19 16:58   ` Dr. David Alan Gilbert
  2022-01-27  9:43   ` Juan Quintela
  2022-01-19  8:09 ` [PATCH RFC 07/15] migration: Introduce postcopy channels on dest node Peter Xu
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Temp pages will need to grow if we want to have multiple channels for postcopy,
because each channel will need its own temp page to cache huge page data.

Before doing that, cleanup the related code.  No functional change intended.

Since at it, touch up the errno handling a little bit on the setup side.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 82 +++++++++++++++++++++++++---------------
 1 file changed, 51 insertions(+), 31 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 2176ed68a5..e662dd05cc 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -523,6 +523,19 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis)
     return 0;
 }
 
+static void postcopy_temp_pages_cleanup(MigrationIncomingState *mis)
+{
+    if (mis->postcopy_tmp_page) {
+        munmap(mis->postcopy_tmp_page, mis->largest_page_size);
+        mis->postcopy_tmp_page = NULL;
+    }
+
+    if (mis->postcopy_tmp_zero_page) {
+        munmap(mis->postcopy_tmp_zero_page, mis->largest_page_size);
+        mis->postcopy_tmp_zero_page = NULL;
+    }
+}
+
 /*
  * At the end of a migration where postcopy_ram_incoming_init was called.
  */
@@ -564,14 +577,8 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
         }
     }
 
-    if (mis->postcopy_tmp_page) {
-        munmap(mis->postcopy_tmp_page, mis->largest_page_size);
-        mis->postcopy_tmp_page = NULL;
-    }
-    if (mis->postcopy_tmp_zero_page) {
-        munmap(mis->postcopy_tmp_zero_page, mis->largest_page_size);
-        mis->postcopy_tmp_zero_page = NULL;
-    }
+    postcopy_temp_pages_cleanup(mis);
+
     trace_postcopy_ram_incoming_cleanup_blocktime(
             get_postcopy_total_blocktime());
 
@@ -1082,6 +1089,40 @@ retry:
     return NULL;
 }
 
+static int postcopy_temp_pages_setup(MigrationIncomingState *mis)
+{
+    int err;
+
+    mis->postcopy_tmp_page = mmap(NULL, mis->largest_page_size,
+                                  PROT_READ | PROT_WRITE,
+                                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+    if (mis->postcopy_tmp_page == MAP_FAILED) {
+        err = errno;
+        mis->postcopy_tmp_page = NULL;
+        error_report("%s: Failed to map postcopy_tmp_page %s",
+                     __func__, strerror(err));
+        return -err;
+    }
+
+    /*
+     * Map large zero page when kernel can't use UFFDIO_ZEROPAGE for hugepages
+     */
+    mis->postcopy_tmp_zero_page = mmap(NULL, mis->largest_page_size,
+                                       PROT_READ | PROT_WRITE,
+                                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+    if (mis->postcopy_tmp_zero_page == MAP_FAILED) {
+        err = errno;
+        mis->postcopy_tmp_zero_page = NULL;
+        error_report("%s: Failed to map large zero page %s",
+                     __func__, strerror(err));
+        return -err;
+    }
+
+    memset(mis->postcopy_tmp_zero_page, '\0', mis->largest_page_size);
+
+    return 0;
+}
+
 int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
 {
     /* Open the fd for the kernel to give us userfaults */
@@ -1122,32 +1163,11 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
         return -1;
     }
 
-    mis->postcopy_tmp_page = mmap(NULL, mis->largest_page_size,
-                                  PROT_READ | PROT_WRITE, MAP_PRIVATE |
-                                  MAP_ANONYMOUS, -1, 0);
-    if (mis->postcopy_tmp_page == MAP_FAILED) {
-        mis->postcopy_tmp_page = NULL;
-        error_report("%s: Failed to map postcopy_tmp_page %s",
-                     __func__, strerror(errno));
+    if (postcopy_temp_pages_setup(mis)) {
+        /* Error dumped in the sub-function */
         return -1;
     }
 
-    /*
-     * Map large zero page when kernel can't use UFFDIO_ZEROPAGE for hugepages
-     */
-    mis->postcopy_tmp_zero_page = mmap(NULL, mis->largest_page_size,
-                                       PROT_READ | PROT_WRITE,
-                                       MAP_PRIVATE | MAP_ANONYMOUS,
-                                       -1, 0);
-    if (mis->postcopy_tmp_zero_page == MAP_FAILED) {
-        int e = errno;
-        mis->postcopy_tmp_zero_page = NULL;
-        error_report("%s: Failed to map large zero page %s",
-                     __func__, strerror(e));
-        return -e;
-    }
-    memset(mis->postcopy_tmp_zero_page, '\0', mis->largest_page_size);
-
     trace_postcopy_ram_enable_notify();
 
     return 0;
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 07/15] migration: Introduce postcopy channels on dest node
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (5 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 06/15] migration: Move temp page setup and cleanup into separate functions Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-02-03 15:08   ` Dr. David Alan Gilbert
  2022-01-19  8:09 ` [PATCH RFC 08/15] migration: Dump ramblock and offset too when non-same-page detected Peter Xu
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Postcopy handles huge pages in a special way that currently we can only have
one "channel" to transfer the page.

It's because when we install pages using UFFDIO_COPY, we need to have the whole
huge page ready, it also means we need to have a temp huge page when trying to
receive the whole content of the page.

Currently all maintainance around this tmp page is global: firstly we'll
allocate a temp huge page, then we maintain its status mostly within
ram_load_postcopy().

To enable multiple channels for postcopy, the first thing we need to do is to
prepare N temp huge pages as caching, one for each channel.

Meanwhile we need to maintain the tmp huge page status per-channel too.

To give some example, some local variables maintained in ram_load_postcopy()
are listed; they are responsible for maintaining temp huge page status:

  - all_zero:     this keeps whether this huge page contains all zeros
  - target_pages: this counts how many target pages have been copied
  - host_page:    this keeps the host ptr for the page to install

Move all these fields to be together with the temp huge pages to form a new
structure called PostcopyTmpPage.  Then for each (future) postcopy channel, we
need one structure to keep the state around.

For vanilla postcopy, obviously there's only one channel.  It contains both
precopy and postcopy pages.

This patch teaches the dest migration node to start realize the possible number
of postcopy channels by introducing the "postcopy_channels" variable.  Its
value is calculated when setup postcopy on dest node (during POSTCOPY_LISTEN
phase).

Vanilla postcopy will have channels=1, but when postcopy-preempt capability is
enabled (in the future), we will boost it to 2 because even during partial
sending of a precopy huge page we still want to preempt it and start sending
the postcopy requested page right away (so we start to keep two temp huge
pages; more if we want to enable multifd).  In this patch there's a TODO marked
for that; so far the channels is always set to 1.

We need to send one "host huge page" on one channel only and we cannot split
them, because otherwise the data upon the same huge page can locate on more
than one channel so we need more complicated logic to manage.  One temp host
huge page for each channel will be enough for us for now.

Postcopy will still always use the index=0 huge page even after this patch.
However it prepares for the latter patches where it can start to use multiple
channels (which needs src intervention, because only src knows which channel we
should use).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h    | 35 +++++++++++++++++++++++++++-
 migration/postcopy-ram.c | 50 +++++++++++++++++++++++++++++-----------
 migration/ram.c          | 43 +++++++++++++++++-----------------
 3 files changed, 91 insertions(+), 37 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index 8130b703eb..8bb2931312 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -45,6 +45,24 @@ struct PostcopyBlocktimeContext;
  */
 #define CLEAR_BITMAP_SHIFT_MAX            31
 
+/* This is an abstraction of a "temp huge page" for postcopy's purpose */
+typedef struct {
+    /*
+     * This points to a temporary huge page as a buffer for UFFDIO_COPY.  It's
+     * mmap()ed and needs to be freed when cleanup.
+     */
+    void *tmp_huge_page;
+    /*
+     * This points to the host page we're going to install for this temp page.
+     * It tells us after we've received the whole page, where we should put it.
+     */
+    void *host_addr;
+    /* Number of small pages copied (in size of TARGET_PAGE_SIZE) */
+    int target_pages;
+    /* Whether this page contains all zeros */
+    bool all_zero;
+} PostcopyTmpPage;
+
 /* State for the incoming migration */
 struct MigrationIncomingState {
     QEMUFile *from_src_file;
@@ -81,7 +99,22 @@ struct MigrationIncomingState {
     QemuMutex rp_mutex;    /* We send replies from multiple threads */
     /* RAMBlock of last request sent to source */
     RAMBlock *last_rb;
-    void     *postcopy_tmp_page;
+    /*
+     * Number of postcopy channels including the default precopy channel, so
+     * vanilla postcopy will only contain one channel which contain both
+     * precopy and postcopy streams.
+     *
+     * This is calculated when the src requests to enable postcopy but before
+     * it starts.  Its value can depend on e.g. whether postcopy preemption is
+     * enabled.
+     */
+    int       postcopy_channels;
+    /*
+     * An array of temp host huge pages to be used, one for each postcopy
+     * channel.
+     */
+    PostcopyTmpPage *postcopy_tmp_pages;
+    /* This is shared for all postcopy channels */
     void     *postcopy_tmp_zero_page;
     /* PostCopyFD's for external userfaultfds & handlers of shared memory */
     GArray   *postcopy_remote_fds;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index e662dd05cc..d78e1b9373 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -525,9 +525,18 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis)
 
 static void postcopy_temp_pages_cleanup(MigrationIncomingState *mis)
 {
-    if (mis->postcopy_tmp_page) {
-        munmap(mis->postcopy_tmp_page, mis->largest_page_size);
-        mis->postcopy_tmp_page = NULL;
+    int i;
+
+    if (mis->postcopy_tmp_pages) {
+        for (i = 0; i < mis->postcopy_channels; i++) {
+            if (mis->postcopy_tmp_pages[i].tmp_huge_page) {
+                munmap(mis->postcopy_tmp_pages[i].tmp_huge_page,
+                       mis->largest_page_size);
+                mis->postcopy_tmp_pages[i].tmp_huge_page = NULL;
+            }
+        }
+        g_free(mis->postcopy_tmp_pages);
+        mis->postcopy_tmp_pages = NULL;
     }
 
     if (mis->postcopy_tmp_zero_page) {
@@ -1091,17 +1100,30 @@ retry:
 
 static int postcopy_temp_pages_setup(MigrationIncomingState *mis)
 {
-    int err;
-
-    mis->postcopy_tmp_page = mmap(NULL, mis->largest_page_size,
-                                  PROT_READ | PROT_WRITE,
-                                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
-    if (mis->postcopy_tmp_page == MAP_FAILED) {
-        err = errno;
-        mis->postcopy_tmp_page = NULL;
-        error_report("%s: Failed to map postcopy_tmp_page %s",
-                     __func__, strerror(err));
-        return -err;
+    PostcopyTmpPage *tmp_page;
+    int err, i, channels;
+    void *temp_page;
+
+    /* TODO: will be boosted when enable postcopy preemption */
+    mis->postcopy_channels = 1;
+
+    channels = mis->postcopy_channels;
+    mis->postcopy_tmp_pages = g_malloc0(sizeof(PostcopyTmpPage) * channels);
+
+    for (i = 0; i < channels; i++) {
+        tmp_page = &mis->postcopy_tmp_pages[i];
+        temp_page = mmap(NULL, mis->largest_page_size, PROT_READ | PROT_WRITE,
+                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+        if (temp_page == MAP_FAILED) {
+            err = errno;
+            error_report("%s: Failed to map postcopy_tmp_pages[%d]: %s",
+                         __func__, i, strerror(err));
+            return -err;
+        }
+        tmp_page->tmp_huge_page = temp_page;
+        /* Initialize default states for each tmp page */
+        tmp_page->all_zero = true;
+        tmp_page->target_pages = 0;
     }
 
     /*
diff --git a/migration/ram.c b/migration/ram.c
index 0df15ff663..930e722e39 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3639,11 +3639,8 @@ static int ram_load_postcopy(QEMUFile *f)
     bool place_needed = false;
     bool matches_target_page_size = false;
     MigrationIncomingState *mis = migration_incoming_get_current();
-    /* Temporary page that is later 'placed' */
-    void *postcopy_host_page = mis->postcopy_tmp_page;
-    void *host_page = NULL;
-    bool all_zero = true;
-    int target_pages = 0;
+    /* Currently we only use channel 0.  TODO: use all the channels */
+    PostcopyTmpPage *tmp_page = &mis->postcopy_tmp_pages[0];
 
     while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
         ram_addr_t addr;
@@ -3687,7 +3684,7 @@ static int ram_load_postcopy(QEMUFile *f)
                 ret = -EINVAL;
                 break;
             }
-            target_pages++;
+            tmp_page->target_pages++;
             matches_target_page_size = block->page_size == TARGET_PAGE_SIZE;
             /*
              * Postcopy requires that we place whole host pages atomically;
@@ -3699,15 +3696,16 @@ static int ram_load_postcopy(QEMUFile *f)
              * however the source ensures it always sends all the components
              * of a host page in one chunk.
              */
-            page_buffer = postcopy_host_page +
+            page_buffer = tmp_page->tmp_huge_page +
                           host_page_offset_from_ram_block_offset(block, addr);
             /* If all TP are zero then we can optimise the place */
-            if (target_pages == 1) {
-                host_page = host_page_from_ram_block_offset(block, addr);
-            } else if (host_page != host_page_from_ram_block_offset(block,
-                                                                    addr)) {
+            if (tmp_page->target_pages == 1) {
+                tmp_page->host_addr =
+                    host_page_from_ram_block_offset(block, addr);
+            } else if (tmp_page->host_addr !=
+                       host_page_from_ram_block_offset(block, addr)) {
                 /* not the 1st TP within the HP */
-                error_report("Non-same host page %p/%p", host_page,
+                error_report("Non-same host page %p/%p", tmp_page->host_addr,
                              host_page_from_ram_block_offset(block, addr));
                 ret = -EINVAL;
                 break;
@@ -3717,10 +3715,11 @@ static int ram_load_postcopy(QEMUFile *f)
              * If it's the last part of a host page then we place the host
              * page
              */
-            if (target_pages == (block->page_size / TARGET_PAGE_SIZE)) {
+            if (tmp_page->target_pages ==
+                (block->page_size / TARGET_PAGE_SIZE)) {
                 place_needed = true;
             }
-            place_source = postcopy_host_page;
+            place_source = tmp_page->tmp_huge_page;
         }
 
         switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
@@ -3734,12 +3733,12 @@ static int ram_load_postcopy(QEMUFile *f)
                 memset(page_buffer, ch, TARGET_PAGE_SIZE);
             }
             if (ch) {
-                all_zero = false;
+                tmp_page->all_zero = false;
             }
             break;
 
         case RAM_SAVE_FLAG_PAGE:
-            all_zero = false;
+            tmp_page->all_zero = false;
             if (!matches_target_page_size) {
                 /* For huge pages, we always use temporary buffer */
                 qemu_get_buffer(f, page_buffer, TARGET_PAGE_SIZE);
@@ -3757,7 +3756,7 @@ static int ram_load_postcopy(QEMUFile *f)
             }
             break;
         case RAM_SAVE_FLAG_COMPRESS_PAGE:
-            all_zero = false;
+            tmp_page->all_zero = false;
             len = qemu_get_be32(f);
             if (len < 0 || len > compressBound(TARGET_PAGE_SIZE)) {
                 error_report("Invalid compressed data length: %d", len);
@@ -3789,16 +3788,16 @@ static int ram_load_postcopy(QEMUFile *f)
         }
 
         if (!ret && place_needed) {
-            if (all_zero) {
-                ret = postcopy_place_page_zero(mis, host_page, block);
+            if (tmp_page->all_zero) {
+                ret = postcopy_place_page_zero(mis, tmp_page->host_addr, block);
             } else {
-                ret = postcopy_place_page(mis, host_page, place_source,
+                ret = postcopy_place_page(mis, tmp_page->host_addr, place_source,
                                           block);
             }
             place_needed = false;
-            target_pages = 0;
+            tmp_page->target_pages = 0;
             /* Assume we have a zero page until we detect something different */
-            all_zero = true;
+            tmp_page->all_zero = true;
         }
     }
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 08/15] migration: Dump ramblock and offset too when non-same-page detected
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (6 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 07/15] migration: Introduce postcopy channels on dest node Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-02-03 15:15   ` Dr. David Alan Gilbert
  2022-01-19  8:09 ` [PATCH RFC 09/15] migration: Add postcopy_thread_create() Peter Xu
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

In ram_load_postcopy() we'll try to detect non-same-page case and dump error.
This error is very helpful for debugging.  Adding ramblock & offset into the
error log too.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 930e722e39..3f823ffffc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3705,8 +3705,12 @@ static int ram_load_postcopy(QEMUFile *f)
             } else if (tmp_page->host_addr !=
                        host_page_from_ram_block_offset(block, addr)) {
                 /* not the 1st TP within the HP */
-                error_report("Non-same host page %p/%p", tmp_page->host_addr,
-                             host_page_from_ram_block_offset(block, addr));
+                error_report("Non-same host page detected.  Target host page %p, "
+                             "received host page %p "
+                             "(rb %s offset 0x"RAM_ADDR_FMT" target_pages %d)",
+                             tmp_page->host_addr,
+                             host_page_from_ram_block_offset(block, addr),
+                             block->idstr, addr, tmp_page->target_pages);
                 ret = -EINVAL;
                 break;
             }
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 09/15] migration: Add postcopy_thread_create()
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (7 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 08/15] migration: Dump ramblock and offset too when non-same-page detected Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-02-03 15:19   ` Dr. David Alan Gilbert
  2022-01-19  8:09 ` [PATCH RFC 10/15] migration: Move static var in ram_block_from_stream() into global Peter Xu
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Postcopy create threads. A common manner is we init a sem and use it to sync
with the thread.  Namely, we have fault_thread_sem and listen_thread_sem and
they're only used for this.

Make it a shared infrastructure so it's easier to create yet another thread.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h    |  5 ++---
 migration/postcopy-ram.c | 19 +++++++++++++------
 migration/postcopy-ram.h |  4 ++++
 migration/savevm.c       | 12 +++---------
 4 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index 8bb2931312..35e7f7babe 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -70,7 +70,8 @@ struct MigrationIncomingState {
     /* A hook to allow cleanup at the end of incoming migration */
     void *transport_data;
     void (*transport_cleanup)(void *data);
-
+    /* Used to sync thread creations */
+    QemuSemaphore  thread_sync_sem;
     /*
      * Free at the start of the main state load, set as the main thread finishes
      * loading state.
@@ -83,13 +84,11 @@ struct MigrationIncomingState {
     size_t         largest_page_size;
     bool           have_fault_thread;
     QemuThread     fault_thread;
-    QemuSemaphore  fault_thread_sem;
     /* Set this when we want the fault thread to quit */
     bool           fault_thread_quit;
 
     bool           have_listen_thread;
     QemuThread     listen_thread;
-    QemuSemaphore  listen_thread_sem;
 
     /* For the kernel to send us notifications */
     int       userfault_fd;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index d78e1b9373..88c832eeba 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -77,6 +77,16 @@ int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp)
                                             &pnd);
 }
 
+void postcopy_thread_create(MigrationIncomingState *mis,
+                            QemuThread *thread, const char *name,
+                            void *(*fn)(void *), int joinable)
+{
+    qemu_sem_init(&mis->thread_sync_sem, 0);
+    qemu_thread_create(thread, name, fn, mis, joinable);
+    qemu_sem_wait(&mis->thread_sync_sem);
+    qemu_sem_destroy(&mis->thread_sync_sem);
+}
+
 /* Postcopy needs to detect accesses to pages that haven't yet been copied
  * across, and efficiently map new pages in, the techniques for doing this
  * are target OS specific.
@@ -901,7 +911,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
     trace_postcopy_ram_fault_thread_entry();
     rcu_register_thread();
     mis->last_rb = NULL; /* last RAMBlock we sent part of */
-    qemu_sem_post(&mis->fault_thread_sem);
+    qemu_sem_post(&mis->thread_sync_sem);
 
     struct pollfd *pfd;
     size_t pfd_len = 2 + mis->postcopy_remote_fds->len;
@@ -1172,11 +1182,8 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
         return -1;
     }
 
-    qemu_sem_init(&mis->fault_thread_sem, 0);
-    qemu_thread_create(&mis->fault_thread, "postcopy/fault",
-                       postcopy_ram_fault_thread, mis, QEMU_THREAD_JOINABLE);
-    qemu_sem_wait(&mis->fault_thread_sem);
-    qemu_sem_destroy(&mis->fault_thread_sem);
+    postcopy_thread_create(mis, &mis->fault_thread, "postcopy/fault",
+                           postcopy_ram_fault_thread, QEMU_THREAD_JOINABLE);
     mis->have_fault_thread = true;
 
     /* Mark so that we get notified of accesses to unwritten areas */
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 6d2b3cf124..07684c0e1d 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -135,6 +135,10 @@ void postcopy_remove_notifier(NotifierWithReturn *n);
 /* Call the notifier list set by postcopy_add_start_notifier */
 int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp);
 
+void postcopy_thread_create(MigrationIncomingState *mis,
+                            QemuThread *thread, const char *name,
+                            void *(*fn)(void *), int joinable);
+
 struct PostCopyFD;
 
 /* ufd is a pointer to the struct uffd_msg *TODO: more Portable! */
diff --git a/migration/savevm.c b/migration/savevm.c
index 3b8f565b14..3342b74c24 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1862,7 +1862,7 @@ static void *postcopy_ram_listen_thread(void *opaque)
 
     migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
                                    MIGRATION_STATUS_POSTCOPY_ACTIVE);
-    qemu_sem_post(&mis->listen_thread_sem);
+    qemu_sem_post(&mis->thread_sync_sem);
     trace_postcopy_ram_listen_thread_start();
 
     rcu_register_thread();
@@ -1987,14 +1987,8 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
     }
 
     mis->have_listen_thread = true;
-    /* Start up the listening thread and wait for it to signal ready */
-    qemu_sem_init(&mis->listen_thread_sem, 0);
-    qemu_thread_create(&mis->listen_thread, "postcopy/listen",
-                       postcopy_ram_listen_thread, NULL,
-                       QEMU_THREAD_DETACHED);
-    qemu_sem_wait(&mis->listen_thread_sem);
-    qemu_sem_destroy(&mis->listen_thread_sem);
-
+    postcopy_thread_create(mis, &mis->listen_thread, "postcopy/listen",
+                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
     trace_loadvm_postcopy_handle_listen("return");
 
     return 0;
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 10/15] migration: Move static var in ram_block_from_stream() into global
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (8 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 09/15] migration: Add postcopy_thread_create() Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-02-03 17:48   ` Dr. David Alan Gilbert
  2022-01-19  8:09 ` [PATCH RFC 11/15] migration: Add pss.postcopy_requested status Peter Xu
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Static variable is very unfriendly to threading of ram_block_from_stream().
Move it into MigrationIncomingState.

Make the incoming state pointer to be passed over to ram_block_from_stream() on
both caller sites.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h |  3 ++-
 migration/ram.c       | 13 +++++++++----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index 35e7f7babe..34b79cb961 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -66,7 +66,8 @@ typedef struct {
 /* State for the incoming migration */
 struct MigrationIncomingState {
     QEMUFile *from_src_file;
-
+    /* Previously received RAM's RAMBlock pointer */
+    RAMBlock *last_recv_block;
     /* A hook to allow cleanup at the end of incoming migration */
     void *transport_data;
     void (*transport_cleanup)(void *data);
diff --git a/migration/ram.c b/migration/ram.c
index 3f823ffffc..3a7d943f9c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3183,12 +3183,14 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
  *
  * Returns a pointer from within the RCU-protected ram_list.
  *
+ * @mis: the migration incoming state pointer
  * @f: QEMUFile where to read the data from
  * @flags: Page flags (mostly to see if it's a continuation of previous block)
  */
-static inline RAMBlock *ram_block_from_stream(QEMUFile *f, int flags)
+static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
+                                              QEMUFile *f, int flags)
 {
-    static RAMBlock *block;
+    RAMBlock *block = mis->last_recv_block;
     char id[256];
     uint8_t len;
 
@@ -3215,6 +3217,8 @@ static inline RAMBlock *ram_block_from_stream(QEMUFile *f, int flags)
         return NULL;
     }
 
+    mis->last_recv_block = block;
+
     return block;
 }
 
@@ -3667,7 +3671,7 @@ static int ram_load_postcopy(QEMUFile *f)
         trace_ram_load_postcopy_loop((uint64_t)addr, flags);
         if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
                      RAM_SAVE_FLAG_COMPRESS_PAGE)) {
-            block = ram_block_from_stream(f, flags);
+            block = ram_block_from_stream(mis, f, flags);
             if (!block) {
                 ret = -EINVAL;
                 break;
@@ -3881,6 +3885,7 @@ void colo_flush_ram_cache(void)
  */
 static int ram_load_precopy(QEMUFile *f)
 {
+    MigrationIncomingState *mis = migration_incoming_get_current();
     int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0;
     /* ADVISE is earlier, it shows the source has the postcopy capability on */
     bool postcopy_advised = postcopy_is_advised();
@@ -3919,7 +3924,7 @@ static int ram_load_precopy(QEMUFile *f)
 
         if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
                      RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
-            RAMBlock *block = ram_block_from_stream(f, flags);
+            RAMBlock *block = ram_block_from_stream(mis, f, flags);
 
             host = host_from_ram_block_offset(block, addr);
             /*
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 11/15] migration: Add pss.postcopy_requested status
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (9 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 10/15] migration: Move static var in ram_block_from_stream() into global Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-02-03 15:42   ` Dr. David Alan Gilbert
  2022-01-19  8:09 ` [PATCH RFC 12/15] migration: Move migrate_allow_multifd and helpers into migration.c Peter Xu
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

This boolean flag shows whether the current page during migration is triggered
by postcopy or not.  Then in ram_save_host_page() and deeper stack we'll be
able to have a reference on the priority of this page.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 3a7d943f9c..b7d17613e8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -400,6 +400,8 @@ struct PageSearchStatus {
     unsigned long page;
     /* Set once we wrap around */
     bool         complete_round;
+    /* Whether current page is explicitly requested by postcopy */
+    bool         postcopy_requested;
 };
 typedef struct PageSearchStatus PageSearchStatus;
 
@@ -1480,6 +1482,9 @@ retry:
  */
 static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
 {
+    /* This is not a postcopy requested page */
+    pss->postcopy_requested = false;
+
     pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
     if (pss->complete_round && pss->block == rs->last_seen_block &&
         pss->page >= rs->last_page) {
@@ -1971,6 +1976,7 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
          * really rare.
          */
         pss->complete_round = false;
+        pss->postcopy_requested = true;
     }
 
     return !!block;
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 12/15] migration: Move migrate_allow_multifd and helpers into migration.c
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (10 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 11/15] migration: Add pss.postcopy_requested status Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-02-03 15:44   ` Dr. David Alan Gilbert
  2022-01-19  8:09 ` [PATCH RFC 13/15] migration: Add postcopy-preempt capability Peter Xu
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

This variable, along with its helpers, is used to detect whether multiple
channel will be supported for migration.  In follow up patches, there'll be
other capability that requires multi-channels.  Hence move it outside multifd
specific code and make it public.  Meanwhile rename it from "multifd" to
"multi_channels" to show its real meaning.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 22 +++++++++++++++++-----
 migration/migration.h |  3 +++
 migration/multifd.c   | 19 ++++---------------
 migration/multifd.h   |  2 --
 4 files changed, 24 insertions(+), 22 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 252ce1eaec..15a48b548a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -180,6 +180,18 @@ static int migration_maybe_pause(MigrationState *s,
                                  int new_state);
 static void migrate_fd_cancel(MigrationState *s);
 
+static bool migrate_allow_multi_channels = true;
+
+void migrate_protocol_allow_multi_channels(bool allow)
+{
+    migrate_allow_multi_channels = allow;
+}
+
+bool migrate_multi_channels_is_allowed(void)
+{
+    return migrate_allow_multi_channels;
+}
+
 static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
 {
     uintptr_t a = (uintptr_t) ap, b = (uintptr_t) bp;
@@ -463,12 +475,12 @@ static void qemu_start_incoming_migration(const char *uri, Error **errp)
 {
     const char *p = NULL;
 
-    migrate_protocol_allow_multifd(false); /* reset it anyway */
+    migrate_protocol_allow_multi_channels(false); /* reset it anyway */
     qapi_event_send_migration(MIGRATION_STATUS_SETUP);
     if (strstart(uri, "tcp:", &p) ||
         strstart(uri, "unix:", NULL) ||
         strstart(uri, "vsock:", NULL)) {
-        migrate_protocol_allow_multifd(true);
+        migrate_protocol_allow_multi_channels(true);
         socket_start_incoming_migration(p ? p : uri, errp);
 #ifdef CONFIG_RDMA
     } else if (strstart(uri, "rdma:", &p)) {
@@ -1252,7 +1264,7 @@ static bool migrate_caps_check(bool *cap_list,
 
     /* incoming side only */
     if (runstate_check(RUN_STATE_INMIGRATE) &&
-        !migrate_multifd_is_allowed() &&
+        !migrate_multi_channels_is_allowed() &&
         cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
         error_setg(errp, "multifd is not supported by current protocol");
         return false;
@@ -2310,11 +2322,11 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
         }
     }
 
-    migrate_protocol_allow_multifd(false);
+    migrate_protocol_allow_multi_channels(false);
     if (strstart(uri, "tcp:", &p) ||
         strstart(uri, "unix:", NULL) ||
         strstart(uri, "vsock:", NULL)) {
-        migrate_protocol_allow_multifd(true);
+        migrate_protocol_allow_multi_channels(true);
         socket_start_outgoing_migration(s, p ? p : uri, &local_err);
 #ifdef CONFIG_RDMA
     } else if (strstart(uri, "rdma:", &p)) {
diff --git a/migration/migration.h b/migration/migration.h
index 34b79cb961..d0c0902ec9 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -425,4 +425,7 @@ void migration_cancel(const Error *error);
 
 void populate_vfio_info(MigrationInfo *info);
 
+bool migrate_multi_channels_is_allowed(void);
+void migrate_protocol_allow_multi_channels(bool allow);
+
 #endif
diff --git a/migration/multifd.c b/migration/multifd.c
index 3242f688e5..64ca50de62 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -535,7 +535,7 @@ void multifd_save_cleanup(void)
 {
     int i;
 
-    if (!migrate_use_multifd() || !migrate_multifd_is_allowed()) {
+    if (!migrate_use_multifd() || !migrate_multi_channels_is_allowed()) {
         return;
     }
     multifd_send_terminate_threads(NULL);
@@ -870,17 +870,6 @@ cleanup:
     multifd_new_send_channel_cleanup(p, sioc, local_err);
 }
 
-static bool migrate_allow_multifd = true;
-void migrate_protocol_allow_multifd(bool allow)
-{
-    migrate_allow_multifd = allow;
-}
-
-bool migrate_multifd_is_allowed(void)
-{
-    return migrate_allow_multifd;
-}
-
 int multifd_save_setup(Error **errp)
 {
     int thread_count;
@@ -891,7 +880,7 @@ int multifd_save_setup(Error **errp)
     if (!migrate_use_multifd()) {
         return 0;
     }
-    if (!migrate_multifd_is_allowed()) {
+    if (!migrate_multi_channels_is_allowed()) {
         error_setg(errp, "multifd is not supported by current protocol");
         return -1;
     }
@@ -989,7 +978,7 @@ int multifd_load_cleanup(Error **errp)
 {
     int i;
 
-    if (!migrate_use_multifd() || !migrate_multifd_is_allowed()) {
+    if (!migrate_use_multifd() || !migrate_multi_channels_is_allowed()) {
         return 0;
     }
     multifd_recv_terminate_threads(NULL);
@@ -1138,7 +1127,7 @@ int multifd_load_setup(Error **errp)
     if (!migrate_use_multifd()) {
         return 0;
     }
-    if (!migrate_multifd_is_allowed()) {
+    if (!migrate_multi_channels_is_allowed()) {
         error_setg(errp, "multifd is not supported by current protocol");
         return -1;
     }
diff --git a/migration/multifd.h b/migration/multifd.h
index e57adc783b..0ed07794b6 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -13,8 +13,6 @@
 #ifndef QEMU_MIGRATION_MULTIFD_H
 #define QEMU_MIGRATION_MULTIFD_H
 
-bool migrate_multifd_is_allowed(void);
-void migrate_protocol_allow_multifd(bool allow);
 int multifd_save_setup(Error **errp);
 void multifd_save_cleanup(void);
 int multifd_load_setup(Error **errp);
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 13/15] migration: Add postcopy-preempt capability
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (11 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 12/15] migration: Move migrate_allow_multifd and helpers into migration.c Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-02-03 15:46   ` Dr. David Alan Gilbert
  2022-01-19  8:09 ` [PATCH RFC 14/15] migration: Postcopy preemption on separate channel Peter Xu
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Firstly, postcopy already preempts precopy due to the fact that we do
unqueue_page() first before looking into dirty bits.

However that's not enough, e.g., when there're host huge page enabled, when
sending a precopy huge page, a postcopy request needs to wait until the whole
huge page that is sending to finish.  That could introduce quite some delay,
the bigger the huge page is the larger delay it'll bring.

This patch adds a new capability to allow postcopy requests to preempt existing
precopy page during sending a huge page, so that postcopy requests can be
serviced even faster.

Meanwhile to send it even faster, bypass the precopy stream by providing a
standalone postcopy socket for sending requested pages.

Since the new behavior will not be compatible with the old behavior, this will
not be the default, it's enabled only when the new capability is set on both
src/dst QEMUs.

This patch only adds the capability itself, the logic will be added in follow
up patches.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 23 +++++++++++++++++++++++
 migration/migration.h |  1 +
 qapi/migration.json   |  8 +++++++-
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 15a48b548a..84a8fbd80d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1227,6 +1227,11 @@ static bool migrate_caps_check(bool *cap_list,
             error_setg(errp, "Postcopy is not compatible with ignore-shared");
             return false;
         }
+
+        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
+            error_setg(errp, "Multifd is not supported in postcopy");
+            return false;
+        }
     }
 
     if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
@@ -1270,6 +1275,13 @@ static bool migrate_caps_check(bool *cap_list,
         return false;
     }
 
+    if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT]) {
+        if (!cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
+            error_setg(errp, "Postcopy preempt requires postcopy-ram");
+            return false;
+        }
+    }
+
     return true;
 }
 
@@ -2623,6 +2635,15 @@ bool migrate_background_snapshot(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT];
 }
 
+bool migrate_postcopy_preempt(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT];
+}
+
 /* migration thread support */
 /*
  * Something bad happened to the RP stream, mark an error
@@ -4239,6 +4260,8 @@ static Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-compress", MIGRATION_CAPABILITY_COMPRESS),
     DEFINE_PROP_MIG_CAP("x-events", MIGRATION_CAPABILITY_EVENTS),
     DEFINE_PROP_MIG_CAP("x-postcopy-ram", MIGRATION_CAPABILITY_POSTCOPY_RAM),
+    DEFINE_PROP_MIG_CAP("x-postcopy-preempt",
+                        MIGRATION_CAPABILITY_POSTCOPY_PREEMPT),
     DEFINE_PROP_MIG_CAP("x-colo", MIGRATION_CAPABILITY_X_COLO),
     DEFINE_PROP_MIG_CAP("x-release-ram", MIGRATION_CAPABILITY_RELEASE_RAM),
     DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
diff --git a/migration/migration.h b/migration/migration.h
index d0c0902ec9..9d39ccfcf5 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -391,6 +391,7 @@ int migrate_decompress_threads(void);
 bool migrate_use_events(void);
 bool migrate_postcopy_blocktime(void);
 bool migrate_background_snapshot(void);
+bool migrate_postcopy_preempt(void);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_shut(MigrationIncomingState *mis,
diff --git a/qapi/migration.json b/qapi/migration.json
index bbfd48cf0b..f00b365bd5 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -452,6 +452,12 @@
 #                       procedure starts. The VM RAM is saved with running VM.
 #                       (since 6.0)
 #
+# @postcopy-preempt: If enabled, the migration process will allow postcopy
+#                    requests to preempt precopy stream, so postcopy requests
+#                    will be handled faster.  This is a performance feature and
+#                    should not affect the correctness of postcopy migration.
+#                    (since 7.0)
+#
 # Features:
 # @unstable: Members @x-colo and @x-ignore-shared are experimental.
 #
@@ -465,7 +471,7 @@
            'block', 'return-path', 'pause-before-switchover', 'multifd',
            'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
            { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
-           'validate-uuid', 'background-snapshot'] }
+           'validate-uuid', 'background-snapshot', 'postcopy-preempt'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 14/15] migration: Postcopy preemption on separate channel
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (12 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 13/15] migration: Add postcopy-preempt capability Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-02-03 17:45   ` Dr. David Alan Gilbert
  2022-01-19  8:09 ` [PATCH RFC 15/15] tests: Add postcopy preempt test Peter Xu
  2022-01-19 12:32 ` [PATCH RFC 00/15] migration: Postcopy Preemption Dr. David Alan Gilbert
  15 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

This patch enables postcopy-preempt feature.

It contains two major changes to the migration logic:

  (1) Postcopy requests are now sent via a different socket from precopy
      background migration stream, so as to be isolated from very high page
      request delays

  (2) For huge page enabled hosts: when there's postcopy requests, they can now
      intercept a partial sending of huge host pages on src QEMU.

After this patch, we'll have two "channels" (or say, sockets, because it's only
supported on socket-based channels) for postcopy: (1) PRECOPY channel (which is
the default channel that transfers background pages), and (2) POSTCOPY
channel (which only transfers requested pages).

On the source QEMU, when we found a postcopy request, we'll interrupt the
PRECOPY channel sending process and quickly switch to the POSTCOPY channel.
After we serviced all the high priority postcopy pages, we'll switch back to
PRECOPY channel so that we'll continue to send the interrupted huge page again.
There's no new thread introduced.

On the destination QEMU, one new thread is introduced to receive page data from
the postcopy specific socket.

This patch has a side effect.  After sending postcopy pages, previously we'll
assume the guest will access follow up pages so we'll keep sending from there.
Now it's changed.  Instead of going on with a postcopy requested page, we'll go
back and continue sending the precopy huge page (which can be intercepted by a
postcopy request so the huge page can be sent partially before).

Whether that's a problem is debatable, because "assuming the guest will
continue to access the next page" doesn't really suite when huge pages are
used, especially if the huge page is large (e.g. 1GB pages).  So that locality
hint is much meaningless if huge pages are used.

If postcopy preempt is enabled, a separate channel is created for it so that it
can be used later for postcopy specific page requests.  On dst node, a
standalone thread is used to receive postcopy requested pages.  The thread is
created along with the ram listen thread during POSTCOPY_LISTEN phase.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c    |  62 +++++++--
 migration/migration.h    |  10 +-
 migration/postcopy-ram.c |  65 ++++++++-
 migration/postcopy-ram.h |  10 ++
 migration/ram.c          | 294 +++++++++++++++++++++++++++++++++++++--
 migration/ram.h          |   2 +
 migration/socket.c       |  18 +++
 migration/socket.h       |   1 +
 migration/trace-events   |  10 ++
 9 files changed, 445 insertions(+), 27 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 84a8fbd80d..13dc6ecd37 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -315,6 +315,12 @@ void migration_incoming_state_destroy(void)
         mis->socket_address_list = NULL;
     }
 
+    if (mis->postcopy_qemufile_dst) {
+        migration_ioc_unregister_yank_from_file(mis->postcopy_qemufile_dst);
+        qemu_fclose(mis->postcopy_qemufile_dst);
+        mis->postcopy_qemufile_dst = NULL;
+    }
+
     yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
@@ -708,15 +714,21 @@ void migration_fd_process_incoming(QEMUFile *f, Error **errp)
     migration_incoming_process();
 }
 
+static bool migration_needs_multiple_sockets(void)
+{
+    return migrate_use_multifd() || migrate_postcopy_preempt();
+}
+
 void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
     Error *local_err = NULL;
     bool start_migration;
+    QEMUFile *f;
 
     if (!mis->from_src_file) {
         /* The first connection (multifd may have multiple) */
-        QEMUFile *f = qemu_fopen_channel_input(ioc);
+        f = qemu_fopen_channel_input(ioc);
 
         /* If it's a recovery, we're done */
         if (postcopy_try_recover(f)) {
@@ -729,13 +741,18 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
 
         /*
          * Common migration only needs one channel, so we can start
-         * right now.  Multifd needs more than one channel, we wait.
+         * right now.  Some features need more than one channel, we wait.
          */
-        start_migration = !migrate_use_multifd();
+        start_migration = !migration_needs_multiple_sockets();
     } else {
         /* Multiple connections */
-        assert(migrate_use_multifd());
-        start_migration = multifd_recv_new_channel(ioc, &local_err);
+        assert(migration_needs_multiple_sockets());
+        if (migrate_use_multifd()) {
+            start_migration = multifd_recv_new_channel(ioc, &local_err);
+        } else if (migrate_postcopy_preempt()) {
+            f = qemu_fopen_channel_input(ioc);
+            start_migration = postcopy_preempt_new_channel(mis, f);
+        }
         if (local_err) {
             error_propagate(errp, local_err);
             return;
@@ -756,11 +773,20 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
 bool migration_has_all_channels(void)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
-    bool all_channels;
 
-    all_channels = multifd_recv_all_channels_created();
+    if (!mis->from_src_file) {
+        return false;
+    }
+
+    if (migrate_use_multifd()) {
+        return multifd_recv_all_channels_created();
+    }
+
+    if (migrate_postcopy_preempt()) {
+        return mis->postcopy_qemufile_dst != NULL;
+    }
 
-    return all_channels && mis->from_src_file != NULL;
+    return true;
 }
 
 /*
@@ -1850,6 +1876,11 @@ static void migrate_fd_cleanup(MigrationState *s)
         qemu_fclose(tmp);
     }
 
+    if (s->postcopy_qemufile_src) {
+        qemu_fclose(s->postcopy_qemufile_src);
+        s->postcopy_qemufile_src = NULL;
+    }
+
     assert(!migration_is_active(s));
 
     if (s->state == MIGRATION_STATUS_CANCELLING) {
@@ -3122,6 +3153,8 @@ static int postcopy_start(MigrationState *ms)
                               MIGRATION_STATUS_FAILED);
     }
 
+    trace_postcopy_preempt_enabled(migrate_postcopy_preempt());
+
     return ret;
 
 fail_closefb:
@@ -3234,6 +3267,11 @@ static void migration_completion(MigrationState *s)
         qemu_savevm_state_complete_postcopy(s->to_dst_file);
         qemu_mutex_unlock_iothread();
 
+        /* Shutdown the postcopy fast path thread */
+        if (migrate_postcopy_preempt()) {
+            postcopy_preempt_shutdown_file(s);
+        }
+
         trace_migration_completion_postcopy_end_after_complete();
     } else if (s->state == MIGRATION_STATUS_CANCELLING) {
         goto fail;
@@ -4143,6 +4181,14 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
         return;
     }
 
+    if (postcopy_preempt_setup(s, &local_err)) {
+        error_report_err(local_err);
+        migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+                          MIGRATION_STATUS_FAILED);
+        migrate_fd_cleanup(s);
+        return;
+    }
+
     if (migrate_background_snapshot()) {
         qemu_thread_create(&s->thread, "bg_snapshot",
                 bg_migration_thread, s, QEMU_THREAD_JOINABLE);
diff --git a/migration/migration.h b/migration/migration.h
index 9d39ccfcf5..8786785b1f 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -23,6 +23,7 @@
 #include "io/channel-buffer.h"
 #include "net/announce.h"
 #include "qom/object.h"
+#include "postcopy-ram.h"
 
 struct PostcopyBlocktimeContext;
 
@@ -67,7 +68,7 @@ typedef struct {
 struct MigrationIncomingState {
     QEMUFile *from_src_file;
     /* Previously received RAM's RAMBlock pointer */
-    RAMBlock *last_recv_block;
+    RAMBlock *last_recv_block[RAM_CHANNEL_MAX];
     /* A hook to allow cleanup at the end of incoming migration */
     void *transport_data;
     void (*transport_cleanup)(void *data);
@@ -109,6 +110,11 @@ struct MigrationIncomingState {
      * enabled.
      */
     int       postcopy_channels;
+    /* QEMUFile for postcopy only; it'll be handled by a separate thread */
+    QEMUFile *postcopy_qemufile_dst;
+    /* Postcopy priority thread is used to receive postcopy requested pages */
+    QemuThread     postcopy_prio_thread;
+    bool           postcopy_prio_thread_created;
     /*
      * An array of temp host huge pages to be used, one for each postcopy
      * channel.
@@ -189,6 +195,8 @@ struct MigrationState {
     QEMUBH *cleanup_bh;
     /* Protected by qemu_file_lock */
     QEMUFile *to_dst_file;
+    /* Postcopy specific transfer channel */
+    QEMUFile *postcopy_qemufile_src;
     QIOChannelBuffer *bioc;
     /*
      * Protects to_dst_file/from_dst_file pointers.  We need to make sure we
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 88c832eeba..9006e68fd1 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -32,6 +32,8 @@
 #include "trace.h"
 #include "hw/boards.h"
 #include "exec/ramblock.h"
+#include "socket.h"
+#include "qemu-file-channel.h"
 
 /* Arbitrary limit on size of each discard command,
  * keeps them around ~200 bytes
@@ -562,6 +564,11 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
 {
     trace_postcopy_ram_incoming_cleanup_entry();
 
+    if (mis->postcopy_prio_thread_created) {
+        qemu_thread_join(&mis->postcopy_prio_thread);
+        mis->postcopy_prio_thread_created = false;
+    }
+
     if (mis->have_fault_thread) {
         Error *local_err = NULL;
 
@@ -1114,8 +1121,13 @@ static int postcopy_temp_pages_setup(MigrationIncomingState *mis)
     int err, i, channels;
     void *temp_page;
 
-    /* TODO: will be boosted when enable postcopy preemption */
-    mis->postcopy_channels = 1;
+    if (migrate_postcopy_preempt()) {
+        /* If preemption enabled, need extra channel for urgent requests */
+        mis->postcopy_channels = RAM_CHANNEL_MAX;
+    } else {
+        /* Both precopy/postcopy on the same channel */
+        mis->postcopy_channels = 1;
+    }
 
     channels = mis->postcopy_channels;
     mis->postcopy_tmp_pages = g_malloc0(sizeof(PostcopyTmpPage) * channels);
@@ -1182,7 +1194,7 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
         return -1;
     }
 
-    postcopy_thread_create(mis, &mis->fault_thread, "postcopy/fault",
+    postcopy_thread_create(mis, &mis->fault_thread, "qemu/fault-default",
                            postcopy_ram_fault_thread, QEMU_THREAD_JOINABLE);
     mis->have_fault_thread = true;
 
@@ -1197,6 +1209,16 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
         return -1;
     }
 
+    if (migrate_postcopy_preempt()) {
+        /*
+         * This thread needs to be created after the temp pages because it'll fetch
+         * RAM_CHANNEL_POSTCOPY PostcopyTmpPage immediately.
+         */
+        postcopy_thread_create(mis, &mis->postcopy_prio_thread, "qemu/fault-fast",
+                               postcopy_preempt_thread, QEMU_THREAD_JOINABLE);
+        mis->postcopy_prio_thread_created = true;
+    }
+
     trace_postcopy_ram_enable_notify();
 
     return 0;
@@ -1516,3 +1538,40 @@ void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd)
         }
     }
 }
+
+bool postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file)
+{
+    mis->postcopy_qemufile_dst = file;
+
+    trace_postcopy_preempt_new_channel();
+
+    /* Start the migration immediately */
+    return true;
+}
+
+int postcopy_preempt_setup(MigrationState *s, Error **errp)
+{
+    QIOChannel *ioc;
+
+    if (!migrate_postcopy_preempt()) {
+        return 0;
+    }
+
+    if (!migrate_multi_channels_is_allowed()) {
+        error_setg(errp, "Postcopy preempt is not supported as current "
+                   "migration stream does not support multi-channels.");
+        return -1;
+    }
+
+    ioc = socket_send_channel_create_sync(errp);
+
+    if (ioc == NULL) {
+        return -1;
+    }
+
+    s->postcopy_qemufile_src = qemu_fopen_channel_output(ioc);
+
+    trace_postcopy_preempt_new_channel();
+
+    return 0;
+}
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 07684c0e1d..34b1080cde 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -183,4 +183,14 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd, uint64_t client_addr,
 int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
                                  uint64_t client_addr, uint64_t offset);
 
+/* Hard-code channels for now for postcopy preemption */
+enum PostcopyChannels {
+    RAM_CHANNEL_PRECOPY = 0,
+    RAM_CHANNEL_POSTCOPY = 1,
+    RAM_CHANNEL_MAX,
+};
+
+bool postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file);
+int postcopy_preempt_setup(MigrationState *s, Error **errp);
+
 #endif
diff --git a/migration/ram.c b/migration/ram.c
index b7d17613e8..6a1ef86eca 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -294,6 +294,20 @@ struct RAMSrcPageRequest {
     QSIMPLEQ_ENTRY(RAMSrcPageRequest) next_req;
 };
 
+typedef struct {
+    /*
+     * Cached ramblock/offset values if preempted.  They're only meaningful if
+     * preempted==true below.
+     */
+    RAMBlock *ram_block;
+    unsigned long ram_page;
+    /*
+     * Whether a postcopy preemption just happened.  Will be reset after
+     * precopy recovered to background migration.
+     */
+    bool preempted;
+} PostcopyPreemptState;
+
 /* State of RAM for migration */
 struct RAMState {
     /* QEMUFile used for this migration */
@@ -347,6 +361,14 @@ struct RAMState {
     /* Queue of outstanding page requests from the destination */
     QemuMutex src_page_req_mutex;
     QSIMPLEQ_HEAD(, RAMSrcPageRequest) src_page_requests;
+
+    /* Postcopy preemption informations */
+    PostcopyPreemptState postcopy_preempt_state;
+    /*
+     * Current channel we're using on src VM.  Only valid if postcopy-preempt
+     * is enabled.
+     */
+    int postcopy_channel;
 };
 typedef struct RAMState RAMState;
 
@@ -354,6 +376,11 @@ static RAMState *ram_state;
 
 static NotifierWithReturnList precopy_notifier_list;
 
+static void postcopy_preempt_reset(RAMState *rs)
+{
+    memset(&rs->postcopy_preempt_state, 0, sizeof(PostcopyPreemptState));
+}
+
 /* Whether postcopy has queued requests? */
 static bool postcopy_has_request(RAMState *rs)
 {
@@ -1937,6 +1964,55 @@ void ram_write_tracking_stop(void)
 }
 #endif /* defined(__linux__) */
 
+/*
+ * Check whether two addr/offset of the ramblock falls onto the same host huge
+ * page.  Returns true if so, false otherwise.
+ */
+static bool offset_on_same_huge_page(RAMBlock *rb, uint64_t addr1,
+                                     uint64_t addr2)
+{
+    size_t page_size = qemu_ram_pagesize(rb);
+
+    addr1 = ROUND_DOWN(addr1, page_size);
+    addr2 = ROUND_DOWN(addr2, page_size);
+
+    return addr1 == addr2;
+}
+
+/*
+ * Whether a previous preempted precopy huge page contains current requested
+ * page?  Returns true if so, false otherwise.
+ *
+ * This should really happen very rarely, because it means when we were sending
+ * during background migration for postcopy we're sending exactly the page that
+ * some vcpu got faulted on on dest node.  When it happens, we probably don't
+ * need to do much but drop the request, because we know right after we restore
+ * the precopy stream it'll be serviced.  It'll slightly affect the order of
+ * postcopy requests to be serviced (e.g. it'll be the same as we move current
+ * request to the end of the queue) but it shouldn't be a big deal.  The most
+ * imporant thing is we can _never_ try to send a partial-sent huge page on the
+ * POSTCOPY channel again, otherwise that huge page will got "split brain" on
+ * two channels (PRECOPY, POSTCOPY).
+ */
+static bool postcopy_preempted_contains(RAMState *rs, RAMBlock *block,
+                                        ram_addr_t offset)
+{
+    PostcopyPreemptState *state = &rs->postcopy_preempt_state;
+
+    /* No preemption at all? */
+    if (!state->preempted) {
+        return false;
+    }
+
+    /* Not even the same ramblock? */
+    if (state->ram_block != block) {
+        return false;
+    }
+
+    return offset_on_same_huge_page(block, offset,
+                                    state->ram_page << TARGET_PAGE_BITS);
+}
+
 /**
  * get_queued_page: unqueue a page from the postcopy requests
  *
@@ -1952,9 +2028,17 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
     RAMBlock  *block;
     ram_addr_t offset;
 
+again:
     block = unqueue_page(rs, &offset);
 
-    if (!block) {
+    if (block) {
+        /* See comment above postcopy_preempted_contains() */
+        if (postcopy_preempted_contains(rs, block, offset)) {
+            trace_postcopy_preempt_hit(block->idstr, offset);
+            /* This request is dropped */
+            goto again;
+        }
+    } else {
         /*
          * Poll write faults too if background snapshot is enabled; that's
          * when we have vcpus got blocked by the write protected pages.
@@ -2173,6 +2257,114 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
     return ram_save_page(rs, pss, last_stage);
 }
 
+static bool postcopy_needs_preempt(RAMState *rs, PageSearchStatus *pss)
+{
+    /* Not enabled eager preempt?  Then never do that. */
+    if (!migrate_postcopy_preempt()) {
+        return false;
+    }
+
+    /* If the ramblock we're sending is a small page?  Never bother. */
+    if (qemu_ram_pagesize(pss->block) == TARGET_PAGE_SIZE) {
+        return false;
+    }
+
+    /* Not in postcopy at all? */
+    if (!migration_in_postcopy()) {
+        return false;
+    }
+
+    /*
+     * If we're already handling a postcopy request, don't preempt as this page
+     * has got the same high priority.
+     */
+    if (pss->postcopy_requested) {
+        return false;
+    }
+
+    /* If there's postcopy requests, then check it up! */
+    return postcopy_has_request(rs);
+}
+
+/* Returns true if we preempted precopy, false otherwise */
+static void postcopy_do_preempt(RAMState *rs, PageSearchStatus *pss)
+{
+    PostcopyPreemptState *p_state = &rs->postcopy_preempt_state;
+
+    trace_postcopy_preempt_triggered(pss->block->idstr, pss->page);
+
+    /*
+     * Time to preempt precopy. Cache current PSS into preempt state, so that
+     * after handling the postcopy pages we can recover to it.  We need to do
+     * so because the dest VM will have partial of the precopy huge page kept
+     * over in its tmp huge page caches; better move on with it when we can.
+     */
+    p_state->ram_block = pss->block;
+    p_state->ram_page = pss->page;
+    p_state->preempted = true;
+}
+
+/* Whether we're preempted by a postcopy request during sending a huge page */
+static bool postcopy_preempt_triggered(RAMState *rs)
+{
+    return rs->postcopy_preempt_state.preempted;
+}
+
+static void postcopy_preempt_restore(RAMState *rs, PageSearchStatus *pss)
+{
+    PostcopyPreemptState *state = &rs->postcopy_preempt_state;
+
+    assert(state->preempted);
+
+    pss->block = state->ram_block;
+    pss->page = state->ram_page;
+    /* This is not a postcopy request but restoring previous precopy */
+    pss->postcopy_requested = false;
+
+    trace_postcopy_preempt_restored(pss->block->idstr, pss->page);
+
+    /* Reset preempt state, most importantly, set preempted==false */
+    postcopy_preempt_reset(rs);
+}
+
+static void postcopy_preempt_choose_channel(RAMState *rs, PageSearchStatus *pss)
+{
+    int channel = pss->postcopy_requested ? RAM_CHANNEL_POSTCOPY : RAM_CHANNEL_PRECOPY;
+    MigrationState *s = migrate_get_current();
+    QEMUFile *next;
+
+    if (channel != rs->postcopy_channel) {
+        if (channel == RAM_CHANNEL_PRECOPY) {
+            next = s->to_dst_file;
+        } else {
+            next = s->postcopy_qemufile_src;
+        }
+        /* Update and cache the current channel */
+        rs->f = next;
+        rs->postcopy_channel = channel;
+
+        /*
+         * If channel switched, reset last_sent_block since the old sent block
+         * may not be on the same channel.
+         */
+        rs->last_sent_block = NULL;
+
+        trace_postcopy_preempt_switch_channel(channel);
+    }
+
+    trace_postcopy_preempt_send_host_page(pss->block->idstr, pss->page);
+}
+
+/* We need to make sure rs->f always points to the default channel elsewhere */
+static void postcopy_preempt_reset_channel(RAMState *rs)
+{
+    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
+        rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
+        rs->f = migrate_get_current()->to_dst_file;
+        trace_postcopy_preempt_reset_channel();
+    }
+}
+
 /**
  * ram_save_host_page: save a whole host page
  *
@@ -2207,7 +2399,16 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
         return 0;
     }
 
+    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
+        postcopy_preempt_choose_channel(rs, pss);
+    }
+
     do {
+        if (postcopy_needs_preempt(rs, pss)) {
+            postcopy_do_preempt(rs, pss);
+            break;
+        }
+
         /* Check the pages is dirty and if it is send it */
         if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
             tmppages = ram_save_target_page(rs, pss, last_stage);
@@ -2229,6 +2430,19 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
              offset_in_ramblock(pss->block,
                                 ((ram_addr_t)pss->page) << TARGET_PAGE_BITS));
 
+    /*
+     * When with postcopy preempt mode, flush the data as soon as possible for
+     * postcopy requests, because we've already sent a whole huge page, so the
+     * dst node should already have enough resource to atomically filling in
+     * the current missing page.
+     *
+     * More importantly, when using separate postcopy channel, we must do
+     * explicit flush or it won't flush until the buffer is full.
+     */
+    if (migrate_postcopy_preempt() && pss->postcopy_requested) {
+        qemu_fflush(rs->f);
+    }
+
     res = ram_save_release_protection(rs, pss, start_page);
     return (res < 0 ? res : pages);
 }
@@ -2272,8 +2486,17 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
         found = get_queued_page(rs, &pss);
 
         if (!found) {
-            /* priority queue empty, so just search for something dirty */
-            found = find_dirty_block(rs, &pss, &again);
+            /*
+             * Recover previous precopy ramblock/offset if postcopy has
+             * preempted precopy.  Otherwise find the next dirty bit.
+             */
+            if (postcopy_preempt_triggered(rs)) {
+                postcopy_preempt_restore(rs, &pss);
+                found = true;
+            } else {
+                /* priority queue empty, so just search for something dirty */
+                found = find_dirty_block(rs, &pss, &again);
+            }
         }
 
         if (found) {
@@ -2401,6 +2624,8 @@ static void ram_state_reset(RAMState *rs)
     rs->last_page = 0;
     rs->last_version = ram_list.version;
     rs->xbzrle_enabled = false;
+    postcopy_preempt_reset(rs);
+    rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
@@ -3043,6 +3268,8 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
     }
     qemu_mutex_unlock(&rs->bitmap_mutex);
 
+    postcopy_preempt_reset_channel(rs);
+
     /*
      * Must occur before EOS (or any QEMUFile operation)
      * because of RDMA protocol.
@@ -3110,6 +3337,8 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         ram_control_after_iterate(f, RAM_CONTROL_FINISH);
     }
 
+    postcopy_preempt_reset_channel(rs);
+
     if (ret >= 0) {
         multifd_send_sync_main(rs->f);
         qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
@@ -3192,11 +3421,13 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
  * @mis: the migration incoming state pointer
  * @f: QEMUFile where to read the data from
  * @flags: Page flags (mostly to see if it's a continuation of previous block)
+ * @channel: the channel we're using
  */
 static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
-                                              QEMUFile *f, int flags)
+                                              QEMUFile *f, int flags,
+                                              int channel)
 {
-    RAMBlock *block = mis->last_recv_block;
+    RAMBlock *block = mis->last_recv_block[channel];
     char id[256];
     uint8_t len;
 
@@ -3223,7 +3454,7 @@ static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
         return NULL;
     }
 
-    mis->last_recv_block = block;
+    mis->last_recv_block[channel] = block;
 
     return block;
 }
@@ -3642,15 +3873,15 @@ int ram_postcopy_incoming_init(MigrationIncomingState *mis)
  * rcu_read_lock is taken prior to this being called.
  *
  * @f: QEMUFile where to send the data
+ * @channel: the channel to use for loading
  */
-static int ram_load_postcopy(QEMUFile *f)
+static int ram_load_postcopy(QEMUFile *f, int channel)
 {
     int flags = 0, ret = 0;
     bool place_needed = false;
     bool matches_target_page_size = false;
     MigrationIncomingState *mis = migration_incoming_get_current();
-    /* Currently we only use channel 0.  TODO: use all the channels */
-    PostcopyTmpPage *tmp_page = &mis->postcopy_tmp_pages[0];
+    PostcopyTmpPage *tmp_page = &mis->postcopy_tmp_pages[channel];
 
     while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
         ram_addr_t addr;
@@ -3677,7 +3908,7 @@ static int ram_load_postcopy(QEMUFile *f)
         trace_ram_load_postcopy_loop((uint64_t)addr, flags);
         if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
                      RAM_SAVE_FLAG_COMPRESS_PAGE)) {
-            block = ram_block_from_stream(mis, f, flags);
+            block = ram_block_from_stream(mis, f, flags, channel);
             if (!block) {
                 ret = -EINVAL;
                 break;
@@ -3715,10 +3946,10 @@ static int ram_load_postcopy(QEMUFile *f)
             } else if (tmp_page->host_addr !=
                        host_page_from_ram_block_offset(block, addr)) {
                 /* not the 1st TP within the HP */
-                error_report("Non-same host page detected.  Target host page %p, "
-                             "received host page %p "
+                error_report("Non-same host page detected on channel %d: "
+                             "Target host page %p, received host page %p "
                              "(rb %s offset 0x"RAM_ADDR_FMT" target_pages %d)",
-                             tmp_page->host_addr,
+                             channel, tmp_page->host_addr,
                              host_page_from_ram_block_offset(block, addr),
                              block->idstr, addr, tmp_page->target_pages);
                 ret = -EINVAL;
@@ -3818,6 +4049,28 @@ static int ram_load_postcopy(QEMUFile *f)
     return ret;
 }
 
+void *postcopy_preempt_thread(void *opaque)
+{
+    MigrationIncomingState *mis = opaque;
+    int ret;
+
+    trace_postcopy_preempt_thread_entry();
+
+    rcu_register_thread();
+
+    qemu_sem_post(&mis->thread_sync_sem);
+
+    /* Sending RAM_SAVE_FLAG_EOS to terminate this thread */
+    ret = ram_load_postcopy(mis->postcopy_qemufile_dst, RAM_CHANNEL_POSTCOPY);
+
+    rcu_unregister_thread();
+
+    trace_postcopy_preempt_thread_exit();
+
+    return ret == 0 ? NULL : (void *)-1;
+}
+
+
 static bool postcopy_is_advised(void)
 {
     PostcopyState ps = postcopy_state_get();
@@ -3930,7 +4183,7 @@ static int ram_load_precopy(QEMUFile *f)
 
         if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
                      RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
-            RAMBlock *block = ram_block_from_stream(mis, f, flags);
+            RAMBlock *block = ram_block_from_stream(mis, f, flags, RAM_CHANNEL_PRECOPY);
 
             host = host_from_ram_block_offset(block, addr);
             /*
@@ -4107,7 +4360,12 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
      */
     WITH_RCU_READ_LOCK_GUARD() {
         if (postcopy_running) {
-            ret = ram_load_postcopy(f);
+            /*
+             * Note!  Here RAM_CHANNEL_PRECOPY is the precopy channel of
+             * postcopy migration, we have another RAM_CHANNEL_POSTCOPY to
+             * service fast page faults.
+             */
+            ret = ram_load_postcopy(f, RAM_CHANNEL_PRECOPY);
         } else {
             ret = ram_load_precopy(f);
         }
@@ -4269,6 +4527,12 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
     return 0;
 }
 
+void postcopy_preempt_shutdown_file(MigrationState *s)
+{
+    qemu_put_be64(s->postcopy_qemufile_src, RAM_SAVE_FLAG_EOS);
+    qemu_fflush(s->postcopy_qemufile_src);
+}
+
 static SaveVMHandlers savevm_ram_handlers = {
     .save_setup = ram_save_setup,
     .save_live_iterate = ram_save_iterate,
diff --git a/migration/ram.h b/migration/ram.h
index 2c6dc3675d..f31b8c0ece 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -72,6 +72,8 @@ int64_t ramblock_recv_bitmap_send(QEMUFile *file,
                                   const char *block_name);
 int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
 bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
+void postcopy_preempt_shutdown_file(MigrationState *s);
+void *postcopy_preempt_thread(void *opaque);
 
 /* ram cache */
 int colo_init_ram_cache(void);
diff --git a/migration/socket.c b/migration/socket.c
index 05705a32d8..955c5ebb10 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -39,6 +39,24 @@ void socket_send_channel_create(QIOTaskFunc f, void *data)
                                      f, data, NULL, NULL);
 }
 
+QIOChannel *socket_send_channel_create_sync(Error **errp)
+{
+    QIOChannelSocket *sioc = qio_channel_socket_new();
+
+    if (!outgoing_args.saddr) {
+        object_unref(OBJECT(sioc));
+        error_setg(errp, "Initial sock address not set!");
+        return NULL;
+    }
+
+    if (qio_channel_socket_connect_sync(sioc, outgoing_args.saddr, errp) < 0) {
+        object_unref(OBJECT(sioc));
+        return NULL;
+    }
+
+    return QIO_CHANNEL(sioc);
+}
+
 int socket_send_channel_destroy(QIOChannel *send)
 {
     /* Remove channel */
diff --git a/migration/socket.h b/migration/socket.h
index 891dbccceb..dc54df4e6c 100644
--- a/migration/socket.h
+++ b/migration/socket.h
@@ -21,6 +21,7 @@
 #include "io/task.h"
 
 void socket_send_channel_create(QIOTaskFunc f, void *data);
+QIOChannel *socket_send_channel_create_sync(Error **errp);
 int socket_send_channel_destroy(QIOChannel *send);
 
 void socket_start_incoming_migration(const char *str, Error **errp);
diff --git a/migration/trace-events b/migration/trace-events
index 3a9b3567ae..6452179bee 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -110,6 +110,12 @@ ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRI
 ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
 ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
 ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
+postcopy_preempt_triggered(char *str, unsigned long page) "during sending ramblock %s offset 0x%lx"
+postcopy_preempt_restored(char *str, unsigned long page) "ramblock %s offset 0x%lx"
+postcopy_preempt_hit(char *str, uint64_t offset) "ramblock %s offset 0x%"PRIx64
+postcopy_preempt_send_host_page(char *str, uint64_t offset) "ramblock %s offset 0x%"PRIx64
+postcopy_preempt_switch_channel(int channel) "%d"
+postcopy_preempt_reset_channel(void) ""
 
 # multifd.c
 multifd_new_send_channel_async(uint8_t id) "channel %d"
@@ -175,6 +181,7 @@ migration_thread_low_pending(uint64_t pending) "%" PRIu64
 migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
 process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
 process_incoming_migration_co_postcopy_end_main(void) ""
+postcopy_preempt_enabled(bool value) "%d"
 
 # channel.c
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
@@ -277,6 +284,9 @@ postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_off
 postcopy_request_shared_page_present(const char *sharer, const char *rb, uint64_t rb_offset) "%s already %s offset 0x%"PRIx64
 postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
 postcopy_page_req_del(void *addr, int count) "resolved page req %p total %d"
+postcopy_preempt_new_channel(void) ""
+postcopy_preempt_thread_entry(void) ""
+postcopy_preempt_thread_exit(void) ""
 
 get_mem_fault_cpu_index(int cpu, uint32_t pid) "cpu: %d, pid: %u"
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 15/15] tests: Add postcopy preempt test
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (13 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 14/15] migration: Postcopy preemption on separate channel Peter Xu
@ 2022-01-19  8:09 ` Peter Xu
  2022-02-03 15:53   ` Dr. David Alan Gilbert
  2022-01-19 12:32 ` [PATCH RFC 00/15] migration: Postcopy Preemption Dr. David Alan Gilbert
  15 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-01-19  8:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, peterx,
	Leonardo Bras Soares Passos

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tests/qtest/migration-test.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 7b42f6fd90..93ff43bb3f 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -470,6 +470,7 @@ typedef struct {
      */
     bool hide_stderr;
     bool use_shmem;
+    bool postcopy_preempt;
     /* only launch the target process */
     bool only_target;
     /* Use dirty ring if true; dirty logging otherwise */
@@ -673,6 +674,11 @@ static int migrate_postcopy_prepare(QTestState **from_ptr,
     migrate_set_capability(to, "postcopy-ram", true);
     migrate_set_capability(to, "postcopy-blocktime", true);
 
+    if (args->postcopy_preempt) {
+        migrate_set_capability(from, "postcopy-preempt", true);
+        migrate_set_capability(to, "postcopy-preempt", true);
+    }
+
     /* We want to pick a speed slow enough that the test completes
      * quickly, but that it doesn't complete precopy even on a slow
      * machine, so also set the downtime.
@@ -719,6 +725,20 @@ static void test_postcopy(void)
     migrate_postcopy_complete(from, to);
 }
 
+static void test_postcopy_preempt(void)
+{
+    MigrateStart *args = migrate_start_new();
+    QTestState *from, *to;
+
+    args->postcopy_preempt = true;
+
+    if (migrate_postcopy_prepare(&from, &to, args)) {
+        return;
+    }
+    migrate_postcopy_start(from, to);
+    migrate_postcopy_complete(from, to);
+}
+
 static void test_postcopy_recovery(void)
 {
     MigrateStart *args = migrate_start_new();
@@ -1458,6 +1478,7 @@ int main(int argc, char **argv)
     module_call_init(MODULE_INIT_QOM);
 
     qtest_add_func("/migration/postcopy/unix", test_postcopy);
+    qtest_add_func("/migration/postcopy/preempt", test_postcopy_preempt);
     qtest_add_func("/migration/postcopy/recovery", test_postcopy_recovery);
     qtest_add_func("/migration/bad_dest", test_baddest);
     qtest_add_func("/migration/precopy/unix", test_precopy_unix);
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 00/15] migration: Postcopy Preemption
  2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
                   ` (14 preceding siblings ...)
  2022-01-19  8:09 ` [PATCH RFC 15/15] tests: Add postcopy preempt test Peter Xu
@ 2022-01-19 12:32 ` Dr. David Alan Gilbert
  15 siblings, 0 replies; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-01-19 12:32 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> Based-on: <20211224065000.97572-1-peterx@redhat.com>
> 
> Human version - This patchset is based on:
>   https://lore.kernel.org/qemu-devel/20211224065000.97572-1-peterx@redhat.com/
> 
> This series can also be found here:
>   https://github.com/xzpeter/qemu/tree/postcopy-preempt
> 
> Abstract
> ========
> 
> This series added a new migration capability called "postcopy-preempt".  It can
> be enabled when postcopy is enabled, and it'll simply (but greatly) speed up
> postcopy page requests handling process.
> 
> Some quick tests below measuring postcopy page request latency:
> 
>   - Guest config: 20G guest, 40 vcpus
>   - Host config: 10Gbps host NIC attached between src/dst
>   - Workload: one busy dirty thread, writting to 18G memory (pre-faulted).
>     (refers to "2M/4K huge page, 1 dirty thread" tests below)
>   - Script: see [1]
> 
>   |----------------+--------------+-----------------------|
>   | Host page size | Vanilla (ms) | Postcopy Preempt (ms) |
>   |----------------+--------------+-----------------------|
>   | 2M             |        10.58 |                  4.96 |
>   | 4K             |        10.68 |                  0.57 |
>   |----------------+--------------+-----------------------|
> 
> For 2M page, we got 1x speedup.  For 4K page, 18x speedup.
> 
> For more information on the testing, please refer to "Test Results" below.
> 
> Design
> ======
> 
> The postcopy-preempt feature contains two major reworks on postcopy page fault
> handlings:
> 
>     (1) Postcopy requests are now sent via a different socket from precopy
>         background migration stream, so as to be isolated from very high page
>         request delays
> 
>     (2) For huge page enabled hosts: when there's postcopy requests, they can
>         now intercept a partial sending of huge host pages on src QEMU.
> 
> The design is relatively straightforward, however there're trivial
> implementation details that the patchset needs to address.  Many of them are
> addressed as separate patches.  The rest is handled majorly in the big patch to
> enable the whole feature.
> 
> Postcopy recovery is not yet supported, it'll be done after some initial review
> on the solution first.
> 
> Patch layout
> ============
> 
> The initial 10 (out of 15) patches are mostly even suitable to be merged
> without the new feature, so they can be looked at even earlier.
> 
> Patch 11-14 implements the new feature, in which patches 11-13 are mostly still
> small and doing preparations, and the major change is done in patch 14.
> 
> Patch 15 is an unit test.
> 
> Tests Results
> ==================
> 
> When measuring the page request latency, I did that via trapping userfaultfd
> kernel faults using the bpf script [1]. I ignored kvm fast page faults, because
> when it happened it means no major/real page fault is even needed, IOW, no
> query to src QEMU.
> 
> The numbers (and histogram) I captured below are based on a whole procedure of
> postcopy migration that I sampled with different configurations, and the
> average page request latency was calculated.  I also captured the latency
> distribution, it's also interesting too to look at them here.
> 
> One thing to mention is I didn't even test 1G pages.  It doesn't mean that this
> series won't help 1G - actually it'll help no less than what I've tested I
> believe, it's just that for 1G huge pages the latency will be >1sec on 10Gbps
> nic so it's not really a usable scenario for any sensible customer.
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 2M huge page, 1 dirty thread
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> With vanilla postcopy:
> 
> Average: 10582 (us)
> 
> @delay_us:
> [1K, 2K)               7 |                                                    |
> [2K, 4K)               1 |                                                    |
> [4K, 8K)               9 |                                                    |
> [8K, 16K)           1983 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> With postcopy-preempt:
> 
> Average: 4960 (us)
> 
> @delay_us:
> [1K, 2K)               5 |                                                    |
> [2K, 4K)              44 |                                                    |
> [4K, 8K)            3495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K)            154 |@@                                                  |
> [16K, 32K)             1 |                                                    |
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 4K small page, 1 dirty thread
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> With vanilla postcopy:
> 
> Average: 10676 (us)
> 
> @delay_us:
> [4, 8)                 1 |                                                    |
> [8, 16)                3 |                                                    |
> [16, 32)               5 |                                                    |
> [32, 64)               3 |                                                    |
> [64, 128)             12 |                                                    |
> [128, 256)            10 |                                                    |
> [256, 512)            27 |                                                    |
> [512, 1K)              5 |                                                    |
> [1K, 2K)              11 |                                                    |
> [2K, 4K)              17 |                                                    |
> [4K, 8K)              10 |                                                    |
> [8K, 16K)           2681 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K)             6 |                                                    |
> 
> With postcopy preempt:
> 
> Average: 570 (us)
> 
> @delay_us:
> [16, 32)               5 |                                                    |
> [32, 64)               6 |                                                    |
> [64, 128)           8340 |@@@@@@@@@@@@@@@@@@                                  |
> [128, 256)         23052 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [256, 512)          8119 |@@@@@@@@@@@@@@@@@@                                  |
> [512, 1K)            148 |                                                    |
> [1K, 2K)             759 |@                                                   |
> [2K, 4K)            6729 |@@@@@@@@@@@@@@@                                     |
> [4K, 8K)              80 |                                                    |
> [8K, 16K)            115 |                                                    |
> [16K, 32K)            32 |                                                    |

Nice speedups.

> So one thing funny about 4K small pages is that with vanilla postcopy I didn't
> even get a speedup comparing to 2M pages, probably because the major overhead
> is not sending the page itself, but other things (e.g. waiting for precopy to
> flush the existing pages).
> 
> The other thing is in postcopy preempt test, I can still see a bunch of 2ms-4ms
> latency page requests.  That's probably what we would like to dig into next.
> One possibility is since we shared the same sending thread on src QEMU, we
> could have yield ourselves because precopy socket is full.  But that's TBD.

I guess those could be pages queued behind others; or maybe something
like one that starts getting sent on the main socket but then
interrupted by another, but then the original page is wanted?

> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 4K small page, 16 dirty threads
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> What I did test in extra was using 16 concurrent faulting threads, in this case
> the postcopy queue can be relatively longer.  It's done via:
> 
>   $ stress -m 16 --vm-bytes 1073741824 --vm-keep
> 
> With vanilla postcopy:
> 
> Average: 2244 (us)
> 
> @delay_us:
> [0]                  556 |                                                    |
> [1]                11251 |@@@@@@@@@@@@                                        |
> [2, 4)             12094 |@@@@@@@@@@@@@                                       |
> [4, 8)             12234 |@@@@@@@@@@@@@                                       |
> [8, 16)            47144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16, 32)           42281 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [32, 64)           17676 |@@@@@@@@@@@@@@@@@@@                                 |
> [64, 128)            952 |@                                                   |
> [128, 256)           405 |                                                    |
> [256, 512)           779 |                                                    |
> [512, 1K)           1003 |@                                                   |
> [1K, 2K)            1976 |@@                                                  |
> [2K, 4K)            4865 |@@@@@                                               |
> [4K, 8K)            5892 |@@@@@@                                              |
> [8K, 16K)          26941 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
> [16K, 32K)           844 |                                                    |
> [32K, 64K)            17 |                                                    |
> 
> With postcopy preempt:
> 
> Average: 1064 (us)
> 
> @delay_us:
> [0]                 1341 |                                                    |
> [1]                30211 |@@@@@@@@@@@@                                        |
> [2, 4)             32934 |@@@@@@@@@@@@@                                       |
> [4, 8)             21295 |@@@@@@@@                                            |
> [8, 16)           130774 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16, 32)           95128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> [32, 64)           49591 |@@@@@@@@@@@@@@@@@@@                                 |
> [64, 128)           3921 |@                                                   |
> [128, 256)          1066 |                                                    |
> [256, 512)          2730 |@                                                   |
> [512, 1K)           1849 |                                                    |
> [1K, 2K)             512 |                                                    |
> [2K, 4K)            2355 |                                                    |
> [4K, 8K)           48812 |@@@@@@@@@@@@@@@@@@@                                 |
> [8K, 16K)          10026 |@@@                                                 |
> [16K, 32K)           810 |                                                    |
> [32K, 64K)            68 |                                                    |
> 
> In this specific case, a funny thing is when there're tons of postcopy
> requests, the vanilla postcopy page requests are handled even faster (2ms
> average) than when there's only 1 dirty thread.  It's probably because
> unqueue_page() will always hit anyway so precopy streaming has a less effect on
> postcopy.  However that'll be still slower than having a standalone postcopy
> stream as preempt version has (1ms).

Curious.

Dave

> Any comment welcomed.
> 
> [1] https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd-latency.bpf
> 
> Peter Xu (15):
>   migration: No off-by-one for pss->page update in host page size
>   migration: Allow pss->page jump over clean pages
>   migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat
>   migration: Add postcopy_has_request()
>   migration: Simplify unqueue_page()
>   migration: Move temp page setup and cleanup into separate functions
>   migration: Introduce postcopy channels on dest node
>   migration: Dump ramblock and offset too when non-same-page detected
>   migration: Add postcopy_thread_create()
>   migration: Move static var in ram_block_from_stream() into global
>   migration: Add pss.postcopy_requested status
>   migration: Move migrate_allow_multifd and helpers into migration.c
>   migration: Add postcopy-preempt capability
>   migration: Postcopy preemption on separate channel
>   tests: Add postcopy preempt test
> 
>  migration/migration.c        | 107 +++++++--
>  migration/migration.h        |  55 ++++-
>  migration/multifd.c          |  19 +-
>  migration/multifd.h          |   2 -
>  migration/postcopy-ram.c     | 192 ++++++++++++----
>  migration/postcopy-ram.h     |  14 ++
>  migration/ram.c              | 417 ++++++++++++++++++++++++++++-------
>  migration/ram.h              |   2 +
>  migration/savevm.c           |  12 +-
>  migration/socket.c           |  18 ++
>  migration/socket.h           |   1 +
>  migration/trace-events       |  12 +-
>  qapi/migration.json          |   8 +-
>  tests/qtest/migration-test.c |  21 ++
>  14 files changed, 716 insertions(+), 164 deletions(-)
> 
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 01/15] migration: No off-by-one for pss->page update in host page size
  2022-01-19  8:09 ` [PATCH RFC 01/15] migration: No off-by-one for pss->page update in host page size Peter Xu
@ 2022-01-19 12:58   ` Dr. David Alan Gilbert
  2022-01-27  9:40   ` Juan Quintela
  1 sibling, 0 replies; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-01-19 12:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Juan Quintela, Kunkun Jiang, qemu-devel,
	Leonardo Bras Soares Passos, Keqian Zhu, Andrey Gruzdev

* Peter Xu (peterx@redhat.com) wrote:
> We used to do off-by-one fixup for pss->page when finished one host huge page
> transfer.  That seems to be unnecesary at all.  Drop it.
> 
> Cc: Keqian Zhu <zhukeqian1@huawei.com>
> Cc: Kunkun Jiang <jiangkunkun@huawei.com>
> Cc: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Yes, I think so - I guess the -1 and +1 cancel so it works, and in
practice ram_save_host_page then points to 1 page inside the hugepage
which is then always clean (because it just sent it) so probably
survives.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/ram.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 5234d1ece1..381ad56d26 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1611,7 +1611,7 @@ static int ram_save_release_protection(RAMState *rs, PageSearchStatus *pss,
>      /* Check if page is from UFFD-managed region. */
>      if (pss->block->flags & RAM_UF_WRITEPROTECT) {
>          void *page_address = pss->block->host + (start_page << TARGET_PAGE_BITS);
> -        uint64_t run_length = (pss->page - start_page + 1) << TARGET_PAGE_BITS;
> +        uint64_t run_length = (pss->page - start_page) << TARGET_PAGE_BITS;
>  
>          /* Flush async buffers before un-protect. */
>          qemu_fflush(rs->f);
> @@ -2230,7 +2230,7 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>               offset_in_ramblock(pss->block,
>                                  ((ram_addr_t)pss->page) << TARGET_PAGE_BITS));
>      /* The offset we leave with is the min boundary of host page and block */
> -    pss->page = MIN(pss->page, hostpage_boundary) - 1;
> +    pss->page = MIN(pss->page, hostpage_boundary);
>  
>      res = ram_save_release_protection(rs, pss, start_page);
>      return (res < 0 ? res : pages);
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 02/15] migration: Allow pss->page jump over clean pages
  2022-01-19  8:09 ` [PATCH RFC 02/15] migration: Allow pss->page jump over clean pages Peter Xu
@ 2022-01-19 13:42   ` Dr. David Alan Gilbert
  2022-01-20  2:12     ` Peter Xu
  0 siblings, 1 reply; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-01-19 13:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: Kunkun Jiang, Juan Quintela, Keqian Zhu, qemu-devel,
	Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> Commit ba1b7c812c ("migration/ram: Optimize ram_save_host_page()") managed to
> optimize host huge page use case by scanning the dirty bitmap when looking for
> the next dirty small page to migrate.
> 
> However when updating the pss->page before returning from that function, we
> used MIN() of these two values: (1) next dirty bit, or (2) end of current sent
> huge page, to fix up pss->page.
> 
> That sounds unnecessary, because I see nowhere that requires pss->page to be
> not going over current huge page boundary.
> 
> What we need here is probably MAX() instead of MIN() so that we'll start
> scanning from the next dirty bit next time. Since pss->page can't be smaller
> than hostpage_boundary (the loop guarantees it), it probably means we don't
> need to fix it up at all.
> 
> Cc: Keqian Zhu <zhukeqian1@huawei.com>
> Cc: Kunkun Jiang <jiangkunkun@huawei.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>


Hmm, I think that's potentially necessary.  note that the start of
ram_save_host_page stores the 'start_page' at entry.
That' start_page' goes to the ram_save_release_protection and so
I think it needs to be pagesize aligned for the mmap/uffd that happens.

Dave

> ---
>  migration/ram.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 381ad56d26..94b0ad4234 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2229,8 +2229,6 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>      } while ((pss->page < hostpage_boundary) &&
>               offset_in_ramblock(pss->block,
>                                  ((ram_addr_t)pss->page) << TARGET_PAGE_BITS));
> -    /* The offset we leave with is the min boundary of host page and block */
> -    pss->page = MIN(pss->page, hostpage_boundary);
>  
>      res = ram_save_release_protection(rs, pss, start_page);
>      return (res < 0 ? res : pages);
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 03/15] migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat
  2022-01-19  8:09 ` [PATCH RFC 03/15] migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat Peter Xu
@ 2022-01-19 14:15   ` Dr. David Alan Gilbert
  2022-01-27  9:40   ` Juan Quintela
  1 sibling, 0 replies; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-01-19 14:15 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> This patch allows us to read the tid even without blocktime feature enabled.
> It's useful when tracing postcopy fault thread on faulted pages to show thread
> id too with the address.
> 
> Remove the comments - they're merely not helpful at all.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/postcopy-ram.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index d18b5d05b2..2176ed68a5 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -283,15 +283,13 @@ static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
>      }
>  
>  #ifdef UFFD_FEATURE_THREAD_ID
> -    if (migrate_postcopy_blocktime() && mis &&
> -        UFFD_FEATURE_THREAD_ID & supported_features) {
> -        /* kernel supports that feature */
> -        /* don't create blocktime_context if it exists */
> -        if (!mis->blocktime_ctx) {
> -            mis->blocktime_ctx = blocktime_context_new();
> -        }
> -
> +    if (UFFD_FEATURE_THREAD_ID & supported_features) {
>          asked_features |= UFFD_FEATURE_THREAD_ID;
> +        if (migrate_postcopy_blocktime()) {
> +            if (!mis->blocktime_ctx) {
> +                mis->blocktime_ctx = blocktime_context_new();
> +            }
> +        }
>      }
>  #endif
>  
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 04/15] migration: Add postcopy_has_request()
  2022-01-19  8:09 ` [PATCH RFC 04/15] migration: Add postcopy_has_request() Peter Xu
@ 2022-01-19 14:27   ` Dr. David Alan Gilbert
  2022-01-27  9:41   ` Juan Quintela
  1 sibling, 0 replies; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-01-19 14:27 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> Add a helper to detect whether postcopy has pending request.
> 
> Since at it, cleanup the code a bit, e.g. in unqueue_page() we shouldn't need
> to check it again on queue empty because we're the only one (besides cleanup
> code, which should never run during this process) that will take a request off
> the list, so the request list can only grow but not shrink under the hood.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/ram.c | 45 ++++++++++++++++++++++++++++-----------------
>  1 file changed, 28 insertions(+), 17 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 94b0ad4234..dc6ba041fa 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -354,6 +354,12 @@ static RAMState *ram_state;
>  
>  static NotifierWithReturnList precopy_notifier_list;
>  
> +/* Whether postcopy has queued requests? */
> +static bool postcopy_has_request(RAMState *rs)
> +{
> +    return !QSIMPLEQ_EMPTY_ATOMIC(&rs->src_page_requests);
> +}
> +
>  void precopy_infrastructure_init(void)
>  {
>      notifier_with_return_list_init(&precopy_notifier_list);
> @@ -1533,28 +1539,33 @@ static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
>   */
>  static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
>  {
> +    struct RAMSrcPageRequest *entry;
>      RAMBlock *block = NULL;
>  
> -    if (QSIMPLEQ_EMPTY_ATOMIC(&rs->src_page_requests)) {
> +    if (!postcopy_has_request(rs)) {
>          return NULL;
>      }
>  
>      QEMU_LOCK_GUARD(&rs->src_page_req_mutex);
> -    if (!QSIMPLEQ_EMPTY(&rs->src_page_requests)) {
> -        struct RAMSrcPageRequest *entry =
> -                                QSIMPLEQ_FIRST(&rs->src_page_requests);
> -        block = entry->rb;
> -        *offset = entry->offset;
> -
> -        if (entry->len > TARGET_PAGE_SIZE) {
> -            entry->len -= TARGET_PAGE_SIZE;
> -            entry->offset += TARGET_PAGE_SIZE;
> -        } else {
> -            memory_region_unref(block->mr);
> -            QSIMPLEQ_REMOVE_HEAD(&rs->src_page_requests, next_req);
> -            g_free(entry);
> -            migration_consume_urgent_request();
> -        }
> +
> +    /*
> +     * This should _never_ change even after we take the lock, because no one
> +     * should be taking anything off the request list other than us.
> +     */
> +    assert(postcopy_has_request(rs));
> +
> +    entry = QSIMPLEQ_FIRST(&rs->src_page_requests);
> +    block = entry->rb;
> +    *offset = entry->offset;
> +
> +    if (entry->len > TARGET_PAGE_SIZE) {
> +        entry->len -= TARGET_PAGE_SIZE;
> +        entry->offset += TARGET_PAGE_SIZE;
> +    } else {
> +        memory_region_unref(block->mr);
> +        QSIMPLEQ_REMOVE_HEAD(&rs->src_page_requests, next_req);
> +        g_free(entry);
> +        migration_consume_urgent_request();
>      }
>  
>      return block;
> @@ -2996,7 +3007,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>          t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>          i = 0;
>          while ((ret = qemu_file_rate_limit(f)) == 0 ||
> -                !QSIMPLEQ_EMPTY(&rs->src_page_requests)) {
> +               postcopy_has_request(rs)) {
>              int pages;
>  
>              if (qemu_file_get_error(f)) {
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 05/15] migration: Simplify unqueue_page()
  2022-01-19  8:09 ` [PATCH RFC 05/15] migration: Simplify unqueue_page() Peter Xu
@ 2022-01-19 16:36   ` Dr. David Alan Gilbert
  2022-01-20  2:23     ` Peter Xu
  2022-01-27  9:41   ` Juan Quintela
  1 sibling, 1 reply; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-01-19 16:36 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> This patch simplifies unqueue_page() on both sides of it (itself, and caller).
> 
> Firstly, due to the fact that right after unqueue_page() returned true, we'll
> definitely send a huge page (see ram_save_huge_page() call - it will _never_
> exit before finish sending that huge page), so unqueue_page() does not need to
> jump in small page size if huge page is enabled on the ramblock.  IOW, it's
> destined that only the 1st 4K page will be valid, when unqueue the 2nd+ time
> we'll notice the whole huge page has already been sent anyway.  Switching to
> operating on huge page reduces a lot of the loops of redundant unqueue_page().
> 
> Meanwhile, drop the dirty check.  It's not helpful to call test_bit() every
> time to jump over clean pages, as ram_save_host_page() has already done so,
> while in a faster way (see commit ba1b7c812c ("migration/ram: Optimize
> ram_save_host_page()", 2021-05-13)).  So that's not necessary too.
> 
> Drop the two tracepoints along the way - based on above analysis it's very
> possible that no one is really using it..
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Yes, OK

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Although:
  a) You might like to keep a trace in get_queued_page just to see
what's getting unqueued
  b) I think originally it was a useful diagnostic to find out when we
were getting a lot of queue requests for pages that were already sent.

Dave


> ---
>  migration/ram.c        | 34 ++++++++--------------------------
>  migration/trace-events |  2 --
>  2 files changed, 8 insertions(+), 28 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index dc6ba041fa..0df15ff663 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1541,6 +1541,7 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
>  {
>      struct RAMSrcPageRequest *entry;
>      RAMBlock *block = NULL;
> +    size_t page_size;
>  
>      if (!postcopy_has_request(rs)) {
>          return NULL;
> @@ -1557,10 +1558,13 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
>      entry = QSIMPLEQ_FIRST(&rs->src_page_requests);
>      block = entry->rb;
>      *offset = entry->offset;
> +    page_size = qemu_ram_pagesize(block);
> +    /* Each page request should only be multiple page size of the ramblock */
> +    assert((entry->len % page_size) == 0);
>  
> -    if (entry->len > TARGET_PAGE_SIZE) {
> -        entry->len -= TARGET_PAGE_SIZE;
> -        entry->offset += TARGET_PAGE_SIZE;
> +    if (entry->len > page_size) {
> +        entry->len -= page_size;
> +        entry->offset += page_size;
>      } else {
>          memory_region_unref(block->mr);
>          QSIMPLEQ_REMOVE_HEAD(&rs->src_page_requests, next_req);
> @@ -1942,30 +1946,8 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>  {
>      RAMBlock  *block;
>      ram_addr_t offset;
> -    bool dirty;
>  
> -    do {
> -        block = unqueue_page(rs, &offset);
> -        /*
> -         * We're sending this page, and since it's postcopy nothing else
> -         * will dirty it, and we must make sure it doesn't get sent again
> -         * even if this queue request was received after the background
> -         * search already sent it.
> -         */
> -        if (block) {
> -            unsigned long page;
> -
> -            page = offset >> TARGET_PAGE_BITS;
> -            dirty = test_bit(page, block->bmap);
> -            if (!dirty) {
> -                trace_get_queued_page_not_dirty(block->idstr, (uint64_t)offset,
> -                                                page);
> -            } else {
> -                trace_get_queued_page(block->idstr, (uint64_t)offset, page);
> -            }
> -        }
> -
> -    } while (block && !dirty);
> +    block = unqueue_page(rs, &offset);
>  
>      if (!block) {
>          /*
> diff --git a/migration/trace-events b/migration/trace-events
> index e165687af2..3a9b3567ae 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -85,8 +85,6 @@ put_qlist_end(const char *field_name, const char *vmsd_name) "%s(%s)"
>  qemu_file_fclose(void) ""
>  
>  # ram.c
> -get_queued_page(const char *block_name, uint64_t tmp_offset, unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
> -get_queued_page_not_dirty(const char *block_name, uint64_t tmp_offset, unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
>  migration_bitmap_sync_start(void) ""
>  migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64
>  migration_bitmap_clear_dirty(char *str, uint64_t start, uint64_t size, unsigned long page) "rb %s start 0x%"PRIx64" size 0x%"PRIx64" page 0x%lx"
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 06/15] migration: Move temp page setup and cleanup into separate functions
  2022-01-19  8:09 ` [PATCH RFC 06/15] migration: Move temp page setup and cleanup into separate functions Peter Xu
@ 2022-01-19 16:58   ` Dr. David Alan Gilbert
  2022-01-27  9:43   ` Juan Quintela
  1 sibling, 0 replies; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-01-19 16:58 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> Temp pages will need to grow if we want to have multiple channels for postcopy,
> because each channel will need its own temp page to cache huge page data.
> 
> Before doing that, cleanup the related code.  No functional change intended.
> 
> Since at it, touch up the errno handling a little bit on the setup side.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/postcopy-ram.c | 82 +++++++++++++++++++++++++---------------
>  1 file changed, 51 insertions(+), 31 deletions(-)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 2176ed68a5..e662dd05cc 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -523,6 +523,19 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis)
>      return 0;
>  }
>  
> +static void postcopy_temp_pages_cleanup(MigrationIncomingState *mis)
> +{
> +    if (mis->postcopy_tmp_page) {
> +        munmap(mis->postcopy_tmp_page, mis->largest_page_size);
> +        mis->postcopy_tmp_page = NULL;
> +    }
> +
> +    if (mis->postcopy_tmp_zero_page) {
> +        munmap(mis->postcopy_tmp_zero_page, mis->largest_page_size);
> +        mis->postcopy_tmp_zero_page = NULL;
> +    }
> +}
> +
>  /*
>   * At the end of a migration where postcopy_ram_incoming_init was called.
>   */
> @@ -564,14 +577,8 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>          }
>      }
>  
> -    if (mis->postcopy_tmp_page) {
> -        munmap(mis->postcopy_tmp_page, mis->largest_page_size);
> -        mis->postcopy_tmp_page = NULL;
> -    }
> -    if (mis->postcopy_tmp_zero_page) {
> -        munmap(mis->postcopy_tmp_zero_page, mis->largest_page_size);
> -        mis->postcopy_tmp_zero_page = NULL;
> -    }
> +    postcopy_temp_pages_cleanup(mis);
> +
>      trace_postcopy_ram_incoming_cleanup_blocktime(
>              get_postcopy_total_blocktime());
>  
> @@ -1082,6 +1089,40 @@ retry:
>      return NULL;
>  }
>  
> +static int postcopy_temp_pages_setup(MigrationIncomingState *mis)
> +{
> +    int err;
> +
> +    mis->postcopy_tmp_page = mmap(NULL, mis->largest_page_size,
> +                                  PROT_READ | PROT_WRITE,
> +                                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +    if (mis->postcopy_tmp_page == MAP_FAILED) {
> +        err = errno;
> +        mis->postcopy_tmp_page = NULL;
> +        error_report("%s: Failed to map postcopy_tmp_page %s",
> +                     __func__, strerror(err));
> +        return -err;
> +    }
> +
> +    /*
> +     * Map large zero page when kernel can't use UFFDIO_ZEROPAGE for hugepages
> +     */
> +    mis->postcopy_tmp_zero_page = mmap(NULL, mis->largest_page_size,
> +                                       PROT_READ | PROT_WRITE,
> +                                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +    if (mis->postcopy_tmp_zero_page == MAP_FAILED) {
> +        err = errno;
> +        mis->postcopy_tmp_zero_page = NULL;
> +        error_report("%s: Failed to map large zero page %s",
> +                     __func__, strerror(err));
> +        return -err;
> +    }
> +
> +    memset(mis->postcopy_tmp_zero_page, '\0', mis->largest_page_size);
> +
> +    return 0;
> +}
> +
>  int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
>  {
>      /* Open the fd for the kernel to give us userfaults */
> @@ -1122,32 +1163,11 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
>          return -1;
>      }
>  
> -    mis->postcopy_tmp_page = mmap(NULL, mis->largest_page_size,
> -                                  PROT_READ | PROT_WRITE, MAP_PRIVATE |
> -                                  MAP_ANONYMOUS, -1, 0);
> -    if (mis->postcopy_tmp_page == MAP_FAILED) {
> -        mis->postcopy_tmp_page = NULL;
> -        error_report("%s: Failed to map postcopy_tmp_page %s",
> -                     __func__, strerror(errno));
> +    if (postcopy_temp_pages_setup(mis)) {
> +        /* Error dumped in the sub-function */
>          return -1;
>      }
>  
> -    /*
> -     * Map large zero page when kernel can't use UFFDIO_ZEROPAGE for hugepages
> -     */
> -    mis->postcopy_tmp_zero_page = mmap(NULL, mis->largest_page_size,
> -                                       PROT_READ | PROT_WRITE,
> -                                       MAP_PRIVATE | MAP_ANONYMOUS,
> -                                       -1, 0);
> -    if (mis->postcopy_tmp_zero_page == MAP_FAILED) {
> -        int e = errno;
> -        mis->postcopy_tmp_zero_page = NULL;
> -        error_report("%s: Failed to map large zero page %s",
> -                     __func__, strerror(e));
> -        return -e;
> -    }
> -    memset(mis->postcopy_tmp_zero_page, '\0', mis->largest_page_size);
> -
>      trace_postcopy_ram_enable_notify();
>  
>      return 0;
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 02/15] migration: Allow pss->page jump over clean pages
  2022-01-19 13:42   ` Dr. David Alan Gilbert
@ 2022-01-20  2:12     ` Peter Xu
  2022-02-03 18:19       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-01-20  2:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kunkun Jiang, Juan Quintela, Keqian Zhu, qemu-devel,
	Leonardo Bras Soares Passos

On Wed, Jan 19, 2022 at 01:42:47PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Commit ba1b7c812c ("migration/ram: Optimize ram_save_host_page()") managed to
> > optimize host huge page use case by scanning the dirty bitmap when looking for
> > the next dirty small page to migrate.
> > 
> > However when updating the pss->page before returning from that function, we
> > used MIN() of these two values: (1) next dirty bit, or (2) end of current sent
> > huge page, to fix up pss->page.
> > 
> > That sounds unnecessary, because I see nowhere that requires pss->page to be
> > not going over current huge page boundary.
> > 
> > What we need here is probably MAX() instead of MIN() so that we'll start
> > scanning from the next dirty bit next time. Since pss->page can't be smaller
> > than hostpage_boundary (the loop guarantees it), it probably means we don't
> > need to fix it up at all.
> > 
> > Cc: Keqian Zhu <zhukeqian1@huawei.com>
> > Cc: Kunkun Jiang <jiangkunkun@huawei.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> 
> Hmm, I think that's potentially necessary.  note that the start of
> ram_save_host_page stores the 'start_page' at entry.
> That' start_page' goes to the ram_save_release_protection and so
> I think it needs to be pagesize aligned for the mmap/uffd that happens.

Right, that's indeed a functional change, but IMHO it's also fine.

When reaching ram_save_release_protection(), what we guaranteed is that below
page range contains no dirty bits in ramblock dirty bitmap:

  range0 = [start_page, pss->page)

Side note: inclusive on start, but not inclusive on the end side of range0
(that is, pss->page can be pointing to a dirty page).

What ram_save_release_protection() does is to unprotect the pages and let them
run free.  If we're sure range0 contains no dirty page, it means we have
already copied them over into the snapshot, so IIUC it's safe to unprotect all
of it (even if it's already bigger than the host page size)?

That can be slightly less efficient for live snapshot in some extreme cases
(when unprotect, we'll need to walk the pgtables in the uffd ioctl()), but I
don't assume live snapshot to be run on a huge VM, so hopefully it's still
fine?  Not to mention it should make live migration a little bit faster,
assuming that's more frequently used..

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 05/15] migration: Simplify unqueue_page()
  2022-01-19 16:36   ` Dr. David Alan Gilbert
@ 2022-01-20  2:23     ` Peter Xu
  2022-01-25 11:01       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-01-20  2:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

On Wed, Jan 19, 2022 at 04:36:50PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > This patch simplifies unqueue_page() on both sides of it (itself, and caller).
> > 
> > Firstly, due to the fact that right after unqueue_page() returned true, we'll
> > definitely send a huge page (see ram_save_huge_page() call - it will _never_
> > exit before finish sending that huge page), so unqueue_page() does not need to
> > jump in small page size if huge page is enabled on the ramblock.  IOW, it's
> > destined that only the 1st 4K page will be valid, when unqueue the 2nd+ time
> > we'll notice the whole huge page has already been sent anyway.  Switching to
> > operating on huge page reduces a lot of the loops of redundant unqueue_page().
> > 
> > Meanwhile, drop the dirty check.  It's not helpful to call test_bit() every
> > time to jump over clean pages, as ram_save_host_page() has already done so,
> > while in a faster way (see commit ba1b7c812c ("migration/ram: Optimize
> > ram_save_host_page()", 2021-05-13)).  So that's not necessary too.
> > 
> > Drop the two tracepoints along the way - based on above analysis it's very
> > possible that no one is really using it..
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Yes, OK
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Although:
>   a) You might like to keep a trace in get_queued_page just to see
> what's getting unqueued
>   b) I think originally it was a useful diagnostic to find out when we
> were getting a lot of queue requests for pages that were already sent.

Ah, that makes sense.  How about I keep the test_bit but remove the loop?  I
can make both a) and b) into one tracepoint:

========
diff --git a/migration/ram.c b/migration/ram.c
index 0df15ff663..02f36fa6d5 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1572,6 +1572,9 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
         migration_consume_urgent_request();
     }
 
+    trace_unqueue_page(block->idstr, *offset,
+                       test_bit((*offset >> TARGET_PAGE_BITS), block->bmap));
+
     return block;
 }
 
diff --git a/migration/trace-events b/migration/trace-events
index 3a9b3567ae..efa3a95f81 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -110,6 +110,7 @@ ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRI
 ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
 ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
 ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
+unqueue_page(char *block, uint64_t offset, bool dirty) "ramblock '%s' offset 0x%"PRIx64" dirty %d"
 
 # multifd.c
 multifd_new_send_channel_async(uint8_t id) "channel %d"
========

Thanks,

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 05/15] migration: Simplify unqueue_page()
  2022-01-20  2:23     ` Peter Xu
@ 2022-01-25 11:01       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-01-25 11:01 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Jan 19, 2022 at 04:36:50PM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > This patch simplifies unqueue_page() on both sides of it (itself, and caller).
> > > 
> > > Firstly, due to the fact that right after unqueue_page() returned true, we'll
> > > definitely send a huge page (see ram_save_huge_page() call - it will _never_
> > > exit before finish sending that huge page), so unqueue_page() does not need to
> > > jump in small page size if huge page is enabled on the ramblock.  IOW, it's
> > > destined that only the 1st 4K page will be valid, when unqueue the 2nd+ time
> > > we'll notice the whole huge page has already been sent anyway.  Switching to
> > > operating on huge page reduces a lot of the loops of redundant unqueue_page().
> > > 
> > > Meanwhile, drop the dirty check.  It's not helpful to call test_bit() every
> > > time to jump over clean pages, as ram_save_host_page() has already done so,
> > > while in a faster way (see commit ba1b7c812c ("migration/ram: Optimize
> > > ram_save_host_page()", 2021-05-13)).  So that's not necessary too.
> > > 
> > > Drop the two tracepoints along the way - based on above analysis it's very
> > > possible that no one is really using it..
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Yes, OK
> > 
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > 
> > Although:
> >   a) You might like to keep a trace in get_queued_page just to see
> > what's getting unqueued
> >   b) I think originally it was a useful diagnostic to find out when we
> > were getting a lot of queue requests for pages that were already sent.
> 
> Ah, that makes sense.  How about I keep the test_bit but remove the loop?  I
> can make both a) and b) into one tracepoint:

Yes, i think that's fine.


> ========
> diff --git a/migration/ram.c b/migration/ram.c
> index 0df15ff663..02f36fa6d5 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1572,6 +1572,9 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
>          migration_consume_urgent_request();
>      }
>  
> +    trace_unqueue_page(block->idstr, *offset,
> +                       test_bit((*offset >> TARGET_PAGE_BITS), block->bmap));
> +
>      return block;
>  }
>  
> diff --git a/migration/trace-events b/migration/trace-events
> index 3a9b3567ae..efa3a95f81 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -110,6 +110,7 @@ ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRI
>  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
>  ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
>  ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
> +unqueue_page(char *block, uint64_t offset, bool dirty) "ramblock '%s' offset 0x%"PRIx64" dirty %d"
>  
>  # multifd.c
>  multifd_new_send_channel_async(uint8_t id) "channel %d"
> ========
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 01/15] migration: No off-by-one for pss->page update in host page size
  2022-01-19  8:09 ` [PATCH RFC 01/15] migration: No off-by-one for pss->page update in host page size Peter Xu
  2022-01-19 12:58   ` Dr. David Alan Gilbert
@ 2022-01-27  9:40   ` Juan Quintela
  1 sibling, 0 replies; 53+ messages in thread
From: Juan Quintela @ 2022-01-27  9:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Kunkun Jiang, qemu-devel, Dr . David Alan Gilbert,
	Leonardo Bras Soares Passos, Keqian Zhu, Andrey Gruzdev

Peter Xu <peterx@redhat.com> wrote:
> We used to do off-by-one fixup for pss->page when finished one host huge page
> transfer.  That seems to be unnecesary at all.  Drop it.
>
> Cc: Keqian Zhu <zhukeqian1@huawei.com>
> Cc: Kunkun Jiang <jiangkunkun@huawei.com>
> Cc: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>


Reviewed-by: Juan Quintela <quintela@redhat.com>

queued.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 03/15] migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat
  2022-01-19  8:09 ` [PATCH RFC 03/15] migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat Peter Xu
  2022-01-19 14:15   ` Dr. David Alan Gilbert
@ 2022-01-27  9:40   ` Juan Quintela
  1 sibling, 0 replies; 53+ messages in thread
From: Juan Quintela @ 2022-01-27  9:40 UTC (permalink / raw)
  To: Peter Xu; +Cc: Leonardo Bras Soares Passos, qemu-devel, Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> This patch allows us to read the tid even without blocktime feature enabled.
> It's useful when tracing postcopy fault thread on faulted pages to show thread
> id too with the address.
>
> Remove the comments - they're merely not helpful at all.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

queued.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 04/15] migration: Add postcopy_has_request()
  2022-01-19  8:09 ` [PATCH RFC 04/15] migration: Add postcopy_has_request() Peter Xu
  2022-01-19 14:27   ` Dr. David Alan Gilbert
@ 2022-01-27  9:41   ` Juan Quintela
  1 sibling, 0 replies; 53+ messages in thread
From: Juan Quintela @ 2022-01-27  9:41 UTC (permalink / raw)
  To: Peter Xu; +Cc: Leonardo Bras Soares Passos, qemu-devel, Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Add a helper to detect whether postcopy has pending request.
>
> Since at it, cleanup the code a bit, e.g. in unqueue_page() we shouldn't need
> to check it again on queue empty because we're the only one (besides cleanup
> code, which should never run during this process) that will take a request off
> the list, so the request list can only grow but not shrink under the hood.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>
queued



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 05/15] migration: Simplify unqueue_page()
  2022-01-19  8:09 ` [PATCH RFC 05/15] migration: Simplify unqueue_page() Peter Xu
  2022-01-19 16:36   ` Dr. David Alan Gilbert
@ 2022-01-27  9:41   ` Juan Quintela
  1 sibling, 0 replies; 53+ messages in thread
From: Juan Quintela @ 2022-01-27  9:41 UTC (permalink / raw)
  To: Peter Xu; +Cc: Leonardo Bras Soares Passos, qemu-devel, Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> This patch simplifies unqueue_page() on both sides of it (itself, and caller).
>
> Firstly, due to the fact that right after unqueue_page() returned true, we'll
> definitely send a huge page (see ram_save_huge_page() call - it will _never_
> exit before finish sending that huge page), so unqueue_page() does not need to
> jump in small page size if huge page is enabled on the ramblock.  IOW, it's
> destined that only the 1st 4K page will be valid, when unqueue the 2nd+ time
> we'll notice the whole huge page has already been sent anyway.  Switching to
> operating on huge page reduces a lot of the loops of redundant unqueue_page().
>
> Meanwhile, drop the dirty check.  It's not helpful to call test_bit() every
> time to jump over clean pages, as ram_save_host_page() has already done so,
> while in a faster way (see commit ba1b7c812c ("migration/ram: Optimize
> ram_save_host_page()", 2021-05-13)).  So that's not necessary too.
>
> Drop the two tracepoints along the way - based on above analysis it's very
> possible that no one is really using it..
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

queued.

I added the extra tracepoint that you added later.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 06/15] migration: Move temp page setup and cleanup into separate functions
  2022-01-19  8:09 ` [PATCH RFC 06/15] migration: Move temp page setup and cleanup into separate functions Peter Xu
  2022-01-19 16:58   ` Dr. David Alan Gilbert
@ 2022-01-27  9:43   ` Juan Quintela
  1 sibling, 0 replies; 53+ messages in thread
From: Juan Quintela @ 2022-01-27  9:43 UTC (permalink / raw)
  To: Peter Xu; +Cc: Leonardo Bras Soares Passos, qemu-devel, Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Temp pages will need to grow if we want to have multiple channels for postcopy,
> because each channel will need its own temp page to cache huge page data.
>
> Before doing that, cleanup the related code.  No functional change intended.
>
> Since at it, touch up the errno handling a little bit on the setup side.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

queued.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 07/15] migration: Introduce postcopy channels on dest node
  2022-01-19  8:09 ` [PATCH RFC 07/15] migration: Introduce postcopy channels on dest node Peter Xu
@ 2022-02-03 15:08   ` Dr. David Alan Gilbert
  2022-02-08  3:27     ` Peter Xu
  0 siblings, 1 reply; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-03 15:08 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> Postcopy handles huge pages in a special way that currently we can only have
> one "channel" to transfer the page.
> 
> It's because when we install pages using UFFDIO_COPY, we need to have the whole
> huge page ready, it also means we need to have a temp huge page when trying to
> receive the whole content of the page.
> 
> Currently all maintainance around this tmp page is global: firstly we'll
> allocate a temp huge page, then we maintain its status mostly within
> ram_load_postcopy().
> 
> To enable multiple channels for postcopy, the first thing we need to do is to
> prepare N temp huge pages as caching, one for each channel.
> 
> Meanwhile we need to maintain the tmp huge page status per-channel too.
> 
> To give some example, some local variables maintained in ram_load_postcopy()
> are listed; they are responsible for maintaining temp huge page status:
> 
>   - all_zero:     this keeps whether this huge page contains all zeros
>   - target_pages: this counts how many target pages have been copied
>   - host_page:    this keeps the host ptr for the page to install
> 
> Move all these fields to be together with the temp huge pages to form a new
> structure called PostcopyTmpPage.  Then for each (future) postcopy channel, we
> need one structure to keep the state around.
> 
> For vanilla postcopy, obviously there's only one channel.  It contains both
> precopy and postcopy pages.
> 
> This patch teaches the dest migration node to start realize the possible number
> of postcopy channels by introducing the "postcopy_channels" variable.  Its
> value is calculated when setup postcopy on dest node (during POSTCOPY_LISTEN
> phase).
> 
> Vanilla postcopy will have channels=1, but when postcopy-preempt capability is
> enabled (in the future), we will boost it to 2 because even during partial
> sending of a precopy huge page we still want to preempt it and start sending
> the postcopy requested page right away (so we start to keep two temp huge
> pages; more if we want to enable multifd).  In this patch there's a TODO marked
> for that; so far the channels is always set to 1.
> 
> We need to send one "host huge page" on one channel only and we cannot split
> them, because otherwise the data upon the same huge page can locate on more
> than one channel so we need more complicated logic to manage.  One temp host
> huge page for each channel will be enough for us for now.
> 
> Postcopy will still always use the index=0 huge page even after this patch.
> However it prepares for the latter patches where it can start to use multiple
> channels (which needs src intervention, because only src knows which channel we
> should use).

Generally OK, some minor nits.

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.h    | 35 +++++++++++++++++++++++++++-
>  migration/postcopy-ram.c | 50 +++++++++++++++++++++++++++++-----------
>  migration/ram.c          | 43 +++++++++++++++++-----------------
>  3 files changed, 91 insertions(+), 37 deletions(-)
> 
> diff --git a/migration/migration.h b/migration/migration.h
> index 8130b703eb..8bb2931312 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -45,6 +45,24 @@ struct PostcopyBlocktimeContext;
>   */
>  #define CLEAR_BITMAP_SHIFT_MAX            31
>  
> +/* This is an abstraction of a "temp huge page" for postcopy's purpose */
> +typedef struct {
> +    /*
> +     * This points to a temporary huge page as a buffer for UFFDIO_COPY.  It's
> +     * mmap()ed and needs to be freed when cleanup.
> +     */
> +    void *tmp_huge_page;
> +    /*
> +     * This points to the host page we're going to install for this temp page.
> +     * It tells us after we've received the whole page, where we should put it.
> +     */
> +    void *host_addr;
> +    /* Number of small pages copied (in size of TARGET_PAGE_SIZE) */
> +    int target_pages;

Can we take the opportunity to convert this to an unsigned?

> +    /* Whether this page contains all zeros */
> +    bool all_zero;
> +} PostcopyTmpPage;
> +
>  /* State for the incoming migration */
>  struct MigrationIncomingState {
>      QEMUFile *from_src_file;
> @@ -81,7 +99,22 @@ struct MigrationIncomingState {
>      QemuMutex rp_mutex;    /* We send replies from multiple threads */
>      /* RAMBlock of last request sent to source */
>      RAMBlock *last_rb;
> -    void     *postcopy_tmp_page;
> +    /*
> +     * Number of postcopy channels including the default precopy channel, so
> +     * vanilla postcopy will only contain one channel which contain both
> +     * precopy and postcopy streams.
> +     *
> +     * This is calculated when the src requests to enable postcopy but before
> +     * it starts.  Its value can depend on e.g. whether postcopy preemption is
> +     * enabled.
> +     */
> +    int       postcopy_channels;

Also unsigned?

> +    /*
> +     * An array of temp host huge pages to be used, one for each postcopy
> +     * channel.
> +     */
> +    PostcopyTmpPage *postcopy_tmp_pages;
> +    /* This is shared for all postcopy channels */
>      void     *postcopy_tmp_zero_page;
>      /* PostCopyFD's for external userfaultfds & handlers of shared memory */
>      GArray   *postcopy_remote_fds;
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index e662dd05cc..d78e1b9373 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -525,9 +525,18 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis)
>  
>  static void postcopy_temp_pages_cleanup(MigrationIncomingState *mis)
>  {
> -    if (mis->postcopy_tmp_page) {
> -        munmap(mis->postcopy_tmp_page, mis->largest_page_size);
> -        mis->postcopy_tmp_page = NULL;
> +    int i;
> +
> +    if (mis->postcopy_tmp_pages) {
> +        for (i = 0; i < mis->postcopy_channels; i++) {
> +            if (mis->postcopy_tmp_pages[i].tmp_huge_page) {
> +                munmap(mis->postcopy_tmp_pages[i].tmp_huge_page,
> +                       mis->largest_page_size);
> +                mis->postcopy_tmp_pages[i].tmp_huge_page = NULL;
> +            }
> +        }
> +        g_free(mis->postcopy_tmp_pages);
> +        mis->postcopy_tmp_pages = NULL;
>      }
>  
>      if (mis->postcopy_tmp_zero_page) {
> @@ -1091,17 +1100,30 @@ retry:
>  
>  static int postcopy_temp_pages_setup(MigrationIncomingState *mis)
>  {
> -    int err;
> -
> -    mis->postcopy_tmp_page = mmap(NULL, mis->largest_page_size,
> -                                  PROT_READ | PROT_WRITE,
> -                                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> -    if (mis->postcopy_tmp_page == MAP_FAILED) {
> -        err = errno;
> -        mis->postcopy_tmp_page = NULL;
> -        error_report("%s: Failed to map postcopy_tmp_page %s",
> -                     __func__, strerror(err));
> -        return -err;
> +    PostcopyTmpPage *tmp_page;
> +    int err, i, channels;
> +    void *temp_page;
> +
> +    /* TODO: will be boosted when enable postcopy preemption */
> +    mis->postcopy_channels = 1;
> +
> +    channels = mis->postcopy_channels;
> +    mis->postcopy_tmp_pages = g_malloc0(sizeof(PostcopyTmpPage) * channels);

I noticed we've started using g_malloc0_n in a few places

> +    for (i = 0; i < channels; i++) {
> +        tmp_page = &mis->postcopy_tmp_pages[i];
> +        temp_page = mmap(NULL, mis->largest_page_size, PROT_READ | PROT_WRITE,
> +                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +        if (temp_page == MAP_FAILED) {
> +            err = errno;
> +            error_report("%s: Failed to map postcopy_tmp_pages[%d]: %s",
> +                         __func__, i, strerror(err));

Please call postcopy_temp_pages_cleanup here to cleanup previous pages
that were succesfully allocated.

> +            return -err;
> +        }
> +        tmp_page->tmp_huge_page = temp_page;
> +        /* Initialize default states for each tmp page */
> +        tmp_page->all_zero = true;
> +        tmp_page->target_pages = 0;
>      }
>  
>      /*
> diff --git a/migration/ram.c b/migration/ram.c
> index 0df15ff663..930e722e39 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3639,11 +3639,8 @@ static int ram_load_postcopy(QEMUFile *f)
>      bool place_needed = false;
>      bool matches_target_page_size = false;
>      MigrationIncomingState *mis = migration_incoming_get_current();
> -    /* Temporary page that is later 'placed' */
> -    void *postcopy_host_page = mis->postcopy_tmp_page;
> -    void *host_page = NULL;
> -    bool all_zero = true;
> -    int target_pages = 0;
> +    /* Currently we only use channel 0.  TODO: use all the channels */
> +    PostcopyTmpPage *tmp_page = &mis->postcopy_tmp_pages[0];
>  
>      while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
>          ram_addr_t addr;
> @@ -3687,7 +3684,7 @@ static int ram_load_postcopy(QEMUFile *f)
>                  ret = -EINVAL;
>                  break;
>              }
> -            target_pages++;
> +            tmp_page->target_pages++;
>              matches_target_page_size = block->page_size == TARGET_PAGE_SIZE;
>              /*
>               * Postcopy requires that we place whole host pages atomically;
> @@ -3699,15 +3696,16 @@ static int ram_load_postcopy(QEMUFile *f)
>               * however the source ensures it always sends all the components
>               * of a host page in one chunk.
>               */
> -            page_buffer = postcopy_host_page +
> +            page_buffer = tmp_page->tmp_huge_page +
>                            host_page_offset_from_ram_block_offset(block, addr);
>              /* If all TP are zero then we can optimise the place */
> -            if (target_pages == 1) {
> -                host_page = host_page_from_ram_block_offset(block, addr);
> -            } else if (host_page != host_page_from_ram_block_offset(block,
> -                                                                    addr)) {
> +            if (tmp_page->target_pages == 1) {
> +                tmp_page->host_addr =
> +                    host_page_from_ram_block_offset(block, addr);
> +            } else if (tmp_page->host_addr !=
> +                       host_page_from_ram_block_offset(block, addr)) {
>                  /* not the 1st TP within the HP */
> -                error_report("Non-same host page %p/%p", host_page,
> +                error_report("Non-same host page %p/%p", tmp_page->host_addr,
>                               host_page_from_ram_block_offset(block, addr));
>                  ret = -EINVAL;
>                  break;
> @@ -3717,10 +3715,11 @@ static int ram_load_postcopy(QEMUFile *f)
>               * If it's the last part of a host page then we place the host
>               * page
>               */
> -            if (target_pages == (block->page_size / TARGET_PAGE_SIZE)) {
> +            if (tmp_page->target_pages ==
> +                (block->page_size / TARGET_PAGE_SIZE)) {
>                  place_needed = true;
>              }
> -            place_source = postcopy_host_page;
> +            place_source = tmp_page->tmp_huge_page;
>          }
>  
>          switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
> @@ -3734,12 +3733,12 @@ static int ram_load_postcopy(QEMUFile *f)
>                  memset(page_buffer, ch, TARGET_PAGE_SIZE);
>              }
>              if (ch) {
> -                all_zero = false;
> +                tmp_page->all_zero = false;
>              }
>              break;
>  
>          case RAM_SAVE_FLAG_PAGE:
> -            all_zero = false;
> +            tmp_page->all_zero = false;
>              if (!matches_target_page_size) {
>                  /* For huge pages, we always use temporary buffer */
>                  qemu_get_buffer(f, page_buffer, TARGET_PAGE_SIZE);
> @@ -3757,7 +3756,7 @@ static int ram_load_postcopy(QEMUFile *f)
>              }
>              break;
>          case RAM_SAVE_FLAG_COMPRESS_PAGE:
> -            all_zero = false;
> +            tmp_page->all_zero = false;
>              len = qemu_get_be32(f);
>              if (len < 0 || len > compressBound(TARGET_PAGE_SIZE)) {
>                  error_report("Invalid compressed data length: %d", len);
> @@ -3789,16 +3788,16 @@ static int ram_load_postcopy(QEMUFile *f)
>          }
>  
>          if (!ret && place_needed) {
> -            if (all_zero) {
> -                ret = postcopy_place_page_zero(mis, host_page, block);
> +            if (tmp_page->all_zero) {
> +                ret = postcopy_place_page_zero(mis, tmp_page->host_addr, block);
>              } else {
> -                ret = postcopy_place_page(mis, host_page, place_source,
> +                ret = postcopy_place_page(mis, tmp_page->host_addr, place_source,
>                                            block);
>              }
>              place_needed = false;
> -            target_pages = 0;
> +            tmp_page->target_pages = 0;
>              /* Assume we have a zero page until we detect something different */
> -            all_zero = true;
> +            tmp_page->all_zero = true;
>          }
>      }
>  
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 08/15] migration: Dump ramblock and offset too when non-same-page detected
  2022-01-19  8:09 ` [PATCH RFC 08/15] migration: Dump ramblock and offset too when non-same-page detected Peter Xu
@ 2022-02-03 15:15   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-03 15:15 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> In ram_load_postcopy() we'll try to detect non-same-page case and dump error.
> This error is very helpful for debugging.  Adding ramblock & offset into the
> error log too.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/ram.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 930e722e39..3f823ffffc 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3705,8 +3705,12 @@ static int ram_load_postcopy(QEMUFile *f)
>              } else if (tmp_page->host_addr !=
>                         host_page_from_ram_block_offset(block, addr)) {
>                  /* not the 1st TP within the HP */
> -                error_report("Non-same host page %p/%p", tmp_page->host_addr,
> -                             host_page_from_ram_block_offset(block, addr));
> +                error_report("Non-same host page detected.  Target host page %p, "
> +                             "received host page %p "
> +                             "(rb %s offset 0x"RAM_ADDR_FMT" target_pages %d)",
> +                             tmp_page->host_addr,
> +                             host_page_from_ram_block_offset(block, addr),
> +                             block->idstr, addr, tmp_page->target_pages);
>                  ret = -EINVAL;
>                  break;
>              }
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 09/15] migration: Add postcopy_thread_create()
  2022-01-19  8:09 ` [PATCH RFC 09/15] migration: Add postcopy_thread_create() Peter Xu
@ 2022-02-03 15:19   ` Dr. David Alan Gilbert
  2022-02-08  3:37     ` Peter Xu
  0 siblings, 1 reply; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-03 15:19 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> Postcopy create threads. A common manner is we init a sem and use it to sync
> with the thread.  Namely, we have fault_thread_sem and listen_thread_sem and
> they're only used for this.
> 
> Make it a shared infrastructure so it's easier to create yet another thread.
> 

It might be worth a note saying you now share that sem, so you can't
start two threads in parallel.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.h    |  5 ++---
>  migration/postcopy-ram.c | 19 +++++++++++++------
>  migration/postcopy-ram.h |  4 ++++
>  migration/savevm.c       | 12 +++---------
>  4 files changed, 22 insertions(+), 18 deletions(-)
> 
> diff --git a/migration/migration.h b/migration/migration.h
> index 8bb2931312..35e7f7babe 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -70,7 +70,8 @@ struct MigrationIncomingState {
>      /* A hook to allow cleanup at the end of incoming migration */
>      void *transport_data;
>      void (*transport_cleanup)(void *data);
> -
> +    /* Used to sync thread creations */
> +    QemuSemaphore  thread_sync_sem;
>      /*
>       * Free at the start of the main state load, set as the main thread finishes
>       * loading state.
> @@ -83,13 +84,11 @@ struct MigrationIncomingState {
>      size_t         largest_page_size;
>      bool           have_fault_thread;
>      QemuThread     fault_thread;
> -    QemuSemaphore  fault_thread_sem;
>      /* Set this when we want the fault thread to quit */
>      bool           fault_thread_quit;
>  
>      bool           have_listen_thread;
>      QemuThread     listen_thread;
> -    QemuSemaphore  listen_thread_sem;
>  
>      /* For the kernel to send us notifications */
>      int       userfault_fd;
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index d78e1b9373..88c832eeba 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -77,6 +77,16 @@ int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp)
>                                              &pnd);
>  }
>  
> +void postcopy_thread_create(MigrationIncomingState *mis,
> +                            QemuThread *thread, const char *name,
> +                            void *(*fn)(void *), int joinable)
> +{
> +    qemu_sem_init(&mis->thread_sync_sem, 0);
> +    qemu_thread_create(thread, name, fn, mis, joinable);
> +    qemu_sem_wait(&mis->thread_sync_sem);
> +    qemu_sem_destroy(&mis->thread_sync_sem);
> +}
> +
>  /* Postcopy needs to detect accesses to pages that haven't yet been copied
>   * across, and efficiently map new pages in, the techniques for doing this
>   * are target OS specific.
> @@ -901,7 +911,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
>      trace_postcopy_ram_fault_thread_entry();
>      rcu_register_thread();
>      mis->last_rb = NULL; /* last RAMBlock we sent part of */
> -    qemu_sem_post(&mis->fault_thread_sem);
> +    qemu_sem_post(&mis->thread_sync_sem);
>  
>      struct pollfd *pfd;
>      size_t pfd_len = 2 + mis->postcopy_remote_fds->len;
> @@ -1172,11 +1182,8 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
>          return -1;
>      }
>  
> -    qemu_sem_init(&mis->fault_thread_sem, 0);
> -    qemu_thread_create(&mis->fault_thread, "postcopy/fault",
> -                       postcopy_ram_fault_thread, mis, QEMU_THREAD_JOINABLE);
> -    qemu_sem_wait(&mis->fault_thread_sem);
> -    qemu_sem_destroy(&mis->fault_thread_sem);
> +    postcopy_thread_create(mis, &mis->fault_thread, "postcopy/fault",
> +                           postcopy_ram_fault_thread, QEMU_THREAD_JOINABLE);
>      mis->have_fault_thread = true;
>  
>      /* Mark so that we get notified of accesses to unwritten areas */
> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index 6d2b3cf124..07684c0e1d 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -135,6 +135,10 @@ void postcopy_remove_notifier(NotifierWithReturn *n);
>  /* Call the notifier list set by postcopy_add_start_notifier */
>  int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp);
>  
> +void postcopy_thread_create(MigrationIncomingState *mis,
> +                            QemuThread *thread, const char *name,
> +                            void *(*fn)(void *), int joinable);
> +
>  struct PostCopyFD;
>  
>  /* ufd is a pointer to the struct uffd_msg *TODO: more Portable! */
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 3b8f565b14..3342b74c24 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1862,7 +1862,7 @@ static void *postcopy_ram_listen_thread(void *opaque)
>  
>      migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
>                                     MIGRATION_STATUS_POSTCOPY_ACTIVE);
> -    qemu_sem_post(&mis->listen_thread_sem);
> +    qemu_sem_post(&mis->thread_sync_sem);
>      trace_postcopy_ram_listen_thread_start();
>  
>      rcu_register_thread();
> @@ -1987,14 +1987,8 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>      }
>  
>      mis->have_listen_thread = true;
> -    /* Start up the listening thread and wait for it to signal ready */
> -    qemu_sem_init(&mis->listen_thread_sem, 0);
> -    qemu_thread_create(&mis->listen_thread, "postcopy/listen",
> -                       postcopy_ram_listen_thread, NULL,
> -                       QEMU_THREAD_DETACHED);
> -    qemu_sem_wait(&mis->listen_thread_sem);
> -    qemu_sem_destroy(&mis->listen_thread_sem);
> -
> +    postcopy_thread_create(mis, &mis->listen_thread, "postcopy/listen",
> +                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
>      trace_loadvm_postcopy_handle_listen("return");
>  
>      return 0;
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 11/15] migration: Add pss.postcopy_requested status
  2022-01-19  8:09 ` [PATCH RFC 11/15] migration: Add pss.postcopy_requested status Peter Xu
@ 2022-02-03 15:42   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-03 15:42 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> This boolean flag shows whether the current page during migration is triggered
> by postcopy or not.  Then in ram_save_host_page() and deeper stack we'll be
> able to have a reference on the priority of this page.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/ram.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 3a7d943f9c..b7d17613e8 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -400,6 +400,8 @@ struct PageSearchStatus {
>      unsigned long page;
>      /* Set once we wrap around */
>      bool         complete_round;
> +    /* Whether current page is explicitly requested by postcopy */
> +    bool         postcopy_requested;
>  };
>  typedef struct PageSearchStatus PageSearchStatus;
>  
> @@ -1480,6 +1482,9 @@ retry:
>   */
>  static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
>  {
> +    /* This is not a postcopy requested page */
> +    pss->postcopy_requested = false;
> +
>      pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
>      if (pss->complete_round && pss->block == rs->last_seen_block &&
>          pss->page >= rs->last_page) {
> @@ -1971,6 +1976,7 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>           * really rare.
>           */
>          pss->complete_round = false;
> +        pss->postcopy_requested = true;
>      }
>  
>      return !!block;
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 12/15] migration: Move migrate_allow_multifd and helpers into migration.c
  2022-01-19  8:09 ` [PATCH RFC 12/15] migration: Move migrate_allow_multifd and helpers into migration.c Peter Xu
@ 2022-02-03 15:44   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-03 15:44 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> This variable, along with its helpers, is used to detect whether multiple
> channel will be supported for migration.  In follow up patches, there'll be
> other capability that requires multi-channels.  Hence move it outside multifd
> specific code and make it public.  Meanwhile rename it from "multifd" to
> "multi_channels" to show its real meaning.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/migration.c | 22 +++++++++++++++++-----
>  migration/migration.h |  3 +++
>  migration/multifd.c   | 19 ++++---------------
>  migration/multifd.h   |  2 --
>  4 files changed, 24 insertions(+), 22 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 252ce1eaec..15a48b548a 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -180,6 +180,18 @@ static int migration_maybe_pause(MigrationState *s,
>                                   int new_state);
>  static void migrate_fd_cancel(MigrationState *s);
>  
> +static bool migrate_allow_multi_channels = true;
> +
> +void migrate_protocol_allow_multi_channels(bool allow)
> +{
> +    migrate_allow_multi_channels = allow;
> +}
> +
> +bool migrate_multi_channels_is_allowed(void)
> +{
> +    return migrate_allow_multi_channels;
> +}
> +
>  static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
>  {
>      uintptr_t a = (uintptr_t) ap, b = (uintptr_t) bp;
> @@ -463,12 +475,12 @@ static void qemu_start_incoming_migration(const char *uri, Error **errp)
>  {
>      const char *p = NULL;
>  
> -    migrate_protocol_allow_multifd(false); /* reset it anyway */
> +    migrate_protocol_allow_multi_channels(false); /* reset it anyway */
>      qapi_event_send_migration(MIGRATION_STATUS_SETUP);
>      if (strstart(uri, "tcp:", &p) ||
>          strstart(uri, "unix:", NULL) ||
>          strstart(uri, "vsock:", NULL)) {
> -        migrate_protocol_allow_multifd(true);
> +        migrate_protocol_allow_multi_channels(true);
>          socket_start_incoming_migration(p ? p : uri, errp);
>  #ifdef CONFIG_RDMA
>      } else if (strstart(uri, "rdma:", &p)) {
> @@ -1252,7 +1264,7 @@ static bool migrate_caps_check(bool *cap_list,
>  
>      /* incoming side only */
>      if (runstate_check(RUN_STATE_INMIGRATE) &&
> -        !migrate_multifd_is_allowed() &&
> +        !migrate_multi_channels_is_allowed() &&
>          cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
>          error_setg(errp, "multifd is not supported by current protocol");
>          return false;
> @@ -2310,11 +2322,11 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
>          }
>      }
>  
> -    migrate_protocol_allow_multifd(false);
> +    migrate_protocol_allow_multi_channels(false);
>      if (strstart(uri, "tcp:", &p) ||
>          strstart(uri, "unix:", NULL) ||
>          strstart(uri, "vsock:", NULL)) {
> -        migrate_protocol_allow_multifd(true);
> +        migrate_protocol_allow_multi_channels(true);
>          socket_start_outgoing_migration(s, p ? p : uri, &local_err);
>  #ifdef CONFIG_RDMA
>      } else if (strstart(uri, "rdma:", &p)) {
> diff --git a/migration/migration.h b/migration/migration.h
> index 34b79cb961..d0c0902ec9 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -425,4 +425,7 @@ void migration_cancel(const Error *error);
>  
>  void populate_vfio_info(MigrationInfo *info);
>  
> +bool migrate_multi_channels_is_allowed(void);
> +void migrate_protocol_allow_multi_channels(bool allow);
> +
>  #endif
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 3242f688e5..64ca50de62 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -535,7 +535,7 @@ void multifd_save_cleanup(void)
>  {
>      int i;
>  
> -    if (!migrate_use_multifd() || !migrate_multifd_is_allowed()) {
> +    if (!migrate_use_multifd() || !migrate_multi_channels_is_allowed()) {
>          return;
>      }
>      multifd_send_terminate_threads(NULL);
> @@ -870,17 +870,6 @@ cleanup:
>      multifd_new_send_channel_cleanup(p, sioc, local_err);
>  }
>  
> -static bool migrate_allow_multifd = true;
> -void migrate_protocol_allow_multifd(bool allow)
> -{
> -    migrate_allow_multifd = allow;
> -}
> -
> -bool migrate_multifd_is_allowed(void)
> -{
> -    return migrate_allow_multifd;
> -}
> -
>  int multifd_save_setup(Error **errp)
>  {
>      int thread_count;
> @@ -891,7 +880,7 @@ int multifd_save_setup(Error **errp)
>      if (!migrate_use_multifd()) {
>          return 0;
>      }
> -    if (!migrate_multifd_is_allowed()) {
> +    if (!migrate_multi_channels_is_allowed()) {
>          error_setg(errp, "multifd is not supported by current protocol");
>          return -1;
>      }
> @@ -989,7 +978,7 @@ int multifd_load_cleanup(Error **errp)
>  {
>      int i;
>  
> -    if (!migrate_use_multifd() || !migrate_multifd_is_allowed()) {
> +    if (!migrate_use_multifd() || !migrate_multi_channels_is_allowed()) {
>          return 0;
>      }
>      multifd_recv_terminate_threads(NULL);
> @@ -1138,7 +1127,7 @@ int multifd_load_setup(Error **errp)
>      if (!migrate_use_multifd()) {
>          return 0;
>      }
> -    if (!migrate_multifd_is_allowed()) {
> +    if (!migrate_multi_channels_is_allowed()) {
>          error_setg(errp, "multifd is not supported by current protocol");
>          return -1;
>      }
> diff --git a/migration/multifd.h b/migration/multifd.h
> index e57adc783b..0ed07794b6 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -13,8 +13,6 @@
>  #ifndef QEMU_MIGRATION_MULTIFD_H
>  #define QEMU_MIGRATION_MULTIFD_H
>  
> -bool migrate_multifd_is_allowed(void);
> -void migrate_protocol_allow_multifd(bool allow);
>  int multifd_save_setup(Error **errp);
>  void multifd_save_cleanup(void);
>  int multifd_load_setup(Error **errp);
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 13/15] migration: Add postcopy-preempt capability
  2022-01-19  8:09 ` [PATCH RFC 13/15] migration: Add postcopy-preempt capability Peter Xu
@ 2022-02-03 15:46   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-03 15:46 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> Firstly, postcopy already preempts precopy due to the fact that we do
> unqueue_page() first before looking into dirty bits.
> 
> However that's not enough, e.g., when there're host huge page enabled, when
> sending a precopy huge page, a postcopy request needs to wait until the whole
> huge page that is sending to finish.  That could introduce quite some delay,
> the bigger the huge page is the larger delay it'll bring.
> 
> This patch adds a new capability to allow postcopy requests to preempt existing
> precopy page during sending a huge page, so that postcopy requests can be
> serviced even faster.
> 
> Meanwhile to send it even faster, bypass the precopy stream by providing a
> standalone postcopy socket for sending requested pages.
> 
> Since the new behavior will not be compatible with the old behavior, this will
> not be the default, it's enabled only when the new capability is set on both
> src/dst QEMUs.
> 
> This patch only adds the capability itself, the logic will be added in follow
> up patches.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/migration.c | 23 +++++++++++++++++++++++
>  migration/migration.h |  1 +
>  qapi/migration.json   |  8 +++++++-
>  3 files changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 15a48b548a..84a8fbd80d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1227,6 +1227,11 @@ static bool migrate_caps_check(bool *cap_list,
>              error_setg(errp, "Postcopy is not compatible with ignore-shared");
>              return false;
>          }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
> +            error_setg(errp, "Multifd is not supported in postcopy");
> +            return false;
> +        }
>      }
>  
>      if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
> @@ -1270,6 +1275,13 @@ static bool migrate_caps_check(bool *cap_list,
>          return false;
>      }
>  
> +    if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT]) {
> +        if (!cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
> +            error_setg(errp, "Postcopy preempt requires postcopy-ram");
> +            return false;
> +        }
> +    }
> +
>      return true;
>  }
>  
> @@ -2623,6 +2635,15 @@ bool migrate_background_snapshot(void)
>      return s->enabled_capabilities[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT];
>  }
>  
> +bool migrate_postcopy_preempt(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT];
> +}
> +
>  /* migration thread support */
>  /*
>   * Something bad happened to the RP stream, mark an error
> @@ -4239,6 +4260,8 @@ static Property migration_properties[] = {
>      DEFINE_PROP_MIG_CAP("x-compress", MIGRATION_CAPABILITY_COMPRESS),
>      DEFINE_PROP_MIG_CAP("x-events", MIGRATION_CAPABILITY_EVENTS),
>      DEFINE_PROP_MIG_CAP("x-postcopy-ram", MIGRATION_CAPABILITY_POSTCOPY_RAM),
> +    DEFINE_PROP_MIG_CAP("x-postcopy-preempt",
> +                        MIGRATION_CAPABILITY_POSTCOPY_PREEMPT),
>      DEFINE_PROP_MIG_CAP("x-colo", MIGRATION_CAPABILITY_X_COLO),
>      DEFINE_PROP_MIG_CAP("x-release-ram", MIGRATION_CAPABILITY_RELEASE_RAM),
>      DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
> diff --git a/migration/migration.h b/migration/migration.h
> index d0c0902ec9..9d39ccfcf5 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -391,6 +391,7 @@ int migrate_decompress_threads(void);
>  bool migrate_use_events(void);
>  bool migrate_postcopy_blocktime(void);
>  bool migrate_background_snapshot(void);
> +bool migrate_postcopy_preempt(void);
>  
>  /* Sending on the return path - generic and then for each message type */
>  void migrate_send_rp_shut(MigrationIncomingState *mis,
> diff --git a/qapi/migration.json b/qapi/migration.json
> index bbfd48cf0b..f00b365bd5 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -452,6 +452,12 @@
>  #                       procedure starts. The VM RAM is saved with running VM.
>  #                       (since 6.0)
>  #
> +# @postcopy-preempt: If enabled, the migration process will allow postcopy
> +#                    requests to preempt precopy stream, so postcopy requests
> +#                    will be handled faster.  This is a performance feature and
> +#                    should not affect the correctness of postcopy migration.
> +#                    (since 7.0)
> +#
>  # Features:
>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>  #
> @@ -465,7 +471,7 @@
>             'block', 'return-path', 'pause-before-switchover', 'multifd',
>             'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
>             { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
> -           'validate-uuid', 'background-snapshot'] }
> +           'validate-uuid', 'background-snapshot', 'postcopy-preempt'] }
>  
>  ##
>  # @MigrationCapabilityStatus:
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 15/15] tests: Add postcopy preempt test
  2022-01-19  8:09 ` [PATCH RFC 15/15] tests: Add postcopy preempt test Peter Xu
@ 2022-02-03 15:53   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-03 15:53 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  tests/qtest/migration-test.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 7b42f6fd90..93ff43bb3f 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -470,6 +470,7 @@ typedef struct {
>       */
>      bool hide_stderr;
>      bool use_shmem;
> +    bool postcopy_preempt;
>      /* only launch the target process */
>      bool only_target;
>      /* Use dirty ring if true; dirty logging otherwise */
> @@ -673,6 +674,11 @@ static int migrate_postcopy_prepare(QTestState **from_ptr,
>      migrate_set_capability(to, "postcopy-ram", true);
>      migrate_set_capability(to, "postcopy-blocktime", true);
>  
> +    if (args->postcopy_preempt) {
> +        migrate_set_capability(from, "postcopy-preempt", true);
> +        migrate_set_capability(to, "postcopy-preempt", true);
> +    }
> +
>      /* We want to pick a speed slow enough that the test completes
>       * quickly, but that it doesn't complete precopy even on a slow
>       * machine, so also set the downtime.
> @@ -719,6 +725,20 @@ static void test_postcopy(void)
>      migrate_postcopy_complete(from, to);
>  }
>  
> +static void test_postcopy_preempt(void)
> +{
> +    MigrateStart *args = migrate_start_new();
> +    QTestState *from, *to;
> +
> +    args->postcopy_preempt = true;
> +
> +    if (migrate_postcopy_prepare(&from, &to, args)) {
> +        return;
> +    }
> +    migrate_postcopy_start(from, to);
> +    migrate_postcopy_complete(from, to);
> +}
> +
>  static void test_postcopy_recovery(void)
>  {
>      MigrateStart *args = migrate_start_new();
> @@ -1458,6 +1478,7 @@ int main(int argc, char **argv)
>      module_call_init(MODULE_INIT_QOM);
>  
>      qtest_add_func("/migration/postcopy/unix", test_postcopy);
> +    qtest_add_func("/migration/postcopy/preempt", test_postcopy_preempt);
>      qtest_add_func("/migration/postcopy/recovery", test_postcopy_recovery);
>      qtest_add_func("/migration/bad_dest", test_baddest);
>      qtest_add_func("/migration/precopy/unix", test_precopy_unix);
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 14/15] migration: Postcopy preemption on separate channel
  2022-01-19  8:09 ` [PATCH RFC 14/15] migration: Postcopy preemption on separate channel Peter Xu
@ 2022-02-03 17:45   ` Dr. David Alan Gilbert
  2022-02-08  4:22     ` Peter Xu
  0 siblings, 1 reply; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-03 17:45 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> This patch enables postcopy-preempt feature.
> 
> It contains two major changes to the migration logic:
> 
>   (1) Postcopy requests are now sent via a different socket from precopy
>       background migration stream, so as to be isolated from very high page
>       request delays
> 
>   (2) For huge page enabled hosts: when there's postcopy requests, they can now
>       intercept a partial sending of huge host pages on src QEMU.
> 
> After this patch, we'll have two "channels" (or say, sockets, because it's only
> supported on socket-based channels) for postcopy: (1) PRECOPY channel (which is
> the default channel that transfers background pages), and (2) POSTCOPY
> channel (which only transfers requested pages).
> 
> On the source QEMU, when we found a postcopy request, we'll interrupt the
> PRECOPY channel sending process and quickly switch to the POSTCOPY channel.
> After we serviced all the high priority postcopy pages, we'll switch back to
> PRECOPY channel so that we'll continue to send the interrupted huge page again.
> There's no new thread introduced.
> 
> On the destination QEMU, one new thread is introduced to receive page data from
> the postcopy specific socket.
> 
> This patch has a side effect.  After sending postcopy pages, previously we'll
> assume the guest will access follow up pages so we'll keep sending from there.
> Now it's changed.  Instead of going on with a postcopy requested page, we'll go
> back and continue sending the precopy huge page (which can be intercepted by a
> postcopy request so the huge page can be sent partially before).
> 
> Whether that's a problem is debatable, because "assuming the guest will
> continue to access the next page" doesn't really suite when huge pages are
> used, especially if the huge page is large (e.g. 1GB pages).  So that locality
> hint is much meaningless if huge pages are used.
> 
> If postcopy preempt is enabled, a separate channel is created for it so that it
> can be used later for postcopy specific page requests.  On dst node, a
> standalone thread is used to receive postcopy requested pages.  The thread is
> created along with the ram listen thread during POSTCOPY_LISTEN phase.

I think this patch could do with being split into two; the first one that
deals with closing/opening channels; and the second that handles the
data on the two channels and does the preemption.

Another thought is whether, if in the future we allow multifd +
postcopy, the multifd code would change - I think it would end up closer
to using multiple channels taking different pages on each one.


Do we need to do anything in psotcopy recovery ?

Dave

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c    |  62 +++++++--
>  migration/migration.h    |  10 +-
>  migration/postcopy-ram.c |  65 ++++++++-
>  migration/postcopy-ram.h |  10 ++
>  migration/ram.c          | 294 +++++++++++++++++++++++++++++++++++++--
>  migration/ram.h          |   2 +
>  migration/socket.c       |  18 +++
>  migration/socket.h       |   1 +
>  migration/trace-events   |  10 ++
>  9 files changed, 445 insertions(+), 27 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 84a8fbd80d..13dc6ecd37 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -315,6 +315,12 @@ void migration_incoming_state_destroy(void)
>          mis->socket_address_list = NULL;
>      }
>  
> +    if (mis->postcopy_qemufile_dst) {
> +        migration_ioc_unregister_yank_from_file(mis->postcopy_qemufile_dst);
> +        qemu_fclose(mis->postcopy_qemufile_dst);
> +        mis->postcopy_qemufile_dst = NULL;
> +    }
> +
>      yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>  }
>  
> @@ -708,15 +714,21 @@ void migration_fd_process_incoming(QEMUFile *f, Error **errp)
>      migration_incoming_process();
>  }
>  
> +static bool migration_needs_multiple_sockets(void)
> +{
> +    return migrate_use_multifd() || migrate_postcopy_preempt();
> +}
> +
>  void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
>  {
>      MigrationIncomingState *mis = migration_incoming_get_current();
>      Error *local_err = NULL;
>      bool start_migration;
> +    QEMUFile *f;
>  
>      if (!mis->from_src_file) {
>          /* The first connection (multifd may have multiple) */
> -        QEMUFile *f = qemu_fopen_channel_input(ioc);
> +        f = qemu_fopen_channel_input(ioc);
>  
>          /* If it's a recovery, we're done */
>          if (postcopy_try_recover(f)) {
> @@ -729,13 +741,18 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
>  
>          /*
>           * Common migration only needs one channel, so we can start
> -         * right now.  Multifd needs more than one channel, we wait.
> +         * right now.  Some features need more than one channel, we wait.
>           */
> -        start_migration = !migrate_use_multifd();
> +        start_migration = !migration_needs_multiple_sockets();
>      } else {
>          /* Multiple connections */
> -        assert(migrate_use_multifd());
> -        start_migration = multifd_recv_new_channel(ioc, &local_err);
> +        assert(migration_needs_multiple_sockets());
> +        if (migrate_use_multifd()) {
> +            start_migration = multifd_recv_new_channel(ioc, &local_err);
> +        } else if (migrate_postcopy_preempt()) {
> +            f = qemu_fopen_channel_input(ioc);
> +            start_migration = postcopy_preempt_new_channel(mis, f);
> +        }
>          if (local_err) {
>              error_propagate(errp, local_err);
>              return;
> @@ -756,11 +773,20 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
>  bool migration_has_all_channels(void)
>  {
>      MigrationIncomingState *mis = migration_incoming_get_current();
> -    bool all_channels;
>  
> -    all_channels = multifd_recv_all_channels_created();
> +    if (!mis->from_src_file) {
> +        return false;
> +    }
> +
> +    if (migrate_use_multifd()) {
> +        return multifd_recv_all_channels_created();
> +    }
> +
> +    if (migrate_postcopy_preempt()) {
> +        return mis->postcopy_qemufile_dst != NULL;
> +    }
>  
> -    return all_channels && mis->from_src_file != NULL;
> +    return true;
>  }
>  
>  /*
> @@ -1850,6 +1876,11 @@ static void migrate_fd_cleanup(MigrationState *s)
>          qemu_fclose(tmp);
>      }
>  
> +    if (s->postcopy_qemufile_src) {
> +        qemu_fclose(s->postcopy_qemufile_src);
> +        s->postcopy_qemufile_src = NULL;
> +    }
> +
>      assert(!migration_is_active(s));
>  
>      if (s->state == MIGRATION_STATUS_CANCELLING) {
> @@ -3122,6 +3153,8 @@ static int postcopy_start(MigrationState *ms)
>                                MIGRATION_STATUS_FAILED);
>      }
>  
> +    trace_postcopy_preempt_enabled(migrate_postcopy_preempt());
> +
>      return ret;
>  
>  fail_closefb:
> @@ -3234,6 +3267,11 @@ static void migration_completion(MigrationState *s)
>          qemu_savevm_state_complete_postcopy(s->to_dst_file);
>          qemu_mutex_unlock_iothread();
>  
> +        /* Shutdown the postcopy fast path thread */
> +        if (migrate_postcopy_preempt()) {
> +            postcopy_preempt_shutdown_file(s);

We use 'shutdown' in a lot of places to mean shutdown(2), so this name
is confusing; here you're sending a simple end-of-stream message I
think.

> +        }
> +
>          trace_migration_completion_postcopy_end_after_complete();
>      } else if (s->state == MIGRATION_STATUS_CANCELLING) {
>          goto fail;
> @@ -4143,6 +4181,14 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>          return;
>      }
>  
> +    if (postcopy_preempt_setup(s, &local_err)) {
> +        error_report_err(local_err);
> +        migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> +                          MIGRATION_STATUS_FAILED);
> +        migrate_fd_cleanup(s);
> +        return;
> +    }
> +
>      if (migrate_background_snapshot()) {
>          qemu_thread_create(&s->thread, "bg_snapshot",
>                  bg_migration_thread, s, QEMU_THREAD_JOINABLE);
> diff --git a/migration/migration.h b/migration/migration.h
> index 9d39ccfcf5..8786785b1f 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -23,6 +23,7 @@
>  #include "io/channel-buffer.h"
>  #include "net/announce.h"
>  #include "qom/object.h"
> +#include "postcopy-ram.h"
>  
>  struct PostcopyBlocktimeContext;
>  
> @@ -67,7 +68,7 @@ typedef struct {
>  struct MigrationIncomingState {
>      QEMUFile *from_src_file;
>      /* Previously received RAM's RAMBlock pointer */
> -    RAMBlock *last_recv_block;
> +    RAMBlock *last_recv_block[RAM_CHANNEL_MAX];
>      /* A hook to allow cleanup at the end of incoming migration */
>      void *transport_data;
>      void (*transport_cleanup)(void *data);
> @@ -109,6 +110,11 @@ struct MigrationIncomingState {
>       * enabled.
>       */
>      int       postcopy_channels;
> +    /* QEMUFile for postcopy only; it'll be handled by a separate thread */
> +    QEMUFile *postcopy_qemufile_dst;
> +    /* Postcopy priority thread is used to receive postcopy requested pages */
> +    QemuThread     postcopy_prio_thread;
> +    bool           postcopy_prio_thread_created;
>      /*
>       * An array of temp host huge pages to be used, one for each postcopy
>       * channel.
> @@ -189,6 +195,8 @@ struct MigrationState {
>      QEMUBH *cleanup_bh;
>      /* Protected by qemu_file_lock */
>      QEMUFile *to_dst_file;
> +    /* Postcopy specific transfer channel */
> +    QEMUFile *postcopy_qemufile_src;
>      QIOChannelBuffer *bioc;
>      /*
>       * Protects to_dst_file/from_dst_file pointers.  We need to make sure we
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 88c832eeba..9006e68fd1 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -32,6 +32,8 @@
>  #include "trace.h"
>  #include "hw/boards.h"
>  #include "exec/ramblock.h"
> +#include "socket.h"
> +#include "qemu-file-channel.h"
>  
>  /* Arbitrary limit on size of each discard command,
>   * keeps them around ~200 bytes
> @@ -562,6 +564,11 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>  {
>      trace_postcopy_ram_incoming_cleanup_entry();
>  
> +    if (mis->postcopy_prio_thread_created) {
> +        qemu_thread_join(&mis->postcopy_prio_thread);
> +        mis->postcopy_prio_thread_created = false;
> +    }
> +
>      if (mis->have_fault_thread) {
>          Error *local_err = NULL;
>  
> @@ -1114,8 +1121,13 @@ static int postcopy_temp_pages_setup(MigrationIncomingState *mis)
>      int err, i, channels;
>      void *temp_page;
>  
> -    /* TODO: will be boosted when enable postcopy preemption */
> -    mis->postcopy_channels = 1;
> +    if (migrate_postcopy_preempt()) {
> +        /* If preemption enabled, need extra channel for urgent requests */
> +        mis->postcopy_channels = RAM_CHANNEL_MAX;
> +    } else {
> +        /* Both precopy/postcopy on the same channel */
> +        mis->postcopy_channels = 1;
> +    }
>  
>      channels = mis->postcopy_channels;
>      mis->postcopy_tmp_pages = g_malloc0(sizeof(PostcopyTmpPage) * channels);
> @@ -1182,7 +1194,7 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
>          return -1;
>      }
>  
> -    postcopy_thread_create(mis, &mis->fault_thread, "postcopy/fault",
> +    postcopy_thread_create(mis, &mis->fault_thread, "qemu/fault-default",

Note Linux has a 14 character max thread name size (which the previous
one just fitted); this name will be lost.  In theory you don't need the
qemu/ because we know the process name that owns the thread (?).

>                             postcopy_ram_fault_thread, QEMU_THREAD_JOINABLE);
>      mis->have_fault_thread = true;
>  
> @@ -1197,6 +1209,16 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
>          return -1;
>      }
>  
> +    if (migrate_postcopy_preempt()) {
> +        /*
> +         * This thread needs to be created after the temp pages because it'll fetch
> +         * RAM_CHANNEL_POSTCOPY PostcopyTmpPage immediately.
> +         */
> +        postcopy_thread_create(mis, &mis->postcopy_prio_thread, "qemu/fault-fast",

and again

> +                               postcopy_preempt_thread, QEMU_THREAD_JOINABLE);
> +        mis->postcopy_prio_thread_created = true;
> +    }
> +
>      trace_postcopy_ram_enable_notify();
>  
>      return 0;
> @@ -1516,3 +1538,40 @@ void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd)
>          }
>      }
>  }
> +
> +bool postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file)
> +{
> +    mis->postcopy_qemufile_dst = file;
> +
> +    trace_postcopy_preempt_new_channel();
> +
> +    /* Start the migration immediately */
> +    return true;
> +}
> +
> +int postcopy_preempt_setup(MigrationState *s, Error **errp)
> +{
> +    QIOChannel *ioc;
> +
> +    if (!migrate_postcopy_preempt()) {
> +        return 0;
> +    }
> +
> +    if (!migrate_multi_channels_is_allowed()) {
> +        error_setg(errp, "Postcopy preempt is not supported as current "
> +                   "migration stream does not support multi-channels.");
> +        return -1;
> +    }
> +
> +    ioc = socket_send_channel_create_sync(errp);
> +
> +    if (ioc == NULL) {
> +        return -1;
> +    }
> +
> +    s->postcopy_qemufile_src = qemu_fopen_channel_output(ioc);
> +
> +    trace_postcopy_preempt_new_channel();

Generally we've preferred trace names to approximately match the
function names; it tends to diverge a bit as we split/rename functions.

> +    return 0;
> +}
> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index 07684c0e1d..34b1080cde 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -183,4 +183,14 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd, uint64_t client_addr,
>  int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
>                                   uint64_t client_addr, uint64_t offset);
>  
> +/* Hard-code channels for now for postcopy preemption */
> +enum PostcopyChannels {
> +    RAM_CHANNEL_PRECOPY = 0,
> +    RAM_CHANNEL_POSTCOPY = 1,
> +    RAM_CHANNEL_MAX,
> +};
> +
> +bool postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file);
> +int postcopy_preempt_setup(MigrationState *s, Error **errp);
> +
>  #endif
> diff --git a/migration/ram.c b/migration/ram.c
> index b7d17613e8..6a1ef86eca 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -294,6 +294,20 @@ struct RAMSrcPageRequest {
>      QSIMPLEQ_ENTRY(RAMSrcPageRequest) next_req;
>  };
>  
> +typedef struct {
> +    /*
> +     * Cached ramblock/offset values if preempted.  They're only meaningful if
> +     * preempted==true below.
> +     */
> +    RAMBlock *ram_block;
> +    unsigned long ram_page;

Is this really a 'ram_block/ram_page' per channel, and the 'preempted'
is telling us which channel we're using?

> +    /*
> +     * Whether a postcopy preemption just happened.  Will be reset after
> +     * precopy recovered to background migration.
> +     */
> +    bool preempted;
> +} PostcopyPreemptState;
> +
>  /* State of RAM for migration */
>  struct RAMState {
>      /* QEMUFile used for this migration */
> @@ -347,6 +361,14 @@ struct RAMState {
>      /* Queue of outstanding page requests from the destination */
>      QemuMutex src_page_req_mutex;
>      QSIMPLEQ_HEAD(, RAMSrcPageRequest) src_page_requests;
> +
> +    /* Postcopy preemption informations */
> +    PostcopyPreemptState postcopy_preempt_state;
> +    /*
> +     * Current channel we're using on src VM.  Only valid if postcopy-preempt
> +     * is enabled.
> +     */
> +    int postcopy_channel;
>  };
>  typedef struct RAMState RAMState;
>  
> @@ -354,6 +376,11 @@ static RAMState *ram_state;
>  
>  static NotifierWithReturnList precopy_notifier_list;
>  
> +static void postcopy_preempt_reset(RAMState *rs)
> +{
> +    memset(&rs->postcopy_preempt_state, 0, sizeof(PostcopyPreemptState));
> +}
> +
>  /* Whether postcopy has queued requests? */
>  static bool postcopy_has_request(RAMState *rs)
>  {
> @@ -1937,6 +1964,55 @@ void ram_write_tracking_stop(void)
>  }
>  #endif /* defined(__linux__) */
>  
> +/*
> + * Check whether two addr/offset of the ramblock falls onto the same host huge
> + * page.  Returns true if so, false otherwise.
> + */
> +static bool offset_on_same_huge_page(RAMBlock *rb, uint64_t addr1,
> +                                     uint64_t addr2)
> +{
> +    size_t page_size = qemu_ram_pagesize(rb);
> +
> +    addr1 = ROUND_DOWN(addr1, page_size);
> +    addr2 = ROUND_DOWN(addr2, page_size);
> +
> +    return addr1 == addr2;
> +}
> +
> +/*
> + * Whether a previous preempted precopy huge page contains current requested
> + * page?  Returns true if so, false otherwise.
> + *
> + * This should really happen very rarely, because it means when we were sending
> + * during background migration for postcopy we're sending exactly the page that
> + * some vcpu got faulted on on dest node.  When it happens, we probably don't
> + * need to do much but drop the request, because we know right after we restore
> + * the precopy stream it'll be serviced.  It'll slightly affect the order of
> + * postcopy requests to be serviced (e.g. it'll be the same as we move current
> + * request to the end of the queue) but it shouldn't be a big deal.  The most
> + * imporant thing is we can _never_ try to send a partial-sent huge page on the
> + * POSTCOPY channel again, otherwise that huge page will got "split brain" on
> + * two channels (PRECOPY, POSTCOPY).
> + */
> +static bool postcopy_preempted_contains(RAMState *rs, RAMBlock *block,
> +                                        ram_addr_t offset)
> +{
> +    PostcopyPreemptState *state = &rs->postcopy_preempt_state;
> +
> +    /* No preemption at all? */
> +    if (!state->preempted) {
> +        return false;
> +    }
> +
> +    /* Not even the same ramblock? */
> +    if (state->ram_block != block) {
> +        return false;
> +    }
> +
> +    return offset_on_same_huge_page(block, offset,
> +                                    state->ram_page << TARGET_PAGE_BITS);

Can you add a trace here - I'm curious how often this hits; if it hits a
lot then it probably tells us the guess about sequential pages being
rare is wrong.

> +}
> +
>  /**
>   * get_queued_page: unqueue a page from the postcopy requests
>   *
> @@ -1952,9 +2028,17 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>      RAMBlock  *block;
>      ram_addr_t offset;
>  
> +again:
>      block = unqueue_page(rs, &offset);
>  
> -    if (!block) {
> +    if (block) {
> +        /* See comment above postcopy_preempted_contains() */
> +        if (postcopy_preempted_contains(rs, block, offset)) {
> +            trace_postcopy_preempt_hit(block->idstr, offset);
> +            /* This request is dropped */
> +            goto again;
> +        }
> +    } else {
>          /*
>           * Poll write faults too if background snapshot is enabled; that's
>           * when we have vcpus got blocked by the write protected pages.
> @@ -2173,6 +2257,114 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
>      return ram_save_page(rs, pss, last_stage);
>  }
>  
> +static bool postcopy_needs_preempt(RAMState *rs, PageSearchStatus *pss)
> +{
> +    /* Not enabled eager preempt?  Then never do that. */
> +    if (!migrate_postcopy_preempt()) {
> +        return false;
> +    }
> +
> +    /* If the ramblock we're sending is a small page?  Never bother. */
> +    if (qemu_ram_pagesize(pss->block) == TARGET_PAGE_SIZE) {
> +        return false;
> +    }

Maybe that should check for qemu_real_host_page_size - so we still don't
bother on ARM or PPC with 16k/64k page sizes ?

> +    /* Not in postcopy at all? */
> +    if (!migration_in_postcopy()) {
> +        return false;
> +    }
> +
> +    /*
> +     * If we're already handling a postcopy request, don't preempt as this page
> +     * has got the same high priority.
> +     */
> +    if (pss->postcopy_requested) {
> +        return false;
> +    }
> +
> +    /* If there's postcopy requests, then check it up! */
> +    return postcopy_has_request(rs);
> +}
> +
> +/* Returns true if we preempted precopy, false otherwise */
> +static void postcopy_do_preempt(RAMState *rs, PageSearchStatus *pss)
> +{
> +    PostcopyPreemptState *p_state = &rs->postcopy_preempt_state;
> +
> +    trace_postcopy_preempt_triggered(pss->block->idstr, pss->page);
> +
> +    /*
> +     * Time to preempt precopy. Cache current PSS into preempt state, so that
> +     * after handling the postcopy pages we can recover to it.  We need to do
> +     * so because the dest VM will have partial of the precopy huge page kept
> +     * over in its tmp huge page caches; better move on with it when we can.
> +     */
> +    p_state->ram_block = pss->block;
> +    p_state->ram_page = pss->page;
> +    p_state->preempted = true;
> +}
> +
> +/* Whether we're preempted by a postcopy request during sending a huge page */
> +static bool postcopy_preempt_triggered(RAMState *rs)
> +{
> +    return rs->postcopy_preempt_state.preempted;
> +}
> +
> +static void postcopy_preempt_restore(RAMState *rs, PageSearchStatus *pss)
> +{
> +    PostcopyPreemptState *state = &rs->postcopy_preempt_state;
> +
> +    assert(state->preempted);
> +
> +    pss->block = state->ram_block;
> +    pss->page = state->ram_page;
> +    /* This is not a postcopy request but restoring previous precopy */
> +    pss->postcopy_requested = false;
> +
> +    trace_postcopy_preempt_restored(pss->block->idstr, pss->page);
> +
> +    /* Reset preempt state, most importantly, set preempted==false */
> +    postcopy_preempt_reset(rs);
> +}
> +
> +static void postcopy_preempt_choose_channel(RAMState *rs, PageSearchStatus *pss)
> +{
> +    int channel = pss->postcopy_requested ? RAM_CHANNEL_POSTCOPY : RAM_CHANNEL_PRECOPY;
> +    MigrationState *s = migrate_get_current();
> +    QEMUFile *next;
> +
> +    if (channel != rs->postcopy_channel) {
> +        if (channel == RAM_CHANNEL_PRECOPY) {
> +            next = s->to_dst_file;
> +        } else {
> +            next = s->postcopy_qemufile_src;
> +        }
> +        /* Update and cache the current channel */
> +        rs->f = next;
> +        rs->postcopy_channel = channel;
> +
> +        /*
> +         * If channel switched, reset last_sent_block since the old sent block
> +         * may not be on the same channel.
> +         */
> +        rs->last_sent_block = NULL;
> +
> +        trace_postcopy_preempt_switch_channel(channel);
> +    }
> +
> +    trace_postcopy_preempt_send_host_page(pss->block->idstr, pss->page);
> +}
> +
> +/* We need to make sure rs->f always points to the default channel elsewhere */
> +static void postcopy_preempt_reset_channel(RAMState *rs)
> +{
> +    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
> +        rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
> +        rs->f = migrate_get_current()->to_dst_file;
> +        trace_postcopy_preempt_reset_channel();
> +    }
> +}
> +
>  /**
>   * ram_save_host_page: save a whole host page
>   *
> @@ -2207,7 +2399,16 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>          return 0;
>      }
>  
> +    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
> +        postcopy_preempt_choose_channel(rs, pss);
> +    }
> +
>      do {
> +        if (postcopy_needs_preempt(rs, pss)) {
> +            postcopy_do_preempt(rs, pss);
> +            break;
> +        }
> +
>          /* Check the pages is dirty and if it is send it */
>          if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
>              tmppages = ram_save_target_page(rs, pss, last_stage);
> @@ -2229,6 +2430,19 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>               offset_in_ramblock(pss->block,
>                                  ((ram_addr_t)pss->page) << TARGET_PAGE_BITS));
>  
> +    /*
> +     * When with postcopy preempt mode, flush the data as soon as possible for
> +     * postcopy requests, because we've already sent a whole huge page, so the
> +     * dst node should already have enough resource to atomically filling in
> +     * the current missing page.
> +     *
> +     * More importantly, when using separate postcopy channel, we must do
> +     * explicit flush or it won't flush until the buffer is full.
> +     */
> +    if (migrate_postcopy_preempt() && pss->postcopy_requested) {
> +        qemu_fflush(rs->f);
> +    }
> +
>      res = ram_save_release_protection(rs, pss, start_page);
>      return (res < 0 ? res : pages);
>  }
> @@ -2272,8 +2486,17 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>          found = get_queued_page(rs, &pss);
>  
>          if (!found) {
> -            /* priority queue empty, so just search for something dirty */
> -            found = find_dirty_block(rs, &pss, &again);
> +            /*
> +             * Recover previous precopy ramblock/offset if postcopy has
> +             * preempted precopy.  Otherwise find the next dirty bit.
> +             */
> +            if (postcopy_preempt_triggered(rs)) {
> +                postcopy_preempt_restore(rs, &pss);
> +                found = true;
> +            } else {
> +                /* priority queue empty, so just search for something dirty */
> +                found = find_dirty_block(rs, &pss, &again);
> +            }
>          }
>  
>          if (found) {
> @@ -2401,6 +2624,8 @@ static void ram_state_reset(RAMState *rs)
>      rs->last_page = 0;
>      rs->last_version = ram_list.version;
>      rs->xbzrle_enabled = false;
> +    postcopy_preempt_reset(rs);
> +    rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
>  }
>  
>  #define MAX_WAIT 50 /* ms, half buffered_file limit */
> @@ -3043,6 +3268,8 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>      }
>      qemu_mutex_unlock(&rs->bitmap_mutex);
>  
> +    postcopy_preempt_reset_channel(rs);
> +
>      /*
>       * Must occur before EOS (or any QEMUFile operation)
>       * because of RDMA protocol.
> @@ -3110,6 +3337,8 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>          ram_control_after_iterate(f, RAM_CONTROL_FINISH);
>      }
>  
> +    postcopy_preempt_reset_channel(rs);
> +
>      if (ret >= 0) {
>          multifd_send_sync_main(rs->f);
>          qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> @@ -3192,11 +3421,13 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
>   * @mis: the migration incoming state pointer
>   * @f: QEMUFile where to read the data from
>   * @flags: Page flags (mostly to see if it's a continuation of previous block)
> + * @channel: the channel we're using
>   */
>  static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
> -                                              QEMUFile *f, int flags)
> +                                              QEMUFile *f, int flags,
> +                                              int channel)
>  {
> -    RAMBlock *block = mis->last_recv_block;
> +    RAMBlock *block = mis->last_recv_block[channel];
>      char id[256];
>      uint8_t len;
>  
> @@ -3223,7 +3454,7 @@ static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
>          return NULL;
>      }
>  
> -    mis->last_recv_block = block;
> +    mis->last_recv_block[channel] = block;
>  
>      return block;
>  }
> @@ -3642,15 +3873,15 @@ int ram_postcopy_incoming_init(MigrationIncomingState *mis)
>   * rcu_read_lock is taken prior to this being called.
>   *
>   * @f: QEMUFile where to send the data
> + * @channel: the channel to use for loading
>   */
> -static int ram_load_postcopy(QEMUFile *f)
> +static int ram_load_postcopy(QEMUFile *f, int channel)
>  {
>      int flags = 0, ret = 0;
>      bool place_needed = false;
>      bool matches_target_page_size = false;
>      MigrationIncomingState *mis = migration_incoming_get_current();
> -    /* Currently we only use channel 0.  TODO: use all the channels */
> -    PostcopyTmpPage *tmp_page = &mis->postcopy_tmp_pages[0];
> +    PostcopyTmpPage *tmp_page = &mis->postcopy_tmp_pages[channel];
>  
>      while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
>          ram_addr_t addr;
> @@ -3677,7 +3908,7 @@ static int ram_load_postcopy(QEMUFile *f)
>          trace_ram_load_postcopy_loop((uint64_t)addr, flags);
>          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
>                       RAM_SAVE_FLAG_COMPRESS_PAGE)) {
> -            block = ram_block_from_stream(mis, f, flags);
> +            block = ram_block_from_stream(mis, f, flags, channel);
>              if (!block) {
>                  ret = -EINVAL;
>                  break;
> @@ -3715,10 +3946,10 @@ static int ram_load_postcopy(QEMUFile *f)
>              } else if (tmp_page->host_addr !=
>                         host_page_from_ram_block_offset(block, addr)) {
>                  /* not the 1st TP within the HP */
> -                error_report("Non-same host page detected.  Target host page %p, "
> -                             "received host page %p "
> +                error_report("Non-same host page detected on channel %d: "
> +                             "Target host page %p, received host page %p "
>                               "(rb %s offset 0x"RAM_ADDR_FMT" target_pages %d)",
> -                             tmp_page->host_addr,
> +                             channel, tmp_page->host_addr,
>                               host_page_from_ram_block_offset(block, addr),
>                               block->idstr, addr, tmp_page->target_pages);
>                  ret = -EINVAL;
> @@ -3818,6 +4049,28 @@ static int ram_load_postcopy(QEMUFile *f)
>      return ret;
>  }
>  
> +void *postcopy_preempt_thread(void *opaque)
> +{
> +    MigrationIncomingState *mis = opaque;
> +    int ret;
> +
> +    trace_postcopy_preempt_thread_entry();
> +
> +    rcu_register_thread();
> +
> +    qemu_sem_post(&mis->thread_sync_sem);
> +
> +    /* Sending RAM_SAVE_FLAG_EOS to terminate this thread */
> +    ret = ram_load_postcopy(mis->postcopy_qemufile_dst, RAM_CHANNEL_POSTCOPY);
> +
> +    rcu_unregister_thread();
> +
> +    trace_postcopy_preempt_thread_exit();
> +
> +    return ret == 0 ? NULL : (void *)-1;
> +}
> +
> +
>  static bool postcopy_is_advised(void)
>  {
>      PostcopyState ps = postcopy_state_get();
> @@ -3930,7 +4183,7 @@ static int ram_load_precopy(QEMUFile *f)
>  
>          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
>                       RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
> -            RAMBlock *block = ram_block_from_stream(mis, f, flags);
> +            RAMBlock *block = ram_block_from_stream(mis, f, flags, RAM_CHANNEL_PRECOPY);
>  
>              host = host_from_ram_block_offset(block, addr);
>              /*
> @@ -4107,7 +4360,12 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>       */
>      WITH_RCU_READ_LOCK_GUARD() {
>          if (postcopy_running) {
> -            ret = ram_load_postcopy(f);
> +            /*
> +             * Note!  Here RAM_CHANNEL_PRECOPY is the precopy channel of
> +             * postcopy migration, we have another RAM_CHANNEL_POSTCOPY to
> +             * service fast page faults.
> +             */
> +            ret = ram_load_postcopy(f, RAM_CHANNEL_PRECOPY);
>          } else {
>              ret = ram_load_precopy(f);
>          }
> @@ -4269,6 +4527,12 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
>      return 0;
>  }
>  
> +void postcopy_preempt_shutdown_file(MigrationState *s)
> +{
> +    qemu_put_be64(s->postcopy_qemufile_src, RAM_SAVE_FLAG_EOS);
> +    qemu_fflush(s->postcopy_qemufile_src);
> +}
> +
>  static SaveVMHandlers savevm_ram_handlers = {
>      .save_setup = ram_save_setup,
>      .save_live_iterate = ram_save_iterate,
> diff --git a/migration/ram.h b/migration/ram.h
> index 2c6dc3675d..f31b8c0ece 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -72,6 +72,8 @@ int64_t ramblock_recv_bitmap_send(QEMUFile *file,
>                                    const char *block_name);
>  int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
>  bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
> +void postcopy_preempt_shutdown_file(MigrationState *s);
> +void *postcopy_preempt_thread(void *opaque);
>  
>  /* ram cache */
>  int colo_init_ram_cache(void);
> diff --git a/migration/socket.c b/migration/socket.c
> index 05705a32d8..955c5ebb10 100644
> --- a/migration/socket.c
> +++ b/migration/socket.c
> @@ -39,6 +39,24 @@ void socket_send_channel_create(QIOTaskFunc f, void *data)
>                                       f, data, NULL, NULL);
>  }
>  
> +QIOChannel *socket_send_channel_create_sync(Error **errp)
> +{
> +    QIOChannelSocket *sioc = qio_channel_socket_new();
> +
> +    if (!outgoing_args.saddr) {
> +        object_unref(OBJECT(sioc));
> +        error_setg(errp, "Initial sock address not set!");
> +        return NULL;
> +    }
> +
> +    if (qio_channel_socket_connect_sync(sioc, outgoing_args.saddr, errp) < 0) {
> +        object_unref(OBJECT(sioc));
> +        return NULL;
> +    }
> +
> +    return QIO_CHANNEL(sioc);
> +}
> +
>  int socket_send_channel_destroy(QIOChannel *send)
>  {
>      /* Remove channel */
> diff --git a/migration/socket.h b/migration/socket.h
> index 891dbccceb..dc54df4e6c 100644
> --- a/migration/socket.h
> +++ b/migration/socket.h
> @@ -21,6 +21,7 @@
>  #include "io/task.h"
>  
>  void socket_send_channel_create(QIOTaskFunc f, void *data);
> +QIOChannel *socket_send_channel_create_sync(Error **errp);
>  int socket_send_channel_destroy(QIOChannel *send);
>  
>  void socket_start_incoming_migration(const char *str, Error **errp);
> diff --git a/migration/trace-events b/migration/trace-events
> index 3a9b3567ae..6452179bee 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -110,6 +110,12 @@ ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRI
>  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
>  ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
>  ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
> +postcopy_preempt_triggered(char *str, unsigned long page) "during sending ramblock %s offset 0x%lx"
> +postcopy_preempt_restored(char *str, unsigned long page) "ramblock %s offset 0x%lx"
> +postcopy_preempt_hit(char *str, uint64_t offset) "ramblock %s offset 0x%"PRIx64
> +postcopy_preempt_send_host_page(char *str, uint64_t offset) "ramblock %s offset 0x%"PRIx64
> +postcopy_preempt_switch_channel(int channel) "%d"
> +postcopy_preempt_reset_channel(void) ""
>  
>  # multifd.c
>  multifd_new_send_channel_async(uint8_t id) "channel %d"
> @@ -175,6 +181,7 @@ migration_thread_low_pending(uint64_t pending) "%" PRIu64
>  migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
>  process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
>  process_incoming_migration_co_postcopy_end_main(void) ""
> +postcopy_preempt_enabled(bool value) "%d"
>  
>  # channel.c
>  migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
> @@ -277,6 +284,9 @@ postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_off
>  postcopy_request_shared_page_present(const char *sharer, const char *rb, uint64_t rb_offset) "%s already %s offset 0x%"PRIx64
>  postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
>  postcopy_page_req_del(void *addr, int count) "resolved page req %p total %d"
> +postcopy_preempt_new_channel(void) ""
> +postcopy_preempt_thread_entry(void) ""
> +postcopy_preempt_thread_exit(void) ""
>  
>  get_mem_fault_cpu_index(int cpu, uint32_t pid) "cpu: %d, pid: %u"
>  
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 10/15] migration: Move static var in ram_block_from_stream() into global
  2022-01-19  8:09 ` [PATCH RFC 10/15] migration: Move static var in ram_block_from_stream() into global Peter Xu
@ 2022-02-03 17:48   ` Dr. David Alan Gilbert
  2022-02-08  3:51     ` Peter Xu
  0 siblings, 1 reply; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-03 17:48 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> Static variable is very unfriendly to threading of ram_block_from_stream().
> Move it into MigrationIncomingState.
> 
> Make the incoming state pointer to be passed over to ram_block_from_stream() on
> both caller sites.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

OK, but I'm not sure if I noticed where you changed this to be per
channel later?



Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Dave
> ---
>  migration/migration.h |  3 ++-
>  migration/ram.c       | 13 +++++++++----
>  2 files changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/migration.h b/migration/migration.h
> index 35e7f7babe..34b79cb961 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -66,7 +66,8 @@ typedef struct {
>  /* State for the incoming migration */
>  struct MigrationIncomingState {
>      QEMUFile *from_src_file;
> -
> +    /* Previously received RAM's RAMBlock pointer */
> +    RAMBlock *last_recv_block;
>      /* A hook to allow cleanup at the end of incoming migration */
>      void *transport_data;
>      void (*transport_cleanup)(void *data);
> diff --git a/migration/ram.c b/migration/ram.c
> index 3f823ffffc..3a7d943f9c 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3183,12 +3183,14 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
>   *
>   * Returns a pointer from within the RCU-protected ram_list.
>   *
> + * @mis: the migration incoming state pointer
>   * @f: QEMUFile where to read the data from
>   * @flags: Page flags (mostly to see if it's a continuation of previous block)
>   */
> -static inline RAMBlock *ram_block_from_stream(QEMUFile *f, int flags)
> +static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
> +                                              QEMUFile *f, int flags)
>  {
> -    static RAMBlock *block;
> +    RAMBlock *block = mis->last_recv_block;
>      char id[256];
>      uint8_t len;
>  
> @@ -3215,6 +3217,8 @@ static inline RAMBlock *ram_block_from_stream(QEMUFile *f, int flags)
>          return NULL;
>      }
>  
> +    mis->last_recv_block = block;
> +
>      return block;
>  }
>  
> @@ -3667,7 +3671,7 @@ static int ram_load_postcopy(QEMUFile *f)
>          trace_ram_load_postcopy_loop((uint64_t)addr, flags);
>          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
>                       RAM_SAVE_FLAG_COMPRESS_PAGE)) {
> -            block = ram_block_from_stream(f, flags);
> +            block = ram_block_from_stream(mis, f, flags);
>              if (!block) {
>                  ret = -EINVAL;
>                  break;
> @@ -3881,6 +3885,7 @@ void colo_flush_ram_cache(void)
>   */
>  static int ram_load_precopy(QEMUFile *f)
>  {
> +    MigrationIncomingState *mis = migration_incoming_get_current();
>      int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0;
>      /* ADVISE is earlier, it shows the source has the postcopy capability on */
>      bool postcopy_advised = postcopy_is_advised();
> @@ -3919,7 +3924,7 @@ static int ram_load_precopy(QEMUFile *f)
>  
>          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
>                       RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
> -            RAMBlock *block = ram_block_from_stream(f, flags);
> +            RAMBlock *block = ram_block_from_stream(mis, f, flags);
>  
>              host = host_from_ram_block_offset(block, addr);
>              /*
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 02/15] migration: Allow pss->page jump over clean pages
  2022-01-20  2:12     ` Peter Xu
@ 2022-02-03 18:19       ` Dr. David Alan Gilbert
  2022-02-08  3:20         ` Peter Xu
  0 siblings, 1 reply; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-03 18:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: Kunkun Jiang, Juan Quintela, Keqian Zhu, qemu-devel,
	Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Jan 19, 2022 at 01:42:47PM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > Commit ba1b7c812c ("migration/ram: Optimize ram_save_host_page()") managed to
> > > optimize host huge page use case by scanning the dirty bitmap when looking for
> > > the next dirty small page to migrate.
> > > 
> > > However when updating the pss->page before returning from that function, we
> > > used MIN() of these two values: (1) next dirty bit, or (2) end of current sent
> > > huge page, to fix up pss->page.
> > > 
> > > That sounds unnecessary, because I see nowhere that requires pss->page to be
> > > not going over current huge page boundary.
> > > 
> > > What we need here is probably MAX() instead of MIN() so that we'll start
> > > scanning from the next dirty bit next time. Since pss->page can't be smaller
> > > than hostpage_boundary (the loop guarantees it), it probably means we don't
> > > need to fix it up at all.
> > > 
> > > Cc: Keqian Zhu <zhukeqian1@huawei.com>
> > > Cc: Kunkun Jiang <jiangkunkun@huawei.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > 
> > Hmm, I think that's potentially necessary.  note that the start of
> > ram_save_host_page stores the 'start_page' at entry.
> > That' start_page' goes to the ram_save_release_protection and so
> > I think it needs to be pagesize aligned for the mmap/uffd that happens.
> 
> Right, that's indeed a functional change, but IMHO it's also fine.
> 
> When reaching ram_save_release_protection(), what we guaranteed is that below
> page range contains no dirty bits in ramblock dirty bitmap:
> 
>   range0 = [start_page, pss->page)
> 
> Side note: inclusive on start, but not inclusive on the end side of range0
> (that is, pss->page can be pointing to a dirty page).
> 
> What ram_save_release_protection() does is to unprotect the pages and let them
> run free.  If we're sure range0 contains no dirty page, it means we have
> already copied them over into the snapshot, so IIUC it's safe to unprotect all
> of it (even if it's already bigger than the host page size)?

I think what's worrying me is the alignment of the address going into
UFFDIO_WRITEPROTECT in uffd_change_protection - if it was previously
huge page aligned and now isn't, what breaks? (Did it support
hugepages?)

> That can be slightly less efficient for live snapshot in some extreme cases
> (when unprotect, we'll need to walk the pgtables in the uffd ioctl()), but I
> don't assume live snapshot to be run on a huge VM, so hopefully it's still
> fine?  Not to mention it should make live migration a little bit faster,
> assuming that's more frequently used..

Hmm I don't think I understand that statement.

Dave

> 
> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 02/15] migration: Allow pss->page jump over clean pages
  2022-02-03 18:19       ` Dr. David Alan Gilbert
@ 2022-02-08  3:20         ` Peter Xu
  0 siblings, 0 replies; 53+ messages in thread
From: Peter Xu @ 2022-02-08  3:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kunkun Jiang, Juan Quintela, Keqian Zhu, qemu-devel,
	Leonardo Bras Soares Passos

On Thu, Feb 03, 2022 at 06:19:22PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Wed, Jan 19, 2022 at 01:42:47PM +0000, Dr. David Alan Gilbert wrote:
> > > * Peter Xu (peterx@redhat.com) wrote:
> > > > Commit ba1b7c812c ("migration/ram: Optimize ram_save_host_page()") managed to
> > > > optimize host huge page use case by scanning the dirty bitmap when looking for
> > > > the next dirty small page to migrate.
> > > > 
> > > > However when updating the pss->page before returning from that function, we
> > > > used MIN() of these two values: (1) next dirty bit, or (2) end of current sent
> > > > huge page, to fix up pss->page.
> > > > 
> > > > That sounds unnecessary, because I see nowhere that requires pss->page to be
> > > > not going over current huge page boundary.
> > > > 
> > > > What we need here is probably MAX() instead of MIN() so that we'll start
> > > > scanning from the next dirty bit next time. Since pss->page can't be smaller
> > > > than hostpage_boundary (the loop guarantees it), it probably means we don't
> > > > need to fix it up at all.
> > > > 
> > > > Cc: Keqian Zhu <zhukeqian1@huawei.com>
> > > > Cc: Kunkun Jiang <jiangkunkun@huawei.com>
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > 
> > > 
> > > Hmm, I think that's potentially necessary.  note that the start of
> > > ram_save_host_page stores the 'start_page' at entry.
> > > That' start_page' goes to the ram_save_release_protection and so
> > > I think it needs to be pagesize aligned for the mmap/uffd that happens.
> > 
> > Right, that's indeed a functional change, but IMHO it's also fine.
> > 
> > When reaching ram_save_release_protection(), what we guaranteed is that below
> > page range contains no dirty bits in ramblock dirty bitmap:
> > 
> >   range0 = [start_page, pss->page)
> > 
> > Side note: inclusive on start, but not inclusive on the end side of range0
> > (that is, pss->page can be pointing to a dirty page).
> > 
> > What ram_save_release_protection() does is to unprotect the pages and let them
> > run free.  If we're sure range0 contains no dirty page, it means we have
> > already copied them over into the snapshot, so IIUC it's safe to unprotect all
> > of it (even if it's already bigger than the host page size)?
> 
> I think what's worrying me is the alignment of the address going into
> UFFDIO_WRITEPROTECT in uffd_change_protection - if it was previously
> huge page aligned and now isn't, what breaks? (Did it support
> hugepages?)

Good point..

It doesn't support huge pages yet, but we'd better keep it always page aligned
for the unprotect ioctl.

> 
> > That can be slightly less efficient for live snapshot in some extreme cases
> > (when unprotect, we'll need to walk the pgtables in the uffd ioctl()), but I
> > don't assume live snapshot to be run on a huge VM, so hopefully it's still
> > fine?  Not to mention it should make live migration a little bit faster,
> > assuming that's more frequently used..
> 
> Hmm I don't think I understand that statement.

I meant since we've scanned over those clean pages we don't need to scan it
again in the next find_dirty_block() call for precopy, per the "faster"
statement.

But to make it simple I think I'll drop this patch in the next version.

Thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 07/15] migration: Introduce postcopy channels on dest node
  2022-02-03 15:08   ` Dr. David Alan Gilbert
@ 2022-02-08  3:27     ` Peter Xu
  2022-02-08  9:43       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-02-08  3:27 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

On Thu, Feb 03, 2022 at 03:08:39PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Postcopy handles huge pages in a special way that currently we can only have
> > one "channel" to transfer the page.
> > 
> > It's because when we install pages using UFFDIO_COPY, we need to have the whole
> > huge page ready, it also means we need to have a temp huge page when trying to
> > receive the whole content of the page.
> > 
> > Currently all maintainance around this tmp page is global: firstly we'll
> > allocate a temp huge page, then we maintain its status mostly within
> > ram_load_postcopy().
> > 
> > To enable multiple channels for postcopy, the first thing we need to do is to
> > prepare N temp huge pages as caching, one for each channel.
> > 
> > Meanwhile we need to maintain the tmp huge page status per-channel too.
> > 
> > To give some example, some local variables maintained in ram_load_postcopy()
> > are listed; they are responsible for maintaining temp huge page status:
> > 
> >   - all_zero:     this keeps whether this huge page contains all zeros
> >   - target_pages: this counts how many target pages have been copied
> >   - host_page:    this keeps the host ptr for the page to install
> > 
> > Move all these fields to be together with the temp huge pages to form a new
> > structure called PostcopyTmpPage.  Then for each (future) postcopy channel, we
> > need one structure to keep the state around.
> > 
> > For vanilla postcopy, obviously there's only one channel.  It contains both
> > precopy and postcopy pages.
> > 
> > This patch teaches the dest migration node to start realize the possible number
> > of postcopy channels by introducing the "postcopy_channels" variable.  Its
> > value is calculated when setup postcopy on dest node (during POSTCOPY_LISTEN
> > phase).
> > 
> > Vanilla postcopy will have channels=1, but when postcopy-preempt capability is
> > enabled (in the future), we will boost it to 2 because even during partial
> > sending of a precopy huge page we still want to preempt it and start sending
> > the postcopy requested page right away (so we start to keep two temp huge
> > pages; more if we want to enable multifd).  In this patch there's a TODO marked
> > for that; so far the channels is always set to 1.
> > 
> > We need to send one "host huge page" on one channel only and we cannot split
> > them, because otherwise the data upon the same huge page can locate on more
> > than one channel so we need more complicated logic to manage.  One temp host
> > huge page for each channel will be enough for us for now.
> > 
> > Postcopy will still always use the index=0 huge page even after this patch.
> > However it prepares for the latter patches where it can start to use multiple
> > channels (which needs src intervention, because only src knows which channel we
> > should use).
> 
> Generally OK, some minor nits.
> 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/migration.h    | 35 +++++++++++++++++++++++++++-
> >  migration/postcopy-ram.c | 50 +++++++++++++++++++++++++++++-----------
> >  migration/ram.c          | 43 +++++++++++++++++-----------------
> >  3 files changed, 91 insertions(+), 37 deletions(-)
> > 
> > diff --git a/migration/migration.h b/migration/migration.h
> > index 8130b703eb..8bb2931312 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -45,6 +45,24 @@ struct PostcopyBlocktimeContext;
> >   */
> >  #define CLEAR_BITMAP_SHIFT_MAX            31
> >  
> > +/* This is an abstraction of a "temp huge page" for postcopy's purpose */
> > +typedef struct {
> > +    /*
> > +     * This points to a temporary huge page as a buffer for UFFDIO_COPY.  It's
> > +     * mmap()ed and needs to be freed when cleanup.
> > +     */
> > +    void *tmp_huge_page;
> > +    /*
> > +     * This points to the host page we're going to install for this temp page.
> > +     * It tells us after we've received the whole page, where we should put it.
> > +     */
> > +    void *host_addr;
> > +    /* Number of small pages copied (in size of TARGET_PAGE_SIZE) */
> > +    int target_pages;
> 
> Can we take the opportunity to convert this to an unsigned?

Sure.

> 
> > +    /* Whether this page contains all zeros */
> > +    bool all_zero;
> > +} PostcopyTmpPage;
> > +
> >  /* State for the incoming migration */
> >  struct MigrationIncomingState {
> >      QEMUFile *from_src_file;
> > @@ -81,7 +99,22 @@ struct MigrationIncomingState {
> >      QemuMutex rp_mutex;    /* We send replies from multiple threads */
> >      /* RAMBlock of last request sent to source */
> >      RAMBlock *last_rb;
> > -    void     *postcopy_tmp_page;
> > +    /*
> > +     * Number of postcopy channels including the default precopy channel, so
> > +     * vanilla postcopy will only contain one channel which contain both
> > +     * precopy and postcopy streams.
> > +     *
> > +     * This is calculated when the src requests to enable postcopy but before
> > +     * it starts.  Its value can depend on e.g. whether postcopy preemption is
> > +     * enabled.
> > +     */
> > +    int       postcopy_channels;
> 
> Also unsigned?

OK.

> 
> > +    /*
> > +     * An array of temp host huge pages to be used, one for each postcopy
> > +     * channel.
> > +     */
> > +    PostcopyTmpPage *postcopy_tmp_pages;
> > +    /* This is shared for all postcopy channels */
> >      void     *postcopy_tmp_zero_page;
> >      /* PostCopyFD's for external userfaultfds & handlers of shared memory */
> >      GArray   *postcopy_remote_fds;
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index e662dd05cc..d78e1b9373 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -525,9 +525,18 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis)
> >  
> >  static void postcopy_temp_pages_cleanup(MigrationIncomingState *mis)
> >  {
> > -    if (mis->postcopy_tmp_page) {
> > -        munmap(mis->postcopy_tmp_page, mis->largest_page_size);
> > -        mis->postcopy_tmp_page = NULL;
> > +    int i;
> > +
> > +    if (mis->postcopy_tmp_pages) {
> > +        for (i = 0; i < mis->postcopy_channels; i++) {
> > +            if (mis->postcopy_tmp_pages[i].tmp_huge_page) {
> > +                munmap(mis->postcopy_tmp_pages[i].tmp_huge_page,
> > +                       mis->largest_page_size);
> > +                mis->postcopy_tmp_pages[i].tmp_huge_page = NULL;
> > +            }
> > +        }
> > +        g_free(mis->postcopy_tmp_pages);
> > +        mis->postcopy_tmp_pages = NULL;
> >      }
> >  
> >      if (mis->postcopy_tmp_zero_page) {
> > @@ -1091,17 +1100,30 @@ retry:
> >  
> >  static int postcopy_temp_pages_setup(MigrationIncomingState *mis)
> >  {
> > -    int err;
> > -
> > -    mis->postcopy_tmp_page = mmap(NULL, mis->largest_page_size,
> > -                                  PROT_READ | PROT_WRITE,
> > -                                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > -    if (mis->postcopy_tmp_page == MAP_FAILED) {
> > -        err = errno;
> > -        mis->postcopy_tmp_page = NULL;
> > -        error_report("%s: Failed to map postcopy_tmp_page %s",
> > -                     __func__, strerror(err));
> > -        return -err;
> > +    PostcopyTmpPage *tmp_page;
> > +    int err, i, channels;
> > +    void *temp_page;
> > +
> > +    /* TODO: will be boosted when enable postcopy preemption */
> > +    mis->postcopy_channels = 1;
> > +
> > +    channels = mis->postcopy_channels;
> > +    mis->postcopy_tmp_pages = g_malloc0(sizeof(PostcopyTmpPage) * channels);
> 
> I noticed we've started using g_malloc0_n in a few places

Sure.

> 
> > +    for (i = 0; i < channels; i++) {
> > +        tmp_page = &mis->postcopy_tmp_pages[i];
> > +        temp_page = mmap(NULL, mis->largest_page_size, PROT_READ | PROT_WRITE,
> > +                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > +        if (temp_page == MAP_FAILED) {
> > +            err = errno;
> > +            error_report("%s: Failed to map postcopy_tmp_pages[%d]: %s",
> > +                         __func__, i, strerror(err));
> 
> Please call postcopy_temp_pages_cleanup here to cleanup previous pages
> that were succesfully allocated.

It'll be cleaned up later here:

  loadvm_postcopy_handle_listen
    postcopy_ram_incoming_setup
      postcopy_temp_pages_setup
    postcopy_ram_incoming_cleanup  <---------- if fail above, go here
      postcopy_temp_pages_cleanup

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 09/15] migration: Add postcopy_thread_create()
  2022-02-03 15:19   ` Dr. David Alan Gilbert
@ 2022-02-08  3:37     ` Peter Xu
  2022-02-08 11:16       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-02-08  3:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

On Thu, Feb 03, 2022 at 03:19:48PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Postcopy create threads. A common manner is we init a sem and use it to sync
> > with the thread.  Namely, we have fault_thread_sem and listen_thread_sem and
> > they're only used for this.
> > 
> > Make it a shared infrastructure so it's easier to create yet another thread.
> > 
> 
> It might be worth a note saying you now share that sem, so you can't
> start two threads in parallel.

I'll squash this into the patch:

---8<---
diff --git a/migration/migration.h b/migration/migration.h
index 845be3463c..2a311fd8d6 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -72,7 +72,10 @@ struct MigrationIncomingState {
     /* A hook to allow cleanup at the end of incoming migration */
     void *transport_data;
     void (*transport_cleanup)(void *data);
-    /* Used to sync thread creations */
+    /*
+     * Used to sync thread creations.  Note that we can't create threads in
+     * parallel with this sem.
+     */
     QemuSemaphore  thread_sync_sem;
     /*
      * Free at the start of the main state load, set as the main thread finishes
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 099d8ed478..1a3ba1db84 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -79,6 +79,10 @@ int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp)
                                             &pnd);
 }
 
+/*
+ * NOTE: this routine is not thread safe, we can't call it concurrently. But it
+ * should be good enough for migration's purposes.
+ */
 void postcopy_thread_create(MigrationIncomingState *mis,
                             QemuThread *thread, const char *name,
                             void *(*fn)(void *), int joinable)
---8<---

> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Thanks,

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 10/15] migration: Move static var in ram_block_from_stream() into global
  2022-02-03 17:48   ` Dr. David Alan Gilbert
@ 2022-02-08  3:51     ` Peter Xu
  0 siblings, 0 replies; 53+ messages in thread
From: Peter Xu @ 2022-02-08  3:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

On Thu, Feb 03, 2022 at 05:48:31PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Static variable is very unfriendly to threading of ram_block_from_stream().
> > Move it into MigrationIncomingState.
> > 
> > Make the incoming state pointer to be passed over to ram_block_from_stream() on
> > both caller sites.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> OK, but I'm not sure if I noticed where you changed this to be per
> channel later?

It's done in the last patch where it'll start to pass over "channel" index into
ram_block_from_stream():

static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
                                              QEMUFile *f, int flags,
                                              int channel)
{
    RAMBlock *block = mis->last_recv_block[channel];
    ...
}

I could have moved it into the new PostcopyTmpPage structure, but it'll be a
bit weird because precopy also uses this to cache the block info, hence I made
it an array.

> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 14/15] migration: Postcopy preemption on separate channel
  2022-02-03 17:45   ` Dr. David Alan Gilbert
@ 2022-02-08  4:22     ` Peter Xu
  2022-02-08 11:24       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-02-08  4:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

On Thu, Feb 03, 2022 at 05:45:32PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > This patch enables postcopy-preempt feature.
> > 
> > It contains two major changes to the migration logic:
> > 
> >   (1) Postcopy requests are now sent via a different socket from precopy
> >       background migration stream, so as to be isolated from very high page
> >       request delays
> > 
> >   (2) For huge page enabled hosts: when there's postcopy requests, they can now
> >       intercept a partial sending of huge host pages on src QEMU.
> > 
> > After this patch, we'll have two "channels" (or say, sockets, because it's only
> > supported on socket-based channels) for postcopy: (1) PRECOPY channel (which is
> > the default channel that transfers background pages), and (2) POSTCOPY
> > channel (which only transfers requested pages).
> > 
> > On the source QEMU, when we found a postcopy request, we'll interrupt the
> > PRECOPY channel sending process and quickly switch to the POSTCOPY channel.
> > After we serviced all the high priority postcopy pages, we'll switch back to
> > PRECOPY channel so that we'll continue to send the interrupted huge page again.
> > There's no new thread introduced.
> > 
> > On the destination QEMU, one new thread is introduced to receive page data from
> > the postcopy specific socket.
> > 
> > This patch has a side effect.  After sending postcopy pages, previously we'll
> > assume the guest will access follow up pages so we'll keep sending from there.
> > Now it's changed.  Instead of going on with a postcopy requested page, we'll go
> > back and continue sending the precopy huge page (which can be intercepted by a
> > postcopy request so the huge page can be sent partially before).
> > 
> > Whether that's a problem is debatable, because "assuming the guest will
> > continue to access the next page" doesn't really suite when huge pages are
> > used, especially if the huge page is large (e.g. 1GB pages).  So that locality
> > hint is much meaningless if huge pages are used.
> > 
> > If postcopy preempt is enabled, a separate channel is created for it so that it
> > can be used later for postcopy specific page requests.  On dst node, a
> > standalone thread is used to receive postcopy requested pages.  The thread is
> > created along with the ram listen thread during POSTCOPY_LISTEN phase.
> 
> I think this patch could do with being split into two; the first one that
> deals with closing/opening channels; and the second that handles the
> data on the two channels and does the preemption.

Sounds good, I'll give it a shot on the split.

> 
> Another thought is whether, if in the future we allow multifd +
> postcopy, the multifd code would change - I think it would end up closer
> to using multiple channels taking different pages on each one.

Right, so potentially the postcopy channels can be multi-threaded too itself.

We've had a quick discussion on irc, just to recap: I didn't reuse multifd
infra because IMO multifd is designed with below ideas in mind:

  (1) Every multifd thread is equal
  (2) Throughput oriented

However I found that postcopy needs something different when they're mixed up
together with multifd.

Firstly, we will have some channels sending as much as we could where latency
is not an issue (aka background pages).  However it's not suitable for page
requests, so we could also have channels that are servicing page faults fron
dst.  In short, there're two types of channels/threads we want, and we may want
to treat them differently.

The current model is we only have 1 postcopy channel and 1 precopy channel, but
it should be easier if we want to make it N post + 1 pre base on this series.

So far all send() is still done in the migration thread so no new sender thread
but 1 more receiver thread only. If we want to grow that 1->N for postcopy
channels we may want to move that out too just like what we do with multifd.
Not sure whether there can be something reused around.  That's where I haven't
yet explored, but this series should already share a common piece of code on
refactoring of things like tmp huge page on dst node to be able to receive with
multiple huge pages.

This also reminded me that, instead of a new capability, should I simply expose
a parameter "postcopy-channels=N" to CLI so that we can be prepared with multi
postcopy channels?

> 
> 
> Do we need to do anything in psotcopy recovery ?

Yes. It's a todo (in the cover letter), if the whole thing looks sane I'll add
that together in the non-rfc series.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 07/15] migration: Introduce postcopy channels on dest node
  2022-02-08  3:27     ` Peter Xu
@ 2022-02-08  9:43       ` Dr. David Alan Gilbert
  2022-02-08 10:07         ` Peter Xu
  0 siblings, 1 reply; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-08  9:43 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Feb 03, 2022 at 03:08:39PM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > Postcopy handles huge pages in a special way that currently we can only have
> > > one "channel" to transfer the page.
> > > 
> > > It's because when we install pages using UFFDIO_COPY, we need to have the whole
> > > huge page ready, it also means we need to have a temp huge page when trying to
> > > receive the whole content of the page.
> > > 
> > > Currently all maintainance around this tmp page is global: firstly we'll
> > > allocate a temp huge page, then we maintain its status mostly within
> > > ram_load_postcopy().
> > > 
> > > To enable multiple channels for postcopy, the first thing we need to do is to
> > > prepare N temp huge pages as caching, one for each channel.
> > > 
> > > Meanwhile we need to maintain the tmp huge page status per-channel too.
> > > 
> > > To give some example, some local variables maintained in ram_load_postcopy()
> > > are listed; they are responsible for maintaining temp huge page status:
> > > 
> > >   - all_zero:     this keeps whether this huge page contains all zeros
> > >   - target_pages: this counts how many target pages have been copied
> > >   - host_page:    this keeps the host ptr for the page to install
> > > 
> > > Move all these fields to be together with the temp huge pages to form a new
> > > structure called PostcopyTmpPage.  Then for each (future) postcopy channel, we
> > > need one structure to keep the state around.
> > > 
> > > For vanilla postcopy, obviously there's only one channel.  It contains both
> > > precopy and postcopy pages.
> > > 
> > > This patch teaches the dest migration node to start realize the possible number
> > > of postcopy channels by introducing the "postcopy_channels" variable.  Its
> > > value is calculated when setup postcopy on dest node (during POSTCOPY_LISTEN
> > > phase).
> > > 
> > > Vanilla postcopy will have channels=1, but when postcopy-preempt capability is
> > > enabled (in the future), we will boost it to 2 because even during partial
> > > sending of a precopy huge page we still want to preempt it and start sending
> > > the postcopy requested page right away (so we start to keep two temp huge
> > > pages; more if we want to enable multifd).  In this patch there's a TODO marked
> > > for that; so far the channels is always set to 1.
> > > 
> > > We need to send one "host huge page" on one channel only and we cannot split
> > > them, because otherwise the data upon the same huge page can locate on more
> > > than one channel so we need more complicated logic to manage.  One temp host
> > > huge page for each channel will be enough for us for now.
> > > 
> > > Postcopy will still always use the index=0 huge page even after this patch.
> > > However it prepares for the latter patches where it can start to use multiple
> > > channels (which needs src intervention, because only src knows which channel we
> > > should use).
> > 
> > Generally OK, some minor nits.
> > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  migration/migration.h    | 35 +++++++++++++++++++++++++++-
> > >  migration/postcopy-ram.c | 50 +++++++++++++++++++++++++++++-----------
> > >  migration/ram.c          | 43 +++++++++++++++++-----------------
> > >  3 files changed, 91 insertions(+), 37 deletions(-)
> > > 
> > > diff --git a/migration/migration.h b/migration/migration.h
> > > index 8130b703eb..8bb2931312 100644
> > > --- a/migration/migration.h
> > > +++ b/migration/migration.h
> > > @@ -45,6 +45,24 @@ struct PostcopyBlocktimeContext;
> > >   */
> > >  #define CLEAR_BITMAP_SHIFT_MAX            31
> > >  
> > > +/* This is an abstraction of a "temp huge page" for postcopy's purpose */
> > > +typedef struct {
> > > +    /*
> > > +     * This points to a temporary huge page as a buffer for UFFDIO_COPY.  It's
> > > +     * mmap()ed and needs to be freed when cleanup.
> > > +     */
> > > +    void *tmp_huge_page;
> > > +    /*
> > > +     * This points to the host page we're going to install for this temp page.
> > > +     * It tells us after we've received the whole page, where we should put it.
> > > +     */
> > > +    void *host_addr;
> > > +    /* Number of small pages copied (in size of TARGET_PAGE_SIZE) */
> > > +    int target_pages;
> > 
> > Can we take the opportunity to convert this to an unsigned?
> 
> Sure.
> 
> > 
> > > +    /* Whether this page contains all zeros */
> > > +    bool all_zero;
> > > +} PostcopyTmpPage;
> > > +
> > >  /* State for the incoming migration */
> > >  struct MigrationIncomingState {
> > >      QEMUFile *from_src_file;
> > > @@ -81,7 +99,22 @@ struct MigrationIncomingState {
> > >      QemuMutex rp_mutex;    /* We send replies from multiple threads */
> > >      /* RAMBlock of last request sent to source */
> > >      RAMBlock *last_rb;
> > > -    void     *postcopy_tmp_page;
> > > +    /*
> > > +     * Number of postcopy channels including the default precopy channel, so
> > > +     * vanilla postcopy will only contain one channel which contain both
> > > +     * precopy and postcopy streams.
> > > +     *
> > > +     * This is calculated when the src requests to enable postcopy but before
> > > +     * it starts.  Its value can depend on e.g. whether postcopy preemption is
> > > +     * enabled.
> > > +     */
> > > +    int       postcopy_channels;
> > 
> > Also unsigned?
> 
> OK.
> 
> > 
> > > +    /*
> > > +     * An array of temp host huge pages to be used, one for each postcopy
> > > +     * channel.
> > > +     */
> > > +    PostcopyTmpPage *postcopy_tmp_pages;
> > > +    /* This is shared for all postcopy channels */
> > >      void     *postcopy_tmp_zero_page;
> > >      /* PostCopyFD's for external userfaultfds & handlers of shared memory */
> > >      GArray   *postcopy_remote_fds;
> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > > index e662dd05cc..d78e1b9373 100644
> > > --- a/migration/postcopy-ram.c
> > > +++ b/migration/postcopy-ram.c
> > > @@ -525,9 +525,18 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis)
> > >  
> > >  static void postcopy_temp_pages_cleanup(MigrationIncomingState *mis)
> > >  {
> > > -    if (mis->postcopy_tmp_page) {
> > > -        munmap(mis->postcopy_tmp_page, mis->largest_page_size);
> > > -        mis->postcopy_tmp_page = NULL;
> > > +    int i;
> > > +
> > > +    if (mis->postcopy_tmp_pages) {
> > > +        for (i = 0; i < mis->postcopy_channels; i++) {
> > > +            if (mis->postcopy_tmp_pages[i].tmp_huge_page) {
> > > +                munmap(mis->postcopy_tmp_pages[i].tmp_huge_page,
> > > +                       mis->largest_page_size);
> > > +                mis->postcopy_tmp_pages[i].tmp_huge_page = NULL;
> > > +            }
> > > +        }
> > > +        g_free(mis->postcopy_tmp_pages);
> > > +        mis->postcopy_tmp_pages = NULL;
> > >      }
> > >  
> > >      if (mis->postcopy_tmp_zero_page) {
> > > @@ -1091,17 +1100,30 @@ retry:
> > >  
> > >  static int postcopy_temp_pages_setup(MigrationIncomingState *mis)
> > >  {
> > > -    int err;
> > > -
> > > -    mis->postcopy_tmp_page = mmap(NULL, mis->largest_page_size,
> > > -                                  PROT_READ | PROT_WRITE,
> > > -                                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > -    if (mis->postcopy_tmp_page == MAP_FAILED) {
> > > -        err = errno;
> > > -        mis->postcopy_tmp_page = NULL;
> > > -        error_report("%s: Failed to map postcopy_tmp_page %s",
> > > -                     __func__, strerror(err));
> > > -        return -err;
> > > +    PostcopyTmpPage *tmp_page;
> > > +    int err, i, channels;
> > > +    void *temp_page;
> > > +
> > > +    /* TODO: will be boosted when enable postcopy preemption */
> > > +    mis->postcopy_channels = 1;
> > > +
> > > +    channels = mis->postcopy_channels;
> > > +    mis->postcopy_tmp_pages = g_malloc0(sizeof(PostcopyTmpPage) * channels);
> > 
> > I noticed we've started using g_malloc0_n in a few places
> 
> Sure.
> 
> > 
> > > +    for (i = 0; i < channels; i++) {
> > > +        tmp_page = &mis->postcopy_tmp_pages[i];
> > > +        temp_page = mmap(NULL, mis->largest_page_size, PROT_READ | PROT_WRITE,
> > > +                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > +        if (temp_page == MAP_FAILED) {
> > > +            err = errno;
> > > +            error_report("%s: Failed to map postcopy_tmp_pages[%d]: %s",
> > > +                         __func__, i, strerror(err));
> > 
> > Please call postcopy_temp_pages_cleanup here to cleanup previous pages
> > that were succesfully allocated.
> 
> It'll be cleaned up later here:
> 
>   loadvm_postcopy_handle_listen
>     postcopy_ram_incoming_setup
>       postcopy_temp_pages_setup
>     postcopy_ram_incoming_cleanup  <---------- if fail above, go here
>       postcopy_temp_pages_cleanup

Ah OK, it might still be worth a comment.

Dave

> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 07/15] migration: Introduce postcopy channels on dest node
  2022-02-08  9:43       ` Dr. David Alan Gilbert
@ 2022-02-08 10:07         ` Peter Xu
  0 siblings, 0 replies; 53+ messages in thread
From: Peter Xu @ 2022-02-08 10:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

On Tue, Feb 08, 2022 at 09:43:49AM +0000, Dr. David Alan Gilbert wrote:
> > It'll be cleaned up later here:
> > 
> >   loadvm_postcopy_handle_listen
> >     postcopy_ram_incoming_setup
> >       postcopy_temp_pages_setup
> >     postcopy_ram_incoming_cleanup  <---------- if fail above, go here
> >       postcopy_temp_pages_cleanup
> 
> Ah OK, it might still be worth a comment.

Will do.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 09/15] migration: Add postcopy_thread_create()
  2022-02-08  3:37     ` Peter Xu
@ 2022-02-08 11:16       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-08 11:16 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Feb 03, 2022 at 03:19:48PM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > Postcopy create threads. A common manner is we init a sem and use it to sync
> > > with the thread.  Namely, we have fault_thread_sem and listen_thread_sem and
> > > they're only used for this.
> > > 
> > > Make it a shared infrastructure so it's easier to create yet another thread.
> > > 
> > 
> > It might be worth a note saying you now share that sem, so you can't
> > start two threads in parallel.
> 
> I'll squash this into the patch:

Thanks

> ---8<---
> diff --git a/migration/migration.h b/migration/migration.h
> index 845be3463c..2a311fd8d6 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -72,7 +72,10 @@ struct MigrationIncomingState {
>      /* A hook to allow cleanup at the end of incoming migration */
>      void *transport_data;
>      void (*transport_cleanup)(void *data);
> -    /* Used to sync thread creations */
> +    /*
> +     * Used to sync thread creations.  Note that we can't create threads in
> +     * parallel with this sem.
> +     */
>      QemuSemaphore  thread_sync_sem;
>      /*
>       * Free at the start of the main state load, set as the main thread finishes
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 099d8ed478..1a3ba1db84 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -79,6 +79,10 @@ int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp)
>                                              &pnd);
>  }
>  
> +/*
> + * NOTE: this routine is not thread safe, we can't call it concurrently. But it
> + * should be good enough for migration's purposes.
> + */
>  void postcopy_thread_create(MigrationIncomingState *mis,
>                              QemuThread *thread, const char *name,
>                              void *(*fn)(void *), int joinable)
> ---8<---
> 
> > 
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 14/15] migration: Postcopy preemption on separate channel
  2022-02-08  4:22     ` Peter Xu
@ 2022-02-08 11:24       ` Dr. David Alan Gilbert
  2022-02-08 11:39         ` Peter Xu
  0 siblings, 1 reply; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-08 11:24 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Feb 03, 2022 at 05:45:32PM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > This patch enables postcopy-preempt feature.
> > > 
> > > It contains two major changes to the migration logic:
> > > 
> > >   (1) Postcopy requests are now sent via a different socket from precopy
> > >       background migration stream, so as to be isolated from very high page
> > >       request delays
> > > 
> > >   (2) For huge page enabled hosts: when there's postcopy requests, they can now
> > >       intercept a partial sending of huge host pages on src QEMU.
> > > 
> > > After this patch, we'll have two "channels" (or say, sockets, because it's only
> > > supported on socket-based channels) for postcopy: (1) PRECOPY channel (which is
> > > the default channel that transfers background pages), and (2) POSTCOPY
> > > channel (which only transfers requested pages).
> > > 
> > > On the source QEMU, when we found a postcopy request, we'll interrupt the
> > > PRECOPY channel sending process and quickly switch to the POSTCOPY channel.
> > > After we serviced all the high priority postcopy pages, we'll switch back to
> > > PRECOPY channel so that we'll continue to send the interrupted huge page again.
> > > There's no new thread introduced.
> > > 
> > > On the destination QEMU, one new thread is introduced to receive page data from
> > > the postcopy specific socket.
> > > 
> > > This patch has a side effect.  After sending postcopy pages, previously we'll
> > > assume the guest will access follow up pages so we'll keep sending from there.
> > > Now it's changed.  Instead of going on with a postcopy requested page, we'll go
> > > back and continue sending the precopy huge page (which can be intercepted by a
> > > postcopy request so the huge page can be sent partially before).
> > > 
> > > Whether that's a problem is debatable, because "assuming the guest will
> > > continue to access the next page" doesn't really suite when huge pages are
> > > used, especially if the huge page is large (e.g. 1GB pages).  So that locality
> > > hint is much meaningless if huge pages are used.
> > > 
> > > If postcopy preempt is enabled, a separate channel is created for it so that it
> > > can be used later for postcopy specific page requests.  On dst node, a
> > > standalone thread is used to receive postcopy requested pages.  The thread is
> > > created along with the ram listen thread during POSTCOPY_LISTEN phase.
> > 
> > I think this patch could do with being split into two; the first one that
> > deals with closing/opening channels; and the second that handles the
> > data on the two channels and does the preemption.
> 
> Sounds good, I'll give it a shot on the split.
> 
> > 
> > Another thought is whether, if in the future we allow multifd +
> > postcopy, the multifd code would change - I think it would end up closer
> > to using multiple channels taking different pages on each one.
> 
> Right, so potentially the postcopy channels can be multi-threaded too itself.
> 
> We've had a quick discussion on irc, just to recap: I didn't reuse multifd
> infra because IMO multifd is designed with below ideas in mind:
> 
>   (1) Every multifd thread is equal
>   (2) Throughput oriented
> 
> However I found that postcopy needs something different when they're mixed up
> together with multifd.
> 
> Firstly, we will have some channels sending as much as we could where latency
> is not an issue (aka background pages).  However it's not suitable for page
> requests, so we could also have channels that are servicing page faults fron
> dst.  In short, there're two types of channels/threads we want, and we may want
> to treat them differently.
> 
> The current model is we only have 1 postcopy channel and 1 precopy channel, but
> it should be easier if we want to make it N post + 1 pre base on this series.

It's not clear to me if we need to be able to do N post + M pre, or
whether we have a rule like always at least 1 post, but if there's more
pagefaults in the queue then you can steal all of the pre channels.

> So far all send() is still done in the migration thread so no new sender thread
> but 1 more receiver thread only. If we want to grow that 1->N for postcopy
> channels we may want to move that out too just like what we do with multifd.
> Not sure whether there can be something reused around.  That's where I haven't
> yet explored, but this series should already share a common piece of code on
> refactoring of things like tmp huge page on dst node to be able to receive with
> multiple huge pages.

Right; it makes me think the multifd+postcopy should just use channels.

> This also reminded me that, instead of a new capability, should I simply expose
> a parameter "postcopy-channels=N" to CLI so that we can be prepared with multi
> postcopy channels?

I'm not sure we know enough yet about what configuration it would have;
I'd be tempted to just make it work for the user by enabling both
multifd and preemption and then using this new mechanism rather than
having to add yet another parameter.

Dave

> > 
> > 
> > Do we need to do anything in psotcopy recovery ?
> 
> Yes. It's a todo (in the cover letter), if the whole thing looks sane I'll add
> that together in the non-rfc series.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 14/15] migration: Postcopy preemption on separate channel
  2022-02-08 11:24       ` Dr. David Alan Gilbert
@ 2022-02-08 11:39         ` Peter Xu
  2022-02-08 13:23           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Xu @ 2022-02-08 11:39 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

On Tue, Feb 08, 2022 at 11:24:14AM +0000, Dr. David Alan Gilbert wrote:
> > The current model is we only have 1 postcopy channel and 1 precopy channel, but
> > it should be easier if we want to make it N post + 1 pre base on this series.
> 
> It's not clear to me if we need to be able to do N post + M pre, or
> whether we have a rule like always at least 1 post, but if there's more
> pagefaults in the queue then you can steal all of the pre channels.

Right, >1 queue length should easily happen with workloads in real cloud
environment.  Though even with only 1post channel we can already hit at least
<~1ms with this series even if there're 16 pending requests per my test.  I
think that may cover quite some real workloads.

One thing to mention is that we should always assume the pre-channels are
filled up with tons of pages already in the NIC send buffer, so they won't be
good candidate for postcopy requests, IMHO.  So I'm not sure whether we can
mixly use the pre/post channels - we may need to leave the post channels idle.

Then, if we keep some of the multifd channels idle, then it will become some
other thing rather than the existing multifd, since we will start to treat
threads and channels differently and break the "equality" rule in the strict
version of multifd world.

> > This also reminded me that, instead of a new capability, should I simply expose
> > a parameter "postcopy-channels=N" to CLI so that we can be prepared with multi
> > postcopy channels?
> 
> I'm not sure we know enough yet about what configuration it would have;
> I'd be tempted to just make it work for the user by enabling both
> multifd and preemption and then using this new mechanism rather than
> having to add yet another parameter.

Let me stick with the current capability bit then, so as to make it 1pre+1post.
And we can leave Npre+1post for later.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 14/15] migration: Postcopy preemption on separate channel
  2022-02-08 11:39         ` Peter Xu
@ 2022-02-08 13:23           ` Dr. David Alan Gilbert
  2022-02-09  2:16             ` Peter Xu
  0 siblings, 1 reply; 53+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-08 13:23 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Feb 08, 2022 at 11:24:14AM +0000, Dr. David Alan Gilbert wrote:
> > > The current model is we only have 1 postcopy channel and 1 precopy channel, but
> > > it should be easier if we want to make it N post + 1 pre base on this series.
> > 
> > It's not clear to me if we need to be able to do N post + M pre, or
> > whether we have a rule like always at least 1 post, but if there's more
> > pagefaults in the queue then you can steal all of the pre channels.
> 
> Right, >1 queue length should easily happen with workloads in real cloud
> environment.  Though even with only 1post channel we can already hit at least
> <~1ms with this series even if there're 16 pending requests per my test.  I
> think that may cover quite some real workloads.
> 
> One thing to mention is that we should always assume the pre-channels are
> filled up with tons of pages already in the NIC send buffer, so they won't be
> good candidate for postcopy requests, IMHO.  So I'm not sure whether we can
> mixly use the pre/post channels - we may need to leave the post channels idle.

No I'm not sure either; even with separate channels do we have problems
with contention on the NIC?

Dave

> Then, if we keep some of the multifd channels idle, then it will become some
> other thing rather than the existing multifd, since we will start to treat
> threads and channels differently and break the "equality" rule in the strict
> version of multifd world.
> 
> > > This also reminded me that, instead of a new capability, should I simply expose
> > > a parameter "postcopy-channels=N" to CLI so that we can be prepared with multi
> > > postcopy channels?
> > 
> > I'm not sure we know enough yet about what configuration it would have;
> > I'd be tempted to just make it work for the user by enabling both
> > multifd and preemption and then using this new mechanism rather than
> > having to add yet another parameter.
> 
> Let me stick with the current capability bit then, so as to make it 1pre+1post.
> And we can leave Npre+1post for later.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 14/15] migration: Postcopy preemption on separate channel
  2022-02-08 13:23           ` Dr. David Alan Gilbert
@ 2022-02-09  2:16             ` Peter Xu
  0 siblings, 0 replies; 53+ messages in thread
From: Peter Xu @ 2022-02-09  2:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Leonardo Bras Soares Passos

On Tue, Feb 08, 2022 at 01:23:29PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Tue, Feb 08, 2022 at 11:24:14AM +0000, Dr. David Alan Gilbert wrote:
> > > > The current model is we only have 1 postcopy channel and 1 precopy channel, but
> > > > it should be easier if we want to make it N post + 1 pre base on this series.
> > > 
> > > It's not clear to me if we need to be able to do N post + M pre, or
> > > whether we have a rule like always at least 1 post, but if there's more
> > > pagefaults in the queue then you can steal all of the pre channels.
> > 
> > Right, >1 queue length should easily happen with workloads in real cloud
> > environment.  Though even with only 1post channel we can already hit at least
> > <~1ms with this series even if there're 16 pending requests per my test.  I
> > think that may cover quite some real workloads.
> > 
> > One thing to mention is that we should always assume the pre-channels are
> > filled up with tons of pages already in the NIC send buffer, so they won't be
> > good candidate for postcopy requests, IMHO.  So I'm not sure whether we can
> > mixly use the pre/post channels - we may need to leave the post channels idle.
> 
> No I'm not sure either; even with separate channels do we have problems
> with contention on the NIC?

Not on the NIC, but on the same socket; assuming each multifd thread is working
only with one socket.

For example, we have N multifd threads/sockets.  If we find some of them are
"free" from multifd POV, it only means we can write() to those sockets, but it
does not mean that these sockets have empty send buffer.

IMHO that's the major problem we're facing: as long as a socket is shared
between pre and post purposes, then the post pages can be after some pre pages.

I think what we could to do to achieve provisioning of multi-sockets is to keep
M out of N multifd sockets to be idle (M<N), servicing postcopy faults only.
Though those M sockets/threads will work mostly different from the rest multifd
threads, hence we probably can't call it multifd anymore..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2022-02-09  2:18 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-19  8:09 [PATCH RFC 00/15] migration: Postcopy Preemption Peter Xu
2022-01-19  8:09 ` [PATCH RFC 01/15] migration: No off-by-one for pss->page update in host page size Peter Xu
2022-01-19 12:58   ` Dr. David Alan Gilbert
2022-01-27  9:40   ` Juan Quintela
2022-01-19  8:09 ` [PATCH RFC 02/15] migration: Allow pss->page jump over clean pages Peter Xu
2022-01-19 13:42   ` Dr. David Alan Gilbert
2022-01-20  2:12     ` Peter Xu
2022-02-03 18:19       ` Dr. David Alan Gilbert
2022-02-08  3:20         ` Peter Xu
2022-01-19  8:09 ` [PATCH RFC 03/15] migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat Peter Xu
2022-01-19 14:15   ` Dr. David Alan Gilbert
2022-01-27  9:40   ` Juan Quintela
2022-01-19  8:09 ` [PATCH RFC 04/15] migration: Add postcopy_has_request() Peter Xu
2022-01-19 14:27   ` Dr. David Alan Gilbert
2022-01-27  9:41   ` Juan Quintela
2022-01-19  8:09 ` [PATCH RFC 05/15] migration: Simplify unqueue_page() Peter Xu
2022-01-19 16:36   ` Dr. David Alan Gilbert
2022-01-20  2:23     ` Peter Xu
2022-01-25 11:01       ` Dr. David Alan Gilbert
2022-01-27  9:41   ` Juan Quintela
2022-01-19  8:09 ` [PATCH RFC 06/15] migration: Move temp page setup and cleanup into separate functions Peter Xu
2022-01-19 16:58   ` Dr. David Alan Gilbert
2022-01-27  9:43   ` Juan Quintela
2022-01-19  8:09 ` [PATCH RFC 07/15] migration: Introduce postcopy channels on dest node Peter Xu
2022-02-03 15:08   ` Dr. David Alan Gilbert
2022-02-08  3:27     ` Peter Xu
2022-02-08  9:43       ` Dr. David Alan Gilbert
2022-02-08 10:07         ` Peter Xu
2022-01-19  8:09 ` [PATCH RFC 08/15] migration: Dump ramblock and offset too when non-same-page detected Peter Xu
2022-02-03 15:15   ` Dr. David Alan Gilbert
2022-01-19  8:09 ` [PATCH RFC 09/15] migration: Add postcopy_thread_create() Peter Xu
2022-02-03 15:19   ` Dr. David Alan Gilbert
2022-02-08  3:37     ` Peter Xu
2022-02-08 11:16       ` Dr. David Alan Gilbert
2022-01-19  8:09 ` [PATCH RFC 10/15] migration: Move static var in ram_block_from_stream() into global Peter Xu
2022-02-03 17:48   ` Dr. David Alan Gilbert
2022-02-08  3:51     ` Peter Xu
2022-01-19  8:09 ` [PATCH RFC 11/15] migration: Add pss.postcopy_requested status Peter Xu
2022-02-03 15:42   ` Dr. David Alan Gilbert
2022-01-19  8:09 ` [PATCH RFC 12/15] migration: Move migrate_allow_multifd and helpers into migration.c Peter Xu
2022-02-03 15:44   ` Dr. David Alan Gilbert
2022-01-19  8:09 ` [PATCH RFC 13/15] migration: Add postcopy-preempt capability Peter Xu
2022-02-03 15:46   ` Dr. David Alan Gilbert
2022-01-19  8:09 ` [PATCH RFC 14/15] migration: Postcopy preemption on separate channel Peter Xu
2022-02-03 17:45   ` Dr. David Alan Gilbert
2022-02-08  4:22     ` Peter Xu
2022-02-08 11:24       ` Dr. David Alan Gilbert
2022-02-08 11:39         ` Peter Xu
2022-02-08 13:23           ` Dr. David Alan Gilbert
2022-02-09  2:16             ` Peter Xu
2022-01-19  8:09 ` [PATCH RFC 15/15] tests: Add postcopy preempt test Peter Xu
2022-02-03 15:53   ` Dr. David Alan Gilbert
2022-01-19 12:32 ` [PATCH RFC 00/15] migration: Postcopy Preemption Dr. David Alan Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).