All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram
@ 2018-02-16 13:15 Dr. David Alan Gilbert (git)
  2018-02-16 13:15 ` [Qemu-devel] [PATCH v3 01/29] migrate: Update ram_block_discard_range for shared Dr. David Alan Gilbert (git)
                   ` (29 more replies)
  0 siblings, 30 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:15 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Hi,
  This is the first non-RFC version of this patch set that
enables postcopy migration with shared memory to a vhost user process.
It's based off current head.

I've tested with vhost-user-bridge and a modified dpdk; both very
lightly.

Compared to v2 we're now using the just-merged reworks to the vhost
code (suggested by Igor), so that the huge page region merging is now a lot simpler
in this series. The handshake between the client and the qemu for the
set-mem-table is now a bit more complex to resolve a previous race where
the client would start sending requests to the qemu prior to the qemu
being ready to accept them.

Dave

Dr. David Alan Gilbert (29):
  migrate: Update ram_block_discard_range for shared
  qemu_ram_block_host_offset
  postcopy: use UFFDIO_ZEROPAGE only when available
  postcopy: Add notifier chain
  postcopy: Add vhost-user flag for postcopy and check it
  vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message
  libvhost-user: Support sending fds back to qemu
  libvhost-user: Open userfaultfd
  postcopy: Allow registering of fd handler
  vhost+postcopy: Register shared ufd with postcopy
  vhost+postcopy: Transmit 'listen' to client
  postcopy+vhost-user: Split set_mem_table for postcopy
  migration/ram: ramblock_recv_bitmap_test_byte_offset
  libvhost-user+postcopy: Register new regions with the ufd
  vhost+postcopy: Send address back to qemu
  vhost+postcopy: Stash RAMBlock and offset
  vhost+postcopy: Send requests to source for shared pages
  vhost+postcopy: Resolve client address
  postcopy: wake shared
  postcopy: postcopy_notify_shared_wake
  vhost+postcopy: Add vhost waker
  vhost+postcopy: Call wakeups
  libvhost-user: mprotect & madvises for postcopy
  vhost-user: Add VHOST_USER_POSTCOPY_END message
  vhost+postcopy: Wire up POSTCOPY_END notify
  vhost: Huge page align and merge
  postcopy: Allow shared memory
  libvhost-user: Claim support for postcopy
  postcopy shared docs

 contrib/libvhost-user/libvhost-user.c | 303 ++++++++++++++++++++++++-
 contrib/libvhost-user/libvhost-user.h |   8 +
 docs/devel/migration.rst              |  41 ++++
 docs/interop/vhost-user.txt           |  42 ++++
 exec.c                                |  85 +++++--
 hw/virtio/trace-events                |  16 +-
 hw/virtio/vhost-user.c                | 411 +++++++++++++++++++++++++++++++++-
 hw/virtio/vhost.c                     |  66 +++++-
 include/exec/cpu-common.h             |   4 +
 migration/migration.c                 |   6 +
 migration/migration.h                 |   4 +
 migration/postcopy-ram.c              | 350 +++++++++++++++++++++++------
 migration/postcopy-ram.h              |  69 ++++++
 migration/ram.c                       |   5 +
 migration/ram.h                       |   1 +
 migration/savevm.c                    |  13 ++
 migration/trace-events                |   6 +
 trace-events                          |   3 +-
 vl.c                                  |   2 +
 19 files changed, 1337 insertions(+), 98 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 01/29] migrate: Update ram_block_discard_range for shared
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
@ 2018-02-16 13:15 ` Dr. David Alan Gilbert (git)
  2018-02-28  6:37   ` Peter Xu
  2018-02-16 13:15 ` [Qemu-devel] [PATCH v3 02/29] qemu_ram_block_host_offset Dr. David Alan Gilbert (git)
                   ` (28 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:15 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The choice of call to discard a block is getting more complicated
for other cases.   We use fallocate PUNCH_HOLE in any file cases;
it works for both hugepage and for tmpfs.
We use the DONTNEED for non-hugepage cases either where they're
anonymous or where they're private.

Care should be taken when trying other backing files.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 exec.c       | 60 ++++++++++++++++++++++++++++++++++++++++++++++--------------
 trace-events |  3 ++-
 2 files changed, 48 insertions(+), 15 deletions(-)

diff --git a/exec.c b/exec.c
index e8d7b335b6..b1bb477776 100644
--- a/exec.c
+++ b/exec.c
@@ -3702,6 +3702,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
     }
 
     if ((start + length) <= rb->used_length) {
+        bool need_madvise, need_fallocate;
         uint8_t *host_endaddr = host_startaddr + length;
         if ((uintptr_t)host_endaddr & (rb->page_size - 1)) {
             error_report("ram_block_discard_range: Unaligned end address: %p",
@@ -3711,29 +3712,60 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
 
         errno = ENOTSUP; /* If we are missing MADVISE etc */
 
-        if (rb->page_size == qemu_host_page_size) {
-#if defined(CONFIG_MADVISE)
-            /* Note: We need the madvise MADV_DONTNEED behaviour of definitely
-             * freeing the page.
-             */
-            ret = madvise(host_startaddr, length, MADV_DONTNEED);
-#endif
-        } else {
-            /* Huge page case  - unfortunately it can't do DONTNEED, but
-             * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
-             * huge page file.
+        /* The logic here is messy;
+         *    madvise DONTNEED fails for hugepages
+         *    fallocate works on hugepages and shmem
+         */
+        need_madvise = (rb->page_size == qemu_host_page_size);
+        need_fallocate = rb->fd != -1;
+        if (need_fallocate) {
+            /* For a file, this causes the area of the file to be zero'd
+             * if read, and for hugetlbfs also causes it to be unmapped
+             * so a userfault will trigger.
              */
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
             ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                             start, length);
+            if (ret) {
+                ret = -errno;
+                error_report("ram_block_discard_range: Failed to fallocate "
+                             "%s:%" PRIx64 " +%zx (%d)",
+                             rb->idstr, start, length, ret);
+                goto err;
+            }
+#else
+            ret = -ENOSYS;
+            error_report("ram_block_discard_range: fallocate not available/file"
+                         "%s:%" PRIx64 " +%zx (%d)",
+                         rb->idstr, start, length, ret);
+            goto err;
 #endif
         }
-        if (ret) {
-            ret = -errno;
-            error_report("ram_block_discard_range: Failed to discard range "
+        if (need_madvise) {
+            /* For normal RAM this causes it to be unmapped,
+             * for shared memory it causes the local mapping to disappear
+             * and to fall back on the file contents (which we just
+             * fallocate'd away).
+             */
+#if defined(CONFIG_MADVISE)
+            ret =  madvise(host_startaddr, length, MADV_DONTNEED);
+            if (ret) {
+                ret = -errno;
+                error_report("ram_block_discard_range: Failed to discard range "
+                             "%s:%" PRIx64 " +%zx (%d)",
+                             rb->idstr, start, length, ret);
+                goto err;
+            }
+#else
+            ret = -ENOSYS;
+            error_report("ram_block_discard_range: MADVISE not available"
                          "%s:%" PRIx64 " +%zx (%d)",
                          rb->idstr, start, length, ret);
+            goto err;
+#endif
         }
+        trace_ram_block_discard_range(rb->idstr, host_startaddr,
+                                      need_madvise, need_fallocate, ret);
     } else {
         error_report("ram_block_discard_range: Overrun block '%s' (%" PRIu64
                      "/%zx/" RAM_ADDR_FMT")",
diff --git a/trace-events b/trace-events
index ec95e67089..bf9741d930 100644
--- a/trace-events
+++ b/trace-events
@@ -55,9 +55,10 @@ dma_complete(void *dbs, int ret, void *cb) "dbs=%p ret=%d cb=%p"
 dma_blk_cb(void *dbs, int ret) "dbs=%p ret=%d"
 dma_map_wait(void *dbs) "dbs=%p"
 
-#  # exec.c
+# exec.c
 find_ram_offset(uint64_t size, uint64_t offset) "size: 0x%" PRIx64 " @ 0x%" PRIx64
 find_ram_offset_loop(uint64_t size, uint64_t candidate, uint64_t offset, uint64_t next, uint64_t mingap) "trying size: 0x%" PRIx64 " @ 0x%" PRIx64 ", offset: 0x%" PRIx64" next: 0x%" PRIx64 " mingap: 0x%" PRIx64
+ram_block_discard_range(const char *rbname, void *hva, bool need_madvise, bool need_fallocate, int ret) "%s@%p: madvise: %d fallocate: %d ret: %d"
 
 # memory.c
 memory_region_ops_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 02/29] qemu_ram_block_host_offset
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
  2018-02-16 13:15 ` [Qemu-devel] [PATCH v3 01/29] migrate: Update ram_block_discard_range for shared Dr. David Alan Gilbert (git)
@ 2018-02-16 13:15 ` Dr. David Alan Gilbert (git)
  2018-02-16 13:15 ` [Qemu-devel] [PATCH v3 03/29] postcopy: use UFFDIO_ZEROPAGE only when available Dr. David Alan Gilbert (git)
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:15 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Utility to give the offset of a host pointer within a RAMBlock
(assuming we already know it's in that RAMBlock)

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 exec.c                    | 10 ++++++++++
 include/exec/cpu-common.h |  1 +
 2 files changed, 11 insertions(+)

diff --git a/exec.c b/exec.c
index b1bb477776..0ec73bc917 100644
--- a/exec.c
+++ b/exec.c
@@ -2293,6 +2293,16 @@ static void *qemu_ram_ptr_length(RAMBlock *ram_block, ram_addr_t addr,
     return ramblock_ptr(block, addr);
 }
 
+/* Return the offset of a hostpointer within a ramblock */
+ram_addr_t qemu_ram_block_host_offset(RAMBlock *rb, void *host)
+{
+    ram_addr_t res = (uint8_t *)host - (uint8_t *)rb->host;
+    assert((uintptr_t)host >= (uintptr_t)rb->host);
+    assert(res < rb->max_length);
+
+    return res;
+}
+
 /*
  * Translates a host ptr back to a RAMBlock, a ram_addr and an offset
  * in that RAMBlock.
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 74341b19d2..0d861a6289 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -68,6 +68,7 @@ ram_addr_t qemu_ram_addr_from_host(void *ptr);
 RAMBlock *qemu_ram_block_by_name(const char *name);
 RAMBlock *qemu_ram_block_from_host(void *ptr, bool round_offset,
                                    ram_addr_t *offset);
+ram_addr_t qemu_ram_block_host_offset(RAMBlock *rb, void *host);
 void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
 void qemu_ram_unset_idstr(RAMBlock *block);
 const char *qemu_ram_get_idstr(RAMBlock *rb);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 03/29] postcopy: use UFFDIO_ZEROPAGE only when available
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
  2018-02-16 13:15 ` [Qemu-devel] [PATCH v3 01/29] migrate: Update ram_block_discard_range for shared Dr. David Alan Gilbert (git)
  2018-02-16 13:15 ` [Qemu-devel] [PATCH v3 02/29] qemu_ram_block_host_offset Dr. David Alan Gilbert (git)
@ 2018-02-16 13:15 ` Dr. David Alan Gilbert (git)
  2018-02-28  6:53   ` Peter Xu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 04/29] postcopy: Add notifier chain Dr. David Alan Gilbert (git)
                   ` (26 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:15 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Use a flag on the RAMBlock to state whether it has the
UFFDIO_ZEROPAGE capability, use it when it's available.

This allows the use of postcopy on tmpfs as well as hugepage
backed files.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 exec.c                    | 15 +++++++++++++++
 include/exec/cpu-common.h |  3 +++
 migration/postcopy-ram.c  | 13 ++++++++++---
 3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/exec.c b/exec.c
index 0ec73bc917..1dc15298c2 100644
--- a/exec.c
+++ b/exec.c
@@ -99,6 +99,11 @@ static MemoryRegion io_mem_unassigned;
  */
 #define RAM_RESIZEABLE (1 << 2)
 
+/* UFFDIO_ZEROPAGE is available on this RAMBlock to atomically
+ * zero the page and wake waiting processes.
+ * (Set during postcopy)
+ */
+#define RAM_UF_ZEROPAGE (1 << 3)
 #endif
 
 #ifdef TARGET_PAGE_BITS_VARY
@@ -1767,6 +1772,16 @@ bool qemu_ram_is_shared(RAMBlock *rb)
     return rb->flags & RAM_SHARED;
 }
 
+bool qemu_ram_is_uf_zeroable(RAMBlock *rb)
+{
+    return rb->flags & RAM_UF_ZEROPAGE;
+}
+
+void qemu_ram_set_uf_zeroable(RAMBlock *rb)
+{
+    rb->flags |= RAM_UF_ZEROPAGE;
+}
+
 /* Called with iothread lock held.  */
 void qemu_ram_set_idstr(RAMBlock *new_block, const char *name, DeviceState *dev)
 {
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 0d861a6289..24d335f95d 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -73,6 +73,9 @@ void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
 void qemu_ram_unset_idstr(RAMBlock *block);
 const char *qemu_ram_get_idstr(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
+bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
+void qemu_ram_set_uf_zeroable(RAMBlock *rb);
+
 size_t qemu_ram_pagesize(RAMBlock *block);
 size_t qemu_ram_pagesize_largest(void);
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index bec6c2c66b..6297979700 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -490,6 +490,10 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
         error_report("%s userfault: Region doesn't support COPY", __func__);
         return -1;
     }
+    if (reg_struct.ioctls & ((__u64)1 << _UFFDIO_ZEROPAGE)) {
+        RAMBlock *rb = qemu_ram_block_by_name(block_name);
+        qemu_ram_set_uf_zeroable(rb);
+    }
 
     return 0;
 }
@@ -699,11 +703,14 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
 int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
                              RAMBlock *rb)
 {
+    size_t pagesize = qemu_ram_pagesize(rb);
     trace_postcopy_place_page_zero(host);
 
-    if (qemu_ram_pagesize(rb) == getpagesize()) {
-        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, getpagesize(),
-                                rb)) {
+    /* Normal RAMBlocks can zero a page using UFFDIO_ZEROPAGE
+     * but it's not available for everything (e.g. hugetlbpages)
+     */
+    if (qemu_ram_is_uf_zeroable(rb)) {
+        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, pagesize, rb)) {
             int e = errno;
             error_report("%s: %s zero host: %p",
                          __func__, strerror(e), host);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 04/29] postcopy: Add notifier chain
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (2 preceding siblings ...)
  2018-02-16 13:15 ` [Qemu-devel] [PATCH v3 03/29] postcopy: use UFFDIO_ZEROPAGE only when available Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 05/29] postcopy: Add vhost-user flag for postcopy and check it Dr. David Alan Gilbert (git)
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add a notifier chain for postcopy with a 'reason' flag
and an opportunity for a notifier member to return an error.

Call it when enabling postcopy.

This will initially used to enable devices to declare they're unable
to postcopy and later to notify of devices of stages within postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 36 ++++++++++++++++++++++++++++++++++++
 migration/postcopy-ram.h | 26 ++++++++++++++++++++++++++
 vl.c                     |  2 ++
 3 files changed, 64 insertions(+)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 6297979700..fa98cf353b 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -23,6 +23,8 @@
 #include "savevm.h"
 #include "postcopy-ram.h"
 #include "ram.h"
+#include "qapi/error.h"
+#include "qemu/notify.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/balloon.h"
 #include "qemu/error-report.h"
@@ -45,6 +47,33 @@ struct PostcopyDiscardState {
     unsigned int nsentcmds;
 };
 
+static NotifierWithReturnList postcopy_notifier_list;
+
+void postcopy_infrastructure_init(void)
+{
+    notifier_with_return_list_init(&postcopy_notifier_list);
+}
+
+void postcopy_add_notifier(NotifierWithReturn *nn)
+{
+    notifier_with_return_list_add(&postcopy_notifier_list, nn);
+}
+
+void postcopy_remove_notifier(NotifierWithReturn *n)
+{
+    notifier_with_return_remove(n);
+}
+
+int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp)
+{
+    struct PostcopyNotifyData pnd;
+    pnd.reason = reason;
+    pnd.errp = errp;
+
+    return notifier_with_return_list_notify(&postcopy_notifier_list,
+                                            &pnd);
+}
+
 /* Postcopy needs to detect accesses to pages that haven't yet been copied
  * across, and efficiently map new pages in, the techniques for doing this
  * are target OS specific.
@@ -215,6 +244,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
     struct uffdio_register reg_struct;
     struct uffdio_range range_struct;
     uint64_t feature_mask;
+    Error *local_err = NULL;
 
     if (qemu_target_page_size() > pagesize) {
         error_report("Target page size bigger than host page size");
@@ -228,6 +258,12 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
         goto out;
     }
 
+    /* Give devices a chance to object */
+    if (postcopy_notify(POSTCOPY_NOTIFY_PROBE, &local_err)) {
+        error_report_err(local_err);
+        goto out;
+    }
+
     /* Version and features check */
     if (!ufd_check_and_apply(ufd, mis)) {
         goto out;
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 77ea0fd264..1eaf7975e9 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -114,4 +114,30 @@ PostcopyState postcopy_state_get(void);
 /* Set the state and return the old state */
 PostcopyState postcopy_state_set(PostcopyState new_state);
 
+/*
+ * To be called once at the start before any device initialisation
+ */
+void postcopy_infrastructure_init(void);
+
+/* Add a notifier to a list to be called when checking whether the devices
+ * can support postcopy.
+ * It's data is a *PostcopyNotifyData
+ * It should return 0 if OK, or a negative value on failure.
+ * On failure it must set the data->errp to an error.
+ *
+ */
+enum PostcopyNotifyReason {
+    POSTCOPY_NOTIFY_PROBE = 0,
+};
+
+struct PostcopyNotifyData {
+    enum PostcopyNotifyReason reason;
+    Error **errp;
+};
+
+void postcopy_add_notifier(NotifierWithReturn *nn);
+void postcopy_remove_notifier(NotifierWithReturn *n);
+/* Call the notifier list set by postcopy_add_start_notifier */
+int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp);
+
 #endif
diff --git a/vl.c b/vl.c
index 7a5554bc41..a092b3c6c6 100644
--- a/vl.c
+++ b/vl.c
@@ -94,6 +94,7 @@ int main(int argc, char **argv)
 #include "audio/audio.h"
 #include "sysemu/cpus.h"
 #include "migration/colo.h"
+#include "migration/postcopy-ram.h"
 #include "sysemu/kvm.h"
 #include "sysemu/hax.h"
 #include "qapi/qobject-input-visitor.h"
@@ -3119,6 +3120,7 @@ int main(int argc, char **argv, char **envp)
     module_call_init(MODULE_INIT_OPTS);
 
     runstate_init();
+    postcopy_infrastructure_init();
 
     if (qcrypto_init(&err) < 0) {
         error_reportf_err(err, "cannot initialize crypto: ");
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 05/29] postcopy: Add vhost-user flag for postcopy and check it
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (3 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 04/29] postcopy: Add notifier chain Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-28  7:14   ` Peter Xu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 06/29] vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message Dr. David Alan Gilbert (git)
                   ` (24 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add a vhost feature flag for postcopy support, and
use the postcopy notifier to check it before allowing postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.h |  1 +
 docs/interop/vhost-user.txt           | 10 +++++++++
 hw/virtio/vhost-user.c                | 41 ++++++++++++++++++++++++++++++++++-
 3 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index 18f95f65d7..a0dcc97c73 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -48,6 +48,7 @@ enum VhostUserProtocolFeature {
     VHOST_USER_PROTOCOL_F_NET_MTU = 4,
     VHOST_USER_PROTOCOL_F_SLAVE_REQ = 5,
     VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,
+    VHOST_USER_PROTOCOL_F_PAGEFAULT = 7,
 
     VHOST_USER_PROTOCOL_F_MAX
 };
diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index 9fcf48d611..e95cd82677 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -290,6 +290,15 @@ Once the source has finished migration, rings will be stopped by
 the source. No further update must be done before rings are
 restarted.
 
+In postcopy migration the slave is started before all the memory has been
+received from the source host, and care must be taken to avoid accessing pages
+that have yet to be received.  The slave opens a 'userfault'-fd and registers
+the memory with it; this fd is then passed back over to the master.
+The master services requests on the userfaultfd for pages that are accessed
+and when the page is available it performs WAKE ioctl's on the userfaultfd
+to wake the stalled slave.  The client indicates support for this via the
+VHOST_USER_PROTOCOL_F_PAGEFAULT feature.
+
 Memory access
 -------------
 
@@ -368,6 +377,7 @@ Protocol features
 #define VHOST_USER_PROTOCOL_F_MTU            4
 #define VHOST_USER_PROTOCOL_F_SLAVE_REQ      5
 #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN   6
+#define VHOST_USER_PROTOCOL_F_PAGEFAULT      7
 
 Master message types
 --------------------
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 6eb97980ad..d60f89cc16 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -17,6 +17,8 @@
 #include "sysemu/kvm.h"
 #include "qemu/error-report.h"
 #include "qemu/sockets.h"
+#include "migration/migration.h"
+#include "migration/postcopy-ram.h"
 
 #include <sys/ioctl.h>
 #include <sys/socket.h>
@@ -39,7 +41,7 @@ enum VhostUserProtocolFeature {
     VHOST_USER_PROTOCOL_F_NET_MTU = 4,
     VHOST_USER_PROTOCOL_F_SLAVE_REQ = 5,
     VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,
-
+    VHOST_USER_PROTOCOL_F_PAGEFAULT = 7,
     VHOST_USER_PROTOCOL_F_MAX
 };
 
@@ -148,8 +150,10 @@ static VhostUserMsg m __attribute__ ((unused));
 #define VHOST_USER_VERSION    (0x1)
 
 struct vhost_user {
+    struct vhost_dev *dev;
     CharBackend *chr;
     int slave_fd;
+    NotifierWithReturn postcopy_notifier;
 };
 
 static bool ioeventfd_enabled(void)
@@ -775,6 +779,33 @@ out:
     return ret;
 }
 
+static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
+                                        void *opaque)
+{
+    struct PostcopyNotifyData *pnd = opaque;
+    struct vhost_user *u = container_of(notifier, struct vhost_user,
+                                         postcopy_notifier);
+    struct vhost_dev *dev = u->dev;
+
+    switch (pnd->reason) {
+    case POSTCOPY_NOTIFY_PROBE:
+        if (!virtio_has_feature(dev->protocol_features,
+                                VHOST_USER_PROTOCOL_F_PAGEFAULT)) {
+            /* TODO: Get the device name into this error somehow */
+            error_setg(pnd->errp,
+                       "vhost-user backend not capable of postcopy");
+            return -ENOENT;
+        }
+        break;
+
+    default:
+        /* We ignore notifications we don't know */
+        break;
+    }
+
+    return 0;
+}
+
 static int vhost_user_init(struct vhost_dev *dev, void *opaque)
 {
     uint64_t features, protocol_features;
@@ -786,6 +817,7 @@ static int vhost_user_init(struct vhost_dev *dev, void *opaque)
     u = g_new0(struct vhost_user, 1);
     u->chr = opaque;
     u->slave_fd = -1;
+    u->dev = dev;
     dev->opaque = u;
 
     err = vhost_user_get_features(dev, &features);
@@ -842,6 +874,9 @@ static int vhost_user_init(struct vhost_dev *dev, void *opaque)
         return err;
     }
 
+    u->postcopy_notifier.notify = vhost_user_postcopy_notifier;
+    postcopy_add_notifier(&u->postcopy_notifier);
+
     return 0;
 }
 
@@ -852,6 +887,10 @@ static int vhost_user_cleanup(struct vhost_dev *dev)
     assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_USER);
 
     u = dev->opaque;
+    if (u->postcopy_notifier.notify) {
+        postcopy_remove_notifier(&u->postcopy_notifier);
+        u->postcopy_notifier.notify = NULL;
+    }
     if (u->slave_fd >= 0) {
         qemu_set_fd_handler(u->slave_fd, NULL, NULL, NULL);
         close(u->slave_fd);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 06/29] vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (4 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 05/29] postcopy: Add vhost-user flag for postcopy and check it Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 07/29] libvhost-user: Support sending fds back to qemu Dr. David Alan Gilbert (git)
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Wire up a notifier to send a VHOST_USER_POSTCOPY_ADVISE
message on an incoming advise.

Later patches will fill in the behaviour/contents of the
message.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 14 ++++++++++
 contrib/libvhost-user/libvhost-user.h |  1 +
 docs/interop/vhost-user.txt           |  9 +++++++
 hw/virtio/vhost-user.c                | 48 +++++++++++++++++++++++++++++++++++
 migration/postcopy-ram.h              |  1 +
 migration/savevm.c                    |  6 +++++
 6 files changed, 79 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 2e358b5bce..71825d2dde 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -86,6 +86,7 @@ vu_request_to_string(unsigned int req)
         REQ(VHOST_USER_SET_VRING_ENDIAN),
         REQ(VHOST_USER_GET_CONFIG),
         REQ(VHOST_USER_SET_CONFIG),
+        REQ(VHOST_USER_POSTCOPY_ADVISE),
         REQ(VHOST_USER_MAX),
     };
 #undef REQ
@@ -856,6 +857,17 @@ vu_set_config(VuDev *dev, VhostUserMsg *vmsg)
     return false;
 }
 
+static bool
+vu_set_postcopy_advise(VuDev *dev, VhostUserMsg *vmsg)
+{
+    /* TODO: Open ufd, pass it back in the request
+     * TODO: Add addresses
+     */
+    vmsg->payload.u64 = 0xcafe;
+    vmsg->size = sizeof(vmsg->payload.u64);
+    return true; /* = send a reply */
+}
+
 static bool
 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -927,6 +939,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
         return vu_set_config(dev, vmsg);
     case VHOST_USER_NONE:
         break;
+    case VHOST_USER_POSTCOPY_ADVISE:
+        return vu_set_postcopy_advise(dev, vmsg);
     default:
         vmsg_close_fds(vmsg);
         vu_panic(dev, "Unhandled request: %d", vmsg->request);
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index a0dcc97c73..9c3a180777 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -82,6 +82,7 @@ typedef enum VhostUserRequest {
     VHOST_USER_SET_VRING_ENDIAN = 23,
     VHOST_USER_GET_CONFIG = 24,
     VHOST_USER_SET_CONFIG = 25,
+    VHOST_USER_POSTCOPY_ADVISE  = 26,
     VHOST_USER_MAX
 } VhostUserRequest;
 
diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index e95cd82677..621543e654 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -673,6 +673,15 @@ Master message types
       field, and slaves MUST NOT accept SET_CONFIG for read-only
       configuration space fields unless the live migration bit is set.
 
+ * VHOST_USER_POSTCOPY_ADVISE
+      Id: 26
+      Master payload: N/A
+      Slave payload: userfault fd + u64
+
+      Master advises slave that a migration with postcopy enabled is underway,
+      the slave must open a userfaultfd for later use.
+      Note that at this stage the migration is still in precopy mode.
+
 Slave message types
 -------------------
 
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index d60f89cc16..4f59993baa 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -74,6 +74,7 @@ typedef enum VhostUserRequest {
     VHOST_USER_SET_VRING_ENDIAN = 23,
     VHOST_USER_GET_CONFIG = 24,
     VHOST_USER_SET_CONFIG = 25,
+    VHOST_USER_POSTCOPY_ADVISE  = 26,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -779,6 +780,50 @@ out:
     return ret;
 }
 
+/*
+ * Called at the start of an inbound postcopy on reception of the
+ * 'advise' command.
+ */
+static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
+{
+    struct vhost_user *u = dev->opaque;
+    CharBackend *chr = u->chr;
+    int ufd;
+    VhostUserMsg msg = {
+        .hdr.request = VHOST_USER_POSTCOPY_ADVISE,
+        .hdr.flags = VHOST_USER_VERSION,
+    };
+
+    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
+        error_setg(errp, "Failed to send postcopy_advise to vhost");
+        return -1;
+    }
+
+    if (vhost_user_read(dev, &msg) < 0) {
+        error_setg(errp, "Failed to get postcopy_advise reply from vhost");
+        return -1;
+    }
+
+    if (msg.hdr.request != VHOST_USER_POSTCOPY_ADVISE) {
+        error_setg(errp, "Unexpected msg type. Expected %d received %d",
+                     VHOST_USER_POSTCOPY_ADVISE, msg.hdr.request);
+        return -1;
+    }
+
+    if (msg.hdr.size != sizeof(msg.payload.u64)) {
+        error_setg(errp, "Received bad msg size.");
+        return -1;
+    }
+    ufd = qemu_chr_fe_get_msgfd(chr);
+    if (ufd < 0) {
+        error_setg(errp, "%s: Failed to get ufd", __func__);
+        return -1;
+    }
+
+    /* TODO: register ufd with userfault thread */
+    return 0;
+}
+
 static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
                                         void *opaque)
 {
@@ -798,6 +843,9 @@ static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
         }
         break;
 
+    case POSTCOPY_NOTIFY_INBOUND_ADVISE:
+        return vhost_user_postcopy_advise(dev, pnd->errp);
+
     default:
         /* We ignore notifications we don't know */
         break;
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 1eaf7975e9..bee21d4401 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -128,6 +128,7 @@ void postcopy_infrastructure_init(void);
  */
 enum PostcopyNotifyReason {
     POSTCOPY_NOTIFY_PROBE = 0,
+    POSTCOPY_NOTIFY_INBOUND_ADVISE,
 };
 
 struct PostcopyNotifyData {
diff --git a/migration/savevm.c b/migration/savevm.c
index 3f611c02e8..9840bcaac9 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1382,6 +1382,7 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
 {
     PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_ADVISE);
     uint64_t remote_pagesize_summary, local_pagesize_summary, remote_tps;
+    Error *local_err = NULL;
 
     trace_loadvm_postcopy_handle_advise();
     if (ps != POSTCOPY_INCOMING_NONE) {
@@ -1447,6 +1448,11 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
         return -1;
     }
 
+    if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_ADVISE, &local_err)) {
+        error_report_err(local_err);
+        return -1;
+    }
+
     if (ram_postcopy_incoming_init(mis)) {
         return -1;
     }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 07/29] libvhost-user: Support sending fds back to qemu
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (5 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 06/29] vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 08/29] libvhost-user: Open userfaultfd Dr. David Alan Gilbert (git)
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Allow replies with fds (for postcopy)

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 71825d2dde..eb0ab9338c 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -246,6 +246,31 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
 {
     int rc;
     uint8_t *p = (uint8_t *)vmsg;
+    char control[CMSG_SPACE(VHOST_MEMORY_MAX_NREGIONS * sizeof(int))] = { };
+    struct iovec iov = {
+        .iov_base = (char *)vmsg,
+        .iov_len = VHOST_USER_HDR_SIZE,
+    };
+    struct msghdr msg = {
+        .msg_iov = &iov,
+        .msg_iovlen = 1,
+        .msg_control = control,
+    };
+    struct cmsghdr *cmsg;
+
+    memset(control, 0, sizeof(control));
+    assert(vmsg->fd_num <= VHOST_MEMORY_MAX_NREGIONS);
+    if (vmsg->fd_num > 0) {
+        size_t fdsize = vmsg->fd_num * sizeof(int);
+        msg.msg_controllen = CMSG_SPACE(fdsize);
+        cmsg = CMSG_FIRSTHDR(&msg);
+        cmsg->cmsg_len = CMSG_LEN(fdsize);
+        cmsg->cmsg_level = SOL_SOCKET;
+        cmsg->cmsg_type = SCM_RIGHTS;
+        memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
+    } else {
+        msg.msg_controllen = 0;
+    }
 
     /* Set the version in the flags when sending the reply */
     vmsg->flags &= ~VHOST_USER_VERSION_MASK;
@@ -253,7 +278,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
     vmsg->flags |= VHOST_USER_REPLY_MASK;
 
     do {
-        rc = write(conn_fd, p, VHOST_USER_HDR_SIZE);
+        rc = sendmsg(conn_fd, &msg, 0);
     } while (rc < 0 && (errno == EINTR || errno == EAGAIN));
 
     do {
@@ -346,6 +371,7 @@ vu_get_features_exec(VuDev *dev, VhostUserMsg *vmsg)
     }
 
     vmsg->size = sizeof(vmsg->payload.u64);
+    vmsg->fd_num = 0;
 
     DPRINT("Sending back to guest u64: 0x%016"PRIx64"\n", vmsg->payload.u64);
 
@@ -501,6 +527,7 @@ vu_set_log_base_exec(VuDev *dev, VhostUserMsg *vmsg)
     dev->log_size = log_mmap_size;
 
     vmsg->size = sizeof(vmsg->payload.u64);
+    vmsg->fd_num = 0;
 
     return true;
 }
@@ -759,6 +786,7 @@ vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg *vmsg)
 
     vmsg->payload.u64 = features;
     vmsg->size = sizeof(vmsg->payload.u64);
+    vmsg->fd_num = 0;
 
     return true;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 08/29] libvhost-user: Open userfaultfd
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (6 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 07/29] libvhost-user: Support sending fds back to qemu Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 09/29] postcopy: Allow registering of fd handler Dr. David Alan Gilbert (git)
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Open a userfaultfd (on a postcopy_advise) and send it back in
the reply to the qemu for it to monitor.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 45 +++++++++++++++++++++++++++++++----
 contrib/libvhost-user/libvhost-user.h |  3 +++
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index eb0ab9338c..0b563fc5ae 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -26,9 +26,20 @@
 #include <sys/socket.h>
 #include <sys/eventfd.h>
 #include <sys/mman.h>
+#include "qemu/compiler.h"
+
+#if defined(__linux__)
+#include <sys/syscall.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
 #include <linux/vhost.h>
 
-#include "qemu/compiler.h"
+#ifdef __NR_userfaultfd
+#include <linux/userfaultfd.h>
+#endif
+
+#endif
+
 #include "qemu/atomic.h"
 
 #include "libvhost-user.h"
@@ -888,11 +899,37 @@ vu_set_config(VuDev *dev, VhostUserMsg *vmsg)
 static bool
 vu_set_postcopy_advise(VuDev *dev, VhostUserMsg *vmsg)
 {
-    /* TODO: Open ufd, pass it back in the request
-     * TODO: Add addresses
-     */
+    dev->postcopy_ufd = -1;
+#ifdef UFFDIO_API
+    struct uffdio_api api_struct;
+
+    dev->postcopy_ufd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+    /* TODO: Add addresses */
     vmsg->payload.u64 = 0xcafe;
     vmsg->size = sizeof(vmsg->payload.u64);
+#endif
+
+    if (dev->postcopy_ufd == -1) {
+        vu_panic(dev, "Userfaultfd not available: %s", strerror(errno));
+        goto out;
+    }
+
+#ifdef UFFDIO_API
+    api_struct.api = UFFD_API;
+    api_struct.features = 0;
+    if (ioctl(dev->postcopy_ufd, UFFDIO_API, &api_struct)) {
+        vu_panic(dev, "Failed UFFDIO_API: %s", strerror(errno));
+        close(dev->postcopy_ufd);
+        dev->postcopy_ufd = -1;
+        goto out;
+    }
+    /* TODO: Stash feature flags somewhere */
+#endif
+
+out:
+    /* Return a ufd to the QEMU */
+    vmsg->fd_num = 1;
+    vmsg->fds[0] = dev->postcopy_ufd;
     return true; /* = send a reply */
 }
 
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index 9c3a180777..bb33b33f3b 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -279,6 +279,9 @@ struct VuDev {
      * re-initialize */
     vu_panic_cb panic;
     const VuDevIface *iface;
+
+    /* Postcopy data */
+    int postcopy_ufd;
 };
 
 typedef struct VuVirtqElement {
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 09/29] postcopy: Allow registering of fd handler
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (7 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 08/29] libvhost-user: Open userfaultfd Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-28  8:38   ` Peter Xu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 10/29] vhost+postcopy: Register shared ufd with postcopy Dr. David Alan Gilbert (git)
                   ` (20 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Allow other userfaultfd's to be registered into the fault thread
so that handlers for shared memory can get responses.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/migration.c    |   6 ++
 migration/migration.h    |   2 +
 migration/postcopy-ram.c | 209 +++++++++++++++++++++++++++++++++++------------
 migration/postcopy-ram.h |  21 +++++
 migration/trace-events   |   2 +
 5 files changed, 187 insertions(+), 53 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 86d69120a6..d49eed1c5b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -155,6 +155,8 @@ MigrationIncomingState *migration_incoming_get_current(void)
     if (!once) {
         mis_current.state = MIGRATION_STATUS_NONE;
         memset(&mis_current, 0, sizeof(MigrationIncomingState));
+        mis_current.postcopy_remote_fds = g_array_new(FALSE, TRUE,
+                                                   sizeof(struct PostCopyFD));
         qemu_mutex_init(&mis_current.rp_mutex);
         qemu_event_init(&mis_current.main_thread_load_event, false);
         once = true;
@@ -177,6 +179,10 @@ void migration_incoming_state_destroy(void)
         qemu_fclose(mis->from_src_file);
         mis->from_src_file = NULL;
     }
+    if (mis->postcopy_remote_fds) {
+        g_array_free(mis->postcopy_remote_fds, TRUE);
+        mis->postcopy_remote_fds = NULL;
+    }
 
     qemu_event_reset(&mis->main_thread_load_event);
 }
diff --git a/migration/migration.h b/migration/migration.h
index 848f638a20..d158e62cf2 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -48,6 +48,8 @@ struct MigrationIncomingState {
     QemuMutex rp_mutex;    /* We send replies from multiple threads */
     void     *postcopy_tmp_page;
     void     *postcopy_tmp_zero_page;
+    /* PostCopyFD's for external userfaultfds & handlers of shared memory */
+    GArray   *postcopy_remote_fds;
 
     QEMUBH *bh;
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index fa98cf353b..d118b78bf5 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -542,29 +542,43 @@ static void *postcopy_ram_fault_thread(void *opaque)
     MigrationIncomingState *mis = opaque;
     struct uffd_msg msg;
     int ret;
+    size_t index;
     RAMBlock *rb = NULL;
     RAMBlock *last_rb = NULL; /* last RAMBlock we sent part of */
 
     trace_postcopy_ram_fault_thread_entry();
     qemu_sem_post(&mis->fault_thread_sem);
 
+    struct pollfd *pfd;
+    size_t pfd_len = 2 + mis->postcopy_remote_fds->len;
+
+    pfd = g_new0(struct pollfd, pfd_len);
+
+    pfd[0].fd = mis->userfault_fd;
+    pfd[0].events = POLLIN;
+    pfd[1].fd = mis->userfault_quit_fd;
+    pfd[1].events = POLLIN; /* Waiting for eventfd to go positive */
+    trace_postcopy_ram_fault_thread_fds_core(pfd[0].fd, pfd[1].fd);
+    for (index = 0; index < mis->postcopy_remote_fds->len; index++) {
+        struct PostCopyFD *pcfd = &g_array_index(mis->postcopy_remote_fds,
+                                                 struct PostCopyFD, index);
+        pfd[2 + index].fd = pcfd->fd;
+        pfd[2 + index].events = POLLIN;
+        trace_postcopy_ram_fault_thread_fds_extra(2 + index, pcfd->idstr,
+                                                  pcfd->fd);
+    }
+
     while (true) {
         ram_addr_t rb_offset;
-        struct pollfd pfd[2];
+        int poll_result;
 
         /*
          * We're mainly waiting for the kernel to give us a faulting HVA,
          * however we can be told to quit via userfault_quit_fd which is
          * an eventfd
          */
-        pfd[0].fd = mis->userfault_fd;
-        pfd[0].events = POLLIN;
-        pfd[0].revents = 0;
-        pfd[1].fd = mis->userfault_quit_fd;
-        pfd[1].events = POLLIN; /* Waiting for eventfd to go positive */
-        pfd[1].revents = 0;
-
-        if (poll(pfd, 2, -1 /* Wait forever */) == -1) {
+        poll_result = poll(pfd, pfd_len, -1 /* Wait forever */);
+        if (poll_result == -1) {
             error_report("%s: userfault poll: %s", __func__, strerror(errno));
             break;
         }
@@ -574,57 +588,117 @@ static void *postcopy_ram_fault_thread(void *opaque)
             break;
         }
 
-        ret = read(mis->userfault_fd, &msg, sizeof(msg));
-        if (ret != sizeof(msg)) {
-            if (errno == EAGAIN) {
-                /*
-                 * if a wake up happens on the other thread just after
-                 * the poll, there is nothing to read.
-                 */
-                continue;
+        if (pfd[0].revents) {
+            poll_result--;
+            ret = read(mis->userfault_fd, &msg, sizeof(msg));
+            if (ret != sizeof(msg)) {
+                if (errno == EAGAIN) {
+                    /*
+                     * if a wake up happens on the other thread just after
+                     * the poll, there is nothing to read.
+                     */
+                    continue;
+                }
+                if (ret < 0) {
+                    error_report("%s: Failed to read full userfault "
+                                 "message: %s",
+                                 __func__, strerror(errno));
+                    break;
+                } else {
+                    error_report("%s: Read %d bytes from userfaultfd "
+                                 "expected %zd",
+                                 __func__, ret, sizeof(msg));
+                    break; /* Lost alignment, don't know what we'd read next */
+                }
             }
-            if (ret < 0) {
-                error_report("%s: Failed to read full userfault message: %s",
-                             __func__, strerror(errno));
-                break;
-            } else {
-                error_report("%s: Read %d bytes from userfaultfd expected %zd",
-                             __func__, ret, sizeof(msg));
-                break; /* Lost alignment, don't know what we'd read next */
+            if (msg.event != UFFD_EVENT_PAGEFAULT) {
+                error_report("%s: Read unexpected event %ud from userfaultfd",
+                             __func__, msg.event);
+                continue; /* It's not a page fault, shouldn't happen */
             }
-        }
-        if (msg.event != UFFD_EVENT_PAGEFAULT) {
-            error_report("%s: Read unexpected event %ud from userfaultfd",
-                         __func__, msg.event);
-            continue; /* It's not a page fault, shouldn't happen */
-        }
 
-        rb = qemu_ram_block_from_host(
-                 (void *)(uintptr_t)msg.arg.pagefault.address,
-                 true, &rb_offset);
-        if (!rb) {
-            error_report("postcopy_ram_fault_thread: Fault outside guest: %"
-                         PRIx64, (uint64_t)msg.arg.pagefault.address);
-            break;
-        }
+            rb = qemu_ram_block_from_host(
+                     (void *)(uintptr_t)msg.arg.pagefault.address,
+                     true, &rb_offset);
+            if (!rb) {
+                error_report("postcopy_ram_fault_thread: Fault outside guest: %"
+                             PRIx64, (uint64_t)msg.arg.pagefault.address);
+                break;
+            }
 
-        rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
-        trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
+            rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
+            trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
                                                 qemu_ram_get_idstr(rb),
                                                 rb_offset);
+            /*
+             * Send the request to the source - we want to request one
+             * of our host page sizes (which is >= TPS)
+             */
+            if (rb != last_rb) {
+                last_rb = rb;
+                migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
+                                         rb_offset, qemu_ram_pagesize(rb));
+            } else {
+                /* Save some space */
+                migrate_send_rp_req_pages(mis, NULL,
+                                         rb_offset, qemu_ram_pagesize(rb));
+            }
+        }
 
-        /*
-         * Send the request to the source - we want to request one
-         * of our host page sizes (which is >= TPS)
-         */
-        if (rb != last_rb) {
-            last_rb = rb;
-            migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
-                                     rb_offset, qemu_ram_pagesize(rb));
-        } else {
-            /* Save some space */
-            migrate_send_rp_req_pages(mis, NULL,
-                                     rb_offset, qemu_ram_pagesize(rb));
+        /* Now handle any requests from external processes on shared memory */
+        /* TODO: May need to handle devices deregistering during postcopy */
+        for (index = 2; index < pfd_len && poll_result; index++) {
+            if (pfd[index].revents) {
+                struct PostCopyFD *pcfd =
+                    &g_array_index(mis->postcopy_remote_fds,
+                                   struct PostCopyFD, index - 2);
+
+                poll_result--;
+                if (pfd[index].revents & POLLERR) {
+                    error_report("%s: POLLERR on poll %zd fd=%d",
+                                 __func__, index, pcfd->fd);
+                    pfd[index].events = 0;
+                    continue;
+                }
+
+                ret = read(pcfd->fd, &msg, sizeof(msg));
+                if (ret != sizeof(msg)) {
+                    if (errno == EAGAIN) {
+                        /*
+                         * if a wake up happens on the other thread just after
+                         * the poll, there is nothing to read.
+                         */
+                        continue;
+                    }
+                    if (ret < 0) {
+                        error_report("%s: Failed to read full userfault "
+                                     "message: %s (shared) revents=%d",
+                                     __func__, strerror(errno),
+                                     pfd[index].revents);
+                        /*TODO: Could just disable this sharer */
+                        break;
+                    } else {
+                        error_report("%s: Read %d bytes from userfaultfd "
+                                     "expected %zd (shared)",
+                                     __func__, ret, sizeof(msg));
+                        /*TODO: Could just disable this sharer */
+                        break; /*Lost alignment,don't know what we'd read next*/
+                    }
+                }
+                if (msg.event != UFFD_EVENT_PAGEFAULT) {
+                    error_report("%s: Read unexpected event %ud "
+                                 "from userfaultfd (shared)",
+                                 __func__, msg.event);
+                    continue; /* It's not a page fault, shouldn't happen */
+                }
+                /* Call the device handler registered with us */
+                ret = pcfd->handler(pcfd, &msg);
+                if (ret) {
+                    error_report("%s: Failed to resolve shared fault on %zd/%s",
+                                 __func__, index, pcfd->idstr);
+                    /* TODO: Fail? Disable this sharer? */
+                }
+            }
         }
     }
     trace_postcopy_ram_fault_thread_exit();
@@ -954,3 +1028,32 @@ PostcopyState postcopy_state_set(PostcopyState new_state)
 {
     return atomic_xchg(&incoming_postcopy_state, new_state);
 }
+
+/* Register a handler for external shared memory postcopy
+ * called on the destination.
+ */
+void postcopy_register_shared_ufd(struct PostCopyFD *pcfd)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    mis->postcopy_remote_fds = g_array_append_val(mis->postcopy_remote_fds,
+                                                  *pcfd);
+}
+
+/* Unregister a handler for external shared memory postcopy
+ */
+void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd)
+{
+    guint i;
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    GArray *pcrfds = mis->postcopy_remote_fds;
+
+    for (i = 0; i < pcrfds->len; i++) {
+        struct PostCopyFD *cur = &g_array_index(pcrfds, struct PostCopyFD, i);
+        if (cur->fd == pcfd->fd) {
+            mis->postcopy_remote_fds = g_array_remove_index(pcrfds, i);
+            return;
+        }
+    }
+}
+
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index bee21d4401..4bda5aa509 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -141,4 +141,25 @@ void postcopy_remove_notifier(NotifierWithReturn *n);
 /* Call the notifier list set by postcopy_add_start_notifier */
 int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp);
 
+struct PostCopyFD;
+
+/* ufd is a pointer to the struct uffd_msg *TODO: more Portable! */
+typedef int (*pcfdhandler)(struct PostCopyFD *pcfd, void *ufd);
+
+struct PostCopyFD {
+    int fd;
+    /* Data to pass to handler */
+    void *data;
+    /* Handler to be called whenever we get a poll event */
+    pcfdhandler handler;
+    /* A string to use in error messages */
+    char *idstr;
+};
+
+/* Register a userfaultfd owned by an external process for
+ * shared memory.
+ */
+void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
+void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
+
 #endif
diff --git a/migration/trace-events b/migration/trace-events
index 93961dea16..1e617ad7a6 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -190,6 +190,8 @@ postcopy_place_page_zero(void *host_addr) "host=%p"
 postcopy_ram_enable_notify(void) ""
 postcopy_ram_fault_thread_entry(void) ""
 postcopy_ram_fault_thread_exit(void) ""
+postcopy_ram_fault_thread_fds_core(int baseufd, int quitfd) "ufd: %d quitfd: %d"
+postcopy_ram_fault_thread_fds_extra(size_t index, const char *name, int fd) "%zd/%s: %d"
 postcopy_ram_fault_thread_quit(void) ""
 postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=0x%" PRIx64 " rb=%s offset=0x%zx"
 postcopy_ram_incoming_cleanup_closeuf(void) ""
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 10/29] vhost+postcopy: Register shared ufd with postcopy
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (8 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 09/29] postcopy: Allow registering of fd handler Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-28  8:46   ` Peter Xu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 11/29] vhost+postcopy: Transmit 'listen' to client Dr. David Alan Gilbert (git)
                   ` (19 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Register the UFD that comes in as the response to the 'advise' method
with the postcopy code.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/vhost-user.c   | 21 ++++++++++++++++++++-
 migration/postcopy-ram.h |  2 +-
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 4f59993baa..dd4eb50668 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -24,6 +24,7 @@
 #include <sys/socket.h>
 #include <sys/un.h>
 #include <linux/vhost.h>
+#include <linux/userfaultfd.h>
 
 #define VHOST_MEMORY_MAX_NREGIONS    8
 #define VHOST_USER_F_PROTOCOL_FEATURES 30
@@ -155,6 +156,7 @@ struct vhost_user {
     CharBackend *chr;
     int slave_fd;
     NotifierWithReturn postcopy_notifier;
+    struct PostCopyFD  postcopy_fd;
 };
 
 static bool ioeventfd_enabled(void)
@@ -780,6 +782,17 @@ out:
     return ret;
 }
 
+/*
+ * Called back from the postcopy fault thread when a fault is received on our
+ * ufd.
+ * TODO: This is Linux specific
+ */
+static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
+                                             void *ufd)
+{
+    return 0;
+}
+
 /*
  * Called at the start of an inbound postcopy on reception of the
  * 'advise' command.
@@ -819,8 +832,14 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
         error_setg(errp, "%s: Failed to get ufd", __func__);
         return -1;
     }
+    fcntl(ufd, F_SETFL, O_NONBLOCK);
 
-    /* TODO: register ufd with userfault thread */
+    /* register ufd with userfault thread */
+    u->postcopy_fd.fd = ufd;
+    u->postcopy_fd.data = dev;
+    u->postcopy_fd.handler = vhost_user_postcopy_fault_handler;
+    u->postcopy_fd.idstr = "vhost-user"; /* Need to find unique name */
+    postcopy_register_shared_ufd(&u->postcopy_fd);
     return 0;
 }
 
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 4bda5aa509..23efbdf346 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -153,7 +153,7 @@ struct PostCopyFD {
     /* Handler to be called whenever we get a poll event */
     pcfdhandler handler;
     /* A string to use in error messages */
-    char *idstr;
+    const char *idstr;
 };
 
 /* Register a userfaultfd owned by an external process for
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 11/29] vhost+postcopy: Transmit 'listen' to client
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (9 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 10/29] vhost+postcopy: Register shared ufd with postcopy Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-28  8:42   ` Peter Xu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 12/29] postcopy+vhost-user: Split set_mem_table for postcopy Dr. David Alan Gilbert (git)
                   ` (18 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Notify the vhost-user client on reception of the 'postcopy-listen'
event from the source.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 19 +++++++++++++++++++
 contrib/libvhost-user/libvhost-user.h |  2 ++
 docs/interop/vhost-user.txt           |  6 ++++++
 hw/virtio/trace-events                |  3 +++
 hw/virtio/vhost-user.c                | 34 ++++++++++++++++++++++++++++++++++
 migration/postcopy-ram.h              |  1 +
 migration/savevm.c                    |  7 +++++++
 7 files changed, 72 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 0b563fc5ae..beec7695a8 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -98,6 +98,7 @@ vu_request_to_string(unsigned int req)
         REQ(VHOST_USER_GET_CONFIG),
         REQ(VHOST_USER_SET_CONFIG),
         REQ(VHOST_USER_POSTCOPY_ADVISE),
+        REQ(VHOST_USER_POSTCOPY_LISTEN),
         REQ(VHOST_USER_MAX),
     };
 #undef REQ
@@ -933,6 +934,22 @@ out:
     return true; /* = send a reply */
 }
 
+static bool
+vu_set_postcopy_listen(VuDev *dev, VhostUserMsg *vmsg)
+{
+    vmsg->payload.u64 = -1;
+    vmsg->size = sizeof(vmsg->payload.u64);
+
+    if (dev->nregions) {
+        vu_panic(dev, "Regions already registered at postcopy-listen");
+        return true;
+    }
+    dev->postcopy_listening = true;
+
+    vmsg->flags = VHOST_USER_VERSION |  VHOST_USER_REPLY_MASK;
+    vmsg->payload.u64 = 0; /* Success */
+    return true;
+}
 static bool
 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -1006,6 +1023,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
         break;
     case VHOST_USER_POSTCOPY_ADVISE:
         return vu_set_postcopy_advise(dev, vmsg);
+    case VHOST_USER_POSTCOPY_LISTEN:
+        return vu_set_postcopy_listen(dev, vmsg);
     default:
         vmsg_close_fds(vmsg);
         vu_panic(dev, "Unhandled request: %d", vmsg->request);
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index bb33b33f3b..fcba53c3c3 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -83,6 +83,7 @@ typedef enum VhostUserRequest {
     VHOST_USER_GET_CONFIG = 24,
     VHOST_USER_SET_CONFIG = 25,
     VHOST_USER_POSTCOPY_ADVISE  = 26,
+    VHOST_USER_POSTCOPY_LISTEN  = 27,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -282,6 +283,7 @@ struct VuDev {
 
     /* Postcopy data */
     int postcopy_ufd;
+    bool postcopy_listening;
 };
 
 typedef struct VuVirtqElement {
diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index 621543e654..bdec9ec0e8 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -682,6 +682,12 @@ Master message types
       the slave must open a userfaultfd for later use.
       Note that at this stage the migration is still in precopy mode.
 
+ * VHOST_USER_POSTCOPY_LISTEN
+      Id: 27
+      Master payload: N/A
+
+      Master advises slave that a transition to postcopy mode has happened.
+
 Slave message types
 -------------------
 
diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 742ff0f90b..06ec03d6e7 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -6,6 +6,9 @@ vhost_region_add_section(const char *name, uint64_t gpa, uint64_t size, uint64_t
 vhost_region_add_section_abut(const char *name, uint64_t new_size) "%s: 0x%"PRIx64
 vhost_section(const char *name, int r) "%s:%d"
 
+# hw/virtio/vhost-user.c
+vhost_user_postcopy_listen(void) ""
+
 # hw/virtio/virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
 virtqueue_fill(void *vq, const void *elem, unsigned int len, unsigned int idx) "vq %p elem %p len %u idx %u"
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index dd4eb50668..ec6a4a82fd 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -19,6 +19,7 @@
 #include "qemu/sockets.h"
 #include "migration/migration.h"
 #include "migration/postcopy-ram.h"
+#include "trace.h"
 
 #include <sys/ioctl.h>
 #include <sys/socket.h>
@@ -76,6 +77,7 @@ typedef enum VhostUserRequest {
     VHOST_USER_GET_CONFIG = 24,
     VHOST_USER_SET_CONFIG = 25,
     VHOST_USER_POSTCOPY_ADVISE  = 26,
+    VHOST_USER_POSTCOPY_LISTEN  = 27,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -157,6 +159,8 @@ struct vhost_user {
     int slave_fd;
     NotifierWithReturn postcopy_notifier;
     struct PostCopyFD  postcopy_fd;
+    /* True once we've entered postcopy_listen */
+    bool               postcopy_listen;
 };
 
 static bool ioeventfd_enabled(void)
@@ -843,6 +847,33 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
     return 0;
 }
 
+/*
+ * Called at the switch to postcopy on reception of the 'listen' command.
+ */
+static int vhost_user_postcopy_listen(struct vhost_dev *dev, Error **errp)
+{
+    struct vhost_user *u = dev->opaque;
+    int ret;
+    VhostUserMsg msg = {
+        .hdr.request = VHOST_USER_POSTCOPY_LISTEN,
+        .hdr.flags = VHOST_USER_VERSION | VHOST_USER_NEED_REPLY_MASK,
+    };
+    u->postcopy_listen = true;
+    trace_vhost_user_postcopy_listen();
+    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
+        error_setg(errp, "Failed to send postcopy_listen to vhost");
+        return -1;
+    }
+
+    ret = process_message_reply(dev, &msg);
+    if (ret) {
+        error_setg(errp, "Failed to receive reply to postcopy_listen");
+        return ret;
+    }
+
+    return 0;
+}
+
 static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
                                         void *opaque)
 {
@@ -865,6 +896,9 @@ static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
     case POSTCOPY_NOTIFY_INBOUND_ADVISE:
         return vhost_user_postcopy_advise(dev, pnd->errp);
 
+    case POSTCOPY_NOTIFY_INBOUND_LISTEN:
+        return vhost_user_postcopy_listen(dev, pnd->errp);
+
     default:
         /* We ignore notifications we don't know */
         break;
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 23efbdf346..dbc2ee1f2b 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -129,6 +129,7 @@ void postcopy_infrastructure_init(void);
 enum PostcopyNotifyReason {
     POSTCOPY_NOTIFY_PROBE = 0,
     POSTCOPY_NOTIFY_INBOUND_ADVISE,
+    POSTCOPY_NOTIFY_INBOUND_LISTEN,
 };
 
 struct PostcopyNotifyData {
diff --git a/migration/savevm.c b/migration/savevm.c
index 9840bcaac9..fa779232cc 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1614,6 +1614,8 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
 {
     PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_LISTENING);
     trace_loadvm_postcopy_handle_listen();
+    Error *local_err = NULL;
+
     if (ps != POSTCOPY_INCOMING_ADVISE && ps != POSTCOPY_INCOMING_DISCARD) {
         error_report("CMD_POSTCOPY_LISTEN in wrong postcopy state (%d)", ps);
         return -1;
@@ -1639,6 +1641,11 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
         }
     }
 
+    if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_LISTEN, &local_err)) {
+        error_report_err(local_err);
+        return -1;
+    }
+
     if (mis->have_listen_thread) {
         error_report("CMD_POSTCOPY_RAM_LISTEN already has a listen thread");
         return -1;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 12/29] postcopy+vhost-user: Split set_mem_table for postcopy
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (10 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 11/29] vhost+postcopy: Transmit 'listen' to client Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-28  8:49   ` Peter Xu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 13/29] migration/ram: ramblock_recv_bitmap_test_byte_offset Dr. David Alan Gilbert (git)
                   ` (17 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Split the set_mem_table routines in both qemu and libvhost-user
because the postcopy versions are going to be quite different
once changes in the later patches are added.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 53 ++++++++++++++++++++++++
 hw/virtio/vhost-user.c                | 77 ++++++++++++++++++++++++++++++++++-
 2 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index beec7695a8..4922b2c722 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -448,6 +448,55 @@ vu_reset_device_exec(VuDev *dev, VhostUserMsg *vmsg)
     return false;
 }
 
+static bool
+vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
+{
+    int i;
+    VhostUserMemory *memory = &vmsg->payload.memory;
+    dev->nregions = memory->nregions;
+    /* TODO: Postcopy specific code */
+    DPRINT("Nregions: %d\n", memory->nregions);
+    for (i = 0; i < dev->nregions; i++) {
+        void *mmap_addr;
+        VhostUserMemoryRegion *msg_region = &memory->regions[i];
+        VuDevRegion *dev_region = &dev->regions[i];
+
+        DPRINT("Region %d\n", i);
+        DPRINT("    guest_phys_addr: 0x%016"PRIx64"\n",
+               msg_region->guest_phys_addr);
+        DPRINT("    memory_size:     0x%016"PRIx64"\n",
+               msg_region->memory_size);
+        DPRINT("    userspace_addr   0x%016"PRIx64"\n",
+               msg_region->userspace_addr);
+        DPRINT("    mmap_offset      0x%016"PRIx64"\n",
+               msg_region->mmap_offset);
+
+        dev_region->gpa = msg_region->guest_phys_addr;
+        dev_region->size = msg_region->memory_size;
+        dev_region->qva = msg_region->userspace_addr;
+        dev_region->mmap_offset = msg_region->mmap_offset;
+
+        /* We don't use offset argument of mmap() since the
+         * mapped address has to be page aligned, and we use huge
+         * pages.  */
+        mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
+                         PROT_READ | PROT_WRITE, MAP_SHARED,
+                         vmsg->fds[i], 0);
+
+        if (mmap_addr == MAP_FAILED) {
+            vu_panic(dev, "region mmap error: %s", strerror(errno));
+        } else {
+            dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
+            DPRINT("    mmap_addr:       0x%016"PRIx64"\n",
+                   dev_region->mmap_addr);
+        }
+
+        close(vmsg->fds[i]);
+    }
+
+    return false;
+}
+
 static bool
 vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -464,6 +513,10 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
     }
     dev->nregions = memory->nregions;
 
+    if (dev->postcopy_listening) {
+        return vu_set_mem_table_exec_postcopy(dev, vmsg);
+    }
+
     DPRINT("Nregions: %d\n", memory->nregions);
     for (i = 0; i < dev->nregions; i++) {
         void *mmap_addr;
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ec6a4a82fd..64f4b3b3f9 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -325,15 +325,86 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
     return 0;
 }
 
+static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
+                                             struct vhost_memory *mem)
+{
+    int fds[VHOST_MEMORY_MAX_NREGIONS];
+    int i, fd;
+    size_t fd_num = 0;
+    bool reply_supported = virtio_has_feature(dev->protocol_features,
+                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
+    /* TODO: Add actual postcopy differences */
+    VhostUserMsg msg = {
+        .hdr.request = VHOST_USER_SET_MEM_TABLE,
+        .hdr.flags = VHOST_USER_VERSION,
+    };
+
+    if (reply_supported) {
+        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+    }
+
+    for (i = 0; i < dev->mem->nregions; ++i) {
+        struct vhost_memory_region *reg = dev->mem->regions + i;
+        ram_addr_t offset;
+        MemoryRegion *mr;
+
+        assert((uintptr_t)reg->userspace_addr == reg->userspace_addr);
+        mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr,
+                                     &offset);
+        fd = memory_region_get_fd(mr);
+        if (fd > 0) {
+            msg.payload.memory.regions[fd_num].userspace_addr =
+                reg->userspace_addr;
+            msg.payload.memory.regions[fd_num].memory_size  = reg->memory_size;
+            msg.payload.memory.regions[fd_num].guest_phys_addr =
+                reg->guest_phys_addr;
+            msg.payload.memory.regions[fd_num].mmap_offset = offset;
+            assert(fd_num < VHOST_MEMORY_MAX_NREGIONS);
+            fds[fd_num++] = fd;
+        }
+    }
+
+    msg.payload.memory.nregions = fd_num;
+
+    if (!fd_num) {
+        error_report("Failed initializing vhost-user memory map, "
+                     "consider using -object memory-backend-file share=on");
+        return -1;
+    }
+
+    msg.hdr.size = sizeof(msg.payload.memory.nregions);
+    msg.hdr.size += sizeof(msg.payload.memory.padding);
+    msg.hdr.size += fd_num * sizeof(VhostUserMemoryRegion);
+
+    if (vhost_user_write(dev, &msg, fds, fd_num) < 0) {
+        return -1;
+    }
+
+    if (reply_supported) {
+        return process_message_reply(dev, &msg);
+    }
+
+    return 0;
+}
+
 static int vhost_user_set_mem_table(struct vhost_dev *dev,
                                     struct vhost_memory *mem)
 {
+    struct vhost_user *u = dev->opaque;
     int fds[VHOST_MEMORY_MAX_NREGIONS];
     int i, fd;
     size_t fd_num = 0;
+    bool do_postcopy = u->postcopy_listen && u->postcopy_fd.handler;
     bool reply_supported = virtio_has_feature(dev->protocol_features,
                                               VHOST_USER_PROTOCOL_F_REPLY_ACK);
 
+    if (do_postcopy) {
+        /* Postcopy has enough differences that it's best done in it's own
+         * version
+         */
+        return vhost_user_set_mem_table_postcopy(dev, mem);
+    }
+
     VhostUserMsg msg = {
         .hdr.request = VHOST_USER_SET_MEM_TABLE,
         .hdr.flags = VHOST_USER_VERSION,
@@ -357,9 +428,11 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
                 error_report("Failed preparing vhost-user memory table msg");
                 return -1;
             }
-            msg.payload.memory.regions[fd_num].userspace_addr = reg->userspace_addr;
+            msg.payload.memory.regions[fd_num].userspace_addr =
+                reg->userspace_addr;
             msg.payload.memory.regions[fd_num].memory_size  = reg->memory_size;
-            msg.payload.memory.regions[fd_num].guest_phys_addr = reg->guest_phys_addr;
+            msg.payload.memory.regions[fd_num].guest_phys_addr =
+                reg->guest_phys_addr;
             msg.payload.memory.regions[fd_num].mmap_offset = offset;
             fds[fd_num++] = fd;
         }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 13/29] migration/ram: ramblock_recv_bitmap_test_byte_offset
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (11 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 12/29] postcopy+vhost-user: Split set_mem_table for postcopy Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-28  8:52   ` Peter Xu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 14/29] libvhost-user+postcopy: Register new regions with the ufd Dr. David Alan Gilbert (git)
                   ` (16 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Utility for testing the map when you already know the offset
in the RAMBlock.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/ram.c | 5 +++++
 migration/ram.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 8333d8e35e..8db5e80500 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -169,6 +169,11 @@ int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr)
                     rb->receivedmap);
 }
 
+bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset)
+{
+    return test_bit(byte_offset >> TARGET_PAGE_BITS, rb->receivedmap);
+}
+
 void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr)
 {
     set_bit_atomic(ramblock_recv_bitmap_offset(host_addr, rb), rb->receivedmap);
diff --git a/migration/ram.h b/migration/ram.h
index f3a227b4fc..63a37c4683 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -60,6 +60,7 @@ int ram_postcopy_incoming_init(MigrationIncomingState *mis);
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
+bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset);
 void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
 void ramblock_recv_bitmap_set_range(RAMBlock *rb, void *host_addr, size_t nr);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 14/29] libvhost-user+postcopy: Register new regions with the ufd
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (12 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 13/29] migration/ram: ramblock_recv_bitmap_test_byte_offset Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 15/29] vhost+postcopy: Send address back to qemu Dr. David Alan Gilbert (git)
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

When new regions are sent to the client using SET_MEM_TABLE, register
them with the userfaultfd.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 4922b2c722..a18bc74a7c 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -494,6 +494,40 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
         close(vmsg->fds[i]);
     }
 
+    /* TODO: Get address back to QEMU */
+    for (i = 0; i < dev->nregions; i++) {
+        VuDevRegion *dev_region = &dev->regions[i];
+#ifdef UFFDIO_REGISTER
+        /* We should already have an open ufd. Mark each memory
+         * range as ufd.
+         * Note: Do we need any madvises? Well it's not been accessed
+         * yet, still probably need no THP to be safe, discard to be safe?
+         */
+        struct uffdio_register reg_struct;
+        reg_struct.range.start = (uintptr_t)dev_region->mmap_addr;
+        reg_struct.range.len = dev_region->size + dev_region->mmap_offset;
+        reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
+
+        if (ioctl(dev->postcopy_ufd, UFFDIO_REGISTER, &reg_struct)) {
+            vu_panic(dev, "%s: Failed to userfault region %d "
+                          "@%p + size:%zx offset: %zx: (ufd=%d)%s\n",
+                     __func__, i,
+                     dev_region->mmap_addr,
+                     dev_region->size, dev_region->mmap_offset,
+                     dev->postcopy_ufd, strerror(errno));
+            return false;
+        }
+        if (!(reg_struct.ioctls & ((__u64)1 << _UFFDIO_COPY))) {
+            vu_panic(dev, "%s Region (%d) doesn't support COPY",
+                     __func__, i);
+            return false;
+        }
+        DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
+                __func__, i, reg_struct.range.start, reg_struct.range.len);
+        /* TODO: Stash 'zero' support flags somewhere */
+#endif
+    }
+
     return false;
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 15/29] vhost+postcopy: Send address back to qemu
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (13 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 14/29] libvhost-user+postcopy: Register new regions with the ufd Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-27 14:25   ` Michael S. Tsirkin
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 16/29] vhost+postcopy: Stash RAMBlock and offset Dr. David Alan Gilbert (git)
                   ` (14 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

We need a better way, but at the moment we need the address of the
mappings sent back to qemu so it can interpret the messages on the
userfaultfd it reads.

This is done as a 3 stage set:
   QEMU -> client
      set_mem_table

   mmap stuff, get addresses

   client -> qemu
       here are the addresses

   qemu -> client
       OK - now you can use them

That ensures that qemu has registered the new addresses in it's
userfault code before the client starts accessing them.

Note: We don't ask for the default 'ack' reply since we've got our own.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 24 ++++++++++++-
 docs/interop/vhost-user.txt           |  9 +++++
 hw/virtio/trace-events                |  1 +
 hw/virtio/vhost-user.c                | 67 +++++++++++++++++++++++++++++++++--
 4 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index a18bc74a7c..e02e5d6f46 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -491,10 +491,32 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
                    dev_region->mmap_addr);
         }
 
+        /* Return the address to QEMU so that it can translate the ufd
+         * fault addresses back.
+         */
+        msg_region->userspace_addr = (uintptr_t)(mmap_addr +
+                                                 dev_region->mmap_offset);
         close(vmsg->fds[i]);
     }
 
-    /* TODO: Get address back to QEMU */
+    /* Send the message back to qemu with the addresses filled in */
+    vmsg->fd_num = 0;
+    if (!vu_message_write(dev, dev->sock, vmsg)) {
+        vu_panic(dev, "failed to respond to set-mem-table for postcopy");
+        return false;
+    }
+
+    /* Wait for QEMU to confirm that it's registered the handler for the
+     * faults.
+     */
+    if (!vu_message_read(dev, dev->sock, vmsg) ||
+        vmsg->size != sizeof(vmsg->payload.u64) ||
+        vmsg->payload.u64 != 0) {
+        vu_panic(dev, "failed to receive valid ack for postcopy set-mem-table");
+        return false;
+    }
+
+    /* OK, now we can go and register the memory and generate faults */
     for (i = 0; i < dev->nregions; i++) {
         VuDevRegion *dev_region = &dev->regions[i];
 #ifdef UFFDIO_REGISTER
diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index bdec9ec0e8..5bbcab2cc4 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -454,12 +454,21 @@ Master message types
       Id: 5
       Equivalent ioctl: VHOST_SET_MEM_TABLE
       Master payload: memory regions description
+      Slave payload: (postcopy only) memory regions description
 
       Sets the memory map regions on the slave so it can translate the vring
       addresses. In the ancillary data there is an array of file descriptors
       for each memory mapped region. The size and ordering of the fds matches
       the number and ordering of memory regions.
 
+      When postcopy-listening has been received, SET_MEM_TABLE replies with
+      the bases of the memory mapped regions to the master.  It must have mmap'd
+      the regions but not yet accessed them and should not yet generate a userfault
+      event. Note NEED_REPLY_MASK is not set in this case.
+      QEMU will then reply back to the list of mappings with an empty
+      VHOST_USER_SET_MEM_TABLE as an acknolwedgment; only upon reception of this
+      message may the guest start accessing the memory and generating faults.
+
  * VHOST_USER_SET_LOG_BASE
 
       Id: 6
diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 06ec03d6e7..05d18ada77 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -8,6 +8,7 @@ vhost_section(const char *name, int r) "%s:%d"
 
 # hw/virtio/vhost-user.c
 vhost_user_postcopy_listen(void) ""
+vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
 
 # hw/virtio/virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 64f4b3b3f9..a060442cb9 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -159,6 +159,7 @@ struct vhost_user {
     int slave_fd;
     NotifierWithReturn postcopy_notifier;
     struct PostCopyFD  postcopy_fd;
+    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
     /* True once we've entered postcopy_listen */
     bool               postcopy_listen;
 };
@@ -328,12 +329,15 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
 static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
                                              struct vhost_memory *mem)
 {
+    struct vhost_user *u = dev->opaque;
     int fds[VHOST_MEMORY_MAX_NREGIONS];
     int i, fd;
     size_t fd_num = 0;
     bool reply_supported = virtio_has_feature(dev->protocol_features,
                                               VHOST_USER_PROTOCOL_F_REPLY_ACK);
-    /* TODO: Add actual postcopy differences */
+    VhostUserMsg msg_reply;
+    int region_i, msg_i;
+
     VhostUserMsg msg = {
         .hdr.request = VHOST_USER_SET_MEM_TABLE,
         .hdr.flags = VHOST_USER_VERSION,
@@ -380,6 +384,64 @@ static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
         return -1;
     }
 
+    if (vhost_user_read(dev, &msg_reply) < 0) {
+        return -1;
+    }
+
+    if (msg_reply.hdr.request != VHOST_USER_SET_MEM_TABLE) {
+        error_report("%s: Received unexpected msg type."
+                     "Expected %d received %d", __func__,
+                     VHOST_USER_SET_MEM_TABLE, msg_reply.hdr.request);
+        return -1;
+    }
+    /* We're using the same structure, just reusing one of the
+     * fields, so it should be the same size.
+     */
+    if (msg_reply.hdr.size != msg.hdr.size) {
+        error_report("%s: Unexpected size for postcopy reply "
+                     "%d vs %d", __func__, msg_reply.hdr.size, msg.hdr.size);
+        return -1;
+    }
+
+    memset(u->postcopy_client_bases, 0,
+           sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
+
+    /* They're in the same order as the regions that were sent
+     * but some of the regions were skipped (above) if they
+     * didn't have fd's
+    */
+    for (msg_i = 0, region_i = 0;
+         region_i < dev->mem->nregions;
+        region_i++) {
+        if (msg_i < fd_num &&
+            msg_reply.payload.memory.regions[msg_i].guest_phys_addr ==
+            dev->mem->regions[region_i].guest_phys_addr) {
+            u->postcopy_client_bases[region_i] =
+                msg_reply.payload.memory.regions[msg_i].userspace_addr;
+            trace_vhost_user_set_mem_table_postcopy(
+                msg_reply.payload.memory.regions[msg_i].userspace_addr,
+                msg.payload.memory.regions[msg_i].userspace_addr,
+                msg_i, region_i);
+            msg_i++;
+        }
+    }
+    if (msg_i != fd_num) {
+        error_report("%s: postcopy reply not fully consumed "
+                     "%d vs %zd",
+                     __func__, msg_i, fd_num);
+        return -1;
+    }
+    /* Now we've registered this with the postcopy code, we ack to the client,
+     * because now we're in the position to be able to deal with any faults
+     * it generates.
+     */
+    /* TODO: Use this for failure cases as well with a bad value */
+    msg.hdr.size = sizeof(msg.payload.u64);
+    msg.payload.u64 = 0; /* OK */
+    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
+        return -1;
+    }
+
     if (reply_supported) {
         return process_message_reply(dev, &msg);
     }
@@ -396,7 +458,8 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
     size_t fd_num = 0;
     bool do_postcopy = u->postcopy_listen && u->postcopy_fd.handler;
     bool reply_supported = virtio_has_feature(dev->protocol_features,
-                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
+                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
+                                          !do_postcopy;
 
     if (do_postcopy) {
         /* Postcopy has enough differences that it's best done in it's own
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 16/29] vhost+postcopy: Stash RAMBlock and offset
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (14 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 15/29] vhost+postcopy: Send address back to qemu Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 17/29] vhost+postcopy: Send requests to source for shared pages Dr. David Alan Gilbert (git)
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Stash the RAMBlock and offset for later use looking up
addresses.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/trace-events |  1 +
 hw/virtio/vhost-user.c | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 05d18ada77..d7e9e1084b 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -9,6 +9,7 @@ vhost_section(const char *name, int r) "%s:%d"
 # hw/virtio/vhost-user.c
 vhost_user_postcopy_listen(void) ""
 vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
+vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
 
 # hw/virtio/virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index a060442cb9..115ca5cdd3 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -160,6 +160,15 @@ struct vhost_user {
     NotifierWithReturn postcopy_notifier;
     struct PostCopyFD  postcopy_fd;
     uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
+    /* Length of the region_rb and region_rb_offset arrays */
+    size_t             region_rb_len;
+    /* RAMBlock associated with a given region */
+    RAMBlock         **region_rb;
+    /* The offset from the start of the RAMBlock to the start of the
+     * vhost region.
+     */
+    ram_addr_t        *region_rb_offset;
+
     /* True once we've entered postcopy_listen */
     bool               postcopy_listen;
 };
@@ -347,6 +356,17 @@ static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
         msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
     }
 
+    if (u->region_rb_len < dev->mem->nregions) {
+        u->region_rb = g_renew(RAMBlock*, u->region_rb, dev->mem->nregions);
+        u->region_rb_offset = g_renew(ram_addr_t, u->region_rb_offset,
+                                      dev->mem->nregions);
+        memset(&(u->region_rb[u->region_rb_len]), '\0',
+               sizeof(RAMBlock *) * (dev->mem->nregions - u->region_rb_len));
+        memset(&(u->region_rb_offset[u->region_rb_len]), '\0',
+               sizeof(ram_addr_t) * (dev->mem->nregions - u->region_rb_len));
+        u->region_rb_len = dev->mem->nregions;
+    }
+
     for (i = 0; i < dev->mem->nregions; ++i) {
         struct vhost_memory_region *reg = dev->mem->regions + i;
         ram_addr_t offset;
@@ -357,6 +377,12 @@ static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
                                      &offset);
         fd = memory_region_get_fd(mr);
         if (fd > 0) {
+            trace_vhost_user_set_mem_table_withfd(fd_num, mr->name,
+                                                  reg->memory_size,
+                                                  reg->guest_phys_addr,
+                                                  reg->userspace_addr, offset);
+            u->region_rb_offset[i] = offset;
+            u->region_rb[i] = mr->ram_block;
             msg.payload.memory.regions[fd_num].userspace_addr =
                 reg->userspace_addr;
             msg.payload.memory.regions[fd_num].memory_size  = reg->memory_size;
@@ -365,6 +391,9 @@ static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
             msg.payload.memory.regions[fd_num].mmap_offset = offset;
             assert(fd_num < VHOST_MEMORY_MAX_NREGIONS);
             fds[fd_num++] = fd;
+        } else {
+            u->region_rb_offset[i] = 0;
+            u->region_rb[i] = NULL;
         }
     }
 
@@ -1133,6 +1162,11 @@ static int vhost_user_cleanup(struct vhost_dev *dev)
         close(u->slave_fd);
         u->slave_fd = -1;
     }
+    g_free(u->region_rb);
+    u->region_rb = NULL;
+    g_free(u->region_rb_offset);
+    u->region_rb_offset = NULL;
+    u->region_rb_len = 0;
     g_free(u);
     dev->opaque = 0;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 17/29] vhost+postcopy: Send requests to source for shared pages
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (15 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 16/29] vhost+postcopy: Stash RAMBlock and offset Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-28 10:03   ` Peter Xu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 18/29] vhost+postcopy: Resolve client address Dr. David Alan Gilbert (git)
                   ` (12 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Send requests back to the source for shared page requests.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/migration.h    |  2 ++
 migration/postcopy-ram.c | 31 ++++++++++++++++++++++++++++---
 migration/postcopy-ram.h |  3 +++
 migration/trace-events   |  2 ++
 4 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index d158e62cf2..457bf37ec2 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -46,6 +46,8 @@ struct MigrationIncomingState {
     int       userfault_quit_fd;
     QEMUFile *to_src_file;
     QemuMutex rp_mutex;    /* We send replies from multiple threads */
+    /* RAMBlock of last request sent to source */
+    RAMBlock *last_rb;
     void     *postcopy_tmp_page;
     void     *postcopy_tmp_zero_page;
     /* PostCopyFD's for external userfaultfds & handlers of shared memory */
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index d118b78bf5..277ff749a0 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -534,6 +534,31 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
     return 0;
 }
 
+/*
+ * Callback from shared fault handlers to ask for a page,
+ * the page must be specified by a RAMBlock and an offset in that rb
+ */
+int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
+                                 uint64_t client_addr, uint64_t rb_offset)
+{
+    size_t pagesize = qemu_ram_pagesize(rb);
+    uint64_t aligned_rbo = rb_offset & ~(pagesize - 1);
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
+                                       rb_offset);
+    /* TODO: Check bitmap to see if we already have the page */
+    if (rb != mis->last_rb) {
+        mis->last_rb = rb;
+        migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
+                                  aligned_rbo, pagesize);
+    } else {
+        /* Save some space */
+        migrate_send_rp_req_pages(mis, NULL, aligned_rbo, pagesize);
+    }
+    return 0;
+}
+
 /*
  * Handle faults detected by the USERFAULT markings
  */
@@ -544,9 +569,9 @@ static void *postcopy_ram_fault_thread(void *opaque)
     int ret;
     size_t index;
     RAMBlock *rb = NULL;
-    RAMBlock *last_rb = NULL; /* last RAMBlock we sent part of */
 
     trace_postcopy_ram_fault_thread_entry();
+    mis->last_rb = NULL; /* last RAMBlock we sent part of */
     qemu_sem_post(&mis->fault_thread_sem);
 
     struct pollfd *pfd;
@@ -634,8 +659,8 @@ static void *postcopy_ram_fault_thread(void *opaque)
              * Send the request to the source - we want to request one
              * of our host page sizes (which is >= TPS)
              */
-            if (rb != last_rb) {
-                last_rb = rb;
+            if (rb != mis->last_rb) {
+                mis->last_rb = rb;
                 migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
                                          rb_offset, qemu_ram_pagesize(rb));
             } else {
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index dbc2ee1f2b..4c63f20df4 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -162,5 +162,8 @@ struct PostCopyFD {
  */
 void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
 void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
+/* Callback from shared fault handlers to ask for a page */
+int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
+                                 uint64_t client_addr, uint64_t offset);
 
 #endif
diff --git a/migration/trace-events b/migration/trace-events
index 1e617ad7a6..7c910b5479 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -198,6 +198,8 @@ postcopy_ram_incoming_cleanup_closeuf(void) ""
 postcopy_ram_incoming_cleanup_entry(void) ""
 postcopy_ram_incoming_cleanup_exit(void) ""
 postcopy_ram_incoming_cleanup_join(void) ""
+postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
+
 save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 18/29] vhost+postcopy: Resolve client address
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (16 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 17/29] vhost+postcopy: Send requests to source for shared pages Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-03-02  7:29   ` Peter Xu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 19/29] postcopy: wake shared Dr. David Alan Gilbert (git)
                   ` (11 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Resolve fault addresses read off the clients UFD into RAMBlock
and offset, and call back to the postcopy code to ask for the page.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/trace-events |  3 +++
 hw/virtio/vhost-user.c | 30 +++++++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index d7e9e1084b..3afd12cfea 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -7,6 +7,9 @@ vhost_region_add_section_abut(const char *name, uint64_t new_size) "%s: 0x%"PRIx
 vhost_section(const char *name, int r) "%s:%d"
 
 # hw/virtio/vhost-user.c
+vhost_user_postcopy_fault_handler(const char *name, uint64_t fault_address, int nregions) "%s: @0x%"PRIx64" nregions:%d"
+vhost_user_postcopy_fault_handler_loop(int i, uint64_t client_base, uint64_t size) "%d: client 0x%"PRIx64" +0x%"PRIx64
+vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t rb_offset) "%d: region_offset: 0x%"PRIx64" rb_offset:0x%"PRIx64
 vhost_user_postcopy_listen(void) ""
 vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
 vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 115ca5cdd3..4589bfd92e 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -959,7 +959,35 @@ out:
 static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
                                              void *ufd)
 {
-    return 0;
+    struct vhost_dev *dev = pcfd->data;
+    struct vhost_user *u = dev->opaque;
+    struct uffd_msg *msg = ufd;
+    uint64_t faultaddr = msg->arg.pagefault.address;
+    RAMBlock *rb = NULL;
+    uint64_t rb_offset;
+    int i;
+
+    trace_vhost_user_postcopy_fault_handler(pcfd->idstr, faultaddr,
+                                            dev->mem->nregions);
+    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
+        trace_vhost_user_postcopy_fault_handler_loop(i,
+                u->postcopy_client_bases[i], dev->mem->regions[i].memory_size);
+        if (faultaddr >= u->postcopy_client_bases[i]) {
+            /* Ofset of the fault address in the vhost region */
+            uint64_t region_offset = faultaddr - u->postcopy_client_bases[i];
+            if (region_offset < dev->mem->regions[i].memory_size) {
+                rb_offset = region_offset + u->region_rb_offset[i];
+                trace_vhost_user_postcopy_fault_handler_found(i,
+                        region_offset, rb_offset);
+                rb = u->region_rb[i];
+                return postcopy_request_shared_page(pcfd, rb, faultaddr,
+                                                    rb_offset);
+            }
+        }
+    }
+    error_report("%s: Failed to find region for fault %" PRIx64,
+                 __func__, faultaddr);
+    return -1;
 }
 
 /*
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 19/29] postcopy: wake shared
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (17 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 18/29] vhost+postcopy: Resolve client address Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-03-02  7:44   ` Peter Xu
  2018-03-12 15:44   ` Marc-André Lureau
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 20/29] postcopy: postcopy_notify_shared_wake Dr. David Alan Gilbert (git)
                   ` (10 subsequent siblings)
  29 siblings, 2 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Send a 'wake' request on a userfaultfd for a shared process.
The address in the clients address space is specified together
with the RAMBlock it was resolved to.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/postcopy-ram.c | 26 ++++++++++++++++++++++++++
 migration/postcopy-ram.h |  6 ++++++
 migration/trace-events   |  1 +
 3 files changed, 33 insertions(+)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 277ff749a0..67deae7e1c 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -534,6 +534,25 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
     return 0;
 }
 
+int postcopy_wake_shared(struct PostCopyFD *pcfd,
+                         uint64_t client_addr,
+                         RAMBlock *rb)
+{
+    size_t pagesize = qemu_ram_pagesize(rb);
+    struct uffdio_range range;
+    int ret;
+    trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb));
+    range.start = client_addr & ~(pagesize - 1);
+    range.len = pagesize;
+    ret = ioctl(pcfd->fd, UFFDIO_WAKE, &range);
+    if (ret) {
+        error_report("%s: Failed to wake: %zx in %s (%s)",
+                     __func__, (size_t)client_addr, qemu_ram_get_idstr(rb),
+                     strerror(errno));
+    }
+    return ret;
+}
+
 /*
  * Callback from shared fault handlers to ask for a page,
  * the page must be specified by a RAMBlock and an offset in that rb
@@ -951,6 +970,13 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
     return NULL;
 }
 
+int postcopy_wake_shared(struct PostCopyFD *pcfd,
+                         uint64_t client_addr,
+                         RAMBlock *rb)
+{
+    assert(0);
+    return -1;
+}
 #endif
 
 /* ------------------------------------------------------------------------- */
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 4c63f20df4..2e3dd844d5 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -162,6 +162,12 @@ struct PostCopyFD {
  */
 void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
 void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
+/* Notify a client ufd that a page is available
+ * Note: The 'client_address' is in the address space of the client
+ * program not QEMU
+ */
+int postcopy_wake_shared(struct PostCopyFD *pcfd, uint64_t client_addr,
+                         RAMBlock *rb);
 /* Callback from shared fault handlers to ask for a page */
 int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
                                  uint64_t client_addr, uint64_t offset);
diff --git a/migration/trace-events b/migration/trace-events
index 7c910b5479..b0acaaa8a0 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -199,6 +199,7 @@ postcopy_ram_incoming_cleanup_entry(void) ""
 postcopy_ram_incoming_cleanup_exit(void) ""
 postcopy_ram_incoming_cleanup_join(void) ""
 postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
+postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
 
 save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 20/29] postcopy: postcopy_notify_shared_wake
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (18 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 19/29] postcopy: wake shared Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-03-02  7:51   ` Peter Xu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 21/29] vhost+postcopy: Add vhost waker Dr. David Alan Gilbert (git)
                   ` (9 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add a hook to allow a client userfaultfd to be 'woken'
when a page arrives, and a walker that calls that
hook for relevant clients given a RAMBlock and offset.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/postcopy-ram.c | 16 ++++++++++++++++
 migration/postcopy-ram.h | 10 ++++++++++
 2 files changed, 26 insertions(+)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 67deae7e1c..879711968c 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -824,6 +824,22 @@ static int qemu_ufd_copy_ioctl(int userfault_fd, void *host_addr,
     return ret;
 }
 
+int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset)
+{
+    int i;
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    GArray *pcrfds = mis->postcopy_remote_fds;
+
+    for (i = 0; i < pcrfds->len; i++) {
+        struct PostCopyFD *cur = &g_array_index(pcrfds, struct PostCopyFD, i);
+        int ret = cur->waker(cur, rb, offset);
+        if (ret) {
+            return ret;
+        }
+    }
+    return 0;
+}
+
 /*
  * Place a host page (from) at (host) atomically
  * returns 0 on success
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 2e3dd844d5..2b71cf958e 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -146,6 +146,10 @@ struct PostCopyFD;
 
 /* ufd is a pointer to the struct uffd_msg *TODO: more Portable! */
 typedef int (*pcfdhandler)(struct PostCopyFD *pcfd, void *ufd);
+/* Notification to wake, either on place or on reception of
+ * a fault on something that's already arrived (race)
+ */
+typedef int (*pcfdwake)(struct PostCopyFD *pcfd, RAMBlock *rb, uint64_t offset);
 
 struct PostCopyFD {
     int fd;
@@ -153,6 +157,8 @@ struct PostCopyFD {
     void *data;
     /* Handler to be called whenever we get a poll event */
     pcfdhandler handler;
+    /* Notification to wake shared client */
+    pcfdwake waker;
     /* A string to use in error messages */
     const char *idstr;
 };
@@ -162,6 +168,10 @@ struct PostCopyFD {
  */
 void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
 void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
+/* Call each of the shared 'waker's registerd telling them of
+ * availability of a block.
+ */
+int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset);
 /* Notify a client ufd that a page is available
  * Note: The 'client_address' is in the address space of the client
  * program not QEMU
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 21/29] vhost+postcopy: Add vhost waker
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (19 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 20/29] postcopy: postcopy_notify_shared_wake Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-03-02  7:55   ` Peter Xu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 22/29] vhost+postcopy: Call wakeups Dr. David Alan Gilbert (git)
                   ` (8 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Register a waker function in vhost-user code to be notified when
pages arrive or requests to previously mapped pages get requested.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/trace-events |  3 +++
 hw/virtio/vhost-user.c | 30 ++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 3afd12cfea..fe5e0ff856 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -13,6 +13,9 @@ vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t
 vhost_user_postcopy_listen(void) ""
 vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
 vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
+vhost_user_postcopy_waker(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
+vhost_user_postcopy_waker_found(uint64_t client_addr) "0x%"PRIx64
+vhost_user_postcopy_waker_nomatch(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
 
 # hw/virtio/virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 4589bfd92e..74807091a0 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -990,6 +990,35 @@ static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
     return -1;
 }
 
+static int vhost_user_postcopy_waker(struct PostCopyFD *pcfd, RAMBlock *rb,
+                                     uint64_t offset)
+{
+    struct vhost_dev *dev = pcfd->data;
+    struct vhost_user *u = dev->opaque;
+    int i;
+
+    trace_vhost_user_postcopy_waker(qemu_ram_get_idstr(rb), offset);
+
+    if (!u) {
+        return 0;
+    }
+    /* Translate the offset into an address in the clients address space */
+    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
+        if (u->region_rb[i] == rb &&
+            offset >= u->region_rb_offset[i] &&
+            offset < (u->region_rb_offset[i] +
+                      dev->mem->regions[i].memory_size)) {
+            uint64_t client_addr = (offset - u->region_rb_offset[i]) +
+                                   u->postcopy_client_bases[i];
+            trace_vhost_user_postcopy_waker_found(client_addr);
+            return postcopy_wake_shared(pcfd, client_addr, rb);
+        }
+    }
+
+    trace_vhost_user_postcopy_waker_nomatch(qemu_ram_get_idstr(rb), offset);
+    return 0;
+}
+
 /*
  * Called at the start of an inbound postcopy on reception of the
  * 'advise' command.
@@ -1035,6 +1064,7 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
     u->postcopy_fd.fd = ufd;
     u->postcopy_fd.data = dev;
     u->postcopy_fd.handler = vhost_user_postcopy_fault_handler;
+    u->postcopy_fd.waker = vhost_user_postcopy_waker;
     u->postcopy_fd.idstr = "vhost-user"; /* Need to find unique name */
     postcopy_register_shared_ufd(&u->postcopy_fd);
     return 0;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 22/29] vhost+postcopy: Call wakeups
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (20 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 21/29] vhost+postcopy: Add vhost waker Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-03-02  8:05   ` Peter Xu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 23/29] libvhost-user: mprotect & madvises for postcopy Dr. David Alan Gilbert (git)
                   ` (7 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Cause the vhost-user client to be woken up whenever:
  a) We place a page in postcopy mode
  b) We get a fault and the page has already been received

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/postcopy-ram.c | 14 ++++++++++----
 migration/trace-events   |  1 +
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 879711968c..13561703b5 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -566,7 +566,11 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
 
     trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
                                        rb_offset);
-    /* TODO: Check bitmap to see if we already have the page */
+    if (ramblock_recv_bitmap_test_byte_offset(rb, aligned_rbo)) {
+        trace_postcopy_request_shared_page_present(pcfd->idstr,
+                                        qemu_ram_get_idstr(rb), rb_offset);
+        return postcopy_wake_shared(pcfd, client_addr, rb);
+    }
     if (rb != mis->last_rb) {
         mis->last_rb = rb;
         migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
@@ -863,7 +867,8 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
     }
 
     trace_postcopy_place_page(host);
-    return 0;
+    return postcopy_notify_shared_wake(rb,
+                                       qemu_ram_block_host_offset(rb, host));
 }
 
 /*
@@ -887,6 +892,9 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
 
             return -e;
         }
+        return postcopy_notify_shared_wake(rb,
+                                           qemu_ram_block_host_offset(rb,
+                                                                      host));
     } else {
         /* The kernel can't use UFFDIO_ZEROPAGE for hugepages */
         if (!mis->postcopy_tmp_zero_page) {
@@ -906,8 +914,6 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
         return postcopy_place_page(mis, host, mis->postcopy_tmp_zero_page,
                                    rb);
     }
-
-    return 0;
 }
 
 /*
diff --git a/migration/trace-events b/migration/trace-events
index b0acaaa8a0..1e353a317f 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -199,6 +199,7 @@ postcopy_ram_incoming_cleanup_entry(void) ""
 postcopy_ram_incoming_cleanup_exit(void) ""
 postcopy_ram_incoming_cleanup_join(void) ""
 postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
+postcopy_request_shared_page_present(const char *sharer, const char *rb, uint64_t rb_offset) "%s already %s offset 0x%"PRIx64
 postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
 
 save_xbzrle_page_skipping(void) ""
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 23/29] libvhost-user: mprotect & madvises for postcopy
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (21 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 22/29] vhost+postcopy: Call wakeups Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 24/29] vhost-user: Add VHOST_USER_POSTCOPY_END message Dr. David Alan Gilbert (git)
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Clear the area and turn off THP.
PROT_NONE the area until after we've userfault advised it
to catch any unexpected changes.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 46 +++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index e02e5d6f46..1b224af706 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -454,7 +454,7 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
     int i;
     VhostUserMemory *memory = &vmsg->payload.memory;
     dev->nregions = memory->nregions;
-    /* TODO: Postcopy specific code */
+
     DPRINT("Nregions: %d\n", memory->nregions);
     for (i = 0; i < dev->nregions; i++) {
         void *mmap_addr;
@@ -478,9 +478,12 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
 
         /* We don't use offset argument of mmap() since the
          * mapped address has to be page aligned, and we use huge
-         * pages.  */
+         * pages.
+         * In postcopy we're using PROT_NONE here to catch anyone
+         * accessing it before we userfault
+         */
         mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
-                         PROT_READ | PROT_WRITE, MAP_SHARED,
+                         PROT_NONE, MAP_SHARED,
                          vmsg->fds[i], 0);
 
         if (mmap_addr == MAP_FAILED) {
@@ -519,12 +522,38 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
     /* OK, now we can go and register the memory and generate faults */
     for (i = 0; i < dev->nregions; i++) {
         VuDevRegion *dev_region = &dev->regions[i];
+        int ret;
 #ifdef UFFDIO_REGISTER
         /* We should already have an open ufd. Mark each memory
          * range as ufd.
-         * Note: Do we need any madvises? Well it's not been accessed
-         * yet, still probably need no THP to be safe, discard to be safe?
+         * Discard any mapping we have here; note I can't use MADV_REMOVE
+         * or fallocate to make the hole since I don't want to lose
+         * data that's already arrived in the shared process.
+         * TODO: How to do hugepage
          */
+        ret = madvise((void *)dev_region->mmap_addr,
+                      dev_region->size + dev_region->mmap_offset,
+                      MADV_DONTNEED);
+        if (ret) {
+            fprintf(stderr,
+                    "%s: Failed to madvise(DONTNEED) region %d: %s\n",
+                    __func__, i, strerror(errno));
+        }
+        /* Turn off transparent hugepages so we dont get lose wakeups
+         * in neighbouring pages.
+         * TODO: Turn this backon later.
+         */
+        ret = madvise((void *)dev_region->mmap_addr,
+                      dev_region->size + dev_region->mmap_offset,
+                      MADV_NOHUGEPAGE);
+        if (ret) {
+            /* Note: This can happen legally on kernels that are configured
+             * without madvise'able hugepages
+             */
+            fprintf(stderr,
+                    "%s: Failed to madvise(NOHUGEPAGE) region %d: %s\n",
+                    __func__, i, strerror(errno));
+        }
         struct uffdio_register reg_struct;
         reg_struct.range.start = (uintptr_t)dev_region->mmap_addr;
         reg_struct.range.len = dev_region->size + dev_region->mmap_offset;
@@ -546,6 +575,13 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
         }
         DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
                 __func__, i, reg_struct.range.start, reg_struct.range.len);
+        /* Now it's registered we can let the client at it */
+        if (mprotect((void *)dev_region->mmap_addr,
+                     dev_region->size + dev_region->mmap_offset,
+                     PROT_READ | PROT_WRITE)) {
+            vu_panic(dev, "failed to mprotect region %d for postcopy", i);
+            return false;
+        }
         /* TODO: Stash 'zero' support flags somewhere */
 #endif
     }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 24/29] vhost-user: Add VHOST_USER_POSTCOPY_END message
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (22 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 23/29] libvhost-user: mprotect & madvises for postcopy Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-26 20:27   ` Michael S. Tsirkin
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 25/29] vhost+postcopy: Wire up POSTCOPY_END notify Dr. David Alan Gilbert (git)
                   ` (5 subsequent siblings)
  29 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

This message is sent just before the end of postcopy to get the
client to stop using userfault since we wont respond to any more
requests.  It should close userfaultfd so that any other pages
get mapped to the backing file automatically by the kernel, since
at this point we know we've received everything.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 23 +++++++++++++++++++++++
 contrib/libvhost-user/libvhost-user.h |  1 +
 docs/interop/vhost-user.txt           |  8 ++++++++
 hw/virtio/vhost-user.c                |  1 +
 4 files changed, 33 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 1b224af706..1f988ab787 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -99,6 +99,7 @@ vu_request_to_string(unsigned int req)
         REQ(VHOST_USER_SET_CONFIG),
         REQ(VHOST_USER_POSTCOPY_ADVISE),
         REQ(VHOST_USER_POSTCOPY_LISTEN),
+        REQ(VHOST_USER_POSTCOPY_END),
         REQ(VHOST_USER_MAX),
     };
 #undef REQ
@@ -1095,6 +1096,26 @@ vu_set_postcopy_listen(VuDev *dev, VhostUserMsg *vmsg)
     vmsg->payload.u64 = 0; /* Success */
     return true;
 }
+
+static bool
+vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
+{
+    DPRINT("%s: Entry\n", __func__);
+    dev->postcopy_listening = false;
+    if (dev->postcopy_ufd > 0) {
+        close(dev->postcopy_ufd);
+        dev->postcopy_ufd = -1;
+        DPRINT("%s: Done close\n", __func__);
+    }
+
+    vmsg->fd_num = 0;
+    vmsg->payload.u64 = 0;
+    vmsg->size = sizeof(vmsg->payload.u64);
+    vmsg->flags = VHOST_USER_VERSION |  VHOST_USER_REPLY_MASK;
+    DPRINT("%s: exit\n", __func__);
+    return true;
+}
+
 static bool
 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -1170,6 +1191,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
         return vu_set_postcopy_advise(dev, vmsg);
     case VHOST_USER_POSTCOPY_LISTEN:
         return vu_set_postcopy_listen(dev, vmsg);
+    case VHOST_USER_POSTCOPY_END:
+        return vu_set_postcopy_end(dev, vmsg);
     default:
         vmsg_close_fds(vmsg);
         vu_panic(dev, "Unhandled request: %d", vmsg->request);
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index fcba53c3c3..9696b89f6e 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -84,6 +84,7 @@ typedef enum VhostUserRequest {
     VHOST_USER_SET_CONFIG = 25,
     VHOST_USER_POSTCOPY_ADVISE  = 26,
     VHOST_USER_POSTCOPY_LISTEN  = 27,
+    VHOST_USER_POSTCOPY_END     = 28,
     VHOST_USER_MAX
 } VhostUserRequest;
 
diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index 5bbcab2cc4..4bf7d8ef99 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -697,6 +697,14 @@ Master message types
 
       Master advises slave that a transition to postcopy mode has happened.
 
+ * VHOST_USER_POSTCOPY_END
+      Id: 28
+      Slave payload: u64
+
+      Master advises that postcopy migration has now completed.  The
+      slave must disable the userfaultfd. The response is an acknowledgement
+      only.
+
 Slave message types
 -------------------
 
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 74807091a0..cf7923b25f 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -78,6 +78,7 @@ typedef enum VhostUserRequest {
     VHOST_USER_SET_CONFIG = 25,
     VHOST_USER_POSTCOPY_ADVISE  = 26,
     VHOST_USER_POSTCOPY_LISTEN  = 27,
+    VHOST_USER_POSTCOPY_END     = 28,
     VHOST_USER_MAX
 } VhostUserRequest;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 25/29] vhost+postcopy: Wire up POSTCOPY_END notify
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (23 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 24/29] vhost-user: Add VHOST_USER_POSTCOPY_END message Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 26/29] vhost: Huge page align and merge Dr. David Alan Gilbert (git)
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Wire up a call to VHOST_USER_POSTCOPY_END message to the vhost clients
right before we ask the listener thread to shutdown.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/trace-events   |  2 ++
 hw/virtio/vhost-user.c   | 34 ++++++++++++++++++++++++++++++++++
 migration/postcopy-ram.c |  5 +++++
 migration/postcopy-ram.h |  1 +
 4 files changed, 42 insertions(+)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index fe5e0ff856..857c495e65 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -7,6 +7,8 @@ vhost_region_add_section_abut(const char *name, uint64_t new_size) "%s: 0x%"PRIx
 vhost_section(const char *name, int r) "%s:%d"
 
 # hw/virtio/vhost-user.c
+vhost_user_postcopy_end_entry(void) ""
+vhost_user_postcopy_end_exit(void) ""
 vhost_user_postcopy_fault_handler(const char *name, uint64_t fault_address, int nregions) "%s: @0x%"PRIx64" nregions:%d"
 vhost_user_postcopy_fault_handler_loop(int i, uint64_t client_base, uint64_t size) "%d: client 0x%"PRIx64" +0x%"PRIx64
 vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t rb_offset) "%d: region_offset: 0x%"PRIx64" rb_offset:0x%"PRIx64
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index cf7923b25f..26c4ea5f31 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1098,6 +1098,37 @@ static int vhost_user_postcopy_listen(struct vhost_dev *dev, Error **errp)
     return 0;
 }
 
+/*
+ * Called at the end of postcopy
+ */
+static int vhost_user_postcopy_end(struct vhost_dev *dev, Error **errp)
+{
+    VhostUserMsg msg = {
+        .hdr.request = VHOST_USER_POSTCOPY_END,
+        .hdr.flags = VHOST_USER_VERSION | VHOST_USER_NEED_REPLY_MASK,
+    };
+    int ret;
+    struct vhost_user *u = dev->opaque;
+
+    trace_vhost_user_postcopy_end_entry();
+    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
+        error_setg(errp, "Failed to send postcopy_end to vhost");
+        return -1;
+    }
+
+    ret = process_message_reply(dev, &msg);
+    if (ret) {
+        error_setg(errp, "Failed to receive reply to postcopy_end");
+        return ret;
+    }
+    postcopy_unregister_shared_ufd(&u->postcopy_fd);
+    u->postcopy_fd.handler = NULL;
+
+    trace_vhost_user_postcopy_end_exit();
+
+    return 0;
+}
+
 static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
                                         void *opaque)
 {
@@ -1123,6 +1154,9 @@ static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
     case POSTCOPY_NOTIFY_INBOUND_LISTEN:
         return vhost_user_postcopy_listen(dev, pnd->errp);
 
+    case POSTCOPY_NOTIFY_INBOUND_END:
+        return vhost_user_postcopy_end(dev, pnd->errp);
+
     default:
         /* We ignore notifications we don't know */
         break;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 13561703b5..14196d0654 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -414,7 +414,12 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
 
     if (mis->have_fault_thread) {
         uint64_t tmp64;
+        Error *local_err = NULL;
 
+        if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_END, &local_err)) {
+            error_report_err(local_err);
+            return -1;
+        }
         if (qemu_ram_foreach_block(cleanup_range, mis)) {
             return -1;
         }
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 2b71cf958e..cc9215482e 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -130,6 +130,7 @@ enum PostcopyNotifyReason {
     POSTCOPY_NOTIFY_PROBE = 0,
     POSTCOPY_NOTIFY_INBOUND_ADVISE,
     POSTCOPY_NOTIFY_INBOUND_LISTEN,
+    POSTCOPY_NOTIFY_INBOUND_END,
 };
 
 struct PostcopyNotifyData {
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 26/29] vhost: Huge page align and merge
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (24 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 25/29] vhost+postcopy: Wire up POSTCOPY_END notify Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 27/29] postcopy: Allow shared memory Dr. David Alan Gilbert (git)
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Align RAMBlocks to page size alignment, and adjust the merging code
to deal with partial overlap due to that alignment.

This is needed for postcopy so that we can place/fetch whole hugepages
when under userfault.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/trace-events |  3 ++-
 hw/virtio/vhost.c      | 66 ++++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 58 insertions(+), 11 deletions(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 857c495e65..1422ff03ab 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -3,7 +3,8 @@
 # hw/virtio/vhost.c
 vhost_commit(bool started, bool changed) "Started: %d Changed: %d"
 vhost_region_add_section(const char *name, uint64_t gpa, uint64_t size, uint64_t host) "%s: 0x%"PRIx64"+0x%"PRIx64" @ 0x%"PRIx64
-vhost_region_add_section_abut(const char *name, uint64_t new_size) "%s: 0x%"PRIx64
+vhost_region_add_section_merge(const char *name, uint64_t new_size, uint64_t gpa, uint64_t owr) "%s: size: 0x%"PRIx64 " gpa: 0x%"PRIx64 " owr: 0x%"PRIx64
+vhost_region_add_section_aligned(const char *name, uint64_t gpa, uint64_t size, uint64_t host) "%s: 0x%"PRIx64"+0x%"PRIx64" @ 0x%"PRIx64
 vhost_section(const char *name, int r) "%s:%d"
 
 # hw/virtio/vhost-user.c
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 4a44e6e6bf..f39450d39e 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -518,10 +518,28 @@ static void vhost_region_add_section(struct vhost_dev *dev,
     uint64_t mrs_gpa = section->offset_within_address_space;
     uintptr_t mrs_host = (uintptr_t)memory_region_get_ram_ptr(section->mr) +
                          section->offset_within_region;
+    RAMBlock *mrs_rb = section->mr->ram_block;
+    size_t mrs_page = qemu_ram_pagesize(mrs_rb);
 
     trace_vhost_region_add_section(section->mr->name, mrs_gpa, mrs_size,
                                    mrs_host);
 
+    /* Round the section to it's page size */
+    /* First align the start down to a page boundary */
+    uint64_t alignage = mrs_host & (mrs_page - 1);
+    if (alignage) {
+        mrs_host -= alignage;
+        mrs_size += alignage;
+        mrs_gpa  -= alignage;
+    }
+    /* Now align the size up to a page boundary */
+    alignage = mrs_size & (mrs_page - 1);
+    if (alignage) {
+        mrs_size += mrs_page - alignage;
+    }
+    trace_vhost_region_add_section_aligned(section->mr->name, mrs_gpa, mrs_size,
+                                           mrs_host);
+
     if (dev->n_tmp_sections) {
         /* Since we already have at least one section, lets see if
          * this extends it; since we're scanning in order, we only
@@ -538,18 +556,46 @@ static void vhost_region_add_section(struct vhost_dev *dev,
                         prev_sec->offset_within_region;
         uint64_t prev_host_end   = range_get_last(prev_host_start, prev_size);
 
-        if (prev_gpa_end + 1 == mrs_gpa &&
-            prev_host_end + 1 == mrs_host &&
-            section->mr == prev_sec->mr &&
-            (!dev->vhost_ops->vhost_backend_can_merge ||
-                dev->vhost_ops->vhost_backend_can_merge(dev,
+        if (mrs_gpa <= (prev_gpa_end + 1)) {
+            /* OK, looks like overlapping/intersecting - it's possible that
+             * the rounding to page sizes has made them overlap, but they should
+             * match up in the same RAMBlock if they do.
+             */
+            if (mrs_gpa < prev_gpa_start) {
+                error_report("%s:Section rounded to %"PRIx64
+                             " prior to previous %"PRIx64,
+                             __func__, mrs_gpa, prev_gpa_start);
+                /* A way to cleanly fail here would be better */
+                return;
+            }
+            /* Offset from the start of the previous GPA to this GPA */
+            size_t offset = mrs_gpa - prev_gpa_start;
+
+            if (prev_host_start + offset == mrs_host &&
+                section->mr == prev_sec->mr &&
+                (!dev->vhost_ops->vhost_backend_can_merge ||
+                 dev->vhost_ops->vhost_backend_can_merge(dev,
                     mrs_host, mrs_size,
                     prev_host_start, prev_size))) {
-            /* The two sections abut */
-            need_add = false;
-            prev_sec->size = int128_add(prev_sec->size, section->size);
-            trace_vhost_region_add_section_abut(section->mr->name,
-                                                mrs_size + prev_size);
+                uint64_t max_end = MAX(prev_host_end, mrs_host + mrs_size);
+                need_add = false;
+                prev_sec->offset_within_address_space =
+                    MIN(prev_gpa_start, mrs_gpa);
+                prev_sec->offset_within_region =
+                    MIN(prev_host_start, mrs_host) -
+                    (uintptr_t)memory_region_get_ram_ptr(prev_sec->mr);
+                prev_sec->size = int128_make64(max_end - MIN(prev_host_start,
+                                               mrs_host));
+                trace_vhost_region_add_section_merge(section->mr->name,
+                                        int128_get64(prev_sec->size),
+                                        prev_sec->offset_within_address_space,
+                                        prev_sec->offset_within_region);
+            } else {
+                error_report("%s: Overlapping but not coherent sections "
+                             "at %"PRIx64,
+                             __func__, mrs_gpa);
+                return;
+            }
         }
     }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 27/29] postcopy: Allow shared memory
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (25 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 26/29] vhost: Huge page align and merge Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 28/29] libvhost-user: Claim support for postcopy Dr. David Alan Gilbert (git)
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Now that we have the mechanisms in here, allow shared memory in a
postcopy.

Note that QEMU can't tell who all the users of shared regions are
and thus can't tell whether all the users of the shared regions
have appropriate support for postcopy.  Those devices that explicitly
support shared memory (e.g. vhost-user) must check, but it doesn't
stop weirder configurations causing problems.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
---
 migration/postcopy-ram.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 14196d0654..b683e24d0e 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -215,12 +215,6 @@ static int test_ramblock_postcopiable(const char *block_name, void *host_addr,
     RAMBlock *rb = qemu_ram_block_by_name(block_name);
     size_t pagesize = qemu_ram_pagesize(rb);
 
-    if (qemu_ram_is_shared(rb)) {
-        error_report("Postcopy on shared RAM (%s) is not yet supported",
-                     block_name);
-        return 1;
-    }
-
     if (length % pagesize) {
         error_report("Postcopy requires RAM blocks to be a page size multiple,"
                      " block %s is 0x" RAM_ADDR_FMT " bytes with a "
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 28/29] libvhost-user: Claim support for postcopy
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (26 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 27/29] postcopy: Allow shared memory Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 29/29] postcopy shared docs Dr. David Alan Gilbert (git)
  2018-02-27 14:01 ` [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Michael S. Tsirkin
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Tell QEMU we understand the protocol features needed for postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 1f988ab787..8acee9628d 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -185,6 +185,35 @@ vmsg_close_fds(VhostUserMsg *vmsg)
     }
 }
 
+/* A test to see if we have userfault available */
+static bool
+have_userfault(void)
+{
+#if defined(__linux__) && defined(__NR_userfaultfd) &&\
+        defined(UFFD_FEATURE_MISSING_SHMEM) &&\
+        defined(UFFD_FEATURE_MISSING_HUGETLBFS)
+    /* Now test the kernel we're running on really has the features */
+    int ufd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+    struct uffdio_api api_struct;
+    if (ufd < 0) {
+        return false;
+    }
+
+    api_struct.api = UFFD_API;
+    api_struct.features = UFFD_FEATURE_MISSING_SHMEM |
+                          UFFD_FEATURE_MISSING_HUGETLBFS;
+    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
+        close(ufd);
+        return false;
+    }
+    close(ufd);
+    return true;
+
+#else
+    return false;
+#endif
+}
+
 static bool
 vu_message_read(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
 {
@@ -938,6 +967,10 @@ vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg *vmsg)
     uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD |
                         1ULL << VHOST_USER_PROTOCOL_F_SLAVE_REQ;
 
+    if (have_userfault()) {
+        features |= 1ULL << VHOST_USER_PROTOCOL_F_PAGEFAULT;
+    }
+
     if (dev->iface->get_protocol_features) {
         features |= dev->iface->get_protocol_features(dev);
     }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH v3 29/29] postcopy shared docs
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (27 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 28/29] libvhost-user: Claim support for postcopy Dr. David Alan Gilbert (git)
@ 2018-02-16 13:16 ` Dr. David Alan Gilbert (git)
  2018-02-27 14:01 ` [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Michael S. Tsirkin
  29 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2018-02-16 13:16 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo, mst
  Cc: quintela, aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add some notes to the migration documentation for shared memory
postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 docs/devel/migration.rst | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index bf97080dac..49fb44f606 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -577,3 +577,44 @@ Postcopy now works with hugetlbfs backed memory:
      hugepages works well, however 1GB hugepages are likely to be problematic
      since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
      and until the full page is transferred the destination thread is blocked.
+
+Postcopy with shared memory
+---------------------------
+
+Postcopy migration with shared memory needs explicit support from the other
+processes that share memory and from QEMU. There are restrictions on the type of
+memory that userfault can support shared.
+
+The Linux kernel userfault support works on `/dev/shm` memory and on `hugetlbfs`
+(although the kernel doesn't provide an equivalent to `madvise(MADV_DONTNEED)`
+for hugetlbfs which may be a problem in some configurations).
+
+The vhost-user code in QEMU supports clients that have Postcopy support,
+and the `vhost-user-bridge` (in `tests/`) and the DPDK package have changes
+to support postcopy.
+
+The client needs to open a userfaultfd and register the areas
+of memory that it maps with userfault.  The client must then pass the
+userfaultfd back to QEMU together with a mapping table that allows
+fault addresses in the clients address space to be converted back to
+RAMBlock/offsets.  The client's userfaultfd is added to the postcopy
+fault-thread and page requests are made on behalf of the client by QEMU.
+QEMU performs 'wake' operations on the client's userfaultfd to allow it
+to continue after a page has arrived.
+
+.. note::
+  There are two future improvements that would be nice:
+    a) Some way to make QEMU ignorant of the addresses in the clients
+       address space
+    b) Avoiding the need for QEMU to perform ufd-wake calls after the
+       pages have arrived
+
+Retro-fitting postcopy to existing clients is possible:
+  a) A mechanism is needed for the registration with userfault as above,
+     and the registration needs to be coordinated with the phases of
+     postcopy.  In vhost-user extra messages are added to the existing
+     control channel.
+  b) Any thread that can block due to guest memory accesses must be
+     identified and the implication understood; for example if the
+     guest memory access is made while holding a lock then all other
+     threads waiting for that lock will also be blocked.
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 24/29] vhost-user: Add VHOST_USER_POSTCOPY_END message
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 24/29] vhost-user: Add VHOST_USER_POSTCOPY_END message Dr. David Alan Gilbert (git)
@ 2018-02-26 20:27   ` Michael S. Tsirkin
  2018-02-27 10:09     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Michael S. Tsirkin @ 2018-02-26 20:27 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:20PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> This message is sent just before the end of postcopy to get the
> client to stop using userfault since we wont respond to any more
> requests.  It should close userfaultfd so that any other pages
> get mapped to the backing file automatically by the kernel, since
> at this point we know we've received everything.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 23 +++++++++++++++++++++++
>  contrib/libvhost-user/libvhost-user.h |  1 +
>  docs/interop/vhost-user.txt           |  8 ++++++++
>  hw/virtio/vhost-user.c                |  1 +
>  4 files changed, 33 insertions(+)
> 
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index 1b224af706..1f988ab787 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -99,6 +99,7 @@ vu_request_to_string(unsigned int req)
>          REQ(VHOST_USER_SET_CONFIG),
>          REQ(VHOST_USER_POSTCOPY_ADVISE),
>          REQ(VHOST_USER_POSTCOPY_LISTEN),
> +        REQ(VHOST_USER_POSTCOPY_END),
>          REQ(VHOST_USER_MAX),
>      };
>  #undef REQ
> @@ -1095,6 +1096,26 @@ vu_set_postcopy_listen(VuDev *dev, VhostUserMsg *vmsg)
>      vmsg->payload.u64 = 0; /* Success */
>      return true;
>  }
> +
> +static bool
> +vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
> +{
> +    DPRINT("%s: Entry\n", __func__);
> +    dev->postcopy_listening = false;
> +    if (dev->postcopy_ufd > 0) {
> +        close(dev->postcopy_ufd);
> +        dev->postcopy_ufd = -1;
> +        DPRINT("%s: Done close\n", __func__);
> +    }
> +
> +    vmsg->fd_num = 0;
> +    vmsg->payload.u64 = 0;
> +    vmsg->size = sizeof(vmsg->payload.u64);
> +    vmsg->flags = VHOST_USER_VERSION |  VHOST_USER_REPLY_MASK;
> +    DPRINT("%s: exit\n", __func__);
> +    return true;
> +}
> +
>  static bool
>  vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
>  {
> @@ -1170,6 +1191,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
>          return vu_set_postcopy_advise(dev, vmsg);
>      case VHOST_USER_POSTCOPY_LISTEN:
>          return vu_set_postcopy_listen(dev, vmsg);
> +    case VHOST_USER_POSTCOPY_END:
> +        return vu_set_postcopy_end(dev, vmsg);
>      default:
>          vmsg_close_fds(vmsg);
>          vu_panic(dev, "Unhandled request: %d", vmsg->request);
> diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> index fcba53c3c3..9696b89f6e 100644
> --- a/contrib/libvhost-user/libvhost-user.h
> +++ b/contrib/libvhost-user/libvhost-user.h
> @@ -84,6 +84,7 @@ typedef enum VhostUserRequest {
>      VHOST_USER_SET_CONFIG = 25,
>      VHOST_USER_POSTCOPY_ADVISE  = 26,
>      VHOST_USER_POSTCOPY_LISTEN  = 27,
> +    VHOST_USER_POSTCOPY_END     = 28,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> index 5bbcab2cc4..4bf7d8ef99 100644
> --- a/docs/interop/vhost-user.txt
> +++ b/docs/interop/vhost-user.txt
> @@ -697,6 +697,14 @@ Master message types
>  
>        Master advises slave that a transition to postcopy mode has happened.
>  
> + * VHOST_USER_POSTCOPY_END
> +      Id: 28
> +      Slave payload: u64
> +
> +      Master advises that postcopy migration has now completed.  The
> +      slave must disable the userfaultfd. The response is an acknowledgement
> +      only.
> +
>  Slave message types
>  -------------------
>  

Which protocol feature enables this message?

> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 74807091a0..cf7923b25f 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -78,6 +78,7 @@ typedef enum VhostUserRequest {
>      VHOST_USER_SET_CONFIG = 25,
>      VHOST_USER_POSTCOPY_ADVISE  = 26,
>      VHOST_USER_POSTCOPY_LISTEN  = 27,
> +    VHOST_USER_POSTCOPY_END     = 28,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> -- 
> 2.14.3

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 24/29] vhost-user: Add VHOST_USER_POSTCOPY_END message
  2018-02-26 20:27   ` Michael S. Tsirkin
@ 2018-02-27 10:09     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-27 10:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo,
	quintela, aarcange

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:16:20PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > This message is sent just before the end of postcopy to get the
> > client to stop using userfault since we wont respond to any more
> > requests.  It should close userfaultfd so that any other pages
> > get mapped to the backing file automatically by the kernel, since
> > at this point we know we've received everything.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 23 +++++++++++++++++++++++
> >  contrib/libvhost-user/libvhost-user.h |  1 +
> >  docs/interop/vhost-user.txt           |  8 ++++++++
> >  hw/virtio/vhost-user.c                |  1 +
> >  4 files changed, 33 insertions(+)
> > 
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index 1b224af706..1f988ab787 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -99,6 +99,7 @@ vu_request_to_string(unsigned int req)
> >          REQ(VHOST_USER_SET_CONFIG),
> >          REQ(VHOST_USER_POSTCOPY_ADVISE),
> >          REQ(VHOST_USER_POSTCOPY_LISTEN),
> > +        REQ(VHOST_USER_POSTCOPY_END),
> >          REQ(VHOST_USER_MAX),
> >      };
> >  #undef REQ
> > @@ -1095,6 +1096,26 @@ vu_set_postcopy_listen(VuDev *dev, VhostUserMsg *vmsg)
> >      vmsg->payload.u64 = 0; /* Success */
> >      return true;
> >  }
> > +
> > +static bool
> > +vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
> > +{
> > +    DPRINT("%s: Entry\n", __func__);
> > +    dev->postcopy_listening = false;
> > +    if (dev->postcopy_ufd > 0) {
> > +        close(dev->postcopy_ufd);
> > +        dev->postcopy_ufd = -1;
> > +        DPRINT("%s: Done close\n", __func__);
> > +    }
> > +
> > +    vmsg->fd_num = 0;
> > +    vmsg->payload.u64 = 0;
> > +    vmsg->size = sizeof(vmsg->payload.u64);
> > +    vmsg->flags = VHOST_USER_VERSION |  VHOST_USER_REPLY_MASK;
> > +    DPRINT("%s: exit\n", __func__);
> > +    return true;
> > +}
> > +
> >  static bool
> >  vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
> >  {
> > @@ -1170,6 +1191,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
> >          return vu_set_postcopy_advise(dev, vmsg);
> >      case VHOST_USER_POSTCOPY_LISTEN:
> >          return vu_set_postcopy_listen(dev, vmsg);
> > +    case VHOST_USER_POSTCOPY_END:
> > +        return vu_set_postcopy_end(dev, vmsg);
> >      default:
> >          vmsg_close_fds(vmsg);
> >          vu_panic(dev, "Unhandled request: %d", vmsg->request);
> > diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> > index fcba53c3c3..9696b89f6e 100644
> > --- a/contrib/libvhost-user/libvhost-user.h
> > +++ b/contrib/libvhost-user/libvhost-user.h
> > @@ -84,6 +84,7 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_SET_CONFIG = 25,
> >      VHOST_USER_POSTCOPY_ADVISE  = 26,
> >      VHOST_USER_POSTCOPY_LISTEN  = 27,
> > +    VHOST_USER_POSTCOPY_END     = 28,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >  
> > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > index 5bbcab2cc4..4bf7d8ef99 100644
> > --- a/docs/interop/vhost-user.txt
> > +++ b/docs/interop/vhost-user.txt
> > @@ -697,6 +697,14 @@ Master message types
> >  
> >        Master advises slave that a transition to postcopy mode has happened.
> >  
> > + * VHOST_USER_POSTCOPY_END
> > +      Id: 28
> > +      Slave payload: u64
> > +
> > +      Master advises that postcopy migration has now completed.  The
> > +      slave must disable the userfaultfd. The response is an acknowledgement
> > +      only.
> > +
> >  Slave message types
> >  -------------------
> >  
> 
> Which protocol feature enables this message?

VHOST_USER_PROTOCOL_F_PAGEFAULT - if you have that,
AND postcopy-ram is enabled
AND postcopy mode is entered

you'll get VHOST_USER_POSTCOPY_LISTEN and sometime later
VHOST_USER_POSTCOPY_END.

Dave

> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index 74807091a0..cf7923b25f 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -78,6 +78,7 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_SET_CONFIG = 25,
> >      VHOST_USER_POSTCOPY_ADVISE  = 26,
> >      VHOST_USER_POSTCOPY_LISTEN  = 27,
> > +    VHOST_USER_POSTCOPY_END     = 28,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >  
> > -- 
> > 2.14.3
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram
  2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                   ` (28 preceding siblings ...)
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 29/29] postcopy shared docs Dr. David Alan Gilbert (git)
@ 2018-02-27 14:01 ` Michael S. Tsirkin
  2018-02-27 20:05   ` Dr. David Alan Gilbert
  29 siblings, 1 reply; 75+ messages in thread
From: Michael S. Tsirkin @ 2018-02-27 14:01 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:15:56PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Hi,
>   This is the first non-RFC version of this patch set that
> enables postcopy migration with shared memory to a vhost user process.
> It's based off current head.
> 
> I've tested with vhost-user-bridge and a modified dpdk; both very
> lightly.
> 
> Compared to v2 we're now using the just-merged reworks to the vhost
> code (suggested by Igor), so that the huge page region merging is now a lot simpler
> in this series. The handshake between the client and the qemu for the
> set-mem-table is now a bit more complex to resolve a previous race where
> the client would start sending requests to the qemu prior to the qemu
> being ready to accept them.
> 
> Dave

>From vhost-user POV this seems mostly fine to me.

I would like to have dependency of specific messages on the
protocol features documented, and the order of messages
documented a bit more explicitly.




> Dr. David Alan Gilbert (29):
>   migrate: Update ram_block_discard_range for shared
>   qemu_ram_block_host_offset
>   postcopy: use UFFDIO_ZEROPAGE only when available
>   postcopy: Add notifier chain
>   postcopy: Add vhost-user flag for postcopy and check it
>   vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message
>   libvhost-user: Support sending fds back to qemu
>   libvhost-user: Open userfaultfd
>   postcopy: Allow registering of fd handler
>   vhost+postcopy: Register shared ufd with postcopy
>   vhost+postcopy: Transmit 'listen' to client
>   postcopy+vhost-user: Split set_mem_table for postcopy
>   migration/ram: ramblock_recv_bitmap_test_byte_offset
>   libvhost-user+postcopy: Register new regions with the ufd
>   vhost+postcopy: Send address back to qemu
>   vhost+postcopy: Stash RAMBlock and offset
>   vhost+postcopy: Send requests to source for shared pages
>   vhost+postcopy: Resolve client address
>   postcopy: wake shared
>   postcopy: postcopy_notify_shared_wake
>   vhost+postcopy: Add vhost waker
>   vhost+postcopy: Call wakeups
>   libvhost-user: mprotect & madvises for postcopy
>   vhost-user: Add VHOST_USER_POSTCOPY_END message
>   vhost+postcopy: Wire up POSTCOPY_END notify
>   vhost: Huge page align and merge
>   postcopy: Allow shared memory
>   libvhost-user: Claim support for postcopy
>   postcopy shared docs
> 
>  contrib/libvhost-user/libvhost-user.c | 303 ++++++++++++++++++++++++-
>  contrib/libvhost-user/libvhost-user.h |   8 +
>  docs/devel/migration.rst              |  41 ++++
>  docs/interop/vhost-user.txt           |  42 ++++
>  exec.c                                |  85 +++++--
>  hw/virtio/trace-events                |  16 +-
>  hw/virtio/vhost-user.c                | 411 +++++++++++++++++++++++++++++++++-
>  hw/virtio/vhost.c                     |  66 +++++-
>  include/exec/cpu-common.h             |   4 +
>  migration/migration.c                 |   6 +
>  migration/migration.h                 |   4 +
>  migration/postcopy-ram.c              | 350 +++++++++++++++++++++++------
>  migration/postcopy-ram.h              |  69 ++++++
>  migration/ram.c                       |   5 +
>  migration/ram.h                       |   1 +
>  migration/savevm.c                    |  13 ++
>  migration/trace-events                |   6 +
>  trace-events                          |   3 +-
>  vl.c                                  |   2 +
>  19 files changed, 1337 insertions(+), 98 deletions(-)
> 
> -- 
> 2.14.3

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 15/29] vhost+postcopy: Send address back to qemu
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 15/29] vhost+postcopy: Send address back to qemu Dr. David Alan Gilbert (git)
@ 2018-02-27 14:25   ` Michael S. Tsirkin
  2018-02-27 19:54     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Michael S. Tsirkin @ 2018-02-27 14:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:11PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> We need a better way, but at the moment we need the address of the
> mappings sent back to qemu so it can interpret the messages on the
> userfaultfd it reads.
> 
> This is done as a 3 stage set:
>    QEMU -> client
>       set_mem_table
> 
>    mmap stuff, get addresses
> 
>    client -> qemu
>        here are the addresses
> 
>    qemu -> client
>        OK - now you can use them
> 
> That ensures that qemu has registered the new addresses in it's
> userfault code before the client starts accessing them.
> 
> Note: We don't ask for the default 'ack' reply since we've got our own.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 24 ++++++++++++-
>  docs/interop/vhost-user.txt           |  9 +++++
>  hw/virtio/trace-events                |  1 +
>  hw/virtio/vhost-user.c                | 67 +++++++++++++++++++++++++++++++++--
>  4 files changed, 98 insertions(+), 3 deletions(-)
> 
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index a18bc74a7c..e02e5d6f46 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -491,10 +491,32 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
>                     dev_region->mmap_addr);
>          }
>  
> +        /* Return the address to QEMU so that it can translate the ufd
> +         * fault addresses back.
> +         */
> +        msg_region->userspace_addr = (uintptr_t)(mmap_addr +
> +                                                 dev_region->mmap_offset);
>          close(vmsg->fds[i]);
>      }
>  
> -    /* TODO: Get address back to QEMU */
> +    /* Send the message back to qemu with the addresses filled in */
> +    vmsg->fd_num = 0;
> +    if (!vu_message_write(dev, dev->sock, vmsg)) {
> +        vu_panic(dev, "failed to respond to set-mem-table for postcopy");
> +        return false;
> +    }
> +
> +    /* Wait for QEMU to confirm that it's registered the handler for the
> +     * faults.
> +     */
> +    if (!vu_message_read(dev, dev->sock, vmsg) ||
> +        vmsg->size != sizeof(vmsg->payload.u64) ||
> +        vmsg->payload.u64 != 0) {
> +        vu_panic(dev, "failed to receive valid ack for postcopy set-mem-table");
> +        return false;
> +    }
> +
> +    /* OK, now we can go and register the memory and generate faults */
>      for (i = 0; i < dev->nregions; i++) {
>          VuDevRegion *dev_region = &dev->regions[i];
>  #ifdef UFFDIO_REGISTER
> diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> index bdec9ec0e8..5bbcab2cc4 100644
> --- a/docs/interop/vhost-user.txt
> +++ b/docs/interop/vhost-user.txt
> @@ -454,12 +454,21 @@ Master message types
>        Id: 5
>        Equivalent ioctl: VHOST_SET_MEM_TABLE
>        Master payload: memory regions description
> +      Slave payload: (postcopy only) memory regions description
>  
>        Sets the memory map regions on the slave so it can translate the vring
>        addresses. In the ancillary data there is an array of file descriptors
>        for each memory mapped region. The size and ordering of the fds matches
>        the number and ordering of memory regions.
>  
> +      When postcopy-listening has been received,

Which message is this?

> SET_MEM_TABLE replies with
> +      the bases of the memory mapped regions to the master.  It must have mmap'd
> +      the regions but not yet accessed them and should not yet generate a userfault
> +      event. Note NEED_REPLY_MASK is not set in this case.
> +      QEMU will then reply back to the list of mappings with an empty
> +      VHOST_USER_SET_MEM_TABLE as an acknolwedgment; only upon reception of this
> +      message may the guest start accessing the memory and generating faults.
> +
>   * VHOST_USER_SET_LOG_BASE
>  
>        Id: 6

As you say yourself, this is probably the best we can do for now,
but it's not ideal. So I think it's a good idea to isolate this
behind a separate protocol feature bit. For now it will be required
for postcopy, when it's fixed in kernel we can drop it
cleanly.


> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 06ec03d6e7..05d18ada77 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -8,6 +8,7 @@ vhost_section(const char *name, int r) "%s:%d"
>  
>  # hw/virtio/vhost-user.c
>  vhost_user_postcopy_listen(void) ""
> +vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
>  
>  # hw/virtio/virtio.c
>  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 64f4b3b3f9..a060442cb9 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -159,6 +159,7 @@ struct vhost_user {
>      int slave_fd;
>      NotifierWithReturn postcopy_notifier;
>      struct PostCopyFD  postcopy_fd;
> +    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
>      /* True once we've entered postcopy_listen */
>      bool               postcopy_listen;
>  };
> @@ -328,12 +329,15 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
>  static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
>                                               struct vhost_memory *mem)
>  {
> +    struct vhost_user *u = dev->opaque;
>      int fds[VHOST_MEMORY_MAX_NREGIONS];
>      int i, fd;
>      size_t fd_num = 0;
>      bool reply_supported = virtio_has_feature(dev->protocol_features,
>                                                VHOST_USER_PROTOCOL_F_REPLY_ACK);
> -    /* TODO: Add actual postcopy differences */
> +    VhostUserMsg msg_reply;
> +    int region_i, msg_i;
> +
>      VhostUserMsg msg = {
>          .hdr.request = VHOST_USER_SET_MEM_TABLE,
>          .hdr.flags = VHOST_USER_VERSION,
> @@ -380,6 +384,64 @@ static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
>          return -1;
>      }
>  
> +    if (vhost_user_read(dev, &msg_reply) < 0) {
> +        return -1;
> +    }
> +
> +    if (msg_reply.hdr.request != VHOST_USER_SET_MEM_TABLE) {
> +        error_report("%s: Received unexpected msg type."
> +                     "Expected %d received %d", __func__,
> +                     VHOST_USER_SET_MEM_TABLE, msg_reply.hdr.request);
> +        return -1;
> +    }
> +    /* We're using the same structure, just reusing one of the
> +     * fields, so it should be the same size.
> +     */
> +    if (msg_reply.hdr.size != msg.hdr.size) {
> +        error_report("%s: Unexpected size for postcopy reply "
> +                     "%d vs %d", __func__, msg_reply.hdr.size, msg.hdr.size);
> +        return -1;
> +    }
> +
> +    memset(u->postcopy_client_bases, 0,
> +           sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
> +
> +    /* They're in the same order as the regions that were sent
> +     * but some of the regions were skipped (above) if they
> +     * didn't have fd's
> +    */
> +    for (msg_i = 0, region_i = 0;
> +         region_i < dev->mem->nregions;
> +        region_i++) {
> +        if (msg_i < fd_num &&
> +            msg_reply.payload.memory.regions[msg_i].guest_phys_addr ==
> +            dev->mem->regions[region_i].guest_phys_addr) {
> +            u->postcopy_client_bases[region_i] =
> +                msg_reply.payload.memory.regions[msg_i].userspace_addr;
> +            trace_vhost_user_set_mem_table_postcopy(
> +                msg_reply.payload.memory.regions[msg_i].userspace_addr,
> +                msg.payload.memory.regions[msg_i].userspace_addr,
> +                msg_i, region_i);
> +            msg_i++;
> +        }
> +    }
> +    if (msg_i != fd_num) {
> +        error_report("%s: postcopy reply not fully consumed "
> +                     "%d vs %zd",
> +                     __func__, msg_i, fd_num);
> +        return -1;
> +    }
> +    /* Now we've registered this with the postcopy code, we ack to the client,
> +     * because now we're in the position to be able to deal with any faults
> +     * it generates.
> +     */
> +    /* TODO: Use this for failure cases as well with a bad value */
> +    msg.hdr.size = sizeof(msg.payload.u64);
> +    msg.payload.u64 = 0; /* OK */
> +    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
> +        return -1;
> +    }
> +
>      if (reply_supported) {
>          return process_message_reply(dev, &msg);
>      }
> @@ -396,7 +458,8 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
>      size_t fd_num = 0;
>      bool do_postcopy = u->postcopy_listen && u->postcopy_fd.handler;
>      bool reply_supported = virtio_has_feature(dev->protocol_features,
> -                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
> +                                          !do_postcopy;
>  
>      if (do_postcopy) {
>          /* Postcopy has enough differences that it's best done in it's own
> -- 
> 2.14.3

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 15/29] vhost+postcopy: Send address back to qemu
  2018-02-27 14:25   ` Michael S. Tsirkin
@ 2018-02-27 19:54     ` Dr. David Alan Gilbert
  2018-02-27 20:25       ` Michael S. Tsirkin
  0 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-27 19:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo,
	quintela, aarcange

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:16:11PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > We need a better way, but at the moment we need the address of the
> > mappings sent back to qemu so it can interpret the messages on the
> > userfaultfd it reads.
> > 
> > This is done as a 3 stage set:
> >    QEMU -> client
> >       set_mem_table
> > 
> >    mmap stuff, get addresses
> > 
> >    client -> qemu
> >        here are the addresses
> > 
> >    qemu -> client
> >        OK - now you can use them
> > 
> > That ensures that qemu has registered the new addresses in it's
> > userfault code before the client starts accessing them.
> > 
> > Note: We don't ask for the default 'ack' reply since we've got our own.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 24 ++++++++++++-
> >  docs/interop/vhost-user.txt           |  9 +++++
> >  hw/virtio/trace-events                |  1 +
> >  hw/virtio/vhost-user.c                | 67 +++++++++++++++++++++++++++++++++--
> >  4 files changed, 98 insertions(+), 3 deletions(-)
> > 
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index a18bc74a7c..e02e5d6f46 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -491,10 +491,32 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
> >                     dev_region->mmap_addr);
> >          }
> >  
> > +        /* Return the address to QEMU so that it can translate the ufd
> > +         * fault addresses back.
> > +         */
> > +        msg_region->userspace_addr = (uintptr_t)(mmap_addr +
> > +                                                 dev_region->mmap_offset);
> >          close(vmsg->fds[i]);
> >      }
> >  
> > -    /* TODO: Get address back to QEMU */
> > +    /* Send the message back to qemu with the addresses filled in */
> > +    vmsg->fd_num = 0;
> > +    if (!vu_message_write(dev, dev->sock, vmsg)) {
> > +        vu_panic(dev, "failed to respond to set-mem-table for postcopy");
> > +        return false;
> > +    }
> > +
> > +    /* Wait for QEMU to confirm that it's registered the handler for the
> > +     * faults.
> > +     */
> > +    if (!vu_message_read(dev, dev->sock, vmsg) ||
> > +        vmsg->size != sizeof(vmsg->payload.u64) ||
> > +        vmsg->payload.u64 != 0) {
> > +        vu_panic(dev, "failed to receive valid ack for postcopy set-mem-table");
> > +        return false;
> > +    }
> > +
> > +    /* OK, now we can go and register the memory and generate faults */
> >      for (i = 0; i < dev->nregions; i++) {
> >          VuDevRegion *dev_region = &dev->regions[i];
> >  #ifdef UFFDIO_REGISTER
> > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > index bdec9ec0e8..5bbcab2cc4 100644
> > --- a/docs/interop/vhost-user.txt
> > +++ b/docs/interop/vhost-user.txt
> > @@ -454,12 +454,21 @@ Master message types
> >        Id: 5
> >        Equivalent ioctl: VHOST_SET_MEM_TABLE
> >        Master payload: memory regions description
> > +      Slave payload: (postcopy only) memory regions description
> >  
> >        Sets the memory map regions on the slave so it can translate the vring
> >        addresses. In the ancillary data there is an array of file descriptors
> >        for each memory mapped region. The size and ordering of the fds matches
> >        the number and ordering of memory regions.
> >  
> > +      When postcopy-listening has been received,
> 
> Which message is this?

VHOST_USER_POSTCOPY_LISTEN

Do you want me just to change that to, 'When VHOST_USER_POSTCOPY_LISTEN
has been received' ?

> > SET_MEM_TABLE replies with
> > +      the bases of the memory mapped regions to the master.  It must have mmap'd
> > +      the regions but not yet accessed them and should not yet generate a userfault
> > +      event. Note NEED_REPLY_MASK is not set in this case.
> > +      QEMU will then reply back to the list of mappings with an empty
> > +      VHOST_USER_SET_MEM_TABLE as an acknolwedgment; only upon reception of this
> > +      message may the guest start accessing the memory and generating faults.
> > +
> >   * VHOST_USER_SET_LOG_BASE
> >  
> >        Id: 6
> 
> As you say yourself, this is probably the best we can do for now,
> but it's not ideal. So I think it's a good idea to isolate this
> behind a separate protocol feature bit. For now it will be required
> for postcopy, when it's fixed in kernel we can drop it
> cleanly.
> 

While we've talked about ways of avoiding the exact addresses being
known by the slave, I'm not sure we've talked about a way of removing
this handshake; although it's doable if we move more of the work to the QEMU
side.

Dave

> > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > index 06ec03d6e7..05d18ada77 100644
> > --- a/hw/virtio/trace-events
> > +++ b/hw/virtio/trace-events
> > @@ -8,6 +8,7 @@ vhost_section(const char *name, int r) "%s:%d"
> >  
> >  # hw/virtio/vhost-user.c
> >  vhost_user_postcopy_listen(void) ""
> > +vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> >  
> >  # hw/virtio/virtio.c
> >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index 64f4b3b3f9..a060442cb9 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -159,6 +159,7 @@ struct vhost_user {
> >      int slave_fd;
> >      NotifierWithReturn postcopy_notifier;
> >      struct PostCopyFD  postcopy_fd;
> > +    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
> >      /* True once we've entered postcopy_listen */
> >      bool               postcopy_listen;
> >  };
> > @@ -328,12 +329,15 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
> >  static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
> >                                               struct vhost_memory *mem)
> >  {
> > +    struct vhost_user *u = dev->opaque;
> >      int fds[VHOST_MEMORY_MAX_NREGIONS];
> >      int i, fd;
> >      size_t fd_num = 0;
> >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> >                                                VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > -    /* TODO: Add actual postcopy differences */
> > +    VhostUserMsg msg_reply;
> > +    int region_i, msg_i;
> > +
> >      VhostUserMsg msg = {
> >          .hdr.request = VHOST_USER_SET_MEM_TABLE,
> >          .hdr.flags = VHOST_USER_VERSION,
> > @@ -380,6 +384,64 @@ static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
> >          return -1;
> >      }
> >  
> > +    if (vhost_user_read(dev, &msg_reply) < 0) {
> > +        return -1;
> > +    }
> > +
> > +    if (msg_reply.hdr.request != VHOST_USER_SET_MEM_TABLE) {
> > +        error_report("%s: Received unexpected msg type."
> > +                     "Expected %d received %d", __func__,
> > +                     VHOST_USER_SET_MEM_TABLE, msg_reply.hdr.request);
> > +        return -1;
> > +    }
> > +    /* We're using the same structure, just reusing one of the
> > +     * fields, so it should be the same size.
> > +     */
> > +    if (msg_reply.hdr.size != msg.hdr.size) {
> > +        error_report("%s: Unexpected size for postcopy reply "
> > +                     "%d vs %d", __func__, msg_reply.hdr.size, msg.hdr.size);
> > +        return -1;
> > +    }
> > +
> > +    memset(u->postcopy_client_bases, 0,
> > +           sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
> > +
> > +    /* They're in the same order as the regions that were sent
> > +     * but some of the regions were skipped (above) if they
> > +     * didn't have fd's
> > +    */
> > +    for (msg_i = 0, region_i = 0;
> > +         region_i < dev->mem->nregions;
> > +        region_i++) {
> > +        if (msg_i < fd_num &&
> > +            msg_reply.payload.memory.regions[msg_i].guest_phys_addr ==
> > +            dev->mem->regions[region_i].guest_phys_addr) {
> > +            u->postcopy_client_bases[region_i] =
> > +                msg_reply.payload.memory.regions[msg_i].userspace_addr;
> > +            trace_vhost_user_set_mem_table_postcopy(
> > +                msg_reply.payload.memory.regions[msg_i].userspace_addr,
> > +                msg.payload.memory.regions[msg_i].userspace_addr,
> > +                msg_i, region_i);
> > +            msg_i++;
> > +        }
> > +    }
> > +    if (msg_i != fd_num) {
> > +        error_report("%s: postcopy reply not fully consumed "
> > +                     "%d vs %zd",
> > +                     __func__, msg_i, fd_num);
> > +        return -1;
> > +    }
> > +    /* Now we've registered this with the postcopy code, we ack to the client,
> > +     * because now we're in the position to be able to deal with any faults
> > +     * it generates.
> > +     */
> > +    /* TODO: Use this for failure cases as well with a bad value */
> > +    msg.hdr.size = sizeof(msg.payload.u64);
> > +    msg.payload.u64 = 0; /* OK */
> > +    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
> > +        return -1;
> > +    }
> > +
> >      if (reply_supported) {
> >          return process_message_reply(dev, &msg);
> >      }
> > @@ -396,7 +458,8 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
> >      size_t fd_num = 0;
> >      bool do_postcopy = u->postcopy_listen && u->postcopy_fd.handler;
> >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> > -                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > +                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
> > +                                          !do_postcopy;
> >  
> >      if (do_postcopy) {
> >          /* Postcopy has enough differences that it's best done in it's own
> > -- 
> > 2.14.3
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram
  2018-02-27 14:01 ` [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Michael S. Tsirkin
@ 2018-02-27 20:05   ` Dr. David Alan Gilbert
  2018-02-27 20:23     ` Michael S. Tsirkin
  0 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-27 20:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo,
	quintela, aarcange

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:15:56PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Hi,
> >   This is the first non-RFC version of this patch set that
> > enables postcopy migration with shared memory to a vhost user process.
> > It's based off current head.
> > 
> > I've tested with vhost-user-bridge and a modified dpdk; both very
> > lightly.
> > 
> > Compared to v2 we're now using the just-merged reworks to the vhost
> > code (suggested by Igor), so that the huge page region merging is now a lot simpler
> > in this series. The handshake between the client and the qemu for the
> > set-mem-table is now a bit more complex to resolve a previous race where
> > the client would start sending requests to the qemu prior to the qemu
> > being ready to accept them.
> > 
> > Dave
> 
> From vhost-user POV this seems mostly fine to me.

OK, great - it would be nice to get this merged in the upcoming release
(Hint: Anyone else please review!)

> I would like to have dependency of specific messages on the
> protocol features documented, and the order of messages
> documented a bit more explicitly.

Something like the following? (appropriately merged in with the
individual commits):

diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index 4bf7d8ef99..7841812766 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -461,7 +461,7 @@ Master message types
       for each memory mapped region. The size and ordering of the fds matches
       the number and ordering of memory regions.
 
-      When postcopy-listening has been received, SET_MEM_TABLE replies with
+      When VHOST_USER_POSTCOPY_LISTEN has been received, SET_MEM_TABLE replies with
       the bases of the memory mapped regions to the master.  It must have mmap'd
       the regions but not yet accessed them and should not yet generate a userfault
       event. Note NEED_REPLY_MASK is not set in this case.
@@ -687,7 +687,8 @@ Master message types
       Master payload: N/A
       Slave payload: userfault fd + u64
 
-      Master advises slave that a migration with postcopy enabled is underway,
+      When VHOST_USER_PROTOCOL_F_PAGEFAULT is supported, the
+      master advises slave that a migration with postcopy enabled is underway,
       the slave must open a userfaultfd for later use.
       Note that at this stage the migration is still in precopy mode.
 
@@ -696,6 +697,8 @@ Master message types
       Master payload: N/A
 
       Master advises slave that a transition to postcopy mode has happened.
+      This is always sent sometime after a VHOST_USER_POSTCOPY_ADVISE, and
+      thus only when VHOST_USER_PROTOCOL_F_PAGEFAULT is supported.
 
  * VHOST_USER_POSTCOPY_END
       Id: 28
@@ -704,6 +707,8 @@ Master message types
       Master advises that postcopy migration has now completed.  The
       slave must disable the userfaultfd. The response is an acknowledgement
       only.
+      This message is sent at the end of the migration, after
+      VHOST_USER_POSTCOPY_LISTEN was previously sent.
 
 Slave message types
 -------------------

Dave

> 
> 
> 
> > Dr. David Alan Gilbert (29):
> >   migrate: Update ram_block_discard_range for shared
> >   qemu_ram_block_host_offset
> >   postcopy: use UFFDIO_ZEROPAGE only when available
> >   postcopy: Add notifier chain
> >   postcopy: Add vhost-user flag for postcopy and check it
> >   vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message
> >   libvhost-user: Support sending fds back to qemu
> >   libvhost-user: Open userfaultfd
> >   postcopy: Allow registering of fd handler
> >   vhost+postcopy: Register shared ufd with postcopy
> >   vhost+postcopy: Transmit 'listen' to client
> >   postcopy+vhost-user: Split set_mem_table for postcopy
> >   migration/ram: ramblock_recv_bitmap_test_byte_offset
> >   libvhost-user+postcopy: Register new regions with the ufd
> >   vhost+postcopy: Send address back to qemu
> >   vhost+postcopy: Stash RAMBlock and offset
> >   vhost+postcopy: Send requests to source for shared pages
> >   vhost+postcopy: Resolve client address
> >   postcopy: wake shared
> >   postcopy: postcopy_notify_shared_wake
> >   vhost+postcopy: Add vhost waker
> >   vhost+postcopy: Call wakeups
> >   libvhost-user: mprotect & madvises for postcopy
> >   vhost-user: Add VHOST_USER_POSTCOPY_END message
> >   vhost+postcopy: Wire up POSTCOPY_END notify
> >   vhost: Huge page align and merge
> >   postcopy: Allow shared memory
> >   libvhost-user: Claim support for postcopy
> >   postcopy shared docs
> > 
> >  contrib/libvhost-user/libvhost-user.c | 303 ++++++++++++++++++++++++-
> >  contrib/libvhost-user/libvhost-user.h |   8 +
> >  docs/devel/migration.rst              |  41 ++++
> >  docs/interop/vhost-user.txt           |  42 ++++
> >  exec.c                                |  85 +++++--
> >  hw/virtio/trace-events                |  16 +-
> >  hw/virtio/vhost-user.c                | 411 +++++++++++++++++++++++++++++++++-
> >  hw/virtio/vhost.c                     |  66 +++++-
> >  include/exec/cpu-common.h             |   4 +
> >  migration/migration.c                 |   6 +
> >  migration/migration.h                 |   4 +
> >  migration/postcopy-ram.c              | 350 +++++++++++++++++++++++------
> >  migration/postcopy-ram.h              |  69 ++++++
> >  migration/ram.c                       |   5 +
> >  migration/ram.h                       |   1 +
> >  migration/savevm.c                    |  13 ++
> >  migration/trace-events                |   6 +
> >  trace-events                          |   3 +-
> >  vl.c                                  |   2 +
> >  19 files changed, 1337 insertions(+), 98 deletions(-)
> > 
> > -- 
> > 2.14.3
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram
  2018-02-27 20:05   ` Dr. David Alan Gilbert
@ 2018-02-27 20:23     ` Michael S. Tsirkin
  2018-02-28 18:38       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Michael S. Tsirkin @ 2018-02-27 20:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo,
	quintela, aarcange

On Tue, Feb 27, 2018 at 08:05:25PM +0000, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
> > On Fri, Feb 16, 2018 at 01:15:56PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Hi,
> > >   This is the first non-RFC version of this patch set that
> > > enables postcopy migration with shared memory to a vhost user process.
> > > It's based off current head.
> > > 
> > > I've tested with vhost-user-bridge and a modified dpdk; both very
> > > lightly.
> > > 
> > > Compared to v2 we're now using the just-merged reworks to the vhost
> > > code (suggested by Igor), so that the huge page region merging is now a lot simpler
> > > in this series. The handshake between the client and the qemu for the
> > > set-mem-table is now a bit more complex to resolve a previous race where
> > > the client would start sending requests to the qemu prior to the qemu
> > > being ready to accept them.
> > > 
> > > Dave
> > 
> > From vhost-user POV this seems mostly fine to me.
> 
> OK, great - it would be nice to get this merged in the upcoming release
> (Hint: Anyone else please review!)
> 
> > I would like to have dependency of specific messages on the
> > protocol features documented, and the order of messages
> > documented a bit more explicitly.
> 
> Something like the following? (appropriately merged in with the
> individual commits):
> 
> diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> index 4bf7d8ef99..7841812766 100644
> --- a/docs/interop/vhost-user.txt
> +++ b/docs/interop/vhost-user.txt
> @@ -461,7 +461,7 @@ Master message types
>        for each memory mapped region. The size and ordering of the fds matches
>        the number and ordering of memory regions.
>  
> -      When postcopy-listening has been received, SET_MEM_TABLE replies with
> +      When VHOST_USER_POSTCOPY_LISTEN has been received, SET_MEM_TABLE replies with
>        the bases of the memory mapped regions to the master.  It must have mmap'd
>        the regions but not yet accessed them and should not yet generate a userfault
>        event. Note NEED_REPLY_MASK is not set in this case.
> @@ -687,7 +687,8 @@ Master message types
>        Master payload: N/A
>        Slave payload: userfault fd + u64
>  
> -      Master advises slave that a migration with postcopy enabled is underway,
> +      When VHOST_USER_PROTOCOL_F_PAGEFAULT is supported, the
> +      master advises slave that a migration with postcopy enabled is underway,
>        the slave must open a userfaultfd for later use.
>        Note that at this stage the migration is still in precopy mode.
>  
> @@ -696,6 +697,8 @@ Master message types
>        Master payload: N/A
>  
>        Master advises slave that a transition to postcopy mode has happened.
> +      This is always sent sometime after a VHOST_USER_POSTCOPY_ADVISE, and
> +      thus only when VHOST_USER_PROTOCOL_F_PAGEFAULT is supported.
>  
>   * VHOST_USER_POSTCOPY_END
>        Id: 28
> @@ -704,6 +707,8 @@ Master message types
>        Master advises that postcopy migration has now completed.  The
>        slave must disable the userfaultfd. The response is an acknowledgement
>        only.
> +      This message is sent at the end of the migration, after
> +      VHOST_USER_POSTCOPY_LISTEN was previously sent.

And maybe mention VHOST_USER_PROTOCOL_F_PAGEFAULT here too.

>  Slave message types
>  -------------------
> 
> Dave
> 
> > 
> > 
> > 
> > > Dr. David Alan Gilbert (29):
> > >   migrate: Update ram_block_discard_range for shared
> > >   qemu_ram_block_host_offset
> > >   postcopy: use UFFDIO_ZEROPAGE only when available
> > >   postcopy: Add notifier chain
> > >   postcopy: Add vhost-user flag for postcopy and check it
> > >   vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message
> > >   libvhost-user: Support sending fds back to qemu
> > >   libvhost-user: Open userfaultfd
> > >   postcopy: Allow registering of fd handler
> > >   vhost+postcopy: Register shared ufd with postcopy
> > >   vhost+postcopy: Transmit 'listen' to client
> > >   postcopy+vhost-user: Split set_mem_table for postcopy
> > >   migration/ram: ramblock_recv_bitmap_test_byte_offset
> > >   libvhost-user+postcopy: Register new regions with the ufd
> > >   vhost+postcopy: Send address back to qemu
> > >   vhost+postcopy: Stash RAMBlock and offset
> > >   vhost+postcopy: Send requests to source for shared pages
> > >   vhost+postcopy: Resolve client address
> > >   postcopy: wake shared
> > >   postcopy: postcopy_notify_shared_wake
> > >   vhost+postcopy: Add vhost waker
> > >   vhost+postcopy: Call wakeups
> > >   libvhost-user: mprotect & madvises for postcopy
> > >   vhost-user: Add VHOST_USER_POSTCOPY_END message
> > >   vhost+postcopy: Wire up POSTCOPY_END notify
> > >   vhost: Huge page align and merge
> > >   postcopy: Allow shared memory
> > >   libvhost-user: Claim support for postcopy
> > >   postcopy shared docs
> > > 
> > >  contrib/libvhost-user/libvhost-user.c | 303 ++++++++++++++++++++++++-
> > >  contrib/libvhost-user/libvhost-user.h |   8 +
> > >  docs/devel/migration.rst              |  41 ++++
> > >  docs/interop/vhost-user.txt           |  42 ++++
> > >  exec.c                                |  85 +++++--
> > >  hw/virtio/trace-events                |  16 +-
> > >  hw/virtio/vhost-user.c                | 411 +++++++++++++++++++++++++++++++++-
> > >  hw/virtio/vhost.c                     |  66 +++++-
> > >  include/exec/cpu-common.h             |   4 +
> > >  migration/migration.c                 |   6 +
> > >  migration/migration.h                 |   4 +
> > >  migration/postcopy-ram.c              | 350 +++++++++++++++++++++++------
> > >  migration/postcopy-ram.h              |  69 ++++++
> > >  migration/ram.c                       |   5 +
> > >  migration/ram.h                       |   1 +
> > >  migration/savevm.c                    |  13 ++
> > >  migration/trace-events                |   6 +
> > >  trace-events                          |   3 +-
> > >  vl.c                                  |   2 +
> > >  19 files changed, 1337 insertions(+), 98 deletions(-)
> > > 
> > > -- 
> > > 2.14.3
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 15/29] vhost+postcopy: Send address back to qemu
  2018-02-27 19:54     ` Dr. David Alan Gilbert
@ 2018-02-27 20:25       ` Michael S. Tsirkin
  2018-02-28 18:26         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Michael S. Tsirkin @ 2018-02-27 20:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo,
	quintela, aarcange

On Tue, Feb 27, 2018 at 07:54:18PM +0000, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
> > On Fri, Feb 16, 2018 at 01:16:11PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > We need a better way, but at the moment we need the address of the
> > > mappings sent back to qemu so it can interpret the messages on the
> > > userfaultfd it reads.
> > > 
> > > This is done as a 3 stage set:
> > >    QEMU -> client
> > >       set_mem_table
> > > 
> > >    mmap stuff, get addresses
> > > 
> > >    client -> qemu
> > >        here are the addresses
> > > 
> > >    qemu -> client
> > >        OK - now you can use them
> > > 
> > > That ensures that qemu has registered the new addresses in it's
> > > userfault code before the client starts accessing them.
> > > 
> > > Note: We don't ask for the default 'ack' reply since we've got our own.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  contrib/libvhost-user/libvhost-user.c | 24 ++++++++++++-
> > >  docs/interop/vhost-user.txt           |  9 +++++
> > >  hw/virtio/trace-events                |  1 +
> > >  hw/virtio/vhost-user.c                | 67 +++++++++++++++++++++++++++++++++--
> > >  4 files changed, 98 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > > index a18bc74a7c..e02e5d6f46 100644
> > > --- a/contrib/libvhost-user/libvhost-user.c
> > > +++ b/contrib/libvhost-user/libvhost-user.c
> > > @@ -491,10 +491,32 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
> > >                     dev_region->mmap_addr);
> > >          }
> > >  
> > > +        /* Return the address to QEMU so that it can translate the ufd
> > > +         * fault addresses back.
> > > +         */
> > > +        msg_region->userspace_addr = (uintptr_t)(mmap_addr +
> > > +                                                 dev_region->mmap_offset);
> > >          close(vmsg->fds[i]);
> > >      }
> > >  
> > > -    /* TODO: Get address back to QEMU */
> > > +    /* Send the message back to qemu with the addresses filled in */
> > > +    vmsg->fd_num = 0;
> > > +    if (!vu_message_write(dev, dev->sock, vmsg)) {
> > > +        vu_panic(dev, "failed to respond to set-mem-table for postcopy");
> > > +        return false;
> > > +    }
> > > +
> > > +    /* Wait for QEMU to confirm that it's registered the handler for the
> > > +     * faults.
> > > +     */
> > > +    if (!vu_message_read(dev, dev->sock, vmsg) ||
> > > +        vmsg->size != sizeof(vmsg->payload.u64) ||
> > > +        vmsg->payload.u64 != 0) {
> > > +        vu_panic(dev, "failed to receive valid ack for postcopy set-mem-table");
> > > +        return false;
> > > +    }
> > > +
> > > +    /* OK, now we can go and register the memory and generate faults */
> > >      for (i = 0; i < dev->nregions; i++) {
> > >          VuDevRegion *dev_region = &dev->regions[i];
> > >  #ifdef UFFDIO_REGISTER
> > > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > > index bdec9ec0e8..5bbcab2cc4 100644
> > > --- a/docs/interop/vhost-user.txt
> > > +++ b/docs/interop/vhost-user.txt
> > > @@ -454,12 +454,21 @@ Master message types
> > >        Id: 5
> > >        Equivalent ioctl: VHOST_SET_MEM_TABLE
> > >        Master payload: memory regions description
> > > +      Slave payload: (postcopy only) memory regions description
> > >  
> > >        Sets the memory map regions on the slave so it can translate the vring
> > >        addresses. In the ancillary data there is an array of file descriptors
> > >        for each memory mapped region. The size and ordering of the fds matches
> > >        the number and ordering of memory regions.
> > >  
> > > +      When postcopy-listening has been received,
> > 
> > Which message is this?
> 
> VHOST_USER_POSTCOPY_LISTEN
> 
> Do you want me just to change that to, 'When VHOST_USER_POSTCOPY_LISTEN
> has been received' ?

I think it's better this way, yes.

> > > SET_MEM_TABLE replies with
> > > +      the bases of the memory mapped regions to the master.  It must have mmap'd
> > > +      the regions but not yet accessed them and should not yet generate a userfault
> > > +      event. Note NEED_REPLY_MASK is not set in this case.
> > > +      QEMU will then reply back to the list of mappings with an empty
> > > +      VHOST_USER_SET_MEM_TABLE as an acknolwedgment; only upon reception of this
> > > +      message may the guest start accessing the memory and generating faults.
> > > +
> > >   * VHOST_USER_SET_LOG_BASE
> > >  
> > >        Id: 6
> > 
> > As you say yourself, this is probably the best we can do for now,
> > but it's not ideal. So I think it's a good idea to isolate this
> > behind a separate protocol feature bit. For now it will be required
> > for postcopy, when it's fixed in kernel we can drop it
> > cleanly.
> > 
> 
> While we've talked about ways of avoiding the exact addresses being
> known by the slave, I'm not sure we've talked about a way of removing
> this handshake; although it's doable if we move more of the work to the QEMU
> side.
> 
> Dave

Some kernel changes might thinkably remove the need for use of the
address with userfaultfd, too.

> > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > index 06ec03d6e7..05d18ada77 100644
> > > --- a/hw/virtio/trace-events
> > > +++ b/hw/virtio/trace-events
> > > @@ -8,6 +8,7 @@ vhost_section(const char *name, int r) "%s:%d"
> > >  
> > >  # hw/virtio/vhost-user.c
> > >  vhost_user_postcopy_listen(void) ""
> > > +vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> > >  
> > >  # hw/virtio/virtio.c
> > >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > index 64f4b3b3f9..a060442cb9 100644
> > > --- a/hw/virtio/vhost-user.c
> > > +++ b/hw/virtio/vhost-user.c
> > > @@ -159,6 +159,7 @@ struct vhost_user {
> > >      int slave_fd;
> > >      NotifierWithReturn postcopy_notifier;
> > >      struct PostCopyFD  postcopy_fd;
> > > +    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
> > >      /* True once we've entered postcopy_listen */
> > >      bool               postcopy_listen;
> > >  };
> > > @@ -328,12 +329,15 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
> > >  static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
> > >                                               struct vhost_memory *mem)
> > >  {
> > > +    struct vhost_user *u = dev->opaque;
> > >      int fds[VHOST_MEMORY_MAX_NREGIONS];
> > >      int i, fd;
> > >      size_t fd_num = 0;
> > >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> > >                                                VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > > -    /* TODO: Add actual postcopy differences */
> > > +    VhostUserMsg msg_reply;
> > > +    int region_i, msg_i;
> > > +
> > >      VhostUserMsg msg = {
> > >          .hdr.request = VHOST_USER_SET_MEM_TABLE,
> > >          .hdr.flags = VHOST_USER_VERSION,
> > > @@ -380,6 +384,64 @@ static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
> > >          return -1;
> > >      }
> > >  
> > > +    if (vhost_user_read(dev, &msg_reply) < 0) {
> > > +        return -1;
> > > +    }
> > > +
> > > +    if (msg_reply.hdr.request != VHOST_USER_SET_MEM_TABLE) {
> > > +        error_report("%s: Received unexpected msg type."
> > > +                     "Expected %d received %d", __func__,
> > > +                     VHOST_USER_SET_MEM_TABLE, msg_reply.hdr.request);
> > > +        return -1;
> > > +    }
> > > +    /* We're using the same structure, just reusing one of the
> > > +     * fields, so it should be the same size.
> > > +     */
> > > +    if (msg_reply.hdr.size != msg.hdr.size) {
> > > +        error_report("%s: Unexpected size for postcopy reply "
> > > +                     "%d vs %d", __func__, msg_reply.hdr.size, msg.hdr.size);
> > > +        return -1;
> > > +    }
> > > +
> > > +    memset(u->postcopy_client_bases, 0,
> > > +           sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
> > > +
> > > +    /* They're in the same order as the regions that were sent
> > > +     * but some of the regions were skipped (above) if they
> > > +     * didn't have fd's
> > > +    */
> > > +    for (msg_i = 0, region_i = 0;
> > > +         region_i < dev->mem->nregions;
> > > +        region_i++) {
> > > +        if (msg_i < fd_num &&
> > > +            msg_reply.payload.memory.regions[msg_i].guest_phys_addr ==
> > > +            dev->mem->regions[region_i].guest_phys_addr) {
> > > +            u->postcopy_client_bases[region_i] =
> > > +                msg_reply.payload.memory.regions[msg_i].userspace_addr;
> > > +            trace_vhost_user_set_mem_table_postcopy(
> > > +                msg_reply.payload.memory.regions[msg_i].userspace_addr,
> > > +                msg.payload.memory.regions[msg_i].userspace_addr,
> > > +                msg_i, region_i);
> > > +            msg_i++;
> > > +        }
> > > +    }
> > > +    if (msg_i != fd_num) {
> > > +        error_report("%s: postcopy reply not fully consumed "
> > > +                     "%d vs %zd",
> > > +                     __func__, msg_i, fd_num);
> > > +        return -1;
> > > +    }
> > > +    /* Now we've registered this with the postcopy code, we ack to the client,
> > > +     * because now we're in the position to be able to deal with any faults
> > > +     * it generates.
> > > +     */
> > > +    /* TODO: Use this for failure cases as well with a bad value */
> > > +    msg.hdr.size = sizeof(msg.payload.u64);
> > > +    msg.payload.u64 = 0; /* OK */
> > > +    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
> > > +        return -1;
> > > +    }
> > > +
> > >      if (reply_supported) {
> > >          return process_message_reply(dev, &msg);
> > >      }
> > > @@ -396,7 +458,8 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
> > >      size_t fd_num = 0;
> > >      bool do_postcopy = u->postcopy_listen && u->postcopy_fd.handler;
> > >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> > > -                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > > +                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
> > > +                                          !do_postcopy;
> > >  
> > >      if (do_postcopy) {
> > >          /* Postcopy has enough differences that it's best done in it's own
> > > -- 
> > > 2.14.3
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/29] migrate: Update ram_block_discard_range for shared
  2018-02-16 13:15 ` [Qemu-devel] [PATCH v3 01/29] migrate: Update ram_block_discard_range for shared Dr. David Alan Gilbert (git)
@ 2018-02-28  6:37   ` Peter Xu
  2018-02-28 19:54     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-02-28  6:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:15:57PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The choice of call to discard a block is getting more complicated
> for other cases.   We use fallocate PUNCH_HOLE in any file cases;
> it works for both hugepage and for tmpfs.
> We use the DONTNEED for non-hugepage cases either where they're
> anonymous or where they're private.
> 
> Care should be taken when trying other backing files.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  exec.c       | 60 ++++++++++++++++++++++++++++++++++++++++++++++--------------
>  trace-events |  3 ++-
>  2 files changed, 48 insertions(+), 15 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index e8d7b335b6..b1bb477776 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -3702,6 +3702,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>      }
>  
>      if ((start + length) <= rb->used_length) {
> +        bool need_madvise, need_fallocate;
>          uint8_t *host_endaddr = host_startaddr + length;
>          if ((uintptr_t)host_endaddr & (rb->page_size - 1)) {
>              error_report("ram_block_discard_range: Unaligned end address: %p",
> @@ -3711,29 +3712,60 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>  
>          errno = ENOTSUP; /* If we are missing MADVISE etc */
>  
> -        if (rb->page_size == qemu_host_page_size) {
> -#if defined(CONFIG_MADVISE)
> -            /* Note: We need the madvise MADV_DONTNEED behaviour of definitely
> -             * freeing the page.
> -             */
> -            ret = madvise(host_startaddr, length, MADV_DONTNEED);
> -#endif
> -        } else {
> -            /* Huge page case  - unfortunately it can't do DONTNEED, but
> -             * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
> -             * huge page file.
> +        /* The logic here is messy;
> +         *    madvise DONTNEED fails for hugepages
> +         *    fallocate works on hugepages and shmem
> +         */
> +        need_madvise = (rb->page_size == qemu_host_page_size);
> +        need_fallocate = rb->fd != -1;
> +        if (need_fallocate) {
> +            /* For a file, this causes the area of the file to be zero'd
> +             * if read, and for hugetlbfs also causes it to be unmapped
> +             * so a userfault will trigger.
>               */
>  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>              ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                              start, length);
> +            if (ret) {
> +                ret = -errno;
> +                error_report("ram_block_discard_range: Failed to fallocate "
> +                             "%s:%" PRIx64 " +%zx (%d)",
> +                             rb->idstr, start, length, ret);
> +                goto err;
> +            }
> +#else
> +            ret = -ENOSYS;
> +            error_report("ram_block_discard_range: fallocate not available/file"
> +                         "%s:%" PRIx64 " +%zx (%d)",
> +                         rb->idstr, start, length, ret);
> +            goto err;
>  #endif
>          }
> -        if (ret) {
> -            ret = -errno;
> -            error_report("ram_block_discard_range: Failed to discard range "
> +        if (need_madvise) {
> +            /* For normal RAM this causes it to be unmapped,
> +             * for shared memory it causes the local mapping to disappear
> +             * and to fall back on the file contents (which we just
> +             * fallocate'd away).
> +             */
> +#if defined(CONFIG_MADVISE)
> +            ret =  madvise(host_startaddr, length, MADV_DONTNEED);
> +            if (ret) {
> +                ret = -errno;
> +                error_report("ram_block_discard_range: Failed to discard range "
> +                             "%s:%" PRIx64 " +%zx (%d)",
> +                             rb->idstr, start, length, ret);
> +                goto err;
> +            }
> +#else
> +            ret = -ENOSYS;
> +            error_report("ram_block_discard_range: MADVISE not available"
>                           "%s:%" PRIx64 " +%zx (%d)",
>                           rb->idstr, start, length, ret);
> +            goto err;
> +#endif
>          }
> +        trace_ram_block_discard_range(rb->idstr, host_startaddr,
> +                                      need_madvise, need_fallocate, ret);

Nit: worth to log the length too if it's named as "range"?

Either with/without:

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/29] postcopy: use UFFDIO_ZEROPAGE only when available
  2018-02-16 13:15 ` [Qemu-devel] [PATCH v3 03/29] postcopy: use UFFDIO_ZEROPAGE only when available Dr. David Alan Gilbert (git)
@ 2018-02-28  6:53   ` Peter Xu
  2018-03-05 17:23     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-02-28  6:53 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:15:59PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Use a flag on the RAMBlock to state whether it has the
> UFFDIO_ZEROPAGE capability, use it when it's available.
> 
> This allows the use of postcopy on tmpfs as well as hugepage
> backed files.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  exec.c                    | 15 +++++++++++++++
>  include/exec/cpu-common.h |  3 +++
>  migration/postcopy-ram.c  | 13 ++++++++++---
>  3 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 0ec73bc917..1dc15298c2 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -99,6 +99,11 @@ static MemoryRegion io_mem_unassigned;
>   */
>  #define RAM_RESIZEABLE (1 << 2)
>  
> +/* UFFDIO_ZEROPAGE is available on this RAMBlock to atomically
> + * zero the page and wake waiting processes.
> + * (Set during postcopy)
> + */
> +#define RAM_UF_ZEROPAGE (1 << 3)
>  #endif
>  
>  #ifdef TARGET_PAGE_BITS_VARY
> @@ -1767,6 +1772,16 @@ bool qemu_ram_is_shared(RAMBlock *rb)
>      return rb->flags & RAM_SHARED;
>  }
>  
> +bool qemu_ram_is_uf_zeroable(RAMBlock *rb)
> +{
> +    return rb->flags & RAM_UF_ZEROPAGE;
> +}
> +
> +void qemu_ram_set_uf_zeroable(RAMBlock *rb)
> +{
> +    rb->flags |= RAM_UF_ZEROPAGE;
> +}
> +
>  /* Called with iothread lock held.  */
>  void qemu_ram_set_idstr(RAMBlock *new_block, const char *name, DeviceState *dev)
>  {
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 0d861a6289..24d335f95d 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -73,6 +73,9 @@ void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
>  void qemu_ram_unset_idstr(RAMBlock *block);
>  const char *qemu_ram_get_idstr(RAMBlock *rb);
>  bool qemu_ram_is_shared(RAMBlock *rb);
> +bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
> +void qemu_ram_set_uf_zeroable(RAMBlock *rb);
> +
>  size_t qemu_ram_pagesize(RAMBlock *block);
>  size_t qemu_ram_pagesize_largest(void);
>  
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index bec6c2c66b..6297979700 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -490,6 +490,10 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
>          error_report("%s userfault: Region doesn't support COPY", __func__);
>          return -1;
>      }
> +    if (reg_struct.ioctls & ((__u64)1 << _UFFDIO_ZEROPAGE)) {
> +        RAMBlock *rb = qemu_ram_block_by_name(block_name);
> +        qemu_ram_set_uf_zeroable(rb);
> +    }

So the zeroable flag is only set after a listening operation of
postcopy migration.  One thing I am a bit worried is that if someone
else wants to use the flag for a RAMBlock he/she may not notice this.
Say, qemu_ram_is_uf_zeroable() is not valid if there is no such an
incoming postcopy migration.

Maybe worth add a comment in the flag definition about this?

Not a big deal (considering that I see no potential QEMU user for
userfaultfd in short peroid), so no matter what:

Reviewed-by: Peter Xu <peterx@redhat.com>

>  
>      return 0;
>  }
> @@ -699,11 +703,14 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
>  int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
>                               RAMBlock *rb)
>  {
> +    size_t pagesize = qemu_ram_pagesize(rb);
>      trace_postcopy_place_page_zero(host);
>  
> -    if (qemu_ram_pagesize(rb) == getpagesize()) {
> -        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, getpagesize(),
> -                                rb)) {
> +    /* Normal RAMBlocks can zero a page using UFFDIO_ZEROPAGE
> +     * but it's not available for everything (e.g. hugetlbpages)
> +     */
> +    if (qemu_ram_is_uf_zeroable(rb)) {
> +        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, pagesize, rb)) {
>              int e = errno;
>              error_report("%s: %s zero host: %p",
>                           __func__, strerror(e), host);
> -- 
> 2.14.3
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/29] postcopy: Add vhost-user flag for postcopy and check it
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 05/29] postcopy: Add vhost-user flag for postcopy and check it Dr. David Alan Gilbert (git)
@ 2018-02-28  7:14   ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2018-02-28  7:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:01PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Add a vhost feature flag for postcopy support, and
> use the postcopy notifier to check it before allowing postcopy.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 09/29] postcopy: Allow registering of fd handler
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 09/29] postcopy: Allow registering of fd handler Dr. David Alan Gilbert (git)
@ 2018-02-28  8:38   ` Peter Xu
  2018-03-05 17:35     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-02-28  8:38 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:05PM +0000, Dr. David Alan Gilbert (git) wrote:

[...]

> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index bee21d4401..4bda5aa509 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -141,4 +141,25 @@ void postcopy_remove_notifier(NotifierWithReturn *n);
>  /* Call the notifier list set by postcopy_add_start_notifier */
>  int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp);
>  
> +struct PostCopyFD;
> +
> +/* ufd is a pointer to the struct uffd_msg *TODO: more Portable! */
> +typedef int (*pcfdhandler)(struct PostCopyFD *pcfd, void *ufd);
> +
> +struct PostCopyFD {
> +    int fd;
> +    /* Data to pass to handler */
> +    void *data;
> +    /* Handler to be called whenever we get a poll event */
> +    pcfdhandler handler;
> +    /* A string to use in error messages */
> +    char *idstr;

This was changed to const char in next patch.  We can move it here?

The patch is a big one, there are quite a lot of TODOs and I still
think there can be some helper functions shared between fd handling
for 0-1 and 2-N but it looks good to me for merging as a first version.

After we have confirmed the definition of PostCopyFD please add:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/29] vhost+postcopy: Transmit 'listen' to client
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 11/29] vhost+postcopy: Transmit 'listen' to client Dr. David Alan Gilbert (git)
@ 2018-02-28  8:42   ` Peter Xu
  2018-03-05 17:42     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-02-28  8:42 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:07PM +0000, Dr. David Alan Gilbert (git) wrote:

[...]

>  typedef struct VuVirtqElement {
> diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> index 621543e654..bdec9ec0e8 100644
> --- a/docs/interop/vhost-user.txt
> +++ b/docs/interop/vhost-user.txt
> @@ -682,6 +682,12 @@ Master message types
>        the slave must open a userfaultfd for later use.
>        Note that at this stage the migration is still in precopy mode.
>  
> + * VHOST_USER_POSTCOPY_LISTEN
> +      Id: 27
> +      Master payload: N/A
> +
> +      Master advises slave that a transition to postcopy mode has happened.

Could we add something to explain why this listen needs to be
broadcasted to clients?  Since I failed to find it out quickly
myself. :(

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 10/29] vhost+postcopy: Register shared ufd with postcopy
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 10/29] vhost+postcopy: Register shared ufd with postcopy Dr. David Alan Gilbert (git)
@ 2018-02-28  8:46   ` Peter Xu
  2018-03-05 18:21     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-02-28  8:46 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:06PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Register the UFD that comes in as the response to the 'advise' method
> with the postcopy code.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/virtio/vhost-user.c   | 21 ++++++++++++++++++++-
>  migration/postcopy-ram.h |  2 +-
>  2 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 4f59993baa..dd4eb50668 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -24,6 +24,7 @@
>  #include <sys/socket.h>
>  #include <sys/un.h>
>  #include <linux/vhost.h>
> +#include <linux/userfaultfd.h>

Why this line?

>  
>  #define VHOST_MEMORY_MAX_NREGIONS    8
>  #define VHOST_USER_F_PROTOCOL_FEATURES 30
> @@ -155,6 +156,7 @@ struct vhost_user {
>      CharBackend *chr;
>      int slave_fd;
>      NotifierWithReturn postcopy_notifier;
> +    struct PostCopyFD  postcopy_fd;
>  };
>  
>  static bool ioeventfd_enabled(void)
> @@ -780,6 +782,17 @@ out:
>      return ret;
>  }
>  
> +/*
> + * Called back from the postcopy fault thread when a fault is received on our
> + * ufd.
> + * TODO: This is Linux specific
> + */
> +static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
> +                                             void *ufd)
> +{
> +    return 0;
> +}
> +
>  /*
>   * Called at the start of an inbound postcopy on reception of the
>   * 'advise' command.
> @@ -819,8 +832,14 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
>          error_setg(errp, "%s: Failed to get ufd", __func__);
>          return -1;
>      }
> +    fcntl(ufd, F_SETFL, O_NONBLOCK);

Only curious: would it work even without this line?

>  
> -    /* TODO: register ufd with userfault thread */
> +    /* register ufd with userfault thread */
> +    u->postcopy_fd.fd = ufd;
> +    u->postcopy_fd.data = dev;
> +    u->postcopy_fd.handler = vhost_user_postcopy_fault_handler;
> +    u->postcopy_fd.idstr = "vhost-user"; /* Need to find unique name */
> +    postcopy_register_shared_ufd(&u->postcopy_fd);
>      return 0;
>  }
>  
> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index 4bda5aa509..23efbdf346 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -153,7 +153,7 @@ struct PostCopyFD {
>      /* Handler to be called whenever we get a poll event */
>      pcfdhandler handler;
>      /* A string to use in error messages */
> -    char *idstr;
> +    const char *idstr;

Move to previous patch?

>  };
>  
>  /* Register a userfaultfd owned by an external process for
> -- 
> 2.14.3
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 12/29] postcopy+vhost-user: Split set_mem_table for postcopy
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 12/29] postcopy+vhost-user: Split set_mem_table for postcopy Dr. David Alan Gilbert (git)
@ 2018-02-28  8:49   ` Peter Xu
  2018-03-05 18:45     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-02-28  8:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:08PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Split the set_mem_table routines in both qemu and libvhost-user
> because the postcopy versions are going to be quite different
> once changes in the later patches are added.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 53 ++++++++++++++++++++++++
>  hw/virtio/vhost-user.c                | 77 ++++++++++++++++++++++++++++++++++-
>  2 files changed, 128 insertions(+), 2 deletions(-)
> 
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index beec7695a8..4922b2c722 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -448,6 +448,55 @@ vu_reset_device_exec(VuDev *dev, VhostUserMsg *vmsg)
>      return false;
>  }
>  
> +static bool
> +vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
> +{
> +    int i;
> +    VhostUserMemory *memory = &vmsg->payload.memory;
> +    dev->nregions = memory->nregions;
> +    /* TODO: Postcopy specific code */
> +    DPRINT("Nregions: %d\n", memory->nregions);
> +    for (i = 0; i < dev->nregions; i++) {
> +        void *mmap_addr;
> +        VhostUserMemoryRegion *msg_region = &memory->regions[i];
> +        VuDevRegion *dev_region = &dev->regions[i];
> +
> +        DPRINT("Region %d\n", i);
> +        DPRINT("    guest_phys_addr: 0x%016"PRIx64"\n",
> +               msg_region->guest_phys_addr);
> +        DPRINT("    memory_size:     0x%016"PRIx64"\n",
> +               msg_region->memory_size);
> +        DPRINT("    userspace_addr   0x%016"PRIx64"\n",
> +               msg_region->userspace_addr);
> +        DPRINT("    mmap_offset      0x%016"PRIx64"\n",
> +               msg_region->mmap_offset);
> +
> +        dev_region->gpa = msg_region->guest_phys_addr;
> +        dev_region->size = msg_region->memory_size;
> +        dev_region->qva = msg_region->userspace_addr;
> +        dev_region->mmap_offset = msg_region->mmap_offset;
> +
> +        /* We don't use offset argument of mmap() since the
> +         * mapped address has to be page aligned, and we use huge
> +         * pages.  */
> +        mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
> +                         PROT_READ | PROT_WRITE, MAP_SHARED,
> +                         vmsg->fds[i], 0);
> +
> +        if (mmap_addr == MAP_FAILED) {
> +            vu_panic(dev, "region mmap error: %s", strerror(errno));
> +        } else {
> +            dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
> +            DPRINT("    mmap_addr:       0x%016"PRIx64"\n",
> +                   dev_region->mmap_addr);
> +        }
> +
> +        close(vmsg->fds[i]);
> +    }
> +
> +    return false;
> +}
> +
>  static bool
>  vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
>  {
> @@ -464,6 +513,10 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
>      }
>      dev->nregions = memory->nregions;
>  
> +    if (dev->postcopy_listening) {
> +        return vu_set_mem_table_exec_postcopy(dev, vmsg);
> +    }
> +
>      DPRINT("Nregions: %d\n", memory->nregions);
>      for (i = 0; i < dev->nregions; i++) {
>          void *mmap_addr;
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index ec6a4a82fd..64f4b3b3f9 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -325,15 +325,86 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
>      return 0;
>  }
>  
> +static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
> +                                             struct vhost_memory *mem)
> +{
> +    int fds[VHOST_MEMORY_MAX_NREGIONS];
> +    int i, fd;
> +    size_t fd_num = 0;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    /* TODO: Add actual postcopy differences */
> +    VhostUserMsg msg = {
> +        .hdr.request = VHOST_USER_SET_MEM_TABLE,
> +        .hdr.flags = VHOST_USER_VERSION,
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    for (i = 0; i < dev->mem->nregions; ++i) {
> +        struct vhost_memory_region *reg = dev->mem->regions + i;
> +        ram_addr_t offset;
> +        MemoryRegion *mr;
> +
> +        assert((uintptr_t)reg->userspace_addr == reg->userspace_addr);
> +        mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr,
> +                                     &offset);
> +        fd = memory_region_get_fd(mr);
> +        if (fd > 0) {
> +            msg.payload.memory.regions[fd_num].userspace_addr =
> +                reg->userspace_addr;
> +            msg.payload.memory.regions[fd_num].memory_size  = reg->memory_size;
> +            msg.payload.memory.regions[fd_num].guest_phys_addr =
> +                reg->guest_phys_addr;
> +            msg.payload.memory.regions[fd_num].mmap_offset = offset;
> +            assert(fd_num < VHOST_MEMORY_MAX_NREGIONS);
> +            fds[fd_num++] = fd;
> +        }
> +    }
> +
> +    msg.payload.memory.nregions = fd_num;
> +
> +    if (!fd_num) {
> +        error_report("Failed initializing vhost-user memory map, "
> +                     "consider using -object memory-backend-file share=on");
> +        return -1;
> +    }
> +
> +    msg.hdr.size = sizeof(msg.payload.memory.nregions);
> +    msg.hdr.size += sizeof(msg.payload.memory.padding);
> +    msg.hdr.size += fd_num * sizeof(VhostUserMemoryRegion);
> +
> +    if (vhost_user_write(dev, &msg, fds, fd_num) < 0) {
> +        return -1;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
>  static int vhost_user_set_mem_table(struct vhost_dev *dev,
>                                      struct vhost_memory *mem)
>  {
> +    struct vhost_user *u = dev->opaque;
>      int fds[VHOST_MEMORY_MAX_NREGIONS];
>      int i, fd;
>      size_t fd_num = 0;
> +    bool do_postcopy = u->postcopy_listen && u->postcopy_fd.handler;
>      bool reply_supported = virtio_has_feature(dev->protocol_features,
>                                                VHOST_USER_PROTOCOL_F_REPLY_ACK);
>  
> +    if (do_postcopy) {
> +        /* Postcopy has enough differences that it's best done in it's own
> +         * version
> +         */
> +        return vhost_user_set_mem_table_postcopy(dev, mem);
> +    }
> +
>      VhostUserMsg msg = {
>          .hdr.request = VHOST_USER_SET_MEM_TABLE,
>          .hdr.flags = VHOST_USER_VERSION,
> @@ -357,9 +428,11 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
>                  error_report("Failed preparing vhost-user memory table msg");
>                  return -1;
>              }
> -            msg.payload.memory.regions[fd_num].userspace_addr = reg->userspace_addr;
> +            msg.payload.memory.regions[fd_num].userspace_addr =
> +                reg->userspace_addr;
>              msg.payload.memory.regions[fd_num].memory_size  = reg->memory_size;
> -            msg.payload.memory.regions[fd_num].guest_phys_addr = reg->guest_phys_addr;
> +            msg.payload.memory.regions[fd_num].guest_phys_addr =
> +                reg->guest_phys_addr;

These newline changes might be avoided?

So after this patch there's no functional change, only the code
splittion of set_mem_table operation, right?

Thanks,

>              msg.payload.memory.regions[fd_num].mmap_offset = offset;
>              fds[fd_num++] = fd;
>          }
> -- 
> 2.14.3
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 13/29] migration/ram: ramblock_recv_bitmap_test_byte_offset
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 13/29] migration/ram: ramblock_recv_bitmap_test_byte_offset Dr. David Alan Gilbert (git)
@ 2018-02-28  8:52   ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2018-02-28  8:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:09PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Utility for testing the map when you already know the offset
> in the RAMBlock.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

> ---
>  migration/ram.c | 5 +++++
>  migration/ram.h | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 8333d8e35e..8db5e80500 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -169,6 +169,11 @@ int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr)
>                      rb->receivedmap);
>  }
>  
> +bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset)
> +{
> +    return test_bit(byte_offset >> TARGET_PAGE_BITS, rb->receivedmap);
> +}
> +
>  void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr)
>  {
>      set_bit_atomic(ramblock_recv_bitmap_offset(host_addr, rb), rb->receivedmap);
> diff --git a/migration/ram.h b/migration/ram.h
> index f3a227b4fc..63a37c4683 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -60,6 +60,7 @@ int ram_postcopy_incoming_init(MigrationIncomingState *mis);
>  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
>  
>  int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
> +bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset);
>  void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
>  void ramblock_recv_bitmap_set_range(RAMBlock *rb, void *host_addr, size_t nr);
>  
> -- 
> 2.14.3
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 17/29] vhost+postcopy: Send requests to source for shared pages
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 17/29] vhost+postcopy: Send requests to source for shared pages Dr. David Alan Gilbert (git)
@ 2018-02-28 10:03   ` Peter Xu
  2018-03-05 18:55     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-02-28 10:03 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:13PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Send requests back to the source for shared page requests.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  migration/migration.h    |  2 ++
>  migration/postcopy-ram.c | 31 ++++++++++++++++++++++++++++---
>  migration/postcopy-ram.h |  3 +++
>  migration/trace-events   |  2 ++
>  4 files changed, 35 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/migration.h b/migration/migration.h
> index d158e62cf2..457bf37ec2 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -46,6 +46,8 @@ struct MigrationIncomingState {
>      int       userfault_quit_fd;
>      QEMUFile *to_src_file;
>      QemuMutex rp_mutex;    /* We send replies from multiple threads */
> +    /* RAMBlock of last request sent to source */
> +    RAMBlock *last_rb;
>      void     *postcopy_tmp_page;
>      void     *postcopy_tmp_zero_page;
>      /* PostCopyFD's for external userfaultfds & handlers of shared memory */
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index d118b78bf5..277ff749a0 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -534,6 +534,31 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
>      return 0;
>  }
>  
> +/*
> + * Callback from shared fault handlers to ask for a page,
> + * the page must be specified by a RAMBlock and an offset in that rb
> + */
> +int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
> +                                 uint64_t client_addr, uint64_t rb_offset)
> +{
> +    size_t pagesize = qemu_ram_pagesize(rb);
> +    uint64_t aligned_rbo = rb_offset & ~(pagesize - 1);
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +    trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
> +                                       rb_offset);
> +    /* TODO: Check bitmap to see if we already have the page */
> +    if (rb != mis->last_rb) {
> +        mis->last_rb = rb;
> +        migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
> +                                  aligned_rbo, pagesize);
> +    } else {
> +        /* Save some space */
> +        migrate_send_rp_req_pages(mis, NULL, aligned_rbo, pagesize);
> +    }
> +    return 0;
> +}
> +

So IIUC this can only be called within the page fault thread or there
can be race.  Is there a way to guarantee this?  Or do we need a
comment for that?

>  /*
>   * Handle faults detected by the USERFAULT markings
>   */
> @@ -544,9 +569,9 @@ static void *postcopy_ram_fault_thread(void *opaque)
>      int ret;
>      size_t index;
>      RAMBlock *rb = NULL;
> -    RAMBlock *last_rb = NULL; /* last RAMBlock we sent part of */
>  
>      trace_postcopy_ram_fault_thread_entry();
> +    mis->last_rb = NULL; /* last RAMBlock we sent part of */
>      qemu_sem_post(&mis->fault_thread_sem);
>  
>      struct pollfd *pfd;
> @@ -634,8 +659,8 @@ static void *postcopy_ram_fault_thread(void *opaque)
>               * Send the request to the source - we want to request one
>               * of our host page sizes (which is >= TPS)
>               */
> -            if (rb != last_rb) {
> -                last_rb = rb;
> +            if (rb != mis->last_rb) {
> +                mis->last_rb = rb;
>                  migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
>                                           rb_offset, qemu_ram_pagesize(rb));
>              } else {
> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index dbc2ee1f2b..4c63f20df4 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -162,5 +162,8 @@ struct PostCopyFD {
>   */
>  void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
>  void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
> +/* Callback from shared fault handlers to ask for a page */
> +int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
> +                                 uint64_t client_addr, uint64_t offset);
>  
>  #endif
> diff --git a/migration/trace-events b/migration/trace-events
> index 1e617ad7a6..7c910b5479 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -198,6 +198,8 @@ postcopy_ram_incoming_cleanup_closeuf(void) ""
>  postcopy_ram_incoming_cleanup_entry(void) ""
>  postcopy_ram_incoming_cleanup_exit(void) ""
>  postcopy_ram_incoming_cleanup_join(void) ""
> +postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
> +
>  save_xbzrle_page_skipping(void) ""
>  save_xbzrle_page_overflow(void) ""
>  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
> -- 
> 2.14.3
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 15/29] vhost+postcopy: Send address back to qemu
  2018-02-27 20:25       ` Michael S. Tsirkin
@ 2018-02-28 18:26         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-28 18:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo,
	quintela, aarcange

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Tue, Feb 27, 2018 at 07:54:18PM +0000, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > On Fri, Feb 16, 2018 at 01:16:11PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > We need a better way, but at the moment we need the address of the
> > > > mappings sent back to qemu so it can interpret the messages on the
> > > > userfaultfd it reads.
> > > > 
> > > > This is done as a 3 stage set:
> > > >    QEMU -> client
> > > >       set_mem_table
> > > > 
> > > >    mmap stuff, get addresses
> > > > 
> > > >    client -> qemu
> > > >        here are the addresses
> > > > 
> > > >    qemu -> client
> > > >        OK - now you can use them
> > > > 
> > > > That ensures that qemu has registered the new addresses in it's
> > > > userfault code before the client starts accessing them.
> > > > 
> > > > Note: We don't ask for the default 'ack' reply since we've got our own.
> > > > 
> > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > ---
> > > >  contrib/libvhost-user/libvhost-user.c | 24 ++++++++++++-
> > > >  docs/interop/vhost-user.txt           |  9 +++++
> > > >  hw/virtio/trace-events                |  1 +
> > > >  hw/virtio/vhost-user.c                | 67 +++++++++++++++++++++++++++++++++--
> > > >  4 files changed, 98 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > > > index a18bc74a7c..e02e5d6f46 100644
> > > > --- a/contrib/libvhost-user/libvhost-user.c
> > > > +++ b/contrib/libvhost-user/libvhost-user.c
> > > > @@ -491,10 +491,32 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
> > > >                     dev_region->mmap_addr);
> > > >          }
> > > >  
> > > > +        /* Return the address to QEMU so that it can translate the ufd
> > > > +         * fault addresses back.
> > > > +         */
> > > > +        msg_region->userspace_addr = (uintptr_t)(mmap_addr +
> > > > +                                                 dev_region->mmap_offset);
> > > >          close(vmsg->fds[i]);
> > > >      }
> > > >  
> > > > -    /* TODO: Get address back to QEMU */
> > > > +    /* Send the message back to qemu with the addresses filled in */
> > > > +    vmsg->fd_num = 0;
> > > > +    if (!vu_message_write(dev, dev->sock, vmsg)) {
> > > > +        vu_panic(dev, "failed to respond to set-mem-table for postcopy");
> > > > +        return false;
> > > > +    }
> > > > +
> > > > +    /* Wait for QEMU to confirm that it's registered the handler for the
> > > > +     * faults.
> > > > +     */
> > > > +    if (!vu_message_read(dev, dev->sock, vmsg) ||
> > > > +        vmsg->size != sizeof(vmsg->payload.u64) ||
> > > > +        vmsg->payload.u64 != 0) {
> > > > +        vu_panic(dev, "failed to receive valid ack for postcopy set-mem-table");
> > > > +        return false;
> > > > +    }
> > > > +
> > > > +    /* OK, now we can go and register the memory and generate faults */
> > > >      for (i = 0; i < dev->nregions; i++) {
> > > >          VuDevRegion *dev_region = &dev->regions[i];
> > > >  #ifdef UFFDIO_REGISTER
> > > > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > > > index bdec9ec0e8..5bbcab2cc4 100644
> > > > --- a/docs/interop/vhost-user.txt
> > > > +++ b/docs/interop/vhost-user.txt
> > > > @@ -454,12 +454,21 @@ Master message types
> > > >        Id: 5
> > > >        Equivalent ioctl: VHOST_SET_MEM_TABLE
> > > >        Master payload: memory regions description
> > > > +      Slave payload: (postcopy only) memory regions description
> > > >  
> > > >        Sets the memory map regions on the slave so it can translate the vring
> > > >        addresses. In the ancillary data there is an array of file descriptors
> > > >        for each memory mapped region. The size and ordering of the fds matches
> > > >        the number and ordering of memory regions.
> > > >  
> > > > +      When postcopy-listening has been received,
> > > 
> > > Which message is this?
> > 
> > VHOST_USER_POSTCOPY_LISTEN
> > 
> > Do you want me just to change that to, 'When VHOST_USER_POSTCOPY_LISTEN
> > has been received' ?
> 
> I think it's better this way, yes.

Done.

> > > > SET_MEM_TABLE replies with
> > > > +      the bases of the memory mapped regions to the master.  It must have mmap'd
> > > > +      the regions but not yet accessed them and should not yet generate a userfault
> > > > +      event. Note NEED_REPLY_MASK is not set in this case.
> > > > +      QEMU will then reply back to the list of mappings with an empty
> > > > +      VHOST_USER_SET_MEM_TABLE as an acknolwedgment; only upon reception of this
> > > > +      message may the guest start accessing the memory and generating faults.
> > > > +
> > > >   * VHOST_USER_SET_LOG_BASE
> > > >  
> > > >        Id: 6
> > > 
> > > As you say yourself, this is probably the best we can do for now,
> > > but it's not ideal. So I think it's a good idea to isolate this
> > > behind a separate protocol feature bit. For now it will be required
> > > for postcopy, when it's fixed in kernel we can drop it
> > > cleanly.
> > > 
> > 
> > While we've talked about ways of avoiding the exact addresses being
> > known by the slave, I'm not sure we've talked about a way of removing
> > this handshake; although it's doable if we move more of the work to the QEMU
> > side.
> > 
> > Dave
> 
> Some kernel changes might thinkably remove the need for use of the
> address with userfaultfd, too.
> 
> > > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > > index 06ec03d6e7..05d18ada77 100644
> > > > --- a/hw/virtio/trace-events
> > > > +++ b/hw/virtio/trace-events
> > > > @@ -8,6 +8,7 @@ vhost_section(const char *name, int r) "%s:%d"
> > > >  
> > > >  # hw/virtio/vhost-user.c
> > > >  vhost_user_postcopy_listen(void) ""
> > > > +vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> > > >  
> > > >  # hw/virtio/virtio.c
> > > >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > > index 64f4b3b3f9..a060442cb9 100644
> > > > --- a/hw/virtio/vhost-user.c
> > > > +++ b/hw/virtio/vhost-user.c
> > > > @@ -159,6 +159,7 @@ struct vhost_user {
> > > >      int slave_fd;
> > > >      NotifierWithReturn postcopy_notifier;
> > > >      struct PostCopyFD  postcopy_fd;
> > > > +    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
> > > >      /* True once we've entered postcopy_listen */
> > > >      bool               postcopy_listen;
> > > >  };
> > > > @@ -328,12 +329,15 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
> > > >  static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
> > > >                                               struct vhost_memory *mem)
> > > >  {
> > > > +    struct vhost_user *u = dev->opaque;
> > > >      int fds[VHOST_MEMORY_MAX_NREGIONS];
> > > >      int i, fd;
> > > >      size_t fd_num = 0;
> > > >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> > > >                                                VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > > > -    /* TODO: Add actual postcopy differences */
> > > > +    VhostUserMsg msg_reply;
> > > > +    int region_i, msg_i;
> > > > +
> > > >      VhostUserMsg msg = {
> > > >          .hdr.request = VHOST_USER_SET_MEM_TABLE,
> > > >          .hdr.flags = VHOST_USER_VERSION,
> > > > @@ -380,6 +384,64 @@ static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
> > > >          return -1;
> > > >      }
> > > >  
> > > > +    if (vhost_user_read(dev, &msg_reply) < 0) {
> > > > +        return -1;
> > > > +    }
> > > > +
> > > > +    if (msg_reply.hdr.request != VHOST_USER_SET_MEM_TABLE) {
> > > > +        error_report("%s: Received unexpected msg type."
> > > > +                     "Expected %d received %d", __func__,
> > > > +                     VHOST_USER_SET_MEM_TABLE, msg_reply.hdr.request);
> > > > +        return -1;
> > > > +    }
> > > > +    /* We're using the same structure, just reusing one of the
> > > > +     * fields, so it should be the same size.
> > > > +     */
> > > > +    if (msg_reply.hdr.size != msg.hdr.size) {
> > > > +        error_report("%s: Unexpected size for postcopy reply "
> > > > +                     "%d vs %d", __func__, msg_reply.hdr.size, msg.hdr.size);
> > > > +        return -1;
> > > > +    }
> > > > +
> > > > +    memset(u->postcopy_client_bases, 0,
> > > > +           sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
> > > > +
> > > > +    /* They're in the same order as the regions that were sent
> > > > +     * but some of the regions were skipped (above) if they
> > > > +     * didn't have fd's
> > > > +    */
> > > > +    for (msg_i = 0, region_i = 0;
> > > > +         region_i < dev->mem->nregions;
> > > > +        region_i++) {
> > > > +        if (msg_i < fd_num &&
> > > > +            msg_reply.payload.memory.regions[msg_i].guest_phys_addr ==
> > > > +            dev->mem->regions[region_i].guest_phys_addr) {
> > > > +            u->postcopy_client_bases[region_i] =
> > > > +                msg_reply.payload.memory.regions[msg_i].userspace_addr;
> > > > +            trace_vhost_user_set_mem_table_postcopy(
> > > > +                msg_reply.payload.memory.regions[msg_i].userspace_addr,
> > > > +                msg.payload.memory.regions[msg_i].userspace_addr,
> > > > +                msg_i, region_i);
> > > > +            msg_i++;
> > > > +        }
> > > > +    }
> > > > +    if (msg_i != fd_num) {
> > > > +        error_report("%s: postcopy reply not fully consumed "
> > > > +                     "%d vs %zd",
> > > > +                     __func__, msg_i, fd_num);
> > > > +        return -1;
> > > > +    }
> > > > +    /* Now we've registered this with the postcopy code, we ack to the client,
> > > > +     * because now we're in the position to be able to deal with any faults
> > > > +     * it generates.
> > > > +     */
> > > > +    /* TODO: Use this for failure cases as well with a bad value */
> > > > +    msg.hdr.size = sizeof(msg.payload.u64);
> > > > +    msg.payload.u64 = 0; /* OK */
> > > > +    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
> > > > +        return -1;
> > > > +    }
> > > > +
> > > >      if (reply_supported) {
> > > >          return process_message_reply(dev, &msg);
> > > >      }
> > > > @@ -396,7 +458,8 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
> > > >      size_t fd_num = 0;
> > > >      bool do_postcopy = u->postcopy_listen && u->postcopy_fd.handler;
> > > >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> > > > -                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > > > +                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
> > > > +                                          !do_postcopy;
> > > >  
> > > >      if (do_postcopy) {
> > > >          /* Postcopy has enough differences that it's best done in it's own
> > > > -- 
> > > > 2.14.3
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram
  2018-02-27 20:23     ` Michael S. Tsirkin
@ 2018-02-28 18:38       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-28 18:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, peterx, imammedo,
	quintela, aarcange

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Tue, Feb 27, 2018 at 08:05:25PM +0000, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > On Fri, Feb 16, 2018 at 01:15:56PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > Hi,
> > > >   This is the first non-RFC version of this patch set that
> > > > enables postcopy migration with shared memory to a vhost user process.
> > > > It's based off current head.
> > > > 
> > > > I've tested with vhost-user-bridge and a modified dpdk; both very
> > > > lightly.
> > > > 
> > > > Compared to v2 we're now using the just-merged reworks to the vhost
> > > > code (suggested by Igor), so that the huge page region merging is now a lot simpler
> > > > in this series. The handshake between the client and the qemu for the
> > > > set-mem-table is now a bit more complex to resolve a previous race where
> > > > the client would start sending requests to the qemu prior to the qemu
> > > > being ready to accept them.
> > > > 
> > > > Dave
> > > 
> > > From vhost-user POV this seems mostly fine to me.
> > 
> > OK, great - it would be nice to get this merged in the upcoming release
> > (Hint: Anyone else please review!)
> > 
> > > I would like to have dependency of specific messages on the
> > > protocol features documented, and the order of messages
> > > documented a bit more explicitly.
> > 
> > Something like the following? (appropriately merged in with the
> > individual commits):
> > 
> > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > index 4bf7d8ef99..7841812766 100644
> > --- a/docs/interop/vhost-user.txt
> > +++ b/docs/interop/vhost-user.txt
> > @@ -461,7 +461,7 @@ Master message types
> >        for each memory mapped region. The size and ordering of the fds matches
> >        the number and ordering of memory regions.
> >  
> > -      When postcopy-listening has been received, SET_MEM_TABLE replies with
> > +      When VHOST_USER_POSTCOPY_LISTEN has been received, SET_MEM_TABLE replies with
> >        the bases of the memory mapped regions to the master.  It must have mmap'd
> >        the regions but not yet accessed them and should not yet generate a userfault
> >        event. Note NEED_REPLY_MASK is not set in this case.
> > @@ -687,7 +687,8 @@ Master message types
> >        Master payload: N/A
> >        Slave payload: userfault fd + u64
> >  
> > -      Master advises slave that a migration with postcopy enabled is underway,
> > +      When VHOST_USER_PROTOCOL_F_PAGEFAULT is supported, the
> > +      master advises slave that a migration with postcopy enabled is underway,
> >        the slave must open a userfaultfd for later use.
> >        Note that at this stage the migration is still in precopy mode.
> >  
> > @@ -696,6 +697,8 @@ Master message types
> >        Master payload: N/A
> >  
> >        Master advises slave that a transition to postcopy mode has happened.
> > +      This is always sent sometime after a VHOST_USER_POSTCOPY_ADVISE, and
> > +      thus only when VHOST_USER_PROTOCOL_F_PAGEFAULT is supported.
> >  
> >   * VHOST_USER_POSTCOPY_END
> >        Id: 28
> > @@ -704,6 +707,8 @@ Master message types
> >        Master advises that postcopy migration has now completed.  The
> >        slave must disable the userfaultfd. The response is an acknowledgement
> >        only.
> > +      This message is sent at the end of the migration, after
> > +      VHOST_USER_POSTCOPY_LISTEN was previously sent.
> 
> And maybe mention VHOST_USER_PROTOCOL_F_PAGEFAULT here too.

Done.

Dave

> >  Slave message types
> >  -------------------
> > 
> > Dave
> > 
> > > 
> > > 
> > > 
> > > > Dr. David Alan Gilbert (29):
> > > >   migrate: Update ram_block_discard_range for shared
> > > >   qemu_ram_block_host_offset
> > > >   postcopy: use UFFDIO_ZEROPAGE only when available
> > > >   postcopy: Add notifier chain
> > > >   postcopy: Add vhost-user flag for postcopy and check it
> > > >   vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message
> > > >   libvhost-user: Support sending fds back to qemu
> > > >   libvhost-user: Open userfaultfd
> > > >   postcopy: Allow registering of fd handler
> > > >   vhost+postcopy: Register shared ufd with postcopy
> > > >   vhost+postcopy: Transmit 'listen' to client
> > > >   postcopy+vhost-user: Split set_mem_table for postcopy
> > > >   migration/ram: ramblock_recv_bitmap_test_byte_offset
> > > >   libvhost-user+postcopy: Register new regions with the ufd
> > > >   vhost+postcopy: Send address back to qemu
> > > >   vhost+postcopy: Stash RAMBlock and offset
> > > >   vhost+postcopy: Send requests to source for shared pages
> > > >   vhost+postcopy: Resolve client address
> > > >   postcopy: wake shared
> > > >   postcopy: postcopy_notify_shared_wake
> > > >   vhost+postcopy: Add vhost waker
> > > >   vhost+postcopy: Call wakeups
> > > >   libvhost-user: mprotect & madvises for postcopy
> > > >   vhost-user: Add VHOST_USER_POSTCOPY_END message
> > > >   vhost+postcopy: Wire up POSTCOPY_END notify
> > > >   vhost: Huge page align and merge
> > > >   postcopy: Allow shared memory
> > > >   libvhost-user: Claim support for postcopy
> > > >   postcopy shared docs
> > > > 
> > > >  contrib/libvhost-user/libvhost-user.c | 303 ++++++++++++++++++++++++-
> > > >  contrib/libvhost-user/libvhost-user.h |   8 +
> > > >  docs/devel/migration.rst              |  41 ++++
> > > >  docs/interop/vhost-user.txt           |  42 ++++
> > > >  exec.c                                |  85 +++++--
> > > >  hw/virtio/trace-events                |  16 +-
> > > >  hw/virtio/vhost-user.c                | 411 +++++++++++++++++++++++++++++++++-
> > > >  hw/virtio/vhost.c                     |  66 +++++-
> > > >  include/exec/cpu-common.h             |   4 +
> > > >  migration/migration.c                 |   6 +
> > > >  migration/migration.h                 |   4 +
> > > >  migration/postcopy-ram.c              | 350 +++++++++++++++++++++++------
> > > >  migration/postcopy-ram.h              |  69 ++++++
> > > >  migration/ram.c                       |   5 +
> > > >  migration/ram.h                       |   1 +
> > > >  migration/savevm.c                    |  13 ++
> > > >  migration/trace-events                |   6 +
> > > >  trace-events                          |   3 +-
> > > >  vl.c                                  |   2 +
> > > >  19 files changed, 1337 insertions(+), 98 deletions(-)
> > > > 
> > > > -- 
> > > > 2.14.3
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/29] migrate: Update ram_block_discard_range for shared
  2018-02-28  6:37   ` Peter Xu
@ 2018-02-28 19:54     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-02-28 19:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:15:57PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > The choice of call to discard a block is getting more complicated
> > for other cases.   We use fallocate PUNCH_HOLE in any file cases;
> > it works for both hugepage and for tmpfs.
> > We use the DONTNEED for non-hugepage cases either where they're
> > anonymous or where they're private.
> > 
> > Care should be taken when trying other backing files.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  exec.c       | 60 ++++++++++++++++++++++++++++++++++++++++++++++--------------
> >  trace-events |  3 ++-
> >  2 files changed, 48 insertions(+), 15 deletions(-)
> > 
> > diff --git a/exec.c b/exec.c
> > index e8d7b335b6..b1bb477776 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -3702,6 +3702,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
> >      }
> >  
> >      if ((start + length) <= rb->used_length) {
> > +        bool need_madvise, need_fallocate;
> >          uint8_t *host_endaddr = host_startaddr + length;
> >          if ((uintptr_t)host_endaddr & (rb->page_size - 1)) {
> >              error_report("ram_block_discard_range: Unaligned end address: %p",
> > @@ -3711,29 +3712,60 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
> >  
> >          errno = ENOTSUP; /* If we are missing MADVISE etc */
> >  
> > -        if (rb->page_size == qemu_host_page_size) {
> > -#if defined(CONFIG_MADVISE)
> > -            /* Note: We need the madvise MADV_DONTNEED behaviour of definitely
> > -             * freeing the page.
> > -             */
> > -            ret = madvise(host_startaddr, length, MADV_DONTNEED);
> > -#endif
> > -        } else {
> > -            /* Huge page case  - unfortunately it can't do DONTNEED, but
> > -             * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
> > -             * huge page file.
> > +        /* The logic here is messy;
> > +         *    madvise DONTNEED fails for hugepages
> > +         *    fallocate works on hugepages and shmem
> > +         */
> > +        need_madvise = (rb->page_size == qemu_host_page_size);
> > +        need_fallocate = rb->fd != -1;
> > +        if (need_fallocate) {
> > +            /* For a file, this causes the area of the file to be zero'd
> > +             * if read, and for hugetlbfs also causes it to be unmapped
> > +             * so a userfault will trigger.
> >               */
> >  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> >              ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> >                              start, length);
> > +            if (ret) {
> > +                ret = -errno;
> > +                error_report("ram_block_discard_range: Failed to fallocate "
> > +                             "%s:%" PRIx64 " +%zx (%d)",
> > +                             rb->idstr, start, length, ret);
> > +                goto err;
> > +            }
> > +#else
> > +            ret = -ENOSYS;
> > +            error_report("ram_block_discard_range: fallocate not available/file"
> > +                         "%s:%" PRIx64 " +%zx (%d)",
> > +                         rb->idstr, start, length, ret);
> > +            goto err;
> >  #endif
> >          }
> > -        if (ret) {
> > -            ret = -errno;
> > -            error_report("ram_block_discard_range: Failed to discard range "
> > +        if (need_madvise) {
> > +            /* For normal RAM this causes it to be unmapped,
> > +             * for shared memory it causes the local mapping to disappear
> > +             * and to fall back on the file contents (which we just
> > +             * fallocate'd away).
> > +             */
> > +#if defined(CONFIG_MADVISE)
> > +            ret =  madvise(host_startaddr, length, MADV_DONTNEED);
> > +            if (ret) {
> > +                ret = -errno;
> > +                error_report("ram_block_discard_range: Failed to discard range "
> > +                             "%s:%" PRIx64 " +%zx (%d)",
> > +                             rb->idstr, start, length, ret);
> > +                goto err;
> > +            }
> > +#else
> > +            ret = -ENOSYS;
> > +            error_report("ram_block_discard_range: MADVISE not available"
> >                           "%s:%" PRIx64 " +%zx (%d)",
> >                           rb->idstr, start, length, ret);
> > +            goto err;
> > +#endif
> >          }
> > +        trace_ram_block_discard_range(rb->idstr, host_startaddr,
> > +                                      need_madvise, need_fallocate, ret);
> 
> Nit: worth to log the length too if it's named as "range"?

Done.

> Either with/without:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks.

Dave

> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 18/29] vhost+postcopy: Resolve client address
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 18/29] vhost+postcopy: Resolve client address Dr. David Alan Gilbert (git)
@ 2018-03-02  7:29   ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2018-03-02  7:29 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:14PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Resolve fault addresses read off the clients UFD into RAMBlock
> and offset, and call back to the postcopy code to ask for the page.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 19/29] postcopy: wake shared
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 19/29] postcopy: wake shared Dr. David Alan Gilbert (git)
@ 2018-03-02  7:44   ` Peter Xu
  2018-03-05 19:35     ` Dr. David Alan Gilbert
  2018-03-12 15:44   ` Marc-André Lureau
  1 sibling, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-03-02  7:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:15PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Send a 'wake' request on a userfaultfd for a shared process.
> The address in the clients address space is specified together
> with the RAMBlock it was resolved to.

I think it's "providing a helper to send WAKE to uffd" rather than
really sending it.

Otherwise it looks good to me.  Thanks,

> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  migration/postcopy-ram.c | 26 ++++++++++++++++++++++++++
>  migration/postcopy-ram.h |  6 ++++++
>  migration/trace-events   |  1 +
>  3 files changed, 33 insertions(+)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 277ff749a0..67deae7e1c 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -534,6 +534,25 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
>      return 0;
>  }
>  
> +int postcopy_wake_shared(struct PostCopyFD *pcfd,
> +                         uint64_t client_addr,
> +                         RAMBlock *rb)
> +{
> +    size_t pagesize = qemu_ram_pagesize(rb);
> +    struct uffdio_range range;
> +    int ret;
> +    trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb));
> +    range.start = client_addr & ~(pagesize - 1);
> +    range.len = pagesize;
> +    ret = ioctl(pcfd->fd, UFFDIO_WAKE, &range);
> +    if (ret) {
> +        error_report("%s: Failed to wake: %zx in %s (%s)",
> +                     __func__, (size_t)client_addr, qemu_ram_get_idstr(rb),
> +                     strerror(errno));
> +    }
> +    return ret;
> +}
> +
>  /*
>   * Callback from shared fault handlers to ask for a page,
>   * the page must be specified by a RAMBlock and an offset in that rb
> @@ -951,6 +970,13 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
>      return NULL;
>  }
>  
> +int postcopy_wake_shared(struct PostCopyFD *pcfd,
> +                         uint64_t client_addr,
> +                         RAMBlock *rb)
> +{
> +    assert(0);
> +    return -1;
> +}
>  #endif
>  
>  /* ------------------------------------------------------------------------- */
> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index 4c63f20df4..2e3dd844d5 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -162,6 +162,12 @@ struct PostCopyFD {
>   */
>  void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
>  void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
> +/* Notify a client ufd that a page is available
> + * Note: The 'client_address' is in the address space of the client
> + * program not QEMU
> + */
> +int postcopy_wake_shared(struct PostCopyFD *pcfd, uint64_t client_addr,
> +                         RAMBlock *rb);
>  /* Callback from shared fault handlers to ask for a page */
>  int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
>                                   uint64_t client_addr, uint64_t offset);
> diff --git a/migration/trace-events b/migration/trace-events
> index 7c910b5479..b0acaaa8a0 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -199,6 +199,7 @@ postcopy_ram_incoming_cleanup_entry(void) ""
>  postcopy_ram_incoming_cleanup_exit(void) ""
>  postcopy_ram_incoming_cleanup_join(void) ""
>  postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
> +postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
>  
>  save_xbzrle_page_skipping(void) ""
>  save_xbzrle_page_overflow(void) ""
> -- 
> 2.14.3
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 20/29] postcopy: postcopy_notify_shared_wake
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 20/29] postcopy: postcopy_notify_shared_wake Dr. David Alan Gilbert (git)
@ 2018-03-02  7:51   ` Peter Xu
  2018-03-05 19:55     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-03-02  7:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:16PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Add a hook to allow a client userfaultfd to be 'woken'
> when a page arrives, and a walker that calls that
> hook for relevant clients given a RAMBlock and offset.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  migration/postcopy-ram.c | 16 ++++++++++++++++
>  migration/postcopy-ram.h | 10 ++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 67deae7e1c..879711968c 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -824,6 +824,22 @@ static int qemu_ufd_copy_ioctl(int userfault_fd, void *host_addr,
>      return ret;
>  }
>  
> +int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset)
> +{
> +    int i;
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +    GArray *pcrfds = mis->postcopy_remote_fds;
> +
> +    for (i = 0; i < pcrfds->len; i++) {
> +        struct PostCopyFD *cur = &g_array_index(pcrfds, struct PostCopyFD, i);
> +        int ret = cur->waker(cur, rb, offset);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +    return 0;
> +}
> +

We should know that which FD needs what pages, right?  If with that
information, we can only notify the ones who have page faulted on
exactly the same page?  Otherwise we do UFFDIO_WAKE once for each
client when a page is ready, even if the clients have not page faulted
at all?

But for the first version, I think it's fine.  And I believe if we
maintain the faulted addresses we need some way to sync between the
wake thread and fault thread too.  And I totally have no idea on how
this difference will be any kind of bottle neck at all, since I guess
the network link should still be the postcopy bottleneck considering
that 10g is mostly what we have now (or even, 1g).

Reviewed-by: Peter Xu <peterx@redhat.com>

>  /*
>   * Place a host page (from) at (host) atomically
>   * returns 0 on success
> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index 2e3dd844d5..2b71cf958e 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -146,6 +146,10 @@ struct PostCopyFD;
>  
>  /* ufd is a pointer to the struct uffd_msg *TODO: more Portable! */
>  typedef int (*pcfdhandler)(struct PostCopyFD *pcfd, void *ufd);
> +/* Notification to wake, either on place or on reception of
> + * a fault on something that's already arrived (race)
> + */
> +typedef int (*pcfdwake)(struct PostCopyFD *pcfd, RAMBlock *rb, uint64_t offset);
>  
>  struct PostCopyFD {
>      int fd;
> @@ -153,6 +157,8 @@ struct PostCopyFD {
>      void *data;
>      /* Handler to be called whenever we get a poll event */
>      pcfdhandler handler;
> +    /* Notification to wake shared client */
> +    pcfdwake waker;
>      /* A string to use in error messages */
>      const char *idstr;
>  };
> @@ -162,6 +168,10 @@ struct PostCopyFD {
>   */
>  void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
>  void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
> +/* Call each of the shared 'waker's registerd telling them of
> + * availability of a block.
> + */
> +int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset);
>  /* Notify a client ufd that a page is available
>   * Note: The 'client_address' is in the address space of the client
>   * program not QEMU
> -- 
> 2.14.3
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 21/29] vhost+postcopy: Add vhost waker
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 21/29] vhost+postcopy: Add vhost waker Dr. David Alan Gilbert (git)
@ 2018-03-02  7:55   ` Peter Xu
  2018-03-05 20:16     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-03-02  7:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:17PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Register a waker function in vhost-user code to be notified when
> pages arrive or requests to previously mapped pages get requested.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/virtio/trace-events |  3 +++
>  hw/virtio/vhost-user.c | 30 ++++++++++++++++++++++++++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 3afd12cfea..fe5e0ff856 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -13,6 +13,9 @@ vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t
>  vhost_user_postcopy_listen(void) ""
>  vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
>  vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
> +vhost_user_postcopy_waker(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
> +vhost_user_postcopy_waker_found(uint64_t client_addr) "0x%"PRIx64
> +vhost_user_postcopy_waker_nomatch(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
>  
>  # hw/virtio/virtio.c
>  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 4589bfd92e..74807091a0 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -990,6 +990,35 @@ static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
>      return -1;
>  }
>  
> +static int vhost_user_postcopy_waker(struct PostCopyFD *pcfd, RAMBlock *rb,
> +                                     uint64_t offset)
> +{
> +    struct vhost_dev *dev = pcfd->data;
> +    struct vhost_user *u = dev->opaque;
> +    int i;
> +
> +    trace_vhost_user_postcopy_waker(qemu_ram_get_idstr(rb), offset);
> +
> +    if (!u) {
> +        return 0;
> +    }
> +    /* Translate the offset into an address in the clients address space */
> +    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
> +        if (u->region_rb[i] == rb &&
> +            offset >= u->region_rb_offset[i] &&
> +            offset < (u->region_rb_offset[i] +
> +                      dev->mem->regions[i].memory_size)) {
> +            uint64_t client_addr = (offset - u->region_rb_offset[i]) +
> +                                   u->postcopy_client_bases[i];
> +            trace_vhost_user_postcopy_waker_found(client_addr);
> +            return postcopy_wake_shared(pcfd, client_addr, rb);
> +        }
> +    }
> +
> +    trace_vhost_user_postcopy_waker_nomatch(qemu_ram_get_idstr(rb), offset);
> +    return 0;

Can we really reach here?

> +}
> +
>  /*
>   * Called at the start of an inbound postcopy on reception of the
>   * 'advise' command.
> @@ -1035,6 +1064,7 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
>      u->postcopy_fd.fd = ufd;
>      u->postcopy_fd.data = dev;
>      u->postcopy_fd.handler = vhost_user_postcopy_fault_handler;
> +    u->postcopy_fd.waker = vhost_user_postcopy_waker;
>      u->postcopy_fd.idstr = "vhost-user"; /* Need to find unique name */
>      postcopy_register_shared_ufd(&u->postcopy_fd);
>      return 0;
> -- 
> 2.14.3
> 

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 22/29] vhost+postcopy: Call wakeups
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 22/29] vhost+postcopy: Call wakeups Dr. David Alan Gilbert (git)
@ 2018-03-02  8:05   ` Peter Xu
  2018-03-06 10:36     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-03-02  8:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Fri, Feb 16, 2018 at 01:16:18PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Cause the vhost-user client to be woken up whenever:
>   a) We place a page in postcopy mode
>   b) We get a fault and the page has already been received
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  migration/postcopy-ram.c | 14 ++++++++++----
>  migration/trace-events   |  1 +
>  2 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 879711968c..13561703b5 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -566,7 +566,11 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
>  
>      trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
>                                         rb_offset);
> -    /* TODO: Check bitmap to see if we already have the page */
> +    if (ramblock_recv_bitmap_test_byte_offset(rb, aligned_rbo)) {
> +        trace_postcopy_request_shared_page_present(pcfd->idstr,
> +                                        qemu_ram_get_idstr(rb), rb_offset);
> +        return postcopy_wake_shared(pcfd, client_addr, rb);
> +    }
>      if (rb != mis->last_rb) {
>          mis->last_rb = rb;
>          migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
> @@ -863,7 +867,8 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
>      }
>  
>      trace_postcopy_place_page(host);
> -    return 0;
> +    return postcopy_notify_shared_wake(rb,
> +                                       qemu_ram_block_host_offset(rb, host));
>  }
>  
>  /*
> @@ -887,6 +892,9 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
>  
>              return -e;
>          }
> +        return postcopy_notify_shared_wake(rb,
> +                                           qemu_ram_block_host_offset(rb,
> +                                                                      host));
>      } else {
>          /* The kernel can't use UFFDIO_ZEROPAGE for hugepages */
>          if (!mis->postcopy_tmp_zero_page) {
> @@ -906,8 +914,6 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
>          return postcopy_place_page(mis, host, mis->postcopy_tmp_zero_page,
>                                     rb);
>      }
> -
> -    return 0;
>  }

Could there be race?  E.g.:

              ram_load_thread             page_fault_thread
             -----------------           -------------------

                                          if (recv_bitmap_set())
                                              wake()
             copy_page()
             recv_bitmap_set()
             wake()
                                          request_page()

Then the last requested page may never be serviced?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/29] postcopy: use UFFDIO_ZEROPAGE only when available
  2018-02-28  6:53   ` Peter Xu
@ 2018-03-05 17:23     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-05 17:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:15:59PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Use a flag on the RAMBlock to state whether it has the
> > UFFDIO_ZEROPAGE capability, use it when it's available.
> > 
> > This allows the use of postcopy on tmpfs as well as hugepage
> > backed files.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  exec.c                    | 15 +++++++++++++++
> >  include/exec/cpu-common.h |  3 +++
> >  migration/postcopy-ram.c  | 13 ++++++++++---
> >  3 files changed, 28 insertions(+), 3 deletions(-)
> > 
> > diff --git a/exec.c b/exec.c
> > index 0ec73bc917..1dc15298c2 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -99,6 +99,11 @@ static MemoryRegion io_mem_unassigned;
> >   */
> >  #define RAM_RESIZEABLE (1 << 2)
> >  
> > +/* UFFDIO_ZEROPAGE is available on this RAMBlock to atomically
> > + * zero the page and wake waiting processes.
> > + * (Set during postcopy)
> > + */
> > +#define RAM_UF_ZEROPAGE (1 << 3)
> >  #endif
> >  
> >  #ifdef TARGET_PAGE_BITS_VARY
> > @@ -1767,6 +1772,16 @@ bool qemu_ram_is_shared(RAMBlock *rb)
> >      return rb->flags & RAM_SHARED;
> >  }
> >  
> > +bool qemu_ram_is_uf_zeroable(RAMBlock *rb)
> > +{
> > +    return rb->flags & RAM_UF_ZEROPAGE;
> > +}
> > +
> > +void qemu_ram_set_uf_zeroable(RAMBlock *rb)
> > +{
> > +    rb->flags |= RAM_UF_ZEROPAGE;
> > +}
> > +
> >  /* Called with iothread lock held.  */
> >  void qemu_ram_set_idstr(RAMBlock *new_block, const char *name, DeviceState *dev)
> >  {
> > diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> > index 0d861a6289..24d335f95d 100644
> > --- a/include/exec/cpu-common.h
> > +++ b/include/exec/cpu-common.h
> > @@ -73,6 +73,9 @@ void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
> >  void qemu_ram_unset_idstr(RAMBlock *block);
> >  const char *qemu_ram_get_idstr(RAMBlock *rb);
> >  bool qemu_ram_is_shared(RAMBlock *rb);
> > +bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
> > +void qemu_ram_set_uf_zeroable(RAMBlock *rb);
> > +
> >  size_t qemu_ram_pagesize(RAMBlock *block);
> >  size_t qemu_ram_pagesize_largest(void);
> >  
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index bec6c2c66b..6297979700 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -490,6 +490,10 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
> >          error_report("%s userfault: Region doesn't support COPY", __func__);
> >          return -1;
> >      }
> > +    if (reg_struct.ioctls & ((__u64)1 << _UFFDIO_ZEROPAGE)) {
> > +        RAMBlock *rb = qemu_ram_block_by_name(block_name);
> > +        qemu_ram_set_uf_zeroable(rb);
> > +    }
> 
> So the zeroable flag is only set after a listening operation of
> postcopy migration.  One thing I am a bit worried is that if someone
> else wants to use the flag for a RAMBlock he/she may not notice this.
> Say, qemu_ram_is_uf_zeroable() is not valid if there is no such an
> incoming postcopy migration.

Yes, the problem is you don't get to know until you register it with
userfault.

> Maybe worth add a comment in the flag definition about this?

Yes, I've added:
/* Note: Only set at the start of postcopy */

above the qemu_ram_is_uf_zeroable() defintiion; also see the existing
'(Set during postcopy)' comment above the #define.


> Not a big deal (considering that I see no potential QEMU user for
> userfaultfd in short peroid), so no matter what:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks.

Dave

> 
> >  
> >      return 0;
> >  }
> > @@ -699,11 +703,14 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> >  int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> >                               RAMBlock *rb)
> >  {
> > +    size_t pagesize = qemu_ram_pagesize(rb);
> >      trace_postcopy_place_page_zero(host);
> >  
> > -    if (qemu_ram_pagesize(rb) == getpagesize()) {
> > -        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, getpagesize(),
> > -                                rb)) {
> > +    /* Normal RAMBlocks can zero a page using UFFDIO_ZEROPAGE
> > +     * but it's not available for everything (e.g. hugetlbpages)
> > +     */
> > +    if (qemu_ram_is_uf_zeroable(rb)) {
> > +        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, pagesize, rb)) {
> >              int e = errno;
> >              error_report("%s: %s zero host: %p",
> >                           __func__, strerror(e), host);
> > -- 
> > 2.14.3
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 09/29] postcopy: Allow registering of fd handler
  2018-02-28  8:38   ` Peter Xu
@ 2018-03-05 17:35     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-05 17:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:16:05PM +0000, Dr. David Alan Gilbert (git) wrote:
> 
> [...]
> 
> > diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> > index bee21d4401..4bda5aa509 100644
> > --- a/migration/postcopy-ram.h
> > +++ b/migration/postcopy-ram.h
> > @@ -141,4 +141,25 @@ void postcopy_remove_notifier(NotifierWithReturn *n);
> >  /* Call the notifier list set by postcopy_add_start_notifier */
> >  int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp);
> >  
> > +struct PostCopyFD;
> > +
> > +/* ufd is a pointer to the struct uffd_msg *TODO: more Portable! */
> > +typedef int (*pcfdhandler)(struct PostCopyFD *pcfd, void *ufd);
> > +
> > +struct PostCopyFD {
> > +    int fd;
> > +    /* Data to pass to handler */
> > +    void *data;
> > +    /* Handler to be called whenever we get a poll event */
> > +    pcfdhandler handler;
> > +    /* A string to use in error messages */
> > +    char *idstr;
> 
> This was changed to const char in next patch.  We can move it here?

Oops, yes, done.

> The patch is a big one, there are quite a lot of TODOs and I still
> think there can be some helper functions shared between fd handling
> for 0-1 and 2-N but it looks good to me for merging as a first version.

Yes, it feels like it should be possible.

> After we have confirmed the definition of PostCopyFD please add:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks.

Dave

> 
> Thanks,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/29] vhost+postcopy: Transmit 'listen' to client
  2018-02-28  8:42   ` Peter Xu
@ 2018-03-05 17:42     ` Dr. David Alan Gilbert
  2018-03-06  7:06       ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-05 17:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:16:07PM +0000, Dr. David Alan Gilbert (git) wrote:
> 
> [...]
> 
> >  typedef struct VuVirtqElement {
> > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > index 621543e654..bdec9ec0e8 100644
> > --- a/docs/interop/vhost-user.txt
> > +++ b/docs/interop/vhost-user.txt
> > @@ -682,6 +682,12 @@ Master message types
> >        the slave must open a userfaultfd for later use.
> >        Note that at this stage the migration is still in precopy mode.
> >  
> > + * VHOST_USER_POSTCOPY_LISTEN
> > +      Id: 27
> > +      Master payload: N/A
> > +
> > +      Master advises slave that a transition to postcopy mode has happened.
> 
> Could we add something to explain why this listen needs to be
> broadcasted to clients?  Since I failed to find it out quickly
> myself. :(

I've changed this to:

 * VHOST_USER_POSTCOPY_LISTEN
      Id: 29
      Master payload: N/A

      Master advises slave that a transition to postcopy mode has happened.
      The slave must ensure that shared memory is registered with userfaultfd
      to cause faulting of non-present pages.

      This is always sent sometime after a VHOST_USER_POSTCOPY_ADVISE, and
      thus only when VHOST_USER_PROTOCOL_F_PAGEFAULT is supported.

Dave

> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 10/29] vhost+postcopy: Register shared ufd with postcopy
  2018-02-28  8:46   ` Peter Xu
@ 2018-03-05 18:21     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-05 18:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:16:06PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Register the UFD that comes in as the response to the 'advise' method
> > with the postcopy code.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  hw/virtio/vhost-user.c   | 21 ++++++++++++++++++++-
> >  migration/postcopy-ram.h |  2 +-
> >  2 files changed, 21 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index 4f59993baa..dd4eb50668 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -24,6 +24,7 @@
> >  #include <sys/socket.h>
> >  #include <sys/un.h>
> >  #include <linux/vhost.h>
> > +#include <linux/userfaultfd.h>
> 
> Why this line?

Actually, we don't need this until a few patches later in
'Resolve client address' where we pick apart the message
received from the kernel.  I've moved it there.

> >  
> >  #define VHOST_MEMORY_MAX_NREGIONS    8
> >  #define VHOST_USER_F_PROTOCOL_FEATURES 30
> > @@ -155,6 +156,7 @@ struct vhost_user {
> >      CharBackend *chr;
> >      int slave_fd;
> >      NotifierWithReturn postcopy_notifier;
> > +    struct PostCopyFD  postcopy_fd;
> >  };
> >  
> >  static bool ioeventfd_enabled(void)
> > @@ -780,6 +782,17 @@ out:
> >      return ret;
> >  }
> >  
> > +/*
> > + * Called back from the postcopy fault thread when a fault is received on our
> > + * ufd.
> > + * TODO: This is Linux specific
> > + */
> > +static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
> > +                                             void *ufd)
> > +{
> > +    return 0;
> > +}
> > +
> >  /*
> >   * Called at the start of an inbound postcopy on reception of the
> >   * 'advise' command.
> > @@ -819,8 +832,14 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
> >          error_setg(errp, "%s: Failed to get ufd", __func__);
> >          return -1;
> >      }
> > +    fcntl(ufd, F_SETFL, O_NONBLOCK);
> 
> Only curious: would it work even without this line?

Probably; it's used in a poll() anyway, so it should be fine.

> >  
> > -    /* TODO: register ufd with userfault thread */
> > +    /* register ufd with userfault thread */
> > +    u->postcopy_fd.fd = ufd;
> > +    u->postcopy_fd.data = dev;
> > +    u->postcopy_fd.handler = vhost_user_postcopy_fault_handler;
> > +    u->postcopy_fd.idstr = "vhost-user"; /* Need to find unique name */
> > +    postcopy_register_shared_ufd(&u->postcopy_fd);
> >      return 0;
> >  }
> >  
> > diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> > index 4bda5aa509..23efbdf346 100644
> > --- a/migration/postcopy-ram.h
> > +++ b/migration/postcopy-ram.h
> > @@ -153,7 +153,7 @@ struct PostCopyFD {
> >      /* Handler to be called whenever we get a poll event */
> >      pcfdhandler handler;
> >      /* A string to use in error messages */
> > -    char *idstr;
> > +    const char *idstr;
> 
> Move to previous patch?

Done.

Dave

> >  };
> >  
> >  /* Register a userfaultfd owned by an external process for
> > -- 
> > 2.14.3
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 12/29] postcopy+vhost-user: Split set_mem_table for postcopy
  2018-02-28  8:49   ` Peter Xu
@ 2018-03-05 18:45     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-05 18:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:16:08PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Split the set_mem_table routines in both qemu and libvhost-user
> > because the postcopy versions are going to be quite different
> > once changes in the later patches are added.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 53 ++++++++++++++++++++++++
> >  hw/virtio/vhost-user.c                | 77 ++++++++++++++++++++++++++++++++++-
> >  2 files changed, 128 insertions(+), 2 deletions(-)
> > 
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index beec7695a8..4922b2c722 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -448,6 +448,55 @@ vu_reset_device_exec(VuDev *dev, VhostUserMsg *vmsg)
> >      return false;
> >  }
> >  
> > +static bool
> > +vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
> > +{
> > +    int i;
> > +    VhostUserMemory *memory = &vmsg->payload.memory;
> > +    dev->nregions = memory->nregions;
> > +    /* TODO: Postcopy specific code */
> > +    DPRINT("Nregions: %d\n", memory->nregions);
> > +    for (i = 0; i < dev->nregions; i++) {
> > +        void *mmap_addr;
> > +        VhostUserMemoryRegion *msg_region = &memory->regions[i];
> > +        VuDevRegion *dev_region = &dev->regions[i];
> > +
> > +        DPRINT("Region %d\n", i);
> > +        DPRINT("    guest_phys_addr: 0x%016"PRIx64"\n",
> > +               msg_region->guest_phys_addr);
> > +        DPRINT("    memory_size:     0x%016"PRIx64"\n",
> > +               msg_region->memory_size);
> > +        DPRINT("    userspace_addr   0x%016"PRIx64"\n",
> > +               msg_region->userspace_addr);
> > +        DPRINT("    mmap_offset      0x%016"PRIx64"\n",
> > +               msg_region->mmap_offset);
> > +
> > +        dev_region->gpa = msg_region->guest_phys_addr;
> > +        dev_region->size = msg_region->memory_size;
> > +        dev_region->qva = msg_region->userspace_addr;
> > +        dev_region->mmap_offset = msg_region->mmap_offset;
> > +
> > +        /* We don't use offset argument of mmap() since the
> > +         * mapped address has to be page aligned, and we use huge
> > +         * pages.  */
> > +        mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
> > +                         PROT_READ | PROT_WRITE, MAP_SHARED,
> > +                         vmsg->fds[i], 0);
> > +
> > +        if (mmap_addr == MAP_FAILED) {
> > +            vu_panic(dev, "region mmap error: %s", strerror(errno));
> > +        } else {
> > +            dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
> > +            DPRINT("    mmap_addr:       0x%016"PRIx64"\n",
> > +                   dev_region->mmap_addr);
> > +        }
> > +
> > +        close(vmsg->fds[i]);
> > +    }
> > +
> > +    return false;
> > +}
> > +
> >  static bool
> >  vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
> >  {
> > @@ -464,6 +513,10 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
> >      }
> >      dev->nregions = memory->nregions;
> >  
> > +    if (dev->postcopy_listening) {
> > +        return vu_set_mem_table_exec_postcopy(dev, vmsg);
> > +    }
> > +
> >      DPRINT("Nregions: %d\n", memory->nregions);
> >      for (i = 0; i < dev->nregions; i++) {
> >          void *mmap_addr;
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index ec6a4a82fd..64f4b3b3f9 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -325,15 +325,86 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
> >      return 0;
> >  }
> >  
> > +static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
> > +                                             struct vhost_memory *mem)
> > +{
> > +    int fds[VHOST_MEMORY_MAX_NREGIONS];
> > +    int i, fd;
> > +    size_t fd_num = 0;
> > +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> > +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > +    /* TODO: Add actual postcopy differences */
> > +    VhostUserMsg msg = {
> > +        .hdr.request = VHOST_USER_SET_MEM_TABLE,
> > +        .hdr.flags = VHOST_USER_VERSION,
> > +    };
> > +
> > +    if (reply_supported) {
> > +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> > +    }
> > +
> > +    for (i = 0; i < dev->mem->nregions; ++i) {
> > +        struct vhost_memory_region *reg = dev->mem->regions + i;
> > +        ram_addr_t offset;
> > +        MemoryRegion *mr;
> > +
> > +        assert((uintptr_t)reg->userspace_addr == reg->userspace_addr);
> > +        mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr,
> > +                                     &offset);
> > +        fd = memory_region_get_fd(mr);
> > +        if (fd > 0) {
> > +            msg.payload.memory.regions[fd_num].userspace_addr =
> > +                reg->userspace_addr;
> > +            msg.payload.memory.regions[fd_num].memory_size  = reg->memory_size;
> > +            msg.payload.memory.regions[fd_num].guest_phys_addr =
> > +                reg->guest_phys_addr;
> > +            msg.payload.memory.regions[fd_num].mmap_offset = offset;
> > +            assert(fd_num < VHOST_MEMORY_MAX_NREGIONS);
> > +            fds[fd_num++] = fd;
> > +        }
> > +    }
> > +
> > +    msg.payload.memory.nregions = fd_num;
> > +
> > +    if (!fd_num) {
> > +        error_report("Failed initializing vhost-user memory map, "
> > +                     "consider using -object memory-backend-file share=on");
> > +        return -1;
> > +    }
> > +
> > +    msg.hdr.size = sizeof(msg.payload.memory.nregions);
> > +    msg.hdr.size += sizeof(msg.payload.memory.padding);
> > +    msg.hdr.size += fd_num * sizeof(VhostUserMemoryRegion);
> > +
> > +    if (vhost_user_write(dev, &msg, fds, fd_num) < 0) {
> > +        return -1;
> > +    }
> > +
> > +    if (reply_supported) {
> > +        return process_message_reply(dev, &msg);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  static int vhost_user_set_mem_table(struct vhost_dev *dev,
> >                                      struct vhost_memory *mem)
> >  {
> > +    struct vhost_user *u = dev->opaque;
> >      int fds[VHOST_MEMORY_MAX_NREGIONS];
> >      int i, fd;
> >      size_t fd_num = 0;
> > +    bool do_postcopy = u->postcopy_listen && u->postcopy_fd.handler;
> >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> >                                                VHOST_USER_PROTOCOL_F_REPLY_ACK);
> >  
> > +    if (do_postcopy) {
> > +        /* Postcopy has enough differences that it's best done in it's own
> > +         * version
> > +         */
> > +        return vhost_user_set_mem_table_postcopy(dev, mem);
> > +    }
> > +
> >      VhostUserMsg msg = {
> >          .hdr.request = VHOST_USER_SET_MEM_TABLE,
> >          .hdr.flags = VHOST_USER_VERSION,
> > @@ -357,9 +428,11 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
> >                  error_report("Failed preparing vhost-user memory table msg");
> >                  return -1;
> >              }
> > -            msg.payload.memory.regions[fd_num].userspace_addr = reg->userspace_addr;
> > +            msg.payload.memory.regions[fd_num].userspace_addr =
> > +                reg->userspace_addr;
> >              msg.payload.memory.regions[fd_num].memory_size  = reg->memory_size;
> > -            msg.payload.memory.regions[fd_num].guest_phys_addr = reg->guest_phys_addr;
> > +            msg.payload.memory.regions[fd_num].guest_phys_addr =
> > +                reg->guest_phys_addr;
> 
> These newline changes might be avoided?

They could, but they're over 80 chars long, so while I was taking a copy
of the code I fixed the style on this copy so they were consistent.

> So after this patch there's no functional change, only the code
> splittion of set_mem_table operation, right?

Right; the changes to the postcopy version come later.

Dave

> Thanks,
> 
> >              msg.payload.memory.regions[fd_num].mmap_offset = offset;
> >              fds[fd_num++] = fd;
> >          }
> > -- 
> > 2.14.3
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 17/29] vhost+postcopy: Send requests to source for shared pages
  2018-02-28 10:03   ` Peter Xu
@ 2018-03-05 18:55     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-05 18:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:16:13PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Send requests back to the source for shared page requests.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  migration/migration.h    |  2 ++
> >  migration/postcopy-ram.c | 31 ++++++++++++++++++++++++++++---
> >  migration/postcopy-ram.h |  3 +++
> >  migration/trace-events   |  2 ++
> >  4 files changed, 35 insertions(+), 3 deletions(-)
> > 
> > diff --git a/migration/migration.h b/migration/migration.h
> > index d158e62cf2..457bf37ec2 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -46,6 +46,8 @@ struct MigrationIncomingState {
> >      int       userfault_quit_fd;
> >      QEMUFile *to_src_file;
> >      QemuMutex rp_mutex;    /* We send replies from multiple threads */
> > +    /* RAMBlock of last request sent to source */
> > +    RAMBlock *last_rb;
> >      void     *postcopy_tmp_page;
> >      void     *postcopy_tmp_zero_page;
> >      /* PostCopyFD's for external userfaultfds & handlers of shared memory */
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index d118b78bf5..277ff749a0 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -534,6 +534,31 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
> >      return 0;
> >  }
> >  
> > +/*
> > + * Callback from shared fault handlers to ask for a page,
> > + * the page must be specified by a RAMBlock and an offset in that rb
> > + */
> > +int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
> > +                                 uint64_t client_addr, uint64_t rb_offset)
> > +{
> > +    size_t pagesize = qemu_ram_pagesize(rb);
> > +    uint64_t aligned_rbo = rb_offset & ~(pagesize - 1);
> > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > +
> > +    trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
> > +                                       rb_offset);
> > +    /* TODO: Check bitmap to see if we already have the page */
> > +    if (rb != mis->last_rb) {
> > +        mis->last_rb = rb;
> > +        migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
> > +                                  aligned_rbo, pagesize);
> > +    } else {
> > +        /* Save some space */
> > +        migrate_send_rp_req_pages(mis, NULL, aligned_rbo, pagesize);
> > +    }
> > +    return 0;
> > +}
> > +
> 
> So IIUC this can only be called within the page fault thread or there
> can be race.  Is there a way to guarantee this?  Or do we need a
> comment for that?

I don't think there's a way to guarantee it - especially since it has
to be called by the device-specific shared handlers in another file -

I've updated the comment to:

/*
 * Callback from shared fault handlers to ask for a page,
 * the page must be specified by a RAMBlock and an offset in that rb
 * Note: Only for use by shared fault handlers (in fault thread)
 */

Dave

> >  /*
> >   * Handle faults detected by the USERFAULT markings
> >   */
> > @@ -544,9 +569,9 @@ static void *postcopy_ram_fault_thread(void *opaque)
> >      int ret;
> >      size_t index;
> >      RAMBlock *rb = NULL;
> > -    RAMBlock *last_rb = NULL; /* last RAMBlock we sent part of */
> >  
> >      trace_postcopy_ram_fault_thread_entry();
> > +    mis->last_rb = NULL; /* last RAMBlock we sent part of */
> >      qemu_sem_post(&mis->fault_thread_sem);
> >  
> >      struct pollfd *pfd;
> > @@ -634,8 +659,8 @@ static void *postcopy_ram_fault_thread(void *opaque)
> >               * Send the request to the source - we want to request one
> >               * of our host page sizes (which is >= TPS)
> >               */
> > -            if (rb != last_rb) {
> > -                last_rb = rb;
> > +            if (rb != mis->last_rb) {
> > +                mis->last_rb = rb;
> >                  migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
> >                                           rb_offset, qemu_ram_pagesize(rb));
> >              } else {
> > diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> > index dbc2ee1f2b..4c63f20df4 100644
> > --- a/migration/postcopy-ram.h
> > +++ b/migration/postcopy-ram.h
> > @@ -162,5 +162,8 @@ struct PostCopyFD {
> >   */
> >  void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
> >  void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
> > +/* Callback from shared fault handlers to ask for a page */
> > +int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
> > +                                 uint64_t client_addr, uint64_t offset);
> >  
> >  #endif
> > diff --git a/migration/trace-events b/migration/trace-events
> > index 1e617ad7a6..7c910b5479 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -198,6 +198,8 @@ postcopy_ram_incoming_cleanup_closeuf(void) ""
> >  postcopy_ram_incoming_cleanup_entry(void) ""
> >  postcopy_ram_incoming_cleanup_exit(void) ""
> >  postcopy_ram_incoming_cleanup_join(void) ""
> > +postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
> > +
> >  save_xbzrle_page_skipping(void) ""
> >  save_xbzrle_page_overflow(void) ""
> >  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
> > -- 
> > 2.14.3
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 19/29] postcopy: wake shared
  2018-03-02  7:44   ` Peter Xu
@ 2018-03-05 19:35     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-05 19:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:16:15PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Send a 'wake' request on a userfaultfd for a shared process.
> > The address in the clients address space is specified together
> > with the RAMBlock it was resolved to.
> 
> I think it's "providing a helper to send WAKE to uffd" rather than
> really sending it.
> 
> Otherwise it looks good to me.  Thanks,

Reworded to:

postcopy: helper for waking shared

Provide a helper to send a 'wake' request on a userfaultfd for
a shared process.  
The address in the clients address space is specified together
with the RAMBlock it was resolved to.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Dave

> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  migration/postcopy-ram.c | 26 ++++++++++++++++++++++++++
> >  migration/postcopy-ram.h |  6 ++++++
> >  migration/trace-events   |  1 +
> >  3 files changed, 33 insertions(+)
> > 
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 277ff749a0..67deae7e1c 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -534,6 +534,25 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
> >      return 0;
> >  }
> >  
> > +int postcopy_wake_shared(struct PostCopyFD *pcfd,
> > +                         uint64_t client_addr,
> > +                         RAMBlock *rb)
> > +{
> > +    size_t pagesize = qemu_ram_pagesize(rb);
> > +    struct uffdio_range range;
> > +    int ret;
> > +    trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb));
> > +    range.start = client_addr & ~(pagesize - 1);
> > +    range.len = pagesize;
> > +    ret = ioctl(pcfd->fd, UFFDIO_WAKE, &range);
> > +    if (ret) {
> > +        error_report("%s: Failed to wake: %zx in %s (%s)",
> > +                     __func__, (size_t)client_addr, qemu_ram_get_idstr(rb),
> > +                     strerror(errno));
> > +    }
> > +    return ret;
> > +}
> > +
> >  /*
> >   * Callback from shared fault handlers to ask for a page,
> >   * the page must be specified by a RAMBlock and an offset in that rb
> > @@ -951,6 +970,13 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
> >      return NULL;
> >  }
> >  
> > +int postcopy_wake_shared(struct PostCopyFD *pcfd,
> > +                         uint64_t client_addr,
> > +                         RAMBlock *rb)
> > +{
> > +    assert(0);
> > +    return -1;
> > +}
> >  #endif
> >  
> >  /* ------------------------------------------------------------------------- */
> > diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> > index 4c63f20df4..2e3dd844d5 100644
> > --- a/migration/postcopy-ram.h
> > +++ b/migration/postcopy-ram.h
> > @@ -162,6 +162,12 @@ struct PostCopyFD {
> >   */
> >  void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
> >  void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
> > +/* Notify a client ufd that a page is available
> > + * Note: The 'client_address' is in the address space of the client
> > + * program not QEMU
> > + */
> > +int postcopy_wake_shared(struct PostCopyFD *pcfd, uint64_t client_addr,
> > +                         RAMBlock *rb);
> >  /* Callback from shared fault handlers to ask for a page */
> >  int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
> >                                   uint64_t client_addr, uint64_t offset);
> > diff --git a/migration/trace-events b/migration/trace-events
> > index 7c910b5479..b0acaaa8a0 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -199,6 +199,7 @@ postcopy_ram_incoming_cleanup_entry(void) ""
> >  postcopy_ram_incoming_cleanup_exit(void) ""
> >  postcopy_ram_incoming_cleanup_join(void) ""
> >  postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
> > +postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
> >  
> >  save_xbzrle_page_skipping(void) ""
> >  save_xbzrle_page_overflow(void) ""
> > -- 
> > 2.14.3
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 20/29] postcopy: postcopy_notify_shared_wake
  2018-03-02  7:51   ` Peter Xu
@ 2018-03-05 19:55     ` Dr. David Alan Gilbert
  2018-03-06  3:37       ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-05 19:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:16:16PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Add a hook to allow a client userfaultfd to be 'woken'
> > when a page arrives, and a walker that calls that
> > hook for relevant clients given a RAMBlock and offset.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  migration/postcopy-ram.c | 16 ++++++++++++++++
> >  migration/postcopy-ram.h | 10 ++++++++++
> >  2 files changed, 26 insertions(+)
> > 
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 67deae7e1c..879711968c 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -824,6 +824,22 @@ static int qemu_ufd_copy_ioctl(int userfault_fd, void *host_addr,
> >      return ret;
> >  }
> >  
> > +int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset)
> > +{
> > +    int i;
> > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > +    GArray *pcrfds = mis->postcopy_remote_fds;
> > +
> > +    for (i = 0; i < pcrfds->len; i++) {
> > +        struct PostCopyFD *cur = &g_array_index(pcrfds, struct PostCopyFD, i);
> > +        int ret = cur->waker(cur, rb, offset);
> > +        if (ret) {
> > +            return ret;
> > +        }
> > +    }
> > +    return 0;
> > +}
> > +
> 
> We should know that which FD needs what pages, right?  If with that
> information, we can only notify the ones who have page faulted on
> exactly the same page?  Otherwise we do UFFDIO_WAKE once for each
> client when a page is ready, even if the clients have not page faulted
> at all?

The 'waker' function we call knows that, we don't; see the
'vhost_user_postcopy_waker' in the next patch, and it hunts down whether
the address the waker is called for is one it's responsible for.
Also note that a shared page might be shared between multiple other
programs - not just one.  In our case that could be two vhost-user
devices wired to two separate processes.

> But for the first version, I think it's fine.  And I believe if we
> maintain the faulted addresses we need some way to sync between the
> wake thread and fault thread too.

Hmm can you explain that a bit more?

> And I totally have no idea on how
> this difference will be any kind of bottle neck at all, since I guess
> the network link should still be the postcopy bottleneck considering
> that 10g is mostly what we have now (or even, 1g).
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks.

Dave

> 
> >  /*
> >   * Place a host page (from) at (host) atomically
> >   * returns 0 on success
> > diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> > index 2e3dd844d5..2b71cf958e 100644
> > --- a/migration/postcopy-ram.h
> > +++ b/migration/postcopy-ram.h
> > @@ -146,6 +146,10 @@ struct PostCopyFD;
> >  
> >  /* ufd is a pointer to the struct uffd_msg *TODO: more Portable! */
> >  typedef int (*pcfdhandler)(struct PostCopyFD *pcfd, void *ufd);
> > +/* Notification to wake, either on place or on reception of
> > + * a fault on something that's already arrived (race)
> > + */
> > +typedef int (*pcfdwake)(struct PostCopyFD *pcfd, RAMBlock *rb, uint64_t offset);
> >  
> >  struct PostCopyFD {
> >      int fd;
> > @@ -153,6 +157,8 @@ struct PostCopyFD {
> >      void *data;
> >      /* Handler to be called whenever we get a poll event */
> >      pcfdhandler handler;
> > +    /* Notification to wake shared client */
> > +    pcfdwake waker;
> >      /* A string to use in error messages */
> >      const char *idstr;
> >  };
> > @@ -162,6 +168,10 @@ struct PostCopyFD {
> >   */
> >  void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
> >  void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
> > +/* Call each of the shared 'waker's registerd telling them of
> > + * availability of a block.
> > + */
> > +int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset);
> >  /* Notify a client ufd that a page is available
> >   * Note: The 'client_address' is in the address space of the client
> >   * program not QEMU
> > -- 
> > 2.14.3
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 21/29] vhost+postcopy: Add vhost waker
  2018-03-02  7:55   ` Peter Xu
@ 2018-03-05 20:16     ` Dr. David Alan Gilbert
  2018-03-06  7:19       ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-05 20:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:16:17PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Register a waker function in vhost-user code to be notified when
> > pages arrive or requests to previously mapped pages get requested.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  hw/virtio/trace-events |  3 +++
> >  hw/virtio/vhost-user.c | 30 ++++++++++++++++++++++++++++++
> >  2 files changed, 33 insertions(+)
> > 
> > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > index 3afd12cfea..fe5e0ff856 100644
> > --- a/hw/virtio/trace-events
> > +++ b/hw/virtio/trace-events
> > @@ -13,6 +13,9 @@ vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t
> >  vhost_user_postcopy_listen(void) ""
> >  vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> >  vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
> > +vhost_user_postcopy_waker(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
> > +vhost_user_postcopy_waker_found(uint64_t client_addr) "0x%"PRIx64
> > +vhost_user_postcopy_waker_nomatch(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
> >  
> >  # hw/virtio/virtio.c
> >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index 4589bfd92e..74807091a0 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -990,6 +990,35 @@ static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
> >      return -1;
> >  }
> >  
> > +static int vhost_user_postcopy_waker(struct PostCopyFD *pcfd, RAMBlock *rb,
> > +                                     uint64_t offset)
> > +{
> > +    struct vhost_dev *dev = pcfd->data;
> > +    struct vhost_user *u = dev->opaque;
> > +    int i;
> > +
> > +    trace_vhost_user_postcopy_waker(qemu_ram_get_idstr(rb), offset);
> > +
> > +    if (!u) {
> > +        return 0;
> > +    }
> > +    /* Translate the offset into an address in the clients address space */
> > +    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
> > +        if (u->region_rb[i] == rb &&
> > +            offset >= u->region_rb_offset[i] &&
> > +            offset < (u->region_rb_offset[i] +
> > +                      dev->mem->regions[i].memory_size)) {
> > +            uint64_t client_addr = (offset - u->region_rb_offset[i]) +
> > +                                   u->postcopy_client_bases[i];
> > +            trace_vhost_user_postcopy_waker_found(client_addr);
> > +            return postcopy_wake_shared(pcfd, client_addr, rb);
> > +        }
> > +    }
> > +
> > +    trace_vhost_user_postcopy_waker_nomatch(qemu_ram_get_idstr(rb), offset);
> > +    return 0;
> 
> Can we really reach here?

Yes; note that all the waker's registered get called for all pages
received
so that:
  a) A page not in shared memory, or not actually registered with a
device, still calls the waker's and it's upto the waker to figure out
whether it's interested for the device it belongs to.

  b) With two devices registered, they might each have registered
different areas of shared memory, and thus it's upto the waker to figure
out if it's interested in this specific page.

Dave

> > +}
> > +
> >  /*
> >   * Called at the start of an inbound postcopy on reception of the
> >   * 'advise' command.
> > @@ -1035,6 +1064,7 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
> >      u->postcopy_fd.fd = ufd;
> >      u->postcopy_fd.data = dev;
> >      u->postcopy_fd.handler = vhost_user_postcopy_fault_handler;
> > +    u->postcopy_fd.waker = vhost_user_postcopy_waker;
> >      u->postcopy_fd.idstr = "vhost-user"; /* Need to find unique name */
> >      postcopy_register_shared_ufd(&u->postcopy_fd);
> >      return 0;
> > -- 
> > 2.14.3
> > 
> 
> Thanks,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 20/29] postcopy: postcopy_notify_shared_wake
  2018-03-05 19:55     ` Dr. David Alan Gilbert
@ 2018-03-06  3:37       ` Peter Xu
  2018-03-06 10:54         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-03-06  3:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Mon, Mar 05, 2018 at 07:55:13PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Fri, Feb 16, 2018 at 01:16:16PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Add a hook to allow a client userfaultfd to be 'woken'
> > > when a page arrives, and a walker that calls that
> > > hook for relevant clients given a RAMBlock and offset.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  migration/postcopy-ram.c | 16 ++++++++++++++++
> > >  migration/postcopy-ram.h | 10 ++++++++++
> > >  2 files changed, 26 insertions(+)
> > > 
> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > > index 67deae7e1c..879711968c 100644
> > > --- a/migration/postcopy-ram.c
> > > +++ b/migration/postcopy-ram.c
> > > @@ -824,6 +824,22 @@ static int qemu_ufd_copy_ioctl(int userfault_fd, void *host_addr,
> > >      return ret;
> > >  }
> > >  
> > > +int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset)
> > > +{
> > > +    int i;
> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > +    GArray *pcrfds = mis->postcopy_remote_fds;
> > > +
> > > +    for (i = 0; i < pcrfds->len; i++) {
> > > +        struct PostCopyFD *cur = &g_array_index(pcrfds, struct PostCopyFD, i);
> > > +        int ret = cur->waker(cur, rb, offset);
> > > +        if (ret) {
> > > +            return ret;
> > > +        }
> > > +    }
> > > +    return 0;
> > > +}
> > > +
> > 
> > We should know that which FD needs what pages, right?  If with that
> > information, we can only notify the ones who have page faulted on
> > exactly the same page?  Otherwise we do UFFDIO_WAKE once for each
> > client when a page is ready, even if the clients have not page faulted
> > at all?
> 
> The 'waker' function we call knows that, we don't; see the
> 'vhost_user_postcopy_waker' in the next patch, and it hunts down whether
> the address the waker is called for is one it's responsible for.

For vhost-user devices, they should be always responsible for mostly
all RAM exported on the guest?  If so, they will always be notified to
wake up if a page is copied?

Here I was thinking not only about responsible ranges - It was about
whether each PostcopyFD could note down the faulted addresses that
were waiting to be service.  Then when we do the wake up, we could
possibly skip notifying the PostcopyFD when the copied page is not
covering any of the faulted addresses on that PostcopyFD?

> Also note that a shared page might be shared between multiple other
> programs - not just one.  In our case that could be two vhost-user
> devices wired to two separate processes.

Yeah, but the idea still stands IMHO - we can notify only those
PostcopyFDs that have faulted on the page already and skip the rest.
For sure there can be more than one candidate for the wakeup, since
there can be multiple PostcopyFDs that captured page fault on the same
page (or even, same address).

> 
> > But for the first version, I think it's fine.  And I believe if we
> > maintain the faulted addresses we need some way to sync between the
> > wake thread and fault thread too.
> 
> Hmm can you explain that a bit more?

Basically above was what I thought - to record the faulted addresses
with specific PostcopyFD when page fault happened, then we may know
which page(s) will a PostcopyFD need.  But when with that, we'll
possibly need a lock to protect the information (or any other sync
method).

(Hope I didn't miss anything important along the way)

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/29] vhost+postcopy: Transmit 'listen' to client
  2018-03-05 17:42     ` Dr. David Alan Gilbert
@ 2018-03-06  7:06       ` Peter Xu
  2018-03-06 11:20         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2018-03-06  7:06 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Mon, Mar 05, 2018 at 05:42:42PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Fri, Feb 16, 2018 at 01:16:07PM +0000, Dr. David Alan Gilbert (git) wrote:
> > 
> > [...]
> > 
> > >  typedef struct VuVirtqElement {
> > > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > > index 621543e654..bdec9ec0e8 100644
> > > --- a/docs/interop/vhost-user.txt
> > > +++ b/docs/interop/vhost-user.txt
> > > @@ -682,6 +682,12 @@ Master message types
> > >        the slave must open a userfaultfd for later use.
> > >        Note that at this stage the migration is still in precopy mode.
> > >  
> > > + * VHOST_USER_POSTCOPY_LISTEN
> > > +      Id: 27
> > > +      Master payload: N/A
> > > +
> > > +      Master advises slave that a transition to postcopy mode has happened.
> > 
> > Could we add something to explain why this listen needs to be
> > broadcasted to clients?  Since I failed to find it out quickly
> > myself. :(
> 
> I've changed this to:
> 
>  * VHOST_USER_POSTCOPY_LISTEN
>       Id: 29
>       Master payload: N/A
> 
>       Master advises slave that a transition to postcopy mode has happened.
>       The slave must ensure that shared memory is registered with userfaultfd
>       to cause faulting of non-present pages.

But shouldn't this be assured by the SET_MEM_TABLE call?

Sorry for being not that familiar with vhost-user protocol... but
what's the correct order of these commands?

  POSTCOPY_ADVISE
  POSTCOPY_LISTEN
  SET_MEM_TABLE

?  Thanks,

> 
>       This is always sent sometime after a VHOST_USER_POSTCOPY_ADVISE, and
>       thus only when VHOST_USER_PROTOCOL_F_PAGEFAULT is supported.
> 
> Dave
> 
> > -- 
> > Peter Xu
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 21/29] vhost+postcopy: Add vhost waker
  2018-03-05 20:16     ` Dr. David Alan Gilbert
@ 2018-03-06  7:19       ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2018-03-06  7:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Mon, Mar 05, 2018 at 08:16:44PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Fri, Feb 16, 2018 at 01:16:17PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Register a waker function in vhost-user code to be notified when
> > > pages arrive or requests to previously mapped pages get requested.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  hw/virtio/trace-events |  3 +++
> > >  hw/virtio/vhost-user.c | 30 ++++++++++++++++++++++++++++++
> > >  2 files changed, 33 insertions(+)
> > > 
> > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > index 3afd12cfea..fe5e0ff856 100644
> > > --- a/hw/virtio/trace-events
> > > +++ b/hw/virtio/trace-events
> > > @@ -13,6 +13,9 @@ vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t
> > >  vhost_user_postcopy_listen(void) ""
> > >  vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> > >  vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
> > > +vhost_user_postcopy_waker(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
> > > +vhost_user_postcopy_waker_found(uint64_t client_addr) "0x%"PRIx64
> > > +vhost_user_postcopy_waker_nomatch(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
> > >  
> > >  # hw/virtio/virtio.c
> > >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > index 4589bfd92e..74807091a0 100644
> > > --- a/hw/virtio/vhost-user.c
> > > +++ b/hw/virtio/vhost-user.c
> > > @@ -990,6 +990,35 @@ static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
> > >      return -1;
> > >  }
> > >  
> > > +static int vhost_user_postcopy_waker(struct PostCopyFD *pcfd, RAMBlock *rb,
> > > +                                     uint64_t offset)
> > > +{
> > > +    struct vhost_dev *dev = pcfd->data;
> > > +    struct vhost_user *u = dev->opaque;
> > > +    int i;
> > > +
> > > +    trace_vhost_user_postcopy_waker(qemu_ram_get_idstr(rb), offset);
> > > +
> > > +    if (!u) {
> > > +        return 0;
> > > +    }
> > > +    /* Translate the offset into an address in the clients address space */
> > > +    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
> > > +        if (u->region_rb[i] == rb &&
> > > +            offset >= u->region_rb_offset[i] &&
> > > +            offset < (u->region_rb_offset[i] +
> > > +                      dev->mem->regions[i].memory_size)) {
> > > +            uint64_t client_addr = (offset - u->region_rb_offset[i]) +
> > > +                                   u->postcopy_client_bases[i];
> > > +            trace_vhost_user_postcopy_waker_found(client_addr);
> > > +            return postcopy_wake_shared(pcfd, client_addr, rb);
> > > +        }
> > > +    }
> > > +
> > > +    trace_vhost_user_postcopy_waker_nomatch(qemu_ram_get_idstr(rb), offset);
> > > +    return 0;
> > 
> > Can we really reach here?
> 
> Yes; note that all the waker's registered get called for all pages
> received
> so that:
>   a) A page not in shared memory, or not actually registered with a
> device, still calls the waker's and it's upto the waker to figure out
> whether it's interested for the device it belongs to.
> 
>   b) With two devices registered, they might each have registered
> different areas of shared memory, and thus it's upto the waker to figure
> out if it's interested in this specific page.

Indeed.

Again, if we note down faulted addresses for reach PostcopyFD, IMHO we
can even ignore this check, since if the copied page covers any of the
faulted addresses of the FD we'll definitely need to send the wake,
otherwise we don't need to.  But current patch is also okay to me now.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 22/29] vhost+postcopy: Call wakeups
  2018-03-02  8:05   ` Peter Xu
@ 2018-03-06 10:36     ` Dr. David Alan Gilbert
  2018-03-08  6:22       ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-06 10:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Feb 16, 2018 at 01:16:18PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Cause the vhost-user client to be woken up whenever:
> >   a) We place a page in postcopy mode
> >   b) We get a fault and the page has already been received
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  migration/postcopy-ram.c | 14 ++++++++++----
> >  migration/trace-events   |  1 +
> >  2 files changed, 11 insertions(+), 4 deletions(-)
> > 
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 879711968c..13561703b5 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -566,7 +566,11 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
> >  
> >      trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
> >                                         rb_offset);
> > -    /* TODO: Check bitmap to see if we already have the page */
> > +    if (ramblock_recv_bitmap_test_byte_offset(rb, aligned_rbo)) {
> > +        trace_postcopy_request_shared_page_present(pcfd->idstr,
> > +                                        qemu_ram_get_idstr(rb), rb_offset);
> > +        return postcopy_wake_shared(pcfd, client_addr, rb);
> > +    }
> >      if (rb != mis->last_rb) {
> >          mis->last_rb = rb;
> >          migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
> > @@ -863,7 +867,8 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> >      }
> >  
> >      trace_postcopy_place_page(host);
> > -    return 0;
> > +    return postcopy_notify_shared_wake(rb,
> > +                                       qemu_ram_block_host_offset(rb, host));
> >  }
> >  
> >  /*
> > @@ -887,6 +892,9 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> >  
> >              return -e;
> >          }
> > +        return postcopy_notify_shared_wake(rb,
> > +                                           qemu_ram_block_host_offset(rb,
> > +                                                                      host));
> >      } else {
> >          /* The kernel can't use UFFDIO_ZEROPAGE for hugepages */
> >          if (!mis->postcopy_tmp_zero_page) {
> > @@ -906,8 +914,6 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> >          return postcopy_place_page(mis, host, mis->postcopy_tmp_zero_page,
> >                                     rb);
> >      }
> > -
> > -    return 0;
> >  }
> 
> Could there be race?  E.g.:
> 
>               ram_load_thread             page_fault_thread
>              -----------------           -------------------
> 
>                                           if (recv_bitmap_set())
>                                               wake()
>              copy_page()
>              recv_bitmap_set()
>              wake()
>                                           request_page()
> 
> Then the last requested page may never be serviced?

The postcopy finishes when the last page is received, and thus when that
also performs the wake() (from the load thread); so that's not a
problem.
You can get the case where a page that qemu has already received, still
needs to be woken for the shared users (which is why we have the wake in
the fault_thread).
When the postcopy finishes, the client is sent a POSTCOPY_END, at which
point it closes it's userfaultfd and it should wake everything remaining
up; so any late requests shouldn't be a problem (the END is sent
before the fault-thread quits).

Dave


> Thanks,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 20/29] postcopy: postcopy_notify_shared_wake
  2018-03-06  3:37       ` Peter Xu
@ 2018-03-06 10:54         ` Dr. David Alan Gilbert
  2018-03-07 10:13           ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-06 10:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Mon, Mar 05, 2018 at 07:55:13PM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Fri, Feb 16, 2018 at 01:16:16PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > Add a hook to allow a client userfaultfd to be 'woken'
> > > > when a page arrives, and a walker that calls that
> > > > hook for relevant clients given a RAMBlock and offset.
> > > > 
> > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > ---
> > > >  migration/postcopy-ram.c | 16 ++++++++++++++++
> > > >  migration/postcopy-ram.h | 10 ++++++++++
> > > >  2 files changed, 26 insertions(+)
> > > > 
> > > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > > > index 67deae7e1c..879711968c 100644
> > > > --- a/migration/postcopy-ram.c
> > > > +++ b/migration/postcopy-ram.c
> > > > @@ -824,6 +824,22 @@ static int qemu_ufd_copy_ioctl(int userfault_fd, void *host_addr,
> > > >      return ret;
> > > >  }
> > > >  
> > > > +int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset)
> > > > +{
> > > > +    int i;
> > > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > > +    GArray *pcrfds = mis->postcopy_remote_fds;
> > > > +
> > > > +    for (i = 0; i < pcrfds->len; i++) {
> > > > +        struct PostCopyFD *cur = &g_array_index(pcrfds, struct PostCopyFD, i);
> > > > +        int ret = cur->waker(cur, rb, offset);
> > > > +        if (ret) {
> > > > +            return ret;
> > > > +        }
> > > > +    }
> > > > +    return 0;
> > > > +}
> > > > +
> > > 
> > > We should know that which FD needs what pages, right?  If with that
> > > information, we can only notify the ones who have page faulted on
> > > exactly the same page?  Otherwise we do UFFDIO_WAKE once for each
> > > client when a page is ready, even if the clients have not page faulted
> > > at all?
> > 
> > The 'waker' function we call knows that, we don't; see the
> > 'vhost_user_postcopy_waker' in the next patch, and it hunts down whether
> > the address the waker is called for is one it's responsible for.
> 
> For vhost-user devices, they should be always responsible for mostly
> all RAM exported on the guest?  If so, they will always be notified to
> wake up if a page is copied?

Right; but this patch isn't vhost-user specific; this is more general.

> Here I was thinking not only about responsible ranges - It was about
> whether each PostcopyFD could note down the faulted addresses that
> were waiting to be service.  Then when we do the wake up, we could
> possibly skip notifying the PostcopyFD when the copied page is not
> covering any of the faulted addresses on that PostcopyFD?

Yes, that would be possible - in this case I made that the job of
the device that had registered (i.e. the waker method) rather than
the core postcopy code.

> > Also note that a shared page might be shared between multiple other
> > programs - not just one.  In our case that could be two vhost-user
> > devices wired to two separate processes.
> 
> Yeah, but the idea still stands IMHO - we can notify only those
> PostcopyFDs that have faulted on the page already and skip the rest.
> For sure there can be more than one candidate for the wakeup, since
> there can be multiple PostcopyFDs that captured page fault on the same
> page (or even, same address).
> 
> > 
> > > But for the first version, I think it's fine.  And I believe if we
> > > maintain the faulted addresses we need some way to sync between the
> > > wake thread and fault thread too.
> > 
> > Hmm can you explain that a bit more?
> 
> Basically above was what I thought - to record the faulted addresses
> with specific PostcopyFD when page fault happened, then we may know
> which page(s) will a PostcopyFD need.  But when with that, we'll
> possibly need a lock to protect the information (or any other sync
> method).

OK, but I think you're suggesting building a whole new data structure to
know which ones need notifying;  that sounds like a lot of extra
complexity for not much gain.

Dave

> (Hope I didn't miss anything important along the way)
> 
> Thanks,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/29] vhost+postcopy: Transmit 'listen' to client
  2018-03-06  7:06       ` Peter Xu
@ 2018-03-06 11:20         ` Dr. David Alan Gilbert
  2018-03-07 10:05           ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-06 11:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

* Peter Xu (peterx@redhat.com) wrote:
> On Mon, Mar 05, 2018 at 05:42:42PM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Fri, Feb 16, 2018 at 01:16:07PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > 
> > > [...]
> > > 
> > > >  typedef struct VuVirtqElement {
> > > > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > > > index 621543e654..bdec9ec0e8 100644
> > > > --- a/docs/interop/vhost-user.txt
> > > > +++ b/docs/interop/vhost-user.txt
> > > > @@ -682,6 +682,12 @@ Master message types
> > > >        the slave must open a userfaultfd for later use.
> > > >        Note that at this stage the migration is still in precopy mode.
> > > >  
> > > > + * VHOST_USER_POSTCOPY_LISTEN
> > > > +      Id: 27
> > > > +      Master payload: N/A
> > > > +
> > > > +      Master advises slave that a transition to postcopy mode has happened.
> > > 
> > > Could we add something to explain why this listen needs to be
> > > broadcasted to clients?  Since I failed to find it out quickly
> > > myself. :(
> > 
> > I've changed this to:
> > 
> >  * VHOST_USER_POSTCOPY_LISTEN
> >       Id: 29
> >       Master payload: N/A
> > 
> >       Master advises slave that a transition to postcopy mode has happened.
> >       The slave must ensure that shared memory is registered with userfaultfd
> >       to cause faulting of non-present pages.
> 
> But shouldn't this be assured by the SET_MEM_TABLE call?

Yes, it is the set_mem_table that does the register - but it only
registers it with userfaultfd if it's received a 'listen' notification,
otherwise it assumes it's a normal precopy migration.

> Sorry for being not that familiar with vhost-user protocol... but
> what's the correct order of these commands?
> 
>   POSTCOPY_ADVISE
>   POSTCOPY_LISTEN
>   SET_MEM_TABLE

Right.

Dave

> ?  Thanks,
> 
> > 
> >       This is always sent sometime after a VHOST_USER_POSTCOPY_ADVISE, and
> >       thus only when VHOST_USER_PROTOCOL_F_PAGEFAULT is supported.
> > 
> > Dave
> > 
> > > -- 
> > > Peter Xu
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/29] vhost+postcopy: Transmit 'listen' to client
  2018-03-06 11:20         ` Dr. David Alan Gilbert
@ 2018-03-07 10:05           ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2018-03-07 10:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Tue, Mar 06, 2018 at 11:20:56AM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Mon, Mar 05, 2018 at 05:42:42PM +0000, Dr. David Alan Gilbert wrote:
> > > * Peter Xu (peterx@redhat.com) wrote:
> > > > On Fri, Feb 16, 2018 at 01:16:07PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > 
> > > > [...]
> > > > 
> > > > >  typedef struct VuVirtqElement {
> > > > > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > > > > index 621543e654..bdec9ec0e8 100644
> > > > > --- a/docs/interop/vhost-user.txt
> > > > > +++ b/docs/interop/vhost-user.txt
> > > > > @@ -682,6 +682,12 @@ Master message types
> > > > >        the slave must open a userfaultfd for later use.
> > > > >        Note that at this stage the migration is still in precopy mode.
> > > > >  
> > > > > + * VHOST_USER_POSTCOPY_LISTEN
> > > > > +      Id: 27
> > > > > +      Master payload: N/A
> > > > > +
> > > > > +      Master advises slave that a transition to postcopy mode has happened.
> > > > 
> > > > Could we add something to explain why this listen needs to be
> > > > broadcasted to clients?  Since I failed to find it out quickly
> > > > myself. :(
> > > 
> > > I've changed this to:
> > > 
> > >  * VHOST_USER_POSTCOPY_LISTEN
> > >       Id: 29
> > >       Master payload: N/A
> > > 
> > >       Master advises slave that a transition to postcopy mode has happened.
> > >       The slave must ensure that shared memory is registered with userfaultfd
> > >       to cause faulting of non-present pages.
> > 
> > But shouldn't this be assured by the SET_MEM_TABLE call?
> 
> Yes, it is the set_mem_table that does the register - but it only
> registers it with userfaultfd if it's received a 'listen' notification,
> otherwise it assumes it's a normal precopy migration.

I think I got the picture now.  Please add this after the enhanced
document:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 20/29] postcopy: postcopy_notify_shared_wake
  2018-03-06 10:54         ` Dr. David Alan Gilbert
@ 2018-03-07 10:13           ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2018-03-07 10:13 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Tue, Mar 06, 2018 at 10:54:18AM +0000, Dr. David Alan Gilbert wrote:

[...]

> > Basically above was what I thought - to record the faulted addresses
> > with specific PostcopyFD when page fault happened, then we may know
> > which page(s) will a PostcopyFD need.  But when with that, we'll
> > possibly need a lock to protect the information (or any other sync
> > method).
> 
> OK, but I think you're suggesting building a whole new data structure to
> know which ones need notifying;  that sounds like a lot of extra
> complexity for not much gain.

Yes we may need a new structure (or just a list of addresses?), and
indeed I have no idea on how that would help us.  I think it depends
on how many useless wakeup we will have, and how expensive is each of
such a wakeup notification.  Again, I think current solution is good
enough as long as we don't see explicit blocker on performance side,
and we can rethink that when really needed.  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 22/29] vhost+postcopy: Call wakeups
  2018-03-06 10:36     ` Dr. David Alan Gilbert
@ 2018-03-08  6:22       ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2018-03-08  6:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, maxime.coquelin, marcandre.lureau, imammedo, mst,
	quintela, aarcange

On Tue, Mar 06, 2018 at 10:36:52AM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Fri, Feb 16, 2018 at 01:16:18PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Cause the vhost-user client to be woken up whenever:
> > >   a) We place a page in postcopy mode
> > >   b) We get a fault and the page has already been received
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  migration/postcopy-ram.c | 14 ++++++++++----
> > >  migration/trace-events   |  1 +
> > >  2 files changed, 11 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > > index 879711968c..13561703b5 100644
> > > --- a/migration/postcopy-ram.c
> > > +++ b/migration/postcopy-ram.c
> > > @@ -566,7 +566,11 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
> > >  
> > >      trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
> > >                                         rb_offset);
> > > -    /* TODO: Check bitmap to see if we already have the page */
> > > +    if (ramblock_recv_bitmap_test_byte_offset(rb, aligned_rbo)) {
> > > +        trace_postcopy_request_shared_page_present(pcfd->idstr,
> > > +                                        qemu_ram_get_idstr(rb), rb_offset);
> > > +        return postcopy_wake_shared(pcfd, client_addr, rb);
> > > +    }
> > >      if (rb != mis->last_rb) {
> > >          mis->last_rb = rb;
> > >          migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
> > > @@ -863,7 +867,8 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> > >      }
> > >  
> > >      trace_postcopy_place_page(host);
> > > -    return 0;
> > > +    return postcopy_notify_shared_wake(rb,
> > > +                                       qemu_ram_block_host_offset(rb, host));
> > >  }
> > >  
> > >  /*
> > > @@ -887,6 +892,9 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> > >  
> > >              return -e;
> > >          }
> > > +        return postcopy_notify_shared_wake(rb,
> > > +                                           qemu_ram_block_host_offset(rb,
> > > +                                                                      host));
> > >      } else {
> > >          /* The kernel can't use UFFDIO_ZEROPAGE for hugepages */
> > >          if (!mis->postcopy_tmp_zero_page) {
> > > @@ -906,8 +914,6 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> > >          return postcopy_place_page(mis, host, mis->postcopy_tmp_zero_page,
> > >                                     rb);
> > >      }
> > > -
> > > -    return 0;
> > >  }
> > 
> > Could there be race?  E.g.:
> > 
> >               ram_load_thread             page_fault_thread
> >              -----------------           -------------------
> > 
> >                                           if (recv_bitmap_set())
> >                                               wake()
> >              copy_page()
> >              recv_bitmap_set()
> >              wake()
> >                                           request_page()
> > 
> > Then the last requested page may never be serviced?
> 
> The postcopy finishes when the last page is received, and thus when that
> also performs the wake() (from the load thread); so that's not a
> problem.
> You can get the case where a page that qemu has already received, still
> needs to be woken for the shared users (which is why we have the wake in
> the fault_thread).
> When the postcopy finishes, the client is sent a POSTCOPY_END, at which
> point it closes it's userfaultfd and it should wake everything remaining
> up; so any late requests shouldn't be a problem (the END is sent
> before the fault-thread quits).

Yeah now I think the race is invalid - the wake() in ram_load_thread
will wake up the paused thread in this case.  I misunderstood.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 19/29] postcopy: wake shared
  2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 19/29] postcopy: wake shared Dr. David Alan Gilbert (git)
  2018-03-02  7:44   ` Peter Xu
@ 2018-03-12 15:44   ` Marc-André Lureau
  2018-03-12 16:42     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 75+ messages in thread
From: Marc-André Lureau @ 2018-03-12 15:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: QEMU, Maxime Coquelin, Peter Xu, Igor Mammedov,
	Michael S. Tsirkin, Andrea Arcangeli, Juan Quintela

Hi

On Fri, Feb 16, 2018 at 2:16 PM, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Send a 'wake' request on a userfaultfd for a shared process.
> The address in the clients address space is specified together
> with the RAMBlock it was resolved to.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>


> ---
>  migration/postcopy-ram.c | 26 ++++++++++++++++++++++++++
>  migration/postcopy-ram.h |  6 ++++++
>  migration/trace-events   |  1 +
>  3 files changed, 33 insertions(+)
>
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 277ff749a0..67deae7e1c 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -534,6 +534,25 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
>      return 0;
>  }
>
> +int postcopy_wake_shared(struct PostCopyFD *pcfd,
> +                         uint64_t client_addr,
> +                         RAMBlock *rb)
> +{
> +    size_t pagesize = qemu_ram_pagesize(rb);
> +    struct uffdio_range range;
> +    int ret;
> +    trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb));
> +    range.start = client_addr & ~(pagesize - 1);
> +    range.len = pagesize;
> +    ret = ioctl(pcfd->fd, UFFDIO_WAKE, &range);
> +    if (ret) {
> +        error_report("%s: Failed to wake: %zx in %s (%s)",
> +                     __func__, (size_t)client_addr, qemu_ram_get_idstr(rb),
> +                     strerror(errno));
> +    }
> +    return ret;
> +}
> +
>  /*
>   * Callback from shared fault handlers to ask for a page,
>   * the page must be specified by a RAMBlock and an offset in that rb
> @@ -951,6 +970,13 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
>      return NULL;
>  }
>
> +int postcopy_wake_shared(struct PostCopyFD *pcfd,
> +                         uint64_t client_addr,
> +                         RAMBlock *rb)
> +{
> +    assert(0);
> +    return -1;
> +}
>  #endif
>
>  /* ------------------------------------------------------------------------- */
> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index 4c63f20df4..2e3dd844d5 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -162,6 +162,12 @@ struct PostCopyFD {
>   */
>  void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
>  void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
> +/* Notify a client ufd that a page is available
> + * Note: The 'client_address' is in the address space of the client
> + * program not QEMU
> + */

Any reason not to follow the classic function declaration / API
documentation style:
/**
 * func:
 * @arg: blah
 *
 * Lorem ipsum...
 */

(future documentation tooling will hopefully parse it etc)

> +int postcopy_wake_shared(struct PostCopyFD *pcfd, uint64_t client_addr,
> +                         RAMBlock *rb);
>  /* Callback from shared fault handlers to ask for a page */
>  int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
>                                   uint64_t client_addr, uint64_t offset);
> diff --git a/migration/trace-events b/migration/trace-events
> index 7c910b5479..b0acaaa8a0 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -199,6 +199,7 @@ postcopy_ram_incoming_cleanup_entry(void) ""
>  postcopy_ram_incoming_cleanup_exit(void) ""
>  postcopy_ram_incoming_cleanup_join(void) ""
>  postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
> +postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
>
>  save_xbzrle_page_skipping(void) ""
>  save_xbzrle_page_overflow(void) ""
> --
> 2.14.3
>
>



-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH v3 19/29] postcopy: wake shared
  2018-03-12 15:44   ` Marc-André Lureau
@ 2018-03-12 16:42     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 75+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-12 16:42 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Maxime Coquelin, Peter Xu, Igor Mammedov,
	Michael S. Tsirkin, Andrea Arcangeli, Juan Quintela

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> Hi
> 
> On Fri, Feb 16, 2018 at 2:16 PM, Dr. David Alan Gilbert (git)
> <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > Send a 'wake' request on a userfaultfd for a shared process.
> > The address in the clients address space is specified together
> > with the RAMBlock it was resolved to.
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>

Thanks.

> 
> > ---
> >  migration/postcopy-ram.c | 26 ++++++++++++++++++++++++++
> >  migration/postcopy-ram.h |  6 ++++++
> >  migration/trace-events   |  1 +
> >  3 files changed, 33 insertions(+)
> >
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 277ff749a0..67deae7e1c 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -534,6 +534,25 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
> >      return 0;
> >  }
> >
> > +int postcopy_wake_shared(struct PostCopyFD *pcfd,
> > +                         uint64_t client_addr,
> > +                         RAMBlock *rb)
> > +{
> > +    size_t pagesize = qemu_ram_pagesize(rb);
> > +    struct uffdio_range range;
> > +    int ret;
> > +    trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb));
> > +    range.start = client_addr & ~(pagesize - 1);
> > +    range.len = pagesize;
> > +    ret = ioctl(pcfd->fd, UFFDIO_WAKE, &range);
> > +    if (ret) {
> > +        error_report("%s: Failed to wake: %zx in %s (%s)",
> > +                     __func__, (size_t)client_addr, qemu_ram_get_idstr(rb),
> > +                     strerror(errno));
> > +    }
> > +    return ret;
> > +}
> > +
> >  /*
> >   * Callback from shared fault handlers to ask for a page,
> >   * the page must be specified by a RAMBlock and an offset in that rb
> > @@ -951,6 +970,13 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
> >      return NULL;
> >  }
> >
> > +int postcopy_wake_shared(struct PostCopyFD *pcfd,
> > +                         uint64_t client_addr,
> > +                         RAMBlock *rb)
> > +{
> > +    assert(0);
> > +    return -1;
> > +}
> >  #endif
> >
> >  /* ------------------------------------------------------------------------- */
> > diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> > index 4c63f20df4..2e3dd844d5 100644
> > --- a/migration/postcopy-ram.h
> > +++ b/migration/postcopy-ram.h
> > @@ -162,6 +162,12 @@ struct PostCopyFD {
> >   */
> >  void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
> >  void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
> > +/* Notify a client ufd that a page is available
> > + * Note: The 'client_address' is in the address space of the client
> > + * program not QEMU
> > + */
> 
> Any reason not to follow the classic function declaration / API
> documentation style:
> /**
>  * func:
>  * @arg: blah
>  *
>  * Lorem ipsum...
>  */
> 
> (future documentation tooling will hopefully parse it etc)

I've added it; it's not something I generally do except for big external
interfaces.

Dave

> 
> > +int postcopy_wake_shared(struct PostCopyFD *pcfd, uint64_t client_addr,
> > +                         RAMBlock *rb);
> >  /* Callback from shared fault handlers to ask for a page */
> >  int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
> >                                   uint64_t client_addr, uint64_t offset);
> > diff --git a/migration/trace-events b/migration/trace-events
> > index 7c910b5479..b0acaaa8a0 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -199,6 +199,7 @@ postcopy_ram_incoming_cleanup_entry(void) ""
> >  postcopy_ram_incoming_cleanup_exit(void) ""
> >  postcopy_ram_incoming_cleanup_join(void) ""
> >  postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
> > +postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
> >
> >  save_xbzrle_page_skipping(void) ""
> >  save_xbzrle_page_overflow(void) ""
> > --
> > 2.14.3
> >
> >
> 
> 
> 
> -- 
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2018-03-12 16:42 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-16 13:15 [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
2018-02-16 13:15 ` [Qemu-devel] [PATCH v3 01/29] migrate: Update ram_block_discard_range for shared Dr. David Alan Gilbert (git)
2018-02-28  6:37   ` Peter Xu
2018-02-28 19:54     ` Dr. David Alan Gilbert
2018-02-16 13:15 ` [Qemu-devel] [PATCH v3 02/29] qemu_ram_block_host_offset Dr. David Alan Gilbert (git)
2018-02-16 13:15 ` [Qemu-devel] [PATCH v3 03/29] postcopy: use UFFDIO_ZEROPAGE only when available Dr. David Alan Gilbert (git)
2018-02-28  6:53   ` Peter Xu
2018-03-05 17:23     ` Dr. David Alan Gilbert
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 04/29] postcopy: Add notifier chain Dr. David Alan Gilbert (git)
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 05/29] postcopy: Add vhost-user flag for postcopy and check it Dr. David Alan Gilbert (git)
2018-02-28  7:14   ` Peter Xu
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 06/29] vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message Dr. David Alan Gilbert (git)
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 07/29] libvhost-user: Support sending fds back to qemu Dr. David Alan Gilbert (git)
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 08/29] libvhost-user: Open userfaultfd Dr. David Alan Gilbert (git)
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 09/29] postcopy: Allow registering of fd handler Dr. David Alan Gilbert (git)
2018-02-28  8:38   ` Peter Xu
2018-03-05 17:35     ` Dr. David Alan Gilbert
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 10/29] vhost+postcopy: Register shared ufd with postcopy Dr. David Alan Gilbert (git)
2018-02-28  8:46   ` Peter Xu
2018-03-05 18:21     ` Dr. David Alan Gilbert
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 11/29] vhost+postcopy: Transmit 'listen' to client Dr. David Alan Gilbert (git)
2018-02-28  8:42   ` Peter Xu
2018-03-05 17:42     ` Dr. David Alan Gilbert
2018-03-06  7:06       ` Peter Xu
2018-03-06 11:20         ` Dr. David Alan Gilbert
2018-03-07 10:05           ` Peter Xu
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 12/29] postcopy+vhost-user: Split set_mem_table for postcopy Dr. David Alan Gilbert (git)
2018-02-28  8:49   ` Peter Xu
2018-03-05 18:45     ` Dr. David Alan Gilbert
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 13/29] migration/ram: ramblock_recv_bitmap_test_byte_offset Dr. David Alan Gilbert (git)
2018-02-28  8:52   ` Peter Xu
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 14/29] libvhost-user+postcopy: Register new regions with the ufd Dr. David Alan Gilbert (git)
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 15/29] vhost+postcopy: Send address back to qemu Dr. David Alan Gilbert (git)
2018-02-27 14:25   ` Michael S. Tsirkin
2018-02-27 19:54     ` Dr. David Alan Gilbert
2018-02-27 20:25       ` Michael S. Tsirkin
2018-02-28 18:26         ` Dr. David Alan Gilbert
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 16/29] vhost+postcopy: Stash RAMBlock and offset Dr. David Alan Gilbert (git)
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 17/29] vhost+postcopy: Send requests to source for shared pages Dr. David Alan Gilbert (git)
2018-02-28 10:03   ` Peter Xu
2018-03-05 18:55     ` Dr. David Alan Gilbert
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 18/29] vhost+postcopy: Resolve client address Dr. David Alan Gilbert (git)
2018-03-02  7:29   ` Peter Xu
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 19/29] postcopy: wake shared Dr. David Alan Gilbert (git)
2018-03-02  7:44   ` Peter Xu
2018-03-05 19:35     ` Dr. David Alan Gilbert
2018-03-12 15:44   ` Marc-André Lureau
2018-03-12 16:42     ` Dr. David Alan Gilbert
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 20/29] postcopy: postcopy_notify_shared_wake Dr. David Alan Gilbert (git)
2018-03-02  7:51   ` Peter Xu
2018-03-05 19:55     ` Dr. David Alan Gilbert
2018-03-06  3:37       ` Peter Xu
2018-03-06 10:54         ` Dr. David Alan Gilbert
2018-03-07 10:13           ` Peter Xu
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 21/29] vhost+postcopy: Add vhost waker Dr. David Alan Gilbert (git)
2018-03-02  7:55   ` Peter Xu
2018-03-05 20:16     ` Dr. David Alan Gilbert
2018-03-06  7:19       ` Peter Xu
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 22/29] vhost+postcopy: Call wakeups Dr. David Alan Gilbert (git)
2018-03-02  8:05   ` Peter Xu
2018-03-06 10:36     ` Dr. David Alan Gilbert
2018-03-08  6:22       ` Peter Xu
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 23/29] libvhost-user: mprotect & madvises for postcopy Dr. David Alan Gilbert (git)
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 24/29] vhost-user: Add VHOST_USER_POSTCOPY_END message Dr. David Alan Gilbert (git)
2018-02-26 20:27   ` Michael S. Tsirkin
2018-02-27 10:09     ` Dr. David Alan Gilbert
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 25/29] vhost+postcopy: Wire up POSTCOPY_END notify Dr. David Alan Gilbert (git)
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 26/29] vhost: Huge page align and merge Dr. David Alan Gilbert (git)
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 27/29] postcopy: Allow shared memory Dr. David Alan Gilbert (git)
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 28/29] libvhost-user: Claim support for postcopy Dr. David Alan Gilbert (git)
2018-02-16 13:16 ` [Qemu-devel] [PATCH v3 29/29] postcopy shared docs Dr. David Alan Gilbert (git)
2018-02-27 14:01 ` [Qemu-devel] [PATCH v3 00/29] postcopy+vhost-user/shared ram Michael S. Tsirkin
2018-02-27 20:05   ` Dr. David Alan Gilbert
2018-02-27 20:23     ` Michael S. Tsirkin
2018-02-28 18:38       ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.