All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram
@ 2017-08-24 19:26 ` Dr. David Alan Gilbert (git)
  2017-08-24 19:26   ` [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started Dr. David Alan Gilbert (git)
                     ` (32 more replies)
  0 siblings, 33 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:26 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Hi,
  This is a RFC/WIP series that enables postcopy migration
with shared memory to a vhost-user process.
It's based off current-head + Alexey's bitmap series

It's tested with vhost-user-bridge and a dpdk (modified by Maxime
that will get posted separately) - both very lightly.

It's still got a few very rough edges, but it succesfully migrates
with both normal and huge pages (2M).

The major difference over v1 is that there's a set of code
that merges vhost regions together on the qemu side so that
we get a single hugepage region on the PC spanning the 640k
hole (the hole hopefully isn't accessed by the client,
but the client used to align around it anyway).

It's also got a lot of cleanups from the comments from v1
but there's still a few things that need work.
In particular, there's still the hack around qemu waiting
for the set_mem_table to come back; I also worry what would
happen if a set-mem-table was triggered during a migrate;
I suspect it would break badly.

One problem that didn't cause a problem was madvises for hugepages;
because we register userfault directly after mmap'ing the
region in the client, we have no pages mapped and hence
the madvise's/fallocate's are fortunately not compulsary.
Still, I'd like a way to do it, it would feel safer.

A copy of this code, based off the current 2.10.0-rc4
together with Alexey's bitmap code is available here:
    https://github.com/dagrh/qemu/tree/vhost-wipv2

Dave

Dr. David Alan Gilbert (32):
  vhu: vu_queue_started
  vhub: Only process received packets on started queues
  migrate: Update ram_block_discard_range for shared
  qemu_ram_block_host_offset
  migration/ram: ramblock_recv_bitmap_test_byte_offset
  postcopy: use UFFDIO_ZEROPAGE only when available
  postcopy: Add notifier chain
  postcopy: Add vhost-user flag for postcopy and check it
  vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message
  vhub: Support sending fds back to qemu
  vhub: Open userfaultfd
  postcopy: Allow registering of fd handler
  vhost+postcopy: Register shared ufd with postcopy
  vhost+postcopy: Transmit 'listen' to client
  vhost+postcopy: Register new regions with the ufd
  vhost+postcopy: Send address back to qemu
  vhost+postcopy: Stash RAMBlock and offset
  vhost+postcopy: Send requests to source for shared pages
  vhost+postcopy: Resolve client address
  postcopy: wake shared
  postcopy: postcopy_notify_shared_wake
  vhost+postcopy: Add vhost waker
  vhost+postcopy: Call wakeups
  vub+postcopy: madvises
  vhost+postcopy: Lock around set_mem_table
  vhost: Add VHOST_USER_POSTCOPY_END message
  vhost+postcopy: Wire up POSTCOPY_END notify
  postcopy: Allow shared memory
  vhost-user: Claim support for postcopy
  vhost: Merge neighbouring hugepage regions where appropriate
  vhost: Don't break merged regions on small remove/non-adds
  postcopy shared docs

 contrib/libvhost-user/libvhost-user.c | 226 ++++++++++++++++++++-
 contrib/libvhost-user/libvhost-user.h |  22 ++-
 docs/devel/migration.txt              |  39 ++++
 docs/interop/vhost-user.txt           |  39 ++++
 exec.c                                |  60 ++++--
 hw/virtio/trace-events                |  27 +++
 hw/virtio/vhost-user.c                | 326 +++++++++++++++++++++++++++++-
 hw/virtio/vhost.c                     | 121 +++++++++++-
 include/exec/cpu-common.h             |   4 +
 migration/migration.c                 |   3 +
 migration/migration.h                 |   4 +
 migration/postcopy-ram.c              | 359 +++++++++++++++++++++++++++-------
 migration/postcopy-ram.h              |  69 +++++++
 migration/ram.c                       |   5 +
 migration/ram.h                       |   1 +
 migration/savevm.c                    |  13 ++
 migration/trace-events                |   6 +
 tests/vhost-user-bridge.c             |   1 +
 trace-events                          |   3 +
 vl.c                                  |   2 +
 20 files changed, 1241 insertions(+), 89 deletions(-)

-- 
2.13.5

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
@ 2017-08-24 19:26   ` Dr. David Alan Gilbert (git)
  2017-08-24 23:10     ` Marc-André Lureau
  2017-08-30 13:02     ` Michael S. Tsirkin
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 02/32] vhub: Only process received packets on started queues Dr. David Alan Gilbert (git)
                     ` (31 subsequent siblings)
  32 siblings, 2 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:26 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add a vu_queue_started method to complement vu_queue_enabled.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 6 ++++++
 contrib/libvhost-user/libvhost-user.h | 9 +++++++++
 2 files changed, 15 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 35fa0c5e56..201b9846e9 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -930,6 +930,12 @@ vu_queue_enabled(VuDev *dev, VuVirtq *vq)
     return vq->enable;
 }
 
+bool
+vu_queue_started(VuDev *dev, VuVirtq *vq)
+{
+    return vq->started;
+}
+
 static inline uint16_t
 vring_avail_flags(VuVirtq *vq)
 {
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index 53ef222c0b..acd019876d 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -328,6 +328,15 @@ void vu_queue_set_notification(VuDev *dev, VuVirtq *vq, int enable);
 bool vu_queue_enabled(VuDev *dev, VuVirtq *vq);
 
 /**
+ * vu_queue_started:
+ * @dev: a VuDev context
+ * @vq: a VuVirtq queue
+ *
+ * Returns: whether the queue is started.
+ */
+bool vu_queue_started(VuDev *dev, VuVirtq *vq);
+
+/**
  * vu_queue_empty:
  * @dev: a VuDev context
  * @vq: a VuVirtq queue
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 02/32] vhub: Only process received packets on started queues
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
  2017-08-24 19:26   ` [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30  9:59     ` Marc-André Lureau
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 03/32] migrate: Update ram_block_discard_range for shared Dr. David Alan Gilbert (git)
                     ` (30 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Only process received packets if the queue has been started.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 tests/vhost-user-bridge.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tests/vhost-user-bridge.c b/tests/vhost-user-bridge.c
index 1e5b5ca3da..324abee53d 100644
--- a/tests/vhost-user-bridge.c
+++ b/tests/vhost-user-bridge.c
@@ -277,6 +277,7 @@ vubr_backend_recv_cb(int sock, void *ctx)
     DPRINT("    hdrlen = %d\n", hdrlen);
 
     if (!vu_queue_enabled(dev, vq) ||
+        !vu_queue_started(dev, vq) ||
         !vu_queue_avail_bytes(dev, vq, hdrlen, 0)) {
         DPRINT("Got UDP packet, but no available descriptors on RX virtq.\n");
         return;
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 03/32] migrate: Update ram_block_discard_range for shared
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
  2017-08-24 19:26   ` [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started Dr. David Alan Gilbert (git)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 02/32] vhub: Only process received packets on started queues Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-29  5:30     ` Peter Xu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 04/32] qemu_ram_block_host_offset Dr. David Alan Gilbert (git)
                     ` (29 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The choice of call to discard a block is getting more complicated
for other cases.   We use fallocate PUNCH_HOLE in any file cases;
it works for both hugepage and for tmpfs.
We use the DONTNEED for non-hugepage cases either where they're
anonymous or where they're private.

Care should be taken when trying other backing files.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 exec.c       | 35 ++++++++++++++++++++++++-----------
 trace-events |  3 +++
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/exec.c b/exec.c
index d20c34ca83..67df2909ce 100644
--- a/exec.c
+++ b/exec.c
@@ -3573,6 +3573,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
     }
 
     if ((start + length) <= rb->used_length) {
+        bool need_madvise, need_fallocate;
         uint8_t *host_endaddr = host_startaddr + length;
         if ((uintptr_t)host_endaddr & (rb->page_size - 1)) {
             error_report("ram_block_discard_range: Unaligned end address: %p",
@@ -3582,23 +3583,35 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
 
         errno = ENOTSUP; /* If we are missing MADVISE etc */
 
-        if (rb->page_size == qemu_host_page_size) {
-#if defined(CONFIG_MADVISE)
-            /* Note: We need the madvise MADV_DONTNEED behaviour of definitely
-             * freeing the page.
-             */
-            ret = madvise(host_startaddr, length, MADV_DONTNEED);
-#endif
-        } else {
-            /* Huge page case  - unfortunately it can't do DONTNEED, but
-             * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
-             * huge page file.
+        /* The logic here is messy;
+         *    madvise DONTNEED fails for hugepages
+         *    fallocate works on hugepages and shmem
+         */
+        need_madvise = (rb->page_size == qemu_host_page_size);
+        need_fallocate = rb->fd != -1;
+        if (need_fallocate) {
+            /* For a file, this causes the area of the file to be zero'd
+             * if read, and for hugetlbfs also causes it to be unmapped
+             * so a userfault will trigger.
              */
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
             ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                             start, length);
 #endif
         }
+        /* i.e. need madvise but skip it if the fallocate failed */
+        if (need_madvise && (!need_fallocate || (ret == 0))) {
+            /* For normal RAM this causes it to be unmapped,
+             * for shared memory it causes the local mapping to disappear
+             * and to fall back on the file contents (which we just
+             * fallocate'd away).
+             */
+#if defined(CONFIG_MADVISE)
+            ret =  madvise(host_startaddr, length, MADV_DONTNEED);
+#endif
+        }
+        trace_ram_block_discard_range(rb->idstr, host_startaddr,
+                                      need_madvise, need_fallocate, ret);
         if (ret) {
             ret = -errno;
             error_report("ram_block_discard_range: Failed to discard range "
diff --git a/trace-events b/trace-events
index 1f50f56d9d..213ee34f89 100644
--- a/trace-events
+++ b/trace-events
@@ -55,6 +55,9 @@ dma_complete(void *dbs, int ret, void *cb) "dbs=%p ret=%d cb=%p"
 dma_blk_cb(void *dbs, int ret) "dbs=%p ret=%d"
 dma_map_wait(void *dbs) "dbs=%p"
 
+# exec.c
+ram_block_discard_range(const char *rbname, void *hva, bool need_madvise, bool need_fallocate, int ret) "%s@%p: madvise: %d fallocate: %d ret: %d"
+
 # memory.c
 memory_region_ops_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
 memory_region_ops_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 04/32] qemu_ram_block_host_offset
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (2 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 03/32] migrate: Update ram_block_discard_range for shared Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-25 12:11     ` Philippe Mathieu-Daudé
  2017-08-29  5:36     ` Peter Xu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 05/32] migration/ram: ramblock_recv_bitmap_test_byte_offset Dr. David Alan Gilbert (git)
                     ` (28 subsequent siblings)
  32 siblings, 2 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Utility to give the offset of a host pointer within a RAMBlock
(assuming we already know it's in that RAMBlock)

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 exec.c                    | 10 ++++++++++
 include/exec/cpu-common.h |  1 +
 2 files changed, 11 insertions(+)

diff --git a/exec.c b/exec.c
index 67df2909ce..35b4cea2ed 100644
--- a/exec.c
+++ b/exec.c
@@ -2231,6 +2231,16 @@ static void *qemu_ram_ptr_length(RAMBlock *ram_block, ram_addr_t addr,
     return ramblock_ptr(block, addr);
 }
 
+/* Return the offset of a hostpointer within a ramblock */
+ram_addr_t qemu_ram_block_host_offset(RAMBlock *rb, void *host)
+{
+    ram_addr_t res = (uint8_t *)host - (uint8_t *)rb->host;
+    assert((uint8_t *)host >= (uint8_t *)rb->host);
+    assert(res < rb->max_length);
+
+    return res;
+}
+
 /*
  * Translates a host ptr back to a RAMBlock, a ram_addr and an offset
  * in that RAMBlock.
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 74341b19d2..0d861a6289 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -68,6 +68,7 @@ ram_addr_t qemu_ram_addr_from_host(void *ptr);
 RAMBlock *qemu_ram_block_by_name(const char *name);
 RAMBlock *qemu_ram_block_from_host(void *ptr, bool round_offset,
                                    ram_addr_t *offset);
+ram_addr_t qemu_ram_block_host_offset(RAMBlock *rb, void *host);
 void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
 void qemu_ram_unset_idstr(RAMBlock *block);
 const char *qemu_ram_get_idstr(RAMBlock *rb);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 05/32] migration/ram: ramblock_recv_bitmap_test_byte_offset
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (3 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 04/32] qemu_ram_block_host_offset Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 06/32] postcopy: use UFFDIO_ZEROPAGE only when available Dr. David Alan Gilbert (git)
                     ` (27 subsequent siblings)
  32 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Utility for testing the map when you already know the offset
in the RAMBlock.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/ram.c | 5 +++++
 migration/ram.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index affb20cb5a..fbb874fb83 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -164,6 +164,11 @@ int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr)
                     rb->receivedmap);
 }
 
+bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset)
+{
+    return test_bit(byte_offset >> TARGET_PAGE_BITS, rb->receivedmap);
+}
+
 void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr)
 {
     set_bit_atomic(ramblock_recv_bitmap_offset(host_addr, rb), rb->receivedmap);
diff --git a/migration/ram.h b/migration/ram.h
index 4db992298a..8720b9de73 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -54,6 +54,7 @@ int ram_postcopy_incoming_init(MigrationIncomingState *mis);
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
+bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset);
 void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
 void ramblock_recv_bitmap_set_range(RAMBlock *rb, void *host_addr, size_t nr);
 void ramblock_recv_bitmap_clear(RAMBlock *rb, void *host_addr);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 06/32] postcopy: use UFFDIO_ZEROPAGE only when available
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (4 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 05/32] migration/ram: ramblock_recv_bitmap_test_byte_offset Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30  9:57     ` Marc-André Lureau
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 07/32] postcopy: Add notifier chain Dr. David Alan Gilbert (git)
                     ` (26 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Use a flag on the RAMBlock to state whether it has the
UFFDIO_ZEROPAGE capability, use it when it's available.

This allows the use of postcopy on tmpfs as well as hugepage
backed files.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 exec.c                    | 15 +++++++++++++++
 include/exec/cpu-common.h |  3 +++
 migration/postcopy-ram.c  | 14 +++++++++++---
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/exec.c b/exec.c
index 35b4cea2ed..80c3d1d121 100644
--- a/exec.c
+++ b/exec.c
@@ -103,6 +103,11 @@ static MemoryRegion io_mem_unassigned;
  */
 #define RAM_RESIZEABLE (1 << 2)
 
+/* UFFDIO_ZEROPAGE is available on this RAMBlock to atomically
+ * zero the page and wake waiting processes.
+ * (Set during postcopy)
+ */
+#define RAM_UF_ZEROPAGE (1 << 3)
 #endif
 
 #ifdef TARGET_PAGE_BITS_VARY
@@ -1705,6 +1710,16 @@ bool qemu_ram_is_shared(RAMBlock *rb)
     return rb->flags & RAM_SHARED;
 }
 
+bool qemu_ram_is_uf_zeroable(RAMBlock *rb)
+{
+    return rb->flags & RAM_UF_ZEROPAGE;
+}
+
+void qemu_ram_set_uf_zeroable(RAMBlock *rb)
+{
+    rb->flags |= RAM_UF_ZEROPAGE;
+}
+
 /* Called with iothread lock held.  */
 void qemu_ram_set_idstr(RAMBlock *new_block, const char *name, DeviceState *dev)
 {
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 0d861a6289..24d335f95d 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -73,6 +73,9 @@ void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
 void qemu_ram_unset_idstr(RAMBlock *block);
 const char *qemu_ram_get_idstr(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
+bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
+void qemu_ram_set_uf_zeroable(RAMBlock *rb);
+
 size_t qemu_ram_pagesize(RAMBlock *block);
 size_t qemu_ram_pagesize_largest(void);
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 7a414ebad8..640b72d86d 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -408,6 +408,11 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
         error_report("%s userfault: Region doesn't support COPY", __func__);
         return -1;
     }
+    if (reg_struct.ioctls & ((__u64)1 << _UFFDIO_ZEROPAGE)) {
+        RAMBlock *rb = qemu_ram_block_by_name(block_name);
+        qemu_ram_set_uf_zeroable(rb);
+    }
+
 
     return 0;
 }
@@ -617,11 +622,14 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
 int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
                              RAMBlock *rb)
 {
+    size_t pagesize = qemu_ram_pagesize(rb);
     trace_postcopy_place_page_zero(host);
 
-    if (qemu_ram_pagesize(rb) == getpagesize()) {
-        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, getpagesize(),
-                                rb)) {
+    /* Normal RAMBlocks can zero a page using UFFDIO_ZEROPAGE
+     * but it's not available for everything (e.g. hugetlbpages)
+     */
+    if (qemu_ram_is_uf_zeroable(rb)) {
+        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, pagesize, rb)) {
             int e = errno;
             error_report("%s: %s zero host: %p",
                          __func__, strerror(e), host);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 07/32] postcopy: Add notifier chain
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (5 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 06/32] postcopy: use UFFDIO_ZEROPAGE only when available Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-29  6:02     ` Peter Xu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 08/32] postcopy: Add vhost-user flag for postcopy and check it Dr. David Alan Gilbert (git)
                     ` (25 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add a notifier chain for postcopy with a 'reason' flag
and an opportunity for a notifier member to return an error.

Call it when enabling postcopy.

This will initially used to enable devices to declare they're unable
to postcopy and later to notify of devices of stages within postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/postcopy-ram.c | 41 +++++++++++++++++++++++++++++++++++++++++
 migration/postcopy-ram.h | 26 ++++++++++++++++++++++++++
 vl.c                     |  2 ++
 3 files changed, 69 insertions(+)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 640b72d86d..95007c00ef 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -23,6 +23,8 @@
 #include "savevm.h"
 #include "postcopy-ram.h"
 #include "ram.h"
+#include "qapi/error.h"
+#include "qemu/notify.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/balloon.h"
 #include "qemu/error-report.h"
@@ -45,6 +47,38 @@ struct PostcopyDiscardState {
     unsigned int nsentcmds;
 };
 
+/* A notifier chain for postcopy
+ * The notifier should return 0 if it's OK, or a
+ * -errno on error.
+ * The notifier should expect an Error ** as it's data
+ */
+static NotifierWithReturnList postcopy_notifier_list;
+
+void postcopy_infrastructure_init(void)
+{
+    notifier_with_return_list_init(&postcopy_notifier_list);
+}
+
+void postcopy_add_notifier(NotifierWithReturn *nn)
+{
+    notifier_with_return_list_add(&postcopy_notifier_list, nn);
+}
+
+void postcopy_remove_notifier(NotifierWithReturn *n)
+{
+    notifier_with_return_remove(n);
+}
+
+int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp)
+{
+    struct PostcopyNotifyData pnd;
+    pnd.reason = reason;
+    pnd.errp = errp;
+
+    return notifier_with_return_list_notify(&postcopy_notifier_list,
+                                            &pnd);
+}
+
 /* Postcopy needs to detect accesses to pages that haven't yet been copied
  * across, and efficiently map new pages in, the techniques for doing this
  * are target OS specific.
@@ -133,6 +167,7 @@ bool postcopy_ram_supported_by_host(void)
     struct uffdio_register reg_struct;
     struct uffdio_range range_struct;
     uint64_t feature_mask;
+    Error *local_err = NULL;
 
     if (qemu_target_page_size() > pagesize) {
         error_report("Target page size bigger than host page size");
@@ -146,6 +181,12 @@ bool postcopy_ram_supported_by_host(void)
         goto out;
     }
 
+    /* Give devices a chance to object */
+    if (postcopy_notify(POSTCOPY_NOTIFY_PROBE, &local_err)) {
+        error_report_err(local_err);
+        goto out;
+    }
+
     /* Version and features check */
     if (!ufd_version_check(ufd)) {
         goto out;
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 78a3591322..d688411674 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -114,4 +114,30 @@ PostcopyState postcopy_state_get(void);
 /* Set the state and return the old state */
 PostcopyState postcopy_state_set(PostcopyState new_state);
 
+/*
+ * To be called once at the start before any device initialisation
+ */
+void postcopy_infrastructure_init(void);
+
+/* Add a notifier to a list to be called when checking whether the devices
+ * can support postcopy.
+ * It's data is a *PostcopyNotifyData
+ * It should return 0 if OK, or a negative value on failure.
+ * On failure it must set the data->errp to an error.
+ *
+ */
+enum PostcopyNotifyReason {
+    POSTCOPY_NOTIFY_PROBE = 0,
+};
+
+struct PostcopyNotifyData {
+    enum PostcopyNotifyReason reason;
+    Error **errp;
+};
+
+void postcopy_add_notifier(NotifierWithReturn *nn);
+void postcopy_remove_notifier(NotifierWithReturn *n);
+/* Call the notifier list set by postcopy_add_start_notifier */
+int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp);
+
 #endif
diff --git a/vl.c b/vl.c
index 8e247cc2a2..65dd9dc324 100644
--- a/vl.c
+++ b/vl.c
@@ -95,6 +95,7 @@ int main(int argc, char **argv)
 #include "audio/audio.h"
 #include "sysemu/cpus.h"
 #include "migration/colo.h"
+#include "migration/postcopy-ram.h"
 #include "sysemu/kvm.h"
 #include "sysemu/hax.h"
 #include "qapi/qobject-input-visitor.h"
@@ -3082,6 +3083,7 @@ int main(int argc, char **argv, char **envp)
     module_call_init(MODULE_INIT_OPTS);
 
     runstate_init();
+    postcopy_infrastructure_init();
 
     if (qcrypto_init(&err) < 0) {
         error_reportf_err(err, "cannot initialize crypto: ");
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 08/32] postcopy: Add vhost-user flag for postcopy and check it
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (6 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 07/32] postcopy: Add notifier chain Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-29  6:22     ` Peter Xu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 09/32] vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message Dr. David Alan Gilbert (git)
                     ` (24 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add a vhost feature flag for postcopy support, and
use the postcopy notifier to check it before allowing postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.h |  1 +
 docs/interop/vhost-user.txt           | 10 +++++++++
 hw/virtio/vhost-user.c                | 40 ++++++++++++++++++++++++++++++++++-
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index acd019876d..95d0d34a28 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -34,6 +34,7 @@ enum VhostUserProtocolFeature {
     VHOST_USER_PROTOCOL_F_MQ = 0,
     VHOST_USER_PROTOCOL_F_LOG_SHMFD = 1,
     VHOST_USER_PROTOCOL_F_RARP = 2,
+    VHOST_USER_PROTOCOL_F_PAGEFAULT = 7,
 
     VHOST_USER_PROTOCOL_F_MAX
 };
diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index 954771d0d8..a279560eb0 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -273,6 +273,15 @@ Once the source has finished migration, rings will be stopped by
 the source. No further update must be done before rings are
 restarted.
 
+In postcopy migration the slave is started before all the memory has been
+received from the source host, and care must be taken to avoid accessing pages
+that have yet to be received.  The slave opens a 'userfault'-fd and registers
+the memory with it; this fd is then passed back over to the master.
+The master services requests on the userfaultfd for pages that are accessed
+and when the page is available it performs WAKE ioctl's on the userfaultfd
+to wake the stalled slave.  The client indicates support for this via the
+VHOST_USER_PROTOCOL_F_PAGEFAULT feature.
+
 IOMMU support
 -------------
 
@@ -327,6 +336,7 @@ Protocol features
 #define VHOST_USER_PROTOCOL_F_MTU            4
 #define VHOST_USER_PROTOCOL_F_SLAVE_REQ      5
 #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN   6
+#define VHOST_USER_PROTOCOL_F_PAGEFAULT      7
 
 Master message types
 --------------------
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 093675ed98..c51bbd1296 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -17,6 +17,8 @@
 #include "sysemu/kvm.h"
 #include "qemu/error-report.h"
 #include "qemu/sockets.h"
+#include "migration/migration.h"
+#include "migration/postcopy-ram.h"
 
 #include <sys/ioctl.h>
 #include <sys/socket.h>
@@ -34,7 +36,7 @@ enum VhostUserProtocolFeature {
     VHOST_USER_PROTOCOL_F_NET_MTU = 4,
     VHOST_USER_PROTOCOL_F_SLAVE_REQ = 5,
     VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,
-
+    VHOST_USER_PROTOCOL_F_PAGEFAULT = 7,
     VHOST_USER_PROTOCOL_F_MAX
 };
 
@@ -123,8 +125,10 @@ static VhostUserMsg m __attribute__ ((unused));
 #define VHOST_USER_VERSION    (0x1)
 
 struct vhost_user {
+    struct vhost_dev *dev;
     CharBackend *chr;
     int slave_fd;
+    NotifierWithReturn postcopy_notifier;
 };
 
 static bool ioeventfd_enabled(void)
@@ -720,6 +724,33 @@ out:
     return ret;
 }
 
+static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
+                                        void *opaque)
+{
+    struct PostcopyNotifyData *pnd = opaque;
+    struct vhost_user *u = container_of(notifier, struct vhost_user,
+                                         postcopy_notifier);
+    struct vhost_dev *dev = u->dev;
+
+    switch (pnd->reason) {
+    case POSTCOPY_NOTIFY_PROBE:
+        if (!virtio_has_feature(dev->protocol_features,
+                                VHOST_USER_PROTOCOL_F_PAGEFAULT)) {
+            /* TODO: Get the device name into this error somehow */
+            error_setg(pnd->errp,
+                       "vhost-user backend not capable of postcopy");
+            return -ENOENT;
+        }
+        break;
+
+    default:
+        /* We ignore notifications we don't know */
+        break;
+    }
+
+    return 0;
+}
+
 static int vhost_user_init(struct vhost_dev *dev, void *opaque)
 {
     uint64_t features, protocol_features;
@@ -731,6 +762,7 @@ static int vhost_user_init(struct vhost_dev *dev, void *opaque)
     u = g_new0(struct vhost_user, 1);
     u->chr = opaque;
     u->slave_fd = -1;
+    u->dev = dev;
     dev->opaque = u;
 
     err = vhost_user_get_features(dev, &features);
@@ -787,6 +819,9 @@ static int vhost_user_init(struct vhost_dev *dev, void *opaque)
         return err;
     }
 
+    u->postcopy_notifier.notify = vhost_user_postcopy_notifier;
+    postcopy_add_notifier(&u->postcopy_notifier);
+
     return 0;
 }
 
@@ -797,6 +832,9 @@ static int vhost_user_cleanup(struct vhost_dev *dev)
     assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_USER);
 
     u = dev->opaque;
+    if (u->postcopy_notifier.notify) {
+        postcopy_remove_notifier(&u->postcopy_notifier);
+    }
     if (u->slave_fd >= 0) {
         qemu_set_fd_handler(u->slave_fd, NULL, NULL, NULL);
         close(u->slave_fd);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 09/32] vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (7 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 08/32] postcopy: Add vhost-user flag for postcopy and check it Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30 10:07     ` Marc-André Lureau
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 10/32] vhub: Support sending fds back to qemu Dr. David Alan Gilbert (git)
                     ` (23 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Wire up a notifier to send a VHOST_USER_POSTCOPY_ADVISE
message on an incoming advise.

Later patches will fill in the behaviour/contents of the
message.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 21 ++++++++++++---
 contrib/libvhost-user/libvhost-user.h |  6 ++++-
 docs/interop/vhost-user.txt           |  9 +++++++
 hw/virtio/vhost-user.c                | 48 +++++++++++++++++++++++++++++++++++
 migration/postcopy-ram.h              |  1 +
 migration/savevm.c                    |  6 +++++
 6 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 201b9846e9..8bbdf5fb40 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -42,9 +42,6 @@ vu_request_to_string(int req)
         REQ(VHOST_USER_NONE),
         REQ(VHOST_USER_GET_FEATURES),
         REQ(VHOST_USER_SET_FEATURES),
-        REQ(VHOST_USER_NONE),
-        REQ(VHOST_USER_GET_FEATURES),
-        REQ(VHOST_USER_SET_FEATURES),
         REQ(VHOST_USER_SET_OWNER),
         REQ(VHOST_USER_RESET_OWNER),
         REQ(VHOST_USER_SET_MEM_TABLE),
@@ -62,7 +59,10 @@ vu_request_to_string(int req)
         REQ(VHOST_USER_GET_QUEUE_NUM),
         REQ(VHOST_USER_SET_VRING_ENABLE),
         REQ(VHOST_USER_SEND_RARP),
-        REQ(VHOST_USER_INPUT_GET_CONFIG),
+        REQ(VHOST_USER_SET_SLAVE_REQ_FD),
+        REQ(VHOST_USER_IOTLB_MSG),
+        REQ(VHOST_USER_SET_VRING_ENDIAN),
+        REQ(VHOST_USER_POSTCOPY_ADVISE),
         REQ(VHOST_USER_MAX),
     };
 #undef REQ
@@ -744,6 +744,17 @@ vu_set_vring_enable_exec(VuDev *dev, VhostUserMsg *vmsg)
 }
 
 static bool
+vu_set_postcopy_advise(VuDev *dev, VhostUserMsg *vmsg)
+{
+    /* TODO: Open ufd, pass it back in the request
+     * TODO: Add addresses 
+     */
+    vmsg->payload.u64 = 0xcafe;
+    vmsg->size = sizeof(vmsg->payload.u64);
+    return true; /* = send a reply */
+}
+
+static bool
 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 {
     int do_reply = 0;
@@ -808,6 +819,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
         return vu_set_vring_enable_exec(dev, vmsg);
     case VHOST_USER_NONE:
         break;
+    case VHOST_USER_POSTCOPY_ADVISE:
+        return vu_set_postcopy_advise(dev, vmsg);
     default:
         vmsg_close_fds(vmsg);
         vu_panic(dev, "Unhandled request: %d", vmsg->request);
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index 95d0d34a28..3987ce643d 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -62,7 +62,11 @@ typedef enum VhostUserRequest {
     VHOST_USER_GET_QUEUE_NUM = 17,
     VHOST_USER_SET_VRING_ENABLE = 18,
     VHOST_USER_SEND_RARP = 19,
-    VHOST_USER_INPUT_GET_CONFIG = 20,
+    VHOST_USER_NET_SET_MTU      = 20,
+    VHOST_USER_SET_SLAVE_REQ_FD = 21,
+    VHOST_USER_IOTLB_MSG        = 22,
+    VHOST_USER_SET_VRING_ENDIAN = 23,
+    VHOST_USER_POSTCOPY_ADVISE  = 24,
     VHOST_USER_MAX
 } VhostUserRequest;
 
diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index a279560eb0..dad2a1b343 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -606,6 +606,15 @@ Master message types
       and expect this message once (per VQ) during device configuration
       (ie. before the master starts the VQ).
 
+ * VHOST_USER_POSTCOPY_ADVISE
+      Id: 24
+      Master payload: N/A
+      Slave payload: userfault fd + u64
+
+      Master advises slave that a migration with postcopy enabled is underway,
+      the slave must open a userfaultfd for later use.
+      Note that at this stage the migration is still in precopy mode.
+
 Slave message types
 -------------------
 
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index c51bbd1296..7063e4df61 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -67,6 +67,7 @@ typedef enum VhostUserRequest {
     VHOST_USER_SET_SLAVE_REQ_FD = 21,
     VHOST_USER_IOTLB_MSG = 22,
     VHOST_USER_SET_VRING_ENDIAN = 23,
+    VHOST_USER_POSTCOPY_ADVISE  = 24,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -724,6 +725,50 @@ out:
     return ret;
 }
 
+/*
+ * Called at the start of an inbound postcopy on reception of the
+ * 'advise' command.
+ */
+static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
+{
+    struct vhost_user *u = dev->opaque;
+    CharBackend *chr = u->chr;
+    int ufd;
+    VhostUserMsg msg = {
+        .request = VHOST_USER_POSTCOPY_ADVISE,
+        .flags = VHOST_USER_VERSION,
+    };
+
+    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
+        error_setg(errp, "Failed to send postcopy_advise to vhost");
+        return -1;
+    }
+
+    if (vhost_user_read(dev, &msg) < 0) {
+        error_setg(errp, "Failed to get postcopy_advise reply from vhost");
+        return -1;
+    }
+
+    if (msg.request != VHOST_USER_POSTCOPY_ADVISE) {
+        error_setg(errp, "Unexpected msg type. Expected %d received %d",
+                     VHOST_USER_POSTCOPY_ADVISE, msg.request);
+        return -1;
+    }
+
+    if (msg.size != sizeof(msg.payload.u64)) {
+        error_setg(errp, "Received bad msg size.");
+        return -1;
+    }
+    ufd = qemu_chr_fe_get_msgfd(chr);
+    if (ufd < 0) {
+        error_setg(errp, "%s: Failed to get ufd", __func__);
+        return -1;
+    }
+
+    /* TODO: register ufd with userfault thread */
+    return 0;
+}
+
 static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
                                         void *opaque)
 {
@@ -743,6 +788,9 @@ static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
         }
         break;
 
+    case POSTCOPY_NOTIFY_INBOUND_ADVISE:
+        return vhost_user_postcopy_advise(dev, pnd->errp);
+
     default:
         /* We ignore notifications we don't know */
         break;
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index d688411674..70d4b09659 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -128,6 +128,7 @@ void postcopy_infrastructure_init(void);
  */
 enum PostcopyNotifyReason {
     POSTCOPY_NOTIFY_PROBE = 0,
+    POSTCOPY_NOTIFY_INBOUND_ADVISE,
 };
 
 struct PostcopyNotifyData {
diff --git a/migration/savevm.c b/migration/savevm.c
index fdd15fa0a7..d35911731d 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1343,6 +1343,7 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
 {
     PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_ADVISE);
     uint64_t remote_pagesize_summary, local_pagesize_summary, remote_tps;
+    Error *local_err = NULL;
 
     trace_loadvm_postcopy_handle_advise();
     if (ps != POSTCOPY_INCOMING_NONE) {
@@ -1390,6 +1391,11 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
         return -1;
     }
 
+    if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_ADVISE, &local_err)) {
+        error_report_err(local_err);
+        return -1;
+    }
+
     if (ram_postcopy_incoming_init(mis)) {
         return -1;
     }
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 10/32] vhub: Support sending fds back to qemu
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (8 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 09/32] vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30 10:22     ` Marc-André Lureau
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 11/32] vhub: Open userfaultfd Dr. David Alan Gilbert (git)
                     ` (22 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Allow replies with fds (for postcopy)

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 8bbdf5fb40..47884c0a15 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -213,6 +213,30 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
 {
     int rc;
     uint8_t *p = (uint8_t *)vmsg;
+    char control[CMSG_SPACE(VHOST_MEMORY_MAX_NREGIONS * sizeof(int))] = { };
+    struct iovec iov = {
+        .iov_base = (char *)vmsg,
+        .iov_len = VHOST_USER_HDR_SIZE,
+    };
+    struct msghdr msg = {
+        .msg_iov = &iov,
+        .msg_iovlen = 1,
+        .msg_control = control,
+    };
+    struct cmsghdr *cmsg;
+
+    memset(control, 0, sizeof(control));
+    if (vmsg->fds) {
+        size_t fdsize = vmsg->fd_num * sizeof(int);
+        msg.msg_controllen = CMSG_SPACE(fdsize);
+        cmsg = CMSG_FIRSTHDR(&msg);
+        cmsg->cmsg_len = CMSG_LEN(fdsize);
+        cmsg->cmsg_level = SOL_SOCKET;
+        cmsg->cmsg_type = SCM_RIGHTS;
+        memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
+    } else {
+        msg.msg_controllen = 0;
+    }
 
     /* Set the version in the flags when sending the reply */
     vmsg->flags &= ~VHOST_USER_VERSION_MASK;
@@ -220,7 +244,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
     vmsg->flags |= VHOST_USER_REPLY_MASK;
 
     do {
-        rc = write(conn_fd, p, VHOST_USER_HDR_SIZE);
+        rc = sendmsg(conn_fd, &msg, 0);
     } while (rc < 0 && (errno == EINTR || errno == EAGAIN));
 
     do {
@@ -313,6 +337,7 @@ vu_get_features_exec(VuDev *dev, VhostUserMsg *vmsg)
     }
 
     vmsg->size = sizeof(vmsg->payload.u64);
+    vmsg->fd_num = 0;
 
     DPRINT("Sending back to guest u64: 0x%016"PRIx64"\n", vmsg->payload.u64);
 
@@ -454,6 +479,7 @@ vu_set_log_base_exec(VuDev *dev, VhostUserMsg *vmsg)
     dev->log_size = log_mmap_size;
 
     vmsg->size = sizeof(vmsg->payload.u64);
+    vmsg->fd_num = 0;
 
     return true;
 }
@@ -698,6 +724,7 @@ vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg *vmsg)
 
     vmsg->payload.u64 = features;
     vmsg->size = sizeof(vmsg->payload.u64);
+    vmsg->fd_num = 0;
 
     return true;
 }
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 11/32] vhub: Open userfaultfd
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (9 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 10/32] vhub: Support sending fds back to qemu Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-29  6:40     ` Peter Xu
  2017-08-30 10:30     ` Marc-André Lureau
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 12/32] postcopy: Allow registering of fd handler Dr. David Alan Gilbert (git)
                     ` (21 subsequent siblings)
  32 siblings, 2 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Open a userfaultfd (on a postcopy_advise) and send it back in
the reply to the qemu for it to monitor.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 26 +++++++++++++++++++++++---
 contrib/libvhost-user/libvhost-user.h |  3 +++
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 47884c0a15..f9b5b12b28 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -15,6 +15,7 @@
 
 #include <qemu/osdep.h>
 #include <sys/eventfd.h>
+#include <sys/syscall.h>
 #include <linux/vhost.h>
 
 #include "qemu/atomic.h"
@@ -773,11 +774,30 @@ vu_set_vring_enable_exec(VuDev *dev, VhostUserMsg *vmsg)
 static bool
 vu_set_postcopy_advise(VuDev *dev, VhostUserMsg *vmsg)
 {
-    /* TODO: Open ufd, pass it back in the request
-     * TODO: Add addresses 
-     */
+    struct uffdio_api api_struct;
+
+    dev->postcopy_ufd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+    /* TODO: Add addresses */
     vmsg->payload.u64 = 0xcafe;
     vmsg->size = sizeof(vmsg->payload.u64);
+
+    if (dev->postcopy_ufd == -1) {
+        vu_panic(dev, "Userfaultfd not available: %s", strerror(errno));
+        goto out;
+    }
+    api_struct.api = UFFD_API;
+    api_struct.features = 0;
+    if (ioctl(dev->postcopy_ufd, UFFDIO_API, &api_struct)) {
+        vu_panic(dev, "Failed UFFDIO_API: %s", strerror(errno));
+        close(dev->postcopy_ufd);
+        dev->postcopy_ufd = -1;
+        goto out;
+    }
+    /* TODO: Stash feature flags somewhere */
+out:
+    /* Return a ufd to the QEMU */
+    vmsg->fd_num = 1;
+    vmsg->fds[0] = dev->postcopy_ufd;
     return true; /* = send a reply */
 }
 
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index 3987ce643d..3e8efdd919 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -234,6 +234,9 @@ struct VuDev {
      * re-initialize */
     vu_panic_cb panic;
     const VuDevIface *iface;
+
+    /* Postcopy data */
+    int postcopy_ufd;
 };
 
 typedef struct VuVirtqElement {
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 12/32] postcopy: Allow registering of fd handler
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (10 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 11/32] vhub: Open userfaultfd Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 13/32] vhost+postcopy: Register shared ufd with postcopy Dr. David Alan Gilbert (git)
                     ` (20 subsequent siblings)
  32 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Allow other userfaultfd's to be registered into the fault thread
so that handlers for shared memory can get responses.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/migration.c    |   3 +
 migration/migration.h    |   2 +
 migration/postcopy-ram.c | 212 +++++++++++++++++++++++++++++++++++------------
 migration/postcopy-ram.h |  21 +++++
 migration/trace-events   |   2 +
 5 files changed, 186 insertions(+), 54 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index c3fe0ed9ca..2c43d730e2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -144,6 +144,8 @@ MigrationIncomingState *migration_incoming_get_current(void)
     if (!once) {
         mis_current.state = MIGRATION_STATUS_NONE;
         memset(&mis_current, 0, sizeof(MigrationIncomingState));
+        mis_current.postcopy_remote_fds = g_array_new(FALSE, TRUE,
+                                                   sizeof(struct PostCopyFD));
         qemu_mutex_init(&mis_current.rp_mutex);
         qemu_event_init(&mis_current.main_thread_load_event, false);
         once = true;
@@ -166,6 +168,7 @@ void migration_incoming_state_destroy(void)
         qemu_fclose(mis->from_src_file);
         mis->from_src_file = NULL;
     }
+    g_array_free(mis->postcopy_remote_fds, TRUE);
 
     qemu_event_destroy(&mis->main_thread_load_event);
 }
diff --git a/migration/migration.h b/migration/migration.h
index 148c9facbc..9fcea6bb25 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -48,6 +48,8 @@ struct MigrationIncomingState {
     QemuMutex rp_mutex;    /* We send replies from multiple threads */
     void     *postcopy_tmp_page;
     void     *postcopy_tmp_zero_page;
+    /* PostCopyFD's for external userfaultfds & handlers of shared memory */
+    GArray   *postcopy_remote_fds;
 
     QEMUBH *bh;
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 95007c00ef..faee7708ff 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -466,29 +466,43 @@ static void *postcopy_ram_fault_thread(void *opaque)
     MigrationIncomingState *mis = opaque;
     struct uffd_msg msg;
     int ret;
+    size_t index;
     RAMBlock *rb = NULL;
     RAMBlock *last_rb = NULL; /* last RAMBlock we sent part of */
 
     trace_postcopy_ram_fault_thread_entry();
     qemu_sem_post(&mis->fault_thread_sem);
 
+    struct pollfd *pfd;
+    size_t pfd_len = 2 + mis->postcopy_remote_fds->len;
+
+    pfd = g_new0(struct pollfd, pfd_len);
+
+    pfd[0].fd = mis->userfault_fd;
+    pfd[0].events = POLLIN;
+    pfd[1].fd = mis->userfault_quit_fd;
+    pfd[1].events = POLLIN; /* Waiting for eventfd to go positive */
+    trace_postcopy_ram_fault_thread_fds_core(pfd[0].fd, pfd[1].fd);
+    for (index = 0; index < mis->postcopy_remote_fds->len; index++) {
+        struct PostCopyFD *pcfd = &g_array_index(mis->postcopy_remote_fds,
+                                                 struct PostCopyFD, index);
+        pfd[2 + index].fd = pcfd->fd;
+        pfd[2 + index].events = POLLIN;
+        trace_postcopy_ram_fault_thread_fds_extra(2 + index, pcfd->idstr,
+                                                  pcfd->fd);
+    }
+
     while (true) {
         ram_addr_t rb_offset;
-        struct pollfd pfd[2];
+        int poll_result;
 
         /*
          * We're mainly waiting for the kernel to give us a faulting HVA,
          * however we can be told to quit via userfault_quit_fd which is
          * an eventfd
          */
-        pfd[0].fd = mis->userfault_fd;
-        pfd[0].events = POLLIN;
-        pfd[0].revents = 0;
-        pfd[1].fd = mis->userfault_quit_fd;
-        pfd[1].events = POLLIN; /* Waiting for eventfd to go positive */
-        pfd[1].revents = 0;
-
-        if (poll(pfd, 2, -1 /* Wait forever */) == -1) {
+        poll_result = poll(pfd, pfd_len, -1 /* Wait forever */);
+        if (poll_result == -1) {
             error_report("%s: userfault poll: %s", __func__, strerror(errno));
             break;
         }
@@ -498,57 +512,118 @@ static void *postcopy_ram_fault_thread(void *opaque)
             break;
         }
 
-        ret = read(mis->userfault_fd, &msg, sizeof(msg));
-        if (ret != sizeof(msg)) {
-            if (errno == EAGAIN) {
-                /*
-                 * if a wake up happens on the other thread just after
-                 * the poll, there is nothing to read.
-                 */
-                continue;
+        if (pfd[0].revents) {
+            poll_result--;
+            ret = read(mis->userfault_fd, &msg, sizeof(msg));
+            if (ret != sizeof(msg)) {
+                if (errno == EAGAIN) {
+                    /*
+                     * if a wake up happens on the other thread just after
+                     * the poll, there is nothing to read.
+                     */
+                    continue;
+                }
+                if (ret < 0) {
+                    error_report("%s: Failed to read full userfault "
+                                 "message: %s",
+                                 __func__, strerror(errno));
+                    break;
+                } else {
+                    error_report("%s: Read %d bytes from userfaultfd "
+                                 "expected %zd",
+                                 __func__, ret, sizeof(msg));
+                    break; /* Lost alignment, don't know what we'd read next */
+                }
+            }
+            if (msg.event != UFFD_EVENT_PAGEFAULT) {
+                error_report("%s: Read unexpected event %ud from userfaultfd",
+                             __func__, msg.event);
+                continue; /* It's not a page fault, shouldn't happen */
             }
-            if (ret < 0) {
-                error_report("%s: Failed to read full userfault message: %s",
-                             __func__, strerror(errno));
+
+            rb = qemu_ram_block_from_host(
+                     (void *)(uintptr_t)msg.arg.pagefault.address,
+                     true, &rb_offset);
+            if (!rb) {
+                error_report("postcopy_ram_fault_thread: Fault outside guest: %"
+                             PRIx64, (uint64_t)msg.arg.pagefault.address);
                 break;
-            } else {
-                error_report("%s: Read %d bytes from userfaultfd expected %zd",
-                             __func__, ret, sizeof(msg));
-                break; /* Lost alignment, don't know what we'd read next */
             }
-        }
-        if (msg.event != UFFD_EVENT_PAGEFAULT) {
-            error_report("%s: Read unexpected event %ud from userfaultfd",
-                         __func__, msg.event);
-            continue; /* It's not a page fault, shouldn't happen */
-        }
 
-        rb = qemu_ram_block_from_host(
-                 (void *)(uintptr_t)msg.arg.pagefault.address,
-                 true, &rb_offset);
-        if (!rb) {
-            error_report("postcopy_ram_fault_thread: Fault outside guest: %"
-                         PRIx64, (uint64_t)msg.arg.pagefault.address);
-            break;
-        }
+            rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
+            trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
+                                                    qemu_ram_get_idstr(rb),
+                                                    rb_offset);
 
-        rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
-        trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
-                                                qemu_ram_get_idstr(rb),
-                                                rb_offset);
+            /*
+             * Send the request to the source - we want to request one
+             * of our host page sizes (which is >= TPS)
+             */
+            if (rb != last_rb) {
+                last_rb = rb;
+                migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
+                                         rb_offset, qemu_ram_pagesize(rb));
+            } else {
+                /* Save some space */
+                migrate_send_rp_req_pages(mis, NULL,
+                                         rb_offset, qemu_ram_pagesize(rb));
+            }
+        }
 
-        /*
-         * Send the request to the source - we want to request one
-         * of our host page sizes (which is >= TPS)
-         */
-        if (rb != last_rb) {
-            last_rb = rb;
-            migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
-                                     rb_offset, qemu_ram_pagesize(rb));
-        } else {
-            /* Save some space */
-            migrate_send_rp_req_pages(mis, NULL,
-                                     rb_offset, qemu_ram_pagesize(rb));
+        /* Now handle any requests from external processes on shared memory */
+        /* TODO: May need to handle devices deregistering during postcopy */
+        for (index = 2; index < pfd_len && poll_result; index++) {
+            if (pfd[index].revents) {
+                struct PostCopyFD *pcfd =
+                    &g_array_index(mis->postcopy_remote_fds,
+                                   struct PostCopyFD, index - 2);
+
+                poll_result--;
+                if (pfd[index].revents & POLLERR) {
+                    error_report("%s: POLLERR on poll %zd fd=%d",
+                                 __func__, index, pcfd->fd);
+                    pfd[index].events = 0;
+                    continue;
+                }
+
+                ret = read(pcfd->fd, &msg, sizeof(msg));
+                if (ret != sizeof(msg)) {
+                    if (errno == EAGAIN) {
+                        /*
+                         * if a wake up happens on the other thread just after
+                         * the poll, there is nothing to read.
+                         */
+                        continue;
+                    }
+                    if (ret < 0) {
+                        error_report("%s: Failed to read full userfault "
+                                     "message: %s (shared) revents=%d",
+                                     __func__, strerror(errno),
+                                     pfd[index].revents);
+                        /*TODO: Could just disable this sharer */
+                        break;
+                    } else {
+                        error_report("%s: Read %d bytes from userfaultfd "
+                                     "expected %zd (shared)",
+                                     __func__, ret, sizeof(msg));
+                        /*TODO: Could just disable this sharer */
+                        break; /*Lost alignment,don't know what we'd read next*/
+                    }
+                }
+                if (msg.event != UFFD_EVENT_PAGEFAULT) {
+                    error_report("%s: Read unexpected event %ud "
+                                 "from userfaultfd (shared)",
+                                 __func__, msg.event);
+                    continue; /* It's not a page fault, shouldn't happen */
+                }
+                /* Call the device handler registered with us */
+                ret = pcfd->handler(pcfd, &msg);
+                if (ret) {
+                    error_report("%s: Failed to resolve shared fault on %zd/%s",
+                                 __func__, index, pcfd->idstr);
+                    /* TODO: Fail? Disable this sharer? */
+                }
+            }
         }
     }
     trace_postcopy_ram_fault_thread_exit();
@@ -878,3 +953,32 @@ PostcopyState postcopy_state_set(PostcopyState new_state)
 {
     return atomic_xchg(&incoming_postcopy_state, new_state);
 }
+
+/* Register a handler for external shared memory postcopy
+ * called on the destination.
+ */
+void postcopy_register_shared_ufd(struct PostCopyFD *pcfd)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    mis->postcopy_remote_fds = g_array_append_val(mis->postcopy_remote_fds,
+                                                  *pcfd);
+}
+
+/* Unregister a handler for external shared memory postcopy
+ */
+void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd)
+{
+    guint i;
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    GArray *pcrfds = mis->postcopy_remote_fds;
+
+    for (i = 0; i < pcrfds->len; i++) {
+        struct PostCopyFD *cur = &g_array_index(pcrfds, struct PostCopyFD, i);
+        if (cur->fd == pcfd->fd) {
+            mis->postcopy_remote_fds = g_array_remove_index(pcrfds, i);
+            return;
+        }
+    }
+}
+
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 70d4b09659..ba8a8ffec5 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -141,4 +141,25 @@ void postcopy_remove_notifier(NotifierWithReturn *n);
 /* Call the notifier list set by postcopy_add_start_notifier */
 int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp);
 
+struct PostCopyFD;
+
+/* ufd is a pointer to the struct uffd_msg *TODO: more Portable! */
+typedef int (*pcfdhandler)(struct PostCopyFD *pcfd, void *ufd);
+
+struct PostCopyFD {
+    int fd;
+    /* Data to pass to handler */
+    void *data;
+    /* Handler to be called whenever we get a poll event */
+    pcfdhandler handler;
+    /* A string to use in error messages */
+    char *idstr;
+};
+
+/* Register a userfaultfd owned by an external process for
+ * shared memory.
+ */
+void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
+void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
+
 #endif
diff --git a/migration/trace-events b/migration/trace-events
index 7a3b5144ff..23f4e5339b 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -189,6 +189,8 @@ postcopy_place_page_zero(void *host_addr) "host=%p"
 postcopy_ram_enable_notify(void) ""
 postcopy_ram_fault_thread_entry(void) ""
 postcopy_ram_fault_thread_exit(void) ""
+postcopy_ram_fault_thread_fds_core(int baseufd, int quitfd) "ufd: %d quitfd: %d"
+postcopy_ram_fault_thread_fds_extra(size_t index, const char *name, int fd) "%zd/%s: %d"
 postcopy_ram_fault_thread_quit(void) ""
 postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=0x%" PRIx64 " rb=%s offset=0x%zx"
 postcopy_ram_incoming_cleanup_closeuf(void) ""
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 13/32] vhost+postcopy: Register shared ufd with postcopy
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (11 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 12/32] postcopy: Allow registering of fd handler Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 14/32] vhost+postcopy: Transmit 'listen' to client Dr. David Alan Gilbert (git)
                     ` (19 subsequent siblings)
  32 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Register the UFD that comes in as the response to the 'advise' method
with the postcopy code.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/vhost-user.c   | 21 ++++++++++++++++++++-
 migration/postcopy-ram.h |  2 +-
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 7063e4df61..b7898f8939 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -24,6 +24,7 @@
 #include <sys/socket.h>
 #include <sys/un.h>
 #include <linux/vhost.h>
+#include <linux/userfaultfd.h>
 
 #define VHOST_MEMORY_MAX_NREGIONS    8
 #define VHOST_USER_F_PROTOCOL_FEATURES 30
@@ -130,6 +131,7 @@ struct vhost_user {
     CharBackend *chr;
     int slave_fd;
     NotifierWithReturn postcopy_notifier;
+    struct PostCopyFD  postcopy_fd;
 };
 
 static bool ioeventfd_enabled(void)
@@ -726,6 +728,17 @@ out:
 }
 
 /*
+ * Called back from the postcopy fault thread when a fault is received on our
+ * ufd.
+ * TODO: This is Linux specific
+ */
+static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
+                                             void *ufd)
+{
+    return 0;
+}
+
+/*
  * Called at the start of an inbound postcopy on reception of the
  * 'advise' command.
  */
@@ -764,8 +777,14 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
         error_setg(errp, "%s: Failed to get ufd", __func__);
         return -1;
     }
+    fcntl(ufd, F_SETFL, O_NONBLOCK);
 
-    /* TODO: register ufd with userfault thread */
+    /* register ufd with userfault thread */
+    u->postcopy_fd.fd = ufd;
+    u->postcopy_fd.data = dev;
+    u->postcopy_fd.handler = vhost_user_postcopy_fault_handler;
+    u->postcopy_fd.idstr = "vhost-user"; /* Need to find unique name */
+    postcopy_register_shared_ufd(&u->postcopy_fd);
     return 0;
 }
 
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index ba8a8ffec5..28c216cc7a 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -153,7 +153,7 @@ struct PostCopyFD {
     /* Handler to be called whenever we get a poll event */
     pcfdhandler handler;
     /* A string to use in error messages */
-    char *idstr;
+    const char *idstr;
 };
 
 /* Register a userfaultfd owned by an external process for
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 14/32] vhost+postcopy: Transmit 'listen' to client
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (12 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 13/32] vhost+postcopy: Register shared ufd with postcopy Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30 10:37     ` Marc-André Lureau
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 15/32] vhost+postcopy: Register new regions with the ufd Dr. David Alan Gilbert (git)
                     ` (18 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Notify the vhost-user client on reception of the 'postcopy-listen'
event from the source.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 21 +++++++++++++++++++++
 contrib/libvhost-user/libvhost-user.h |  2 ++
 docs/interop/vhost-user.txt           |  6 ++++++
 hw/virtio/trace-events                |  3 +++
 hw/virtio/vhost-user.c                | 30 ++++++++++++++++++++++++++++++
 migration/postcopy-ram.h              |  1 +
 migration/savevm.c                    |  7 +++++++
 7 files changed, 70 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index f9b5b12b28..e8accf11db 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -15,7 +15,9 @@
 
 #include <qemu/osdep.h>
 #include <sys/eventfd.h>
+#include <sys/ioctl.h>
 #include <sys/syscall.h>
+#include <linux/userfaultfd.h>
 #include <linux/vhost.h>
 
 #include "qemu/atomic.h"
@@ -64,6 +66,7 @@ vu_request_to_string(int req)
         REQ(VHOST_USER_IOTLB_MSG),
         REQ(VHOST_USER_SET_VRING_ENDIAN),
         REQ(VHOST_USER_POSTCOPY_ADVISE),
+        REQ(VHOST_USER_POSTCOPY_LISTEN),
         REQ(VHOST_USER_MAX),
     };
 #undef REQ
@@ -802,6 +805,22 @@ out:
 }
 
 static bool
+vu_set_postcopy_listen(VuDev *dev, VhostUserMsg *vmsg)
+{
+    vmsg->payload.u64 = -1;
+    vmsg->size = sizeof(vmsg->payload.u64);
+
+    if (dev->nregions) {
+        vu_panic(dev, "Regions already registered at postcopy-listen");
+        return true;
+    }
+    dev->postcopy_listening = true;
+
+    vmsg->flags = VHOST_USER_VERSION |  VHOST_USER_REPLY_MASK;
+    vmsg->payload.u64 = 0; /* Success */
+    return true;
+}
+static bool
 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 {
     int do_reply = 0;
@@ -868,6 +887,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
         break;
     case VHOST_USER_POSTCOPY_ADVISE:
         return vu_set_postcopy_advise(dev, vmsg);
+    case VHOST_USER_POSTCOPY_LISTEN:
+        return vu_set_postcopy_listen(dev, vmsg);
     default:
         vmsg_close_fds(vmsg);
         vu_panic(dev, "Unhandled request: %d", vmsg->request);
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index 3e8efdd919..29c11ba56c 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -67,6 +67,7 @@ typedef enum VhostUserRequest {
     VHOST_USER_IOTLB_MSG        = 22,
     VHOST_USER_SET_VRING_ENDIAN = 23,
     VHOST_USER_POSTCOPY_ADVISE  = 24,
+    VHOST_USER_POSTCOPY_LISTEN  = 25,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -237,6 +238,7 @@ struct VuDev {
 
     /* Postcopy data */
     int postcopy_ufd;
+    bool postcopy_listening;
 };
 
 typedef struct VuVirtqElement {
diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index dad2a1b343..73c3dd74db 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -615,6 +615,12 @@ Master message types
       the slave must open a userfaultfd for later use.
       Note that at this stage the migration is still in precopy mode.
 
+ * VHOST_USER_POSTCOPY_LISTEN
+      Id: 25
+      Master payload: N/A
+
+      Master advises slave that a transition to postcopy mode has happened.
+
 Slave message types
 -------------------
 
diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 775461ae98..f736c7c84f 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -1,5 +1,8 @@
 # See docs/devel/tracing.txt for syntax documentation.
 
+# hw/virtio/vhost-user.c
+vhost_user_postcopy_listen(void) ""
+
 # hw/virtio/virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
 virtqueue_fill(void *vq, const void *elem, unsigned int len, unsigned int idx) "vq %p elem %p len %u idx %u"
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index b7898f8939..9178271ab2 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -69,6 +69,7 @@ typedef enum VhostUserRequest {
     VHOST_USER_IOTLB_MSG = 22,
     VHOST_USER_SET_VRING_ENDIAN = 23,
     VHOST_USER_POSTCOPY_ADVISE  = 24,
+    VHOST_USER_POSTCOPY_LISTEN  = 25,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -788,6 +789,32 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
     return 0;
 }
 
+/*
+ * Called at the switch to postcopy on reception of the 'listen' command.
+ */
+static int vhost_user_postcopy_listen(struct vhost_dev *dev, Error **errp)
+{
+    int ret;
+    VhostUserMsg msg = {
+        .request = VHOST_USER_POSTCOPY_LISTEN,
+        .flags = VHOST_USER_VERSION | VHOST_USER_NEED_REPLY_MASK,
+    };
+
+    trace_vhost_user_postcopy_listen();
+    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
+        error_setg(errp, "Failed to send postcopy_listen to vhost");
+        return -1;
+    }
+
+    ret = process_message_reply(dev, &msg);
+    if (ret) {
+        error_setg(errp, "Failed to receive reply to postcopy_listen");
+        return ret;
+    }
+
+    return 0;
+}
+
 static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
                                         void *opaque)
 {
@@ -810,6 +837,9 @@ static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
     case POSTCOPY_NOTIFY_INBOUND_ADVISE:
         return vhost_user_postcopy_advise(dev, pnd->errp);
 
+    case POSTCOPY_NOTIFY_INBOUND_LISTEN:
+        return vhost_user_postcopy_listen(dev, pnd->errp);
+
     default:
         /* We ignore notifications we don't know */
         break;
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 28c216cc7a..873c147b68 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -129,6 +129,7 @@ void postcopy_infrastructure_init(void);
 enum PostcopyNotifyReason {
     POSTCOPY_NOTIFY_PROBE = 0,
     POSTCOPY_NOTIFY_INBOUND_ADVISE,
+    POSTCOPY_NOTIFY_INBOUND_LISTEN,
 };
 
 struct PostcopyNotifyData {
diff --git a/migration/savevm.c b/migration/savevm.c
index d35911731d..72f084e10d 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1557,6 +1557,8 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
 {
     PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_LISTENING);
     trace_loadvm_postcopy_handle_listen();
+    Error *local_err = NULL;
+
     if (ps != POSTCOPY_INCOMING_ADVISE && ps != POSTCOPY_INCOMING_DISCARD) {
         error_report("CMD_POSTCOPY_LISTEN in wrong postcopy state (%d)", ps);
         return -1;
@@ -1578,6 +1580,11 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
         return -1;
     }
 
+    if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_LISTEN, &local_err)) {
+        error_report_err(local_err);
+        return -1;
+    }
+
     if (mis->have_listen_thread) {
         error_report("CMD_POSTCOPY_RAM_LISTEN already has a listen thread");
         return -1;
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 15/32] vhost+postcopy: Register new regions with the ufd
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (13 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 14/32] vhost+postcopy: Transmit 'listen' to client Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30 10:42     ` Marc-André Lureau
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 16/32] vhost+postcopy: Send address back to qemu Dr. David Alan Gilbert (git)
                     ` (17 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

When new regions are sent to the client using SET_MEM_TABLE, register
them with the userfaultfd.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index e8accf11db..e6ab059a03 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -449,6 +449,38 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
                    dev_region->mmap_addr);
         }
 
+        if (dev->postcopy_listening) {
+            /* We should already have an open ufd need to mark each memory
+             * range as ufd.
+             * Note: Do we need any madvises? Well it's not been accessed
+             * yet, still probably need no THP to be safe, discard to be safe?
+             */
+            struct uffdio_register reg_struct;
+            reg_struct.range.start = (uintptr_t)dev_region->mmap_addr;
+            reg_struct.range.len = dev_region->size + dev_region->mmap_offset;
+            reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
+
+            if (ioctl(dev->postcopy_ufd, UFFDIO_REGISTER, &reg_struct)) {
+                vu_panic(dev, "%s: Failed to userfault region %d "
+                              "@%p + %zx: (ufd=%d)%s\n",
+                         __func__, i,
+                         dev_region->mmap_addr,
+                         dev_region->size + dev_region->mmap_offset,
+                         dev->postcopy_ufd, strerror(errno));
+                continue;
+            }
+            if (!(reg_struct.ioctls & ((__u64)1 << _UFFDIO_COPY))) {
+                vu_panic(dev, "%s Region (%d) doesn't support COPY",
+                         __func__, i);
+                continue;
+            }
+            DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
+                    __func__, i, reg_struct.range.start, reg_struct.range.len);
+            /* TODO: Stash 'zero' support flags somewhere */
+            /* TODO: Get address back to QEMU */
+
+        }
+
         close(vmsg->fds[i]);
     }
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 16/32] vhost+postcopy: Send address back to qemu
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (14 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 15/32] vhost+postcopy: Register new regions with the ufd Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-29  8:30     ` Peter Xu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 17/32] vhost+postcopy: Stash RAMBlock and offset Dr. David Alan Gilbert (git)
                     ` (16 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

We need a better way, but at the moment we need the address of the
mappings sent back to qemu so it can interpret the messages on the
userfaultfd it reads.

Note: We don't ask for the default 'ack' reply since we've got our own.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 15 ++++++++-
 docs/interop/vhost-user.txt           |  6 ++++
 hw/virtio/trace-events                |  1 +
 hw/virtio/vhost-user.c                | 57 ++++++++++++++++++++++++++++++++++-
 4 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index e6ab059a03..5ec54f7d60 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -477,13 +477,26 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
             DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
                     __func__, i, reg_struct.range.start, reg_struct.range.len);
             /* TODO: Stash 'zero' support flags somewhere */
-            /* TODO: Get address back to QEMU */
 
+            /* TODO: We need to find a way for the qemu not to see the virtual
+             * addresses of the clients, so as to keep better separation.
+             */
+            /* Return the address to QEMU so that it can translate the ufd
+             * fault addresses back.
+             */
+            msg_region->userspace_addr = (uintptr_t)(mmap_addr +
+                                                     dev_region->mmap_offset);
         }
 
         close(vmsg->fds[i]);
     }
 
+    if (dev->postcopy_listening) {
+        /* Need to return the addresses - send the updated message back */
+        vmsg->fd_num = 0;
+        return true;
+    }
+
     return false;
 }
 
diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index 73c3dd74db..b2a548c94d 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -413,12 +413,18 @@ Master message types
       Id: 5
       Equivalent ioctl: VHOST_SET_MEM_TABLE
       Master payload: memory regions description
+      Slave payload: (postcopy only) memory regions description
 
       Sets the memory map regions on the slave so it can translate the vring
       addresses. In the ancillary data there is an array of file descriptors
       for each memory mapped region. The size and ordering of the fds matches
       the number and ordering of memory regions.
 
+      When postcopy-listening has been received, SET_MEM_TABLE replies with
+      the bases of the memory mapped regions to the master.  It must have mmap'd
+      the regions and enabled userfaultfd on them.  Note NEED_REPLY_MASK
+      is not set in this case.
+
  * VHOST_USER_SET_LOG_BASE
 
       Id: 6
diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index f736c7c84f..63fd4a79cf 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -2,6 +2,7 @@
 
 # hw/virtio/vhost-user.c
 vhost_user_postcopy_listen(void) ""
+vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
 
 # hw/virtio/virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 9178271ab2..2e4eb0864a 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -19,6 +19,7 @@
 #include "qemu/sockets.h"
 #include "migration/migration.h"
 #include "migration/postcopy-ram.h"
+#include "trace.h"
 
 #include <sys/ioctl.h>
 #include <sys/socket.h>
@@ -133,6 +134,7 @@ struct vhost_user {
     int slave_fd;
     NotifierWithReturn postcopy_notifier;
     struct PostCopyFD  postcopy_fd;
+    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
 };
 
 static bool ioeventfd_enabled(void)
@@ -300,11 +302,13 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
 static int vhost_user_set_mem_table(struct vhost_dev *dev,
                                     struct vhost_memory *mem)
 {
+    struct vhost_user *u = dev->opaque;
     int fds[VHOST_MEMORY_MAX_NREGIONS];
     int i, fd;
     size_t fd_num = 0;
     bool reply_supported = virtio_has_feature(dev->protocol_features,
-                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
+                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
+                           !u->postcopy_fd.handler;
 
     VhostUserMsg msg = {
         .request = VHOST_USER_SET_MEM_TABLE,
@@ -350,6 +354,57 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
         return -1;
     }
 
+    if (u->postcopy_fd.handler) {
+        VhostUserMsg msg_reply;
+        int region_i, reply_i;
+        if (vhost_user_read(dev, &msg_reply) < 0) {
+            return -1;
+        }
+
+        if (msg_reply.request != VHOST_USER_SET_MEM_TABLE) {
+            error_report("%s: Received unexpected msg type."
+                         "Expected %d received %d", __func__,
+                         VHOST_USER_SET_MEM_TABLE, msg_reply.request);
+            return -1;
+        }
+        /* We're using the same structure, just reusing one of the
+         * fields, so it should be the same size.
+         */
+        if (msg_reply.size != msg.size) {
+            error_report("%s: Unexpected size for postcopy reply "
+                         "%d vs %d", __func__, msg_reply.size, msg.size);
+            return -1;
+        }
+
+        memset(u->postcopy_client_bases, 0,
+               sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
+
+        /* They're in the same order as the regions that were sent
+         * but some of the regions were skipped (above) if they
+         * didn't have fd's
+        */
+        for (reply_i = 0, region_i = 0;
+             region_i < dev->mem->nregions;
+             region_i++) {
+            if (reply_i < fd_num &&
+                msg_reply.payload.memory.regions[region_i].guest_phys_addr ==
+                dev->mem->regions[region_i].guest_phys_addr) {
+                u->postcopy_client_bases[region_i] =
+                    msg_reply.payload.memory.regions[reply_i].userspace_addr;
+                trace_vhost_user_set_mem_table_postcopy(
+                    msg_reply.payload.memory.regions[reply_i].userspace_addr,
+                    msg.payload.memory.regions[reply_i].userspace_addr,
+                    reply_i, region_i);
+                reply_i++;
+            }
+        }
+        if (reply_i != fd_num) {
+            error_report("%s: postcopy reply not fully consumed "
+                         "%d vs %zd",
+                         __func__, reply_i, fd_num);
+            return -1;
+        }
+    }
     if (reply_supported) {
         return process_message_reply(dev, &msg);
     }
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 17/32] vhost+postcopy: Stash RAMBlock and offset
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (15 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 16/32] vhost+postcopy: Send address back to qemu Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30  5:51     ` Peter Xu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 18/32] vhost+postcopy: Send requests to source for shared pages Dr. David Alan Gilbert (git)
                     ` (15 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Stash the RAMBlock and offset for later use looking up
addresses.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/trace-events |  1 +
 hw/virtio/vhost-user.c | 30 ++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 63fd4a79cf..5067dee19b 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -3,6 +3,7 @@
 # hw/virtio/vhost-user.c
 vhost_user_postcopy_listen(void) ""
 vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
+vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
 
 # hw/virtio/virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 2e4eb0864a..fbe2743298 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -135,6 +135,14 @@ struct vhost_user {
     NotifierWithReturn postcopy_notifier;
     struct PostCopyFD  postcopy_fd;
     uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
+    /* Length of the region_rb and region_rb_offset arrays */
+    size_t             region_rb_len;
+    /* RAMBlock associated with a given region */
+    RAMBlock         **region_rb;
+    /* The offset from the start of the RAMBlock to the start of the
+     * vhost region.
+     */
+    ram_addr_t        *region_rb_offset;
 };
 
 static bool ioeventfd_enabled(void)
@@ -319,6 +327,17 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
         msg.flags |= VHOST_USER_NEED_REPLY_MASK;
     }
 
+    if (u->region_rb_len < dev->mem->nregions) {
+        u->region_rb = g_renew(RAMBlock*, u->region_rb, dev->mem->nregions);
+        u->region_rb_offset = g_renew(ram_addr_t, u->region_rb_offset,
+                                      dev->mem->nregions);
+        memset(&(u->region_rb[u->region_rb_len]), '\0',
+               sizeof(RAMBlock *) * (dev->mem->nregions - u->region_rb_len));
+        memset(&(u->region_rb_offset[u->region_rb_len]), '\0',
+               sizeof(ram_addr_t) * (dev->mem->nregions - u->region_rb_len));
+        u->region_rb_len = dev->mem->nregions;
+    }
+
     for (i = 0; i < dev->mem->nregions; ++i) {
         struct vhost_memory_region *reg = dev->mem->regions + i;
         ram_addr_t offset;
@@ -327,8 +346,14 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
         assert((uintptr_t)reg->userspace_addr == reg->userspace_addr);
         mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr,
                                      &offset);
+        u->region_rb_offset[i] = offset;
+        u->region_rb[i] = mr->ram_block;
         fd = memory_region_get_fd(mr);
         if (fd > 0) {
+            trace_vhost_user_set_mem_table_withfd(fd_num, mr->name,
+                                                  reg->memory_size,
+                                                  reg->guest_phys_addr,
+                                                  reg->userspace_addr, offset);
             msg.payload.memory.regions[fd_num].userspace_addr = reg->userspace_addr;
             msg.payload.memory.regions[fd_num].memory_size  = reg->memory_size;
             msg.payload.memory.regions[fd_num].guest_phys_addr = reg->guest_phys_addr;
@@ -992,6 +1017,11 @@ static int vhost_user_cleanup(struct vhost_dev *dev)
         close(u->slave_fd);
         u->slave_fd = -1;
     }
+    g_free(u->region_rb);
+    u->region_rb = NULL;
+    g_free(u->region_rb_offset);
+    u->region_rb_offset = NULL;
+    u->region_rb_len = 0;
     g_free(u);
     dev->opaque = 0;
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 18/32] vhost+postcopy: Send requests to source for shared pages
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (16 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 17/32] vhost+postcopy: Stash RAMBlock and offset Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 19/32] vhost+postcopy: Resolve client address Dr. David Alan Gilbert (git)
                     ` (14 subsequent siblings)
  32 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Send requests back to the source for shared page requests.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/migration.h    |  2 ++
 migration/postcopy-ram.c | 31 ++++++++++++++++++++++++++++---
 migration/postcopy-ram.h |  3 +++
 migration/trace-events   |  2 ++
 4 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index 9fcea6bb25..214b0b6afd 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -46,6 +46,8 @@ struct MigrationIncomingState {
     int       userfault_quit_fd;
     QEMUFile *to_src_file;
     QemuMutex rp_mutex;    /* We send replies from multiple threads */
+    /* RAMBlock of last request sent to source */
+    RAMBlock *last_rb;
     void     *postcopy_tmp_page;
     void     *postcopy_tmp_zero_page;
     /* PostCopyFD's for external userfaultfds & handlers of shared memory */
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index faee7708ff..2d77674b94 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -459,6 +459,31 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
 }
 
 /*
+ * Callback from shared fault handlers to ask for a page,
+ * the page must be specified by a RAMBlock and an offset in that rb
+ */
+int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
+                                 uint64_t client_addr, uint64_t rb_offset)
+{
+    size_t pagesize = qemu_ram_pagesize(rb);
+    uint64_t aligned_rbo = rb_offset & ~(pagesize - 1);
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
+                                       rb_offset);
+    /* TODO: Check bitmap to see if we already have the page */
+    if (rb != mis->last_rb) {
+        mis->last_rb = rb;
+        migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
+                                  aligned_rbo, pagesize);
+    } else {
+        /* Save some space */
+        migrate_send_rp_req_pages(mis, NULL, aligned_rbo, pagesize);
+    }
+    return 0;
+}
+
+/*
  * Handle faults detected by the USERFAULT markings
  */
 static void *postcopy_ram_fault_thread(void *opaque)
@@ -468,9 +493,9 @@ static void *postcopy_ram_fault_thread(void *opaque)
     int ret;
     size_t index;
     RAMBlock *rb = NULL;
-    RAMBlock *last_rb = NULL; /* last RAMBlock we sent part of */
 
     trace_postcopy_ram_fault_thread_entry();
+    mis->last_rb = NULL; /* last RAMBlock we sent part of */
     qemu_sem_post(&mis->fault_thread_sem);
 
     struct pollfd *pfd;
@@ -559,8 +584,8 @@ static void *postcopy_ram_fault_thread(void *opaque)
              * Send the request to the source - we want to request one
              * of our host page sizes (which is >= TPS)
              */
-            if (rb != last_rb) {
-                last_rb = rb;
+            if (rb != mis->last_rb) {
+                mis->last_rb = rb;
                 migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
                                          rb_offset, qemu_ram_pagesize(rb));
             } else {
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 873c147b68..69e88b0174 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -162,5 +162,8 @@ struct PostCopyFD {
  */
 void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
 void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
+/* Callback from shared fault handlers to ask for a page */
+int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
+                                 uint64_t client_addr, uint64_t offset);
 
 #endif
diff --git a/migration/trace-events b/migration/trace-events
index 23f4e5339b..3a0b143f7e 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -197,6 +197,8 @@ postcopy_ram_incoming_cleanup_closeuf(void) ""
 postcopy_ram_incoming_cleanup_entry(void) ""
 postcopy_ram_incoming_cleanup_exit(void) ""
 postcopy_ram_incoming_cleanup_join(void) ""
+postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
+
 save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 19/32] vhost+postcopy: Resolve client address
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (17 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 18/32] vhost+postcopy: Send requests to source for shared pages Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30  5:28     ` Peter Xu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 20/32] postcopy: wake shared Dr. David Alan Gilbert (git)
                     ` (13 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Resolve fault addresses read off the clients UFD into RAMBlock
and offset, and call back to the postcopy code to ask for the page.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/trace-events |  3 +++
 hw/virtio/vhost-user.c | 30 +++++++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 5067dee19b..f7d4b831fe 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -1,6 +1,9 @@
 # See docs/devel/tracing.txt for syntax documentation.
 
 # hw/virtio/vhost-user.c
+vhost_user_postcopy_fault_handler(const char *name, uint64_t fault_address, int nregions) "%s: @0x%"PRIx64" nregions:%d"
+vhost_user_postcopy_fault_handler_loop(int i, uint64_t client_base, uint64_t size) "%d: client 0x%"PRIx64" +0x%"PRIx64
+vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t rb_offset) "%d: region_offset: 0x%"PRIx64" rb_offset:0x%"PRIx64
 vhost_user_postcopy_listen(void) ""
 vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
 vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index fbe2743298..2897ff70b3 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -816,7 +816,35 @@ out:
 static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
                                              void *ufd)
 {
-    return 0;
+    struct vhost_dev *dev = pcfd->data;
+    struct vhost_user *u = dev->opaque;
+    struct uffd_msg *msg = ufd;
+    uint64_t faultaddr = msg->arg.pagefault.address;
+    RAMBlock *rb = NULL;
+    uint64_t rb_offset;
+    int i;
+
+    trace_vhost_user_postcopy_fault_handler(pcfd->idstr, faultaddr,
+                                            dev->mem->nregions);
+    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
+        trace_vhost_user_postcopy_fault_handler_loop(i,
+                u->postcopy_client_bases[i], dev->mem->regions[i].memory_size);
+        if (faultaddr >= u->postcopy_client_bases[i]) {
+            /* Ofset of the fault address in the vhost region */
+            uint64_t region_offset = faultaddr - u->postcopy_client_bases[i];
+            if (region_offset <= dev->mem->regions[i].memory_size) {
+                rb_offset = region_offset + u->region_rb_offset[i];
+                trace_vhost_user_postcopy_fault_handler_found(i,
+                        region_offset, rb_offset);
+                rb = u->region_rb[i];
+                return postcopy_request_shared_page(pcfd, rb, faultaddr,
+                                                    rb_offset);
+            }
+        }
+    }
+    error_report("%s: Failed to find region for fault %" PRIx64,
+                 __func__, faultaddr);
+    return -1;
 }
 
 /*
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 20/32] postcopy: wake shared
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (18 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 19/32] vhost+postcopy: Resolve client address Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 21/32] postcopy: postcopy_notify_shared_wake Dr. David Alan Gilbert (git)
                     ` (12 subsequent siblings)
  32 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Send a 'wake' request on a userfaultfd for a shared process.
The address in the clients address space is specified together
with the RAMBlock it was resolved to.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/postcopy-ram.c | 26 ++++++++++++++++++++++++++
 migration/postcopy-ram.h |  6 ++++++
 migration/trace-events   |  1 +
 3 files changed, 33 insertions(+)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 2d77674b94..2c9680ef7a 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -458,6 +458,25 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
     return 0;
 }
 
+int postcopy_wake_shared(struct PostCopyFD *pcfd,
+                         uint64_t client_addr,
+                         RAMBlock *rb)
+{
+    size_t pagesize = qemu_ram_pagesize(rb);
+    struct uffdio_range range;
+    int ret;
+    trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb));
+    range.start = client_addr & ~(pagesize - 1);
+    range.len = pagesize;
+    ret = ioctl(pcfd->fd, UFFDIO_WAKE, &range);
+    if (ret) {
+        error_report("%s: Failed to wake: %zx in %s (%s)",
+                     __func__, client_addr, qemu_ram_get_idstr(rb),
+                     strerror(errno));
+    }
+    return ret;
+}
+
 /*
  * Callback from shared fault handlers to ask for a page,
  * the page must be specified by a RAMBlock and an offset in that rb
@@ -876,6 +895,13 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
     return NULL;
 }
 
+int postcopy_wake_shared(struct PostCopyFD *pcfd,
+                         uint64_t client_addr,
+                         RAMBlock *rb)
+{
+    assert(0);
+    return -1;
+}
 #endif
 
 /* ------------------------------------------------------------------------- */
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 69e88b0174..d2b2f5f4aa 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -162,6 +162,12 @@ struct PostCopyFD {
  */
 void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
 void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
+/* Notify a client ufd that a page is available
+ * Note: The 'client_address' is in the address space of the client
+ * program not QEMU
+ */
+int postcopy_wake_shared(struct PostCopyFD *pcfd, uint64_t client_addr,
+                         RAMBlock *rb);
 /* Callback from shared fault handlers to ask for a page */
 int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
                                  uint64_t client_addr, uint64_t offset);
diff --git a/migration/trace-events b/migration/trace-events
index 3a0b143f7e..535e7ad84b 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -198,6 +198,7 @@ postcopy_ram_incoming_cleanup_entry(void) ""
 postcopy_ram_incoming_cleanup_exit(void) ""
 postcopy_ram_incoming_cleanup_join(void) ""
 postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
+postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
 
 save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 21/32] postcopy: postcopy_notify_shared_wake
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (19 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 20/32] postcopy: wake shared Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 22/32] vhost+postcopy: Add vhost waker Dr. David Alan Gilbert (git)
                     ` (11 subsequent siblings)
  32 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add a hook to allow a client userfaultfd to be 'woken'
when a page arrives, and a walker that calls that
hook for relevant clients given a RAMBlock and offset.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/postcopy-ram.c | 16 ++++++++++++++++
 migration/postcopy-ram.h | 10 ++++++++++
 2 files changed, 26 insertions(+)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 2c9680ef7a..40b58a7912 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -749,6 +749,22 @@ static int qemu_ufd_copy_ioctl(int userfault_fd, void *host_addr,
     return ret;
 }
 
+int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset)
+{
+    int i;
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    GArray *pcrfds = mis->postcopy_remote_fds;
+
+    for (i = 0; i < pcrfds->len; i++) {
+        struct PostCopyFD *cur = &g_array_index(pcrfds, struct PostCopyFD, i);
+        int ret = cur->waker(cur, rb, offset);
+        if (ret) {
+            return ret;
+        }
+    }
+    return 0;
+}
+
 /*
  * Place a host page (from) at (host) atomically
  * returns 0 on success
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index d2b2f5f4aa..ecf731c689 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -146,6 +146,10 @@ struct PostCopyFD;
 
 /* ufd is a pointer to the struct uffd_msg *TODO: more Portable! */
 typedef int (*pcfdhandler)(struct PostCopyFD *pcfd, void *ufd);
+/* Notification to wake, either on place or on reception of
+ * a fault on something that's already arrived (race)
+ */
+typedef int (*pcfdwake)(struct PostCopyFD *pcfd, RAMBlock *rb, uint64_t offset);
 
 struct PostCopyFD {
     int fd;
@@ -153,6 +157,8 @@ struct PostCopyFD {
     void *data;
     /* Handler to be called whenever we get a poll event */
     pcfdhandler handler;
+    /* Notification to wake shared client */
+    pcfdwake waker;
     /* A string to use in error messages */
     const char *idstr;
 };
@@ -162,6 +168,10 @@ struct PostCopyFD {
  */
 void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
 void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
+/* Call each of the shared 'waker's registerd telling them of
+ * availability of a block.
+ */
+int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset);
 /* Notify a client ufd that a page is available
  * Note: The 'client_address' is in the address space of the client
  * program not QEMU
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 22/32] vhost+postcopy: Add vhost waker
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (20 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 21/32] postcopy: postcopy_notify_shared_wake Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30  5:55     ` Peter Xu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 23/32] vhost+postcopy: Call wakeups Dr. David Alan Gilbert (git)
                     ` (10 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Register a waker function in vhost-user code to be notified when
pages arrive or requests to previously mapped pages get requested.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/trace-events |  3 +++
 hw/virtio/vhost-user.c | 26 ++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index f7d4b831fe..adebf6dc6b 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -7,6 +7,9 @@ vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t
 vhost_user_postcopy_listen(void) ""
 vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
 vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
+vhost_user_postcopy_waker(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
+vhost_user_postcopy_waker_found(uint64_t client_addr) "0x%"PRIx64
+vhost_user_postcopy_waker_nomatch(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
 
 # hw/virtio/virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 2897ff70b3..3bff33a1a6 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -847,6 +847,31 @@ static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
     return -1;
 }
 
+static int vhost_user_postcopy_waker(struct PostCopyFD *pcfd, RAMBlock *rb,
+                                     uint64_t offset)
+{
+    struct vhost_dev *dev = pcfd->data;
+    struct vhost_user *u = dev->opaque;
+    int i;
+
+    trace_vhost_user_postcopy_waker(qemu_ram_get_idstr(rb), offset);
+    /* Translate the offset into an address in the clients address space */
+    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
+        if (u->region_rb[i] == rb &&
+            offset >= u->region_rb_offset[i] &&
+            offset < (u->region_rb_offset[i] +
+                      dev->mem->regions[i].memory_size)) {
+            uint64_t client_addr = (offset - u->region_rb_offset[i]) +
+                                   u->postcopy_client_bases[i];
+            trace_vhost_user_postcopy_waker_found(client_addr);
+            return postcopy_wake_shared(pcfd, client_addr, rb);
+        }
+    }
+
+    trace_vhost_user_postcopy_waker_nomatch(qemu_ram_get_idstr(rb), offset);
+    return 0;
+}
+
 /*
  * Called at the start of an inbound postcopy on reception of the
  * 'advise' command.
@@ -892,6 +917,7 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
     u->postcopy_fd.fd = ufd;
     u->postcopy_fd.data = dev;
     u->postcopy_fd.handler = vhost_user_postcopy_fault_handler;
+    u->postcopy_fd.waker = vhost_user_postcopy_waker;
     u->postcopy_fd.idstr = "vhost-user"; /* Need to find unique name */
     postcopy_register_shared_ufd(&u->postcopy_fd);
     return 0;
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 23/32] vhost+postcopy: Call wakeups
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (21 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 22/32] vhost+postcopy: Add vhost waker Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 24/32] vub+postcopy: madvises Dr. David Alan Gilbert (git)
                     ` (9 subsequent siblings)
  32 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Cause the vhost-user client to be woken up whenever:
  a) We place a page in postcopy mode
  b) We get a fault and the page has already been received

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/postcopy-ram.c | 14 ++++++++++----
 migration/trace-events   |  1 +
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 40b58a7912..7d0786ff04 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -490,7 +490,11 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
 
     trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
                                        rb_offset);
-    /* TODO: Check bitmap to see if we already have the page */
+    if (ramblock_recv_bitmap_test_byte_offset(rb, aligned_rbo)) {
+        trace_postcopy_request_shared_page_present(pcfd->idstr,
+                                        qemu_ram_get_idstr(rb), rb_offset);
+        return postcopy_wake_shared(pcfd, client_addr, rb);
+    }
     if (rb != mis->last_rb) {
         mis->last_rb = rb;
         migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
@@ -788,7 +792,8 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
     }
 
     trace_postcopy_place_page(host);
-    return 0;
+    return postcopy_notify_shared_wake(rb,
+                                       qemu_ram_block_host_offset(rb, host));
 }
 
 /*
@@ -812,6 +817,9 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
 
             return -e;
         }
+        return postcopy_notify_shared_wake(rb,
+                                           qemu_ram_block_host_offset(rb,
+                                                                      host));
     } else {
         /* The kernel can't use UFFDIO_ZEROPAGE for hugepages */
         if (!mis->postcopy_tmp_zero_page) {
@@ -831,8 +839,6 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
         return postcopy_place_page(mis, host, mis->postcopy_tmp_zero_page,
                                    rb);
     }
-
-    return 0;
 }
 
 /*
diff --git a/migration/trace-events b/migration/trace-events
index 535e7ad84b..10cff5a068 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -198,6 +198,7 @@ postcopy_ram_incoming_cleanup_entry(void) ""
 postcopy_ram_incoming_cleanup_exit(void) ""
 postcopy_ram_incoming_cleanup_join(void) ""
 postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
+postcopy_request_shared_page_present(const char *sharer, const char *rb, uint64_t rb_offset) "%s already %s offset 0x%"PRIx64
 postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
 
 save_xbzrle_page_skipping(void) ""
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 24/32] vub+postcopy: madvises
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (22 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 23/32] vhost+postcopy: Call wakeups Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30 10:48     ` Marc-André Lureau
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 25/32] vhost+postcopy: Lock around set_mem_table Dr. David Alan Gilbert (git)
                     ` (8 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Clear the area and turn off THP.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 5ec54f7d60..d816851c6d 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -450,11 +450,39 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
         }
 
         if (dev->postcopy_listening) {
+            int ret;
             /* We should already have an open ufd need to mark each memory
              * range as ufd.
-             * Note: Do we need any madvises? Well it's not been accessed
-             * yet, still probably need no THP to be safe, discard to be safe?
              */
+
+            /* Discard any mapping we have here; note I can't use MADV_REMOVE
+             * or fallocate to make the hole since I don't want to lose
+             * data that's already arrived in the shared process.
+             * TODO: How to do hugepage
+             */
+            ret = madvise((void *)dev_region->mmap_addr,
+                          dev_region->size + dev_region->mmap_offset,
+                          MADV_DONTNEED);
+            if (ret) {
+                fprintf(stderr,
+                        "%s: Failed to madvise(DONTNEED) region %d: %s\n",
+                        __func__, i, strerror(errno));
+            }
+            /* Turn off transparent hugepages so we dont get lose wakeups
+             * in neighbouring pages.
+             * TODO: Turn this backon later.
+             */
+            ret = madvise((void *)dev_region->mmap_addr,
+                          dev_region->size + dev_region->mmap_offset,
+                          MADV_NOHUGEPAGE);
+            if (ret) {
+                /* Note: This can happen legally on kernels that are configured
+                 * without madvise'able hugepages
+                 */
+                fprintf(stderr,
+                        "%s: Failed to madvise(NOHUGEPAGE) region %d: %s\n",
+                        __func__, i, strerror(errno));
+            }
             struct uffdio_register reg_struct;
             reg_struct.range.start = (uintptr_t)dev_region->mmap_addr;
             reg_struct.range.len = dev_region->size + dev_region->mmap_offset;
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 25/32] vhost+postcopy: Lock around set_mem_table
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (23 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 24/32] vub+postcopy: madvises Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30  6:50     ` Peter Xu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 26/32] vhost: Add VHOST_USER_POSTCOPY_END message Dr. David Alan Gilbert (git)
                     ` (7 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

**HACK - better solution needed **
We have the situation where:

     qemu                      bridge

     send set_mem_table
                              map memory
  a)                          mark area with UFD
                              send reply with map addresses
  b)                          start using
  c) receive reply

  As soon as (a) happens qemu might start seeing faults
from memory accesses (but doesn't until b); but it can't
process those faults until (c) when it's received the
mmap addresses.

Make the fault handler spin until it gets the reply in (c).

At the very least this needs some proper locks, but preferably
we need to split the message.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/trace-events |  1 +
 hw/virtio/vhost-user.c | 17 ++++++++++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index adebf6dc6b..065822c70a 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -10,6 +10,7 @@ vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_siz
 vhost_user_postcopy_waker(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
 vhost_user_postcopy_waker_found(uint64_t client_addr) "0x%"PRIx64
 vhost_user_postcopy_waker_nomatch(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
+vhost_user_postcopy_waker_spin(const char *rb) "%s"
 
 # hw/virtio/virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 3bff33a1a6..4d03383a66 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -143,6 +143,7 @@ struct vhost_user {
      * vhost region.
      */
     ram_addr_t        *region_rb_offset;
+    uint64_t           in_set_mem_table; /*Hack! 1 while waiting for set_mem_table reply */
 };
 
 static bool ioeventfd_enabled(void)
@@ -338,6 +339,7 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
         u->region_rb_len = dev->mem->nregions;
     }
 
+    atomic_set(&u->in_set_mem_table, true);
     for (i = 0; i < dev->mem->nregions; ++i) {
         struct vhost_memory_region *reg = dev->mem->regions + i;
         ram_addr_t offset;
@@ -368,14 +370,15 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
     if (!fd_num) {
         error_report("Failed initializing vhost-user memory map, "
                      "consider using -object memory-backend-file share=on");
+        atomic_set(&u->in_set_mem_table, false);
         return -1;
     }
 
     msg.size = sizeof(msg.payload.memory.nregions);
     msg.size += sizeof(msg.payload.memory.padding);
     msg.size += fd_num * sizeof(VhostUserMemoryRegion);
-
     if (vhost_user_write(dev, &msg, fds, fd_num) < 0) {
+        atomic_set(&u->in_set_mem_table, false);
         return -1;
     }
 
@@ -390,6 +393,7 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
             error_report("%s: Received unexpected msg type."
                          "Expected %d received %d", __func__,
                          VHOST_USER_SET_MEM_TABLE, msg_reply.request);
+            atomic_set(&u->in_set_mem_table, false);
             return -1;
         }
         /* We're using the same structure, just reusing one of the
@@ -398,6 +402,7 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
         if (msg_reply.size != msg.size) {
             error_report("%s: Unexpected size for postcopy reply "
                          "%d vs %d", __func__, msg_reply.size, msg.size);
+            atomic_set(&u->in_set_mem_table, false);
             return -1;
         }
 
@@ -427,9 +432,11 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
             error_report("%s: postcopy reply not fully consumed "
                          "%d vs %zd",
                          __func__, reply_i, fd_num);
+            atomic_set(&u->in_set_mem_table, false);
             return -1;
         }
     }
+    atomic_set(&u->in_set_mem_table, false);
     if (reply_supported) {
         return process_message_reply(dev, &msg);
     }
@@ -855,6 +862,14 @@ static int vhost_user_postcopy_waker(struct PostCopyFD *pcfd, RAMBlock *rb,
     int i;
 
     trace_vhost_user_postcopy_waker(qemu_ram_get_idstr(rb), offset);
+    if (!u) {
+        return 0;
+    }
+    while (atomic_mb_read(&u->in_set_mem_table)) {
+        trace_vhost_user_postcopy_waker_spin(qemu_ram_get_idstr(rb));
+        usleep(1000*100);
+    }
+
     /* Translate the offset into an address in the clients address space */
     for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
         if (u->region_rb[i] == rb &&
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 26/32] vhost: Add VHOST_USER_POSTCOPY_END message
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (24 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 25/32] vhost+postcopy: Lock around set_mem_table Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30  6:55     ` Peter Xu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 27/32] vhost+postcopy: Wire up POSTCOPY_END notify Dr. David Alan Gilbert (git)
                     ` (6 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

This message is sent just before the end of postcopy to get the
client to stop using userfault since we wont respond to any more
requests.  It should close userfaultfd so that any other pages
get mapped to the backing file automatically by the kernel, since
at this point we know we've received everything.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 23 +++++++++++++++++++++++
 contrib/libvhost-user/libvhost-user.h |  1 +
 docs/interop/vhost-user.txt           |  8 ++++++++
 hw/virtio/vhost-user.c                |  1 +
 4 files changed, 33 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index d816851c6d..23bff47649 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -67,6 +67,7 @@ vu_request_to_string(int req)
         REQ(VHOST_USER_SET_VRING_ENDIAN),
         REQ(VHOST_USER_POSTCOPY_ADVISE),
         REQ(VHOST_USER_POSTCOPY_LISTEN),
+        REQ(VHOST_USER_POSTCOPY_END),
         REQ(VHOST_USER_MAX),
     };
 #undef REQ
@@ -893,6 +894,26 @@ vu_set_postcopy_listen(VuDev *dev, VhostUserMsg *vmsg)
     vmsg->payload.u64 = 0; /* Success */
     return true;
 }
+
+static bool
+vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
+{
+    DPRINT("%s: Entry\n", __func__);
+    dev->postcopy_listening = false;
+    if (dev->postcopy_ufd > 0) {
+        close(dev->postcopy_ufd);
+        dev->postcopy_ufd = -1;
+        DPRINT("%s: Done close\n", __func__);
+    }
+
+    vmsg->fd_num = 0;
+    vmsg->payload.u64 = 0;
+    vmsg->size = sizeof(vmsg->payload.u64);
+    vmsg->flags = VHOST_USER_VERSION |  VHOST_USER_REPLY_MASK;
+    DPRINT("%s: exit\n", __func__);
+    return true;
+}
+
 static bool
 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -962,6 +983,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
         return vu_set_postcopy_advise(dev, vmsg);
     case VHOST_USER_POSTCOPY_LISTEN:
         return vu_set_postcopy_listen(dev, vmsg);
+    case VHOST_USER_POSTCOPY_END:
+        return vu_set_postcopy_end(dev, vmsg);
     default:
         vmsg_close_fds(vmsg);
         vu_panic(dev, "Unhandled request: %d", vmsg->request);
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index 29c11ba56c..a78596e6fd 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -68,6 +68,7 @@ typedef enum VhostUserRequest {
     VHOST_USER_SET_VRING_ENDIAN = 23,
     VHOST_USER_POSTCOPY_ADVISE  = 24,
     VHOST_USER_POSTCOPY_LISTEN  = 25,
+    VHOST_USER_POSTCOPY_END     = 26,
     VHOST_USER_MAX
 } VhostUserRequest;
 
diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index b2a548c94d..d6586e0b43 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -627,6 +627,14 @@ Master message types
 
       Master advises slave that a transition to postcopy mode has happened.
 
+ * VHOST_USER_POSTCOPY_END
+      Id: 26
+      Slave payload: u64
+
+      Master advises that postcopy migration has now completed.  The
+      slave must disable the userfaultfd. The response is an acknowledgement
+      only.
+
 Slave message types
 -------------------
 
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 4d03383a66..c2e55be0fd 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -71,6 +71,7 @@ typedef enum VhostUserRequest {
     VHOST_USER_SET_VRING_ENDIAN = 23,
     VHOST_USER_POSTCOPY_ADVISE  = 24,
     VHOST_USER_POSTCOPY_LISTEN  = 25,
+    VHOST_USER_POSTCOPY_END     = 26,
     VHOST_USER_MAX
 } VhostUserRequest;
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 27/32] vhost+postcopy: Wire up POSTCOPY_END notify
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (25 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 26/32] vhost: Add VHOST_USER_POSTCOPY_END message Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30  6:57     ` Peter Xu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 28/32] postcopy: Allow shared memory Dr. David Alan Gilbert (git)
                     ` (5 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Wire up a call to VHOST_USER_POSTCOPY_END message to the vhost clients
right before we ask the listener thread to shutdown.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/trace-events   |  2 ++
 hw/virtio/vhost-user.c   | 30 ++++++++++++++++++++++++++++++
 migration/postcopy-ram.c |  5 +++++
 migration/postcopy-ram.h |  1 +
 4 files changed, 38 insertions(+)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 065822c70a..5b599617a1 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -1,6 +1,8 @@
 # See docs/devel/tracing.txt for syntax documentation.
 
 # hw/virtio/vhost-user.c
+vhost_user_postcopy_end_entry(void) ""
+vhost_user_postcopy_end_exit(void) ""
 vhost_user_postcopy_fault_handler(const char *name, uint64_t fault_address, int nregions) "%s: @0x%"PRIx64" nregions:%d"
 vhost_user_postcopy_fault_handler_loop(int i, uint64_t client_base, uint64_t size) "%d: client 0x%"PRIx64" +0x%"PRIx64
 vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t rb_offset) "%d: region_offset: 0x%"PRIx64" rb_offset:0x%"PRIx64
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index c2e55be0fd..d4461459fe 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -965,6 +965,33 @@ static int vhost_user_postcopy_listen(struct vhost_dev *dev, Error **errp)
     return 0;
 }
 
+/*
+ * Called at the end of postcopy
+ */
+static int vhost_user_postcopy_end(struct vhost_dev *dev, Error **errp)
+{
+    VhostUserMsg msg = {
+        .request = VHOST_USER_POSTCOPY_END,
+        .flags = VHOST_USER_VERSION | VHOST_USER_NEED_REPLY_MASK,
+    };
+    int ret;
+
+    trace_vhost_user_postcopy_end_entry();
+    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
+        error_setg(errp, "Failed to send postcopy_end to vhost");
+        return -1;
+    }
+
+    ret = process_message_reply(dev, &msg);
+    if (ret) {
+        error_setg(errp, "Failed to receive reply to postcopy_end");
+        return ret;
+    }
+    trace_vhost_user_postcopy_end_exit();
+
+    return 0;
+}
+
 static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
                                         void *opaque)
 {
@@ -990,6 +1017,9 @@ static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
     case POSTCOPY_NOTIFY_INBOUND_LISTEN:
         return vhost_user_postcopy_listen(dev, pnd->errp);
 
+    case POSTCOPY_NOTIFY_INBOUND_END:
+        return vhost_user_postcopy_end(dev, pnd->errp);
+
     default:
         /* We ignore notifications we don't know */
         break;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 7d0786ff04..28791cf1f1 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -337,7 +337,12 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
 
     if (mis->have_fault_thread) {
         uint64_t tmp64;
+        Error *local_err = NULL;
 
+        if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_END, &local_err)) {
+            error_report_err(local_err);
+            return -1;
+        }
         if (qemu_ram_foreach_block(cleanup_range, mis)) {
             return -1;
         }
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index ecf731c689..d0dc838001 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -130,6 +130,7 @@ enum PostcopyNotifyReason {
     POSTCOPY_NOTIFY_PROBE = 0,
     POSTCOPY_NOTIFY_INBOUND_ADVISE,
     POSTCOPY_NOTIFY_INBOUND_LISTEN,
+    POSTCOPY_NOTIFY_INBOUND_END,
 };
 
 struct PostcopyNotifyData {
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 28/32] postcopy: Allow shared memory
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (26 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 27/32] vhost+postcopy: Wire up POSTCOPY_END notify Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30 10:39     ` Marc-André Lureau
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 29/32] vhost-user: Claim support for postcopy Dr. David Alan Gilbert (git)
                     ` (4 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Now that we have the mechanisms in here, allow shared memory in a
postcopy.

Note that QEMU can't tell who all the users of shared regions are
and thus can't tell whether all the users of the shared regions
have appropriate support for postcopy.  Those devices that explicitly
support shared memory (e.g. vhost-user) must check, but it doesn't
stop weirder configurations causing problems.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/postcopy-ram.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 28791cf1f1..89c3aadda1 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -138,12 +138,6 @@ static int test_ramblock_postcopiable(const char *block_name, void *host_addr,
     RAMBlock *rb = qemu_ram_block_by_name(block_name);
     size_t pagesize = qemu_ram_pagesize(rb);
 
-    if (qemu_ram_is_shared(rb)) {
-        error_report("Postcopy on shared RAM (%s) is not yet supported",
-                     block_name);
-        return 1;
-    }
-
     if (length % pagesize) {
         error_report("Postcopy requires RAM blocks to be a page size multiple,"
                      " block %s is 0x" RAM_ADDR_FMT " bytes with a "
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 29/32] vhost-user: Claim support for postcopy
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (27 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 28/32] postcopy: Allow shared memory Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-30 10:50     ` Marc-André Lureau
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 30/32] vhost: Merge neighbouring hugepage regions where appropriate Dr. David Alan Gilbert (git)
                     ` (3 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Tell QEMU we understand the protocol features needed for postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 23bff47649..290748733b 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -144,6 +144,35 @@ vmsg_close_fds(VhostUserMsg *vmsg)
     }
 }
 
+/* A test to see if we have userfault available */
+static bool
+have_userfault(void)
+{
+#if defined(__linux__) && defined(__NR_userfaultfd) &&\
+        defined(UFFD_FEATURE_MISSING_SHMEM) &&\
+        defined(UFFD_FEATURE_MISSING_HUGETLBFS)
+    /* Now test the kernel we're running on really has the features */
+    int ufd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+    struct uffdio_api api_struct;
+    if (ufd < 0) {
+        return false;
+    }
+
+    api_struct.api = UFFD_API;
+    api_struct.features = UFFD_FEATURE_MISSING_SHMEM |
+                          UFFD_FEATURE_MISSING_HUGETLBFS;
+    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
+        close(ufd);
+        return false;
+    }
+    close(ufd);
+    return true;
+
+#else
+    return false;
+#endif
+}
+
 static bool
 vu_message_read(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
 {
@@ -796,6 +825,10 @@ vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg *vmsg)
 {
     uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD;
 
+    if (have_userfault()) {
+        features |= 1ULL << VHOST_USER_PROTOCOL_F_PAGEFAULT;
+    }
+
     if (dev->iface->get_protocol_features) {
         features |= dev->iface->get_protocol_features(dev);
     }
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 30/32] vhost: Merge neighbouring hugepage regions where appropriate
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (28 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 29/32] vhost-user: Claim support for postcopy Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-09-14  9:18     ` Igor Mammedov
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 31/32] vhost: Don't break merged regions on small remove/non-adds Dr. David Alan Gilbert (git)
                     ` (2 subsequent siblings)
  32 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Where two regions are created with a gap such that when aligned
to hugepage boundaries, the two regions overlap, merge them.

I also add quite a few trace events to see what's going on.

Note: This doesn't handle all the cases, but does handle the common
case on a PC due to the 640k hole.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 hw/virtio/trace-events | 11 +++++++
 hw/virtio/vhost.c      | 79 +++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 5b599617a1..f98efb39fd 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -1,5 +1,16 @@
 # See docs/devel/tracing.txt for syntax documentation.
 
+# hw/virtio/vhost.c
+vhost_dev_assign_memory_merged(int from, int to, uint64_t size, uint64_t start_addr, uint64_t uaddr) "f/t=%d/%d 0x%"PRIx64" @ P: 0x%"PRIx64" U: 0x%"PRIx64
+vhost_dev_assign_memory_not_merged(uint64_t size, uint64_t start_addr, uint64_t uaddr) "0x%"PRIx64" @ P: 0x%"PRIx64" U: 0x%"PRIx64
+vhost_dev_assign_memory_entry(uint64_t size, uint64_t start_addr, uint64_t uaddr) "0x%"PRIx64" @ P: 0x%"PRIx64" U: 0x%"PRIx64
+vhost_dev_assign_memory_exit(uint32_t nregions) "%"PRId32
+vhost_huge_page_stretch_and_merge_entry(uint32_t nregions) "%"PRId32
+vhost_huge_page_stretch_and_merge_can(void) ""
+vhost_huge_page_stretch_and_merge_size_align(int d, uint64_t gpa, uint64_t align) "%d: gpa: 0x%"PRIx64" align: 0x%"PRIx64
+vhost_huge_page_stretch_and_merge_start_align(int d, uint64_t gpa, uint64_t align) "%d: gpa: 0x%"PRIx64" align: 0x%"PRIx64
+vhost_section(const char *name, int r) "%s:%d"
+
 # hw/virtio/vhost-user.c
 vhost_user_postcopy_end_entry(void) ""
 vhost_user_postcopy_end_exit(void) ""
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 6eddb099b0..fb506e747f 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -27,6 +27,7 @@
 #include "hw/virtio/virtio-access.h"
 #include "migration/blocker.h"
 #include "sysemu/dma.h"
+#include "trace.h"
 
 /* enabled until disconnected backend stabilizes */
 #define _VHOST_DEBUG 1
@@ -250,6 +251,8 @@ static void vhost_dev_assign_memory(struct vhost_dev *dev,
 {
     int from, to;
     struct vhost_memory_region *merged = NULL;
+    trace_vhost_dev_assign_memory_entry(size, start_addr, uaddr);
+
     for (from = 0, to = 0; from < dev->mem->nregions; ++from, ++to) {
         struct vhost_memory_region *reg = dev->mem->regions + to;
         uint64_t prlast, urlast;
@@ -293,11 +296,13 @@ static void vhost_dev_assign_memory(struct vhost_dev *dev,
         uaddr = merged->userspace_addr = u;
         start_addr = merged->guest_phys_addr = s;
         size = merged->memory_size = e - s + 1;
+        trace_vhost_dev_assign_memory_merged(from, to, size, start_addr, uaddr);
         assert(merged->memory_size);
     }
 
     if (!merged) {
         struct vhost_memory_region *reg = dev->mem->regions + to;
+        trace_vhost_dev_assign_memory_not_merged(size, start_addr, uaddr);
         memset(reg, 0, sizeof *reg);
         reg->memory_size = size;
         assert(reg->memory_size);
@@ -307,6 +312,7 @@ static void vhost_dev_assign_memory(struct vhost_dev *dev,
     }
     assert(to <= dev->mem->nregions + 1);
     dev->mem->nregions = to;
+    trace_vhost_dev_assign_memory_exit(to);
 }
 
 static uint64_t vhost_get_log_size(struct vhost_dev *dev)
@@ -610,8 +616,12 @@ static void vhost_set_memory(MemoryListener *listener,
 
 static bool vhost_section(MemoryRegionSection *section)
 {
-    return memory_region_is_ram(section->mr) &&
+    bool result;
+    result = memory_region_is_ram(section->mr) &&
         !memory_region_is_rom(section->mr);
+
+    trace_vhost_section(section->mr->name, result);
+    return result;
 }
 
 static void vhost_begin(MemoryListener *listener)
@@ -622,6 +632,68 @@ static void vhost_begin(MemoryListener *listener)
     dev->mem_changed_start_addr = -1;
 }
 
+/* Look for regions that are hugepage backed but not aligned
+ * and fix them up to be aligned.
+ * TODO: For now this is just enough to deal with the 640k hole
+ */
+static bool vhost_huge_page_stretch_and_merge(struct vhost_dev *dev)
+{
+    int i, j;
+    bool result = true;
+    trace_vhost_huge_page_stretch_and_merge_entry(dev->mem->nregions);
+
+    for (i = 0; i < dev->mem->nregions; i++) {
+        struct vhost_memory_region *reg = dev->mem->regions + i;
+        ram_addr_t offset;
+        RAMBlock *rb = qemu_ram_block_from_host((void *)reg->userspace_addr,
+                                                false, &offset);
+        size_t pagesize = qemu_ram_pagesize(rb);
+        uint64_t alignage;
+        alignage = reg->guest_phys_addr & (pagesize - 1);
+        if (alignage) {
+
+            trace_vhost_huge_page_stretch_and_merge_start_align(i,
+                                                (uint64_t)reg->guest_phys_addr,
+                                                alignage);
+            for (j = 0; j < dev->mem->nregions; j++) {
+                struct vhost_memory_region *oreg = dev->mem->regions + j;
+                if (j == i) {
+                    continue;
+                }
+
+                if (oreg->guest_phys_addr ==
+                        (reg->guest_phys_addr - alignage) &&
+                    oreg->userspace_addr ==
+                         (reg->userspace_addr - alignage)) {
+                    struct vhost_memory_region treg = *reg;
+                    trace_vhost_huge_page_stretch_and_merge_can();
+                    vhost_dev_unassign_memory(dev, oreg->guest_phys_addr,
+                                              oreg->memory_size);
+                    vhost_dev_unassign_memory(dev, treg.guest_phys_addr,
+                                              treg.memory_size);
+                    vhost_dev_assign_memory(dev,
+                                            treg.guest_phys_addr - alignage,
+                                            treg.memory_size + alignage,
+                                            treg.userspace_addr - alignage);
+                    return vhost_huge_page_stretch_and_merge(dev);
+                }
+            }
+        }
+        alignage = reg->memory_size & (pagesize - 1);
+        if (alignage) {
+            trace_vhost_huge_page_stretch_and_merge_size_align(i,
+                                               (uint64_t)reg->guest_phys_addr,
+                                               alignage);
+            /* We ignore this if we find something else to merge,
+             * so we only return false if we're left with this
+             */
+            result = false;
+        }
+    }
+
+    return result;
+}
+
 static void vhost_commit(MemoryListener *listener)
 {
     struct vhost_dev *dev = container_of(listener, struct vhost_dev,
@@ -641,6 +713,7 @@ static void vhost_commit(MemoryListener *listener)
         return;
     }
 
+    vhost_huge_page_stretch_and_merge(dev);
     if (dev->started) {
         start_addr = dev->mem_changed_start_addr;
         size = dev->mem_changed_end_addr - dev->mem_changed_start_addr + 1;
@@ -1512,6 +1585,10 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
         goto fail_features;
     }
 
+    if (!vhost_huge_page_stretch_and_merge(hdev)) {
+        VHOST_OPS_DEBUG("vhost_huge_page_stretch_and_merge failed");
+        goto fail_mem;
+    }
     if (vhost_dev_has_iommu(hdev)) {
         memory_listener_register(&hdev->iommu_listener, vdev->dma_as);
     }
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 31/32] vhost: Don't break merged regions on small remove/non-adds
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (29 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 30/32] vhost: Merge neighbouring hugepage regions where appropriate Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 32/32] postcopy shared docs Dr. David Alan Gilbert (git)
  2017-09-01 13:34   ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Alexey Perevalov
  32 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The previous patch merges hugepage regions with small gaps
back into one; this patch stops some of the cases that split
them up again.

vhost_set_memory avoids adding small regions that are being
dirty-monitored (typically VGA regions), but vhost_set_memory
does remove these regions from any region they overlap.

Avoid doing that removal if we'll just merge them again.

The typical case is where the VGA regions dynamically change
at run time after we've done the merge; although the merge works, the
result is that the memory is marked as having changed and
a set_mem_table message is transmitted.  By avoiding the split
we avoid the join and we avoid marking the regions as having
changed, and thus avoid sending the set_mem_table.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/trace-events |  2 ++
 hw/virtio/vhost.c      | 42 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index f98efb39fd..e22d2055b0 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -10,6 +10,8 @@ vhost_huge_page_stretch_and_merge_can(void) ""
 vhost_huge_page_stretch_and_merge_size_align(int d, uint64_t gpa, uint64_t align) "%d: gpa: 0x%"PRIx64" align: 0x%"PRIx64
 vhost_huge_page_stretch_and_merge_start_align(int d, uint64_t gpa, uint64_t align) "%d: gpa: 0x%"PRIx64" align: 0x%"PRIx64
 vhost_section(const char *name, int r) "%s:%d"
+vhost_dev_would_remerge_no(uint64_t start_addr, uint64_t size) "0x%"PRIx64" + 0x%"PRIx64
+vhost_dev_would_remerge_yes(uint64_t start_addr, uint64_t size) "0x%"PRIx64" + 0x%"PRIx64
 
 # hw/virtio/vhost-user.c
 vhost_user_postcopy_end_entry(void) ""
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index fb506e747f..18714f6d03 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -564,6 +564,43 @@ static bool vhost_dev_cmp_memory(struct vhost_dev *dev,
     return uaddr != reg->userspace_addr + start_addr - reg->guest_phys_addr;
 }
 
+/* Called by vhost_set_memory on removal where start_addr/size have
+ * been found to correspond to an existing region (reg).
+ * Returns True iff removing the section would be undone later
+ * by merging (for hugepage) and thus there's no point in removing it.
+ */
+static bool vhost_dev_would_remerge(struct vhost_dev *dev,
+                                    struct vhost_memory_region *reg,
+                                    hwaddr start_addr, ram_addr_t size)
+{
+    uint64_t reglast = range_get_last(reg->guest_phys_addr, reg->memory_size);
+    uint64_t memlast = range_get_last(start_addr, size);
+    ram_addr_t offset;
+    RAMBlock *rb = qemu_ram_block_from_host((void *)reg->userspace_addr,
+                                            false, &offset);
+    size_t reg_pagesize = qemu_ram_pagesize(rb);
+
+    /* If the region being deleted hangs over the end of the existing
+     * region, it's not a case we're interested in.
+     * If it's just a normal sized page then there won't normally be any
+     * alignment merging going on.
+     * and if the hole being cut is larger than the pagesize of the
+     * region then assume it won't be remerged.
+     */
+    if (memlast > reglast || start_addr < reg->guest_phys_addr ||
+        reg_pagesize == getpagesize() ||
+        size >= reg_pagesize) {
+        trace_vhost_dev_would_remerge_no(start_addr, size);
+        return false;
+    }
+
+    /* This is a small chunk overlapping a big hugepage region,
+     * deleting it will get remerged
+     */
+    trace_vhost_dev_would_remerge_yes(start_addr, size);
+    return true;
+}
+
 static void vhost_set_memory(MemoryListener *listener,
                              MemoryRegionSection *section,
                              bool add)
@@ -594,7 +631,10 @@ static void vhost_set_memory(MemoryListener *listener,
             return;
         }
     } else {
-        if (!vhost_dev_find_reg(dev, start_addr, size)) {
+        struct vhost_memory_region *reg = vhost_dev_find_reg(dev,
+                                                             start_addr,
+                                                             size);
+        if (!reg || vhost_dev_would_remerge(dev, reg, start_addr, size)) {
             /* Removing region that we don't access. Nothing to do. */
             return;
         }
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [Qemu-devel] [RFC v2 32/32] postcopy shared docs
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (30 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 31/32] vhost: Don't break merged regions on small remove/non-adds Dr. David Alan Gilbert (git)
@ 2017-08-24 19:27   ` Dr. David Alan Gilbert (git)
  2017-09-01 13:34   ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Alexey Perevalov
  32 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-08-24 19:27 UTC (permalink / raw)
  To: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add some notes to the migration documentation for shared memory
postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 docs/devel/migration.txt | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/docs/devel/migration.txt b/docs/devel/migration.txt
index 1b940a829b..d4c344c671 100644
--- a/docs/devel/migration.txt
+++ b/docs/devel/migration.txt
@@ -553,3 +553,42 @@ Postcopy now works with hugetlbfs backed memory:
      hugepages works well, however 1GB hugepages are likely to be problematic
      since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
      and until the full page is transferred the destination thread is blocked.
+
+=== Postcopy with shared memory ===
+
+Postcopy migration with shared memory needs explicit support from the other
+processes that share memory and from QEMU. There are restrictions on the type of
+memory that userfault can support shared.
+
+The Linux kernel userfault support works on /dev/shm memory and on hugetlbfs
+(although the kernel doesn't provide an equivalent to madvise(MADV_DONTNEED)
+for hugetlbfs which may be a problem in some configurations).
+
+The vhost-user code in QEMU supports clients that have Postcopy support,
+and the vhost-user-bridge (in tests/) and the DPDK package have changes
+to support postcopy.
+
+The client needs to open a userfaultfd and register the areas
+of memory that it maps with userfault.  The client must then pass the
+userfaultfd back to QEMU together with a mapping table that allows
+fault addresses in the clients address space to be converted back to
+RAMBlock/offsets.  The client's userfaultfd is added to the postcopy
+fault-thread and page requests are made on behalf of the client by QEMU.
+QEMU performs 'wake' operations on the client's userfaultfd to allow it
+to continue after a page has arrived.
+
+  Note: There are two future improvements that would be nice:
+    a) Some way to make QEMU ignorant of the addresses in the clients
+      address space
+    b) Avoiding the need for QEMU to perform ufd-wake calls after the
+      pages have arrived
+
+Retro-fitting postcopy to existing clients is possible:
+  a) A mechanism is needed for the registration with userfault as above,
+     and the registration needs to be coordinated with the phases of
+     postcopy.  In vhost-user extra messages are added to the existing
+     control channel.
+  b) Any thread that can block due to guest memory accesses must be
+     identified and the implication understood; for example if the
+     guest memory access is made while holding a lock then all other
+     threads waiting for that lock will also be blocked.
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started
  2017-08-24 19:26   ` [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started Dr. David Alan Gilbert (git)
@ 2017-08-24 23:10     ` Marc-André Lureau
  2017-08-25 14:58       ` Dr. David Alan Gilbert
  2017-08-30 13:02     ` Michael S. Tsirkin
  1 sibling, 1 reply; 94+ messages in thread
From: Marc-André Lureau @ 2017-08-24 23:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git),
	qemu-devel, maxime.coquelin, a.perevalov, mst
  Cc: lvivier, aarcange, felipe, peterx, quintela

Hi

On Thu, Aug 24, 2017 at 9:39 PM Dr. David Alan Gilbert (git) <
dgilbert@redhat.com> wrote:

> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Add a vu_queue_started method to complement vu_queue_enabled.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>



> ---
>  contrib/libvhost-user/libvhost-user.c | 6 ++++++
>  contrib/libvhost-user/libvhost-user.h | 9 +++++++++
>  2 files changed, 15 insertions(+)
>
> diff --git a/contrib/libvhost-user/libvhost-user.c
> b/contrib/libvhost-user/libvhost-user.c
> index 35fa0c5e56..201b9846e9 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -930,6 +930,12 @@ vu_queue_enabled(VuDev *dev, VuVirtq *vq)
>      return vq->enable;
>  }
>
> +bool
> +vu_queue_started(VuDev *dev, VuVirtq *vq)
>

I guess we could make it const, but this is true for many other functions.
Could be done later in one go.

> +{
> +    return vq->started;
> +}
> +
>  static inline uint16_t
>  vring_avail_flags(VuVirtq *vq)
>  {
> diff --git a/contrib/libvhost-user/libvhost-user.h
> b/contrib/libvhost-user/libvhost-user.h
> index 53ef222c0b..acd019876d 100644
> --- a/contrib/libvhost-user/libvhost-user.h
> +++ b/contrib/libvhost-user/libvhost-user.h
> @@ -328,6 +328,15 @@ void vu_queue_set_notification(VuDev *dev, VuVirtq
> *vq, int enable);
>  bool vu_queue_enabled(VuDev *dev, VuVirtq *vq);
>
>  /**
> + * vu_queue_started:
> + * @dev: a VuDev context
> + * @vq: a VuVirtq queue
> + *
> + * Returns: whether the queue is started.
> + */
> +bool vu_queue_started(VuDev *dev, VuVirtq *vq);
> +
> +/**
>   * vu_queue_empty:
>   * @dev: a VuDev context
>   * @vq: a VuVirtq queue
> --
> 2.13.5
>
>
> --
Marc-André Lureau

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 04/32] qemu_ram_block_host_offset
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 04/32] qemu_ram_block_host_offset Dr. David Alan Gilbert (git)
@ 2017-08-25 12:11     ` Philippe Mathieu-Daudé
  2017-08-25 15:28       ` Dr. David Alan Gilbert
  2017-08-29  5:36     ` Peter Xu
  1 sibling, 1 reply; 94+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-08-25 12:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git),
	qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau
  Cc: lvivier, aarcange, felipe, peterx, quintela

Hi David,

On 08/24/2017 04:27 PM, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Utility to give the offset of a host pointer within a RAMBlock
> (assuming we already know it's in that RAMBlock)
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>   exec.c                    | 10 ++++++++++
>   include/exec/cpu-common.h |  1 +
>   2 files changed, 11 insertions(+)
> 
> diff --git a/exec.c b/exec.c
> index 67df2909ce..35b4cea2ed 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2231,6 +2231,16 @@ static void *qemu_ram_ptr_length(RAMBlock *ram_block, ram_addr_t addr,
>       return ramblock_ptr(block, addr);
>   }
>   
> +/* Return the offset of a hostpointer within a ramblock */
> +ram_addr_t qemu_ram_block_host_offset(RAMBlock *rb, void *host)
> +{
> +    ram_addr_t res = (uint8_t *)host - (uint8_t *)rb->host;

What about using ptrdiff_t here,

> +    assert((uint8_t *)host >= (uint8_t *)rb->host);

and uintptr_t here?

> +    assert(res < rb->max_length);
> +
> +    return res;
> +}
> +
>   /*
>    * Translates a host ptr back to a RAMBlock, a ram_addr and an offset
>    * in that RAMBlock.
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 74341b19d2..0d861a6289 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -68,6 +68,7 @@ ram_addr_t qemu_ram_addr_from_host(void *ptr);
>   RAMBlock *qemu_ram_block_by_name(const char *name);
>   RAMBlock *qemu_ram_block_from_host(void *ptr, bool round_offset,
>                                      ram_addr_t *offset);
> +ram_addr_t qemu_ram_block_host_offset(RAMBlock *rb, void *host);
>   void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
>   void qemu_ram_unset_idstr(RAMBlock *block);
>   const char *qemu_ram_get_idstr(RAMBlock *rb);
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started
  2017-08-24 23:10     ` Marc-André Lureau
@ 2017-08-25 14:58       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-25 14:58 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, lvivier, aarcange,
	felipe, peterx, quintela

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> Hi
> 
> On Thu, Aug 24, 2017 at 9:39 PM Dr. David Alan Gilbert (git) <
> dgilbert@redhat.com> wrote:
> 
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > Add a vu_queue_started method to complement vu_queue_enabled.
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >
> 
> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>

Thanks.

> 
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 6 ++++++
> >  contrib/libvhost-user/libvhost-user.h | 9 +++++++++
> >  2 files changed, 15 insertions(+)
> >
> > diff --git a/contrib/libvhost-user/libvhost-user.c
> > b/contrib/libvhost-user/libvhost-user.c
> > index 35fa0c5e56..201b9846e9 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -930,6 +930,12 @@ vu_queue_enabled(VuDev *dev, VuVirtq *vq)
> >      return vq->enable;
> >  }
> >
> > +bool
> > +vu_queue_started(VuDev *dev, VuVirtq *vq)
> >
> 
> I guess we could make it const, but this is true for many other functions.
> Could be done later in one go.

Thanks; I've added the consts.

Dave

> > +{
> > +    return vq->started;
> > +}
> > +
> >  static inline uint16_t
> >  vring_avail_flags(VuVirtq *vq)
> >  {
> > diff --git a/contrib/libvhost-user/libvhost-user.h
> > b/contrib/libvhost-user/libvhost-user.h
> > index 53ef222c0b..acd019876d 100644
> > --- a/contrib/libvhost-user/libvhost-user.h
> > +++ b/contrib/libvhost-user/libvhost-user.h
> > @@ -328,6 +328,15 @@ void vu_queue_set_notification(VuDev *dev, VuVirtq
> > *vq, int enable);
> >  bool vu_queue_enabled(VuDev *dev, VuVirtq *vq);
> >
> >  /**
> > + * vu_queue_started:
> > + * @dev: a VuDev context
> > + * @vq: a VuVirtq queue
> > + *
> > + * Returns: whether the queue is started.
> > + */
> > +bool vu_queue_started(VuDev *dev, VuVirtq *vq);
> > +
> > +/**
> >   * vu_queue_empty:
> >   * @dev: a VuDev context
> >   * @vq: a VuVirtq queue
> > --
> > 2.13.5
> >
> >
> > --
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 04/32] qemu_ram_block_host_offset
  2017-08-25 12:11     ` Philippe Mathieu-Daudé
@ 2017-08-25 15:28       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-25 15:28 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	lvivier, aarcange, felipe, peterx, quintela

* Philippe Mathieu-Daudé (f4bug@amsat.org) wrote:
> Hi David,
> 
> On 08/24/2017 04:27 PM, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Utility to give the offset of a host pointer within a RAMBlock
> > (assuming we already know it's in that RAMBlock)
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >   exec.c                    | 10 ++++++++++
> >   include/exec/cpu-common.h |  1 +
> >   2 files changed, 11 insertions(+)
> > 
> > diff --git a/exec.c b/exec.c
> > index 67df2909ce..35b4cea2ed 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -2231,6 +2231,16 @@ static void *qemu_ram_ptr_length(RAMBlock *ram_block, ram_addr_t addr,
> >       return ramblock_ptr(block, addr);
> >   }
> > +/* Return the offset of a hostpointer within a ramblock */
> > +ram_addr_t qemu_ram_block_host_offset(RAMBlock *rb, void *host)
> > +{
> > +    ram_addr_t res = (uint8_t *)host - (uint8_t *)rb->host;
> 
> What about using ptrdiff_t here,

We tend to use ram_addr_t for offsets in RAM, and so that's
the return type of the function, and we're also comparing this
value to rb->max_length below which is also a ram_addr_t.

> > +    assert((uint8_t *)host >= (uint8_t *)rb->host);
> 
> and uintptr_t here?

Done.

Thanks,

Dave

> > +    assert(res < rb->max_length);
> > +
> > +    return res;
> > +}
> > +
> >   /*
> >    * Translates a host ptr back to a RAMBlock, a ram_addr and an offset
> >    * in that RAMBlock.
> > diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> > index 74341b19d2..0d861a6289 100644
> > --- a/include/exec/cpu-common.h
> > +++ b/include/exec/cpu-common.h
> > @@ -68,6 +68,7 @@ ram_addr_t qemu_ram_addr_from_host(void *ptr);
> >   RAMBlock *qemu_ram_block_by_name(const char *name);
> >   RAMBlock *qemu_ram_block_from_host(void *ptr, bool round_offset,
> >                                      ram_addr_t *offset);
> > +ram_addr_t qemu_ram_block_host_offset(RAMBlock *rb, void *host);
> >   void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
> >   void qemu_ram_unset_idstr(RAMBlock *block);
> >   const char *qemu_ram_get_idstr(RAMBlock *rb);
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 03/32] migrate: Update ram_block_discard_range for shared
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 03/32] migrate: Update ram_block_discard_range for shared Dr. David Alan Gilbert (git)
@ 2017-08-29  5:30     ` Peter Xu
  2017-09-18 12:18       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Xu @ 2017-08-29  5:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Thu, Aug 24, 2017 at 08:27:01PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The choice of call to discard a block is getting more complicated
> for other cases.   We use fallocate PUNCH_HOLE in any file cases;
> it works for both hugepage and for tmpfs.
> We use the DONTNEED for non-hugepage cases either where they're
> anonymous or where they're private.
> 
> Care should be taken when trying other backing files.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  exec.c       | 35 ++++++++++++++++++++++++-----------
>  trace-events |  3 +++
>  2 files changed, 27 insertions(+), 11 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index d20c34ca83..67df2909ce 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -3573,6 +3573,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>      }
>  
>      if ((start + length) <= rb->used_length) {
> +        bool need_madvise, need_fallocate;
>          uint8_t *host_endaddr = host_startaddr + length;
>          if ((uintptr_t)host_endaddr & (rb->page_size - 1)) {
>              error_report("ram_block_discard_range: Unaligned end address: %p",
> @@ -3582,23 +3583,35 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>  
>          errno = ENOTSUP; /* If we are missing MADVISE etc */
>  
> -        if (rb->page_size == qemu_host_page_size) {
> -#if defined(CONFIG_MADVISE)
> -            /* Note: We need the madvise MADV_DONTNEED behaviour of definitely
> -             * freeing the page.
> -             */
> -            ret = madvise(host_startaddr, length, MADV_DONTNEED);
> -#endif
> -        } else {
> -            /* Huge page case  - unfortunately it can't do DONTNEED, but
> -             * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
> -             * huge page file.
> +        /* The logic here is messy;
> +         *    madvise DONTNEED fails for hugepages
> +         *    fallocate works on hugepages and shmem
> +         */
> +        need_madvise = (rb->page_size == qemu_host_page_size);
> +        need_fallocate = rb->fd != -1;
> +        if (need_fallocate) {
> +            /* For a file, this causes the area of the file to be zero'd
> +             * if read, and for hugetlbfs also causes it to be unmapped
> +             * so a userfault will trigger.
>               */
>  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>              ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                              start, length);
>  #endif
>          }
> +        /* i.e. need madvise but skip it if the fallocate failed */
> +        if (need_madvise && (!need_fallocate || (ret == 0))) {

I'll slightly prefer:

  trace_ram_block_discard_range();

  if (need_fallocate) {
    ret = fallocate();
    if (ret) {
      error_report();
      goto err;
    }
  }

  if (need_madvise) {
    ret = madvise();
    if (ret) {
      error_report();
      goto err;
    }
  }

But it is personal preference.  For either way:

Reviewed-by: Peter Xu <peterx@redhat.com>

> +            /* For normal RAM this causes it to be unmapped,
> +             * for shared memory it causes the local mapping to disappear
> +             * and to fall back on the file contents (which we just
> +             * fallocate'd away).
> +             */
> +#if defined(CONFIG_MADVISE)
> +            ret =  madvise(host_startaddr, length, MADV_DONTNEED);
> +#endif
> +        }
> +        trace_ram_block_discard_range(rb->idstr, host_startaddr,
> +                                      need_madvise, need_fallocate, ret);
>          if (ret) {
>              ret = -errno;
>              error_report("ram_block_discard_range: Failed to discard range "
> diff --git a/trace-events b/trace-events
> index 1f50f56d9d..213ee34f89 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -55,6 +55,9 @@ dma_complete(void *dbs, int ret, void *cb) "dbs=%p ret=%d cb=%p"
>  dma_blk_cb(void *dbs, int ret) "dbs=%p ret=%d"
>  dma_map_wait(void *dbs) "dbs=%p"
>  
> +# exec.c
> +ram_block_discard_range(const char *rbname, void *hva, bool need_madvise, bool need_fallocate, int ret) "%s@%p: madvise: %d fallocate: %d ret: %d"
> +
>  # memory.c
>  memory_region_ops_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
>  memory_region_ops_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
> -- 
> 2.13.5
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 04/32] qemu_ram_block_host_offset
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 04/32] qemu_ram_block_host_offset Dr. David Alan Gilbert (git)
  2017-08-25 12:11     ` Philippe Mathieu-Daudé
@ 2017-08-29  5:36     ` Peter Xu
  1 sibling, 0 replies; 94+ messages in thread
From: Peter Xu @ 2017-08-29  5:36 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Thu, Aug 24, 2017 at 08:27:02PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Utility to give the offset of a host pointer within a RAMBlock
> (assuming we already know it's in that RAMBlock)
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 07/32] postcopy: Add notifier chain
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 07/32] postcopy: Add notifier chain Dr. David Alan Gilbert (git)
@ 2017-08-29  6:02     ` Peter Xu
  2017-09-11 17:00       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Xu @ 2017-08-29  6:02 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Thu, Aug 24, 2017 at 08:27:05PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Add a notifier chain for postcopy with a 'reason' flag
> and an opportunity for a notifier member to return an error.
> 
> Call it when enabling postcopy.
> 
> This will initially used to enable devices to declare they're unable
> to postcopy and later to notify of devices of stages within postcopy.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  migration/postcopy-ram.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  migration/postcopy-ram.h | 26 ++++++++++++++++++++++++++
>  vl.c                     |  2 ++
>  3 files changed, 69 insertions(+)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 640b72d86d..95007c00ef 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -23,6 +23,8 @@
>  #include "savevm.h"
>  #include "postcopy-ram.h"
>  #include "ram.h"
> +#include "qapi/error.h"
> +#include "qemu/notify.h"
>  #include "sysemu/sysemu.h"
>  #include "sysemu/balloon.h"
>  #include "qemu/error-report.h"
> @@ -45,6 +47,38 @@ struct PostcopyDiscardState {
>      unsigned int nsentcmds;
>  };
>  
> +/* A notifier chain for postcopy
> + * The notifier should return 0 if it's OK, or a
> + * -errno on error.
> + * The notifier should expect an Error ** as it's data

"PostcopyNotifyData *" but not "Error **"?

Maybe we can just remove this block of comment since there is a
similar one in the header below.

Besides:

Reviewed-by: Peter Xu <peterx@redhat.com>

> + */
> +static NotifierWithReturnList postcopy_notifier_list;
> +
> +void postcopy_infrastructure_init(void)
> +{
> +    notifier_with_return_list_init(&postcopy_notifier_list);
> +}
> +
> +void postcopy_add_notifier(NotifierWithReturn *nn)
> +{
> +    notifier_with_return_list_add(&postcopy_notifier_list, nn);
> +}
> +
> +void postcopy_remove_notifier(NotifierWithReturn *n)
> +{
> +    notifier_with_return_remove(n);
> +}
> +
> +int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp)
> +{
> +    struct PostcopyNotifyData pnd;
> +    pnd.reason = reason;
> +    pnd.errp = errp;
> +
> +    return notifier_with_return_list_notify(&postcopy_notifier_list,
> +                                            &pnd);
> +}
> +
>  /* Postcopy needs to detect accesses to pages that haven't yet been copied
>   * across, and efficiently map new pages in, the techniques for doing this
>   * are target OS specific.
> @@ -133,6 +167,7 @@ bool postcopy_ram_supported_by_host(void)
>      struct uffdio_register reg_struct;
>      struct uffdio_range range_struct;
>      uint64_t feature_mask;
> +    Error *local_err = NULL;
>  
>      if (qemu_target_page_size() > pagesize) {
>          error_report("Target page size bigger than host page size");
> @@ -146,6 +181,12 @@ bool postcopy_ram_supported_by_host(void)
>          goto out;
>      }
>  
> +    /* Give devices a chance to object */
> +    if (postcopy_notify(POSTCOPY_NOTIFY_PROBE, &local_err)) {
> +        error_report_err(local_err);
> +        goto out;
> +    }
> +
>      /* Version and features check */
>      if (!ufd_version_check(ufd)) {
>          goto out;
> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index 78a3591322..d688411674 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -114,4 +114,30 @@ PostcopyState postcopy_state_get(void);
>  /* Set the state and return the old state */
>  PostcopyState postcopy_state_set(PostcopyState new_state);
>  
> +/*
> + * To be called once at the start before any device initialisation
> + */
> +void postcopy_infrastructure_init(void);
> +
> +/* Add a notifier to a list to be called when checking whether the devices
> + * can support postcopy.
> + * It's data is a *PostcopyNotifyData
> + * It should return 0 if OK, or a negative value on failure.
> + * On failure it must set the data->errp to an error.
> + *
> + */
> +enum PostcopyNotifyReason {
> +    POSTCOPY_NOTIFY_PROBE = 0,
> +};
> +
> +struct PostcopyNotifyData {
> +    enum PostcopyNotifyReason reason;
> +    Error **errp;
> +};
> +
> +void postcopy_add_notifier(NotifierWithReturn *nn);
> +void postcopy_remove_notifier(NotifierWithReturn *n);
> +/* Call the notifier list set by postcopy_add_start_notifier */
> +int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp);
> +
>  #endif
> diff --git a/vl.c b/vl.c
> index 8e247cc2a2..65dd9dc324 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -95,6 +95,7 @@ int main(int argc, char **argv)
>  #include "audio/audio.h"
>  #include "sysemu/cpus.h"
>  #include "migration/colo.h"
> +#include "migration/postcopy-ram.h"
>  #include "sysemu/kvm.h"
>  #include "sysemu/hax.h"
>  #include "qapi/qobject-input-visitor.h"
> @@ -3082,6 +3083,7 @@ int main(int argc, char **argv, char **envp)
>      module_call_init(MODULE_INIT_OPTS);
>  
>      runstate_init();
> +    postcopy_infrastructure_init();
>  
>      if (qcrypto_init(&err) < 0) {
>          error_reportf_err(err, "cannot initialize crypto: ");
> -- 
> 2.13.5
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 08/32] postcopy: Add vhost-user flag for postcopy and check it
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 08/32] postcopy: Add vhost-user flag for postcopy and check it Dr. David Alan Gilbert (git)
@ 2017-08-29  6:22     ` Peter Xu
  2017-09-13 14:34       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Xu @ 2017-08-29  6:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Thu, Aug 24, 2017 at 08:27:06PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Add a vhost feature flag for postcopy support, and
> use the postcopy notifier to check it before allowing postcopy.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.h |  1 +
>  docs/interop/vhost-user.txt           | 10 +++++++++
>  hw/virtio/vhost-user.c                | 40 ++++++++++++++++++++++++++++++++++-
>  3 files changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> index acd019876d..95d0d34a28 100644
> --- a/contrib/libvhost-user/libvhost-user.h
> +++ b/contrib/libvhost-user/libvhost-user.h
> @@ -34,6 +34,7 @@ enum VhostUserProtocolFeature {
>      VHOST_USER_PROTOCOL_F_MQ = 0,
>      VHOST_USER_PROTOCOL_F_LOG_SHMFD = 1,
>      VHOST_USER_PROTOCOL_F_RARP = 2,
> +    VHOST_USER_PROTOCOL_F_PAGEFAULT = 7,
>  
>      VHOST_USER_PROTOCOL_F_MAX
>  };
> diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> index 954771d0d8..a279560eb0 100644
> --- a/docs/interop/vhost-user.txt
> +++ b/docs/interop/vhost-user.txt
> @@ -273,6 +273,15 @@ Once the source has finished migration, rings will be stopped by
>  the source. No further update must be done before rings are
>  restarted.
>  
> +In postcopy migration the slave is started before all the memory has been
> +received from the source host, and care must be taken to avoid accessing pages
> +that have yet to be received.  The slave opens a 'userfault'-fd and registers
> +the memory with it; this fd is then passed back over to the master.
> +The master services requests on the userfaultfd for pages that are accessed
> +and when the page is available it performs WAKE ioctl's on the userfaultfd
> +to wake the stalled slave.  The client indicates support for this via the
> +VHOST_USER_PROTOCOL_F_PAGEFAULT feature.
> +
>  IOMMU support
>  -------------
>  
> @@ -327,6 +336,7 @@ Protocol features
>  #define VHOST_USER_PROTOCOL_F_MTU            4
>  #define VHOST_USER_PROTOCOL_F_SLAVE_REQ      5
>  #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN   6
> +#define VHOST_USER_PROTOCOL_F_PAGEFAULT      7
>  
>  Master message types
>  --------------------
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 093675ed98..c51bbd1296 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -17,6 +17,8 @@
>  #include "sysemu/kvm.h"
>  #include "qemu/error-report.h"
>  #include "qemu/sockets.h"
> +#include "migration/migration.h"
> +#include "migration/postcopy-ram.h"
>  
>  #include <sys/ioctl.h>
>  #include <sys/socket.h>
> @@ -34,7 +36,7 @@ enum VhostUserProtocolFeature {
>      VHOST_USER_PROTOCOL_F_NET_MTU = 4,
>      VHOST_USER_PROTOCOL_F_SLAVE_REQ = 5,
>      VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,
> -
> +    VHOST_USER_PROTOCOL_F_PAGEFAULT = 7,
>      VHOST_USER_PROTOCOL_F_MAX
>  };
>  
> @@ -123,8 +125,10 @@ static VhostUserMsg m __attribute__ ((unused));
>  #define VHOST_USER_VERSION    (0x1)
>  
>  struct vhost_user {
> +    struct vhost_dev *dev;
>      CharBackend *chr;
>      int slave_fd;
> +    NotifierWithReturn postcopy_notifier;
>  };
>  
>  static bool ioeventfd_enabled(void)
> @@ -720,6 +724,33 @@ out:
>      return ret;
>  }
>  
> +static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
> +                                        void *opaque)
> +{
> +    struct PostcopyNotifyData *pnd = opaque;
> +    struct vhost_user *u = container_of(notifier, struct vhost_user,
> +                                         postcopy_notifier);
> +    struct vhost_dev *dev = u->dev;
> +
> +    switch (pnd->reason) {
> +    case POSTCOPY_NOTIFY_PROBE:
> +        if (!virtio_has_feature(dev->protocol_features,
> +                                VHOST_USER_PROTOCOL_F_PAGEFAULT)) {
> +            /* TODO: Get the device name into this error somehow */
> +            error_setg(pnd->errp,
> +                       "vhost-user backend not capable of postcopy");
> +            return -ENOENT;
> +        }
> +        break;
> +
> +    default:
> +        /* We ignore notifications we don't know */
> +        break;
> +    }
> +
> +    return 0;
> +}
> +
>  static int vhost_user_init(struct vhost_dev *dev, void *opaque)
>  {
>      uint64_t features, protocol_features;
> @@ -731,6 +762,7 @@ static int vhost_user_init(struct vhost_dev *dev, void *opaque)
>      u = g_new0(struct vhost_user, 1);
>      u->chr = opaque;
>      u->slave_fd = -1;
> +    u->dev = dev;
>      dev->opaque = u;
>  
>      err = vhost_user_get_features(dev, &features);
> @@ -787,6 +819,9 @@ static int vhost_user_init(struct vhost_dev *dev, void *opaque)
>          return err;
>      }
>  
> +    u->postcopy_notifier.notify = vhost_user_postcopy_notifier;
> +    postcopy_add_notifier(&u->postcopy_notifier);
> +
>      return 0;
>  }
>  
> @@ -797,6 +832,9 @@ static int vhost_user_cleanup(struct vhost_dev *dev)
>      assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_USER);
>  
>      u = dev->opaque;
> +    if (u->postcopy_notifier.notify) {

Detecting init using the notify hook is slightly strange here for
me... If so, not sure whether we also need:

           u->postcopy_notifier.notify = NULL;

Or I'm not sure whether a 2nd call to vhost_user_cleanup() can be
dangerous since postcopy_remove_notifier() will be called twice.

Besides that, the patch looks good to me.  Thanks,

> +        postcopy_remove_notifier(&u->postcopy_notifier);
> +    }
>      if (u->slave_fd >= 0) {
>          qemu_set_fd_handler(u->slave_fd, NULL, NULL, NULL);
>          close(u->slave_fd);
> -- 
> 2.13.5
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 11/32] vhub: Open userfaultfd
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 11/32] vhub: Open userfaultfd Dr. David Alan Gilbert (git)
@ 2017-08-29  6:40     ` Peter Xu
  2017-09-15 17:33       ` Dr. David Alan Gilbert
  2017-08-30 10:30     ` Marc-André Lureau
  1 sibling, 1 reply; 94+ messages in thread
From: Peter Xu @ 2017-08-29  6:40 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Thu, Aug 24, 2017 at 08:27:09PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Open a userfaultfd (on a postcopy_advise) and send it back in
> the reply to the qemu for it to monitor.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 26 +++++++++++++++++++++++---
>  contrib/libvhost-user/libvhost-user.h |  3 +++
>  2 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index 47884c0a15..f9b5b12b28 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -15,6 +15,7 @@
>  
>  #include <qemu/osdep.h>
>  #include <sys/eventfd.h>
> +#include <sys/syscall.h>
>  #include <linux/vhost.h>
>  
>  #include "qemu/atomic.h"
> @@ -773,11 +774,30 @@ vu_set_vring_enable_exec(VuDev *dev, VhostUserMsg *vmsg)
>  static bool
>  vu_set_postcopy_advise(VuDev *dev, VhostUserMsg *vmsg)
>  {
> -    /* TODO: Open ufd, pass it back in the request
> -     * TODO: Add addresses 
> -     */
> +    struct uffdio_api api_struct;
> +
> +    dev->postcopy_ufd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> +    /* TODO: Add addresses */
>      vmsg->payload.u64 = 0xcafe;
>      vmsg->size = sizeof(vmsg->payload.u64);
> +
> +    if (dev->postcopy_ufd == -1) {
> +        vu_panic(dev, "Userfaultfd not available: %s", strerror(errno));
> +        goto out;

We got error but still goto out?  I feel like we should reply with
some kind of error code when any error happens.

> +    }
> +    api_struct.api = UFFD_API;
> +    api_struct.features = 0;
> +    if (ioctl(dev->postcopy_ufd, UFFDIO_API, &api_struct)) {
> +        vu_panic(dev, "Failed UFFDIO_API: %s", strerror(errno));
> +        close(dev->postcopy_ufd);
> +        dev->postcopy_ufd = -1;
> +        goto out;

Same here.

> +    }
> +    /* TODO: Stash feature flags somewhere */
> +out:
> +    /* Return a ufd to the QEMU */
> +    vmsg->fd_num = 1;
> +    vmsg->fds[0] = dev->postcopy_ufd;
>      return true; /* = send a reply */
>  }
>  
> diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> index 3987ce643d..3e8efdd919 100644
> --- a/contrib/libvhost-user/libvhost-user.h
> +++ b/contrib/libvhost-user/libvhost-user.h
> @@ -234,6 +234,9 @@ struct VuDev {
>       * re-initialize */
>      vu_panic_cb panic;
>      const VuDevIface *iface;
> +
> +    /* Postcopy data */
> +    int postcopy_ufd;
>  };
>  
>  typedef struct VuVirtqElement {
> -- 
> 2.13.5
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 16/32] vhost+postcopy: Send address back to qemu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 16/32] vhost+postcopy: Send address back to qemu Dr. David Alan Gilbert (git)
@ 2017-08-29  8:30     ` Peter Xu
  2017-09-12 17:15       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Xu @ 2017-08-29  8:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Thu, Aug 24, 2017 at 08:27:14PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> We need a better way, but at the moment we need the address of the
> mappings sent back to qemu so it can interpret the messages on the
> userfaultfd it reads.
> 
> Note: We don't ask for the default 'ack' reply since we've got our own.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 15 ++++++++-
>  docs/interop/vhost-user.txt           |  6 ++++
>  hw/virtio/trace-events                |  1 +
>  hw/virtio/vhost-user.c                | 57 ++++++++++++++++++++++++++++++++++-
>  4 files changed, 77 insertions(+), 2 deletions(-)
> 
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index e6ab059a03..5ec54f7d60 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -477,13 +477,26 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
>              DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
>                      __func__, i, reg_struct.range.start, reg_struct.range.len);
>              /* TODO: Stash 'zero' support flags somewhere */
> -            /* TODO: Get address back to QEMU */
>  
> +            /* TODO: We need to find a way for the qemu not to see the virtual
> +             * addresses of the clients, so as to keep better separation.
> +             */
> +            /* Return the address to QEMU so that it can translate the ufd
> +             * fault addresses back.
> +             */
> +            msg_region->userspace_addr = (uintptr_t)(mmap_addr +
> +                                                     dev_region->mmap_offset);
>          }
>  
>          close(vmsg->fds[i]);
>      }
>  
> +    if (dev->postcopy_listening) {
> +        /* Need to return the addresses - send the updated message back */
> +        vmsg->fd_num = 0;
> +        return true;
> +    }
> +
>      return false;
>  }
>  
> diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> index 73c3dd74db..b2a548c94d 100644
> --- a/docs/interop/vhost-user.txt
> +++ b/docs/interop/vhost-user.txt
> @@ -413,12 +413,18 @@ Master message types
>        Id: 5
>        Equivalent ioctl: VHOST_SET_MEM_TABLE
>        Master payload: memory regions description
> +      Slave payload: (postcopy only) memory regions description
>  
>        Sets the memory map regions on the slave so it can translate the vring
>        addresses. In the ancillary data there is an array of file descriptors
>        for each memory mapped region. The size and ordering of the fds matches
>        the number and ordering of memory regions.
>  
> +      When postcopy-listening has been received, SET_MEM_TABLE replies with
> +      the bases of the memory mapped regions to the master.  It must have mmap'd
> +      the regions and enabled userfaultfd on them.  Note NEED_REPLY_MASK
> +      is not set in this case.
> +
>   * VHOST_USER_SET_LOG_BASE
>  
>        Id: 6
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index f736c7c84f..63fd4a79cf 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -2,6 +2,7 @@
>  
>  # hw/virtio/vhost-user.c
>  vhost_user_postcopy_listen(void) ""
> +vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
>  
>  # hw/virtio/virtio.c
>  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 9178271ab2..2e4eb0864a 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -19,6 +19,7 @@
>  #include "qemu/sockets.h"
>  #include "migration/migration.h"
>  #include "migration/postcopy-ram.h"
> +#include "trace.h"
>  
>  #include <sys/ioctl.h>
>  #include <sys/socket.h>
> @@ -133,6 +134,7 @@ struct vhost_user {
>      int slave_fd;
>      NotifierWithReturn postcopy_notifier;
>      struct PostCopyFD  postcopy_fd;
> +    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
>  };
>  
>  static bool ioeventfd_enabled(void)
> @@ -300,11 +302,13 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
>  static int vhost_user_set_mem_table(struct vhost_dev *dev,
>                                      struct vhost_memory *mem)
>  {
> +    struct vhost_user *u = dev->opaque;
>      int fds[VHOST_MEMORY_MAX_NREGIONS];
>      int i, fd;
>      size_t fd_num = 0;
>      bool reply_supported = virtio_has_feature(dev->protocol_features,
> -                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
> +                           !u->postcopy_fd.handler;

(indent)

>  
>      VhostUserMsg msg = {
>          .request = VHOST_USER_SET_MEM_TABLE,
> @@ -350,6 +354,57 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
>          return -1;
>      }
>  
> +    if (u->postcopy_fd.handler) {

It seems that after this handler is set, we never clean it up.  Do we
need to unset it somewhere? (maybe vhost_user_postcopy_end?)

> +        VhostUserMsg msg_reply;
> +        int region_i, reply_i;
> +        if (vhost_user_read(dev, &msg_reply) < 0) {
> +            return -1;
> +        }
> +
> +        if (msg_reply.request != VHOST_USER_SET_MEM_TABLE) {
> +            error_report("%s: Received unexpected msg type."
> +                         "Expected %d received %d", __func__,
> +                         VHOST_USER_SET_MEM_TABLE, msg_reply.request);
> +            return -1;
> +        }
> +        /* We're using the same structure, just reusing one of the
> +         * fields, so it should be the same size.
> +         */
> +        if (msg_reply.size != msg.size) {
> +            error_report("%s: Unexpected size for postcopy reply "
> +                         "%d vs %d", __func__, msg_reply.size, msg.size);
> +            return -1;
> +        }
> +
> +        memset(u->postcopy_client_bases, 0,
> +               sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
> +
> +        /* They're in the same order as the regions that were sent
> +         * but some of the regions were skipped (above) if they
> +         * didn't have fd's
> +        */
> +        for (reply_i = 0, region_i = 0;
> +             region_i < dev->mem->nregions;
> +             region_i++) {
> +            if (reply_i < fd_num &&
> +                msg_reply.payload.memory.regions[region_i].guest_phys_addr ==
                                                    ^^^^^^^^
                                          should this be reply_i?

(And maybe we can use pointers for the regions for better readability?)

> +                dev->mem->regions[region_i].guest_phys_addr) {
> +                u->postcopy_client_bases[region_i] =
> +                    msg_reply.payload.memory.regions[reply_i].userspace_addr;
> +                trace_vhost_user_set_mem_table_postcopy(
> +                    msg_reply.payload.memory.regions[reply_i].userspace_addr,
> +                    msg.payload.memory.regions[reply_i].userspace_addr,
> +                    reply_i, region_i);
> +                reply_i++;
> +            }
> +        }
> +        if (reply_i != fd_num) {
> +            error_report("%s: postcopy reply not fully consumed "
> +                         "%d vs %zd",
> +                         __func__, reply_i, fd_num);
> +            return -1;
> +        }
> +    }
>      if (reply_supported) {
>          return process_message_reply(dev, &msg);
>      }
> -- 
> 2.13.5
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 19/32] vhost+postcopy: Resolve client address
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 19/32] vhost+postcopy: Resolve client address Dr. David Alan Gilbert (git)
@ 2017-08-30  5:28     ` Peter Xu
  2017-09-11 11:58       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Xu @ 2017-08-30  5:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Thu, Aug 24, 2017 at 08:27:17PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Resolve fault addresses read off the clients UFD into RAMBlock
> and offset, and call back to the postcopy code to ask for the page.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/virtio/trace-events |  3 +++
>  hw/virtio/vhost-user.c | 30 +++++++++++++++++++++++++++++-
>  2 files changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 5067dee19b..f7d4b831fe 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -1,6 +1,9 @@
>  # See docs/devel/tracing.txt for syntax documentation.
>  
>  # hw/virtio/vhost-user.c
> +vhost_user_postcopy_fault_handler(const char *name, uint64_t fault_address, int nregions) "%s: @0x%"PRIx64" nregions:%d"
> +vhost_user_postcopy_fault_handler_loop(int i, uint64_t client_base, uint64_t size) "%d: client 0x%"PRIx64" +0x%"PRIx64
> +vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t rb_offset) "%d: region_offset: 0x%"PRIx64" rb_offset:0x%"PRIx64
>  vhost_user_postcopy_listen(void) ""
>  vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
>  vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index fbe2743298..2897ff70b3 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -816,7 +816,35 @@ out:
>  static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
>                                               void *ufd)
>  {
> -    return 0;
> +    struct vhost_dev *dev = pcfd->data;
> +    struct vhost_user *u = dev->opaque;
> +    struct uffd_msg *msg = ufd;
> +    uint64_t faultaddr = msg->arg.pagefault.address;
> +    RAMBlock *rb = NULL;
> +    uint64_t rb_offset;
> +    int i;
> +
> +    trace_vhost_user_postcopy_fault_handler(pcfd->idstr, faultaddr,
> +                                            dev->mem->nregions);
> +    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {

Should dev->mem->nregions always the same as u->region_rb_len?

> +        trace_vhost_user_postcopy_fault_handler_loop(i,
> +                u->postcopy_client_bases[i], dev->mem->regions[i].memory_size);
> +        if (faultaddr >= u->postcopy_client_bases[i]) {
> +            /* Ofset of the fault address in the vhost region */
> +            uint64_t region_offset = faultaddr - u->postcopy_client_bases[i];
> +            if (region_offset <= dev->mem->regions[i].memory_size) {

Should be "<" rather than "<="?  Say:

Region 1: [0, 1M), size 1M
Region 2: [1M, 2M), size 1M

Looks like otherwise faultaddr=1M will fall into region 1, while it
should be region 2?

> +                rb_offset = region_offset + u->region_rb_offset[i];
> +                trace_vhost_user_postcopy_fault_handler_found(i,
> +                        region_offset, rb_offset);
> +                rb = u->region_rb[i];

Nit: this "rb" might be avoided if only used once.

> +                return postcopy_request_shared_page(pcfd, rb, faultaddr,
> +                                                    rb_offset);
> +            }
> +        }
> +    }
> +    error_report("%s: Failed to find region for fault %" PRIx64,
> +                 __func__, faultaddr);
> +    return -1;
>  }
>  
>  /*
> -- 
> 2.13.5
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 17/32] vhost+postcopy: Stash RAMBlock and offset
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 17/32] vhost+postcopy: Stash RAMBlock and offset Dr. David Alan Gilbert (git)
@ 2017-08-30  5:51     ` Peter Xu
  2017-09-13 15:59       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Xu @ 2017-08-30  5:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Thu, Aug 24, 2017 at 08:27:15PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Stash the RAMBlock and offset for later use looking up
> addresses.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/virtio/trace-events |  1 +
>  hw/virtio/vhost-user.c | 30 ++++++++++++++++++++++++++++++
>  2 files changed, 31 insertions(+)
> 
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 63fd4a79cf..5067dee19b 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -3,6 +3,7 @@
>  # hw/virtio/vhost-user.c
>  vhost_user_postcopy_listen(void) ""
>  vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> +vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
>  
>  # hw/virtio/virtio.c
>  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 2e4eb0864a..fbe2743298 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -135,6 +135,14 @@ struct vhost_user {
>      NotifierWithReturn postcopy_notifier;
>      struct PostCopyFD  postcopy_fd;
>      uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
> +    /* Length of the region_rb and region_rb_offset arrays */
> +    size_t             region_rb_len;
> +    /* RAMBlock associated with a given region */
> +    RAMBlock         **region_rb;
> +    /* The offset from the start of the RAMBlock to the start of the
> +     * vhost region.
> +     */
> +    ram_addr_t        *region_rb_offset;
>  };
>  
>  static bool ioeventfd_enabled(void)
> @@ -319,6 +327,17 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
>          msg.flags |= VHOST_USER_NEED_REPLY_MASK;
>      }
>  
> +    if (u->region_rb_len < dev->mem->nregions) {
> +        u->region_rb = g_renew(RAMBlock*, u->region_rb, dev->mem->nregions);
> +        u->region_rb_offset = g_renew(ram_addr_t, u->region_rb_offset,
> +                                      dev->mem->nregions);
> +        memset(&(u->region_rb[u->region_rb_len]), '\0',
> +               sizeof(RAMBlock *) * (dev->mem->nregions - u->region_rb_len));
> +        memset(&(u->region_rb_offset[u->region_rb_len]), '\0',
> +               sizeof(ram_addr_t) * (dev->mem->nregions - u->region_rb_len));
> +        u->region_rb_len = dev->mem->nregions;
> +    }
> +
>      for (i = 0; i < dev->mem->nregions; ++i) {
>          struct vhost_memory_region *reg = dev->mem->regions + i;
>          ram_addr_t offset;
> @@ -327,8 +346,14 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
>          assert((uintptr_t)reg->userspace_addr == reg->userspace_addr);
>          mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr,
>                                       &offset);
> +        u->region_rb_offset[i] = offset;
> +        u->region_rb[i] = mr->ram_block;

Do we need to record these info even if fd <= 0?

>          fd = memory_region_get_fd(mr);
>          if (fd > 0) {
> +            trace_vhost_user_set_mem_table_withfd(fd_num, mr->name,
> +                                                  reg->memory_size,
> +                                                  reg->guest_phys_addr,
> +                                                  reg->userspace_addr, offset);
>              msg.payload.memory.regions[fd_num].userspace_addr = reg->userspace_addr;
>              msg.payload.memory.regions[fd_num].memory_size  = reg->memory_size;
>              msg.payload.memory.regions[fd_num].guest_phys_addr = reg->guest_phys_addr;
> @@ -992,6 +1017,11 @@ static int vhost_user_cleanup(struct vhost_dev *dev)
>          close(u->slave_fd);
>          u->slave_fd = -1;
>      }
> +    g_free(u->region_rb);
> +    u->region_rb = NULL;
> +    g_free(u->region_rb_offset);
> +    u->region_rb_offset = NULL;
> +    u->region_rb_len = 0;
>      g_free(u);
>      dev->opaque = 0;
>  
> -- 
> 2.13.5
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 22/32] vhost+postcopy: Add vhost waker
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 22/32] vhost+postcopy: Add vhost waker Dr. David Alan Gilbert (git)
@ 2017-08-30  5:55     ` Peter Xu
  2017-09-13 13:09       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Xu @ 2017-08-30  5:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Thu, Aug 24, 2017 at 08:27:20PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Register a waker function in vhost-user code to be notified when
> pages arrive or requests to previously mapped pages get requested.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/virtio/trace-events |  3 +++
>  hw/virtio/vhost-user.c | 26 ++++++++++++++++++++++++++
>  2 files changed, 29 insertions(+)
> 
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index f7d4b831fe..adebf6dc6b 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -7,6 +7,9 @@ vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t
>  vhost_user_postcopy_listen(void) ""
>  vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
>  vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
> +vhost_user_postcopy_waker(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
> +vhost_user_postcopy_waker_found(uint64_t client_addr) "0x%"PRIx64
> +vhost_user_postcopy_waker_nomatch(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
>  
>  # hw/virtio/virtio.c
>  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 2897ff70b3..3bff33a1a6 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -847,6 +847,31 @@ static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
>      return -1;
>  }
>  
> +static int vhost_user_postcopy_waker(struct PostCopyFD *pcfd, RAMBlock *rb,
> +                                     uint64_t offset)
> +{
> +    struct vhost_dev *dev = pcfd->data;
> +    struct vhost_user *u = dev->opaque;
> +    int i;
> +
> +    trace_vhost_user_postcopy_waker(qemu_ram_get_idstr(rb), offset);
> +    /* Translate the offset into an address in the clients address space */
> +    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
> +        if (u->region_rb[i] == rb &&
> +            offset >= u->region_rb_offset[i] &&
> +            offset < (u->region_rb_offset[i] +
> +                      dev->mem->regions[i].memory_size)) {

Just curious: checks against offset should only be for safety, right?
Is there valid case that even rb is correct but the offset gets out of
the range of that RAMBlock?

> +            uint64_t client_addr = (offset - u->region_rb_offset[i]) +
> +                                   u->postcopy_client_bases[i];
> +            trace_vhost_user_postcopy_waker_found(client_addr);
> +            return postcopy_wake_shared(pcfd, client_addr, rb);
> +        }
> +    }
> +
> +    trace_vhost_user_postcopy_waker_nomatch(qemu_ram_get_idstr(rb), offset);
> +    return 0;
> +}
> +
>  /*
>   * Called at the start of an inbound postcopy on reception of the
>   * 'advise' command.
> @@ -892,6 +917,7 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
>      u->postcopy_fd.fd = ufd;
>      u->postcopy_fd.data = dev;
>      u->postcopy_fd.handler = vhost_user_postcopy_fault_handler;
> +    u->postcopy_fd.waker = vhost_user_postcopy_waker;
>      u->postcopy_fd.idstr = "vhost-user"; /* Need to find unique name */
>      postcopy_register_shared_ufd(&u->postcopy_fd);
>      return 0;
> -- 
> 2.13.5
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 25/32] vhost+postcopy: Lock around set_mem_table
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 25/32] vhost+postcopy: Lock around set_mem_table Dr. David Alan Gilbert (git)
@ 2017-08-30  6:50     ` Peter Xu
  2017-09-25 17:56       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Xu @ 2017-08-30  6:50 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Thu, Aug 24, 2017 at 08:27:23PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> **HACK - better solution needed **
> We have the situation where:
> 
>      qemu                      bridge
> 
>      send set_mem_table
>                               map memory
>   a)                          mark area with UFD
>                               send reply with map addresses
>   b)                          start using
>   c) receive reply
> 
>   As soon as (a) happens qemu might start seeing faults
> from memory accesses (but doesn't until b); but it can't
> process those faults until (c) when it's received the
> mmap addresses.
> 
> Make the fault handler spin until it gets the reply in (c).
> 
> At the very least this needs some proper locks, but preferably
> we need to split the message.

I see discussions about slave channel and ack mechanism in previous
post.  So it's still not adopted (which looks doable)?  What's our
further plan?

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 26/32] vhost: Add VHOST_USER_POSTCOPY_END message
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 26/32] vhost: Add VHOST_USER_POSTCOPY_END message Dr. David Alan Gilbert (git)
@ 2017-08-30  6:55     ` Peter Xu
  2017-09-11 11:31       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Xu @ 2017-08-30  6:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Thu, Aug 24, 2017 at 08:27:24PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> This message is sent just before the end of postcopy to get the
> client to stop using userfault since we wont respond to any more
> requests.  It should close userfaultfd so that any other pages
> get mapped to the backing file automatically by the kernel, since
> at this point we know we've received everything.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

(I feel like the title should be for "vub", not "vhost"?)

Reviewed-by: Peter Xu <peterx@redhat.com>

> ---
>  contrib/libvhost-user/libvhost-user.c | 23 +++++++++++++++++++++++
>  contrib/libvhost-user/libvhost-user.h |  1 +
>  docs/interop/vhost-user.txt           |  8 ++++++++
>  hw/virtio/vhost-user.c                |  1 +
>  4 files changed, 33 insertions(+)
> 
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index d816851c6d..23bff47649 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -67,6 +67,7 @@ vu_request_to_string(int req)
>          REQ(VHOST_USER_SET_VRING_ENDIAN),
>          REQ(VHOST_USER_POSTCOPY_ADVISE),
>          REQ(VHOST_USER_POSTCOPY_LISTEN),
> +        REQ(VHOST_USER_POSTCOPY_END),
>          REQ(VHOST_USER_MAX),
>      };
>  #undef REQ
> @@ -893,6 +894,26 @@ vu_set_postcopy_listen(VuDev *dev, VhostUserMsg *vmsg)
>      vmsg->payload.u64 = 0; /* Success */
>      return true;
>  }
> +
> +static bool
> +vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
> +{
> +    DPRINT("%s: Entry\n", __func__);
> +    dev->postcopy_listening = false;
> +    if (dev->postcopy_ufd > 0) {
> +        close(dev->postcopy_ufd);
> +        dev->postcopy_ufd = -1;
> +        DPRINT("%s: Done close\n", __func__);
> +    }
> +
> +    vmsg->fd_num = 0;
> +    vmsg->payload.u64 = 0;
> +    vmsg->size = sizeof(vmsg->payload.u64);
> +    vmsg->flags = VHOST_USER_VERSION |  VHOST_USER_REPLY_MASK;
> +    DPRINT("%s: exit\n", __func__);
> +    return true;
> +}
> +
>  static bool
>  vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
>  {
> @@ -962,6 +983,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
>          return vu_set_postcopy_advise(dev, vmsg);
>      case VHOST_USER_POSTCOPY_LISTEN:
>          return vu_set_postcopy_listen(dev, vmsg);
> +    case VHOST_USER_POSTCOPY_END:
> +        return vu_set_postcopy_end(dev, vmsg);
>      default:
>          vmsg_close_fds(vmsg);
>          vu_panic(dev, "Unhandled request: %d", vmsg->request);
> diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> index 29c11ba56c..a78596e6fd 100644
> --- a/contrib/libvhost-user/libvhost-user.h
> +++ b/contrib/libvhost-user/libvhost-user.h
> @@ -68,6 +68,7 @@ typedef enum VhostUserRequest {
>      VHOST_USER_SET_VRING_ENDIAN = 23,
>      VHOST_USER_POSTCOPY_ADVISE  = 24,
>      VHOST_USER_POSTCOPY_LISTEN  = 25,
> +    VHOST_USER_POSTCOPY_END     = 26,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> index b2a548c94d..d6586e0b43 100644
> --- a/docs/interop/vhost-user.txt
> +++ b/docs/interop/vhost-user.txt
> @@ -627,6 +627,14 @@ Master message types
>  
>        Master advises slave that a transition to postcopy mode has happened.
>  
> + * VHOST_USER_POSTCOPY_END
> +      Id: 26
> +      Slave payload: u64
> +
> +      Master advises that postcopy migration has now completed.  The
> +      slave must disable the userfaultfd. The response is an acknowledgement
> +      only.
> +
>  Slave message types
>  -------------------
>  
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 4d03383a66..c2e55be0fd 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -71,6 +71,7 @@ typedef enum VhostUserRequest {
>      VHOST_USER_SET_VRING_ENDIAN = 23,
>      VHOST_USER_POSTCOPY_ADVISE  = 24,
>      VHOST_USER_POSTCOPY_LISTEN  = 25,
> +    VHOST_USER_POSTCOPY_END     = 26,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> -- 
> 2.13.5
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 27/32] vhost+postcopy: Wire up POSTCOPY_END notify
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 27/32] vhost+postcopy: Wire up POSTCOPY_END notify Dr. David Alan Gilbert (git)
@ 2017-08-30  6:57     ` Peter Xu
  0 siblings, 0 replies; 94+ messages in thread
From: Peter Xu @ 2017-08-30  6:57 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Thu, Aug 24, 2017 at 08:27:25PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Wire up a call to VHOST_USER_POSTCOPY_END message to the vhost clients
> right before we ask the listener thread to shutdown.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

> ---
>  hw/virtio/trace-events   |  2 ++
>  hw/virtio/vhost-user.c   | 30 ++++++++++++++++++++++++++++++
>  migration/postcopy-ram.c |  5 +++++
>  migration/postcopy-ram.h |  1 +
>  4 files changed, 38 insertions(+)
> 
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 065822c70a..5b599617a1 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -1,6 +1,8 @@
>  # See docs/devel/tracing.txt for syntax documentation.
>  
>  # hw/virtio/vhost-user.c
> +vhost_user_postcopy_end_entry(void) ""
> +vhost_user_postcopy_end_exit(void) ""
>  vhost_user_postcopy_fault_handler(const char *name, uint64_t fault_address, int nregions) "%s: @0x%"PRIx64" nregions:%d"
>  vhost_user_postcopy_fault_handler_loop(int i, uint64_t client_base, uint64_t size) "%d: client 0x%"PRIx64" +0x%"PRIx64
>  vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t rb_offset) "%d: region_offset: 0x%"PRIx64" rb_offset:0x%"PRIx64
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index c2e55be0fd..d4461459fe 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -965,6 +965,33 @@ static int vhost_user_postcopy_listen(struct vhost_dev *dev, Error **errp)
>      return 0;
>  }
>  
> +/*
> + * Called at the end of postcopy
> + */
> +static int vhost_user_postcopy_end(struct vhost_dev *dev, Error **errp)
> +{
> +    VhostUserMsg msg = {
> +        .request = VHOST_USER_POSTCOPY_END,
> +        .flags = VHOST_USER_VERSION | VHOST_USER_NEED_REPLY_MASK,
> +    };
> +    int ret;
> +
> +    trace_vhost_user_postcopy_end_entry();
> +    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
> +        error_setg(errp, "Failed to send postcopy_end to vhost");
> +        return -1;
> +    }
> +
> +    ret = process_message_reply(dev, &msg);
> +    if (ret) {
> +        error_setg(errp, "Failed to receive reply to postcopy_end");
> +        return ret;
> +    }
> +    trace_vhost_user_postcopy_end_exit();
> +
> +    return 0;
> +}
> +
>  static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
>                                          void *opaque)
>  {
> @@ -990,6 +1017,9 @@ static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
>      case POSTCOPY_NOTIFY_INBOUND_LISTEN:
>          return vhost_user_postcopy_listen(dev, pnd->errp);
>  
> +    case POSTCOPY_NOTIFY_INBOUND_END:
> +        return vhost_user_postcopy_end(dev, pnd->errp);
> +
>      default:
>          /* We ignore notifications we don't know */
>          break;
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 7d0786ff04..28791cf1f1 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -337,7 +337,12 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>  
>      if (mis->have_fault_thread) {
>          uint64_t tmp64;
> +        Error *local_err = NULL;
>  
> +        if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_END, &local_err)) {
> +            error_report_err(local_err);
> +            return -1;
> +        }
>          if (qemu_ram_foreach_block(cleanup_range, mis)) {
>              return -1;
>          }
> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index ecf731c689..d0dc838001 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -130,6 +130,7 @@ enum PostcopyNotifyReason {
>      POSTCOPY_NOTIFY_PROBE = 0,
>      POSTCOPY_NOTIFY_INBOUND_ADVISE,
>      POSTCOPY_NOTIFY_INBOUND_LISTEN,
> +    POSTCOPY_NOTIFY_INBOUND_END,
>  };
>  
>  struct PostcopyNotifyData {
> -- 
> 2.13.5
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 06/32] postcopy: use UFFDIO_ZEROPAGE only when available
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 06/32] postcopy: use UFFDIO_ZEROPAGE only when available Dr. David Alan Gilbert (git)
@ 2017-08-30  9:57     ` Marc-André Lureau
  2017-09-07 10:55       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Marc-André Lureau @ 2017-08-30  9:57 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

Hi

On Thu, Aug 24, 2017 at 9:27 PM, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Use a flag on the RAMBlock to state whether it has the
> UFFDIO_ZEROPAGE capability, use it when it's available.
>
> This allows the use of postcopy on tmpfs as well as hugepage
> backed files.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  exec.c                    | 15 +++++++++++++++
>  include/exec/cpu-common.h |  3 +++
>  migration/postcopy-ram.c  | 14 +++++++++++---
>  3 files changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/exec.c b/exec.c
> index 35b4cea2ed..80c3d1d121 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -103,6 +103,11 @@ static MemoryRegion io_mem_unassigned;
>   */
>  #define RAM_RESIZEABLE (1 << 2)
>
> +/* UFFDIO_ZEROPAGE is available on this RAMBlock to atomically
> + * zero the page and wake waiting processes.
> + * (Set during postcopy)
> + */
> +#define RAM_UF_ZEROPAGE (1 << 3)
>  #endif
>
>  #ifdef TARGET_PAGE_BITS_VARY
> @@ -1705,6 +1710,16 @@ bool qemu_ram_is_shared(RAMBlock *rb)
>      return rb->flags & RAM_SHARED;
>  }
>
> +bool qemu_ram_is_uf_zeroable(RAMBlock *rb)
> +{
> +    return rb->flags & RAM_UF_ZEROPAGE;
> +}
> +
> +void qemu_ram_set_uf_zeroable(RAMBlock *rb)
> +{
> +    rb->flags |= RAM_UF_ZEROPAGE;
> +}
> +
>  /* Called with iothread lock held.  */
>  void qemu_ram_set_idstr(RAMBlock *new_block, const char *name, DeviceState *dev)
>  {
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 0d861a6289..24d335f95d 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -73,6 +73,9 @@ void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
>  void qemu_ram_unset_idstr(RAMBlock *block);
>  const char *qemu_ram_get_idstr(RAMBlock *rb);
>  bool qemu_ram_is_shared(RAMBlock *rb);
> +bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
> +void qemu_ram_set_uf_zeroable(RAMBlock *rb);
> +
>  size_t qemu_ram_pagesize(RAMBlock *block);
>  size_t qemu_ram_pagesize_largest(void);
>
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 7a414ebad8..640b72d86d 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -408,6 +408,11 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
>          error_report("%s userfault: Region doesn't support COPY", __func__);
>          return -1;
>      }
> +    if (reg_struct.ioctls & ((__u64)1 << _UFFDIO_ZEROPAGE)) {
> +        RAMBlock *rb = qemu_ram_block_by_name(block_name);
> +        qemu_ram_set_uf_zeroable(rb);
> +    }
> +
>

extra empty line

>      return 0;
>  }
> @@ -617,11 +622,14 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
>  int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
>                               RAMBlock *rb)
>  {
> +    size_t pagesize = qemu_ram_pagesize(rb);
>      trace_postcopy_place_page_zero(host);
>
> -    if (qemu_ram_pagesize(rb) == getpagesize()) {

Is this check drop intentionally?

> -        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, getpagesize(),
> -                                rb)) {
> +    /* Normal RAMBlocks can zero a page using UFFDIO_ZEROPAGE
> +     * but it's not available for everything (e.g. hugetlbpages)
> +     */
> +    if (qemu_ram_is_uf_zeroable(rb)) {
> +        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, pagesize, rb)) {
>              int e = errno;
>              error_report("%s: %s zero host: %p",
>                           __func__, strerror(e), host);
> --
> 2.13.5
>
>



-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 02/32] vhub: Only process received packets on started queues
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 02/32] vhub: Only process received packets on started queues Dr. David Alan Gilbert (git)
@ 2017-08-30  9:59     ` Marc-André Lureau
  0 siblings, 0 replies; 94+ messages in thread
From: Marc-André Lureau @ 2017-08-30  9:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

On Thu, Aug 24, 2017 at 9:27 PM, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Only process received packets if the queue has been started.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  tests/vhost-user-bridge.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/tests/vhost-user-bridge.c b/tests/vhost-user-bridge.c
> index 1e5b5ca3da..324abee53d 100644
> --- a/tests/vhost-user-bridge.c
> +++ b/tests/vhost-user-bridge.c
> @@ -277,6 +277,7 @@ vubr_backend_recv_cb(int sock, void *ctx)
>      DPRINT("    hdrlen = %d\n", hdrlen);
>
>      if (!vu_queue_enabled(dev, vq) ||
> +        !vu_queue_started(dev, vq) ||
>          !vu_queue_avail_bytes(dev, vq, hdrlen, 0)) {
>          DPRINT("Got UDP packet, but no available descriptors on RX virtq.\n");
>          return;

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>



-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 09/32] vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 09/32] vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message Dr. David Alan Gilbert (git)
@ 2017-08-30 10:07     ` Marc-André Lureau
  2017-09-07 11:04       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Marc-André Lureau @ 2017-08-30 10:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

Hi

On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Wire up a notifier to send a VHOST_USER_POSTCOPY_ADVISE
> message on an incoming advise.
>
> Later patches will fill in the behaviour/contents of the
> message.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 21 ++++++++++++---
>  contrib/libvhost-user/libvhost-user.h |  6 ++++-
>  docs/interop/vhost-user.txt           |  9 +++++++
>  hw/virtio/vhost-user.c                | 48 +++++++++++++++++++++++++++++++++++
>  migration/postcopy-ram.h              |  1 +
>  migration/savevm.c                    |  6 +++++
>  6 files changed, 86 insertions(+), 5 deletions(-)
>
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index 201b9846e9..8bbdf5fb40 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -42,9 +42,6 @@ vu_request_to_string(int req)
>          REQ(VHOST_USER_NONE),
>          REQ(VHOST_USER_GET_FEATURES),
>          REQ(VHOST_USER_SET_FEATURES),
> -        REQ(VHOST_USER_NONE),
> -        REQ(VHOST_USER_GET_FEATURES),
> -        REQ(VHOST_USER_SET_FEATURES),

nice cleanup ;)

>          REQ(VHOST_USER_SET_OWNER),
>          REQ(VHOST_USER_RESET_OWNER),
>          REQ(VHOST_USER_SET_MEM_TABLE),
> @@ -62,7 +59,10 @@ vu_request_to_string(int req)
>          REQ(VHOST_USER_GET_QUEUE_NUM),
>          REQ(VHOST_USER_SET_VRING_ENABLE),
>          REQ(VHOST_USER_SEND_RARP),
> -        REQ(VHOST_USER_INPUT_GET_CONFIG),
> +        REQ(VHOST_USER_SET_SLAVE_REQ_FD),
> +        REQ(VHOST_USER_IOTLB_MSG),
> +        REQ(VHOST_USER_SET_VRING_ENDIAN),
> +        REQ(VHOST_USER_POSTCOPY_ADVISE),
>          REQ(VHOST_USER_MAX),
>      };
>  #undef REQ
> @@ -744,6 +744,17 @@ vu_set_vring_enable_exec(VuDev *dev, VhostUserMsg *vmsg)
>  }
>
>  static bool
> +vu_set_postcopy_advise(VuDev *dev, VhostUserMsg *vmsg)
> +{
> +    /* TODO: Open ufd, pass it back in the request
> +     * TODO: Add addresses

Extra EOL space

> +     */
> +    vmsg->payload.u64 = 0xcafe;
> +    vmsg->size = sizeof(vmsg->payload.u64);
> +    return true; /* = send a reply */
> +}
> +
> +static bool
>  vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
>  {
>      int do_reply = 0;
> @@ -808,6 +819,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
>          return vu_set_vring_enable_exec(dev, vmsg);
>      case VHOST_USER_NONE:
>          break;
> +    case VHOST_USER_POSTCOPY_ADVISE:
> +        return vu_set_postcopy_advise(dev, vmsg);
>      default:
>          vmsg_close_fds(vmsg);
>          vu_panic(dev, "Unhandled request: %d", vmsg->request);
> diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> index 95d0d34a28..3987ce643d 100644
> --- a/contrib/libvhost-user/libvhost-user.h
> +++ b/contrib/libvhost-user/libvhost-user.h
> @@ -62,7 +62,11 @@ typedef enum VhostUserRequest {
>      VHOST_USER_GET_QUEUE_NUM = 17,
>      VHOST_USER_SET_VRING_ENABLE = 18,
>      VHOST_USER_SEND_RARP = 19,
> -    VHOST_USER_INPUT_GET_CONFIG = 20,
> +    VHOST_USER_NET_SET_MTU      = 20,
> +    VHOST_USER_SET_SLAVE_REQ_FD = 21,
> +    VHOST_USER_IOTLB_MSG        = 22,
> +    VHOST_USER_SET_VRING_ENDIAN = 23,
> +    VHOST_USER_POSTCOPY_ADVISE  = 24,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>
> diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> index a279560eb0..dad2a1b343 100644
> --- a/docs/interop/vhost-user.txt
> +++ b/docs/interop/vhost-user.txt
> @@ -606,6 +606,15 @@ Master message types
>        and expect this message once (per VQ) during device configuration
>        (ie. before the master starts the VQ).
>
> + * VHOST_USER_POSTCOPY_ADVISE
> +      Id: 24
> +      Master payload: N/A
> +      Slave payload: userfault fd + u64
> +
> +      Master advises slave that a migration with postcopy enabled is underway,
> +      the slave must open a userfaultfd for later use.
> +      Note that at this stage the migration is still in precopy mode.
> +
>  Slave message types
>  -------------------
>
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index c51bbd1296..7063e4df61 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -67,6 +67,7 @@ typedef enum VhostUserRequest {
>      VHOST_USER_SET_SLAVE_REQ_FD = 21,
>      VHOST_USER_IOTLB_MSG = 22,
>      VHOST_USER_SET_VRING_ENDIAN = 23,
> +    VHOST_USER_POSTCOPY_ADVISE  = 24,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>
> @@ -724,6 +725,50 @@ out:
>      return ret;
>  }
>
> +/*
> + * Called at the start of an inbound postcopy on reception of the
> + * 'advise' command.
> + */
> +static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
> +{
> +    struct vhost_user *u = dev->opaque;
> +    CharBackend *chr = u->chr;
> +    int ufd;
> +    VhostUserMsg msg = {
> +        .request = VHOST_USER_POSTCOPY_ADVISE,
> +        .flags = VHOST_USER_VERSION,
> +    };
> +
> +    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
> +        error_setg(errp, "Failed to send postcopy_advise to vhost");
> +        return -1;
> +    }
> +
> +    if (vhost_user_read(dev, &msg) < 0) {
> +        error_setg(errp, "Failed to get postcopy_advise reply from vhost");
> +        return -1;
> +    }
> +
> +    if (msg.request != VHOST_USER_POSTCOPY_ADVISE) {
> +        error_setg(errp, "Unexpected msg type. Expected %d received %d",
> +                     VHOST_USER_POSTCOPY_ADVISE, msg.request);
> +        return -1;
> +    }
> +
> +    if (msg.size != sizeof(msg.payload.u64)) {
> +        error_setg(errp, "Received bad msg size.");
> +        return -1;
> +    }
> +    ufd = qemu_chr_fe_get_msgfd(chr);
> +    if (ufd < 0) {
> +        error_setg(errp, "%s: Failed to get ufd", __func__);
> +        return -1;
> +    }
> +
> +    /* TODO: register ufd with userfault thread */
> +    return 0;
> +}
> +
>  static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
>                                          void *opaque)
>  {
> @@ -743,6 +788,9 @@ static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
>          }
>          break;
>
> +    case POSTCOPY_NOTIFY_INBOUND_ADVISE:
> +        return vhost_user_postcopy_advise(dev, pnd->errp);
> +
>      default:
>          /* We ignore notifications we don't know */
>          break;
> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index d688411674..70d4b09659 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -128,6 +128,7 @@ void postcopy_infrastructure_init(void);
>   */
>  enum PostcopyNotifyReason {
>      POSTCOPY_NOTIFY_PROBE = 0,
> +    POSTCOPY_NOTIFY_INBOUND_ADVISE,
>  };
>
>  struct PostcopyNotifyData {
> diff --git a/migration/savevm.c b/migration/savevm.c
> index fdd15fa0a7..d35911731d 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1343,6 +1343,7 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
>  {
>      PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_ADVISE);
>      uint64_t remote_pagesize_summary, local_pagesize_summary, remote_tps;
> +    Error *local_err = NULL;
>
>      trace_loadvm_postcopy_handle_advise();
>      if (ps != POSTCOPY_INCOMING_NONE) {
> @@ -1390,6 +1391,11 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
>          return -1;
>      }
>
> +    if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_ADVISE, &local_err)) {
> +        error_report_err(local_err);
> +        return -1;
> +    }
> +
>      if (ram_postcopy_incoming_init(mis)) {
>          return -1;
>      }
> --
> 2.13.5
>
>

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>




-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 10/32] vhub: Support sending fds back to qemu
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 10/32] vhub: Support sending fds back to qemu Dr. David Alan Gilbert (git)
@ 2017-08-30 10:22     ` Marc-André Lureau
  2017-09-07 11:31       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Marc-André Lureau @ 2017-08-30 10:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

Hi

On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Allow replies with fds (for postcopy)
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index 8bbdf5fb40..47884c0a15 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -213,6 +213,30 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
>  {
>      int rc;
>      uint8_t *p = (uint8_t *)vmsg;
> +    char control[CMSG_SPACE(VHOST_MEMORY_MAX_NREGIONS * sizeof(int))] = { };
> +    struct iovec iov = {
> +        .iov_base = (char *)vmsg,
> +        .iov_len = VHOST_USER_HDR_SIZE,
> +    };
> +    struct msghdr msg = {
> +        .msg_iov = &iov,
> +        .msg_iovlen = 1,
> +        .msg_control = control,
> +    };
> +    struct cmsghdr *cmsg;
> +
> +    memset(control, 0, sizeof(control));
> +    if (vmsg->fds) {

This is going to be always true, right? Check vmsg->fd_num > 0 instead?

I would also add check or assert(vmsg->fd_num <= VHOST_MEMORY_MAX_NREGIONS)

> +        size_t fdsize = vmsg->fd_num * sizeof(int);
> +        msg.msg_controllen = CMSG_SPACE(fdsize);
> +        cmsg = CMSG_FIRSTHDR(&msg);
> +        cmsg->cmsg_len = CMSG_LEN(fdsize);
> +        cmsg->cmsg_level = SOL_SOCKET;
> +        cmsg->cmsg_type = SCM_RIGHTS;
> +        memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
> +    } else {
> +        msg.msg_controllen = 0;
> +    }
>
>      /* Set the version in the flags when sending the reply */
>      vmsg->flags &= ~VHOST_USER_VERSION_MASK;
> @@ -220,7 +244,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
>      vmsg->flags |= VHOST_USER_REPLY_MASK;
>
>      do {
> -        rc = write(conn_fd, p, VHOST_USER_HDR_SIZE);
> +        rc = sendmsg(conn_fd, &msg, 0);
>      } while (rc < 0 && (errno == EINTR || errno == EAGAIN));
>
>      do {
> @@ -313,6 +337,7 @@ vu_get_features_exec(VuDev *dev, VhostUserMsg *vmsg)
>      }
>
>      vmsg->size = sizeof(vmsg->payload.u64);
> +    vmsg->fd_num = 0;
>
>      DPRINT("Sending back to guest u64: 0x%016"PRIx64"\n", vmsg->payload.u64);
>
> @@ -454,6 +479,7 @@ vu_set_log_base_exec(VuDev *dev, VhostUserMsg *vmsg)
>      dev->log_size = log_mmap_size;
>
>      vmsg->size = sizeof(vmsg->payload.u64);
> +    vmsg->fd_num = 0;
>
>      return true;
>  }
> @@ -698,6 +724,7 @@ vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg *vmsg)
>
>      vmsg->payload.u64 = features;
>      vmsg->size = sizeof(vmsg->payload.u64);
> +    vmsg->fd_num = 0;
>
>      return true;
>  }
> --
> 2.13.5
>
>

other than that
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>



-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 11/32] vhub: Open userfaultfd
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 11/32] vhub: Open userfaultfd Dr. David Alan Gilbert (git)
  2017-08-29  6:40     ` Peter Xu
@ 2017-08-30 10:30     ` Marc-André Lureau
  2017-09-07 16:36       ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 94+ messages in thread
From: Marc-André Lureau @ 2017-08-30 10:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

I would rather use libvhost-user: message prefix (same for similar
libvhost-user patches)

On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Open a userfaultfd (on a postcopy_advise) and send it back in
> the reply to the qemu for it to monitor.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 26 +++++++++++++++++++++++---
>  contrib/libvhost-user/libvhost-user.h |  3 +++
>  2 files changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index 47884c0a15..f9b5b12b28 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -15,6 +15,7 @@
>
>  #include <qemu/osdep.h>
>  #include <sys/eventfd.h>
> +#include <sys/syscall.h>
>  #include <linux/vhost.h>
>
>  #include "qemu/atomic.h"
> @@ -773,11 +774,30 @@ vu_set_vring_enable_exec(VuDev *dev, VhostUserMsg *vmsg)
>  static bool
>  vu_set_postcopy_advise(VuDev *dev, VhostUserMsg *vmsg)
>  {
> -    /* TODO: Open ufd, pass it back in the request
> -     * TODO: Add addresses
> -     */
> +    struct uffdio_api api_struct;
> +
> +    dev->postcopy_ufd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);

This will likely fail to compile on !Linux, could you add some
appropriate #ifdef?

> +    /* TODO: Add addresses */
>      vmsg->payload.u64 = 0xcafe;
>      vmsg->size = sizeof(vmsg->payload.u64);
> +
> +    if (dev->postcopy_ufd == -1) {
> +        vu_panic(dev, "Userfaultfd not available: %s", strerror(errno));
> +        goto out;
> +    }
> +    api_struct.api = UFFD_API;
> +    api_struct.features = 0;
> +    if (ioctl(dev->postcopy_ufd, UFFDIO_API, &api_struct)) {
> +        vu_panic(dev, "Failed UFFDIO_API: %s", strerror(errno));
> +        close(dev->postcopy_ufd);
> +        dev->postcopy_ufd = -1;
> +        goto out;
> +    }
> +    /* TODO: Stash feature flags somewhere */
> +out:
> +    /* Return a ufd to the QEMU */
> +    vmsg->fd_num = 1;
> +    vmsg->fds[0] = dev->postcopy_ufd;
>      return true; /* = send a reply */
>  }
>
> diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> index 3987ce643d..3e8efdd919 100644
> --- a/contrib/libvhost-user/libvhost-user.h
> +++ b/contrib/libvhost-user/libvhost-user.h
> @@ -234,6 +234,9 @@ struct VuDev {
>       * re-initialize */
>      vu_panic_cb panic;
>      const VuDevIface *iface;
> +
> +    /* Postcopy data */
> +    int postcopy_ufd;
>  };
>
>  typedef struct VuVirtqElement {
> --
> 2.13.5
>
>



-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 14/32] vhost+postcopy: Transmit 'listen' to client
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 14/32] vhost+postcopy: Transmit 'listen' to client Dr. David Alan Gilbert (git)
@ 2017-08-30 10:37     ` Marc-André Lureau
  2017-09-07 12:10       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Marc-André Lureau @ 2017-08-30 10:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Notify the vhost-user client on reception of the 'postcopy-listen'
> event from the source.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 21 +++++++++++++++++++++
>  contrib/libvhost-user/libvhost-user.h |  2 ++
>  docs/interop/vhost-user.txt           |  6 ++++++
>  hw/virtio/trace-events                |  3 +++
>  hw/virtio/vhost-user.c                | 30 ++++++++++++++++++++++++++++++
>  migration/postcopy-ram.h              |  1 +
>  migration/savevm.c                    |  7 +++++++
>  7 files changed, 70 insertions(+)
>
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index f9b5b12b28..e8accf11db 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -15,7 +15,9 @@
>
>  #include <qemu/osdep.h>
>  #include <sys/eventfd.h>
> +#include <sys/ioctl.h>
>  #include <sys/syscall.h>
> +#include <linux/userfaultfd.h>
>  #include <linux/vhost.h>
>

Belong to an earlier patch

>  #include "qemu/atomic.h"
> @@ -64,6 +66,7 @@ vu_request_to_string(int req)
>          REQ(VHOST_USER_IOTLB_MSG),
>          REQ(VHOST_USER_SET_VRING_ENDIAN),
>          REQ(VHOST_USER_POSTCOPY_ADVISE),
> +        REQ(VHOST_USER_POSTCOPY_LISTEN),
>          REQ(VHOST_USER_MAX),
>      };
>  #undef REQ
> @@ -802,6 +805,22 @@ out:
>  }
>
>  static bool
> +vu_set_postcopy_listen(VuDev *dev, VhostUserMsg *vmsg)
> +{
> +    vmsg->payload.u64 = -1;
> +    vmsg->size = sizeof(vmsg->payload.u64);
> +
> +    if (dev->nregions) {
> +        vu_panic(dev, "Regions already registered at postcopy-listen");
> +        return true;
> +    }
> +    dev->postcopy_listening = true;
> +
> +    vmsg->flags = VHOST_USER_VERSION |  VHOST_USER_REPLY_MASK;
> +    vmsg->payload.u64 = 0; /* Success */
> +    return true;
> +}
> +static bool
>  vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
>  {
>      int do_reply = 0;
> @@ -868,6 +887,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
>          break;
>      case VHOST_USER_POSTCOPY_ADVISE:
>          return vu_set_postcopy_advise(dev, vmsg);
> +    case VHOST_USER_POSTCOPY_LISTEN:
> +        return vu_set_postcopy_listen(dev, vmsg);
>      default:
>          vmsg_close_fds(vmsg);
>          vu_panic(dev, "Unhandled request: %d", vmsg->request);
> diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> index 3e8efdd919..29c11ba56c 100644
> --- a/contrib/libvhost-user/libvhost-user.h
> +++ b/contrib/libvhost-user/libvhost-user.h
> @@ -67,6 +67,7 @@ typedef enum VhostUserRequest {
>      VHOST_USER_IOTLB_MSG        = 22,
>      VHOST_USER_SET_VRING_ENDIAN = 23,
>      VHOST_USER_POSTCOPY_ADVISE  = 24,
> +    VHOST_USER_POSTCOPY_LISTEN  = 25,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>
> @@ -237,6 +238,7 @@ struct VuDev {
>
>      /* Postcopy data */
>      int postcopy_ufd;
> +    bool postcopy_listening;
>  };
>
>  typedef struct VuVirtqElement {
> diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> index dad2a1b343..73c3dd74db 100644
> --- a/docs/interop/vhost-user.txt
> +++ b/docs/interop/vhost-user.txt
> @@ -615,6 +615,12 @@ Master message types
>        the slave must open a userfaultfd for later use.
>        Note that at this stage the migration is still in precopy mode.
>
> + * VHOST_USER_POSTCOPY_LISTEN
> +      Id: 25
> +      Master payload: N/A
> +
> +      Master advises slave that a transition to postcopy mode has happened.
> +
>  Slave message types
>  -------------------
>
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 775461ae98..f736c7c84f 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -1,5 +1,8 @@
>  # See docs/devel/tracing.txt for syntax documentation.
>
> +# hw/virtio/vhost-user.c
> +vhost_user_postcopy_listen(void) ""
> +
>  # hw/virtio/virtio.c
>  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
>  virtqueue_fill(void *vq, const void *elem, unsigned int len, unsigned int idx) "vq %p elem %p len %u idx %u"
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index b7898f8939..9178271ab2 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -69,6 +69,7 @@ typedef enum VhostUserRequest {
>      VHOST_USER_IOTLB_MSG = 22,
>      VHOST_USER_SET_VRING_ENDIAN = 23,
>      VHOST_USER_POSTCOPY_ADVISE  = 24,
> +    VHOST_USER_POSTCOPY_LISTEN  = 25,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>
> @@ -788,6 +789,32 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
>      return 0;
>  }
>
> +/*
> + * Called at the switch to postcopy on reception of the 'listen' command.
> + */
> +static int vhost_user_postcopy_listen(struct vhost_dev *dev, Error **errp)
> +{
> +    int ret;
> +    VhostUserMsg msg = {
> +        .request = VHOST_USER_POSTCOPY_LISTEN,
> +        .flags = VHOST_USER_VERSION | VHOST_USER_NEED_REPLY_MASK,
> +    };
> +
> +    trace_vhost_user_postcopy_listen();
> +    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
> +        error_setg(errp, "Failed to send postcopy_listen to vhost");
> +        return -1;
> +    }
> +
> +    ret = process_message_reply(dev, &msg);
> +    if (ret) {
> +        error_setg(errp, "Failed to receive reply to postcopy_listen");
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
>  static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
>                                          void *opaque)
>  {
> @@ -810,6 +837,9 @@ static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
>      case POSTCOPY_NOTIFY_INBOUND_ADVISE:
>          return vhost_user_postcopy_advise(dev, pnd->errp);
>
> +    case POSTCOPY_NOTIFY_INBOUND_LISTEN:
> +        return vhost_user_postcopy_listen(dev, pnd->errp);
> +
>      default:
>          /* We ignore notifications we don't know */
>          break;
> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index 28c216cc7a..873c147b68 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -129,6 +129,7 @@ void postcopy_infrastructure_init(void);
>  enum PostcopyNotifyReason {
>      POSTCOPY_NOTIFY_PROBE = 0,
>      POSTCOPY_NOTIFY_INBOUND_ADVISE,
> +    POSTCOPY_NOTIFY_INBOUND_LISTEN,
>  };
>
>  struct PostcopyNotifyData {
> diff --git a/migration/savevm.c b/migration/savevm.c
> index d35911731d..72f084e10d 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1557,6 +1557,8 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>  {
>      PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_LISTENING);
>      trace_loadvm_postcopy_handle_listen();
> +    Error *local_err = NULL;
> +
>      if (ps != POSTCOPY_INCOMING_ADVISE && ps != POSTCOPY_INCOMING_DISCARD) {
>          error_report("CMD_POSTCOPY_LISTEN in wrong postcopy state (%d)", ps);
>          return -1;
> @@ -1578,6 +1580,11 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>          return -1;
>      }
>
> +    if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_LISTEN, &local_err)) {
> +        error_report_err(local_err);
> +        return -1;
> +    }
> +
>      if (mis->have_listen_thread) {
>          error_report("CMD_POSTCOPY_RAM_LISTEN already has a listen thread");
>          return -1;
> --
> 2.13.5
>
>


Looks good to me otherwise,
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>


-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 28/32] postcopy: Allow shared memory
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 28/32] postcopy: Allow shared memory Dr. David Alan Gilbert (git)
@ 2017-08-30 10:39     ` Marc-André Lureau
  2017-09-07 12:15       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Marc-André Lureau @ 2017-08-30 10:39 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

Hi

On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Now that we have the mechanisms in here, allow shared memory in a
> postcopy.
>
> Note that QEMU can't tell who all the users of shared regions are
> and thus can't tell whether all the users of the shared regions
> have appropriate support for postcopy.  Those devices that explicitly
> support shared memory (e.g. vhost-user) must check, but it doesn't
> stop weirder configurations causing problems.
>

Other users should have their own migration blocker, I guess.

> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>


> ---
>  migration/postcopy-ram.c | 6 ------
>  1 file changed, 6 deletions(-)
>
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 28791cf1f1..89c3aadda1 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -138,12 +138,6 @@ static int test_ramblock_postcopiable(const char *block_name, void *host_addr,
>      RAMBlock *rb = qemu_ram_block_by_name(block_name);
>      size_t pagesize = qemu_ram_pagesize(rb);
>
> -    if (qemu_ram_is_shared(rb)) {
> -        error_report("Postcopy on shared RAM (%s) is not yet supported",
> -                     block_name);
> -        return 1;
> -    }
> -
>      if (length % pagesize) {
>          error_report("Postcopy requires RAM blocks to be a page size multiple,"
>                       " block %s is 0x" RAM_ADDR_FMT " bytes with a "
> --
> 2.13.5
>
>



-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 15/32] vhost+postcopy: Register new regions with the ufd
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 15/32] vhost+postcopy: Register new regions with the ufd Dr. David Alan Gilbert (git)
@ 2017-08-30 10:42     ` Marc-André Lureau
  2017-09-08 14:50       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Marc-André Lureau @ 2017-08-30 10:42 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

Use "libvhost-user: " commit title tag/prefix?

On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> When new regions are sent to the client using SET_MEM_TABLE, register
> them with the userfaultfd.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
>
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index e8accf11db..e6ab059a03 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -449,6 +449,38 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
>                     dev_region->mmap_addr);
>          }
>
> +        if (dev->postcopy_listening) {
> +            /* We should already have an open ufd need to mark each memory
> +             * range as ufd.
> +             * Note: Do we need any madvises? Well it's not been accessed
> +             * yet, still probably need no THP to be safe, discard to be safe?
> +             */
> +            struct uffdio_register reg_struct;
> +            reg_struct.range.start = (uintptr_t)dev_region->mmap_addr;
> +            reg_struct.range.len = dev_region->size + dev_region->mmap_offset;
> +            reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
> +
> +            if (ioctl(dev->postcopy_ufd, UFFDIO_REGISTER, &reg_struct)) {
> +                vu_panic(dev, "%s: Failed to userfault region %d "
> +                              "@%p + %zx: (ufd=%d)%s\n",
> +                         __func__, i,
> +                         dev_region->mmap_addr,
> +                         dev_region->size + dev_region->mmap_offset,
> +                         dev->postcopy_ufd, strerror(errno));
> +                continue;

panic is supposed to be unrecoverable errors, so I would suggest to return here

> +            }
> +            if (!(reg_struct.ioctls & ((__u64)1 << _UFFDIO_COPY))) {
> +                vu_panic(dev, "%s Region (%d) doesn't support COPY",
> +                         __func__, i);
> +                continue;
> +            }
> +            DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
> +                    __func__, i, reg_struct.range.start, reg_struct.range.len);
> +            /* TODO: Stash 'zero' support flags somewhere */
> +            /* TODO: Get address back to QEMU */
> +
> +        }
> +
>          close(vmsg->fds[i]);
>      }

This patch would be nicer if it compiles on !Linux / without userfault.


-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 24/32] vub+postcopy: madvises
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 24/32] vub+postcopy: madvises Dr. David Alan Gilbert (git)
@ 2017-08-30 10:48     ` Marc-André Lureau
  2017-09-07 12:30       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Marc-André Lureau @ 2017-08-30 10:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

Hi

"libvhost-user: madvises for postcopy" for ex, would be nicer imho

On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Clear the area and turn off THP.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index 5ec54f7d60..d816851c6d 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -450,11 +450,39 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
>          }
>
>          if (dev->postcopy_listening) {
> +            int ret;
>              /* We should already have an open ufd need to mark each memory
>               * range as ufd.
> -             * Note: Do we need any madvises? Well it's not been accessed
> -             * yet, still probably need no THP to be safe, discard to be safe?
>               */
> +
> +            /* Discard any mapping we have here; note I can't use MADV_REMOVE
> +             * or fallocate to make the hole since I don't want to lose
> +             * data that's already arrived in the shared process.
> +             * TODO: How to do hugepage
> +             */
> +            ret = madvise((void *)dev_region->mmap_addr,
> +                          dev_region->size + dev_region->mmap_offset,
> +                          MADV_DONTNEED);
> +            if (ret) {
> +                fprintf(stderr,
> +                        "%s: Failed to madvise(DONTNEED) region %d: %s\n",
> +                        __func__, i, strerror(errno));
> +            }
> +            /* Turn off transparent hugepages so we dont get lose wakeups
> +             * in neighbouring pages.
> +             * TODO: Turn this backon later.
> +             */
> +            ret = madvise((void *)dev_region->mmap_addr,
> +                          dev_region->size + dev_region->mmap_offset,
> +                          MADV_NOHUGEPAGE);
> +            if (ret) {
> +                /* Note: This can happen legally on kernels that are configured
> +                 * without madvise'able hugepages
> +                 */
> +                fprintf(stderr,
> +                        "%s: Failed to madvise(NOHUGEPAGE) region %d: %s\n",
> +                        __func__, i, strerror(errno));
> +            }
>              struct uffdio_register reg_struct;
>              reg_struct.range.start = (uintptr_t)dev_region->mmap_addr;
>              reg_struct.range.len = dev_region->size + dev_region->mmap_offset;
> --
> 2.13.5
>

Errors are non-fatal? patch looks ok to me, despite the TODOs :).
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>





-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 29/32] vhost-user: Claim support for postcopy
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 29/32] vhost-user: Claim support for postcopy Dr. David Alan Gilbert (git)
@ 2017-08-30 10:50     ` Marc-André Lureau
  0 siblings, 0 replies; 94+ messages in thread
From: Marc-André Lureau @ 2017-08-30 10:50 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

Hi

Use "libvhost-user: ", so we don't confuse with qemu own vhost-user code.

On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Tell QEMU we understand the protocol features needed for postcopy.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
>
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index 23bff47649..290748733b 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -144,6 +144,35 @@ vmsg_close_fds(VhostUserMsg *vmsg)
>      }
>  }
>
> +/* A test to see if we have userfault available */
> +static bool
> +have_userfault(void)
> +{
> +#if defined(__linux__) && defined(__NR_userfaultfd) &&\
> +        defined(UFFD_FEATURE_MISSING_SHMEM) &&\
> +        defined(UFFD_FEATURE_MISSING_HUGETLBFS)
> +    /* Now test the kernel we're running on really has the features */
> +    int ufd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> +    struct uffdio_api api_struct;
> +    if (ufd < 0) {
> +        return false;
> +    }
> +
> +    api_struct.api = UFFD_API;
> +    api_struct.features = UFFD_FEATURE_MISSING_SHMEM |
> +                          UFFD_FEATURE_MISSING_HUGETLBFS;
> +    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> +        close(ufd);
> +        return false;
> +    }
> +    close(ufd);
> +    return true;
> +
> +#else
> +    return false;
> +#endif
> +}
> +
>  static bool
>  vu_message_read(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
>  {
> @@ -796,6 +825,10 @@ vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg *vmsg)
>  {
>      uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD;
>
> +    if (have_userfault()) {
> +        features |= 1ULL << VHOST_USER_PROTOCOL_F_PAGEFAULT;
> +    }
> +
>      if (dev->iface->get_protocol_features) {
>          features |= dev->iface->get_protocol_features(dev);
>      }
> --
> 2.13.5
>
>

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>




-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started
  2017-08-24 19:26   ` [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started Dr. David Alan Gilbert (git)
  2017-08-24 23:10     ` Marc-André Lureau
@ 2017-08-30 13:02     ` Michael S. Tsirkin
  2017-08-30 13:13       ` Marc-André Lureau
  1 sibling, 1 reply; 94+ messages in thread
From: Michael S. Tsirkin @ 2017-08-30 13:02 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, marcandre.lureau,
	quintela, peterx, lvivier, aarcange, felipe

> Subject: Re: [RFC v2 01/32] vhu: vu_queue_started

I mean, how did we end up with "vhu"? I never meant that to happen :(
It's merely in the commit title but maybe we can come up with a
better-sounding way to abbreviate vhost user.

-- 
MST

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started
  2017-08-30 13:02     ` Michael S. Tsirkin
@ 2017-08-30 13:13       ` Marc-André Lureau
  2017-09-05 12:58         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Marc-André Lureau @ 2017-08-30 13:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Dr. David Alan Gilbert (git),
	qemu-devel, maxime coquelin, a perevalov, quintela, peterx,
	lvivier, aarcange, felipe

Hi

----- Original Message -----
> > Subject: Re: [RFC v2 01/32] vhu: vu_queue_started
> 
> I mean, how did we end up with "vhu"? I never meant that to happen :(
> It's merely in the commit title but maybe we can come up with a
> better-sounding way to abbreviate vhost user.

vhu, no idea :)

I would suggest to use libvhost-user: prefix in commit messages for contrib/libvhost-user changes.

vhost-user: for qemu vhost-user changes.

vhost-user-bridge: (or vub:), vhost-user-scsi: (or vus:)  for those backends.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram
  2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
                     ` (31 preceding siblings ...)
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 32/32] postcopy shared docs Dr. David Alan Gilbert (git)
@ 2017-09-01 13:34   ` Alexey Perevalov
  2017-09-01 13:42     ` Maxime Coquelin
  32 siblings, 1 reply; 94+ messages in thread
From: Alexey Perevalov @ 2017-09-01 13:34 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git),
	qemu-devel, maxime.coquelin, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

Hello David,

You wrote in previous version:

>We've had a postcopy migrate work now, with a few hacks we're still
>cleaning up, both on vhost-user-bridge and dpdk; so I'll get this
>updated and reposted.

I want to know more about DPDK work, do you know, is somebody assigned to that task?



On 08/24/2017 10:26 PM, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Hi,
>    This is a RFC/WIP series that enables postcopy migration
> with shared memory to a vhost-user process.
> It's based off current-head + Alexey's bitmap series
>
> It's tested with vhost-user-bridge and a dpdk (modified by Maxime
> that will get posted separately) - both very lightly.
>
> It's still got a few very rough edges, but it succesfully migrates
> with both normal and huge pages (2M).
>
> The major difference over v1 is that there's a set of code
> that merges vhost regions together on the qemu side so that
> we get a single hugepage region on the PC spanning the 640k
> hole (the hole hopefully isn't accessed by the client,
> but the client used to align around it anyway).
>
> It's also got a lot of cleanups from the comments from v1
> but there's still a few things that need work.
> In particular, there's still the hack around qemu waiting
> for the set_mem_table to come back; I also worry what would
> happen if a set-mem-table was triggered during a migrate;
> I suspect it would break badly.
>
> One problem that didn't cause a problem was madvises for hugepages;
> because we register userfault directly after mmap'ing the
> region in the client, we have no pages mapped and hence
> the madvise's/fallocate's are fortunately not compulsary.
> Still, I'd like a way to do it, it would feel safer.
>
> A copy of this code, based off the current 2.10.0-rc4
> together with Alexey's bitmap code is available here:
>      https://github.com/dagrh/qemu/tree/vhost-wipv2
>
> Dave
>
> Dr. David Alan Gilbert (32):
>    vhu: vu_queue_started
>    vhub: Only process received packets on started queues
>    migrate: Update ram_block_discard_range for shared
>    qemu_ram_block_host_offset
>    migration/ram: ramblock_recv_bitmap_test_byte_offset
>    postcopy: use UFFDIO_ZEROPAGE only when available
>    postcopy: Add notifier chain
>    postcopy: Add vhost-user flag for postcopy and check it
>    vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message
>    vhub: Support sending fds back to qemu
>    vhub: Open userfaultfd
>    postcopy: Allow registering of fd handler
>    vhost+postcopy: Register shared ufd with postcopy
>    vhost+postcopy: Transmit 'listen' to client
>    vhost+postcopy: Register new regions with the ufd
>    vhost+postcopy: Send address back to qemu
>    vhost+postcopy: Stash RAMBlock and offset
>    vhost+postcopy: Send requests to source for shared pages
>    vhost+postcopy: Resolve client address
>    postcopy: wake shared
>    postcopy: postcopy_notify_shared_wake
>    vhost+postcopy: Add vhost waker
>    vhost+postcopy: Call wakeups
>    vub+postcopy: madvises
>    vhost+postcopy: Lock around set_mem_table
>    vhost: Add VHOST_USER_POSTCOPY_END message
>    vhost+postcopy: Wire up POSTCOPY_END notify
>    postcopy: Allow shared memory
>    vhost-user: Claim support for postcopy
>    vhost: Merge neighbouring hugepage regions where appropriate
>    vhost: Don't break merged regions on small remove/non-adds
>    postcopy shared docs
>
>   contrib/libvhost-user/libvhost-user.c | 226 ++++++++++++++++++++-
>   contrib/libvhost-user/libvhost-user.h |  22 ++-
>   docs/devel/migration.txt              |  39 ++++
>   docs/interop/vhost-user.txt           |  39 ++++
>   exec.c                                |  60 ++++--
>   hw/virtio/trace-events                |  27 +++
>   hw/virtio/vhost-user.c                | 326 +++++++++++++++++++++++++++++-
>   hw/virtio/vhost.c                     | 121 +++++++++++-
>   include/exec/cpu-common.h             |   4 +
>   migration/migration.c                 |   3 +
>   migration/migration.h                 |   4 +
>   migration/postcopy-ram.c              | 359 +++++++++++++++++++++++++++-------
>   migration/postcopy-ram.h              |  69 +++++++
>   migration/ram.c                       |   5 +
>   migration/ram.h                       |   1 +
>   migration/savevm.c                    |  13 ++
>   migration/trace-events                |   6 +
>   tests/vhost-user-bridge.c             |   1 +
>   trace-events                          |   3 +
>   vl.c                                  |   2 +
>   20 files changed, 1241 insertions(+), 89 deletions(-)
>

-- 
Best regards,
Alexey Perevalov

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram
  2017-09-01 13:34   ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Alexey Perevalov
@ 2017-09-01 13:42     ` Maxime Coquelin
  2017-10-16  8:32       ` Alexey Perevalov
  0 siblings, 1 reply; 94+ messages in thread
From: Maxime Coquelin @ 2017-09-01 13:42 UTC (permalink / raw)
  To: Alexey Perevalov, Dr. David Alan Gilbert (git),
	qemu-devel, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

Hello Alexey,

On 09/01/2017 03:34 PM, Alexey Perevalov wrote:
> Hello David,
> 
> You wrote in previous version:
> 
>> We've had a postcopy migrate work now, with a few hacks we're still
>> cleaning up, both on vhost-user-bridge and dpdk; so I'll get this
>> updated and reposted.
> 
> I want to know more about DPDK work, do you know, is somebody assigned 
> to that task?

I did the DPDK (rough) prototype, you may find it here:
https://gitlab.com/mcoquelin/dpdk-next-virtio/commits/postcopy_proto_v1

Cheers,
Maxime

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started
  2017-08-30 13:13       ` Marc-André Lureau
@ 2017-09-05 12:58         ` Dr. David Alan Gilbert
  2017-09-05 13:01           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-05 12:58 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Michael S. Tsirkin, qemu-devel, maxime coquelin, a perevalov,
	quintela, peterx, lvivier, aarcange, felipe

* Marc-André Lureau (marcandre.lureau@redhat.com) wrote:
> Hi
> 
> ----- Original Message -----
> > > Subject: Re: [RFC v2 01/32] vhu: vu_queue_started
> > 
> > I mean, how did we end up with "vhu"? I never meant that to happen :(
> > It's merely in the commit title but maybe we can come up with a
> > better-sounding way to abbreviate vhost user.
> 
> vhu, no idea :)
> 
> I would suggest to use libvhost-user: prefix in commit messages for contrib/libvhost-user changes.
> 
> vhost-user: for qemu vhost-user changes.

Changed to vhost-user.

Dave

> vhost-user-bridge: (or vub:), vhost-user-scsi: (or vus:)  for those backends.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started
  2017-09-05 12:58         ` Dr. David Alan Gilbert
@ 2017-09-05 13:01           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-05 13:01 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Michael S. Tsirkin, qemu-devel, maxime coquelin, a perevalov,
	quintela, peterx, lvivier, aarcange, felipe

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * Marc-André Lureau (marcandre.lureau@redhat.com) wrote:
> > Hi
> > 
> > ----- Original Message -----
> > > > Subject: Re: [RFC v2 01/32] vhu: vu_queue_started
> > > 
> > > I mean, how did we end up with "vhu"? I never meant that to happen :(
> > > It's merely in the commit title but maybe we can come up with a
> > > better-sounding way to abbreviate vhost user.
> > 
> > vhu, no idea :)
> > 
> > I would suggest to use libvhost-user: prefix in commit messages for contrib/libvhost-user changes.
> > 
> > vhost-user: for qemu vhost-user changes.
> 
> Changed to vhost-user.

oops, I mean changed to libvhost-user for this one.

Dave

> Dave
> 
> > vhost-user-bridge: (or vub:), vhost-user-scsi: (or vus:)  for those backends.
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 06/32] postcopy: use UFFDIO_ZEROPAGE only when available
  2017-08-30  9:57     ` Marc-André Lureau
@ 2017-09-07 10:55       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 10:55 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> Hi
> 
> On Thu, Aug 24, 2017 at 9:27 PM, Dr. David Alan Gilbert (git)
> <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > Use a flag on the RAMBlock to state whether it has the
> > UFFDIO_ZEROPAGE capability, use it when it's available.
> >
> > This allows the use of postcopy on tmpfs as well as hugepage
> > backed files.
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  exec.c                    | 15 +++++++++++++++
> >  include/exec/cpu-common.h |  3 +++
> >  migration/postcopy-ram.c  | 14 +++++++++++---
> >  3 files changed, 29 insertions(+), 3 deletions(-)
> >
> > diff --git a/exec.c b/exec.c
> > index 35b4cea2ed..80c3d1d121 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -103,6 +103,11 @@ static MemoryRegion io_mem_unassigned;
> >   */
> >  #define RAM_RESIZEABLE (1 << 2)
> >
> > +/* UFFDIO_ZEROPAGE is available on this RAMBlock to atomically
> > + * zero the page and wake waiting processes.
> > + * (Set during postcopy)
> > + */
> > +#define RAM_UF_ZEROPAGE (1 << 3)
> >  #endif
> >
> >  #ifdef TARGET_PAGE_BITS_VARY
> > @@ -1705,6 +1710,16 @@ bool qemu_ram_is_shared(RAMBlock *rb)
> >      return rb->flags & RAM_SHARED;
> >  }
> >
> > +bool qemu_ram_is_uf_zeroable(RAMBlock *rb)
> > +{
> > +    return rb->flags & RAM_UF_ZEROPAGE;
> > +}
> > +
> > +void qemu_ram_set_uf_zeroable(RAMBlock *rb)
> > +{
> > +    rb->flags |= RAM_UF_ZEROPAGE;
> > +}
> > +
> >  /* Called with iothread lock held.  */
> >  void qemu_ram_set_idstr(RAMBlock *new_block, const char *name, DeviceState *dev)
> >  {
> > diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> > index 0d861a6289..24d335f95d 100644
> > --- a/include/exec/cpu-common.h
> > +++ b/include/exec/cpu-common.h
> > @@ -73,6 +73,9 @@ void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
> >  void qemu_ram_unset_idstr(RAMBlock *block);
> >  const char *qemu_ram_get_idstr(RAMBlock *rb);
> >  bool qemu_ram_is_shared(RAMBlock *rb);
> > +bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
> > +void qemu_ram_set_uf_zeroable(RAMBlock *rb);
> > +
> >  size_t qemu_ram_pagesize(RAMBlock *block);
> >  size_t qemu_ram_pagesize_largest(void);
> >
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 7a414ebad8..640b72d86d 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -408,6 +408,11 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
> >          error_report("%s userfault: Region doesn't support COPY", __func__);
> >          return -1;
> >      }
> > +    if (reg_struct.ioctls & ((__u64)1 << _UFFDIO_ZEROPAGE)) {
> > +        RAMBlock *rb = qemu_ram_block_by_name(block_name);
> > +        qemu_ram_set_uf_zeroable(rb);
> > +    }
> > +
> >
> 
> extra empty line

Thanks, gone.

> >      return 0;
> >  }
> > @@ -617,11 +622,14 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> >  int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> >                               RAMBlock *rb)
> >  {
> > +    size_t pagesize = qemu_ram_pagesize(rb);
> >      trace_postcopy_place_page_zero(host);
> >
> > -    if (qemu_ram_pagesize(rb) == getpagesize()) {
> 
> Is this check drop intentionally?

Yes, it used to be the case that we knew that for hugepages we couldn't
do zeroing and that was the rule we used.   Now we're using the
capability flag returned from the uffdio registration on this RAMBlock
and it tells us if we can use the ZERO ioctl on this block.

Dave

> > -        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, getpagesize(),
> > -                                rb)) {
> > +    /* Normal RAMBlocks can zero a page using UFFDIO_ZEROPAGE
> > +     * but it's not available for everything (e.g. hugetlbpages)
> > +     */
> > +    if (qemu_ram_is_uf_zeroable(rb)) {
> > +        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, pagesize, rb)) {
> >              int e = errno;
> >              error_report("%s: %s zero host: %p",
> >                           __func__, strerror(e), host);
> > --
> > 2.13.5
> >
> >
> 
> 
> 
> -- 
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 09/32] vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message
  2017-08-30 10:07     ` Marc-André Lureau
@ 2017-09-07 11:04       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 11:04 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> Hi
> 
> On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
> <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > Wire up a notifier to send a VHOST_USER_POSTCOPY_ADVISE
> > message on an incoming advise.
> >
> > Later patches will fill in the behaviour/contents of the
> > message.
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 21 ++++++++++++---
> >  contrib/libvhost-user/libvhost-user.h |  6 ++++-
> >  docs/interop/vhost-user.txt           |  9 +++++++
> >  hw/virtio/vhost-user.c                | 48 +++++++++++++++++++++++++++++++++++
> >  migration/postcopy-ram.h              |  1 +
> >  migration/savevm.c                    |  6 +++++
> >  6 files changed, 86 insertions(+), 5 deletions(-)
> >
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index 201b9846e9..8bbdf5fb40 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -42,9 +42,6 @@ vu_request_to_string(int req)
> >          REQ(VHOST_USER_NONE),
> >          REQ(VHOST_USER_GET_FEATURES),
> >          REQ(VHOST_USER_SET_FEATURES),
> > -        REQ(VHOST_USER_NONE),
> > -        REQ(VHOST_USER_GET_FEATURES),
> > -        REQ(VHOST_USER_SET_FEATURES),
> 
> nice cleanup ;)
> 
> >          REQ(VHOST_USER_SET_OWNER),
> >          REQ(VHOST_USER_RESET_OWNER),
> >          REQ(VHOST_USER_SET_MEM_TABLE),
> > @@ -62,7 +59,10 @@ vu_request_to_string(int req)
> >          REQ(VHOST_USER_GET_QUEUE_NUM),
> >          REQ(VHOST_USER_SET_VRING_ENABLE),
> >          REQ(VHOST_USER_SEND_RARP),
> > -        REQ(VHOST_USER_INPUT_GET_CONFIG),
> > +        REQ(VHOST_USER_SET_SLAVE_REQ_FD),
> > +        REQ(VHOST_USER_IOTLB_MSG),
> > +        REQ(VHOST_USER_SET_VRING_ENDIAN),
> > +        REQ(VHOST_USER_POSTCOPY_ADVISE),
> >          REQ(VHOST_USER_MAX),
> >      };
> >  #undef REQ
> > @@ -744,6 +744,17 @@ vu_set_vring_enable_exec(VuDev *dev, VhostUserMsg *vmsg)
> >  }
> >
> >  static bool
> > +vu_set_postcopy_advise(VuDev *dev, VhostUserMsg *vmsg)
> > +{
> > +    /* TODO: Open ufd, pass it back in the request
> > +     * TODO: Add addresses
> 
> Extra EOL space

Gone.

> > +     */
> > +    vmsg->payload.u64 = 0xcafe;
> > +    vmsg->size = sizeof(vmsg->payload.u64);
> > +    return true; /* = send a reply */
> > +}
> > +
> > +static bool
> >  vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
> >  {
> >      int do_reply = 0;
> > @@ -808,6 +819,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
> >          return vu_set_vring_enable_exec(dev, vmsg);
> >      case VHOST_USER_NONE:
> >          break;
> > +    case VHOST_USER_POSTCOPY_ADVISE:
> > +        return vu_set_postcopy_advise(dev, vmsg);
> >      default:
> >          vmsg_close_fds(vmsg);
> >          vu_panic(dev, "Unhandled request: %d", vmsg->request);
> > diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> > index 95d0d34a28..3987ce643d 100644
> > --- a/contrib/libvhost-user/libvhost-user.h
> > +++ b/contrib/libvhost-user/libvhost-user.h
> > @@ -62,7 +62,11 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_GET_QUEUE_NUM = 17,
> >      VHOST_USER_SET_VRING_ENABLE = 18,
> >      VHOST_USER_SEND_RARP = 19,
> > -    VHOST_USER_INPUT_GET_CONFIG = 20,
> > +    VHOST_USER_NET_SET_MTU      = 20,
> > +    VHOST_USER_SET_SLAVE_REQ_FD = 21,
> > +    VHOST_USER_IOTLB_MSG        = 22,
> > +    VHOST_USER_SET_VRING_ENDIAN = 23,
> > +    VHOST_USER_POSTCOPY_ADVISE  = 24,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >
> > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > index a279560eb0..dad2a1b343 100644
> > --- a/docs/interop/vhost-user.txt
> > +++ b/docs/interop/vhost-user.txt
> > @@ -606,6 +606,15 @@ Master message types
> >        and expect this message once (per VQ) during device configuration
> >        (ie. before the master starts the VQ).
> >
> > + * VHOST_USER_POSTCOPY_ADVISE
> > +      Id: 24
> > +      Master payload: N/A
> > +      Slave payload: userfault fd + u64
> > +
> > +      Master advises slave that a migration with postcopy enabled is underway,
> > +      the slave must open a userfaultfd for later use.
> > +      Note that at this stage the migration is still in precopy mode.
> > +
> >  Slave message types
> >  -------------------
> >
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index c51bbd1296..7063e4df61 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -67,6 +67,7 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_SET_SLAVE_REQ_FD = 21,
> >      VHOST_USER_IOTLB_MSG = 22,
> >      VHOST_USER_SET_VRING_ENDIAN = 23,
> > +    VHOST_USER_POSTCOPY_ADVISE  = 24,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >
> > @@ -724,6 +725,50 @@ out:
> >      return ret;
> >  }
> >
> > +/*
> > + * Called at the start of an inbound postcopy on reception of the
> > + * 'advise' command.
> > + */
> > +static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
> > +{
> > +    struct vhost_user *u = dev->opaque;
> > +    CharBackend *chr = u->chr;
> > +    int ufd;
> > +    VhostUserMsg msg = {
> > +        .request = VHOST_USER_POSTCOPY_ADVISE,
> > +        .flags = VHOST_USER_VERSION,
> > +    };
> > +
> > +    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
> > +        error_setg(errp, "Failed to send postcopy_advise to vhost");
> > +        return -1;
> > +    }
> > +
> > +    if (vhost_user_read(dev, &msg) < 0) {
> > +        error_setg(errp, "Failed to get postcopy_advise reply from vhost");
> > +        return -1;
> > +    }
> > +
> > +    if (msg.request != VHOST_USER_POSTCOPY_ADVISE) {
> > +        error_setg(errp, "Unexpected msg type. Expected %d received %d",
> > +                     VHOST_USER_POSTCOPY_ADVISE, msg.request);
> > +        return -1;
> > +    }
> > +
> > +    if (msg.size != sizeof(msg.payload.u64)) {
> > +        error_setg(errp, "Received bad msg size.");
> > +        return -1;
> > +    }
> > +    ufd = qemu_chr_fe_get_msgfd(chr);
> > +    if (ufd < 0) {
> > +        error_setg(errp, "%s: Failed to get ufd", __func__);
> > +        return -1;
> > +    }
> > +
> > +    /* TODO: register ufd with userfault thread */
> > +    return 0;
> > +}
> > +
> >  static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
> >                                          void *opaque)
> >  {
> > @@ -743,6 +788,9 @@ static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
> >          }
> >          break;
> >
> > +    case POSTCOPY_NOTIFY_INBOUND_ADVISE:
> > +        return vhost_user_postcopy_advise(dev, pnd->errp);
> > +
> >      default:
> >          /* We ignore notifications we don't know */
> >          break;
> > diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> > index d688411674..70d4b09659 100644
> > --- a/migration/postcopy-ram.h
> > +++ b/migration/postcopy-ram.h
> > @@ -128,6 +128,7 @@ void postcopy_infrastructure_init(void);
> >   */
> >  enum PostcopyNotifyReason {
> >      POSTCOPY_NOTIFY_PROBE = 0,
> > +    POSTCOPY_NOTIFY_INBOUND_ADVISE,
> >  };
> >
> >  struct PostcopyNotifyData {
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index fdd15fa0a7..d35911731d 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -1343,6 +1343,7 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
> >  {
> >      PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_ADVISE);
> >      uint64_t remote_pagesize_summary, local_pagesize_summary, remote_tps;
> > +    Error *local_err = NULL;
> >
> >      trace_loadvm_postcopy_handle_advise();
> >      if (ps != POSTCOPY_INCOMING_NONE) {
> > @@ -1390,6 +1391,11 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
> >          return -1;
> >      }
> >
> > +    if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_ADVISE, &local_err)) {
> > +        error_report_err(local_err);
> > +        return -1;
> > +    }
> > +
> >      if (ram_postcopy_incoming_init(mis)) {
> >          return -1;
> >      }
> > --
> > 2.13.5
> >
> >
> 
> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>

Thanks.

Dave

> 
> 
> 
> 
> -- 
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 10/32] vhub: Support sending fds back to qemu
  2017-08-30 10:22     ` Marc-André Lureau
@ 2017-09-07 11:31       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 11:31 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> Hi
> 
> On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
> <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > Allow replies with fds (for postcopy)
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 29 ++++++++++++++++++++++++++++-
> >  1 file changed, 28 insertions(+), 1 deletion(-)
> >
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index 8bbdf5fb40..47884c0a15 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -213,6 +213,30 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
> >  {
> >      int rc;
> >      uint8_t *p = (uint8_t *)vmsg;
> > +    char control[CMSG_SPACE(VHOST_MEMORY_MAX_NREGIONS * sizeof(int))] = { };
> > +    struct iovec iov = {
> > +        .iov_base = (char *)vmsg,
> > +        .iov_len = VHOST_USER_HDR_SIZE,
> > +    };
> > +    struct msghdr msg = {
> > +        .msg_iov = &iov,
> > +        .msg_iovlen = 1,
> > +        .msg_control = control,
> > +    };
> > +    struct cmsghdr *cmsg;
> > +
> > +    memset(control, 0, sizeof(control));
> > +    if (vmsg->fds) {
> 
> This is going to be always true, right? Check vmsg->fd_num > 0 instead?

Ah yes, thanks.

> I would also add check or assert(vmsg->fd_num <= VHOST_MEMORY_MAX_NREGIONS)

-    if (vmsg->fds) {
+    assert(vmsg->fd_num <= VHOST_MEMORY_MAX_NREGIONS);
+    if (vmsg->fd_num > 0) {

> 
> > +        size_t fdsize = vmsg->fd_num * sizeof(int);
> > +        msg.msg_controllen = CMSG_SPACE(fdsize);
> > +        cmsg = CMSG_FIRSTHDR(&msg);
> > +        cmsg->cmsg_len = CMSG_LEN(fdsize);
> > +        cmsg->cmsg_level = SOL_SOCKET;
> > +        cmsg->cmsg_type = SCM_RIGHTS;
> > +        memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
> > +    } else {
> > +        msg.msg_controllen = 0;
> > +    }
> >
> >      /* Set the version in the flags when sending the reply */
> >      vmsg->flags &= ~VHOST_USER_VERSION_MASK;
> > @@ -220,7 +244,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
> >      vmsg->flags |= VHOST_USER_REPLY_MASK;
> >
> >      do {
> > -        rc = write(conn_fd, p, VHOST_USER_HDR_SIZE);
> > +        rc = sendmsg(conn_fd, &msg, 0);
> >      } while (rc < 0 && (errno == EINTR || errno == EAGAIN));
> >
> >      do {
> > @@ -313,6 +337,7 @@ vu_get_features_exec(VuDev *dev, VhostUserMsg *vmsg)
> >      }
> >
> >      vmsg->size = sizeof(vmsg->payload.u64);
> > +    vmsg->fd_num = 0;
> >
> >      DPRINT("Sending back to guest u64: 0x%016"PRIx64"\n", vmsg->payload.u64);
> >
> > @@ -454,6 +479,7 @@ vu_set_log_base_exec(VuDev *dev, VhostUserMsg *vmsg)
> >      dev->log_size = log_mmap_size;
> >
> >      vmsg->size = sizeof(vmsg->payload.u64);
> > +    vmsg->fd_num = 0;
> >
> >      return true;
> >  }
> > @@ -698,6 +724,7 @@ vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg *vmsg)
> >
> >      vmsg->payload.u64 = features;
> >      vmsg->size = sizeof(vmsg->payload.u64);
> > +    vmsg->fd_num = 0;
> >
> >      return true;
> >  }
> > --
> > 2.13.5
> >
> >
> 
> other than that
> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>

Thanks.

Dave

> 
> 
> -- 
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 14/32] vhost+postcopy: Transmit 'listen' to client
  2017-08-30 10:37     ` Marc-André Lureau
@ 2017-09-07 12:10       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 12:10 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
> <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > Notify the vhost-user client on reception of the 'postcopy-listen'
> > event from the source.
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 21 +++++++++++++++++++++
> >  contrib/libvhost-user/libvhost-user.h |  2 ++
> >  docs/interop/vhost-user.txt           |  6 ++++++
> >  hw/virtio/trace-events                |  3 +++
> >  hw/virtio/vhost-user.c                | 30 ++++++++++++++++++++++++++++++
> >  migration/postcopy-ram.h              |  1 +
> >  migration/savevm.c                    |  7 +++++++
> >  7 files changed, 70 insertions(+)
> >
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index f9b5b12b28..e8accf11db 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -15,7 +15,9 @@
> >
> >  #include <qemu/osdep.h>
> >  #include <sys/eventfd.h>
> > +#include <sys/ioctl.h>
> >  #include <sys/syscall.h>
> > +#include <linux/userfaultfd.h>
> >  #include <linux/vhost.h>
> >
> 
> Belong to an earlier patch

oops, thanks.

> >  #include "qemu/atomic.h"
> > @@ -64,6 +66,7 @@ vu_request_to_string(int req)
> >          REQ(VHOST_USER_IOTLB_MSG),
> >          REQ(VHOST_USER_SET_VRING_ENDIAN),
> >          REQ(VHOST_USER_POSTCOPY_ADVISE),
> > +        REQ(VHOST_USER_POSTCOPY_LISTEN),
> >          REQ(VHOST_USER_MAX),
> >      };
> >  #undef REQ
> > @@ -802,6 +805,22 @@ out:
> >  }
> >
> >  static bool
> > +vu_set_postcopy_listen(VuDev *dev, VhostUserMsg *vmsg)
> > +{
> > +    vmsg->payload.u64 = -1;
> > +    vmsg->size = sizeof(vmsg->payload.u64);
> > +
> > +    if (dev->nregions) {
> > +        vu_panic(dev, "Regions already registered at postcopy-listen");
> > +        return true;
> > +    }
> > +    dev->postcopy_listening = true;
> > +
> > +    vmsg->flags = VHOST_USER_VERSION |  VHOST_USER_REPLY_MASK;
> > +    vmsg->payload.u64 = 0; /* Success */
> > +    return true;
> > +}
> > +static bool
> >  vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
> >  {
> >      int do_reply = 0;
> > @@ -868,6 +887,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
> >          break;
> >      case VHOST_USER_POSTCOPY_ADVISE:
> >          return vu_set_postcopy_advise(dev, vmsg);
> > +    case VHOST_USER_POSTCOPY_LISTEN:
> > +        return vu_set_postcopy_listen(dev, vmsg);
> >      default:
> >          vmsg_close_fds(vmsg);
> >          vu_panic(dev, "Unhandled request: %d", vmsg->request);
> > diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> > index 3e8efdd919..29c11ba56c 100644
> > --- a/contrib/libvhost-user/libvhost-user.h
> > +++ b/contrib/libvhost-user/libvhost-user.h
> > @@ -67,6 +67,7 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_IOTLB_MSG        = 22,
> >      VHOST_USER_SET_VRING_ENDIAN = 23,
> >      VHOST_USER_POSTCOPY_ADVISE  = 24,
> > +    VHOST_USER_POSTCOPY_LISTEN  = 25,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >
> > @@ -237,6 +238,7 @@ struct VuDev {
> >
> >      /* Postcopy data */
> >      int postcopy_ufd;
> > +    bool postcopy_listening;
> >  };
> >
> >  typedef struct VuVirtqElement {
> > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > index dad2a1b343..73c3dd74db 100644
> > --- a/docs/interop/vhost-user.txt
> > +++ b/docs/interop/vhost-user.txt
> > @@ -615,6 +615,12 @@ Master message types
> >        the slave must open a userfaultfd for later use.
> >        Note that at this stage the migration is still in precopy mode.
> >
> > + * VHOST_USER_POSTCOPY_LISTEN
> > +      Id: 25
> > +      Master payload: N/A
> > +
> > +      Master advises slave that a transition to postcopy mode has happened.
> > +
> >  Slave message types
> >  -------------------
> >
> > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > index 775461ae98..f736c7c84f 100644
> > --- a/hw/virtio/trace-events
> > +++ b/hw/virtio/trace-events
> > @@ -1,5 +1,8 @@
> >  # See docs/devel/tracing.txt for syntax documentation.
> >
> > +# hw/virtio/vhost-user.c
> > +vhost_user_postcopy_listen(void) ""
> > +
> >  # hw/virtio/virtio.c
> >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> >  virtqueue_fill(void *vq, const void *elem, unsigned int len, unsigned int idx) "vq %p elem %p len %u idx %u"
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index b7898f8939..9178271ab2 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -69,6 +69,7 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_IOTLB_MSG = 22,
> >      VHOST_USER_SET_VRING_ENDIAN = 23,
> >      VHOST_USER_POSTCOPY_ADVISE  = 24,
> > +    VHOST_USER_POSTCOPY_LISTEN  = 25,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >
> > @@ -788,6 +789,32 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
> >      return 0;
> >  }
> >
> > +/*
> > + * Called at the switch to postcopy on reception of the 'listen' command.
> > + */
> > +static int vhost_user_postcopy_listen(struct vhost_dev *dev, Error **errp)
> > +{
> > +    int ret;
> > +    VhostUserMsg msg = {
> > +        .request = VHOST_USER_POSTCOPY_LISTEN,
> > +        .flags = VHOST_USER_VERSION | VHOST_USER_NEED_REPLY_MASK,
> > +    };
> > +
> > +    trace_vhost_user_postcopy_listen();
> > +    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
> > +        error_setg(errp, "Failed to send postcopy_listen to vhost");
> > +        return -1;
> > +    }
> > +
> > +    ret = process_message_reply(dev, &msg);
> > +    if (ret) {
> > +        error_setg(errp, "Failed to receive reply to postcopy_listen");
> > +        return ret;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
> >                                          void *opaque)
> >  {
> > @@ -810,6 +837,9 @@ static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
> >      case POSTCOPY_NOTIFY_INBOUND_ADVISE:
> >          return vhost_user_postcopy_advise(dev, pnd->errp);
> >
> > +    case POSTCOPY_NOTIFY_INBOUND_LISTEN:
> > +        return vhost_user_postcopy_listen(dev, pnd->errp);
> > +
> >      default:
> >          /* We ignore notifications we don't know */
> >          break;
> > diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> > index 28c216cc7a..873c147b68 100644
> > --- a/migration/postcopy-ram.h
> > +++ b/migration/postcopy-ram.h
> > @@ -129,6 +129,7 @@ void postcopy_infrastructure_init(void);
> >  enum PostcopyNotifyReason {
> >      POSTCOPY_NOTIFY_PROBE = 0,
> >      POSTCOPY_NOTIFY_INBOUND_ADVISE,
> > +    POSTCOPY_NOTIFY_INBOUND_LISTEN,
> >  };
> >
> >  struct PostcopyNotifyData {
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index d35911731d..72f084e10d 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -1557,6 +1557,8 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
> >  {
> >      PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_LISTENING);
> >      trace_loadvm_postcopy_handle_listen();
> > +    Error *local_err = NULL;
> > +
> >      if (ps != POSTCOPY_INCOMING_ADVISE && ps != POSTCOPY_INCOMING_DISCARD) {
> >          error_report("CMD_POSTCOPY_LISTEN in wrong postcopy state (%d)", ps);
> >          return -1;
> > @@ -1578,6 +1580,11 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
> >          return -1;
> >      }
> >
> > +    if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_LISTEN, &local_err)) {
> > +        error_report_err(local_err);
> > +        return -1;
> > +    }
> > +
> >      if (mis->have_listen_thread) {
> >          error_report("CMD_POSTCOPY_RAM_LISTEN already has a listen thread");
> >          return -1;
> > --
> > 2.13.5
> >
> >
> 
> 
> Looks good to me otherwise,
> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>

Thanks.

Dave

> 
> -- 
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 28/32] postcopy: Allow shared memory
  2017-08-30 10:39     ` Marc-André Lureau
@ 2017-09-07 12:15       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 12:15 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> Hi
> 
> On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
> <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > Now that we have the mechanisms in here, allow shared memory in a
> > postcopy.
> >
> > Note that QEMU can't tell who all the users of shared regions are
> > and thus can't tell whether all the users of the shared regions
> > have appropriate support for postcopy.  Those devices that explicitly
> > support shared memory (e.g. vhost-user) must check, but it doesn't
> > stop weirder configurations causing problems.
> >
> 
> Other users should have their own migration blocker, I guess.

Yes, the ones that know about it.
The tricky thing is you can add a shared=on to any memory object you
add; I don't know that the reason it's shared is purely because it's
used by vhost-user - for all qemu knows it could be shared with 5 other
things as well.

> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>

Thanks.

Dave

> 
> 
> > ---
> >  migration/postcopy-ram.c | 6 ------
> >  1 file changed, 6 deletions(-)
> >
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 28791cf1f1..89c3aadda1 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -138,12 +138,6 @@ static int test_ramblock_postcopiable(const char *block_name, void *host_addr,
> >      RAMBlock *rb = qemu_ram_block_by_name(block_name);
> >      size_t pagesize = qemu_ram_pagesize(rb);
> >
> > -    if (qemu_ram_is_shared(rb)) {
> > -        error_report("Postcopy on shared RAM (%s) is not yet supported",
> > -                     block_name);
> > -        return 1;
> > -    }
> > -
> >      if (length % pagesize) {
> >          error_report("Postcopy requires RAM blocks to be a page size multiple,"
> >                       " block %s is 0x" RAM_ADDR_FMT " bytes with a "
> > --
> > 2.13.5
> >
> >
> 
> 
> 
> -- 
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 24/32] vub+postcopy: madvises
  2017-08-30 10:48     ` Marc-André Lureau
@ 2017-09-07 12:30       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 12:30 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> Hi
> 
> "libvhost-user: madvises for postcopy" for ex, would be nicer imho

Done.

> On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
> <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > Clear the area and turn off THP.
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 32 ++++++++++++++++++++++++++++++--
> >  1 file changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index 5ec54f7d60..d816851c6d 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -450,11 +450,39 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
> >          }
> >
> >          if (dev->postcopy_listening) {
> > +            int ret;
> >              /* We should already have an open ufd need to mark each memory
> >               * range as ufd.
> > -             * Note: Do we need any madvises? Well it's not been accessed
> > -             * yet, still probably need no THP to be safe, discard to be safe?
> >               */
> > +
> > +            /* Discard any mapping we have here; note I can't use MADV_REMOVE
> > +             * or fallocate to make the hole since I don't want to lose
> > +             * data that's already arrived in the shared process.
> > +             * TODO: How to do hugepage
> > +             */
> > +            ret = madvise((void *)dev_region->mmap_addr,
> > +                          dev_region->size + dev_region->mmap_offset,
> > +                          MADV_DONTNEED);
> > +            if (ret) {
> > +                fprintf(stderr,
> > +                        "%s: Failed to madvise(DONTNEED) region %d: %s\n",
> > +                        __func__, i, strerror(errno));
> > +            }
> > +            /* Turn off transparent hugepages so we dont get lose wakeups
> > +             * in neighbouring pages.
> > +             * TODO: Turn this backon later.
> > +             */
> > +            ret = madvise((void *)dev_region->mmap_addr,
> > +                          dev_region->size + dev_region->mmap_offset,
> > +                          MADV_NOHUGEPAGE);
> > +            if (ret) {
> > +                /* Note: This can happen legally on kernels that are configured
> > +                 * without madvise'able hugepages
> > +                 */
> > +                fprintf(stderr,
> > +                        "%s: Failed to madvise(NOHUGEPAGE) region %d: %s\n",
> > +                        __func__, i, strerror(errno));
> > +            }
> >              struct uffdio_register reg_struct;
> >              reg_struct.range.start = (uintptr_t)dev_region->mmap_addr;
> >              reg_struct.range.len = dev_region->size + dev_region->mmap_offset;
> > --
> > 2.13.5
> >
> 
> Errors are non-fatal? patch looks ok to me, despite the TODOs :).

The DONTNEED is actually not critical in this case; since we've
only just mmap'd it there should be nothing mapped in there that
needs clearing with DONTNEED; however it's a good safe guard.
There's no equivalent syscall we can make for postcopy.

The madvise(NOHUGEPAGE) can fail on a kernel that's configured
without CONFIG_TRANSPARENT_HUGEPAGE_MADVISE if TRANSPARENT_HUGEPAGE=n
then that's OK since it just means we don't have transparent hugepages
and thus we can't turn them off (I think we saw this on s390 or aarch64)

if CONFIG_TRANSAPRENT_HUGEPAGE_ALWAYS is set then things get
messy. IMHO there's way too many kernel config flags for transparent
hugepages!


> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>


Thanks,

Dave

> 
> 
> 
> 
> -- 
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 11/32] vhub: Open userfaultfd
  2017-08-30 10:30     ` Marc-André Lureau
@ 2017-09-07 16:36       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 16:36 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> I would rather use libvhost-user: message prefix (same for similar
> libvhost-user patches)

Done.

> On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
> <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > Open a userfaultfd (on a postcopy_advise) and send it back in
> > the reply to the qemu for it to monitor.
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 26 +++++++++++++++++++++++---
> >  contrib/libvhost-user/libvhost-user.h |  3 +++
> >  2 files changed, 26 insertions(+), 3 deletions(-)
> >
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index 47884c0a15..f9b5b12b28 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -15,6 +15,7 @@
> >
> >  #include <qemu/osdep.h>
> >  #include <sys/eventfd.h>
> > +#include <sys/syscall.h>
> >  #include <linux/vhost.h>
> >
> >  #include "qemu/atomic.h"
> > @@ -773,11 +774,30 @@ vu_set_vring_enable_exec(VuDev *dev, VhostUserMsg *vmsg)
> >  static bool
> >  vu_set_postcopy_advise(VuDev *dev, VhostUserMsg *vmsg)
> >  {
> > -    /* TODO: Open ufd, pass it back in the request
> > -     * TODO: Add addresses
> > -     */
> > +    struct uffdio_api api_struct;
> > +
> > +    dev->postcopy_ufd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> 
> This will likely fail to compile on !Linux, could you add some
> appropriate #ifdef?

Note that we already #include <linux/vhost.h> so this file only builds
on Linux anyway before I came along, however I'll add some ifdef's for
my new code.

Dave


> > +    /* TODO: Add addresses */
> >      vmsg->payload.u64 = 0xcafe;
> >      vmsg->size = sizeof(vmsg->payload.u64);
> > +
> > +    if (dev->postcopy_ufd == -1) {
> > +        vu_panic(dev, "Userfaultfd not available: %s", strerror(errno));
> > +        goto out;
> > +    }
> > +    api_struct.api = UFFD_API;
> > +    api_struct.features = 0;
> > +    if (ioctl(dev->postcopy_ufd, UFFDIO_API, &api_struct)) {
> > +        vu_panic(dev, "Failed UFFDIO_API: %s", strerror(errno));
> > +        close(dev->postcopy_ufd);
> > +        dev->postcopy_ufd = -1;
> > +        goto out;
> > +    }
> > +    /* TODO: Stash feature flags somewhere */
> > +out:
> > +    /* Return a ufd to the QEMU */
> > +    vmsg->fd_num = 1;
> > +    vmsg->fds[0] = dev->postcopy_ufd;
> >      return true; /* = send a reply */
> >  }
> >
> > diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> > index 3987ce643d..3e8efdd919 100644
> > --- a/contrib/libvhost-user/libvhost-user.h
> > +++ b/contrib/libvhost-user/libvhost-user.h
> > @@ -234,6 +234,9 @@ struct VuDev {
> >       * re-initialize */
> >      vu_panic_cb panic;
> >      const VuDevIface *iface;
> > +
> > +    /* Postcopy data */
> > +    int postcopy_ufd;
> >  };
> >
> >  typedef struct VuVirtqElement {
> > --
> > 2.13.5
> >
> >
> 
> 
> 
> -- 
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 15/32] vhost+postcopy: Register new regions with the ufd
  2017-08-30 10:42     ` Marc-André Lureau
@ 2017-09-08 14:50       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-08 14:50 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Maxime Coquelin, a.perevalov, Michael S. Tsirkin,
	Laurent Vivier, aarcange, Felipe Franciosi, Peter Xu,
	Juan Quintela

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> Use "libvhost-user: " commit title tag/prefix?
> 
> On Thu, Aug 24, 2017 at 12:27 PM, Dr. David Alan Gilbert (git)
> <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > When new regions are sent to the client using SET_MEM_TABLE, register
> > them with the userfaultfd.
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 32 ++++++++++++++++++++++++++++++++
> >  1 file changed, 32 insertions(+)
> >
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index e8accf11db..e6ab059a03 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -449,6 +449,38 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
> >                     dev_region->mmap_addr);
> >          }
> >
> > +        if (dev->postcopy_listening) {
> > +            /* We should already have an open ufd need to mark each memory
> > +             * range as ufd.
> > +             * Note: Do we need any madvises? Well it's not been accessed
> > +             * yet, still probably need no THP to be safe, discard to be safe?
> > +             */
> > +            struct uffdio_register reg_struct;
> > +            reg_struct.range.start = (uintptr_t)dev_region->mmap_addr;
> > +            reg_struct.range.len = dev_region->size + dev_region->mmap_offset;
> > +            reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
> > +
> > +            if (ioctl(dev->postcopy_ufd, UFFDIO_REGISTER, &reg_struct)) {
> > +                vu_panic(dev, "%s: Failed to userfault region %d "
> > +                              "@%p + %zx: (ufd=%d)%s\n",
> > +                         __func__, i,
> > +                         dev_region->mmap_addr,
> > +                         dev_region->size + dev_region->mmap_offset,
> > +                         dev->postcopy_ufd, strerror(errno));
> > +                continue;
> 
> panic is supposed to be unrecoverable errors, so I would suggest to return here

Done.

> > +            }
> > +            if (!(reg_struct.ioctls & ((__u64)1 << _UFFDIO_COPY))) {
> > +                vu_panic(dev, "%s Region (%d) doesn't support COPY",
> > +                         __func__, i);
> > +                continue;
> > +            }
> > +            DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
> > +                    __func__, i, reg_struct.range.start, reg_struct.range.len);
> > +            /* TODO: Stash 'zero' support flags somewhere */
> > +            /* TODO: Get address back to QEMU */
> > +
> > +        }
> > +
> >          close(vmsg->fds[i]);
> >      }
> 
> This patch would be nicer if it compiles on !Linux / without userfault.

Done; I've just ifdef UFFDIO_REGISTER the inside of this if;
I'll add other code that makes sure it doesn't get as far as setting
postcopy_listening either.

Dave

> -- 
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 26/32] vhost: Add VHOST_USER_POSTCOPY_END message
  2017-08-30  6:55     ` Peter Xu
@ 2017-09-11 11:31       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-11 11:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Aug 24, 2017 at 08:27:24PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > This message is sent just before the end of postcopy to get the
> > client to stop using userfault since we wont respond to any more
> > requests.  It should close userfaultfd so that any other pages
> > get mapped to the backing file automatically by the kernel, since
> > at this point we know we've received everything.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> (I feel like the title should be for "vub", not "vhost"?)

I've changed it to vhost-user.

> Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks.

Dave

> 
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 23 +++++++++++++++++++++++
> >  contrib/libvhost-user/libvhost-user.h |  1 +
> >  docs/interop/vhost-user.txt           |  8 ++++++++
> >  hw/virtio/vhost-user.c                |  1 +
> >  4 files changed, 33 insertions(+)
> > 
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index d816851c6d..23bff47649 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -67,6 +67,7 @@ vu_request_to_string(int req)
> >          REQ(VHOST_USER_SET_VRING_ENDIAN),
> >          REQ(VHOST_USER_POSTCOPY_ADVISE),
> >          REQ(VHOST_USER_POSTCOPY_LISTEN),
> > +        REQ(VHOST_USER_POSTCOPY_END),
> >          REQ(VHOST_USER_MAX),
> >      };
> >  #undef REQ
> > @@ -893,6 +894,26 @@ vu_set_postcopy_listen(VuDev *dev, VhostUserMsg *vmsg)
> >      vmsg->payload.u64 = 0; /* Success */
> >      return true;
> >  }
> > +
> > +static bool
> > +vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
> > +{
> > +    DPRINT("%s: Entry\n", __func__);
> > +    dev->postcopy_listening = false;
> > +    if (dev->postcopy_ufd > 0) {
> > +        close(dev->postcopy_ufd);
> > +        dev->postcopy_ufd = -1;
> > +        DPRINT("%s: Done close\n", __func__);
> > +    }
> > +
> > +    vmsg->fd_num = 0;
> > +    vmsg->payload.u64 = 0;
> > +    vmsg->size = sizeof(vmsg->payload.u64);
> > +    vmsg->flags = VHOST_USER_VERSION |  VHOST_USER_REPLY_MASK;
> > +    DPRINT("%s: exit\n", __func__);
> > +    return true;
> > +}
> > +
> >  static bool
> >  vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
> >  {
> > @@ -962,6 +983,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
> >          return vu_set_postcopy_advise(dev, vmsg);
> >      case VHOST_USER_POSTCOPY_LISTEN:
> >          return vu_set_postcopy_listen(dev, vmsg);
> > +    case VHOST_USER_POSTCOPY_END:
> > +        return vu_set_postcopy_end(dev, vmsg);
> >      default:
> >          vmsg_close_fds(vmsg);
> >          vu_panic(dev, "Unhandled request: %d", vmsg->request);
> > diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> > index 29c11ba56c..a78596e6fd 100644
> > --- a/contrib/libvhost-user/libvhost-user.h
> > +++ b/contrib/libvhost-user/libvhost-user.h
> > @@ -68,6 +68,7 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_SET_VRING_ENDIAN = 23,
> >      VHOST_USER_POSTCOPY_ADVISE  = 24,
> >      VHOST_USER_POSTCOPY_LISTEN  = 25,
> > +    VHOST_USER_POSTCOPY_END     = 26,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >  
> > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > index b2a548c94d..d6586e0b43 100644
> > --- a/docs/interop/vhost-user.txt
> > +++ b/docs/interop/vhost-user.txt
> > @@ -627,6 +627,14 @@ Master message types
> >  
> >        Master advises slave that a transition to postcopy mode has happened.
> >  
> > + * VHOST_USER_POSTCOPY_END
> > +      Id: 26
> > +      Slave payload: u64
> > +
> > +      Master advises that postcopy migration has now completed.  The
> > +      slave must disable the userfaultfd. The response is an acknowledgement
> > +      only.
> > +
> >  Slave message types
> >  -------------------
> >  
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index 4d03383a66..c2e55be0fd 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -71,6 +71,7 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_SET_VRING_ENDIAN = 23,
> >      VHOST_USER_POSTCOPY_ADVISE  = 24,
> >      VHOST_USER_POSTCOPY_LISTEN  = 25,
> > +    VHOST_USER_POSTCOPY_END     = 26,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >  
> > -- 
> > 2.13.5
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 19/32] vhost+postcopy: Resolve client address
  2017-08-30  5:28     ` Peter Xu
@ 2017-09-11 11:58       ` Dr. David Alan Gilbert
  2017-09-13  5:18         ` Peter Xu
  0 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-11 11:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Aug 24, 2017 at 08:27:17PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Resolve fault addresses read off the clients UFD into RAMBlock
> > and offset, and call back to the postcopy code to ask for the page.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  hw/virtio/trace-events |  3 +++
> >  hw/virtio/vhost-user.c | 30 +++++++++++++++++++++++++++++-
> >  2 files changed, 32 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > index 5067dee19b..f7d4b831fe 100644
> > --- a/hw/virtio/trace-events
> > +++ b/hw/virtio/trace-events
> > @@ -1,6 +1,9 @@
> >  # See docs/devel/tracing.txt for syntax documentation.
> >  
> >  # hw/virtio/vhost-user.c
> > +vhost_user_postcopy_fault_handler(const char *name, uint64_t fault_address, int nregions) "%s: @0x%"PRIx64" nregions:%d"
> > +vhost_user_postcopy_fault_handler_loop(int i, uint64_t client_base, uint64_t size) "%d: client 0x%"PRIx64" +0x%"PRIx64
> > +vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t rb_offset) "%d: region_offset: 0x%"PRIx64" rb_offset:0x%"PRIx64
> >  vhost_user_postcopy_listen(void) ""
> >  vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> >  vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index fbe2743298..2897ff70b3 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -816,7 +816,35 @@ out:
> >  static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
> >                                               void *ufd)
> >  {
> > -    return 0;
> > +    struct vhost_dev *dev = pcfd->data;
> > +    struct vhost_user *u = dev->opaque;
> > +    struct uffd_msg *msg = ufd;
> > +    uint64_t faultaddr = msg->arg.pagefault.address;
> > +    RAMBlock *rb = NULL;
> > +    uint64_t rb_offset;
> > +    int i;
> > +
> > +    trace_vhost_user_postcopy_fault_handler(pcfd->idstr, faultaddr,
> > +                                            dev->mem->nregions);
> > +    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
> 
> Should dev->mem->nregions always the same as u->region_rb_len?

u->region_rb_len only gets updated when vhost_user_set_mem_table is
called, so I think there are short periods of time when they don't
quite match.
(We do have to take some more care than we are at the moment during
updates, because this address resolution happens off the postcopy
thread)

> > +        trace_vhost_user_postcopy_fault_handler_loop(i,
> > +                u->postcopy_client_bases[i], dev->mem->regions[i].memory_size);
> > +        if (faultaddr >= u->postcopy_client_bases[i]) {
> > +            /* Ofset of the fault address in the vhost region */
> > +            uint64_t region_offset = faultaddr - u->postcopy_client_bases[i];
> > +            if (region_offset <= dev->mem->regions[i].memory_size) {
> 
> Should be "<" rather than "<="?  Say:
> 
> Region 1: [0, 1M), size 1M
> Region 2: [1M, 2M), size 1M
> 
> Looks like otherwise faultaddr=1M will fall into region 1, while it
> should be region 2?

Fixed; thanks.

> 
> > +                rb_offset = region_offset + u->region_rb_offset[i];
> > +                trace_vhost_user_postcopy_fault_handler_found(i,
> > +                        region_offset, rb_offset);
> > +                rb = u->region_rb[i];
> 
> Nit: this "rb" might be avoided if only used once.

It's only a local, ok if it makes it a little more readable.

Dave

> > +                return postcopy_request_shared_page(pcfd, rb, faultaddr,
> > +                                                    rb_offset);
> > +            }
> > +        }
> > +    }
> > +    error_report("%s: Failed to find region for fault %" PRIx64,
> > +                 __func__, faultaddr);
> > +    return -1;
> >  }
> >  
> >  /*
> > -- 
> > 2.13.5
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 07/32] postcopy: Add notifier chain
  2017-08-29  6:02     ` Peter Xu
@ 2017-09-11 17:00       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-11 17:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Aug 24, 2017 at 08:27:05PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Add a notifier chain for postcopy with a 'reason' flag
> > and an opportunity for a notifier member to return an error.
> > 
> > Call it when enabling postcopy.
> > 
> > This will initially used to enable devices to declare they're unable
> > to postcopy and later to notify of devices of stages within postcopy.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  migration/postcopy-ram.c | 41 +++++++++++++++++++++++++++++++++++++++++
> >  migration/postcopy-ram.h | 26 ++++++++++++++++++++++++++
> >  vl.c                     |  2 ++
> >  3 files changed, 69 insertions(+)
> > 
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 640b72d86d..95007c00ef 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -23,6 +23,8 @@
> >  #include "savevm.h"
> >  #include "postcopy-ram.h"
> >  #include "ram.h"
> > +#include "qapi/error.h"
> > +#include "qemu/notify.h"
> >  #include "sysemu/sysemu.h"
> >  #include "sysemu/balloon.h"
> >  #include "qemu/error-report.h"
> > @@ -45,6 +47,38 @@ struct PostcopyDiscardState {
> >      unsigned int nsentcmds;
> >  };
> >  
> > +/* A notifier chain for postcopy
> > + * The notifier should return 0 if it's OK, or a
> > + * -errno on error.
> > + * The notifier should expect an Error ** as it's data
> 
> "PostcopyNotifyData *" but not "Error **"?

Ah well spotted.

> Maybe we can just remove this block of comment since there is a
> similar one in the header below.

Yes, that's what I've done.

> Besides:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks.

Dave

> 
> > + */
> > +static NotifierWithReturnList postcopy_notifier_list;
> > +
> > +void postcopy_infrastructure_init(void)
> > +{
> > +    notifier_with_return_list_init(&postcopy_notifier_list);
> > +}
> > +
> > +void postcopy_add_notifier(NotifierWithReturn *nn)
> > +{
> > +    notifier_with_return_list_add(&postcopy_notifier_list, nn);
> > +}
> > +
> > +void postcopy_remove_notifier(NotifierWithReturn *n)
> > +{
> > +    notifier_with_return_remove(n);
> > +}
> > +
> > +int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp)
> > +{
> > +    struct PostcopyNotifyData pnd;
> > +    pnd.reason = reason;
> > +    pnd.errp = errp;
> > +
> > +    return notifier_with_return_list_notify(&postcopy_notifier_list,
> > +                                            &pnd);
> > +}
> > +
> >  /* Postcopy needs to detect accesses to pages that haven't yet been copied
> >   * across, and efficiently map new pages in, the techniques for doing this
> >   * are target OS specific.
> > @@ -133,6 +167,7 @@ bool postcopy_ram_supported_by_host(void)
> >      struct uffdio_register reg_struct;
> >      struct uffdio_range range_struct;
> >      uint64_t feature_mask;
> > +    Error *local_err = NULL;
> >  
> >      if (qemu_target_page_size() > pagesize) {
> >          error_report("Target page size bigger than host page size");
> > @@ -146,6 +181,12 @@ bool postcopy_ram_supported_by_host(void)
> >          goto out;
> >      }
> >  
> > +    /* Give devices a chance to object */
> > +    if (postcopy_notify(POSTCOPY_NOTIFY_PROBE, &local_err)) {
> > +        error_report_err(local_err);
> > +        goto out;
> > +    }
> > +
> >      /* Version and features check */
> >      if (!ufd_version_check(ufd)) {
> >          goto out;
> > diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> > index 78a3591322..d688411674 100644
> > --- a/migration/postcopy-ram.h
> > +++ b/migration/postcopy-ram.h
> > @@ -114,4 +114,30 @@ PostcopyState postcopy_state_get(void);
> >  /* Set the state and return the old state */
> >  PostcopyState postcopy_state_set(PostcopyState new_state);
> >  
> > +/*
> > + * To be called once at the start before any device initialisation
> > + */
> > +void postcopy_infrastructure_init(void);
> > +
> > +/* Add a notifier to a list to be called when checking whether the devices
> > + * can support postcopy.
> > + * It's data is a *PostcopyNotifyData
> > + * It should return 0 if OK, or a negative value on failure.
> > + * On failure it must set the data->errp to an error.
> > + *
> > + */
> > +enum PostcopyNotifyReason {
> > +    POSTCOPY_NOTIFY_PROBE = 0,
> > +};
> > +
> > +struct PostcopyNotifyData {
> > +    enum PostcopyNotifyReason reason;
> > +    Error **errp;
> > +};
> > +
> > +void postcopy_add_notifier(NotifierWithReturn *nn);
> > +void postcopy_remove_notifier(NotifierWithReturn *n);
> > +/* Call the notifier list set by postcopy_add_start_notifier */
> > +int postcopy_notify(enum PostcopyNotifyReason reason, Error **errp);
> > +
> >  #endif
> > diff --git a/vl.c b/vl.c
> > index 8e247cc2a2..65dd9dc324 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -95,6 +95,7 @@ int main(int argc, char **argv)
> >  #include "audio/audio.h"
> >  #include "sysemu/cpus.h"
> >  #include "migration/colo.h"
> > +#include "migration/postcopy-ram.h"
> >  #include "sysemu/kvm.h"
> >  #include "sysemu/hax.h"
> >  #include "qapi/qobject-input-visitor.h"
> > @@ -3082,6 +3083,7 @@ int main(int argc, char **argv, char **envp)
> >      module_call_init(MODULE_INIT_OPTS);
> >  
> >      runstate_init();
> > +    postcopy_infrastructure_init();
> >  
> >      if (qcrypto_init(&err) < 0) {
> >          error_reportf_err(err, "cannot initialize crypto: ");
> > -- 
> > 2.13.5
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 16/32] vhost+postcopy: Send address back to qemu
  2017-08-29  8:30     ` Peter Xu
@ 2017-09-12 17:15       ` Dr. David Alan Gilbert
  2017-09-13  4:29         ` Peter Xu
  0 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-12 17:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Aug 24, 2017 at 08:27:14PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > We need a better way, but at the moment we need the address of the
> > mappings sent back to qemu so it can interpret the messages on the
> > userfaultfd it reads.
> > 
> > Note: We don't ask for the default 'ack' reply since we've got our own.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 15 ++++++++-
> >  docs/interop/vhost-user.txt           |  6 ++++
> >  hw/virtio/trace-events                |  1 +
> >  hw/virtio/vhost-user.c                | 57 ++++++++++++++++++++++++++++++++++-
> >  4 files changed, 77 insertions(+), 2 deletions(-)
> > 
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index e6ab059a03..5ec54f7d60 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -477,13 +477,26 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
> >              DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
> >                      __func__, i, reg_struct.range.start, reg_struct.range.len);
> >              /* TODO: Stash 'zero' support flags somewhere */
> > -            /* TODO: Get address back to QEMU */
> >  
> > +            /* TODO: We need to find a way for the qemu not to see the virtual
> > +             * addresses of the clients, so as to keep better separation.
> > +             */
> > +            /* Return the address to QEMU so that it can translate the ufd
> > +             * fault addresses back.
> > +             */
> > +            msg_region->userspace_addr = (uintptr_t)(mmap_addr +
> > +                                                     dev_region->mmap_offset);
> >          }
> >  
> >          close(vmsg->fds[i]);
> >      }
> >  
> > +    if (dev->postcopy_listening) {
> > +        /* Need to return the addresses - send the updated message back */
> > +        vmsg->fd_num = 0;
> > +        return true;
> > +    }
> > +
> >      return false;
> >  }
> >  
> > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > index 73c3dd74db..b2a548c94d 100644
> > --- a/docs/interop/vhost-user.txt
> > +++ b/docs/interop/vhost-user.txt
> > @@ -413,12 +413,18 @@ Master message types
> >        Id: 5
> >        Equivalent ioctl: VHOST_SET_MEM_TABLE
> >        Master payload: memory regions description
> > +      Slave payload: (postcopy only) memory regions description
> >  
> >        Sets the memory map regions on the slave so it can translate the vring
> >        addresses. In the ancillary data there is an array of file descriptors
> >        for each memory mapped region. The size and ordering of the fds matches
> >        the number and ordering of memory regions.
> >  
> > +      When postcopy-listening has been received, SET_MEM_TABLE replies with
> > +      the bases of the memory mapped regions to the master.  It must have mmap'd
> > +      the regions and enabled userfaultfd on them.  Note NEED_REPLY_MASK
> > +      is not set in this case.
> > +
> >   * VHOST_USER_SET_LOG_BASE
> >  
> >        Id: 6
> > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > index f736c7c84f..63fd4a79cf 100644
> > --- a/hw/virtio/trace-events
> > +++ b/hw/virtio/trace-events
> > @@ -2,6 +2,7 @@
> >  
> >  # hw/virtio/vhost-user.c
> >  vhost_user_postcopy_listen(void) ""
> > +vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> >  
> >  # hw/virtio/virtio.c
> >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index 9178271ab2..2e4eb0864a 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -19,6 +19,7 @@
> >  #include "qemu/sockets.h"
> >  #include "migration/migration.h"
> >  #include "migration/postcopy-ram.h"
> > +#include "trace.h"
> >  
> >  #include <sys/ioctl.h>
> >  #include <sys/socket.h>
> > @@ -133,6 +134,7 @@ struct vhost_user {
> >      int slave_fd;
> >      NotifierWithReturn postcopy_notifier;
> >      struct PostCopyFD  postcopy_fd;
> > +    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
> >  };
> >  
> >  static bool ioeventfd_enabled(void)
> > @@ -300,11 +302,13 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
> >  static int vhost_user_set_mem_table(struct vhost_dev *dev,
> >                                      struct vhost_memory *mem)
> >  {
> > +    struct vhost_user *u = dev->opaque;
> >      int fds[VHOST_MEMORY_MAX_NREGIONS];
> >      int i, fd;
> >      size_t fd_num = 0;
> >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> > -                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > +                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
> > +                           !u->postcopy_fd.handler;
> 
> (indent)

Fixed

> >  
> >      VhostUserMsg msg = {
> >          .request = VHOST_USER_SET_MEM_TABLE,
> > @@ -350,6 +354,57 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
> >          return -1;
> >      }
> >  
> > +    if (u->postcopy_fd.handler) {
> 
> It seems that after this handler is set, we never clean it up.  Do we
> need to unset it somewhere? (maybe vhost_user_postcopy_end?)

Hmm yes I'll have a look at that.

> > +        VhostUserMsg msg_reply;
> > +        int region_i, reply_i;
> > +        if (vhost_user_read(dev, &msg_reply) < 0) {
> > +            return -1;
> > +        }
> > +
> > +        if (msg_reply.request != VHOST_USER_SET_MEM_TABLE) {
> > +            error_report("%s: Received unexpected msg type."
> > +                         "Expected %d received %d", __func__,
> > +                         VHOST_USER_SET_MEM_TABLE, msg_reply.request);
> > +            return -1;
> > +        }
> > +        /* We're using the same structure, just reusing one of the
> > +         * fields, so it should be the same size.
> > +         */
> > +        if (msg_reply.size != msg.size) {
> > +            error_report("%s: Unexpected size for postcopy reply "
> > +                         "%d vs %d", __func__, msg_reply.size, msg.size);
> > +            return -1;
> > +        }
> > +
> > +        memset(u->postcopy_client_bases, 0,
> > +               sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
> > +
> > +        /* They're in the same order as the regions that were sent
> > +         * but some of the regions were skipped (above) if they
> > +         * didn't have fd's
> > +        */
> > +        for (reply_i = 0, region_i = 0;
> > +             region_i < dev->mem->nregions;
> > +             region_i++) {
> > +            if (reply_i < fd_num &&
> > +                msg_reply.payload.memory.regions[region_i].guest_phys_addr ==
>                                                     ^^^^^^^^
>                                           should this be reply_i?

Yes it should - nicely spotted

> (And maybe we can use pointers for the regions for better readability?)

I'm nervous of doing that since VhostUserMsg is 'packed' - and I'm not
convinced it's legal to take a pointer to a member (although I think
we do it in a whole bunch of places and clang moans about it).

> > +                dev->mem->regions[region_i].guest_phys_addr) {
> > +                u->postcopy_client_bases[region_i] =
> > +                    msg_reply.payload.memory.regions[reply_i].userspace_addr;
> > +                trace_vhost_user_set_mem_table_postcopy(
> > +                    msg_reply.payload.memory.regions[reply_i].userspace_addr,
> > +                    msg.payload.memory.regions[reply_i].userspace_addr,
                                                    ^^^^^^^
                        and I think this one is region_i

Dave

> > +                    reply_i, region_i);
> > +                reply_i++;
> > +            }
> > +        }
> > +        if (reply_i != fd_num) {
> > +            error_report("%s: postcopy reply not fully consumed "
> > +                         "%d vs %zd",
> > +                         __func__, reply_i, fd_num);
> > +            return -1;
> > +        }
> > +    }
> >      if (reply_supported) {
> >          return process_message_reply(dev, &msg);
> >      }
> > -- 
> > 2.13.5
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 16/32] vhost+postcopy: Send address back to qemu
  2017-09-12 17:15       ` Dr. David Alan Gilbert
@ 2017-09-13  4:29         ` Peter Xu
  2017-09-13 12:15           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Xu @ 2017-09-13  4:29 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Tue, Sep 12, 2017 at 06:15:13PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Thu, Aug 24, 2017 at 08:27:14PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > We need a better way, but at the moment we need the address of the
> > > mappings sent back to qemu so it can interpret the messages on the
> > > userfaultfd it reads.
> > > 
> > > Note: We don't ask for the default 'ack' reply since we've got our own.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  contrib/libvhost-user/libvhost-user.c | 15 ++++++++-
> > >  docs/interop/vhost-user.txt           |  6 ++++
> > >  hw/virtio/trace-events                |  1 +
> > >  hw/virtio/vhost-user.c                | 57 ++++++++++++++++++++++++++++++++++-
> > >  4 files changed, 77 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > > index e6ab059a03..5ec54f7d60 100644
> > > --- a/contrib/libvhost-user/libvhost-user.c
> > > +++ b/contrib/libvhost-user/libvhost-user.c
> > > @@ -477,13 +477,26 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
> > >              DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
> > >                      __func__, i, reg_struct.range.start, reg_struct.range.len);
> > >              /* TODO: Stash 'zero' support flags somewhere */
> > > -            /* TODO: Get address back to QEMU */
> > >  
> > > +            /* TODO: We need to find a way for the qemu not to see the virtual
> > > +             * addresses of the clients, so as to keep better separation.
> > > +             */
> > > +            /* Return the address to QEMU so that it can translate the ufd
> > > +             * fault addresses back.
> > > +             */
> > > +            msg_region->userspace_addr = (uintptr_t)(mmap_addr +
> > > +                                                     dev_region->mmap_offset);
> > >          }
> > >  
> > >          close(vmsg->fds[i]);
> > >      }
> > >  
> > > +    if (dev->postcopy_listening) {
> > > +        /* Need to return the addresses - send the updated message back */
> > > +        vmsg->fd_num = 0;
> > > +        return true;
> > > +    }
> > > +
> > >      return false;
> > >  }
> > >  
> > > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > > index 73c3dd74db..b2a548c94d 100644
> > > --- a/docs/interop/vhost-user.txt
> > > +++ b/docs/interop/vhost-user.txt
> > > @@ -413,12 +413,18 @@ Master message types
> > >        Id: 5
> > >        Equivalent ioctl: VHOST_SET_MEM_TABLE
> > >        Master payload: memory regions description
> > > +      Slave payload: (postcopy only) memory regions description
> > >  
> > >        Sets the memory map regions on the slave so it can translate the vring
> > >        addresses. In the ancillary data there is an array of file descriptors
> > >        for each memory mapped region. The size and ordering of the fds matches
> > >        the number and ordering of memory regions.
> > >  
> > > +      When postcopy-listening has been received, SET_MEM_TABLE replies with
> > > +      the bases of the memory mapped regions to the master.  It must have mmap'd
> > > +      the regions and enabled userfaultfd on them.  Note NEED_REPLY_MASK
> > > +      is not set in this case.
> > > +
> > >   * VHOST_USER_SET_LOG_BASE
> > >  
> > >        Id: 6
> > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > index f736c7c84f..63fd4a79cf 100644
> > > --- a/hw/virtio/trace-events
> > > +++ b/hw/virtio/trace-events
> > > @@ -2,6 +2,7 @@
> > >  
> > >  # hw/virtio/vhost-user.c
> > >  vhost_user_postcopy_listen(void) ""
> > > +vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> > >  
> > >  # hw/virtio/virtio.c
> > >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > index 9178271ab2..2e4eb0864a 100644
> > > --- a/hw/virtio/vhost-user.c
> > > +++ b/hw/virtio/vhost-user.c
> > > @@ -19,6 +19,7 @@
> > >  #include "qemu/sockets.h"
> > >  #include "migration/migration.h"
> > >  #include "migration/postcopy-ram.h"
> > > +#include "trace.h"
> > >  
> > >  #include <sys/ioctl.h>
> > >  #include <sys/socket.h>
> > > @@ -133,6 +134,7 @@ struct vhost_user {
> > >      int slave_fd;
> > >      NotifierWithReturn postcopy_notifier;
> > >      struct PostCopyFD  postcopy_fd;
> > > +    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
> > >  };
> > >  
> > >  static bool ioeventfd_enabled(void)
> > > @@ -300,11 +302,13 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
> > >  static int vhost_user_set_mem_table(struct vhost_dev *dev,
> > >                                      struct vhost_memory *mem)
> > >  {
> > > +    struct vhost_user *u = dev->opaque;
> > >      int fds[VHOST_MEMORY_MAX_NREGIONS];
> > >      int i, fd;
> > >      size_t fd_num = 0;
> > >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> > > -                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > > +                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
> > > +                           !u->postcopy_fd.handler;
> > 
> > (indent)
> 
> Fixed
> 
> > >  
> > >      VhostUserMsg msg = {
> > >          .request = VHOST_USER_SET_MEM_TABLE,
> > > @@ -350,6 +354,57 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
> > >          return -1;
> > >      }
> > >  
> > > +    if (u->postcopy_fd.handler) {
> > 
> > It seems that after this handler is set, we never clean it up.  Do we
> > need to unset it somewhere? (maybe vhost_user_postcopy_end?)
> 
> Hmm yes I'll have a look at that.
> 
> > > +        VhostUserMsg msg_reply;
> > > +        int region_i, reply_i;
> > > +        if (vhost_user_read(dev, &msg_reply) < 0) {
> > > +            return -1;
> > > +        }
> > > +
> > > +        if (msg_reply.request != VHOST_USER_SET_MEM_TABLE) {
> > > +            error_report("%s: Received unexpected msg type."
> > > +                         "Expected %d received %d", __func__,
> > > +                         VHOST_USER_SET_MEM_TABLE, msg_reply.request);
> > > +            return -1;
> > > +        }
> > > +        /* We're using the same structure, just reusing one of the
> > > +         * fields, so it should be the same size.
> > > +         */
> > > +        if (msg_reply.size != msg.size) {
> > > +            error_report("%s: Unexpected size for postcopy reply "
> > > +                         "%d vs %d", __func__, msg_reply.size, msg.size);
> > > +            return -1;
> > > +        }
> > > +
> > > +        memset(u->postcopy_client_bases, 0,
> > > +               sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
> > > +
> > > +        /* They're in the same order as the regions that were sent
> > > +         * but some of the regions were skipped (above) if they
> > > +         * didn't have fd's
> > > +        */
> > > +        for (reply_i = 0, region_i = 0;
> > > +             region_i < dev->mem->nregions;
> > > +             region_i++) {
> > > +            if (reply_i < fd_num &&
> > > +                msg_reply.payload.memory.regions[region_i].guest_phys_addr ==
> >                                                     ^^^^^^^^
> >                                           should this be reply_i?
> 
> Yes it should - nicely spotted
> 
> > (And maybe we can use pointers for the regions for better readability?)
> 
> I'm nervous of doing that since VhostUserMsg is 'packed' - and I'm not
> convinced it's legal to take a pointer to a member (although I think
> we do it in a whole bunch of places and clang moans about it).

Could I ask why packed struct is not suitable for taking field
pointers out of the structs?  I hardly use clang, and I feel like
there is something I may have missed in C programming...

> 
> > > +                dev->mem->regions[region_i].guest_phys_addr) {
> > > +                u->postcopy_client_bases[region_i] =
> > > +                    msg_reply.payload.memory.regions[reply_i].userspace_addr;
> > > +                trace_vhost_user_set_mem_table_postcopy(
> > > +                    msg_reply.payload.memory.regions[reply_i].userspace_addr,
> > > +                    msg.payload.memory.regions[reply_i].userspace_addr,
>                                                     ^^^^^^^
>                         and I think this one is region_i

Hmm... shouldn't msg.payload.memory.regions[] defined with size
VHOST_MEMORY_MAX_NREGIONS as well?

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 19/32] vhost+postcopy: Resolve client address
  2017-09-11 11:58       ` Dr. David Alan Gilbert
@ 2017-09-13  5:18         ` Peter Xu
  0 siblings, 0 replies; 94+ messages in thread
From: Peter Xu @ 2017-09-13  5:18 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Mon, Sep 11, 2017 at 12:58:15PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Thu, Aug 24, 2017 at 08:27:17PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Resolve fault addresses read off the clients UFD into RAMBlock
> > > and offset, and call back to the postcopy code to ask for the page.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  hw/virtio/trace-events |  3 +++
> > >  hw/virtio/vhost-user.c | 30 +++++++++++++++++++++++++++++-
> > >  2 files changed, 32 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > index 5067dee19b..f7d4b831fe 100644
> > > --- a/hw/virtio/trace-events
> > > +++ b/hw/virtio/trace-events
> > > @@ -1,6 +1,9 @@
> > >  # See docs/devel/tracing.txt for syntax documentation.
> > >  
> > >  # hw/virtio/vhost-user.c
> > > +vhost_user_postcopy_fault_handler(const char *name, uint64_t fault_address, int nregions) "%s: @0x%"PRIx64" nregions:%d"
> > > +vhost_user_postcopy_fault_handler_loop(int i, uint64_t client_base, uint64_t size) "%d: client 0x%"PRIx64" +0x%"PRIx64
> > > +vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t rb_offset) "%d: region_offset: 0x%"PRIx64" rb_offset:0x%"PRIx64
> > >  vhost_user_postcopy_listen(void) ""
> > >  vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> > >  vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
> > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > index fbe2743298..2897ff70b3 100644
> > > --- a/hw/virtio/vhost-user.c
> > > +++ b/hw/virtio/vhost-user.c
> > > @@ -816,7 +816,35 @@ out:
> > >  static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
> > >                                               void *ufd)
> > >  {
> > > -    return 0;
> > > +    struct vhost_dev *dev = pcfd->data;
> > > +    struct vhost_user *u = dev->opaque;
> > > +    struct uffd_msg *msg = ufd;
> > > +    uint64_t faultaddr = msg->arg.pagefault.address;
> > > +    RAMBlock *rb = NULL;
> > > +    uint64_t rb_offset;
> > > +    int i;
> > > +
> > > +    trace_vhost_user_postcopy_fault_handler(pcfd->idstr, faultaddr,
> > > +                                            dev->mem->nregions);
> > > +    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
> > 
> > Should dev->mem->nregions always the same as u->region_rb_len?
> 
> u->region_rb_len only gets updated when vhost_user_set_mem_table is
> called, so I think there are short periods of time when they don't
> quite match.
> (We do have to take some more care than we are at the moment during
> updates, because this address resolution happens off the postcopy
> thread)

I see, so memory layout can change along the way...

But I still doubt whether this single MIN() can work.

Say, we have these arrays already:

- array A: dev->mem->regions[]
- array B: u->region_rb[]
- array C: u->postcopy_client_bases[]

These arrays should always be aligned with each other (index "i" of
array "A/B/C" will always describe the same memory region).  But since
we can change the memory layout dynamically during postcopy, then
array A can grow/shrink/change in following path:

  vhost_region_{add|delete}
    updates array A              (1)
  vhost_region_{add|delete}
    updates array A              (2)
  vhost_region_{add|delete}
    updates array A              (3)
  ...
  vhost_commit
    vhost_set_mem_table
      align arrays B/C with A    (4)

IMHO array A may not really match B/C during step (1)-(3), until step
(4) to re-align them?  And if they are not aligned with each other, I
guess a single MIN() won't help much? (Since the indexing below would
be problematic?)

(Hmm, can we just disallow memory change during postcopy for now?)

> 
> > > +        trace_vhost_user_postcopy_fault_handler_loop(i,
> > > +                u->postcopy_client_bases[i], dev->mem->regions[i].memory_size);
> > > +        if (faultaddr >= u->postcopy_client_bases[i]) {

Ah, wait...

postcopy_client_bases[] is now defined with static size
VHOST_MEMORY_MAX_NREGIONS.  Shouldn't it be dynamically allocated as
well with dev->mem->nregions, just like vhost_user.region_rb[]?

Maybe we want to leave the postcopy_client_bases[i] be zeros when
dev->mem->regions[i] it's not a vhost-user supported region (without
"fd")?

> > > +            /* Ofset of the fault address in the vhost region */
> > > +            uint64_t region_offset = faultaddr - u->postcopy_client_bases[i];
> > > +            if (region_offset <= dev->mem->regions[i].memory_size) {
> > 
> > Should be "<" rather than "<="?  Say:
> > 
> > Region 1: [0, 1M), size 1M
> > Region 2: [1M, 2M), size 1M
> > 
> > Looks like otherwise faultaddr=1M will fall into region 1, while it
> > should be region 2?
> 
> Fixed; thanks.
> 
> > 
> > > +                rb_offset = region_offset + u->region_rb_offset[i];
> > > +                trace_vhost_user_postcopy_fault_handler_found(i,
> > > +                        region_offset, rb_offset);
> > > +                rb = u->region_rb[i];
> > 
> > Nit: this "rb" might be avoided if only used once.
> 
> It's only a local, ok if it makes it a little more readable.
> 
> Dave
> 
> > > +                return postcopy_request_shared_page(pcfd, rb, faultaddr,
> > > +                                                    rb_offset);
> > > +            }
> > > +        }
> > > +    }
> > > +    error_report("%s: Failed to find region for fault %" PRIx64,
> > > +                 __func__, faultaddr);
> > > +    return -1;
> > >  }
> > >  
> > >  /*
> > > -- 
> > > 2.13.5
> > > 
> > 
> > -- 
> > Peter Xu
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 16/32] vhost+postcopy: Send address back to qemu
  2017-09-13  4:29         ` Peter Xu
@ 2017-09-13 12:15           ` Dr. David Alan Gilbert
  2017-09-15  8:57             ` Peter Xu
  0 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-13 12:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Sep 12, 2017 at 06:15:13PM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Thu, Aug 24, 2017 at 08:27:14PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > We need a better way, but at the moment we need the address of the
> > > > mappings sent back to qemu so it can interpret the messages on the
> > > > userfaultfd it reads.
> > > > 
> > > > Note: We don't ask for the default 'ack' reply since we've got our own.
> > > > 
> > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > ---
> > > >  contrib/libvhost-user/libvhost-user.c | 15 ++++++++-
> > > >  docs/interop/vhost-user.txt           |  6 ++++
> > > >  hw/virtio/trace-events                |  1 +
> > > >  hw/virtio/vhost-user.c                | 57 ++++++++++++++++++++++++++++++++++-
> > > >  4 files changed, 77 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > > > index e6ab059a03..5ec54f7d60 100644
> > > > --- a/contrib/libvhost-user/libvhost-user.c
> > > > +++ b/contrib/libvhost-user/libvhost-user.c
> > > > @@ -477,13 +477,26 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
> > > >              DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
> > > >                      __func__, i, reg_struct.range.start, reg_struct.range.len);
> > > >              /* TODO: Stash 'zero' support flags somewhere */
> > > > -            /* TODO: Get address back to QEMU */
> > > >  
> > > > +            /* TODO: We need to find a way for the qemu not to see the virtual
> > > > +             * addresses of the clients, so as to keep better separation.
> > > > +             */
> > > > +            /* Return the address to QEMU so that it can translate the ufd
> > > > +             * fault addresses back.
> > > > +             */
> > > > +            msg_region->userspace_addr = (uintptr_t)(mmap_addr +
> > > > +                                                     dev_region->mmap_offset);
> > > >          }
> > > >  
> > > >          close(vmsg->fds[i]);
> > > >      }
> > > >  
> > > > +    if (dev->postcopy_listening) {
> > > > +        /* Need to return the addresses - send the updated message back */
> > > > +        vmsg->fd_num = 0;
> > > > +        return true;
> > > > +    }
> > > > +
> > > >      return false;
> > > >  }
> > > >  
> > > > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > > > index 73c3dd74db..b2a548c94d 100644
> > > > --- a/docs/interop/vhost-user.txt
> > > > +++ b/docs/interop/vhost-user.txt
> > > > @@ -413,12 +413,18 @@ Master message types
> > > >        Id: 5
> > > >        Equivalent ioctl: VHOST_SET_MEM_TABLE
> > > >        Master payload: memory regions description
> > > > +      Slave payload: (postcopy only) memory regions description
> > > >  
> > > >        Sets the memory map regions on the slave so it can translate the vring
> > > >        addresses. In the ancillary data there is an array of file descriptors
> > > >        for each memory mapped region. The size and ordering of the fds matches
> > > >        the number and ordering of memory regions.
> > > >  
> > > > +      When postcopy-listening has been received, SET_MEM_TABLE replies with
> > > > +      the bases of the memory mapped regions to the master.  It must have mmap'd
> > > > +      the regions and enabled userfaultfd on them.  Note NEED_REPLY_MASK
> > > > +      is not set in this case.
> > > > +
> > > >   * VHOST_USER_SET_LOG_BASE
> > > >  
> > > >        Id: 6
> > > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > > index f736c7c84f..63fd4a79cf 100644
> > > > --- a/hw/virtio/trace-events
> > > > +++ b/hw/virtio/trace-events
> > > > @@ -2,6 +2,7 @@
> > > >  
> > > >  # hw/virtio/vhost-user.c
> > > >  vhost_user_postcopy_listen(void) ""
> > > > +vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> > > >  
> > > >  # hw/virtio/virtio.c
> > > >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > > index 9178271ab2..2e4eb0864a 100644
> > > > --- a/hw/virtio/vhost-user.c
> > > > +++ b/hw/virtio/vhost-user.c
> > > > @@ -19,6 +19,7 @@
> > > >  #include "qemu/sockets.h"
> > > >  #include "migration/migration.h"
> > > >  #include "migration/postcopy-ram.h"
> > > > +#include "trace.h"
> > > >  
> > > >  #include <sys/ioctl.h>
> > > >  #include <sys/socket.h>
> > > > @@ -133,6 +134,7 @@ struct vhost_user {
> > > >      int slave_fd;
> > > >      NotifierWithReturn postcopy_notifier;
> > > >      struct PostCopyFD  postcopy_fd;
> > > > +    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
> > > >  };
> > > >  
> > > >  static bool ioeventfd_enabled(void)
> > > > @@ -300,11 +302,13 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
> > > >  static int vhost_user_set_mem_table(struct vhost_dev *dev,
> > > >                                      struct vhost_memory *mem)
> > > >  {
> > > > +    struct vhost_user *u = dev->opaque;
> > > >      int fds[VHOST_MEMORY_MAX_NREGIONS];
> > > >      int i, fd;
> > > >      size_t fd_num = 0;
> > > >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> > > > -                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > > > +                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
> > > > +                           !u->postcopy_fd.handler;
> > > 
> > > (indent)
> > 
> > Fixed
> > 
> > > >  
> > > >      VhostUserMsg msg = {
> > > >          .request = VHOST_USER_SET_MEM_TABLE,
> > > > @@ -350,6 +354,57 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
> > > >          return -1;
> > > >      }
> > > >  
> > > > +    if (u->postcopy_fd.handler) {
> > > 
> > > It seems that after this handler is set, we never clean it up.  Do we
> > > need to unset it somewhere? (maybe vhost_user_postcopy_end?)
> > 
> > Hmm yes I'll have a look at that.
> > 
> > > > +        VhostUserMsg msg_reply;
> > > > +        int region_i, reply_i;
> > > > +        if (vhost_user_read(dev, &msg_reply) < 0) {
> > > > +            return -1;
> > > > +        }
> > > > +
> > > > +        if (msg_reply.request != VHOST_USER_SET_MEM_TABLE) {
> > > > +            error_report("%s: Received unexpected msg type."
> > > > +                         "Expected %d received %d", __func__,
> > > > +                         VHOST_USER_SET_MEM_TABLE, msg_reply.request);
> > > > +            return -1;
> > > > +        }
> > > > +        /* We're using the same structure, just reusing one of the
> > > > +         * fields, so it should be the same size.
> > > > +         */
> > > > +        if (msg_reply.size != msg.size) {
> > > > +            error_report("%s: Unexpected size for postcopy reply "
> > > > +                         "%d vs %d", __func__, msg_reply.size, msg.size);
> > > > +            return -1;
> > > > +        }
> > > > +
> > > > +        memset(u->postcopy_client_bases, 0,
> > > > +               sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
> > > > +
> > > > +        /* They're in the same order as the regions that were sent
> > > > +         * but some of the regions were skipped (above) if they
> > > > +         * didn't have fd's
> > > > +        */
> > > > +        for (reply_i = 0, region_i = 0;
> > > > +             region_i < dev->mem->nregions;
> > > > +             region_i++) {
> > > > +            if (reply_i < fd_num &&
> > > > +                msg_reply.payload.memory.regions[region_i].guest_phys_addr ==
> > >                                                     ^^^^^^^^
> > >                                           should this be reply_i?
> > 
> > Yes it should - nicely spotted
> > 
> > > (And maybe we can use pointers for the regions for better readability?)
> > 
> > I'm nervous of doing that since VhostUserMsg is 'packed' - and I'm not
> > convinced it's legal to take a pointer to a member (although I think
> > we do it in a whole bunch of places and clang moans about it).
> 
> Could I ask why packed struct is not suitable for taking field
> pointers out of the structs?  I hardly use clang, and I feel like
> there is something I may have missed in C programming...

The problem is that when you 'pack' a structure all the alignment rules
you normally have go away;  when the compiler knows it's accessing
a packed structure that's OK because the compiler knows not to rely
on those alignments;  however if I took a pointer to the
regions table in the msg I'd end up with a VhostUserMemoryRegion*
and a pointer like that carries nothing to tell the compiler to take
care about alignment.

> > 
> > > > +                dev->mem->regions[region_i].guest_phys_addr) {
> > > > +                u->postcopy_client_bases[region_i] =
> > > > +                    msg_reply.payload.memory.regions[reply_i].userspace_addr;
> > > > +                trace_vhost_user_set_mem_table_postcopy(
> > > > +                    msg_reply.payload.memory.regions[reply_i].userspace_addr,
> > > > +                    msg.payload.memory.regions[reply_i].userspace_addr,
> >                                                     ^^^^^^^
> >                         and I think this one is region_i
> 
> Hmm... shouldn't msg.payload.memory.regions[] defined with size
> VHOST_MEMORY_MAX_NREGIONS as well?

Yes, it already is; msg is a VhostUserMsg, payload.memory is a
VhostUserMemory and it has:
  VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];

Dave

> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 22/32] vhost+postcopy: Add vhost waker
  2017-08-30  5:55     ` Peter Xu
@ 2017-09-13 13:09       ` Dr. David Alan Gilbert
  2017-09-18  3:57         ` Peter Xu
  0 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-13 13:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Aug 24, 2017 at 08:27:20PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Register a waker function in vhost-user code to be notified when
> > pages arrive or requests to previously mapped pages get requested.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  hw/virtio/trace-events |  3 +++
> >  hw/virtio/vhost-user.c | 26 ++++++++++++++++++++++++++
> >  2 files changed, 29 insertions(+)
> > 
> > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > index f7d4b831fe..adebf6dc6b 100644
> > --- a/hw/virtio/trace-events
> > +++ b/hw/virtio/trace-events
> > @@ -7,6 +7,9 @@ vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t
> >  vhost_user_postcopy_listen(void) ""
> >  vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> >  vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
> > +vhost_user_postcopy_waker(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
> > +vhost_user_postcopy_waker_found(uint64_t client_addr) "0x%"PRIx64
> > +vhost_user_postcopy_waker_nomatch(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
> >  
> >  # hw/virtio/virtio.c
> >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index 2897ff70b3..3bff33a1a6 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -847,6 +847,31 @@ static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
> >      return -1;
> >  }
> >  
> > +static int vhost_user_postcopy_waker(struct PostCopyFD *pcfd, RAMBlock *rb,
> > +                                     uint64_t offset)
> > +{
> > +    struct vhost_dev *dev = pcfd->data;
> > +    struct vhost_user *u = dev->opaque;
> > +    int i;
> > +
> > +    trace_vhost_user_postcopy_waker(qemu_ram_get_idstr(rb), offset);
> > +    /* Translate the offset into an address in the clients address space */
> > +    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
> > +        if (u->region_rb[i] == rb &&
> > +            offset >= u->region_rb_offset[i] &&
> > +            offset < (u->region_rb_offset[i] +
> > +                      dev->mem->regions[i].memory_size)) {
> 
> Just curious: checks against offset should only be for safety, right?
> Is there valid case that even rb is correct but the offset gets out of
> the range of that RAMBlock?

Yes, I think that case does exist.

'regions' are mapping regions as visible from the guest, but there may
be two regions that are mapped to the same RAMBlock.  In our world
the cleanest example of that is an x86 guest with 8GB of RAM; it
has a single pc.ram RAMBlock of 8GB in size, but that's mapped
in two chunks, a 3GB chunk at the bottom of physical address space
and a 5GB chunk that starts on the 4GB boundary - i.e. leaving
a 1GB hole.
In this structure that appears as two regions each with the same rb and
different offsets.

Dave

> > +            uint64_t client_addr = (offset - u->region_rb_offset[i]) +
> > +                                   u->postcopy_client_bases[i];
> > +            trace_vhost_user_postcopy_waker_found(client_addr);
> > +            return postcopy_wake_shared(pcfd, client_addr, rb);
> > +        }
> > +    }
> > +
> > +    trace_vhost_user_postcopy_waker_nomatch(qemu_ram_get_idstr(rb), offset);
> > +    return 0;
> > +}
> > +
> >  /*
> >   * Called at the start of an inbound postcopy on reception of the
> >   * 'advise' command.
> > @@ -892,6 +917,7 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
> >      u->postcopy_fd.fd = ufd;
> >      u->postcopy_fd.data = dev;
> >      u->postcopy_fd.handler = vhost_user_postcopy_fault_handler;
> > +    u->postcopy_fd.waker = vhost_user_postcopy_waker;
> >      u->postcopy_fd.idstr = "vhost-user"; /* Need to find unique name */
> >      postcopy_register_shared_ufd(&u->postcopy_fd);
> >      return 0;
> > -- 
> > 2.13.5
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 08/32] postcopy: Add vhost-user flag for postcopy and check it
  2017-08-29  6:22     ` Peter Xu
@ 2017-09-13 14:34       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-13 14:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Aug 24, 2017 at 08:27:06PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Add a vhost feature flag for postcopy support, and
> > use the postcopy notifier to check it before allowing postcopy.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  contrib/libvhost-user/libvhost-user.h |  1 +
> >  docs/interop/vhost-user.txt           | 10 +++++++++
> >  hw/virtio/vhost-user.c                | 40 ++++++++++++++++++++++++++++++++++-
> >  3 files changed, 50 insertions(+), 1 deletion(-)
> > 
> > diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> > index acd019876d..95d0d34a28 100644
> > --- a/contrib/libvhost-user/libvhost-user.h
> > +++ b/contrib/libvhost-user/libvhost-user.h
> > @@ -34,6 +34,7 @@ enum VhostUserProtocolFeature {
> >      VHOST_USER_PROTOCOL_F_MQ = 0,
> >      VHOST_USER_PROTOCOL_F_LOG_SHMFD = 1,
> >      VHOST_USER_PROTOCOL_F_RARP = 2,
> > +    VHOST_USER_PROTOCOL_F_PAGEFAULT = 7,
> >  
> >      VHOST_USER_PROTOCOL_F_MAX
> >  };
> > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > index 954771d0d8..a279560eb0 100644
> > --- a/docs/interop/vhost-user.txt
> > +++ b/docs/interop/vhost-user.txt
> > @@ -273,6 +273,15 @@ Once the source has finished migration, rings will be stopped by
> >  the source. No further update must be done before rings are
> >  restarted.
> >  
> > +In postcopy migration the slave is started before all the memory has been
> > +received from the source host, and care must be taken to avoid accessing pages
> > +that have yet to be received.  The slave opens a 'userfault'-fd and registers
> > +the memory with it; this fd is then passed back over to the master.
> > +The master services requests on the userfaultfd for pages that are accessed
> > +and when the page is available it performs WAKE ioctl's on the userfaultfd
> > +to wake the stalled slave.  The client indicates support for this via the
> > +VHOST_USER_PROTOCOL_F_PAGEFAULT feature.
> > +
> >  IOMMU support
> >  -------------
> >  
> > @@ -327,6 +336,7 @@ Protocol features
> >  #define VHOST_USER_PROTOCOL_F_MTU            4
> >  #define VHOST_USER_PROTOCOL_F_SLAVE_REQ      5
> >  #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN   6
> > +#define VHOST_USER_PROTOCOL_F_PAGEFAULT      7
> >  
> >  Master message types
> >  --------------------
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index 093675ed98..c51bbd1296 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -17,6 +17,8 @@
> >  #include "sysemu/kvm.h"
> >  #include "qemu/error-report.h"
> >  #include "qemu/sockets.h"
> > +#include "migration/migration.h"
> > +#include "migration/postcopy-ram.h"
> >  
> >  #include <sys/ioctl.h>
> >  #include <sys/socket.h>
> > @@ -34,7 +36,7 @@ enum VhostUserProtocolFeature {
> >      VHOST_USER_PROTOCOL_F_NET_MTU = 4,
> >      VHOST_USER_PROTOCOL_F_SLAVE_REQ = 5,
> >      VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,
> > -
> > +    VHOST_USER_PROTOCOL_F_PAGEFAULT = 7,
> >      VHOST_USER_PROTOCOL_F_MAX
> >  };
> >  
> > @@ -123,8 +125,10 @@ static VhostUserMsg m __attribute__ ((unused));
> >  #define VHOST_USER_VERSION    (0x1)
> >  
> >  struct vhost_user {
> > +    struct vhost_dev *dev;
> >      CharBackend *chr;
> >      int slave_fd;
> > +    NotifierWithReturn postcopy_notifier;
> >  };
> >  
> >  static bool ioeventfd_enabled(void)
> > @@ -720,6 +724,33 @@ out:
> >      return ret;
> >  }
> >  
> > +static int vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
> > +                                        void *opaque)
> > +{
> > +    struct PostcopyNotifyData *pnd = opaque;
> > +    struct vhost_user *u = container_of(notifier, struct vhost_user,
> > +                                         postcopy_notifier);
> > +    struct vhost_dev *dev = u->dev;
> > +
> > +    switch (pnd->reason) {
> > +    case POSTCOPY_NOTIFY_PROBE:
> > +        if (!virtio_has_feature(dev->protocol_features,
> > +                                VHOST_USER_PROTOCOL_F_PAGEFAULT)) {
> > +            /* TODO: Get the device name into this error somehow */
> > +            error_setg(pnd->errp,
> > +                       "vhost-user backend not capable of postcopy");
> > +            return -ENOENT;
> > +        }
> > +        break;
> > +
> > +    default:
> > +        /* We ignore notifications we don't know */
> > +        break;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  static int vhost_user_init(struct vhost_dev *dev, void *opaque)
> >  {
> >      uint64_t features, protocol_features;
> > @@ -731,6 +762,7 @@ static int vhost_user_init(struct vhost_dev *dev, void *opaque)
> >      u = g_new0(struct vhost_user, 1);
> >      u->chr = opaque;
> >      u->slave_fd = -1;
> > +    u->dev = dev;
> >      dev->opaque = u;
> >  
> >      err = vhost_user_get_features(dev, &features);
> > @@ -787,6 +819,9 @@ static int vhost_user_init(struct vhost_dev *dev, void *opaque)
> >          return err;
> >      }
> >  
> > +    u->postcopy_notifier.notify = vhost_user_postcopy_notifier;
> > +    postcopy_add_notifier(&u->postcopy_notifier);
> > +
> >      return 0;
> >  }
> >  
> > @@ -797,6 +832,9 @@ static int vhost_user_cleanup(struct vhost_dev *dev)
> >      assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_USER);
> >  
> >      u = dev->opaque;
> > +    if (u->postcopy_notifier.notify) {
> 
> Detecting init using the notify hook is slightly strange here for
> me... If so, not sure whether we also need:
> 
>            u->postcopy_notifier.notify = NULL;
> 
> Or I'm not sure whether a 2nd call to vhost_user_cleanup() can be
> dangerous since postcopy_remove_notifier() will be called twice.

I've added the NULL assignment; I think I'd assumed that the notifier
remove also did that.

Dave

> 
> Besides that, the patch looks good to me.  Thanks,
> 
> > +        postcopy_remove_notifier(&u->postcopy_notifier);
> > +    }
> >      if (u->slave_fd >= 0) {
> >          qemu_set_fd_handler(u->slave_fd, NULL, NULL, NULL);
> >          close(u->slave_fd);
> > -- 
> > 2.13.5
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 17/32] vhost+postcopy: Stash RAMBlock and offset
  2017-08-30  5:51     ` Peter Xu
@ 2017-09-13 15:59       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-13 15:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Aug 24, 2017 at 08:27:15PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Stash the RAMBlock and offset for later use looking up
> > addresses.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  hw/virtio/trace-events |  1 +
> >  hw/virtio/vhost-user.c | 30 ++++++++++++++++++++++++++++++
> >  2 files changed, 31 insertions(+)
> > 
> > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > index 63fd4a79cf..5067dee19b 100644
> > --- a/hw/virtio/trace-events
> > +++ b/hw/virtio/trace-events
> > @@ -3,6 +3,7 @@
> >  # hw/virtio/vhost-user.c
> >  vhost_user_postcopy_listen(void) ""
> >  vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> > +vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
> >  
> >  # hw/virtio/virtio.c
> >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index 2e4eb0864a..fbe2743298 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -135,6 +135,14 @@ struct vhost_user {
> >      NotifierWithReturn postcopy_notifier;
> >      struct PostCopyFD  postcopy_fd;
> >      uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
> > +    /* Length of the region_rb and region_rb_offset arrays */
> > +    size_t             region_rb_len;
> > +    /* RAMBlock associated with a given region */
> > +    RAMBlock         **region_rb;
> > +    /* The offset from the start of the RAMBlock to the start of the
> > +     * vhost region.
> > +     */
> > +    ram_addr_t        *region_rb_offset;
> >  };
> >  
> >  static bool ioeventfd_enabled(void)
> > @@ -319,6 +327,17 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
> >          msg.flags |= VHOST_USER_NEED_REPLY_MASK;
> >      }
> >  
> > +    if (u->region_rb_len < dev->mem->nregions) {
> > +        u->region_rb = g_renew(RAMBlock*, u->region_rb, dev->mem->nregions);
> > +        u->region_rb_offset = g_renew(ram_addr_t, u->region_rb_offset,
> > +                                      dev->mem->nregions);
> > +        memset(&(u->region_rb[u->region_rb_len]), '\0',
> > +               sizeof(RAMBlock *) * (dev->mem->nregions - u->region_rb_len));
> > +        memset(&(u->region_rb_offset[u->region_rb_len]), '\0',
> > +               sizeof(ram_addr_t) * (dev->mem->nregions - u->region_rb_len));
> > +        u->region_rb_len = dev->mem->nregions;
> > +    }
> > +
> >      for (i = 0; i < dev->mem->nregions; ++i) {
> >          struct vhost_memory_region *reg = dev->mem->regions + i;
> >          ram_addr_t offset;
> > @@ -327,8 +346,14 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
> >          assert((uintptr_t)reg->userspace_addr == reg->userspace_addr);
> >          mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr,
> >                                       &offset);
> > +        u->region_rb_offset[i] = offset;
> > +        u->region_rb[i] = mr->ram_block;
> 
> Do we need to record these info even if fd <= 0?

Hmm no we don't;  I've moved them down into the if block and 0
initialised them in the non-fd case just to make sure we don't use
them accidentally.

Dave

> >          fd = memory_region_get_fd(mr);
> >          if (fd > 0) {
> > +            trace_vhost_user_set_mem_table_withfd(fd_num, mr->name,
> > +                                                  reg->memory_size,
> > +                                                  reg->guest_phys_addr,
> > +                                                  reg->userspace_addr, offset);
> >              msg.payload.memory.regions[fd_num].userspace_addr = reg->userspace_addr;
> >              msg.payload.memory.regions[fd_num].memory_size  = reg->memory_size;
> >              msg.payload.memory.regions[fd_num].guest_phys_addr = reg->guest_phys_addr;
> > @@ -992,6 +1017,11 @@ static int vhost_user_cleanup(struct vhost_dev *dev)
> >          close(u->slave_fd);
> >          u->slave_fd = -1;
> >      }
> > +    g_free(u->region_rb);
> > +    u->region_rb = NULL;
> > +    g_free(u->region_rb_offset);
> > +    u->region_rb_offset = NULL;
> > +    u->region_rb_len = 0;
> >      g_free(u);
> >      dev->opaque = 0;
> >  
> > -- 
> > 2.13.5
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 30/32] vhost: Merge neighbouring hugepage regions where appropriate
  2017-08-24 19:27   ` [Qemu-devel] [RFC v2 30/32] vhost: Merge neighbouring hugepage regions where appropriate Dr. David Alan Gilbert (git)
@ 2017-09-14  9:18     ` Igor Mammedov
  2017-09-25 11:19       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 94+ messages in thread
From: Igor Mammedov @ 2017-09-14  9:18 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	lvivier, aarcange, felipe, peterx, quintela

On Thu, 24 Aug 2017 20:27:28 +0100
"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:

> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Where two regions are created with a gap such that when aligned
> to hugepage boundaries, the two regions overlap, merge them.
why only hugepage boundaries, it should be applicable any alignment

I'd say the patch isn't what I've had in mind when we discussed issue,
it builds on already existing merging code and complicates
code even more.

Have you looked into possibility to rebuild memory map from scratch
every time vhost_region_add/vhost_region_del is called or even at
vhost_commit() time to reduce rebuild from a set of memory sections
that vhost tracks?
That should simplify algorithm a lot as memory sections are coming
from flat view and never overlap compared to current merged memory
map in vhost_dev::mem, so it won't have to deal with first splitting
and then merging back every time flatview changes.

> I also add quite a few trace events to see what's going on.
> 
> Note: This doesn't handle all the cases, but does handle the common
> case on a PC due to the 640k hole.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
>  hw/virtio/trace-events | 11 +++++++
>  hw/virtio/vhost.c      | 79 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 89 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 5b599617a1..f98efb39fd 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -1,5 +1,16 @@
>  # See docs/devel/tracing.txt for syntax documentation.
>  
> +# hw/virtio/vhost.c
> +vhost_dev_assign_memory_merged(int from, int to, uint64_t size, uint64_t start_addr, uint64_t uaddr) "f/t=%d/%d 0x%"PRIx64" @ P: 0x%"PRIx64" U: 0x%"PRIx64
> +vhost_dev_assign_memory_not_merged(uint64_t size, uint64_t start_addr, uint64_t uaddr) "0x%"PRIx64" @ P: 0x%"PRIx64" U: 0x%"PRIx64
> +vhost_dev_assign_memory_entry(uint64_t size, uint64_t start_addr, uint64_t uaddr) "0x%"PRIx64" @ P: 0x%"PRIx64" U: 0x%"PRIx64
> +vhost_dev_assign_memory_exit(uint32_t nregions) "%"PRId32
> +vhost_huge_page_stretch_and_merge_entry(uint32_t nregions) "%"PRId32
> +vhost_huge_page_stretch_and_merge_can(void) ""
> +vhost_huge_page_stretch_and_merge_size_align(int d, uint64_t gpa, uint64_t align) "%d: gpa: 0x%"PRIx64" align: 0x%"PRIx64
> +vhost_huge_page_stretch_and_merge_start_align(int d, uint64_t gpa, uint64_t align) "%d: gpa: 0x%"PRIx64" align: 0x%"PRIx64
> +vhost_section(const char *name, int r) "%s:%d"
> +
>  # hw/virtio/vhost-user.c
>  vhost_user_postcopy_end_entry(void) ""
>  vhost_user_postcopy_end_exit(void) ""
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 6eddb099b0..fb506e747f 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -27,6 +27,7 @@
>  #include "hw/virtio/virtio-access.h"
>  #include "migration/blocker.h"
>  #include "sysemu/dma.h"
> +#include "trace.h"
>  
>  /* enabled until disconnected backend stabilizes */
>  #define _VHOST_DEBUG 1
> @@ -250,6 +251,8 @@ static void vhost_dev_assign_memory(struct vhost_dev *dev,
>  {
>      int from, to;
>      struct vhost_memory_region *merged = NULL;
> +    trace_vhost_dev_assign_memory_entry(size, start_addr, uaddr);
> +
>      for (from = 0, to = 0; from < dev->mem->nregions; ++from, ++to) {
>          struct vhost_memory_region *reg = dev->mem->regions + to;
>          uint64_t prlast, urlast;
> @@ -293,11 +296,13 @@ static void vhost_dev_assign_memory(struct vhost_dev *dev,
>          uaddr = merged->userspace_addr = u;
>          start_addr = merged->guest_phys_addr = s;
>          size = merged->memory_size = e - s + 1;
> +        trace_vhost_dev_assign_memory_merged(from, to, size, start_addr, uaddr);
>          assert(merged->memory_size);
>      }
>  
>      if (!merged) {
>          struct vhost_memory_region *reg = dev->mem->regions + to;
> +        trace_vhost_dev_assign_memory_not_merged(size, start_addr, uaddr);
>          memset(reg, 0, sizeof *reg);
>          reg->memory_size = size;
>          assert(reg->memory_size);
> @@ -307,6 +312,7 @@ static void vhost_dev_assign_memory(struct vhost_dev *dev,
>      }
>      assert(to <= dev->mem->nregions + 1);
>      dev->mem->nregions = to;
> +    trace_vhost_dev_assign_memory_exit(to);
>  }
>  
>  static uint64_t vhost_get_log_size(struct vhost_dev *dev)
> @@ -610,8 +616,12 @@ static void vhost_set_memory(MemoryListener *listener,
>  
>  static bool vhost_section(MemoryRegionSection *section)
>  {
> -    return memory_region_is_ram(section->mr) &&
> +    bool result;
> +    result = memory_region_is_ram(section->mr) &&
>          !memory_region_is_rom(section->mr);
> +
> +    trace_vhost_section(section->mr->name, result);
> +    return result;
>  }
>  
>  static void vhost_begin(MemoryListener *listener)
> @@ -622,6 +632,68 @@ static void vhost_begin(MemoryListener *listener)
>      dev->mem_changed_start_addr = -1;
>  }
>  
> +/* Look for regions that are hugepage backed but not aligned
> + * and fix them up to be aligned.
> + * TODO: For now this is just enough to deal with the 640k hole
> + */
> +static bool vhost_huge_page_stretch_and_merge(struct vhost_dev *dev)
> +{
> +    int i, j;
> +    bool result = true;
> +    trace_vhost_huge_page_stretch_and_merge_entry(dev->mem->nregions);
> +
> +    for (i = 0; i < dev->mem->nregions; i++) {
> +        struct vhost_memory_region *reg = dev->mem->regions + i;
> +        ram_addr_t offset;
> +        RAMBlock *rb = qemu_ram_block_from_host((void *)reg->userspace_addr,
> +                                                false, &offset);
> +        size_t pagesize = qemu_ram_pagesize(rb);
> +        uint64_t alignage;
> +        alignage = reg->guest_phys_addr & (pagesize - 1);
> +        if (alignage) {
> +
> +            trace_vhost_huge_page_stretch_and_merge_start_align(i,
> +                                                (uint64_t)reg->guest_phys_addr,
> +                                                alignage);
> +            for (j = 0; j < dev->mem->nregions; j++) {
> +                struct vhost_memory_region *oreg = dev->mem->regions + j;
> +                if (j == i) {
> +                    continue;
> +                }
> +
> +                if (oreg->guest_phys_addr ==
> +                        (reg->guest_phys_addr - alignage) &&
> +                    oreg->userspace_addr ==
> +                         (reg->userspace_addr - alignage)) {
> +                    struct vhost_memory_region treg = *reg;
> +                    trace_vhost_huge_page_stretch_and_merge_can();
> +                    vhost_dev_unassign_memory(dev, oreg->guest_phys_addr,
> +                                              oreg->memory_size);
> +                    vhost_dev_unassign_memory(dev, treg.guest_phys_addr,
> +                                              treg.memory_size);
> +                    vhost_dev_assign_memory(dev,
> +                                            treg.guest_phys_addr - alignage,
> +                                            treg.memory_size + alignage,
> +                                            treg.userspace_addr - alignage);
> +                    return vhost_huge_page_stretch_and_merge(dev);
> +                }
> +            }
> +        }
> +        alignage = reg->memory_size & (pagesize - 1);
> +        if (alignage) {
> +            trace_vhost_huge_page_stretch_and_merge_size_align(i,
> +                                               (uint64_t)reg->guest_phys_addr,
> +                                               alignage);
> +            /* We ignore this if we find something else to merge,
> +             * so we only return false if we're left with this
> +             */
> +            result = false;
> +        }
> +    }
> +
> +    return result;
> +}
> +
>  static void vhost_commit(MemoryListener *listener)
>  {
>      struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> @@ -641,6 +713,7 @@ static void vhost_commit(MemoryListener *listener)
>          return;
>      }
>  
> +    vhost_huge_page_stretch_and_merge(dev);
>      if (dev->started) {
>          start_addr = dev->mem_changed_start_addr;
>          size = dev->mem_changed_end_addr - dev->mem_changed_start_addr + 1;
> @@ -1512,6 +1585,10 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
>          goto fail_features;
>      }
>  
> +    if (!vhost_huge_page_stretch_and_merge(hdev)) {
> +        VHOST_OPS_DEBUG("vhost_huge_page_stretch_and_merge failed");
> +        goto fail_mem;
> +    }
>      if (vhost_dev_has_iommu(hdev)) {
>          memory_listener_register(&hdev->iommu_listener, vdev->dma_as);
>      }

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 16/32] vhost+postcopy: Send address back to qemu
  2017-09-13 12:15           ` Dr. David Alan Gilbert
@ 2017-09-15  8:57             ` Peter Xu
  2017-09-15 15:32               ` Dr. David Alan Gilbert
  2017-09-18  9:31               ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 94+ messages in thread
From: Peter Xu @ 2017-09-15  8:57 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Wed, Sep 13, 2017 at 01:15:32PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Tue, Sep 12, 2017 at 06:15:13PM +0100, Dr. David Alan Gilbert wrote:
> > > * Peter Xu (peterx@redhat.com) wrote:
> > > > On Thu, Aug 24, 2017 at 08:27:14PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > 
> > > > > We need a better way, but at the moment we need the address of the
> > > > > mappings sent back to qemu so it can interpret the messages on the
> > > > > userfaultfd it reads.
> > > > > 
> > > > > Note: We don't ask for the default 'ack' reply since we've got our own.
> > > > > 
> > > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > > ---
> > > > >  contrib/libvhost-user/libvhost-user.c | 15 ++++++++-
> > > > >  docs/interop/vhost-user.txt           |  6 ++++
> > > > >  hw/virtio/trace-events                |  1 +
> > > > >  hw/virtio/vhost-user.c                | 57 ++++++++++++++++++++++++++++++++++-
> > > > >  4 files changed, 77 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > > > > index e6ab059a03..5ec54f7d60 100644
> > > > > --- a/contrib/libvhost-user/libvhost-user.c
> > > > > +++ b/contrib/libvhost-user/libvhost-user.c
> > > > > @@ -477,13 +477,26 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
> > > > >              DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
> > > > >                      __func__, i, reg_struct.range.start, reg_struct.range.len);
> > > > >              /* TODO: Stash 'zero' support flags somewhere */
> > > > > -            /* TODO: Get address back to QEMU */
> > > > >  
> > > > > +            /* TODO: We need to find a way for the qemu not to see the virtual
> > > > > +             * addresses of the clients, so as to keep better separation.
> > > > > +             */
> > > > > +            /* Return the address to QEMU so that it can translate the ufd
> > > > > +             * fault addresses back.
> > > > > +             */
> > > > > +            msg_region->userspace_addr = (uintptr_t)(mmap_addr +
> > > > > +                                                     dev_region->mmap_offset);
> > > > >          }
> > > > >  
> > > > >          close(vmsg->fds[i]);
> > > > >      }
> > > > >  
> > > > > +    if (dev->postcopy_listening) {
> > > > > +        /* Need to return the addresses - send the updated message back */
> > > > > +        vmsg->fd_num = 0;
> > > > > +        return true;
> > > > > +    }
> > > > > +
> > > > >      return false;
> > > > >  }
> > > > >  
> > > > > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > > > > index 73c3dd74db..b2a548c94d 100644
> > > > > --- a/docs/interop/vhost-user.txt
> > > > > +++ b/docs/interop/vhost-user.txt
> > > > > @@ -413,12 +413,18 @@ Master message types
> > > > >        Id: 5
> > > > >        Equivalent ioctl: VHOST_SET_MEM_TABLE
> > > > >        Master payload: memory regions description
> > > > > +      Slave payload: (postcopy only) memory regions description
> > > > >  
> > > > >        Sets the memory map regions on the slave so it can translate the vring
> > > > >        addresses. In the ancillary data there is an array of file descriptors
> > > > >        for each memory mapped region. The size and ordering of the fds matches
> > > > >        the number and ordering of memory regions.
> > > > >  
> > > > > +      When postcopy-listening has been received, SET_MEM_TABLE replies with
> > > > > +      the bases of the memory mapped regions to the master.  It must have mmap'd
> > > > > +      the regions and enabled userfaultfd on them.  Note NEED_REPLY_MASK
> > > > > +      is not set in this case.
> > > > > +
> > > > >   * VHOST_USER_SET_LOG_BASE
> > > > >  
> > > > >        Id: 6
> > > > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > > > index f736c7c84f..63fd4a79cf 100644
> > > > > --- a/hw/virtio/trace-events
> > > > > +++ b/hw/virtio/trace-events
> > > > > @@ -2,6 +2,7 @@
> > > > >  
> > > > >  # hw/virtio/vhost-user.c
> > > > >  vhost_user_postcopy_listen(void) ""
> > > > > +vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> > > > >  
> > > > >  # hw/virtio/virtio.c
> > > > >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > > > index 9178271ab2..2e4eb0864a 100644
> > > > > --- a/hw/virtio/vhost-user.c
> > > > > +++ b/hw/virtio/vhost-user.c
> > > > > @@ -19,6 +19,7 @@
> > > > >  #include "qemu/sockets.h"
> > > > >  #include "migration/migration.h"
> > > > >  #include "migration/postcopy-ram.h"
> > > > > +#include "trace.h"
> > > > >  
> > > > >  #include <sys/ioctl.h>
> > > > >  #include <sys/socket.h>
> > > > > @@ -133,6 +134,7 @@ struct vhost_user {
> > > > >      int slave_fd;
> > > > >      NotifierWithReturn postcopy_notifier;
> > > > >      struct PostCopyFD  postcopy_fd;
> > > > > +    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
> > > > >  };
> > > > >  
> > > > >  static bool ioeventfd_enabled(void)
> > > > > @@ -300,11 +302,13 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
> > > > >  static int vhost_user_set_mem_table(struct vhost_dev *dev,
> > > > >                                      struct vhost_memory *mem)
> > > > >  {
> > > > > +    struct vhost_user *u = dev->opaque;
> > > > >      int fds[VHOST_MEMORY_MAX_NREGIONS];
> > > > >      int i, fd;
> > > > >      size_t fd_num = 0;
> > > > >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> > > > > -                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > > > > +                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
> > > > > +                           !u->postcopy_fd.handler;
> > > > 
> > > > (indent)
> > > 
> > > Fixed
> > > 
> > > > >  
> > > > >      VhostUserMsg msg = {
> > > > >          .request = VHOST_USER_SET_MEM_TABLE,
> > > > > @@ -350,6 +354,57 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
> > > > >          return -1;
> > > > >      }
> > > > >  
> > > > > +    if (u->postcopy_fd.handler) {
> > > > 
> > > > It seems that after this handler is set, we never clean it up.  Do we
> > > > need to unset it somewhere? (maybe vhost_user_postcopy_end?)
> > > 
> > > Hmm yes I'll have a look at that.
> > > 
> > > > > +        VhostUserMsg msg_reply;
> > > > > +        int region_i, reply_i;
> > > > > +        if (vhost_user_read(dev, &msg_reply) < 0) {
> > > > > +            return -1;
> > > > > +        }
> > > > > +
> > > > > +        if (msg_reply.request != VHOST_USER_SET_MEM_TABLE) {
> > > > > +            error_report("%s: Received unexpected msg type."
> > > > > +                         "Expected %d received %d", __func__,
> > > > > +                         VHOST_USER_SET_MEM_TABLE, msg_reply.request);
> > > > > +            return -1;
> > > > > +        }
> > > > > +        /* We're using the same structure, just reusing one of the
> > > > > +         * fields, so it should be the same size.
> > > > > +         */
> > > > > +        if (msg_reply.size != msg.size) {
> > > > > +            error_report("%s: Unexpected size for postcopy reply "
> > > > > +                         "%d vs %d", __func__, msg_reply.size, msg.size);
> > > > > +            return -1;
> > > > > +        }
> > > > > +
> > > > > +        memset(u->postcopy_client_bases, 0,
> > > > > +               sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
> > > > > +
> > > > > +        /* They're in the same order as the regions that were sent
> > > > > +         * but some of the regions were skipped (above) if they
> > > > > +         * didn't have fd's
> > > > > +        */
> > > > > +        for (reply_i = 0, region_i = 0;
> > > > > +             region_i < dev->mem->nregions;
> > > > > +             region_i++) {
> > > > > +            if (reply_i < fd_num &&
> > > > > +                msg_reply.payload.memory.regions[region_i].guest_phys_addr ==
> > > >                                                     ^^^^^^^^
> > > >                                           should this be reply_i?
> > > 
> > > Yes it should - nicely spotted
> > > 
> > > > (And maybe we can use pointers for the regions for better readability?)
> > > 
> > > I'm nervous of doing that since VhostUserMsg is 'packed' - and I'm not
> > > convinced it's legal to take a pointer to a member (although I think
> > > we do it in a whole bunch of places and clang moans about it).
> > 
> > Could I ask why packed struct is not suitable for taking field
> > pointers out of the structs?  I hardly use clang, and I feel like
> > there is something I may have missed in C programming...
> 
> The problem is that when you 'pack' a structure all the alignment rules
> you normally have go away;  when the compiler knows it's accessing
> a packed structure that's OK because the compiler knows not to rely
> on those alignments;  however if I took a pointer to the
> regions table in the msg I'd end up with a VhostUserMemoryRegion*
> and a pointer like that carries nothing to tell the compiler to take
> care about alignment.

Ah I see.

I did a test with gcc:

#include <stdio.h>

struct test {
    unsigned short a;
    unsigned long b;
};

struct test2 {
    struct test c;
} __attribute__ ((packed));

int main(void)
{
    printf("test is %lu, test2 is %lu\n",
           sizeof(struct test), sizeof(struct test2));
    return 0;
}

This outputs:

test is 16, test2 is 16

So I think even if test2 is marked as packed, it'll still keep how
test is defined (or I would expect test be 16B while test2 be 10B)?  I
tried with clang and got the same result.

gcc version 6.1.1 20160621 (Red Hat 6.1.1-3) (GCC) 
clang version 3.8.1 (tags/RELEASE_381/final)

> 
> > > 
> > > > > +                dev->mem->regions[region_i].guest_phys_addr) {
> > > > > +                u->postcopy_client_bases[region_i] =
> > > > > +                    msg_reply.payload.memory.regions[reply_i].userspace_addr;
> > > > > +                trace_vhost_user_set_mem_table_postcopy(
> > > > > +                    msg_reply.payload.memory.regions[reply_i].userspace_addr,
> > > > > +                    msg.payload.memory.regions[reply_i].userspace_addr,
> > >                                                     ^^^^^^^
> > >                         and I think this one is region_i
> > 
> > Hmm... shouldn't msg.payload.memory.regions[] defined with size
> > VHOST_MEMORY_MAX_NREGIONS as well?
> 
> Yes, it already is; msg is a VhostUserMsg, payload.memory is a
> VhostUserMemory and it has:
>   VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];

Sorry I mis-expressed.  I mean, then we should still use reply_i here,
right?  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 16/32] vhost+postcopy: Send address back to qemu
  2017-09-15  8:57             ` Peter Xu
@ 2017-09-15 15:32               ` Dr. David Alan Gilbert
  2017-09-18  9:31               ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-15 15:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Sep 13, 2017 at 01:15:32PM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Tue, Sep 12, 2017 at 06:15:13PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Peter Xu (peterx@redhat.com) wrote:
> > > > > On Thu, Aug 24, 2017 at 08:27:14PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > 
> > > > > > We need a better way, but at the moment we need the address of the
> > > > > > mappings sent back to qemu so it can interpret the messages on the
> > > > > > userfaultfd it reads.
> > > > > > 
> > > > > > Note: We don't ask for the default 'ack' reply since we've got our own.
> > > > > > 
> > > > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > > > ---
> > > > > >  contrib/libvhost-user/libvhost-user.c | 15 ++++++++-
> > > > > >  docs/interop/vhost-user.txt           |  6 ++++
> > > > > >  hw/virtio/trace-events                |  1 +
> > > > > >  hw/virtio/vhost-user.c                | 57 ++++++++++++++++++++++++++++++++++-
> > > > > >  4 files changed, 77 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > > > > > index e6ab059a03..5ec54f7d60 100644
> > > > > > --- a/contrib/libvhost-user/libvhost-user.c
> > > > > > +++ b/contrib/libvhost-user/libvhost-user.c
> > > > > > @@ -477,13 +477,26 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
> > > > > >              DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
> > > > > >                      __func__, i, reg_struct.range.start, reg_struct.range.len);
> > > > > >              /* TODO: Stash 'zero' support flags somewhere */
> > > > > > -            /* TODO: Get address back to QEMU */
> > > > > >  
> > > > > > +            /* TODO: We need to find a way for the qemu not to see the virtual
> > > > > > +             * addresses of the clients, so as to keep better separation.
> > > > > > +             */
> > > > > > +            /* Return the address to QEMU so that it can translate the ufd
> > > > > > +             * fault addresses back.
> > > > > > +             */
> > > > > > +            msg_region->userspace_addr = (uintptr_t)(mmap_addr +
> > > > > > +                                                     dev_region->mmap_offset);
> > > > > >          }
> > > > > >  
> > > > > >          close(vmsg->fds[i]);
> > > > > >      }
> > > > > >  
> > > > > > +    if (dev->postcopy_listening) {
> > > > > > +        /* Need to return the addresses - send the updated message back */
> > > > > > +        vmsg->fd_num = 0;
> > > > > > +        return true;
> > > > > > +    }
> > > > > > +
> > > > > >      return false;
> > > > > >  }
> > > > > >  
> > > > > > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > > > > > index 73c3dd74db..b2a548c94d 100644
> > > > > > --- a/docs/interop/vhost-user.txt
> > > > > > +++ b/docs/interop/vhost-user.txt
> > > > > > @@ -413,12 +413,18 @@ Master message types
> > > > > >        Id: 5
> > > > > >        Equivalent ioctl: VHOST_SET_MEM_TABLE
> > > > > >        Master payload: memory regions description
> > > > > > +      Slave payload: (postcopy only) memory regions description
> > > > > >  
> > > > > >        Sets the memory map regions on the slave so it can translate the vring
> > > > > >        addresses. In the ancillary data there is an array of file descriptors
> > > > > >        for each memory mapped region. The size and ordering of the fds matches
> > > > > >        the number and ordering of memory regions.
> > > > > >  
> > > > > > +      When postcopy-listening has been received, SET_MEM_TABLE replies with
> > > > > > +      the bases of the memory mapped regions to the master.  It must have mmap'd
> > > > > > +      the regions and enabled userfaultfd on them.  Note NEED_REPLY_MASK
> > > > > > +      is not set in this case.
> > > > > > +
> > > > > >   * VHOST_USER_SET_LOG_BASE
> > > > > >  
> > > > > >        Id: 6
> > > > > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > > > > index f736c7c84f..63fd4a79cf 100644
> > > > > > --- a/hw/virtio/trace-events
> > > > > > +++ b/hw/virtio/trace-events
> > > > > > @@ -2,6 +2,7 @@
> > > > > >  
> > > > > >  # hw/virtio/vhost-user.c
> > > > > >  vhost_user_postcopy_listen(void) ""
> > > > > > +vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> > > > > >  
> > > > > >  # hw/virtio/virtio.c
> > > > > >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > > > > index 9178271ab2..2e4eb0864a 100644
> > > > > > --- a/hw/virtio/vhost-user.c
> > > > > > +++ b/hw/virtio/vhost-user.c
> > > > > > @@ -19,6 +19,7 @@
> > > > > >  #include "qemu/sockets.h"
> > > > > >  #include "migration/migration.h"
> > > > > >  #include "migration/postcopy-ram.h"
> > > > > > +#include "trace.h"
> > > > > >  
> > > > > >  #include <sys/ioctl.h>
> > > > > >  #include <sys/socket.h>
> > > > > > @@ -133,6 +134,7 @@ struct vhost_user {
> > > > > >      int slave_fd;
> > > > > >      NotifierWithReturn postcopy_notifier;
> > > > > >      struct PostCopyFD  postcopy_fd;
> > > > > > +    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
> > > > > >  };
> > > > > >  
> > > > > >  static bool ioeventfd_enabled(void)
> > > > > > @@ -300,11 +302,13 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
> > > > > >  static int vhost_user_set_mem_table(struct vhost_dev *dev,
> > > > > >                                      struct vhost_memory *mem)
> > > > > >  {
> > > > > > +    struct vhost_user *u = dev->opaque;
> > > > > >      int fds[VHOST_MEMORY_MAX_NREGIONS];
> > > > > >      int i, fd;
> > > > > >      size_t fd_num = 0;
> > > > > >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> > > > > > -                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > > > > > +                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
> > > > > > +                           !u->postcopy_fd.handler;
> > > > > 
> > > > > (indent)
> > > > 
> > > > Fixed
> > > > 
> > > > > >  
> > > > > >      VhostUserMsg msg = {
> > > > > >          .request = VHOST_USER_SET_MEM_TABLE,
> > > > > > @@ -350,6 +354,57 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
> > > > > >          return -1;
> > > > > >      }
> > > > > >  
> > > > > > +    if (u->postcopy_fd.handler) {
> > > > > 
> > > > > It seems that after this handler is set, we never clean it up.  Do we
> > > > > need to unset it somewhere? (maybe vhost_user_postcopy_end?)
> > > > 
> > > > Hmm yes I'll have a look at that.
> > > > 
> > > > > > +        VhostUserMsg msg_reply;
> > > > > > +        int region_i, reply_i;
> > > > > > +        if (vhost_user_read(dev, &msg_reply) < 0) {
> > > > > > +            return -1;
> > > > > > +        }
> > > > > > +
> > > > > > +        if (msg_reply.request != VHOST_USER_SET_MEM_TABLE) {
> > > > > > +            error_report("%s: Received unexpected msg type."
> > > > > > +                         "Expected %d received %d", __func__,
> > > > > > +                         VHOST_USER_SET_MEM_TABLE, msg_reply.request);
> > > > > > +            return -1;
> > > > > > +        }
> > > > > > +        /* We're using the same structure, just reusing one of the
> > > > > > +         * fields, so it should be the same size.
> > > > > > +         */
> > > > > > +        if (msg_reply.size != msg.size) {
> > > > > > +            error_report("%s: Unexpected size for postcopy reply "
> > > > > > +                         "%d vs %d", __func__, msg_reply.size, msg.size);
> > > > > > +            return -1;
> > > > > > +        }
> > > > > > +
> > > > > > +        memset(u->postcopy_client_bases, 0,
> > > > > > +               sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
> > > > > > +
> > > > > > +        /* They're in the same order as the regions that were sent
> > > > > > +         * but some of the regions were skipped (above) if they
> > > > > > +         * didn't have fd's
> > > > > > +        */
> > > > > > +        for (reply_i = 0, region_i = 0;
> > > > > > +             region_i < dev->mem->nregions;
> > > > > > +             region_i++) {
> > > > > > +            if (reply_i < fd_num &&
> > > > > > +                msg_reply.payload.memory.regions[region_i].guest_phys_addr ==
> > > > >                                                     ^^^^^^^^
> > > > >                                           should this be reply_i?
> > > > 
> > > > Yes it should - nicely spotted
> > > > 
> > > > > (And maybe we can use pointers for the regions for better readability?)
> > > > 
> > > > I'm nervous of doing that since VhostUserMsg is 'packed' - and I'm not
> > > > convinced it's legal to take a pointer to a member (although I think
> > > > we do it in a whole bunch of places and clang moans about it).
> > > 
> > > Could I ask why packed struct is not suitable for taking field
> > > pointers out of the structs?  I hardly use clang, and I feel like
> > > there is something I may have missed in C programming...
> > 
> > The problem is that when you 'pack' a structure all the alignment rules
> > you normally have go away;  when the compiler knows it's accessing
> > a packed structure that's OK because the compiler knows not to rely
> > on those alignments;  however if I took a pointer to the
> > regions table in the msg I'd end up with a VhostUserMemoryRegion*
> > and a pointer like that carries nothing to tell the compiler to take
> > care about alignment.
> 
> Ah I see.
> 
> I did a test with gcc:
> 
> #include <stdio.h>
> 
> struct test {
>     unsigned short a;
>     unsigned long b;
> };
> 
> struct test2 {
>     struct test c;
> } __attribute__ ((packed));
> 
> int main(void)
> {
>     printf("test is %lu, test2 is %lu\n",
>            sizeof(struct test), sizeof(struct test2));
>     return 0;
> }
> 
> This outputs:
> 
> test is 16, test2 is 16
> 
> So I think even if test2 is marked as packed, it'll still keep how
> test is defined (or I would expect test be 16B while test2 be 10B)?  I
> tried with clang and got the same result.
> 
> gcc version 6.1.1 20160621 (Red Hat 6.1.1-3) (GCC) 
> clang version 3.8.1 (tags/RELEASE_381/final)

Note it's alignment not size that's the problem (and any portability
test is always wrong on one compiler!)

#include <stdio.h>
#include <stddef.h>
 
struct test { 
    unsigned long c;
    unsigned short d;
};
 
struct test2 { 
    unsigned short a;
    struct test b;
} __attribute__ ((packed));
 
int main(void)
{
    struct test2 t2;
    struct test *tp=&t2.b;
    unsigned long *tpc=&tp->c;
    printf("t2 at %p t2.b at %p tp->c at %p\n", &t2, tp, tpc);
    return 0;
}

t2 at 0x7ffe7a235a30 t2.b at 0x7ffe7a235a32 tp->c at 0x7ffe7a235a32

so you see that the 'unsigned long * tpc' is unaligned as is the
'struct test *tp' - those are both unaligned but there's nothing in
the type that tells you that.

> > 
> > > > 
> > > > > > +                dev->mem->regions[region_i].guest_phys_addr) {
> > > > > > +                u->postcopy_client_bases[region_i] =
> > > > > > +                    msg_reply.payload.memory.regions[reply_i].userspace_addr;
> > > > > > +                trace_vhost_user_set_mem_table_postcopy(
> > > > > > +                    msg_reply.payload.memory.regions[reply_i].userspace_addr,
> > > > > > +                    msg.payload.memory.regions[reply_i].userspace_addr,
> > > >                                                     ^^^^^^^
> > > >                         and I think this one is region_i
> > > 
> > > Hmm... shouldn't msg.payload.memory.regions[] defined with size
> > > VHOST_MEMORY_MAX_NREGIONS as well?
> > 
> > Yes, it already is; msg is a VhostUserMsg, payload.memory is a
> > VhostUserMemory and it has:
> >   VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
> 
> Sorry I mis-expressed.  I mean, then we should still use reply_i here,
> right?  Thanks,

Why? Aren't we indexing msg_reply by reply_i and msg by region_i ?

Dave

> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 11/32] vhub: Open userfaultfd
  2017-08-29  6:40     ` Peter Xu
@ 2017-09-15 17:33       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-15 17:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Aug 24, 2017 at 08:27:09PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Open a userfaultfd (on a postcopy_advise) and send it back in
> > the reply to the qemu for it to monitor.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  contrib/libvhost-user/libvhost-user.c | 26 +++++++++++++++++++++++---
> >  contrib/libvhost-user/libvhost-user.h |  3 +++
> >  2 files changed, 26 insertions(+), 3 deletions(-)
> > 
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index 47884c0a15..f9b5b12b28 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -15,6 +15,7 @@
> >  
> >  #include <qemu/osdep.h>
> >  #include <sys/eventfd.h>
> > +#include <sys/syscall.h>
> >  #include <linux/vhost.h>
> >  
> >  #include "qemu/atomic.h"
> > @@ -773,11 +774,30 @@ vu_set_vring_enable_exec(VuDev *dev, VhostUserMsg *vmsg)
> >  static bool
> >  vu_set_postcopy_advise(VuDev *dev, VhostUserMsg *vmsg)
> >  {
> > -    /* TODO: Open ufd, pass it back in the request
> > -     * TODO: Add addresses 
> > -     */
> > +    struct uffdio_api api_struct;
> > +
> > +    dev->postcopy_ufd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> > +    /* TODO: Add addresses */
> >      vmsg->payload.u64 = 0xcafe;
> >      vmsg->size = sizeof(vmsg->payload.u64);
> > +
> > +    if (dev->postcopy_ufd == -1) {
> > +        vu_panic(dev, "Userfaultfd not available: %s", strerror(errno));
> > +        goto out;
> 
> We got error but still goto out?  I feel like we should reply with
> some kind of error code when any error happens.

See that the out: code returns the dev->postcopy_ufd to qemu
and in this case it's -1 so it is already flagged as an error.

> > +    }
> > +    api_struct.api = UFFD_API;
> > +    api_struct.features = 0;
> > +    if (ioctl(dev->postcopy_ufd, UFFDIO_API, &api_struct)) {
> > +        vu_panic(dev, "Failed UFFDIO_API: %s", strerror(errno));
> > +        close(dev->postcopy_ufd);
> > +        dev->postcopy_ufd = -1;
> > +        goto out;
> 
> Same here.

And there we explicitly set dev->postcopy_ufd = -1 before going to out
so it's sent as the error value.

> > +    }
> > +    /* TODO: Stash feature flags somewhere */
> > +out:
> > +    /* Return a ufd to the QEMU */
> > +    vmsg->fd_num = 1;
> > +    vmsg->fds[0] = dev->postcopy_ufd;

^^^ see that's where the -1's end up.
> >      return true; /* = send a reply */
> >  }

Dave

> >  
> > diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> > index 3987ce643d..3e8efdd919 100644
> > --- a/contrib/libvhost-user/libvhost-user.h
> > +++ b/contrib/libvhost-user/libvhost-user.h
> > @@ -234,6 +234,9 @@ struct VuDev {
> >       * re-initialize */
> >      vu_panic_cb panic;
> >      const VuDevIface *iface;
> > +
> > +    /* Postcopy data */
> > +    int postcopy_ufd;
> >  };
> >  
> >  typedef struct VuVirtqElement {
> > -- 
> > 2.13.5
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 22/32] vhost+postcopy: Add vhost waker
  2017-09-13 13:09       ` Dr. David Alan Gilbert
@ 2017-09-18  3:57         ` Peter Xu
  0 siblings, 0 replies; 94+ messages in thread
From: Peter Xu @ 2017-09-18  3:57 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

On Wed, Sep 13, 2017 at 02:09:02PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Thu, Aug 24, 2017 at 08:27:20PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Register a waker function in vhost-user code to be notified when
> > > pages arrive or requests to previously mapped pages get requested.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  hw/virtio/trace-events |  3 +++
> > >  hw/virtio/vhost-user.c | 26 ++++++++++++++++++++++++++
> > >  2 files changed, 29 insertions(+)
> > > 
> > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > index f7d4b831fe..adebf6dc6b 100644
> > > --- a/hw/virtio/trace-events
> > > +++ b/hw/virtio/trace-events
> > > @@ -7,6 +7,9 @@ vhost_user_postcopy_fault_handler_found(int i, uint64_t region_offset, uint64_t
> > >  vhost_user_postcopy_listen(void) ""
> > >  vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> > >  vhost_user_set_mem_table_withfd(int index, const char *name, uint64_t memory_size, uint64_t guest_phys_addr, uint64_t userspace_addr, uint64_t offset) "%d:%s: size:0x%"PRIx64" GPA:0x%"PRIx64" QVA/userspace:0x%"PRIx64" RB offset:0x%"PRIx64
> > > +vhost_user_postcopy_waker(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
> > > +vhost_user_postcopy_waker_found(uint64_t client_addr) "0x%"PRIx64
> > > +vhost_user_postcopy_waker_nomatch(const char *rb, uint64_t rb_offset) "%s + 0x%"PRIx64
> > >  
> > >  # hw/virtio/virtio.c
> > >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > index 2897ff70b3..3bff33a1a6 100644
> > > --- a/hw/virtio/vhost-user.c
> > > +++ b/hw/virtio/vhost-user.c
> > > @@ -847,6 +847,31 @@ static int vhost_user_postcopy_fault_handler(struct PostCopyFD *pcfd,
> > >      return -1;
> > >  }
> > >  
> > > +static int vhost_user_postcopy_waker(struct PostCopyFD *pcfd, RAMBlock *rb,
> > > +                                     uint64_t offset)
> > > +{
> > > +    struct vhost_dev *dev = pcfd->data;
> > > +    struct vhost_user *u = dev->opaque;
> > > +    int i;
> > > +
> > > +    trace_vhost_user_postcopy_waker(qemu_ram_get_idstr(rb), offset);
> > > +    /* Translate the offset into an address in the clients address space */
> > > +    for (i = 0; i < MIN(dev->mem->nregions, u->region_rb_len); i++) {
> > > +        if (u->region_rb[i] == rb &&
> > > +            offset >= u->region_rb_offset[i] &&
> > > +            offset < (u->region_rb_offset[i] +
> > > +                      dev->mem->regions[i].memory_size)) {
> > 
> > Just curious: checks against offset should only be for safety, right?
> > Is there valid case that even rb is correct but the offset gets out of
> > the range of that RAMBlock?
> 
> Yes, I think that case does exist.
> 
> 'regions' are mapping regions as visible from the guest, but there may
> be two regions that are mapped to the same RAMBlock.  In our world
> the cleanest example of that is an x86 guest with 8GB of RAM; it
> has a single pc.ram RAMBlock of 8GB in size, but that's mapped
> in two chunks, a 3GB chunk at the bottom of physical address space
> and a 5GB chunk that starts on the 4GB boundary - i.e. leaving
> a 1GB hole.
> In this structure that appears as two regions each with the same rb and
> different offsets.

Yeah I missed that.  Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 16/32] vhost+postcopy: Send address back to qemu
  2017-09-15  8:57             ` Peter Xu
  2017-09-15 15:32               ` Dr. David Alan Gilbert
@ 2017-09-18  9:31               ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-18  9:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Sep 13, 2017 at 01:15:32PM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Tue, Sep 12, 2017 at 06:15:13PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Peter Xu (peterx@redhat.com) wrote:
> > > > > On Thu, Aug 24, 2017 at 08:27:14PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > 
> > > > > > We need a better way, but at the moment we need the address of the
> > > > > > mappings sent back to qemu so it can interpret the messages on the
> > > > > > userfaultfd it reads.
> > > > > > 
> > > > > > Note: We don't ask for the default 'ack' reply since we've got our own.
> > > > > > 
> > > > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > > > ---
> > > > > >  contrib/libvhost-user/libvhost-user.c | 15 ++++++++-
> > > > > >  docs/interop/vhost-user.txt           |  6 ++++
> > > > > >  hw/virtio/trace-events                |  1 +
> > > > > >  hw/virtio/vhost-user.c                | 57 ++++++++++++++++++++++++++++++++++-
> > > > > >  4 files changed, 77 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > > > > > index e6ab059a03..5ec54f7d60 100644
> > > > > > --- a/contrib/libvhost-user/libvhost-user.c
> > > > > > +++ b/contrib/libvhost-user/libvhost-user.c
> > > > > > @@ -477,13 +477,26 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
> > > > > >              DPRINT("%s: region %d: Registered userfault for %llx + %llx\n",
> > > > > >                      __func__, i, reg_struct.range.start, reg_struct.range.len);
> > > > > >              /* TODO: Stash 'zero' support flags somewhere */
> > > > > > -            /* TODO: Get address back to QEMU */
> > > > > >  
> > > > > > +            /* TODO: We need to find a way for the qemu not to see the virtual
> > > > > > +             * addresses of the clients, so as to keep better separation.
> > > > > > +             */
> > > > > > +            /* Return the address to QEMU so that it can translate the ufd
> > > > > > +             * fault addresses back.
> > > > > > +             */
> > > > > > +            msg_region->userspace_addr = (uintptr_t)(mmap_addr +
> > > > > > +                                                     dev_region->mmap_offset);
> > > > > >          }
> > > > > >  
> > > > > >          close(vmsg->fds[i]);
> > > > > >      }
> > > > > >  
> > > > > > +    if (dev->postcopy_listening) {
> > > > > > +        /* Need to return the addresses - send the updated message back */
> > > > > > +        vmsg->fd_num = 0;
> > > > > > +        return true;
> > > > > > +    }
> > > > > > +
> > > > > >      return false;
> > > > > >  }
> > > > > >  
> > > > > > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > > > > > index 73c3dd74db..b2a548c94d 100644
> > > > > > --- a/docs/interop/vhost-user.txt
> > > > > > +++ b/docs/interop/vhost-user.txt
> > > > > > @@ -413,12 +413,18 @@ Master message types
> > > > > >        Id: 5
> > > > > >        Equivalent ioctl: VHOST_SET_MEM_TABLE
> > > > > >        Master payload: memory regions description
> > > > > > +      Slave payload: (postcopy only) memory regions description
> > > > > >  
> > > > > >        Sets the memory map regions on the slave so it can translate the vring
> > > > > >        addresses. In the ancillary data there is an array of file descriptors
> > > > > >        for each memory mapped region. The size and ordering of the fds matches
> > > > > >        the number and ordering of memory regions.
> > > > > >  
> > > > > > +      When postcopy-listening has been received, SET_MEM_TABLE replies with
> > > > > > +      the bases of the memory mapped regions to the master.  It must have mmap'd
> > > > > > +      the regions and enabled userfaultfd on them.  Note NEED_REPLY_MASK
> > > > > > +      is not set in this case.
> > > > > > +
> > > > > >   * VHOST_USER_SET_LOG_BASE
> > > > > >  
> > > > > >        Id: 6
> > > > > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > > > > index f736c7c84f..63fd4a79cf 100644
> > > > > > --- a/hw/virtio/trace-events
> > > > > > +++ b/hw/virtio/trace-events
> > > > > > @@ -2,6 +2,7 @@
> > > > > >  
> > > > > >  # hw/virtio/vhost-user.c
> > > > > >  vhost_user_postcopy_listen(void) ""
> > > > > > +vhost_user_set_mem_table_postcopy(uint64_t client_addr, uint64_t qhva, int reply_i, int region_i) "client:0x%"PRIx64" for hva: 0x%"PRIx64" reply %d region %d"
> > > > > >  
> > > > > >  # hw/virtio/virtio.c
> > > > > >  virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > > > > index 9178271ab2..2e4eb0864a 100644
> > > > > > --- a/hw/virtio/vhost-user.c
> > > > > > +++ b/hw/virtio/vhost-user.c
> > > > > > @@ -19,6 +19,7 @@
> > > > > >  #include "qemu/sockets.h"
> > > > > >  #include "migration/migration.h"
> > > > > >  #include "migration/postcopy-ram.h"
> > > > > > +#include "trace.h"
> > > > > >  
> > > > > >  #include <sys/ioctl.h>
> > > > > >  #include <sys/socket.h>
> > > > > > @@ -133,6 +134,7 @@ struct vhost_user {
> > > > > >      int slave_fd;
> > > > > >      NotifierWithReturn postcopy_notifier;
> > > > > >      struct PostCopyFD  postcopy_fd;
> > > > > > +    uint64_t           postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
> > > > > >  };
> > > > > >  
> > > > > >  static bool ioeventfd_enabled(void)
> > > > > > @@ -300,11 +302,13 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
> > > > > >  static int vhost_user_set_mem_table(struct vhost_dev *dev,
> > > > > >                                      struct vhost_memory *mem)
> > > > > >  {
> > > > > > +    struct vhost_user *u = dev->opaque;
> > > > > >      int fds[VHOST_MEMORY_MAX_NREGIONS];
> > > > > >      int i, fd;
> > > > > >      size_t fd_num = 0;
> > > > > >      bool reply_supported = virtio_has_feature(dev->protocol_features,
> > > > > > -                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> > > > > > +                                          VHOST_USER_PROTOCOL_F_REPLY_ACK) &&
> > > > > > +                           !u->postcopy_fd.handler;
> > > > > 
> > > > > (indent)
> > > > 
> > > > Fixed
> > > > 
> > > > > >  
> > > > > >      VhostUserMsg msg = {
> > > > > >          .request = VHOST_USER_SET_MEM_TABLE,
> > > > > > @@ -350,6 +354,57 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev,
> > > > > >          return -1;
> > > > > >      }
> > > > > >  
> > > > > > +    if (u->postcopy_fd.handler) {
> > > > > 
> > > > > It seems that after this handler is set, we never clean it up.  Do we
> > > > > need to unset it somewhere? (maybe vhost_user_postcopy_end?)
> > > > 
> > > > Hmm yes I'll have a look at that.
> > > > 
> > > > > > +        VhostUserMsg msg_reply;
> > > > > > +        int region_i, reply_i;
> > > > > > +        if (vhost_user_read(dev, &msg_reply) < 0) {
> > > > > > +            return -1;
> > > > > > +        }
> > > > > > +
> > > > > > +        if (msg_reply.request != VHOST_USER_SET_MEM_TABLE) {
> > > > > > +            error_report("%s: Received unexpected msg type."
> > > > > > +                         "Expected %d received %d", __func__,
> > > > > > +                         VHOST_USER_SET_MEM_TABLE, msg_reply.request);
> > > > > > +            return -1;
> > > > > > +        }
> > > > > > +        /* We're using the same structure, just reusing one of the
> > > > > > +         * fields, so it should be the same size.
> > > > > > +         */
> > > > > > +        if (msg_reply.size != msg.size) {
> > > > > > +            error_report("%s: Unexpected size for postcopy reply "
> > > > > > +                         "%d vs %d", __func__, msg_reply.size, msg.size);
> > > > > > +            return -1;
> > > > > > +        }
> > > > > > +
> > > > > > +        memset(u->postcopy_client_bases, 0,
> > > > > > +               sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS);
> > > > > > +
> > > > > > +        /* They're in the same order as the regions that were sent
> > > > > > +         * but some of the regions were skipped (above) if they
> > > > > > +         * didn't have fd's
> > > > > > +        */
> > > > > > +        for (reply_i = 0, region_i = 0;
> > > > > > +             region_i < dev->mem->nregions;
> > > > > > +             region_i++) {
> > > > > > +            if (reply_i < fd_num &&
> > > > > > +                msg_reply.payload.memory.regions[region_i].guest_phys_addr ==
> > > > >                                                     ^^^^^^^^
> > > > >                                           should this be reply_i?
> > > > 
> > > > Yes it should - nicely spotted
> > > > 
> > > > > (And maybe we can use pointers for the regions for better readability?)
> > > > 

<snip>

> > > > > > +                dev->mem->regions[region_i].guest_phys_addr) {
> > > > > > +                u->postcopy_client_bases[region_i] =
> > > > > > +                    msg_reply.payload.memory.regions[reply_i].userspace_addr;
> > > > > > +                trace_vhost_user_set_mem_table_postcopy(
> > > > > > +                    msg_reply.payload.memory.regions[reply_i].userspace_addr,
> > > > > > +                    msg.payload.memory.regions[reply_i].userspace_addr,
> > > >                                                     ^^^^^^^
> > > >                         and I think this one is region_i
> > > 
> > > Hmm... shouldn't msg.payload.memory.regions[] defined with size
> > > VHOST_MEMORY_MAX_NREGIONS as well?
> > 
> > Yes, it already is; msg is a VhostUserMsg, payload.memory is a
> > VhostUserMemory and it has:
> >   VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
> 
> Sorry I mis-expressed.  I mean, then we should still use reply_i here,
> right?  Thanks,

You're right! I've renamed 'reply_i' to 'msg_i' - it's always an index
into the messages (either of them).

Dave

> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 03/32] migrate: Update ram_block_discard_range for shared
  2017-08-29  5:30     ` Peter Xu
@ 2017-09-18 12:18       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-18 12:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Aug 24, 2017 at 08:27:01PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > The choice of call to discard a block is getting more complicated
> > for other cases.   We use fallocate PUNCH_HOLE in any file cases;
> > it works for both hugepage and for tmpfs.
> > We use the DONTNEED for non-hugepage cases either where they're
> > anonymous or where they're private.
> > 
> > Care should be taken when trying other backing files.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  exec.c       | 35 ++++++++++++++++++++++++-----------
> >  trace-events |  3 +++
> >  2 files changed, 27 insertions(+), 11 deletions(-)
> > 
> > diff --git a/exec.c b/exec.c
> > index d20c34ca83..67df2909ce 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -3573,6 +3573,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
> >      }
> >  
> >      if ((start + length) <= rb->used_length) {
> > +        bool need_madvise, need_fallocate;
> >          uint8_t *host_endaddr = host_startaddr + length;
> >          if ((uintptr_t)host_endaddr & (rb->page_size - 1)) {
> >              error_report("ram_block_discard_range: Unaligned end address: %p",
> > @@ -3582,23 +3583,35 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
> >  
> >          errno = ENOTSUP; /* If we are missing MADVISE etc */
> >  
> > -        if (rb->page_size == qemu_host_page_size) {
> > -#if defined(CONFIG_MADVISE)
> > -            /* Note: We need the madvise MADV_DONTNEED behaviour of definitely
> > -             * freeing the page.
> > -             */
> > -            ret = madvise(host_startaddr, length, MADV_DONTNEED);
> > -#endif
> > -        } else {
> > -            /* Huge page case  - unfortunately it can't do DONTNEED, but
> > -             * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
> > -             * huge page file.
> > +        /* The logic here is messy;
> > +         *    madvise DONTNEED fails for hugepages
> > +         *    fallocate works on hugepages and shmem
> > +         */
> > +        need_madvise = (rb->page_size == qemu_host_page_size);
> > +        need_fallocate = rb->fd != -1;
> > +        if (need_fallocate) {
> > +            /* For a file, this causes the area of the file to be zero'd
> > +             * if read, and for hugetlbfs also causes it to be unmapped
> > +             * so a userfault will trigger.
> >               */
> >  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> >              ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> >                              start, length);
> >  #endif
> >          }
> > +        /* i.e. need madvise but skip it if the fallocate failed */
> > +        if (need_madvise && (!need_fallocate || (ret == 0))) {
> 
> I'll slightly prefer:
> 
>   trace_ram_block_discard_range();
> 
>   if (need_fallocate) {
>     ret = fallocate();
>     if (ret) {
>       error_report();
>       goto err;
>     }
>   }
> 
>   if (need_madvise) {
>     ret = madvise();
>     if (ret) {
>       error_report();
>       goto err;
>     }
>   }

OK, I've reworked it more like that.
(It's a little more complex because of the ifdef's)

Dave

> But it is personal preference.  For either way:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> > +            /* For normal RAM this causes it to be unmapped,
> > +             * for shared memory it causes the local mapping to disappear
> > +             * and to fall back on the file contents (which we just
> > +             * fallocate'd away).
> > +             */
> > +#if defined(CONFIG_MADVISE)
> > +            ret =  madvise(host_startaddr, length, MADV_DONTNEED);
> > +#endif
> > +        }
> > +        trace_ram_block_discard_range(rb->idstr, host_startaddr,
> > +                                      need_madvise, need_fallocate, ret);
> >          if (ret) {
> >              ret = -errno;
> >              error_report("ram_block_discard_range: Failed to discard range "
> > diff --git a/trace-events b/trace-events
> > index 1f50f56d9d..213ee34f89 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -55,6 +55,9 @@ dma_complete(void *dbs, int ret, void *cb) "dbs=%p ret=%d cb=%p"
> >  dma_blk_cb(void *dbs, int ret) "dbs=%p ret=%d"
> >  dma_map_wait(void *dbs) "dbs=%p"
> >  
> > +# exec.c
> > +ram_block_discard_range(const char *rbname, void *hva, bool need_madvise, bool need_fallocate, int ret) "%s@%p: madvise: %d fallocate: %d ret: %d"
> > +
> >  # memory.c
> >  memory_region_ops_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
> >  memory_region_ops_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr 0x%"PRIx64" value 0x%"PRIx64" size %u"
> > -- 
> > 2.13.5
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 30/32] vhost: Merge neighbouring hugepage regions where appropriate
  2017-09-14  9:18     ` Igor Mammedov
@ 2017-09-25 11:19       ` Dr. David Alan Gilbert
  2017-10-02 13:49         ` Igor Mammedov
  0 siblings, 1 reply; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-25 11:19 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	lvivier, aarcange, felipe, peterx, quintela

* Igor Mammedov (imammedo@redhat.com) wrote:
> On Thu, 24 Aug 2017 20:27:28 +0100
> "Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> 
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Where two regions are created with a gap such that when aligned
> > to hugepage boundaries, the two regions overlap, merge them.
> why only hugepage boundaries, it should be applicable any alignment

Actually this patch isn't huge-page specific - it just aligns to the
pagesize; but do we ever hit a case where a region is smaller than a
normal page and thus is changed by this?

> I'd say the patch isn't what I've had in mind when we discussed issue,

Ah

> it builds on already existing merging code and complicates
> code even more.

Yes it is a little complex.

> Have you looked into possibility to rebuild memory map from scratch
> every time vhost_region_add/vhost_region_del is called or even at
> vhost_commit() time to reduce rebuild from a set of memory sections
> that vhost tracks?
> That should simplify algorithm a lot as memory sections are coming
> from flat view and never overlap compared to current merged memory
> map in vhost_dev::mem, so it won't have to deal with first splitting
> and then merging back every time flatview changes.

I hadn't; I was concentrating on changing the existing code rather than
reworking it - especially since I don't/didn't know much about the
notifiers.

Are you suggesting that basically vhost_region_add/del do nothing
(except maybe set a flag) and the real work gets done in vhost_commit()?
(I also found I had to call the merge from vhost_dev_start as well as
vhost_commit - I guess from the first use?)

If I just did everything in vhost_commit where do I start - is that
using something like address_space_to_flatview(address_space_memory) to
get the main FlatView and somehow walk that?

Dave

> > I also add quite a few trace events to see what's going on.
> > 
> > Note: This doesn't handle all the cases, but does handle the common
> > case on a PC due to the 640k hole.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> > ---
> >  hw/virtio/trace-events | 11 +++++++
> >  hw/virtio/vhost.c      | 79 +++++++++++++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 89 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > index 5b599617a1..f98efb39fd 100644
> > --- a/hw/virtio/trace-events
> > +++ b/hw/virtio/trace-events
> > @@ -1,5 +1,16 @@
> >  # See docs/devel/tracing.txt for syntax documentation.
> >  
> > +# hw/virtio/vhost.c
> > +vhost_dev_assign_memory_merged(int from, int to, uint64_t size, uint64_t start_addr, uint64_t uaddr) "f/t=%d/%d 0x%"PRIx64" @ P: 0x%"PRIx64" U: 0x%"PRIx64
> > +vhost_dev_assign_memory_not_merged(uint64_t size, uint64_t start_addr, uint64_t uaddr) "0x%"PRIx64" @ P: 0x%"PRIx64" U: 0x%"PRIx64
> > +vhost_dev_assign_memory_entry(uint64_t size, uint64_t start_addr, uint64_t uaddr) "0x%"PRIx64" @ P: 0x%"PRIx64" U: 0x%"PRIx64
> > +vhost_dev_assign_memory_exit(uint32_t nregions) "%"PRId32
> > +vhost_huge_page_stretch_and_merge_entry(uint32_t nregions) "%"PRId32
> > +vhost_huge_page_stretch_and_merge_can(void) ""
> > +vhost_huge_page_stretch_and_merge_size_align(int d, uint64_t gpa, uint64_t align) "%d: gpa: 0x%"PRIx64" align: 0x%"PRIx64
> > +vhost_huge_page_stretch_and_merge_start_align(int d, uint64_t gpa, uint64_t align) "%d: gpa: 0x%"PRIx64" align: 0x%"PRIx64
> > +vhost_section(const char *name, int r) "%s:%d"
> > +
> >  # hw/virtio/vhost-user.c
> >  vhost_user_postcopy_end_entry(void) ""
> >  vhost_user_postcopy_end_exit(void) ""
> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index 6eddb099b0..fb506e747f 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -27,6 +27,7 @@
> >  #include "hw/virtio/virtio-access.h"
> >  #include "migration/blocker.h"
> >  #include "sysemu/dma.h"
> > +#include "trace.h"
> >  
> >  /* enabled until disconnected backend stabilizes */
> >  #define _VHOST_DEBUG 1
> > @@ -250,6 +251,8 @@ static void vhost_dev_assign_memory(struct vhost_dev *dev,
> >  {
> >      int from, to;
> >      struct vhost_memory_region *merged = NULL;
> > +    trace_vhost_dev_assign_memory_entry(size, start_addr, uaddr);
> > +
> >      for (from = 0, to = 0; from < dev->mem->nregions; ++from, ++to) {
> >          struct vhost_memory_region *reg = dev->mem->regions + to;
> >          uint64_t prlast, urlast;
> > @@ -293,11 +296,13 @@ static void vhost_dev_assign_memory(struct vhost_dev *dev,
> >          uaddr = merged->userspace_addr = u;
> >          start_addr = merged->guest_phys_addr = s;
> >          size = merged->memory_size = e - s + 1;
> > +        trace_vhost_dev_assign_memory_merged(from, to, size, start_addr, uaddr);
> >          assert(merged->memory_size);
> >      }
> >  
> >      if (!merged) {
> >          struct vhost_memory_region *reg = dev->mem->regions + to;
> > +        trace_vhost_dev_assign_memory_not_merged(size, start_addr, uaddr);
> >          memset(reg, 0, sizeof *reg);
> >          reg->memory_size = size;
> >          assert(reg->memory_size);
> > @@ -307,6 +312,7 @@ static void vhost_dev_assign_memory(struct vhost_dev *dev,
> >      }
> >      assert(to <= dev->mem->nregions + 1);
> >      dev->mem->nregions = to;
> > +    trace_vhost_dev_assign_memory_exit(to);
> >  }
> >  
> >  static uint64_t vhost_get_log_size(struct vhost_dev *dev)
> > @@ -610,8 +616,12 @@ static void vhost_set_memory(MemoryListener *listener,
> >  
> >  static bool vhost_section(MemoryRegionSection *section)
> >  {
> > -    return memory_region_is_ram(section->mr) &&
> > +    bool result;
> > +    result = memory_region_is_ram(section->mr) &&
> >          !memory_region_is_rom(section->mr);
> > +
> > +    trace_vhost_section(section->mr->name, result);
> > +    return result;
> >  }
> >  
> >  static void vhost_begin(MemoryListener *listener)
> > @@ -622,6 +632,68 @@ static void vhost_begin(MemoryListener *listener)
> >      dev->mem_changed_start_addr = -1;
> >  }
> >  
> > +/* Look for regions that are hugepage backed but not aligned
> > + * and fix them up to be aligned.
> > + * TODO: For now this is just enough to deal with the 640k hole
> > + */
> > +static bool vhost_huge_page_stretch_and_merge(struct vhost_dev *dev)
> > +{
> > +    int i, j;
> > +    bool result = true;
> > +    trace_vhost_huge_page_stretch_and_merge_entry(dev->mem->nregions);
> > +
> > +    for (i = 0; i < dev->mem->nregions; i++) {
> > +        struct vhost_memory_region *reg = dev->mem->regions + i;
> > +        ram_addr_t offset;
> > +        RAMBlock *rb = qemu_ram_block_from_host((void *)reg->userspace_addr,
> > +                                                false, &offset);
> > +        size_t pagesize = qemu_ram_pagesize(rb);
> > +        uint64_t alignage;
> > +        alignage = reg->guest_phys_addr & (pagesize - 1);
> > +        if (alignage) {
> > +
> > +            trace_vhost_huge_page_stretch_and_merge_start_align(i,
> > +                                                (uint64_t)reg->guest_phys_addr,
> > +                                                alignage);
> > +            for (j = 0; j < dev->mem->nregions; j++) {
> > +                struct vhost_memory_region *oreg = dev->mem->regions + j;
> > +                if (j == i) {
> > +                    continue;
> > +                }
> > +
> > +                if (oreg->guest_phys_addr ==
> > +                        (reg->guest_phys_addr - alignage) &&
> > +                    oreg->userspace_addr ==
> > +                         (reg->userspace_addr - alignage)) {
> > +                    struct vhost_memory_region treg = *reg;
> > +                    trace_vhost_huge_page_stretch_and_merge_can();
> > +                    vhost_dev_unassign_memory(dev, oreg->guest_phys_addr,
> > +                                              oreg->memory_size);
> > +                    vhost_dev_unassign_memory(dev, treg.guest_phys_addr,
> > +                                              treg.memory_size);
> > +                    vhost_dev_assign_memory(dev,
> > +                                            treg.guest_phys_addr - alignage,
> > +                                            treg.memory_size + alignage,
> > +                                            treg.userspace_addr - alignage);
> > +                    return vhost_huge_page_stretch_and_merge(dev);
> > +                }
> > +            }
> > +        }
> > +        alignage = reg->memory_size & (pagesize - 1);
> > +        if (alignage) {
> > +            trace_vhost_huge_page_stretch_and_merge_size_align(i,
> > +                                               (uint64_t)reg->guest_phys_addr,
> > +                                               alignage);
> > +            /* We ignore this if we find something else to merge,
> > +             * so we only return false if we're left with this
> > +             */
> > +            result = false;
> > +        }
> > +    }
> > +
> > +    return result;
> > +}
> > +
> >  static void vhost_commit(MemoryListener *listener)
> >  {
> >      struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> > @@ -641,6 +713,7 @@ static void vhost_commit(MemoryListener *listener)
> >          return;
> >      }
> >  
> > +    vhost_huge_page_stretch_and_merge(dev);
> >      if (dev->started) {
> >          start_addr = dev->mem_changed_start_addr;
> >          size = dev->mem_changed_end_addr - dev->mem_changed_start_addr + 1;
> > @@ -1512,6 +1585,10 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
> >          goto fail_features;
> >      }
> >  
> > +    if (!vhost_huge_page_stretch_and_merge(hdev)) {
> > +        VHOST_OPS_DEBUG("vhost_huge_page_stretch_and_merge failed");
> > +        goto fail_mem;
> > +    }
> >      if (vhost_dev_has_iommu(hdev)) {
> >          memory_listener_register(&hdev->iommu_listener, vdev->dma_as);
> >      }
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 25/32] vhost+postcopy: Lock around set_mem_table
  2017-08-30  6:50     ` Peter Xu
@ 2017-09-25 17:56       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-25 17:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, maxime.coquelin, a.perevalov, mst, marcandre.lureau,
	quintela, lvivier, aarcange, felipe

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Aug 24, 2017 at 08:27:23PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > **HACK - better solution needed **
> > We have the situation where:
> > 
> >      qemu                      bridge
> > 
> >      send set_mem_table
> >                               map memory
> >   a)                          mark area with UFD
> >                               send reply with map addresses
> >   b)                          start using
> >   c) receive reply
> > 
> >   As soon as (a) happens qemu might start seeing faults
> > from memory accesses (but doesn't until b); but it can't
> > process those faults until (c) when it's received the
> > mmap addresses.
> > 
> > Make the fault handler spin until it gets the reply in (c).
> > 
> > At the very least this needs some proper locks, but preferably
> > we need to split the message.
> 
> I see discussions about slave channel and ack mechanism in previous
> post.  So it's still not adopted (which looks doable)?  What's our
> further plan?

Yep I'm going to look at the slave channel stuff.

Dave

> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 30/32] vhost: Merge neighbouring hugepage regions where appropriate
  2017-09-25 11:19       ` Dr. David Alan Gilbert
@ 2017-10-02 13:49         ` Igor Mammedov
  0 siblings, 0 replies; 94+ messages in thread
From: Igor Mammedov @ 2017-10-02 13:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: lvivier, aarcange, mst, quintela, qemu-devel, peterx,
	a.perevalov, maxime.coquelin, felipe, marcandre.lureau

On Mon, 25 Sep 2017 12:19:55 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Igor Mammedov (imammedo@redhat.com) wrote:
> > On Thu, 24 Aug 2017 20:27:28 +0100
> > "Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> >   
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Where two regions are created with a gap such that when aligned
> > > to hugepage boundaries, the two regions overlap, merge them.  
> > why only hugepage boundaries, it should be applicable any alignment  
> 
> Actually this patch isn't huge-page specific - it just aligns to the
> pagesize; but do we ever hit a case where a region is smaller than a
> normal page and thus is changed by this?
> 
> > I'd say the patch isn't what I've had in mind when we discussed issue,  
> 
> Ah
> 
> > it builds on already existing merging code and complicates
> > code even more.  
> 
> Yes it is a little complex.
> 
> > Have you looked into possibility to rebuild memory map from scratch
> > every time vhost_region_add/vhost_region_del is called or even at
> > vhost_commit() time to reduce rebuild from a set of memory sections
> > that vhost tracks?
> > That should simplify algorithm a lot as memory sections are coming
> > from flat view and never overlap compared to current merged memory
> > map in vhost_dev::mem, so it won't have to deal with first splitting
> > and then merging back every time flatview changes.  
> 
> I hadn't; I was concentrating on changing the existing code rather than
> reworking it - especially since I don't/didn't know much about the
> notifiers.
> 
> Are you suggesting that basically vhost_region_add/del do nothing
> (except maybe set a flag) and the real work gets done in vhost_commit()?
> (I also found I had to call the merge from vhost_dev_start as well as
> vhost_commit - I guess from the first use?)
yep, i.e. build memmap on request.


> If I just did everything in vhost_commit where do I start - is that
> using something like address_space_to_flatview(address_space_memory) to
> get the main FlatView and somehow walk that?
vhost already tracks flat view with vhost_region_add/vhost_region_del
notifiers by saving references to MemoryRegionSection-s.
Memory sections have following properties/behavior:
 1. they never overlap
 2. when we map something over existing memory section.
    notifier first removes former section and then gets several
    region_add calls that add newly split non overlaping sections.

#2 happens multiple times when we start VM (before machine_done)
   and several times during firmware boot when some registers are
   (un)mapped during chip-set initialization.

so currently vhost_set_memory() is called uselessly multiple times
before memmap is actually need/used and it maintains essentially
optimized/sorted version of mem_sections[].
What I suggest is to 
 1. stop rebuilding memap in vhost_set_memory on 'every' flatview
    change and do it only when memmap is actually used
 2. get rid of duplicate data kept in regions[]/complex code that
    maintains it and
      2.1 use mem_sections[] directly to build memap on request.
      2.2 sorting mem_sections[] by start_addr when memmap is
          build could help to merge neighboring/mergable sections on the fly
          without need to resplit/merge regions[] in internally maintained
          memmap.

implementing both points would allow to drop a bunch of complex
code that sort of duplicates what flatview already does and I'd guess
this patch would be much simpler as result.

Optionally there is an idea to allow merging neighboring sections
even if there are gaps between them provided that GVA->HVA distance
for merging sections is the same (i.e sections belong to the same MR
with some holes in flatview punched by MMIO),
it should allow for better memmap compression then we have now.

PS:
refactoring probably should be split into separate series,
that should go in first.

> Dave
> 
> > > I also add quite a few trace events to see what's going on.
> > > 
> > > Note: This doesn't handle all the cases, but does handle the common
> > > case on a PC due to the 640k hole.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > ---
> > >  hw/virtio/trace-events | 11 +++++++
> > >  hw/virtio/vhost.c      | 79 +++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  2 files changed, 89 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > index 5b599617a1..f98efb39fd 100644
> > > --- a/hw/virtio/trace-events
> > > +++ b/hw/virtio/trace-events
> > > @@ -1,5 +1,16 @@
> > >  # See docs/devel/tracing.txt for syntax documentation.
> > >  
> > > +# hw/virtio/vhost.c
> > > +vhost_dev_assign_memory_merged(int from, int to, uint64_t size, uint64_t start_addr, uint64_t uaddr) "f/t=%d/%d 0x%"PRIx64" @ P: 0x%"PRIx64" U: 0x%"PRIx64
> > > +vhost_dev_assign_memory_not_merged(uint64_t size, uint64_t start_addr, uint64_t uaddr) "0x%"PRIx64" @ P: 0x%"PRIx64" U: 0x%"PRIx64
> > > +vhost_dev_assign_memory_entry(uint64_t size, uint64_t start_addr, uint64_t uaddr) "0x%"PRIx64" @ P: 0x%"PRIx64" U: 0x%"PRIx64
> > > +vhost_dev_assign_memory_exit(uint32_t nregions) "%"PRId32
> > > +vhost_huge_page_stretch_and_merge_entry(uint32_t nregions) "%"PRId32
> > > +vhost_huge_page_stretch_and_merge_can(void) ""
> > > +vhost_huge_page_stretch_and_merge_size_align(int d, uint64_t gpa, uint64_t align) "%d: gpa: 0x%"PRIx64" align: 0x%"PRIx64
> > > +vhost_huge_page_stretch_and_merge_start_align(int d, uint64_t gpa, uint64_t align) "%d: gpa: 0x%"PRIx64" align: 0x%"PRIx64
> > > +vhost_section(const char *name, int r) "%s:%d"
> > > +
> > >  # hw/virtio/vhost-user.c
> > >  vhost_user_postcopy_end_entry(void) ""
> > >  vhost_user_postcopy_end_exit(void) ""
> > > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > > index 6eddb099b0..fb506e747f 100644
> > > --- a/hw/virtio/vhost.c
> > > +++ b/hw/virtio/vhost.c
> > > @@ -27,6 +27,7 @@
> > >  #include "hw/virtio/virtio-access.h"
> > >  #include "migration/blocker.h"
> > >  #include "sysemu/dma.h"
> > > +#include "trace.h"
> > >  
> > >  /* enabled until disconnected backend stabilizes */
> > >  #define _VHOST_DEBUG 1
> > > @@ -250,6 +251,8 @@ static void vhost_dev_assign_memory(struct vhost_dev *dev,
> > >  {
> > >      int from, to;
> > >      struct vhost_memory_region *merged = NULL;
> > > +    trace_vhost_dev_assign_memory_entry(size, start_addr, uaddr);
> > > +
> > >      for (from = 0, to = 0; from < dev->mem->nregions; ++from, ++to) {
> > >          struct vhost_memory_region *reg = dev->mem->regions + to;
> > >          uint64_t prlast, urlast;
> > > @@ -293,11 +296,13 @@ static void vhost_dev_assign_memory(struct vhost_dev *dev,
> > >          uaddr = merged->userspace_addr = u;
> > >          start_addr = merged->guest_phys_addr = s;
> > >          size = merged->memory_size = e - s + 1;
> > > +        trace_vhost_dev_assign_memory_merged(from, to, size, start_addr, uaddr);
> > >          assert(merged->memory_size);
> > >      }
> > >  
> > >      if (!merged) {
> > >          struct vhost_memory_region *reg = dev->mem->regions + to;
> > > +        trace_vhost_dev_assign_memory_not_merged(size, start_addr, uaddr);
> > >          memset(reg, 0, sizeof *reg);
> > >          reg->memory_size = size;
> > >          assert(reg->memory_size);
> > > @@ -307,6 +312,7 @@ static void vhost_dev_assign_memory(struct vhost_dev *dev,
> > >      }
> > >      assert(to <= dev->mem->nregions + 1);
> > >      dev->mem->nregions = to;
> > > +    trace_vhost_dev_assign_memory_exit(to);
> > >  }
> > >  
> > >  static uint64_t vhost_get_log_size(struct vhost_dev *dev)
> > > @@ -610,8 +616,12 @@ static void vhost_set_memory(MemoryListener *listener,
> > >  
> > >  static bool vhost_section(MemoryRegionSection *section)
> > >  {
> > > -    return memory_region_is_ram(section->mr) &&
> > > +    bool result;
> > > +    result = memory_region_is_ram(section->mr) &&
> > >          !memory_region_is_rom(section->mr);
> > > +
> > > +    trace_vhost_section(section->mr->name, result);
> > > +    return result;
> > >  }
> > >  
> > >  static void vhost_begin(MemoryListener *listener)
> > > @@ -622,6 +632,68 @@ static void vhost_begin(MemoryListener *listener)
> > >      dev->mem_changed_start_addr = -1;
> > >  }
> > >  
> > > +/* Look for regions that are hugepage backed but not aligned
> > > + * and fix them up to be aligned.
> > > + * TODO: For now this is just enough to deal with the 640k hole
> > > + */
> > > +static bool vhost_huge_page_stretch_and_merge(struct vhost_dev *dev)
> > > +{
> > > +    int i, j;
> > > +    bool result = true;
> > > +    trace_vhost_huge_page_stretch_and_merge_entry(dev->mem->nregions);
> > > +
> > > +    for (i = 0; i < dev->mem->nregions; i++) {
> > > +        struct vhost_memory_region *reg = dev->mem->regions + i;
> > > +        ram_addr_t offset;
> > > +        RAMBlock *rb = qemu_ram_block_from_host((void *)reg->userspace_addr,
> > > +                                                false, &offset);
> > > +        size_t pagesize = qemu_ram_pagesize(rb);
> > > +        uint64_t alignage;
> > > +        alignage = reg->guest_phys_addr & (pagesize - 1);
> > > +        if (alignage) {
> > > +
> > > +            trace_vhost_huge_page_stretch_and_merge_start_align(i,
> > > +                                                (uint64_t)reg->guest_phys_addr,
> > > +                                                alignage);
> > > +            for (j = 0; j < dev->mem->nregions; j++) {
> > > +                struct vhost_memory_region *oreg = dev->mem->regions + j;
> > > +                if (j == i) {
> > > +                    continue;
> > > +                }
> > > +
> > > +                if (oreg->guest_phys_addr ==
> > > +                        (reg->guest_phys_addr - alignage) &&
> > > +                    oreg->userspace_addr ==
> > > +                         (reg->userspace_addr - alignage)) {
> > > +                    struct vhost_memory_region treg = *reg;
> > > +                    trace_vhost_huge_page_stretch_and_merge_can();
> > > +                    vhost_dev_unassign_memory(dev, oreg->guest_phys_addr,
> > > +                                              oreg->memory_size);
> > > +                    vhost_dev_unassign_memory(dev, treg.guest_phys_addr,
> > > +                                              treg.memory_size);
> > > +                    vhost_dev_assign_memory(dev,
> > > +                                            treg.guest_phys_addr - alignage,
> > > +                                            treg.memory_size + alignage,
> > > +                                            treg.userspace_addr - alignage);
> > > +                    return vhost_huge_page_stretch_and_merge(dev);
> > > +                }
> > > +            }
> > > +        }
> > > +        alignage = reg->memory_size & (pagesize - 1);
> > > +        if (alignage) {
> > > +            trace_vhost_huge_page_stretch_and_merge_size_align(i,
> > > +                                               (uint64_t)reg->guest_phys_addr,
> > > +                                               alignage);
> > > +            /* We ignore this if we find something else to merge,
> > > +             * so we only return false if we're left with this
> > > +             */
> > > +            result = false;
> > > +        }
> > > +    }
> > > +
> > > +    return result;
> > > +}
> > > +
> > >  static void vhost_commit(MemoryListener *listener)
> > >  {
> > >      struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> > > @@ -641,6 +713,7 @@ static void vhost_commit(MemoryListener *listener)
> > >          return;
> > >      }
> > >  
> > > +    vhost_huge_page_stretch_and_merge(dev);
> > >      if (dev->started) {
> > >          start_addr = dev->mem_changed_start_addr;
> > >          size = dev->mem_changed_end_addr - dev->mem_changed_start_addr + 1;
> > > @@ -1512,6 +1585,10 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
> > >          goto fail_features;
> > >      }
> > >  
> > > +    if (!vhost_huge_page_stretch_and_merge(hdev)) {
> > > +        VHOST_OPS_DEBUG("vhost_huge_page_stretch_and_merge failed");
> > > +        goto fail_mem;
> > > +    }
> > >      if (vhost_dev_has_iommu(hdev)) {
> > >          memory_listener_register(&hdev->iommu_listener, vdev->dma_as);
> > >      }  
> >   
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram
  2017-09-01 13:42     ` Maxime Coquelin
@ 2017-10-16  8:32       ` Alexey Perevalov
  0 siblings, 0 replies; 94+ messages in thread
From: Alexey Perevalov @ 2017-10-16  8:32 UTC (permalink / raw)
  To: Maxime Coquelin, Dr. David Alan Gilbert (git),
	qemu-devel, mst, marcandre.lureau
  Cc: quintela, peterx, lvivier, aarcange, felipe

Hello Maxime

On 09/01/2017 04:42 PM, Maxime Coquelin wrote:
> Hello Alexey,
>
> On 09/01/2017 03:34 PM, Alexey Perevalov wrote:
>> Hello David,
>>
>> You wrote in previous version:
>>
>>> We've had a postcopy migrate work now, with a few hacks we're still
>>> cleaning up, both on vhost-user-bridge and dpdk; so I'll get this
>>> updated and reposted.
>>
>> I want to know more about DPDK work, do you know, is somebody 
>> assigned to that task?
>
> I did the DPDK (rough) prototype, you may find it here:
> https://gitlab.com/mcoquelin/dpdk-next-virtio/commits/postcopy_proto_v1

I found it is for previous version of the patchset. Do you have any updates?
>
> Cheers,
> Maxime
>
>
>

-- 
Best regards,
Alexey Perevalov

^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2017-10-16  8:32 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20170824192750epcas5p484df9724ca7c0a259a4dd85425a69e1d@epcas5p4.samsung.com>
2017-08-24 19:26 ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Dr. David Alan Gilbert (git)
2017-08-24 19:26   ` [Qemu-devel] [RFC v2 01/32] vhu: vu_queue_started Dr. David Alan Gilbert (git)
2017-08-24 23:10     ` Marc-André Lureau
2017-08-25 14:58       ` Dr. David Alan Gilbert
2017-08-30 13:02     ` Michael S. Tsirkin
2017-08-30 13:13       ` Marc-André Lureau
2017-09-05 12:58         ` Dr. David Alan Gilbert
2017-09-05 13:01           ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 02/32] vhub: Only process received packets on started queues Dr. David Alan Gilbert (git)
2017-08-30  9:59     ` Marc-André Lureau
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 03/32] migrate: Update ram_block_discard_range for shared Dr. David Alan Gilbert (git)
2017-08-29  5:30     ` Peter Xu
2017-09-18 12:18       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 04/32] qemu_ram_block_host_offset Dr. David Alan Gilbert (git)
2017-08-25 12:11     ` Philippe Mathieu-Daudé
2017-08-25 15:28       ` Dr. David Alan Gilbert
2017-08-29  5:36     ` Peter Xu
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 05/32] migration/ram: ramblock_recv_bitmap_test_byte_offset Dr. David Alan Gilbert (git)
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 06/32] postcopy: use UFFDIO_ZEROPAGE only when available Dr. David Alan Gilbert (git)
2017-08-30  9:57     ` Marc-André Lureau
2017-09-07 10:55       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 07/32] postcopy: Add notifier chain Dr. David Alan Gilbert (git)
2017-08-29  6:02     ` Peter Xu
2017-09-11 17:00       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 08/32] postcopy: Add vhost-user flag for postcopy and check it Dr. David Alan Gilbert (git)
2017-08-29  6:22     ` Peter Xu
2017-09-13 14:34       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 09/32] vhost-user: Add 'VHOST_USER_POSTCOPY_ADVISE' message Dr. David Alan Gilbert (git)
2017-08-30 10:07     ` Marc-André Lureau
2017-09-07 11:04       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 10/32] vhub: Support sending fds back to qemu Dr. David Alan Gilbert (git)
2017-08-30 10:22     ` Marc-André Lureau
2017-09-07 11:31       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 11/32] vhub: Open userfaultfd Dr. David Alan Gilbert (git)
2017-08-29  6:40     ` Peter Xu
2017-09-15 17:33       ` Dr. David Alan Gilbert
2017-08-30 10:30     ` Marc-André Lureau
2017-09-07 16:36       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 12/32] postcopy: Allow registering of fd handler Dr. David Alan Gilbert (git)
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 13/32] vhost+postcopy: Register shared ufd with postcopy Dr. David Alan Gilbert (git)
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 14/32] vhost+postcopy: Transmit 'listen' to client Dr. David Alan Gilbert (git)
2017-08-30 10:37     ` Marc-André Lureau
2017-09-07 12:10       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 15/32] vhost+postcopy: Register new regions with the ufd Dr. David Alan Gilbert (git)
2017-08-30 10:42     ` Marc-André Lureau
2017-09-08 14:50       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 16/32] vhost+postcopy: Send address back to qemu Dr. David Alan Gilbert (git)
2017-08-29  8:30     ` Peter Xu
2017-09-12 17:15       ` Dr. David Alan Gilbert
2017-09-13  4:29         ` Peter Xu
2017-09-13 12:15           ` Dr. David Alan Gilbert
2017-09-15  8:57             ` Peter Xu
2017-09-15 15:32               ` Dr. David Alan Gilbert
2017-09-18  9:31               ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 17/32] vhost+postcopy: Stash RAMBlock and offset Dr. David Alan Gilbert (git)
2017-08-30  5:51     ` Peter Xu
2017-09-13 15:59       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 18/32] vhost+postcopy: Send requests to source for shared pages Dr. David Alan Gilbert (git)
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 19/32] vhost+postcopy: Resolve client address Dr. David Alan Gilbert (git)
2017-08-30  5:28     ` Peter Xu
2017-09-11 11:58       ` Dr. David Alan Gilbert
2017-09-13  5:18         ` Peter Xu
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 20/32] postcopy: wake shared Dr. David Alan Gilbert (git)
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 21/32] postcopy: postcopy_notify_shared_wake Dr. David Alan Gilbert (git)
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 22/32] vhost+postcopy: Add vhost waker Dr. David Alan Gilbert (git)
2017-08-30  5:55     ` Peter Xu
2017-09-13 13:09       ` Dr. David Alan Gilbert
2017-09-18  3:57         ` Peter Xu
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 23/32] vhost+postcopy: Call wakeups Dr. David Alan Gilbert (git)
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 24/32] vub+postcopy: madvises Dr. David Alan Gilbert (git)
2017-08-30 10:48     ` Marc-André Lureau
2017-09-07 12:30       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 25/32] vhost+postcopy: Lock around set_mem_table Dr. David Alan Gilbert (git)
2017-08-30  6:50     ` Peter Xu
2017-09-25 17:56       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 26/32] vhost: Add VHOST_USER_POSTCOPY_END message Dr. David Alan Gilbert (git)
2017-08-30  6:55     ` Peter Xu
2017-09-11 11:31       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 27/32] vhost+postcopy: Wire up POSTCOPY_END notify Dr. David Alan Gilbert (git)
2017-08-30  6:57     ` Peter Xu
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 28/32] postcopy: Allow shared memory Dr. David Alan Gilbert (git)
2017-08-30 10:39     ` Marc-André Lureau
2017-09-07 12:15       ` Dr. David Alan Gilbert
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 29/32] vhost-user: Claim support for postcopy Dr. David Alan Gilbert (git)
2017-08-30 10:50     ` Marc-André Lureau
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 30/32] vhost: Merge neighbouring hugepage regions where appropriate Dr. David Alan Gilbert (git)
2017-09-14  9:18     ` Igor Mammedov
2017-09-25 11:19       ` Dr. David Alan Gilbert
2017-10-02 13:49         ` Igor Mammedov
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 31/32] vhost: Don't break merged regions on small remove/non-adds Dr. David Alan Gilbert (git)
2017-08-24 19:27   ` [Qemu-devel] [RFC v2 32/32] postcopy shared docs Dr. David Alan Gilbert (git)
2017-09-01 13:34   ` [Qemu-devel] [RFC v2 00/32] postcopy+vhost-user/shared ram Alexey Perevalov
2017-09-01 13:42     ` Maxime Coquelin
2017-10-16  8:32       ` Alexey Perevalov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.