All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
@ 2017-02-06 17:32 Dr. David Alan Gilbert (git)
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 01/16] postcopy: Transmit ram size summary word Dr. David Alan Gilbert (git)
                   ` (18 more replies)
  0 siblings, 19 replies; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:32 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Hi,
  The existing postcopy code, and the userfault kernel
code that supports it, only works for normal anonymous memory.
Kernel support for userfault on hugetlbfs is working
it's way upstream; it's in the linux-mm tree,
You can get a version at:
   git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
on the origin/userfault branch.

Note that while this code supports arbitrary sized hugepages,
it doesn't make sense with pages above the few-MB region,
so while 2MB is fine, 1GB is probably a bad idea;
this code waits for and transmits whole huge pages, and a
1GB page would take about 1 second to transfer over a 10Gbps
link - which is way too long to pause the destination for.

Dave

Dr. David Alan Gilbert (16):
  postcopy: Transmit ram size summary word
  postcopy: Transmit and compare individual page sizes
  postcopy: Chunk discards for hugepages
  exec: ram_block_discard_range
  postcopy: enhance ram_block_discard_range for hugepages
  Fold postcopy_ram_discard_range into ram_discard_range
  postcopy: Record largest page size
  postcopy: Plumb pagesize down into place helpers
  postcopy: Use temporary for placing zero huge pages
  postcopy: Load huge pages in one go
  postcopy: Mask fault addresses to huge page boundary
  postcopy: Send whole huge pages
  postcopy: Allow hugepages
  postcopy: Update userfaultfd.h header
  postcopy: Check for userfault+hugepage feature
  postcopy: Add doc about hugepages and postcopy

 docs/migration.txt                |  13 ++++
 exec.c                            |  83 +++++++++++++++++++++++
 include/exec/cpu-common.h         |   2 +
 include/exec/memory.h             |   1 -
 include/migration/migration.h     |   3 +
 include/migration/postcopy-ram.h  |  13 ++--
 linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
 migration/migration.c             |   1 +
 migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
 migration/ram.c                   | 109 ++++++++++++++++++------------
 migration/savevm.c                |  32 ++++++---
 migration/trace-events            |   2 +-
 12 files changed, 328 insertions(+), 150 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 01/16] postcopy: Transmit ram size summary word
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
@ 2017-02-06 17:32 ` Dr. David Alan Gilbert (git)
  2017-02-24 10:16   ` Laurent Vivier
  2017-02-24 13:10   ` Juan Quintela
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 02/16] postcopy: Transmit and compare individual page sizes Dr. David Alan Gilbert (git)
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:32 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Replace the host page-size in the 'advise' command by a pagesize
summary bitmap; if the VM is just using normal RAM then
this will be exactly the same as before, however if they're using
huge pages they'll be different, and thus:
   a) Migration from/to old qemu's that don't understand huge pages
      will fail early.
   b) Migrations with different size RAMBlocks will also fail early.

This catches it very early; earlier than the detailed per-block
check in the next patch.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |  1 +
 migration/ram.c               | 17 +++++++++++++++++
 migration/savevm.c            | 32 +++++++++++++++++++++-----------
 3 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index af9135f..96c9d6e 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -366,6 +366,7 @@ void global_state_store_running(void);
 void flush_page_queue(MigrationState *ms);
 int ram_save_queue_pages(MigrationState *ms, const char *rbname,
                          ram_addr_t start, ram_addr_t len);
+uint64_t ram_pagesize_summary(void);
 
 PostcopyState postcopy_state_get(void);
 /* Set the state and return the old state */
diff --git a/migration/ram.c b/migration/ram.c
index ef8fadf..b405e4a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -600,6 +600,23 @@ static void migration_bitmap_sync_init(void)
     iterations_prev = 0;
 }
 
+/* Returns a summary bitmap of the page sizes of all RAMBlocks;
+ * for VMs with just normal pages this is equivalent to the
+ * host page size.  If it's got some huge pages then it's the OR
+ * of all the different page sizes.
+ */
+uint64_t ram_pagesize_summary(void)
+{
+    RAMBlock *block;
+    uint64_t summary = 0;
+
+    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+        summary |= block->page_size;
+    }
+
+    return summary;
+}
+
 static void migration_bitmap_sync(void)
 {
     RAMBlock *block;
diff --git a/migration/savevm.c b/migration/savevm.c
index de86db0..e83d01a 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -869,7 +869,7 @@ int qemu_savevm_send_packaged(QEMUFile *f, const uint8_t *buf, size_t len)
 void qemu_savevm_send_postcopy_advise(QEMUFile *f)
 {
     uint64_t tmp[2];
-    tmp[0] = cpu_to_be64(getpagesize());
+    tmp[0] = cpu_to_be64(ram_pagesize_summary());
     tmp[1] = cpu_to_be64(1ul << qemu_target_page_bits());
 
     trace_qemu_savevm_send_postcopy_advise();
@@ -1346,7 +1346,7 @@ static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
 {
     PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_ADVISE);
-    uint64_t remote_hps, remote_tps;
+    uint64_t remote_pagesize_summary, local_pagesize_summary, remote_tps;
 
     trace_loadvm_postcopy_handle_advise();
     if (ps != POSTCOPY_INCOMING_NONE) {
@@ -1359,17 +1359,27 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
         return -1;
     }
 
-    remote_hps = qemu_get_be64(mis->from_src_file);
-    if (remote_hps != getpagesize())  {
+    remote_pagesize_summary = qemu_get_be64(mis->from_src_file);
+    local_pagesize_summary = ram_pagesize_summary();
+
+    if (remote_pagesize_summary != local_pagesize_summary)  {
         /*
-         * Some combinations of mismatch are probably possible but it gets
-         * a bit more complicated.  In particular we need to place whole
-         * host pages on the dest at once, and we need to ensure that we
-         * handle dirtying to make sure we never end up sending part of
-         * a hostpage on it's own.
+         * This detects two potential causes of mismatch:
+         *   a) A mismatch in host page sizes
+         *      Some combinations of mismatch are probably possible but it gets
+         *      a bit more complicated.  In particular we need to place whole
+         *      host pages on the dest at once, and we need to ensure that we
+         *      handle dirtying to make sure we never end up sending part of
+         *      a hostpage on it's own.
+         *   b) The use of different huge page sizes on source/destination
+         *      a more fine grain test is performed during RAM block migration
+         *      but this test here causes a nice early clear failure, and
+         *      also fails when passed to an older qemu that doesn't
+         *      do huge pages.
          */
-        error_report("Postcopy needs matching host page sizes (s=%d d=%d)",
-                     (int)remote_hps, getpagesize());
+        error_report("Postcopy needs matching RAM page sizes (s=%" PRIx64
+                                                             " d=%" PRIx64 ")",
+                     remote_pagesize_summary, local_pagesize_summary);
         return -1;
     }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 02/16] postcopy: Transmit and compare individual page sizes
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 01/16] postcopy: Transmit ram size summary word Dr. David Alan Gilbert (git)
@ 2017-02-06 17:32 ` Dr. David Alan Gilbert (git)
  2017-02-24 10:31   ` Laurent Vivier
  2017-02-24 13:13   ` Juan Quintela
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 03/16] postcopy: Chunk discards for hugepages Dr. David Alan Gilbert (git)
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:32 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

When using postcopy with hugepages, we require the source
and destination page sizes for any RAMBlock to match; note
that different RAMBlocks in the same VM can have different
page sizes.

Transmit them as part of the RAM information header and
fail if there's a difference.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/ram.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index b405e4a..5726563 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1979,6 +1979,9 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
         qemu_put_byte(f, strlen(block->idstr));
         qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
         qemu_put_be64(f, block->used_length);
+        if (migrate_postcopy_ram() && block->page_size != qemu_host_page_size) {
+            qemu_put_be64(f, block->page_size);
+        }
     }
 
     rcu_read_unlock();
@@ -2480,6 +2483,8 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
      * be atomic
      */
     bool postcopy_running = postcopy_state_get() >= POSTCOPY_INCOMING_LISTENING;
+    /* ADVISE is earlier, it shows the source has the postcopy capability on */
+    bool postcopy_advised = postcopy_state_get() >= POSTCOPY_INCOMING_ADVISE;
 
     seq_iter++;
 
@@ -2544,6 +2549,18 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
                             error_report_err(local_err);
                         }
                     }
+                    /* For postcopy we need to check hugepage sizes match */
+                    if (postcopy_advised &&
+                        block->page_size != qemu_host_page_size) {
+                        uint64_t remote_page_size = qemu_get_be64(f);
+                        if (remote_page_size != block->page_size) {
+                            error_report("Mismatched RAM page size %s "
+                                         "(local) %zd != %" PRId64,
+                                         id, block->page_size,
+                                         remote_page_size);
+                            ret = -EINVAL;
+                        }
+                    }
                     ram_control_load_hook(f, RAM_CONTROL_BLOCK_REG,
                                           block->idstr);
                 } else {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 03/16] postcopy: Chunk discards for hugepages
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 01/16] postcopy: Transmit ram size summary word Dr. David Alan Gilbert (git)
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 02/16] postcopy: Transmit and compare individual page sizes Dr. David Alan Gilbert (git)
@ 2017-02-06 17:32 ` Dr. David Alan Gilbert (git)
  2017-02-24 13:48   ` Laurent Vivier
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 04/16] exec: ram_block_discard_range Dr. David Alan Gilbert (git)
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:32 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

At the start of the postcopy phase, partially sent huge pages
must be discarded.  The code for dealing with host page sizes larger
than the target page size can be reused for this case.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 migration/ram.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 5726563..d33bd21 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1627,12 +1627,17 @@ static void postcopy_chunk_hostpages_pass(MigrationState *ms, bool unsent_pass,
 {
     unsigned long *bitmap;
     unsigned long *unsentmap;
-    unsigned int host_ratio = qemu_host_page_size / TARGET_PAGE_SIZE;
+    unsigned int host_ratio = block->page_size / TARGET_PAGE_SIZE;
     unsigned long first = block->offset >> TARGET_PAGE_BITS;
     unsigned long len = block->used_length >> TARGET_PAGE_BITS;
     unsigned long last = first + (len - 1);
     unsigned long run_start;
 
+    if (block->page_size == TARGET_PAGE_SIZE) {
+        /* Easy case - TPS==HPS for a non-huge page RAMBlock */
+        return;
+    }
+
     bitmap = atomic_rcu_read(&migration_bitmap_rcu)->bmap;
     unsentmap = atomic_rcu_read(&migration_bitmap_rcu)->unsentmap;
 
@@ -1736,7 +1741,8 @@ static void postcopy_chunk_hostpages_pass(MigrationState *ms, bool unsent_pass,
  * Utility for the outgoing postcopy code.
  *
  * Discard any partially sent host-page size chunks, mark any partially
- * dirty host-page size chunks as all dirty.
+ * dirty host-page size chunks as all dirty.  In this case the host-page
+ * is the host-page for the particular RAMBlock, i.e. it might be a huge page
  *
  * Returns: 0 on success
  */
@@ -1744,11 +1750,6 @@ static int postcopy_chunk_hostpages(MigrationState *ms)
 {
     struct RAMBlock *block;
 
-    if (qemu_host_page_size == TARGET_PAGE_SIZE) {
-        /* Easy case - TPS==HPS - nothing to be done */
-        return 0;
-    }
-
     /* Easiest way to make sure we don't resume in the middle of a host-page */
     last_seen_block = NULL;
     last_sent_block = NULL;
@@ -1804,7 +1805,7 @@ int ram_postcopy_send_discard_bitmap(MigrationState *ms)
         return -EINVAL;
     }
 
-    /* Deal with TPS != HPS */
+    /* Deal with TPS != HPS and huge pages */
     ret = postcopy_chunk_hostpages(ms);
     if (ret) {
         rcu_read_unlock();
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 04/16] exec: ram_block_discard_range
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (2 preceding siblings ...)
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 03/16] postcopy: Chunk discards for hugepages Dr. David Alan Gilbert (git)
@ 2017-02-06 17:32 ` Dr. David Alan Gilbert (git)
  2017-02-24 13:14   ` Juan Quintela
                     ` (2 more replies)
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 05/16] postcopy: enhance ram_block_discard_range for hugepages Dr. David Alan Gilbert (git)
                   ` (14 subsequent siblings)
  18 siblings, 3 replies; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:32 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Create ram_block_discard_range in exec.c to replace
postcopy_ram_discard_range and most of ram_discard_range.

Those two routines are a bit of a weird combination, and
ram_discard_range is about to get more complex for hugepages.
It's OS dependent code (so shouldn't be in migration/ram.c) but
it needs quite a bit of the innards of RAMBlock so doesn't belong in
the os*.c.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 exec.c                    | 59 +++++++++++++++++++++++++++++++++++++++++++++++
 include/exec/cpu-common.h |  1 +
 2 files changed, 60 insertions(+)

diff --git a/exec.c b/exec.c
index 8b9ed73..e040cdf 100644
--- a/exec.c
+++ b/exec.c
@@ -45,6 +45,12 @@
 #include "exec/address-spaces.h"
 #include "sysemu/xen-mapcache.h"
 #include "trace-root.h"
+
+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+#include <fcntl.h>
+#include <linux/falloc.h>
+#endif
+
 #endif
 #include "exec/cpu-all.h"
 #include "qemu/rcu_queue.h"
@@ -3286,4 +3292,57 @@ int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
     rcu_read_unlock();
     return ret;
 }
+
+/*
+ * Unmap pages of memory from start to start+length such that
+ * they a) read as 0, b) Trigger whatever fault mechanism
+ * the OS provides for postcopy.
+ * The pages must be unmapped by the end of the function.
+ * Returns: 0 on success, none-0 on failure
+ *
+ */
+int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
+{
+    int ret = -1;
+
+    rcu_read_lock();
+    uint8_t *host_startaddr = rb->host + start;
+
+    if ((uintptr_t)host_startaddr & (rb->page_size - 1)) {
+        error_report("ram_block_discard_range: Unaligned start address: %p",
+                     host_startaddr);
+        goto err;
+    }
+
+    if ((start + length) <= rb->used_length) {
+        uint8_t *host_endaddr = host_startaddr + length;
+        if ((uintptr_t)host_endaddr & (rb->page_size - 1)) {
+            error_report("ram_block_discard_range: Unaligned end address: %p",
+                         host_endaddr);
+            goto err;
+        }
+
+        errno = ENOTSUP; /* If we are missing MADVISE etc */
+
+#if defined(CONFIG_MADVISE)
+        ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
+#endif
+        if (ret) {
+            ret = -errno;
+            error_report("ram_block_discard_range: Failed to discard range "
+                         "%s:%" PRIx64 " +%zx (%d)",
+                         rb->idstr, start, length, ret);
+        }
+    } else {
+        error_report("ram_block_discard_range: Overrun block '%s' (%" PRIu64
+                     "/%zx/" RAM_ADDR_FMT")",
+                     rb->idstr, start, length, rb->used_length);
+    }
+
+err:
+    rcu_read_unlock();
+
+    return ret;
+}
+
 #endif
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index bd15853..1350c2e 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -105,6 +105,7 @@ typedef int (RAMBlockIterFunc)(const char *block_name, void *host_addr,
     ram_addr_t offset, ram_addr_t length, void *opaque);
 
 int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
+int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length);
 
 #endif
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 05/16] postcopy: enhance ram_block_discard_range for hugepages
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (3 preceding siblings ...)
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 04/16] exec: ram_block_discard_range Dr. David Alan Gilbert (git)
@ 2017-02-06 17:32 ` Dr. David Alan Gilbert (git)
  2017-02-24 13:20   ` Juan Quintela
  2017-02-24 14:20   ` Laurent Vivier
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 06/16] Fold postcopy_ram_discard_range into ram_discard_range Dr. David Alan Gilbert (git)
                   ` (13 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:32 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Unfortunately madvise DONTNEED doesn't work on hugepagetlb
so use fallocate(FALLOC_FL_PUNCH_HOLE)
qemu_fd_getpagesize only sets the page based off a file
if the file is from hugetlbfs.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 exec.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/exec.c b/exec.c
index e040cdf..c25f6b3 100644
--- a/exec.c
+++ b/exec.c
@@ -3324,9 +3324,20 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
 
         errno = ENOTSUP; /* If we are missing MADVISE etc */
 
+        if (rb->page_size == qemu_host_page_size) {
 #if defined(CONFIG_MADVISE)
-        ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
+            ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
 #endif
+        } else {
+            /* Huge page case  - unfortunately it can't do DONTNEED, but
+             * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
+             * huge page file.
+             */
+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+            ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+                            start, length);
+#endif
+        }
         if (ret) {
             ret = -errno;
             error_report("ram_block_discard_range: Failed to discard range "
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 06/16] Fold postcopy_ram_discard_range into ram_discard_range
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (4 preceding siblings ...)
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 05/16] postcopy: enhance ram_block_discard_range for hugepages Dr. David Alan Gilbert (git)
@ 2017-02-06 17:32 ` Dr. David Alan Gilbert (git)
  2017-02-24 13:21   ` Juan Quintela
  2017-02-24 14:26   ` Laurent Vivier
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 07/16] postcopy: Record largest page size Dr. David Alan Gilbert (git)
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:32 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Using the previously created ram_block_discard_range,
kill off postcopy_ram_discard_range.
ram_discard_range is just a wrapper that does the name lookup.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/postcopy-ram.h |  7 -------
 migration/postcopy-ram.c         | 30 +-----------------------------
 migration/ram.c                  | 24 +++---------------------
 migration/trace-events           |  2 +-
 4 files changed, 5 insertions(+), 58 deletions(-)

diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
index b6a7491f..43bbbca 100644
--- a/include/migration/postcopy-ram.h
+++ b/include/migration/postcopy-ram.h
@@ -35,13 +35,6 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis, size_t ram_pages);
 int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis);
 
 /*
- * Discard the contents of 'length' bytes from 'start'
- * We can assume that if we've been called postcopy_ram_hosttest returned true
- */
-int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
-                               size_t length);
-
-/*
  * Userfault requires us to mark RAM as NOHUGEPAGE prior to discard
  * however leaving it until after precopy means that most of the precopy
  * data is still THPd
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index a40dddb..1e3d22f 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -200,27 +200,6 @@ out:
     return ret;
 }
 
-/**
- * postcopy_ram_discard_range: Discard a range of memory.
- * We can assume that if we've been called postcopy_ram_hosttest returned true.
- *
- * @mis: Current incoming migration state.
- * @start, @length: range of memory to discard.
- *
- * returns: 0 on success.
- */
-int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
-                               size_t length)
-{
-    trace_postcopy_ram_discard_range(start, length);
-    if (madvise(start, length, MADV_DONTNEED)) {
-        error_report("%s MADV_DONTNEED: %s", __func__, strerror(errno));
-        return -1;
-    }
-
-    return 0;
-}
-
 /*
  * Setup an area of RAM so that it *can* be used for postcopy later; this
  * must be done right at the start prior to pre-copy.
@@ -239,7 +218,7 @@ static int init_range(const char *block_name, void *host_addr,
      * - we're going to get the copy from the source anyway.
      * (Precopy will just overwrite this data, so doesn't need the discard)
      */
-    if (postcopy_ram_discard_range(mis, host_addr, length)) {
+    if (ram_discard_range(mis, block_name, 0, length)) {
         return -1;
     }
 
@@ -658,13 +637,6 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
     return -1;
 }
 
-int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
-                               size_t length)
-{
-    assert(0);
-    return -1;
-}
-
 int postcopy_ram_prepare_discard(MigrationIncomingState *mis)
 {
     assert(0);
diff --git a/migration/ram.c b/migration/ram.c
index d33bd21..136996a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1845,6 +1845,8 @@ int ram_discard_range(MigrationIncomingState *mis,
 {
     int ret = -1;
 
+    trace_ram_discard_range(block_name, start, length);
+
     rcu_read_lock();
     RAMBlock *rb = qemu_ram_block_by_name(block_name);
 
@@ -1854,27 +1856,7 @@ int ram_discard_range(MigrationIncomingState *mis,
         goto err;
     }
 
-    uint8_t *host_startaddr = rb->host + start;
-
-    if ((uintptr_t)host_startaddr & (qemu_host_page_size - 1)) {
-        error_report("ram_discard_range: Unaligned start address: %p",
-                     host_startaddr);
-        goto err;
-    }
-
-    if ((start + length) <= rb->used_length) {
-        uint8_t *host_endaddr = host_startaddr + length;
-        if ((uintptr_t)host_endaddr & (qemu_host_page_size - 1)) {
-            error_report("ram_discard_range: Unaligned end address: %p",
-                         host_endaddr);
-            goto err;
-        }
-        ret = postcopy_ram_discard_range(mis, host_startaddr, length);
-    } else {
-        error_report("ram_discard_range: Overrun block '%s' (%" PRIu64
-                     "/%zx/" RAM_ADDR_FMT")",
-                     block_name, start, length, rb->used_length);
-    }
+    ret = ram_block_discard_range(rb, start, length);
 
 err:
     rcu_read_unlock();
diff --git a/migration/trace-events b/migration/trace-events
index fa660e3..7372ce2 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -68,6 +68,7 @@ get_queued_page_not_dirty(const char *block_name, uint64_t tmp_offset, uint64_t
 migration_bitmap_sync_start(void) ""
 migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64
 migration_throttle(void) ""
+ram_discard_range(const char *rbname, uint64_t start, size_t len) "%s: start: %" PRIx64 " %zx"
 ram_load_postcopy_loop(uint64_t addr, int flags) "@%" PRIx64 " %x"
 ram_postcopy_send_discard_bitmap(void) ""
 ram_save_queue_pages(const char *rbname, size_t start, size_t len) "%s: start: %zx len: %zx"
@@ -176,7 +177,6 @@ rdma_start_outgoing_migration_after_rdma_source_init(void) ""
 # migration/postcopy-ram.c
 postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s mask words sent=%d in %d commands"
 postcopy_discard_send_range(const char *ramblock, unsigned long start, unsigned long length) "%s:%lx/%lx"
-postcopy_ram_discard_range(void *start, size_t length) "%p,+%zx"
 postcopy_cleanup_range(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
 postcopy_init_range(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
 postcopy_nhp_range(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 07/16] postcopy: Record largest page size
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (5 preceding siblings ...)
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 06/16] Fold postcopy_ram_discard_range into ram_discard_range Dr. David Alan Gilbert (git)
@ 2017-02-06 17:32 ` Dr. David Alan Gilbert (git)
  2017-02-24 13:22   ` Juan Quintela
  2017-02-24 14:37   ` Laurent Vivier
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 08/16] postcopy: Plumb pagesize down into place helpers Dr. David Alan Gilbert (git)
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:32 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Record the largest page size in use; we'll need it soon for allocating
temporary buffers.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 exec.c                        | 13 +++++++++++++
 include/exec/cpu-common.h     |  1 +
 include/migration/migration.h |  1 +
 migration/migration.c         |  1 +
 4 files changed, 16 insertions(+)

diff --git a/exec.c b/exec.c
index c25f6b3..59f3b6b 100644
--- a/exec.c
+++ b/exec.c
@@ -1524,6 +1524,19 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
     return rb->page_size;
 }
 
+/* Returns the largest size of page in use */
+size_t qemu_ram_pagesize_largest(void)
+{
+    RAMBlock *block;
+    size_t largest = 0;
+
+    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+        largest = MAX(largest, qemu_ram_pagesize(block));
+    }
+
+    return largest;
+}
+
 static int memory_try_enable_merging(void *addr, size_t len)
 {
     if (!machine_mem_merge(current_machine)) {
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 1350c2e..8c305aa 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -64,6 +64,7 @@ void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
 void qemu_ram_unset_idstr(RAMBlock *block);
 const char *qemu_ram_get_idstr(RAMBlock *rb);
 size_t qemu_ram_pagesize(RAMBlock *block);
+size_t qemu_ram_pagesize_largest(void);
 
 void cpu_physical_memory_rw(hwaddr addr, uint8_t *buf,
                             int len, int is_write);
diff --git a/include/migration/migration.h b/include/migration/migration.h
index 96c9d6e..c9c1d5f 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -92,6 +92,7 @@ struct MigrationIncomingState {
      */
     QemuEvent main_thread_load_event;
 
+    size_t         largest_page_size;
     bool           have_fault_thread;
     QemuThread     fault_thread;
     QemuSemaphore  fault_thread_sem;
diff --git a/migration/migration.c b/migration/migration.c
index 283677c..e0fdafc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -387,6 +387,7 @@ static void process_incoming_migration_co(void *opaque)
     int ret;
 
     mis = migration_incoming_state_new(f);
+    mis->largest_page_size = qemu_ram_pagesize_largest();
     postcopy_state_set(POSTCOPY_INCOMING_NONE);
     migrate_set_state(&mis->state, MIGRATION_STATUS_NONE,
                       MIGRATION_STATUS_ACTIVE);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 08/16] postcopy: Plumb pagesize down into place helpers
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (6 preceding siblings ...)
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 07/16] postcopy: Record largest page size Dr. David Alan Gilbert (git)
@ 2017-02-06 17:32 ` Dr. David Alan Gilbert (git)
  2017-02-24 13:24   ` Juan Quintela
  2017-02-24 15:10   ` Laurent Vivier
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 09/16] postcopy: Use temporary for placing zero huge pages Dr. David Alan Gilbert (git)
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:32 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Now we deal with normal size pages and huge pages we need
to tell the place handlers the size we're dealing with
and make sure the temporary page is large enough.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/postcopy-ram.h |  6 +++--
 migration/postcopy-ram.c         | 47 ++++++++++++++++++++++++----------------
 migration/ram.c                  | 15 +++++++------
 3 files changed, 40 insertions(+), 28 deletions(-)

diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
index 43bbbca..8e036b9 100644
--- a/include/migration/postcopy-ram.h
+++ b/include/migration/postcopy-ram.h
@@ -74,13 +74,15 @@ void postcopy_discard_send_finish(MigrationState *ms,
  *    to use other postcopy_ routines to allocate.
  * returns 0 on success
  */
-int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from);
+int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
+                        size_t pagesize);
 
 /*
  * Place a zero page at (host) atomically
  * returns 0 on success
  */
-int postcopy_place_page_zero(MigrationIncomingState *mis, void *host);
+int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
+                             size_t pagesize);
 
 /*
  * Allocate a page of memory that can be mapped at a later point in time
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 1e3d22f..a8b7fed 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -321,7 +321,7 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
     migrate_send_rp_shut(mis, qemu_file_get_error(mis->from_src_file) != 0);
 
     if (mis->postcopy_tmp_page) {
-        munmap(mis->postcopy_tmp_page, getpagesize());
+        munmap(mis->postcopy_tmp_page, mis->largest_page_size);
         mis->postcopy_tmp_page = NULL;
     }
     trace_postcopy_ram_incoming_cleanup_exit();
@@ -543,13 +543,14 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
  * Place a host page (from) at (host) atomically
  * returns 0 on success
  */
-int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from)
+int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
+                        size_t pagesize)
 {
     struct uffdio_copy copy_struct;
 
     copy_struct.dst = (uint64_t)(uintptr_t)host;
     copy_struct.src = (uint64_t)(uintptr_t)from;
-    copy_struct.len = getpagesize();
+    copy_struct.len = pagesize;
     copy_struct.mode = 0;
 
     /* copy also acks to the kernel waking the stalled thread up
@@ -559,8 +560,8 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from)
      */
     if (ioctl(mis->userfault_fd, UFFDIO_COPY, &copy_struct)) {
         int e = errno;
-        error_report("%s: %s copy host: %p from: %p",
-                     __func__, strerror(e), host, from);
+        error_report("%s: %s copy host: %p from: %p (size: %zd)",
+                     __func__, strerror(e), host, from, pagesize);
 
         return -e;
     }
@@ -573,23 +574,29 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from)
  * Place a zero page at (host) atomically
  * returns 0 on success
  */
-int postcopy_place_page_zero(MigrationIncomingState *mis, void *host)
+int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
+                             size_t pagesize)
 {
-    struct uffdio_zeropage zero_struct;
+    trace_postcopy_place_page_zero(host);
 
-    zero_struct.range.start = (uint64_t)(uintptr_t)host;
-    zero_struct.range.len = getpagesize();
-    zero_struct.mode = 0;
+    if (pagesize == getpagesize()) {
+        struct uffdio_zeropage zero_struct;
+        zero_struct.range.start = (uint64_t)(uintptr_t)host;
+        zero_struct.range.len = getpagesize();
+        zero_struct.mode = 0;
 
-    if (ioctl(mis->userfault_fd, UFFDIO_ZEROPAGE, &zero_struct)) {
-        int e = errno;
-        error_report("%s: %s zero host: %p",
-                     __func__, strerror(e), host);
+        if (ioctl(mis->userfault_fd, UFFDIO_ZEROPAGE, &zero_struct)) {
+            int e = errno;
+            error_report("%s: %s zero host: %p",
+                         __func__, strerror(e), host);
 
-        return -e;
+            return -e;
+        }
+    } else {
+        /* TODO: The kernel can't use UFFDIO_ZEROPAGE for hugepages */
+        assert(0);
     }
 
-    trace_postcopy_place_page_zero(host);
     return 0;
 }
 
@@ -604,7 +611,7 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host)
 void *postcopy_get_tmp_page(MigrationIncomingState *mis)
 {
     if (!mis->postcopy_tmp_page) {
-        mis->postcopy_tmp_page = mmap(NULL, getpagesize(),
+        mis->postcopy_tmp_page = mmap(NULL, mis->largest_page_size,
                              PROT_READ | PROT_WRITE, MAP_PRIVATE |
                              MAP_ANONYMOUS, -1, 0);
         if (mis->postcopy_tmp_page == MAP_FAILED) {
@@ -649,13 +656,15 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
     return -1;
 }
 
-int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from)
+int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
+                        size_t pagesize)
 {
     assert(0);
     return -1;
 }
 
-int postcopy_place_page_zero(MigrationIncomingState *mis, void *host)
+int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
+                        size_t pagesize)
 {
     assert(0);
     return -1;
diff --git a/migration/ram.c b/migration/ram.c
index 136996a..ff448ef 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2354,6 +2354,7 @@ static int ram_load_postcopy(QEMUFile *f)
         void *host = NULL;
         void *page_buffer = NULL;
         void *place_source = NULL;
+        RAMBlock *block = NULL;
         uint8_t ch;
 
         addr = qemu_get_be64(f);
@@ -2363,7 +2364,7 @@ static int ram_load_postcopy(QEMUFile *f)
         trace_ram_load_postcopy_loop((uint64_t)addr, flags);
         place_needed = false;
         if (flags & (RAM_SAVE_FLAG_COMPRESS | RAM_SAVE_FLAG_PAGE)) {
-            RAMBlock *block = ram_block_from_stream(f, flags);
+            block = ram_block_from_stream(f, flags);
 
             host = host_from_ram_block_offset(block, addr);
             if (!host) {
@@ -2438,14 +2439,14 @@ static int ram_load_postcopy(QEMUFile *f)
 
         if (place_needed) {
             /* This gets called at the last target page in the host page */
+            void *place_dest = host + TARGET_PAGE_SIZE - block->page_size;
+
             if (all_zero) {
-                ret = postcopy_place_page_zero(mis,
-                                               host + TARGET_PAGE_SIZE -
-                                               qemu_host_page_size);
+                ret = postcopy_place_page_zero(mis, place_dest,
+                                               block->page_size);
             } else {
-                ret = postcopy_place_page(mis, host + TARGET_PAGE_SIZE -
-                                               qemu_host_page_size,
-                                               place_source);
+                ret = postcopy_place_page(mis, place_dest,
+                                          place_source, block->page_size);
             }
         }
         if (!ret) {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 09/16] postcopy: Use temporary for placing zero huge pages
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (7 preceding siblings ...)
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 08/16] postcopy: Plumb pagesize down into place helpers Dr. David Alan Gilbert (git)
@ 2017-02-06 17:32 ` Dr. David Alan Gilbert (git)
  2017-02-24 15:31   ` Laurent Vivier
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 10/16] postcopy: Load huge pages in one go Dr. David Alan Gilbert (git)
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:32 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The kernel can't do UFFDIO_ZEROPAGE for huge pages, so we have
to allocate a temporary (always zero) page and use UFFDIO_COPYPAGE
on it.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 include/migration/migration.h |  1 +
 migration/postcopy-ram.c      | 23 +++++++++++++++++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index c9c1d5f..bd399fc 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -108,6 +108,7 @@ struct MigrationIncomingState {
     QEMUFile *to_src_file;
     QemuMutex rp_mutex;    /* We send replies from multiple threads */
     void     *postcopy_tmp_page;
+    void     *postcopy_tmp_zero_page;
 
     QEMUBH *bh;
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index a8b7fed..4c736d2 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -324,6 +324,10 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
         munmap(mis->postcopy_tmp_page, mis->largest_page_size);
         mis->postcopy_tmp_page = NULL;
     }
+    if (mis->postcopy_tmp_zero_page) {
+        munmap(mis->postcopy_tmp_zero_page, mis->largest_page_size);
+        mis->postcopy_tmp_zero_page = NULL;
+    }
     trace_postcopy_ram_incoming_cleanup_exit();
     return 0;
 }
@@ -593,8 +597,23 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
             return -e;
         }
     } else {
-        /* TODO: The kernel can't use UFFDIO_ZEROPAGE for hugepages */
-        assert(0);
+        /* The kernel can't use UFFDIO_ZEROPAGE for hugepages */
+        if (!mis->postcopy_tmp_zero_page) {
+            mis->postcopy_tmp_zero_page = mmap(NULL, mis->largest_page_size,
+                                               PROT_READ | PROT_WRITE,
+                                               MAP_PRIVATE | MAP_ANONYMOUS,
+                                               -1, 0);
+            if (mis->postcopy_tmp_zero_page == MAP_FAILED) {
+                int e = errno;
+                mis->postcopy_tmp_zero_page = NULL;
+                error_report("%s: %s mapping large zero page",
+                             __func__, strerror(e));
+                return -e;
+            }
+            memset(mis->postcopy_tmp_zero_page, '\0', mis->largest_page_size);
+        }
+        return postcopy_place_page(mis, host, mis->postcopy_tmp_zero_page,
+                                   pagesize);
     }
 
     return 0;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 10/16] postcopy: Load huge pages in one go
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (8 preceding siblings ...)
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 09/16] postcopy: Use temporary for placing zero huge pages Dr. David Alan Gilbert (git)
@ 2017-02-06 17:33 ` Dr. David Alan Gilbert (git)
  2017-02-24 15:54   ` Laurent Vivier
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 11/16] postcopy: Mask fault addresses to huge page boundary Dr. David Alan Gilbert (git)
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:33 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The existing postcopy RAM load loop already ensures that it
glues together whole host-pages from the target page size chunks sent
over the wire.  Modify the definition of host page that it uses
to be the RAM block page size and thus be huge pages where appropriate.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 migration/ram.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index ff448ef..88d9444 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2342,7 +2342,7 @@ static int ram_load_postcopy(QEMUFile *f)
 {
     int flags = 0, ret = 0;
     bool place_needed = false;
-    bool matching_page_sizes = qemu_host_page_size == TARGET_PAGE_SIZE;
+    bool matching_page_sizes = false;
     MigrationIncomingState *mis = migration_incoming_get_current();
     /* Temporary page that is later 'placed' */
     void *postcopy_host_page = postcopy_get_tmp_page(mis);
@@ -2372,8 +2372,11 @@ static int ram_load_postcopy(QEMUFile *f)
                 ret = -EINVAL;
                 break;
             }
+            matching_page_sizes = block->page_size == TARGET_PAGE_SIZE;
             /*
-             * Postcopy requires that we place whole host pages atomically.
+             * Postcopy requires that we place whole host pages atomically;
+             * these may be huge pages for RAMBlocks that are backed by
+             * hugetlbfs.
              * To make it atomic, the data is read into a temporary page
              * that's moved into place later.
              * The migration protocol uses,  possibly smaller, target-pages
@@ -2381,9 +2384,9 @@ static int ram_load_postcopy(QEMUFile *f)
              * of a host page in order.
              */
             page_buffer = postcopy_host_page +
-                          ((uintptr_t)host & ~qemu_host_page_mask);
+                          ((uintptr_t)host & (block->page_size - 1));
             /* If all TP are zero then we can optimise the place */
-            if (!((uintptr_t)host & ~qemu_host_page_mask)) {
+            if (!((uintptr_t)host & (block->page_size - 1))) {
                 all_zero = true;
             } else {
                 /* not the 1st TP within the HP */
@@ -2401,7 +2404,7 @@ static int ram_load_postcopy(QEMUFile *f)
              * page
              */
             place_needed = (((uintptr_t)host + TARGET_PAGE_SIZE) &
-                                     ~qemu_host_page_mask) == 0;
+                                     (block->page_size - 1)) == 0;
             place_source = postcopy_host_page;
         }
         last_host = host;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 11/16] postcopy: Mask fault addresses to huge page boundary
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (9 preceding siblings ...)
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 10/16] postcopy: Load huge pages in one go Dr. David Alan Gilbert (git)
@ 2017-02-06 17:33 ` Dr. David Alan Gilbert (git)
  2017-02-24 15:59   ` Laurent Vivier
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 12/16] postcopy: Send whole huge pages Dr. David Alan Gilbert (git)
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:33 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Currently the fault address received by userfault is rounded to
the host page boundary and a host page is requested from the source.
Use the current RAMBlock page size instead of the general host page
size so that for RAMBlocks backed by huge pages we request the whole
huge page.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 include/exec/memory.h    | 1 -
 migration/postcopy-ram.c | 7 +++----
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 987f925..c428891 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1614,7 +1614,6 @@ MemTxResult address_space_read_continue(AddressSpace *as, hwaddr addr,
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
                                     MemTxAttrs attrs, uint8_t *buf, int len);
 void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
-
 static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
 {
     if (is_write) {
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 4c736d2..03cbd6e 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -403,7 +403,6 @@ static void *postcopy_ram_fault_thread(void *opaque)
     MigrationIncomingState *mis = opaque;
     struct uffd_msg msg;
     int ret;
-    size_t hostpagesize = getpagesize();
     RAMBlock *rb = NULL;
     RAMBlock *last_rb = NULL; /* last RAMBlock we sent part of */
 
@@ -470,7 +469,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
             break;
         }
 
-        rb_offset &= ~(hostpagesize - 1);
+        rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
         trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
                                                 qemu_ram_get_idstr(rb),
                                                 rb_offset);
@@ -482,11 +481,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
         if (rb != last_rb) {
             last_rb = rb;
             migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
-                                     rb_offset, hostpagesize);
+                                     rb_offset, qemu_ram_pagesize(rb));
         } else {
             /* Save some space */
             migrate_send_rp_req_pages(mis, NULL,
-                                     rb_offset, hostpagesize);
+                                     rb_offset, qemu_ram_pagesize(rb));
         }
     }
     trace_postcopy_ram_fault_thread_exit();
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 12/16] postcopy: Send whole huge pages
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (10 preceding siblings ...)
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 11/16] postcopy: Mask fault addresses to huge page boundary Dr. David Alan Gilbert (git)
@ 2017-02-06 17:33 ` Dr. David Alan Gilbert (git)
  2017-02-24 16:06   ` Laurent Vivier
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 13/16] postcopy: Allow hugepages Dr. David Alan Gilbert (git)
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:33 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The RAM save code uses ram_save_host_page to send whole
host pages at a time;  change this to use the host page size associated
with the RAM Block which may be a huge page.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 migration/ram.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index 88d9444..2350f71 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1281,6 +1281,8 @@ static int ram_save_target_page(MigrationState *ms, QEMUFile *f,
  *                     offset to point into the middle of a host page
  *                     in which case the remainder of the hostpage is sent.
  *                     Only dirty target pages are sent.
+ *                     Note that the host page size may be a huge page for this
+ *                     block.
  *
  * Returns: Number of pages written.
  *
@@ -1299,6 +1301,8 @@ static int ram_save_host_page(MigrationState *ms, QEMUFile *f,
                               ram_addr_t dirty_ram_abs)
 {
     int tmppages, pages = 0;
+    size_t pagesize = qemu_ram_pagesize(pss->block);
+
     do {
         tmppages = ram_save_target_page(ms, f, pss, last_stage,
                                         bytes_transferred, dirty_ram_abs);
@@ -1309,7 +1313,7 @@ static int ram_save_host_page(MigrationState *ms, QEMUFile *f,
         pages += tmppages;
         pss->offset += TARGET_PAGE_SIZE;
         dirty_ram_abs += TARGET_PAGE_SIZE;
-    } while (pss->offset & (qemu_host_page_size - 1));
+    } while (pss->offset & (pagesize - 1));
 
     /* The offset we leave with is the last one we looked at */
     pss->offset -= TARGET_PAGE_SIZE;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 13/16] postcopy: Allow hugepages
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (11 preceding siblings ...)
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 12/16] postcopy: Send whole huge pages Dr. David Alan Gilbert (git)
@ 2017-02-06 17:33 ` Dr. David Alan Gilbert (git)
  2017-02-24 16:07   ` Laurent Vivier
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 14/16] postcopy: Update userfaultfd.h header Dr. David Alan Gilbert (git)
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:33 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Allow huge pages in postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 migration/postcopy-ram.c | 25 +------------------------
 1 file changed, 1 insertion(+), 24 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 03cbd6e..6b30b43 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -85,24 +85,6 @@ static bool ufd_version_check(int ufd)
 }
 
 /*
- * Check for things that postcopy won't support; returns 0 if the block
- * is fine.
- */
-static int check_range(const char *block_name, void *host_addr,
-                      ram_addr_t offset, ram_addr_t length, void *opaque)
-{
-    RAMBlock *rb = qemu_ram_block_by_name(block_name);
-
-    if (qemu_ram_pagesize(rb) > getpagesize()) {
-        error_report("Postcopy doesn't support large page sizes yet (%s)",
-                     block_name);
-        return -E2BIG;
-    }
-
-    return 0;
-}
-
-/*
  * Note: This has the side effect of munlock'ing all of RAM, that's
  * normally fine since if the postcopy succeeds it gets turned back on at the
  * end.
@@ -122,12 +104,6 @@ bool postcopy_ram_supported_by_host(void)
         goto out;
     }
 
-    /* Check for anything about the RAMBlocks we don't support */
-    if (qemu_ram_foreach_block(check_range, NULL)) {
-        /* check_range will have printed its own error */
-        goto out;
-    }
-
     ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
     if (ufd == -1) {
         error_report("%s: userfaultfd not available: %s", __func__,
@@ -139,6 +115,7 @@ bool postcopy_ram_supported_by_host(void)
     if (!ufd_version_check(ufd)) {
         goto out;
     }
+    /* TODO: Only allow huge pages if the kernel supports it */
 
     /*
      * userfault and mlock don't go together; we'll put it back later if
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 14/16] postcopy: Update userfaultfd.h header
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (12 preceding siblings ...)
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 13/16] postcopy: Allow hugepages Dr. David Alan Gilbert (git)
@ 2017-02-06 17:33 ` Dr. David Alan Gilbert (git)
  2017-02-24 16:09   ` Laurent Vivier
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 15/16] postcopy: Check for userfault+hugepage feature Dr. David Alan Gilbert (git)
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:33 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

We use a new userfaultfd define, so update the header.
(Not needed if someone just runs the update script once it's
gone into the main kernel).

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 linux-headers/linux/userfaultfd.h | 81 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 71 insertions(+), 10 deletions(-)

diff --git a/linux-headers/linux/userfaultfd.h b/linux-headers/linux/userfaultfd.h
index 19e8453..a7c1a62 100644
--- a/linux-headers/linux/userfaultfd.h
+++ b/linux-headers/linux/userfaultfd.h
@@ -11,13 +11,19 @@
 
 #include <linux/types.h>
 
-#define UFFD_API ((__u64)0xAA)
 /*
- * After implementing the respective features it will become:
- * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
- *			      UFFD_FEATURE_EVENT_FORK)
+ * If the UFFDIO_API is upgraded someday, the UFFDIO_UNREGISTER and
+ * UFFDIO_WAKE ioctls should be defined as _IOW and not as _IOR.  In
+ * userfaultfd.h we assumed the kernel was reading (instead _IOC_READ
+ * means the userland is reading).
  */
-#define UFFD_API_FEATURES (0)
+#define UFFD_API ((__u64)0xAA)
+#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP |	\
+			   UFFD_FEATURE_EVENT_FORK |		\
+			   UFFD_FEATURE_EVENT_REMAP |		\
+			   UFFD_FEATURE_EVENT_MADVDONTNEED |	\
+			   UFFD_FEATURE_MISSING_HUGETLBFS |	\
+			   UFFD_FEATURE_MISSING_SHMEM)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -25,7 +31,11 @@
 #define UFFD_API_RANGE_IOCTLS			\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
-	 (__u64)1 << _UFFDIO_ZEROPAGE)
+	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
+	 (__u64)1 << _UFFDIO_WRITEPROTECT)
+#define UFFD_API_RANGE_IOCTLS_BASIC		\
+	((__u64)1 << _UFFDIO_WAKE |		\
+	 (__u64)1 << _UFFDIO_COPY)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -40,6 +50,7 @@
 #define _UFFDIO_WAKE			(0x02)
 #define _UFFDIO_COPY			(0x03)
 #define _UFFDIO_ZEROPAGE		(0x04)
+#define _UFFDIO_WRITEPROTECT		(0x05)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -56,6 +67,8 @@
 				      struct uffdio_copy)
 #define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
 				      struct uffdio_zeropage)
+#define UFFDIO_WRITEPROTECT	_IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
+				      struct uffdio_writeprotect)
 
 /* read() structure */
 struct uffd_msg {
@@ -72,6 +85,21 @@ struct uffd_msg {
 		} pagefault;
 
 		struct {
+			__u32	ufd;
+		} fork;
+
+		struct {
+			__u64	from;
+			__u64	to;
+			__u64	len;
+		} remap;
+
+		struct {
+			__u64	start;
+			__u64	end;
+		} madv_dn;
+
+		struct {
 			/* unused reserved fields */
 			__u64	reserved1;
 			__u64	reserved2;
@@ -84,9 +112,9 @@ struct uffd_msg {
  * Start at 0x12 and not at 0 to be more strict against bugs.
  */
 #define UFFD_EVENT_PAGEFAULT	0x12
-#if 0 /* not available yet */
 #define UFFD_EVENT_FORK		0x13
-#endif
+#define UFFD_EVENT_REMAP	0x14
+#define UFFD_EVENT_MADVDONTNEED	0x15
 
 /* flags for UFFD_EVENT_PAGEFAULT */
 #define UFFD_PAGEFAULT_FLAG_WRITE	(1<<0)	/* If this was a write fault */
@@ -104,11 +132,37 @@ struct uffdio_api {
 	 * Note: UFFD_EVENT_PAGEFAULT and UFFD_PAGEFAULT_FLAG_WRITE
 	 * are to be considered implicitly always enabled in all kernels as
 	 * long as the uffdio_api.api requested matches UFFD_API.
+	 *
+	 * UFFD_FEATURE_MISSING_HUGETLBFS means an UFFDIO_REGISTER
+	 * with UFFDIO_REGISTER_MODE_MISSING mode will succeed on
+	 * hugetlbfs virtual memory ranges. Adding or not adding
+	 * UFFD_FEATURE_MISSING_HUGETLBFS to uffdio_api.features has
+	 * no real functional effect after UFFDIO_API returns, but
+	 * it's only useful for an initial feature set probe at
+	 * UFFDIO_API time. There are two ways to use it:
+	 *
+	 * 1) by adding UFFD_FEATURE_MISSING_HUGETLBFS to the
+	 *    uffdio_api.features before calling UFFDIO_API, an error
+	 *    will be returned by UFFDIO_API on a kernel without
+	 *    hugetlbfs missing support
+	 *
+	 * 2) the UFFD_FEATURE_MISSING_HUGETLBFS can not be added in
+	 *    uffdio_api.features and instead it will be set by the
+	 *    kernel in the uffdio_api.features if the kernel supports
+	 *    it, so userland can later check if the feature flag is
+	 *    present in uffdio_api.features after UFFDIO_API
+	 *    succeeded.
+	 *
+	 * UFFD_FEATURE_MISSING_SHMEM works the same as
+	 * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
+	 * (i.e. tmpfs and other shmem based APIs).
 	 */
-#if 0 /* not available yet */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
-#endif
+#define UFFD_FEATURE_EVENT_REMAP		(1<<2)
+#define UFFD_FEATURE_EVENT_MADVDONTNEED		(1<<3)
+#define UFFD_FEATURE_MISSING_HUGETLBFS		(1<<4)
+#define UFFD_FEATURE_MISSING_SHMEM		(1<<5)
 	__u64 features;
 
 	__u64 ioctls;
@@ -164,4 +218,11 @@ struct uffdio_zeropage {
 	__s64 zeropage;
 };
 
+struct uffdio_writeprotect {
+	struct uffdio_range range;
+	/* !WP means undo writeprotect. DONTWAKE is valid only with !WP */
+#define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
+#define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
+	__u64 mode;
+};
 #endif /* _LINUX_USERFAULTFD_H */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 15/16] postcopy: Check for userfault+hugepage feature
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (13 preceding siblings ...)
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 14/16] postcopy: Update userfaultfd.h header Dr. David Alan Gilbert (git)
@ 2017-02-06 17:33 ` Dr. David Alan Gilbert (git)
  2017-02-24 16:12   ` Laurent Vivier
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 16/16] postcopy: Add doc about hugepages and postcopy Dr. David Alan Gilbert (git)
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:33 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

We need extra Linux kernel support (~4.11) to support userfaults
on hugetlbfs; check for them.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 migration/postcopy-ram.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 6b30b43..102fb61 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -81,6 +81,17 @@ static bool ufd_version_check(int ufd)
         return false;
     }
 
+    if (getpagesize() != ram_pagesize_summary()) {
+        bool have_hp = false;
+        /* We've got a huge page */
+#ifdef UFFD_FEATURE_MISSING_HUGETLBFS
+        have_hp = api_struct.features & UFFD_FEATURE_MISSING_HUGETLBFS;
+#endif
+        if (!have_hp) {
+            error_report("Userfault on this host does not support huge pages");
+            return false;
+        }
+    }
     return true;
 }
 
@@ -115,7 +126,6 @@ bool postcopy_ram_supported_by_host(void)
     if (!ufd_version_check(ufd)) {
         goto out;
     }
-    /* TODO: Only allow huge pages if the kernel supports it */
 
     /*
      * userfault and mlock don't go together; we'll put it back later if
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v2 16/16] postcopy: Add doc about hugepages and postcopy
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (14 preceding siblings ...)
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 15/16] postcopy: Check for userfault+hugepage feature Dr. David Alan Gilbert (git)
@ 2017-02-06 17:33 ` Dr. David Alan Gilbert (git)
  2017-02-24 13:25   ` Juan Quintela
  2017-02-24 16:12   ` Laurent Vivier
  2017-02-06 17:45 ` [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert
                   ` (2 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-02-06 17:33 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 docs/migration.txt | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/docs/migration.txt b/docs/migration.txt
index 6503c17..b462ead 100644
--- a/docs/migration.txt
+++ b/docs/migration.txt
@@ -482,3 +482,16 @@ request for a page that has already been sent is ignored.  Duplicate requests
 such as this can happen as a page is sent at about the same time the
 destination accesses it.
 
+=== Postcopy with hugepages ===
+
+Postcopy now works with hugetlbfs backed memory:
+  a) The linux kernel on the destination must support userfault on hugepages.
+  b) The huge-page configuration on the source and destination VMs must be
+     identical; i.e. RAMBlocks on both sides must use the same page size.
+  c) Note that -mem-path /dev/hugepages  will fall back to allocating normal
+     RAM if it doesn't have enough hugepages, triggering (b) to fail.
+     Using -mem-prealloc enforces the allocation using hugepages.
+  d) Care should be taken with the size of hugepage used; postcopy with 2MB
+     hugepages works well, however 1GB hugepages are likely to be problematic
+     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
+     and until the full page is transferred the destination thread is blocked.
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (15 preceding siblings ...)
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 16/16] postcopy: Add doc about hugepages and postcopy Dr. David Alan Gilbert (git)
@ 2017-02-06 17:45 ` Dr. David Alan Gilbert
       [not found]   ` <CGME20170213171108eucas1p147999fc8b6980ff89a67626b78b12e44@eucas1p1.samsung.com>
  2017-02-22 16:43 ` Laurent Vivier
  2017-02-24 10:04 ` Dr. David Alan Gilbert
  18 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-06 17:45 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: aarcange

* Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Hi,
>   The existing postcopy code, and the userfault kernel
> code that supports it, only works for normal anonymous memory.
> Kernel support for userfault on hugetlbfs is working
> it's way upstream; it's in the linux-mm tree,
> You can get a version at:
>    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> on the origin/userfault branch.
> 
> Note that while this code supports arbitrary sized hugepages,
> it doesn't make sense with pages above the few-MB region,
> so while 2MB is fine, 1GB is probably a bad idea;
> this code waits for and transmits whole huge pages, and a
> 1GB page would take about 1 second to transfer over a 10Gbps
> link - which is way too long to pause the destination for.
> 
> Dave

Oops I missed the v2 changes from the message:

v2
  Flip ram-size summary word/compare individual page size patches around
  Individual page size comparison is done in ram_load if 'advise' has been
    received rather than checking migrate_postcopy_ram()
  Moved discard code into exec.c, reworked ram_discard_range

Dave

> Dr. David Alan Gilbert (16):
>   postcopy: Transmit ram size summary word
>   postcopy: Transmit and compare individual page sizes
>   postcopy: Chunk discards for hugepages
>   exec: ram_block_discard_range
>   postcopy: enhance ram_block_discard_range for hugepages
>   Fold postcopy_ram_discard_range into ram_discard_range
>   postcopy: Record largest page size
>   postcopy: Plumb pagesize down into place helpers
>   postcopy: Use temporary for placing zero huge pages
>   postcopy: Load huge pages in one go
>   postcopy: Mask fault addresses to huge page boundary
>   postcopy: Send whole huge pages
>   postcopy: Allow hugepages
>   postcopy: Update userfaultfd.h header
>   postcopy: Check for userfault+hugepage feature
>   postcopy: Add doc about hugepages and postcopy
> 
>  docs/migration.txt                |  13 ++++
>  exec.c                            |  83 +++++++++++++++++++++++
>  include/exec/cpu-common.h         |   2 +
>  include/exec/memory.h             |   1 -
>  include/migration/migration.h     |   3 +
>  include/migration/postcopy-ram.h  |  13 ++--
>  linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
>  migration/migration.c             |   1 +
>  migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
>  migration/ram.c                   | 109 ++++++++++++++++++------------
>  migration/savevm.c                |  32 ++++++---
>  migration/trace-events            |   2 +-
>  12 files changed, 328 insertions(+), 150 deletions(-)
> 
> -- 
> 2.9.3
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
       [not found]   ` <CGME20170213171108eucas1p147999fc8b6980ff89a67626b78b12e44@eucas1p1.samsung.com>
@ 2017-02-13 17:11     ` Alexey Perevalov
  2017-02-13 17:57       ` Andrea Arcangeli
  2017-02-13 18:16       ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 73+ messages in thread
From: Alexey Perevalov @ 2017-02-13 17:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: qemu-devel, quintela, aarcange

 Hello David!

I have checked you series with 1G hugepage, but only in 1 Gbit/sec network
environment.
I started Ubuntu just with console interface and gave to it only 1G of
RAM, inside Ubuntu I started stress command
(stress --cpu 4 --io 4 --vm 4 --vm-bytes 256000000 &)
in such environment precopy live migration was impossible, it never
being finished, in this case it infinitely sends pages (it looks like
dpkg scenario).

Also I modified stress utility
http://people.seas.harvard.edu/~apw/stress/stress-1.0.4.tar.gz
due to it wrote into memory every time the same value `Z`. My
modified version writes every allocation new incremented value.

I'm using Arcangeli's kernel only at the destination.


I got controversial results. Downtime for 1G hugepage is close to 2Mb
hugepage and it took around 7 ms (in 2Mb hugepage scenario downtime was
around 8 ms).
I made that opinion by query-migrate.
{"return": {"status": "completed", "setup-time": 6, "downtime": 6, "total-time": 9668, "ram": {"total": 1091379200, "postcopy-requests": 1, "dirty-sync-count": 2, "remaining": 0, "mbps": 879.786851, "transferred": 1063007296, "duplicate": 7449, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1060868096, "normal": 259001}}}

Documentation says about downtime field - measurement unit is ms.


So I traced it (I added additional trace into postcopy_place_page
trace_postcopy_place_page_start(host, from, pagesize); )

postcopy_ram_fault_thread_request Request for HVA=7f6dc0000000 rb=/objects/mem offset=0
postcopy_place_page_start host=0x7f6dc0000000 from=0x7f6d70000000, pagesize=40000000
postcopy_place_page_start host=0x7f6e0e800000 from=0x55b665969619, pagesize=1000
postcopy_place_page_start host=0x7f6e0e801000 from=0x55b6659684e8, pagesize=1000
several pages with 4Kb step ...
postcopy_place_page_start host=0x7f6e0e817000 from=0x55b6659694f0, pagesize=1000

4K pages, started from 0x7f6e0e800000 address it's
vga.ram, /rom@etc/acpi/tables etc.

Frankly saying, right now, I don't have any ideas why hugepage wasn't
resent. Maybe my expectation of it is wrong as well as understanding )

stress utility also duplicated for me value into appropriate file:
sec_since_epoch.microsec:value
1487003192.728493:22
1487003197.335362:23
*1487003213.367260:24*
*1487003238.480379:25*
1487003243.315299:26
1487003250.775721:27
1487003255.473792:28

It mean rewriting 256Mb of memory per byte took around 5 sec, but at
the moment of migration it took 25 sec.


Another one request.
QEMU could use mem_path in hugefs with share key simultaneously
(-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
in this case will start and will properly work (it will allocate memory
with mmap), but in case of destination for postcopy live migration
UFFDIO_COPY ioctl will fail for
such region, in Arcangeli's git tree there is such prevent check
(if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
Is it possible to handle such situation at qemu?


On Mon, Feb 06, 2017 at 05:45:30PM +0000, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Hi,
> >   The existing postcopy code, and the userfault kernel
> > code that supports it, only works for normal anonymous memory.
> > Kernel support for userfault on hugetlbfs is working
> > it's way upstream; it's in the linux-mm tree,
> > You can get a version at:
> >    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> > on the origin/userfault branch.
> > 
> > Note that while this code supports arbitrary sized hugepages,
> > it doesn't make sense with pages above the few-MB region,
> > so while 2MB is fine, 1GB is probably a bad idea;
> > this code waits for and transmits whole huge pages, and a
> > 1GB page would take about 1 second to transfer over a 10Gbps
> > link - which is way too long to pause the destination for.
> > 
> > Dave
> 
> Oops I missed the v2 changes from the message:
> 
> v2
>   Flip ram-size summary word/compare individual page size patches around
>   Individual page size comparison is done in ram_load if 'advise' has been
>     received rather than checking migrate_postcopy_ram()
>   Moved discard code into exec.c, reworked ram_discard_range
> 
> Dave

Thank your, right now it's not necessary to set
postcopy-ram capability on destination machine.


> 
> > Dr. David Alan Gilbert (16):
> >   postcopy: Transmit ram size summary word
> >   postcopy: Transmit and compare individual page sizes
> >   postcopy: Chunk discards for hugepages
> >   exec: ram_block_discard_range
> >   postcopy: enhance ram_block_discard_range for hugepages
> >   Fold postcopy_ram_discard_range into ram_discard_range
> >   postcopy: Record largest page size
> >   postcopy: Plumb pagesize down into place helpers
> >   postcopy: Use temporary for placing zero huge pages
> >   postcopy: Load huge pages in one go
> >   postcopy: Mask fault addresses to huge page boundary
> >   postcopy: Send whole huge pages
> >   postcopy: Allow hugepages
> >   postcopy: Update userfaultfd.h header
> >   postcopy: Check for userfault+hugepage feature
> >   postcopy: Add doc about hugepages and postcopy
> > 
> >  docs/migration.txt                |  13 ++++
> >  exec.c                            |  83 +++++++++++++++++++++++
> >  include/exec/cpu-common.h         |   2 +
> >  include/exec/memory.h             |   1 -
> >  include/migration/migration.h     |   3 +
> >  include/migration/postcopy-ram.h  |  13 ++--
> >  linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
> >  migration/migration.c             |   1 +
> >  migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
> >  migration/ram.c                   | 109 ++++++++++++++++++------------
> >  migration/savevm.c                |  32 ++++++---
> >  migration/trace-events            |   2 +-
> >  12 files changed, 328 insertions(+), 150 deletions(-)
> > 
> > -- 
> > 2.9.3
> > 
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-13 17:11     ` Alexey Perevalov
@ 2017-02-13 17:57       ` Andrea Arcangeli
  2017-02-13 18:10         ` Andrea Arcangeli
  2017-02-14 14:48         ` Alexey Perevalov
  2017-02-13 18:16       ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 73+ messages in thread
From: Andrea Arcangeli @ 2017-02-13 17:57 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: Dr. David Alan Gilbert, qemu-devel, quintela, kravetz

Hello,

On Mon, Feb 13, 2017 at 08:11:06PM +0300, Alexey Perevalov wrote:
> Another one request.
> QEMU could use mem_path in hugefs with share key simultaneously
> (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
> in this case will start and will properly work (it will allocate memory
> with mmap), but in case of destination for postcopy live migration
> UFFDIO_COPY ioctl will fail for
> such region, in Arcangeli's git tree there is such prevent check
> (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> Is it possible to handle such situation at qemu?

It'd be nice to lift this hugetlbfs !VM_SHARED restriction I agree, I
already asked Mike (CC'ed) why is there, because I'm afraid it's a
leftover from the anon version where VM_SHARED means a very different
thing but it was already lifted for shmem. share=on should already
work on top of tmpfs and also with THP on tmpfs enabled.

For hugetlbfs and shmem it should be generally more complicated to
cope with private mappings than shared ones, shared is just the native
form of the pseudofs without having to deal with private COWs aliases
so it's hard to imagine something going wrong for VM_SHARED if the
MAP_PRIVATE mapping already works fine. If it turns out to be
superflous the check may be just turned into
"vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED".

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-13 17:57       ` Andrea Arcangeli
@ 2017-02-13 18:10         ` Andrea Arcangeli
  2017-02-13 21:59           ` Mike Kravetz
  2017-02-14 14:48         ` Alexey Perevalov
  1 sibling, 1 reply; 73+ messages in thread
From: Andrea Arcangeli @ 2017-02-13 18:10 UTC (permalink / raw)
  To: Alexey Perevalov
  Cc: Dr. David Alan Gilbert, qemu-devel, quintela, Mike Kravetz

On Mon, Feb 13, 2017 at 06:57:22PM +0100, Andrea Arcangeli wrote:
> Hello,
> 
> On Mon, Feb 13, 2017 at 08:11:06PM +0300, Alexey Perevalov wrote:
> > Another one request.
> > QEMU could use mem_path in hugefs with share key simultaneously
> > (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
> > in this case will start and will properly work (it will allocate memory
> > with mmap), but in case of destination for postcopy live migration
> > UFFDIO_COPY ioctl will fail for
> > such region, in Arcangeli's git tree there is such prevent check
> > (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> > Is it possible to handle such situation at qemu?
> 
> It'd be nice to lift this hugetlbfs !VM_SHARED restriction I agree, I
> already asked Mike (CC'ed) why is there, because I'm afraid it's a

Cc'ed not existent email, mail client autocompletion error, corrected
the CC.

> leftover from the anon version where VM_SHARED means a very different
> thing but it was already lifted for shmem. share=on should already
> work on top of tmpfs and also with THP on tmpfs enabled.
> 
> For hugetlbfs and shmem it should be generally more complicated to
> cope with private mappings than shared ones, shared is just the native
> form of the pseudofs without having to deal with private COWs aliases
> so it's hard to imagine something going wrong for VM_SHARED if the
> MAP_PRIVATE mapping already works fine. If it turns out to be
> superflous the check may be just turned into
> "vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED".
> 
> Thanks,
> Andrea

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-13 17:11     ` Alexey Perevalov
  2017-02-13 17:57       ` Andrea Arcangeli
@ 2017-02-13 18:16       ` Dr. David Alan Gilbert
  2017-02-14 16:22         ` Alexey Perevalov
  1 sibling, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-13 18:16 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, quintela, aarcange

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
>  Hello David!

Hi Alexey,

> I have checked you series with 1G hugepage, but only in 1 Gbit/sec network
> environment.

Can you show the qemu command line you're using?  I'm just trying
to make sure I understand where your hugepages are; running 1G hostpages
across a 1Gbit/sec network for postcopy would be pretty poor - it would take
~10 seconds to transfer the page.

> I started Ubuntu just with console interface and gave to it only 1G of
> RAM, inside Ubuntu I started stress command

> (stress --cpu 4 --io 4 --vm 4 --vm-bytes 256000000 &)
> in such environment precopy live migration was impossible, it never
> being finished, in this case it infinitely sends pages (it looks like
> dpkg scenario).
> 
> Also I modified stress utility
> http://people.seas.harvard.edu/~apw/stress/stress-1.0.4.tar.gz
> due to it wrote into memory every time the same value `Z`. My
> modified version writes every allocation new incremented value.

I use google's stressapptest normally; although remember to turn
off the bit where it pauses.

> I'm using Arcangeli's kernel only at the destination.
> 
> I got controversial results. Downtime for 1G hugepage is close to 2Mb
> hugepage and it took around 7 ms (in 2Mb hugepage scenario downtime was
> around 8 ms).
> I made that opinion by query-migrate.
> {"return": {"status": "completed", "setup-time": 6, "downtime": 6, "total-time": 9668, "ram": {"total": 1091379200, "postcopy-requests": 1, "dirty-sync-count": 2, "remaining": 0, "mbps": 879.786851, "transferred": 1063007296, "duplicate": 7449, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1060868096, "normal": 259001}}}
> 
> Documentation says about downtime field - measurement unit is ms.

The downtime measurement field is pretty meaningless for postcopy; it's only
the time from stopping the VM until the point where we tell the destination it
can start running.  Meaningful measurements are only from inside the guest
really, or the place latencys.

> So I traced it (I added additional trace into postcopy_place_page
> trace_postcopy_place_page_start(host, from, pagesize); )
> 
> postcopy_ram_fault_thread_request Request for HVA=7f6dc0000000 rb=/objects/mem offset=0
> postcopy_place_page_start host=0x7f6dc0000000 from=0x7f6d70000000, pagesize=40000000
> postcopy_place_page_start host=0x7f6e0e800000 from=0x55b665969619, pagesize=1000
> postcopy_place_page_start host=0x7f6e0e801000 from=0x55b6659684e8, pagesize=1000
> several pages with 4Kb step ...
> postcopy_place_page_start host=0x7f6e0e817000 from=0x55b6659694f0, pagesize=1000
> 
> 4K pages, started from 0x7f6e0e800000 address it's
> vga.ram, /rom@etc/acpi/tables etc.
> 
> Frankly saying, right now, I don't have any ideas why hugepage wasn't
> resent. Maybe my expectation of it is wrong as well as understanding )

That's pretty much what I expect to see - before you get into postcopy
mode everything is sent as individual 4k pages (in order); once we're
in postcopy mode we send each page no more than once.  So you're
huge page comes across once - and there it is.

> stress utility also duplicated for me value into appropriate file:
> sec_since_epoch.microsec:value
> 1487003192.728493:22
> 1487003197.335362:23
> *1487003213.367260:24*
> *1487003238.480379:25*
> 1487003243.315299:26
> 1487003250.775721:27
> 1487003255.473792:28
> 
> It mean rewriting 256Mb of memory per byte took around 5 sec, but at
> the moment of migration it took 25 sec.

right, now this is the thing that's more useful to measure.
That's not too surprising; when it migrates that data is changing rapidly
so it's going to have to pause and wait for that whole 1GB to be transferred.
Your 1Gbps network is going to take about 10 seconds to transfer that
1GB page - and that's if you're lucky and it saturates the network.
SO it's going to take at least 10 seconds longer than it normally
would, plus any other overheads - so at least 15 seconds.
This is why I say it's a bad idea to use 1GB host pages with postcopy.
Of course it would be fun to find where the other 10 seconds went!

You might like to add timing to the tracing so you can see the time between the
fault thread requesting the page and it arriving.

> Another one request.
> QEMU could use mem_path in hugefs with share key simultaneously
> (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
> in this case will start and will properly work (it will allocate memory
> with mmap), but in case of destination for postcopy live migration
> UFFDIO_COPY ioctl will fail for
> such region, in Arcangeli's git tree there is such prevent check
> (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> Is it possible to handle such situation at qemu?

Imagine that you had shared memory; what semantics would you like
to see ?  What happens to the other process?

Dave

> On Mon, Feb 06, 2017 at 05:45:30PM +0000, Dr. David Alan Gilbert wrote:
> > * Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Hi,
> > >   The existing postcopy code, and the userfault kernel
> > > code that supports it, only works for normal anonymous memory.
> > > Kernel support for userfault on hugetlbfs is working
> > > it's way upstream; it's in the linux-mm tree,
> > > You can get a version at:
> > >    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> > > on the origin/userfault branch.
> > > 
> > > Note that while this code supports arbitrary sized hugepages,
> > > it doesn't make sense with pages above the few-MB region,
> > > so while 2MB is fine, 1GB is probably a bad idea;
> > > this code waits for and transmits whole huge pages, and a
> > > 1GB page would take about 1 second to transfer over a 10Gbps
> > > link - which is way too long to pause the destination for.
> > > 
> > > Dave
> > 
> > Oops I missed the v2 changes from the message:
> > 
> > v2
> >   Flip ram-size summary word/compare individual page size patches around
> >   Individual page size comparison is done in ram_load if 'advise' has been
> >     received rather than checking migrate_postcopy_ram()
> >   Moved discard code into exec.c, reworked ram_discard_range
> > 
> > Dave
> 
> Thank your, right now it's not necessary to set
> postcopy-ram capability on destination machine.
> 
> 
> > 
> > > Dr. David Alan Gilbert (16):
> > >   postcopy: Transmit ram size summary word
> > >   postcopy: Transmit and compare individual page sizes
> > >   postcopy: Chunk discards for hugepages
> > >   exec: ram_block_discard_range
> > >   postcopy: enhance ram_block_discard_range for hugepages
> > >   Fold postcopy_ram_discard_range into ram_discard_range
> > >   postcopy: Record largest page size
> > >   postcopy: Plumb pagesize down into place helpers
> > >   postcopy: Use temporary for placing zero huge pages
> > >   postcopy: Load huge pages in one go
> > >   postcopy: Mask fault addresses to huge page boundary
> > >   postcopy: Send whole huge pages
> > >   postcopy: Allow hugepages
> > >   postcopy: Update userfaultfd.h header
> > >   postcopy: Check for userfault+hugepage feature
> > >   postcopy: Add doc about hugepages and postcopy
> > > 
> > >  docs/migration.txt                |  13 ++++
> > >  exec.c                            |  83 +++++++++++++++++++++++
> > >  include/exec/cpu-common.h         |   2 +
> > >  include/exec/memory.h             |   1 -
> > >  include/migration/migration.h     |   3 +
> > >  include/migration/postcopy-ram.h  |  13 ++--
> > >  linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
> > >  migration/migration.c             |   1 +
> > >  migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
> > >  migration/ram.c                   | 109 ++++++++++++++++++------------
> > >  migration/savevm.c                |  32 ++++++---
> > >  migration/trace-events            |   2 +-
> > >  12 files changed, 328 insertions(+), 150 deletions(-)
> > > 
> > > -- 
> > > 2.9.3
> > > 
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-13 18:10         ` Andrea Arcangeli
@ 2017-02-13 21:59           ` Mike Kravetz
  0 siblings, 0 replies; 73+ messages in thread
From: Mike Kravetz @ 2017-02-13 21:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Alexey Perevalov
  Cc: Dr. David Alan Gilbert, qemu-devel, quintela

On 02/13/2017 10:10 AM, Andrea Arcangeli wrote:
> On Mon, Feb 13, 2017 at 06:57:22PM +0100, Andrea Arcangeli wrote:
>> Hello,
>>
>> On Mon, Feb 13, 2017 at 08:11:06PM +0300, Alexey Perevalov wrote:
>>> Another one request.
>>> QEMU could use mem_path in hugefs with share key simultaneously
>>> (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
>>> in this case will start and will properly work (it will allocate memory
>>> with mmap), but in case of destination for postcopy live migration
>>> UFFDIO_COPY ioctl will fail for
>>> such region, in Arcangeli's git tree there is such prevent check
>>> (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
>>> Is it possible to handle such situation at qemu?
>>
>> It'd be nice to lift this hugetlbfs !VM_SHARED restriction I agree, I
>> already asked Mike (CC'ed) why is there, because I'm afraid it's a
> 
> Cc'ed not existent email, mail client autocompletion error, corrected
> the CC.
> 
>> leftover from the anon version where VM_SHARED means a very different
>> thing but it was already lifted for shmem. share=on should already
>> work on top of tmpfs and also with THP on tmpfs enabled.
>>
>> For hugetlbfs and shmem it should be generally more complicated to
>> cope with private mappings than shared ones, shared is just the native
>> form of the pseudofs without having to deal with private COWs aliases
>> so it's hard to imagine something going wrong for VM_SHARED if the
>> MAP_PRIVATE mapping already works fine. If it turns out to be
>> superflous the check may be just turned into
>> "vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED".
>>
>> Thanks,
>> Andrea

Sorry, I did not see e-mail earlier.

Andrea is correct in that the VM_SHARED restriction for hugetlbfs was there
to make the code common with the anon version.  The use case I had was to
simply 'catch' no page hugetlbfs faults private -or- shared.  That is why
you can register hugetlbfs shared regions.

I can take a look at what it would take to enable copy, and agree with Andrea
that it should be relatively easy.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-13 17:57       ` Andrea Arcangeli
  2017-02-13 18:10         ` Andrea Arcangeli
@ 2017-02-14 14:48         ` Alexey Perevalov
  2017-02-17 16:47           ` Andrea Arcangeli
  1 sibling, 1 reply; 73+ messages in thread
From: Alexey Perevalov @ 2017-02-14 14:48 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: quintela, kravetz, Dr. David Alan Gilbert, qemu-devel

On Mon, Feb 13, 2017 at 06:57:22PM +0100, Andrea Arcangeli wrote:
> Hello,
> 
> On Mon, Feb 13, 2017 at 08:11:06PM +0300, Alexey Perevalov wrote:
> > Another one request.
> > QEMU could use mem_path in hugefs with share key simultaneously
> > (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
> > in this case will start and will properly work (it will allocate memory
> > with mmap), but in case of destination for postcopy live migration
> > UFFDIO_COPY ioctl will fail for
> > such region, in Arcangeli's git tree there is such prevent check
> > (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> > Is it possible to handle such situation at qemu?
> 
> It'd be nice to lift this hugetlbfs !VM_SHARED restriction I agree, I
> already asked Mike (CC'ed) why is there, because I'm afraid it's a
> leftover from the anon version where VM_SHARED means a very different
> thing but it was already lifted for shmem. share=on should already
> work on top of tmpfs and also with THP on tmpfs enabled.
> 
> For hugetlbfs and shmem it should be generally more complicated to
> cope with private mappings than shared ones, shared is just the native
> form of the pseudofs without having to deal with private COWs aliases
> so it's hard to imagine something going wrong for VM_SHARED if the
> MAP_PRIVATE mapping already works fine. If it turns out to be
> superflous the check may be just turned into
> "vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED".

Great, as I know  -netdev type=vhost-user requires share=on in
-object memory-backend in ovs-dpdk scenario
http://wiki.qemu-project.org/Documentation/vhost-user-ovs-dpdk

> 
> Thanks,
> Andrea
>

BR,
Alexey

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-13 18:16       ` Dr. David Alan Gilbert
@ 2017-02-14 16:22         ` Alexey Perevalov
  2017-02-14 19:34           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 73+ messages in thread
From: Alexey Perevalov @ 2017-02-14 16:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: aarcange, qemu-devel, quintela

Hi David,

Thank your, now it's clear.

On Mon, Feb 13, 2017 at 06:16:02PM +0000, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> >  Hello David!
> 
> Hi Alexey,
> 
> > I have checked you series with 1G hugepage, but only in 1 Gbit/sec network
> > environment.
> 
> Can you show the qemu command line you're using?  I'm just trying
> to make sure I understand where your hugepages are; running 1G hostpages
> across a 1Gbit/sec network for postcopy would be pretty poor - it would take
> ~10 seconds to transfer the page.

sure
-hda ./Ubuntu.img -name PAU,debug-threads=on -boot d -net nic -net user
-m 1024 -localtime -nographic -enable-kvm -incoming tcp:0:4444 -object
memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages -mem-prealloc
-numa node,memdev=mem -trace events=/tmp/events -chardev
socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control
> 
> > I started Ubuntu just with console interface and gave to it only 1G of
> > RAM, inside Ubuntu I started stress command
> 
> > (stress --cpu 4 --io 4 --vm 4 --vm-bytes 256000000 &)
> > in such environment precopy live migration was impossible, it never
> > being finished, in this case it infinitely sends pages (it looks like
> > dpkg scenario).
> > 
> > Also I modified stress utility
> > http://people.seas.harvard.edu/~apw/stress/stress-1.0.4.tar.gz
> > due to it wrote into memory every time the same value `Z`. My
> > modified version writes every allocation new incremented value.
> 
> I use google's stressapptest normally; although remember to turn
> off the bit where it pauses.

I decided to use it too
stressapptest -s 300 -M 256 -m 8 -W

> 
> > I'm using Arcangeli's kernel only at the destination.
> > 
> > I got controversial results. Downtime for 1G hugepage is close to 2Mb
> > hugepage and it took around 7 ms (in 2Mb hugepage scenario downtime was
> > around 8 ms).
> > I made that opinion by query-migrate.
> > {"return": {"status": "completed", "setup-time": 6, "downtime": 6, "total-time": 9668, "ram": {"total": 1091379200, "postcopy-requests": 1, "dirty-sync-count": 2, "remaining": 0, "mbps": 879.786851, "transferred": 1063007296, "duplicate": 7449, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1060868096, "normal": 259001}}}
> > 
> > Documentation says about downtime field - measurement unit is ms.
> 
> The downtime measurement field is pretty meaningless for postcopy; it's only
> the time from stopping the VM until the point where we tell the destination it
> can start running.  Meaningful measurements are only from inside the guest
> really, or the place latencys.
>

Maybe improve it by receiving such information from destination?
I wish to do that.
> > So I traced it (I added additional trace into postcopy_place_page
> > trace_postcopy_place_page_start(host, from, pagesize); )
> > 
> > postcopy_ram_fault_thread_request Request for HVA=7f6dc0000000 rb=/objects/mem offset=0
> > postcopy_place_page_start host=0x7f6dc0000000 from=0x7f6d70000000, pagesize=40000000
> > postcopy_place_page_start host=0x7f6e0e800000 from=0x55b665969619, pagesize=1000
> > postcopy_place_page_start host=0x7f6e0e801000 from=0x55b6659684e8, pagesize=1000
> > several pages with 4Kb step ...
> > postcopy_place_page_start host=0x7f6e0e817000 from=0x55b6659694f0, pagesize=1000
> > 
> > 4K pages, started from 0x7f6e0e800000 address it's
> > vga.ram, /rom@etc/acpi/tables etc.
> > 
> > Frankly saying, right now, I don't have any ideas why hugepage wasn't
> > resent. Maybe my expectation of it is wrong as well as understanding )
> 
> That's pretty much what I expect to see - before you get into postcopy
> mode everything is sent as individual 4k pages (in order); once we're
> in postcopy mode we send each page no more than once.  So you're
> huge page comes across once - and there it is.
> 
> > stress utility also duplicated for me value into appropriate file:
> > sec_since_epoch.microsec:value
> > 1487003192.728493:22
> > 1487003197.335362:23
> > *1487003213.367260:24*
> > *1487003238.480379:25*
> > 1487003243.315299:26
> > 1487003250.775721:27
> > 1487003255.473792:28
> > 
> > It mean rewriting 256Mb of memory per byte took around 5 sec, but at
> > the moment of migration it took 25 sec.
> 
> right, now this is the thing that's more useful to measure.
> That's not too surprising; when it migrates that data is changing rapidly
> so it's going to have to pause and wait for that whole 1GB to be transferred.
> Your 1Gbps network is going to take about 10 seconds to transfer that
> 1GB page - and that's if you're lucky and it saturates the network.
> SO it's going to take at least 10 seconds longer than it normally
> would, plus any other overheads - so at least 15 seconds.
> This is why I say it's a bad idea to use 1GB host pages with postcopy.
> Of course it would be fun to find where the other 10 seconds went!
> 
> You might like to add timing to the tracing so you can see the time between the
> fault thread requesting the page and it arriving.
>
yes, sorry I forgot about timing
20806@1487084818.270993:postcopy_ram_fault_thread_request Request for HVA=7f0280000000 rb=/objects/mem offset=0
20806@1487084818.271038:qemu_loadvm_state_section 8
20806@1487084818.271056:loadvm_process_command com=0x2 len=4
20806@1487084818.271089:qemu_loadvm_state_section 2
20806@1487084823.315919:postcopy_place_page_start host=0x7f0280000000 from=0x7f0240000000, pagesize=40000000

1487084823.315919 - 1487084818.270993 = 5.044926 sec.
Machines connected w/o any routers, directly by cable.

> > Another one request.
> > QEMU could use mem_path in hugefs with share key simultaneously
> > (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
> > in this case will start and will properly work (it will allocate memory
> > with mmap), but in case of destination for postcopy live migration
> > UFFDIO_COPY ioctl will fail for
> > such region, in Arcangeli's git tree there is such prevent check
> > (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> > Is it possible to handle such situation at qemu?
> 
> Imagine that you had shared memory; what semantics would you like
> to see ?  What happens to the other process?

Honestly, initially, I thought to handle such error, but I quit forgot
about vhost-user in ovs-dpdk.

> Dave
> 
> > On Mon, Feb 06, 2017 at 05:45:30PM +0000, Dr. David Alan Gilbert wrote:
> > > * Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > Hi,
> > > >   The existing postcopy code, and the userfault kernel
> > > > code that supports it, only works for normal anonymous memory.
> > > > Kernel support for userfault on hugetlbfs is working
> > > > it's way upstream; it's in the linux-mm tree,
> > > > You can get a version at:
> > > >    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> > > > on the origin/userfault branch.
> > > > 
> > > > Note that while this code supports arbitrary sized hugepages,
> > > > it doesn't make sense with pages above the few-MB region,
> > > > so while 2MB is fine, 1GB is probably a bad idea;
> > > > this code waits for and transmits whole huge pages, and a
> > > > 1GB page would take about 1 second to transfer over a 10Gbps
> > > > link - which is way too long to pause the destination for.
> > > > 
> > > > Dave
> > > 
> > > Oops I missed the v2 changes from the message:
> > > 
> > > v2
> > >   Flip ram-size summary word/compare individual page size patches around
> > >   Individual page size comparison is done in ram_load if 'advise' has been
> > >     received rather than checking migrate_postcopy_ram()
> > >   Moved discard code into exec.c, reworked ram_discard_range
> > > 
> > > Dave
> > 
> > Thank your, right now it's not necessary to set
> > postcopy-ram capability on destination machine.
> > 
> > 
> > > 
> > > > Dr. David Alan Gilbert (16):
> > > >   postcopy: Transmit ram size summary word
> > > >   postcopy: Transmit and compare individual page sizes
> > > >   postcopy: Chunk discards for hugepages
> > > >   exec: ram_block_discard_range
> > > >   postcopy: enhance ram_block_discard_range for hugepages
> > > >   Fold postcopy_ram_discard_range into ram_discard_range
> > > >   postcopy: Record largest page size
> > > >   postcopy: Plumb pagesize down into place helpers
> > > >   postcopy: Use temporary for placing zero huge pages
> > > >   postcopy: Load huge pages in one go
> > > >   postcopy: Mask fault addresses to huge page boundary
> > > >   postcopy: Send whole huge pages
> > > >   postcopy: Allow hugepages
> > > >   postcopy: Update userfaultfd.h header
> > > >   postcopy: Check for userfault+hugepage feature
> > > >   postcopy: Add doc about hugepages and postcopy
> > > > 
> > > >  docs/migration.txt                |  13 ++++
> > > >  exec.c                            |  83 +++++++++++++++++++++++
> > > >  include/exec/cpu-common.h         |   2 +
> > > >  include/exec/memory.h             |   1 -
> > > >  include/migration/migration.h     |   3 +
> > > >  include/migration/postcopy-ram.h  |  13 ++--
> > > >  linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
> > > >  migration/migration.c             |   1 +
> > > >  migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
> > > >  migration/ram.c                   | 109 ++++++++++++++++++------------
> > > >  migration/savevm.c                |  32 ++++++---
> > > >  migration/trace-events            |   2 +-
> > > >  12 files changed, 328 insertions(+), 150 deletions(-)
> > > > 
> > > > -- 
> > > > 2.9.3
> > > > 
> > > > 
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

-- 

BR
Alexey

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-14 16:22         ` Alexey Perevalov
@ 2017-02-14 19:34           ` Dr. David Alan Gilbert
  2017-02-21  7:31             ` Alexey Perevalov
  0 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-14 19:34 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: aarcange, qemu-devel, quintela

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> Hi David,
> 
> Thank your, now it's clear.
> 
> On Mon, Feb 13, 2017 at 06:16:02PM +0000, Dr. David Alan Gilbert wrote:
> > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > >  Hello David!
> > 
> > Hi Alexey,
> > 
> > > I have checked you series with 1G hugepage, but only in 1 Gbit/sec network
> > > environment.
> > 
> > Can you show the qemu command line you're using?  I'm just trying
> > to make sure I understand where your hugepages are; running 1G hostpages
> > across a 1Gbit/sec network for postcopy would be pretty poor - it would take
> > ~10 seconds to transfer the page.
> 
> sure
> -hda ./Ubuntu.img -name PAU,debug-threads=on -boot d -net nic -net user
> -m 1024 -localtime -nographic -enable-kvm -incoming tcp:0:4444 -object
> memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages -mem-prealloc
> -numa node,memdev=mem -trace events=/tmp/events -chardev
> socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=control

OK, it's a pretty unusual setup - a 1G page guest with 1G of guest RAM.

> > 
> > > I started Ubuntu just with console interface and gave to it only 1G of
> > > RAM, inside Ubuntu I started stress command
> > 
> > > (stress --cpu 4 --io 4 --vm 4 --vm-bytes 256000000 &)
> > > in such environment precopy live migration was impossible, it never
> > > being finished, in this case it infinitely sends pages (it looks like
> > > dpkg scenario).
> > > 
> > > Also I modified stress utility
> > > http://people.seas.harvard.edu/~apw/stress/stress-1.0.4.tar.gz
> > > due to it wrote into memory every time the same value `Z`. My
> > > modified version writes every allocation new incremented value.
> > 
> > I use google's stressapptest normally; although remember to turn
> > off the bit where it pauses.
> 
> I decided to use it too
> stressapptest -s 300 -M 256 -m 8 -W
> 
> > 
> > > I'm using Arcangeli's kernel only at the destination.
> > > 
> > > I got controversial results. Downtime for 1G hugepage is close to 2Mb
> > > hugepage and it took around 7 ms (in 2Mb hugepage scenario downtime was
> > > around 8 ms).
> > > I made that opinion by query-migrate.
> > > {"return": {"status": "completed", "setup-time": 6, "downtime": 6, "total-time": 9668, "ram": {"total": 1091379200, "postcopy-requests": 1, "dirty-sync-count": 2, "remaining": 0, "mbps": 879.786851, "transferred": 1063007296, "duplicate": 7449, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1060868096, "normal": 259001}}}
> > > 
> > > Documentation says about downtime field - measurement unit is ms.
> > 
> > The downtime measurement field is pretty meaningless for postcopy; it's only
> > the time from stopping the VM until the point where we tell the destination it
> > can start running.  Meaningful measurements are only from inside the guest
> > really, or the place latencys.
> >
> 
> Maybe improve it by receiving such information from destination?
> I wish to do that.
> > > So I traced it (I added additional trace into postcopy_place_page
> > > trace_postcopy_place_page_start(host, from, pagesize); )
> > > 
> > > postcopy_ram_fault_thread_request Request for HVA=7f6dc0000000 rb=/objects/mem offset=0
> > > postcopy_place_page_start host=0x7f6dc0000000 from=0x7f6d70000000, pagesize=40000000
> > > postcopy_place_page_start host=0x7f6e0e800000 from=0x55b665969619, pagesize=1000
> > > postcopy_place_page_start host=0x7f6e0e801000 from=0x55b6659684e8, pagesize=1000
> > > several pages with 4Kb step ...
> > > postcopy_place_page_start host=0x7f6e0e817000 from=0x55b6659694f0, pagesize=1000
> > > 
> > > 4K pages, started from 0x7f6e0e800000 address it's
> > > vga.ram, /rom@etc/acpi/tables etc.
> > > 
> > > Frankly saying, right now, I don't have any ideas why hugepage wasn't
> > > resent. Maybe my expectation of it is wrong as well as understanding )
> > 
> > That's pretty much what I expect to see - before you get into postcopy
> > mode everything is sent as individual 4k pages (in order); once we're
> > in postcopy mode we send each page no more than once.  So you're
> > huge page comes across once - and there it is.
> > 
> > > stress utility also duplicated for me value into appropriate file:
> > > sec_since_epoch.microsec:value
> > > 1487003192.728493:22
> > > 1487003197.335362:23
> > > *1487003213.367260:24*
> > > *1487003238.480379:25*
> > > 1487003243.315299:26
> > > 1487003250.775721:27
> > > 1487003255.473792:28
> > > 
> > > It mean rewriting 256Mb of memory per byte took around 5 sec, but at
> > > the moment of migration it took 25 sec.
> > 
> > right, now this is the thing that's more useful to measure.
> > That's not too surprising; when it migrates that data is changing rapidly
> > so it's going to have to pause and wait for that whole 1GB to be transferred.
> > Your 1Gbps network is going to take about 10 seconds to transfer that
> > 1GB page - and that's if you're lucky and it saturates the network.
> > SO it's going to take at least 10 seconds longer than it normally
> > would, plus any other overheads - so at least 15 seconds.
> > This is why I say it's a bad idea to use 1GB host pages with postcopy.
> > Of course it would be fun to find where the other 10 seconds went!
> > 
> > You might like to add timing to the tracing so you can see the time between the
> > fault thread requesting the page and it arriving.
> >
> yes, sorry I forgot about timing
> 20806@1487084818.270993:postcopy_ram_fault_thread_request Request for HVA=7f0280000000 rb=/objects/mem offset=0
> 20806@1487084818.271038:qemu_loadvm_state_section 8
> 20806@1487084818.271056:loadvm_process_command com=0x2 len=4
> 20806@1487084818.271089:qemu_loadvm_state_section 2
> 20806@1487084823.315919:postcopy_place_page_start host=0x7f0280000000 from=0x7f0240000000, pagesize=40000000
> 
> 1487084823.315919 - 1487084818.270993 = 5.044926 sec.
> Machines connected w/o any routers, directly by cable.

OK, the fact it's only 5 seconds not 10 I think suggests a lot of the memory was all zero
so didn't take up the whole bandwidth.

> > > Another one request.
> > > QEMU could use mem_path in hugefs with share key simultaneously
> > > (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
> > > in this case will start and will properly work (it will allocate memory
> > > with mmap), but in case of destination for postcopy live migration
> > > UFFDIO_COPY ioctl will fail for
> > > such region, in Arcangeli's git tree there is such prevent check
> > > (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> > > Is it possible to handle such situation at qemu?
> > 
> > Imagine that you had shared memory; what semantics would you like
> > to see ?  What happens to the other process?
> 
> Honestly, initially, I thought to handle such error, but I quit forgot
> about vhost-user in ovs-dpdk.

Yes, I don't know much about vhost-user; but we'll have to think carefully
about the way things behave when they're accessing memory that's shared
with qemu during migration.  Writing to the source after we've started
the postcopy phase is not allowed.  Accessing the destination memory
during postcopy will produce pauses in the other processes accessing it
(I think) and they mustn't do various types of madvise etc - so
I'm sure there will be things we find out the hard way!

Dave

> > Dave
> > 
> > > On Mon, Feb 06, 2017 at 05:45:30PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > 
> > > > > Hi,
> > > > >   The existing postcopy code, and the userfault kernel
> > > > > code that supports it, only works for normal anonymous memory.
> > > > > Kernel support for userfault on hugetlbfs is working
> > > > > it's way upstream; it's in the linux-mm tree,
> > > > > You can get a version at:
> > > > >    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> > > > > on the origin/userfault branch.
> > > > > 
> > > > > Note that while this code supports arbitrary sized hugepages,
> > > > > it doesn't make sense with pages above the few-MB region,
> > > > > so while 2MB is fine, 1GB is probably a bad idea;
> > > > > this code waits for and transmits whole huge pages, and a
> > > > > 1GB page would take about 1 second to transfer over a 10Gbps
> > > > > link - which is way too long to pause the destination for.
> > > > > 
> > > > > Dave
> > > > 
> > > > Oops I missed the v2 changes from the message:
> > > > 
> > > > v2
> > > >   Flip ram-size summary word/compare individual page size patches around
> > > >   Individual page size comparison is done in ram_load if 'advise' has been
> > > >     received rather than checking migrate_postcopy_ram()
> > > >   Moved discard code into exec.c, reworked ram_discard_range
> > > > 
> > > > Dave
> > > 
> > > Thank your, right now it's not necessary to set
> > > postcopy-ram capability on destination machine.
> > > 
> > > 
> > > > 
> > > > > Dr. David Alan Gilbert (16):
> > > > >   postcopy: Transmit ram size summary word
> > > > >   postcopy: Transmit and compare individual page sizes
> > > > >   postcopy: Chunk discards for hugepages
> > > > >   exec: ram_block_discard_range
> > > > >   postcopy: enhance ram_block_discard_range for hugepages
> > > > >   Fold postcopy_ram_discard_range into ram_discard_range
> > > > >   postcopy: Record largest page size
> > > > >   postcopy: Plumb pagesize down into place helpers
> > > > >   postcopy: Use temporary for placing zero huge pages
> > > > >   postcopy: Load huge pages in one go
> > > > >   postcopy: Mask fault addresses to huge page boundary
> > > > >   postcopy: Send whole huge pages
> > > > >   postcopy: Allow hugepages
> > > > >   postcopy: Update userfaultfd.h header
> > > > >   postcopy: Check for userfault+hugepage feature
> > > > >   postcopy: Add doc about hugepages and postcopy
> > > > > 
> > > > >  docs/migration.txt                |  13 ++++
> > > > >  exec.c                            |  83 +++++++++++++++++++++++
> > > > >  include/exec/cpu-common.h         |   2 +
> > > > >  include/exec/memory.h             |   1 -
> > > > >  include/migration/migration.h     |   3 +
> > > > >  include/migration/postcopy-ram.h  |  13 ++--
> > > > >  linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
> > > > >  migration/migration.c             |   1 +
> > > > >  migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
> > > > >  migration/ram.c                   | 109 ++++++++++++++++++------------
> > > > >  migration/savevm.c                |  32 ++++++---
> > > > >  migration/trace-events            |   2 +-
> > > > >  12 files changed, 328 insertions(+), 150 deletions(-)
> > > > > 
> > > > > -- 
> > > > > 2.9.3
> > > > > 
> > > > > 
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
> -- 
> 
> BR
> Alexey
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-14 14:48         ` Alexey Perevalov
@ 2017-02-17 16:47           ` Andrea Arcangeli
  2017-02-20 16:01             ` Alexey Perevalov
  0 siblings, 1 reply; 73+ messages in thread
From: Andrea Arcangeli @ 2017-02-17 16:47 UTC (permalink / raw)
  To: Alexey Perevalov
  Cc: quintela, kravetz, Dr. David Alan Gilbert, qemu-devel, Mike Kravetz

Hello Alexey,

On Tue, Feb 14, 2017 at 05:48:25PM +0300, Alexey Perevalov wrote:
> On Mon, Feb 13, 2017 at 06:57:22PM +0100, Andrea Arcangeli wrote:
> > Hello,
> > 
> > On Mon, Feb 13, 2017 at 08:11:06PM +0300, Alexey Perevalov wrote:
> > > Another one request.
> > > QEMU could use mem_path in hugefs with share key simultaneously
> > > (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
> > > in this case will start and will properly work (it will allocate memory
> > > with mmap), but in case of destination for postcopy live migration
> > > UFFDIO_COPY ioctl will fail for
> > > such region, in Arcangeli's git tree there is such prevent check
> > > (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> > > Is it possible to handle such situation at qemu?
> > 
> > It'd be nice to lift this hugetlbfs !VM_SHARED restriction I agree, I
> > already asked Mike (CC'ed) why is there, because I'm afraid it's a
> > leftover from the anon version where VM_SHARED means a very different
> > thing but it was already lifted for shmem. share=on should already
> > work on top of tmpfs and also with THP on tmpfs enabled.
> > 
> > For hugetlbfs and shmem it should be generally more complicated to
> > cope with private mappings than shared ones, shared is just the native
> > form of the pseudofs without having to deal with private COWs aliases
> > so it's hard to imagine something going wrong for VM_SHARED if the
> > MAP_PRIVATE mapping already works fine. If it turns out to be
> > superflous the check may be just turned into
> > "vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED".
> 
> Great, as I know  -netdev type=vhost-user requires share=on in
> -object memory-backend in ovs-dpdk scenario
> http://wiki.qemu-project.org/Documentation/vhost-user-ovs-dpdk

share=on should work now with current aa.git userfault branch, and the
support is already included in -mm, it should all get merged upstream
in kernel 4.11.

Could you test the current aa.git userfault branch to verify postcopy
live migration works fine on hugetlbfs share=on?

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-17 16:47           ` Andrea Arcangeli
@ 2017-02-20 16:01             ` Alexey Perevalov
  0 siblings, 0 replies; 73+ messages in thread
From: Alexey Perevalov @ 2017-02-20 16:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, Mike Kravetz, kravetz, Dr. David Alan Gilbert, quintela

Hello Andrea,


On Fri, Feb 17, 2017 at 05:47:30PM +0100, Andrea Arcangeli wrote:
> Hello Alexey,
> 
> On Tue, Feb 14, 2017 at 05:48:25PM +0300, Alexey Perevalov wrote:
> > On Mon, Feb 13, 2017 at 06:57:22PM +0100, Andrea Arcangeli wrote:
> > > Hello,
> > > 
> > > On Mon, Feb 13, 2017 at 08:11:06PM +0300, Alexey Perevalov wrote:
> > > > Another one request.
> > > > QEMU could use mem_path in hugefs with share key simultaneously
> > > > (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
> > > > in this case will start and will properly work (it will allocate memory
> > > > with mmap), but in case of destination for postcopy live migration
> > > > UFFDIO_COPY ioctl will fail for
> > > > such region, in Arcangeli's git tree there is such prevent check
> > > > (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> > > > Is it possible to handle such situation at qemu?
> > > 
> > > It'd be nice to lift this hugetlbfs !VM_SHARED restriction I agree, I
> > > already asked Mike (CC'ed) why is there, because I'm afraid it's a
> > > leftover from the anon version where VM_SHARED means a very different
> > > thing but it was already lifted for shmem. share=on should already
> > > work on top of tmpfs and also with THP on tmpfs enabled.
> > > 
> > > For hugetlbfs and shmem it should be generally more complicated to
> > > cope with private mappings than shared ones, shared is just the native
> > > form of the pseudofs without having to deal with private COWs aliases
> > > so it's hard to imagine something going wrong for VM_SHARED if the
> > > MAP_PRIVATE mapping already works fine. If it turns out to be
> > > superflous the check may be just turned into
> > > "vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED".
> > 
> > Great, as I know  -netdev type=vhost-user requires share=on in
> > -object memory-backend in ovs-dpdk scenario
> > http://wiki.qemu-project.org/Documentation/vhost-user-ovs-dpdk
> 
> share=on should work now with current aa.git userfault branch, and the
> support is already included in -mm, it should all get merged upstream
> in kernel 4.11.
> 
> Could you test the current aa.git userfault branch to verify postcopy
> live migration works fine on hugetlbfs share=on?
>

Yes, I already checked with you suggestion of using another check
"vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED", but in
this case dst page was anonymous after successfully passed ioctl.

There is no such bug in latest aa.git now.
"userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings"
solved issue with anonymous page after UFFDIO_COPY.


> Thanks!
> Andrea
> 

-- 

BR
Alexey

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-14 19:34           ` Dr. David Alan Gilbert
@ 2017-02-21  7:31             ` Alexey Perevalov
  2017-02-21 10:03               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 73+ messages in thread
From: Alexey Perevalov @ 2017-02-21  7:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: aarcange, qemu-devel, quintela


Hello David,

On Tue, Feb 14, 2017 at 07:34:26PM +0000, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > Hi David,
> > 
> > Thank your, now it's clear.
> > 
> > On Mon, Feb 13, 2017 at 06:16:02PM +0000, Dr. David Alan Gilbert wrote:
> > > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > >  Hello David!
> > > 
> > > Hi Alexey,
> > > 
> > > > I have checked you series with 1G hugepage, but only in 1 Gbit/sec network
> > > > environment.
> > > 
> > > Can you show the qemu command line you're using?  I'm just trying
> > > to make sure I understand where your hugepages are; running 1G hostpages
> > > across a 1Gbit/sec network for postcopy would be pretty poor - it would take
> > > ~10 seconds to transfer the page.
> > 
> > sure
> > -hda ./Ubuntu.img -name PAU,debug-threads=on -boot d -net nic -net user
> > -m 1024 -localtime -nographic -enable-kvm -incoming tcp:0:4444 -object
> > memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages -mem-prealloc
> > -numa node,memdev=mem -trace events=/tmp/events -chardev
> > socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock,server,nowait
> > -mon chardev=charmonitor,id=monitor,mode=control
> 
> OK, it's a pretty unusual setup - a 1G page guest with 1G of guest RAM.
> 
> > > 
> > > > I started Ubuntu just with console interface and gave to it only 1G of
> > > > RAM, inside Ubuntu I started stress command
> > > 
> > > > (stress --cpu 4 --io 4 --vm 4 --vm-bytes 256000000 &)
> > > > in such environment precopy live migration was impossible, it never
> > > > being finished, in this case it infinitely sends pages (it looks like
> > > > dpkg scenario).
> > > > 
> > > > Also I modified stress utility
> > > > http://people.seas.harvard.edu/~apw/stress/stress-1.0.4.tar.gz
> > > > due to it wrote into memory every time the same value `Z`. My
> > > > modified version writes every allocation new incremented value.
> > > 
> > > I use google's stressapptest normally; although remember to turn
> > > off the bit where it pauses.
> > 
> > I decided to use it too
> > stressapptest -s 300 -M 256 -m 8 -W
> > 
> > > 
> > > > I'm using Arcangeli's kernel only at the destination.
> > > > 
> > > > I got controversial results. Downtime for 1G hugepage is close to 2Mb
> > > > hugepage and it took around 7 ms (in 2Mb hugepage scenario downtime was
> > > > around 8 ms).
> > > > I made that opinion by query-migrate.
> > > > {"return": {"status": "completed", "setup-time": 6, "downtime": 6, "total-time": 9668, "ram": {"total": 1091379200, "postcopy-requests": 1, "dirty-sync-count": 2, "remaining": 0, "mbps": 879.786851, "transferred": 1063007296, "duplicate": 7449, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1060868096, "normal": 259001}}}
> > > > 
> > > > Documentation says about downtime field - measurement unit is ms.
> > > 
> > > The downtime measurement field is pretty meaningless for postcopy; it's only
> > > the time from stopping the VM until the point where we tell the destination it
> > > can start running.  Meaningful measurements are only from inside the guest
> > > really, or the place latencys.
> > >
> > 
> > Maybe improve it by receiving such information from destination?
> > I wish to do that.
> > > > So I traced it (I added additional trace into postcopy_place_page
> > > > trace_postcopy_place_page_start(host, from, pagesize); )
> > > > 
> > > > postcopy_ram_fault_thread_request Request for HVA=7f6dc0000000 rb=/objects/mem offset=0
> > > > postcopy_place_page_start host=0x7f6dc0000000 from=0x7f6d70000000, pagesize=40000000
> > > > postcopy_place_page_start host=0x7f6e0e800000 from=0x55b665969619, pagesize=1000
> > > > postcopy_place_page_start host=0x7f6e0e801000 from=0x55b6659684e8, pagesize=1000
> > > > several pages with 4Kb step ...
> > > > postcopy_place_page_start host=0x7f6e0e817000 from=0x55b6659694f0, pagesize=1000
> > > > 
> > > > 4K pages, started from 0x7f6e0e800000 address it's
> > > > vga.ram, /rom@etc/acpi/tables etc.
> > > > 
> > > > Frankly saying, right now, I don't have any ideas why hugepage wasn't
> > > > resent. Maybe my expectation of it is wrong as well as understanding )
> > > 
> > > That's pretty much what I expect to see - before you get into postcopy
> > > mode everything is sent as individual 4k pages (in order); once we're
> > > in postcopy mode we send each page no more than once.  So you're
> > > huge page comes across once - and there it is.
> > > 
> > > > stress utility also duplicated for me value into appropriate file:
> > > > sec_since_epoch.microsec:value
> > > > 1487003192.728493:22
> > > > 1487003197.335362:23
> > > > *1487003213.367260:24*
> > > > *1487003238.480379:25*
> > > > 1487003243.315299:26
> > > > 1487003250.775721:27
> > > > 1487003255.473792:28
> > > > 
> > > > It mean rewriting 256Mb of memory per byte took around 5 sec, but at
> > > > the moment of migration it took 25 sec.
> > > 
> > > right, now this is the thing that's more useful to measure.
> > > That's not too surprising; when it migrates that data is changing rapidly
> > > so it's going to have to pause and wait for that whole 1GB to be transferred.
> > > Your 1Gbps network is going to take about 10 seconds to transfer that
> > > 1GB page - and that's if you're lucky and it saturates the network.
> > > SO it's going to take at least 10 seconds longer than it normally
> > > would, plus any other overheads - so at least 15 seconds.
> > > This is why I say it's a bad idea to use 1GB host pages with postcopy.
> > > Of course it would be fun to find where the other 10 seconds went!
> > > 
> > > You might like to add timing to the tracing so you can see the time between the
> > > fault thread requesting the page and it arriving.
> > >
> > yes, sorry I forgot about timing
> > 20806@1487084818.270993:postcopy_ram_fault_thread_request Request for HVA=7f0280000000 rb=/objects/mem offset=0
> > 20806@1487084818.271038:qemu_loadvm_state_section 8
> > 20806@1487084818.271056:loadvm_process_command com=0x2 len=4
> > 20806@1487084818.271089:qemu_loadvm_state_section 2
> > 20806@1487084823.315919:postcopy_place_page_start host=0x7f0280000000 from=0x7f0240000000, pagesize=40000000
> > 
> > 1487084823.315919 - 1487084818.270993 = 5.044926 sec.
> > Machines connected w/o any routers, directly by cable.
> 
> OK, the fact it's only 5 seconds not 10 I think suggests a lot of the memory was all zero
> so didn't take up the whole bandwidth.
I decided to measure downtime as a sum of intervals since fault happened
and till page was load. I didn't relay on order, so I associated that
interval with fault address.

For 2G ram vm - using 1G huge page, downtime measured on dst is around 12 sec,
but for the same 2G ram vm with 2Mb huge page, downtime measured on dst
is around 20 sec, and 320 page faults happened, 640 Mb was transmitted.

My current method doesn't take into account multi core vcpu. I checked
only with 1 CPU, but it's not proper case. So I think it's worth to
count downtime per CPU, or calculate overlap of CPU downtimes.
How do your think?
Also I didn't yet finish IPC to provide such information to src host, where
info_migrate is being called.


> 
> > > > Another one request.
> > > > QEMU could use mem_path in hugefs with share key simultaneously
> > > > (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
> > > > in this case will start and will properly work (it will allocate memory
> > > > with mmap), but in case of destination for postcopy live migration
> > > > UFFDIO_COPY ioctl will fail for
> > > > such region, in Arcangeli's git tree there is such prevent check
> > > > (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> > > > Is it possible to handle such situation at qemu?
> > > 
> > > Imagine that you had shared memory; what semantics would you like
> > > to see ?  What happens to the other process?
> > 
> > Honestly, initially, I thought to handle such error, but I quit forgot
> > about vhost-user in ovs-dpdk.
> 
> Yes, I don't know much about vhost-user; but we'll have to think carefully
> about the way things behave when they're accessing memory that's shared
> with qemu during migration.  Writing to the source after we've started
> the postcopy phase is not allowed.  Accessing the destination memory
> during postcopy will produce pauses in the other processes accessing it
> (I think) and they mustn't do various types of madvise etc - so
> I'm sure there will be things we find out the hard way!
> 
> Dave
> 
> > > Dave
> > > 
> > > > On Mon, Feb 06, 2017 at 05:45:30PM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > 
> > > > > > Hi,
> > > > > >   The existing postcopy code, and the userfault kernel
> > > > > > code that supports it, only works for normal anonymous memory.
> > > > > > Kernel support for userfault on hugetlbfs is working
> > > > > > it's way upstream; it's in the linux-mm tree,
> > > > > > You can get a version at:
> > > > > >    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> > > > > > on the origin/userfault branch.
> > > > > > 
> > > > > > Note that while this code supports arbitrary sized hugepages,
> > > > > > it doesn't make sense with pages above the few-MB region,
> > > > > > so while 2MB is fine, 1GB is probably a bad idea;
> > > > > > this code waits for and transmits whole huge pages, and a
> > > > > > 1GB page would take about 1 second to transfer over a 10Gbps
> > > > > > link - which is way too long to pause the destination for.
> > > > > > 
> > > > > > Dave
> > > > > 
> > > > > Oops I missed the v2 changes from the message:
> > > > > 
> > > > > v2
> > > > >   Flip ram-size summary word/compare individual page size patches around
> > > > >   Individual page size comparison is done in ram_load if 'advise' has been
> > > > >     received rather than checking migrate_postcopy_ram()
> > > > >   Moved discard code into exec.c, reworked ram_discard_range
> > > > > 
> > > > > Dave
> > > > 
> > > > Thank your, right now it's not necessary to set
> > > > postcopy-ram capability on destination machine.
> > > > 
> > > > 
> > > > > 
> > > > > > Dr. David Alan Gilbert (16):
> > > > > >   postcopy: Transmit ram size summary word
> > > > > >   postcopy: Transmit and compare individual page sizes
> > > > > >   postcopy: Chunk discards for hugepages
> > > > > >   exec: ram_block_discard_range
> > > > > >   postcopy: enhance ram_block_discard_range for hugepages
> > > > > >   Fold postcopy_ram_discard_range into ram_discard_range
> > > > > >   postcopy: Record largest page size
> > > > > >   postcopy: Plumb pagesize down into place helpers
> > > > > >   postcopy: Use temporary for placing zero huge pages
> > > > > >   postcopy: Load huge pages in one go
> > > > > >   postcopy: Mask fault addresses to huge page boundary
> > > > > >   postcopy: Send whole huge pages
> > > > > >   postcopy: Allow hugepages
> > > > > >   postcopy: Update userfaultfd.h header
> > > > > >   postcopy: Check for userfault+hugepage feature
> > > > > >   postcopy: Add doc about hugepages and postcopy
> > > > > > 
> > > > > >  docs/migration.txt                |  13 ++++
> > > > > >  exec.c                            |  83 +++++++++++++++++++++++
> > > > > >  include/exec/cpu-common.h         |   2 +
> > > > > >  include/exec/memory.h             |   1 -
> > > > > >  include/migration/migration.h     |   3 +
> > > > > >  include/migration/postcopy-ram.h  |  13 ++--
> > > > > >  linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
> > > > > >  migration/migration.c             |   1 +
> > > > > >  migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
> > > > > >  migration/ram.c                   | 109 ++++++++++++++++++------------
> > > > > >  migration/savevm.c                |  32 ++++++---
> > > > > >  migration/trace-events            |   2 +-
> > > > > >  12 files changed, 328 insertions(+), 150 deletions(-)
> > > > > > 
> > > > > > -- 
> > > > > > 2.9.3
> > > > > > 
> > > > > > 
> > > > > --
> > > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > > 
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > 
> > 
> > -- 
> > 
> > BR
> > Alexey
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

-- 

BR
Alexey

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-21  7:31             ` Alexey Perevalov
@ 2017-02-21 10:03               ` Dr. David Alan Gilbert
  2017-02-27 11:05                 ` Alexey Perevalov
  0 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-21 10:03 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: aarcange, qemu-devel, quintela

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> 
> Hello David,

Hi Alexey,

> On Tue, Feb 14, 2017 at 07:34:26PM +0000, Dr. David Alan Gilbert wrote:
> > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > Hi David,
> > > 
> > > Thank your, now it's clear.
> > > 
> > > On Mon, Feb 13, 2017 at 06:16:02PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > > >  Hello David!
> > > > 
> > > > Hi Alexey,
> > > > 
> > > > > I have checked you series with 1G hugepage, but only in 1 Gbit/sec network
> > > > > environment.
> > > > 
> > > > Can you show the qemu command line you're using?  I'm just trying
> > > > to make sure I understand where your hugepages are; running 1G hostpages
> > > > across a 1Gbit/sec network for postcopy would be pretty poor - it would take
> > > > ~10 seconds to transfer the page.
> > > 
> > > sure
> > > -hda ./Ubuntu.img -name PAU,debug-threads=on -boot d -net nic -net user
> > > -m 1024 -localtime -nographic -enable-kvm -incoming tcp:0:4444 -object
> > > memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages -mem-prealloc
> > > -numa node,memdev=mem -trace events=/tmp/events -chardev
> > > socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock,server,nowait
> > > -mon chardev=charmonitor,id=monitor,mode=control
> > 
> > OK, it's a pretty unusual setup - a 1G page guest with 1G of guest RAM.
> > 
> > > > 
> > > > > I started Ubuntu just with console interface and gave to it only 1G of
> > > > > RAM, inside Ubuntu I started stress command
> > > > 
> > > > > (stress --cpu 4 --io 4 --vm 4 --vm-bytes 256000000 &)
> > > > > in such environment precopy live migration was impossible, it never
> > > > > being finished, in this case it infinitely sends pages (it looks like
> > > > > dpkg scenario).
> > > > > 
> > > > > Also I modified stress utility
> > > > > http://people.seas.harvard.edu/~apw/stress/stress-1.0.4.tar.gz
> > > > > due to it wrote into memory every time the same value `Z`. My
> > > > > modified version writes every allocation new incremented value.
> > > > 
> > > > I use google's stressapptest normally; although remember to turn
> > > > off the bit where it pauses.
> > > 
> > > I decided to use it too
> > > stressapptest -s 300 -M 256 -m 8 -W
> > > 
> > > > 
> > > > > I'm using Arcangeli's kernel only at the destination.
> > > > > 
> > > > > I got controversial results. Downtime for 1G hugepage is close to 2Mb
> > > > > hugepage and it took around 7 ms (in 2Mb hugepage scenario downtime was
> > > > > around 8 ms).
> > > > > I made that opinion by query-migrate.
> > > > > {"return": {"status": "completed", "setup-time": 6, "downtime": 6, "total-time": 9668, "ram": {"total": 1091379200, "postcopy-requests": 1, "dirty-sync-count": 2, "remaining": 0, "mbps": 879.786851, "transferred": 1063007296, "duplicate": 7449, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1060868096, "normal": 259001}}}
> > > > > 
> > > > > Documentation says about downtime field - measurement unit is ms.
> > > > 
> > > > The downtime measurement field is pretty meaningless for postcopy; it's only
> > > > the time from stopping the VM until the point where we tell the destination it
> > > > can start running.  Meaningful measurements are only from inside the guest
> > > > really, or the place latencys.
> > > >
> > > 
> > > Maybe improve it by receiving such information from destination?
> > > I wish to do that.
> > > > > So I traced it (I added additional trace into postcopy_place_page
> > > > > trace_postcopy_place_page_start(host, from, pagesize); )
> > > > > 
> > > > > postcopy_ram_fault_thread_request Request for HVA=7f6dc0000000 rb=/objects/mem offset=0
> > > > > postcopy_place_page_start host=0x7f6dc0000000 from=0x7f6d70000000, pagesize=40000000
> > > > > postcopy_place_page_start host=0x7f6e0e800000 from=0x55b665969619, pagesize=1000
> > > > > postcopy_place_page_start host=0x7f6e0e801000 from=0x55b6659684e8, pagesize=1000
> > > > > several pages with 4Kb step ...
> > > > > postcopy_place_page_start host=0x7f6e0e817000 from=0x55b6659694f0, pagesize=1000
> > > > > 
> > > > > 4K pages, started from 0x7f6e0e800000 address it's
> > > > > vga.ram, /rom@etc/acpi/tables etc.
> > > > > 
> > > > > Frankly saying, right now, I don't have any ideas why hugepage wasn't
> > > > > resent. Maybe my expectation of it is wrong as well as understanding )
> > > > 
> > > > That's pretty much what I expect to see - before you get into postcopy
> > > > mode everything is sent as individual 4k pages (in order); once we're
> > > > in postcopy mode we send each page no more than once.  So you're
> > > > huge page comes across once - and there it is.
> > > > 
> > > > > stress utility also duplicated for me value into appropriate file:
> > > > > sec_since_epoch.microsec:value
> > > > > 1487003192.728493:22
> > > > > 1487003197.335362:23
> > > > > *1487003213.367260:24*
> > > > > *1487003238.480379:25*
> > > > > 1487003243.315299:26
> > > > > 1487003250.775721:27
> > > > > 1487003255.473792:28
> > > > > 
> > > > > It mean rewriting 256Mb of memory per byte took around 5 sec, but at
> > > > > the moment of migration it took 25 sec.
> > > > 
> > > > right, now this is the thing that's more useful to measure.
> > > > That's not too surprising; when it migrates that data is changing rapidly
> > > > so it's going to have to pause and wait for that whole 1GB to be transferred.
> > > > Your 1Gbps network is going to take about 10 seconds to transfer that
> > > > 1GB page - and that's if you're lucky and it saturates the network.
> > > > SO it's going to take at least 10 seconds longer than it normally
> > > > would, plus any other overheads - so at least 15 seconds.
> > > > This is why I say it's a bad idea to use 1GB host pages with postcopy.
> > > > Of course it would be fun to find where the other 10 seconds went!
> > > > 
> > > > You might like to add timing to the tracing so you can see the time between the
> > > > fault thread requesting the page and it arriving.
> > > >
> > > yes, sorry I forgot about timing
> > > 20806@1487084818.270993:postcopy_ram_fault_thread_request Request for HVA=7f0280000000 rb=/objects/mem offset=0
> > > 20806@1487084818.271038:qemu_loadvm_state_section 8
> > > 20806@1487084818.271056:loadvm_process_command com=0x2 len=4
> > > 20806@1487084818.271089:qemu_loadvm_state_section 2
> > > 20806@1487084823.315919:postcopy_place_page_start host=0x7f0280000000 from=0x7f0240000000, pagesize=40000000
> > > 
> > > 1487084823.315919 - 1487084818.270993 = 5.044926 sec.
> > > Machines connected w/o any routers, directly by cable.
> > 
> > OK, the fact it's only 5 seconds not 10 I think suggests a lot of the memory was all zero
> > so didn't take up the whole bandwidth.

> I decided to measure downtime as a sum of intervals since fault happened
> and till page was load. I didn't relay on order, so I associated that
> interval with fault address.

Don't forget the source will still be sending unrequested pages at the
same time as fault responses; so that simplification might be wrong.
My experience with 4k pages is you'll often get pages that arrive
at about the same time as you ask for them because of the background transmission.

> For 2G ram vm - using 1G huge page, downtime measured on dst is around 12 sec,
> but for the same 2G ram vm with 2Mb huge page, downtime measured on dst
> is around 20 sec, and 320 page faults happened, 640 Mb was transmitted.

OK, so 20/320 * 1000=62.5msec/ page.   That's a bit high.
I think it takes about 16ms to transmit a 2MB page on your 1Gbps network,
you're probably also suffering from the requests being queued behind
background requests; if you try reducing your tcp_wmem setting on the
source it might get a bit better.  Once Juan Quintela's multi-fd work
goes in my hope is to combine it with postcopy and then be able to
avoid that type of request blocking.
Generally I'd not recommend 10Gbps for postcopy since it does pull
down the latency quite a bit.

> My current method doesn't take into account multi core vcpu. I checked
> only with 1 CPU, but it's not proper case. So I think it's worth to
> count downtime per CPU, or calculate overlap of CPU downtimes.
> How do your think?

Yes; one of the nice things about postcopy is that if one vCPU is blocked
waiting for a page, the other vCPUs will just be able to carry on.
Even with 1 vCPU if you've got multiple tasks that can run the guest can
switch to a task that isn't blocked (See KVM asynchronous page faults).
Now, what the numbers mean when you calculate the total like that might be a bit
odd - for example if you have 8 vCPUs and they're each blocked do you
add the times together even though they're blocked at the same time? What
about if they're blocked on the same page?

> Also I didn't yet finish IPC to provide such information to src host, where
> info_migrate is being called.

Dave

> 
> 
> > 
> > > > > Another one request.
> > > > > QEMU could use mem_path in hugefs with share key simultaneously
> > > > > (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
> > > > > in this case will start and will properly work (it will allocate memory
> > > > > with mmap), but in case of destination for postcopy live migration
> > > > > UFFDIO_COPY ioctl will fail for
> > > > > such region, in Arcangeli's git tree there is such prevent check
> > > > > (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> > > > > Is it possible to handle such situation at qemu?
> > > > 
> > > > Imagine that you had shared memory; what semantics would you like
> > > > to see ?  What happens to the other process?
> > > 
> > > Honestly, initially, I thought to handle such error, but I quit forgot
> > > about vhost-user in ovs-dpdk.
> > 
> > Yes, I don't know much about vhost-user; but we'll have to think carefully
> > about the way things behave when they're accessing memory that's shared
> > with qemu during migration.  Writing to the source after we've started
> > the postcopy phase is not allowed.  Accessing the destination memory
> > during postcopy will produce pauses in the other processes accessing it
> > (I think) and they mustn't do various types of madvise etc - so
> > I'm sure there will be things we find out the hard way!
> > 
> > Dave
> > 
> > > > Dave
> > > > 
> > > > > On Mon, Feb 06, 2017 at 05:45:30PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > * Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> > > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > > 
> > > > > > > Hi,
> > > > > > >   The existing postcopy code, and the userfault kernel
> > > > > > > code that supports it, only works for normal anonymous memory.
> > > > > > > Kernel support for userfault on hugetlbfs is working
> > > > > > > it's way upstream; it's in the linux-mm tree,
> > > > > > > You can get a version at:
> > > > > > >    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> > > > > > > on the origin/userfault branch.
> > > > > > > 
> > > > > > > Note that while this code supports arbitrary sized hugepages,
> > > > > > > it doesn't make sense with pages above the few-MB region,
> > > > > > > so while 2MB is fine, 1GB is probably a bad idea;
> > > > > > > this code waits for and transmits whole huge pages, and a
> > > > > > > 1GB page would take about 1 second to transfer over a 10Gbps
> > > > > > > link - which is way too long to pause the destination for.
> > > > > > > 
> > > > > > > Dave
> > > > > > 
> > > > > > Oops I missed the v2 changes from the message:
> > > > > > 
> > > > > > v2
> > > > > >   Flip ram-size summary word/compare individual page size patches around
> > > > > >   Individual page size comparison is done in ram_load if 'advise' has been
> > > > > >     received rather than checking migrate_postcopy_ram()
> > > > > >   Moved discard code into exec.c, reworked ram_discard_range
> > > > > > 
> > > > > > Dave
> > > > > 
> > > > > Thank your, right now it's not necessary to set
> > > > > postcopy-ram capability on destination machine.
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > Dr. David Alan Gilbert (16):
> > > > > > >   postcopy: Transmit ram size summary word
> > > > > > >   postcopy: Transmit and compare individual page sizes
> > > > > > >   postcopy: Chunk discards for hugepages
> > > > > > >   exec: ram_block_discard_range
> > > > > > >   postcopy: enhance ram_block_discard_range for hugepages
> > > > > > >   Fold postcopy_ram_discard_range into ram_discard_range
> > > > > > >   postcopy: Record largest page size
> > > > > > >   postcopy: Plumb pagesize down into place helpers
> > > > > > >   postcopy: Use temporary for placing zero huge pages
> > > > > > >   postcopy: Load huge pages in one go
> > > > > > >   postcopy: Mask fault addresses to huge page boundary
> > > > > > >   postcopy: Send whole huge pages
> > > > > > >   postcopy: Allow hugepages
> > > > > > >   postcopy: Update userfaultfd.h header
> > > > > > >   postcopy: Check for userfault+hugepage feature
> > > > > > >   postcopy: Add doc about hugepages and postcopy
> > > > > > > 
> > > > > > >  docs/migration.txt                |  13 ++++
> > > > > > >  exec.c                            |  83 +++++++++++++++++++++++
> > > > > > >  include/exec/cpu-common.h         |   2 +
> > > > > > >  include/exec/memory.h             |   1 -
> > > > > > >  include/migration/migration.h     |   3 +
> > > > > > >  include/migration/postcopy-ram.h  |  13 ++--
> > > > > > >  linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
> > > > > > >  migration/migration.c             |   1 +
> > > > > > >  migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
> > > > > > >  migration/ram.c                   | 109 ++++++++++++++++++------------
> > > > > > >  migration/savevm.c                |  32 ++++++---
> > > > > > >  migration/trace-events            |   2 +-
> > > > > > >  12 files changed, 328 insertions(+), 150 deletions(-)
> > > > > > > 
> > > > > > > -- 
> > > > > > > 2.9.3
> > > > > > > 
> > > > > > > 
> > > > > > --
> > > > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > > > 
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > 
> > > 
> > > -- 
> > > 
> > > BR
> > > Alexey
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
> -- 
> 
> BR
> Alexey
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (16 preceding siblings ...)
  2017-02-06 17:45 ` [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert
@ 2017-02-22 16:43 ` Laurent Vivier
  2017-02-24 10:04 ` Dr. David Alan Gilbert
  18 siblings, 0 replies; 73+ messages in thread
From: Laurent Vivier @ 2017-02-22 16:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Hi,
>   The existing postcopy code, and the userfault kernel
> code that supports it, only works for normal anonymous memory.
> Kernel support for userfault on hugetlbfs is working
> it's way upstream; it's in the linux-mm tree,
> You can get a version at:
>    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> on the origin/userfault branch.
> 
> Note that while this code supports arbitrary sized hugepages,
> it doesn't make sense with pages above the few-MB region,
> so while 2MB is fine, 1GB is probably a bad idea;
> this code waits for and transmits whole huge pages, and a
> 1GB page would take about 1 second to transfer over a 10Gbps
> link - which is way too long to pause the destination for.
> 
> Dave
> 
> Dr. David Alan Gilbert (16):
>   postcopy: Transmit ram size summary word
>   postcopy: Transmit and compare individual page sizes
>   postcopy: Chunk discards for hugepages
>   exec: ram_block_discard_range
>   postcopy: enhance ram_block_discard_range for hugepages
>   Fold postcopy_ram_discard_range into ram_discard_range
>   postcopy: Record largest page size
>   postcopy: Plumb pagesize down into place helpers
>   postcopy: Use temporary for placing zero huge pages
>   postcopy: Load huge pages in one go
>   postcopy: Mask fault addresses to huge page boundary
>   postcopy: Send whole huge pages
>   postcopy: Allow hugepages
>   postcopy: Update userfaultfd.h header
>   postcopy: Check for userfault+hugepage feature
>   postcopy: Add doc about hugepages and postcopy
> 
>  docs/migration.txt                |  13 ++++
>  exec.c                            |  83 +++++++++++++++++++++++
>  include/exec/cpu-common.h         |   2 +
>  include/exec/memory.h             |   1 -
>  include/migration/migration.h     |   3 +
>  include/migration/postcopy-ram.h  |  13 ++--
>  linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
>  migration/migration.c             |   1 +
>  migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
>  migration/ram.c                   | 109 ++++++++++++++++++------------
>  migration/savevm.c                |  32 ++++++---
>  migration/trace-events            |   2 +-
>  12 files changed, 328 insertions(+), 150 deletions(-)
> 
Tested-by: Laurent Vivier <lvivier@redhat.com>

On ppc64le with 16MB hugepage size and kernel 4.10 from aa.git/userfault

Laurent

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
                   ` (17 preceding siblings ...)
  2017-02-22 16:43 ` Laurent Vivier
@ 2017-02-24 10:04 ` Dr. David Alan Gilbert
  18 siblings, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-24 10:04 UTC (permalink / raw)
  To: qemu-devel, quintela; +Cc: lvivier

* Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Hi,
>   The existing postcopy code, and the userfault kernel
> code that supports it, only works for normal anonymous memory.
> Kernel support for userfault on hugetlbfs is working
> it's way upstream; it's in the linux-mm tree,
> You can get a version at:
>    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> on the origin/userfault branch.

This has now merged into Linus's tree as of commit
bc49a7831b1137ce1c2dda1c57e3631655f5d2ae on 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Dave

> Note that while this code supports arbitrary sized hugepages,
> it doesn't make sense with pages above the few-MB region,
> so while 2MB is fine, 1GB is probably a bad idea;
> this code waits for and transmits whole huge pages, and a
> 1GB page would take about 1 second to transfer over a 10Gbps
> link - which is way too long to pause the destination for.
> 
> Dave
> 
> Dr. David Alan Gilbert (16):
>   postcopy: Transmit ram size summary word
>   postcopy: Transmit and compare individual page sizes
>   postcopy: Chunk discards for hugepages
>   exec: ram_block_discard_range
>   postcopy: enhance ram_block_discard_range for hugepages
>   Fold postcopy_ram_discard_range into ram_discard_range
>   postcopy: Record largest page size
>   postcopy: Plumb pagesize down into place helpers
>   postcopy: Use temporary for placing zero huge pages
>   postcopy: Load huge pages in one go
>   postcopy: Mask fault addresses to huge page boundary
>   postcopy: Send whole huge pages
>   postcopy: Allow hugepages
>   postcopy: Update userfaultfd.h header
>   postcopy: Check for userfault+hugepage feature
>   postcopy: Add doc about hugepages and postcopy
> 
>  docs/migration.txt                |  13 ++++
>  exec.c                            |  83 +++++++++++++++++++++++
>  include/exec/cpu-common.h         |   2 +
>  include/exec/memory.h             |   1 -
>  include/migration/migration.h     |   3 +
>  include/migration/postcopy-ram.h  |  13 ++--
>  linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
>  migration/migration.c             |   1 +
>  migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
>  migration/ram.c                   | 109 ++++++++++++++++++------------
>  migration/savevm.c                |  32 ++++++---
>  migration/trace-events            |   2 +-
>  12 files changed, 328 insertions(+), 150 deletions(-)
> 
> -- 
> 2.9.3
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 01/16] postcopy: Transmit ram size summary word
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 01/16] postcopy: Transmit ram size summary word Dr. David Alan Gilbert (git)
@ 2017-02-24 10:16   ` Laurent Vivier
  2017-02-24 13:10   ` Juan Quintela
  1 sibling, 0 replies; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 10:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Replace the host page-size in the 'advise' command by a pagesize
> summary bitmap; if the VM is just using normal RAM then
> this will be exactly the same as before, however if they're using
> huge pages they'll be different, and thus:
>    a) Migration from/to old qemu's that don't understand huge pages
>       will fail early.
>    b) Migrations with different size RAMBlocks will also fail early.
> 
> This catches it very early; earlier than the detailed per-block
> check in the next patch.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/migration.h |  1 +
>  migration/ram.c               | 17 +++++++++++++++++
>  migration/savevm.c            | 32 +++++++++++++++++++++-----------
>  3 files changed, 39 insertions(+), 11 deletions(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index af9135f..96c9d6e 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -366,6 +366,7 @@ void global_state_store_running(void);
>  void flush_page_queue(MigrationState *ms);
>  int ram_save_queue_pages(MigrationState *ms, const char *rbname,
>                           ram_addr_t start, ram_addr_t len);
> +uint64_t ram_pagesize_summary(void);
>  
>  PostcopyState postcopy_state_get(void);
>  /* Set the state and return the old state */
> diff --git a/migration/ram.c b/migration/ram.c
> index ef8fadf..b405e4a 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -600,6 +600,23 @@ static void migration_bitmap_sync_init(void)
>      iterations_prev = 0;
>  }
>  
> +/* Returns a summary bitmap of the page sizes of all RAMBlocks;
> + * for VMs with just normal pages this is equivalent to the
> + * host page size.  If it's got some huge pages then it's the OR
> + * of all the different page sizes.
> + */
> +uint64_t ram_pagesize_summary(void)
> +{
> +    RAMBlock *block;
> +    uint64_t summary = 0;
> +
> +    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> +        summary |= block->page_size;

It should be cleaner to use "qemu_ram_pagesize(block)".

It's only cosmetic, so:

Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/16] postcopy: Transmit and compare individual page sizes
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 02/16] postcopy: Transmit and compare individual page sizes Dr. David Alan Gilbert (git)
@ 2017-02-24 10:31   ` Laurent Vivier
  2017-02-24 10:48     ` Dr. David Alan Gilbert
  2017-02-24 13:13   ` Juan Quintela
  1 sibling, 1 reply; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 10:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> When using postcopy with hugepages, we require the source
> and destination page sizes for any RAMBlock to match; note
> that different RAMBlocks in the same VM can have different
> page sizes.
> 
> Transmit them as part of the RAM information header and
> fail if there's a difference.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  migration/ram.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index b405e4a..5726563 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1979,6 +1979,9 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>          qemu_put_byte(f, strlen(block->idstr));
>          qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
>          qemu_put_be64(f, block->used_length);
> +        if (migrate_postcopy_ram() && block->page_size != qemu_host_page_size) {
> +            qemu_put_be64(f, block->page_size);
> +        }

I understand we don't break migration to previous machine type by adding
data in the migration stream because migration is already broken when
block->page_size != qemu_host_page_size. Am I correct?

Laurent

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/16] postcopy: Transmit and compare individual page sizes
  2017-02-24 10:31   ` Laurent Vivier
@ 2017-02-24 10:48     ` Dr. David Alan Gilbert
  2017-02-24 10:50       ` Laurent Vivier
  0 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-24 10:48 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: qemu-devel, quintela, aarcange

* Laurent Vivier (lvivier@redhat.com) wrote:
> On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > When using postcopy with hugepages, we require the source
> > and destination page sizes for any RAMBlock to match; note
> > that different RAMBlocks in the same VM can have different
> > page sizes.
> > 
> > Transmit them as part of the RAM information header and
> > fail if there's a difference.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  migration/ram.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/migration/ram.c b/migration/ram.c
> > index b405e4a..5726563 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1979,6 +1979,9 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> >          qemu_put_byte(f, strlen(block->idstr));
> >          qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
> >          qemu_put_be64(f, block->used_length);
> > +        if (migrate_postcopy_ram() && block->page_size != qemu_host_page_size) {
> > +            qemu_put_be64(f, block->page_size);
> > +        }
> 
> I understand we don't break migration to previous machine type by adding
> data in the migration stream because migration is already broken when
> block->page_size != qemu_host_page_size. Am I correct?

Right, the previous patch - the one with the summary word - should have failed
the migration before we get to this point.

This patch can detect a more subtle problem - e.g. we still have huge pages used
by some of the RAM Blocks, but not all the same ones as the source.

Dave

> Laurent
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/16] postcopy: Transmit and compare individual page sizes
  2017-02-24 10:48     ` Dr. David Alan Gilbert
@ 2017-02-24 10:50       ` Laurent Vivier
  0 siblings, 0 replies; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 10:50 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: qemu-devel, quintela, aarcange

On 24/02/2017 11:48, Dr. David Alan Gilbert wrote:
> * Laurent Vivier (lvivier@redhat.com) wrote:
>> On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>
>>> When using postcopy with hugepages, we require the source
>>> and destination page sizes for any RAMBlock to match; note
>>> that different RAMBlocks in the same VM can have different
>>> page sizes.
>>>
>>> Transmit them as part of the RAM information header and
>>> fail if there's a difference.
>>>
>>> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>> ---
>>>  migration/ram.c | 17 +++++++++++++++++
>>>  1 file changed, 17 insertions(+)
>>>
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index b405e4a..5726563 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -1979,6 +1979,9 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>>          qemu_put_byte(f, strlen(block->idstr));
>>>          qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
>>>          qemu_put_be64(f, block->used_length);
>>> +        if (migrate_postcopy_ram() && block->page_size != qemu_host_page_size) {
>>> +            qemu_put_be64(f, block->page_size);
>>> +        }
>>
>> I understand we don't break migration to previous machine type by adding
>> data in the migration stream because migration is already broken when
>> block->page_size != qemu_host_page_size. Am I correct?
> 
> Right, the previous patch - the one with the summary word - should have failed
> the migration before we get to this point.
> 
> This patch can detect a more subtle problem - e.g. we still have huge pages used
> by some of the RAM Blocks, but not all the same ones as the source.

Thanks.

Reviewed-by: Laurent Vivier <lvivier@redhat.com>

Laurent

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 01/16] postcopy: Transmit ram size summary word
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 01/16] postcopy: Transmit ram size summary word Dr. David Alan Gilbert (git)
  2017-02-24 10:16   ` Laurent Vivier
@ 2017-02-24 13:10   ` Juan Quintela
  1 sibling, 0 replies; 73+ messages in thread
From: Juan Quintela @ 2017-02-24 13:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, aarcange

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Replace the host page-size in the 'advise' command by a pagesize
> summary bitmap; if the VM is just using normal RAM then
> this will be exactly the same as before, however if they're using
> huge pages they'll be different, and thus:
>    a) Migration from/to old qemu's that don't understand huge pages
>       will fail early.
>    b) Migrations with different size RAMBlocks will also fail early.
>
> This catches it very early; earlier than the detailed per-block
> check in the next patch.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/16] postcopy: Transmit and compare individual page sizes
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 02/16] postcopy: Transmit and compare individual page sizes Dr. David Alan Gilbert (git)
  2017-02-24 10:31   ` Laurent Vivier
@ 2017-02-24 13:13   ` Juan Quintela
  1 sibling, 0 replies; 73+ messages in thread
From: Juan Quintela @ 2017-02-24 13:13 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, aarcange

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> When using postcopy with hugepages, we require the source
> and destination page sizes for any RAMBlock to match; note
> that different RAMBlocks in the same VM can have different
> page sizes.
>
> Transmit them as part of the RAM information header and
> fail if there's a difference.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

I would preffer to add the field unconditionally for new machine types,
but I don't have a good idea about how to do it :-(

so...

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/16] exec: ram_block_discard_range
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 04/16] exec: ram_block_discard_range Dr. David Alan Gilbert (git)
@ 2017-02-24 13:14   ` Juan Quintela
  2017-02-24 14:04   ` Laurent Vivier
  2017-02-24 14:08   ` Laurent Vivier
  2 siblings, 0 replies; 73+ messages in thread
From: Juan Quintela @ 2017-02-24 13:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, aarcange

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Create ram_block_discard_range in exec.c to replace
> postcopy_ram_discard_range and most of ram_discard_range.
>
> Those two routines are a bit of a weird combination, and
> ram_discard_range is about to get more complex for hugepages.
> It's OS dependent code (so shouldn't be in migration/ram.c) but
> it needs quite a bit of the innards of RAMBlock so doesn't belong in
> the os*.c.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 05/16] postcopy: enhance ram_block_discard_range for hugepages
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 05/16] postcopy: enhance ram_block_discard_range for hugepages Dr. David Alan Gilbert (git)
@ 2017-02-24 13:20   ` Juan Quintela
  2017-02-24 13:44     ` Dr. David Alan Gilbert
  2017-02-24 14:20   ` Laurent Vivier
  1 sibling, 1 reply; 73+ messages in thread
From: Juan Quintela @ 2017-02-24 13:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, aarcange

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Unfortunately madvise DONTNEED doesn't work on hugepagetlb
> so use fallocate(FALLOC_FL_PUNCH_HOLE)
> qemu_fd_getpagesize only sets the page based off a file
> if the file is from hugetlbfs.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

But ...


> ---
>  exec.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/exec.c b/exec.c
> index e040cdf..c25f6b3 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -3324,9 +3324,20 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>  
>          errno = ENOTSUP; /* If we are missing MADVISE etc */
>  
> +        if (rb->page_size == qemu_host_page_size) {
>  #if defined(CONFIG_MADVISE)
> -        ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
> +            ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
>  #endif
> +        } else {
> +            /* Huge page case  - unfortunately it can't do DONTNEED, but
> +             * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
> +             * huge page file.
> +             */
> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +            ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +                            start, length);


Why can't we use fallocate() when !CONFIG_MADVISE?

or even ...

         if (rb->page_size == qemu_host_page_size) {
#if defined(CONFIG_MADVISE)
            ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
#endif
          }

          if (ret == -1) {
              /* Huge page case  - unfortunately it can't do DONTNEED, but
               * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
               * huge page file.
               */
#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
           ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                            start, length);
#endif
          }

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 06/16] Fold postcopy_ram_discard_range into ram_discard_range
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 06/16] Fold postcopy_ram_discard_range into ram_discard_range Dr. David Alan Gilbert (git)
@ 2017-02-24 13:21   ` Juan Quintela
  2017-02-24 14:26   ` Laurent Vivier
  1 sibling, 0 replies; 73+ messages in thread
From: Juan Quintela @ 2017-02-24 13:21 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, aarcange

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Using the previously created ram_block_discard_range,
> kill off postcopy_ram_discard_range.
> ram_discard_range is just a wrapper that does the name lookup.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 07/16] postcopy: Record largest page size
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 07/16] postcopy: Record largest page size Dr. David Alan Gilbert (git)
@ 2017-02-24 13:22   ` Juan Quintela
  2017-02-24 14:37   ` Laurent Vivier
  1 sibling, 0 replies; 73+ messages in thread
From: Juan Quintela @ 2017-02-24 13:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, aarcange

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Record the largest page size in use; we'll need it soon for allocating
> temporary buffers.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 08/16] postcopy: Plumb pagesize down into place helpers
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 08/16] postcopy: Plumb pagesize down into place helpers Dr. David Alan Gilbert (git)
@ 2017-02-24 13:24   ` Juan Quintela
  2017-02-24 15:10   ` Laurent Vivier
  1 sibling, 0 replies; 73+ messages in thread
From: Juan Quintela @ 2017-02-24 13:24 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, aarcange

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Now we deal with normal size pages and huge pages we need
> to tell the place handlers the size we're dealing with
> and make sure the temporary page is large enough.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 16/16] postcopy: Add doc about hugepages and postcopy
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 16/16] postcopy: Add doc about hugepages and postcopy Dr. David Alan Gilbert (git)
@ 2017-02-24 13:25   ` Juan Quintela
  2017-02-24 16:12   ` Laurent Vivier
  1 sibling, 0 replies; 73+ messages in thread
From: Juan Quintela @ 2017-02-24 13:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, aarcange

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  docs/migration.txt | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/docs/migration.txt b/docs/migration.txt
> index 6503c17..b462ead 100644
> --- a/docs/migration.txt
> +++ b/docs/migration.txt
> @@ -482,3 +482,16 @@ request for a page that has already been sent is
> ignored.  Duplicate requests
>  such as this can happen as a page is sent at about the same time the
>  destination accesses it.
>  
> +=== Postcopy with hugepages ===
> +
> +Postcopy now works with hugetlbfs backed memory:
> +  a) The linux kernel on the destination must support userfault on hugepages.
> +  b) The huge-page configuration on the source and destination VMs must be
> +     identical; i.e. RAMBlocks on both sides must use the same page size.
> +  c) Note that -mem-path /dev/hugepages  will fall back to allocating normal
> +     RAM if it doesn't have enough hugepages, triggering (b) to fail.
> +     Using -mem-prealloc enforces the allocation using hugepages.
> +  d) Care should be taken with the size of hugepage used; postcopy with 2MB
> +     hugepages works well, however 1GB hugepages are likely to be problematic
> +     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
> +     and until the full page is transferred the destination thread is blocked.

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 05/16] postcopy: enhance ram_block_discard_range for hugepages
  2017-02-24 13:20   ` Juan Quintela
@ 2017-02-24 13:44     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-24 13:44 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel, aarcange

* Juan Quintela (quintela@redhat.com) wrote:
> "Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > Unfortunately madvise DONTNEED doesn't work on hugepagetlb
> > so use fallocate(FALLOC_FL_PUNCH_HOLE)
> > qemu_fd_getpagesize only sets the page based off a file
> > if the file is from hugetlbfs.
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> 
> But ...
> 
> 
> > ---
> >  exec.c | 13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/exec.c b/exec.c
> > index e040cdf..c25f6b3 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -3324,9 +3324,20 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
> >  
> >          errno = ENOTSUP; /* If we are missing MADVISE etc */
> >  
> > +        if (rb->page_size == qemu_host_page_size) {
> >  #if defined(CONFIG_MADVISE)
> > -        ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
> > +            ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
> >  #endif
> > +        } else {
> > +            /* Huge page case  - unfortunately it can't do DONTNEED, but
> > +             * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
> > +             * huge page file.
> > +             */
> > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > +            ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > +                            start, length);
> 
> 
> Why can't we use fallocate() when !CONFIG_MADVISE?
> 
> or even ...
> 
>          if (rb->page_size == qemu_host_page_size) {
> #if defined(CONFIG_MADVISE)
>             ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
> #endif
>           }
> 
>           if (ret == -1) {
>               /* Huge page case  - unfortunately it can't do DONTNEED, but
>                * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
>                * huge page file.
>                */
> #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>            ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                             start, length);
> #endif
>           }

The fallocate only works where we have an fd, e.g. in the hugepage case;
the madvise only works where we have anonymous memory.  So if we don't have
madvise, we can't use fallocate for normal anonymous memory.

Actually, it's much more complicated than that - I've got another patch
that adds support for postcopy with memory that's backed by tmpfs with shared=true
and that also uses fallocate;  I'm trying to decide if it also needs
the madvise.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/16] postcopy: Chunk discards for hugepages
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 03/16] postcopy: Chunk discards for hugepages Dr. David Alan Gilbert (git)
@ 2017-02-24 13:48   ` Laurent Vivier
  0 siblings, 0 replies; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 13:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> At the start of the postcopy phase, partially sent huge pages
> must be discarded.  The code for dealing with host page sizes larger
> than the target page size can be reused for this case.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>

Reviewed-by: Laurent Vivier <lvivier@redhat.com>

> ---
>  migration/ram.c | 17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 5726563..d33bd21 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1627,12 +1627,17 @@ static void postcopy_chunk_hostpages_pass(MigrationState *ms, bool unsent_pass,
>  {
>      unsigned long *bitmap;
>      unsigned long *unsentmap;
> -    unsigned int host_ratio = qemu_host_page_size / TARGET_PAGE_SIZE;
> +    unsigned int host_ratio = block->page_size / TARGET_PAGE_SIZE;
>      unsigned long first = block->offset >> TARGET_PAGE_BITS;
>      unsigned long len = block->used_length >> TARGET_PAGE_BITS;
>      unsigned long last = first + (len - 1);
>      unsigned long run_start;
>  
> +    if (block->page_size == TARGET_PAGE_SIZE) {
> +        /* Easy case - TPS==HPS for a non-huge page RAMBlock */
> +        return;
> +    }
> +
>      bitmap = atomic_rcu_read(&migration_bitmap_rcu)->bmap;
>      unsentmap = atomic_rcu_read(&migration_bitmap_rcu)->unsentmap;
>  
> @@ -1736,7 +1741,8 @@ static void postcopy_chunk_hostpages_pass(MigrationState *ms, bool unsent_pass,
>   * Utility for the outgoing postcopy code.
>   *
>   * Discard any partially sent host-page size chunks, mark any partially
> - * dirty host-page size chunks as all dirty.
> + * dirty host-page size chunks as all dirty.  In this case the host-page
> + * is the host-page for the particular RAMBlock, i.e. it might be a huge page
>   *
>   * Returns: 0 on success
>   */
> @@ -1744,11 +1750,6 @@ static int postcopy_chunk_hostpages(MigrationState *ms)
>  {
>      struct RAMBlock *block;
>  
> -    if (qemu_host_page_size == TARGET_PAGE_SIZE) {
> -        /* Easy case - TPS==HPS - nothing to be done */
> -        return 0;
> -    }
> -
>      /* Easiest way to make sure we don't resume in the middle of a host-page */
>      last_seen_block = NULL;
>      last_sent_block = NULL;
> @@ -1804,7 +1805,7 @@ int ram_postcopy_send_discard_bitmap(MigrationState *ms)
>          return -EINVAL;
>      }
>  
> -    /* Deal with TPS != HPS */
> +    /* Deal with TPS != HPS and huge pages */
>      ret = postcopy_chunk_hostpages(ms);
>      if (ret) {
>          rcu_read_unlock();
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/16] exec: ram_block_discard_range
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 04/16] exec: ram_block_discard_range Dr. David Alan Gilbert (git)
  2017-02-24 13:14   ` Juan Quintela
@ 2017-02-24 14:04   ` Laurent Vivier
  2017-02-24 16:50     ` Dr. David Alan Gilbert
  2017-02-24 14:08   ` Laurent Vivier
  2 siblings, 1 reply; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 14:04 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Create ram_block_discard_range in exec.c to replace
> postcopy_ram_discard_range and most of ram_discard_range.
> 
> Those two routines are a bit of a weird combination, and
> ram_discard_range is about to get more complex for hugepages.
> It's OS dependent code (so shouldn't be in migration/ram.c) but
> it needs quite a bit of the innards of RAMBlock so doesn't belong in
> the os*.c.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  exec.c                    | 59 +++++++++++++++++++++++++++++++++++++++++++++++
>  include/exec/cpu-common.h |  1 +
>  2 files changed, 60 insertions(+)
> 
> diff --git a/exec.c b/exec.c
> index 8b9ed73..e040cdf 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -45,6 +45,12 @@
>  #include "exec/address-spaces.h"
>  #include "sysemu/xen-mapcache.h"
>  #include "trace-root.h"
> +
> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +#include <fcntl.h>
> +#include <linux/falloc.h>
> +#endif
> +
>  #endif
>  #include "exec/cpu-all.h"
>  #include "qemu/rcu_queue.h"
> @@ -3286,4 +3292,57 @@ int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
>      rcu_read_unlock();
>      return ret;
>  }
> +
> +/*
> + * Unmap pages of memory from start to start+length such that
> + * they a) read as 0, b) Trigger whatever fault mechanism
> + * the OS provides for postcopy.
> + * The pages must be unmapped by the end of the function.
> + * Returns: 0 on success, none-0 on failure
> + *
> + */
> +int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
> +{
> +    int ret = -1;
> +
> +    rcu_read_lock();
> +    uint8_t *host_startaddr = rb->host + start;
> +
> +    if ((uintptr_t)host_startaddr & (rb->page_size - 1)) {
> +        error_report("ram_block_discard_range: Unaligned start address: %p",
> +                     host_startaddr);
> +        goto err;
> +    }
> +
> +    if ((start + length) <= rb->used_length) {
> +        uint8_t *host_endaddr = host_startaddr + length;
> +        if ((uintptr_t)host_endaddr & (rb->page_size - 1)) {
> +            error_report("ram_block_discard_range: Unaligned end address: %p",
> +                         host_endaddr);
> +            goto err;
> +        }
> +
> +        errno = ENOTSUP; /* If we are missing MADVISE etc */
> +
> +#if defined(CONFIG_MADVISE)
> +        ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
> +#endif
> +        if (ret) {
> +            ret = -errno;
> +            error_report("ram_block_discard_range: Failed to discard range "
> +                         "%s:%" PRIx64 " +%zx (%d)",
> +                         rb->idstr, start, length, ret);
> +        }
> +    } else {
> +        error_report("ram_block_discard_range: Overrun block '%s' (%" PRIu64
> +                     "/%zx/" RAM_ADDR_FMT")",
> +                     rb->idstr, start, length, rb->used_length);
> +    }
> +
> +err:
> +    rcu_read_unlock();
> +
> +    return ret;
> +}

I really looks like a copy'n'paste from ram_discard_range(). It could be
clearer if you remove the code from ram_discard_range() and call this
function instead.

I think you don't need the "#if defined(CONFIG_MADVISE)" as you use
qemu_madvise() (or you should use madvise() directly if you want to
avoid the posix_madvise()).
[perhaps qemu_madvise() should set errno to ENOTSUP instead of EINVAL]

Laurent

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/16] exec: ram_block_discard_range
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 04/16] exec: ram_block_discard_range Dr. David Alan Gilbert (git)
  2017-02-24 13:14   ` Juan Quintela
  2017-02-24 14:04   ` Laurent Vivier
@ 2017-02-24 14:08   ` Laurent Vivier
  2017-02-24 15:35     ` Dr. David Alan Gilbert
  2 siblings, 1 reply; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 14:08 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Create ram_block_discard_range in exec.c to replace
> postcopy_ram_discard_range and most of ram_discard_range.
> 
> Those two routines are a bit of a weird combination, and
> ram_discard_range is about to get more complex for hugepages.
> It's OS dependent code (so shouldn't be in migration/ram.c) but
> it needs quite a bit of the innards of RAMBlock so doesn't belong in
> the os*.c.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  exec.c                    | 59 +++++++++++++++++++++++++++++++++++++++++++++++
>  include/exec/cpu-common.h |  1 +
>  2 files changed, 60 insertions(+)
> 
> diff --git a/exec.c b/exec.c
> index 8b9ed73..e040cdf 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -45,6 +45,12 @@
>  #include "exec/address-spaces.h"
>  #include "sysemu/xen-mapcache.h"
>  #include "trace-root.h"
> +
> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +#include <fcntl.h>
> +#include <linux/falloc.h>
> +#endif

Should it be in PATCH 05/16 instead?

Laurent

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 05/16] postcopy: enhance ram_block_discard_range for hugepages
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 05/16] postcopy: enhance ram_block_discard_range for hugepages Dr. David Alan Gilbert (git)
  2017-02-24 13:20   ` Juan Quintela
@ 2017-02-24 14:20   ` Laurent Vivier
  1 sibling, 0 replies; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 14:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Unfortunately madvise DONTNEED doesn't work on hugepagetlb
> so use fallocate(FALLOC_FL_PUNCH_HOLE)
> qemu_fd_getpagesize only sets the page based off a file
> if the file is from hugetlbfs.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  exec.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/exec.c b/exec.c
> index e040cdf..c25f6b3 100644
> --- a/exec.c
> +++ b/exec.c

You should move here the "#include" from PATCH 04/16

+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+#include <fcntl.h>
+#include <linux/falloc.h>
+#endif

> @@ -3324,9 +3324,20 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>  
>          errno = ENOTSUP; /* If we are missing MADVISE etc */
>  
> +        if (rb->page_size == qemu_host_page_size) {
>  #if defined(CONFIG_MADVISE)
> -        ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
> +            ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
>  #endif
> +        } else {
> +            /* Huge page case  - unfortunately it can't do DONTNEED, but
> +             * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
> +             * huge page file.
> +             */
> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +            ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +                            start, length);
> +#endif
> +        }
>          if (ret) {
>              ret = -errno;
>              error_report("ram_block_discard_range: Failed to discard range "
> 

Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 06/16] Fold postcopy_ram_discard_range into ram_discard_range
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 06/16] Fold postcopy_ram_discard_range into ram_discard_range Dr. David Alan Gilbert (git)
  2017-02-24 13:21   ` Juan Quintela
@ 2017-02-24 14:26   ` Laurent Vivier
  2017-02-24 16:02     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 14:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Using the previously created ram_block_discard_range,
> kill off postcopy_ram_discard_range.
> ram_discard_range is just a wrapper that does the name lookup.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/postcopy-ram.h |  7 -------
>  migration/postcopy-ram.c         | 30 +-----------------------------
>  migration/ram.c                  | 24 +++---------------------
>  migration/trace-events           |  2 +-
>  4 files changed, 5 insertions(+), 58 deletions(-)
> 
> diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> index b6a7491f..43bbbca 100644
> --- a/include/migration/postcopy-ram.h
> +++ b/include/migration/postcopy-ram.h
> @@ -35,13 +35,6 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis, size_t ram_pages);
>  int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis);
>  
>  /*
> - * Discard the contents of 'length' bytes from 'start'
> - * We can assume that if we've been called postcopy_ram_hosttest returned true
> - */
> -int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> -                               size_t length);
> -
> -/*
>   * Userfault requires us to mark RAM as NOHUGEPAGE prior to discard
>   * however leaving it until after precopy means that most of the precopy
>   * data is still THPd
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index a40dddb..1e3d22f 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -200,27 +200,6 @@ out:
>      return ret;
>  }
>  
> -/**
> - * postcopy_ram_discard_range: Discard a range of memory.
> - * We can assume that if we've been called postcopy_ram_hosttest returned true.
> - *
> - * @mis: Current incoming migration state.
> - * @start, @length: range of memory to discard.
> - *
> - * returns: 0 on success.
> - */
> -int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> -                               size_t length)
> -{
> -    trace_postcopy_ram_discard_range(start, length);
> -    if (madvise(start, length, MADV_DONTNEED)) {
> -        error_report("%s MADV_DONTNEED: %s", __func__, strerror(errno));
> -        return -1;
> -    }
> -
> -    return 0;
> -}
> -
>  /*
>   * Setup an area of RAM so that it *can* be used for postcopy later; this
>   * must be done right at the start prior to pre-copy.
> @@ -239,7 +218,7 @@ static int init_range(const char *block_name, void *host_addr,
>       * - we're going to get the copy from the source anyway.
>       * (Precopy will just overwrite this data, so doesn't need the discard)
>       */
> -    if (postcopy_ram_discard_range(mis, host_addr, length)) {
> +    if (ram_discard_range(mis, block_name, 0, length)) {
>          return -1;
>      }
>  
> @@ -658,13 +637,6 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>      return -1;
>  }
>  
> -int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> -                               size_t length)
> -{
> -    assert(0);
> -    return -1;
> -}
> -
>  int postcopy_ram_prepare_discard(MigrationIncomingState *mis)
>  {
>      assert(0);
> diff --git a/migration/ram.c b/migration/ram.c
> index d33bd21..136996a 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1845,6 +1845,8 @@ int ram_discard_range(MigrationIncomingState *mis,
>  {
>      int ret = -1;
>  
> +    trace_ram_discard_range(block_name, start, length);
> +
>      rcu_read_lock();

I think you take the rcu_read_lock() twice: here and in
ram_block_discard_range().

I think you should merge this patch with PATCH 04/16, as it's just code
copy.

Laurent

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 07/16] postcopy: Record largest page size
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 07/16] postcopy: Record largest page size Dr. David Alan Gilbert (git)
  2017-02-24 13:22   ` Juan Quintela
@ 2017-02-24 14:37   ` Laurent Vivier
  1 sibling, 0 replies; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 14:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Record the largest page size in use; we'll need it soon for allocating
> temporary buffers.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  exec.c                        | 13 +++++++++++++
>  include/exec/cpu-common.h     |  1 +
>  include/migration/migration.h |  1 +
>  migration/migration.c         |  1 +
>  4 files changed, 16 insertions(+)
> 
> diff --git a/exec.c b/exec.c
> index c25f6b3..59f3b6b 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1524,6 +1524,19 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
>      return rb->page_size;
>  }
>  
> +/* Returns the largest size of page in use */
> +size_t qemu_ram_pagesize_largest(void)
> +{
> +    RAMBlock *block;
> +    size_t largest = 0;
> +
> +    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> +        largest = MAX(largest, qemu_ram_pagesize(block));
> +    }
> +
> +    return largest;
> +}
> +
>  static int memory_try_enable_merging(void *addr, size_t len)
>  {
>      if (!machine_mem_merge(current_machine)) {
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 1350c2e..8c305aa 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -64,6 +64,7 @@ void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
>  void qemu_ram_unset_idstr(RAMBlock *block);
>  const char *qemu_ram_get_idstr(RAMBlock *rb);
>  size_t qemu_ram_pagesize(RAMBlock *block);
> +size_t qemu_ram_pagesize_largest(void);
>  
>  void cpu_physical_memory_rw(hwaddr addr, uint8_t *buf,
>                              int len, int is_write);
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 96c9d6e..c9c1d5f 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -92,6 +92,7 @@ struct MigrationIncomingState {
>       */
>      QemuEvent main_thread_load_event;
>  
> +    size_t         largest_page_size;
>      bool           have_fault_thread;
>      QemuThread     fault_thread;
>      QemuSemaphore  fault_thread_sem;
> diff --git a/migration/migration.c b/migration/migration.c
> index 283677c..e0fdafc 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -387,6 +387,7 @@ static void process_incoming_migration_co(void *opaque)
>      int ret;
>  
>      mis = migration_incoming_state_new(f);
> +    mis->largest_page_size = qemu_ram_pagesize_largest();
>      postcopy_state_set(POSTCOPY_INCOMING_NONE);
>      migrate_set_state(&mis->state, MIGRATION_STATUS_NONE,
>                        MIGRATION_STATUS_ACTIVE);
> 
Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 08/16] postcopy: Plumb pagesize down into place helpers
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 08/16] postcopy: Plumb pagesize down into place helpers Dr. David Alan Gilbert (git)
  2017-02-24 13:24   ` Juan Quintela
@ 2017-02-24 15:10   ` Laurent Vivier
  2017-02-24 15:21     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 15:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Now we deal with normal size pages and huge pages we need
> to tell the place handlers the size we're dealing with
> and make sure the temporary page is large enough.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/postcopy-ram.h |  6 +++--
>  migration/postcopy-ram.c         | 47 ++++++++++++++++++++++++----------------
>  migration/ram.c                  | 15 +++++++------
>  3 files changed, 40 insertions(+), 28 deletions(-)
> 
> diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> index 43bbbca..8e036b9 100644
> --- a/include/migration/postcopy-ram.h
> +++ b/include/migration/postcopy-ram.h
> @@ -74,13 +74,15 @@ void postcopy_discard_send_finish(MigrationState *ms,
>   *    to use other postcopy_ routines to allocate.
>   * returns 0 on success
>   */
> -int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from);
> +int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> +                        size_t pagesize);
>  
>  /*
>   * Place a zero page at (host) atomically
>   * returns 0 on success
>   */
> -int postcopy_place_page_zero(MigrationIncomingState *mis, void *host);
> +int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> +                             size_t pagesize);
>  
>  /*
>   * Allocate a page of memory that can be mapped at a later point in time
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 1e3d22f..a8b7fed 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -321,7 +321,7 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>      migrate_send_rp_shut(mis, qemu_file_get_error(mis->from_src_file) != 0);
>  
>      if (mis->postcopy_tmp_page) {
> -        munmap(mis->postcopy_tmp_page, getpagesize());
> +        munmap(mis->postcopy_tmp_page, mis->largest_page_size);
>          mis->postcopy_tmp_page = NULL;
>      }
>      trace_postcopy_ram_incoming_cleanup_exit();
> @@ -543,13 +543,14 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
>   * Place a host page (from) at (host) atomically
>   * returns 0 on success
>   */
> -int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from)
> +int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> +                        size_t pagesize)
>  {
>      struct uffdio_copy copy_struct;
>  
>      copy_struct.dst = (uint64_t)(uintptr_t)host;
>      copy_struct.src = (uint64_t)(uintptr_t)from;
> -    copy_struct.len = getpagesize();
> +    copy_struct.len = pagesize;
>      copy_struct.mode = 0;
>  
>      /* copy also acks to the kernel waking the stalled thread up
> @@ -559,8 +560,8 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from)
>       */
>      if (ioctl(mis->userfault_fd, UFFDIO_COPY, &copy_struct)) {
>          int e = errno;
> -        error_report("%s: %s copy host: %p from: %p",
> -                     __func__, strerror(e), host, from);
> +        error_report("%s: %s copy host: %p from: %p (size: %zd)",
> +                     __func__, strerror(e), host, from, pagesize);
>  
>          return -e;
>      }
> @@ -573,23 +574,29 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from)
>   * Place a zero page at (host) atomically
>   * returns 0 on success
>   */
> -int postcopy_place_page_zero(MigrationIncomingState *mis, void *host)
> +int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> +                             size_t pagesize)
>  {
> -    struct uffdio_zeropage zero_struct;
> +    trace_postcopy_place_page_zero(host);
>  
> -    zero_struct.range.start = (uint64_t)(uintptr_t)host;
> -    zero_struct.range.len = getpagesize();
> -    zero_struct.mode = 0;
> +    if (pagesize == getpagesize()) {
> +        struct uffdio_zeropage zero_struct;
> +        zero_struct.range.start = (uint64_t)(uintptr_t)host;
> +        zero_struct.range.len = getpagesize();
> +        zero_struct.mode = 0;
>  
> -    if (ioctl(mis->userfault_fd, UFFDIO_ZEROPAGE, &zero_struct)) {
> -        int e = errno;
> -        error_report("%s: %s zero host: %p",
> -                     __func__, strerror(e), host);
> +        if (ioctl(mis->userfault_fd, UFFDIO_ZEROPAGE, &zero_struct)) {
> +            int e = errno;
> +            error_report("%s: %s zero host: %p",
> +                         __func__, strerror(e), host);
>  
> -        return -e;
> +            return -e;
> +        }
> +    } else {
> +        /* TODO: The kernel can't use UFFDIO_ZEROPAGE for hugepages */
> +        assert(0);
>      }
>  
> -    trace_postcopy_place_page_zero(host);
>      return 0;
>  }
>  
> @@ -604,7 +611,7 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host)
>  void *postcopy_get_tmp_page(MigrationIncomingState *mis)
>  {
>      if (!mis->postcopy_tmp_page) {
> -        mis->postcopy_tmp_page = mmap(NULL, getpagesize(),
> +        mis->postcopy_tmp_page = mmap(NULL, mis->largest_page_size,
>                               PROT_READ | PROT_WRITE, MAP_PRIVATE |
>                               MAP_ANONYMOUS, -1, 0);
>          if (mis->postcopy_tmp_page == MAP_FAILED) {
> @@ -649,13 +656,15 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
>      return -1;
>  }
>  
> -int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from)
> +int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> +                        size_t pagesize)
>  {
>      assert(0);
>      return -1;
>  }
>  
> -int postcopy_place_page_zero(MigrationIncomingState *mis, void *host)
> +int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> +                        size_t pagesize)
>  {
>      assert(0);
>      return -1;
> diff --git a/migration/ram.c b/migration/ram.c
> index 136996a..ff448ef 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2354,6 +2354,7 @@ static int ram_load_postcopy(QEMUFile *f)
>          void *host = NULL;
>          void *page_buffer = NULL;
>          void *place_source = NULL;
> +        RAMBlock *block = NULL;
>          uint8_t ch;
>  
>          addr = qemu_get_be64(f);
> @@ -2363,7 +2364,7 @@ static int ram_load_postcopy(QEMUFile *f)
>          trace_ram_load_postcopy_loop((uint64_t)addr, flags);
>          place_needed = false;
>          if (flags & (RAM_SAVE_FLAG_COMPRESS | RAM_SAVE_FLAG_PAGE)) {
> -            RAMBlock *block = ram_block_from_stream(f, flags);
> +            block = ram_block_from_stream(f, flags);
>  
>              host = host_from_ram_block_offset(block, addr);
>              if (!host) {
> @@ -2438,14 +2439,14 @@ static int ram_load_postcopy(QEMUFile *f)
>  
>          if (place_needed) {
>              /* This gets called at the last target page in the host page */
> +            void *place_dest = host + TARGET_PAGE_SIZE - block->page_size;
> +
>              if (all_zero) {
> -                ret = postcopy_place_page_zero(mis,
> -                                               host + TARGET_PAGE_SIZE -
> -                                               qemu_host_page_size);
> +                ret = postcopy_place_page_zero(mis, place_dest,
> +                                               block->page_size);
>              } else {
> -                ret = postcopy_place_page(mis, host + TARGET_PAGE_SIZE -
> -                                               qemu_host_page_size,
> -                                               place_source);
> +                ret = postcopy_place_page(mis, place_dest,
> +                                          place_source, block->page_size);
>              }
>          }
>          if (!ret) {
> 

I think the "postcopy_tmp_page" part should be better in PATCH 07/16, so
we know why you introduce the largest_page_size field, and this avoids
to mix two kinds of change in this one (to place page and adjust tmp_page).

Anyway:
Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 08/16] postcopy: Plumb pagesize down into place helpers
  2017-02-24 15:10   ` Laurent Vivier
@ 2017-02-24 15:21     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-24 15:21 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: qemu-devel, quintela, aarcange

* Laurent Vivier (lvivier@redhat.com) wrote:
> On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Now we deal with normal size pages and huge pages we need
> > to tell the place handlers the size we're dealing with
> > and make sure the temporary page is large enough.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/postcopy-ram.h |  6 +++--
> >  migration/postcopy-ram.c         | 47 ++++++++++++++++++++++++----------------
> >  migration/ram.c                  | 15 +++++++------
> >  3 files changed, 40 insertions(+), 28 deletions(-)
> > 
> > diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> > index 43bbbca..8e036b9 100644
> > --- a/include/migration/postcopy-ram.h
> > +++ b/include/migration/postcopy-ram.h
> > @@ -74,13 +74,15 @@ void postcopy_discard_send_finish(MigrationState *ms,
> >   *    to use other postcopy_ routines to allocate.
> >   * returns 0 on success
> >   */
> > -int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from);
> > +int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> > +                        size_t pagesize);
> >  
> >  /*
> >   * Place a zero page at (host) atomically
> >   * returns 0 on success
> >   */
> > -int postcopy_place_page_zero(MigrationIncomingState *mis, void *host);
> > +int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> > +                             size_t pagesize);
> >  
> >  /*
> >   * Allocate a page of memory that can be mapped at a later point in time
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 1e3d22f..a8b7fed 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -321,7 +321,7 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
> >      migrate_send_rp_shut(mis, qemu_file_get_error(mis->from_src_file) != 0);
> >  
> >      if (mis->postcopy_tmp_page) {
> > -        munmap(mis->postcopy_tmp_page, getpagesize());
> > +        munmap(mis->postcopy_tmp_page, mis->largest_page_size);
> >          mis->postcopy_tmp_page = NULL;
> >      }
> >      trace_postcopy_ram_incoming_cleanup_exit();
> > @@ -543,13 +543,14 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
> >   * Place a host page (from) at (host) atomically
> >   * returns 0 on success
> >   */
> > -int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from)
> > +int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> > +                        size_t pagesize)
> >  {
> >      struct uffdio_copy copy_struct;
> >  
> >      copy_struct.dst = (uint64_t)(uintptr_t)host;
> >      copy_struct.src = (uint64_t)(uintptr_t)from;
> > -    copy_struct.len = getpagesize();
> > +    copy_struct.len = pagesize;
> >      copy_struct.mode = 0;
> >  
> >      /* copy also acks to the kernel waking the stalled thread up
> > @@ -559,8 +560,8 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from)
> >       */
> >      if (ioctl(mis->userfault_fd, UFFDIO_COPY, &copy_struct)) {
> >          int e = errno;
> > -        error_report("%s: %s copy host: %p from: %p",
> > -                     __func__, strerror(e), host, from);
> > +        error_report("%s: %s copy host: %p from: %p (size: %zd)",
> > +                     __func__, strerror(e), host, from, pagesize);
> >  
> >          return -e;
> >      }
> > @@ -573,23 +574,29 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from)
> >   * Place a zero page at (host) atomically
> >   * returns 0 on success
> >   */
> > -int postcopy_place_page_zero(MigrationIncomingState *mis, void *host)
> > +int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> > +                             size_t pagesize)
> >  {
> > -    struct uffdio_zeropage zero_struct;
> > +    trace_postcopy_place_page_zero(host);
> >  
> > -    zero_struct.range.start = (uint64_t)(uintptr_t)host;
> > -    zero_struct.range.len = getpagesize();
> > -    zero_struct.mode = 0;
> > +    if (pagesize == getpagesize()) {
> > +        struct uffdio_zeropage zero_struct;
> > +        zero_struct.range.start = (uint64_t)(uintptr_t)host;
> > +        zero_struct.range.len = getpagesize();
> > +        zero_struct.mode = 0;
> >  
> > -    if (ioctl(mis->userfault_fd, UFFDIO_ZEROPAGE, &zero_struct)) {
> > -        int e = errno;
> > -        error_report("%s: %s zero host: %p",
> > -                     __func__, strerror(e), host);
> > +        if (ioctl(mis->userfault_fd, UFFDIO_ZEROPAGE, &zero_struct)) {
> > +            int e = errno;
> > +            error_report("%s: %s zero host: %p",
> > +                         __func__, strerror(e), host);
> >  
> > -        return -e;
> > +            return -e;
> > +        }
> > +    } else {
> > +        /* TODO: The kernel can't use UFFDIO_ZEROPAGE for hugepages */
> > +        assert(0);
> >      }
> >  
> > -    trace_postcopy_place_page_zero(host);
> >      return 0;
> >  }
> >  
> > @@ -604,7 +611,7 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host)
> >  void *postcopy_get_tmp_page(MigrationIncomingState *mis)
> >  {
> >      if (!mis->postcopy_tmp_page) {
> > -        mis->postcopy_tmp_page = mmap(NULL, getpagesize(),
> > +        mis->postcopy_tmp_page = mmap(NULL, mis->largest_page_size,
> >                               PROT_READ | PROT_WRITE, MAP_PRIVATE |
> >                               MAP_ANONYMOUS, -1, 0);
> >          if (mis->postcopy_tmp_page == MAP_FAILED) {
> > @@ -649,13 +656,15 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
> >      return -1;
> >  }
> >  
> > -int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from)
> > +int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> > +                        size_t pagesize)
> >  {
> >      assert(0);
> >      return -1;
> >  }
> >  
> > -int postcopy_place_page_zero(MigrationIncomingState *mis, void *host)
> > +int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> > +                        size_t pagesize)
> >  {
> >      assert(0);
> >      return -1;
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 136996a..ff448ef 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -2354,6 +2354,7 @@ static int ram_load_postcopy(QEMUFile *f)
> >          void *host = NULL;
> >          void *page_buffer = NULL;
> >          void *place_source = NULL;
> > +        RAMBlock *block = NULL;
> >          uint8_t ch;
> >  
> >          addr = qemu_get_be64(f);
> > @@ -2363,7 +2364,7 @@ static int ram_load_postcopy(QEMUFile *f)
> >          trace_ram_load_postcopy_loop((uint64_t)addr, flags);
> >          place_needed = false;
> >          if (flags & (RAM_SAVE_FLAG_COMPRESS | RAM_SAVE_FLAG_PAGE)) {
> > -            RAMBlock *block = ram_block_from_stream(f, flags);
> > +            block = ram_block_from_stream(f, flags);
> >  
> >              host = host_from_ram_block_offset(block, addr);
> >              if (!host) {
> > @@ -2438,14 +2439,14 @@ static int ram_load_postcopy(QEMUFile *f)
> >  
> >          if (place_needed) {
> >              /* This gets called at the last target page in the host page */
> > +            void *place_dest = host + TARGET_PAGE_SIZE - block->page_size;
> > +
> >              if (all_zero) {
> > -                ret = postcopy_place_page_zero(mis,
> > -                                               host + TARGET_PAGE_SIZE -
> > -                                               qemu_host_page_size);
> > +                ret = postcopy_place_page_zero(mis, place_dest,
> > +                                               block->page_size);
> >              } else {
> > -                ret = postcopy_place_page(mis, host + TARGET_PAGE_SIZE -
> > -                                               qemu_host_page_size,
> > -                                               place_source);
> > +                ret = postcopy_place_page(mis, place_dest,
> > +                                          place_source, block->page_size);
> >              }
> >          }
> >          if (!ret) {
> > 
> 
> I think the "postcopy_tmp_page" part should be better in PATCH 07/16, so
> we know why you introduce the largest_page_size field, and this avoids
> to mix two kinds of change in this one (to place page and adjust tmp_page).

Well I did mention it in the commit message at the top of 7.

> Anyway:
> Reviewed-by: Laurent Vivier <lvivier@redhat.com>

Thanks

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 09/16] postcopy: Use temporary for placing zero huge pages
  2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 09/16] postcopy: Use temporary for placing zero huge pages Dr. David Alan Gilbert (git)
@ 2017-02-24 15:31   ` Laurent Vivier
  2017-02-24 15:46     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 15:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The kernel can't do UFFDIO_ZEROPAGE for huge pages, so we have
> to allocate a temporary (always zero) page and use UFFDIO_COPYPAGE
> on it.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> ---
>  include/migration/migration.h |  1 +
>  migration/postcopy-ram.c      | 23 +++++++++++++++++++++--
>  2 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index c9c1d5f..bd399fc 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -108,6 +108,7 @@ struct MigrationIncomingState {
>      QEMUFile *to_src_file;
>      QemuMutex rp_mutex;    /* We send replies from multiple threads */
>      void     *postcopy_tmp_page;
> +    void     *postcopy_tmp_zero_page;
>  
>      QEMUBH *bh;
>  
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index a8b7fed..4c736d2 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -324,6 +324,10 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>          munmap(mis->postcopy_tmp_page, mis->largest_page_size);
>          mis->postcopy_tmp_page = NULL;
>      }
> +    if (mis->postcopy_tmp_zero_page) {
> +        munmap(mis->postcopy_tmp_zero_page, mis->largest_page_size);
> +        mis->postcopy_tmp_zero_page = NULL;
> +    }
>      trace_postcopy_ram_incoming_cleanup_exit();
>      return 0;
>  }
> @@ -593,8 +597,23 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
>              return -e;
>          }
>      } else {
> -        /* TODO: The kernel can't use UFFDIO_ZEROPAGE for hugepages */
> -        assert(0);
> +        /* The kernel can't use UFFDIO_ZEROPAGE for hugepages */
> +        if (!mis->postcopy_tmp_zero_page) {
> +            mis->postcopy_tmp_zero_page = mmap(NULL, mis->largest_page_size,
> +                                               PROT_READ | PROT_WRITE,
> +                                               MAP_PRIVATE | MAP_ANONYMOUS,
> +                                               -1, 0);
> +            if (mis->postcopy_tmp_zero_page == MAP_FAILED) {
> +                int e = errno;
> +                mis->postcopy_tmp_zero_page = NULL;
> +                error_report("%s: %s mapping large zero page",
> +                             __func__, strerror(e));
> +                return -e;
> +            }
> +            memset(mis->postcopy_tmp_zero_page, '\0', mis->largest_page_size);
> +        }
> +        return postcopy_place_page(mis, host, mis->postcopy_tmp_zero_page,
> +                                   pagesize);
>      }

It's sad to have to allocate 1 huge page just to zero them.

Are you sure the kernel doesn't support UFFDIO_ZEROPAGE for huge page.
It seems __mcopy_atomic() manages HUGETLB vma (it is called by
mfill_zeropage(), called by userfaultfd_zeropage())?

Anyway, the code looks good:
Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/16] exec: ram_block_discard_range
  2017-02-24 14:08   ` Laurent Vivier
@ 2017-02-24 15:35     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-24 15:35 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: qemu-devel, quintela, aarcange

* Laurent Vivier (lvivier@redhat.com) wrote:
> On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Create ram_block_discard_range in exec.c to replace
> > postcopy_ram_discard_range and most of ram_discard_range.
> > 
> > Those two routines are a bit of a weird combination, and
> > ram_discard_range is about to get more complex for hugepages.
> > It's OS dependent code (so shouldn't be in migration/ram.c) but
> > it needs quite a bit of the innards of RAMBlock so doesn't belong in
> > the os*.c.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  exec.c                    | 59 +++++++++++++++++++++++++++++++++++++++++++++++
> >  include/exec/cpu-common.h |  1 +
> >  2 files changed, 60 insertions(+)
> > 
> > diff --git a/exec.c b/exec.c
> > index 8b9ed73..e040cdf 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -45,6 +45,12 @@
> >  #include "exec/address-spaces.h"
> >  #include "sysemu/xen-mapcache.h"
> >  #include "trace-root.h"
> > +
> > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > +#include <fcntl.h>
> > +#include <linux/falloc.h>
> > +#endif
> 
> Should it be in PATCH 05/16 instead?

Ah, yes, it should.

Dave

> Laurent
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 09/16] postcopy: Use temporary for placing zero huge pages
  2017-02-24 15:31   ` Laurent Vivier
@ 2017-02-24 15:46     ` Dr. David Alan Gilbert
  2017-02-24 17:24       ` Laurent Vivier
  0 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-24 15:46 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: qemu-devel, quintela, aarcange

* Laurent Vivier (lvivier@redhat.com) wrote:
> On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > The kernel can't do UFFDIO_ZEROPAGE for huge pages, so we have
> > to allocate a temporary (always zero) page and use UFFDIO_COPYPAGE
> > on it.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reviewed-by: Juan Quintela <quintela@redhat.com>
> > ---
> >  include/migration/migration.h |  1 +
> >  migration/postcopy-ram.c      | 23 +++++++++++++++++++++--
> >  2 files changed, 22 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index c9c1d5f..bd399fc 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -108,6 +108,7 @@ struct MigrationIncomingState {
> >      QEMUFile *to_src_file;
> >      QemuMutex rp_mutex;    /* We send replies from multiple threads */
> >      void     *postcopy_tmp_page;
> > +    void     *postcopy_tmp_zero_page;
> >  
> >      QEMUBH *bh;
> >  
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index a8b7fed..4c736d2 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -324,6 +324,10 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
> >          munmap(mis->postcopy_tmp_page, mis->largest_page_size);
> >          mis->postcopy_tmp_page = NULL;
> >      }
> > +    if (mis->postcopy_tmp_zero_page) {
> > +        munmap(mis->postcopy_tmp_zero_page, mis->largest_page_size);
> > +        mis->postcopy_tmp_zero_page = NULL;
> > +    }
> >      trace_postcopy_ram_incoming_cleanup_exit();
> >      return 0;
> >  }
> > @@ -593,8 +597,23 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
> >              return -e;
> >          }
> >      } else {
> > -        /* TODO: The kernel can't use UFFDIO_ZEROPAGE for hugepages */
> > -        assert(0);
> > +        /* The kernel can't use UFFDIO_ZEROPAGE for hugepages */
> > +        if (!mis->postcopy_tmp_zero_page) {
> > +            mis->postcopy_tmp_zero_page = mmap(NULL, mis->largest_page_size,
> > +                                               PROT_READ | PROT_WRITE,
> > +                                               MAP_PRIVATE | MAP_ANONYMOUS,
> > +                                               -1, 0);
> > +            if (mis->postcopy_tmp_zero_page == MAP_FAILED) {
> > +                int e = errno;
> > +                mis->postcopy_tmp_zero_page = NULL;
> > +                error_report("%s: %s mapping large zero page",
> > +                             __func__, strerror(e));
> > +                return -e;
> > +            }
> > +            memset(mis->postcopy_tmp_zero_page, '\0', mis->largest_page_size);
> > +        }
> > +        return postcopy_place_page(mis, host, mis->postcopy_tmp_zero_page,
> > +                                   pagesize);
> >      }
> 
> It's sad to have to allocate 1 huge page just to zero them.
> 
> Are you sure the kernel doesn't support UFFDIO_ZEROPAGE for huge page.
> It seems __mcopy_atomic() manages HUGETLB vma (it is called by
> mfill_zeropage(), called by userfaultfd_zeropage())?

That's as I understand it from Andrea; and I think it does fail if you try it.

> Anyway, the code looks good:
> Reviewed-by: Laurent Vivier <lvivier@redhat.com>

Thanks.

Dave

> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 10/16] postcopy: Load huge pages in one go
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 10/16] postcopy: Load huge pages in one go Dr. David Alan Gilbert (git)
@ 2017-02-24 15:54   ` Laurent Vivier
  2017-02-24 16:32     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 15:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:33, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The existing postcopy RAM load loop already ensures that it
> glues together whole host-pages from the target page size chunks sent
> over the wire.  Modify the definition of host page that it uses
> to be the RAM block page size and thus be huge pages where appropriate.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> ---
>  migration/ram.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index ff448ef..88d9444 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2342,7 +2342,7 @@ static int ram_load_postcopy(QEMUFile *f)
>  {
>      int flags = 0, ret = 0;
>      bool place_needed = false;
> -    bool matching_page_sizes = qemu_host_page_size == TARGET_PAGE_SIZE;
> +    bool matching_page_sizes = false;

The false value is not obvious.
Is gcc smart enough to detect you use "matching_page_sizes" (in the
"switch ()") only when it has been really initialized (in the "if ()")?

>      MigrationIncomingState *mis = migration_incoming_get_current();
>      /* Temporary page that is later 'placed' */
>      void *postcopy_host_page = postcopy_get_tmp_page(mis);
> @@ -2372,8 +2372,11 @@ static int ram_load_postcopy(QEMUFile *f)
>                  ret = -EINVAL;
>                  break;
>              }
> +            matching_page_sizes = block->page_size == TARGET_PAGE_SIZE;
>              /*
> -             * Postcopy requires that we place whole host pages atomically.
> +             * Postcopy requires that we place whole host pages atomically;
> +             * these may be huge pages for RAMBlocks that are backed by
> +             * hugetlbfs.
>               * To make it atomic, the data is read into a temporary page
>               * that's moved into place later.
>               * The migration protocol uses,  possibly smaller, target-pages
> @@ -2381,9 +2384,9 @@ static int ram_load_postcopy(QEMUFile *f)
>               * of a host page in order.
>               */
>              page_buffer = postcopy_host_page +
> -                          ((uintptr_t)host & ~qemu_host_page_mask);
> +                          ((uintptr_t)host & (block->page_size - 1));
>              /* If all TP are zero then we can optimise the place */
> -            if (!((uintptr_t)host & ~qemu_host_page_mask)) {
> +            if (!((uintptr_t)host & (block->page_size - 1))) {
>                  all_zero = true;
>              } else {
>                  /* not the 1st TP within the HP */
> @@ -2401,7 +2404,7 @@ static int ram_load_postcopy(QEMUFile *f)
>               * page
>               */
>              place_needed = (((uintptr_t)host + TARGET_PAGE_SIZE) &
> -                                     ~qemu_host_page_mask) == 0;
> +                                     (block->page_size - 1)) == 0;
>              place_source = postcopy_host_page;
>          }
>          last_host = host;
> 

Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 11/16] postcopy: Mask fault addresses to huge page boundary
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 11/16] postcopy: Mask fault addresses to huge page boundary Dr. David Alan Gilbert (git)
@ 2017-02-24 15:59   ` Laurent Vivier
  2017-02-24 16:34     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 15:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:33, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Currently the fault address received by userfault is rounded to
> the host page boundary and a host page is requested from the source.
> Use the current RAMBlock page size instead of the general host page
> size so that for RAMBlocks backed by huge pages we request the whole
> huge page.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> ---
>  include/exec/memory.h    | 1 -
>  migration/postcopy-ram.c | 7 +++----
>  2 files changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 987f925..c428891 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -1614,7 +1614,6 @@ MemTxResult address_space_read_continue(AddressSpace *as, hwaddr addr,
>  MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
>                                      MemTxAttrs attrs, uint8_t *buf, int len);
>  void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
> -
>  static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
>  {
>      if (is_write) {

This hunk removing one blank line is strange...

> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 4c736d2..03cbd6e 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -403,7 +403,6 @@ static void *postcopy_ram_fault_thread(void *opaque)
>      MigrationIncomingState *mis = opaque;
>      struct uffd_msg msg;
>      int ret;
> -    size_t hostpagesize = getpagesize();
>      RAMBlock *rb = NULL;
>      RAMBlock *last_rb = NULL; /* last RAMBlock we sent part of */
>  
> @@ -470,7 +469,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
>              break;
>          }
>  
> -        rb_offset &= ~(hostpagesize - 1);
> +        rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
>          trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
>                                                  qemu_ram_get_idstr(rb),
>                                                  rb_offset);
> @@ -482,11 +481,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
>          if (rb != last_rb) {
>              last_rb = rb;
>              migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
> -                                     rb_offset, hostpagesize);
> +                                     rb_offset, qemu_ram_pagesize(rb));
>          } else {
>              /* Save some space */
>              migrate_send_rp_req_pages(mis, NULL,
> -                                     rb_offset, hostpagesize);
> +                                     rb_offset, qemu_ram_pagesize(rb));
>          }
>      }
>      trace_postcopy_ram_fault_thread_exit();
> 
Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 06/16] Fold postcopy_ram_discard_range into ram_discard_range
  2017-02-24 14:26   ` Laurent Vivier
@ 2017-02-24 16:02     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-24 16:02 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: qemu-devel, quintela, aarcange

* Laurent Vivier (lvivier@redhat.com) wrote:
> On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Using the previously created ram_block_discard_range,
> > kill off postcopy_ram_discard_range.
> > ram_discard_range is just a wrapper that does the name lookup.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/postcopy-ram.h |  7 -------
> >  migration/postcopy-ram.c         | 30 +-----------------------------
> >  migration/ram.c                  | 24 +++---------------------
> >  migration/trace-events           |  2 +-
> >  4 files changed, 5 insertions(+), 58 deletions(-)
> > 
> > diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> > index b6a7491f..43bbbca 100644
> > --- a/include/migration/postcopy-ram.h
> > +++ b/include/migration/postcopy-ram.h
> > @@ -35,13 +35,6 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis, size_t ram_pages);
> >  int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis);
> >  
> >  /*
> > - * Discard the contents of 'length' bytes from 'start'
> > - * We can assume that if we've been called postcopy_ram_hosttest returned true
> > - */
> > -int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> > -                               size_t length);
> > -
> > -/*
> >   * Userfault requires us to mark RAM as NOHUGEPAGE prior to discard
> >   * however leaving it until after precopy means that most of the precopy
> >   * data is still THPd
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index a40dddb..1e3d22f 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -200,27 +200,6 @@ out:
> >      return ret;
> >  }
> >  
> > -/**
> > - * postcopy_ram_discard_range: Discard a range of memory.
> > - * We can assume that if we've been called postcopy_ram_hosttest returned true.
> > - *
> > - * @mis: Current incoming migration state.
> > - * @start, @length: range of memory to discard.
> > - *
> > - * returns: 0 on success.
> > - */
> > -int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> > -                               size_t length)
> > -{
> > -    trace_postcopy_ram_discard_range(start, length);
> > -    if (madvise(start, length, MADV_DONTNEED)) {
> > -        error_report("%s MADV_DONTNEED: %s", __func__, strerror(errno));
> > -        return -1;
> > -    }
> > -
> > -    return 0;
> > -}
> > -
> >  /*
> >   * Setup an area of RAM so that it *can* be used for postcopy later; this
> >   * must be done right at the start prior to pre-copy.
> > @@ -239,7 +218,7 @@ static int init_range(const char *block_name, void *host_addr,
> >       * - we're going to get the copy from the source anyway.
> >       * (Precopy will just overwrite this data, so doesn't need the discard)
> >       */
> > -    if (postcopy_ram_discard_range(mis, host_addr, length)) {
> > +    if (ram_discard_range(mis, block_name, 0, length)) {
> >          return -1;
> >      }
> >  
> > @@ -658,13 +637,6 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
> >      return -1;
> >  }
> >  
> > -int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> > -                               size_t length)
> > -{
> > -    assert(0);
> > -    return -1;
> > -}
> > -
> >  int postcopy_ram_prepare_discard(MigrationIncomingState *mis)
> >  {
> >      assert(0);
> > diff --git a/migration/ram.c b/migration/ram.c
> > index d33bd21..136996a 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1845,6 +1845,8 @@ int ram_discard_range(MigrationIncomingState *mis,
> >  {
> >      int ret = -1;
> >  
> > +    trace_ram_discard_range(block_name, start, length);
> > +
> >      rcu_read_lock();
> 
> I think you take the rcu_read_lock() twice: here and in
> ram_block_discard_range().

Hmm, yes I can lose the one in ram_block_discard_range.

> I think you should merge this patch with PATCH 04/16, as it's just code
> copy.

OK, I'd done it as 'add the new one' and 'take the old one away';
but I can do that.

Dave

> 
> Laurent
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 12/16] postcopy: Send whole huge pages
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 12/16] postcopy: Send whole huge pages Dr. David Alan Gilbert (git)
@ 2017-02-24 16:06   ` Laurent Vivier
  0 siblings, 0 replies; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 16:06 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:33, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The RAM save code uses ram_save_host_page to send whole
> host pages at a time;  change this to use the host page size associated
> with the RAM Block which may be a huge page.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> ---
>  migration/ram.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 88d9444..2350f71 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1281,6 +1281,8 @@ static int ram_save_target_page(MigrationState *ms, QEMUFile *f,
>   *                     offset to point into the middle of a host page
>   *                     in which case the remainder of the hostpage is sent.
>   *                     Only dirty target pages are sent.
> + *                     Note that the host page size may be a huge page for this
> + *                     block.
>   *
>   * Returns: Number of pages written.
>   *
> @@ -1299,6 +1301,8 @@ static int ram_save_host_page(MigrationState *ms, QEMUFile *f,
>                                ram_addr_t dirty_ram_abs)
>  {
>      int tmppages, pages = 0;
> +    size_t pagesize = qemu_ram_pagesize(pss->block);
> +
>      do {
>          tmppages = ram_save_target_page(ms, f, pss, last_stage,
>                                          bytes_transferred, dirty_ram_abs);
> @@ -1309,7 +1313,7 @@ static int ram_save_host_page(MigrationState *ms, QEMUFile *f,
>          pages += tmppages;
>          pss->offset += TARGET_PAGE_SIZE;
>          dirty_ram_abs += TARGET_PAGE_SIZE;
> -    } while (pss->offset & (qemu_host_page_size - 1));
> +    } while (pss->offset & (pagesize - 1));
>  
>      /* The offset we leave with is the last one we looked at */
>      pss->offset -= TARGET_PAGE_SIZE;
> 
Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/16] postcopy: Allow hugepages
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 13/16] postcopy: Allow hugepages Dr. David Alan Gilbert (git)
@ 2017-02-24 16:07   ` Laurent Vivier
  0 siblings, 0 replies; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 16:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:33, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Allow huge pages in postcopy.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> ---
>  migration/postcopy-ram.c | 25 +------------------------
>  1 file changed, 1 insertion(+), 24 deletions(-)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 03cbd6e..6b30b43 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -85,24 +85,6 @@ static bool ufd_version_check(int ufd)
>  }
>  
>  /*
> - * Check for things that postcopy won't support; returns 0 if the block
> - * is fine.
> - */
> -static int check_range(const char *block_name, void *host_addr,
> -                      ram_addr_t offset, ram_addr_t length, void *opaque)
> -{
> -    RAMBlock *rb = qemu_ram_block_by_name(block_name);
> -
> -    if (qemu_ram_pagesize(rb) > getpagesize()) {
> -        error_report("Postcopy doesn't support large page sizes yet (%s)",
> -                     block_name);
> -        return -E2BIG;
> -    }
> -
> -    return 0;
> -}
> -
> -/*
>   * Note: This has the side effect of munlock'ing all of RAM, that's
>   * normally fine since if the postcopy succeeds it gets turned back on at the
>   * end.
> @@ -122,12 +104,6 @@ bool postcopy_ram_supported_by_host(void)
>          goto out;
>      }
>  
> -    /* Check for anything about the RAMBlocks we don't support */
> -    if (qemu_ram_foreach_block(check_range, NULL)) {
> -        /* check_range will have printed its own error */
> -        goto out;
> -    }
> -
>      ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
>      if (ufd == -1) {
>          error_report("%s: userfaultfd not available: %s", __func__,
> @@ -139,6 +115,7 @@ bool postcopy_ram_supported_by_host(void)
>      if (!ufd_version_check(ufd)) {
>          goto out;
>      }
> +    /* TODO: Only allow huge pages if the kernel supports it */
>  
>      /*
>       * userfault and mlock don't go together; we'll put it back later if
> 
Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 14/16] postcopy: Update userfaultfd.h header
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 14/16] postcopy: Update userfaultfd.h header Dr. David Alan Gilbert (git)
@ 2017-02-24 16:09   ` Laurent Vivier
  0 siblings, 0 replies; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 16:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:33, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> We use a new userfaultfd define, so update the header.
> (Not needed if someone just runs the update script once it's
> gone into the main kernel).
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> ---
>  linux-headers/linux/userfaultfd.h | 81 ++++++++++++++++++++++++++++++++++-----
>  1 file changed, 71 insertions(+), 10 deletions(-)
> 
> diff --git a/linux-headers/linux/userfaultfd.h b/linux-headers/linux/userfaultfd.h
> index 19e8453..a7c1a62 100644
> --- a/linux-headers/linux/userfaultfd.h
> +++ b/linux-headers/linux/userfaultfd.h
> @@ -11,13 +11,19 @@
>  
>  #include <linux/types.h>
>  
> -#define UFFD_API ((__u64)0xAA)
>  /*
> - * After implementing the respective features it will become:
> - * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
> - *			      UFFD_FEATURE_EVENT_FORK)
> + * If the UFFDIO_API is upgraded someday, the UFFDIO_UNREGISTER and
> + * UFFDIO_WAKE ioctls should be defined as _IOW and not as _IOR.  In
> + * userfaultfd.h we assumed the kernel was reading (instead _IOC_READ
> + * means the userland is reading).
>   */
> -#define UFFD_API_FEATURES (0)
> +#define UFFD_API ((__u64)0xAA)
> +#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP |	\
> +			   UFFD_FEATURE_EVENT_FORK |		\
> +			   UFFD_FEATURE_EVENT_REMAP |		\
> +			   UFFD_FEATURE_EVENT_MADVDONTNEED |	\
> +			   UFFD_FEATURE_MISSING_HUGETLBFS |	\
> +			   UFFD_FEATURE_MISSING_SHMEM)
>  #define UFFD_API_IOCTLS				\
>  	((__u64)1 << _UFFDIO_REGISTER |		\
>  	 (__u64)1 << _UFFDIO_UNREGISTER |	\
> @@ -25,7 +31,11 @@
>  #define UFFD_API_RANGE_IOCTLS			\
>  	((__u64)1 << _UFFDIO_WAKE |		\
>  	 (__u64)1 << _UFFDIO_COPY |		\
> -	 (__u64)1 << _UFFDIO_ZEROPAGE)
> +	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
> +	 (__u64)1 << _UFFDIO_WRITEPROTECT)
> +#define UFFD_API_RANGE_IOCTLS_BASIC		\
> +	((__u64)1 << _UFFDIO_WAKE |		\
> +	 (__u64)1 << _UFFDIO_COPY)
>  
>  /*
>   * Valid ioctl command number range with this API is from 0x00 to
> @@ -40,6 +50,7 @@
>  #define _UFFDIO_WAKE			(0x02)
>  #define _UFFDIO_COPY			(0x03)
>  #define _UFFDIO_ZEROPAGE		(0x04)
> +#define _UFFDIO_WRITEPROTECT		(0x05)
>  #define _UFFDIO_API			(0x3F)
>  
>  /* userfaultfd ioctl ids */
> @@ -56,6 +67,8 @@
>  				      struct uffdio_copy)
>  #define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
>  				      struct uffdio_zeropage)
> +#define UFFDIO_WRITEPROTECT	_IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
> +				      struct uffdio_writeprotect)
>  
>  /* read() structure */
>  struct uffd_msg {
> @@ -72,6 +85,21 @@ struct uffd_msg {
>  		} pagefault;
>  
>  		struct {
> +			__u32	ufd;
> +		} fork;
> +
> +		struct {
> +			__u64	from;
> +			__u64	to;
> +			__u64	len;
> +		} remap;
> +
> +		struct {
> +			__u64	start;
> +			__u64	end;
> +		} madv_dn;
> +
> +		struct {
>  			/* unused reserved fields */
>  			__u64	reserved1;
>  			__u64	reserved2;
> @@ -84,9 +112,9 @@ struct uffd_msg {
>   * Start at 0x12 and not at 0 to be more strict against bugs.
>   */
>  #define UFFD_EVENT_PAGEFAULT	0x12
> -#if 0 /* not available yet */
>  #define UFFD_EVENT_FORK		0x13
> -#endif
> +#define UFFD_EVENT_REMAP	0x14
> +#define UFFD_EVENT_MADVDONTNEED	0x15
>  
>  /* flags for UFFD_EVENT_PAGEFAULT */
>  #define UFFD_PAGEFAULT_FLAG_WRITE	(1<<0)	/* If this was a write fault */
> @@ -104,11 +132,37 @@ struct uffdio_api {
>  	 * Note: UFFD_EVENT_PAGEFAULT and UFFD_PAGEFAULT_FLAG_WRITE
>  	 * are to be considered implicitly always enabled in all kernels as
>  	 * long as the uffdio_api.api requested matches UFFD_API.
> +	 *
> +	 * UFFD_FEATURE_MISSING_HUGETLBFS means an UFFDIO_REGISTER
> +	 * with UFFDIO_REGISTER_MODE_MISSING mode will succeed on
> +	 * hugetlbfs virtual memory ranges. Adding or not adding
> +	 * UFFD_FEATURE_MISSING_HUGETLBFS to uffdio_api.features has
> +	 * no real functional effect after UFFDIO_API returns, but
> +	 * it's only useful for an initial feature set probe at
> +	 * UFFDIO_API time. There are two ways to use it:
> +	 *
> +	 * 1) by adding UFFD_FEATURE_MISSING_HUGETLBFS to the
> +	 *    uffdio_api.features before calling UFFDIO_API, an error
> +	 *    will be returned by UFFDIO_API on a kernel without
> +	 *    hugetlbfs missing support
> +	 *
> +	 * 2) the UFFD_FEATURE_MISSING_HUGETLBFS can not be added in
> +	 *    uffdio_api.features and instead it will be set by the
> +	 *    kernel in the uffdio_api.features if the kernel supports
> +	 *    it, so userland can later check if the feature flag is
> +	 *    present in uffdio_api.features after UFFDIO_API
> +	 *    succeeded.
> +	 *
> +	 * UFFD_FEATURE_MISSING_SHMEM works the same as
> +	 * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
> +	 * (i.e. tmpfs and other shmem based APIs).
>  	 */
> -#if 0 /* not available yet */
>  #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
>  #define UFFD_FEATURE_EVENT_FORK			(1<<1)
> -#endif
> +#define UFFD_FEATURE_EVENT_REMAP		(1<<2)
> +#define UFFD_FEATURE_EVENT_MADVDONTNEED		(1<<3)
> +#define UFFD_FEATURE_MISSING_HUGETLBFS		(1<<4)
> +#define UFFD_FEATURE_MISSING_SHMEM		(1<<5)
>  	__u64 features;
>  
>  	__u64 ioctls;
> @@ -164,4 +218,11 @@ struct uffdio_zeropage {
>  	__s64 zeropage;
>  };
>  
> +struct uffdio_writeprotect {
> +	struct uffdio_range range;
> +	/* !WP means undo writeprotect. DONTWAKE is valid only with !WP */
> +#define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
> +#define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
> +	__u64 mode;
> +};
>  #endif /* _LINUX_USERFAULTFD_H */
> 
Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 15/16] postcopy: Check for userfault+hugepage feature
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 15/16] postcopy: Check for userfault+hugepage feature Dr. David Alan Gilbert (git)
@ 2017-02-24 16:12   ` Laurent Vivier
  0 siblings, 0 replies; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 16:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:33, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> We need extra Linux kernel support (~4.11) to support userfaults
> on hugetlbfs; check for them.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> ---
>  migration/postcopy-ram.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 6b30b43..102fb61 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -81,6 +81,17 @@ static bool ufd_version_check(int ufd)
>          return false;
>      }
>  
> +    if (getpagesize() != ram_pagesize_summary()) {
> +        bool have_hp = false;
> +        /* We've got a huge page */
> +#ifdef UFFD_FEATURE_MISSING_HUGETLBFS
> +        have_hp = api_struct.features & UFFD_FEATURE_MISSING_HUGETLBFS;
> +#endif
> +        if (!have_hp) {
> +            error_report("Userfault on this host does not support huge pages");
> +            return false;
> +        }
> +    }
>      return true;
>  }
>  
> @@ -115,7 +126,6 @@ bool postcopy_ram_supported_by_host(void)
>      if (!ufd_version_check(ufd)) {
>          goto out;
>      }
> -    /* TODO: Only allow huge pages if the kernel supports it */
>  
>      /*
>       * userfault and mlock don't go together; we'll put it back later if
> 
Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 16/16] postcopy: Add doc about hugepages and postcopy
  2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 16/16] postcopy: Add doc about hugepages and postcopy Dr. David Alan Gilbert (git)
  2017-02-24 13:25   ` Juan Quintela
@ 2017-02-24 16:12   ` Laurent Vivier
  1 sibling, 0 replies; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 16:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, quintela; +Cc: aarcange

On 06/02/2017 18:33, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  docs/migration.txt | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/docs/migration.txt b/docs/migration.txt
> index 6503c17..b462ead 100644
> --- a/docs/migration.txt
> +++ b/docs/migration.txt
> @@ -482,3 +482,16 @@ request for a page that has already been sent is ignored.  Duplicate requests
>  such as this can happen as a page is sent at about the same time the
>  destination accesses it.
>  
> +=== Postcopy with hugepages ===
> +
> +Postcopy now works with hugetlbfs backed memory:
> +  a) The linux kernel on the destination must support userfault on hugepages.
> +  b) The huge-page configuration on the source and destination VMs must be
> +     identical; i.e. RAMBlocks on both sides must use the same page size.
> +  c) Note that -mem-path /dev/hugepages  will fall back to allocating normal
> +     RAM if it doesn't have enough hugepages, triggering (b) to fail.
> +     Using -mem-prealloc enforces the allocation using hugepages.
> +  d) Care should be taken with the size of hugepage used; postcopy with 2MB
> +     hugepages works well, however 1GB hugepages are likely to be problematic
> +     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
> +     and until the full page is transferred the destination thread is blocked.
> 
Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 10/16] postcopy: Load huge pages in one go
  2017-02-24 15:54   ` Laurent Vivier
@ 2017-02-24 16:32     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-24 16:32 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: qemu-devel, quintela, aarcange

* Laurent Vivier (lvivier@redhat.com) wrote:
> On 06/02/2017 18:33, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > The existing postcopy RAM load loop already ensures that it
> > glues together whole host-pages from the target page size chunks sent
> > over the wire.  Modify the definition of host page that it uses
> > to be the RAM block page size and thus be huge pages where appropriate.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reviewed-by: Juan Quintela <quintela@redhat.com>
> > ---
> >  migration/ram.c | 13 ++++++++-----
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> > 
> > diff --git a/migration/ram.c b/migration/ram.c
> > index ff448ef..88d9444 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -2342,7 +2342,7 @@ static int ram_load_postcopy(QEMUFile *f)
> >  {
> >      int flags = 0, ret = 0;
> >      bool place_needed = false;
> > -    bool matching_page_sizes = qemu_host_page_size == TARGET_PAGE_SIZE;
> > +    bool matching_page_sizes = false;
> 
> The false value is not obvious.
> Is gcc smart enough to detect you use "matching_page_sizes" (in the
> "switch ()") only when it has been really initialized (in the "if ()")?

4.8.5-8 on RHEL 7 doesn't like it if I drop the false.
(That took a bit of searching, RHEL 6 is OK, f25 is OK)
But generally I've found these ram-load loops are really good
for tickling gcc's paranoia.

> >      MigrationIncomingState *mis = migration_incoming_get_current();
> >      /* Temporary page that is later 'placed' */
> >      void *postcopy_host_page = postcopy_get_tmp_page(mis);
> > @@ -2372,8 +2372,11 @@ static int ram_load_postcopy(QEMUFile *f)
> >                  ret = -EINVAL;
> >                  break;
> >              }
> > +            matching_page_sizes = block->page_size == TARGET_PAGE_SIZE;
> >              /*
> > -             * Postcopy requires that we place whole host pages atomically.
> > +             * Postcopy requires that we place whole host pages atomically;
> > +             * these may be huge pages for RAMBlocks that are backed by
> > +             * hugetlbfs.
> >               * To make it atomic, the data is read into a temporary page
> >               * that's moved into place later.
> >               * The migration protocol uses,  possibly smaller, target-pages
> > @@ -2381,9 +2384,9 @@ static int ram_load_postcopy(QEMUFile *f)
> >               * of a host page in order.
> >               */
> >              page_buffer = postcopy_host_page +
> > -                          ((uintptr_t)host & ~qemu_host_page_mask);
> > +                          ((uintptr_t)host & (block->page_size - 1));
> >              /* If all TP are zero then we can optimise the place */
> > -            if (!((uintptr_t)host & ~qemu_host_page_mask)) {
> > +            if (!((uintptr_t)host & (block->page_size - 1))) {
> >                  all_zero = true;
> >              } else {
> >                  /* not the 1st TP within the HP */
> > @@ -2401,7 +2404,7 @@ static int ram_load_postcopy(QEMUFile *f)
> >               * page
> >               */
> >              place_needed = (((uintptr_t)host + TARGET_PAGE_SIZE) &
> > -                                     ~qemu_host_page_mask) == 0;
> > +                                     (block->page_size - 1)) == 0;
> >              place_source = postcopy_host_page;
> >          }
> >          last_host = host;
> > 
> 
> Reviewed-by: Laurent Vivier <lvivier@redhat.com>

Thanks.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 11/16] postcopy: Mask fault addresses to huge page boundary
  2017-02-24 15:59   ` Laurent Vivier
@ 2017-02-24 16:34     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-24 16:34 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: qemu-devel, quintela, aarcange

* Laurent Vivier (lvivier@redhat.com) wrote:
> On 06/02/2017 18:33, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Currently the fault address received by userfault is rounded to
> > the host page boundary and a host page is requested from the source.
> > Use the current RAMBlock page size instead of the general host page
> > size so that for RAMBlocks backed by huge pages we request the whole
> > huge page.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reviewed-by: Juan Quintela <quintela@redhat.com>
> > ---
> >  include/exec/memory.h    | 1 -
> >  migration/postcopy-ram.c | 7 +++----
> >  2 files changed, 3 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > index 987f925..c428891 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -1614,7 +1614,6 @@ MemTxResult address_space_read_continue(AddressSpace *as, hwaddr addr,
> >  MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
> >                                      MemTxAttrs attrs, uint8_t *buf, int len);
> >  void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
> > -
> >  static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
> >  {
> >      if (is_write) {
> 
> This hunk removing one blank line is strange...

Oops, cleaned up.

Dave

> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 4c736d2..03cbd6e 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -403,7 +403,6 @@ static void *postcopy_ram_fault_thread(void *opaque)
> >      MigrationIncomingState *mis = opaque;
> >      struct uffd_msg msg;
> >      int ret;
> > -    size_t hostpagesize = getpagesize();
> >      RAMBlock *rb = NULL;
> >      RAMBlock *last_rb = NULL; /* last RAMBlock we sent part of */
> >  
> > @@ -470,7 +469,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
> >              break;
> >          }
> >  
> > -        rb_offset &= ~(hostpagesize - 1);
> > +        rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
> >          trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
> >                                                  qemu_ram_get_idstr(rb),
> >                                                  rb_offset);
> > @@ -482,11 +481,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
> >          if (rb != last_rb) {
> >              last_rb = rb;
> >              migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
> > -                                     rb_offset, hostpagesize);
> > +                                     rb_offset, qemu_ram_pagesize(rb));
> >          } else {
> >              /* Save some space */
> >              migrate_send_rp_req_pages(mis, NULL,
> > -                                     rb_offset, hostpagesize);
> > +                                     rb_offset, qemu_ram_pagesize(rb));
> >          }
> >      }
> >      trace_postcopy_ram_fault_thread_exit();
> > 
> Reviewed-by: Laurent Vivier <lvivier@redhat.com>
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/16] exec: ram_block_discard_range
  2017-02-24 14:04   ` Laurent Vivier
@ 2017-02-24 16:50     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-24 16:50 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: qemu-devel, quintela, aarcange

* Laurent Vivier (lvivier@redhat.com) wrote:
> On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Create ram_block_discard_range in exec.c to replace
> > postcopy_ram_discard_range and most of ram_discard_range.
> > 
> > Those two routines are a bit of a weird combination, and
> > ram_discard_range is about to get more complex for hugepages.
> > It's OS dependent code (so shouldn't be in migration/ram.c) but
> > it needs quite a bit of the innards of RAMBlock so doesn't belong in
> > the os*.c.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  exec.c                    | 59 +++++++++++++++++++++++++++++++++++++++++++++++
> >  include/exec/cpu-common.h |  1 +
> >  2 files changed, 60 insertions(+)
> > 
> > diff --git a/exec.c b/exec.c
> > index 8b9ed73..e040cdf 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -45,6 +45,12 @@
> >  #include "exec/address-spaces.h"
> >  #include "sysemu/xen-mapcache.h"
> >  #include "trace-root.h"
> > +
> > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > +#include <fcntl.h>
> > +#include <linux/falloc.h>
> > +#endif
> > +
> >  #endif
> >  #include "exec/cpu-all.h"
> >  #include "qemu/rcu_queue.h"
> > @@ -3286,4 +3292,57 @@ int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
> >      rcu_read_unlock();
> >      return ret;
> >  }
> > +
> > +/*
> > + * Unmap pages of memory from start to start+length such that
> > + * they a) read as 0, b) Trigger whatever fault mechanism
> > + * the OS provides for postcopy.
> > + * The pages must be unmapped by the end of the function.
> > + * Returns: 0 on success, none-0 on failure
> > + *
> > + */
> > +int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
> > +{
> > +    int ret = -1;
> > +
> > +    rcu_read_lock();
> > +    uint8_t *host_startaddr = rb->host + start;
> > +
> > +    if ((uintptr_t)host_startaddr & (rb->page_size - 1)) {
> > +        error_report("ram_block_discard_range: Unaligned start address: %p",
> > +                     host_startaddr);
> > +        goto err;
> > +    }
> > +
> > +    if ((start + length) <= rb->used_length) {
> > +        uint8_t *host_endaddr = host_startaddr + length;
> > +        if ((uintptr_t)host_endaddr & (rb->page_size - 1)) {
> > +            error_report("ram_block_discard_range: Unaligned end address: %p",
> > +                         host_endaddr);
> > +            goto err;
> > +        }
> > +
> > +        errno = ENOTSUP; /* If we are missing MADVISE etc */
> > +
> > +#if defined(CONFIG_MADVISE)
> > +        ret = qemu_madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
> > +#endif
> > +        if (ret) {
> > +            ret = -errno;
> > +            error_report("ram_block_discard_range: Failed to discard range "
> > +                         "%s:%" PRIx64 " +%zx (%d)",
> > +                         rb->idstr, start, length, ret);
> > +        }
> > +    } else {
> > +        error_report("ram_block_discard_range: Overrun block '%s' (%" PRIu64
> > +                     "/%zx/" RAM_ADDR_FMT")",
> > +                     rb->idstr, start, length, rb->used_length);
> > +    }
> > +
> > +err:
> > +    rcu_read_unlock();
> > +
> > +    return ret;
> > +}
> 
> I really looks like a copy'n'paste from ram_discard_range(). It could be
> clearer if you remove the code from ram_discard_range() and call this
> function instead.

Yes, flattened into the latter commit.

> I think you don't need the "#if defined(CONFIG_MADVISE)" as you use
> qemu_madvise() (or you should use madvise() directly if you want to
> avoid the posix_madvise()).

Yes, changed to CONFIG_MADVISE + madvise()

I need to avoid posix_madvise because it doesn't do the same thing.

> [perhaps qemu_madvise() should set errno to ENOTSUP instead of EINVAL]

The difficulty is with how it fiddles with it's QEMU_MADV_* macros,
when it finds one that doesn't exist it gets defined as -1 or the like
which then fails that way.

Dave

> 
> Laurent
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 09/16] postcopy: Use temporary for placing zero huge pages
  2017-02-24 15:46     ` Dr. David Alan Gilbert
@ 2017-02-24 17:24       ` Laurent Vivier
  0 siblings, 0 replies; 73+ messages in thread
From: Laurent Vivier @ 2017-02-24 17:24 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: qemu-devel, quintela, aarcange

On 24/02/2017 16:46, Dr. David Alan Gilbert wrote:
> * Laurent Vivier (lvivier@redhat.com) wrote:
>> On 06/02/2017 18:32, Dr. David Alan Gilbert (git) wrote:
>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>
>>> The kernel can't do UFFDIO_ZEROPAGE for huge pages, so we have
>>> to allocate a temporary (always zero) page and use UFFDIO_COPYPAGE
>>> on it.
>>>
>>> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>> Reviewed-by: Juan Quintela <quintela@redhat.com>
>>> ---
...
>> Are you sure the kernel doesn't support UFFDIO_ZEROPAGE for huge page.
>> It seems __mcopy_atomic() manages HUGETLB vma (it is called by
>> mfill_zeropage(), called by userfaultfd_zeropage())?
> 
> That's as I understand it from Andrea; and I think it does fail if you try it.

Found the answer in kernel log:

    commit 7a0c4cf85b856430af62a907dd65dfc51438d24f
    Author: Andrea Arcangeli <aarcange@redhat.com>
    Date:   Wed Feb 22 15:44:10 2017 -0800

        userfaultfd: selftest: test UFFDIO_ZEROPAGE on all memory types

        This will verify -EINVAL is returned with hugetlbfs/shmem and
        it'll do a functional test of UFFDIO_ZEROPAGE on anonymous
        memory.

Thanks,
Laurent

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-21 10:03               ` Dr. David Alan Gilbert
@ 2017-02-27 11:05                 ` Alexey Perevalov
  2017-02-27 11:26                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 73+ messages in thread
From: Alexey Perevalov @ 2017-02-27 11:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Andrea Arcangeli; +Cc: qemu-devel, quintela

Hi David,


On Tue, Feb 21, 2017 at 10:03:14AM +0000, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > 
> > Hello David,
> 
> Hi Alexey,
> 
> > On Tue, Feb 14, 2017 at 07:34:26PM +0000, Dr. David Alan Gilbert wrote:
> > > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > > Hi David,
> > > > 
> > > > Thank your, now it's clear.
> > > > 
> > > > On Mon, Feb 13, 2017 at 06:16:02PM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > > > >  Hello David!
> > > > > 
> > > > > Hi Alexey,
> > > > > 
> > > > > > I have checked you series with 1G hugepage, but only in 1 Gbit/sec network
> > > > > > environment.
> > > > > 
> > > > > Can you show the qemu command line you're using?  I'm just trying
> > > > > to make sure I understand where your hugepages are; running 1G hostpages
> > > > > across a 1Gbit/sec network for postcopy would be pretty poor - it would take
> > > > > ~10 seconds to transfer the page.
> > > > 
> > > > sure
> > > > -hda ./Ubuntu.img -name PAU,debug-threads=on -boot d -net nic -net user
> > > > -m 1024 -localtime -nographic -enable-kvm -incoming tcp:0:4444 -object
> > > > memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages -mem-prealloc
> > > > -numa node,memdev=mem -trace events=/tmp/events -chardev
> > > > socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock,server,nowait
> > > > -mon chardev=charmonitor,id=monitor,mode=control
> > > 
> > > OK, it's a pretty unusual setup - a 1G page guest with 1G of guest RAM.
> > > 
> > > > > 
> > > > > > I started Ubuntu just with console interface and gave to it only 1G of
> > > > > > RAM, inside Ubuntu I started stress command
> > > > > 
> > > > > > (stress --cpu 4 --io 4 --vm 4 --vm-bytes 256000000 &)
> > > > > > in such environment precopy live migration was impossible, it never
> > > > > > being finished, in this case it infinitely sends pages (it looks like
> > > > > > dpkg scenario).
> > > > > > 
> > > > > > Also I modified stress utility
> > > > > > http://people.seas.harvard.edu/~apw/stress/stress-1.0.4.tar.gz
> > > > > > due to it wrote into memory every time the same value `Z`. My
> > > > > > modified version writes every allocation new incremented value.
> > > > > 
> > > > > I use google's stressapptest normally; although remember to turn
> > > > > off the bit where it pauses.
> > > > 
> > > > I decided to use it too
> > > > stressapptest -s 300 -M 256 -m 8 -W
> > > > 
> > > > > 
> > > > > > I'm using Arcangeli's kernel only at the destination.
> > > > > > 
> > > > > > I got controversial results. Downtime for 1G hugepage is close to 2Mb
> > > > > > hugepage and it took around 7 ms (in 2Mb hugepage scenario downtime was
> > > > > > around 8 ms).
> > > > > > I made that opinion by query-migrate.
> > > > > > {"return": {"status": "completed", "setup-time": 6, "downtime": 6, "total-time": 9668, "ram": {"total": 1091379200, "postcopy-requests": 1, "dirty-sync-count": 2, "remaining": 0, "mbps": 879.786851, "transferred": 1063007296, "duplicate": 7449, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1060868096, "normal": 259001}}}
> > > > > > 
> > > > > > Documentation says about downtime field - measurement unit is ms.
> > > > > 
> > > > > The downtime measurement field is pretty meaningless for postcopy; it's only
> > > > > the time from stopping the VM until the point where we tell the destination it
> > > > > can start running.  Meaningful measurements are only from inside the guest
> > > > > really, or the place latencys.
> > > > >
> > > > 
> > > > Maybe improve it by receiving such information from destination?
> > > > I wish to do that.
> > > > > > So I traced it (I added additional trace into postcopy_place_page
> > > > > > trace_postcopy_place_page_start(host, from, pagesize); )
> > > > > > 
> > > > > > postcopy_ram_fault_thread_request Request for HVA=7f6dc0000000 rb=/objects/mem offset=0
> > > > > > postcopy_place_page_start host=0x7f6dc0000000 from=0x7f6d70000000, pagesize=40000000
> > > > > > postcopy_place_page_start host=0x7f6e0e800000 from=0x55b665969619, pagesize=1000
> > > > > > postcopy_place_page_start host=0x7f6e0e801000 from=0x55b6659684e8, pagesize=1000
> > > > > > several pages with 4Kb step ...
> > > > > > postcopy_place_page_start host=0x7f6e0e817000 from=0x55b6659694f0, pagesize=1000
> > > > > > 
> > > > > > 4K pages, started from 0x7f6e0e800000 address it's
> > > > > > vga.ram, /rom@etc/acpi/tables etc.
> > > > > > 
> > > > > > Frankly saying, right now, I don't have any ideas why hugepage wasn't
> > > > > > resent. Maybe my expectation of it is wrong as well as understanding )
> > > > > 
> > > > > That's pretty much what I expect to see - before you get into postcopy
> > > > > mode everything is sent as individual 4k pages (in order); once we're
> > > > > in postcopy mode we send each page no more than once.  So you're
> > > > > huge page comes across once - and there it is.
> > > > > 
> > > > > > stress utility also duplicated for me value into appropriate file:
> > > > > > sec_since_epoch.microsec:value
> > > > > > 1487003192.728493:22
> > > > > > 1487003197.335362:23
> > > > > > *1487003213.367260:24*
> > > > > > *1487003238.480379:25*
> > > > > > 1487003243.315299:26
> > > > > > 1487003250.775721:27
> > > > > > 1487003255.473792:28
> > > > > > 
> > > > > > It mean rewriting 256Mb of memory per byte took around 5 sec, but at
> > > > > > the moment of migration it took 25 sec.
> > > > > 
> > > > > right, now this is the thing that's more useful to measure.
> > > > > That's not too surprising; when it migrates that data is changing rapidly
> > > > > so it's going to have to pause and wait for that whole 1GB to be transferred.
> > > > > Your 1Gbps network is going to take about 10 seconds to transfer that
> > > > > 1GB page - and that's if you're lucky and it saturates the network.
> > > > > SO it's going to take at least 10 seconds longer than it normally
> > > > > would, plus any other overheads - so at least 15 seconds.
> > > > > This is why I say it's a bad idea to use 1GB host pages with postcopy.
> > > > > Of course it would be fun to find where the other 10 seconds went!
> > > > > 
> > > > > You might like to add timing to the tracing so you can see the time between the
> > > > > fault thread requesting the page and it arriving.
> > > > >
> > > > yes, sorry I forgot about timing
> > > > 20806@1487084818.270993:postcopy_ram_fault_thread_request Request for HVA=7f0280000000 rb=/objects/mem offset=0
> > > > 20806@1487084818.271038:qemu_loadvm_state_section 8
> > > > 20806@1487084818.271056:loadvm_process_command com=0x2 len=4
> > > > 20806@1487084818.271089:qemu_loadvm_state_section 2
> > > > 20806@1487084823.315919:postcopy_place_page_start host=0x7f0280000000 from=0x7f0240000000, pagesize=40000000
> > > > 
> > > > 1487084823.315919 - 1487084818.270993 = 5.044926 sec.
> > > > Machines connected w/o any routers, directly by cable.
> > > 
> > > OK, the fact it's only 5 seconds not 10 I think suggests a lot of the memory was all zero
> > > so didn't take up the whole bandwidth.
> 
> > I decided to measure downtime as a sum of intervals since fault happened
> > and till page was load. I didn't relay on order, so I associated that
> > interval with fault address.
> 
> Don't forget the source will still be sending unrequested pages at the
> same time as fault responses; so that simplification might be wrong.
> My experience with 4k pages is you'll often get pages that arrive
> at about the same time as you ask for them because of the background transmission.
> 
> > For 2G ram vm - using 1G huge page, downtime measured on dst is around 12 sec,
> > but for the same 2G ram vm with 2Mb huge page, downtime measured on dst
> > is around 20 sec, and 320 page faults happened, 640 Mb was transmitted.
> 
> OK, so 20/320 * 1000=62.5msec/ page.   That's a bit high.
> I think it takes about 16ms to transmit a 2MB page on your 1Gbps network,
Yes, you right, transfer of the first page doesn't wait for prefetched page
transmission, and downtime for first page was 25 ms.

Next requested pages are queued (FIFO) so dst is waiting all prefetched pages,
it's around 5-7 pages transmission.
So I have a question why not to put requested page into the head of
queue in that case, and dst qemu will wait only lesser, only page which
was already in transmission.

Also if I'm not wrong, commands and pages are transferred over the same
socket. Why not to use OOB TCP in this case for commands?

> you're probably also suffering from the requests being queued behind
> background requests; if you try reducing your tcp_wmem setting on the
> source it might get a bit better.  Once Juan Quintela's multi-fd work
> goes in my hope is to combine it with postcopy and then be able to
> avoid that type of request blocking.
> Generally I'd not recommend 10Gbps for postcopy since it does pull
> down the latency quite a bit.
> 
> > My current method doesn't take into account multi core vcpu. I checked
> > only with 1 CPU, but it's not proper case. So I think it's worth to
> > count downtime per CPU, or calculate overlap of CPU downtimes.
> > How do your think?
> 
> Yes; one of the nice things about postcopy is that if one vCPU is blocked
> waiting for a page, the other vCPUs will just be able to carry on.
> Even with 1 vCPU if you've got multiple tasks that can run the guest can
> switch to a task that isn't blocked (See KVM asynchronous page faults).
> Now, what the numbers mean when you calculate the total like that might be a bit
> odd - for example if you have 8 vCPUs and they're each blocked do you
> add the times together even though they're blocked at the same time? What
> about if they're blocked on the same page?

I implemented downtime calculation for all cpu's, the approach is
following:

Initially intervals are represented in tree where key is
pagefault address, and values:
    begin - page fault time
    end   - page load time
    cpus  - bit mask shows affected cpus

To calculate overlap on all cpus, intervals converted into
array of points in time (downtime_intervals), the size of
array is 2 * number of nodes in tree of intervals (2 array
elements per one in element of interval).
Each element is marked as end (E) or not the end (S) of
interval.
The overlap downtime will be calculated for SE, only in
case of sequence S(0..N)E(M) for every vCPU.

As example we have 3 CPU
     S1        E1           S1               E1
-----***********------------xxx***************------------------------> CPU1

            S2                E2
------------****************xxx---------------------------------------> CPU2

                        S3            E3
------------------------****xxx********-------------------------------> CPU3
	        
We have sequence S1,S2,E1,S3,S1,E2,E3,E1
S2,E1 - doesn't match condition due to
sequence S1,S2,E1 doesn't include CPU3,
S3,S1,E2 - sequenece includes all CPUs, in
this case overlap will be S1,E2


But I don't send RFC now,
due to I faced an issue. Kernel doesn't inform user space about page's
owner in handle_userfault. So it's the question to Andrea. Is it worth
to add such information.
Frankly saying, I don't know is current (task_struct) in
handle_userfault equal to mm_struct's owner.

> 
> > Also I didn't yet finish IPC to provide such information to src host, where
> > info_migrate is being called.
> 
> Dave
> 
> > 
> > 
> > > 
> > > > > > Another one request.
> > > > > > QEMU could use mem_path in hugefs with share key simultaneously
> > > > > > (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
> > > > > > in this case will start and will properly work (it will allocate memory
> > > > > > with mmap), but in case of destination for postcopy live migration
> > > > > > UFFDIO_COPY ioctl will fail for
> > > > > > such region, in Arcangeli's git tree there is such prevent check
> > > > > > (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> > > > > > Is it possible to handle such situation at qemu?
> > > > > 
> > > > > Imagine that you had shared memory; what semantics would you like
> > > > > to see ?  What happens to the other process?
> > > > 
> > > > Honestly, initially, I thought to handle such error, but I quit forgot
> > > > about vhost-user in ovs-dpdk.
> > > 
> > > Yes, I don't know much about vhost-user; but we'll have to think carefully
> > > about the way things behave when they're accessing memory that's shared
> > > with qemu during migration.  Writing to the source after we've started
> > > the postcopy phase is not allowed.  Accessing the destination memory
> > > during postcopy will produce pauses in the other processes accessing it
> > > (I think) and they mustn't do various types of madvise etc - so
> > > I'm sure there will be things we find out the hard way!
> > > 
> > > Dave
> > > 
> > > > > Dave
> > > > > 
> > > > > > On Mon, Feb 06, 2017 at 05:45:30PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > > * Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> > > > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > > > 
> > > > > > > > Hi,
> > > > > > > >   The existing postcopy code, and the userfault kernel
> > > > > > > > code that supports it, only works for normal anonymous memory.
> > > > > > > > Kernel support for userfault on hugetlbfs is working
> > > > > > > > it's way upstream; it's in the linux-mm tree,
> > > > > > > > You can get a version at:
> > > > > > > >    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> > > > > > > > on the origin/userfault branch.
> > > > > > > > 
> > > > > > > > Note that while this code supports arbitrary sized hugepages,
> > > > > > > > it doesn't make sense with pages above the few-MB region,
> > > > > > > > so while 2MB is fine, 1GB is probably a bad idea;
> > > > > > > > this code waits for and transmits whole huge pages, and a
> > > > > > > > 1GB page would take about 1 second to transfer over a 10Gbps
> > > > > > > > link - which is way too long to pause the destination for.
> > > > > > > > 
> > > > > > > > Dave
> > > > > > > 
> > > > > > > Oops I missed the v2 changes from the message:
> > > > > > > 
> > > > > > > v2
> > > > > > >   Flip ram-size summary word/compare individual page size patches around
> > > > > > >   Individual page size comparison is done in ram_load if 'advise' has been
> > > > > > >     received rather than checking migrate_postcopy_ram()
> > > > > > >   Moved discard code into exec.c, reworked ram_discard_range
> > > > > > > 
> > > > > > > Dave
> > > > > > 
> > > > > > Thank your, right now it's not necessary to set
> > > > > > postcopy-ram capability on destination machine.
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > Dr. David Alan Gilbert (16):
> > > > > > > >   postcopy: Transmit ram size summary word
> > > > > > > >   postcopy: Transmit and compare individual page sizes
> > > > > > > >   postcopy: Chunk discards for hugepages
> > > > > > > >   exec: ram_block_discard_range
> > > > > > > >   postcopy: enhance ram_block_discard_range for hugepages
> > > > > > > >   Fold postcopy_ram_discard_range into ram_discard_range
> > > > > > > >   postcopy: Record largest page size
> > > > > > > >   postcopy: Plumb pagesize down into place helpers
> > > > > > > >   postcopy: Use temporary for placing zero huge pages
> > > > > > > >   postcopy: Load huge pages in one go
> > > > > > > >   postcopy: Mask fault addresses to huge page boundary
> > > > > > > >   postcopy: Send whole huge pages
> > > > > > > >   postcopy: Allow hugepages
> > > > > > > >   postcopy: Update userfaultfd.h header
> > > > > > > >   postcopy: Check for userfault+hugepage feature
> > > > > > > >   postcopy: Add doc about hugepages and postcopy
> > > > > > > > 
> > > > > > > >  docs/migration.txt                |  13 ++++
> > > > > > > >  exec.c                            |  83 +++++++++++++++++++++++
> > > > > > > >  include/exec/cpu-common.h         |   2 +
> > > > > > > >  include/exec/memory.h             |   1 -
> > > > > > > >  include/migration/migration.h     |   3 +
> > > > > > > >  include/migration/postcopy-ram.h  |  13 ++--
> > > > > > > >  linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
> > > > > > > >  migration/migration.c             |   1 +
> > > > > > > >  migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
> > > > > > > >  migration/ram.c                   | 109 ++++++++++++++++++------------
> > > > > > > >  migration/savevm.c                |  32 ++++++---
> > > > > > > >  migration/trace-events            |   2 +-
> > > > > > > >  12 files changed, 328 insertions(+), 150 deletions(-)
> > > > > > > > 
> > > > > > > > -- 
> > > > > > > > 2.9.3
> > > > > > > > 
> > > > > > > > 
> > > > > > > --
> > > > > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > > > > 
> > > > > --
> > > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > > 
> > > > 
> > > > -- 
> > > > 
> > > > BR
> > > > Alexey
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > 
> > 
> > -- 
> > 
> > BR
> > Alexey
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>

BR
Alexey

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-27 11:05                 ` Alexey Perevalov
@ 2017-02-27 11:26                   ` Dr. David Alan Gilbert
  2017-02-27 15:00                     ` Andrea Arcangeli
  2017-02-27 19:04                     ` Alexey Perevalov
  0 siblings, 2 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2017-02-27 11:26 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: Andrea Arcangeli, qemu-devel, quintela

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> Hi David,
> 
> 
> On Tue, Feb 21, 2017 at 10:03:14AM +0000, Dr. David Alan Gilbert wrote:
> > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > 
> > > Hello David,
> > 
> > Hi Alexey,
> > 
> > > On Tue, Feb 14, 2017 at 07:34:26PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > > > Hi David,
> > > > > 
> > > > > Thank your, now it's clear.
> > > > > 
> > > > > On Mon, Feb 13, 2017 at 06:16:02PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > > > > >  Hello David!
> > > > > > 
> > > > > > Hi Alexey,
> > > > > > 
> > > > > > > I have checked you series with 1G hugepage, but only in 1 Gbit/sec network
> > > > > > > environment.
> > > > > > 
> > > > > > Can you show the qemu command line you're using?  I'm just trying
> > > > > > to make sure I understand where your hugepages are; running 1G hostpages
> > > > > > across a 1Gbit/sec network for postcopy would be pretty poor - it would take
> > > > > > ~10 seconds to transfer the page.
> > > > > 
> > > > > sure
> > > > > -hda ./Ubuntu.img -name PAU,debug-threads=on -boot d -net nic -net user
> > > > > -m 1024 -localtime -nographic -enable-kvm -incoming tcp:0:4444 -object
> > > > > memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages -mem-prealloc
> > > > > -numa node,memdev=mem -trace events=/tmp/events -chardev
> > > > > socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock,server,nowait
> > > > > -mon chardev=charmonitor,id=monitor,mode=control
> > > > 
> > > > OK, it's a pretty unusual setup - a 1G page guest with 1G of guest RAM.
> > > > 
> > > > > > 
> > > > > > > I started Ubuntu just with console interface and gave to it only 1G of
> > > > > > > RAM, inside Ubuntu I started stress command
> > > > > > 
> > > > > > > (stress --cpu 4 --io 4 --vm 4 --vm-bytes 256000000 &)
> > > > > > > in such environment precopy live migration was impossible, it never
> > > > > > > being finished, in this case it infinitely sends pages (it looks like
> > > > > > > dpkg scenario).
> > > > > > > 
> > > > > > > Also I modified stress utility
> > > > > > > http://people.seas.harvard.edu/~apw/stress/stress-1.0.4.tar.gz
> > > > > > > due to it wrote into memory every time the same value `Z`. My
> > > > > > > modified version writes every allocation new incremented value.
> > > > > > 
> > > > > > I use google's stressapptest normally; although remember to turn
> > > > > > off the bit where it pauses.
> > > > > 
> > > > > I decided to use it too
> > > > > stressapptest -s 300 -M 256 -m 8 -W
> > > > > 
> > > > > > 
> > > > > > > I'm using Arcangeli's kernel only at the destination.
> > > > > > > 
> > > > > > > I got controversial results. Downtime for 1G hugepage is close to 2Mb
> > > > > > > hugepage and it took around 7 ms (in 2Mb hugepage scenario downtime was
> > > > > > > around 8 ms).
> > > > > > > I made that opinion by query-migrate.
> > > > > > > {"return": {"status": "completed", "setup-time": 6, "downtime": 6, "total-time": 9668, "ram": {"total": 1091379200, "postcopy-requests": 1, "dirty-sync-count": 2, "remaining": 0, "mbps": 879.786851, "transferred": 1063007296, "duplicate": 7449, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1060868096, "normal": 259001}}}
> > > > > > > 
> > > > > > > Documentation says about downtime field - measurement unit is ms.
> > > > > > 
> > > > > > The downtime measurement field is pretty meaningless for postcopy; it's only
> > > > > > the time from stopping the VM until the point where we tell the destination it
> > > > > > can start running.  Meaningful measurements are only from inside the guest
> > > > > > really, or the place latencys.
> > > > > >
> > > > > 
> > > > > Maybe improve it by receiving such information from destination?
> > > > > I wish to do that.
> > > > > > > So I traced it (I added additional trace into postcopy_place_page
> > > > > > > trace_postcopy_place_page_start(host, from, pagesize); )
> > > > > > > 
> > > > > > > postcopy_ram_fault_thread_request Request for HVA=7f6dc0000000 rb=/objects/mem offset=0
> > > > > > > postcopy_place_page_start host=0x7f6dc0000000 from=0x7f6d70000000, pagesize=40000000
> > > > > > > postcopy_place_page_start host=0x7f6e0e800000 from=0x55b665969619, pagesize=1000
> > > > > > > postcopy_place_page_start host=0x7f6e0e801000 from=0x55b6659684e8, pagesize=1000
> > > > > > > several pages with 4Kb step ...
> > > > > > > postcopy_place_page_start host=0x7f6e0e817000 from=0x55b6659694f0, pagesize=1000
> > > > > > > 
> > > > > > > 4K pages, started from 0x7f6e0e800000 address it's
> > > > > > > vga.ram, /rom@etc/acpi/tables etc.
> > > > > > > 
> > > > > > > Frankly saying, right now, I don't have any ideas why hugepage wasn't
> > > > > > > resent. Maybe my expectation of it is wrong as well as understanding )
> > > > > > 
> > > > > > That's pretty much what I expect to see - before you get into postcopy
> > > > > > mode everything is sent as individual 4k pages (in order); once we're
> > > > > > in postcopy mode we send each page no more than once.  So you're
> > > > > > huge page comes across once - and there it is.
> > > > > > 
> > > > > > > stress utility also duplicated for me value into appropriate file:
> > > > > > > sec_since_epoch.microsec:value
> > > > > > > 1487003192.728493:22
> > > > > > > 1487003197.335362:23
> > > > > > > *1487003213.367260:24*
> > > > > > > *1487003238.480379:25*
> > > > > > > 1487003243.315299:26
> > > > > > > 1487003250.775721:27
> > > > > > > 1487003255.473792:28
> > > > > > > 
> > > > > > > It mean rewriting 256Mb of memory per byte took around 5 sec, but at
> > > > > > > the moment of migration it took 25 sec.
> > > > > > 
> > > > > > right, now this is the thing that's more useful to measure.
> > > > > > That's not too surprising; when it migrates that data is changing rapidly
> > > > > > so it's going to have to pause and wait for that whole 1GB to be transferred.
> > > > > > Your 1Gbps network is going to take about 10 seconds to transfer that
> > > > > > 1GB page - and that's if you're lucky and it saturates the network.
> > > > > > SO it's going to take at least 10 seconds longer than it normally
> > > > > > would, plus any other overheads - so at least 15 seconds.
> > > > > > This is why I say it's a bad idea to use 1GB host pages with postcopy.
> > > > > > Of course it would be fun to find where the other 10 seconds went!
> > > > > > 
> > > > > > You might like to add timing to the tracing so you can see the time between the
> > > > > > fault thread requesting the page and it arriving.
> > > > > >
> > > > > yes, sorry I forgot about timing
> > > > > 20806@1487084818.270993:postcopy_ram_fault_thread_request Request for HVA=7f0280000000 rb=/objects/mem offset=0
> > > > > 20806@1487084818.271038:qemu_loadvm_state_section 8
> > > > > 20806@1487084818.271056:loadvm_process_command com=0x2 len=4
> > > > > 20806@1487084818.271089:qemu_loadvm_state_section 2
> > > > > 20806@1487084823.315919:postcopy_place_page_start host=0x7f0280000000 from=0x7f0240000000, pagesize=40000000
> > > > > 
> > > > > 1487084823.315919 - 1487084818.270993 = 5.044926 sec.
> > > > > Machines connected w/o any routers, directly by cable.
> > > > 
> > > > OK, the fact it's only 5 seconds not 10 I think suggests a lot of the memory was all zero
> > > > so didn't take up the whole bandwidth.
> > 
> > > I decided to measure downtime as a sum of intervals since fault happened
> > > and till page was load. I didn't relay on order, so I associated that
> > > interval with fault address.
> > 
> > Don't forget the source will still be sending unrequested pages at the
> > same time as fault responses; so that simplification might be wrong.
> > My experience with 4k pages is you'll often get pages that arrive
> > at about the same time as you ask for them because of the background transmission.
> > 
> > > For 2G ram vm - using 1G huge page, downtime measured on dst is around 12 sec,
> > > but for the same 2G ram vm with 2Mb huge page, downtime measured on dst
> > > is around 20 sec, and 320 page faults happened, 640 Mb was transmitted.
> > 
> > OK, so 20/320 * 1000=62.5msec/ page.   That's a bit high.
> > I think it takes about 16ms to transmit a 2MB page on your 1Gbps network,
> Yes, you right, transfer of the first page doesn't wait for prefetched page
> transmission, and downtime for first page was 25 ms.
> 
> Next requested pages are queued (FIFO) so dst is waiting all prefetched pages,
> it's around 5-7 pages transmission.
> So I have a question why not to put requested page into the head of
> queue in that case, and dst qemu will wait only lesser, only page which
> was already in transmission.

The problem is it's already in the source's network queue.

> Also if I'm not wrong, commands and pages are transferred over the same
> socket. Why not to use OOB TCP in this case for commands?

My understanding was that OOB was limited to quite small transfers
I think the right way is to use a separate FD for the requests, so I'll
do it after Juan's multifd series.
Although even then I'm not sure how it will behave; the other thing
might be to throttle the background page transfer so the FIFO isn't
as full.

> > you're probably also suffering from the requests being queued behind
> > background requests; if you try reducing your tcp_wmem setting on the
> > source it might get a bit better.  Once Juan Quintela's multi-fd work
> > goes in my hope is to combine it with postcopy and then be able to
> > avoid that type of request blocking.
> > Generally I'd not recommend 10Gbps for postcopy since it does pull
> > down the latency quite a bit.
> > 
> > > My current method doesn't take into account multi core vcpu. I checked
> > > only with 1 CPU, but it's not proper case. So I think it's worth to
> > > count downtime per CPU, or calculate overlap of CPU downtimes.
> > > How do your think?
> > 
> > Yes; one of the nice things about postcopy is that if one vCPU is blocked
> > waiting for a page, the other vCPUs will just be able to carry on.
> > Even with 1 vCPU if you've got multiple tasks that can run the guest can
> > switch to a task that isn't blocked (See KVM asynchronous page faults).
> > Now, what the numbers mean when you calculate the total like that might be a bit
> > odd - for example if you have 8 vCPUs and they're each blocked do you
> > add the times together even though they're blocked at the same time? What
> > about if they're blocked on the same page?
> 
> I implemented downtime calculation for all cpu's, the approach is
> following:
> 
> Initially intervals are represented in tree where key is
> pagefault address, and values:
>     begin - page fault time
>     end   - page load time
>     cpus  - bit mask shows affected cpus
> 
> To calculate overlap on all cpus, intervals converted into
> array of points in time (downtime_intervals), the size of
> array is 2 * number of nodes in tree of intervals (2 array
> elements per one in element of interval).
> Each element is marked as end (E) or not the end (S) of
> interval.
> The overlap downtime will be calculated for SE, only in
> case of sequence S(0..N)E(M) for every vCPU.
> 
> As example we have 3 CPU
>      S1        E1           S1               E1
> -----***********------------xxx***************------------------------> CPU1
> 
>             S2                E2
> ------------****************xxx---------------------------------------> CPU2
> 
>                         S3            E3
> ------------------------****xxx********-------------------------------> CPU3
> 	        
> We have sequence S1,S2,E1,S3,S1,E2,E3,E1
> S2,E1 - doesn't match condition due to
> sequence S1,S2,E1 doesn't include CPU3,
> S3,S1,E2 - sequenece includes all CPUs, in
> this case overlap will be S1,E2
> 
> 
> But I don't send RFC now,
> due to I faced an issue. Kernel doesn't inform user space about page's
> owner in handle_userfault. So it's the question to Andrea. Is it worth
> to add such information.
> Frankly saying, I don't know is current (task_struct) in
> handle_userfault equal to mm_struct's owner.

Is this so you can find which thread is waiting for it? I'm not sure it's
worth it; we dont normally need that, and anyway if doesn't help if multiple
CPUs need it, where the 2nd CPU hits it just after the 1st one.

Dave

> > 
> > > Also I didn't yet finish IPC to provide such information to src host, where
> > > info_migrate is being called.
> > 
> > Dave
> > 
> > > 
> > > 
> > > > 
> > > > > > > Another one request.
> > > > > > > QEMU could use mem_path in hugefs with share key simultaneously
> > > > > > > (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
> > > > > > > in this case will start and will properly work (it will allocate memory
> > > > > > > with mmap), but in case of destination for postcopy live migration
> > > > > > > UFFDIO_COPY ioctl will fail for
> > > > > > > such region, in Arcangeli's git tree there is such prevent check
> > > > > > > (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
> > > > > > > Is it possible to handle such situation at qemu?
> > > > > > 
> > > > > > Imagine that you had shared memory; what semantics would you like
> > > > > > to see ?  What happens to the other process?
> > > > > 
> > > > > Honestly, initially, I thought to handle such error, but I quit forgot
> > > > > about vhost-user in ovs-dpdk.
> > > > 
> > > > Yes, I don't know much about vhost-user; but we'll have to think carefully
> > > > about the way things behave when they're accessing memory that's shared
> > > > with qemu during migration.  Writing to the source after we've started
> > > > the postcopy phase is not allowed.  Accessing the destination memory
> > > > during postcopy will produce pauses in the other processes accessing it
> > > > (I think) and they mustn't do various types of madvise etc - so
> > > > I'm sure there will be things we find out the hard way!
> > > > 
> > > > Dave
> > > > 
> > > > > > Dave
> > > > > > 
> > > > > > > On Mon, Feb 06, 2017 at 05:45:30PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > > > * Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> > > > > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > > > > 
> > > > > > > > > Hi,
> > > > > > > > >   The existing postcopy code, and the userfault kernel
> > > > > > > > > code that supports it, only works for normal anonymous memory.
> > > > > > > > > Kernel support for userfault on hugetlbfs is working
> > > > > > > > > it's way upstream; it's in the linux-mm tree,
> > > > > > > > > You can get a version at:
> > > > > > > > >    git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> > > > > > > > > on the origin/userfault branch.
> > > > > > > > > 
> > > > > > > > > Note that while this code supports arbitrary sized hugepages,
> > > > > > > > > it doesn't make sense with pages above the few-MB region,
> > > > > > > > > so while 2MB is fine, 1GB is probably a bad idea;
> > > > > > > > > this code waits for and transmits whole huge pages, and a
> > > > > > > > > 1GB page would take about 1 second to transfer over a 10Gbps
> > > > > > > > > link - which is way too long to pause the destination for.
> > > > > > > > > 
> > > > > > > > > Dave
> > > > > > > > 
> > > > > > > > Oops I missed the v2 changes from the message:
> > > > > > > > 
> > > > > > > > v2
> > > > > > > >   Flip ram-size summary word/compare individual page size patches around
> > > > > > > >   Individual page size comparison is done in ram_load if 'advise' has been
> > > > > > > >     received rather than checking migrate_postcopy_ram()
> > > > > > > >   Moved discard code into exec.c, reworked ram_discard_range
> > > > > > > > 
> > > > > > > > Dave
> > > > > > > 
> > > > > > > Thank your, right now it's not necessary to set
> > > > > > > postcopy-ram capability on destination machine.
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > Dr. David Alan Gilbert (16):
> > > > > > > > >   postcopy: Transmit ram size summary word
> > > > > > > > >   postcopy: Transmit and compare individual page sizes
> > > > > > > > >   postcopy: Chunk discards for hugepages
> > > > > > > > >   exec: ram_block_discard_range
> > > > > > > > >   postcopy: enhance ram_block_discard_range for hugepages
> > > > > > > > >   Fold postcopy_ram_discard_range into ram_discard_range
> > > > > > > > >   postcopy: Record largest page size
> > > > > > > > >   postcopy: Plumb pagesize down into place helpers
> > > > > > > > >   postcopy: Use temporary for placing zero huge pages
> > > > > > > > >   postcopy: Load huge pages in one go
> > > > > > > > >   postcopy: Mask fault addresses to huge page boundary
> > > > > > > > >   postcopy: Send whole huge pages
> > > > > > > > >   postcopy: Allow hugepages
> > > > > > > > >   postcopy: Update userfaultfd.h header
> > > > > > > > >   postcopy: Check for userfault+hugepage feature
> > > > > > > > >   postcopy: Add doc about hugepages and postcopy
> > > > > > > > > 
> > > > > > > > >  docs/migration.txt                |  13 ++++
> > > > > > > > >  exec.c                            |  83 +++++++++++++++++++++++
> > > > > > > > >  include/exec/cpu-common.h         |   2 +
> > > > > > > > >  include/exec/memory.h             |   1 -
> > > > > > > > >  include/migration/migration.h     |   3 +
> > > > > > > > >  include/migration/postcopy-ram.h  |  13 ++--
> > > > > > > > >  linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
> > > > > > > > >  migration/migration.c             |   1 +
> > > > > > > > >  migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
> > > > > > > > >  migration/ram.c                   | 109 ++++++++++++++++++------------
> > > > > > > > >  migration/savevm.c                |  32 ++++++---
> > > > > > > > >  migration/trace-events            |   2 +-
> > > > > > > > >  12 files changed, 328 insertions(+), 150 deletions(-)
> > > > > > > > > 
> > > > > > > > > -- 
> > > > > > > > > 2.9.3
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > --
> > > > > > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > > > > > 
> > > > > > --
> > > > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > 
> > > > > BR
> > > > > Alexey
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > 
> > > 
> > > -- 
> > > 
> > > BR
> > > Alexey
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> 
> BR
> Alexey
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-27 11:26                   ` Dr. David Alan Gilbert
@ 2017-02-27 15:00                     ` Andrea Arcangeli
  2017-02-27 15:47                       ` Daniel P. Berrange
  2017-02-27 19:04                     ` Alexey Perevalov
  1 sibling, 1 reply; 73+ messages in thread
From: Andrea Arcangeli @ 2017-02-27 15:00 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Alexey Perevalov, qemu-devel, quintela

Hello,

On Mon, Feb 27, 2017 at 11:26:58AM +0000, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > Also if I'm not wrong, commands and pages are transferred over the same
> > socket. Why not to use OOB TCP in this case for commands?
> 
> My understanding was that OOB was limited to quite small transfers
> I think the right way is to use a separate FD for the requests, so I'll
> do it after Juan's multifd series.

OOB would do the trick and we considered it some time ago, but we need
this to work over any network pipe including TLS (out of control of
qemu but setup by libvirt), and OOB being a protocol level TCP
specific feature in the kernel, I don't think there's any way to
access it through TLS APIs abstractions. Plus like David said there
are issues with the size of the transfer.

Currently reducing tcp_wmem sysctl to 3MiB sounds best (to give a
little room for the headers of the packets required to transfer
2M). For 4k pages it can be reduced perhaps to 6k/10k.

> Although even then I'm not sure how it will behave; the other thing
> might be to throttle the background page transfer so the FIFO isn't
> as full.

Yes, we didn't go in this direction because it would be only a short
term solution.

The kernel has optimal throttling in the TCP stack already, trying to
throttle against it in qemu so that the tcp_wmem queue doesn't
fill, doesn't look attractive.

With the multisocket implementation, with tc qdisc you can further
make sure that you've got the userfault socket with top priority and
delivered immediately, but normally it will not be necessary and
fq_codel (should be the userland post-boot default by now, kernel has
still an obsolete default) should do a fine job by default. Having a
proper tc qdisc default will matter once we switch to the multisocket
implementation so you'll have to pay attention to that, but that's
something to pay attention to regardless, if you have significant
network load from multiple sockets in the equation, nothing out of the
ordinary.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-27 15:00                     ` Andrea Arcangeli
@ 2017-02-27 15:47                       ` Daniel P. Berrange
  0 siblings, 0 replies; 73+ messages in thread
From: Daniel P. Berrange @ 2017-02-27 15:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Dr. David Alan Gilbert, quintela, Alexey Perevalov, qemu-devel

On Mon, Feb 27, 2017 at 04:00:15PM +0100, Andrea Arcangeli wrote:
> Hello,
> 
> On Mon, Feb 27, 2017 at 11:26:58AM +0000, Dr. David Alan Gilbert wrote:
> > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > Also if I'm not wrong, commands and pages are transferred over the same
> > > socket. Why not to use OOB TCP in this case for commands?
> > 
> > My understanding was that OOB was limited to quite small transfers
> > I think the right way is to use a separate FD for the requests, so I'll
> > do it after Juan's multifd series.
> 
> OOB would do the trick and we considered it some time ago, but we need
> this to work over any network pipe including TLS (out of control of
> qemu but setup by libvirt), and OOB being a protocol level TCP
> specific feature in the kernel, I don't think there's any way to
> access it through TLS APIs abstractions. Plus like David said there
> are issues with the size of the transfer.

Correct, there's no facility for handling OOB data when a socket is
using TLS. Also note that QEMU might not even have a TCP socket,
as when libvirt is tunnelling migration over the libvirtd connection,
QEMU will just be given a UNIX socket or even a anoymous pipe. So any
use of OOB data is pretty much out of the question. 

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support
  2017-02-27 11:26                   ` Dr. David Alan Gilbert
  2017-02-27 15:00                     ` Andrea Arcangeli
@ 2017-02-27 19:04                     ` Alexey Perevalov
  1 sibling, 0 replies; 73+ messages in thread
From: Alexey Perevalov @ 2017-02-27 19:04 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Andrea Arcangeli, qemu-devel, quintela

On 02/27/2017 02:26 PM, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
>> Hi David,
>>
>>
>> On Tue, Feb 21, 2017 at 10:03:14AM +0000, Dr. David Alan Gilbert wrote:
>>> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
>>>> Hello David,
>>> Hi Alexey,
>>>
>>>> On Tue, Feb 14, 2017 at 07:34:26PM +0000, Dr. David Alan Gilbert wrote:
>>>>> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> Thank your, now it's clear.
>>>>>>
>>>>>> On Mon, Feb 13, 2017 at 06:16:02PM +0000, Dr. David Alan Gilbert wrote:
>>>>>>> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
>>>>>>>>   Hello David!
>>>>>>> Hi Alexey,
>>>>>>>
>>>>>>>> I have checked you series with 1G hugepage, but only in 1 Gbit/sec network
>>>>>>>> environment.
>>>>>>> Can you show the qemu command line you're using?  I'm just trying
>>>>>>> to make sure I understand where your hugepages are; running 1G hostpages
>>>>>>> across a 1Gbit/sec network for postcopy would be pretty poor - it would take
>>>>>>> ~10 seconds to transfer the page.
>>>>>> sure
>>>>>> -hda ./Ubuntu.img -name PAU,debug-threads=on -boot d -net nic -net user
>>>>>> -m 1024 -localtime -nographic -enable-kvm -incoming tcp:0:4444 -object
>>>>>> memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages -mem-prealloc
>>>>>> -numa node,memdev=mem -trace events=/tmp/events -chardev
>>>>>> socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock,server,nowait
>>>>>> -mon chardev=charmonitor,id=monitor,mode=control
>>>>> OK, it's a pretty unusual setup - a 1G page guest with 1G of guest RAM.
>>>>>
>>>>>>>> I started Ubuntu just with console interface and gave to it only 1G of
>>>>>>>> RAM, inside Ubuntu I started stress command
>>>>>>>> (stress --cpu 4 --io 4 --vm 4 --vm-bytes 256000000 &)
>>>>>>>> in such environment precopy live migration was impossible, it never
>>>>>>>> being finished, in this case it infinitely sends pages (it looks like
>>>>>>>> dpkg scenario).
>>>>>>>>
>>>>>>>> Also I modified stress utility
>>>>>>>> http://people.seas.harvard.edu/~apw/stress/stress-1.0.4.tar.gz
>>>>>>>> due to it wrote into memory every time the same value `Z`. My
>>>>>>>> modified version writes every allocation new incremented value.
>>>>>>> I use google's stressapptest normally; although remember to turn
>>>>>>> off the bit where it pauses.
>>>>>> I decided to use it too
>>>>>> stressapptest -s 300 -M 256 -m 8 -W
>>>>>>
>>>>>>>> I'm using Arcangeli's kernel only at the destination.
>>>>>>>>
>>>>>>>> I got controversial results. Downtime for 1G hugepage is close to 2Mb
>>>>>>>> hugepage and it took around 7 ms (in 2Mb hugepage scenario downtime was
>>>>>>>> around 8 ms).
>>>>>>>> I made that opinion by query-migrate.
>>>>>>>> {"return": {"status": "completed", "setup-time": 6, "downtime": 6, "total-time": 9668, "ram": {"total": 1091379200, "postcopy-requests": 1, "dirty-sync-count": 2, "remaining": 0, "mbps": 879.786851, "transferred": 1063007296, "duplicate": 7449, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1060868096, "normal": 259001}}}
>>>>>>>>
>>>>>>>> Documentation says about downtime field - measurement unit is ms.
>>>>>>> The downtime measurement field is pretty meaningless for postcopy; it's only
>>>>>>> the time from stopping the VM until the point where we tell the destination it
>>>>>>> can start running.  Meaningful measurements are only from inside the guest
>>>>>>> really, or the place latencys.
>>>>>>>
>>>>>> Maybe improve it by receiving such information from destination?
>>>>>> I wish to do that.
>>>>>>>> So I traced it (I added additional trace into postcopy_place_page
>>>>>>>> trace_postcopy_place_page_start(host, from, pagesize); )
>>>>>>>>
>>>>>>>> postcopy_ram_fault_thread_request Request for HVA=7f6dc0000000 rb=/objects/mem offset=0
>>>>>>>> postcopy_place_page_start host=0x7f6dc0000000 from=0x7f6d70000000, pagesize=40000000
>>>>>>>> postcopy_place_page_start host=0x7f6e0e800000 from=0x55b665969619, pagesize=1000
>>>>>>>> postcopy_place_page_start host=0x7f6e0e801000 from=0x55b6659684e8, pagesize=1000
>>>>>>>> several pages with 4Kb step ...
>>>>>>>> postcopy_place_page_start host=0x7f6e0e817000 from=0x55b6659694f0, pagesize=1000
>>>>>>>>
>>>>>>>> 4K pages, started from 0x7f6e0e800000 address it's
>>>>>>>> vga.ram, /rom@etc/acpi/tables etc.
>>>>>>>>
>>>>>>>> Frankly saying, right now, I don't have any ideas why hugepage wasn't
>>>>>>>> resent. Maybe my expectation of it is wrong as well as understanding )
>>>>>>> That's pretty much what I expect to see - before you get into postcopy
>>>>>>> mode everything is sent as individual 4k pages (in order); once we're
>>>>>>> in postcopy mode we send each page no more than once.  So you're
>>>>>>> huge page comes across once - and there it is.
>>>>>>>
>>>>>>>> stress utility also duplicated for me value into appropriate file:
>>>>>>>> sec_since_epoch.microsec:value
>>>>>>>> 1487003192.728493:22
>>>>>>>> 1487003197.335362:23
>>>>>>>> *1487003213.367260:24*
>>>>>>>> *1487003238.480379:25*
>>>>>>>> 1487003243.315299:26
>>>>>>>> 1487003250.775721:27
>>>>>>>> 1487003255.473792:28
>>>>>>>>
>>>>>>>> It mean rewriting 256Mb of memory per byte took around 5 sec, but at
>>>>>>>> the moment of migration it took 25 sec.
>>>>>>> right, now this is the thing that's more useful to measure.
>>>>>>> That's not too surprising; when it migrates that data is changing rapidly
>>>>>>> so it's going to have to pause and wait for that whole 1GB to be transferred.
>>>>>>> Your 1Gbps network is going to take about 10 seconds to transfer that
>>>>>>> 1GB page - and that's if you're lucky and it saturates the network.
>>>>>>> SO it's going to take at least 10 seconds longer than it normally
>>>>>>> would, plus any other overheads - so at least 15 seconds.
>>>>>>> This is why I say it's a bad idea to use 1GB host pages with postcopy.
>>>>>>> Of course it would be fun to find where the other 10 seconds went!
>>>>>>>
>>>>>>> You might like to add timing to the tracing so you can see the time between the
>>>>>>> fault thread requesting the page and it arriving.
>>>>>>>
>>>>>> yes, sorry I forgot about timing
>>>>>> 20806@1487084818.270993:postcopy_ram_fault_thread_request Request for HVA=7f0280000000 rb=/objects/mem offset=0
>>>>>> 20806@1487084818.271038:qemu_loadvm_state_section 8
>>>>>> 20806@1487084818.271056:loadvm_process_command com=0x2 len=4
>>>>>> 20806@1487084818.271089:qemu_loadvm_state_section 2
>>>>>> 20806@1487084823.315919:postcopy_place_page_start host=0x7f0280000000 from=0x7f0240000000, pagesize=40000000
>>>>>>
>>>>>> 1487084823.315919 - 1487084818.270993 = 5.044926 sec.
>>>>>> Machines connected w/o any routers, directly by cable.
>>>>> OK, the fact it's only 5 seconds not 10 I think suggests a lot of the memory was all zero
>>>>> so didn't take up the whole bandwidth.
>>>> I decided to measure downtime as a sum of intervals since fault happened
>>>> and till page was load. I didn't relay on order, so I associated that
>>>> interval with fault address.
>>> Don't forget the source will still be sending unrequested pages at the
>>> same time as fault responses; so that simplification might be wrong.
>>> My experience with 4k pages is you'll often get pages that arrive
>>> at about the same time as you ask for them because of the background transmission.
>>>
>>>> For 2G ram vm - using 1G huge page, downtime measured on dst is around 12 sec,
>>>> but for the same 2G ram vm with 2Mb huge page, downtime measured on dst
>>>> is around 20 sec, and 320 page faults happened, 640 Mb was transmitted.
>>> OK, so 20/320 * 1000=62.5msec/ page.   That's a bit high.
>>> I think it takes about 16ms to transmit a 2MB page on your 1Gbps network,
>> Yes, you right, transfer of the first page doesn't wait for prefetched page
>> transmission, and downtime for first page was 25 ms.
>>
>> Next requested pages are queued (FIFO) so dst is waiting all prefetched pages,
>> it's around 5-7 pages transmission.
>> So I have a question why not to put requested page into the head of
>> queue in that case, and dst qemu will wait only lesser, only page which
>> was already in transmission.
> The problem is it's already in the source's network queue.
>
>> Also if I'm not wrong, commands and pages are transferred over the same
>> socket. Why not to use OOB TCP in this case for commands?
> My understanding was that OOB was limited to quite small transfers
> I think the right way is to use a separate FD for the requests, so I'll
> do it after Juan's multifd series.
> Although even then I'm not sure how it will behave; the other thing
> might be to throttle the background page transfer so the FIFO isn't
> as full.
>
>>> you're probably also suffering from the requests being queued behind
>>> background requests; if you try reducing your tcp_wmem setting on the
>>> source it might get a bit better.  Once Juan Quintela's multi-fd work
>>> goes in my hope is to combine it with postcopy and then be able to
>>> avoid that type of request blocking.
>>> Generally I'd not recommend 10Gbps for postcopy since it does pull
>>> down the latency quite a bit.
>>>
>>>> My current method doesn't take into account multi core vcpu. I checked
>>>> only with 1 CPU, but it's not proper case. So I think it's worth to
>>>> count downtime per CPU, or calculate overlap of CPU downtimes.
>>>> How do your think?
>>> Yes; one of the nice things about postcopy is that if one vCPU is blocked
>>> waiting for a page, the other vCPUs will just be able to carry on.
>>> Even with 1 vCPU if you've got multiple tasks that can run the guest can
>>> switch to a task that isn't blocked (See KVM asynchronous page faults).
>>> Now, what the numbers mean when you calculate the total like that might be a bit
>>> odd - for example if you have 8 vCPUs and they're each blocked do you
>>> add the times together even though they're blocked at the same time? What
>>> about if they're blocked on the same page?
>> I implemented downtime calculation for all cpu's, the approach is
>> following:
>>
>> Initially intervals are represented in tree where key is
>> pagefault address, and values:
>>      begin - page fault time
>>      end   - page load time
>>      cpus  - bit mask shows affected cpus
>>
>> To calculate overlap on all cpus, intervals converted into
>> array of points in time (downtime_intervals), the size of
>> array is 2 * number of nodes in tree of intervals (2 array
>> elements per one in element of interval).
>> Each element is marked as end (E) or not the end (S) of
>> interval.
>> The overlap downtime will be calculated for SE, only in
>> case of sequence S(0..N)E(M) for every vCPU.
>>
>> As example we have 3 CPU
>>       S1        E1           S1               E1
>> -----***********------------xxx***************------------------------> CPU1
>>
>>              S2                E2
>> ------------****************xxx---------------------------------------> CPU2
>>
>>                          S3            E3
>> ------------------------****xxx********-------------------------------> CPU3
>> 	
>> We have sequence S1,S2,E1,S3,S1,E2,E3,E1
>> S2,E1 - doesn't match condition due to
>> sequence S1,S2,E1 doesn't include CPU3,
>> S3,S1,E2 - sequenece includes all CPUs, in
>> this case overlap will be S1,E2
>>
>>
>> But I don't send RFC now,
>> due to I faced an issue. Kernel doesn't inform user space about page's
>> owner in handle_userfault. So it's the question to Andrea. Is it worth
>> to add such information.
>> Frankly saying, I don't know is current (task_struct) in
>> handle_userfault equal to mm_struct's owner.
> Is this so you can find which thread is waiting for it? I'm not sure it's
> worth it; we dont normally need that, and anyway if doesn't help if multiple
> CPUs need it, where the 2nd CPU hits it just after the 1st one.
I think in case of multiple CPUs, e.g 2 CPUs,
first page fault will come from CPU0 for page
ADDR and we store it with proper CPU index, and second page fault from 
just started CPU1
for the same page ADDR and we also track it. And finally we will 
calculate downtime as overlap,
and the sum of it will be the final downtime.

>
> Dave
>
>>>> Also I didn't yet finish IPC to provide such information to src host, where
>>>> info_migrate is being called.
>>> Dave
>>>
>>>>
>>>>>>>> Another one request.
>>>>>>>> QEMU could use mem_path in hugefs with share key simultaneously
>>>>>>>> (-object memory-backend-file,id=mem,size=${mem_size},mem-path=${mem_path},share=on) and vm
>>>>>>>> in this case will start and will properly work (it will allocate memory
>>>>>>>> with mmap), but in case of destination for postcopy live migration
>>>>>>>> UFFDIO_COPY ioctl will fail for
>>>>>>>> such region, in Arcangeli's git tree there is such prevent check
>>>>>>>> (if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED).
>>>>>>>> Is it possible to handle such situation at qemu?
>>>>>>> Imagine that you had shared memory; what semantics would you like
>>>>>>> to see ?  What happens to the other process?
>>>>>> Honestly, initially, I thought to handle such error, but I quit forgot
>>>>>> about vhost-user in ovs-dpdk.
>>>>> Yes, I don't know much about vhost-user; but we'll have to think carefully
>>>>> about the way things behave when they're accessing memory that's shared
>>>>> with qemu during migration.  Writing to the source after we've started
>>>>> the postcopy phase is not allowed.  Accessing the destination memory
>>>>> during postcopy will produce pauses in the other processes accessing it
>>>>> (I think) and they mustn't do various types of madvise etc - so
>>>>> I'm sure there will be things we find out the hard way!
>>>>>
>>>>> Dave
>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>>> On Mon, Feb 06, 2017 at 05:45:30PM +0000, Dr. David Alan Gilbert wrote:
>>>>>>>>> * Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
>>>>>>>>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>    The existing postcopy code, and the userfault kernel
>>>>>>>>>> code that supports it, only works for normal anonymous memory.
>>>>>>>>>> Kernel support for userfault on hugetlbfs is working
>>>>>>>>>> it's way upstream; it's in the linux-mm tree,
>>>>>>>>>> You can get a version at:
>>>>>>>>>>     git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
>>>>>>>>>> on the origin/userfault branch.
>>>>>>>>>>
>>>>>>>>>> Note that while this code supports arbitrary sized hugepages,
>>>>>>>>>> it doesn't make sense with pages above the few-MB region,
>>>>>>>>>> so while 2MB is fine, 1GB is probably a bad idea;
>>>>>>>>>> this code waits for and transmits whole huge pages, and a
>>>>>>>>>> 1GB page would take about 1 second to transfer over a 10Gbps
>>>>>>>>>> link - which is way too long to pause the destination for.
>>>>>>>>>>
>>>>>>>>>> Dave
>>>>>>>>> Oops I missed the v2 changes from the message:
>>>>>>>>>
>>>>>>>>> v2
>>>>>>>>>    Flip ram-size summary word/compare individual page size patches around
>>>>>>>>>    Individual page size comparison is done in ram_load if 'advise' has been
>>>>>>>>>      received rather than checking migrate_postcopy_ram()
>>>>>>>>>    Moved discard code into exec.c, reworked ram_discard_range
>>>>>>>>>
>>>>>>>>> Dave
>>>>>>>> Thank your, right now it's not necessary to set
>>>>>>>> postcopy-ram capability on destination machine.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> Dr. David Alan Gilbert (16):
>>>>>>>>>>    postcopy: Transmit ram size summary word
>>>>>>>>>>    postcopy: Transmit and compare individual page sizes
>>>>>>>>>>    postcopy: Chunk discards for hugepages
>>>>>>>>>>    exec: ram_block_discard_range
>>>>>>>>>>    postcopy: enhance ram_block_discard_range for hugepages
>>>>>>>>>>    Fold postcopy_ram_discard_range into ram_discard_range
>>>>>>>>>>    postcopy: Record largest page size
>>>>>>>>>>    postcopy: Plumb pagesize down into place helpers
>>>>>>>>>>    postcopy: Use temporary for placing zero huge pages
>>>>>>>>>>    postcopy: Load huge pages in one go
>>>>>>>>>>    postcopy: Mask fault addresses to huge page boundary
>>>>>>>>>>    postcopy: Send whole huge pages
>>>>>>>>>>    postcopy: Allow hugepages
>>>>>>>>>>    postcopy: Update userfaultfd.h header
>>>>>>>>>>    postcopy: Check for userfault+hugepage feature
>>>>>>>>>>    postcopy: Add doc about hugepages and postcopy
>>>>>>>>>>
>>>>>>>>>>   docs/migration.txt                |  13 ++++
>>>>>>>>>>   exec.c                            |  83 +++++++++++++++++++++++
>>>>>>>>>>   include/exec/cpu-common.h         |   2 +
>>>>>>>>>>   include/exec/memory.h             |   1 -
>>>>>>>>>>   include/migration/migration.h     |   3 +
>>>>>>>>>>   include/migration/postcopy-ram.h  |  13 ++--
>>>>>>>>>>   linux-headers/linux/userfaultfd.h |  81 +++++++++++++++++++---
>>>>>>>>>>   migration/migration.c             |   1 +
>>>>>>>>>>   migration/postcopy-ram.c          | 138 +++++++++++++++++---------------------
>>>>>>>>>>   migration/ram.c                   | 109 ++++++++++++++++++------------
>>>>>>>>>>   migration/savevm.c                |  32 ++++++---
>>>>>>>>>>   migration/trace-events            |   2 +-
>>>>>>>>>>   12 files changed, 328 insertions(+), 150 deletions(-)
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> 2.9.3
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>>>
>>>>>>> --
>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>
>>>>>> -- 
>>>>>>
>>>>>> BR
>>>>>> Alexey
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>
>>>> -- 
>>>>
>>>> BR
>>>> Alexey
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
>> BR
>> Alexey
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>


-- 
Best regards,
Alexey Perevalov

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2017-02-27 19:04 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-06 17:32 [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert (git)
2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 01/16] postcopy: Transmit ram size summary word Dr. David Alan Gilbert (git)
2017-02-24 10:16   ` Laurent Vivier
2017-02-24 13:10   ` Juan Quintela
2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 02/16] postcopy: Transmit and compare individual page sizes Dr. David Alan Gilbert (git)
2017-02-24 10:31   ` Laurent Vivier
2017-02-24 10:48     ` Dr. David Alan Gilbert
2017-02-24 10:50       ` Laurent Vivier
2017-02-24 13:13   ` Juan Quintela
2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 03/16] postcopy: Chunk discards for hugepages Dr. David Alan Gilbert (git)
2017-02-24 13:48   ` Laurent Vivier
2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 04/16] exec: ram_block_discard_range Dr. David Alan Gilbert (git)
2017-02-24 13:14   ` Juan Quintela
2017-02-24 14:04   ` Laurent Vivier
2017-02-24 16:50     ` Dr. David Alan Gilbert
2017-02-24 14:08   ` Laurent Vivier
2017-02-24 15:35     ` Dr. David Alan Gilbert
2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 05/16] postcopy: enhance ram_block_discard_range for hugepages Dr. David Alan Gilbert (git)
2017-02-24 13:20   ` Juan Quintela
2017-02-24 13:44     ` Dr. David Alan Gilbert
2017-02-24 14:20   ` Laurent Vivier
2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 06/16] Fold postcopy_ram_discard_range into ram_discard_range Dr. David Alan Gilbert (git)
2017-02-24 13:21   ` Juan Quintela
2017-02-24 14:26   ` Laurent Vivier
2017-02-24 16:02     ` Dr. David Alan Gilbert
2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 07/16] postcopy: Record largest page size Dr. David Alan Gilbert (git)
2017-02-24 13:22   ` Juan Quintela
2017-02-24 14:37   ` Laurent Vivier
2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 08/16] postcopy: Plumb pagesize down into place helpers Dr. David Alan Gilbert (git)
2017-02-24 13:24   ` Juan Quintela
2017-02-24 15:10   ` Laurent Vivier
2017-02-24 15:21     ` Dr. David Alan Gilbert
2017-02-06 17:32 ` [Qemu-devel] [PATCH v2 09/16] postcopy: Use temporary for placing zero huge pages Dr. David Alan Gilbert (git)
2017-02-24 15:31   ` Laurent Vivier
2017-02-24 15:46     ` Dr. David Alan Gilbert
2017-02-24 17:24       ` Laurent Vivier
2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 10/16] postcopy: Load huge pages in one go Dr. David Alan Gilbert (git)
2017-02-24 15:54   ` Laurent Vivier
2017-02-24 16:32     ` Dr. David Alan Gilbert
2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 11/16] postcopy: Mask fault addresses to huge page boundary Dr. David Alan Gilbert (git)
2017-02-24 15:59   ` Laurent Vivier
2017-02-24 16:34     ` Dr. David Alan Gilbert
2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 12/16] postcopy: Send whole huge pages Dr. David Alan Gilbert (git)
2017-02-24 16:06   ` Laurent Vivier
2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 13/16] postcopy: Allow hugepages Dr. David Alan Gilbert (git)
2017-02-24 16:07   ` Laurent Vivier
2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 14/16] postcopy: Update userfaultfd.h header Dr. David Alan Gilbert (git)
2017-02-24 16:09   ` Laurent Vivier
2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 15/16] postcopy: Check for userfault+hugepage feature Dr. David Alan Gilbert (git)
2017-02-24 16:12   ` Laurent Vivier
2017-02-06 17:33 ` [Qemu-devel] [PATCH v2 16/16] postcopy: Add doc about hugepages and postcopy Dr. David Alan Gilbert (git)
2017-02-24 13:25   ` Juan Quintela
2017-02-24 16:12   ` Laurent Vivier
2017-02-06 17:45 ` [Qemu-devel] [PATCH v2 00/16] Postcopy: Hugepage support Dr. David Alan Gilbert
     [not found]   ` <CGME20170213171108eucas1p147999fc8b6980ff89a67626b78b12e44@eucas1p1.samsung.com>
2017-02-13 17:11     ` Alexey Perevalov
2017-02-13 17:57       ` Andrea Arcangeli
2017-02-13 18:10         ` Andrea Arcangeli
2017-02-13 21:59           ` Mike Kravetz
2017-02-14 14:48         ` Alexey Perevalov
2017-02-17 16:47           ` Andrea Arcangeli
2017-02-20 16:01             ` Alexey Perevalov
2017-02-13 18:16       ` Dr. David Alan Gilbert
2017-02-14 16:22         ` Alexey Perevalov
2017-02-14 19:34           ` Dr. David Alan Gilbert
2017-02-21  7:31             ` Alexey Perevalov
2017-02-21 10:03               ` Dr. David Alan Gilbert
2017-02-27 11:05                 ` Alexey Perevalov
2017-02-27 11:26                   ` Dr. David Alan Gilbert
2017-02-27 15:00                     ` Andrea Arcangeli
2017-02-27 15:47                       ` Daniel P. Berrange
2017-02-27 19:04                     ` Alexey Perevalov
2017-02-22 16:43 ` Laurent Vivier
2017-02-24 10:04 ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.