All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH V5 0/9] calculate blocktime for postcopy live migration
       [not found] <CGME20170512133144eucas1p23502fd953ee73bda5b2afb25e65604f9@eucas1p2.samsung.com>
@ 2017-05-12 13:31 ` Alexey Perevalov
       [not found]   ` <CGME20170512133144eucas1p288cde3bd6faefb16cbb0d3790885783d@eucas1p2.samsung.com>
                     ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Alexey Perevalov @ 2017-05-12 13:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: dgilbert, a.perevalov, i.maximets, peterx

The rationale for that idea is following:
vCPU could suspend during postcopy live migration until faulted
page is not copied into kernel. Downtime on source side it's a value -
time interval since source turn vCPU off, till destination start runnig
vCPU. But that value was proper value for precopy migration it really shows
amount of time when vCPU is down. But not for postcopy migration, because
several vCPU threads could susppend after vCPU was started. That is important
to estimate packet drop for SDN software.

This is 5th version of patch set. In previous was build error in mingw.
First version was tagged as RFC, second was without version tag, third with V3.

This patch set doesn't include improvements sugested by Peter Xu for
get_mem_fault_cpu_index, but I would prefer to do it. I think to introduce a
tree for fast CPUState lookup by thread_id, or general code, due to there are
places like qemu_get_cpu (cpus.c) with the similar lookup.

(V4 -> V5)
    - fill_destination_postcopy_migration_info empty stub was missed for none linux
build

(V3 -> V4)
    - get rid of Downtime as a name for vCPU waiting time during postcopy migration
    - PostcopyBlocktimeContext renamed (it was just BlocktimeContext)
    - atomic operations are used for dealing with fields of PostcopyBlocktimeContext
affected in both threads.
    - hardcoded function names in error_report were replaced to %s and __line__
    - this patch set includes postcopy-downtime capability, but it used on
destination, coupled with not possibility to return calculated downtime back
to source to show it in query-migrate, it looks like a big trade off
    - UFFD_API have to be sent notwithstanding need or not to ask kernel
for a feature, due to kernel expects it in any case (see patch comment)
    - postcopy_downtime included into query-migrate output
    - also this patch set includes trivial fix
migration: fix hardcoded function name in error report
maybe that is a candidate for qemu-trivial mailing list, but I already
sent "migration: Fixed code style" and it was unclaimed.


(V2 -> V3)
    - Downtime calculation approach was changed, thanks to Peter Xu
    - Due to previous point no more need to keep GTree as well as bitmap of cpus.
So glib changes aren't included in this patch set, it could be resent in
another patch set, if it will be a good reason for it.
    - No procfs traces in this patchset, if somebody wants it you could get it
from patchwork site to track down page fault initiators.
    - UFFD_FEATURE_THREAD_ID is requesting only when kernel supports it
    - It doesn't send back the downtime, just trace it

This patch set is based on master branch of git://git.qemu-project.org/qemu.git
base commit is commit ecc1f5adeec4e3324d1b695a7c54e3967c526949. That point is
after postcopy-ram.h movement.

It contains patch for kernel header, just for convinience of applying current
patch set, for testing until kernel headers arn't synced. At the moment of
posting this patch set, "userfaultfd: provide pid in userfault msg" wasn't yet
merged into upstream. 

Alexey Perevalov (9):
  userfault: add pid into uffd_msg & update UFFD_FEATURE_*
  migration: pass ptr to MigrationIncomingState into migration
    ufd_version_check & postcopy_ram_supported_by_host
  migration: fix hardcoded function name in error report
  migration: split ufd_version_check onto receive/request features part
  migration: introduce postcopy-blocktime capability
  migration: add postcopy vcpu blocktime context into
    MigrationIncomingState
  migration: calculate vCPU blocktime on dst side
  migration: add postcopy total blocktime into query-migrate
  migration: postcopy_blocktime documentation

 docs/migration.txt                |  10 ++
 include/migration/migration.h     |  13 ++
 include/migration/postcopy-ram.h  |   2 +-
 linux-headers/linux/userfaultfd.h |   5 +
 migration/migration.c             |  58 ++++++-
 migration/postcopy-ram.c          | 322 ++++++++++++++++++++++++++++++++++++--
 migration/savevm.c                |   2 +-
 migration/trace-events            |   6 +-
 qapi-schema.json                  |  11 +-
 9 files changed, 408 insertions(+), 21 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH V5 1/9] userfault: add pid into uffd_msg & update UFFD_FEATURE_*
       [not found]   ` <CGME20170512133144eucas1p288cde3bd6faefb16cbb0d3790885783d@eucas1p2.samsung.com>
@ 2017-05-12 13:31     ` Alexey Perevalov
  0 siblings, 0 replies; 28+ messages in thread
From: Alexey Perevalov @ 2017-05-12 13:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: dgilbert, a.perevalov, i.maximets, peterx

This commit duplicates header of "userfaultfd: provide pid in userfault msg"
into linux kernel.

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 linux-headers/linux/userfaultfd.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/linux-headers/linux/userfaultfd.h b/linux-headers/linux/userfaultfd.h
index 2ed5dc3..e7c8898 100644
--- a/linux-headers/linux/userfaultfd.h
+++ b/linux-headers/linux/userfaultfd.h
@@ -77,6 +77,9 @@ struct uffd_msg {
 		struct {
 			__u64	flags;
 			__u64	address;
+			union {
+				__u32   ptid;
+			} feat;
 		} pagefault;
 
 		struct {
@@ -158,6 +161,8 @@ struct uffdio_api {
 #define UFFD_FEATURE_EVENT_MADVDONTNEED		(1<<3)
 #define UFFD_FEATURE_MISSING_HUGETLBFS		(1<<4)
 #define UFFD_FEATURE_MISSING_SHMEM		(1<<5)
+#define UFFD_FEATURE_EVENT_UNMAP		(1<<6)
+#define UFFD_FEATURE_THREAD_ID			(1<<7)
 	__u64 features;
 
 	__u64 ioctls;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH V5 2/9] migration: pass ptr to MigrationIncomingState into migration ufd_version_check & postcopy_ram_supported_by_host
       [not found]   ` <CGME20170512133144eucas1p25b4275feb4126a21415242c5085382fd@eucas1p2.samsung.com>
@ 2017-05-12 13:31     ` Alexey Perevalov
  2017-05-18 14:09       ` Eric Blake
  0 siblings, 1 reply; 28+ messages in thread
From: Alexey Perevalov @ 2017-05-12 13:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: dgilbert, a.perevalov, i.maximets, peterx

That tiny refactoring is necessary to be able to set
UFFD_FEATURE_THREAD_ID while requesting features, and then
to create downtime context in case when kernel supports it.

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 include/migration/postcopy-ram.h |  2 +-
 migration/migration.c            |  2 +-
 migration/postcopy-ram.c         | 10 +++++-----
 migration/savevm.c               |  2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
index 8e036b9..809f6db 100644
--- a/include/migration/postcopy-ram.h
+++ b/include/migration/postcopy-ram.h
@@ -14,7 +14,7 @@
 #define QEMU_POSTCOPY_RAM_H
 
 /* Return true if the host supports everything we need to do postcopy-ram */
-bool postcopy_ram_supported_by_host(void);
+bool postcopy_ram_supported_by_host(MigrationIncomingState *mis);
 
 /*
  * Make all of RAM sensitive to accesses to areas that haven't yet been written
diff --git a/migration/migration.c b/migration/migration.c
index 353f272..569a7f6 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -804,7 +804,7 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
          * special support.
          */
         if (!old_postcopy_cap && runstate_check(RUN_STATE_INMIGRATE) &&
-            !postcopy_ram_supported_by_host()) {
+            !postcopy_ram_supported_by_host(NULL)) {
             /* postcopy_ram_supported_by_host will have emitted a more
              * detailed message
              */
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 85fd8d7..4c859b4 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -60,7 +60,7 @@ struct PostcopyDiscardState {
 #include <sys/eventfd.h>
 #include <linux/userfaultfd.h>
 
-static bool ufd_version_check(int ufd)
+static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
 {
     struct uffdio_api api_struct;
     uint64_t ioctl_mask;
@@ -113,7 +113,7 @@ static int test_range_shared(const char *block_name, void *host_addr,
  * normally fine since if the postcopy succeeds it gets turned back on at the
  * end.
  */
-bool postcopy_ram_supported_by_host(void)
+bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
 {
     long pagesize = getpagesize();
     int ufd = -1;
@@ -136,7 +136,7 @@ bool postcopy_ram_supported_by_host(void)
     }
 
     /* Version and features check */
-    if (!ufd_version_check(ufd)) {
+    if (!ufd_version_check(ufd, mis)) {
         goto out;
     }
 
@@ -513,7 +513,7 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
      * Although the host check already tested the API, we need to
      * do the check again as an ABI handshake on the new fd.
      */
-    if (!ufd_version_check(mis->userfault_fd)) {
+    if (!ufd_version_check(mis->userfault_fd, mis)) {
         return -1;
     }
 
@@ -651,7 +651,7 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
 
 #else
 /* No target OS support, stubs just fail */
-bool postcopy_ram_supported_by_host(void)
+bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
 {
     error_report("%s: No OS support", __func__);
     return false;
diff --git a/migration/savevm.c b/migration/savevm.c
index a00c1ab..27ab8b2 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1360,7 +1360,7 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
         return -1;
     }
 
-    if (!postcopy_ram_supported_by_host()) {
+    if (!postcopy_ram_supported_by_host(mis)) {
         postcopy_state_set(POSTCOPY_INCOMING_NONE);
         return -1;
     }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH V5 3/9] migration: fix hardcoded function name in error report
       [not found]   ` <CGME20170512133145eucas1p2cad4c4efe46e6f1b757d97dd9d301dbe@eucas1p2.samsung.com>
@ 2017-05-12 13:31     ` Alexey Perevalov
  2017-05-16  9:46       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 28+ messages in thread
From: Alexey Perevalov @ 2017-05-12 13:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: dgilbert, a.perevalov, i.maximets, peterx

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 migration/postcopy-ram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 4c859b4..0f75700 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -68,7 +68,7 @@ static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
     api_struct.api = UFFD_API;
     api_struct.features = 0;
     if (ioctl(ufd, UFFDIO_API, &api_struct)) {
-        error_report("postcopy_ram_supported_by_host: UFFDIO_API failed: %s",
+        error_report("%s: UFFDIO_API failed: %s", __func__
                      strerror(errno));
         return false;
     }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH V5 4/9] migration: split ufd_version_check onto receive/request features part
       [not found]   ` <CGME20170512133146eucas1p17df48bb6b5fcefe3717e18cd9afd84b7@eucas1p1.samsung.com>
@ 2017-05-12 13:31     ` Alexey Perevalov
  2017-05-16 10:32       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 28+ messages in thread
From: Alexey Perevalov @ 2017-05-12 13:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: dgilbert, a.perevalov, i.maximets, peterx

This modification is necessary for userfault fd features which are
required to be requested from userspace.
UFFD_FEATURE_THREAD_ID is a one of such "on demand" feature, which will
be introduced in the next patch.

QEMU need to use separate userfault file descriptor, due to
userfault context has internal state, and after first call of
ioctl UFFD_API it changes its state to UFFD_STATE_RUNNING (in case of
success), but
kernel while handling ioctl UFFD_API expects UFFD_STATE_WAIT_API. So
only one ioctl with UFFD_API is possible per ufd.

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 migration/postcopy-ram.c | 82 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 73 insertions(+), 9 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 0f75700..c96d5f5 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -60,32 +60,96 @@ struct PostcopyDiscardState {
 #include <sys/eventfd.h>
 #include <linux/userfaultfd.h>
 
-static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
+
+/*
+ * Check userfault fd features, to request only supported features in
+ * future.
+ * __NR_userfaultfd - should be checked before
+ * Return obtained features
+ */
+static bool receive_ufd_features(uint64_t *features)
 {
-    struct uffdio_api api_struct;
-    uint64_t ioctl_mask;
+    struct uffdio_api api_struct = {0};
+    int ufd;
+    bool ret = true;
+
+    /* if we are here __NR_userfaultfd should exists */
+    ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
+    if (ufd == -1) {
+        return false;
+    }
 
+    /* ask features */
     api_struct.api = UFFD_API;
     api_struct.features = 0;
     if (ioctl(ufd, UFFDIO_API, &api_struct)) {
-        error_report("%s: UFFDIO_API failed: %s", __func__
+        error_report("%s: UFFDIO_API failed: %s", __func__,
                      strerror(errno));
+        ret = false;
+        goto release_ufd;
+    }
+
+    *features = api_struct.features;
+
+release_ufd:
+    close(ufd);
+    return ret;
+}
+
+static bool request_ufd_features(int ufd, uint64_t features)
+{
+    struct uffdio_api api_struct = {0};
+    uint64_t ioctl_mask;
+
+    api_struct.api = UFFD_API;
+    api_struct.features = features;
+    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
+        error_report("%s failed: UFFDIO_API failed: %s", __func__,
+                strerror(errno));
         return false;
     }
 
-    ioctl_mask = (__u64)1 << _UFFDIO_REGISTER |
-                 (__u64)1 << _UFFDIO_UNREGISTER;
+    ioctl_mask = 1 << _UFFDIO_REGISTER |
+                 1 << _UFFDIO_UNREGISTER;
     if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
         error_report("Missing userfault features: %" PRIx64,
                      (uint64_t)(~api_struct.ioctls & ioctl_mask));
         return false;
     }
 
+    return true;
+}
+
+static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
+{
+    uint64_t asked_features = 0;
+    uint64_t supported_features;
+
+    /*
+     * it's not possible to
+     * request UFFD_API twice per one fd
+     */
+    if (!receive_ufd_features(&supported_features)) {
+        error_report("%s failed", __func__);
+        return false;
+    }
+
+    /*
+     * request features, even if asked_features is 0, due to
+     * kernel expects UFFD_API before UFFDIO_REGISTER, per
+     * userfault file descriptor
+     */
+    if (!request_ufd_features(ufd, asked_features)) {
+        error_report("%s failed: features %" PRIu64, __func__,
+                asked_features);
+        return false;
+    }
+
     if (getpagesize() != ram_pagesize_summary()) {
         bool have_hp = false;
         /* We've got a huge page */
 #ifdef UFFD_FEATURE_MISSING_HUGETLBFS
-        have_hp = api_struct.features & UFFD_FEATURE_MISSING_HUGETLBFS;
+        have_hp = supported_features & UFFD_FEATURE_MISSING_HUGETLBFS;
 #endif
         if (!have_hp) {
             error_report("Userfault on this host does not support huge pages");
@@ -136,7 +200,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
     }
 
     /* Version and features check */
-    if (!ufd_version_check(ufd, mis)) {
+    if (!ufd_check_and_apply(ufd, mis)) {
         goto out;
     }
 
@@ -513,7 +577,7 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
      * Although the host check already tested the API, we need to
      * do the check again as an ABI handshake on the new fd.
      */
-    if (!ufd_version_check(mis->userfault_fd, mis)) {
+    if (!ufd_check_and_apply(mis->userfault_fd, mis)) {
         return -1;
     }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH V5 5/9] migration: introduce postcopy-blocktime capability
       [not found]   ` <CGME20170512133146eucas1p2ba4841cabf508b66410fae6784952eaa@eucas1p2.samsung.com>
@ 2017-05-12 13:31     ` Alexey Perevalov
  2017-05-16 10:33       ` Dr. David Alan Gilbert
  2017-05-22 16:20       ` Eric Blake
  0 siblings, 2 replies; 28+ messages in thread
From: Alexey Perevalov @ 2017-05-12 13:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: dgilbert, a.perevalov, i.maximets, peterx

Right now it could be used on destination side to
enable vCPU blocktime calculation for postcopy live migration.
vCPU blocktime - it's time since vCPU thread was put into
interruptible sleep, till memory page was copied and thread awake.

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 include/migration/migration.h | 1 +
 migration/migration.c         | 9 +++++++++
 qapi-schema.json              | 5 ++++-
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index ba1a16c..82bbcd8 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -315,6 +315,7 @@ int migrate_compress_level(void);
 int migrate_compress_threads(void);
 int migrate_decompress_threads(void);
 bool migrate_use_events(void);
+bool migrate_postcopy_blocktime(void);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_message(MigrationIncomingState *mis,
diff --git a/migration/migration.c b/migration/migration.c
index 569a7f6..c0443ce 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1371,6 +1371,15 @@ bool migrate_zero_blocks(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_ZERO_BLOCKS];
 }
 
+bool migrate_postcopy_blocktime(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME];
+}
+
 bool migrate_use_compression(void)
 {
     MigrationState *s;
diff --git a/qapi-schema.json b/qapi-schema.json
index 01b087f..fde6d63 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -894,11 +894,14 @@
 # @release-ram: if enabled, qemu will free the migrated ram pages on the source
 #        during postcopy-ram migration. (since 2.9)
 #
+# @postcopy-blocktime: Calculate downtime for postcopy live migration (since 2.10)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
   'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
-           'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram'] }
+           'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
+           'postcopy-blocktime'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH V5 6/9] migration: add postcopy vcpu blocktime context into MigrationIncomingState
       [not found]   ` <CGME20170512133147eucas1p1aca0281fc864bf6f3beb610e7ce2695b@eucas1p1.samsung.com>
@ 2017-05-12 13:31     ` Alexey Perevalov
  0 siblings, 0 replies; 28+ messages in thread
From: Alexey Perevalov @ 2017-05-12 13:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: dgilbert, a.perevalov, i.maximets, peterx

This patch adds request to kernel space for UFFD_FEATURE_THREAD_ID,
in case when this feature is provided by kernel.

PostcopyBlocktimeContext is incapsulated inside postcopy-ram.c,
due to it's postcopy only feature.
Also it defines PostcopyBlocktimeContext's instance live time.
Information from PostcopyBlocktimeContext instance will be provided
much after postcopy migration end, instance of PostcopyBlocktimeContext
will live till QEMU exit, but part of it (vcpu_addr,
page_fault_vcpu_time) used only during calculation, will be released
when postcopy ended or failed.

To enable postcopy blocktime calculation on destination, need to request
proper capabiltiy (Patch for documentation will at the tail of patch
set).

As an example following command enable that capability, assume qemu was
started with
-chardev socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock
option to control it

[root@host]#printf "{\"execute\" : \"qmp_capabilities\"}\r\n \
{\"execute\": \"migrate-set-capabilities\" , \"arguments\":   {
\"capabilities\": [ { \"capability\": \"postcopy-blocktime\", \"state\":
true } ] } }" | nc -U /var/lib/migrate-vm-monitor.sock

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 include/migration/migration.h |  8 +++++
 migration/postcopy-ram.c      | 76 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 82bbcd8..7e69a2d 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -83,6 +83,8 @@ typedef enum {
     POSTCOPY_INCOMING_END
 } PostcopyState;
 
+struct PostcopyBlocktimeContext;
+
 /* State for the incoming migration */
 struct MigrationIncomingState {
     QEMUFile *from_src_file;
@@ -123,6 +125,12 @@ struct MigrationIncomingState {
 
     /* See savevm.c */
     LoadStateEntry_Head loadvm_handlers;
+
+    /*
+     * PostcopyBlocktimeContext to keep information for postcopy
+     * live migration, to calculate vCPU block time
+     * */
+    struct PostcopyBlocktimeContext *blocktime_ctx;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index c96d5f5..a1f1705 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -60,6 +60,73 @@ struct PostcopyDiscardState {
 #include <sys/eventfd.h>
 #include <linux/userfaultfd.h>
 
+typedef struct PostcopyBlocktimeContext {
+    /* time when page fault initiated per vCPU */
+    int64_t *page_fault_vcpu_time;
+    /* page address per vCPU */
+    uint64_t *vcpu_addr;
+    int64_t total_blocktime;
+    /* blocktime per vCPU */
+    int64_t *vcpu_blocktime;
+    /* point in time when last page fault was initiated */
+    int64_t last_begin;
+    /* number of vCPU are suspended */
+    int smp_cpus_down;
+
+    /*
+     * Handler for exit event, necessary for
+     * releasing whole blocktime_ctx
+     */
+    Notifier exit_notifier;
+    /*
+     * Handler for postcopy event, necessary for
+     * releasing unnecessary part of blocktime_ctx
+     */
+    Notifier postcopy_notifier;
+} PostcopyBlocktimeContext;
+
+static void destroy_blocktime_context(struct PostcopyBlocktimeContext *ctx)
+{
+    g_free(ctx->page_fault_vcpu_time);
+    g_free(ctx->vcpu_addr);
+    g_free(ctx->vcpu_blocktime);
+    g_free(ctx);
+}
+
+static void postcopy_migration_cb(Notifier *n, void *data)
+{
+    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
+                                               postcopy_notifier);
+    MigrationState *s = data;
+    if (migration_has_finished(s) || migration_has_failed(s)) {
+        g_free(ctx->page_fault_vcpu_time);
+        /* g_free is NULL robust */
+        ctx->page_fault_vcpu_time = NULL;
+        g_free(ctx->vcpu_addr);
+        ctx->vcpu_addr = NULL;
+    }
+}
+
+static void migration_exit_cb(Notifier *n, void *data)
+{
+    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
+                                               exit_notifier);
+    destroy_blocktime_context(ctx);
+}
+
+static struct PostcopyBlocktimeContext *blocktime_context_new(void)
+{
+    PostcopyBlocktimeContext *ctx = g_new0(PostcopyBlocktimeContext, 1);
+    ctx->page_fault_vcpu_time = g_new0(int64_t, smp_cpus);
+    ctx->vcpu_addr = g_new0(uint64_t, smp_cpus);
+    ctx->vcpu_blocktime = g_new0(int64_t, smp_cpus);
+
+    ctx->exit_notifier.notify = migration_exit_cb;
+    ctx->postcopy_notifier.notify = postcopy_migration_cb;
+    qemu_add_exit_notifier(&ctx->exit_notifier);
+    add_migration_state_change_notifier(&ctx->postcopy_notifier);
+    return ctx;
+}
 
 /*
  * Check userfault fd features, to request only supported features in
@@ -134,6 +201,15 @@ static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
         return false;
     }
 
+#ifdef UFFD_FEATURE_THREAD_ID
+    if (migrate_postcopy_blocktime() && mis &&
+            UFFD_FEATURE_THREAD_ID & supported_features) {
+        /* kernel supports that feature */
+        mis->blocktime_ctx = blocktime_context_new();
+        asked_features |= UFFD_FEATURE_THREAD_ID;
+    }
+#endif
+
     /*
      * request features, even if asked_features is 0, due to
      * kernel expects UFFD_API before UFFDIO_REGISTER, per
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH V5 7/9] migration: calculate vCPU blocktime on dst side
       [not found]   ` <CGME20170512133147eucas1p1eaa21aac3a0b9d45be0ef8ea903b6824@eucas1p1.samsung.com>
@ 2017-05-12 13:31     ` Alexey Perevalov
  2017-05-16 11:34       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 28+ messages in thread
From: Alexey Perevalov @ 2017-05-12 13:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: dgilbert, a.perevalov, i.maximets, peterx

This patch provides blocktime calculation per vCPU,
as a summary and as a overlapped value for all vCPUs.

This approach was suggested by Peter Xu, as an improvements of
previous approch where QEMU kept tree with faulted page address and cpus bitmask
in it. Now QEMU is keeping array with faulted page address as value and vCPU
as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
list for blocktime per vCPU (could be traced with page_fault_addr)

Blocktime will not calculated if postcopy_blocktime field of
MigrationIncomingState wasn't initialized.

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 migration/postcopy-ram.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++-
 migration/trace-events   |  5 ++-
 2 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index a1f1705..e2660ae 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -23,6 +23,7 @@
 #include "migration/postcopy-ram.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/balloon.h"
+#include <sys/param.h>
 #include "qemu/error-report.h"
 #include "trace.h"
 
@@ -542,6 +543,86 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
     return 0;
 }
 
+static int get_mem_fault_cpu_index(uint32_t pid)
+{
+    CPUState *cpu_iter;
+
+    CPU_FOREACH(cpu_iter) {
+        if (cpu_iter->thread_id == pid) {
+            return cpu_iter->cpu_index;
+        }
+    }
+    trace_get_mem_fault_cpu_index(pid);
+    return -1;
+}
+
+static void mark_postcopy_blocktime_begin(uint64_t addr, int cpu)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    PostcopyBlocktimeContext *dc;
+    int64_t now_ms;
+    if (!mis->blocktime_ctx || cpu < 0) {
+        return;
+    }
+    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+    dc = mis->blocktime_ctx;
+    if (dc->vcpu_addr[cpu] == 0) {
+        atomic_inc(&dc->smp_cpus_down);
+    }
+
+    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
+    atomic_xchg__nocheck(&dc->last_begin, now_ms);
+    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
+
+    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
+            cpu);
+}
+
+static void mark_postcopy_blocktime_end(uint64_t addr)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    PostcopyBlocktimeContext *dc;
+    int i, affected_cpu = 0;
+    int64_t now_ms;
+    bool vcpu_total_blocktime = false;
+
+    if (!mis->blocktime_ctx) {
+        return;
+    }
+    dc = mis->blocktime_ctx;
+    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+    /* lookup cpu, to clear it,
+     * that algorithm looks straighforward, but it's not
+     * optimal, more optimal algorithm is keeping tree or hash
+     * where key is address value is a list of  */
+    for (i = 0; i < smp_cpus; i++) {
+        uint64_t vcpu_blocktime = 0;
+        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
+            continue;
+        }
+        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
+        vcpu_blocktime = now_ms -
+            atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
+        affected_cpu += 1;
+        /* we need to know is that mark_postcopy_end was due to
+         * faulted page, another possible case it's prefetched
+         * page and in that case we shouldn't be here */
+        if (!vcpu_total_blocktime &&
+            atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
+            vcpu_total_blocktime = true;
+        }
+        /* continue cycle, due to one page could affect several vCPUs */
+        dc->vcpu_blocktime[i] += vcpu_blocktime;
+    }
+
+    atomic_sub(&dc->smp_cpus_down, affected_cpu);
+    if (vcpu_total_blocktime) {
+        dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, 0);
+    }
+    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime);
+}
+
 /*
  * Handle faults detected by the USERFAULT markings
  */
@@ -619,8 +700,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
         rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
         trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
                                                 qemu_ram_get_idstr(rb),
-                                                rb_offset);
+                                                rb_offset,
+                                                msg.arg.pagefault.feat.ptid);
 
+        mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address),
+                         get_mem_fault_cpu_index(msg.arg.pagefault.feat.ptid));
         /*
          * Send the request to the source - we want to request one
          * of our host page sizes (which is >= TPS)
@@ -715,6 +799,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
 
         return -e;
     }
+    mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host);
 
     trace_postcopy_place_page(host);
     return 0;
diff --git a/migration/trace-events b/migration/trace-events
index b8f01a2..9424e3e 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -110,6 +110,8 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
 process_incoming_migration_co_postcopy_end_main(void) ""
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
 migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
+mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d"
+mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr 0x%" PRIx64 " dd %p time %" PRId64
 
 # migration/rdma.c
 qemu_rdma_accept_incoming_migration(void) ""
@@ -186,7 +188,7 @@ postcopy_ram_enable_notify(void) ""
 postcopy_ram_fault_thread_entry(void) ""
 postcopy_ram_fault_thread_exit(void) ""
 postcopy_ram_fault_thread_quit(void) ""
-postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx"
+postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx %u"
 postcopy_ram_incoming_cleanup_closeuf(void) ""
 postcopy_ram_incoming_cleanup_entry(void) ""
 postcopy_ram_incoming_cleanup_exit(void) ""
@@ -195,6 +197,7 @@ save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
 ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
+get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU"
 
 # migration/exec.c
 migration_exec_outgoing(const char *cmd) "cmd=%s"
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH V5 8/9] migration: add postcopy total blocktime into query-migrate
       [not found]   ` <CGME20170512133148eucas1p2c04111d415b1fbd6fb702cfc2a3ed6f9@eucas1p2.samsung.com>
@ 2017-05-12 13:31     ` Alexey Perevalov
  2017-05-19 19:23       ` Dr. David Alan Gilbert
  2017-05-22 16:14       ` Eric Blake
  0 siblings, 2 replies; 28+ messages in thread
From: Alexey Perevalov @ 2017-05-12 13:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: dgilbert, a.perevalov, i.maximets, peterx

Postcopy total blocktime is available on destination side only.
But query-migrate was possible only for source. This patch
adds ability to call query-migrate on destination. To distinguish
src/dst, state of the MigrationState is using, query-migrate prepares
MigrationInfo for source machine only in case of migration's state is different
than MIGRATION_STATUS_NONE.

To be able to see postcopy blocktime, need to request postcopy-blocktime
capability.

The query-migrate command will show following sample result:
{"return":
    "postcopy_vcpu_blocktime": [115, 100],
    "status": "completed",
    "postcopy_blocktime": 100
}}

postcopy_vcpu_blocktime contains list, where the first item is the first
vCPU in QEMU.

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 include/migration/migration.h |  4 +++
 migration/migration.c         | 47 ++++++++++++++++++++++++++--
 migration/postcopy-ram.c      | 73 +++++++++++++++++++++++++++++++++++++++++++
 migration/trace-events        |  1 +
 qapi-schema.json              |  6 +++-
 5 files changed, 127 insertions(+), 4 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 7e69a2d..aba0535 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -135,6 +135,10 @@ struct MigrationIncomingState {
 
 MigrationIncomingState *migration_incoming_get_current(void);
 void migration_incoming_state_destroy(void);
+/*
+ * Functions to work with blocktime context
+ */
+void fill_destination_postcopy_migration_info(MigrationInfo *info);
 
 struct MigrationState
 {
diff --git a/migration/migration.c b/migration/migration.c
index c0443ce..7a4f33f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -666,9 +666,15 @@ static void populate_ram_info(MigrationInfo *info, MigrationState *s)
     }
 }
 
-MigrationInfo *qmp_query_migrate(Error **errp)
+/* TODO improve this assumption */
+static bool is_source_migration(void)
+{
+    MigrationState *ms = migrate_get_current();
+    return ms->state != MIGRATION_STATUS_NONE;
+}
+
+static void fill_source_migration_info(MigrationInfo *info)
 {
-    MigrationInfo *info = g_malloc0(sizeof(*info));
     MigrationState *s = migrate_get_current();
 
     switch (s->state) {
@@ -759,10 +765,45 @@ MigrationInfo *qmp_query_migrate(Error **errp)
         break;
     }
     info->status = s->state;
+}
+
+static void fill_destination_migration_info(MigrationInfo *info)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
 
-    return info;
+    switch (mis->state) {
+    case MIGRATION_STATUS_NONE:
+        break;
+    case MIGRATION_STATUS_SETUP:
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_ACTIVE:
+    case MIGRATION_STATUS_POSTCOPY_ACTIVE:
+    case MIGRATION_STATUS_FAILED:
+    case MIGRATION_STATUS_COLO:
+        info->has_status = true;
+        break;
+    case MIGRATION_STATUS_COMPLETED:
+        info->has_status = true;
+        fill_destination_postcopy_migration_info(info);
+        break;
+    }
+    info->status = mis->state;
 }
 
+MigrationInfo *qmp_query_migrate(Error **errp)
+{
+    MigrationInfo *info = g_malloc0(sizeof(*info));
+
+    if (is_source_migration()) {
+        fill_source_migration_info(info);
+    } else {
+        fill_destination_migration_info(info);
+    }
+
+     return info;
+ }
+
 void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
                                   Error **errp)
 {
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index e2660ae..fe047c8 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -129,6 +129,71 @@ static struct PostcopyBlocktimeContext *blocktime_context_new(void)
     return ctx;
 }
 
+static int64List *get_vcpu_blocktime_list(PostcopyBlocktimeContext *ctx)
+{
+    int64List *list = NULL, *entry = NULL;
+    int i;
+
+    for (i = smp_cpus - 1; i >= 0; i--) {
+            entry = g_new0(int64List, 1);
+            entry->value = ctx->vcpu_blocktime[i];
+            entry->next = list;
+            list = entry;
+    }
+
+    return list;
+}
+
+/*
+ * This function just provide calculated blocktime per cpu and trace it.
+ * Total blocktime is calculated in mark_postcopy_blocktime_end.
+ *
+ *
+ * Assume we have 3 CPU
+ *
+ *      S1        E1           S1               E1
+ * -----***********------------xxx***************------------------------> CPU1
+ *
+ *             S2                E2
+ * ------------****************xxx---------------------------------------> CPU2
+ *
+ *                         S3            E3
+ * ------------------------****xxx********-------------------------------> CPU3
+ *
+ * We have sequence S1,S2,E1,S3,S1,E2,E3,E1
+ * S2,E1 - doesn't match condition due to sequence S1,S2,E1 doesn't include CPU3
+ * S3,S1,E2 - sequence includes all CPUs, in this case overlap will be S1,E2 -
+ *            it's a part of total blocktime.
+ * S1 - here is last_begin
+ * Legend of the picture is following:
+ *              * - means blocktime per vCPU
+ *              x - means overlapped blocktime (total blocktime)
+ */
+void fill_destination_postcopy_migration_info(MigrationInfo *info)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    if (!mis->blocktime_ctx) {
+        return;
+    }
+
+    info->has_postcopy_blocktime = true;
+    info->postcopy_blocktime = mis->blocktime_ctx->total_blocktime;
+    info->has_postcopy_vcpu_blocktime = true;
+    info->postcopy_vcpu_blocktime = get_vcpu_blocktime_list(mis->blocktime_ctx);
+}
+
+static uint64_t get_postcopy_total_blocktime(void)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    if (!mis->blocktime_ctx) {
+        return 0;
+    }
+
+    return mis->blocktime_ctx->total_blocktime;
+}
+
 /*
  * Check userfault fd features, to request only supported features in
  * future.
@@ -462,6 +527,9 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
     }
 
     postcopy_state_set(POSTCOPY_INCOMING_END);
+    /* here should be blocktime receiving back operation */
+    trace_postcopy_ram_incoming_cleanup_blocktime(
+            get_postcopy_total_blocktime());
     migrate_send_rp_shut(mis, qemu_file_get_error(mis->from_src_file) != 0);
 
     if (mis->postcopy_tmp_page) {
@@ -876,6 +944,11 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
 
 #else
 /* No target OS support, stubs just fail */
+void fill_destination_postcopy_migration_info(MigrationInfo *info)
+{
+    error_report("%s: No OS support", __func__);
+}
+
 bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
 {
     error_report("%s: No OS support", __func__);
diff --git a/migration/trace-events b/migration/trace-events
index 9424e3e..bdaca1d 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -193,6 +193,7 @@ postcopy_ram_incoming_cleanup_closeuf(void) ""
 postcopy_ram_incoming_cleanup_entry(void) ""
 postcopy_ram_incoming_cleanup_exit(void) ""
 postcopy_ram_incoming_cleanup_join(void) ""
+postcopy_ram_incoming_cleanup_blocktime(uint64_t total) "total blocktime %" PRIu64
 save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
diff --git a/qapi-schema.json b/qapi-schema.json
index fde6d63..e11c5f2 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -712,6 +712,8 @@
 #              @status is 'failed'. Clients should not attempt to parse the
 #              error strings. (Since 2.7)
 #
+# @postcopy_vcpu_blocktime: list of the postcopy blocktime per vCPU (Since 2.9)
+#
 # Since: 0.14.0
 ##
 { 'struct': 'MigrationInfo',
@@ -723,7 +725,9 @@
            '*downtime': 'int',
            '*setup-time': 'int',
            '*cpu-throttle-percentage': 'int',
-           '*error-desc': 'str'} }
+           '*error-desc': 'str',
+           '*postcopy_blocktime' : 'int64',
+           '*postcopy_vcpu_blocktime': ['int64']} }
 
 ##
 # @query-migrate:
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH V5 9/9] migration: postcopy_blocktime documentation
       [not found]   ` <CGME20170512133149eucas1p2b4c448fe763975cf11cf96801857d42e@eucas1p2.samsung.com>
@ 2017-05-12 13:31     ` Alexey Perevalov
  0 siblings, 0 replies; 28+ messages in thread
From: Alexey Perevalov @ 2017-05-12 13:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: dgilbert, a.perevalov, i.maximets, peterx

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 docs/migration.txt | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/docs/migration.txt b/docs/migration.txt
index 1b940a8..d0f5a6d 100644
--- a/docs/migration.txt
+++ b/docs/migration.txt
@@ -402,6 +402,16 @@ will now cause the transition from precopy to postcopy.
 It can be issued immediately after migration is started or any
 time later on.  Issuing it after the end of a migration is harmless.
 
+Blocktime it's a postcopy live migration metric, intend to show
+when source vCPU was in state interruptable sleep due to pagefault.
+This value is calculated on destination side.
+To enable postcopy blocktime calculation, enter following command on destination
+monitor:
+
+migrate_set_capability postcopy-blocktime on
+
+Postcopy blocktime could be retrieved by query-migrate qmp command.
+
 Note: During the postcopy phase, the bandwidth limits set using
 migrate_set_speed is ignored (to avoid delaying requested pages that
 the destination is waiting for).
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 0/9] calculate blocktime for postcopy live migration
  2017-05-12 13:31 ` [Qemu-devel] [PATCH V5 0/9] calculate blocktime for postcopy live migration Alexey Perevalov
                     ` (8 preceding siblings ...)
       [not found]   ` <CGME20170512133149eucas1p2b4c448fe763975cf11cf96801857d42e@eucas1p2.samsung.com>
@ 2017-05-12 20:09   ` Eric Blake
  9 siblings, 0 replies; 28+ messages in thread
From: Eric Blake @ 2017-05-12 20:09 UTC (permalink / raw)
  To: Alexey Perevalov, qemu-devel; +Cc: i.maximets, dgilbert, peterx

[-- Attachment #1: Type: text/plain, Size: 1445 bytes --]

On 05/12/2017 08:31 AM, Alexey Perevalov wrote:
> The rationale for that idea is following:
> vCPU could suspend during postcopy live migration until faulted
> page is not copied into kernel. Downtime on source side it's a value -
> time interval since source turn vCPU off, till destination start runnig
> vCPU. But that value was proper value for precopy migration it really shows
> amount of time when vCPU is down. But not for postcopy migration, because
> several vCPU threads could susppend after vCPU was started. That is important
> to estimate packet drop for SDN software.
> 
> This is 5th version of patch set. In previous was build error in mingw.
> First version was tagged as RFC, second was without version tag, third with V3.
> 
> This patch set doesn't include improvements sugested by Peter Xu for
> get_mem_fault_cpu_index, but I would prefer to do it. I think to introduce a
> tree for fast CPUState lookup by thread_id, or general code, due to there are
> places like qemu_get_cpu (cpus.c) with the similar lookup.
> 
> (V4 -> V5)
>     - fill_destination_postcopy_migration_info empty stub was missed for none linux
> build
> 
> (V3 -> V4)

I reviewed a couple of spots related to QMP in v4 before seeing that you
had already posted v5; those comments still apply.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 3/9] migration: fix hardcoded function name in error report
  2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 3/9] migration: fix hardcoded function name in error report Alexey Perevalov
@ 2017-05-16  9:46       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 28+ messages in thread
From: Dr. David Alan Gilbert @ 2017-05-16  9:46 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, peterx

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>



Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/postcopy-ram.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 4c859b4..0f75700 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -68,7 +68,7 @@ static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
>      api_struct.api = UFFD_API;
>      api_struct.features = 0;
>      if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> -        error_report("postcopy_ram_supported_by_host: UFFDIO_API failed: %s",
> +        error_report("%s: UFFDIO_API failed: %s", __func__
>                       strerror(errno));
>          return false;
>      }
> -- 
> 1.9.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 4/9] migration: split ufd_version_check onto receive/request features part
  2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 4/9] migration: split ufd_version_check onto receive/request features part Alexey Perevalov
@ 2017-05-16 10:32       ` Dr. David Alan Gilbert
  2017-05-18  6:55         ` Alexey
  0 siblings, 1 reply; 28+ messages in thread
From: Dr. David Alan Gilbert @ 2017-05-16 10:32 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, peterx

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> This modification is necessary for userfault fd features which are
> required to be requested from userspace.
> UFFD_FEATURE_THREAD_ID is a one of such "on demand" feature, which will
> be introduced in the next patch.
> 
> QEMU need to use separate userfault file descriptor, due to
> userfault context has internal state, and after first call of
> ioctl UFFD_API it changes its state to UFFD_STATE_RUNNING (in case of
> success), but
> kernel while handling ioctl UFFD_API expects UFFD_STATE_WAIT_API. So
> only one ioctl with UFFD_API is possible per ufd.
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> ---
>  migration/postcopy-ram.c | 82 ++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 73 insertions(+), 9 deletions(-)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 0f75700..c96d5f5 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -60,32 +60,96 @@ struct PostcopyDiscardState {
>  #include <sys/eventfd.h>
>  #include <linux/userfaultfd.h>
>  
> -static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
> +
> +/*
> + * Check userfault fd features, to request only supported features in
> + * future.
> + * __NR_userfaultfd - should be checked before
> + * Return obtained features

That's not quite right;
 * Returns: True on success, sets *features to supported features
            False on failure or if kernel doesn't support ufd

> + */
> +static bool receive_ufd_features(uint64_t *features)
>  {
> -    struct uffdio_api api_struct;
> -    uint64_t ioctl_mask;
> +    struct uffdio_api api_struct = {0};
> +    int ufd;
> +    bool ret = true;
> +
> +    /* if we are here __NR_userfaultfd should exists */
> +    ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
> +    if (ufd == -1) {
> +        return false;
> +    }
>  
> +    /* ask features */
>      api_struct.api = UFFD_API;
>      api_struct.features = 0;
>      if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> -        error_report("%s: UFFDIO_API failed: %s", __func__
> +        error_report("%s: UFFDIO_API failed: %s", __func__,
>                       strerror(errno));
> +        ret = false;
> +        goto release_ufd;
> +    }
> +
> +    *features = api_struct.features;
> +
> +release_ufd:
> +    close(ufd);
> +    return ret;
> +}

Needs a comment; perhaps something like:
  * Called once on a newly opened ufd, can request specific features.
  * Returns: True on success

> +static bool request_ufd_features(int ufd, uint64_t features)
> +{
> +    struct uffdio_api api_struct = {0};
> +    uint64_t ioctl_mask;
> +
> +    api_struct.api = UFFD_API;
> +    api_struct.features = features;
> +    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> +        error_report("%s failed: UFFDIO_API failed: %s", __func__,
> +                strerror(errno));
>          return false;
>      }
>  
> -    ioctl_mask = (__u64)1 << _UFFDIO_REGISTER |
> -                 (__u64)1 << _UFFDIO_UNREGISTER;
> +    ioctl_mask = 1 << _UFFDIO_REGISTER |
> +                 1 << _UFFDIO_UNREGISTER;
>      if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
>          error_report("Missing userfault features: %" PRIx64,
>                       (uint64_t)(~api_struct.ioctls & ioctl_mask));
>          return false;
>      }
>  
> +    return true;
> +}
> +
> +static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
> +{
> +    uint64_t asked_features = 0;
> +    uint64_t supported_features;
> +
> +    /*
> +     * it's not possible to
> +     * request UFFD_API twice per one fd
> +     */
> +    if (!receive_ufd_features(&supported_features)) {
> +        error_report("%s failed", __func__);
> +        return false;
> +    }
> +
> +    /*
> +     * request features, even if asked_features is 0, due to
> +     * kernel expects UFFD_API before UFFDIO_REGISTER, per
> +     * userfault file descriptor
> +     */
> +    if (!request_ufd_features(ufd, asked_features)) {
> +        error_report("%s failed: features %" PRIu64, __func__,
> +                asked_features);
> +        return false;
> +    }
> +
>      if (getpagesize() != ram_pagesize_summary()) {
>          bool have_hp = false;
>          /* We've got a huge page */
>  #ifdef UFFD_FEATURE_MISSING_HUGETLBFS
> -        have_hp = api_struct.features & UFFD_FEATURE_MISSING_HUGETLBFS;
> +        have_hp = supported_features & UFFD_FEATURE_MISSING_HUGETLBFS;
>  #endif
>          if (!have_hp) {
>              error_report("Userfault on this host does not support huge pages");
> @@ -136,7 +200,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
>      }
>  
>      /* Version and features check */
> -    if (!ufd_version_check(ufd, mis)) {
> +    if (!ufd_check_and_apply(ufd, mis)) {
>          goto out;
>      }
>  
> @@ -513,7 +577,7 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
>       * Although the host check already tested the API, we need to
>       * do the check again as an ABI handshake on the new fd.
>       */
> -    if (!ufd_version_check(mis->userfault_fd, mis)) {
> +    if (!ufd_check_and_apply(mis->userfault_fd, mis)) {
>          return -1;
>      }
>  
> -- 
> 1.9.1

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 5/9] migration: introduce postcopy-blocktime capability
  2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 5/9] migration: introduce postcopy-blocktime capability Alexey Perevalov
@ 2017-05-16 10:33       ` Dr. David Alan Gilbert
  2017-05-22 16:20       ` Eric Blake
  1 sibling, 0 replies; 28+ messages in thread
From: Dr. David Alan Gilbert @ 2017-05-16 10:33 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, peterx

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> Right now it could be used on destination side to
> enable vCPU blocktime calculation for postcopy live migration.
> vCPU blocktime - it's time since vCPU thread was put into
> interruptible sleep, till memory page was copied and thread awake.
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  include/migration/migration.h | 1 +
>  migration/migration.c         | 9 +++++++++
>  qapi-schema.json              | 5 ++++-
>  3 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index ba1a16c..82bbcd8 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -315,6 +315,7 @@ int migrate_compress_level(void);
>  int migrate_compress_threads(void);
>  int migrate_decompress_threads(void);
>  bool migrate_use_events(void);
> +bool migrate_postcopy_blocktime(void);
>  
>  /* Sending on the return path - generic and then for each message type */
>  void migrate_send_rp_message(MigrationIncomingState *mis,
> diff --git a/migration/migration.c b/migration/migration.c
> index 569a7f6..c0443ce 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1371,6 +1371,15 @@ bool migrate_zero_blocks(void)
>      return s->enabled_capabilities[MIGRATION_CAPABILITY_ZERO_BLOCKS];
>  }
>  
> +bool migrate_postcopy_blocktime(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME];
> +}
> +
>  bool migrate_use_compression(void)
>  {
>      MigrationState *s;
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 01b087f..fde6d63 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -894,11 +894,14 @@
>  # @release-ram: if enabled, qemu will free the migrated ram pages on the source
>  #        during postcopy-ram migration. (since 2.9)
>  #
> +# @postcopy-blocktime: Calculate downtime for postcopy live migration (since 2.10)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> -           'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram'] }
> +           'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
> +           'postcopy-blocktime'] }
>  
>  ##
>  # @MigrationCapabilityStatus:
> -- 
> 1.9.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 7/9] migration: calculate vCPU blocktime on dst side
  2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 7/9] migration: calculate vCPU blocktime on dst side Alexey Perevalov
@ 2017-05-16 11:34       ` Dr. David Alan Gilbert
  2017-05-16 15:19         ` Alexey
  2017-05-18  7:18         ` Alexey
  0 siblings, 2 replies; 28+ messages in thread
From: Dr. David Alan Gilbert @ 2017-05-16 11:34 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, peterx

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> This patch provides blocktime calculation per vCPU,
> as a summary and as a overlapped value for all vCPUs.
> 
> This approach was suggested by Peter Xu, as an improvements of
> previous approch where QEMU kept tree with faulted page address and cpus bitmask
> in it. Now QEMU is keeping array with faulted page address as value and vCPU
> as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
> list for blocktime per vCPU (could be traced with page_fault_addr)
> 
> Blocktime will not calculated if postcopy_blocktime field of
> MigrationIncomingState wasn't initialized.
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>

I have some multi-threading/ordering worries still.

The fault thread receives faults over the ufd and calls
mark_postcopy_blocktime_being.  That's fine.

The receiving thread receives pages, calls place page, and
calls mark_postcopy_blocktime_end.  That's also fine.

However, remember that we send pages from the source without
them being requested as background transfers; consider:


    Source           receive-thread          fault-thread

  1  Send A
  2                  Receive A
  3                                            Access A
  4                                            Report on UFD
  5                  Place
  6                                            Read UFD entry


 Placing and reading UFD race - and up till now that's been fine;
so we can read off the ufd an address that's already on it's way from
the source, and which we might just be receiving, or that we might
have already placed.

In this code at (6) won't you call mark_postcopy_blocktime_start
even though it's already been placed at (5) ? Then that blocktime
will stay set until the end of the run?

Perhaps that's not a problem; if mark_postcopy_blocktime_end is called
for a different address it wont count the blocktime; and when
mark_postcopy_blocktime_start is called for a different address it'll
remove the addres that was a problem above - so perhaps that's fine?


> ---
>  migration/postcopy-ram.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++-
>  migration/trace-events   |  5 ++-
>  2 files changed, 90 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index a1f1705..e2660ae 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -23,6 +23,7 @@
>  #include "migration/postcopy-ram.h"
>  #include "sysemu/sysemu.h"
>  #include "sysemu/balloon.h"
> +#include <sys/param.h>
>  #include "qemu/error-report.h"
>  #include "trace.h"
>  
> @@ -542,6 +543,86 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
>      return 0;
>  }
>  
> +static int get_mem_fault_cpu_index(uint32_t pid)
> +{
> +    CPUState *cpu_iter;
> +
> +    CPU_FOREACH(cpu_iter) {
> +        if (cpu_iter->thread_id == pid) {
> +            return cpu_iter->cpu_index;
> +        }
> +    }
> +    trace_get_mem_fault_cpu_index(pid);
> +    return -1;
> +}
> +
> +static void mark_postcopy_blocktime_begin(uint64_t addr, int cpu)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +    PostcopyBlocktimeContext *dc;
> +    int64_t now_ms;
> +    if (!mis->blocktime_ctx || cpu < 0) {
> +        return;
> +    }

You might consider:

 PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
 int64_t now_ms;
 if (!dc || cpu < 0) {
     return;
 }

it gets rid of the two reads of mis->blocktime_ctx
(You do something similar in a few places)

> +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +    dc = mis->blocktime_ctx;
> +    if (dc->vcpu_addr[cpu] == 0) {
> +        atomic_inc(&dc->smp_cpus_down);
> +    }
> +
> +    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
> +    atomic_xchg__nocheck(&dc->last_begin, now_ms);
> +    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
> +
> +    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
> +            cpu);
> +}
> +
> +static void mark_postcopy_blocktime_end(uint64_t addr)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +    PostcopyBlocktimeContext *dc;
> +    int i, affected_cpu = 0;
> +    int64_t now_ms;
> +    bool vcpu_total_blocktime = false;
> +
> +    if (!mis->blocktime_ctx) {
> +        return;
> +    }
> +    dc = mis->blocktime_ctx;
> +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +
> +    /* lookup cpu, to clear it,
> +     * that algorithm looks straighforward, but it's not
> +     * optimal, more optimal algorithm is keeping tree or hash
> +     * where key is address value is a list of  */
> +    for (i = 0; i < smp_cpus; i++) {
> +        uint64_t vcpu_blocktime = 0;
> +        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
> +            continue;
> +        }
> +        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
> +        vcpu_blocktime = now_ms -
> +            atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
> +        affected_cpu += 1;
> +        /* we need to know is that mark_postcopy_end was due to
> +         * faulted page, another possible case it's prefetched
> +         * page and in that case we shouldn't be here */
> +        if (!vcpu_total_blocktime &&
> +            atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
> +            vcpu_total_blocktime = true;
> +        }
> +        /* continue cycle, due to one page could affect several vCPUs */
> +        dc->vcpu_blocktime[i] += vcpu_blocktime;
> +    }
> +
> +    atomic_sub(&dc->smp_cpus_down, affected_cpu);
> +    if (vcpu_total_blocktime) {
> +        dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, 0);

This total_blocktime calculation is a little odd; the 'last_begin' is
not necessarily related to the same CPU or same block.

Dave

> +    }
> +    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime);
> +}
> +
>  /*
>   * Handle faults detected by the USERFAULT markings
>   */
> @@ -619,8 +700,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
>          rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
>          trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
>                                                  qemu_ram_get_idstr(rb),
> -                                                rb_offset);
> +                                                rb_offset,
> +                                                msg.arg.pagefault.feat.ptid);
>  
> +        mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address),
> +                         get_mem_fault_cpu_index(msg.arg.pagefault.feat.ptid));
>          /*
>           * Send the request to the source - we want to request one
>           * of our host page sizes (which is >= TPS)
> @@ -715,6 +799,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
>  
>          return -e;
>      }
> +    mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host);
>  
>      trace_postcopy_place_page(host);
>      return 0;
> diff --git a/migration/trace-events b/migration/trace-events
> index b8f01a2..9424e3e 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -110,6 +110,8 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
>  process_incoming_migration_co_postcopy_end_main(void) ""
>  migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
>  migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
> +mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d"
> +mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr 0x%" PRIx64 " dd %p time %" PRId64
>  
>  # migration/rdma.c
>  qemu_rdma_accept_incoming_migration(void) ""
> @@ -186,7 +188,7 @@ postcopy_ram_enable_notify(void) ""
>  postcopy_ram_fault_thread_entry(void) ""
>  postcopy_ram_fault_thread_exit(void) ""
>  postcopy_ram_fault_thread_quit(void) ""
> -postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx"
> +postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx %u"
>  postcopy_ram_incoming_cleanup_closeuf(void) ""
>  postcopy_ram_incoming_cleanup_entry(void) ""
>  postcopy_ram_incoming_cleanup_exit(void) ""
> @@ -195,6 +197,7 @@ save_xbzrle_page_skipping(void) ""
>  save_xbzrle_page_overflow(void) ""
>  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
>  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
> +get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU"
>  
>  # migration/exec.c
>  migration_exec_outgoing(const char *cmd) "cmd=%s"
> -- 
> 1.9.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 7/9] migration: calculate vCPU blocktime on dst side
  2017-05-16 11:34       ` Dr. David Alan Gilbert
@ 2017-05-16 15:19         ` Alexey
  2017-05-18  7:18         ` Alexey
  1 sibling, 0 replies; 28+ messages in thread
From: Alexey @ 2017-05-16 15:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: i.maximets, qemu-devel, peterx

On Tue, May 16, 2017 at 12:34:16PM +0100, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > This patch provides blocktime calculation per vCPU,
> > as a summary and as a overlapped value for all vCPUs.
> > 
> > This approach was suggested by Peter Xu, as an improvements of
> > previous approch where QEMU kept tree with faulted page address and cpus bitmask
> > in it. Now QEMU is keeping array with faulted page address as value and vCPU
> > as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
> > list for blocktime per vCPU (could be traced with page_fault_addr)
> > 
> > Blocktime will not calculated if postcopy_blocktime field of
> > MigrationIncomingState wasn't initialized.
> > 
> > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> 
> I have some multi-threading/ordering worries still.
> 
> The fault thread receives faults over the ufd and calls
> mark_postcopy_blocktime_being.  That's fine.
> 
> The receiving thread receives pages, calls place page, and
> calls mark_postcopy_blocktime_end.  That's also fine.
> 
> However, remember that we send pages from the source without
> them being requested as background transfers; consider:
> 
> 
>     Source           receive-thread          fault-thread
> 
>   1  Send A
>   2                  Receive A
>   3                                            Access A
>   4                                            Report on UFD
>   5                  Place
>   6                                            Read UFD entry
> 
> 
>  Placing and reading UFD race - and up till now that's been fine;
> so we can read off the ufd an address that's already on it's way from
> the source, and which we might just be receiving, or that we might
> have already placed.
> 
> In this code at (6) won't you call mark_postcopy_blocktime_start
> even though it's already been placed at (5) ? Then that blocktime
> will stay set until the end of the run?
Could you clarify, what does it mean "Read UFD entry",
Place - it's postcopy_place_page.


> 
> Perhaps that's not a problem; if mark_postcopy_blocktime_end is called
> for a different address it wont count the blocktime; and when
> mark_postcopy_blocktime_start is called for a different address it'll
> remove the addres that was a problem above - so perhaps that's fine?
mark_postcopy_blocktime_begin doesn't clear state, it only sets the
state.

Looks like I imaging the race nature:
kernel reports about pagefault for page, which are in the middle of the copying
process, so we will never copy it again, and there is a chance it will
be in vcpu_addr forever.


     Source           receive-thread          fault-thread
 
   4                                            Report on UFD
   5                   Place ioctl(UFFD_COPY)
   5.1                 mark_postcopy_blocktime_end
   4.1						mark_postcopy_blocktime_begin

I think that possible, but probability is too low, probability is increasing
for small page size such as 4K pages, ioctl(UFFD_COPY) copies memory
like memcpy doing, so time complexity of copying inside ioctl depends on page
size.
I think to add logic to check *_blocktime_begin in case of 
*_blocktime_end for that page before. Just for not keeping addr in the
vcpu_addr forever.

> 
> 
> > ---
> >  migration/postcopy-ram.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  migration/trace-events   |  5 ++-
> >  2 files changed, 90 insertions(+), 2 deletions(-)
> > 
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index a1f1705..e2660ae 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -23,6 +23,7 @@
> >  #include "migration/postcopy-ram.h"
> >  #include "sysemu/sysemu.h"
> >  #include "sysemu/balloon.h"
> > +#include <sys/param.h>
> >  #include "qemu/error-report.h"
> >  #include "trace.h"
> >  
> > @@ -542,6 +543,86 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
> >      return 0;
> >  }
> >  
> > +static int get_mem_fault_cpu_index(uint32_t pid)
> > +{
> > +    CPUState *cpu_iter;
> > +
> > +    CPU_FOREACH(cpu_iter) {
> > +        if (cpu_iter->thread_id == pid) {
> > +            return cpu_iter->cpu_index;
> > +        }
> > +    }
> > +    trace_get_mem_fault_cpu_index(pid);
> > +    return -1;
> > +}
> > +
> > +static void mark_postcopy_blocktime_begin(uint64_t addr, int cpu)
> > +{
> > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > +    PostcopyBlocktimeContext *dc;
> > +    int64_t now_ms;
> > +    if (!mis->blocktime_ctx || cpu < 0) {
> > +        return;
> > +    }
> 
> You might consider:
> 
>  PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
>  int64_t now_ms;
>  if (!dc || cpu < 0) {
>      return;
>  }
> 
> it gets rid of the two reads of mis->blocktime_ctx
> (You do something similar in a few places)
> 
> > +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > +    dc = mis->blocktime_ctx;
> > +    if (dc->vcpu_addr[cpu] == 0) {
> > +        atomic_inc(&dc->smp_cpus_down);
> > +    }
> > +
> > +    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
> > +    atomic_xchg__nocheck(&dc->last_begin, now_ms);
> > +    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
> > +
> > +    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
> > +            cpu);
> > +}
> > +
> > +static void mark_postcopy_blocktime_end(uint64_t addr)
> > +{
> > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > +    PostcopyBlocktimeContext *dc;
> > +    int i, affected_cpu = 0;
> > +    int64_t now_ms;
> > +    bool vcpu_total_blocktime = false;
> > +
> > +    if (!mis->blocktime_ctx) {
> > +        return;
> > +    }
> > +    dc = mis->blocktime_ctx;
> > +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > +
> > +    /* lookup cpu, to clear it,
> > +     * that algorithm looks straighforward, but it's not
> > +     * optimal, more optimal algorithm is keeping tree or hash
> > +     * where key is address value is a list of  */
> > +    for (i = 0; i < smp_cpus; i++) {
> > +        uint64_t vcpu_blocktime = 0;
> > +        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
> > +            continue;
> > +        }
> > +        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
> > +        vcpu_blocktime = now_ms -
> > +            atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
> > +        affected_cpu += 1;
> > +        /* we need to know is that mark_postcopy_end was due to
> > +         * faulted page, another possible case it's prefetched
> > +         * page and in that case we shouldn't be here */
> > +        if (!vcpu_total_blocktime &&
> > +            atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
> > +            vcpu_total_blocktime = true;
> > +        }
> > +        /* continue cycle, due to one page could affect several vCPUs */
> > +        dc->vcpu_blocktime[i] += vcpu_blocktime;
> > +    }
> > +
> > +    atomic_sub(&dc->smp_cpus_down, affected_cpu);
> > +    if (vcpu_total_blocktime) {
> > +        dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, 0);
> 
> This total_blocktime calculation is a little odd; the 'last_begin' is
> not necessarily related to the same CPU or same block.
last_begin should not be related to the same vCPU, vCPU doesn't
matter in this case, due to last_begin is a time when
mark_postcopy_blocktime_begin was called (last pagefault),
so if we 100% sure here all vCPU is blocked, the time interval since it
was blocked starts at last_begin time, even that last_begin was on
another vCPU.

> 
> Dave
> 
> > +    }
> > +    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime);
> > +}
> > +
> >  /*
> >   * Handle faults detected by the USERFAULT markings
> >   */
> > @@ -619,8 +700,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
> >          rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
> >          trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
> >                                                  qemu_ram_get_idstr(rb),
> > -                                                rb_offset);
> > +                                                rb_offset,
> > +                                                msg.arg.pagefault.feat.ptid);
> >  
> > +        mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address),
> > +                         get_mem_fault_cpu_index(msg.arg.pagefault.feat.ptid));
> >          /*
> >           * Send the request to the source - we want to request one
> >           * of our host page sizes (which is >= TPS)
> > @@ -715,6 +799,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> >  
> >          return -e;
> >      }
> > +    mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host);
> >  
> >      trace_postcopy_place_page(host);
> >      return 0;
> > diff --git a/migration/trace-events b/migration/trace-events
> > index b8f01a2..9424e3e 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -110,6 +110,8 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> >  process_incoming_migration_co_postcopy_end_main(void) ""
> >  migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
> >  migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
> > +mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d"
> > +mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr 0x%" PRIx64 " dd %p time %" PRId64
> >  
> >  # migration/rdma.c
> >  qemu_rdma_accept_incoming_migration(void) ""
> > @@ -186,7 +188,7 @@ postcopy_ram_enable_notify(void) ""
> >  postcopy_ram_fault_thread_entry(void) ""
> >  postcopy_ram_fault_thread_exit(void) ""
> >  postcopy_ram_fault_thread_quit(void) ""
> > -postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx"
> > +postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx %u"
> >  postcopy_ram_incoming_cleanup_closeuf(void) ""
> >  postcopy_ram_incoming_cleanup_entry(void) ""
> >  postcopy_ram_incoming_cleanup_exit(void) ""
> > @@ -195,6 +197,7 @@ save_xbzrle_page_skipping(void) ""
> >  save_xbzrle_page_overflow(void) ""
> >  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
> >  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
> > +get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU"
> >  
> >  # migration/exec.c
> >  migration_exec_outgoing(const char *cmd) "cmd=%s"
> > -- 
> > 1.9.1
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

-- 

BR
Alexey

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 4/9] migration: split ufd_version_check onto receive/request features part
  2017-05-16 10:32       ` Dr. David Alan Gilbert
@ 2017-05-18  6:55         ` Alexey
  2017-05-19 18:46           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 28+ messages in thread
From: Alexey @ 2017-05-18  6:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: i.maximets, qemu-devel, peterx

On Tue, May 16, 2017 at 11:32:51AM +0100, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > This modification is necessary for userfault fd features which are
> > required to be requested from userspace.
> > UFFD_FEATURE_THREAD_ID is a one of such "on demand" feature, which will
> > be introduced in the next patch.
> > 
> > QEMU need to use separate userfault file descriptor, due to
> > userfault context has internal state, and after first call of
> > ioctl UFFD_API it changes its state to UFFD_STATE_RUNNING (in case of
> > success), but
> > kernel while handling ioctl UFFD_API expects UFFD_STATE_WAIT_API. So
> > only one ioctl with UFFD_API is possible per ufd.
> > 
> > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> > ---
> >  migration/postcopy-ram.c | 82 ++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 73 insertions(+), 9 deletions(-)
> > 
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 0f75700..c96d5f5 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -60,32 +60,96 @@ struct PostcopyDiscardState {
> >  #include <sys/eventfd.h>
> >  #include <linux/userfaultfd.h>
> >  
> > -static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
> > +
> > +/*
> > + * Check userfault fd features, to request only supported features in
> > + * future.
> > + * __NR_userfaultfd - should be checked before
> > + * Return obtained features
> 
> That's not quite right;
>  * Returns: True on success, sets *features to supported features
>             False on failure or if kernel doesn't support ufd
> 
yes, obtained features is out parameter,
but I want to keep false uncommented and just add error_report into
syscall check, because the possible reason of failure is:
1. No syscall userfaultfd, but function expects that syscall, it reflects in
comment
2  Within syscall:  exhausted fd or out of memory (file in kernel
is allocating)
3. Problem in ioctl due to internal state of UFFD, as example
UFFDIO_API after UFFDIO_REGISTER

Also I would prefer follow migration/ram.c comment style.
> > + */
> > +static bool receive_ufd_features(uint64_t *features)
> >  {
> > -    struct uffdio_api api_struct;
> > -    uint64_t ioctl_mask;
> > +    struct uffdio_api api_struct = {0};
> > +    int ufd;
> > +    bool ret = true;
> > +
> > +    /* if we are here __NR_userfaultfd should exists */
> > +    ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
> > +    if (ufd == -1) {
> > +        return false;
> > +    }
> >  
> > +    /* ask features */
> >      api_struct.api = UFFD_API;
> >      api_struct.features = 0;
> >      if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> > -        error_report("%s: UFFDIO_API failed: %s", __func__
> > +        error_report("%s: UFFDIO_API failed: %s", __func__,
> >                       strerror(errno));
> > +        ret = false;
> > +        goto release_ufd;
> > +    }
> > +
> > +    *features = api_struct.features;
> > +
> > +release_ufd:
> > +    close(ufd);
> > +    return ret;
> > +}
> 
> Needs a comment; perhaps something like:
>   * Called once on a newly opened ufd, can request specific features.
>   * Returns: True on success
> 
> > +static bool request_ufd_features(int ufd, uint64_t features)
> > +{
> > +    struct uffdio_api api_struct = {0};
> > +    uint64_t ioctl_mask;
> > +
> > +    api_struct.api = UFFD_API;
> > +    api_struct.features = features;
> > +    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> > +        error_report("%s failed: UFFDIO_API failed: %s", __func__,
> > +                strerror(errno));
> >          return false;
> >      }
> >  
> > -    ioctl_mask = (__u64)1 << _UFFDIO_REGISTER |
> > -                 (__u64)1 << _UFFDIO_UNREGISTER;
> > +    ioctl_mask = 1 << _UFFDIO_REGISTER |
> > +                 1 << _UFFDIO_UNREGISTER;
> >      if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
> >          error_report("Missing userfault features: %" PRIx64,
> >                       (uint64_t)(~api_struct.ioctls & ioctl_mask));
> >          return false;
> >      }
> >  
> > +    return true;
> > +}
> > +
> > +static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
> > +{
> > +    uint64_t asked_features = 0;
> > +    uint64_t supported_features;
> > +
> > +    /*
> > +     * it's not possible to
> > +     * request UFFD_API twice per one fd
> > +     */
> > +    if (!receive_ufd_features(&supported_features)) {
> > +        error_report("%s failed", __func__);
> > +        return false;
> > +    }
> > +
> > +    /*
> > +     * request features, even if asked_features is 0, due to
> > +     * kernel expects UFFD_API before UFFDIO_REGISTER, per
> > +     * userfault file descriptor
> > +     */
> > +    if (!request_ufd_features(ufd, asked_features)) {
> > +        error_report("%s failed: features %" PRIu64, __func__,
> > +                asked_features);
> > +        return false;
> > +    }
> > +
> >      if (getpagesize() != ram_pagesize_summary()) {
> >          bool have_hp = false;
> >          /* We've got a huge page */
> >  #ifdef UFFD_FEATURE_MISSING_HUGETLBFS
> > -        have_hp = api_struct.features & UFFD_FEATURE_MISSING_HUGETLBFS;
> > +        have_hp = supported_features & UFFD_FEATURE_MISSING_HUGETLBFS;
> >  #endif
> >          if (!have_hp) {
> >              error_report("Userfault on this host does not support huge pages");
> > @@ -136,7 +200,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
> >      }
> >  
> >      /* Version and features check */
> > -    if (!ufd_version_check(ufd, mis)) {
> > +    if (!ufd_check_and_apply(ufd, mis)) {
> >          goto out;
> >      }
> >  
> > @@ -513,7 +577,7 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
> >       * Although the host check already tested the API, we need to
> >       * do the check again as an ABI handshake on the new fd.
> >       */
> > -    if (!ufd_version_check(mis->userfault_fd, mis)) {
> > +    if (!ufd_check_and_apply(mis->userfault_fd, mis)) {
> >          return -1;
> >      }
> >  
> > -- 
> > 1.9.1
> 
> Dave
> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

-- 

BR
Alexey

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 7/9] migration: calculate vCPU blocktime on dst side
  2017-05-16 11:34       ` Dr. David Alan Gilbert
  2017-05-16 15:19         ` Alexey
@ 2017-05-18  7:18         ` Alexey
  2017-05-19 19:05           ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 28+ messages in thread
From: Alexey @ 2017-05-18  7:18 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: i.maximets, qemu-devel, peterx

On Tue, May 16, 2017 at 12:34:16PM +0100, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > This patch provides blocktime calculation per vCPU,
> > as a summary and as a overlapped value for all vCPUs.
> > 
> > This approach was suggested by Peter Xu, as an improvements of
> > previous approch where QEMU kept tree with faulted page address and cpus bitmask
> > in it. Now QEMU is keeping array with faulted page address as value and vCPU
> > as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
> > list for blocktime per vCPU (could be traced with page_fault_addr)
> > 
> > Blocktime will not calculated if postcopy_blocktime field of
> > MigrationIncomingState wasn't initialized.
> > 
> > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> 
> I have some multi-threading/ordering worries still.
> 
> The fault thread receives faults over the ufd and calls
> mark_postcopy_blocktime_being.  That's fine.
> 
> The receiving thread receives pages, calls place page, and
> calls mark_postcopy_blocktime_end.  That's also fine.
> 
> However, remember that we send pages from the source without
> them being requested as background transfers; consider:
> 
> 
>     Source           receive-thread          fault-thread
> 
>   1  Send A
>   2                  Receive A
>   3                                            Access A
>   4                                            Report on UFD
>   5                  Place
>   6                                            Read UFD entry
> 
> 
>  Placing and reading UFD race - and up till now that's been fine;
> so we can read off the ufd an address that's already on it's way from
> the source, and which we might just be receiving, or that we might
> have already placed.
> 
> In this code at (6) won't you call mark_postcopy_blocktime_start
> even though it's already been placed at (5) ? Then that blocktime
> will stay set until the end of the run?
> 
> Perhaps that's not a problem; if mark_postcopy_blocktime_end is called
> for a different address it wont count the blocktime; and when
> mark_postcopy_blocktime_start is called for a different address it'll
> remove the addres that was a problem above - so perhaps that's fine?
It's not 100% fine, but I'm going to clarify my previous answer to that
email where I wrote "forever". That mechanism will think vCPU is blocked
until the same vCPU will block/page copied again.
Unfortunately we don't know vCPU index at *_end time, and I don't want
to extend struct uffdio_copy and add pid into it. But now I have only
expensive and robust or not expensive and not robust solution, like
keeping list of page addressed which was faulted (or just one page
address, the latest, taking into account _end, _start sequence should be
quick, and no other pages interpose, but it's assumption).

BTW with tree based solution, proposed in the first version, was possible to
lookup node by pageaddr in _end and mark it as populated.



> 
> 
> > ---
> >  migration/postcopy-ram.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  migration/trace-events   |  5 ++-
> >  2 files changed, 90 insertions(+), 2 deletions(-)
> > 
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index a1f1705..e2660ae 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -23,6 +23,7 @@
> >  #include "migration/postcopy-ram.h"
> >  #include "sysemu/sysemu.h"
> >  #include "sysemu/balloon.h"
> > +#include <sys/param.h>
> >  #include "qemu/error-report.h"
> >  #include "trace.h"
> >  
> > @@ -542,6 +543,86 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
> >      return 0;
> >  }
> >  
> > +static int get_mem_fault_cpu_index(uint32_t pid)
> > +{
> > +    CPUState *cpu_iter;
> > +
> > +    CPU_FOREACH(cpu_iter) {
> > +        if (cpu_iter->thread_id == pid) {
> > +            return cpu_iter->cpu_index;
> > +        }
> > +    }
> > +    trace_get_mem_fault_cpu_index(pid);
> > +    return -1;
> > +}
> > +
> > +static void mark_postcopy_blocktime_begin(uint64_t addr, int cpu)
> > +{
> > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > +    PostcopyBlocktimeContext *dc;
> > +    int64_t now_ms;
> > +    if (!mis->blocktime_ctx || cpu < 0) {
> > +        return;
> > +    }
> 
> You might consider:
> 
>  PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
>  int64_t now_ms;
>  if (!dc || cpu < 0) {
>      return;
>  }
> 
> it gets rid of the two reads of mis->blocktime_ctx
> (You do something similar in a few places)
> 
> > +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > +    dc = mis->blocktime_ctx;
> > +    if (dc->vcpu_addr[cpu] == 0) {
> > +        atomic_inc(&dc->smp_cpus_down);
> > +    }
> > +
> > +    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
> > +    atomic_xchg__nocheck(&dc->last_begin, now_ms);
> > +    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
> > +
> > +    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
> > +            cpu);
> > +}
> > +
> > +static void mark_postcopy_blocktime_end(uint64_t addr)
> > +{
> > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > +    PostcopyBlocktimeContext *dc;
> > +    int i, affected_cpu = 0;
> > +    int64_t now_ms;
> > +    bool vcpu_total_blocktime = false;
> > +
> > +    if (!mis->blocktime_ctx) {
> > +        return;
> > +    }
> > +    dc = mis->blocktime_ctx;
> > +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > +
> > +    /* lookup cpu, to clear it,
> > +     * that algorithm looks straighforward, but it's not
> > +     * optimal, more optimal algorithm is keeping tree or hash
> > +     * where key is address value is a list of  */
> > +    for (i = 0; i < smp_cpus; i++) {
> > +        uint64_t vcpu_blocktime = 0;
> > +        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
> > +            continue;
> > +        }
> > +        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
> > +        vcpu_blocktime = now_ms -
> > +            atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
> > +        affected_cpu += 1;
> > +        /* we need to know is that mark_postcopy_end was due to
> > +         * faulted page, another possible case it's prefetched
> > +         * page and in that case we shouldn't be here */
> > +        if (!vcpu_total_blocktime &&
> > +            atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
> > +            vcpu_total_blocktime = true;
> > +        }
> > +        /* continue cycle, due to one page could affect several vCPUs */
> > +        dc->vcpu_blocktime[i] += vcpu_blocktime;
> > +    }
> > +
> > +    atomic_sub(&dc->smp_cpus_down, affected_cpu);
> > +    if (vcpu_total_blocktime) {
> > +        dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, 0);
> 
> This total_blocktime calculation is a little odd; the 'last_begin' is
> not necessarily related to the same CPU or same block.
> 
> Dave
> 
> > +    }
> > +    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime);
> > +}
> > +
> >  /*
> >   * Handle faults detected by the USERFAULT markings
> >   */
> > @@ -619,8 +700,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
> >          rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
> >          trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
> >                                                  qemu_ram_get_idstr(rb),
> > -                                                rb_offset);
> > +                                                rb_offset,
> > +                                                msg.arg.pagefault.feat.ptid);
> >  
> > +        mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address),
> > +                         get_mem_fault_cpu_index(msg.arg.pagefault.feat.ptid));
> >          /*
> >           * Send the request to the source - we want to request one
> >           * of our host page sizes (which is >= TPS)
> > @@ -715,6 +799,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> >  
> >          return -e;
> >      }
> > +    mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host);
> >  
> >      trace_postcopy_place_page(host);
> >      return 0;
> > diff --git a/migration/trace-events b/migration/trace-events
> > index b8f01a2..9424e3e 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -110,6 +110,8 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> >  process_incoming_migration_co_postcopy_end_main(void) ""
> >  migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
> >  migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
> > +mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d"
> > +mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr 0x%" PRIx64 " dd %p time %" PRId64
> >  
> >  # migration/rdma.c
> >  qemu_rdma_accept_incoming_migration(void) ""
> > @@ -186,7 +188,7 @@ postcopy_ram_enable_notify(void) ""
> >  postcopy_ram_fault_thread_entry(void) ""
> >  postcopy_ram_fault_thread_exit(void) ""
> >  postcopy_ram_fault_thread_quit(void) ""
> > -postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx"
> > +postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx %u"
> >  postcopy_ram_incoming_cleanup_closeuf(void) ""
> >  postcopy_ram_incoming_cleanup_entry(void) ""
> >  postcopy_ram_incoming_cleanup_exit(void) ""
> > @@ -195,6 +197,7 @@ save_xbzrle_page_skipping(void) ""
> >  save_xbzrle_page_overflow(void) ""
> >  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
> >  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
> > +get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU"
> >  
> >  # migration/exec.c
> >  migration_exec_outgoing(const char *cmd) "cmd=%s"
> > -- 
> > 1.9.1
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

-- 

BR
Alexey

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 2/9] migration: pass ptr to MigrationIncomingState into migration ufd_version_check & postcopy_ram_supported_by_host
  2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 2/9] migration: pass ptr to MigrationIncomingState into migration ufd_version_check & postcopy_ram_supported_by_host Alexey Perevalov
@ 2017-05-18 14:09       ` Eric Blake
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Blake @ 2017-05-18 14:09 UTC (permalink / raw)
  To: Alexey Perevalov, qemu-devel; +Cc: i.maximets, dgilbert, peterx

[-- Attachment #1: Type: text/plain, Size: 671 bytes --]

On 05/12/2017 08:31 AM, Alexey Perevalov wrote:

Long subject line. Try to keep things in the subject around 60
characters or less, in part so that 'git shortlog --oneline -30' still
fits in an 80-column screen.  Maybe:

migration: Refactor use of MigrationIncomingState pointer

> That tiny refactoring is necessary to be able to set
> UFFD_FEATURE_THREAD_ID while requesting features, and then
> to create downtime context in case when kernel supports it.
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> ---
-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 4/9] migration: split ufd_version_check onto receive/request features part
  2017-05-18  6:55         ` Alexey
@ 2017-05-19 18:46           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 28+ messages in thread
From: Dr. David Alan Gilbert @ 2017-05-19 18:46 UTC (permalink / raw)
  To: Alexey; +Cc: i.maximets, qemu-devel, peterx

* Alexey (a.perevalov@samsung.com) wrote:
> On Tue, May 16, 2017 at 11:32:51AM +0100, Dr. David Alan Gilbert wrote:
> > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > This modification is necessary for userfault fd features which are
> > > required to be requested from userspace.
> > > UFFD_FEATURE_THREAD_ID is a one of such "on demand" feature, which will
> > > be introduced in the next patch.
> > > 
> > > QEMU need to use separate userfault file descriptor, due to
> > > userfault context has internal state, and after first call of
> > > ioctl UFFD_API it changes its state to UFFD_STATE_RUNNING (in case of
> > > success), but
> > > kernel while handling ioctl UFFD_API expects UFFD_STATE_WAIT_API. So
> > > only one ioctl with UFFD_API is possible per ufd.
> > > 
> > > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> > > ---
> > >  migration/postcopy-ram.c | 82 ++++++++++++++++++++++++++++++++++++++++++------
> > >  1 file changed, 73 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > > index 0f75700..c96d5f5 100644
> > > --- a/migration/postcopy-ram.c
> > > +++ b/migration/postcopy-ram.c
> > > @@ -60,32 +60,96 @@ struct PostcopyDiscardState {
> > >  #include <sys/eventfd.h>
> > >  #include <linux/userfaultfd.h>
> > >  
> > > -static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
> > > +
> > > +/*
> > > + * Check userfault fd features, to request only supported features in
> > > + * future.
> > > + * __NR_userfaultfd - should be checked before
> > > + * Return obtained features
> > 
> > That's not quite right;
> >  * Returns: True on success, sets *features to supported features
> >             False on failure or if kernel doesn't support ufd
> > 
> yes, obtained features is out parameter,
> but I want to keep false uncommented and just add error_report into
> syscall check, because the possible reason of failure is:
> 1. No syscall userfaultfd, but function expects that syscall, it reflects in
> comment
> 2  Within syscall:  exhausted fd or out of memory (file in kernel
> is allocating)
> 3. Problem in ioctl due to internal state of UFFD, as example
> UFFDIO_API after UFFDIO_REGISTER

I don't think we're allowed to depend on error pointers, but either
way we should comment it to make sure it's clear, so if you have a
boolean return at least say it's true for success and explain features
etc.

> Also I would prefer follow migration/ram.c comment style.

Yes, that's fine - it's the content of the comment I was more
worried about (and the one below).

Dave

> > > + */
> > > +static bool receive_ufd_features(uint64_t *features)
> > >  {
> > > -    struct uffdio_api api_struct;
> > > -    uint64_t ioctl_mask;
> > > +    struct uffdio_api api_struct = {0};
> > > +    int ufd;
> > > +    bool ret = true;
> > > +
> > > +    /* if we are here __NR_userfaultfd should exists */
> > > +    ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
> > > +    if (ufd == -1) {
> > > +        return false;
> > > +    }
> > >  
> > > +    /* ask features */
> > >      api_struct.api = UFFD_API;
> > >      api_struct.features = 0;
> > >      if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> > > -        error_report("%s: UFFDIO_API failed: %s", __func__
> > > +        error_report("%s: UFFDIO_API failed: %s", __func__,
> > >                       strerror(errno));
> > > +        ret = false;
> > > +        goto release_ufd;
> > > +    }
> > > +
> > > +    *features = api_struct.features;
> > > +
> > > +release_ufd:
> > > +    close(ufd);
> > > +    return ret;
> > > +}
> > 
> > Needs a comment; perhaps something like:
> >   * Called once on a newly opened ufd, can request specific features.
> >   * Returns: True on success
> > 
> > > +static bool request_ufd_features(int ufd, uint64_t features)
> > > +{
> > > +    struct uffdio_api api_struct = {0};
> > > +    uint64_t ioctl_mask;
> > > +
> > > +    api_struct.api = UFFD_API;
> > > +    api_struct.features = features;
> > > +    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> > > +        error_report("%s failed: UFFDIO_API failed: %s", __func__,
> > > +                strerror(errno));
> > >          return false;
> > >      }
> > >  
> > > -    ioctl_mask = (__u64)1 << _UFFDIO_REGISTER |
> > > -                 (__u64)1 << _UFFDIO_UNREGISTER;
> > > +    ioctl_mask = 1 << _UFFDIO_REGISTER |
> > > +                 1 << _UFFDIO_UNREGISTER;
> > >      if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
> > >          error_report("Missing userfault features: %" PRIx64,
> > >                       (uint64_t)(~api_struct.ioctls & ioctl_mask));
> > >          return false;
> > >      }
> > >  
> > > +    return true;
> > > +}
> > > +
> > > +static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
> > > +{
> > > +    uint64_t asked_features = 0;
> > > +    uint64_t supported_features;
> > > +
> > > +    /*
> > > +     * it's not possible to
> > > +     * request UFFD_API twice per one fd
> > > +     */
> > > +    if (!receive_ufd_features(&supported_features)) {
> > > +        error_report("%s failed", __func__);
> > > +        return false;
> > > +    }
> > > +
> > > +    /*
> > > +     * request features, even if asked_features is 0, due to
> > > +     * kernel expects UFFD_API before UFFDIO_REGISTER, per
> > > +     * userfault file descriptor
> > > +     */
> > > +    if (!request_ufd_features(ufd, asked_features)) {
> > > +        error_report("%s failed: features %" PRIu64, __func__,
> > > +                asked_features);
> > > +        return false;
> > > +    }
> > > +
> > >      if (getpagesize() != ram_pagesize_summary()) {
> > >          bool have_hp = false;
> > >          /* We've got a huge page */
> > >  #ifdef UFFD_FEATURE_MISSING_HUGETLBFS
> > > -        have_hp = api_struct.features & UFFD_FEATURE_MISSING_HUGETLBFS;
> > > +        have_hp = supported_features & UFFD_FEATURE_MISSING_HUGETLBFS;
> > >  #endif
> > >          if (!have_hp) {
> > >              error_report("Userfault on this host does not support huge pages");
> > > @@ -136,7 +200,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
> > >      }
> > >  
> > >      /* Version and features check */
> > > -    if (!ufd_version_check(ufd, mis)) {
> > > +    if (!ufd_check_and_apply(ufd, mis)) {
> > >          goto out;
> > >      }
> > >  
> > > @@ -513,7 +577,7 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
> > >       * Although the host check already tested the API, we need to
> > >       * do the check again as an ABI handshake on the new fd.
> > >       */
> > > -    if (!ufd_version_check(mis->userfault_fd, mis)) {
> > > +    if (!ufd_check_and_apply(mis->userfault_fd, mis)) {
> > >          return -1;
> > >      }
> > >  
> > > -- 
> > > 1.9.1
> > 
> > Dave
> > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
> -- 
> 
> BR
> Alexey
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 7/9] migration: calculate vCPU blocktime on dst side
  2017-05-18  7:18         ` Alexey
@ 2017-05-19 19:05           ` Dr. David Alan Gilbert
  2017-05-22  7:43             ` Alexey Perevalov
  0 siblings, 1 reply; 28+ messages in thread
From: Dr. David Alan Gilbert @ 2017-05-19 19:05 UTC (permalink / raw)
  To: Alexey; +Cc: i.maximets, qemu-devel, peterx

* Alexey (a.perevalov@samsung.com) wrote:
> On Tue, May 16, 2017 at 12:34:16PM +0100, Dr. David Alan Gilbert wrote:
> > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > This patch provides blocktime calculation per vCPU,
> > > as a summary and as a overlapped value for all vCPUs.
> > > 
> > > This approach was suggested by Peter Xu, as an improvements of
> > > previous approch where QEMU kept tree with faulted page address and cpus bitmask
> > > in it. Now QEMU is keeping array with faulted page address as value and vCPU
> > > as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
> > > list for blocktime per vCPU (could be traced with page_fault_addr)
> > > 
> > > Blocktime will not calculated if postcopy_blocktime field of
> > > MigrationIncomingState wasn't initialized.
> > > 
> > > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> > 
> > I have some multi-threading/ordering worries still.
> > 
> > The fault thread receives faults over the ufd and calls
> > mark_postcopy_blocktime_being.  That's fine.
> > 
> > The receiving thread receives pages, calls place page, and
> > calls mark_postcopy_blocktime_end.  That's also fine.
> > 
> > However, remember that we send pages from the source without
> > them being requested as background transfers; consider:
> > 
> > 
> >     Source           receive-thread          fault-thread
> > 
> >   1  Send A
> >   2                  Receive A
> >   3                                            Access A
> >   4                                            Report on UFD
> >   5                  Place
> >   6                                            Read UFD entry
> > 
> > 
> >  Placing and reading UFD race - and up till now that's been fine;
> > so we can read off the ufd an address that's already on it's way from
> > the source, and which we might just be receiving, or that we might
> > have already placed.
> > 
> > In this code at (6) won't you call mark_postcopy_blocktime_start
> > even though it's already been placed at (5) ? Then that blocktime
> > will stay set until the end of the run?
> > 
> > Perhaps that's not a problem; if mark_postcopy_blocktime_end is called
> > for a different address it wont count the blocktime; and when
> > mark_postcopy_blocktime_start is called for a different address it'll
> > remove the addres that was a problem above - so perhaps that's fine?
> It's not 100% fine, but I'm going to clarify my previous answer to that
> email where I wrote "forever". That mechanism will think vCPU is blocked
> until the same vCPU will block/page copied again.
> Unfortunately we don't know vCPU index at *_end time, and I don't want
> to extend struct uffdio_copy and add pid into it.

You couldn't anyway, one uffdio_copy might wake up multiple PIDs.

> But now I have only
> expensive and robust or not expensive and not robust solution, like
> keeping list of page addressed which was faulted (or just one page
> address, the latest, taking into account _end, _start sequence should be
> quick, and no other pages interpose, but it's assumption).
> 
> BTW with tree based solution, proposed in the first version, was possible to
> lookup node by pageaddr in _end and mark it as populated.

Yes , sorry, I hadn't realised at the time that this solution wasn't
robust.
Would this be fixed by a 'received' pages bitmap? i.e. a bitmap with one
bit per page (fixed 0.003% RAM overhead - tiny) that gets set by
mark_postcopy_blocktime_end (called before the 'place' operation)
and checked in mark_postcopy_blocktime_start?
That would be interesting because that bitmap is potentially needed by
other projects (recovery from network failure in particular).
However, I'm not sure it really helps - you'd have to get the
ordering just-right, and I'm not sure it's possible.
My thoughts are something like:

blocktime_end:
   set bitmap entry for 'arrived'
   read CPU stall address, if none-0 then zero it and update stats

blocktime_start:
   set CPU stall address
   check bitmap entry
     if set then zero stall-address

is that safe?

Dave

> 
> 
> > 
> > 
> > > ---
> > >  migration/postcopy-ram.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++-
> > >  migration/trace-events   |  5 ++-
> > >  2 files changed, 90 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > > index a1f1705..e2660ae 100644
> > > --- a/migration/postcopy-ram.c
> > > +++ b/migration/postcopy-ram.c
> > > @@ -23,6 +23,7 @@
> > >  #include "migration/postcopy-ram.h"
> > >  #include "sysemu/sysemu.h"
> > >  #include "sysemu/balloon.h"
> > > +#include <sys/param.h>
> > >  #include "qemu/error-report.h"
> > >  #include "trace.h"
> > >  
> > > @@ -542,6 +543,86 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
> > >      return 0;
> > >  }
> > >  
> > > +static int get_mem_fault_cpu_index(uint32_t pid)
> > > +{
> > > +    CPUState *cpu_iter;
> > > +
> > > +    CPU_FOREACH(cpu_iter) {
> > > +        if (cpu_iter->thread_id == pid) {
> > > +            return cpu_iter->cpu_index;
> > > +        }
> > > +    }
> > > +    trace_get_mem_fault_cpu_index(pid);
> > > +    return -1;
> > > +}
> > > +
> > > +static void mark_postcopy_blocktime_begin(uint64_t addr, int cpu)
> > > +{
> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > +    PostcopyBlocktimeContext *dc;
> > > +    int64_t now_ms;
> > > +    if (!mis->blocktime_ctx || cpu < 0) {
> > > +        return;
> > > +    }
> > 
> > You might consider:
> > 
> >  PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
> >  int64_t now_ms;
> >  if (!dc || cpu < 0) {
> >      return;
> >  }
> > 
> > it gets rid of the two reads of mis->blocktime_ctx
> > (You do something similar in a few places)
> > 
> > > +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > > +    dc = mis->blocktime_ctx;
> > > +    if (dc->vcpu_addr[cpu] == 0) {
> > > +        atomic_inc(&dc->smp_cpus_down);
> > > +    }
> > > +
> > > +    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
> > > +    atomic_xchg__nocheck(&dc->last_begin, now_ms);
> > > +    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
> > > +
> > > +    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
> > > +            cpu);
> > > +}
> > > +
> > > +static void mark_postcopy_blocktime_end(uint64_t addr)
> > > +{
> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > +    PostcopyBlocktimeContext *dc;
> > > +    int i, affected_cpu = 0;
> > > +    int64_t now_ms;
> > > +    bool vcpu_total_blocktime = false;
> > > +
> > > +    if (!mis->blocktime_ctx) {
> > > +        return;
> > > +    }
> > > +    dc = mis->blocktime_ctx;
> > > +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > > +
> > > +    /* lookup cpu, to clear it,
> > > +     * that algorithm looks straighforward, but it's not
> > > +     * optimal, more optimal algorithm is keeping tree or hash
> > > +     * where key is address value is a list of  */
> > > +    for (i = 0; i < smp_cpus; i++) {
> > > +        uint64_t vcpu_blocktime = 0;
> > > +        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
> > > +            continue;
> > > +        }
> > > +        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
> > > +        vcpu_blocktime = now_ms -
> > > +            atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
> > > +        affected_cpu += 1;
> > > +        /* we need to know is that mark_postcopy_end was due to
> > > +         * faulted page, another possible case it's prefetched
> > > +         * page and in that case we shouldn't be here */
> > > +        if (!vcpu_total_blocktime &&
> > > +            atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
> > > +            vcpu_total_blocktime = true;
> > > +        }
> > > +        /* continue cycle, due to one page could affect several vCPUs */
> > > +        dc->vcpu_blocktime[i] += vcpu_blocktime;
> > > +    }
> > > +
> > > +    atomic_sub(&dc->smp_cpus_down, affected_cpu);
> > > +    if (vcpu_total_blocktime) {
> > > +        dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, 0);
> > 
> > This total_blocktime calculation is a little odd; the 'last_begin' is
> > not necessarily related to the same CPU or same block.
> > 
> > Dave
> > 
> > > +    }
> > > +    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime);
> > > +}
> > > +
> > >  /*
> > >   * Handle faults detected by the USERFAULT markings
> > >   */
> > > @@ -619,8 +700,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
> > >          rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
> > >          trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
> > >                                                  qemu_ram_get_idstr(rb),
> > > -                                                rb_offset);
> > > +                                                rb_offset,
> > > +                                                msg.arg.pagefault.feat.ptid);
> > >  
> > > +        mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address),
> > > +                         get_mem_fault_cpu_index(msg.arg.pagefault.feat.ptid));
> > >          /*
> > >           * Send the request to the source - we want to request one
> > >           * of our host page sizes (which is >= TPS)
> > > @@ -715,6 +799,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> > >  
> > >          return -e;
> > >      }
> > > +    mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host);
> > >  
> > >      trace_postcopy_place_page(host);
> > >      return 0;
> > > diff --git a/migration/trace-events b/migration/trace-events
> > > index b8f01a2..9424e3e 100644
> > > --- a/migration/trace-events
> > > +++ b/migration/trace-events
> > > @@ -110,6 +110,8 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> > >  process_incoming_migration_co_postcopy_end_main(void) ""
> > >  migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
> > >  migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
> > > +mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d"
> > > +mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr 0x%" PRIx64 " dd %p time %" PRId64
> > >  
> > >  # migration/rdma.c
> > >  qemu_rdma_accept_incoming_migration(void) ""
> > > @@ -186,7 +188,7 @@ postcopy_ram_enable_notify(void) ""
> > >  postcopy_ram_fault_thread_entry(void) ""
> > >  postcopy_ram_fault_thread_exit(void) ""
> > >  postcopy_ram_fault_thread_quit(void) ""
> > > -postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx"
> > > +postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx %u"
> > >  postcopy_ram_incoming_cleanup_closeuf(void) ""
> > >  postcopy_ram_incoming_cleanup_entry(void) ""
> > >  postcopy_ram_incoming_cleanup_exit(void) ""
> > > @@ -195,6 +197,7 @@ save_xbzrle_page_skipping(void) ""
> > >  save_xbzrle_page_overflow(void) ""
> > >  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
> > >  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
> > > +get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU"
> > >  
> > >  # migration/exec.c
> > >  migration_exec_outgoing(const char *cmd) "cmd=%s"
> > > -- 
> > > 1.9.1
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
> -- 
> 
> BR
> Alexey
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 8/9] migration: add postcopy total blocktime into query-migrate
  2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 8/9] migration: add postcopy total blocktime into query-migrate Alexey Perevalov
@ 2017-05-19 19:23       ` Dr. David Alan Gilbert
  2017-05-22 16:15         ` Eric Blake
  2017-05-22 16:14       ` Eric Blake
  1 sibling, 1 reply; 28+ messages in thread
From: Dr. David Alan Gilbert @ 2017-05-19 19:23 UTC (permalink / raw)
  To: Alexey Perevalov, eblake; +Cc: qemu-devel, i.maximets, peterx

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> Postcopy total blocktime is available on destination side only.
> But query-migrate was possible only for source. This patch
> adds ability to call query-migrate on destination. To distinguish
> src/dst, state of the MigrationState is using, query-migrate prepares
> MigrationInfo for source machine only in case of migration's state is different
> than MIGRATION_STATUS_NONE.
> 
> To be able to see postcopy blocktime, need to request postcopy-blocktime
> capability.
> 
> The query-migrate command will show following sample result:
> {"return":
>     "postcopy_vcpu_blocktime": [115, 100],
>     "status": "completed",
>     "postcopy_blocktime": 100
> }}
> 
> postcopy_vcpu_blocktime contains list, where the first item is the first
> vCPU in QEMU.

Lets just check Eric is happy with the qapi side.
Please also update hmp.c:hmp_info_migrate.

A few comments below.

> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> ---
>  include/migration/migration.h |  4 +++
>  migration/migration.c         | 47 ++++++++++++++++++++++++++--
>  migration/postcopy-ram.c      | 73 +++++++++++++++++++++++++++++++++++++++++++
>  migration/trace-events        |  1 +
>  qapi-schema.json              |  6 +++-
>  5 files changed, 127 insertions(+), 4 deletions(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 7e69a2d..aba0535 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -135,6 +135,10 @@ struct MigrationIncomingState {
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
>  void migration_incoming_state_destroy(void);
> +/*
> + * Functions to work with blocktime context
> + */
> +void fill_destination_postcopy_migration_info(MigrationInfo *info);
>  
>  struct MigrationState
>  {
> diff --git a/migration/migration.c b/migration/migration.c
> index c0443ce..7a4f33f 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -666,9 +666,15 @@ static void populate_ram_info(MigrationInfo *info, MigrationState *s)
>      }
>  }
>  
> -MigrationInfo *qmp_query_migrate(Error **errp)
> +/* TODO improve this assumption */
> +static bool is_source_migration(void)
> +{
> +    MigrationState *ms = migrate_get_current();
> +    return ms->state != MIGRATION_STATUS_NONE;
> +}
> +
> +static void fill_source_migration_info(MigrationInfo *info)
>  {
> -    MigrationInfo *info = g_malloc0(sizeof(*info));
>      MigrationState *s = migrate_get_current();
>  
>      switch (s->state) {
> @@ -759,10 +765,45 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>          break;
>      }
>      info->status = s->state;
> +}
> +
> +static void fill_destination_migration_info(MigrationInfo *info)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
>  
> -    return info;
> +    switch (mis->state) {
> +    case MIGRATION_STATUS_NONE:
> +        break;
> +    case MIGRATION_STATUS_SETUP:
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +    case MIGRATION_STATUS_ACTIVE:
> +    case MIGRATION_STATUS_POSTCOPY_ACTIVE:
> +    case MIGRATION_STATUS_FAILED:
> +    case MIGRATION_STATUS_COLO:
> +        info->has_status = true;
> +        break;
> +    case MIGRATION_STATUS_COMPLETED:
> +        info->has_status = true;
> +        fill_destination_postcopy_migration_info(info);
> +        break;
> +    }
> +    info->status = mis->state;
>  }
>  
> +MigrationInfo *qmp_query_migrate(Error **errp)
> +{
> +    MigrationInfo *info = g_malloc0(sizeof(*info));
> +
> +    if (is_source_migration()) {
> +        fill_source_migration_info(info);
> +    } else {
> +        fill_destination_migration_info(info);
> +    }

A VM that was migated in can then later get migrated out;
so I think you need to give both sets of data.
Which probably means you need a second status field
since existing stuff might get confused if it's watching
an outbound migration after an inbound one.

Dave
 
> +
> +     return info;
> + }
> +
>  void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>                                    Error **errp)
>  {
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index e2660ae..fe047c8 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -129,6 +129,71 @@ static struct PostcopyBlocktimeContext *blocktime_context_new(void)
>      return ctx;
>  }
>  
> +static int64List *get_vcpu_blocktime_list(PostcopyBlocktimeContext *ctx)
> +{
> +    int64List *list = NULL, *entry = NULL;
> +    int i;
> +
> +    for (i = smp_cpus - 1; i >= 0; i--) {
> +            entry = g_new0(int64List, 1);
> +            entry->value = ctx->vcpu_blocktime[i];
> +            entry->next = list;
> +            list = entry;
> +    }
> +
> +    return list;
> +}
> +
> +/*
> + * This function just provide calculated blocktime per cpu and trace it.
> + * Total blocktime is calculated in mark_postcopy_blocktime_end.
> + *
> + *
> + * Assume we have 3 CPU
> + *
> + *      S1        E1           S1               E1
> + * -----***********------------xxx***************------------------------> CPU1
> + *
> + *             S2                E2
> + * ------------****************xxx---------------------------------------> CPU2
> + *
> + *                         S3            E3
> + * ------------------------****xxx********-------------------------------> CPU3
> + *
> + * We have sequence S1,S2,E1,S3,S1,E2,E3,E1
> + * S2,E1 - doesn't match condition due to sequence S1,S2,E1 doesn't include CPU3
> + * S3,S1,E2 - sequence includes all CPUs, in this case overlap will be S1,E2 -
> + *            it's a part of total blocktime.
> + * S1 - here is last_begin
> + * Legend of the picture is following:
> + *              * - means blocktime per vCPU
> + *              x - means overlapped blocktime (total blocktime)
> + */
> +void fill_destination_postcopy_migration_info(MigrationInfo *info)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +    if (!mis->blocktime_ctx) {
> +        return;
> +    }
> +
> +    info->has_postcopy_blocktime = true;
> +    info->postcopy_blocktime = mis->blocktime_ctx->total_blocktime;
> +    info->has_postcopy_vcpu_blocktime = true;
> +    info->postcopy_vcpu_blocktime = get_vcpu_blocktime_list(mis->blocktime_ctx);
> +}
> +
> +static uint64_t get_postcopy_total_blocktime(void)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +    if (!mis->blocktime_ctx) {
> +        return 0;
> +    }
> +
> +    return mis->blocktime_ctx->total_blocktime;
> +}
> +
>  /*
>   * Check userfault fd features, to request only supported features in
>   * future.
> @@ -462,6 +527,9 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>      }
>  
>      postcopy_state_set(POSTCOPY_INCOMING_END);
> +    /* here should be blocktime receiving back operation */
> +    trace_postcopy_ram_incoming_cleanup_blocktime(
> +            get_postcopy_total_blocktime());
>      migrate_send_rp_shut(mis, qemu_file_get_error(mis->from_src_file) != 0);
>  
>      if (mis->postcopy_tmp_page) {
> @@ -876,6 +944,11 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
>  
>  #else
>  /* No target OS support, stubs just fail */
> +void fill_destination_postcopy_migration_info(MigrationInfo *info)
> +{
> +    error_report("%s: No OS support", __func__);
> +}
> +
>  bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
>  {
>      error_report("%s: No OS support", __func__);
> diff --git a/migration/trace-events b/migration/trace-events
> index 9424e3e..bdaca1d 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -193,6 +193,7 @@ postcopy_ram_incoming_cleanup_closeuf(void) ""
>  postcopy_ram_incoming_cleanup_entry(void) ""
>  postcopy_ram_incoming_cleanup_exit(void) ""
>  postcopy_ram_incoming_cleanup_join(void) ""
> +postcopy_ram_incoming_cleanup_blocktime(uint64_t total) "total blocktime %" PRIu64
>  save_xbzrle_page_skipping(void) ""
>  save_xbzrle_page_overflow(void) ""
>  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
> diff --git a/qapi-schema.json b/qapi-schema.json
> index fde6d63..e11c5f2 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -712,6 +712,8 @@
>  #              @status is 'failed'. Clients should not attempt to parse the
>  #              error strings. (Since 2.7)
>  #
> +# @postcopy_vcpu_blocktime: list of the postcopy blocktime per vCPU (Since 2.9)
> +#
>  # Since: 0.14.0
>  ##
>  { 'struct': 'MigrationInfo',
> @@ -723,7 +725,9 @@
>             '*downtime': 'int',
>             '*setup-time': 'int',
>             '*cpu-throttle-percentage': 'int',
> -           '*error-desc': 'str'} }
> +           '*error-desc': 'str',
> +           '*postcopy_blocktime' : 'int64',
> +           '*postcopy_vcpu_blocktime': ['int64']} }
>  
>  ##
>  # @query-migrate:
> -- 
> 1.9.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 7/9] migration: calculate vCPU blocktime on dst side
  2017-05-19 19:05           ` Dr. David Alan Gilbert
@ 2017-05-22  7:43             ` Alexey Perevalov
  0 siblings, 0 replies; 28+ messages in thread
From: Alexey Perevalov @ 2017-05-22  7:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: i.maximets, qemu-devel, peterx

On 05/19/2017 10:05 PM, Dr. David Alan Gilbert wrote:
> * Alexey (a.perevalov@samsung.com) wrote:
>> On Tue, May 16, 2017 at 12:34:16PM +0100, Dr. David Alan Gilbert wrote:
>>> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
>>>> This patch provides blocktime calculation per vCPU,
>>>> as a summary and as a overlapped value for all vCPUs.
>>>>
>>>> This approach was suggested by Peter Xu, as an improvements of
>>>> previous approch where QEMU kept tree with faulted page address and cpus bitmask
>>>> in it. Now QEMU is keeping array with faulted page address as value and vCPU
>>>> as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
>>>> list for blocktime per vCPU (could be traced with page_fault_addr)
>>>>
>>>> Blocktime will not calculated if postcopy_blocktime field of
>>>> MigrationIncomingState wasn't initialized.
>>>>
>>>> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
>>> I have some multi-threading/ordering worries still.
>>>
>>> The fault thread receives faults over the ufd and calls
>>> mark_postcopy_blocktime_being.  That's fine.
>>>
>>> The receiving thread receives pages, calls place page, and
>>> calls mark_postcopy_blocktime_end.  That's also fine.
>>>
>>> However, remember that we send pages from the source without
>>> them being requested as background transfers; consider:
>>>
>>>
>>>      Source           receive-thread          fault-thread
>>>
>>>    1  Send A
>>>    2                  Receive A
>>>    3                                            Access A
>>>    4                                            Report on UFD
>>>    5                  Place
>>>    6                                            Read UFD entry
>>>
>>>
>>>   Placing and reading UFD race - and up till now that's been fine;
>>> so we can read off the ufd an address that's already on it's way from
>>> the source, and which we might just be receiving, or that we might
>>> have already placed.
>>>
>>> In this code at (6) won't you call mark_postcopy_blocktime_start
>>> even though it's already been placed at (5) ? Then that blocktime
>>> will stay set until the end of the run?
>>>
>>> Perhaps that's not a problem; if mark_postcopy_blocktime_end is called
>>> for a different address it wont count the blocktime; and when
>>> mark_postcopy_blocktime_start is called for a different address it'll
>>> remove the addres that was a problem above - so perhaps that's fine?
>> It's not 100% fine, but I'm going to clarify my previous answer to that
>> email where I wrote "forever". That mechanism will think vCPU is blocked
>> until the same vCPU will block/page copied again.
>> Unfortunately we don't know vCPU index at *_end time, and I don't want
>> to extend struct uffdio_copy and add pid into it.
> You couldn't anyway, one uffdio_copy might wake up multiple PIDs.
>
>> But now I have only
>> expensive and robust or not expensive and not robust solution, like
>> keeping list of page addressed which was faulted (or just one page
>> address, the latest, taking into account _end, _start sequence should be
>> quick, and no other pages interpose, but it's assumption).
>>
>> BTW with tree based solution, proposed in the first version, was possible to
>> lookup node by pageaddr in _end and mark it as populated.
> Yes , sorry, I hadn't realised at the time that this solution wasn't
> robust.
> Would this be fixed by a 'received' pages bitmap? i.e. a bitmap with one
> bit per page (fixed 0.003% RAM overhead - tiny) that gets set by
> mark_postcopy_blocktime_end (called before the 'place' operation)
> and checked in mark_postcopy_blocktime_start?
> That would be interesting because that bitmap is potentially needed by
> other projects (recovery from network failure in particular).
> However, I'm not sure it really helps - you'd have to get the
> ordering just-right, and I'm not sure it's possible.
> My thoughts are something like:
>
> blocktime_end:
>     set bitmap entry for 'arrived'
>     read CPU stall address, if none-0 then zero it and update stats
>
> blocktime_start:
>     set CPU stall address
>     check bitmap entry
>       if set then zero stall-address
>
> is that safe?
Looks like yes, it's safe. Nice data structure, If we will create bitmap 
on ramblock's offset, but not host's virtual address,
because anonymous memory isn't contiguous.
So in worst case e.g. 4Kb page size, the 2Mb bitmap will be required to 
cover 8Gb address space. In my calculation
it's 0.024% RAM overhead. In case of 1G hugepage, just 8 entries for 
8Gb. But need to keep such bitmap per RAMBlock,
as I know only /objects/mem could be mapped to hugetlbfs, others are 
involved into migration, such as vga.vram, but
they are from anonymous memory.

>
> Dave
>
>>
>>>
>>>> ---
>>>>   migration/postcopy-ram.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++-
>>>>   migration/trace-events   |  5 ++-
>>>>   2 files changed, 90 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>>>> index a1f1705..e2660ae 100644
>>>> --- a/migration/postcopy-ram.c
>>>> +++ b/migration/postcopy-ram.c
>>>> @@ -23,6 +23,7 @@
>>>>   #include "migration/postcopy-ram.h"
>>>>   #include "sysemu/sysemu.h"
>>>>   #include "sysemu/balloon.h"
>>>> +#include <sys/param.h>
>>>>   #include "qemu/error-report.h"
>>>>   #include "trace.h"
>>>>   
>>>> @@ -542,6 +543,86 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
>>>>       return 0;
>>>>   }
>>>>   
>>>> +static int get_mem_fault_cpu_index(uint32_t pid)
>>>> +{
>>>> +    CPUState *cpu_iter;
>>>> +
>>>> +    CPU_FOREACH(cpu_iter) {
>>>> +        if (cpu_iter->thread_id == pid) {
>>>> +            return cpu_iter->cpu_index;
>>>> +        }
>>>> +    }
>>>> +    trace_get_mem_fault_cpu_index(pid);
>>>> +    return -1;
>>>> +}
>>>> +
>>>> +static void mark_postcopy_blocktime_begin(uint64_t addr, int cpu)
>>>> +{
>>>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>>>> +    PostcopyBlocktimeContext *dc;
>>>> +    int64_t now_ms;
>>>> +    if (!mis->blocktime_ctx || cpu < 0) {
>>>> +        return;
>>>> +    }
>>> You might consider:
>>>
>>>   PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
>>>   int64_t now_ms;
>>>   if (!dc || cpu < 0) {
>>>       return;
>>>   }
>>>
>>> it gets rid of the two reads of mis->blocktime_ctx
>>> (You do something similar in a few places)
>>>
>>>> +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>>>> +    dc = mis->blocktime_ctx;
>>>> +    if (dc->vcpu_addr[cpu] == 0) {
>>>> +        atomic_inc(&dc->smp_cpus_down);
>>>> +    }
>>>> +
>>>> +    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
>>>> +    atomic_xchg__nocheck(&dc->last_begin, now_ms);
>>>> +    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
>>>> +
>>>> +    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
>>>> +            cpu);
>>>> +}
>>>> +
>>>> +static void mark_postcopy_blocktime_end(uint64_t addr)
>>>> +{
>>>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>>>> +    PostcopyBlocktimeContext *dc;
>>>> +    int i, affected_cpu = 0;
>>>> +    int64_t now_ms;
>>>> +    bool vcpu_total_blocktime = false;
>>>> +
>>>> +    if (!mis->blocktime_ctx) {
>>>> +        return;
>>>> +    }
>>>> +    dc = mis->blocktime_ctx;
>>>> +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>>>> +
>>>> +    /* lookup cpu, to clear it,
>>>> +     * that algorithm looks straighforward, but it's not
>>>> +     * optimal, more optimal algorithm is keeping tree or hash
>>>> +     * where key is address value is a list of  */
>>>> +    for (i = 0; i < smp_cpus; i++) {
>>>> +        uint64_t vcpu_blocktime = 0;
>>>> +        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
>>>> +            continue;
>>>> +        }
>>>> +        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
>>>> +        vcpu_blocktime = now_ms -
>>>> +            atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
>>>> +        affected_cpu += 1;
>>>> +        /* we need to know is that mark_postcopy_end was due to
>>>> +         * faulted page, another possible case it's prefetched
>>>> +         * page and in that case we shouldn't be here */
>>>> +        if (!vcpu_total_blocktime &&
>>>> +            atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
>>>> +            vcpu_total_blocktime = true;
>>>> +        }
>>>> +        /* continue cycle, due to one page could affect several vCPUs */
>>>> +        dc->vcpu_blocktime[i] += vcpu_blocktime;
>>>> +    }
>>>> +
>>>> +    atomic_sub(&dc->smp_cpus_down, affected_cpu);
>>>> +    if (vcpu_total_blocktime) {
>>>> +        dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, 0);
>>> This total_blocktime calculation is a little odd; the 'last_begin' is
>>> not necessarily related to the same CPU or same block.
>>>
>>> Dave
>>>
>>>> +    }
>>>> +    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime);
>>>> +}
>>>> +
>>>>   /*
>>>>    * Handle faults detected by the USERFAULT markings
>>>>    */
>>>> @@ -619,8 +700,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
>>>>           rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
>>>>           trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
>>>>                                                   qemu_ram_get_idstr(rb),
>>>> -                                                rb_offset);
>>>> +                                                rb_offset,
>>>> +                                                msg.arg.pagefault.feat.ptid);
>>>>   
>>>> +        mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address),
>>>> +                         get_mem_fault_cpu_index(msg.arg.pagefault.feat.ptid));
>>>>           /*
>>>>            * Send the request to the source - we want to request one
>>>>            * of our host page sizes (which is >= TPS)
>>>> @@ -715,6 +799,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
>>>>   
>>>>           return -e;
>>>>       }
>>>> +    mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host);
>>>>   
>>>>       trace_postcopy_place_page(host);
>>>>       return 0;
>>>> diff --git a/migration/trace-events b/migration/trace-events
>>>> index b8f01a2..9424e3e 100644
>>>> --- a/migration/trace-events
>>>> +++ b/migration/trace-events
>>>> @@ -110,6 +110,8 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
>>>>   process_incoming_migration_co_postcopy_end_main(void) ""
>>>>   migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
>>>>   migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
>>>> +mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d"
>>>> +mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr 0x%" PRIx64 " dd %p time %" PRId64
>>>>   
>>>>   # migration/rdma.c
>>>>   qemu_rdma_accept_incoming_migration(void) ""
>>>> @@ -186,7 +188,7 @@ postcopy_ram_enable_notify(void) ""
>>>>   postcopy_ram_fault_thread_entry(void) ""
>>>>   postcopy_ram_fault_thread_exit(void) ""
>>>>   postcopy_ram_fault_thread_quit(void) ""
>>>> -postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx"
>>>> +postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx %u"
>>>>   postcopy_ram_incoming_cleanup_closeuf(void) ""
>>>>   postcopy_ram_incoming_cleanup_entry(void) ""
>>>>   postcopy_ram_incoming_cleanup_exit(void) ""
>>>> @@ -195,6 +197,7 @@ save_xbzrle_page_skipping(void) ""
>>>>   save_xbzrle_page_overflow(void) ""
>>>>   ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
>>>>   ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
>>>> +get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU"
>>>>   
>>>>   # migration/exec.c
>>>>   migration_exec_outgoing(const char *cmd) "cmd=%s"
>>>> -- 
>>>> 1.9.1
>>>>
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
>> -- 
>>
>> BR
>> Alexey
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>
>

-- 
Best regards,
Alexey Perevalov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 8/9] migration: add postcopy total blocktime into query-migrate
  2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 8/9] migration: add postcopy total blocktime into query-migrate Alexey Perevalov
  2017-05-19 19:23       ` Dr. David Alan Gilbert
@ 2017-05-22 16:14       ` Eric Blake
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Blake @ 2017-05-22 16:14 UTC (permalink / raw)
  To: Alexey Perevalov, qemu-devel; +Cc: i.maximets, dgilbert, peterx

[-- Attachment #1: Type: text/plain, Size: 2020 bytes --]

On 05/12/2017 08:31 AM, Alexey Perevalov wrote:
> Postcopy total blocktime is available on destination side only.
> But query-migrate was possible only for source. This patch
> adds ability to call query-migrate on destination. To distinguish
> src/dst, state of the MigrationState is using, query-migrate prepares
> MigrationInfo for source machine only in case of migration's state is different
> than MIGRATION_STATUS_NONE.
> 
> To be able to see postcopy blocktime, need to request postcopy-blocktime
> capability.
> 
> The query-migrate command will show following sample result:
> {"return":
>     "postcopy_vcpu_blocktime": [115, 100],
>     "status": "completed",
>     "postcopy_blocktime": 100
> }}
> 
> postcopy_vcpu_blocktime contains list, where the first item is the first
> vCPU in QEMU.
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> ---

> +++ b/qapi-schema.json
> @@ -712,6 +712,8 @@
>  #              @status is 'failed'. Clients should not attempt to parse the
>  #              error strings. (Since 2.7)
>  #
> +# @postcopy_vcpu_blocktime: list of the postcopy blocktime per vCPU (Since 2.9)

You've missed 2.9; this should be 2.10.

> +#
>  # Since: 0.14.0
>  ##
>  { 'struct': 'MigrationInfo',
> @@ -723,7 +725,9 @@
>             '*downtime': 'int',
>             '*setup-time': 'int',
>             '*cpu-throttle-percentage': 'int',
> -           '*error-desc': 'str'} }
> +           '*error-desc': 'str',
> +           '*postcopy_blocktime' : 'int64',
> +           '*postcopy_vcpu_blocktime': ['int64']} }

You're adding two fields, but only documented one of them
(postcopy_blocktime needs mention).

New fields should favor names with '-', not '_'; especially when part of
a struct that is already using '-' names.  So these should be
'postcopy-blocktime' and 'postcopy-vcpu-blocktime'.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 8/9] migration: add postcopy total blocktime into query-migrate
  2017-05-19 19:23       ` Dr. David Alan Gilbert
@ 2017-05-22 16:15         ` Eric Blake
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Blake @ 2017-05-22 16:15 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Alexey Perevalov; +Cc: qemu-devel, i.maximets, peterx

[-- Attachment #1: Type: text/plain, Size: 755 bytes --]

On 05/19/2017 02:23 PM, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
>> Postcopy total blocktime is available on destination side only.
>> But query-migrate was possible only for source. This patch
>> adds ability to call query-migrate on destination. To distinguish
>> src/dst, state of the MigrationState is using, query-migrate prepares
>> MigrationInfo for source machine only in case of migration's state is different
>> than MIGRATION_STATUS_NONE.
>>

> 
> Lets just check Eric is happy with the qapi side.

I pointed out a couple of things that need to be fixed.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 5/9] migration: introduce postcopy-blocktime capability
  2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 5/9] migration: introduce postcopy-blocktime capability Alexey Perevalov
  2017-05-16 10:33       ` Dr. David Alan Gilbert
@ 2017-05-22 16:20       ` Eric Blake
  2017-05-22 16:42         ` Alexey
  2017-05-30 11:26         ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 28+ messages in thread
From: Eric Blake @ 2017-05-22 16:20 UTC (permalink / raw)
  To: Alexey Perevalov, qemu-devel; +Cc: i.maximets, dgilbert, peterx

[-- Attachment #1: Type: text/plain, Size: 1523 bytes --]

On 05/12/2017 08:31 AM, Alexey Perevalov wrote:
> Right now it could be used on destination side to
> enable vCPU blocktime calculation for postcopy live migration.
> vCPU blocktime - it's time since vCPU thread was put into
> interruptible sleep, till memory page was copied and thread awake.
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> ---
>  include/migration/migration.h | 1 +
>  migration/migration.c         | 9 +++++++++
>  qapi-schema.json              | 5 ++++-
>  3 files changed, 14 insertions(+), 1 deletion(-)
> 

> +++ b/qapi-schema.json
> @@ -894,11 +894,14 @@
>  # @release-ram: if enabled, qemu will free the migrated ram pages on the source
>  #        during postcopy-ram migration. (since 2.9)
>  #
> +# @postcopy-blocktime: Calculate downtime for postcopy live migration (since 2.10)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> -           'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram'] }
> +           'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
> +           'postcopy-blocktime'] }

Why does this need to be a capability that we have to turn on, and not
something that is collected unconditionally? Is there a drawback to
having the stat collection always enabled without a capability?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 5/9] migration: introduce postcopy-blocktime capability
  2017-05-22 16:20       ` Eric Blake
@ 2017-05-22 16:42         ` Alexey
  2017-05-30 11:26         ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 28+ messages in thread
From: Alexey @ 2017-05-22 16:42 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, i.maximets, dgilbert, peterx

On Mon, May 22, 2017 at 11:20:13AM -0500, Eric Blake wrote:
> On 05/12/2017 08:31 AM, Alexey Perevalov wrote:
> > Right now it could be used on destination side to
> > enable vCPU blocktime calculation for postcopy live migration.
> > vCPU blocktime - it's time since vCPU thread was put into
> > interruptible sleep, till memory page was copied and thread awake.
> > 
> > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> > ---
> >  include/migration/migration.h | 1 +
> >  migration/migration.c         | 9 +++++++++
> >  qapi-schema.json              | 5 ++++-
> >  3 files changed, 14 insertions(+), 1 deletion(-)
> > 
> 
> > +++ b/qapi-schema.json
> > @@ -894,11 +894,14 @@
> >  # @release-ram: if enabled, qemu will free the migrated ram pages on the source
> >  #        during postcopy-ram migration. (since 2.9)
> >  #
> > +# @postcopy-blocktime: Calculate downtime for postcopy live migration (since 2.10)
> > +#
> >  # Since: 1.2
> >  ##
> >  { 'enum': 'MigrationCapability',
> >    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> > -           'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram'] }
> > +           'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
> > +           'postcopy-blocktime'] }
> 
> Why does this need to be a capability that we have to turn on, and not
> something that is collected unconditionally? Is there a drawback to
> having the stat collection always enabled without a capability?
yes, it has a performance penalty
(runtime complexity O(n) + O(m), where n is vCPU number, m is number of memory
pages), but not so huge, compared to network
latencies, also memory usage, but no more than 0.03% of QEMU's total
memory.

> 
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
> 



-- 

BR
Alexey

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH V5 5/9] migration: introduce postcopy-blocktime capability
  2017-05-22 16:20       ` Eric Blake
  2017-05-22 16:42         ` Alexey
@ 2017-05-30 11:26         ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 28+ messages in thread
From: Dr. David Alan Gilbert @ 2017-05-30 11:26 UTC (permalink / raw)
  To: Eric Blake; +Cc: Alexey Perevalov, qemu-devel, i.maximets, peterx

* Eric Blake (eblake@redhat.com) wrote:
> On 05/12/2017 08:31 AM, Alexey Perevalov wrote:
> > Right now it could be used on destination side to
> > enable vCPU blocktime calculation for postcopy live migration.
> > vCPU blocktime - it's time since vCPU thread was put into
> > interruptible sleep, till memory page was copied and thread awake.
> > 
> > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> > ---
> >  include/migration/migration.h | 1 +
> >  migration/migration.c         | 9 +++++++++
> >  qapi-schema.json              | 5 ++++-
> >  3 files changed, 14 insertions(+), 1 deletion(-)
> > 
> 
> > +++ b/qapi-schema.json
> > @@ -894,11 +894,14 @@
> >  # @release-ram: if enabled, qemu will free the migrated ram pages on the source
> >  #        during postcopy-ram migration. (since 2.9)
> >  #
> > +# @postcopy-blocktime: Calculate downtime for postcopy live migration (since 2.10)
> > +#
> >  # Since: 1.2
> >  ##
> >  { 'enum': 'MigrationCapability',
> >    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> > -           'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram'] }
> > +           'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
> > +           'postcopy-blocktime'] }
> 
> Why does this need to be a capability that we have to turn on, and not
> something that is collected unconditionally? Is there a drawback to
> having the stat collection always enabled without a capability?

Yes, there was a reasonable CPU/memory overhead.
(Although it might be lower now).

Dave

> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
> 



--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-05-30 11:27 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20170512133144eucas1p23502fd953ee73bda5b2afb25e65604f9@eucas1p2.samsung.com>
2017-05-12 13:31 ` [Qemu-devel] [PATCH V5 0/9] calculate blocktime for postcopy live migration Alexey Perevalov
     [not found]   ` <CGME20170512133144eucas1p288cde3bd6faefb16cbb0d3790885783d@eucas1p2.samsung.com>
2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 1/9] userfault: add pid into uffd_msg & update UFFD_FEATURE_* Alexey Perevalov
     [not found]   ` <CGME20170512133144eucas1p25b4275feb4126a21415242c5085382fd@eucas1p2.samsung.com>
2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 2/9] migration: pass ptr to MigrationIncomingState into migration ufd_version_check & postcopy_ram_supported_by_host Alexey Perevalov
2017-05-18 14:09       ` Eric Blake
     [not found]   ` <CGME20170512133145eucas1p2cad4c4efe46e6f1b757d97dd9d301dbe@eucas1p2.samsung.com>
2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 3/9] migration: fix hardcoded function name in error report Alexey Perevalov
2017-05-16  9:46       ` Dr. David Alan Gilbert
     [not found]   ` <CGME20170512133146eucas1p17df48bb6b5fcefe3717e18cd9afd84b7@eucas1p1.samsung.com>
2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 4/9] migration: split ufd_version_check onto receive/request features part Alexey Perevalov
2017-05-16 10:32       ` Dr. David Alan Gilbert
2017-05-18  6:55         ` Alexey
2017-05-19 18:46           ` Dr. David Alan Gilbert
     [not found]   ` <CGME20170512133146eucas1p2ba4841cabf508b66410fae6784952eaa@eucas1p2.samsung.com>
2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 5/9] migration: introduce postcopy-blocktime capability Alexey Perevalov
2017-05-16 10:33       ` Dr. David Alan Gilbert
2017-05-22 16:20       ` Eric Blake
2017-05-22 16:42         ` Alexey
2017-05-30 11:26         ` Dr. David Alan Gilbert
     [not found]   ` <CGME20170512133147eucas1p1aca0281fc864bf6f3beb610e7ce2695b@eucas1p1.samsung.com>
2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 6/9] migration: add postcopy vcpu blocktime context into MigrationIncomingState Alexey Perevalov
     [not found]   ` <CGME20170512133147eucas1p1eaa21aac3a0b9d45be0ef8ea903b6824@eucas1p1.samsung.com>
2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 7/9] migration: calculate vCPU blocktime on dst side Alexey Perevalov
2017-05-16 11:34       ` Dr. David Alan Gilbert
2017-05-16 15:19         ` Alexey
2017-05-18  7:18         ` Alexey
2017-05-19 19:05           ` Dr. David Alan Gilbert
2017-05-22  7:43             ` Alexey Perevalov
     [not found]   ` <CGME20170512133148eucas1p2c04111d415b1fbd6fb702cfc2a3ed6f9@eucas1p2.samsung.com>
2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 8/9] migration: add postcopy total blocktime into query-migrate Alexey Perevalov
2017-05-19 19:23       ` Dr. David Alan Gilbert
2017-05-22 16:15         ` Eric Blake
2017-05-22 16:14       ` Eric Blake
     [not found]   ` <CGME20170512133149eucas1p2b4c448fe763975cf11cf96801857d42e@eucas1p2.samsung.com>
2017-05-12 13:31     ` [Qemu-devel] [PATCH V5 9/9] migration: postcopy_blocktime documentation Alexey Perevalov
2017-05-12 20:09   ` [Qemu-devel] [PATCH V5 0/9] calculate blocktime for postcopy live migration Eric Blake

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.