[Qemu-devel] [PATCH V6 00/10] calculate blocktime for postcopy live migration

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH V6 00/10] calculate blocktime for postcopy live migration
       [not found] <CGME20170523113120eucas1p2032ace2121aa8627067b6d7f03fbf482@eucas1p2.samsung.com>
@ 2017-05-23 11:31 ` Alexey Perevalov
       [not found]   ` <CGME20170523113126eucas1p163c64fe50bd44026fdf4d36716bfc4f2@eucas1p1.samsung.com>
                     ` (9 more replies)
  0 siblings, 10 replies; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-23 11:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: i.maximets, peterx, eblake, dgilbert, Alexey Perevalov

Hello, this is 6th version of the series.

Previous version of that series at
https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg03117.html 

(V5 -> V6)
    - blocktime was added into hmp command. Comment from David.
    - bitmap for copied pages was added as well as check in *_begin/_end
functions. Patch uses just introduced RAMBLOCK_FOREACH. Comment from David.
    - description of receive_ufd_features/request_ufd_features. Comment from David.
    - commit message headers/@since references were modified. Comment from Eric.
    - also typos in documentation. Comment from Eric.
    - style and description of field in MigrationInfo. Comment from Eric.
    - ufd_check_and_apply (former ufd_version_check) is calling twice,
so my previous patch contained double allocation of blocktime context and
as a result memory leak. In this patch series it was fixed.

patch set is based on commit 56821559f0ba682fe6b367815572e6f974d329ab
"Merge remote-tracking branch 'dgilbert/tags/pull-hmp-20170517' into staging"

Alexey Perevalov (10):
  userfault: add pid into uffd_msg & update UFFD_FEATURE_*
  migration: pass MigrationIncomingState* into migration check functions
  migration: fix hardcoded function name in error report
  migration: split ufd_version_check onto receive/request features part
  migration: introduce postcopy-blocktime capability
  migration: add postcopy blocktime ctx into MigrationIncomingState
  migration: add bitmap for copied page
  migration: calculate vCPU blocktime on dst side
  migration: add postcopy total blocktime into query-migrate
  migration: postcopy_blocktime documentation

 docs/migration.txt                |  10 +
 hmp.c                             |  15 ++
 include/migration/migration.h     |  29 +++
 linux-headers/linux/userfaultfd.h |   5 +
 migration/migration.c             |  51 +++++-
 migration/postcopy-ram.c          | 375 ++++++++++++++++++++++++++++++++++++--
 migration/postcopy-ram.h          |   2 +-
 migration/ram.c                   |  63 +++++++
 migration/savevm.c                |   2 +-
 migration/trace-events            |   6 +-
 qapi-schema.json                  |  14 +-
 11 files changed, 551 insertions(+), 21 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH V6 01/10] userfault: add pid into uffd_msg & update UFFD_FEATURE_*
       [not found]   ` <CGME20170523113126eucas1p163c64fe50bd44026fdf4d36716bfc4f2@eucas1p1.samsung.com>
@ 2017-05-23 11:31     ` Alexey Perevalov
  0 siblings, 0 replies; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-23 11:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: i.maximets, peterx, eblake, dgilbert, Alexey Perevalov

This commit duplicates header of "userfaultfd: provide pid in userfault msg"
into linux kernel.

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 linux-headers/linux/userfaultfd.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/linux-headers/linux/userfaultfd.h b/linux-headers/linux/userfaultfd.h
index 2ed5dc3..e7c8898 100644
--- a/linux-headers/linux/userfaultfd.h
+++ b/linux-headers/linux/userfaultfd.h
@@ -77,6 +77,9 @@ struct uffd_msg {
 		struct {
 			__u64	flags;
 			__u64	address;
+			union {
+				__u32   ptid;
+			} feat;
 		} pagefault;
 
 		struct {
@@ -158,6 +161,8 @@ struct uffdio_api {
 #define UFFD_FEATURE_EVENT_MADVDONTNEED		(1<<3)
 #define UFFD_FEATURE_MISSING_HUGETLBFS		(1<<4)
 #define UFFD_FEATURE_MISSING_SHMEM		(1<<5)
+#define UFFD_FEATURE_EVENT_UNMAP		(1<<6)
+#define UFFD_FEATURE_THREAD_ID			(1<<7)
 	__u64 features;
 
 	__u64 ioctls;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH V6 02/10] migration: pass MigrationIncomingState* into migration check functions
       [not found]   ` <CGME20170523113127eucas1p22dba0fddcc9bcf70e554bf659272f947@eucas1p2.samsung.com>
@ 2017-05-23 11:31     ` Alexey Perevalov
  2017-05-31 17:54       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-23 11:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: i.maximets, peterx, eblake, dgilbert, Alexey Perevalov

That tiny refactoring is necessary to be able to set
UFFD_FEATURE_THREAD_ID while requesting features, and then
to create downtime context in case when kernel supports it.

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 migration/migration.c    |  2 +-
 migration/postcopy-ram.c | 10 +++++-----
 migration/postcopy-ram.h |  2 +-
 migration/savevm.c       |  2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 0304c01..d735976 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -789,7 +789,7 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
          * special support.
          */
         if (!old_postcopy_cap && runstate_check(RUN_STATE_INMIGRATE) &&
-            !postcopy_ram_supported_by_host()) {
+            !postcopy_ram_supported_by_host(NULL)) {
             /* postcopy_ram_supported_by_host will have emitted a more
              * detailed message
              */
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index a0489f6..4adab36 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -59,7 +59,7 @@ struct PostcopyDiscardState {
 #include <sys/eventfd.h>
 #include <linux/userfaultfd.h>
 
-static bool ufd_version_check(int ufd)
+static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
 {
     struct uffdio_api api_struct;
     uint64_t ioctl_mask;
@@ -112,7 +112,7 @@ static int test_range_shared(const char *block_name, void *host_addr,
  * normally fine since if the postcopy succeeds it gets turned back on at the
  * end.
  */
-bool postcopy_ram_supported_by_host(void)
+bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
 {
     long pagesize = getpagesize();
     int ufd = -1;
@@ -135,7 +135,7 @@ bool postcopy_ram_supported_by_host(void)
     }
 
     /* Version and features check */
-    if (!ufd_version_check(ufd)) {
+    if (!ufd_version_check(ufd, mis)) {
         goto out;
     }
 
@@ -512,7 +512,7 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
      * Although the host check already tested the API, we need to
      * do the check again as an ABI handshake on the new fd.
      */
-    if (!ufd_version_check(mis->userfault_fd)) {
+    if (!ufd_version_check(mis->userfault_fd, mis)) {
         return -1;
     }
 
@@ -650,7 +650,7 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
 
 #else
 /* No target OS support, stubs just fail */
-bool postcopy_ram_supported_by_host(void)
+bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
 {
     error_report("%s: No OS support", __func__);
     return false;
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 52d51e8..587a8b8 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -14,7 +14,7 @@
 #define QEMU_POSTCOPY_RAM_H
 
 /* Return true if the host supports everything we need to do postcopy-ram */
-bool postcopy_ram_supported_by_host(void);
+bool postcopy_ram_supported_by_host(MigrationIncomingState *mis);
 
 /*
  * Make all of RAM sensitive to accesses to areas that haven't yet been written
diff --git a/migration/savevm.c b/migration/savevm.c
index f5e8194..61a6df7 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1356,7 +1356,7 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
         return -1;
     }
 
-    if (!postcopy_ram_supported_by_host()) {
+    if (!postcopy_ram_supported_by_host(mis)) {
         postcopy_state_set(POSTCOPY_INCOMING_NONE);
         return -1;
     }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH V6 03/10] migration: fix hardcoded function name in error report
       [not found]   ` <CGME20170523113127eucas1p1b6cebc0fc51a056b8c1a983d375f1012@eucas1p1.samsung.com>
@ 2017-05-23 11:31     ` Alexey Perevalov
  0 siblings, 0 replies; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-23 11:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: i.maximets, peterx, eblake, dgilbert, Alexey Perevalov

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 migration/postcopy-ram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 4adab36..3ed78bf 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -67,7 +67,7 @@ static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
     api_struct.api = UFFD_API;
     api_struct.features = 0;
     if (ioctl(ufd, UFFDIO_API, &api_struct)) {
-        error_report("postcopy_ram_supported_by_host: UFFDIO_API failed: %s",
+        error_report("%s: UFFDIO_API failed: %s", __func__
                      strerror(errno));
         return false;
     }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH V6 04/10] migration: split ufd_version_check onto receive/request features part
       [not found]   ` <CGME20170523113128eucas1p17a89f8cb47d5731c50f94c3218ba155f@eucas1p1.samsung.com>
@ 2017-05-23 11:31     ` Alexey Perevalov
  2017-05-24  2:36       ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-23 11:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: i.maximets, peterx, eblake, dgilbert, Alexey Perevalov

This modification is necessary for userfault fd features which are
required to be requested from userspace.
UFFD_FEATURE_THREAD_ID is a one of such "on demand" feature, which will
be introduced in the next patch.

QEMU have to use separate userfault file descriptor, due to
userfault context has internal state, and after first call of
ioctl UFFD_API it changes its state to UFFD_STATE_RUNNING (in case of
success), but kernel while handling ioctl UFFD_API expects UFFD_STATE_WAIT_API.
So only one ioctl with UFFD_API is possible per ufd.

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 migration/postcopy-ram.c | 100 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 91 insertions(+), 9 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 3ed78bf..4f3f495 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -59,32 +59,114 @@ struct PostcopyDiscardState {
 #include <sys/eventfd.h>
 #include <linux/userfaultfd.h>
 
-static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
+
+/**
+ * receive_ufd_features: check userfault fd features, to request only supported
+ * features in the future.
+ *
+ * Returns: true on success
+ *
+ * __NR_userfaultfd - should be checked before
+ *  @features: out parameter will contain uffdio_api.features provided by kernel
+ *              in case of success
+ */
+static bool receive_ufd_features(uint64_t *features)
 {
-    struct uffdio_api api_struct;
-    uint64_t ioctl_mask;
+    struct uffdio_api api_struct = {0};
+    int ufd;
+    bool ret = true;
+
+    /* if we are here __NR_userfaultfd should exists */
+    ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
+    if (ufd == -1) {
+        error_report("%s: syscall __NR_userfaultfd failed: %s", __func__,
+                     strerror(errno));
+        return false;
+    }
 
+    /* ask features */
     api_struct.api = UFFD_API;
     api_struct.features = 0;
     if (ioctl(ufd, UFFDIO_API, &api_struct)) {
-        error_report("%s: UFFDIO_API failed: %s", __func__
+        error_report("%s: UFFDIO_API failed: %s", __func__,
                      strerror(errno));
+        ret = false;
+        goto release_ufd;
+    }
+
+    *features = api_struct.features;
+
+release_ufd:
+    close(ufd);
+    return ret;
+}
+
+/**
+ * request_ufd_features: this function should be called only once on a newly
+ * opened ufd, subsequent calls will lead to error.
+ *
+ * Returns: true on succes
+ *
+ * @ufd: fd obtained from userfaultfd syscall
+ * @features: bit mask see UFFD_API_FEATURES
+ */
+static bool request_ufd_features(int ufd, uint64_t features)
+{
+    struct uffdio_api api_struct = {0};
+    uint64_t ioctl_mask;
+
+    api_struct.api = UFFD_API;
+    api_struct.features = features;
+    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
+        error_report("%s failed: UFFDIO_API failed: %s", __func__,
+                strerror(errno));
         return false;
     }
 
-    ioctl_mask = (__u64)1 << _UFFDIO_REGISTER |
-                 (__u64)1 << _UFFDIO_UNREGISTER;
+    ioctl_mask = 1 << _UFFDIO_REGISTER |
+                 1 << _UFFDIO_UNREGISTER;
     if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
         error_report("Missing userfault features: %" PRIx64,
                      (uint64_t)(~api_struct.ioctls & ioctl_mask));
         return false;
     }
 
+    return true;
+}
+
+static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
+{
+    uint64_t asked_features = 0;
+    static uint64_t supported_features;
+
+    /*
+     * it's not possible to
+     * request UFFD_API twice per one fd
+     * userfault fd features is persistent
+     */
+    if (!supported_features) {
+        if (!receive_ufd_features(&supported_features)) {
+            error_report("%s failed", __func__);
+            return false;
+        }
+    }
+
+    /*
+     * request features, even if asked_features is 0, due to
+     * kernel expects UFFD_API before UFFDIO_REGISTER, per
+     * userfault file descriptor
+     */
+    if (!request_ufd_features(ufd, asked_features)) {
+        error_report("%s failed: features %" PRIu64, __func__,
+                asked_features);
+        return false;
+    }
+
     if (getpagesize() != ram_pagesize_summary()) {
         bool have_hp = false;
         /* We've got a huge page */
 #ifdef UFFD_FEATURE_MISSING_HUGETLBFS
-        have_hp = api_struct.features & UFFD_FEATURE_MISSING_HUGETLBFS;
+        have_hp = supported_features & UFFD_FEATURE_MISSING_HUGETLBFS;
 #endif
         if (!have_hp) {
             error_report("Userfault on this host does not support huge pages");
@@ -135,7 +217,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
     }
 
     /* Version and features check */
-    if (!ufd_version_check(ufd, mis)) {
+    if (!ufd_check_and_apply(ufd, mis)) {
         goto out;
     }
 
@@ -512,7 +594,7 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
      * Although the host check already tested the API, we need to
      * do the check again as an ABI handshake on the new fd.
      */
-    if (!ufd_version_check(mis->userfault_fd, mis)) {
+    if (!ufd_check_and_apply(mis->userfault_fd, mis)) {
         return -1;
     }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH V6 05/10] migration: introduce postcopy-blocktime capability
       [not found]   ` <CGME20170523113129eucas1p2146e1018e660eed0b319cbe22adc2712@eucas1p2.samsung.com>
@ 2017-05-23 11:31     ` Alexey Perevalov
  0 siblings, 0 replies; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-23 11:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: i.maximets, peterx, eblake, dgilbert, Alexey Perevalov

Right now it could be used on destination side to
enable vCPU blocktime calculation for postcopy live migration.
vCPU blocktime - it's time since vCPU thread was put into
interruptible sleep, till memory page was copied and thread awake.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 include/migration/migration.h | 1 +
 migration/migration.c         | 9 +++++++++
 qapi-schema.json              | 5 ++++-
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 49ec501..2951253 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -270,6 +270,7 @@ int migrate_compress_level(void);
 int migrate_compress_threads(void);
 int migrate_decompress_threads(void);
 bool migrate_use_events(void);
+bool migrate_postcopy_blocktime(void);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_message(MigrationIncomingState *mis,
diff --git a/migration/migration.c b/migration/migration.c
index d735976..e10284e 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1341,6 +1341,15 @@ bool migrate_zero_blocks(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_ZERO_BLOCKS];
 }
 
+bool migrate_postcopy_blocktime(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME];
+}
+
 bool migrate_use_compression(void)
 {
     MigrationState *s;
diff --git a/qapi-schema.json b/qapi-schema.json
index 80603cf..78617fe 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -894,11 +894,14 @@
 # @release-ram: if enabled, qemu will free the migrated ram pages on the source
 #        during postcopy-ram migration. (since 2.9)
 #
+# @postcopy-blocktime: Calculate downtime for postcopy live migration (since 2.10)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
   'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
-           'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram'] }
+           'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
+           'postcopy-blocktime'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH V6 06/10] migration: add postcopy blocktime ctx into MigrationIncomingState
       [not found]   ` <CGME20170523113129eucas1p179082f20f41d1069f5fbd0f37535fae9@eucas1p1.samsung.com>
@ 2017-05-23 11:31     ` Alexey Perevalov
  2017-05-24  3:31       ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-23 11:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: i.maximets, peterx, eblake, dgilbert, Alexey Perevalov

This patch adds request to kernel space for UFFD_FEATURE_THREAD_ID,
in case when this feature is provided by kernel.

PostcopyBlocktimeContext is incapsulated inside postcopy-ram.c,
due to it's postcopy only feature.
Also it defines PostcopyBlocktimeContext's instance live time.
Information from PostcopyBlocktimeContext instance will be provided
much after postcopy migration end, instance of PostcopyBlocktimeContext
will live till QEMU exit, but part of it (vcpu_addr,
page_fault_vcpu_time) used only during calculation, will be released
when postcopy ended or failed.

To enable postcopy blocktime calculation on destination, need to request
proper capabiltiy (Patch for documentation will be at the tail of the patch
set).

As an example following command enable that capability, assume QEMU was
started with
-chardev socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock
option to control it

[root@host]#printf "{\"execute\" : \"qmp_capabilities\"}\r\n \
{\"execute\": \"migrate-set-capabilities\" , \"arguments\":   {
\"capabilities\": [ { \"capability\": \"postcopy-blocktime\", \"state\":
true } ] } }" | nc -U /var/lib/migrate-vm-monitor.sock

Or just with HMP
(qemu) migrate_set_capability postcopy-blocktime on

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 include/migration/migration.h |  8 +++++
 migration/postcopy-ram.c      | 80 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 88 insertions(+)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 2951253..449cb07 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -57,6 +57,8 @@ enum mig_rp_message_type {
 
 typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
 
+struct PostcopyBlocktimeContext;
+
 /* State for the incoming migration */
 struct MigrationIncomingState {
     QEMUFile *from_src_file;
@@ -97,6 +99,12 @@ struct MigrationIncomingState {
 
     /* See savevm.c */
     LoadStateEntry_Head loadvm_handlers;
+
+    /*
+     * PostcopyBlocktimeContext to keep information for postcopy
+     * live migration, to calculate vCPU block time
+     * */
+    struct PostcopyBlocktimeContext *blocktime_ctx;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 4f3f495..5435a40 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -59,6 +59,73 @@ struct PostcopyDiscardState {
 #include <sys/eventfd.h>
 #include <linux/userfaultfd.h>
 
+typedef struct PostcopyBlocktimeContext {
+    /* time when page fault initiated per vCPU */
+    int64_t *page_fault_vcpu_time;
+    /* page address per vCPU */
+    uint64_t *vcpu_addr;
+    int64_t total_blocktime;
+    /* blocktime per vCPU */
+    int64_t *vcpu_blocktime;
+    /* point in time when last page fault was initiated */
+    int64_t last_begin;
+    /* number of vCPU are suspended */
+    int smp_cpus_down;
+
+    /*
+     * Handler for exit event, necessary for
+     * releasing whole blocktime_ctx
+     */
+    Notifier exit_notifier;
+    /*
+     * Handler for postcopy event, necessary for
+     * releasing unnecessary part of blocktime_ctx
+     */
+    Notifier postcopy_notifier;
+} PostcopyBlocktimeContext;
+
+static void destroy_blocktime_context(struct PostcopyBlocktimeContext *ctx)
+{
+    g_free(ctx->page_fault_vcpu_time);
+    g_free(ctx->vcpu_addr);
+    g_free(ctx->vcpu_blocktime);
+    g_free(ctx);
+}
+
+static void postcopy_migration_cb(Notifier *n, void *data)
+{
+    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
+                                               postcopy_notifier);
+    MigrationState *s = data;
+    if (migration_has_finished(s) || migration_has_failed(s)) {
+        g_free(ctx->page_fault_vcpu_time);
+        /* g_free is NULL robust */
+        ctx->page_fault_vcpu_time = NULL;
+        g_free(ctx->vcpu_addr);
+        ctx->vcpu_addr = NULL;
+    }
+}
+
+static void migration_exit_cb(Notifier *n, void *data)
+{
+    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
+                                               exit_notifier);
+    destroy_blocktime_context(ctx);
+}
+
+static struct PostcopyBlocktimeContext *blocktime_context_new(void)
+{
+    PostcopyBlocktimeContext *ctx = g_new0(PostcopyBlocktimeContext, 1);
+    ctx->page_fault_vcpu_time = g_new0(int64_t, smp_cpus);
+    ctx->vcpu_addr = g_new0(uint64_t, smp_cpus);
+    ctx->vcpu_blocktime = g_new0(int64_t, smp_cpus);
+
+    ctx->exit_notifier.notify = migration_exit_cb;
+    ctx->postcopy_notifier.notify = postcopy_migration_cb;
+    qemu_add_exit_notifier(&ctx->exit_notifier);
+    add_migration_state_change_notifier(&ctx->postcopy_notifier);
+    return ctx;
+}
 
 /**
  * receive_ufd_features: check userfault fd features, to request only supported
@@ -151,6 +218,19 @@ static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
         }
     }
 
+#ifdef UFFD_FEATURE_THREAD_ID
+    if (migrate_postcopy_blocktime() && mis &&
+            UFFD_FEATURE_THREAD_ID & supported_features) {
+        /* kernel supports that feature */
+        /* don't create blocktime_context if it exists */
+        if (!mis->blocktime_ctx) {
+            mis->blocktime_ctx = blocktime_context_new();
+        }
+
+        asked_features |= UFFD_FEATURE_THREAD_ID;
+    }
+#endif
+
     /*
      * request features, even if asked_features is 0, due to
      * kernel expects UFFD_API before UFFDIO_REGISTER, per
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH V6 07/10] migration: add bitmap for copied page
       [not found]   ` <CGME20170523113130eucas1p1babac9d8659c10abe22ddc7d5b9526ab@eucas1p1.samsung.com>
@ 2017-05-23 11:31     ` Alexey Perevalov
  2017-05-24  6:57       ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-23 11:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: i.maximets, peterx, eblake, dgilbert, Alexey Perevalov

This patch adds ability to track down already copied
pages, it's necessary for calculation vCPU block time in
postcopy migration feature and maybe for restore after
postcopy migration failure.

Functions which work with RAMBlock are placed into ram.c,
due to TARGET_WORDS_BIGENDIAN is poisoned int postcopy-ram.c -
hardware independed code.

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 include/migration/migration.h | 16 +++++++++++
 migration/postcopy-ram.c      | 22 ++++++++++++---
 migration/ram.c               | 63 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 97 insertions(+), 4 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 449cb07..4e05c83 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -101,6 +101,20 @@ struct MigrationIncomingState {
     LoadStateEntry_Head loadvm_handlers;
 
     /*
+     * bitmap indicates whether page copied,
+     * based on ramblock offset
+     * now it is using only for blocktime calculation in
+     * postcopy migration, so livetime of this entry:
+     * since user requested blocktime calculation,
+     * till the end of postcopy migration
+     * as an example it could represend following memory map
+     * ___________________________________
+     * |4k pages | hugepages | 4k pages
+     *
+     * */
+    unsigned long *copied_pages;
+
+    /*
      * PostcopyBlocktimeContext to keep information for postcopy
      * live migration, to calculate vCPU block time
      * */
@@ -279,6 +293,8 @@ int migrate_compress_threads(void);
 int migrate_decompress_threads(void);
 bool migrate_use_events(void);
 bool migrate_postcopy_blocktime(void);
+unsigned long int get_copied_bit_offset(ram_addr_t addr);
+unsigned long int *copied_pages_bitmap_new(void);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_message(MigrationIncomingState *mis,
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 5435a40..d647769 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -94,23 +94,34 @@ static void destroy_blocktime_context(struct PostcopyBlocktimeContext *ctx)
 
 static void postcopy_migration_cb(Notifier *n, void *data)
 {
-    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
-                                               postcopy_notifier);
     MigrationState *s = data;
     if (migration_has_finished(s) || migration_has_failed(s)) {
+        MigrationIncomingState *mis = migration_incoming_get_current();
+        PostcopyBlocktimeContext *ctx = mis->blocktime_ctx;
+
+        if (!ctx) {
+            return;
+        }
+
         g_free(ctx->page_fault_vcpu_time);
         /* g_free is NULL robust */
         ctx->page_fault_vcpu_time = NULL;
         g_free(ctx->vcpu_addr);
         ctx->vcpu_addr = NULL;
+        g_free(mis->copied_pages);
+        mis->copied_pages = NULL;
     }
 }
 
 static void migration_exit_cb(Notifier *n, void *data)
 {
-    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
-                                               exit_notifier);
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    PostcopyBlocktimeContext *ctx = mis->blocktime_ctx;
+    if (!ctx) {
+        return;
+    }
     destroy_blocktime_context(ctx);
+    mis->blocktime_ctx = NULL;
 }
 
 static struct PostcopyBlocktimeContext *blocktime_context_new(void)
@@ -227,6 +238,9 @@ static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
             mis->blocktime_ctx = blocktime_context_new();
         }
 
+        if (!mis->copied_pages) {
+            mis->copied_pages = copied_pages_bitmap_new();
+        }
         asked_features |= UFFD_FEATURE_THREAD_ID;
     }
 #endif
diff --git a/migration/ram.c b/migration/ram.c
index f59fdd4..1abb6bb 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2661,6 +2661,69 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
+static unsigned long get_total_bits_per_page(ram_addr_t mem_length,
+        size_t page_size)
+{
+    unsigned long page_size_bit = find_last_bit((unsigned long *)&page_size,
+            BITS_PER_LONG);
+    unsigned long total_bits = mem_length >> page_size_bit;
+    if (mem_length % page_size) {
+        total_bits += 1;
+    }
+    return total_bits;
+}
+
+/*
+ * this function allocates bitmap for copied pages,
+ * also it calculates
+ * how many entries do we need
+ * */
+unsigned long int *copied_pages_bitmap_new(void)
+{
+    RAMBlock *block;
+    unsigned long int total_bits = 0;
+
+    rcu_read_lock();
+    RAMBLOCK_FOREACH(block) {
+        /* in general case used_length may not be aligned
+         * by page_size */
+
+        total_bits += get_total_bits_per_page(block->used_length,
+                block->page_size);
+    }
+    rcu_read_unlock();
+
+    return bitmap_new(total_bits);
+}
+
+unsigned long int get_copied_bit_offset(ram_addr_t addr)
+{
+    RAMBlock *block;
+    unsigned long int iter_bit = 0;
+
+    rcu_read_lock();
+    RAMBLOCK_FOREACH(block) {
+        /* in general case used_length may not be aligned
+         * by page_size */
+        if (block->host == NULL) {
+            continue;
+        }
+        if (addr - (ram_addr_t)block->host < block->max_length) {
+            unsigned long page_size_bit = find_last_bit(
+                    (unsigned long *)&block->page_size,
+                    BITS_PER_LONG);
+            ram_addr_t offset = addr - (ram_addr_t)block->host;
+            iter_bit += offset >> page_size_bit;
+            break;
+        }
+        iter_bit += get_total_bits_per_page(block->used_length,
+                block->page_size);
+    }
+    rcu_read_unlock();
+
+    return iter_bit;
+}
+
 static SaveVMHandlers savevm_ram_handlers = {
     .save_live_setup = ram_save_setup,
     .save_live_iterate = ram_save_iterate,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH V6 08/10] migration: calculate vCPU blocktime on dst side
       [not found]   ` <CGME20170523113131eucas1p24a041de6004237e437f97a24340507e2@eucas1p2.samsung.com>
@ 2017-05-23 11:31     ` Alexey Perevalov
  2017-05-24  7:53       ` Peter Xu
  2017-06-01 10:57       ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-23 11:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: i.maximets, peterx, eblake, dgilbert, Alexey Perevalov

This patch provides blocktime calculation per vCPU,
as a summary and as a overlapped value for all vCPUs.

This approach was suggested by Peter Xu, as an improvements of
previous approch where QEMU kept tree with faulted page address and cpus bitmask
in it. Now QEMU is keeping array with faulted page address as value and vCPU
as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
list for blocktime per vCPU (could be traced with page_fault_addr)

Blocktime will not calculated if postcopy_blocktime field of
MigrationIncomingState wasn't initialized.

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 migration/postcopy-ram.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++-
 migration/trace-events   |   5 ++-
 2 files changed, 105 insertions(+), 2 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index d647769..e70c44b 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -23,6 +23,7 @@
 #include "postcopy-ram.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/balloon.h"
+#include <sys/param.h>
 #include "qemu/error-report.h"
 #include "trace.h"
 
@@ -577,6 +578,101 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
     return 0;
 }
 
+static int get_mem_fault_cpu_index(uint32_t pid)
+{
+    CPUState *cpu_iter;
+
+    CPU_FOREACH(cpu_iter) {
+        if (cpu_iter->thread_id == pid) {
+            return cpu_iter->cpu_index;
+        }
+    }
+    trace_get_mem_fault_cpu_index(pid);
+    return -1;
+}
+
+static void mark_postcopy_blocktime_begin(uint64_t addr, uint32_t ptid,
+        RAMBlock *rb)
+{
+    int cpu;
+    unsigned long int nr_bit;
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
+    int64_t now_ms;
+
+    if (!dc || ptid == 0) {
+        return;
+    }
+    cpu = get_mem_fault_cpu_index(ptid);
+    if (cpu < 0) {
+        return;
+    }
+    nr_bit = get_copied_bit_offset(addr);
+    if (test_bit(nr_bit, mis->copied_pages)) {
+        return;
+    }
+    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+    if (dc->vcpu_addr[cpu] == 0) {
+        atomic_inc(&dc->smp_cpus_down);
+    }
+
+    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
+    atomic_xchg__nocheck(&dc->last_begin, now_ms);
+    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
+
+    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
+            cpu);
+}
+
+static void mark_postcopy_blocktime_end(uint64_t addr)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
+    int i, affected_cpu = 0;
+    int64_t now_ms;
+    bool vcpu_total_blocktime = false;
+    unsigned long int nr_bit;
+
+    if (!dc) {
+        return;
+    }
+    /* mark that page as copied */
+    nr_bit = get_copied_bit_offset(addr);
+    set_bit_atomic(nr_bit, mis->copied_pages);
+
+    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+    /* lookup cpu, to clear it,
+     * that algorithm looks straighforward, but it's not
+     * optimal, more optimal algorithm is keeping tree or hash
+     * where key is address value is a list of  */
+    for (i = 0; i < smp_cpus; i++) {
+        uint64_t vcpu_blocktime = 0;
+        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
+            continue;
+        }
+        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
+        vcpu_blocktime = now_ms -
+            atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
+        affected_cpu += 1;
+        /* we need to know is that mark_postcopy_end was due to
+         * faulted page, another possible case it's prefetched
+         * page and in that case we shouldn't be here */
+        if (!vcpu_total_blocktime &&
+            atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
+            vcpu_total_blocktime = true;
+        }
+        /* continue cycle, due to one page could affect several vCPUs */
+        dc->vcpu_blocktime[i] += vcpu_blocktime;
+    }
+
+    atomic_sub(&dc->smp_cpus_down, affected_cpu);
+    if (vcpu_total_blocktime) {
+        dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, 0);
+    }
+    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime);
+}
+
 /*
  * Handle faults detected by the USERFAULT markings
  */
@@ -654,8 +750,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
         rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
         trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
                                                 qemu_ram_get_idstr(rb),
-                                                rb_offset);
+                                                rb_offset,
+                                                msg.arg.pagefault.feat.ptid);
 
+        mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address),
+                msg.arg.pagefault.feat.ptid, rb);
         /*
          * Send the request to the source - we want to request one
          * of our host page sizes (which is >= TPS)
@@ -750,6 +849,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
 
         return -e;
     }
+    mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host);
 
     trace_postcopy_place_page(host);
     return 0;
diff --git a/migration/trace-events b/migration/trace-events
index 5b8ccf3..7bdadbb 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -112,6 +112,8 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
 process_incoming_migration_co_postcopy_end_main(void) ""
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
 migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
+mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d"
+mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr 0x%" PRIx64 " dd %p time %" PRId64
 
 # migration/rdma.c
 qemu_rdma_accept_incoming_migration(void) ""
@@ -188,7 +190,7 @@ postcopy_ram_enable_notify(void) ""
 postcopy_ram_fault_thread_entry(void) ""
 postcopy_ram_fault_thread_exit(void) ""
 postcopy_ram_fault_thread_quit(void) ""
-postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx"
+postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx %u"
 postcopy_ram_incoming_cleanup_closeuf(void) ""
 postcopy_ram_incoming_cleanup_entry(void) ""
 postcopy_ram_incoming_cleanup_exit(void) ""
@@ -197,6 +199,7 @@ save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
 ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
+get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU"
 
 # migration/exec.c
 migration_exec_outgoing(const char *cmd) "cmd=%s"
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH V6 09/10] migration: add postcopy total blocktime into query-migrate
       [not found]   ` <CGME20170523113131eucas1p1ec4e059c13ce977e3a3872c343e6b858@eucas1p1.samsung.com>
@ 2017-05-23 11:31     ` Alexey Perevalov
  2017-06-01 11:35       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-23 11:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: i.maximets, peterx, eblake, dgilbert, Alexey Perevalov

Postcopy total blocktime is available on destination side only.
But query-migrate was possible only for source. This patch
adds ability to call query-migrate on destination.
To be able to see postcopy blocktime, need to request postcopy-blocktime
capability.

The query-migrate command will show following sample result:
{"return":
    "postcopy-vcpu-blocktime": [115, 100],
    "status": "completed",
    "postcopy-blocktime": 100
}}

postcopy_vcpu_blocktime contains list, where the first item is the first
vCPU in QEMU.

This patch has a drawback, it combines states of incoming and
outgoing migration. Ongoing migration state will overwrite incoming
state. Looks like better to separate query-migrate for incoming and
outgoing migration or add parameter to indicate type of migration.

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 hmp.c                         | 15 +++++++++
 include/migration/migration.h |  4 +++
 migration/migration.c         | 40 +++++++++++++++++++++--
 migration/postcopy-ram.c      | 75 +++++++++++++++++++++++++++++++++++++++++++
 migration/trace-events        |  1 +
 qapi-schema.json              |  9 +++++-
 6 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/hmp.c b/hmp.c
index 3dceaf8..25135e7 100644
--- a/hmp.c
+++ b/hmp.c
@@ -260,6 +260,21 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
                        info->cpu_throttle_percentage);
     }
 
+    if (info->has_postcopy_blocktime) {
+        monitor_printf(mon, "postcopy blocktime: %" PRId64 "\n",
+                info->postcopy_blocktime);
+    }
+
+    if (info->has_postcopy_vcpu_blocktime) {
+        Visitor *v;
+        char *str;
+        v = string_output_visitor_new(false, &str);
+        visit_type_int64List(v, NULL, &info->postcopy_vcpu_blocktime, NULL);
+        visit_complete(v, &str);
+        monitor_printf(mon, "postcopy vcpu blocktime: %s\n", str);
+        g_free(str);
+        visit_free(v);
+    }
     qapi_free_MigrationInfo(info);
     qapi_free_MigrationCapabilityStatusList(caps);
 }
diff --git a/include/migration/migration.h b/include/migration/migration.h
index 4e05c83..c9d4954 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -123,6 +123,10 @@ struct MigrationIncomingState {
 
 MigrationIncomingState *migration_incoming_get_current(void);
 void migration_incoming_state_destroy(void);
+/*
+ * Functions to work with blocktime context
+ */
+void fill_destination_postcopy_migration_info(MigrationInfo *info);
 
 struct MigrationState
 {
diff --git a/migration/migration.c b/migration/migration.c
index e10284e..4da0c20 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -651,14 +651,15 @@ static void populate_ram_info(MigrationInfo *info, MigrationState *s)
     }
 }
 
-MigrationInfo *qmp_query_migrate(Error **errp)
+static void fill_source_migration_info(MigrationInfo *info)
 {
-    MigrationInfo *info = g_malloc0(sizeof(*info));
     MigrationState *s = migrate_get_current();
 
     switch (s->state) {
     case MIGRATION_STATUS_NONE:
         /* no migration has happened ever */
+        /* do not overwrite destination migration status */
+        return;
         break;
     case MIGRATION_STATUS_SETUP:
         info->has_status = true;
@@ -744,10 +745,43 @@ MigrationInfo *qmp_query_migrate(Error **errp)
         break;
     }
     info->status = s->state;
+}
 
-    return info;
+static void fill_destination_migration_info(MigrationInfo *info)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    switch (mis->state) {
+    case MIGRATION_STATUS_NONE:
+        return;
+        break;
+    case MIGRATION_STATUS_SETUP:
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_ACTIVE:
+    case MIGRATION_STATUS_POSTCOPY_ACTIVE:
+    case MIGRATION_STATUS_FAILED:
+    case MIGRATION_STATUS_COLO:
+        info->has_status = true;
+        break;
+    case MIGRATION_STATUS_COMPLETED:
+        info->has_status = true;
+        fill_destination_postcopy_migration_info(info);
+        break;
+    }
+    info->status = mis->state;
 }
 
+MigrationInfo *qmp_query_migrate(Error **errp)
+{
+    MigrationInfo *info = g_malloc0(sizeof(*info));
+
+    fill_destination_migration_info(info);
+    fill_source_migration_info(info);
+
+    return info;
+ }
+
 void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
                                   Error **errp)
 {
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index e70c44b..3dc3869 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -139,6 +139,73 @@ static struct PostcopyBlocktimeContext *blocktime_context_new(void)
     return ctx;
 }
 
+static int64List *get_vcpu_blocktime_list(PostcopyBlocktimeContext *ctx)
+{
+    int64List *list = NULL, *entry = NULL;
+    int i;
+
+    for (i = smp_cpus - 1; i >= 0; i--) {
+            entry = g_new0(int64List, 1);
+            entry->value = ctx->vcpu_blocktime[i];
+            entry->next = list;
+            list = entry;
+    }
+
+    return list;
+}
+
+/*
+ * This function just provide calculated blocktime per cpu and trace it.
+ * Total blocktime is calculated in mark_postcopy_blocktime_end.
+ *
+ *
+ * Assume we have 3 CPU
+ *
+ *      S1        E1           S1               E1
+ * -----***********------------xxx***************------------------------> CPU1
+ *
+ *             S2                E2
+ * ------------****************xxx---------------------------------------> CPU2
+ *
+ *                         S3            E3
+ * ------------------------****xxx********-------------------------------> CPU3
+ *
+ * We have sequence S1,S2,E1,S3,S1,E2,E3,E1
+ * S2,E1 - doesn't match condition due to sequence S1,S2,E1 doesn't include CPU3
+ * S3,S1,E2 - sequence includes all CPUs, in this case overlap will be S1,E2 -
+ *            it's a part of total blocktime.
+ * S1 - here is last_begin
+ * Legend of the picture is following:
+ *              * - means blocktime per vCPU
+ *              x - means overlapped blocktime (total blocktime)
+ */
+void fill_destination_postcopy_migration_info(MigrationInfo *info)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    PostcopyBlocktimeContext *bc = mis->blocktime_ctx;
+
+    if (!bc) {
+        return;
+    }
+
+    info->has_postcopy_blocktime = true;
+    info->postcopy_blocktime = bc->total_blocktime;
+    info->has_postcopy_vcpu_blocktime = true;
+    info->postcopy_vcpu_blocktime = get_vcpu_blocktime_list(bc);
+}
+
+static uint64_t get_postcopy_total_blocktime(void)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    PostcopyBlocktimeContext *bc = mis->blocktime_ctx;
+
+    if (!bc) {
+        return 0;
+    }
+
+    return bc->total_blocktime;
+}
+
 /**
  * receive_ufd_features: check userfault fd features, to request only supported
  * features in the future.
@@ -497,6 +564,9 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
     }
 
     postcopy_state_set(POSTCOPY_INCOMING_END);
+    /* here should be blocktime receiving back operation */
+    trace_postcopy_ram_incoming_cleanup_blocktime(
+            get_postcopy_total_blocktime());
     migrate_send_rp_shut(mis, qemu_file_get_error(mis->from_src_file) != 0);
 
     if (mis->postcopy_tmp_page) {
@@ -926,6 +996,11 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
 
 #else
 /* No target OS support, stubs just fail */
+void fill_destination_postcopy_migration_info(MigrationInfo *info)
+{
+    error_report("%s: No OS support", __func__);
+}
+
 bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
 {
     error_report("%s: No OS support", __func__);
diff --git a/migration/trace-events b/migration/trace-events
index 7bdadbb..55a3b6e 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -195,6 +195,7 @@ postcopy_ram_incoming_cleanup_closeuf(void) ""
 postcopy_ram_incoming_cleanup_entry(void) ""
 postcopy_ram_incoming_cleanup_exit(void) ""
 postcopy_ram_incoming_cleanup_join(void) ""
+postcopy_ram_incoming_cleanup_blocktime(uint64_t total) "total blocktime %" PRIu64
 save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
diff --git a/qapi-schema.json b/qapi-schema.json
index 78617fe..4be0b09 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -712,6 +712,11 @@
 #              @status is 'failed'. Clients should not attempt to parse the
 #              error strings. (Since 2.7)
 #
+# @postcopy-blocktime: total time when all vCPU were blocked during postcopy
+#           live migration (Since 2.10)
+#
+# @postcopy-vcpu-blocktime: list of the postcopy blocktime per vCPU (Since 2.10)
+#
 # Since: 0.14.0
 ##
 { 'struct': 'MigrationInfo',
@@ -723,7 +728,9 @@
            '*downtime': 'int',
            '*setup-time': 'int',
            '*cpu-throttle-percentage': 'int',
-           '*error-desc': 'str'} }
+           '*error-desc': 'str',
+           '*postcopy-blocktime' : 'int64',
+           '*postcopy-vcpu-blocktime': ['int64']} }
 
 ##
 # @query-migrate:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH V6 10/10] migration: postcopy_blocktime documentation
       [not found]   ` <CGME20170523113132eucas1p19143aceccbb30a0051635cddcf376bb6@eucas1p1.samsung.com>
@ 2017-05-23 11:31     ` Alexey Perevalov
  2017-06-01 11:37       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-23 11:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: i.maximets, peterx, eblake, dgilbert, Alexey Perevalov

Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
---
 docs/migration.txt | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/docs/migration.txt b/docs/migration.txt
index 1b940a8..2207c6d 100644
--- a/docs/migration.txt
+++ b/docs/migration.txt
@@ -402,6 +402,16 @@ will now cause the transition from precopy to postcopy.
 It can be issued immediately after migration is started or any
 time later on.  Issuing it after the end of a migration is harmless.
 
+Blocktime is a postcopy live migration metric, intended to show
+how long the source vCPU was in state of interruptable sleep due to pagefault.
+This value is calculated on destination side.
+To enable postcopy blocktime calculation, enter following command on destination
+monitor:
+
+migrate_set_capability postcopy-blocktime on
+
+Postcopy blocktime can be retrieved by query-migrate qmp command.
+
 Note: During the postcopy phase, the bandwidth limits set using
 migrate_set_speed is ignored (to avoid delaying requested pages that
 the destination is waiting for).
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 04/10] migration: split ufd_version_check onto receive/request features part
  2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 04/10] migration: split ufd_version_check onto receive/request features part Alexey Perevalov
@ 2017-05-24  2:36       ` Peter Xu
  2017-05-24  6:45         ` Alexey
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2017-05-24  2:36 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, eblake, dgilbert

On Tue, May 23, 2017 at 02:31:05PM +0300, Alexey Perevalov wrote:
> This modification is necessary for userfault fd features which are
> required to be requested from userspace.
> UFFD_FEATURE_THREAD_ID is a one of such "on demand" feature, which will
> be introduced in the next patch.
> 
> QEMU have to use separate userfault file descriptor, due to
> userfault context has internal state, and after first call of
> ioctl UFFD_API it changes its state to UFFD_STATE_RUNNING (in case of
> success), but kernel while handling ioctl UFFD_API expects UFFD_STATE_WAIT_API.
> So only one ioctl with UFFD_API is possible per ufd.
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>

Hi, Alexey,

Mostly good to me, some nitpicks below.

> ---
>  migration/postcopy-ram.c | 100 ++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 91 insertions(+), 9 deletions(-)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 3ed78bf..4f3f495 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -59,32 +59,114 @@ struct PostcopyDiscardState {
>  #include <sys/eventfd.h>
>  #include <linux/userfaultfd.h>
>  
> -static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
> +
> +/**
> + * receive_ufd_features: check userfault fd features, to request only supported
> + * features in the future.
> + *
> + * Returns: true on success
> + *
> + * __NR_userfaultfd - should be checked before

I don't see this line necessary. After all we will detect the error no
matter what...

> + *  @features: out parameter will contain uffdio_api.features provided by kernel
> + *              in case of success
> + */
> +static bool receive_ufd_features(uint64_t *features)
>  {
> -    struct uffdio_api api_struct;
> -    uint64_t ioctl_mask;
> +    struct uffdio_api api_struct = {0};
> +    int ufd;
> +    bool ret = true;
> +
> +    /* if we are here __NR_userfaultfd should exists */
> +    ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
> +    if (ufd == -1) {
> +        error_report("%s: syscall __NR_userfaultfd failed: %s", __func__,
> +                     strerror(errno));
> +        return false;
> +    }
>  
> +    /* ask features */
>      api_struct.api = UFFD_API;
>      api_struct.features = 0;
>      if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> -        error_report("%s: UFFDIO_API failed: %s", __func__
> +        error_report("%s: UFFDIO_API failed: %s", __func__,
>                       strerror(errno));
> +        ret = false;
> +        goto release_ufd;
> +    }
> +
> +    *features = api_struct.features;
> +
> +release_ufd:
> +    close(ufd);
> +    return ret;
> +}
> +
> +/**
> + * request_ufd_features: this function should be called only once on a newly
> + * opened ufd, subsequent calls will lead to error.
> + *
> + * Returns: true on succes
> + *
> + * @ufd: fd obtained from userfaultfd syscall
> + * @features: bit mask see UFFD_API_FEATURES
> + */
> +static bool request_ufd_features(int ufd, uint64_t features)
> +{
> +    struct uffdio_api api_struct = {0};
> +    uint64_t ioctl_mask;
> +
> +    api_struct.api = UFFD_API;
> +    api_struct.features = features;
> +    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> +        error_report("%s failed: UFFDIO_API failed: %s", __func__,
> +                strerror(errno));

Maybe we can indent this line to follow this file's rule?

    error_report("%s failed: UFFDIO_API failed: %s", __func__,
                 strerror(errno));

>          return false;
>      }
>  
> -    ioctl_mask = (__u64)1 << _UFFDIO_REGISTER |
> -                 (__u64)1 << _UFFDIO_UNREGISTER;
> +    ioctl_mask = 1 << _UFFDIO_REGISTER |
> +                 1 << _UFFDIO_UNREGISTER;

Could I ask why we explicitly removed (__u64) here? Since I see the
old one better.

>      if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
>          error_report("Missing userfault features: %" PRIx64,
>                       (uint64_t)(~api_struct.ioctls & ioctl_mask));
>          return false;
>      }
>  
> +    return true;
> +}
> +
> +static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
> +{
> +    uint64_t asked_features = 0;
> +    static uint64_t supported_features;
> +
> +    /*
> +     * it's not possible to
> +     * request UFFD_API twice per one fd
> +     * userfault fd features is persistent
> +     */
> +    if (!supported_features) {

I would prefer not having this static variable. After all, this
function call is rare, and the receive_ufd_features() is not that slow
as well.

> +        if (!receive_ufd_features(&supported_features)) {
> +            error_report("%s failed", __func__);
> +            return false;
> +        }
> +    }
> +
> +    /*
> +     * request features, even if asked_features is 0, due to
> +     * kernel expects UFFD_API before UFFDIO_REGISTER, per
> +     * userfault file descriptor
> +     */
> +    if (!request_ufd_features(ufd, asked_features)) {
> +        error_report("%s failed: features %" PRIu64, __func__,
> +                asked_features);

Better indent?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 06/10] migration: add postcopy blocktime ctx into MigrationIncomingState
  2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 06/10] migration: add postcopy blocktime ctx into MigrationIncomingState Alexey Perevalov
@ 2017-05-24  3:31       ` Peter Xu
  2017-06-05  6:31         ` Alexey Perevalov
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2017-05-24  3:31 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, eblake, dgilbert

On Tue, May 23, 2017 at 02:31:07PM +0300, Alexey Perevalov wrote:
> This patch adds request to kernel space for UFFD_FEATURE_THREAD_ID,
> in case when this feature is provided by kernel.
> 
> PostcopyBlocktimeContext is incapsulated inside postcopy-ram.c,
> due to it's postcopy only feature.
> Also it defines PostcopyBlocktimeContext's instance live time.
> Information from PostcopyBlocktimeContext instance will be provided
> much after postcopy migration end, instance of PostcopyBlocktimeContext
> will live till QEMU exit, but part of it (vcpu_addr,
> page_fault_vcpu_time) used only during calculation, will be released
> when postcopy ended or failed.
> 
> To enable postcopy blocktime calculation on destination, need to request
> proper capabiltiy (Patch for documentation will be at the tail of the patch
> set).
> 
> As an example following command enable that capability, assume QEMU was
> started with
> -chardev socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock
> option to control it
> 
> [root@host]#printf "{\"execute\" : \"qmp_capabilities\"}\r\n \
> {\"execute\": \"migrate-set-capabilities\" , \"arguments\":   {
> \"capabilities\": [ { \"capability\": \"postcopy-blocktime\", \"state\":
> true } ] } }" | nc -U /var/lib/migrate-vm-monitor.sock
> 
> Or just with HMP
> (qemu) migrate_set_capability postcopy-blocktime on
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> ---
>  include/migration/migration.h |  8 +++++
>  migration/postcopy-ram.c      | 80 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 88 insertions(+)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 2951253..449cb07 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -57,6 +57,8 @@ enum mig_rp_message_type {
>  
>  typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
>  
> +struct PostcopyBlocktimeContext;
> +
>  /* State for the incoming migration */
>  struct MigrationIncomingState {
>      QEMUFile *from_src_file;
> @@ -97,6 +99,12 @@ struct MigrationIncomingState {
>  
>      /* See savevm.c */
>      LoadStateEntry_Head loadvm_handlers;
> +
> +    /*
> +     * PostcopyBlocktimeContext to keep information for postcopy
> +     * live migration, to calculate vCPU block time
> +     * */
> +    struct PostcopyBlocktimeContext *blocktime_ctx;
>  };
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 4f3f495..5435a40 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -59,6 +59,73 @@ struct PostcopyDiscardState {
>  #include <sys/eventfd.h>
>  #include <linux/userfaultfd.h>
>  
> +typedef struct PostcopyBlocktimeContext {
> +    /* time when page fault initiated per vCPU */
> +    int64_t *page_fault_vcpu_time;
> +    /* page address per vCPU */
> +    uint64_t *vcpu_addr;
> +    int64_t total_blocktime;
> +    /* blocktime per vCPU */
> +    int64_t *vcpu_blocktime;
> +    /* point in time when last page fault was initiated */
> +    int64_t last_begin;
> +    /* number of vCPU are suspended */
> +    int smp_cpus_down;
> +
> +    /*
> +     * Handler for exit event, necessary for
> +     * releasing whole blocktime_ctx
> +     */
> +    Notifier exit_notifier;
> +    /*
> +     * Handler for postcopy event, necessary for
> +     * releasing unnecessary part of blocktime_ctx
> +     */
> +    Notifier postcopy_notifier;
> +} PostcopyBlocktimeContext;
> +
> +static void destroy_blocktime_context(struct PostcopyBlocktimeContext *ctx)
> +{
> +    g_free(ctx->page_fault_vcpu_time);
> +    g_free(ctx->vcpu_addr);
> +    g_free(ctx->vcpu_blocktime);
> +    g_free(ctx);
> +}
> +
> +static void postcopy_migration_cb(Notifier *n, void *data)
> +{
> +    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
> +                                               postcopy_notifier);
> +    MigrationState *s = data;
> +    if (migration_has_finished(s) || migration_has_failed(s)) {
> +        g_free(ctx->page_fault_vcpu_time);
> +        /* g_free is NULL robust */
> +        ctx->page_fault_vcpu_time = NULL;
> +        g_free(ctx->vcpu_addr);
> +        ctx->vcpu_addr = NULL;
> +    }
> +}
> +
> +static void migration_exit_cb(Notifier *n, void *data)
> +{
> +    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
> +                                               exit_notifier);
> +    destroy_blocktime_context(ctx);
> +}
> +
> +static struct PostcopyBlocktimeContext *blocktime_context_new(void)
> +{
> +    PostcopyBlocktimeContext *ctx = g_new0(PostcopyBlocktimeContext, 1);
> +    ctx->page_fault_vcpu_time = g_new0(int64_t, smp_cpus);
> +    ctx->vcpu_addr = g_new0(uint64_t, smp_cpus);
> +    ctx->vcpu_blocktime = g_new0(int64_t, smp_cpus);
> +
> +    ctx->exit_notifier.notify = migration_exit_cb;
> +    ctx->postcopy_notifier.notify = postcopy_migration_cb;
> +    qemu_add_exit_notifier(&ctx->exit_notifier);
> +    add_migration_state_change_notifier(&ctx->postcopy_notifier);

I think we just need a global MAX_VCPUS macro, then we can just make
the whole struct static. But I admit this is out of topic for current
thread. The point is indeed I see no much point on such fine-grained
management of memory on this... only for a summary of, say a maximum
of 1K vcpus, no more than 8B*3*1K=24KB memory...

IMHO This just made things complicated.

> +    return ctx;
> +}
>  
>  /**
>   * receive_ufd_features: check userfault fd features, to request only supported
> @@ -151,6 +218,19 @@ static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
>          }
>      }
>  
> +#ifdef UFFD_FEATURE_THREAD_ID
> +    if (migrate_postcopy_blocktime() && mis &&
> +            UFFD_FEATURE_THREAD_ID & supported_features) {

(indent)

> +        /* kernel supports that feature */
> +        /* don't create blocktime_context if it exists */
> +        if (!mis->blocktime_ctx) {

If it does existed, then maybe we'll have problem - in the state
change notifier we cleaned up ->page_fault_vcpu_time and ->vcpu_addr,
but blocktime_ctx was there. So will it possible when we reached here
we'll get blocktime_ctx != NULL but
blocktime_ctx->page_fault_vcpu_time == NULL and also
blocktime_ctx->vcpu_addr == NULL?

Maybe we can just drop the state change notifier, and one single exit
notifier would be enough to cleanup everything?

> +            mis->blocktime_ctx = blocktime_context_new();
> +        }
> +
> +        asked_features |= UFFD_FEATURE_THREAD_ID;
> +    }
> +#endif
> +
>      /*
>       * request features, even if asked_features is 0, due to
>       * kernel expects UFFD_API before UFFDIO_REGISTER, per
> -- 
> 1.8.3.1
> 

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 04/10] migration: split ufd_version_check onto receive/request features part
  2017-05-24  2:36       ` Peter Xu
@ 2017-05-24  6:45         ` Alexey
  2017-05-24 11:33           ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Alexey @ 2017-05-24  6:45 UTC (permalink / raw)
  To: Peter Xu; +Cc: i.maximets, qemu-devel, dgilbert

Hi, Peter,

On Wed, May 24, 2017 at 10:36:29AM +0800, Peter Xu wrote:
> On Tue, May 23, 2017 at 02:31:05PM +0300, Alexey Perevalov wrote:
> > This modification is necessary for userfault fd features which are
> > required to be requested from userspace.
> > UFFD_FEATURE_THREAD_ID is a one of such "on demand" feature, which will
> > be introduced in the next patch.
> > 
> > QEMU have to use separate userfault file descriptor, due to
> > userfault context has internal state, and after first call of
> > ioctl UFFD_API it changes its state to UFFD_STATE_RUNNING (in case of
> > success), but kernel while handling ioctl UFFD_API expects UFFD_STATE_WAIT_API.
> > So only one ioctl with UFFD_API is possible per ufd.
> > 
> > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> 
> Hi, Alexey,
> 
> Mostly good to me, some nitpicks below.
> 
> > ---
> >  migration/postcopy-ram.c | 100 ++++++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 91 insertions(+), 9 deletions(-)
> > 
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 3ed78bf..4f3f495 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -59,32 +59,114 @@ struct PostcopyDiscardState {
> >  #include <sys/eventfd.h>
> >  #include <linux/userfaultfd.h>
> >  
> > -static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
> > +
> > +/**
> > + * receive_ufd_features: check userfault fd features, to request only supported
> > + * features in the future.
> > + *
> > + * Returns: true on success
> > + *
> > + * __NR_userfaultfd - should be checked before
> 
> I don't see this line necessary. After all we will detect the error no
> matter what...
Yes, because in this function it has a check already, but that check
isn't odd.
So comment will be removed.
> 
> > + *  @features: out parameter will contain uffdio_api.features provided by kernel
> > + *              in case of success
> > + */
> > +static bool receive_ufd_features(uint64_t *features)
> >  {
> > -    struct uffdio_api api_struct;
> > -    uint64_t ioctl_mask;
> > +    struct uffdio_api api_struct = {0};
> > +    int ufd;
> > +    bool ret = true;
> > +
> > +    /* if we are here __NR_userfaultfd should exists */
> > +    ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
> > +    if (ufd == -1) {
> > +        error_report("%s: syscall __NR_userfaultfd failed: %s", __func__,
> > +                     strerror(errno));
> > +        return false;
> > +    }
> >  
> > +    /* ask features */
> >      api_struct.api = UFFD_API;
> >      api_struct.features = 0;
> >      if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> > -        error_report("%s: UFFDIO_API failed: %s", __func__
> > +        error_report("%s: UFFDIO_API failed: %s", __func__,
> >                       strerror(errno));
> > +        ret = false;
> > +        goto release_ufd;
> > +    }
> > +
> > +    *features = api_struct.features;
> > +
> > +release_ufd:
> > +    close(ufd);
> > +    return ret;
> > +}
> > +
> > +/**
> > + * request_ufd_features: this function should be called only once on a newly
> > + * opened ufd, subsequent calls will lead to error.
> > + *
> > + * Returns: true on succes
> > + *
> > + * @ufd: fd obtained from userfaultfd syscall
> > + * @features: bit mask see UFFD_API_FEATURES
> > + */
> > +static bool request_ufd_features(int ufd, uint64_t features)
> > +{
> > +    struct uffdio_api api_struct = {0};
> > +    uint64_t ioctl_mask;
> > +
> > +    api_struct.api = UFFD_API;
> > +    api_struct.features = features;
> > +    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> > +        error_report("%s failed: UFFDIO_API failed: %s", __func__,
> > +                strerror(errno));
> 
> Maybe we can indent this line to follow this file's rule?
> 
>     error_report("%s failed: UFFDIO_API failed: %s", __func__,
>                  strerror(errno));
looks like I missed that rule.
> 
> >          return false;
> >      }
> >  
> > -    ioctl_mask = (__u64)1 << _UFFDIO_REGISTER |
> > -                 (__u64)1 << _UFFDIO_UNREGISTER;
> > +    ioctl_mask = 1 << _UFFDIO_REGISTER |
> > +                 1 << _UFFDIO_UNREGISTER;
> 
> Could I ask why we explicitly removed (__u64) here? Since I see the
> old one better.
maybe my change not robust, in any case thank to point me, but now I
think, here should be a constant instead of ioctl_mask, like
UFFD_API_IOCTLS, the total meaning of that check it's make sure kernel
returns to us no error and accepted features.
ok, from the beginning:

if we request unsupported feature (we check it before) or internal
state of userfault ctx inside kernel isn't UFFD_STATE_WAIT_API (for
example we are in the middle of the coping process)
	ioctl should end with EINVAL error and ioctls field in
	uffdio_api will be empty

Right now I think ioctls check for UFFD_API is not necessary.
We just say here, we will use _UFFDIO_REGISTER, _UFFDIO_UNREGISTER,
but kernel supports it unconditionally, by contrast with
UFFDIO_REGISTER ioctl - it also returns ioctl field in uffdio_register
structure, here can be a variations.
> 
> >      if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
> >          error_report("Missing userfault features: %" PRIx64,
> >                       (uint64_t)(~api_struct.ioctls & ioctl_mask));
> >          return false;
> >      }
> >  
> > +    return true;
> > +}
> > +
> > +static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
> > +{
> > +    uint64_t asked_features = 0;
> > +    static uint64_t supported_features;
> > +
> > +    /*
> > +     * it's not possible to
> > +     * request UFFD_API twice per one fd
> > +     * userfault fd features is persistent
> > +     */
> > +    if (!supported_features) {
> 
> I would prefer not having this static variable. After all, this
> function call is rare, and the receive_ufd_features() is not that slow
> as well.
ok ) for the sake of low code complexity
> 
> > +        if (!receive_ufd_features(&supported_features)) {
> > +            error_report("%s failed", __func__);
> > +            return false;
> > +        }
> > +    }
> > +
> > +    /*
> > +     * request features, even if asked_features is 0, due to
> > +     * kernel expects UFFD_API before UFFDIO_REGISTER, per
> > +     * userfault file descriptor
> > +     */
> > +    if (!request_ufd_features(ufd, asked_features)) {
> > +        error_report("%s failed: features %" PRIu64, __func__,
> > +                asked_features);
> 
> Better indent?
> 
> Thanks,
> 
> -- 
> Peter Xu
> 

-- 

BR
Alexey

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 07/10] migration: add bitmap for copied page
  2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 07/10] migration: add bitmap for copied page Alexey Perevalov
@ 2017-05-24  6:57       ` Peter Xu
  2017-05-24  7:56         ` Alexey
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2017-05-24  6:57 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, eblake, dgilbert

On Tue, May 23, 2017 at 02:31:08PM +0300, Alexey Perevalov wrote:
> This patch adds ability to track down already copied
> pages, it's necessary for calculation vCPU block time in
> postcopy migration feature and maybe for restore after
> postcopy migration failure.
> 
> Functions which work with RAMBlock are placed into ram.c,
> due to TARGET_WORDS_BIGENDIAN is poisoned int postcopy-ram.c -
> hardware independed code.
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> ---
>  include/migration/migration.h | 16 +++++++++++
>  migration/postcopy-ram.c      | 22 ++++++++++++---
>  migration/ram.c               | 63 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 97 insertions(+), 4 deletions(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 449cb07..4e05c83 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -101,6 +101,20 @@ struct MigrationIncomingState {
>      LoadStateEntry_Head loadvm_handlers;
>  
>      /*
> +     * bitmap indicates whether page copied,
> +     * based on ramblock offset
> +     * now it is using only for blocktime calculation in
> +     * postcopy migration, so livetime of this entry:
> +     * since user requested blocktime calculation,
> +     * till the end of postcopy migration
> +     * as an example it could represend following memory map
> +     * ___________________________________
> +     * |4k pages | hugepages | 4k pages

Can we really do this?

The problem is AFAIU migration stream is sending pages only in target
page size, even for huge pages... so even for huge pages we should
still need per TARGET_PAGE_SIZE bitmap?

(Please refer to ram_state_init() on init of RAMBlock.bmap)

> +     *
> +     * */
> +    unsigned long *copied_pages;
> +
> +    /*
>       * PostcopyBlocktimeContext to keep information for postcopy
>       * live migration, to calculate vCPU block time
>       * */
> @@ -279,6 +293,8 @@ int migrate_compress_threads(void);
>  int migrate_decompress_threads(void);
>  bool migrate_use_events(void);
>  bool migrate_postcopy_blocktime(void);
> +unsigned long int get_copied_bit_offset(ram_addr_t addr);
> +unsigned long int *copied_pages_bitmap_new(void);
>  
>  /* Sending on the return path - generic and then for each message type */
>  void migrate_send_rp_message(MigrationIncomingState *mis,
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 5435a40..d647769 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -94,23 +94,34 @@ static void destroy_blocktime_context(struct PostcopyBlocktimeContext *ctx)
>  
>  static void postcopy_migration_cb(Notifier *n, void *data)
>  {
> -    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
> -                                               postcopy_notifier);
>      MigrationState *s = data;
>      if (migration_has_finished(s) || migration_has_failed(s)) {
> +        MigrationIncomingState *mis = migration_incoming_get_current();
> +        PostcopyBlocktimeContext *ctx = mis->blocktime_ctx;
> +
> +        if (!ctx) {
> +            return;
> +        }

If we are in postcopy_migration_cb() already, then we have registered
the notifiers, then... could this (!ctx) really happen?

> +
>          g_free(ctx->page_fault_vcpu_time);
>          /* g_free is NULL robust */
>          ctx->page_fault_vcpu_time = NULL;
>          g_free(ctx->vcpu_addr);
>          ctx->vcpu_addr = NULL;
> +        g_free(mis->copied_pages);
> +        mis->copied_pages = NULL;
>      }
>  }
>  
>  static void migration_exit_cb(Notifier *n, void *data)
>  {
> -    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
> -                                               exit_notifier);
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +    PostcopyBlocktimeContext *ctx = mis->blocktime_ctx;
> +    if (!ctx) {
> +        return;
> +    }

I failed to find out why touched this up...

Thanks,

>      destroy_blocktime_context(ctx);
> +    mis->blocktime_ctx = NULL;
>  }
>  
>  static struct PostcopyBlocktimeContext *blocktime_context_new(void)
> @@ -227,6 +238,9 @@ static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
>              mis->blocktime_ctx = blocktime_context_new();
>          }
>  
> +        if (!mis->copied_pages) {
> +            mis->copied_pages = copied_pages_bitmap_new();
> +        }
>          asked_features |= UFFD_FEATURE_THREAD_ID;
>      }
>  #endif
> diff --git a/migration/ram.c b/migration/ram.c
> index f59fdd4..1abb6bb 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2661,6 +2661,69 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>      return ret;
>  }
>  
> +static unsigned long get_total_bits_per_page(ram_addr_t mem_length,
> +        size_t page_size)
> +{
> +    unsigned long page_size_bit = find_last_bit((unsigned long *)&page_size,
> +            BITS_PER_LONG);
> +    unsigned long total_bits = mem_length >> page_size_bit;
> +    if (mem_length % page_size) {
> +        total_bits += 1;
> +    }
> +    return total_bits;
> +}
> +
> +/*
> + * this function allocates bitmap for copied pages,
> + * also it calculates
> + * how many entries do we need
> + * */
> +unsigned long int *copied_pages_bitmap_new(void)
> +{
> +    RAMBlock *block;
> +    unsigned long int total_bits = 0;
> +
> +    rcu_read_lock();
> +    RAMBLOCK_FOREACH(block) {
> +        /* in general case used_length may not be aligned
> +         * by page_size */
> +
> +        total_bits += get_total_bits_per_page(block->used_length,
> +                block->page_size);
> +    }
> +    rcu_read_unlock();
> +
> +    return bitmap_new(total_bits);
> +}
> +
> +unsigned long int get_copied_bit_offset(ram_addr_t addr)
> +{
> +    RAMBlock *block;
> +    unsigned long int iter_bit = 0;
> +
> +    rcu_read_lock();
> +    RAMBLOCK_FOREACH(block) {
> +        /* in general case used_length may not be aligned
> +         * by page_size */
> +        if (block->host == NULL) {
> +            continue;
> +        }
> +        if (addr - (ram_addr_t)block->host < block->max_length) {
> +            unsigned long page_size_bit = find_last_bit(
> +                    (unsigned long *)&block->page_size,
> +                    BITS_PER_LONG);
> +            ram_addr_t offset = addr - (ram_addr_t)block->host;
> +            iter_bit += offset >> page_size_bit;
> +            break;
> +        }
> +        iter_bit += get_total_bits_per_page(block->used_length,
> +                block->page_size);
> +    }
> +    rcu_read_unlock();
> +
> +    return iter_bit;
> +}
> +
>  static SaveVMHandlers savevm_ram_handlers = {
>      .save_live_setup = ram_save_setup,
>      .save_live_iterate = ram_save_iterate,
> -- 
> 1.8.3.1
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 08/10] migration: calculate vCPU blocktime on dst side
  2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 08/10] migration: calculate vCPU blocktime on dst side Alexey Perevalov
@ 2017-05-24  7:53       ` Peter Xu
  2017-05-24  9:37         ` Alexey
  2017-06-01 10:57       ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Xu @ 2017-05-24  7:53 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, eblake, dgilbert

On Tue, May 23, 2017 at 02:31:09PM +0300, Alexey Perevalov wrote:
> This patch provides blocktime calculation per vCPU,
> as a summary and as a overlapped value for all vCPUs.
> 
> This approach was suggested by Peter Xu, as an improvements of
> previous approch where QEMU kept tree with faulted page address and cpus bitmask
> in it. Now QEMU is keeping array with faulted page address as value and vCPU
> as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
> list for blocktime per vCPU (could be traced with page_fault_addr)
> 
> Blocktime will not calculated if postcopy_blocktime field of
> MigrationIncomingState wasn't initialized.
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> ---
>  migration/postcopy-ram.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++-
>  migration/trace-events   |   5 ++-
>  2 files changed, 105 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index d647769..e70c44b 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -23,6 +23,7 @@
>  #include "postcopy-ram.h"
>  #include "sysemu/sysemu.h"
>  #include "sysemu/balloon.h"
> +#include <sys/param.h>
>  #include "qemu/error-report.h"
>  #include "trace.h"
>  
> @@ -577,6 +578,101 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
>      return 0;
>  }
>  
> +static int get_mem_fault_cpu_index(uint32_t pid)
> +{
> +    CPUState *cpu_iter;
> +
> +    CPU_FOREACH(cpu_iter) {
> +        if (cpu_iter->thread_id == pid) {
> +            return cpu_iter->cpu_index;
> +        }
> +    }
> +    trace_get_mem_fault_cpu_index(pid);
> +    return -1;
> +}
> +
> +static void mark_postcopy_blocktime_begin(uint64_t addr, uint32_t ptid,
> +        RAMBlock *rb)
> +{
> +    int cpu;
> +    unsigned long int nr_bit;
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
> +    int64_t now_ms;
> +
> +    if (!dc || ptid == 0) {
> +        return;
> +    }
> +    cpu = get_mem_fault_cpu_index(ptid);
> +    if (cpu < 0) {
> +        return;
> +    }
> +    nr_bit = get_copied_bit_offset(addr);
> +    if (test_bit(nr_bit, mis->copied_pages)) {
> +        return;
> +    }
> +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +    if (dc->vcpu_addr[cpu] == 0) {
> +        atomic_inc(&dc->smp_cpus_down);
> +    }
> +
> +    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
> +    atomic_xchg__nocheck(&dc->last_begin, now_ms);
> +    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);

Looks like this is not what you and Dave have discussed?

(Btw, sorry to have not followed the thread recently, so I just went
 over the discussion again...)

What I see that Dave suggested is (I copied from Dave's email):

blocktime_start:
   set CPU stall address
   check bitmap entry
     if set then zero stall-address

While here it is:

blocktime_start:
   check bitmap entry
     if set then return
   set CPU stall address

I don't think current version can really solve the risk condition. See
this possible sequence:

       receive-thread             fault-thread 
       --------------             ------------
                                  blocktime_start
                                    check bitmap entry,
                                      if set then return
       blocktime_end
         set bitmap entry
         read CPU stall address,
           if none-0 then zero it
                                    set CPU stall address [1]
        
Then imho the address set at [1] will be stall again until forever.

I think we should follow exactly what Dave has suggested.

And.. after a second thought, I am afraid even this would not satisfy
all risk conditions. What if we consider the UFFDIO_COPY ioctl in?
AFAIU after UFFDIO_COPY the faulted vcpu can be running again, then
the question is, can it quickly trigger another page fault?

Firstly, a workable sequence is (adding UFFDIO_COPY ioctl in, and
showing vcpu-thread X as well):

  receive-thread       fault-thread        vcpu-thread X
  --------------       ------------        -------------
                                           fault at addr A1
                       fault_addr[X]=A1
  UFFDIO_COPY page A1
  check fault_addr[X] with A1
    if match, clear fault_addr[X]
                                           vcpu X starts

This is fine.

While since "vcpu X starts" can be right after UFFDIO_COPY, can this
be possible?

  receive-thread       fault-thread        vcpu-thread X
  --------------       ------------        -------------
                                           fault at addr A1
                       fault_addr[X]=A1
  UFFDIO_COPY page A1
                                           vcpu X starts
                                           fault at addr A2
                       fault_addr[X]=A2
  check fault_addr[X] with A1
    if match, clear fault_addr[X]
        ^
        |
        +---------- here it will not match since now fault_addr[X]==A2

Then looks like fault_addr[X], which is currently A1, will stall
again?

(I feel like finally we may need something like a per-cpu lock... or I
 must have missed something)

> +
> +    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
> +            cpu);
> +}
> +
> +static void mark_postcopy_blocktime_end(uint64_t addr)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
> +    int i, affected_cpu = 0;
> +    int64_t now_ms;
> +    bool vcpu_total_blocktime = false;
> +    unsigned long int nr_bit;
> +
> +    if (!dc) {
> +        return;
> +    }
> +    /* mark that page as copied */
> +    nr_bit = get_copied_bit_offset(addr);
> +    set_bit_atomic(nr_bit, mis->copied_pages);
> +
> +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +
> +    /* lookup cpu, to clear it,
> +     * that algorithm looks straighforward, but it's not
> +     * optimal, more optimal algorithm is keeping tree or hash
> +     * where key is address value is a list of  */
> +    for (i = 0; i < smp_cpus; i++) {
> +        uint64_t vcpu_blocktime = 0;
> +        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
> +            continue;
> +        }
> +        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);

Why use *__nocheck() rather than atomic_xchg() or even atomic_read()?
Same thing happened a lot in current patch.

Thanks,

> +        vcpu_blocktime = now_ms -
> +            atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
> +        affected_cpu += 1;
> +        /* we need to know is that mark_postcopy_end was due to
> +         * faulted page, another possible case it's prefetched
> +         * page and in that case we shouldn't be here */
> +        if (!vcpu_total_blocktime &&
> +            atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
> +            vcpu_total_blocktime = true;
> +        }
> +        /* continue cycle, due to one page could affect several vCPUs */
> +        dc->vcpu_blocktime[i] += vcpu_blocktime;
> +    }
> +
> +    atomic_sub(&dc->smp_cpus_down, affected_cpu);
> +    if (vcpu_total_blocktime) {
> +        dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, 0);
> +    }
> +    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime);
> +}
> +
>  /*
>   * Handle faults detected by the USERFAULT markings
>   */
> @@ -654,8 +750,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
>          rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
>          trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
>                                                  qemu_ram_get_idstr(rb),
> -                                                rb_offset);
> +                                                rb_offset,
> +                                                msg.arg.pagefault.feat.ptid);
>  
> +        mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address),
> +                msg.arg.pagefault.feat.ptid, rb);
>          /*
>           * Send the request to the source - we want to request one
>           * of our host page sizes (which is >= TPS)
> @@ -750,6 +849,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
>  
>          return -e;
>      }
> +    mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host);
>  
>      trace_postcopy_place_page(host);
>      return 0;
> diff --git a/migration/trace-events b/migration/trace-events
> index 5b8ccf3..7bdadbb 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -112,6 +112,8 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
>  process_incoming_migration_co_postcopy_end_main(void) ""
>  migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
>  migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
> +mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d"
> +mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr 0x%" PRIx64 " dd %p time %" PRId64
>  
>  # migration/rdma.c
>  qemu_rdma_accept_incoming_migration(void) ""
> @@ -188,7 +190,7 @@ postcopy_ram_enable_notify(void) ""
>  postcopy_ram_fault_thread_entry(void) ""
>  postcopy_ram_fault_thread_exit(void) ""
>  postcopy_ram_fault_thread_quit(void) ""
> -postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx"
> +postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx %u"
>  postcopy_ram_incoming_cleanup_closeuf(void) ""
>  postcopy_ram_incoming_cleanup_entry(void) ""
>  postcopy_ram_incoming_cleanup_exit(void) ""
> @@ -197,6 +199,7 @@ save_xbzrle_page_skipping(void) ""
>  save_xbzrle_page_overflow(void) ""
>  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
>  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
> +get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU"
>  
>  # migration/exec.c
>  migration_exec_outgoing(const char *cmd) "cmd=%s"
> -- 
> 1.8.3.1
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 07/10] migration: add bitmap for copied page
  2017-05-24  6:57       ` Peter Xu
@ 2017-05-24  7:56         ` Alexey
  2017-05-24 12:01           ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Alexey @ 2017-05-24  7:56 UTC (permalink / raw)
  To: Peter Xu; +Cc: i.maximets, qemu-devel, dgilbert

On Wed, May 24, 2017 at 02:57:36PM +0800, Peter Xu wrote:
> On Tue, May 23, 2017 at 02:31:08PM +0300, Alexey Perevalov wrote:
> > This patch adds ability to track down already copied
> > pages, it's necessary for calculation vCPU block time in
> > postcopy migration feature and maybe for restore after
> > postcopy migration failure.
> > 
> > Functions which work with RAMBlock are placed into ram.c,
> > due to TARGET_WORDS_BIGENDIAN is poisoned int postcopy-ram.c -
> > hardware independed code.
> > 
> > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> > ---
> >  include/migration/migration.h | 16 +++++++++++
> >  migration/postcopy-ram.c      | 22 ++++++++++++---
> >  migration/ram.c               | 63 +++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 97 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index 449cb07..4e05c83 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -101,6 +101,20 @@ struct MigrationIncomingState {
> >      LoadStateEntry_Head loadvm_handlers;
> >  
> >      /*
> > +     * bitmap indicates whether page copied,
> > +     * based on ramblock offset
> > +     * now it is using only for blocktime calculation in
> > +     * postcopy migration, so livetime of this entry:
> > +     * since user requested blocktime calculation,
> > +     * till the end of postcopy migration
> > +     * as an example it could represend following memory map
> > +     * ___________________________________
> > +     * |4k pages | hugepages | 4k pages
> 
> Can we really do this?

> 
> The problem is AFAIU migration stream is sending pages only in target
> page size, even for huge pages... so even for huge pages we should
> still need per TARGET_PAGE_SIZE bitmap?
sending maybe, but copying to userfault fd is doing in hugepage size
in case of hugetlbfs memory backend.
> 
> (Please refer to ram_state_init() on init of RAMBlock.bmap)
I thought to use bitmap per ramblock, but I decided to not do it,
because of following reasons:
	1. There is only 4k ramblocks, for such ramblock it's not
	optimal
	2. copied pages bitmap - it's attribute of incoming migration,
	but ramblock it's general structure, although bmap it's about
	dirty pages of precopy migrations, hmm, but RAMBlock also
	contains unsentmap and it's for postcopy.
> 
> > +     *
> > +     * */
> > +    unsigned long *copied_pages;
> > +
> > +    /*
> >       * PostcopyBlocktimeContext to keep information for postcopy
> >       * live migration, to calculate vCPU block time
> >       * */
> > @@ -279,6 +293,8 @@ int migrate_compress_threads(void);
> >  int migrate_decompress_threads(void);
> >  bool migrate_use_events(void);
> >  bool migrate_postcopy_blocktime(void);
> > +unsigned long int get_copied_bit_offset(ram_addr_t addr);
> > +unsigned long int *copied_pages_bitmap_new(void);
> >  
> >  /* Sending on the return path - generic and then for each message type */
> >  void migrate_send_rp_message(MigrationIncomingState *mis,
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 5435a40..d647769 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -94,23 +94,34 @@ static void destroy_blocktime_context(struct PostcopyBlocktimeContext *ctx)
> >  
> >  static void postcopy_migration_cb(Notifier *n, void *data)
> >  {
> > -    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
> > -                                               postcopy_notifier);
> >      MigrationState *s = data;
> >      if (migration_has_finished(s) || migration_has_failed(s)) {
> > +        MigrationIncomingState *mis = migration_incoming_get_current();
> > +        PostcopyBlocktimeContext *ctx = mis->blocktime_ctx;
> > +
> > +        if (!ctx) {
> > +            return;
> > +        }
> 
> If we are in postcopy_migration_cb() already, then we have registered
> the notifiers, then... could this (!ctx) really happen?
> 
ok, maybe it's unnecessary sanity check.

> > +
> >          g_free(ctx->page_fault_vcpu_time);
> >          /* g_free is NULL robust */
> >          ctx->page_fault_vcpu_time = NULL;
> >          g_free(ctx->vcpu_addr);
> >          ctx->vcpu_addr = NULL;
> > +        g_free(mis->copied_pages);
> > +        mis->copied_pages = NULL;
> >      }
> >  }
> >  
> >  static void migration_exit_cb(Notifier *n, void *data)
> >  {
> > -    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
> > -                                               exit_notifier);
> > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > +    PostcopyBlocktimeContext *ctx = mis->blocktime_ctx;
> > +    if (!ctx) {
> > +        return;
> > +    }
> 
> I failed to find out why touched this up...
> 
> Thanks,
> 
> >      destroy_blocktime_context(ctx);
> > +    mis->blocktime_ctx = NULL;
> >  }
> >  
> >  static struct PostcopyBlocktimeContext *blocktime_context_new(void)
> > @@ -227,6 +238,9 @@ static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
> >              mis->blocktime_ctx = blocktime_context_new();
> >          }
> >  
> > +        if (!mis->copied_pages) {
> > +            mis->copied_pages = copied_pages_bitmap_new();
> > +        }
> >          asked_features |= UFFD_FEATURE_THREAD_ID;
> >      }
> >  #endif
> > diff --git a/migration/ram.c b/migration/ram.c
> > index f59fdd4..1abb6bb 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -2661,6 +2661,69 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> >      return ret;
> >  }
> >  
> > +static unsigned long get_total_bits_per_page(ram_addr_t mem_length,
> > +        size_t page_size)
> > +{
> > +    unsigned long page_size_bit = find_last_bit((unsigned long *)&page_size,
> > +            BITS_PER_LONG);
> > +    unsigned long total_bits = mem_length >> page_size_bit;
> > +    if (mem_length % page_size) {
> > +        total_bits += 1;
> > +    }
> > +    return total_bits;
> > +}
> > +
> > +/*
> > + * this function allocates bitmap for copied pages,
> > + * also it calculates
> > + * how many entries do we need
> > + * */
> > +unsigned long int *copied_pages_bitmap_new(void)
> > +{
> > +    RAMBlock *block;
> > +    unsigned long int total_bits = 0;
> > +
> > +    rcu_read_lock();
> > +    RAMBLOCK_FOREACH(block) {
> > +        /* in general case used_length may not be aligned
> > +         * by page_size */
> > +
> > +        total_bits += get_total_bits_per_page(block->used_length,
> > +                block->page_size);
> > +    }
> > +    rcu_read_unlock();
> > +
> > +    return bitmap_new(total_bits);
> > +}
> > +
> > +unsigned long int get_copied_bit_offset(ram_addr_t addr)
> > +{
> > +    RAMBlock *block;
> > +    unsigned long int iter_bit = 0;
> > +
> > +    rcu_read_lock();
> > +    RAMBLOCK_FOREACH(block) {
> > +        /* in general case used_length may not be aligned
> > +         * by page_size */
> > +        if (block->host == NULL) {
> > +            continue;
> > +        }
> > +        if (addr - (ram_addr_t)block->host < block->max_length) {
> > +            unsigned long page_size_bit = find_last_bit(
> > +                    (unsigned long *)&block->page_size,
> > +                    BITS_PER_LONG);
> > +            ram_addr_t offset = addr - (ram_addr_t)block->host;
> > +            iter_bit += offset >> page_size_bit;
> > +            break;
> > +        }
> > +        iter_bit += get_total_bits_per_page(block->used_length,
> > +                block->page_size);
> > +    }
> > +    rcu_read_unlock();
> > +
> > +    return iter_bit;
> > +}
> > +
> >  static SaveVMHandlers savevm_ram_handlers = {
> >      .save_live_setup = ram_save_setup,
> >      .save_live_iterate = ram_save_iterate,
> > -- 
> > 1.8.3.1
> > 
> 
> -- 
> Peter Xu
> 

-- 

BR
Alexey

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 08/10] migration: calculate vCPU blocktime on dst side
  2017-05-24  7:53       ` Peter Xu
@ 2017-05-24  9:37         ` Alexey
  2017-05-24 11:22           ` Peter Xu
  2017-06-01 10:50           ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 39+ messages in thread
From: Alexey @ 2017-05-24  9:37 UTC (permalink / raw)
  To: Peter Xu; +Cc: i.maximets, qemu-devel, dgilbert

On Wed, May 24, 2017 at 03:53:05PM +0800, Peter Xu wrote:
> On Tue, May 23, 2017 at 02:31:09PM +0300, Alexey Perevalov wrote:
> > This patch provides blocktime calculation per vCPU,
> > as a summary and as a overlapped value for all vCPUs.
> > 
> > This approach was suggested by Peter Xu, as an improvements of
> > previous approch where QEMU kept tree with faulted page address and cpus bitmask
> > in it. Now QEMU is keeping array with faulted page address as value and vCPU
> > as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
> > list for blocktime per vCPU (could be traced with page_fault_addr)
> > 
> > Blocktime will not calculated if postcopy_blocktime field of
> > MigrationIncomingState wasn't initialized.
> > 
> > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> > ---
> >  migration/postcopy-ram.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++-
> >  migration/trace-events   |   5 ++-
> >  2 files changed, 105 insertions(+), 2 deletions(-)
> > 
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index d647769..e70c44b 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -23,6 +23,7 @@
> >  #include "postcopy-ram.h"
> >  #include "sysemu/sysemu.h"
> >  #include "sysemu/balloon.h"
> > +#include <sys/param.h>
> >  #include "qemu/error-report.h"
> >  #include "trace.h"
> >  
> > @@ -577,6 +578,101 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
> >      return 0;
> >  }
> >  
> > +static int get_mem_fault_cpu_index(uint32_t pid)
> > +{
> > +    CPUState *cpu_iter;
> > +
> > +    CPU_FOREACH(cpu_iter) {
> > +        if (cpu_iter->thread_id == pid) {
> > +            return cpu_iter->cpu_index;
> > +        }
> > +    }
> > +    trace_get_mem_fault_cpu_index(pid);
> > +    return -1;
> > +}
> > +
> > +static void mark_postcopy_blocktime_begin(uint64_t addr, uint32_t ptid,
> > +        RAMBlock *rb)
> > +{
> > +    int cpu;
> > +    unsigned long int nr_bit;
> > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > +    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
> > +    int64_t now_ms;
> > +
> > +    if (!dc || ptid == 0) {
> > +        return;
> > +    }
> > +    cpu = get_mem_fault_cpu_index(ptid);
> > +    if (cpu < 0) {
> > +        return;
> > +    }
> > +    nr_bit = get_copied_bit_offset(addr);
> > +    if (test_bit(nr_bit, mis->copied_pages)) {
> > +        return;
> > +    }
> > +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > +    if (dc->vcpu_addr[cpu] == 0) {
> > +        atomic_inc(&dc->smp_cpus_down);
> > +    }
> > +
> > +    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
> > +    atomic_xchg__nocheck(&dc->last_begin, now_ms);
> > +    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
> 
> Looks like this is not what you and Dave have discussed?
> 
> (Btw, sorry to have not followed the thread recently, so I just went
>  over the discussion again...)
> 
> What I see that Dave suggested is (I copied from Dave's email):
> 
> blocktime_start:
>    set CPU stall address
>    check bitmap entry
>      if set then zero stall-address
> 
> While here it is:
> 
> blocktime_start:
>    check bitmap entry
>      if set then return
>    set CPU stall address
> 
> I don't think current version can really solve the risk condition. See
> this possible sequence:
> 
>        receive-thread             fault-thread 
>        --------------             ------------
>                                   blocktime_start
>                                     check bitmap entry,
>                                       if set then return
>        blocktime_end
>          set bitmap entry
>          read CPU stall address,
>            if none-0 then zero it
>                                     set CPU stall address [1]
>         
> Then imho the address set at [1] will be stall again until forever.
> 
agree, I check is in incorrect order

> I think we should follow exactly what Dave has suggested.
> 
> And.. after a second thought, I am afraid even this would not satisfy
> all risk conditions. What if we consider the UFFDIO_COPY ioctl in?
> AFAIU after UFFDIO_COPY the faulted vcpu can be running again, then
> the question is, can it quickly trigger another page fault?
>
yes, it can

> Firstly, a workable sequence is (adding UFFDIO_COPY ioctl in, and
> showing vcpu-thread X as well):
> 
>   receive-thread       fault-thread        vcpu-thread X
>   --------------       ------------        -------------
>                                            fault at addr A1
>                        fault_addr[X]=A1
>   UFFDIO_COPY page A1
>   check fault_addr[X] with A1
>     if match, clear fault_addr[X]
>                                            vcpu X starts
> 
> This is fine.
> 
> While since "vcpu X starts" can be right after UFFDIO_COPY, can this
> be possible?
Previous picture isn't possible, due to mark_postcopy_blocktime_end
is being called right after ioctl, and vCPU is waking up
inside ioctl, so check fault_addr will be after vcpu X starts.

> 
>   receive-thread       fault-thread        vcpu-thread X
>   --------------       ------------        -------------
>                                            fault at addr A1
>                        fault_addr[X]=A1
>   UFFDIO_COPY page A1
>                                            vcpu X starts
>                                            fault at addr A2
>                        fault_addr[X]=A2
>   check fault_addr[X] with A1
>     if match, clear fault_addr[X]
>         ^
>         |
>         +---------- here it will not match since now fault_addr[X]==A2
> 
> Then looks like fault_addr[X], which is currently A1, will stall
> again?

It will be another address(A2), but probably the same vCPU and if in
this case blocktime_start will be called before blocktime_end we lost
block time for page A1. Address of the page is unique key in this
process, not vCPU ;)
Here maybe reasonable to wake up vCPU after blocktime_end.



> 
> (I feel like finally we may need something like a per-cpu lock... or I
>  must have missed something)
I think no, because locking time of the vCPU is critical in this process.
> 
> > +
> > +    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
> > +            cpu);
> > +}
> > +
> > +static void mark_postcopy_blocktime_end(uint64_t addr)
> > +{
> > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > +    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
> > +    int i, affected_cpu = 0;
> > +    int64_t now_ms;
> > +    bool vcpu_total_blocktime = false;
> > +    unsigned long int nr_bit;
> > +
> > +    if (!dc) {
> > +        return;
> > +    }
> > +    /* mark that page as copied */
> > +    nr_bit = get_copied_bit_offset(addr);
> > +    set_bit_atomic(nr_bit, mis->copied_pages);
> > +
> > +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > +
> > +    /* lookup cpu, to clear it,
> > +     * that algorithm looks straighforward, but it's not
> > +     * optimal, more optimal algorithm is keeping tree or hash
> > +     * where key is address value is a list of  */
> > +    for (i = 0; i < smp_cpus; i++) {
> > +        uint64_t vcpu_blocktime = 0;
> > +        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
> > +            continue;
> > +        }
> > +        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
> 
> Why use *__nocheck() rather than atomic_xchg() or even atomic_read()?
> Same thing happened a lot in current patch.
atomic_read/atomic_xchg for mingw/(gcc on arm32) has build time check,

QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *));

it prevents using 64 atomic operation on 32 architecture, just mingw I
think, but postcopy-ram.c isn't compiling for mingw.
On other 32 platforms as I know clang/gcc allow to use 8 bytes
long variables in built atomic operations. In arm32 it allows in
builtin. But QEMU on arm32 still
has that sanity check, and I think it's bug, so I just worked it around.
Maybe better was to fix it.

I tested in docker, using follow command:
make docker-test-build@debian-armhf-cross

And got following error

/tmp/qemu-test/src/migration/postcopy-ram.c: In function
'mark_postcopy_blocktime_begin':
/tmp/qemu-test/src/include/qemu/compiler.h:86:30: error: static
assertion failed: "not expecting: sizeof(*&dc->vcpu_addr[cpu]) >
sizeof(void *)"
 #define QEMU_BUILD_BUG_ON(x) _Static_assert(!(x), "not expecting: " #x)

when I used atomic_xchg,
I agree with you, but I think need to fix atomic.h firstly and add additional
#ifdef there.

And I didn't want to split 64 bit values onto 32 bit values, but I saw
in mailing list people are doing it.

> 
> Thanks,
> 
> > +        vcpu_blocktime = now_ms -
> > +            atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
> > +        affected_cpu += 1;
> > +        /* we need to know is that mark_postcopy_end was due to
> > +         * faulted page, another possible case it's prefetched
> > +         * page and in that case we shouldn't be here */
> > +        if (!vcpu_total_blocktime &&
> > +            atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
> > +            vcpu_total_blocktime = true;
> > +        }
> > +        /* continue cycle, due to one page could affect several vCPUs */
> > +        dc->vcpu_blocktime[i] += vcpu_blocktime;
> > +    }
> > +
> > +    atomic_sub(&dc->smp_cpus_down, affected_cpu);
> > +    if (vcpu_total_blocktime) {
> > +        dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, 0);
> > +    }
> > +    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime);
> > +}
> > +
> >  /*
> >   * Handle faults detected by the USERFAULT markings
> >   */
> > @@ -654,8 +750,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
> >          rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
> >          trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
> >                                                  qemu_ram_get_idstr(rb),
> > -                                                rb_offset);
> > +                                                rb_offset,
> > +                                                msg.arg.pagefault.feat.ptid);
> >  
> > +        mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address),
> > +                msg.arg.pagefault.feat.ptid, rb);
> >          /*
> >           * Send the request to the source - we want to request one
> >           * of our host page sizes (which is >= TPS)
> > @@ -750,6 +849,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> >  
> >          return -e;
> >      }
> > +    mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host);
> >  
> >      trace_postcopy_place_page(host);
> >      return 0;
> > diff --git a/migration/trace-events b/migration/trace-events
> > index 5b8ccf3..7bdadbb 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -112,6 +112,8 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> >  process_incoming_migration_co_postcopy_end_main(void) ""
> >  migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
> >  migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
> > +mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d"
> > +mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr 0x%" PRIx64 " dd %p time %" PRId64
> >  
> >  # migration/rdma.c
> >  qemu_rdma_accept_incoming_migration(void) ""
> > @@ -188,7 +190,7 @@ postcopy_ram_enable_notify(void) ""
> >  postcopy_ram_fault_thread_entry(void) ""
> >  postcopy_ram_fault_thread_exit(void) ""
> >  postcopy_ram_fault_thread_quit(void) ""
> > -postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx"
> > +postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx %u"
> >  postcopy_ram_incoming_cleanup_closeuf(void) ""
> >  postcopy_ram_incoming_cleanup_entry(void) ""
> >  postcopy_ram_incoming_cleanup_exit(void) ""
> > @@ -197,6 +199,7 @@ save_xbzrle_page_skipping(void) ""
> >  save_xbzrle_page_overflow(void) ""
> >  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
> >  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
> > +get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU"
> >  
> >  # migration/exec.c
> >  migration_exec_outgoing(const char *cmd) "cmd=%s"
> > -- 
> > 1.8.3.1
> > 
> 
> -- 
> Peter Xu
> 

-- 

BR
Alexey

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 08/10] migration: calculate vCPU blocktime on dst side
  2017-05-24  9:37         ` Alexey
@ 2017-05-24 11:22           ` Peter Xu
  2017-05-24 11:37             ` Alexey Perevalov
  2017-06-01 10:07             ` Dr. David Alan Gilbert
  2017-06-01 10:50           ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 39+ messages in thread
From: Peter Xu @ 2017-05-24 11:22 UTC (permalink / raw)
  To: Alexey; +Cc: i.maximets, qemu-devel, dgilbert

On Wed, May 24, 2017 at 12:37:20PM +0300, Alexey wrote:
> On Wed, May 24, 2017 at 03:53:05PM +0800, Peter Xu wrote:
> > On Tue, May 23, 2017 at 02:31:09PM +0300, Alexey Perevalov wrote:
> > > This patch provides blocktime calculation per vCPU,
> > > as a summary and as a overlapped value for all vCPUs.
> > > 
> > > This approach was suggested by Peter Xu, as an improvements of
> > > previous approch where QEMU kept tree with faulted page address and cpus bitmask
> > > in it. Now QEMU is keeping array with faulted page address as value and vCPU
> > > as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
> > > list for blocktime per vCPU (could be traced with page_fault_addr)
> > > 
> > > Blocktime will not calculated if postcopy_blocktime field of
> > > MigrationIncomingState wasn't initialized.
> > > 
> > > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> > > ---
> > >  migration/postcopy-ram.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++-
> > >  migration/trace-events   |   5 ++-
> > >  2 files changed, 105 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > > index d647769..e70c44b 100644
> > > --- a/migration/postcopy-ram.c
> > > +++ b/migration/postcopy-ram.c
> > > @@ -23,6 +23,7 @@
> > >  #include "postcopy-ram.h"
> > >  #include "sysemu/sysemu.h"
> > >  #include "sysemu/balloon.h"
> > > +#include <sys/param.h>
> > >  #include "qemu/error-report.h"
> > >  #include "trace.h"
> > >  
> > > @@ -577,6 +578,101 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
> > >      return 0;
> > >  }
> > >  
> > > +static int get_mem_fault_cpu_index(uint32_t pid)
> > > +{
> > > +    CPUState *cpu_iter;
> > > +
> > > +    CPU_FOREACH(cpu_iter) {
> > > +        if (cpu_iter->thread_id == pid) {
> > > +            return cpu_iter->cpu_index;
> > > +        }
> > > +    }
> > > +    trace_get_mem_fault_cpu_index(pid);
> > > +    return -1;
> > > +}
> > > +
> > > +static void mark_postcopy_blocktime_begin(uint64_t addr, uint32_t ptid,
> > > +        RAMBlock *rb)
> > > +{
> > > +    int cpu;
> > > +    unsigned long int nr_bit;
> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > +    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
> > > +    int64_t now_ms;
> > > +
> > > +    if (!dc || ptid == 0) {
> > > +        return;
> > > +    }
> > > +    cpu = get_mem_fault_cpu_index(ptid);
> > > +    if (cpu < 0) {
> > > +        return;
> > > +    }
> > > +    nr_bit = get_copied_bit_offset(addr);
> > > +    if (test_bit(nr_bit, mis->copied_pages)) {
> > > +        return;
> > > +    }
> > > +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > > +    if (dc->vcpu_addr[cpu] == 0) {
> > > +        atomic_inc(&dc->smp_cpus_down);
> > > +    }
> > > +
> > > +    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
> > > +    atomic_xchg__nocheck(&dc->last_begin, now_ms);
> > > +    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
> > 
> > Looks like this is not what you and Dave have discussed?
> > 
> > (Btw, sorry to have not followed the thread recently, so I just went
> >  over the discussion again...)
> > 
> > What I see that Dave suggested is (I copied from Dave's email):
> > 
> > blocktime_start:
> >    set CPU stall address
> >    check bitmap entry
> >      if set then zero stall-address
> > 
> > While here it is:
> > 
> > blocktime_start:
> >    check bitmap entry
> >      if set then return
> >    set CPU stall address
> > 
> > I don't think current version can really solve the risk condition. See
> > this possible sequence:
> > 
> >        receive-thread             fault-thread 
> >        --------------             ------------
> >                                   blocktime_start
> >                                     check bitmap entry,
> >                                       if set then return
> >        blocktime_end
> >          set bitmap entry
> >          read CPU stall address,
> >            if none-0 then zero it
> >                                     set CPU stall address [1]
> >         
> > Then imho the address set at [1] will be stall again until forever.
> > 
> agree, I check is in incorrect order
> 
> > I think we should follow exactly what Dave has suggested.
> > 
> > And.. after a second thought, I am afraid even this would not satisfy
> > all risk conditions. What if we consider the UFFDIO_COPY ioctl in?
> > AFAIU after UFFDIO_COPY the faulted vcpu can be running again, then
> > the question is, can it quickly trigger another page fault?
> >
> yes, it can
> 
> > Firstly, a workable sequence is (adding UFFDIO_COPY ioctl in, and
> > showing vcpu-thread X as well):
> > 
> >   receive-thread       fault-thread        vcpu-thread X
> >   --------------       ------------        -------------
> >                                            fault at addr A1
> >                        fault_addr[X]=A1
> >   UFFDIO_COPY page A1
> >   check fault_addr[X] with A1
> >     if match, clear fault_addr[X]
> >                                            vcpu X starts
> > 
> > This is fine.
> > 
> > While since "vcpu X starts" can be right after UFFDIO_COPY, can this
> > be possible?
> Previous picture isn't possible, due to mark_postcopy_blocktime_end
> is being called right after ioctl, and vCPU is waking up
> inside ioctl, so check fault_addr will be after vcpu X starts.
> 
> > 
> >   receive-thread       fault-thread        vcpu-thread X
> >   --------------       ------------        -------------
> >                                            fault at addr A1
> >                        fault_addr[X]=A1
> >   UFFDIO_COPY page A1
> >                                            vcpu X starts
> >                                            fault at addr A2
> >                        fault_addr[X]=A2
> >   check fault_addr[X] with A1
> >     if match, clear fault_addr[X]
> >         ^
> >         |
> >         +---------- here it will not match since now fault_addr[X]==A2
> > 
> > Then looks like fault_addr[X], which is currently A1, will stall
> > again?
> 
> It will be another address(A2), but probably the same vCPU and if in
> this case blocktime_start will be called before blocktime_end we lost
> block time for page A1. Address of the page is unique key in this
> process, not vCPU ;)

I failed to understand the last sentence, anyway...

> Here maybe reasonable to wake up vCPU after blocktime_end.

... I guess this can solve the problem. :)

> 
> 
> 
> > 
> > (I feel like finally we may need something like a per-cpu lock... or I
> >  must have missed something)
> I think no, because locking time of the vCPU is critical in this process.

Yes.

> > 
> > > +
> > > +    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
> > > +            cpu);
> > > +}
> > > +
> > > +static void mark_postcopy_blocktime_end(uint64_t addr)
> > > +{
> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > +    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
> > > +    int i, affected_cpu = 0;
> > > +    int64_t now_ms;
> > > +    bool vcpu_total_blocktime = false;
> > > +    unsigned long int nr_bit;
> > > +
> > > +    if (!dc) {
> > > +        return;
> > > +    }
> > > +    /* mark that page as copied */
> > > +    nr_bit = get_copied_bit_offset(addr);
> > > +    set_bit_atomic(nr_bit, mis->copied_pages);
> > > +
> > > +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > > +
> > > +    /* lookup cpu, to clear it,
> > > +     * that algorithm looks straighforward, but it's not
> > > +     * optimal, more optimal algorithm is keeping tree or hash
> > > +     * where key is address value is a list of  */
> > > +    for (i = 0; i < smp_cpus; i++) {
> > > +        uint64_t vcpu_blocktime = 0;
> > > +        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
> > > +            continue;
> > > +        }
> > > +        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
> > 
> > Why use *__nocheck() rather than atomic_xchg() or even atomic_read()?
> > Same thing happened a lot in current patch.
> atomic_read/atomic_xchg for mingw/(gcc on arm32) has build time check,
> 
> QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *));
> 
> it prevents using 64 atomic operation on 32 architecture, just mingw I
> think, but postcopy-ram.c isn't compiling for mingw.
> On other 32 platforms as I know clang/gcc allow to use 8 bytes
> long variables in built atomic operations. In arm32 it allows in
> builtin. But QEMU on arm32 still
> has that sanity check, and I think it's bug, so I just worked it around.
> Maybe better was to fix it.
> 
> I tested in docker, using follow command:
> make docker-test-build@debian-armhf-cross
> 
> And got following error
> 
> /tmp/qemu-test/src/migration/postcopy-ram.c: In function
> 'mark_postcopy_blocktime_begin':
> /tmp/qemu-test/src/include/qemu/compiler.h:86:30: error: static
> assertion failed: "not expecting: sizeof(*&dc->vcpu_addr[cpu]) >
> sizeof(void *)"
>  #define QEMU_BUILD_BUG_ON(x) _Static_assert(!(x), "not expecting: " #x)
> 
> when I used atomic_xchg,
> I agree with you, but I think need to fix atomic.h firstly and add additional
> #ifdef there.
> 
> And I didn't want to split 64 bit values onto 32 bit values, but I saw
> in mailing list people are doing it.

If this is a bug, then I guess the best way is to fix it. But before
that - can a 32bit system really do 64bit atomic ops? Is it really a
bug at all?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 04/10] migration: split ufd_version_check onto receive/request features part
  2017-05-24  6:45         ` Alexey
@ 2017-05-24 11:33           ` Peter Xu
  2017-05-24 11:47             ` Alexey Perevalov
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2017-05-24 11:33 UTC (permalink / raw)
  To: Alexey; +Cc: i.maximets, qemu-devel, dgilbert

On Wed, May 24, 2017 at 09:45:48AM +0300, Alexey wrote:

[...]

> > 
> > >          return false;
> > >      }
> > >  
> > > -    ioctl_mask = (__u64)1 << _UFFDIO_REGISTER |
> > > -                 (__u64)1 << _UFFDIO_UNREGISTER;
> > > +    ioctl_mask = 1 << _UFFDIO_REGISTER |
> > > +                 1 << _UFFDIO_UNREGISTER;
> > 
> > Could I ask why we explicitly removed (__u64) here? Since I see the
> > old one better.
> maybe my change not robust, in any case thank to point me, but now I
> think, here should be a constant instead of ioctl_mask, like
> UFFD_API_IOCTLS, the total meaning of that check it's make sure kernel
> returns to us no error and accepted features.
> ok, from the beginning:
> 
> if we request unsupported feature (we check it before) or internal
> state of userfault ctx inside kernel isn't UFFD_STATE_WAIT_API (for
> example we are in the middle of the coping process)
> 	ioctl should end with EINVAL error and ioctls field in
> 	uffdio_api will be empty
> 
> Right now I think ioctls check for UFFD_API is not necessary.
> We just say here, we will use _UFFDIO_REGISTER, _UFFDIO_UNREGISTER,
> but kernel supports it unconditionally, by contrast with
> UFFDIO_REGISTER ioctl - it also returns ioctl field in uffdio_register
> structure, here can be a variations.

Sorry I didn't get the point...

AFAIU here (__u64) makes the constant "1" a 64bit variable. Just like
when we do bit shift we normally have "1ULL<<40". I liked it since
even if _UFFDIO_REGISTER is defined as >32 it will not overflow since
by default a constant "1" is a "int" typed (and it's 32bit width).

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 08/10] migration: calculate vCPU blocktime on dst side
  2017-05-24 11:22           ` Peter Xu
@ 2017-05-24 11:37             ` Alexey Perevalov
  2017-06-01 10:07             ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-24 11:37 UTC (permalink / raw)
  To: Peter Xu; +Cc: i.maximets, qemu-devel, dgilbert

On 05/24/2017 02:22 PM, Peter Xu wrote:
> On Wed, May 24, 2017 at 12:37:20PM +0300, Alexey wrote:
>> On Wed, May 24, 2017 at 03:53:05PM +0800, Peter Xu wrote:
>>> On Tue, May 23, 2017 at 02:31:09PM +0300, Alexey Perevalov wrote:
>>>> This patch provides blocktime calculation per vCPU,
>>>> as a summary and as a overlapped value for all vCPUs.
>>>>
>>>> This approach was suggested by Peter Xu, as an improvements of
>>>> previous approch where QEMU kept tree with faulted page address and cpus bitmask
>>>> in it. Now QEMU is keeping array with faulted page address as value and vCPU
>>>> as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
>>>> list for blocktime per vCPU (could be traced with page_fault_addr)
>>>>
>>>> Blocktime will not calculated if postcopy_blocktime field of
>>>> MigrationIncomingState wasn't initialized.
>>>>
>>>> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
>>>> ---
>>>>   migration/postcopy-ram.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++-
>>>>   migration/trace-events   |   5 ++-
>>>>   2 files changed, 105 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>>>> index d647769..e70c44b 100644
>>>> --- a/migration/postcopy-ram.c
>>>> +++ b/migration/postcopy-ram.c
>>>> @@ -23,6 +23,7 @@
>>>>   #include "postcopy-ram.h"
>>>>   #include "sysemu/sysemu.h"
>>>>   #include "sysemu/balloon.h"
>>>> +#include <sys/param.h>
>>>>   #include "qemu/error-report.h"
>>>>   #include "trace.h"
>>>>   
>>>> @@ -577,6 +578,101 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
>>>>       return 0;
>>>>   }
>>>>   
>>>> +static int get_mem_fault_cpu_index(uint32_t pid)
>>>> +{
>>>> +    CPUState *cpu_iter;
>>>> +
>>>> +    CPU_FOREACH(cpu_iter) {
>>>> +        if (cpu_iter->thread_id == pid) {
>>>> +            return cpu_iter->cpu_index;
>>>> +        }
>>>> +    }
>>>> +    trace_get_mem_fault_cpu_index(pid);
>>>> +    return -1;
>>>> +}
>>>> +
>>>> +static void mark_postcopy_blocktime_begin(uint64_t addr, uint32_t ptid,
>>>> +        RAMBlock *rb)
>>>> +{
>>>> +    int cpu;
>>>> +    unsigned long int nr_bit;
>>>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>>>> +    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
>>>> +    int64_t now_ms;
>>>> +
>>>> +    if (!dc || ptid == 0) {
>>>> +        return;
>>>> +    }
>>>> +    cpu = get_mem_fault_cpu_index(ptid);
>>>> +    if (cpu < 0) {
>>>> +        return;
>>>> +    }
>>>> +    nr_bit = get_copied_bit_offset(addr);
>>>> +    if (test_bit(nr_bit, mis->copied_pages)) {
>>>> +        return;
>>>> +    }
>>>> +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>>>> +    if (dc->vcpu_addr[cpu] == 0) {
>>>> +        atomic_inc(&dc->smp_cpus_down);
>>>> +    }
>>>> +
>>>> +    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
>>>> +    atomic_xchg__nocheck(&dc->last_begin, now_ms);
>>>> +    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
>>> Looks like this is not what you and Dave have discussed?
>>>
>>> (Btw, sorry to have not followed the thread recently, so I just went
>>>   over the discussion again...)
>>>
>>> What I see that Dave suggested is (I copied from Dave's email):
>>>
>>> blocktime_start:
>>>     set CPU stall address
>>>     check bitmap entry
>>>       if set then zero stall-address
>>>
>>> While here it is:
>>>
>>> blocktime_start:
>>>     check bitmap entry
>>>       if set then return
>>>     set CPU stall address
>>>
>>> I don't think current version can really solve the risk condition. See
>>> this possible sequence:
>>>
>>>         receive-thread             fault-thread
>>>         --------------             ------------
>>>                                    blocktime_start
>>>                                      check bitmap entry,
>>>                                        if set then return
>>>         blocktime_end
>>>           set bitmap entry
>>>           read CPU stall address,
>>>             if none-0 then zero it
>>>                                      set CPU stall address [1]
>>>          
>>> Then imho the address set at [1] will be stall again until forever.
>>>
>> agree, I check is in incorrect order
>>
>>> I think we should follow exactly what Dave has suggested.
>>>
>>> And.. after a second thought, I am afraid even this would not satisfy
>>> all risk conditions. What if we consider the UFFDIO_COPY ioctl in?
>>> AFAIU after UFFDIO_COPY the faulted vcpu can be running again, then
>>> the question is, can it quickly trigger another page fault?
>>>
>> yes, it can
>>
>>> Firstly, a workable sequence is (adding UFFDIO_COPY ioctl in, and
>>> showing vcpu-thread X as well):
>>>
>>>    receive-thread       fault-thread        vcpu-thread X
>>>    --------------       ------------        -------------
>>>                                             fault at addr A1
>>>                         fault_addr[X]=A1
>>>    UFFDIO_COPY page A1
>>>    check fault_addr[X] with A1
>>>      if match, clear fault_addr[X]
>>>                                             vcpu X starts
>>>
>>> This is fine.
>>>
>>> While since "vcpu X starts" can be right after UFFDIO_COPY, can this
>>> be possible?
>> Previous picture isn't possible, due to mark_postcopy_blocktime_end
>> is being called right after ioctl, and vCPU is waking up
>> inside ioctl, so check fault_addr will be after vcpu X starts.
>>
>>>    receive-thread       fault-thread        vcpu-thread X
>>>    --------------       ------------        -------------
>>>                                             fault at addr A1
>>>                         fault_addr[X]=A1
>>>    UFFDIO_COPY page A1
>>>                                             vcpu X starts
>>>                                             fault at addr A2
>>>                         fault_addr[X]=A2
>>>    check fault_addr[X] with A1
>>>      if match, clear fault_addr[X]
>>>          ^
>>>          |
>>>          +---------- here it will not match since now fault_addr[X]==A2
>>>
>>> Then looks like fault_addr[X], which is currently A1, will stall
>>> again?
>> It will be another address(A2), but probably the same vCPU and if in
>> this case blocktime_start will be called before blocktime_end we lost
>> block time for page A1. Address of the page is unique key in this
>> process, not vCPU ;)
> I failed to understand the last sentence, anyway...
I mean calculation process, page address, in it is unique.
So when I kept start/end time in tree where page address was a key,
I could avoid such kind of problems, but memory cost and runtime complexity,
wasn't perfect, also operations on tree had to be guarded both in 
blocktime_end and
blocktime_begin.
>
>> Here maybe reasonable to wake up vCPU after blocktime_end.
> ... I guess this can solve the problem. :)
>
>>
>>
>>> (I feel like finally we may need something like a per-cpu lock... or I
>>>   must have missed something)
>> I think no, because locking time of the vCPU is critical in this process.
> Yes.
>
>>>> +
>>>> +    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
>>>> +            cpu);
>>>> +}
>>>> +
>>>> +static void mark_postcopy_blocktime_end(uint64_t addr)
>>>> +{
>>>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>>>> +    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
>>>> +    int i, affected_cpu = 0;
>>>> +    int64_t now_ms;
>>>> +    bool vcpu_total_blocktime = false;
>>>> +    unsigned long int nr_bit;
>>>> +
>>>> +    if (!dc) {
>>>> +        return;
>>>> +    }
>>>> +    /* mark that page as copied */
>>>> +    nr_bit = get_copied_bit_offset(addr);
>>>> +    set_bit_atomic(nr_bit, mis->copied_pages);
>>>> +
>>>> +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>>>> +
>>>> +    /* lookup cpu, to clear it,
>>>> +     * that algorithm looks straighforward, but it's not
>>>> +     * optimal, more optimal algorithm is keeping tree or hash
>>>> +     * where key is address value is a list of  */
>>>> +    for (i = 0; i < smp_cpus; i++) {
>>>> +        uint64_t vcpu_blocktime = 0;
>>>> +        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
>>>> +            continue;
>>>> +        }
>>>> +        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
>>> Why use *__nocheck() rather than atomic_xchg() or even atomic_read()?
>>> Same thing happened a lot in current patch.
>> atomic_read/atomic_xchg for mingw/(gcc on arm32) has build time check,
>>
>> QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *));
>>
>> it prevents using 64 atomic operation on 32 architecture, just mingw I
>> think, but postcopy-ram.c isn't compiling for mingw.
>> On other 32 platforms as I know clang/gcc allow to use 8 bytes
>> long variables in built atomic operations. In arm32 it allows in
>> builtin. But QEMU on arm32 still
>> has that sanity check, and I think it's bug, so I just worked it around.
>> Maybe better was to fix it.
>>
>> I tested in docker, using follow command:
>> make docker-test-build@debian-armhf-cross
>>
>> And got following error
>>
>> /tmp/qemu-test/src/migration/postcopy-ram.c: In function
>> 'mark_postcopy_blocktime_begin':
>> /tmp/qemu-test/src/include/qemu/compiler.h:86:30: error: static
>> assertion failed: "not expecting: sizeof(*&dc->vcpu_addr[cpu]) >
>> sizeof(void *)"
>>   #define QEMU_BUILD_BUG_ON(x) _Static_assert(!(x), "not expecting: " #x)
>>
>> when I used atomic_xchg,
>> I agree with you, but I think need to fix atomic.h firstly and add additional
>> #ifdef there.
>>
>> And I didn't want to split 64 bit values onto 32 bit values, but I saw
>> in mailing list people are doing it.
> If this is a bug, then I guess the best way is to fix it. But before
> that - can a 32bit system really do 64bit atomic ops? Is it really a
> bug at all?
>
> Thanks,
>

-- 
Best regards,
Alexey Perevalov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 04/10] migration: split ufd_version_check onto receive/request features part
  2017-05-24 11:33           ` Peter Xu
@ 2017-05-24 11:47             ` Alexey Perevalov
  2017-05-31 19:12               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-24 11:47 UTC (permalink / raw)
  To: Peter Xu; +Cc: i.maximets, qemu-devel, dgilbert

On 05/24/2017 02:33 PM, Peter Xu wrote:
> On Wed, May 24, 2017 at 09:45:48AM +0300, Alexey wrote:
>
> [...]
>
>>>>           return false;
>>>>       }
>>>>   
>>>> -    ioctl_mask = (__u64)1 << _UFFDIO_REGISTER |
>>>> -                 (__u64)1 << _UFFDIO_UNREGISTER;
>>>> +    ioctl_mask = 1 << _UFFDIO_REGISTER |
>>>> +                 1 << _UFFDIO_UNREGISTER;
>>> Could I ask why we explicitly removed (__u64) here? Since I see the
>>> old one better.
>> maybe my change not robust, in any case thank to point me, but now I
>> think, here should be a constant instead of ioctl_mask, like
>> UFFD_API_IOCTLS, the total meaning of that check it's make sure kernel
>> returns to us no error and accepted features.
>> ok, from the beginning:
>>
>> if we request unsupported feature (we check it before) or internal
>> state of userfault ctx inside kernel isn't UFFD_STATE_WAIT_API (for
>> example we are in the middle of the coping process)
>> 	ioctl should end with EINVAL error and ioctls field in
>> 	uffdio_api will be empty
>>
>> Right now I think ioctls check for UFFD_API is not necessary.
>> We just say here, we will use _UFFDIO_REGISTER, _UFFDIO_UNREGISTER,
>> but kernel supports it unconditionally, by contrast with
>> UFFDIO_REGISTER ioctl - it also returns ioctl field in uffdio_register
>> structure, here can be a variations.
> Sorry I didn't get the point...
I misprinted
 >We just say here, we will use _UFFDIO_REGISTER

>s/_UFFDIO_REGISTER/_UFFDIO_API/g
but the point, ioctl_mask is not necessary here, kernel always returns it.
But for _UFFDIO_UNREGISTER, later, not in this function, yes that check is required.
  

>
> AFAIU here (__u64) makes the constant "1" a 64bit variable. Just like
> when we do bit shift we normally have "1ULL<<40". I liked it since
> even if _UFFDIO_REGISTER is defined as >32 it will not overflow since
> by default a constant "1" is a "int" typed (and it's 32bit width).

> Thanks,
>

-- 
Best regards,
Alexey Perevalov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 07/10] migration: add bitmap for copied page
  2017-05-24  7:56         ` Alexey
@ 2017-05-24 12:01           ` Peter Xu
  2017-05-24 12:16             ` Alexey Perevalov
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2017-05-24 12:01 UTC (permalink / raw)
  To: Alexey; +Cc: i.maximets, qemu-devel, dgilbert

On Wed, May 24, 2017 at 10:56:37AM +0300, Alexey wrote:
> On Wed, May 24, 2017 at 02:57:36PM +0800, Peter Xu wrote:
> > On Tue, May 23, 2017 at 02:31:08PM +0300, Alexey Perevalov wrote:
> > > This patch adds ability to track down already copied
> > > pages, it's necessary for calculation vCPU block time in
> > > postcopy migration feature and maybe for restore after
> > > postcopy migration failure.
> > > 
> > > Functions which work with RAMBlock are placed into ram.c,
> > > due to TARGET_WORDS_BIGENDIAN is poisoned int postcopy-ram.c -
> > > hardware independed code.
> > > 
> > > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> > > ---
> > >  include/migration/migration.h | 16 +++++++++++
> > >  migration/postcopy-ram.c      | 22 ++++++++++++---
> > >  migration/ram.c               | 63 +++++++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 97 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > index 449cb07..4e05c83 100644
> > > --- a/include/migration/migration.h
> > > +++ b/include/migration/migration.h
> > > @@ -101,6 +101,20 @@ struct MigrationIncomingState {
> > >      LoadStateEntry_Head loadvm_handlers;
> > >  
> > >      /*
> > > +     * bitmap indicates whether page copied,
> > > +     * based on ramblock offset
> > > +     * now it is using only for blocktime calculation in
> > > +     * postcopy migration, so livetime of this entry:
> > > +     * since user requested blocktime calculation,
> > > +     * till the end of postcopy migration
> > > +     * as an example it could represend following memory map
> > > +     * ___________________________________
> > > +     * |4k pages | hugepages | 4k pages
> > 
> > Can we really do this?
> 
> > 
> > The problem is AFAIU migration stream is sending pages only in target
> > page size, even for huge pages... so even for huge pages we should
> > still need per TARGET_PAGE_SIZE bitmap?
> sending maybe, but copying to userfault fd is doing in hugepage size

Yes you are right. :)

> in case of hugetlbfs memory backend.
> > 
> > (Please refer to ram_state_init() on init of RAMBlock.bmap)
> I thought to use bitmap per ramblock, but I decided to not do it,
> because of following reasons:
> 	1. There is only 4k ramblocks, for such ramblock it's not
> 	optimal

Could you explain what do you mean by "there is only 4k ramblocks"?

> 	2. copied pages bitmap - it's attribute of incoming migration,
> 	but ramblock it's general structure, although bmap it's about
> 	dirty pages of precopy migrations, hmm, but RAMBlock also
> 	contains unsentmap and it's for postcopy.

I thought you mean that copied pages are specific fields for incoming
so not suitable for RAMBlock struct? But the last sentence seems to
not support that idea... Hmm... Actually it sounds like a good idea to
me to put it into RAMBlock (but I maybe wrong).

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 07/10] migration: add bitmap for copied page
  2017-05-24 12:01           ` Peter Xu
@ 2017-05-24 12:16             ` Alexey Perevalov
  2017-05-24 23:30               ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-24 12:16 UTC (permalink / raw)
  To: Peter Xu; +Cc: i.maximets, qemu-devel, dgilbert

On 05/24/2017 03:01 PM, Peter Xu wrote:
> On Wed, May 24, 2017 at 10:56:37AM +0300, Alexey wrote:
>> On Wed, May 24, 2017 at 02:57:36PM +0800, Peter Xu wrote:
>>> On Tue, May 23, 2017 at 02:31:08PM +0300, Alexey Perevalov wrote:
>>>> This patch adds ability to track down already copied
>>>> pages, it's necessary for calculation vCPU block time in
>>>> postcopy migration feature and maybe for restore after
>>>> postcopy migration failure.
>>>>
>>>> Functions which work with RAMBlock are placed into ram.c,
>>>> due to TARGET_WORDS_BIGENDIAN is poisoned int postcopy-ram.c -
>>>> hardware independed code.
>>>>
>>>> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
>>>> ---
>>>>   include/migration/migration.h | 16 +++++++++++
>>>>   migration/postcopy-ram.c      | 22 ++++++++++++---
>>>>   migration/ram.c               | 63 +++++++++++++++++++++++++++++++++++++++++++
>>>>   3 files changed, 97 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>>>> index 449cb07..4e05c83 100644
>>>> --- a/include/migration/migration.h
>>>> +++ b/include/migration/migration.h
>>>> @@ -101,6 +101,20 @@ struct MigrationIncomingState {
>>>>       LoadStateEntry_Head loadvm_handlers;
>>>>   
>>>>       /*
>>>> +     * bitmap indicates whether page copied,
>>>> +     * based on ramblock offset
>>>> +     * now it is using only for blocktime calculation in
>>>> +     * postcopy migration, so livetime of this entry:
>>>> +     * since user requested blocktime calculation,
>>>> +     * till the end of postcopy migration
>>>> +     * as an example it could represend following memory map
>>>> +     * ___________________________________
>>>> +     * |4k pages | hugepages | 4k pages
>>> Can we really do this?
>>> The problem is AFAIU migration stream is sending pages only in target
>>> page size, even for huge pages... so even for huge pages we should
>>> still need per TARGET_PAGE_SIZE bitmap?
>> sending maybe, but copying to userfault fd is doing in hugepage size
> Yes you are right. :)
>
>> in case of hugetlbfs memory backend.
>>> (Please refer to ram_state_init() on init of RAMBlock.bmap)
>> I thought to use bitmap per ramblock, but I decided to not do it,
>> because of following reasons:
>> 	1. There is only 4k ramblocks, for such ramblock it's not
>> 	optimal
> Could you explain what do you mean by "there is only 4k ramblocks"?
sorry, could be ramblocks with 4k size,
as example, your's qemu hmp info ramblock shows
  Block Name    PSize              Offset Used              Total
             /objects/mem    2 MiB  0x0000000000000000 
0x0000000020000000 0x0000000020000000
                 vga.vram    4 KiB  0x0000000020060000 
0x0000000001000000 0x0000000001000000
     /rom@etc/acpi/tables    4 KiB  0x0000000021130000 
0x0000000000020000 0x0000000000200000
                  pc.bios    4 KiB  0x0000000020000000 
0x0000000000040000 0x0000000000040000
   0000:00:03.0/e1000.rom    4 KiB  0x0000000021070000 
0x0000000000040000 0x0000000000040000
   0000:00:04.0/e1000.rom    4 KiB  0x00000000210b0000 
0x0000000000040000 0x0000000000040000
   0000:00:05.0/e1000.rom    4 KiB  0x00000000210f0000 
0x0000000000040000 0x0000000000040000
                   pc.rom    4 KiB  0x0000000020040000 
0x0000000000020000 0x0000000000020000
     0000:00:02.0/vga.rom    4 KiB  0x0000000021060000 
0x0000000000010000 0x0000000000010000
    /rom@etc/table-loader    4 KiB  0x0000000021330000 
0x0000000000001000 0x0000000000001000
       /rom@etc/acpi/rsdp    4 KiB  0x0000000021331000 
0x0000000000001000 0x0000000000001000

/rom@etc/table-loader and /rom@etc/acpi/rsdp has only one 4K page.
>
>> 	2. copied pages bitmap - it's attribute of incoming migration,
>> 	but ramblock it's general structure, although bmap it's about
>> 	dirty pages of precopy migrations, hmm, but RAMBlock also
>> 	contains unsentmap and it's for postcopy.
> I thought you mean that copied pages are specific fields for incoming
> so not suitable for RAMBlock struct? But the last sentence seems to
> not support that idea... Hmm... Actually it sounds like a good idea to
> me to put it into RAMBlock (but I maybe wrong).
>
> Thanks,
>

-- 
Best regards,
Alexey Perevalov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 07/10] migration: add bitmap for copied page
  2017-05-24 12:16             ` Alexey Perevalov
@ 2017-05-24 23:30               ` Peter Xu
  2017-05-25  6:28                 ` Alexey Perevalov
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2017-05-24 23:30 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: i.maximets, qemu-devel, dgilbert

On Wed, May 24, 2017 at 03:16:23PM +0300, Alexey Perevalov wrote:
> On 05/24/2017 03:01 PM, Peter Xu wrote:
> >On Wed, May 24, 2017 at 10:56:37AM +0300, Alexey wrote:
> >>On Wed, May 24, 2017 at 02:57:36PM +0800, Peter Xu wrote:
> >>>On Tue, May 23, 2017 at 02:31:08PM +0300, Alexey Perevalov wrote:
> >>>>This patch adds ability to track down already copied
> >>>>pages, it's necessary for calculation vCPU block time in
> >>>>postcopy migration feature and maybe for restore after
> >>>>postcopy migration failure.
> >>>>
> >>>>Functions which work with RAMBlock are placed into ram.c,
> >>>>due to TARGET_WORDS_BIGENDIAN is poisoned int postcopy-ram.c -
> >>>>hardware independed code.
> >>>>
> >>>>Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> >>>>---
> >>>>  include/migration/migration.h | 16 +++++++++++
> >>>>  migration/postcopy-ram.c      | 22 ++++++++++++---
> >>>>  migration/ram.c               | 63 +++++++++++++++++++++++++++++++++++++++++++
> >>>>  3 files changed, 97 insertions(+), 4 deletions(-)
> >>>>
> >>>>diff --git a/include/migration/migration.h b/include/migration/migration.h
> >>>>index 449cb07..4e05c83 100644
> >>>>--- a/include/migration/migration.h
> >>>>+++ b/include/migration/migration.h
> >>>>@@ -101,6 +101,20 @@ struct MigrationIncomingState {
> >>>>      LoadStateEntry_Head loadvm_handlers;
> >>>>      /*
> >>>>+     * bitmap indicates whether page copied,
> >>>>+     * based on ramblock offset
> >>>>+     * now it is using only for blocktime calculation in
> >>>>+     * postcopy migration, so livetime of this entry:
> >>>>+     * since user requested blocktime calculation,
> >>>>+     * till the end of postcopy migration
> >>>>+     * as an example it could represend following memory map
> >>>>+     * ___________________________________
> >>>>+     * |4k pages | hugepages | 4k pages
> >>>Can we really do this?
> >>>The problem is AFAIU migration stream is sending pages only in target
> >>>page size, even for huge pages... so even for huge pages we should
> >>>still need per TARGET_PAGE_SIZE bitmap?
> >>sending maybe, but copying to userfault fd is doing in hugepage size
> >Yes you are right. :)
> >
> >>in case of hugetlbfs memory backend.
> >>>(Please refer to ram_state_init() on init of RAMBlock.bmap)
> >>I thought to use bitmap per ramblock, but I decided to not do it,
> >>because of following reasons:
> >>	1. There is only 4k ramblocks, for such ramblock it's not
> >>	optimal
> >Could you explain what do you mean by "there is only 4k ramblocks"?
> sorry, could be ramblocks with 4k size,
> as example, your's qemu hmp info ramblock shows
>  Block Name    PSize              Offset Used              Total
>             /objects/mem    2 MiB  0x0000000000000000 0x0000000020000000
> 0x0000000020000000
>                 vga.vram    4 KiB  0x0000000020060000 0x0000000001000000
> 0x0000000001000000
>     /rom@etc/acpi/tables    4 KiB  0x0000000021130000 0x0000000000020000
> 0x0000000000200000
>                  pc.bios    4 KiB  0x0000000020000000 0x0000000000040000
> 0x0000000000040000
>   0000:00:03.0/e1000.rom    4 KiB  0x0000000021070000 0x0000000000040000
> 0x0000000000040000
>   0000:00:04.0/e1000.rom    4 KiB  0x00000000210b0000 0x0000000000040000
> 0x0000000000040000
>   0000:00:05.0/e1000.rom    4 KiB  0x00000000210f0000 0x0000000000040000
> 0x0000000000040000
>                   pc.rom    4 KiB  0x0000000020040000 0x0000000000020000
> 0x0000000000020000
>     0000:00:02.0/vga.rom    4 KiB  0x0000000021060000 0x0000000000010000
> 0x0000000000010000
>    /rom@etc/table-loader    4 KiB  0x0000000021330000 0x0000000000001000
> 0x0000000000001000
>       /rom@etc/acpi/rsdp    4 KiB  0x0000000021331000 0x0000000000001000
> 0x0000000000001000
> 
> /rom@etc/table-loader and /rom@etc/acpi/rsdp has only one 4K page.

I see. Thanks for explaining this.

A 4k sized ramblock means the bitmap would be only one unsigned long
width (via bitmap_new(1)). I still don't see why it's not good... :-)

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 07/10] migration: add bitmap for copied page
  2017-05-24 23:30               ` Peter Xu
@ 2017-05-25  6:28                 ` Alexey Perevalov
  2017-05-25  7:25                   ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Alexey Perevalov @ 2017-05-25  6:28 UTC (permalink / raw)
  To: Peter Xu; +Cc: i.maximets, qemu-devel, dgilbert

On 05/25/2017 02:30 AM, Peter Xu wrote:
> On Wed, May 24, 2017 at 03:16:23PM +0300, Alexey Perevalov wrote:
>> On 05/24/2017 03:01 PM, Peter Xu wrote:
>>> On Wed, May 24, 2017 at 10:56:37AM +0300, Alexey wrote:
>>>> On Wed, May 24, 2017 at 02:57:36PM +0800, Peter Xu wrote:
>>>>> On Tue, May 23, 2017 at 02:31:08PM +0300, Alexey Perevalov wrote:
>>>>>> This patch adds ability to track down already copied
>>>>>> pages, it's necessary for calculation vCPU block time in
>>>>>> postcopy migration feature and maybe for restore after
>>>>>> postcopy migration failure.
>>>>>>
>>>>>> Functions which work with RAMBlock are placed into ram.c,
>>>>>> due to TARGET_WORDS_BIGENDIAN is poisoned int postcopy-ram.c -
>>>>>> hardware independed code.
>>>>>>
>>>>>> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
>>>>>> ---
>>>>>>   include/migration/migration.h | 16 +++++++++++
>>>>>>   migration/postcopy-ram.c      | 22 ++++++++++++---
>>>>>>   migration/ram.c               | 63 +++++++++++++++++++++++++++++++++++++++++++
>>>>>>   3 files changed, 97 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>>>>>> index 449cb07..4e05c83 100644
>>>>>> --- a/include/migration/migration.h
>>>>>> +++ b/include/migration/migration.h
>>>>>> @@ -101,6 +101,20 @@ struct MigrationIncomingState {
>>>>>>       LoadStateEntry_Head loadvm_handlers;
>>>>>>       /*
>>>>>> +     * bitmap indicates whether page copied,
>>>>>> +     * based on ramblock offset
>>>>>> +     * now it is using only for blocktime calculation in
>>>>>> +     * postcopy migration, so livetime of this entry:
>>>>>> +     * since user requested blocktime calculation,
>>>>>> +     * till the end of postcopy migration
>>>>>> +     * as an example it could represend following memory map
>>>>>> +     * ___________________________________
>>>>>> +     * |4k pages | hugepages | 4k pages
>>>>> Can we really do this?
>>>>> The problem is AFAIU migration stream is sending pages only in target
>>>>> page size, even for huge pages... so even for huge pages we should
>>>>> still need per TARGET_PAGE_SIZE bitmap?
>>>> sending maybe, but copying to userfault fd is doing in hugepage size
>>> Yes you are right. :)
>>>
>>>> in case of hugetlbfs memory backend.
>>>>> (Please refer to ram_state_init() on init of RAMBlock.bmap)
>>>> I thought to use bitmap per ramblock, but I decided to not do it,
>>>> because of following reasons:
>>>> 	1. There is only 4k ramblocks, for such ramblock it's not
>>>> 	optimal
>>> Could you explain what do you mean by "there is only 4k ramblocks"?
>> sorry, could be ramblocks with 4k size,
>> as example, your's qemu hmp info ramblock shows
>>   Block Name    PSize              Offset Used              Total
>>              /objects/mem    2 MiB  0x0000000000000000 0x0000000020000000
>> 0x0000000020000000
>>                  vga.vram    4 KiB  0x0000000020060000 0x0000000001000000
>> 0x0000000001000000
>>      /rom@etc/acpi/tables    4 KiB  0x0000000021130000 0x0000000000020000
>> 0x0000000000200000
>>                   pc.bios    4 KiB  0x0000000020000000 0x0000000000040000
>> 0x0000000000040000
>>    0000:00:03.0/e1000.rom    4 KiB  0x0000000021070000 0x0000000000040000
>> 0x0000000000040000
>>    0000:00:04.0/e1000.rom    4 KiB  0x00000000210b0000 0x0000000000040000
>> 0x0000000000040000
>>    0000:00:05.0/e1000.rom    4 KiB  0x00000000210f0000 0x0000000000040000
>> 0x0000000000040000
>>                    pc.rom    4 KiB  0x0000000020040000 0x0000000000020000
>> 0x0000000000020000
>>      0000:00:02.0/vga.rom    4 KiB  0x0000000021060000 0x0000000000010000
>> 0x0000000000010000
>>     /rom@etc/table-loader    4 KiB  0x0000000021330000 0x0000000000001000
>> 0x0000000000001000
>>        /rom@etc/acpi/rsdp    4 KiB  0x0000000021331000 0x0000000000001000
>> 0x0000000000001000
>>
>> /rom@etc/table-loader and /rom@etc/acpi/rsdp has only one 4K page.
> I see. Thanks for explaining this.
>
> A 4k sized ramblock means the bitmap would be only one unsigned long
> width (via bitmap_new(1)). I still don't see why it's not good... :-)
>
Ok, I'll take into account your comment, and I'll place copied

bitmap into RAMBlock.


-- 
Best regards,
Alexey Perevalov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 07/10] migration: add bitmap for copied page
  2017-05-25  6:28                 ` Alexey Perevalov
@ 2017-05-25  7:25                   ` Peter Xu
  2017-05-31 19:25                     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2017-05-25  7:25 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: i.maximets, qemu-devel, dgilbert

On Thu, May 25, 2017 at 09:28:57AM +0300, Alexey Perevalov wrote:
> On 05/25/2017 02:30 AM, Peter Xu wrote:
> >On Wed, May 24, 2017 at 03:16:23PM +0300, Alexey Perevalov wrote:
> >>On 05/24/2017 03:01 PM, Peter Xu wrote:
> >>>On Wed, May 24, 2017 at 10:56:37AM +0300, Alexey wrote:
> >>>>On Wed, May 24, 2017 at 02:57:36PM +0800, Peter Xu wrote:
> >>>>>On Tue, May 23, 2017 at 02:31:08PM +0300, Alexey Perevalov wrote:
> >>>>>>This patch adds ability to track down already copied
> >>>>>>pages, it's necessary for calculation vCPU block time in
> >>>>>>postcopy migration feature and maybe for restore after
> >>>>>>postcopy migration failure.
> >>>>>>
> >>>>>>Functions which work with RAMBlock are placed into ram.c,
> >>>>>>due to TARGET_WORDS_BIGENDIAN is poisoned int postcopy-ram.c -
> >>>>>>hardware independed code.
> >>>>>>
> >>>>>>Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> >>>>>>---
> >>>>>>  include/migration/migration.h | 16 +++++++++++
> >>>>>>  migration/postcopy-ram.c      | 22 ++++++++++++---
> >>>>>>  migration/ram.c               | 63 +++++++++++++++++++++++++++++++++++++++++++
> >>>>>>  3 files changed, 97 insertions(+), 4 deletions(-)
> >>>>>>
> >>>>>>diff --git a/include/migration/migration.h b/include/migration/migration.h
> >>>>>>index 449cb07..4e05c83 100644
> >>>>>>--- a/include/migration/migration.h
> >>>>>>+++ b/include/migration/migration.h
> >>>>>>@@ -101,6 +101,20 @@ struct MigrationIncomingState {
> >>>>>>      LoadStateEntry_Head loadvm_handlers;
> >>>>>>      /*
> >>>>>>+     * bitmap indicates whether page copied,
> >>>>>>+     * based on ramblock offset
> >>>>>>+     * now it is using only for blocktime calculation in
> >>>>>>+     * postcopy migration, so livetime of this entry:
> >>>>>>+     * since user requested blocktime calculation,
> >>>>>>+     * till the end of postcopy migration
> >>>>>>+     * as an example it could represend following memory map
> >>>>>>+     * ___________________________________
> >>>>>>+     * |4k pages | hugepages | 4k pages
> >>>>>Can we really do this?
> >>>>>The problem is AFAIU migration stream is sending pages only in target
> >>>>>page size, even for huge pages... so even for huge pages we should
> >>>>>still need per TARGET_PAGE_SIZE bitmap?
> >>>>sending maybe, but copying to userfault fd is doing in hugepage size
> >>>Yes you are right. :)
> >>>
> >>>>in case of hugetlbfs memory backend.
> >>>>>(Please refer to ram_state_init() on init of RAMBlock.bmap)
> >>>>I thought to use bitmap per ramblock, but I decided to not do it,
> >>>>because of following reasons:
> >>>>	1. There is only 4k ramblocks, for such ramblock it's not
> >>>>	optimal
> >>>Could you explain what do you mean by "there is only 4k ramblocks"?
> >>sorry, could be ramblocks with 4k size,
> >>as example, your's qemu hmp info ramblock shows
> >>  Block Name    PSize              Offset Used              Total
> >>             /objects/mem    2 MiB  0x0000000000000000 0x0000000020000000
> >>0x0000000020000000
> >>                 vga.vram    4 KiB  0x0000000020060000 0x0000000001000000
> >>0x0000000001000000
> >>     /rom@etc/acpi/tables    4 KiB  0x0000000021130000 0x0000000000020000
> >>0x0000000000200000
> >>                  pc.bios    4 KiB  0x0000000020000000 0x0000000000040000
> >>0x0000000000040000
> >>   0000:00:03.0/e1000.rom    4 KiB  0x0000000021070000 0x0000000000040000
> >>0x0000000000040000
> >>   0000:00:04.0/e1000.rom    4 KiB  0x00000000210b0000 0x0000000000040000
> >>0x0000000000040000
> >>   0000:00:05.0/e1000.rom    4 KiB  0x00000000210f0000 0x0000000000040000
> >>0x0000000000040000
> >>                   pc.rom    4 KiB  0x0000000020040000 0x0000000000020000
> >>0x0000000000020000
> >>     0000:00:02.0/vga.rom    4 KiB  0x0000000021060000 0x0000000000010000
> >>0x0000000000010000
> >>    /rom@etc/table-loader    4 KiB  0x0000000021330000 0x0000000000001000
> >>0x0000000000001000
> >>       /rom@etc/acpi/rsdp    4 KiB  0x0000000021331000 0x0000000000001000
> >>0x0000000000001000
> >>
> >>/rom@etc/table-loader and /rom@etc/acpi/rsdp has only one 4K page.
> >I see. Thanks for explaining this.
> >
> >A 4k sized ramblock means the bitmap would be only one unsigned long
> >width (via bitmap_new(1)). I still don't see why it's not good... :-)
> >
> Ok, I'll take into account your comment, and I'll place copied
> 
> bitmap into RAMBlock.

Or we can wait for further comments in case you'll do extra work. :-)

Just to mention that Dave won't be here until next week, so you may
need to wait for some days until he can provide some further input.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 02/10] migration: pass MigrationIncomingState* into migration check functions
  2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 02/10] migration: pass MigrationIncomingState* into migration check functions Alexey Perevalov
@ 2017-05-31 17:54       ` Dr. David Alan Gilbert
  2017-06-05  5:59         ` Alexey Perevalov
  0 siblings, 1 reply; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2017-05-31 17:54 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, peterx, eblake

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> That tiny refactoring is necessary to be able to set
> UFFD_FEATURE_THREAD_ID while requesting features, and then
> to create downtime context in case when kernel supports it.
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> ---
>  migration/migration.c    |  2 +-
>  migration/postcopy-ram.c | 10 +++++-----
>  migration/postcopy-ram.h |  2 +-
>  migration/savevm.c       |  2 +-
>  4 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 0304c01..d735976 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -789,7 +789,7 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>           * special support.
>           */
>          if (!old_postcopy_cap && runstate_check(RUN_STATE_INMIGRATE) &&
> -            !postcopy_ram_supported_by_host()) {
> +            !postcopy_ram_supported_by_host(NULL)) {

Why is that NULL rather than migration_incoming_get_current()?
That's on the destination.

Dave

>              /* postcopy_ram_supported_by_host will have emitted a more
>               * detailed message
>               */
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index a0489f6..4adab36 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -59,7 +59,7 @@ struct PostcopyDiscardState {
>  #include <sys/eventfd.h>
>  #include <linux/userfaultfd.h>
>  
> -static bool ufd_version_check(int ufd)
> +static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
>  {
>      struct uffdio_api api_struct;
>      uint64_t ioctl_mask;
> @@ -112,7 +112,7 @@ static int test_range_shared(const char *block_name, void *host_addr,
>   * normally fine since if the postcopy succeeds it gets turned back on at the
>   * end.
>   */
> -bool postcopy_ram_supported_by_host(void)
> +bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
>  {
>      long pagesize = getpagesize();
>      int ufd = -1;
> @@ -135,7 +135,7 @@ bool postcopy_ram_supported_by_host(void)
>      }
>  
>      /* Version and features check */
> -    if (!ufd_version_check(ufd)) {
> +    if (!ufd_version_check(ufd, mis)) {
>          goto out;
>      }
>  
> @@ -512,7 +512,7 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
>       * Although the host check already tested the API, we need to
>       * do the check again as an ABI handshake on the new fd.
>       */
> -    if (!ufd_version_check(mis->userfault_fd)) {
> +    if (!ufd_version_check(mis->userfault_fd, mis)) {
>          return -1;
>      }
>  
> @@ -650,7 +650,7 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
>  
>  #else
>  /* No target OS support, stubs just fail */
> -bool postcopy_ram_supported_by_host(void)
> +bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
>  {
>      error_report("%s: No OS support", __func__);
>      return false;
> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> index 52d51e8..587a8b8 100644
> --- a/migration/postcopy-ram.h
> +++ b/migration/postcopy-ram.h
> @@ -14,7 +14,7 @@
>  #define QEMU_POSTCOPY_RAM_H
>  
>  /* Return true if the host supports everything we need to do postcopy-ram */
> -bool postcopy_ram_supported_by_host(void);
> +bool postcopy_ram_supported_by_host(MigrationIncomingState *mis);
>  
>  /*
>   * Make all of RAM sensitive to accesses to areas that haven't yet been written
> diff --git a/migration/savevm.c b/migration/savevm.c
> index f5e8194..61a6df7 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1356,7 +1356,7 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
>          return -1;
>      }
>  
> -    if (!postcopy_ram_supported_by_host()) {
> +    if (!postcopy_ram_supported_by_host(mis)) {
>          postcopy_state_set(POSTCOPY_INCOMING_NONE);
>          return -1;
>      }
> -- 
> 1.8.3.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 04/10] migration: split ufd_version_check onto receive/request features part
  2017-05-24 11:47             ` Alexey Perevalov
@ 2017-05-31 19:12               ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2017-05-31 19:12 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: Peter Xu, i.maximets, qemu-devel

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> On 05/24/2017 02:33 PM, Peter Xu wrote:
> > On Wed, May 24, 2017 at 09:45:48AM +0300, Alexey wrote:
> > 
> > [...]
> > 
> > > > >           return false;
> > > > >       }
> > > > > -    ioctl_mask = (__u64)1 << _UFFDIO_REGISTER |
> > > > > -                 (__u64)1 << _UFFDIO_UNREGISTER;
> > > > > +    ioctl_mask = 1 << _UFFDIO_REGISTER |
> > > > > +                 1 << _UFFDIO_UNREGISTER;
> > > > Could I ask why we explicitly removed (__u64) here? Since I see the
> > > > old one better.
> > > maybe my change not robust, in any case thank to point me, but now I
> > > think, here should be a constant instead of ioctl_mask, like
> > > UFFD_API_IOCTLS, the total meaning of that check it's make sure kernel
> > > returns to us no error and accepted features.
> > > ok, from the beginning:
> > > 
> > > if we request unsupported feature (we check it before) or internal
> > > state of userfault ctx inside kernel isn't UFFD_STATE_WAIT_API (for
> > > example we are in the middle of the coping process)
> > > 	ioctl should end with EINVAL error and ioctls field in
> > > 	uffdio_api will be empty
> > > 
> > > Right now I think ioctls check for UFFD_API is not necessary.
> > > We just say here, we will use _UFFDIO_REGISTER, _UFFDIO_UNREGISTER,
> > > but kernel supports it unconditionally, by contrast with
> > > UFFDIO_REGISTER ioctl - it also returns ioctl field in uffdio_register
> > > structure, here can be a variations.
> > Sorry I didn't get the point...
> I misprinted
> >We just say here, we will use _UFFDIO_REGISTER
> 
> > s/_UFFDIO_REGISTER/_UFFDIO_API/g
> but the point, ioctl_mask is not necessary here, kernel always returns it.
> But for _UFFDIO_UNREGISTER, later, not in this function, yes that check is required.

But Peter's only point was that to build the mask it's better to keep
the (__u64) cast for safety.

Dave

> > 
> > AFAIU here (__u64) makes the constant "1" a 64bit variable. Just like
> > when we do bit shift we normally have "1ULL<<40". I liked it since
> > even if _UFFDIO_REGISTER is defined as >32 it will not overflow since
> > by default a constant "1" is a "int" typed (and it's 32bit width).
> 
> > Thanks,
> > 
> 
> -- 
> Best regards,
> Alexey Perevalov
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 07/10] migration: add bitmap for copied page
  2017-05-25  7:25                   ` Peter Xu
@ 2017-05-31 19:25                     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2017-05-31 19:25 UTC (permalink / raw)
  To: Peter Xu; +Cc: Alexey Perevalov, i.maximets, qemu-devel

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, May 25, 2017 at 09:28:57AM +0300, Alexey Perevalov wrote:
> > On 05/25/2017 02:30 AM, Peter Xu wrote:
> > >On Wed, May 24, 2017 at 03:16:23PM +0300, Alexey Perevalov wrote:
> > >>On 05/24/2017 03:01 PM, Peter Xu wrote:
> > >>>On Wed, May 24, 2017 at 10:56:37AM +0300, Alexey wrote:
> > >>>>On Wed, May 24, 2017 at 02:57:36PM +0800, Peter Xu wrote:
> > >>>>>On Tue, May 23, 2017 at 02:31:08PM +0300, Alexey Perevalov wrote:
> > >>>>>>This patch adds ability to track down already copied
> > >>>>>>pages, it's necessary for calculation vCPU block time in
> > >>>>>>postcopy migration feature and maybe for restore after
> > >>>>>>postcopy migration failure.
> > >>>>>>
> > >>>>>>Functions which work with RAMBlock are placed into ram.c,
> > >>>>>>due to TARGET_WORDS_BIGENDIAN is poisoned int postcopy-ram.c -
> > >>>>>>hardware independed code.
> > >>>>>>
> > >>>>>>Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> > >>>>>>---
> > >>>>>>  include/migration/migration.h | 16 +++++++++++
> > >>>>>>  migration/postcopy-ram.c      | 22 ++++++++++++---
> > >>>>>>  migration/ram.c               | 63 +++++++++++++++++++++++++++++++++++++++++++
> > >>>>>>  3 files changed, 97 insertions(+), 4 deletions(-)
> > >>>>>>
> > >>>>>>diff --git a/include/migration/migration.h b/include/migration/migration.h
> > >>>>>>index 449cb07..4e05c83 100644
> > >>>>>>--- a/include/migration/migration.h
> > >>>>>>+++ b/include/migration/migration.h
> > >>>>>>@@ -101,6 +101,20 @@ struct MigrationIncomingState {
> > >>>>>>      LoadStateEntry_Head loadvm_handlers;
> > >>>>>>      /*
> > >>>>>>+     * bitmap indicates whether page copied,
> > >>>>>>+     * based on ramblock offset
> > >>>>>>+     * now it is using only for blocktime calculation in
> > >>>>>>+     * postcopy migration, so livetime of this entry:
> > >>>>>>+     * since user requested blocktime calculation,
> > >>>>>>+     * till the end of postcopy migration
> > >>>>>>+     * as an example it could represend following memory map
> > >>>>>>+     * ___________________________________
> > >>>>>>+     * |4k pages | hugepages | 4k pages
> > >>>>>Can we really do this?
> > >>>>>The problem is AFAIU migration stream is sending pages only in target
> > >>>>>page size, even for huge pages... so even for huge pages we should
> > >>>>>still need per TARGET_PAGE_SIZE bitmap?
> > >>>>sending maybe, but copying to userfault fd is doing in hugepage size
> > >>>Yes you are right. :)
> > >>>
> > >>>>in case of hugetlbfs memory backend.
> > >>>>>(Please refer to ram_state_init() on init of RAMBlock.bmap)
> > >>>>I thought to use bitmap per ramblock, but I decided to not do it,
> > >>>>because of following reasons:
> > >>>>	1. There is only 4k ramblocks, for such ramblock it's not
> > >>>>	optimal
> > >>>Could you explain what do you mean by "there is only 4k ramblocks"?
> > >>sorry, could be ramblocks with 4k size,
> > >>as example, your's qemu hmp info ramblock shows
> > >>  Block Name    PSize              Offset Used              Total
> > >>             /objects/mem    2 MiB  0x0000000000000000 0x0000000020000000
> > >>0x0000000020000000
> > >>                 vga.vram    4 KiB  0x0000000020060000 0x0000000001000000
> > >>0x0000000001000000
> > >>     /rom@etc/acpi/tables    4 KiB  0x0000000021130000 0x0000000000020000
> > >>0x0000000000200000
> > >>                  pc.bios    4 KiB  0x0000000020000000 0x0000000000040000
> > >>0x0000000000040000
> > >>   0000:00:03.0/e1000.rom    4 KiB  0x0000000021070000 0x0000000000040000
> > >>0x0000000000040000
> > >>   0000:00:04.0/e1000.rom    4 KiB  0x00000000210b0000 0x0000000000040000
> > >>0x0000000000040000
> > >>   0000:00:05.0/e1000.rom    4 KiB  0x00000000210f0000 0x0000000000040000
> > >>0x0000000000040000
> > >>                   pc.rom    4 KiB  0x0000000020040000 0x0000000000020000
> > >>0x0000000000020000
> > >>     0000:00:02.0/vga.rom    4 KiB  0x0000000021060000 0x0000000000010000
> > >>0x0000000000010000
> > >>    /rom@etc/table-loader    4 KiB  0x0000000021330000 0x0000000000001000
> > >>0x0000000000001000
> > >>       /rom@etc/acpi/rsdp    4 KiB  0x0000000021331000 0x0000000000001000
> > >>0x0000000000001000
> > >>
> > >>/rom@etc/table-loader and /rom@etc/acpi/rsdp has only one 4K page.
> > >I see. Thanks for explaining this.
> > >
> > >A 4k sized ramblock means the bitmap would be only one unsigned long
> > >width (via bitmap_new(1)). I still don't see why it's not good... :-)
> > >
> > Ok, I'll take into account your comment, and I'll place copied
> > 
> > bitmap into RAMBlock.
> 
> Or we can wait for further comments in case you'll do extra work. :-)
> 
> Just to mention that Dave won't be here until next week, so you may
> need to wait for some days until he can provide some further input.

(Catches up with mail)

So I think I agree with Peter here; if we allocate the bitmap in the
same way as our other bitmaps it'll perhaps be easier for people to
understand both of them.
Especially now Juan has changed that to be individual bitmaps per
RAMBlock rather than one big bitmap; it avoids having to work out
offsets into the bitmap.

Dave

> Thanks,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 08/10] migration: calculate vCPU blocktime on dst side
  2017-05-24 11:22           ` Peter Xu
  2017-05-24 11:37             ` Alexey Perevalov
@ 2017-06-01 10:07             ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2017-06-01 10:07 UTC (permalink / raw)
  To: Peter Xu; +Cc: Alexey, i.maximets, qemu-devel

* Peter Xu (peterx@redhat.com) wrote:

> > > Why use *__nocheck() rather than atomic_xchg() or even atomic_read()?
> > > Same thing happened a lot in current patch.
> > atomic_read/atomic_xchg for mingw/(gcc on arm32) has build time check,
> > 
> > QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *));
> > 
> > it prevents using 64 atomic operation on 32 architecture, just mingw I
> > think, but postcopy-ram.c isn't compiling for mingw.
> > On other 32 platforms as I know clang/gcc allow to use 8 bytes
> > long variables in built atomic operations. In arm32 it allows in
> > builtin. But QEMU on arm32 still
> > has that sanity check, and I think it's bug, so I just worked it around.
> > Maybe better was to fix it.
> > 
> > I tested in docker, using follow command:
> > make docker-test-build@debian-armhf-cross
> > 
> > And got following error
> > 
> > /tmp/qemu-test/src/migration/postcopy-ram.c: In function
> > 'mark_postcopy_blocktime_begin':
> > /tmp/qemu-test/src/include/qemu/compiler.h:86:30: error: static
> > assertion failed: "not expecting: sizeof(*&dc->vcpu_addr[cpu]) >
> > sizeof(void *)"
> >  #define QEMU_BUILD_BUG_ON(x) _Static_assert(!(x), "not expecting: " #x)
> > 
> > when I used atomic_xchg,
> > I agree with you, but I think need to fix atomic.h firstly and add additional
> > #ifdef there.
> > 
> > And I didn't want to split 64 bit values onto 32 bit values, but I saw
> > in mailing list people are doing it.
> 
> If this is a bug, then I guess the best way is to fix it. But before
> that - can a 32bit system really do 64bit atomic ops? Is it really a
> bug at all?

Some can, but not all.
ARM has 'ldrexd' for 64bit atomics; but I can't remember which subset
of ARMs has it; I think most recent stuff.
I don't think you can guarantee it's on all architectures though, but it
might be on any we care about.

Dave

> Thanks,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 08/10] migration: calculate vCPU blocktime on dst side
  2017-05-24  9:37         ` Alexey
  2017-05-24 11:22           ` Peter Xu
@ 2017-06-01 10:50           ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2017-06-01 10:50 UTC (permalink / raw)
  To: Alexey; +Cc: Peter Xu, i.maximets, qemu-devel

* Alexey (a.perevalov@samsung.com) wrote:
> On Wed, May 24, 2017 at 03:53:05PM +0800, Peter Xu wrote:
> > On Tue, May 23, 2017 at 02:31:09PM +0300, Alexey Perevalov wrote:
> > > This patch provides blocktime calculation per vCPU,
> > > as a summary and as a overlapped value for all vCPUs.
> > > 
> > > This approach was suggested by Peter Xu, as an improvements of
> > > previous approch where QEMU kept tree with faulted page address and cpus bitmask
> > > in it. Now QEMU is keeping array with faulted page address as value and vCPU
> > > as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
> > > list for blocktime per vCPU (could be traced with page_fault_addr)
> > > 
> > > Blocktime will not calculated if postcopy_blocktime field of
> > > MigrationIncomingState wasn't initialized.
> > > 
> > > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> > > ---
> > >  migration/postcopy-ram.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++-
> > >  migration/trace-events   |   5 ++-
> > >  2 files changed, 105 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > > index d647769..e70c44b 100644
> > > --- a/migration/postcopy-ram.c
> > > +++ b/migration/postcopy-ram.c
> > > @@ -23,6 +23,7 @@
> > >  #include "postcopy-ram.h"
> > >  #include "sysemu/sysemu.h"
> > >  #include "sysemu/balloon.h"
> > > +#include <sys/param.h>
> > >  #include "qemu/error-report.h"
> > >  #include "trace.h"
> > >  
> > > @@ -577,6 +578,101 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
> > >      return 0;
> > >  }
> > >  
> > > +static int get_mem_fault_cpu_index(uint32_t pid)
> > > +{
> > > +    CPUState *cpu_iter;
> > > +
> > > +    CPU_FOREACH(cpu_iter) {
> > > +        if (cpu_iter->thread_id == pid) {
> > > +            return cpu_iter->cpu_index;
> > > +        }
> > > +    }
> > > +    trace_get_mem_fault_cpu_index(pid);
> > > +    return -1;
> > > +}
> > > +
> > > +static void mark_postcopy_blocktime_begin(uint64_t addr, uint32_t ptid,
> > > +        RAMBlock *rb)
> > > +{
> > > +    int cpu;
> > > +    unsigned long int nr_bit;
> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > +    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
> > > +    int64_t now_ms;
> > > +
> > > +    if (!dc || ptid == 0) {
> > > +        return;
> > > +    }
> > > +    cpu = get_mem_fault_cpu_index(ptid);
> > > +    if (cpu < 0) {
> > > +        return;
> > > +    }
> > > +    nr_bit = get_copied_bit_offset(addr);
> > > +    if (test_bit(nr_bit, mis->copied_pages)) {
> > > +        return;
> > > +    }
> > > +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > > +    if (dc->vcpu_addr[cpu] == 0) {
> > > +        atomic_inc(&dc->smp_cpus_down);
> > > +    }
> > > +
> > > +    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
> > > +    atomic_xchg__nocheck(&dc->last_begin, now_ms);
> > > +    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
> > 
> > Looks like this is not what you and Dave have discussed?
> > 
> > (Btw, sorry to have not followed the thread recently, so I just went
> >  over the discussion again...)
> > 
> > What I see that Dave suggested is (I copied from Dave's email):
> > 
> > blocktime_start:
> >    set CPU stall address
> >    check bitmap entry
> >      if set then zero stall-address
> > 
> > While here it is:
> > 
> > blocktime_start:
> >    check bitmap entry
> >      if set then return
> >    set CPU stall address
> > 
> > I don't think current version can really solve the risk condition. See
> > this possible sequence:
> > 
> >        receive-thread             fault-thread 
> >        --------------             ------------
> >                                   blocktime_start
> >                                     check bitmap entry,
> >                                       if set then return
> >        blocktime_end
> >          set bitmap entry
> >          read CPU stall address,
> >            if none-0 then zero it
> >                                     set CPU stall address [1]
> >         
> > Then imho the address set at [1] will be stall again until forever.
> > 
> agree, I check is in incorrect order
> 
> > I think we should follow exactly what Dave has suggested.
> > 
> > And.. after a second thought, I am afraid even this would not satisfy
> > all risk conditions. What if we consider the UFFDIO_COPY ioctl in?
> > AFAIU after UFFDIO_COPY the faulted vcpu can be running again, then
> > the question is, can it quickly trigger another page fault?
> >
> yes, it can
> 
> > Firstly, a workable sequence is (adding UFFDIO_COPY ioctl in, and
> > showing vcpu-thread X as well):
> > 
> >   receive-thread       fault-thread        vcpu-thread X
> >   --------------       ------------        -------------
> >                                            fault at addr A1
> >                        fault_addr[X]=A1
> >   UFFDIO_COPY page A1
> >   check fault_addr[X] with A1
> >     if match, clear fault_addr[X]
> >                                            vcpu X starts
> > 
> > This is fine.
> > 
> > While since "vcpu X starts" can be right after UFFDIO_COPY, can this
> > be possible?
> Previous picture isn't possible, due to mark_postcopy_blocktime_end
> is being called right after ioctl, and vCPU is waking up
> inside ioctl, so check fault_addr will be after vcpu X starts.

It's the optimistic view when you get lucky.

> > 
> >   receive-thread       fault-thread        vcpu-thread X
> >   --------------       ------------        -------------
> >                                            fault at addr A1
> >                        fault_addr[X]=A1
> >   UFFDIO_COPY page A1
> >                                            vcpu X starts
> >                                            fault at addr A2
> >                        fault_addr[X]=A2
> >   check fault_addr[X] with A1
> >     if match, clear fault_addr[X]
> >         ^
> >         |
> >         +---------- here it will not match since now fault_addr[X]==A2
> > 
> > Then looks like fault_addr[X], which is currently A1, will stall
> > again?
> 
> It will be another address(A2), but probably the same vCPU and if in
> this case blocktime_start will be called before blocktime_end we lost
> block time for page A1. Address of the page is unique key in this
> process, not vCPU ;)
> Here maybe reasonable to wake up vCPU after blocktime_end.

Yes, that probably works better.
This ordering is really tricky!

One thing you might want to look for, I suspect there are cases
where you get faults from vCPUs but then it carries on running
doing something else because of KVM asynchronous page faults.
I'm not sure what you end up with when that happens.

Dave

> 
> 
> 
> > 
> > (I feel like finally we may need something like a per-cpu lock... or I
> >  must have missed something)
> I think no, because locking time of the vCPU is critical in this process.
> > 
> > > +
> > > +    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
> > > +            cpu);
> > > +}
> > > +
> > > +static void mark_postcopy_blocktime_end(uint64_t addr)
> > > +{
> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > +    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
> > > +    int i, affected_cpu = 0;
> > > +    int64_t now_ms;
> > > +    bool vcpu_total_blocktime = false;
> > > +    unsigned long int nr_bit;
> > > +
> > > +    if (!dc) {
> > > +        return;
> > > +    }
> > > +    /* mark that page as copied */
> > > +    nr_bit = get_copied_bit_offset(addr);
> > > +    set_bit_atomic(nr_bit, mis->copied_pages);
> > > +
> > > +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > > +
> > > +    /* lookup cpu, to clear it,
> > > +     * that algorithm looks straighforward, but it's not
> > > +     * optimal, more optimal algorithm is keeping tree or hash
> > > +     * where key is address value is a list of  */
> > > +    for (i = 0; i < smp_cpus; i++) {
> > > +        uint64_t vcpu_blocktime = 0;
> > > +        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
> > > +            continue;
> > > +        }
> > > +        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
> > 
> > Why use *__nocheck() rather than atomic_xchg() or even atomic_read()?
> > Same thing happened a lot in current patch.
> atomic_read/atomic_xchg for mingw/(gcc on arm32) has build time check,
> 
> QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *));
> 
> it prevents using 64 atomic operation on 32 architecture, just mingw I
> think, but postcopy-ram.c isn't compiling for mingw.
> On other 32 platforms as I know clang/gcc allow to use 8 bytes
> long variables in built atomic operations. In arm32 it allows in
> builtin. But QEMU on arm32 still
> has that sanity check, and I think it's bug, so I just worked it around.
> Maybe better was to fix it.
> 
> I tested in docker, using follow command:
> make docker-test-build@debian-armhf-cross
> 
> And got following error
> 
> /tmp/qemu-test/src/migration/postcopy-ram.c: In function
> 'mark_postcopy_blocktime_begin':
> /tmp/qemu-test/src/include/qemu/compiler.h:86:30: error: static
> assertion failed: "not expecting: sizeof(*&dc->vcpu_addr[cpu]) >
> sizeof(void *)"
>  #define QEMU_BUILD_BUG_ON(x) _Static_assert(!(x), "not expecting: " #x)
> 
> when I used atomic_xchg,
> I agree with you, but I think need to fix atomic.h firstly and add additional
> #ifdef there.
> 
> And I didn't want to split 64 bit values onto 32 bit values, but I saw
> in mailing list people are doing it.
> 
> > 
> > Thanks,
> > 
> > > +        vcpu_blocktime = now_ms -
> > > +            atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
> > > +        affected_cpu += 1;
> > > +        /* we need to know is that mark_postcopy_end was due to
> > > +         * faulted page, another possible case it's prefetched
> > > +         * page and in that case we shouldn't be here */
> > > +        if (!vcpu_total_blocktime &&
> > > +            atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
> > > +            vcpu_total_blocktime = true;
> > > +        }
> > > +        /* continue cycle, due to one page could affect several vCPUs */
> > > +        dc->vcpu_blocktime[i] += vcpu_blocktime;
> > > +    }
> > > +
> > > +    atomic_sub(&dc->smp_cpus_down, affected_cpu);
> > > +    if (vcpu_total_blocktime) {
> > > +        dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, 0);
> > > +    }
> > > +    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime);
> > > +}
> > > +
> > >  /*
> > >   * Handle faults detected by the USERFAULT markings
> > >   */
> > > @@ -654,8 +750,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
> > >          rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
> > >          trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
> > >                                                  qemu_ram_get_idstr(rb),
> > > -                                                rb_offset);
> > > +                                                rb_offset,
> > > +                                                msg.arg.pagefault.feat.ptid);
> > >  
> > > +        mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address),
> > > +                msg.arg.pagefault.feat.ptid, rb);
> > >          /*
> > >           * Send the request to the source - we want to request one
> > >           * of our host page sizes (which is >= TPS)
> > > @@ -750,6 +849,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> > >  
> > >          return -e;
> > >      }
> > > +    mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host);
> > >  
> > >      trace_postcopy_place_page(host);
> > >      return 0;
> > > diff --git a/migration/trace-events b/migration/trace-events
> > > index 5b8ccf3..7bdadbb 100644
> > > --- a/migration/trace-events
> > > +++ b/migration/trace-events
> > > @@ -112,6 +112,8 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> > >  process_incoming_migration_co_postcopy_end_main(void) ""
> > >  migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
> > >  migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
> > > +mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d"
> > > +mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr 0x%" PRIx64 " dd %p time %" PRId64
> > >  
> > >  # migration/rdma.c
> > >  qemu_rdma_accept_incoming_migration(void) ""
> > > @@ -188,7 +190,7 @@ postcopy_ram_enable_notify(void) ""
> > >  postcopy_ram_fault_thread_entry(void) ""
> > >  postcopy_ram_fault_thread_exit(void) ""
> > >  postcopy_ram_fault_thread_quit(void) ""
> > > -postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx"
> > > +postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx %u"
> > >  postcopy_ram_incoming_cleanup_closeuf(void) ""
> > >  postcopy_ram_incoming_cleanup_entry(void) ""
> > >  postcopy_ram_incoming_cleanup_exit(void) ""
> > > @@ -197,6 +199,7 @@ save_xbzrle_page_skipping(void) ""
> > >  save_xbzrle_page_overflow(void) ""
> > >  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
> > >  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
> > > +get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU"
> > >  
> > >  # migration/exec.c
> > >  migration_exec_outgoing(const char *cmd) "cmd=%s"
> > > -- 
> > > 1.8.3.1
> > > 
> > 
> > -- 
> > Peter Xu
> > 
> 
> -- 
> 
> BR
> Alexey
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 08/10] migration: calculate vCPU blocktime on dst side
  2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 08/10] migration: calculate vCPU blocktime on dst side Alexey Perevalov
  2017-05-24  7:53       ` Peter Xu
@ 2017-06-01 10:57       ` Dr. David Alan Gilbert
  2017-06-07  7:34         ` Alexey Perevalov
  1 sibling, 1 reply; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2017-06-01 10:57 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, peterx, eblake

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> This patch provides blocktime calculation per vCPU,
> as a summary and as a overlapped value for all vCPUs.
> 
> This approach was suggested by Peter Xu, as an improvements of
> previous approch where QEMU kept tree with faulted page address and cpus bitmask
> in it. Now QEMU is keeping array with faulted page address as value and vCPU
> as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
> list for blocktime per vCPU (could be traced with page_fault_addr)
> 
> Blocktime will not calculated if postcopy_blocktime field of
> MigrationIncomingState wasn't initialized.
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>

<snip>

> +    if (dc->vcpu_addr[cpu] == 0) {
> +        atomic_inc(&dc->smp_cpus_down);
> +    }
> +
> +    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);

I was wondering if this could be done with atomic_cmpxchg with old=0,
but the behaviour would be different in the case where vcpu_addr[cpu]
wasn't zero  or the 'addr'; so I think allowing it to cope with that
case seems better.

Dave

> +    atomic_xchg__nocheck(&dc->last_begin, now_ms);
> +    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
> +
> +    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
> +            cpu);
> +}
> +
> +static void mark_postcopy_blocktime_end(uint64_t addr)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
> +    int i, affected_cpu = 0;
> +    int64_t now_ms;
> +    bool vcpu_total_blocktime = false;
> +    unsigned long int nr_bit;
> +
> +    if (!dc) {
> +        return;
> +    }
> +    /* mark that page as copied */
> +    nr_bit = get_copied_bit_offset(addr);
> +    set_bit_atomic(nr_bit, mis->copied_pages);
> +
> +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +
> +    /* lookup cpu, to clear it,
> +     * that algorithm looks straighforward, but it's not
> +     * optimal, more optimal algorithm is keeping tree or hash
> +     * where key is address value is a list of  */
> +    for (i = 0; i < smp_cpus; i++) {
> +        uint64_t vcpu_blocktime = 0;
> +        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
> +            continue;
> +        }
> +        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
> +        vcpu_blocktime = now_ms -
> +            atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
> +        affected_cpu += 1;
> +        /* we need to know is that mark_postcopy_end was due to
> +         * faulted page, another possible case it's prefetched
> +         * page and in that case we shouldn't be here */
> +        if (!vcpu_total_blocktime &&
> +            atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
> +            vcpu_total_blocktime = true;
> +        }
> +        /* continue cycle, due to one page could affect several vCPUs */
> +        dc->vcpu_blocktime[i] += vcpu_blocktime;
> +    }
> +
> +    atomic_sub(&dc->smp_cpus_down, affected_cpu);
> +    if (vcpu_total_blocktime) {
> +        dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, 0);
> +    }
> +    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime);
> +}
> +
>  /*
>   * Handle faults detected by the USERFAULT markings
>   */
> @@ -654,8 +750,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
>          rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
>          trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
>                                                  qemu_ram_get_idstr(rb),
> -                                                rb_offset);
> +                                                rb_offset,
> +                                                msg.arg.pagefault.feat.ptid);
>  
> +        mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address),
> +                msg.arg.pagefault.feat.ptid, rb);
>          /*
>           * Send the request to the source - we want to request one
>           * of our host page sizes (which is >= TPS)
> @@ -750,6 +849,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
>  
>          return -e;
>      }
> +    mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host);
>  
>      trace_postcopy_place_page(host);
>      return 0;
> diff --git a/migration/trace-events b/migration/trace-events
> index 5b8ccf3..7bdadbb 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -112,6 +112,8 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
>  process_incoming_migration_co_postcopy_end_main(void) ""
>  migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
>  migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
> +mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d"
> +mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr 0x%" PRIx64 " dd %p time %" PRId64
>  
>  # migration/rdma.c
>  qemu_rdma_accept_incoming_migration(void) ""
> @@ -188,7 +190,7 @@ postcopy_ram_enable_notify(void) ""
>  postcopy_ram_fault_thread_entry(void) ""
>  postcopy_ram_fault_thread_exit(void) ""
>  postcopy_ram_fault_thread_quit(void) ""
> -postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx"
> +postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx %u"
>  postcopy_ram_incoming_cleanup_closeuf(void) ""
>  postcopy_ram_incoming_cleanup_entry(void) ""
>  postcopy_ram_incoming_cleanup_exit(void) ""
> @@ -197,6 +199,7 @@ save_xbzrle_page_skipping(void) ""
>  save_xbzrle_page_overflow(void) ""
>  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
>  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
> +get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU"
>  
>  # migration/exec.c
>  migration_exec_outgoing(const char *cmd) "cmd=%s"
> -- 
> 1.8.3.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 09/10] migration: add postcopy total blocktime into query-migrate
  2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 09/10] migration: add postcopy total blocktime into query-migrate Alexey Perevalov
@ 2017-06-01 11:35       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2017-06-01 11:35 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, peterx, eblake

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> Postcopy total blocktime is available on destination side only.
> But query-migrate was possible only for source. This patch
> adds ability to call query-migrate on destination.
> To be able to see postcopy blocktime, need to request postcopy-blocktime
> capability.
> 
> The query-migrate command will show following sample result:
> {"return":
>     "postcopy-vcpu-blocktime": [115, 100],
>     "status": "completed",
>     "postcopy-blocktime": 100
> }}
> 
> postcopy_vcpu_blocktime contains list, where the first item is the first
> vCPU in QEMU.
> 
> This patch has a drawback, it combines states of incoming and
> outgoing migration. Ongoing migration state will overwrite incoming
> state. Looks like better to separate query-migrate for incoming and
> outgoing migration or add parameter to indicate type of migration.
> 
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> ---
>  hmp.c                         | 15 +++++++++
>  include/migration/migration.h |  4 +++
>  migration/migration.c         | 40 +++++++++++++++++++++--
>  migration/postcopy-ram.c      | 75 +++++++++++++++++++++++++++++++++++++++++++
>  migration/trace-events        |  1 +
>  qapi-schema.json              |  9 +++++-
>  6 files changed, 140 insertions(+), 4 deletions(-)
> 
> diff --git a/hmp.c b/hmp.c
> index 3dceaf8..25135e7 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -260,6 +260,21 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>                         info->cpu_throttle_percentage);
>      }
>  
> +    if (info->has_postcopy_blocktime) {
> +        monitor_printf(mon, "postcopy blocktime: %" PRId64 "\n",
> +                info->postcopy_blocktime);
> +    }
> +
> +    if (info->has_postcopy_vcpu_blocktime) {
> +        Visitor *v;
> +        char *str;
> +        v = string_output_visitor_new(false, &str);
> +        visit_type_int64List(v, NULL, &info->postcopy_vcpu_blocktime, NULL);
> +        visit_complete(v, &str);
> +        monitor_printf(mon, "postcopy vcpu blocktime: %s\n", str);
> +        g_free(str);
> +        visit_free(v);
> +    }
>      qapi_free_MigrationInfo(info);
>      qapi_free_MigrationCapabilityStatusList(caps);
>  }
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 4e05c83..c9d4954 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -123,6 +123,10 @@ struct MigrationIncomingState {
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
>  void migration_incoming_state_destroy(void);
> +/*
> + * Functions to work with blocktime context
> + */
> +void fill_destination_postcopy_migration_info(MigrationInfo *info);
>  
>  struct MigrationState
>  {
> diff --git a/migration/migration.c b/migration/migration.c
> index e10284e..4da0c20 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -651,14 +651,15 @@ static void populate_ram_info(MigrationInfo *info, MigrationState *s)
>      }
>  }
>  
> -MigrationInfo *qmp_query_migrate(Error **errp)
> +static void fill_source_migration_info(MigrationInfo *info)
>  {
> -    MigrationInfo *info = g_malloc0(sizeof(*info));
>      MigrationState *s = migrate_get_current();
>  
>      switch (s->state) {
>      case MIGRATION_STATUS_NONE:
>          /* no migration has happened ever */
> +        /* do not overwrite destination migration status */
> +        return;
>          break;
>      case MIGRATION_STATUS_SETUP:
>          info->has_status = true;
> @@ -744,10 +745,43 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>          break;
>      }
>      info->status = s->state;
> +}
>  
> -    return info;
> +static void fill_destination_migration_info(MigrationInfo *info)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +    switch (mis->state) {
> +    case MIGRATION_STATUS_NONE:
> +        return;
> +        break;
> +    case MIGRATION_STATUS_SETUP:
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +    case MIGRATION_STATUS_ACTIVE:
> +    case MIGRATION_STATUS_POSTCOPY_ACTIVE:
> +    case MIGRATION_STATUS_FAILED:
> +    case MIGRATION_STATUS_COLO:
> +        info->has_status = true;
> +        break;
> +    case MIGRATION_STATUS_COMPLETED:
> +        info->has_status = true;
> +        fill_destination_postcopy_migration_info(info);
> +        break;
> +    }
> +    info->status = mis->state;
>  }
>  
> +MigrationInfo *qmp_query_migrate(Error **errp)
> +{
> +    MigrationInfo *info = g_malloc0(sizeof(*info));
> +
> +    fill_destination_migration_info(info);
> +    fill_source_migration_info(info);
> +
> +    return info;
> + }
> +
>  void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>                                    Error **errp)
>  {
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index e70c44b..3dc3869 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -139,6 +139,73 @@ static struct PostcopyBlocktimeContext *blocktime_context_new(void)
>      return ctx;
>  }
>  
> +static int64List *get_vcpu_blocktime_list(PostcopyBlocktimeContext *ctx)
> +{
> +    int64List *list = NULL, *entry = NULL;
> +    int i;
> +
> +    for (i = smp_cpus - 1; i >= 0; i--) {
> +            entry = g_new0(int64List, 1);
> +            entry->value = ctx->vcpu_blocktime[i];
> +            entry->next = list;
> +            list = entry;
> +    }
> +
> +    return list;
> +}
> +
> +/*
> + * This function just provide calculated blocktime per cpu and trace it.
> + * Total blocktime is calculated in mark_postcopy_blocktime_end.
> + *
> + *
> + * Assume we have 3 CPU
> + *
> + *      S1        E1           S1               E1
> + * -----***********------------xxx***************------------------------> CPU1
> + *
> + *             S2                E2
> + * ------------****************xxx---------------------------------------> CPU2
> + *
> + *                         S3            E3
> + * ------------------------****xxx********-------------------------------> CPU3
> + *
> + * We have sequence S1,S2,E1,S3,S1,E2,E3,E1
> + * S2,E1 - doesn't match condition due to sequence S1,S2,E1 doesn't include CPU3
> + * S3,S1,E2 - sequence includes all CPUs, in this case overlap will be S1,E2 -
> + *            it's a part of total blocktime.
> + * S1 - here is last_begin
> + * Legend of the picture is following:
> + *              * - means blocktime per vCPU
> + *              x - means overlapped blocktime (total blocktime)
> + */

That's probably too much detail for this function - since this function
now does very little except call the others.

However, other than the comment:


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> +void fill_destination_postcopy_migration_info(MigrationInfo *info)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +    PostcopyBlocktimeContext *bc = mis->blocktime_ctx;
> +
> +    if (!bc) {
> +        return;
> +    }
> +
> +    info->has_postcopy_blocktime = true;
> +    info->postcopy_blocktime = bc->total_blocktime;
> +    info->has_postcopy_vcpu_blocktime = true;
> +    info->postcopy_vcpu_blocktime = get_vcpu_blocktime_list(bc);
> +}
> +
> +static uint64_t get_postcopy_total_blocktime(void)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +    PostcopyBlocktimeContext *bc = mis->blocktime_ctx;
> +
> +    if (!bc) {
> +        return 0;
> +    }
> +
> +    return bc->total_blocktime;
> +}
> +
>  /**
>   * receive_ufd_features: check userfault fd features, to request only supported
>   * features in the future.
> @@ -497,6 +564,9 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>      }
>  
>      postcopy_state_set(POSTCOPY_INCOMING_END);
> +    /* here should be blocktime receiving back operation */
> +    trace_postcopy_ram_incoming_cleanup_blocktime(
> +            get_postcopy_total_blocktime());
>      migrate_send_rp_shut(mis, qemu_file_get_error(mis->from_src_file) != 0);
>  
>      if (mis->postcopy_tmp_page) {
> @@ -926,6 +996,11 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
>  
>  #else
>  /* No target OS support, stubs just fail */
> +void fill_destination_postcopy_migration_info(MigrationInfo *info)
> +{
> +    error_report("%s: No OS support", __func__);
> +}
> +
>  bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
>  {
>      error_report("%s: No OS support", __func__);
> diff --git a/migration/trace-events b/migration/trace-events
> index 7bdadbb..55a3b6e 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -195,6 +195,7 @@ postcopy_ram_incoming_cleanup_closeuf(void) ""
>  postcopy_ram_incoming_cleanup_entry(void) ""
>  postcopy_ram_incoming_cleanup_exit(void) ""
>  postcopy_ram_incoming_cleanup_join(void) ""
> +postcopy_ram_incoming_cleanup_blocktime(uint64_t total) "total blocktime %" PRIu64
>  save_xbzrle_page_skipping(void) ""
>  save_xbzrle_page_overflow(void) ""
>  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 78617fe..4be0b09 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -712,6 +712,11 @@
>  #              @status is 'failed'. Clients should not attempt to parse the
>  #              error strings. (Since 2.7)
>  #
> +# @postcopy-blocktime: total time when all vCPU were blocked during postcopy
> +#           live migration (Since 2.10)
> +#
> +# @postcopy-vcpu-blocktime: list of the postcopy blocktime per vCPU (Since 2.10)
> +#
>  # Since: 0.14.0
>  ##
>  { 'struct': 'MigrationInfo',
> @@ -723,7 +728,9 @@
>             '*downtime': 'int',
>             '*setup-time': 'int',
>             '*cpu-throttle-percentage': 'int',
> -           '*error-desc': 'str'} }
> +           '*error-desc': 'str',
> +           '*postcopy-blocktime' : 'int64',
> +           '*postcopy-vcpu-blocktime': ['int64']} }
>  
>  ##
>  # @query-migrate:
> -- 
> 1.8.3.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 10/10] migration: postcopy_blocktime documentation
  2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 10/10] migration: postcopy_blocktime documentation Alexey Perevalov
@ 2017-06-01 11:37       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2017-06-01 11:37 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, peterx, eblake

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> ---
>  docs/migration.txt | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/docs/migration.txt b/docs/migration.txt
> index 1b940a8..2207c6d 100644
> --- a/docs/migration.txt
> +++ b/docs/migration.txt
> @@ -402,6 +402,16 @@ will now cause the transition from precopy to postcopy.
>  It can be issued immediately after migration is started or any
>  time later on.  Issuing it after the end of a migration is harmless.
>  
> +Blocktime is a postcopy live migration metric, intended to show
> +how long the source vCPU was in state of interruptable sleep due to pagefault.

                ^^^^^^
I think it's just 'how long the vCPU' - I don't think 'source' adds
anything.

> +This value is calculated on destination side.
> +To enable postcopy blocktime calculation, enter following command on destination
> +monitor:
> +
> +migrate_set_capability postcopy-blocktime on
> +
> +Postcopy blocktime can be retrieved by query-migrate qmp command.
> +

I think other than that it's OK.



Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

>  Note: During the postcopy phase, the bandwidth limits set using
>  migrate_set_speed is ignored (to avoid delaying requested pages that
>  the destination is waiting for).
> -- 
> 1.8.3.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 02/10] migration: pass MigrationIncomingState* into migration check functions
  2017-05-31 17:54       ` Dr. David Alan Gilbert
@ 2017-06-05  5:59         ` Alexey Perevalov
  2017-06-05  9:15           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 39+ messages in thread
From: Alexey Perevalov @ 2017-06-05  5:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: qemu-devel, i.maximets, peterx, eblake

On 05/31/2017 08:54 PM, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
>> That tiny refactoring is necessary to be able to set
>> UFFD_FEATURE_THREAD_ID while requesting features, and then
>> to create downtime context in case when kernel supports it.
>>
>> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
>> ---
>>   migration/migration.c    |  2 +-
>>   migration/postcopy-ram.c | 10 +++++-----
>>   migration/postcopy-ram.h |  2 +-
>>   migration/savevm.c       |  2 +-
>>   4 files changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 0304c01..d735976 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -789,7 +789,7 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>>            * special support.
>>            */
>>           if (!old_postcopy_cap && runstate_check(RUN_STATE_INMIGRATE) &&
>> -            !postcopy_ram_supported_by_host()) {
>> +            !postcopy_ram_supported_by_host(NULL)) {
> Why is that NULL rather than migration_incoming_get_current()?
> That's on the destination.
In current implementation, AFAIU,
migrate-set-capabilities with postcopy-ram argument should be done on 
source machine only,
and it's not required on destination. And it is unnecessary requirement 
to support userfaultfd on
source and check it there.

>
> Dave
>
>>               /* postcopy_ram_supported_by_host will have emitted a more
>>                * detailed message
>>                */
>> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>> index a0489f6..4adab36 100644
>> --- a/migration/postcopy-ram.c
>> +++ b/migration/postcopy-ram.c
>> @@ -59,7 +59,7 @@ struct PostcopyDiscardState {
>>   #include <sys/eventfd.h>
>>   #include <linux/userfaultfd.h>
>>   
>> -static bool ufd_version_check(int ufd)
>> +static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
>>   {
>>       struct uffdio_api api_struct;
>>       uint64_t ioctl_mask;
>> @@ -112,7 +112,7 @@ static int test_range_shared(const char *block_name, void *host_addr,
>>    * normally fine since if the postcopy succeeds it gets turned back on at the
>>    * end.
>>    */
>> -bool postcopy_ram_supported_by_host(void)
>> +bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
>>   {
>>       long pagesize = getpagesize();
>>       int ufd = -1;
>> @@ -135,7 +135,7 @@ bool postcopy_ram_supported_by_host(void)
>>       }
>>   
>>       /* Version and features check */
>> -    if (!ufd_version_check(ufd)) {
>> +    if (!ufd_version_check(ufd, mis)) {
>>           goto out;
>>       }
>>   
>> @@ -512,7 +512,7 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
>>        * Although the host check already tested the API, we need to
>>        * do the check again as an ABI handshake on the new fd.
>>        */
>> -    if (!ufd_version_check(mis->userfault_fd)) {
>> +    if (!ufd_version_check(mis->userfault_fd, mis)) {
>>           return -1;
>>       }
>>   
>> @@ -650,7 +650,7 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
>>   
>>   #else
>>   /* No target OS support, stubs just fail */
>> -bool postcopy_ram_supported_by_host(void)
>> +bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
>>   {
>>       error_report("%s: No OS support", __func__);
>>       return false;
>> diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
>> index 52d51e8..587a8b8 100644
>> --- a/migration/postcopy-ram.h
>> +++ b/migration/postcopy-ram.h
>> @@ -14,7 +14,7 @@
>>   #define QEMU_POSTCOPY_RAM_H
>>   
>>   /* Return true if the host supports everything we need to do postcopy-ram */
>> -bool postcopy_ram_supported_by_host(void);
>> +bool postcopy_ram_supported_by_host(MigrationIncomingState *mis);
>>   
>>   /*
>>    * Make all of RAM sensitive to accesses to areas that haven't yet been written
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index f5e8194..61a6df7 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -1356,7 +1356,7 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
>>           return -1;
>>       }
>>   
>> -    if (!postcopy_ram_supported_by_host()) {
>> +    if (!postcopy_ram_supported_by_host(mis)) {
>>           postcopy_state_set(POSTCOPY_INCOMING_NONE);
>>           return -1;
>>       }
>> -- 
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>
>

-- 
Best regards,
Alexey Perevalov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 06/10] migration: add postcopy blocktime ctx into MigrationIncomingState
  2017-05-24  3:31       ` Peter Xu
@ 2017-06-05  6:31         ` Alexey Perevalov
  0 siblings, 0 replies; 39+ messages in thread
From: Alexey Perevalov @ 2017-06-05  6:31 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, i.maximets, eblake, dgilbert

On 05/24/2017 06:31 AM, Peter Xu wrote:
> On Tue, May 23, 2017 at 02:31:07PM +0300, Alexey Perevalov wrote:
>> This patch adds request to kernel space for UFFD_FEATURE_THREAD_ID,
>> in case when this feature is provided by kernel.
>>
>> PostcopyBlocktimeContext is incapsulated inside postcopy-ram.c,
>> due to it's postcopy only feature.
>> Also it defines PostcopyBlocktimeContext's instance live time.
>> Information from PostcopyBlocktimeContext instance will be provided
>> much after postcopy migration end, instance of PostcopyBlocktimeContext
>> will live till QEMU exit, but part of it (vcpu_addr,
>> page_fault_vcpu_time) used only during calculation, will be released
>> when postcopy ended or failed.
>>
>> To enable postcopy blocktime calculation on destination, need to request
>> proper capabiltiy (Patch for documentation will be at the tail of the patch
>> set).
>>
>> As an example following command enable that capability, assume QEMU was
>> started with
>> -chardev socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock
>> option to control it
>>
>> [root@host]#printf "{\"execute\" : \"qmp_capabilities\"}\r\n \
>> {\"execute\": \"migrate-set-capabilities\" , \"arguments\":   {
>> \"capabilities\": [ { \"capability\": \"postcopy-blocktime\", \"state\":
>> true } ] } }" | nc -U /var/lib/migrate-vm-monitor.sock
>>
>> Or just with HMP
>> (qemu) migrate_set_capability postcopy-blocktime on
>>
>> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
>> ---
>>   include/migration/migration.h |  8 +++++
>>   migration/postcopy-ram.c      | 80 +++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 88 insertions(+)
>>
>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>> index 2951253..449cb07 100644
>> --- a/include/migration/migration.h
>> +++ b/include/migration/migration.h
>> @@ -57,6 +57,8 @@ enum mig_rp_message_type {
>>   
>>   typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
>>   
>> +struct PostcopyBlocktimeContext;
>> +
>>   /* State for the incoming migration */
>>   struct MigrationIncomingState {
>>       QEMUFile *from_src_file;
>> @@ -97,6 +99,12 @@ struct MigrationIncomingState {
>>   
>>       /* See savevm.c */
>>       LoadStateEntry_Head loadvm_handlers;
>> +
>> +    /*
>> +     * PostcopyBlocktimeContext to keep information for postcopy
>> +     * live migration, to calculate vCPU block time
>> +     * */
>> +    struct PostcopyBlocktimeContext *blocktime_ctx;
>>   };
>>   
>>   MigrationIncomingState *migration_incoming_get_current(void);
>> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>> index 4f3f495..5435a40 100644
>> --- a/migration/postcopy-ram.c
>> +++ b/migration/postcopy-ram.c
>> @@ -59,6 +59,73 @@ struct PostcopyDiscardState {
>>   #include <sys/eventfd.h>
>>   #include <linux/userfaultfd.h>
>>   
>> +typedef struct PostcopyBlocktimeContext {
>> +    /* time when page fault initiated per vCPU */
>> +    int64_t *page_fault_vcpu_time;
>> +    /* page address per vCPU */
>> +    uint64_t *vcpu_addr;
>> +    int64_t total_blocktime;
>> +    /* blocktime per vCPU */
>> +    int64_t *vcpu_blocktime;
>> +    /* point in time when last page fault was initiated */
>> +    int64_t last_begin;
>> +    /* number of vCPU are suspended */
>> +    int smp_cpus_down;
>> +
>> +    /*
>> +     * Handler for exit event, necessary for
>> +     * releasing whole blocktime_ctx
>> +     */
>> +    Notifier exit_notifier;
>> +    /*
>> +     * Handler for postcopy event, necessary for
>> +     * releasing unnecessary part of blocktime_ctx
>> +     */
>> +    Notifier postcopy_notifier;
>> +} PostcopyBlocktimeContext;
>> +
>> +static void destroy_blocktime_context(struct PostcopyBlocktimeContext *ctx)
>> +{
>> +    g_free(ctx->page_fault_vcpu_time);
>> +    g_free(ctx->vcpu_addr);
>> +    g_free(ctx->vcpu_blocktime);
>> +    g_free(ctx);
>> +}
>> +
>> +static void postcopy_migration_cb(Notifier *n, void *data)
>> +{
>> +    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
>> +                                               postcopy_notifier);
>> +    MigrationState *s = data;
>> +    if (migration_has_finished(s) || migration_has_failed(s)) {
>> +        g_free(ctx->page_fault_vcpu_time);
>> +        /* g_free is NULL robust */
>> +        ctx->page_fault_vcpu_time = NULL;
>> +        g_free(ctx->vcpu_addr);
>> +        ctx->vcpu_addr = NULL;
>> +    }
>> +}
>> +
>> +static void migration_exit_cb(Notifier *n, void *data)
>> +{
>> +    PostcopyBlocktimeContext *ctx = container_of(n, PostcopyBlocktimeContext,
>> +                                               exit_notifier);
>> +    destroy_blocktime_context(ctx);
>> +}
>> +
>> +static struct PostcopyBlocktimeContext *blocktime_context_new(void)
>> +{
>> +    PostcopyBlocktimeContext *ctx = g_new0(PostcopyBlocktimeContext, 1);
>> +    ctx->page_fault_vcpu_time = g_new0(int64_t, smp_cpus);
>> +    ctx->vcpu_addr = g_new0(uint64_t, smp_cpus);
>> +    ctx->vcpu_blocktime = g_new0(int64_t, smp_cpus);
>> +
>> +    ctx->exit_notifier.notify = migration_exit_cb;
>> +    ctx->postcopy_notifier.notify = postcopy_migration_cb;
>> +    qemu_add_exit_notifier(&ctx->exit_notifier);
>> +    add_migration_state_change_notifier(&ctx->postcopy_notifier);
> I think we just need a global MAX_VCPUS macro, then we can just make
> the whole struct static. But I admit this is out of topic for current
> thread. The point is indeed I see no much point on such fine-grained
> management of memory on this... only for a summary of, say a maximum
> of 1K vcpus, no more than 8B*3*1K=24KB memory...
>
> IMHO This just made things complicated.
>
>> +    return ctx;
>> +}
>>   
>>   /**
>>    * receive_ufd_features: check userfault fd features, to request only supported
>> @@ -151,6 +218,19 @@ static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
>>           }
>>       }
>>   
>> +#ifdef UFFD_FEATURE_THREAD_ID
>> +    if (migrate_postcopy_blocktime() && mis &&
>> +            UFFD_FEATURE_THREAD_ID & supported_features) {
> (indent)
>
>> +        /* kernel supports that feature */
>> +        /* don't create blocktime_context if it exists */
>> +        if (!mis->blocktime_ctx) {
> If it does existed, then maybe we'll have problem - in the state
> change notifier we cleaned up ->page_fault_vcpu_time and ->vcpu_addr,
> but blocktime_ctx was there. So will it possible when we reached here
> we'll get blocktime_ctx != NULL but
> blocktime_ctx->page_fault_vcpu_time == NULL and also
> blocktime_ctx->vcpu_addr == NULL?
>
> Maybe we can just drop the state change notifier, and one single exit
> notifier would be enough to cleanup everything?
When I did that I had a doubt too.
On one hand, yes, implementation complexity, but not
so high and possibility of use after free, but I tried to prevent it,
on other hand I didn't want to keep that auxiliary memory, but as you
truly noticed that memory overhead isn't big,
ok, for the sake of robustness and simplicity, I'll free that memory at
QEMU's exit.

>
>> +            mis->blocktime_ctx = blocktime_context_new();
>> +        }
>> +
>> +        asked_features |= UFFD_FEATURE_THREAD_ID;
>> +    }
>> +#endif
>> +
>>       /*
>>        * request features, even if asked_features is 0, due to
>>        * kernel expects UFFD_API before UFFDIO_REGISTER, per
>> -- 
>> 1.8.3.1
>>
> Thanks,
>

-- 
Best regards,
Alexey Perevalov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 02/10] migration: pass MigrationIncomingState* into migration check functions
  2017-06-05  5:59         ` Alexey Perevalov
@ 2017-06-05  9:15           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2017-06-05  9:15 UTC (permalink / raw)
  To: Alexey Perevalov; +Cc: qemu-devel, i.maximets, peterx, eblake

* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> On 05/31/2017 08:54 PM, Dr. David Alan Gilbert wrote:
> > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > That tiny refactoring is necessary to be able to set
> > > UFFD_FEATURE_THREAD_ID while requesting features, and then
> > > to create downtime context in case when kernel supports it.
> > > 
> > > Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> > > ---
> > >   migration/migration.c    |  2 +-
> > >   migration/postcopy-ram.c | 10 +++++-----
> > >   migration/postcopy-ram.h |  2 +-
> > >   migration/savevm.c       |  2 +-
> > >   4 files changed, 8 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 0304c01..d735976 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -789,7 +789,7 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
> > >            * special support.
> > >            */
> > >           if (!old_postcopy_cap && runstate_check(RUN_STATE_INMIGRATE) &&
> > > -            !postcopy_ram_supported_by_host()) {
> > > +            !postcopy_ram_supported_by_host(NULL)) {
> > Why is that NULL rather than migration_incoming_get_current()?
> > That's on the destination.
> In current implementation, AFAIU,
> migrate-set-capabilities with postcopy-ram argument should be done on source
> machine only,
> and it's not required on destination. And it is unnecessary requirement to
> support userfaultfd on
> source and check it there.

Note the 'runstate_check(RUN_STATE_INMIGRATE)' - this check is made on
the destination.
While it's not strictly necessary for the management layer to do a
migrate_set_capability postcopy-ram   on the destination, it does it
anyway to check the destination is capable of postcopy before it starts
migration.

Dave

> 
> > 
> > Dave
> > 
> > >               /* postcopy_ram_supported_by_host will have emitted a more
> > >                * detailed message
> > >                */
> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > > index a0489f6..4adab36 100644
> > > --- a/migration/postcopy-ram.c
> > > +++ b/migration/postcopy-ram.c
> > > @@ -59,7 +59,7 @@ struct PostcopyDiscardState {
> > >   #include <sys/eventfd.h>
> > >   #include <linux/userfaultfd.h>
> > > -static bool ufd_version_check(int ufd)
> > > +static bool ufd_version_check(int ufd, MigrationIncomingState *mis)
> > >   {
> > >       struct uffdio_api api_struct;
> > >       uint64_t ioctl_mask;
> > > @@ -112,7 +112,7 @@ static int test_range_shared(const char *block_name, void *host_addr,
> > >    * normally fine since if the postcopy succeeds it gets turned back on at the
> > >    * end.
> > >    */
> > > -bool postcopy_ram_supported_by_host(void)
> > > +bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
> > >   {
> > >       long pagesize = getpagesize();
> > >       int ufd = -1;
> > > @@ -135,7 +135,7 @@ bool postcopy_ram_supported_by_host(void)
> > >       }
> > >       /* Version and features check */
> > > -    if (!ufd_version_check(ufd)) {
> > > +    if (!ufd_version_check(ufd, mis)) {
> > >           goto out;
> > >       }
> > > @@ -512,7 +512,7 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
> > >        * Although the host check already tested the API, we need to
> > >        * do the check again as an ABI handshake on the new fd.
> > >        */
> > > -    if (!ufd_version_check(mis->userfault_fd)) {
> > > +    if (!ufd_version_check(mis->userfault_fd, mis)) {
> > >           return -1;
> > >       }
> > > @@ -650,7 +650,7 @@ void *postcopy_get_tmp_page(MigrationIncomingState *mis)
> > >   #else
> > >   /* No target OS support, stubs just fail */
> > > -bool postcopy_ram_supported_by_host(void)
> > > +bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
> > >   {
> > >       error_report("%s: No OS support", __func__);
> > >       return false;
> > > diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
> > > index 52d51e8..587a8b8 100644
> > > --- a/migration/postcopy-ram.h
> > > +++ b/migration/postcopy-ram.h
> > > @@ -14,7 +14,7 @@
> > >   #define QEMU_POSTCOPY_RAM_H
> > >   /* Return true if the host supports everything we need to do postcopy-ram */
> > > -bool postcopy_ram_supported_by_host(void);
> > > +bool postcopy_ram_supported_by_host(MigrationIncomingState *mis);
> > >   /*
> > >    * Make all of RAM sensitive to accesses to areas that haven't yet been written
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index f5e8194..61a6df7 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -1356,7 +1356,7 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis)
> > >           return -1;
> > >       }
> > > -    if (!postcopy_ram_supported_by_host()) {
> > > +    if (!postcopy_ram_supported_by_host(mis)) {
> > >           postcopy_state_set(POSTCOPY_INCOMING_NONE);
> > >           return -1;
> > >       }
> > > -- 
> > > 1.8.3.1
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> > 
> > 
> 
> -- 
> Best regards,
> Alexey Perevalov
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH V6 08/10] migration: calculate vCPU blocktime on dst side
  2017-06-01 10:57       ` Dr. David Alan Gilbert
@ 2017-06-07  7:34         ` Alexey Perevalov
  0 siblings, 0 replies; 39+ messages in thread
From: Alexey Perevalov @ 2017-06-07  7:34 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: qemu-devel, i.maximets, peterx, eblake

On 06/01/2017 01:57 PM, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
>> This patch provides blocktime calculation per vCPU,
>> as a summary and as a overlapped value for all vCPUs.
>>
>> This approach was suggested by Peter Xu, as an improvements of
>> previous approch where QEMU kept tree with faulted page address and cpus bitmask
>> in it. Now QEMU is keeping array with faulted page address as value and vCPU
>> as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
>> list for blocktime per vCPU (could be traced with page_fault_addr)
>>
>> Blocktime will not calculated if postcopy_blocktime field of
>> MigrationIncomingState wasn't initialized.
>>
>> Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
> <snip>
>
>> +    if (dc->vcpu_addr[cpu] == 0) {
>> +        atomic_inc(&dc->smp_cpus_down);
>> +    }
>> +
>> +    atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr);
> I was wondering if this could be done with atomic_cmpxchg with old=0,
> but the behaviour would be different in the case where vcpu_addr[cpu]
> wasn't zero  or the 'addr'; so I think allowing it to cope with that
> case seems better.

atomic_xchg__nocheck isn't atomic_cmpxchg, it is based on __atomic_exchange_n, ( from reference
  It writesval  into|*ptr|, and returns the previous contents of|*ptr ) so I leave it as is. |

>
> Dave
>
>> +    atomic_xchg__nocheck(&dc->last_begin, now_ms);
>> +    atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms);
>> +
>> +    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
>> +            cpu);
>> +}
>> +
>> +static void mark_postcopy_blocktime_end(uint64_t addr)
>> +{
>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>> +    PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
>> +    int i, affected_cpu = 0;
>> +    int64_t now_ms;
>> +    bool vcpu_total_blocktime = false;
>> +    unsigned long int nr_bit;
>> +
>> +    if (!dc) {
>> +        return;
>> +    }
>> +    /* mark that page as copied */
>> +    nr_bit = get_copied_bit_offset(addr);
>> +    set_bit_atomic(nr_bit, mis->copied_pages);
>> +
>> +    now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>> +
>> +    /* lookup cpu, to clear it,
>> +     * that algorithm looks straighforward, but it's not
>> +     * optimal, more optimal algorithm is keeping tree or hash
>> +     * where key is address value is a list of  */
>> +    for (i = 0; i < smp_cpus; i++) {
>> +        uint64_t vcpu_blocktime = 0;
>> +        if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) {
>> +            continue;
>> +        }
>> +        atomic_xchg__nocheck(&dc->vcpu_addr[i], 0);
>> +        vcpu_blocktime = now_ms -
>> +            atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
>> +        affected_cpu += 1;
>> +        /* we need to know is that mark_postcopy_end was due to
>> +         * faulted page, another possible case it's prefetched
>> +         * page and in that case we shouldn't be here */
>> +        if (!vcpu_total_blocktime &&
>> +            atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
>> +            vcpu_total_blocktime = true;
>> +        }
>> +        /* continue cycle, due to one page could affect several vCPUs */
>> +        dc->vcpu_blocktime[i] += vcpu_blocktime;
>> +    }
>> +
>> +    atomic_sub(&dc->smp_cpus_down, affected_cpu);
>> +    if (vcpu_total_blocktime) {
>> +        dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, 0);
>> +    }
>> +    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime);
>> +}
>> +
>>   /*
>>    * Handle faults detected by the USERFAULT markings
>>    */
>> @@ -654,8 +750,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
>>           rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
>>           trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
>>                                                   qemu_ram_get_idstr(rb),
>> -                                                rb_offset);
>> +                                                rb_offset,
>> +                                                msg.arg.pagefault.feat.ptid);
>>   
>> +        mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address),
>> +                msg.arg.pagefault.feat.ptid, rb);
>>           /*
>>            * Send the request to the source - we want to request one
>>            * of our host page sizes (which is >= TPS)
>> @@ -750,6 +849,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
>>   
>>           return -e;
>>       }
>> +    mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host);
>>   
>>       trace_postcopy_place_page(host);
>>       return 0;
>> diff --git a/migration/trace-events b/migration/trace-events
>> index 5b8ccf3..7bdadbb 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -112,6 +112,8 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
>>   process_incoming_migration_co_postcopy_end_main(void) ""
>>   migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
>>   migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname)  "ioc=%p ioctype=%s hostname=%s"
>> +mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d"
>> +mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr 0x%" PRIx64 " dd %p time %" PRId64
>>   
>>   # migration/rdma.c
>>   qemu_rdma_accept_incoming_migration(void) ""
>> @@ -188,7 +190,7 @@ postcopy_ram_enable_notify(void) ""
>>   postcopy_ram_fault_thread_entry(void) ""
>>   postcopy_ram_fault_thread_exit(void) ""
>>   postcopy_ram_fault_thread_quit(void) ""
>> -postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx"
>> +postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx %u"
>>   postcopy_ram_incoming_cleanup_closeuf(void) ""
>>   postcopy_ram_incoming_cleanup_entry(void) ""
>>   postcopy_ram_incoming_cleanup_exit(void) ""
>> @@ -197,6 +199,7 @@ save_xbzrle_page_skipping(void) ""
>>   save_xbzrle_page_overflow(void) ""
>>   ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
>>   ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
>> +get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU"
>>   
>>   # migration/exec.c
>>   migration_exec_outgoing(const char *cmd) "cmd=%s"
>> -- 
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>
>

-- 
Best regards,
Alexey Perevalov

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2017-06-07  7:44 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20170523113120eucas1p2032ace2121aa8627067b6d7f03fbf482@eucas1p2.samsung.com>
2017-05-23 11:31 ` [Qemu-devel] [PATCH V6 00/10] calculate blocktime for postcopy live migration Alexey Perevalov
     [not found]   ` <CGME20170523113126eucas1p163c64fe50bd44026fdf4d36716bfc4f2@eucas1p1.samsung.com>
2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 01/10] userfault: add pid into uffd_msg & update UFFD_FEATURE_* Alexey Perevalov
     [not found]   ` <CGME20170523113127eucas1p22dba0fddcc9bcf70e554bf659272f947@eucas1p2.samsung.com>
2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 02/10] migration: pass MigrationIncomingState* into migration check functions Alexey Perevalov
2017-05-31 17:54       ` Dr. David Alan Gilbert
2017-06-05  5:59         ` Alexey Perevalov
2017-06-05  9:15           ` Dr. David Alan Gilbert
     [not found]   ` <CGME20170523113127eucas1p1b6cebc0fc51a056b8c1a983d375f1012@eucas1p1.samsung.com>
2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 03/10] migration: fix hardcoded function name in error report Alexey Perevalov
     [not found]   ` <CGME20170523113128eucas1p17a89f8cb47d5731c50f94c3218ba155f@eucas1p1.samsung.com>
2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 04/10] migration: split ufd_version_check onto receive/request features part Alexey Perevalov
2017-05-24  2:36       ` Peter Xu
2017-05-24  6:45         ` Alexey
2017-05-24 11:33           ` Peter Xu
2017-05-24 11:47             ` Alexey Perevalov
2017-05-31 19:12               ` Dr. David Alan Gilbert
     [not found]   ` <CGME20170523113129eucas1p2146e1018e660eed0b319cbe22adc2712@eucas1p2.samsung.com>
2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 05/10] migration: introduce postcopy-blocktime capability Alexey Perevalov
     [not found]   ` <CGME20170523113129eucas1p179082f20f41d1069f5fbd0f37535fae9@eucas1p1.samsung.com>
2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 06/10] migration: add postcopy blocktime ctx into MigrationIncomingState Alexey Perevalov
2017-05-24  3:31       ` Peter Xu
2017-06-05  6:31         ` Alexey Perevalov
     [not found]   ` <CGME20170523113130eucas1p1babac9d8659c10abe22ddc7d5b9526ab@eucas1p1.samsung.com>
2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 07/10] migration: add bitmap for copied page Alexey Perevalov
2017-05-24  6:57       ` Peter Xu
2017-05-24  7:56         ` Alexey
2017-05-24 12:01           ` Peter Xu
2017-05-24 12:16             ` Alexey Perevalov
2017-05-24 23:30               ` Peter Xu
2017-05-25  6:28                 ` Alexey Perevalov
2017-05-25  7:25                   ` Peter Xu
2017-05-31 19:25                     ` Dr. David Alan Gilbert
     [not found]   ` <CGME20170523113131eucas1p24a041de6004237e437f97a24340507e2@eucas1p2.samsung.com>
2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 08/10] migration: calculate vCPU blocktime on dst side Alexey Perevalov
2017-05-24  7:53       ` Peter Xu
2017-05-24  9:37         ` Alexey
2017-05-24 11:22           ` Peter Xu
2017-05-24 11:37             ` Alexey Perevalov
2017-06-01 10:07             ` Dr. David Alan Gilbert
2017-06-01 10:50           ` Dr. David Alan Gilbert
2017-06-01 10:57       ` Dr. David Alan Gilbert
2017-06-07  7:34         ` Alexey Perevalov
     [not found]   ` <CGME20170523113131eucas1p1ec4e059c13ce977e3a3872c343e6b858@eucas1p1.samsung.com>
2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 09/10] migration: add postcopy total blocktime into query-migrate Alexey Perevalov
2017-06-01 11:35       ` Dr. David Alan Gilbert
     [not found]   ` <CGME20170523113132eucas1p19143aceccbb30a0051635cddcf376bb6@eucas1p1.samsung.com>
2017-05-23 11:31     ` [Qemu-devel] [PATCH V6 10/10] migration: postcopy_blocktime documentation Alexey Perevalov
2017-06-01 11:37       ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.