[Qemu-devel] [RFC PATCH v5 0/3] Throttle-down guest to help with live migration convergence.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC PATCH v5 0/3] Throttle-down guest to help with live migration convergence.
@ 2013-05-09 19:43 Chegu Vinod
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 1/3] Introduce async_run_on_cpu() Chegu Vinod
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Chegu Vinod @ 2013-05-09 19:43 UTC (permalink / raw)
  To: eblake, anthony, quintela, owasserm, pbonzini, qemu-devel; +Cc: Chegu Vinod

Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements (& using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

If a user chooses to force convergence of their migration via a new
migration capability "auto-converge" then this change will auto-detect
lack of convergence scenario and trigger a slow down of the workload
by explicitly disallowing the VCPUs from spending much time in the VM
context.

The migration thread tries to catchup and this eventually leads
to convergence in some "deterministic" amount of time. Yes it does
impact the performance of all the VCPUs but in my observation that
lasts only for a short duration of time. i.e. end up entering
stage 3 (downtime phase) soon after that. No external trigger is
required.

Thanks to Juan and Paolo for their useful suggestions.

---

Changes from v4:
- incorporated feedback from Paolo.
- split into 3 patches.

Changes from v3:
- incorporated feedback from Paolo and Eric
- rebased to latest qemu.git

Changes from v2:
- incorporated feedback from Orit, Juan and Eric
- stop the throttling thread at the start of stage 3
- rebased to latest qemu.git

Changes from v1:
- rebased to latest qemu.git
- added auto-converge capability(default off) - suggested by Anthony Liguori &
                                                Eric Blake.

Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>

Chegu Vinod (3):
 Introduce async_run_on_cpu()
 Add 'auto-converge' migration capability
 Force auto-convegence of live migration

 arch_init.c                   |   68 +++++++++++++++++++++++++++++++++++++++++
 cpus.c                        |   29 +++++++++++++++++
 include/migration/migration.h |    6 +++
 include/qemu-common.h         |    1 +
 include/qom/cpu.h             |   10 ++++++
 migration.c                   |   10 ++++++
 qapi-schema.json              |    5 ++-
 7 files changed, 128 insertions(+), 1 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RFC PATCH v5 1/3] Introduce async_run_on_cpu()
  2013-05-09 19:43 [Qemu-devel] [RFC PATCH v5 0/3] Throttle-down guest to help with live migration convergence Chegu Vinod
@ 2013-05-09 19:43 ` Chegu Vinod
  2013-05-10  7:43   ` Paolo Bonzini
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 2/3] Add 'auto-converge' migration capability Chegu Vinod
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration Chegu Vinod
  2 siblings, 1 reply; 21+ messages in thread
From: Chegu Vinod @ 2013-05-09 19:43 UTC (permalink / raw)
  To: eblake, anthony, quintela, owasserm, pbonzini, qemu-devel; +Cc: Chegu Vinod

 Introduce an asynchronous version of run_on_cpu() i.e. the caller
 doesn't have to block till the call back routine finishes execution
 on the target vcpu.

Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
---
 cpus.c                |   29 +++++++++++++++++++++++++++++
 include/qemu-common.h |    1 +
 include/qom/cpu.h     |   10 ++++++++++
 3 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/cpus.c b/cpus.c
index c232265..8cd4eab 100644
--- a/cpus.c
+++ b/cpus.c
@@ -653,6 +653,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
 
     wi.func = func;
     wi.data = data;
+    wi.free = false;
     if (cpu->queued_work_first == NULL) {
         cpu->queued_work_first = &wi;
     } else {
@@ -671,6 +672,31 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
     }
 }
 
+void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
+{
+    struct qemu_work_item *wi;
+
+    if (qemu_cpu_is_self(cpu)) {
+        func(data);
+        return;
+    }
+
+    wi = g_malloc0(sizeof(struct qemu_work_item));
+    wi->func = func;
+    wi->data = data;
+    wi->free = true;
+    if (cpu->queued_work_first == NULL) {
+        cpu->queued_work_first = wi;
+    } else {
+        cpu->queued_work_last->next = wi;
+    }
+    cpu->queued_work_last = wi;
+    wi->next = NULL;
+    wi->done = false;
+
+    qemu_cpu_kick(cpu);
+}
+
 static void flush_queued_work(CPUState *cpu)
 {
     struct qemu_work_item *wi;
@@ -683,6 +709,9 @@ static void flush_queued_work(CPUState *cpu)
         cpu->queued_work_first = wi->next;
         wi->func(wi->data);
         wi->done = true;
+        if (wi->free) {
+            g_free(wi);
+        }
     }
     cpu->queued_work_last = NULL;
     qemu_cond_broadcast(&qemu_work_cond);
diff --git a/include/qemu-common.h b/include/qemu-common.h
index b399d85..bad6e1f 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -286,6 +286,7 @@ struct qemu_work_item {
     void (*func)(void *data);
     void *data;
     int done;
+    bool free;
 };
 
 #ifdef CONFIG_USER_ONLY
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 7cd9442..46465e9 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -265,6 +265,16 @@ bool cpu_is_stopped(CPUState *cpu);
 void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
 
 /**
+ * async_run_on_cpu:
+ * @cpu: The vCPU to run on.
+ * @func: The function to be executed.
+ * @data: Data to pass to the function.
+ *
+ * Schedules the function @func for execution on the vCPU @cpu asynchronously.
+ */
+void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
+
+/**
  * qemu_for_each_cpu:
  * @func: The function to be executed.
  * @data: Data to pass to the function.
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RFC PATCH v5 2/3] Add 'auto-converge' migration capability
  2013-05-09 19:43 [Qemu-devel] [RFC PATCH v5 0/3] Throttle-down guest to help with live migration convergence Chegu Vinod
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 1/3] Introduce async_run_on_cpu() Chegu Vinod
@ 2013-05-09 19:43 ` Chegu Vinod
  2013-05-10  7:43   ` Paolo Bonzini
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration Chegu Vinod
  2 siblings, 1 reply; 21+ messages in thread
From: Chegu Vinod @ 2013-05-09 19:43 UTC (permalink / raw)
  To: eblake, anthony, quintela, owasserm, pbonzini, qemu-devel; +Cc: Chegu Vinod

 The auto-converge migration capability allows the user to specify if they
 choose live migration seqeunce to automatically detect and force convergence.

Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
---
 include/migration/migration.h |    2 ++
 migration.c                   |    9 +++++++++
 qapi-schema.json              |    5 ++++-
 3 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index e2acec6..ace91b0 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -127,4 +127,6 @@ int migrate_use_xbzrle(void);
 int64_t migrate_xbzrle_cache_size(void);
 
 int64_t xbzrle_cache_resize(int64_t new_size);
+
+bool migrate_auto_converge(void);
 #endif
diff --git a/migration.c b/migration.c
index 3eb0fad..570cee5 100644
--- a/migration.c
+++ b/migration.c
@@ -474,6 +474,15 @@ void qmp_migrate_set_downtime(double value, Error **errp)
     max_downtime = (uint64_t)value;
 }
 
+bool migrate_auto_converge(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
+}
+
 int migrate_use_xbzrle(void)
 {
     MigrationState *s;
diff --git a/qapi-schema.json b/qapi-schema.json
index 199744a..b33839c 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -602,10 +602,13 @@
 #          This feature allows us to minimize migration traffic for certain work
 #          loads, by sending compressed difference of the pages
 #
+# @auto-converge: Migration supports automatic throttling down of guest
+#          to force convergence. (since 1.6)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle'] }
+  'data': ['xbzrle', 'auto-converge'] }
 
 ##
 # @MigrationCapabilityStatus
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-09 19:43 [Qemu-devel] [RFC PATCH v5 0/3] Throttle-down guest to help with live migration convergence Chegu Vinod
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 1/3] Introduce async_run_on_cpu() Chegu Vinod
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 2/3] Add 'auto-converge' migration capability Chegu Vinod
@ 2013-05-09 19:43 ` Chegu Vinod
  2013-05-09 20:05   ` Igor Mammedov
                     ` (3 more replies)
  2 siblings, 4 replies; 21+ messages in thread
From: Chegu Vinod @ 2013-05-09 19:43 UTC (permalink / raw)
  To: eblake, anthony, quintela, owasserm, pbonzini, qemu-devel; +Cc: Chegu Vinod

 If a user chooses to turn on the auto-converge migration capability
 these changes detect the lack of convergence and throttle down the
 guest. i.e. force the VCPUs out of the guest for some duration
 and let the migration thread catchup and help converge.

 Verified the convergence using the following :
 - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
 - OLTP like workload running on a 80VCPU/512G guest (~80% busy)

 Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
 migrate downtime set to 4seconds).

 (qemu) info migrate
 capabilities: xbzrle: off auto-converge: off  <----
 Migration status: active
 total time: 1487503 milliseconds
 expected downtime: 519 milliseconds
 transferred ram: 383749347 kbytes
 remaining ram: 2753372 kbytes
 total ram: 268444224 kbytes
 duplicate: 65461532 pages
 skipped: 64901568 pages
 normal: 95750218 pages
 normal bytes: 383000872 kbytes
 dirty pages rate: 67551 pages

 ---
 
 (qemu) info migrate
 capabilities: xbzrle: off auto-converge: on   <----
 Migration status: completed
 total time: 241161 milliseconds
 downtime: 6373 milliseconds
 transferred ram: 28235307 kbytes
 remaining ram: 0 kbytes
 total ram: 268444224 kbytes
 duplicate: 64946416 pages
 skipped: 64903523 pages
 normal: 7044971 pages
 normal bytes: 28179884 kbytes

Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
---
 arch_init.c                   |   68 +++++++++++++++++++++++++++++++++++++++++
 include/migration/migration.h |    4 ++
 migration.c                   |    1 +
 3 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 49c5dc2..29788d6 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -49,6 +49,7 @@
 #include "trace.h"
 #include "exec/cpu-all.h"
 #include "hw/acpi/acpi.h"
+#include "sysemu/cpus.h"
 
 #ifdef DEBUG_ARCH_INIT
 #define DPRINTF(fmt, ...) \
@@ -104,6 +105,8 @@ int graphic_depth = 15;
 #endif
 
 const uint32_t arch_type = QEMU_ARCH;
+static bool mig_throttle_on;
+
 
 /***********************************************************/
 /* ram save/restore */
@@ -378,8 +381,15 @@ static void migration_bitmap_sync(void)
     uint64_t num_dirty_pages_init = migration_dirty_pages;
     MigrationState *s = migrate_get_current();
     static int64_t start_time;
+    static int64_t bytes_xfer_prev;
     static int64_t num_dirty_pages_period;
     int64_t end_time;
+    int64_t bytes_xfer_now;
+    static int dirty_rate_high_cnt;
+
+    if (!bytes_xfer_prev) {
+        bytes_xfer_prev = ram_bytes_transferred();
+    }
 
     if (!start_time) {
         start_time = qemu_get_clock_ms(rt_clock);
@@ -404,6 +414,23 @@ static void migration_bitmap_sync(void)
 
     /* more than 1 second = 1000 millisecons */
     if (end_time > start_time + 1000) {
+        if (migrate_auto_converge()) {
+            /* The following detection logic can be refined later. For now:
+               Check to see if the dirtied bytes is 50% more than the approx.
+               amount of bytes that just got transferred since the last time we
+               were in this routine. If that happens N times (for now N==5)
+               we turn on the throttle down logic */
+            bytes_xfer_now = ram_bytes_transferred();
+            if (s->dirty_pages_rate &&
+                ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
+                ((bytes_xfer_now - bytes_xfer_prev)/2))) {
+                if (dirty_rate_high_cnt++ > 5) {
+                    DPRINTF("Unable to converge. Throtting down guest\n");
+                    mig_throttle_on = true;
+                }
+             }
+             bytes_xfer_prev = bytes_xfer_now;
+        }
         s->dirty_pages_rate = num_dirty_pages_period * 1000
             / (end_time - start_time);
         s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
@@ -496,6 +523,15 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
     return bytes_sent;
 }
 
+bool throttling_needed(void)
+{
+    if (!migrate_auto_converge()) {
+        return false;
+    }
+
+    return mig_throttle_on;
+}
+
 static uint64_t bytes_transferred;
 
 static ram_addr_t ram_save_remaining(void)
@@ -1098,3 +1134,35 @@ TargetInfo *qmp_query_target(Error **errp)
 
     return info;
 }
+
+static void mig_delay_vcpu(void)
+{
+    qemu_mutex_unlock_iothread();
+    g_usleep(50*1000);
+    qemu_mutex_lock_iothread();
+}
+
+/* Stub used for getting the vcpu out of VM and into qemu via
+   run_on_cpu()*/
+static void mig_kick_cpu(void *opq)
+{
+    mig_delay_vcpu();
+    return;
+}
+
+/* To reduce the dirty rate explicitly disallow the VCPUs from spending
+   much time in the VM. The migration thread will try to catchup.
+   Workload will experience a performance drop.
+*/
+void migration_throttle_down(void)
+{
+    if (throttling_needed()) {
+        CPUArchState *penv = first_cpu;
+        while (penv) {
+            qemu_mutex_lock_iothread();
+            async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
+            qemu_mutex_unlock_iothread();
+            penv = penv->next_cpu;
+        }
+    }
+}
diff --git a/include/migration/migration.h b/include/migration/migration.h
index ace91b0..68b65c6 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -129,4 +129,8 @@ int64_t migrate_xbzrle_cache_size(void);
 int64_t xbzrle_cache_resize(int64_t new_size);
 
 bool migrate_auto_converge(void);
+bool throttling_needed(void);
+void stop_throttling(void);
+void migration_throttle_down(void);
+
 #endif
diff --git a/migration.c b/migration.c
index 570cee5..d3673a6 100644
--- a/migration.c
+++ b/migration.c
@@ -526,6 +526,7 @@ static void *migration_thread(void *opaque)
             DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
             if (pending_size && pending_size >= max_size) {
                 qemu_savevm_state_iterate(s->file);
+                migration_throttle_down();
             } else {
                 DPRINTF("done iterating\n");
                 qemu_mutex_lock_iothread();
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration Chegu Vinod
@ 2013-05-09 20:05   ` Igor Mammedov
  2013-05-09 22:26     ` Chegu Vinod
  2013-05-09 20:24   ` Igor Mammedov
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Igor Mammedov @ 2013-05-09 20:05 UTC (permalink / raw)
  To: Chegu Vinod; +Cc: quintela, qemu-devel, owasserm, anthony, pbonzini

On Thu,  9 May 2013 12:43:20 -0700
Chegu Vinod <chegu_vinod@hp.com> wrote:

>  If a user chooses to turn on the auto-converge migration capability
>  these changes detect the lack of convergence and throttle down the
>  guest. i.e. force the VCPUs out of the guest for some duration
>  and let the migration thread catchup and help converge.
> 
[...]
> +void migration_throttle_down(void)
> +{
> +    if (throttling_needed()) {
> +        CPUArchState *penv = first_cpu;
> +        while (penv) {
> +            qemu_mutex_lock_iothread();
> +            async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
> +            qemu_mutex_unlock_iothread();
> +            penv = penv->next_cpu;
could you replace open coded loop with qemu_for_each_cpu()?

> +        }
> +    }
> +}

-- 
Regards,
  Igor

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration Chegu Vinod
  2013-05-09 20:05   ` Igor Mammedov
@ 2013-05-09 20:24   ` Igor Mammedov
  2013-05-09 23:00     ` Chegu Vinod
  2013-05-10  7:41   ` Paolo Bonzini
  2013-05-10 13:07   ` Anthony Liguori
  3 siblings, 1 reply; 21+ messages in thread
From: Igor Mammedov @ 2013-05-09 20:24 UTC (permalink / raw)
  To: Chegu Vinod; +Cc: quintela, qemu-devel, owasserm, anthony, pbonzini

On Thu,  9 May 2013 12:43:20 -0700
Chegu Vinod <chegu_vinod@hp.com> wrote:

>  If a user chooses to turn on the auto-converge migration capability
>  these changes detect the lack of convergence and throttle down the
>  guest. i.e. force the VCPUs out of the guest for some duration
>  and let the migration thread catchup and help converge.
> 
[...]
> +
> +static void mig_delay_vcpu(void)
> +{
> +    qemu_mutex_unlock_iothread();
> +    g_usleep(50*1000);
> +    qemu_mutex_lock_iothread();
> +}
> +
> +/* Stub used for getting the vcpu out of VM and into qemu via
> +   run_on_cpu()*/
> +static void mig_kick_cpu(void *opq)
> +{
> +    mig_delay_vcpu();
> +    return;
> +}
> +
> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
> +   much time in the VM. The migration thread will try to catchup.
> +   Workload will experience a performance drop.
> +*/
> +void migration_throttle_down(void)
> +{
> +    if (throttling_needed()) {
> +        CPUArchState *penv = first_cpu;
> +        while (penv) {
> +            qemu_mutex_lock_iothread();
Locking it here and the unlocking it inside of queued work doesn't look nice.
What exactly are you protecting with this lock?


> +            async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
> +            qemu_mutex_unlock_iothread();
> +            penv = penv->next_cpu;
> +        }
> +    }
> +}



-- 
Regards,
  Igor

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-09 20:05   ` Igor Mammedov
@ 2013-05-09 22:26     ` Chegu Vinod
  0 siblings, 0 replies; 21+ messages in thread
From: Chegu Vinod @ 2013-05-09 22:26 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: quintela, qemu-devel, owasserm, anthony, pbonzini

On 5/9/2013 1:05 PM, Igor Mammedov wrote:
> On Thu,  9 May 2013 12:43:20 -0700
> Chegu Vinod <chegu_vinod@hp.com> wrote:
>
>>   If a user chooses to turn on the auto-converge migration capability
>>   these changes detect the lack of convergence and throttle down the
>>   guest. i.e. force the VCPUs out of the guest for some duration
>>   and let the migration thread catchup and help converge.
>>
> [...]
>> +void migration_throttle_down(void)
>> +{
>> +    if (throttling_needed()) {
>> +        CPUArchState *penv = first_cpu;
>> +        while (penv) {
>> +            qemu_mutex_lock_iothread();
>> +            async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
>> +            qemu_mutex_unlock_iothread();
>> +            penv = penv->next_cpu;
> could you replace open coded loop with qemu_for_each_cpu()?

Yes will try to replace it in the next version.
Vinod
>
>> +        }
>> +    }
>> +}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-09 20:24   ` Igor Mammedov
@ 2013-05-09 23:00     ` Chegu Vinod
  2013-05-10  7:47       ` Paolo Bonzini
  0 siblings, 1 reply; 21+ messages in thread
From: Chegu Vinod @ 2013-05-09 23:00 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: quintela, qemu-devel, owasserm, anthony, pbonzini

On 5/9/2013 1:24 PM, Igor Mammedov wrote:
> On Thu,  9 May 2013 12:43:20 -0700
> Chegu Vinod <chegu_vinod@hp.com> wrote:
>
>>   If a user chooses to turn on the auto-converge migration capability
>>   these changes detect the lack of convergence and throttle down the
>>   guest. i.e. force the VCPUs out of the guest for some duration
>>   and let the migration thread catchup and help converge.
>>
> [...]
>> +
>> +static void mig_delay_vcpu(void)
>> +{
>> +    qemu_mutex_unlock_iothread();
>> +    g_usleep(50*1000);
>> +    qemu_mutex_lock_iothread();
>> +}
>> +
>> +/* Stub used for getting the vcpu out of VM and into qemu via
>> +   run_on_cpu()*/
>> +static void mig_kick_cpu(void *opq)
>> +{
>> +    mig_delay_vcpu();
>> +    return;
>> +}
>> +
>> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
>> +   much time in the VM. The migration thread will try to catchup.
>> +   Workload will experience a performance drop.
>> +*/
>> +void migration_throttle_down(void)
>> +{
>> +    if (throttling_needed()) {
>> +        CPUArchState *penv = first_cpu;
>> +        while (penv) {
>> +            qemu_mutex_lock_iothread();
> Locking it here and the unlocking it inside of queued work doesn't look nice.
Yes...but see below.
> What exactly are you protecting with this lock?
It was my understanding that BQL is supposed to be held when the vcpu 
threads start entering and executing in the qemu context (as qemu is not 
MP safe).. Still true?

In this specific use case I was concerned about the fraction of the time 
when a given vcpu thread is in the qemu context but not executing the 
callback routine...and was hence holding the BQL.Holding the BQL and 
g_usleep'ng is not only bad but would slow down the migration 
thread...hence the "doesn't look nice" stuff :(

For this specific use case If its not really required to even bother 
with the BQL then pl. do let me know.

Also pl. refer to version 3 of my patch....I was doing a g_usleep() in 
kvm_cpu_exec() and was not messing much with the BQL....but that was 
deemed as not a good thing either.

Thanks
Vinod

>
>> +            async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
>> +            qemu_mutex_unlock_iothread();
>> +            penv = penv->next_cpu;
>> +        }
>> +    }
>> +}
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration Chegu Vinod
  2013-05-09 20:05   ` Igor Mammedov
  2013-05-09 20:24   ` Igor Mammedov
@ 2013-05-10  7:41   ` Paolo Bonzini
  2013-05-10 13:07   ` Anthony Liguori
  3 siblings, 0 replies; 21+ messages in thread
From: Paolo Bonzini @ 2013-05-10  7:41 UTC (permalink / raw)
  To: Chegu Vinod; +Cc: owasserm, qemu-devel, anthony, quintela

Il 09/05/2013 21:43, Chegu Vinod ha scritto:
>  If a user chooses to turn on the auto-converge migration capability
>  these changes detect the lack of convergence and throttle down the
>  guest. i.e. force the VCPUs out of the guest for some duration
>  and let the migration thread catchup and help converge.
> 
>  Verified the convergence using the following :
>  - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
>  - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
> 
>  Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
>  migrate downtime set to 4seconds).
> 
>  (qemu) info migrate
>  capabilities: xbzrle: off auto-converge: off  <----
>  Migration status: active
>  total time: 1487503 milliseconds
>  expected downtime: 519 milliseconds
>  transferred ram: 383749347 kbytes
>  remaining ram: 2753372 kbytes
>  total ram: 268444224 kbytes
>  duplicate: 65461532 pages
>  skipped: 64901568 pages
>  normal: 95750218 pages
>  normal bytes: 383000872 kbytes
>  dirty pages rate: 67551 pages
> 
>  ---
>  
>  (qemu) info migrate
>  capabilities: xbzrle: off auto-converge: on   <----
>  Migration status: completed
>  total time: 241161 milliseconds
>  downtime: 6373 milliseconds
>  transferred ram: 28235307 kbytes
>  remaining ram: 0 kbytes
>  total ram: 268444224 kbytes
>  duplicate: 64946416 pages
>  skipped: 64903523 pages
>  normal: 7044971 pages
>  normal bytes: 28179884 kbytes

Almost there, and certainly much better than the previous patches.

Just a couple of comments.

> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
> ---
>  arch_init.c                   |   68 +++++++++++++++++++++++++++++++++++++++++
>  include/migration/migration.h |    4 ++
>  migration.c                   |    1 +
>  3 files changed, 73 insertions(+), 0 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index 49c5dc2..29788d6 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -49,6 +49,7 @@
>  #include "trace.h"
>  #include "exec/cpu-all.h"
>  #include "hw/acpi/acpi.h"
> +#include "sysemu/cpus.h"
>  
>  #ifdef DEBUG_ARCH_INIT
>  #define DPRINTF(fmt, ...) \
> @@ -104,6 +105,8 @@ int graphic_depth = 15;
>  #endif
>  
>  const uint32_t arch_type = QEMU_ARCH;
> +static bool mig_throttle_on;
> +
>  
>  /***********************************************************/
>  /* ram save/restore */
> @@ -378,8 +381,15 @@ static void migration_bitmap_sync(void)
>      uint64_t num_dirty_pages_init = migration_dirty_pages;
>      MigrationState *s = migrate_get_current();
>      static int64_t start_time;
> +    static int64_t bytes_xfer_prev;
>      static int64_t num_dirty_pages_period;
>      int64_t end_time;
> +    int64_t bytes_xfer_now;
> +    static int dirty_rate_high_cnt;
> +
> +    if (!bytes_xfer_prev) {
> +        bytes_xfer_prev = ram_bytes_transferred();
> +    }
>  
>      if (!start_time) {
>          start_time = qemu_get_clock_ms(rt_clock);
> @@ -404,6 +414,23 @@ static void migration_bitmap_sync(void)
>  
>      /* more than 1 second = 1000 millisecons */
>      if (end_time > start_time + 1000) {
> +        if (migrate_auto_converge()) {
> +            /* The following detection logic can be refined later. For now:
> +               Check to see if the dirtied bytes is 50% more than the approx.
> +               amount of bytes that just got transferred since the last time we
> +               were in this routine. If that happens N times (for now N==5)
> +               we turn on the throttle down logic */
> +            bytes_xfer_now = ram_bytes_transferred();
> +            if (s->dirty_pages_rate &&
> +                ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
> +                ((bytes_xfer_now - bytes_xfer_prev)/2))) {
> +                if (dirty_rate_high_cnt++ > 5) {
> +                    DPRINTF("Unable to converge. Throtting down guest\n");
> +                    mig_throttle_on = true;
> +                }
> +             }
> +             bytes_xfer_prev = bytes_xfer_now;
> +        }
>          s->dirty_pages_rate = num_dirty_pages_period * 1000
>              / (end_time - start_time);
>          s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
> @@ -496,6 +523,15 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>      return bytes_sent;
>  }
>  
> +bool throttling_needed(void)
> +{
> +    if (!migrate_auto_converge()) {
> +        return false;
> +    }

Also return false if !runstate_is_running() please.

> +    return mig_throttle_on;
> +}
> +
>  static uint64_t bytes_transferred;
>  
>  static ram_addr_t ram_save_remaining(void)
> @@ -1098,3 +1134,35 @@ TargetInfo *qmp_query_target(Error **errp)
>  
>      return info;
>  }
> +
> +static void mig_delay_vcpu(void)
> +{
> +    qemu_mutex_unlock_iothread();
> +    g_usleep(50*1000);
> +    qemu_mutex_lock_iothread();
> +}
> +
> +/* Stub used for getting the vcpu out of VM and into qemu via
> +   run_on_cpu()*/
> +static void mig_kick_cpu(void *opq)
> +{
> +    mig_delay_vcpu();
> +    return;
> +}

Just inline mig_delay_vcpu in here, delete the extra return, and call
this function mig_delay_vcpu

> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
> +   much time in the VM. The migration thread will try to catchup.
> +   Workload will experience a performance drop.
> +*/
> +void migration_throttle_down(void)
> +{
> +    if (throttling_needed()) {
> +        CPUArchState *penv = first_cpu;
> +        while (penv) {
> +            qemu_mutex_lock_iothread();
> +            async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
> +            qemu_mutex_unlock_iothread();

Pleas hoist the lock/unlock outside the while loop.

> +            penv = penv->next_cpu;
> +        }
> +    }
> +}
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index ace91b0..68b65c6 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -129,4 +129,8 @@ int64_t migrate_xbzrle_cache_size(void);
>  int64_t xbzrle_cache_resize(int64_t new_size);
>  
>  bool migrate_auto_converge(void);
> +bool throttling_needed(void);
> +void stop_throttling(void);
> +void migration_throttle_down(void);
> +
>  #endif
> diff --git a/migration.c b/migration.c
> index 570cee5..d3673a6 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -526,6 +526,7 @@ static void *migration_thread(void *opaque)
>              DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
>              if (pending_size && pending_size >= max_size) {
>                  qemu_savevm_state_iterate(s->file);
> +                migration_throttle_down();

Did you try the approach of calling migration_throttle_down from
ram_save_iterate, based on how much time passed from the last occurrence?

I would like that a bit more because in principle (especially with large
bandwidth) qemu_savevm_state_iterate() can take a long time, thus the
"duty cycle" of the auto-convergence is not predictable.

Paolo

>              } else {
>                  DPRINTF("done iterating\n");
>                  qemu_mutex_lock_iothread();
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 2/3] Add 'auto-converge' migration capability
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 2/3] Add 'auto-converge' migration capability Chegu Vinod
@ 2013-05-10  7:43   ` Paolo Bonzini
  2013-05-10 14:26     ` Eric Blake
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2013-05-10  7:43 UTC (permalink / raw)
  To: Chegu Vinod; +Cc: owasserm, qemu-devel, anthony, quintela

Il 09/05/2013 21:43, Chegu Vinod ha scritto:
>  The auto-converge migration capability allows the user to specify if they
>  choose live migration seqeunce to automatically detect and force convergence.
> 
> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
> ---
>  include/migration/migration.h |    2 ++
>  migration.c                   |    9 +++++++++
>  qapi-schema.json              |    5 ++++-
>  3 files changed, 15 insertions(+), 1 deletions(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index e2acec6..ace91b0 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -127,4 +127,6 @@ int migrate_use_xbzrle(void);
>  int64_t migrate_xbzrle_cache_size(void);
>  
>  int64_t xbzrle_cache_resize(int64_t new_size);
> +
> +bool migrate_auto_converge(void);
>  #endif
> diff --git a/migration.c b/migration.c
> index 3eb0fad..570cee5 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -474,6 +474,15 @@ void qmp_migrate_set_downtime(double value, Error **errp)
>      max_downtime = (uint64_t)value;
>  }
>  
> +bool migrate_auto_converge(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
> +}
> +
>  int migrate_use_xbzrle(void)
>  {
>      MigrationState *s;
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 199744a..b33839c 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -602,10 +602,13 @@
>  #          This feature allows us to minimize migration traffic for certain work
>  #          loads, by sending compressed difference of the pages
>  #
> +# @auto-converge: Migration supports automatic throttling down of guest
> +#          to force convergence. (since 1.6)

If enabled, QEMU will automatically throttle down the guest to speed up
convergence of RAM migration.

> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
> -  'data': ['xbzrle'] }
> +  'data': ['xbzrle', 'auto-converge'] }
>  
>  ##
>  # @MigrationCapabilityStatus
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 1/3] Introduce async_run_on_cpu()
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 1/3] Introduce async_run_on_cpu() Chegu Vinod
@ 2013-05-10  7:43   ` Paolo Bonzini
  0 siblings, 0 replies; 21+ messages in thread
From: Paolo Bonzini @ 2013-05-10  7:43 UTC (permalink / raw)
  To: Chegu Vinod; +Cc: owasserm, qemu-devel, anthony, quintela

Il 09/05/2013 21:43, Chegu Vinod ha scritto:
>  Introduce an asynchronous version of run_on_cpu() i.e. the caller
>  doesn't have to block till the call back routine finishes execution
>  on the target vcpu.
> 
> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
> ---
>  cpus.c                |   29 +++++++++++++++++++++++++++++
>  include/qemu-common.h |    1 +
>  include/qom/cpu.h     |   10 ++++++++++
>  3 files changed, 40 insertions(+), 0 deletions(-)
> 
> diff --git a/cpus.c b/cpus.c
> index c232265..8cd4eab 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -653,6 +653,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
>  
>      wi.func = func;
>      wi.data = data;
> +    wi.free = false;
>      if (cpu->queued_work_first == NULL) {
>          cpu->queued_work_first = &wi;
>      } else {
> @@ -671,6 +672,31 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
>      }
>  }
>  
> +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
> +{
> +    struct qemu_work_item *wi;
> +
> +    if (qemu_cpu_is_self(cpu)) {
> +        func(data);
> +        return;
> +    }
> +
> +    wi = g_malloc0(sizeof(struct qemu_work_item));
> +    wi->func = func;
> +    wi->data = data;
> +    wi->free = true;
> +    if (cpu->queued_work_first == NULL) {
> +        cpu->queued_work_first = wi;
> +    } else {
> +        cpu->queued_work_last->next = wi;
> +    }
> +    cpu->queued_work_last = wi;
> +    wi->next = NULL;
> +    wi->done = false;
> +
> +    qemu_cpu_kick(cpu);
> +}
> +
>  static void flush_queued_work(CPUState *cpu)
>  {
>      struct qemu_work_item *wi;
> @@ -683,6 +709,9 @@ static void flush_queued_work(CPUState *cpu)
>          cpu->queued_work_first = wi->next;
>          wi->func(wi->data);
>          wi->done = true;
> +        if (wi->free) {
> +            g_free(wi);
> +        }
>      }
>      cpu->queued_work_last = NULL;
>      qemu_cond_broadcast(&qemu_work_cond);
> diff --git a/include/qemu-common.h b/include/qemu-common.h
> index b399d85..bad6e1f 100644
> --- a/include/qemu-common.h
> +++ b/include/qemu-common.h
> @@ -286,6 +286,7 @@ struct qemu_work_item {
>      void (*func)(void *data);
>      void *data;
>      int done;
> +    bool free;
>  };
>  
>  #ifdef CONFIG_USER_ONLY
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index 7cd9442..46465e9 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -265,6 +265,16 @@ bool cpu_is_stopped(CPUState *cpu);
>  void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
>  
>  /**
> + * async_run_on_cpu:
> + * @cpu: The vCPU to run on.
> + * @func: The function to be executed.
> + * @data: Data to pass to the function.
> + *
> + * Schedules the function @func for execution on the vCPU @cpu asynchronously.
> + */
> +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
> +
> +/**
>   * qemu_for_each_cpu:
>   * @func: The function to be executed.
>   * @data: Data to pass to the function.
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-09 23:00     ` Chegu Vinod
@ 2013-05-10  7:47       ` Paolo Bonzini
  0 siblings, 0 replies; 21+ messages in thread
From: Paolo Bonzini @ 2013-05-10  7:47 UTC (permalink / raw)
  To: Chegu Vinod; +Cc: Igor Mammedov, owasserm, qemu-devel, anthony, quintela

Il 10/05/2013 01:00, Chegu Vinod ha scritto:
> On 5/9/2013 1:24 PM, Igor Mammedov wrote:
>> On Thu,  9 May 2013 12:43:20 -0700
>> Chegu Vinod <chegu_vinod@hp.com> wrote:
>>
>>>   If a user chooses to turn on the auto-converge migration capability
>>>   these changes detect the lack of convergence and throttle down the
>>>   guest. i.e. force the VCPUs out of the guest for some duration
>>>   and let the migration thread catchup and help converge.
>>>
>> [...]
>>> +
>>> +static void mig_delay_vcpu(void)
>>> +{
>>> +    qemu_mutex_unlock_iothread();
>>> +    g_usleep(50*1000);
>>> +    qemu_mutex_lock_iothread();
>>> +}
>>> +
>>> +/* Stub used for getting the vcpu out of VM and into qemu via
>>> +   run_on_cpu()*/
>>> +static void mig_kick_cpu(void *opq)
>>> +{
>>> +    mig_delay_vcpu();
>>> +    return;
>>> +}
>>> +
>>> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
>>> +   much time in the VM. The migration thread will try to catchup.
>>> +   Workload will experience a performance drop.
>>> +*/
>>> +void migration_throttle_down(void)
>>> +{
>>> +    if (throttling_needed()) {
>>> +        CPUArchState *penv = first_cpu;
>>> +        while (penv) {
>>> +            qemu_mutex_lock_iothread();
>> Locking it here and the unlocking it inside of queued work doesn't
>> look nice.
> Yes...but see below.

Actually, no. :)  It looks strange, but it is correct and perfectly fine.

The queued work is running in a completely different thread.  run_on_cpu
work items run under the BQL, thus mig_delay_vcpu needs to unlock.

On the other hand, migration_throttle_down runs in the migration thread,
outside the BQL.  It needs to lock because the first_cpu list can change
through hotplug at any time.  qemu_for_each_cpu would also need the BQL
for the same reason.

Paolo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration Chegu Vinod
                     ` (2 preceding siblings ...)
  2013-05-10  7:41   ` Paolo Bonzini
@ 2013-05-10 13:07   ` Anthony Liguori
  2013-05-10 14:14     ` Chegu Vinod
  2013-05-10 14:17     ` Daniel P. Berrange
  3 siblings, 2 replies; 21+ messages in thread
From: Anthony Liguori @ 2013-05-10 13:07 UTC (permalink / raw)
  To: Chegu Vinod, eblake, quintela, owasserm, pbonzini, qemu-devel

Chegu Vinod <chegu_vinod@hp.com> writes:

>  If a user chooses to turn on the auto-converge migration capability
>  these changes detect the lack of convergence and throttle down the
>  guest. i.e. force the VCPUs out of the guest for some duration
>  and let the migration thread catchup and help converge.
>
>  Verified the convergence using the following :
>  - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
>  - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>
>  Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
>  migrate downtime set to 4seconds).

Would it make sense to separate out the "slow the VCPU down" part of
this?

That would give a management tool more flexibility to create policies
around slowing the VCPU down to encourage migration.

In fact, I wonder if we need anything in the migration path if we just
expose the "slow the VCPU down" bit as a feature.

Slow the VCPU down is not quite the same as setting priority of the VCPU
thread largely because of the QBL so I recognize the need to have
something for this in QEMU.

Regards,

Anthony Liguori

>
>  (qemu) info migrate
>  capabilities: xbzrle: off auto-converge: off  <----
>  Migration status: active
>  total time: 1487503 milliseconds
>  expected downtime: 519 milliseconds
>  transferred ram: 383749347 kbytes
>  remaining ram: 2753372 kbytes
>  total ram: 268444224 kbytes
>  duplicate: 65461532 pages
>  skipped: 64901568 pages
>  normal: 95750218 pages
>  normal bytes: 383000872 kbytes
>  dirty pages rate: 67551 pages
>
>  ---
>  
>  (qemu) info migrate
>  capabilities: xbzrle: off auto-converge: on   <----
>  Migration status: completed
>  total time: 241161 milliseconds
>  downtime: 6373 milliseconds
>  transferred ram: 28235307 kbytes
>  remaining ram: 0 kbytes
>  total ram: 268444224 kbytes
>  duplicate: 64946416 pages
>  skipped: 64903523 pages
>  normal: 7044971 pages
>  normal bytes: 28179884 kbytes
>
> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
> ---
>  arch_init.c                   |   68 +++++++++++++++++++++++++++++++++++++++++
>  include/migration/migration.h |    4 ++
>  migration.c                   |    1 +
>  3 files changed, 73 insertions(+), 0 deletions(-)
>
> diff --git a/arch_init.c b/arch_init.c
> index 49c5dc2..29788d6 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -49,6 +49,7 @@
>  #include "trace.h"
>  #include "exec/cpu-all.h"
>  #include "hw/acpi/acpi.h"
> +#include "sysemu/cpus.h"
>  
>  #ifdef DEBUG_ARCH_INIT
>  #define DPRINTF(fmt, ...) \
> @@ -104,6 +105,8 @@ int graphic_depth = 15;
>  #endif
>  
>  const uint32_t arch_type = QEMU_ARCH;
> +static bool mig_throttle_on;
> +
>  
>  /***********************************************************/
>  /* ram save/restore */
> @@ -378,8 +381,15 @@ static void migration_bitmap_sync(void)
>      uint64_t num_dirty_pages_init = migration_dirty_pages;
>      MigrationState *s = migrate_get_current();
>      static int64_t start_time;
> +    static int64_t bytes_xfer_prev;
>      static int64_t num_dirty_pages_period;
>      int64_t end_time;
> +    int64_t bytes_xfer_now;
> +    static int dirty_rate_high_cnt;
> +
> +    if (!bytes_xfer_prev) {
> +        bytes_xfer_prev = ram_bytes_transferred();
> +    }
>  
>      if (!start_time) {
>          start_time = qemu_get_clock_ms(rt_clock);
> @@ -404,6 +414,23 @@ static void migration_bitmap_sync(void)
>  
>      /* more than 1 second = 1000 millisecons */
>      if (end_time > start_time + 1000) {
> +        if (migrate_auto_converge()) {
> +            /* The following detection logic can be refined later. For now:
> +               Check to see if the dirtied bytes is 50% more than the approx.
> +               amount of bytes that just got transferred since the last time we
> +               were in this routine. If that happens N times (for now N==5)
> +               we turn on the throttle down logic */
> +            bytes_xfer_now = ram_bytes_transferred();
> +            if (s->dirty_pages_rate &&
> +                ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
> +                ((bytes_xfer_now - bytes_xfer_prev)/2))) {
> +                if (dirty_rate_high_cnt++ > 5) {
> +                    DPRINTF("Unable to converge. Throtting down guest\n");
> +                    mig_throttle_on = true;
> +                }
> +             }
> +             bytes_xfer_prev = bytes_xfer_now;
> +        }
>          s->dirty_pages_rate = num_dirty_pages_period * 1000
>              / (end_time - start_time);
>          s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
> @@ -496,6 +523,15 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>      return bytes_sent;
>  }
>  
> +bool throttling_needed(void)
> +{
> +    if (!migrate_auto_converge()) {
> +        return false;
> +    }
> +
> +    return mig_throttle_on;
> +}
> +
>  static uint64_t bytes_transferred;
>  
>  static ram_addr_t ram_save_remaining(void)
> @@ -1098,3 +1134,35 @@ TargetInfo *qmp_query_target(Error **errp)
>  
>      return info;
>  }
> +
> +static void mig_delay_vcpu(void)
> +{
> +    qemu_mutex_unlock_iothread();
> +    g_usleep(50*1000);
> +    qemu_mutex_lock_iothread();
> +}
> +
> +/* Stub used for getting the vcpu out of VM and into qemu via
> +   run_on_cpu()*/
> +static void mig_kick_cpu(void *opq)
> +{
> +    mig_delay_vcpu();
> +    return;
> +}
> +
> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
> +   much time in the VM. The migration thread will try to catchup.
> +   Workload will experience a performance drop.
> +*/
> +void migration_throttle_down(void)
> +{
> +    if (throttling_needed()) {
> +        CPUArchState *penv = first_cpu;
> +        while (penv) {
> +            qemu_mutex_lock_iothread();
> +            async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
> +            qemu_mutex_unlock_iothread();
> +            penv = penv->next_cpu;
> +        }
> +    }
> +}
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index ace91b0..68b65c6 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -129,4 +129,8 @@ int64_t migrate_xbzrle_cache_size(void);
>  int64_t xbzrle_cache_resize(int64_t new_size);
>  
>  bool migrate_auto_converge(void);
> +bool throttling_needed(void);
> +void stop_throttling(void);
> +void migration_throttle_down(void);
> +
>  #endif
> diff --git a/migration.c b/migration.c
> index 570cee5..d3673a6 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -526,6 +526,7 @@ static void *migration_thread(void *opaque)
>              DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
>              if (pending_size && pending_size >= max_size) {
>                  qemu_savevm_state_iterate(s->file);
> +                migration_throttle_down();
>              } else {
>                  DPRINTF("done iterating\n");
>                  qemu_mutex_lock_iothread();
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-10 13:07   ` Anthony Liguori
@ 2013-05-10 14:14     ` Chegu Vinod
  2013-05-10 15:11       ` Anthony Liguori
  2013-05-10 14:17     ` Daniel P. Berrange
  1 sibling, 1 reply; 21+ messages in thread
From: Chegu Vinod @ 2013-05-10 14:14 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: owasserm, pbonzini, qemu-devel, quintela

On 5/10/2013 6:07 AM, Anthony Liguori wrote:
> Chegu Vinod <chegu_vinod@hp.com> writes:
>
>>   If a user chooses to turn on the auto-converge migration capability
>>   these changes detect the lack of convergence and throttle down the
>>   guest. i.e. force the VCPUs out of the guest for some duration
>>   and let the migration thread catchup and help converge.
>>
>>   Verified the convergence using the following :
>>   - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
>>   - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>>
>>   Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
>>   migrate downtime set to 4seconds).
> Would it make sense to separate out the "slow the VCPU down" part of
> this?
>
> That would give a management tool more flexibility to create policies
> around slowing the VCPU down to encourage migration.

I believe one can always enhance libvirt tools to monitor the migration 
statistics and control the shares/entitlements of the vcpus via 
cgroups..thereby slowing the guest down to allow for convergence  (I had 
that listed in my earlier versions of the patches as an option and also 
noted that it requires external (i.e. tool driven) monitoring and 
triggers...and that this alternative was kind of automatic after the 
initial setting of the capability).

Is that what you meant by your comment above (or) are you talking about 
something outside the scope of cgroups and from an implementation point 
of view also outside the migration code path...i.e. a new knob that an 
external tool can use to just throttle down the vcpus of a guest ?

Thanks
Vinod



>
> In fact, I wonder if we need anything in the migration path if we just
> expose the "slow the VCPU down" bit as a feature.
>
> Slow the VCPU down is not quite the same as setting priority of the VCPU
> thread largely because of the QBL so I recognize the need to have
> something for this in QEMU.
>
> Regards,
>
> Anthony Liguori
>
>>   (qemu) info migrate
>>   capabilities: xbzrle: off auto-converge: off  <----
>>   Migration status: active
>>   total time: 1487503 milliseconds
>>   expected downtime: 519 milliseconds
>>   transferred ram: 383749347 kbytes
>>   remaining ram: 2753372 kbytes
>>   total ram: 268444224 kbytes
>>   duplicate: 65461532 pages
>>   skipped: 64901568 pages
>>   normal: 95750218 pages
>>   normal bytes: 383000872 kbytes
>>   dirty pages rate: 67551 pages
>>
>>   ---
>>   
>>   (qemu) info migrate
>>   capabilities: xbzrle: off auto-converge: on   <----
>>   Migration status: completed
>>   total time: 241161 milliseconds
>>   downtime: 6373 milliseconds
>>   transferred ram: 28235307 kbytes
>>   remaining ram: 0 kbytes
>>   total ram: 268444224 kbytes
>>   duplicate: 64946416 pages
>>   skipped: 64903523 pages
>>   normal: 7044971 pages
>>   normal bytes: 28179884 kbytes
>>
>> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
>> ---
>>   arch_init.c                   |   68 +++++++++++++++++++++++++++++++++++++++++
>>   include/migration/migration.h |    4 ++
>>   migration.c                   |    1 +
>>   3 files changed, 73 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index 49c5dc2..29788d6 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -49,6 +49,7 @@
>>   #include "trace.h"
>>   #include "exec/cpu-all.h"
>>   #include "hw/acpi/acpi.h"
>> +#include "sysemu/cpus.h"
>>   
>>   #ifdef DEBUG_ARCH_INIT
>>   #define DPRINTF(fmt, ...) \
>> @@ -104,6 +105,8 @@ int graphic_depth = 15;
>>   #endif
>>   
>>   const uint32_t arch_type = QEMU_ARCH;
>> +static bool mig_throttle_on;
>> +
>>   
>>   /***********************************************************/
>>   /* ram save/restore */
>> @@ -378,8 +381,15 @@ static void migration_bitmap_sync(void)
>>       uint64_t num_dirty_pages_init = migration_dirty_pages;
>>       MigrationState *s = migrate_get_current();
>>       static int64_t start_time;
>> +    static int64_t bytes_xfer_prev;
>>       static int64_t num_dirty_pages_period;
>>       int64_t end_time;
>> +    int64_t bytes_xfer_now;
>> +    static int dirty_rate_high_cnt;
>> +
>> +    if (!bytes_xfer_prev) {
>> +        bytes_xfer_prev = ram_bytes_transferred();
>> +    }
>>   
>>       if (!start_time) {
>>           start_time = qemu_get_clock_ms(rt_clock);
>> @@ -404,6 +414,23 @@ static void migration_bitmap_sync(void)
>>   
>>       /* more than 1 second = 1000 millisecons */
>>       if (end_time > start_time + 1000) {
>> +        if (migrate_auto_converge()) {
>> +            /* The following detection logic can be refined later. For now:
>> +               Check to see if the dirtied bytes is 50% more than the approx.
>> +               amount of bytes that just got transferred since the last time we
>> +               were in this routine. If that happens N times (for now N==5)
>> +               we turn on the throttle down logic */
>> +            bytes_xfer_now = ram_bytes_transferred();
>> +            if (s->dirty_pages_rate &&
>> +                ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
>> +                ((bytes_xfer_now - bytes_xfer_prev)/2))) {
>> +                if (dirty_rate_high_cnt++ > 5) {
>> +                    DPRINTF("Unable to converge. Throtting down guest\n");
>> +                    mig_throttle_on = true;
>> +                }
>> +             }
>> +             bytes_xfer_prev = bytes_xfer_now;
>> +        }
>>           s->dirty_pages_rate = num_dirty_pages_period * 1000
>>               / (end_time - start_time);
>>           s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
>> @@ -496,6 +523,15 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>>       return bytes_sent;
>>   }
>>   
>> +bool throttling_needed(void)
>> +{
>> +    if (!migrate_auto_converge()) {
>> +        return false;
>> +    }
>> +
>> +    return mig_throttle_on;
>> +}
>> +
>>   static uint64_t bytes_transferred;
>>   
>>   static ram_addr_t ram_save_remaining(void)
>> @@ -1098,3 +1134,35 @@ TargetInfo *qmp_query_target(Error **errp)
>>   
>>       return info;
>>   }
>> +
>> +static void mig_delay_vcpu(void)
>> +{
>> +    qemu_mutex_unlock_iothread();
>> +    g_usleep(50*1000);
>> +    qemu_mutex_lock_iothread();
>> +}
>> +
>> +/* Stub used for getting the vcpu out of VM and into qemu via
>> +   run_on_cpu()*/
>> +static void mig_kick_cpu(void *opq)
>> +{
>> +    mig_delay_vcpu();
>> +    return;
>> +}
>> +
>> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
>> +   much time in the VM. The migration thread will try to catchup.
>> +   Workload will experience a performance drop.
>> +*/
>> +void migration_throttle_down(void)
>> +{
>> +    if (throttling_needed()) {
>> +        CPUArchState *penv = first_cpu;
>> +        while (penv) {
>> +            qemu_mutex_lock_iothread();
>> +            async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
>> +            qemu_mutex_unlock_iothread();
>> +            penv = penv->next_cpu;
>> +        }
>> +    }
>> +}
>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>> index ace91b0..68b65c6 100644
>> --- a/include/migration/migration.h
>> +++ b/include/migration/migration.h
>> @@ -129,4 +129,8 @@ int64_t migrate_xbzrle_cache_size(void);
>>   int64_t xbzrle_cache_resize(int64_t new_size);
>>   
>>   bool migrate_auto_converge(void);
>> +bool throttling_needed(void);
>> +void stop_throttling(void);
>> +void migration_throttle_down(void);
>> +
>>   #endif
>> diff --git a/migration.c b/migration.c
>> index 570cee5..d3673a6 100644
>> --- a/migration.c
>> +++ b/migration.c
>> @@ -526,6 +526,7 @@ static void *migration_thread(void *opaque)
>>               DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
>>               if (pending_size && pending_size >= max_size) {
>>                   qemu_savevm_state_iterate(s->file);
>> +                migration_throttle_down();
>>               } else {
>>                   DPRINTF("done iterating\n");
>>                   qemu_mutex_lock_iothread();
>> -- 
>> 1.7.1
> .
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-10 13:07   ` Anthony Liguori
  2013-05-10 14:14     ` Chegu Vinod
@ 2013-05-10 14:17     ` Daniel P. Berrange
  2013-05-10 15:08       ` Anthony Liguori
  1 sibling, 1 reply; 21+ messages in thread
From: Daniel P. Berrange @ 2013-05-10 14:17 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: quintela, Chegu Vinod, qemu-devel, owasserm, pbonzini

On Fri, May 10, 2013 at 08:07:51AM -0500, Anthony Liguori wrote:
> Chegu Vinod <chegu_vinod@hp.com> writes:
> 
> >  If a user chooses to turn on the auto-converge migration capability
> >  these changes detect the lack of convergence and throttle down the
> >  guest. i.e. force the VCPUs out of the guest for some duration
> >  and let the migration thread catchup and help converge.
> >
> >  Verified the convergence using the following :
> >  - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
> >  - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
> >
> >  Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
> >  migrate downtime set to 4seconds).
> 
> Would it make sense to separate out the "slow the VCPU down" part of
> this?
> 
> That would give a management tool more flexibility to create policies
> around slowing the VCPU down to encourage migration.
> 
> In fact, I wonder if we need anything in the migration path if we just
> expose the "slow the VCPU down" bit as a feature.
> 
> Slow the VCPU down is not quite the same as setting priority of the VCPU
> thread largely because of the QBL so I recognize the need to have
> something for this in QEMU.

Rather than the priority, could you perhaps do the VCPU slow down
using  cfs_quota_us + cfs_period_us settings though ? These let you
place hard caps on schedular time afforded to vCPUs and we can already
control those via libvirt + cgroups.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 2/3] Add 'auto-converge' migration capability
  2013-05-10  7:43   ` Paolo Bonzini
@ 2013-05-10 14:26     ` Eric Blake
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Blake @ 2013-05-10 14:26 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: owasserm, Chegu Vinod, qemu-devel, anthony, quintela

[-- Attachment #1: Type: text/plain, Size: 763 bytes --]

On 05/10/2013 01:43 AM, Paolo Bonzini wrote:
>> +++ b/qapi-schema.json
>> @@ -602,10 +602,13 @@
>>  #          This feature allows us to minimize migration traffic for certain work
>>  #          loads, by sending compressed difference of the pages
>>  #
>> +# @auto-converge: Migration supports automatic throttling down of guest
>> +#          to force convergence. (since 1.6)
> 
> If enabled, QEMU will automatically throttle down the guest to speed up
> convergence of RAM migration.

Ooh, I do like Paolo's wording better than mine.  But either one is
reasonable, so feel free to add:

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-10 14:17     ` Daniel P. Berrange
@ 2013-05-10 15:08       ` Anthony Liguori
  2013-05-13 12:33         ` Daniel P. Berrange
  0 siblings, 1 reply; 21+ messages in thread
From: Anthony Liguori @ 2013-05-10 15:08 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: quintela, Chegu Vinod, qemu-devel, owasserm, pbonzini

"Daniel P. Berrange" <berrange@redhat.com> writes:

> On Fri, May 10, 2013 at 08:07:51AM -0500, Anthony Liguori wrote:
>> Chegu Vinod <chegu_vinod@hp.com> writes:
>> 
>> >  If a user chooses to turn on the auto-converge migration capability
>> >  these changes detect the lack of convergence and throttle down the
>> >  guest. i.e. force the VCPUs out of the guest for some duration
>> >  and let the migration thread catchup and help converge.
>> >
>> >  Verified the convergence using the following :
>> >  - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
>> >  - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>> >
>> >  Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
>> >  migrate downtime set to 4seconds).
>> 
>> Would it make sense to separate out the "slow the VCPU down" part of
>> this?
>> 
>> That would give a management tool more flexibility to create policies
>> around slowing the VCPU down to encourage migration.
>> 
>> In fact, I wonder if we need anything in the migration path if we just
>> expose the "slow the VCPU down" bit as a feature.
>> 
>> Slow the VCPU down is not quite the same as setting priority of the VCPU
>> thread largely because of the QBL so I recognize the need to have
>> something for this in QEMU.
>
> Rather than the priority, could you perhaps do the VCPU slow down
> using  cfs_quota_us + cfs_period_us settings though ? These let you
> place hard caps on schedular time afforded to vCPUs and we can already
> control those via libvirt + cgroups.

The problem with the bandwidth controller is the same with priorities.
You can end up causing lock holder pre-emption which would negatively
impact migration performance.

It's far better for QEMU to voluntarily give up some time knowing that
it's not holding the QBL since then migration can continue without
impact.

Regards,

Anthony Liguori

>
> Daniel
> -- 
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-10 14:14     ` Chegu Vinod
@ 2013-05-10 15:11       ` Anthony Liguori
  2013-05-12 17:19         ` Paolo Bonzini
  0 siblings, 1 reply; 21+ messages in thread
From: Anthony Liguori @ 2013-05-10 15:11 UTC (permalink / raw)
  To: Chegu Vinod; +Cc: owasserm, pbonzini, qemu-devel, quintela

Chegu Vinod <chegu_vinod@hp.com> writes:

> On 5/10/2013 6:07 AM, Anthony Liguori wrote:
>> Chegu Vinod <chegu_vinod@hp.com> writes:
>>
>>>   If a user chooses to turn on the auto-converge migration capability
>>>   these changes detect the lack of convergence and throttle down the
>>>   guest. i.e. force the VCPUs out of the guest for some duration
>>>   and let the migration thread catchup and help converge.
>>>
>>>   Verified the convergence using the following :
>>>   - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
>>>   - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>>>
>>>   Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
>>>   migrate downtime set to 4seconds).
>> Would it make sense to separate out the "slow the VCPU down" part of
>> this?
>>
>> That would give a management tool more flexibility to create policies
>> around slowing the VCPU down to encourage migration.
>
> I believe one can always enhance libvirt tools to monitor the migration 
> statistics and control the shares/entitlements of the vcpus via 
> cgroups..thereby slowing the guest down to allow for convergence  (I had 
> that listed in my earlier versions of the patches as an option and also 
> noted that it requires external (i.e. tool driven) monitoring and 
> triggers...and that this alternative was kind of automatic after the 
> initial setting of the capability).
>
> Is that what you meant by your comment above (or) are you talking about 
> something outside the scope of cgroups and from an implementation point 
> of view also outside the migration code path...i.e. a new knob that an 
> external tool can use to just throttle down the vcpus of a guest ?

I'm saying, a knob to throttle the guest vcpus within QEMU that could be
used by management tools to encourage convergence.

For instance, consider an imaginary "vcpu_throttle" command that took a
number between 0 and 1 that throttled VCPU performance accordingly.

Then migration would look like:

0) throttle = 1.0
1) call migrate command to start migration
2) query progress until you decide you aren't converging
3) throttle *= 0.75; call vcpu_throttle $throttle
4) goto (2)

Now I'm not opposed to a series like this that adds this sort of policy
to QEMU itself too but I want to make sure the pieces are exposed for a
management tool to implement its own policies too.

Regards,

Anthony Liguori

>
> Thanks
> Vinod
>
>
>
>>
>> In fact, I wonder if we need anything in the migration path if we just
>> expose the "slow the VCPU down" bit as a feature.
>>
>> Slow the VCPU down is not quite the same as setting priority of the VCPU
>> thread largely because of the QBL so I recognize the need to have
>> something for this in QEMU.
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>>   (qemu) info migrate
>>>   capabilities: xbzrle: off auto-converge: off  <----
>>>   Migration status: active
>>>   total time: 1487503 milliseconds
>>>   expected downtime: 519 milliseconds
>>>   transferred ram: 383749347 kbytes
>>>   remaining ram: 2753372 kbytes
>>>   total ram: 268444224 kbytes
>>>   duplicate: 65461532 pages
>>>   skipped: 64901568 pages
>>>   normal: 95750218 pages
>>>   normal bytes: 383000872 kbytes
>>>   dirty pages rate: 67551 pages
>>>
>>>   ---
>>>   
>>>   (qemu) info migrate
>>>   capabilities: xbzrle: off auto-converge: on   <----
>>>   Migration status: completed
>>>   total time: 241161 milliseconds
>>>   downtime: 6373 milliseconds
>>>   transferred ram: 28235307 kbytes
>>>   remaining ram: 0 kbytes
>>>   total ram: 268444224 kbytes
>>>   duplicate: 64946416 pages
>>>   skipped: 64903523 pages
>>>   normal: 7044971 pages
>>>   normal bytes: 28179884 kbytes
>>>
>>> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
>>> ---
>>>   arch_init.c                   |   68 +++++++++++++++++++++++++++++++++++++++++
>>>   include/migration/migration.h |    4 ++
>>>   migration.c                   |    1 +
>>>   3 files changed, 73 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/arch_init.c b/arch_init.c
>>> index 49c5dc2..29788d6 100644
>>> --- a/arch_init.c
>>> +++ b/arch_init.c
>>> @@ -49,6 +49,7 @@
>>>   #include "trace.h"
>>>   #include "exec/cpu-all.h"
>>>   #include "hw/acpi/acpi.h"
>>> +#include "sysemu/cpus.h"
>>>   
>>>   #ifdef DEBUG_ARCH_INIT
>>>   #define DPRINTF(fmt, ...) \
>>> @@ -104,6 +105,8 @@ int graphic_depth = 15;
>>>   #endif
>>>   
>>>   const uint32_t arch_type = QEMU_ARCH;
>>> +static bool mig_throttle_on;
>>> +
>>>   
>>>   /***********************************************************/
>>>   /* ram save/restore */
>>> @@ -378,8 +381,15 @@ static void migration_bitmap_sync(void)
>>>       uint64_t num_dirty_pages_init = migration_dirty_pages;
>>>       MigrationState *s = migrate_get_current();
>>>       static int64_t start_time;
>>> +    static int64_t bytes_xfer_prev;
>>>       static int64_t num_dirty_pages_period;
>>>       int64_t end_time;
>>> +    int64_t bytes_xfer_now;
>>> +    static int dirty_rate_high_cnt;
>>> +
>>> +    if (!bytes_xfer_prev) {
>>> +        bytes_xfer_prev = ram_bytes_transferred();
>>> +    }
>>>   
>>>       if (!start_time) {
>>>           start_time = qemu_get_clock_ms(rt_clock);
>>> @@ -404,6 +414,23 @@ static void migration_bitmap_sync(void)
>>>   
>>>       /* more than 1 second = 1000 millisecons */
>>>       if (end_time > start_time + 1000) {
>>> +        if (migrate_auto_converge()) {
>>> +            /* The following detection logic can be refined later. For now:
>>> +               Check to see if the dirtied bytes is 50% more than the approx.
>>> +               amount of bytes that just got transferred since the last time we
>>> +               were in this routine. If that happens N times (for now N==5)
>>> +               we turn on the throttle down logic */
>>> +            bytes_xfer_now = ram_bytes_transferred();
>>> +            if (s->dirty_pages_rate &&
>>> +                ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
>>> +                ((bytes_xfer_now - bytes_xfer_prev)/2))) {
>>> +                if (dirty_rate_high_cnt++ > 5) {
>>> +                    DPRINTF("Unable to converge. Throtting down guest\n");
>>> +                    mig_throttle_on = true;
>>> +                }
>>> +             }
>>> +             bytes_xfer_prev = bytes_xfer_now;
>>> +        }
>>>           s->dirty_pages_rate = num_dirty_pages_period * 1000
>>>               / (end_time - start_time);
>>>           s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
>>> @@ -496,6 +523,15 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>>>       return bytes_sent;
>>>   }
>>>   
>>> +bool throttling_needed(void)
>>> +{
>>> +    if (!migrate_auto_converge()) {
>>> +        return false;
>>> +    }
>>> +
>>> +    return mig_throttle_on;
>>> +}
>>> +
>>>   static uint64_t bytes_transferred;
>>>   
>>>   static ram_addr_t ram_save_remaining(void)
>>> @@ -1098,3 +1134,35 @@ TargetInfo *qmp_query_target(Error **errp)
>>>   
>>>       return info;
>>>   }
>>> +
>>> +static void mig_delay_vcpu(void)
>>> +{
>>> +    qemu_mutex_unlock_iothread();
>>> +    g_usleep(50*1000);
>>> +    qemu_mutex_lock_iothread();
>>> +}
>>> +
>>> +/* Stub used for getting the vcpu out of VM and into qemu via
>>> +   run_on_cpu()*/
>>> +static void mig_kick_cpu(void *opq)
>>> +{
>>> +    mig_delay_vcpu();
>>> +    return;
>>> +}
>>> +
>>> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
>>> +   much time in the VM. The migration thread will try to catchup.
>>> +   Workload will experience a performance drop.
>>> +*/
>>> +void migration_throttle_down(void)
>>> +{
>>> +    if (throttling_needed()) {
>>> +        CPUArchState *penv = first_cpu;
>>> +        while (penv) {
>>> +            qemu_mutex_lock_iothread();
>>> +            async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
>>> +            qemu_mutex_unlock_iothread();
>>> +            penv = penv->next_cpu;
>>> +        }
>>> +    }
>>> +}
>>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>>> index ace91b0..68b65c6 100644
>>> --- a/include/migration/migration.h
>>> +++ b/include/migration/migration.h
>>> @@ -129,4 +129,8 @@ int64_t migrate_xbzrle_cache_size(void);
>>>   int64_t xbzrle_cache_resize(int64_t new_size);
>>>   
>>>   bool migrate_auto_converge(void);
>>> +bool throttling_needed(void);
>>> +void stop_throttling(void);
>>> +void migration_throttle_down(void);
>>> +
>>>   #endif
>>> diff --git a/migration.c b/migration.c
>>> index 570cee5..d3673a6 100644
>>> --- a/migration.c
>>> +++ b/migration.c
>>> @@ -526,6 +526,7 @@ static void *migration_thread(void *opaque)
>>>               DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
>>>               if (pending_size && pending_size >= max_size) {
>>>                   qemu_savevm_state_iterate(s->file);
>>> +                migration_throttle_down();
>>>               } else {
>>>                   DPRINTF("done iterating\n");
>>>                   qemu_mutex_lock_iothread();
>>> -- 
>>> 1.7.1
>> .
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-10 15:11       ` Anthony Liguori
@ 2013-05-12 17:19         ` Paolo Bonzini
  2013-05-13 12:18           ` Anthony Liguori
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2013-05-12 17:19 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: owasserm, Chegu Vinod, qemu-devel, quintela

Il 10/05/2013 17:11, Anthony Liguori ha scritto:
> Chegu Vinod <chegu_vinod@hp.com> writes:
> 
>> On 5/10/2013 6:07 AM, Anthony Liguori wrote:
>>> Chegu Vinod <chegu_vinod@hp.com> writes:
>>>
>>>>   If a user chooses to turn on the auto-converge migration capability
>>>>   these changes detect the lack of convergence and throttle down the
>>>>   guest. i.e. force the VCPUs out of the guest for some duration
>>>>   and let the migration thread catchup and help converge.
>>>>
>>>>   Verified the convergence using the following :
>>>>   - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
>>>>   - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>>>>
>>>>   Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
>>>>   migrate downtime set to 4seconds).
>>> Would it make sense to separate out the "slow the VCPU down" part of
>>> this?
>>>
>>> That would give a management tool more flexibility to create policies
>>> around slowing the VCPU down to encourage migration.
>>
>> I believe one can always enhance libvirt tools to monitor the migration 
>> statistics and control the shares/entitlements of the vcpus via 
>> cgroups..thereby slowing the guest down to allow for convergence  (I had 
>> that listed in my earlier versions of the patches as an option and also 
>> noted that it requires external (i.e. tool driven) monitoring and 
>> triggers...and that this alternative was kind of automatic after the 
>> initial setting of the capability).
>>
>> Is that what you meant by your comment above (or) are you talking about 
>> something outside the scope of cgroups and from an implementation point 
>> of view also outside the migration code path...i.e. a new knob that an 
>> external tool can use to just throttle down the vcpus of a guest ?
> 
> I'm saying, a knob to throttle the guest vcpus within QEMU that could be
> used by management tools to encourage convergence.
> 
> For instance, consider an imaginary "vcpu_throttle" command that took a
> number between 0 and 1 that throttled VCPU performance accordingly.
> 
> Then migration would look like:
> 
> 0) throttle = 1.0
> 1) call migrate command to start migration
> 2) query progress until you decide you aren't converging
> 3) throttle *= 0.75; call vcpu_throttle $throttle
> 4) goto (2)
> 
> Now I'm not opposed to a series like this that adds this sort of policy
> to QEMU itself too but I want to make sure the pieces are exposed for a
> management tool to implement its own policies too.

Note that QEMU can also throttle VCPUs as they dirty guest memory,
rather than based on CPU time.  That's not something that management
cannot do (you can approximate it based on the recent history if you
provide dirtying statistics, but it's not the same thing).

Paolo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-12 17:19         ` Paolo Bonzini
@ 2013-05-13 12:18           ` Anthony Liguori
  0 siblings, 0 replies; 21+ messages in thread
From: Anthony Liguori @ 2013-05-13 12:18 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: owasserm, Chegu Vinod, qemu-devel, quintela

Paolo Bonzini <pbonzini@redhat.com> writes:

> Il 10/05/2013 17:11, Anthony Liguori ha scritto:
>> Chegu Vinod <chegu_vinod@hp.com> writes:
>> 
>>> On 5/10/2013 6:07 AM, Anthony Liguori wrote:
>>>> Chegu Vinod <chegu_vinod@hp.com> writes:
>>>>
>>>>>   If a user chooses to turn on the auto-converge migration capability
>>>>>   these changes detect the lack of convergence and throttle down the
>>>>>   guest. i.e. force the VCPUs out of the guest for some duration
>>>>>   and let the migration thread catchup and help converge.
>>>>>
>>>>>   Verified the convergence using the following :
>>>>>   - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
>>>>>   - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>>>>>
>>>>>   Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
>>>>>   migrate downtime set to 4seconds).
>>>> Would it make sense to separate out the "slow the VCPU down" part of
>>>> this?
>>>>
>>>> That would give a management tool more flexibility to create policies
>>>> around slowing the VCPU down to encourage migration.
>>>
>>> I believe one can always enhance libvirt tools to monitor the migration 
>>> statistics and control the shares/entitlements of the vcpus via 
>>> cgroups..thereby slowing the guest down to allow for convergence  (I had 
>>> that listed in my earlier versions of the patches as an option and also 
>>> noted that it requires external (i.e. tool driven) monitoring and 
>>> triggers...and that this alternative was kind of automatic after the 
>>> initial setting of the capability).
>>>
>>> Is that what you meant by your comment above (or) are you talking about 
>>> something outside the scope of cgroups and from an implementation point 
>>> of view also outside the migration code path...i.e. a new knob that an 
>>> external tool can use to just throttle down the vcpus of a guest ?
>> 
>> I'm saying, a knob to throttle the guest vcpus within QEMU that could be
>> used by management tools to encourage convergence.
>> 
>> For instance, consider an imaginary "vcpu_throttle" command that took a
>> number between 0 and 1 that throttled VCPU performance accordingly.
>> 
>> Then migration would look like:
>> 
>> 0) throttle = 1.0
>> 1) call migrate command to start migration
>> 2) query progress until you decide you aren't converging
>> 3) throttle *= 0.75; call vcpu_throttle $throttle
>> 4) goto (2)
>> 
>> Now I'm not opposed to a series like this that adds this sort of policy
>> to QEMU itself too but I want to make sure the pieces are exposed for a
>> management tool to implement its own policies too.
>
> Note that QEMU can also throttle VCPUs as they dirty guest memory,
> rather than based on CPU time.  That's not something that management
> cannot do (you can approximate it based on the recent history if you
> provide dirtying statistics, but it's not the same thing).

Sure but in that case, I'd argue you would want to expose that as a
command that libvirt could invoke too.

Regards,

Anthony Liguori

>
> Paolo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
  2013-05-10 15:08       ` Anthony Liguori
@ 2013-05-13 12:33         ` Daniel P. Berrange
  0 siblings, 0 replies; 21+ messages in thread
From: Daniel P. Berrange @ 2013-05-13 12:33 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: quintela, Chegu Vinod, qemu-devel, owasserm, pbonzini

On Fri, May 10, 2013 at 10:08:05AM -0500, Anthony Liguori wrote:
> "Daniel P. Berrange" <berrange@redhat.com> writes:
> 
> > On Fri, May 10, 2013 at 08:07:51AM -0500, Anthony Liguori wrote:
> >> Chegu Vinod <chegu_vinod@hp.com> writes:
> >> 
> >> >  If a user chooses to turn on the auto-converge migration capability
> >> >  these changes detect the lack of convergence and throttle down the
> >> >  guest. i.e. force the VCPUs out of the guest for some duration
> >> >  and let the migration thread catchup and help converge.
> >> >
> >> >  Verified the convergence using the following :
> >> >  - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
> >> >  - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
> >> >
> >> >  Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
> >> >  migrate downtime set to 4seconds).
> >> 
> >> Would it make sense to separate out the "slow the VCPU down" part of
> >> this?
> >> 
> >> That would give a management tool more flexibility to create policies
> >> around slowing the VCPU down to encourage migration.
> >> 
> >> In fact, I wonder if we need anything in the migration path if we just
> >> expose the "slow the VCPU down" bit as a feature.
> >> 
> >> Slow the VCPU down is not quite the same as setting priority of the VCPU
> >> thread largely because of the QBL so I recognize the need to have
> >> something for this in QEMU.
> >
> > Rather than the priority, could you perhaps do the VCPU slow down
> > using  cfs_quota_us + cfs_period_us settings though ? These let you
> > place hard caps on schedular time afforded to vCPUs and we can already
> > control those via libvirt + cgroups.
> 
> The problem with the bandwidth controller is the same with priorities.
> You can end up causing lock holder pre-emption which would negatively
> impact migration performance.
> 
> It's far better for QEMU to voluntarily give up some time knowing that
> it's not holding the QBL since then migration can continue without
> impact.

IMHO it'd be nice to get some clear benchmark numbers of just how bug
the lock holder pre-emption problem is when using cgroup hard caps,
before we invent another mechanism for throttling the CPUs that has
to be plumbed into the whole stack.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-05-13 12:34 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-09 19:43 [Qemu-devel] [RFC PATCH v5 0/3] Throttle-down guest to help with live migration convergence Chegu Vinod
2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 1/3] Introduce async_run_on_cpu() Chegu Vinod
2013-05-10  7:43   ` Paolo Bonzini
2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 2/3] Add 'auto-converge' migration capability Chegu Vinod
2013-05-10  7:43   ` Paolo Bonzini
2013-05-10 14:26     ` Eric Blake
2013-05-09 19:43 ` [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration Chegu Vinod
2013-05-09 20:05   ` Igor Mammedov
2013-05-09 22:26     ` Chegu Vinod
2013-05-09 20:24   ` Igor Mammedov
2013-05-09 23:00     ` Chegu Vinod
2013-05-10  7:47       ` Paolo Bonzini
2013-05-10  7:41   ` Paolo Bonzini
2013-05-10 13:07   ` Anthony Liguori
2013-05-10 14:14     ` Chegu Vinod
2013-05-10 15:11       ` Anthony Liguori
2013-05-12 17:19         ` Paolo Bonzini
2013-05-13 12:18           ` Anthony Liguori
2013-05-10 14:17     ` Daniel P. Berrange
2013-05-10 15:08       ` Anthony Liguori
2013-05-13 12:33         ` Daniel P. Berrange

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.