Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration

From: Anthony Liguori <anthony@codemonkey.ws>
To: Chegu Vinod <chegu_vinod@hp.com>
Cc: owasserm@redhat.com, pbonzini@redhat.com, qemu-devel@nongnu.org,
	quintela@redhat.com
Subject: Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
Date: Fri, 10 May 2013 10:11:31 -0500	[thread overview]
Message-ID: <87ip2qzx7g.fsf@codemonkey.ws> (raw)
In-Reply-To: <518D00B6.6040305@hp.com>

Chegu Vinod <chegu_vinod@hp.com> writes:

> On 5/10/2013 6:07 AM, Anthony Liguori wrote:
>> Chegu Vinod <chegu_vinod@hp.com> writes:
>>
>>>   If a user chooses to turn on the auto-converge migration capability
>>>   these changes detect the lack of convergence and throttle down the
>>>   guest. i.e. force the VCPUs out of the guest for some duration
>>>   and let the migration thread catchup and help converge.
>>>
>>>   Verified the convergence using the following :
>>>   - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
>>>   - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>>>
>>>   Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
>>>   migrate downtime set to 4seconds).
>> Would it make sense to separate out the "slow the VCPU down" part of
>> this?
>>
>> That would give a management tool more flexibility to create policies
>> around slowing the VCPU down to encourage migration.
>
> I believe one can always enhance libvirt tools to monitor the migration 
> statistics and control the shares/entitlements of the vcpus via 
> cgroups..thereby slowing the guest down to allow for convergence  (I had 
> that listed in my earlier versions of the patches as an option and also 
> noted that it requires external (i.e. tool driven) monitoring and 
> triggers...and that this alternative was kind of automatic after the 
> initial setting of the capability).
>
> Is that what you meant by your comment above (or) are you talking about 
> something outside the scope of cgroups and from an implementation point 
> of view also outside the migration code path...i.e. a new knob that an 
> external tool can use to just throttle down the vcpus of a guest ?

I'm saying, a knob to throttle the guest vcpus within QEMU that could be
used by management tools to encourage convergence.

For instance, consider an imaginary "vcpu_throttle" command that took a
number between 0 and 1 that throttled VCPU performance accordingly.

Then migration would look like:

0) throttle = 1.0
1) call migrate command to start migration
2) query progress until you decide you aren't converging
3) throttle *= 0.75; call vcpu_throttle $throttle
4) goto (2)

Now I'm not opposed to a series like this that adds this sort of policy
to QEMU itself too but I want to make sure the pieces are exposed for a
management tool to implement its own policies too.

Regards,

Anthony Liguori

>
> Thanks
> Vinod
>
>
>
>>
>> In fact, I wonder if we need anything in the migration path if we just
>> expose the "slow the VCPU down" bit as a feature.
>>
>> Slow the VCPU down is not quite the same as setting priority of the VCPU
>> thread largely because of the QBL so I recognize the need to have
>> something for this in QEMU.
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>>   (qemu) info migrate
>>>   capabilities: xbzrle: off auto-converge: off  <----
>>>   Migration status: active
>>>   total time: 1487503 milliseconds
>>>   expected downtime: 519 milliseconds
>>>   transferred ram: 383749347 kbytes
>>>   remaining ram: 2753372 kbytes
>>>   total ram: 268444224 kbytes
>>>   duplicate: 65461532 pages
>>>   skipped: 64901568 pages
>>>   normal: 95750218 pages
>>>   normal bytes: 383000872 kbytes
>>>   dirty pages rate: 67551 pages
>>>
>>>   ---
>>>   
>>>   (qemu) info migrate
>>>   capabilities: xbzrle: off auto-converge: on   <----
>>>   Migration status: completed
>>>   total time: 241161 milliseconds
>>>   downtime: 6373 milliseconds
>>>   transferred ram: 28235307 kbytes
>>>   remaining ram: 0 kbytes
>>>   total ram: 268444224 kbytes
>>>   duplicate: 64946416 pages
>>>   skipped: 64903523 pages
>>>   normal: 7044971 pages
>>>   normal bytes: 28179884 kbytes
>>>
>>> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
>>> ---
>>>   arch_init.c                   |   68 +++++++++++++++++++++++++++++++++++++++++
>>>   include/migration/migration.h |    4 ++
>>>   migration.c                   |    1 +
>>>   3 files changed, 73 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/arch_init.c b/arch_init.c
>>> index 49c5dc2..29788d6 100644
>>> --- a/arch_init.c
>>> +++ b/arch_init.c
>>> @@ -49,6 +49,7 @@
>>>   #include "trace.h"
>>>   #include "exec/cpu-all.h"
>>>   #include "hw/acpi/acpi.h"
>>> +#include "sysemu/cpus.h"
>>>   
>>>   #ifdef DEBUG_ARCH_INIT
>>>   #define DPRINTF(fmt, ...) \
>>> @@ -104,6 +105,8 @@ int graphic_depth = 15;
>>>   #endif
>>>   
>>>   const uint32_t arch_type = QEMU_ARCH;
>>> +static bool mig_throttle_on;
>>> +
>>>   
>>>   /***********************************************************/
>>>   /* ram save/restore */
>>> @@ -378,8 +381,15 @@ static void migration_bitmap_sync(void)
>>>       uint64_t num_dirty_pages_init = migration_dirty_pages;
>>>       MigrationState *s = migrate_get_current();
>>>       static int64_t start_time;
>>> +    static int64_t bytes_xfer_prev;
>>>       static int64_t num_dirty_pages_period;
>>>       int64_t end_time;
>>> +    int64_t bytes_xfer_now;
>>> +    static int dirty_rate_high_cnt;
>>> +
>>> +    if (!bytes_xfer_prev) {
>>> +        bytes_xfer_prev = ram_bytes_transferred();
>>> +    }
>>>   
>>>       if (!start_time) {
>>>           start_time = qemu_get_clock_ms(rt_clock);
>>> @@ -404,6 +414,23 @@ static void migration_bitmap_sync(void)
>>>   
>>>       /* more than 1 second = 1000 millisecons */
>>>       if (end_time > start_time + 1000) {
>>> +        if (migrate_auto_converge()) {
>>> +            /* The following detection logic can be refined later. For now:
>>> +               Check to see if the dirtied bytes is 50% more than the approx.
>>> +               amount of bytes that just got transferred since the last time we
>>> +               were in this routine. If that happens N times (for now N==5)
>>> +               we turn on the throttle down logic */
>>> +            bytes_xfer_now = ram_bytes_transferred();
>>> +            if (s->dirty_pages_rate &&
>>> +                ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
>>> +                ((bytes_xfer_now - bytes_xfer_prev)/2))) {
>>> +                if (dirty_rate_high_cnt++ > 5) {
>>> +                    DPRINTF("Unable to converge. Throtting down guest\n");
>>> +                    mig_throttle_on = true;
>>> +                }
>>> +             }
>>> +             bytes_xfer_prev = bytes_xfer_now;
>>> +        }
>>>           s->dirty_pages_rate = num_dirty_pages_period * 1000
>>>               / (end_time - start_time);
>>>           s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
>>> @@ -496,6 +523,15 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>>>       return bytes_sent;
>>>   }
>>>   
>>> +bool throttling_needed(void)
>>> +{
>>> +    if (!migrate_auto_converge()) {
>>> +        return false;
>>> +    }
>>> +
>>> +    return mig_throttle_on;
>>> +}
>>> +
>>>   static uint64_t bytes_transferred;
>>>   
>>>   static ram_addr_t ram_save_remaining(void)
>>> @@ -1098,3 +1134,35 @@ TargetInfo *qmp_query_target(Error **errp)
>>>   
>>>       return info;
>>>   }
>>> +
>>> +static void mig_delay_vcpu(void)
>>> +{
>>> +    qemu_mutex_unlock_iothread();
>>> +    g_usleep(50*1000);
>>> +    qemu_mutex_lock_iothread();
>>> +}
>>> +
>>> +/* Stub used for getting the vcpu out of VM and into qemu via
>>> +   run_on_cpu()*/
>>> +static void mig_kick_cpu(void *opq)
>>> +{
>>> +    mig_delay_vcpu();
>>> +    return;
>>> +}
>>> +
>>> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
>>> +   much time in the VM. The migration thread will try to catchup.
>>> +   Workload will experience a performance drop.
>>> +*/
>>> +void migration_throttle_down(void)
>>> +{
>>> +    if (throttling_needed()) {
>>> +        CPUArchState *penv = first_cpu;
>>> +        while (penv) {
>>> +            qemu_mutex_lock_iothread();
>>> +            async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
>>> +            qemu_mutex_unlock_iothread();
>>> +            penv = penv->next_cpu;
>>> +        }
>>> +    }
>>> +}
>>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>>> index ace91b0..68b65c6 100644
>>> --- a/include/migration/migration.h
>>> +++ b/include/migration/migration.h
>>> @@ -129,4 +129,8 @@ int64_t migrate_xbzrle_cache_size(void);
>>>   int64_t xbzrle_cache_resize(int64_t new_size);
>>>   
>>>   bool migrate_auto_converge(void);
>>> +bool throttling_needed(void);
>>> +void stop_throttling(void);
>>> +void migration_throttle_down(void);
>>> +
>>>   #endif
>>> diff --git a/migration.c b/migration.c
>>> index 570cee5..d3673a6 100644
>>> --- a/migration.c
>>> +++ b/migration.c
>>> @@ -526,6 +526,7 @@ static void *migration_thread(void *opaque)
>>>               DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
>>>               if (pending_size && pending_size >= max_size) {
>>>                   qemu_savevm_state_iterate(s->file);
>>> +                migration_throttle_down();
>>>               } else {
>>>                   DPRINTF("done iterating\n");
>>>                   qemu_mutex_lock_iothread();
>>> -- 
>>> 1.7.1
>> .
>>