[Qemu-devel] [PATCH 0/1] spapr: Do not re-read the clock on pre

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/1] spapr: Do not re-read the clock on pre_save handler on migration
@ 2019-05-20 20:43 Maxiwell S. Garcia
  2019-05-20 20:43 ` [Qemu-devel] [PATCH 1/1] " Maxiwell S. Garcia
  0 siblings, 1 reply; 7+ messages in thread
From: Maxiwell S. Garcia @ 2019-05-20 20:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-ppc, Maxiwell S. Garcia, david

I suggest to remove the pre_save handler that saves the timebase before
migrate. The commit that added this was ported from x86:
  6053a86fe7bd: kvmclock: reduce kvmclock difference on migration

The review [1] had a discussion about it. The author says that a VM
already paused 10 minutes ago should re-read the clock just before
migrate. But a reviewer question was not answered:

"Is it really valid to make the clock move on an already-paused
VM, only because it was migrated?"

This clock move makes the guest know about the pause between the stop
and migrate commands. Many side effects could happen after migration.

[1] http://lists.nongnu.org/archive/html/qemu-devel/2016-12/msg00610.html

Maxiwell S. Garcia (1):
  spapr: Do not re-read the clock on pre_save handler on migration

 hw/ppc/ppc.c | 24 ------------------------
 1 file changed, 24 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Qemu-devel] [PATCH 1/1] spapr: Do not re-read the clock on pre_save handler on migration
  2019-05-20 20:43 [Qemu-devel] [PATCH 0/1] spapr: Do not re-read the clock on pre_save handler on migration Maxiwell S. Garcia
@ 2019-05-20 20:43 ` Maxiwell S. Garcia
  2019-05-22 23:29   ` David Gibson
  0 siblings, 1 reply; 7+ messages in thread
From: Maxiwell S. Garcia @ 2019-05-20 20:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-ppc, Maxiwell S. Garcia, david

This handler was added in the commit:
  42043e4f1241: spapr: clock should count only if vm is running

In a scenario without migration, this pre_save handler is not
triggered, so the 'stop/cont' commands save and restore the clock
in the function 'cpu_ppc_clock_vm_state_change.' The SW clock
in the guest doesn't know about this pause.

If the command 'migrate' is called between 'stop' and 'cont',
the pre_save handler re-read the clock, and the SW clock in the
guest will know about the pause between 'stop' and 'migrate.'
If the guest is running a workload like HTC, a side-effect of
this is a lot of process stall messages (with call traces) in
the kernel guest.

Signed-off-by: Maxiwell S. Garcia <maxiwell@linux.ibm.com>
---
 hw/ppc/ppc.c | 24 ------------------------
 1 file changed, 24 deletions(-)

diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
index ad20584f26..3fb50cbeee 100644
--- a/hw/ppc/ppc.c
+++ b/hw/ppc/ppc.c
@@ -1056,35 +1056,11 @@ void cpu_ppc_clock_vm_state_change(void *opaque, int running,
     }
 }
 
-/*
- * When migrating, read the clock just before migration,
- * so that the guest clock counts during the events
- * between:
- *
- *  * vm_stop()
- *  *
- *  * pre_save()
- *
- *  This reduces clock difference on migration from 5s
- *  to 0.1s (when max_downtime == 5s), because sending the
- *  final pages of memory (which happens between vm_stop()
- *  and pre_save()) takes max_downtime.
- */
-static int timebase_pre_save(void *opaque)
-{
-    PPCTimebase *tb = opaque;
-
-    timebase_save(tb);
-
-    return 0;
-}
-
 const VMStateDescription vmstate_ppc_timebase = {
     .name = "timebase",
     .version_id = 1,
     .minimum_version_id = 1,
     .minimum_version_id_old = 1,
-    .pre_save = timebase_pre_save,
     .fields      = (VMStateField []) {
         VMSTATE_UINT64(guest_timebase, PPCTimebase),
         VMSTATE_INT64(time_of_the_day_ns, PPCTimebase),
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [PATCH 1/1] spapr: Do not re-read the clock on pre_save handler on migration
  2019-05-20 20:43 ` [Qemu-devel] [PATCH 1/1] " Maxiwell S. Garcia
@ 2019-05-22 23:29   ` David Gibson
  2019-05-23 20:18     ` [Qemu-devel] [Qemu-ppc] " Maxiwell S. Garcia
  0 siblings, 1 reply; 7+ messages in thread
From: David Gibson @ 2019-05-22 23:29 UTC (permalink / raw)
  To: Maxiwell S. Garcia; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2831 bytes --]

On Mon, May 20, 2019 at 05:43:40PM -0300, Maxiwell S. Garcia wrote:
> This handler was added in the commit:
>   42043e4f1241: spapr: clock should count only if vm is running
> 
> In a scenario without migration, this pre_save handler is not
> triggered, so the 'stop/cont' commands save and restore the clock
> in the function 'cpu_ppc_clock_vm_state_change.' The SW clock
> in the guest doesn't know about this pause.
> 
> If the command 'migrate' is called between 'stop' and 'cont',
> the pre_save handler re-read the clock, and the SW clock in the
> guest will know about the pause between 'stop' and 'migrate.'
> If the guest is running a workload like HTC, a side-effect of
> this is a lot of process stall messages (with call traces) in
> the kernel guest.
> 
> Signed-off-by: Maxiwell S. Garcia <maxiwell@linux.ibm.com>

What affect will this have on the clock for the case of migrations
without a stop/cont around?  The complicated thing here is that for
*explicit* stops/continues we want to freeze the clock, however for
the implicit stop/continue during migration downtime, we want to keep
the clock running (logically), so that the guest time of day doesn't
get out of sync on migration.

> ---
>  hw/ppc/ppc.c | 24 ------------------------
>  1 file changed, 24 deletions(-)
> 
> diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
> index ad20584f26..3fb50cbeee 100644
> --- a/hw/ppc/ppc.c
> +++ b/hw/ppc/ppc.c
> @@ -1056,35 +1056,11 @@ void cpu_ppc_clock_vm_state_change(void *opaque, int running,
>      }
>  }
>  
> -/*
> - * When migrating, read the clock just before migration,
> - * so that the guest clock counts during the events
> - * between:
> - *
> - *  * vm_stop()
> - *  *
> - *  * pre_save()
> - *
> - *  This reduces clock difference on migration from 5s
> - *  to 0.1s (when max_downtime == 5s), because sending the
> - *  final pages of memory (which happens between vm_stop()
> - *  and pre_save()) takes max_downtime.

Urgh.. this comment is confusing - 5s would be a ludicrously long
max_downtime by modern standards.

> - */
> -static int timebase_pre_save(void *opaque)
> -{
> -    PPCTimebase *tb = opaque;
> -
> -    timebase_save(tb);
> -
> -    return 0;
> -}
> -
>  const VMStateDescription vmstate_ppc_timebase = {
>      .name = "timebase",
>      .version_id = 1,
>      .minimum_version_id = 1,
>      .minimum_version_id_old = 1,
> -    .pre_save = timebase_pre_save,
>      .fields      = (VMStateField []) {
>          VMSTATE_UINT64(guest_timebase, PPCTimebase),
>          VMSTATE_INT64(time_of_the_day_ns, PPCTimebase),

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH 1/1] spapr: Do not re-read the clock on pre_save handler on migration
  2019-05-22 23:29   ` David Gibson
@ 2019-05-23 20:18     ` Maxiwell S. Garcia
  2019-05-30  1:13       ` David Gibson
  0 siblings, 1 reply; 7+ messages in thread
From: Maxiwell S. Garcia @ 2019-05-23 20:18 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel

On Thu, May 23, 2019 at 09:29:52AM +1000, David Gibson wrote:
> On Mon, May 20, 2019 at 05:43:40PM -0300, Maxiwell S. Garcia wrote:
> > This handler was added in the commit:
> >   42043e4f1241: spapr: clock should count only if vm is running
> > 
> > In a scenario without migration, this pre_save handler is not
> > triggered, so the 'stop/cont' commands save and restore the clock
> > in the function 'cpu_ppc_clock_vm_state_change.' The SW clock
> > in the guest doesn't know about this pause.
> > 
> > If the command 'migrate' is called between 'stop' and 'cont',
> > the pre_save handler re-read the clock, and the SW clock in the
> > guest will know about the pause between 'stop' and 'migrate.'
> > If the guest is running a workload like HTC, a side-effect of
> > this is a lot of process stall messages (with call traces) in
> > the kernel guest.
> > 
> > Signed-off-by: Maxiwell S. Garcia <maxiwell@linux.ibm.com>
> 
> What affect will this have on the clock for the case of migrations
> without a stop/cont around?

The guest timebase is saved when the VM stop running and restored when
the VM starts running again (cpu_ppc_clock_vm_state_change handler).
Migrations without stop/cont save the clock when the VM go to the
FINISH_MIGRATE state.

> The complicated thing here is that for
> *explicit* stops/continues we want to freeze the clock, however for
> the implicit stop/continue during migration downtime, we want to keep
> the clock running (logically), so that the guest time of day doesn't
> get out of sync on migration.
> 

Not sure if the *implicit* word here is about commands from the libvirt
or any other orchestrator. QEMU itself doesn't know the intent behind the
command stop/cont. So, If we are using a guest to process a workload and
the manager tool decide to migrate our VM transparently, it's unpleasant
to see a lot of process stalls with call traces in the kernel log.
The high-level tools could sync the SW clock with the HW clock if this
behavior is required, keeping the QEMU stop/cont and stop/migrate/cont
consistent.

> > ---
> >  hw/ppc/ppc.c | 24 ------------------------
> >  1 file changed, 24 deletions(-)
> > 
> > diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
> > index ad20584f26..3fb50cbeee 100644
> > --- a/hw/ppc/ppc.c
> > +++ b/hw/ppc/ppc.c
> > @@ -1056,35 +1056,11 @@ void cpu_ppc_clock_vm_state_change(void *opaque, int running,
> >      }
> >  }
> >  
> > -/*
> > - * When migrating, read the clock just before migration,
> > - * so that the guest clock counts during the events
> > - * between:
> > - *
> > - *  * vm_stop()
> > - *  *
> > - *  * pre_save()
> > - *
> > - *  This reduces clock difference on migration from 5s
> > - *  to 0.1s (when max_downtime == 5s), because sending the
> > - *  final pages of memory (which happens between vm_stop()
> > - *  and pre_save()) takes max_downtime.
> 
> Urgh.. this comment is confusing - 5s would be a ludicrously long
> max_downtime by modern standards.
> 
> > - */
> > -static int timebase_pre_save(void *opaque)
> > -{
> > -    PPCTimebase *tb = opaque;
> > -
> > -    timebase_save(tb);
> > -
> > -    return 0;
> > -}
> > -
> >  const VMStateDescription vmstate_ppc_timebase = {
> >      .name = "timebase",
> >      .version_id = 1,
> >      .minimum_version_id = 1,
> >      .minimum_version_id_old = 1,
> > -    .pre_save = timebase_pre_save,
> >      .fields      = (VMStateField []) {
> >          VMSTATE_UINT64(guest_timebase, PPCTimebase),
> >          VMSTATE_INT64(time_of_the_day_ns, PPCTimebase),
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH 1/1] spapr: Do not re-read the clock on pre_save handler on migration
  2019-05-23 20:18     ` [Qemu-devel] [Qemu-ppc] " Maxiwell S. Garcia
@ 2019-05-30  1:13       ` David Gibson
  2019-06-05 19:39         ` Maxiwell S. Garcia
  0 siblings, 1 reply; 7+ messages in thread
From: David Gibson @ 2019-05-30  1:13 UTC (permalink / raw)
  To: Maxiwell S. Garcia; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4608 bytes --]

On Thu, May 23, 2019 at 05:18:51PM -0300, Maxiwell S. Garcia wrote:
> On Thu, May 23, 2019 at 09:29:52AM +1000, David Gibson wrote:
> > On Mon, May 20, 2019 at 05:43:40PM -0300, Maxiwell S. Garcia wrote:
> > > This handler was added in the commit:
> > >   42043e4f1241: spapr: clock should count only if vm is running
> > > 
> > > In a scenario without migration, this pre_save handler is not
> > > triggered, so the 'stop/cont' commands save and restore the clock
> > > in the function 'cpu_ppc_clock_vm_state_change.' The SW clock
> > > in the guest doesn't know about this pause.
> > > 
> > > If the command 'migrate' is called between 'stop' and 'cont',
> > > the pre_save handler re-read the clock, and the SW clock in the
> > > guest will know about the pause between 'stop' and 'migrate.'
> > > If the guest is running a workload like HTC, a side-effect of
> > > this is a lot of process stall messages (with call traces) in
> > > the kernel guest.
> > > 
> > > Signed-off-by: Maxiwell S. Garcia <maxiwell@linux.ibm.com>
> > 
> > What affect will this have on the clock for the case of migrations
> > without a stop/cont around?
> 
> The guest timebase is saved when the VM stop running and restored when
> the VM starts running again (cpu_ppc_clock_vm_state_change handler).
> Migrations without stop/cont save the clock when the VM go to the
> FINISH_MIGRATE state.

Right... which means the clock is effectively stopped for the
migration downtime window while we transfer the final state.  That
means the guest clock will drift from wall clock by a couple of
hundred ms across the migration which is not correct.

> > The complicated thing here is that for
> > *explicit* stops/continues we want to freeze the clock, however for
> > the implicit stop/continue during migration downtime, we want to keep
> > the clock running (logically), so that the guest time of day doesn't
> > get out of sync on migration.
> > 
> 
> Not sure if the *implicit* word here is about commands from the libvirt
> or any other orchestrator.

By implicit I mean the stopping of the VM which qemu does to transfer
the final part of the state, rather than because of an explicit
stop/cont command.

> QEMU itself doesn't know the intent behind the
> command stop/cont. So, If we are using a guest to process a workload and
> the manager tool decide to migrate our VM transparently, it's unpleasant
> to see a lot of process stalls with call traces in the kernel log.

If you have a lot of process stalls across a migration, that suggests
your permitted downtime window is *way* too long.

> The high-level tools could sync the SW clock with the HW clock if this
> behavior is required, keeping the QEMU stop/cont and stop/migrate/cont
> consistent.
> 
> > > ---
> > >  hw/ppc/ppc.c | 24 ------------------------
> > >  1 file changed, 24 deletions(-)
> > > 
> > > diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
> > > index ad20584f26..3fb50cbeee 100644
> > > --- a/hw/ppc/ppc.c
> > > +++ b/hw/ppc/ppc.c
> > > @@ -1056,35 +1056,11 @@ void cpu_ppc_clock_vm_state_change(void *opaque, int running,
> > >      }
> > >  }
> > >  
> > > -/*
> > > - * When migrating, read the clock just before migration,
> > > - * so that the guest clock counts during the events
> > > - * between:
> > > - *
> > > - *  * vm_stop()
> > > - *  *
> > > - *  * pre_save()
> > > - *
> > > - *  This reduces clock difference on migration from 5s
> > > - *  to 0.1s (when max_downtime == 5s), because sending the
> > > - *  final pages of memory (which happens between vm_stop()
> > > - *  and pre_save()) takes max_downtime.
> > 
> > Urgh.. this comment is confusing - 5s would be a ludicrously long
> > max_downtime by modern standards.
> > 
> > > - */
> > > -static int timebase_pre_save(void *opaque)
> > > -{
> > > -    PPCTimebase *tb = opaque;
> > > -
> > > -    timebase_save(tb);
> > > -
> > > -    return 0;
> > > -}
> > > -
> > >  const VMStateDescription vmstate_ppc_timebase = {
> > >      .name = "timebase",
> > >      .version_id = 1,
> > >      .minimum_version_id = 1,
> > >      .minimum_version_id_old = 1,
> > > -    .pre_save = timebase_pre_save,
> > >      .fields      = (VMStateField []) {
> > >          VMSTATE_UINT64(guest_timebase, PPCTimebase),
> > >          VMSTATE_INT64(time_of_the_day_ns, PPCTimebase),
> > 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH 1/1] spapr: Do not re-read the clock on pre_save handler on migration
  2019-05-30  1:13       ` David Gibson
@ 2019-06-05 19:39         ` Maxiwell S. Garcia
  2019-06-12  5:14           ` David Gibson
  0 siblings, 1 reply; 7+ messages in thread
From: Maxiwell S. Garcia @ 2019-06-05 19:39 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel

On Thu, May 30, 2019 at 11:13:41AM +1000, David Gibson wrote:
> On Thu, May 23, 2019 at 05:18:51PM -0300, Maxiwell S. Garcia wrote:
> > On Thu, May 23, 2019 at 09:29:52AM +1000, David Gibson wrote:
> > > On Mon, May 20, 2019 at 05:43:40PM -0300, Maxiwell S. Garcia wrote:
> > > > This handler was added in the commit:
> > > >   42043e4f1241: spapr: clock should count only if vm is running
> > > > 
> > > > In a scenario without migration, this pre_save handler is not
> > > > triggered, so the 'stop/cont' commands save and restore the clock
> > > > in the function 'cpu_ppc_clock_vm_state_change.' The SW clock
> > > > in the guest doesn't know about this pause.
> > > > 
> > > > If the command 'migrate' is called between 'stop' and 'cont',
> > > > the pre_save handler re-read the clock, and the SW clock in the
> > > > guest will know about the pause between 'stop' and 'migrate.'
> > > > If the guest is running a workload like HTC, a side-effect of
> > > > this is a lot of process stall messages (with call traces) in
> > > > the kernel guest.
> > > > 
> > > > Signed-off-by: Maxiwell S. Garcia <maxiwell@linux.ibm.com>
> > > 
> > > What affect will this have on the clock for the case of migrations
> > > without a stop/cont around?
> > 
> > The guest timebase is saved when the VM stop running and restored when
> > the VM starts running again (cpu_ppc_clock_vm_state_change handler).
> > Migrations without stop/cont save the clock when the VM go to the
> > FINISH_MIGRATE state.
> 
> Right... which means the clock is effectively stopped for the
> migration downtime window while we transfer the final state.  That
> means the guest clock will drift from wall clock by a couple of
> hundred ms across the migration which is not correct.
> 
> > > The complicated thing here is that for
> > > *explicit* stops/continues we want to freeze the clock, however for
> > > the implicit stop/continue during migration downtime, we want to keep
> > > the clock running (logically), so that the guest time of day doesn't
> > > get out of sync on migration.
> > > 
> > 
> > Not sure if the *implicit* word here is about commands from the libvirt
> > or any other orchestrator.
> 
> By implicit I mean the stopping of the VM which qemu does to transfer
> the final part of the state, rather than because of an explicit
> stop/cont command.
> 
> > QEMU itself doesn't know the intent behind the
> > command stop/cont. So, If we are using a guest to process a workload and
> > the manager tool decide to migrate our VM transparently, it's unpleasant
> > to see a lot of process stalls with call traces in the kernel log.
> 
> If you have a lot of process stalls across a migration, that suggests
> your permitted downtime window is *way* too long.
> 

I see a difference between live migration and 'cold' migration. In
a cold migration scenario (where a user 'stop' the machine, wait an
arbitrary time, move it to another server, and run it), the behavior
should be the same as executing 'stop/cont' in a guest without migration.

This problem also emerges when the 'timeout' flag is used with 'virsh'
tool to live-migrate a guest. After the 'timeout', the libvirt sends a
'stop' command to QEMU to suspend the guest before migrate. If the NFS
is slow, for example, this guest will wait many minutes to run again.

Maybe a solution is to modify the timebase_pre_save handler to know the
current vm_state and only save the timebase again when vm_state is not
in 'stop' state. What do you think?


> > The high-level tools could sync the SW clock with the HW clock if this
> > behavior is required, keeping the QEMU stop/cont and stop/migrate/cont
> > consistent.
> > 
> > > > ---
> > > >  hw/ppc/ppc.c | 24 ------------------------
> > > >  1 file changed, 24 deletions(-)
> > > > 
> > > > diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
> > > > index ad20584f26..3fb50cbeee 100644
> > > > --- a/hw/ppc/ppc.c
> > > > +++ b/hw/ppc/ppc.c
> > > > @@ -1056,35 +1056,11 @@ void cpu_ppc_clock_vm_state_change(void *opaque, int running,
> > > >      }
> > > >  }
> > > >  
> > > > -/*
> > > > - * When migrating, read the clock just before migration,
> > > > - * so that the guest clock counts during the events
> > > > - * between:
> > > > - *
> > > > - *  * vm_stop()
> > > > - *  *
> > > > - *  * pre_save()
> > > > - *
> > > > - *  This reduces clock difference on migration from 5s
> > > > - *  to 0.1s (when max_downtime == 5s), because sending the
> > > > - *  final pages of memory (which happens between vm_stop()
> > > > - *  and pre_save()) takes max_downtime.
> > > 
> > > Urgh.. this comment is confusing - 5s would be a ludicrously long
> > > max_downtime by modern standards.
> > > 
> > > > - */
> > > > -static int timebase_pre_save(void *opaque)
> > > > -{
> > > > -    PPCTimebase *tb = opaque;
> > > > -
> > > > -    timebase_save(tb);
> > > > -
> > > > -    return 0;
> > > > -}
> > > > -
> > > >  const VMStateDescription vmstate_ppc_timebase = {
> > > >      .name = "timebase",
> > > >      .version_id = 1,
> > > >      .minimum_version_id = 1,
> > > >      .minimum_version_id_old = 1,
> > > > -    .pre_save = timebase_pre_save,
> > > >      .fields      = (VMStateField []) {
> > > >          VMSTATE_UINT64(guest_timebase, PPCTimebase),
> > > >          VMSTATE_INT64(time_of_the_day_ns, PPCTimebase),
> > > 
> > 
> > 
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH 1/1] spapr: Do not re-read the clock on pre_save handler on migration
  2019-06-05 19:39         ` Maxiwell S. Garcia
@ 2019-06-12  5:14           ` David Gibson
  0 siblings, 0 replies; 7+ messages in thread
From: David Gibson @ 2019-06-12  5:14 UTC (permalink / raw)
  To: Maxiwell S. Garcia; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4484 bytes --]

On Wed, Jun 05, 2019 at 04:39:27PM -0300, Maxiwell S. Garcia wrote:
> On Thu, May 30, 2019 at 11:13:41AM +1000, David Gibson wrote:
> > On Thu, May 23, 2019 at 05:18:51PM -0300, Maxiwell S. Garcia wrote:
> > > On Thu, May 23, 2019 at 09:29:52AM +1000, David Gibson wrote:
> > > > On Mon, May 20, 2019 at 05:43:40PM -0300, Maxiwell S. Garcia wrote:
> > > > > This handler was added in the commit:
> > > > >   42043e4f1241: spapr: clock should count only if vm is running
> > > > > 
> > > > > In a scenario without migration, this pre_save handler is not
> > > > > triggered, so the 'stop/cont' commands save and restore the clock
> > > > > in the function 'cpu_ppc_clock_vm_state_change.' The SW clock
> > > > > in the guest doesn't know about this pause.
> > > > > 
> > > > > If the command 'migrate' is called between 'stop' and 'cont',
> > > > > the pre_save handler re-read the clock, and the SW clock in the
> > > > > guest will know about the pause between 'stop' and 'migrate.'
> > > > > If the guest is running a workload like HTC, a side-effect of
> > > > > this is a lot of process stall messages (with call traces) in
> > > > > the kernel guest.
> > > > > 
> > > > > Signed-off-by: Maxiwell S. Garcia <maxiwell@linux.ibm.com>
> > > > 
> > > > What affect will this have on the clock for the case of migrations
> > > > without a stop/cont around?
> > > 
> > > The guest timebase is saved when the VM stop running and restored when
> > > the VM starts running again (cpu_ppc_clock_vm_state_change handler).
> > > Migrations without stop/cont save the clock when the VM go to the
> > > FINISH_MIGRATE state.
> > 
> > Right... which means the clock is effectively stopped for the
> > migration downtime window while we transfer the final state.  That
> > means the guest clock will drift from wall clock by a couple of
> > hundred ms across the migration which is not correct.
> > 
> > > > The complicated thing here is that for
> > > > *explicit* stops/continues we want to freeze the clock, however for
> > > > the implicit stop/continue during migration downtime, we want to keep
> > > > the clock running (logically), so that the guest time of day doesn't
> > > > get out of sync on migration.
> > > > 
> > > 
> > > Not sure if the *implicit* word here is about commands from the libvirt
> > > or any other orchestrator.
> > 
> > By implicit I mean the stopping of the VM which qemu does to transfer
> > the final part of the state, rather than because of an explicit
> > stop/cont command.
> > 
> > > QEMU itself doesn't know the intent behind the
> > > command stop/cont. So, If we are using a guest to process a workload and
> > > the manager tool decide to migrate our VM transparently, it's unpleasant
> > > to see a lot of process stalls with call traces in the kernel log.
> > 
> > If you have a lot of process stalls across a migration, that suggests
> > your permitted downtime window is *way* too long.
> 
> I see a difference between live migration and 'cold' migration. In
> a cold migration scenario (where a user 'stop' the machine, wait an
> arbitrary time, move it to another server, and run it), the behavior
> should be the same as executing 'stop/cont' in a guest without migration.

Agreed.  The difficulty is distinguishing those two cases and handling
them both correctly.

> This problem also emerges when the 'timeout' flag is used with 'virsh'
> tool to live-migrate a guest. After the 'timeout', the libvirt sends a
> 'stop' command to QEMU to suspend the guest before migrate. If the NFS
> is slow, for example, this guest will wait many minutes to run again.

Hm, in this case it's pretty unclear what the right action is.
Failing the migration outright is bad, but shifting the guest's real
time clock by minutes is pretty bad too.

> Maybe a solution is to modify the timebase_pre_save handler to know the
> current vm_state and only save the timebase again when vm_state is not
> in 'stop' state. What do you think?

Sounds reasonable, but the problem is I'm not sure there's a clear way
to tell that the guest has been explicitly stopped - that is with a
"stop" monitor instruction - rather that just the automatic stop due
to the vm_stop() in the migration path.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-06-12  5:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-20 20:43 [Qemu-devel] [PATCH 0/1] spapr: Do not re-read the clock on pre_save handler on migration Maxiwell S. Garcia
2019-05-20 20:43 ` [Qemu-devel] [PATCH 1/1] " Maxiwell S. Garcia
2019-05-22 23:29   ` David Gibson
2019-05-23 20:18     ` [Qemu-devel] [Qemu-ppc] " Maxiwell S. Garcia
2019-05-30  1:13       ` David Gibson
2019-06-05 19:39         ` Maxiwell S. Garcia
2019-06-12  5:14           ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.