All of lore.kernel.org
 help / color / mirror / Atom feed
* HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
       [not found] <1223417765.8633857.1368537033873.JavaMail.root@zimbra002>
@ 2013-05-14 13:11 ` Diana Crisan
  2013-05-14 16:09   ` George Dunlap
                     ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Diana Crisan @ 2013-05-14 13:11 UTC (permalink / raw)
  To: xen-devel

This is problem 1 of 3 problems we are having with live migration and/or ACPI on Xen-4.3 and Xen-4.2.

Any help would be appreciated.

Detailed description of problem:

We are using Xen-4.3-rc1 with dom0 running Ubuntu Precise and 3.5.0-23-generic kernel, and domU running Ubuntu Precise (12.04) cloud images running 3.2.0-39-virtual. We are using the xl.conf below on qemu-upstream-dm and HVM and two identical sending and receiving machines (hardware and software)

When live migration is instigated between two identical hardware configurations using 'xl migrate', the migrate completes but the system clock in domU appears to be stuck when the domU resumes on the receiving side. For instance, running 'top', 'date', or 'uptime' will constantly report the same result. The clocks in dom0 were synchronized before migration using ntpdate. A modification of the clock using the date command in the migrated domU solves the problem; migrating back to the original machine works, but after a third migration the problem reappears. 

Sometimes the clock is not stuck on the first migrate, but the problem is reproducible after several migrations.

How to replicate:

1. Take two machines with identical hardware and software, running the xen-4.3-rc1 version of Xen on Ubuntu Precise with 3.5.0-23-generic kernel.
2. Use the xl.conf below as a configuration file.
3. Create a VM using Ubuntu Precise and 3.5.0-23 generic.
4. Start the VM
5. xl migrate from one to the other
6. wait until it resumes on the receiving side
7. Determine whether the clock is updating (run 'top'). Determine whether ping works (ping is broken if the clock is stopped).
8. Repeat steps 5, 6 and 7 multiple times until the clock is stuck (usually happens within 3 migrations)

Expected results:

The clock is never stuck

Actual results:

The clock becomes 'stuck' after one or more migrations

Notes:

If the lines 'acpi=0', 'acpi_s3=0', 'acpi_s4=0' are added to xl.conf, I cannot reproduce this problem. I thus believe it may be something to do with ACPI. In investigating this, we found problem (2) which is that live migration does not take across all the acpi entries within xenstore - this is handled in a separate email.

On xen-4.2, a similar thing happens. However, if the clock does become stuck, the subsequent migration fails.

--xl.conf--

builder='hvm'
memory = 512
name = "416-vm"
vcpus=1
disk = [ 'tap:qcow2:/root/diana.qcow2,xvda,w' ]
vif = ['mac=00:16:3f:1d:6a:c0, bridge=defaultbr']
sdl=0
opengl=1
vnc=1
vnclisten="0.0.0.0"
vncdisplay=0
vncunused=0
vncpasswd='p'
stdvga=0
serial='pty'

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-14 13:11 ` HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI Diana Crisan
@ 2013-05-14 16:09   ` George Dunlap
  2013-05-15 10:05     ` Diana Crisan
  2013-05-15 13:46   ` Alex Bligh
  2013-05-30 14:32   ` George Dunlap
  2 siblings, 1 reply; 60+ messages in thread
From: George Dunlap @ 2013-05-14 16:09 UTC (permalink / raw)
  To: Diana Crisan; +Cc: xen-devel

On Tue, May 14, 2013 at 2:11 PM, Diana Crisan <dcrisan@flexiant.com> wrote:
> This is problem 1 of 3 problems we are having with live migration and/or ACPI on Xen-4.3 and Xen-4.2.
>
> Any help would be appreciated.
>
> Detailed description of problem:
>
> We are using Xen-4.3-rc1 with dom0 running Ubuntu Precise and 3.5.0-23-generic kernel, and domU running Ubuntu Precise (12.04) cloud images running 3.2.0-39-virtual. We are using the xl.conf below on qemu-upstream-dm and HVM and two identical sending and receiving machines (hardware and software)

Thanks for your description -- I've put them on my list of bugs to track.

One thing that's not clear from the report: Did you try this with
qemu-traditional or not?

Thanks,
 -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-14 16:09   ` George Dunlap
@ 2013-05-15 10:05     ` Diana Crisan
  0 siblings, 0 replies; 60+ messages in thread
From: Diana Crisan @ 2013-05-15 10:05 UTC (permalink / raw)
  To: George Dunlap; +Cc: alex, xen-devel

George,

I only tried this with qemu-upstream.

Thanks for your help,
Diana

----- Original Message -----
From: "George Dunlap" <George.Dunlap@eu.citrix.com>
To: "Diana Crisan" <dcrisan@flexiant.com>
Cc: xen-devel@lists.xen.org
Sent: Tuesday, 14 May, 2013 5:09:29 PM
Subject: Re: [Xen-devel] HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI

On Tue, May 14, 2013 at 2:11 PM, Diana Crisan <dcrisan@flexiant.com> wrote:
> This is problem 1 of 3 problems we are having with live migration and/or ACPI on Xen-4.3 and Xen-4.2.
>
> Any help would be appreciated.
>
> Detailed description of problem:
>
> We are using Xen-4.3-rc1 with dom0 running Ubuntu Precise and 3.5.0-23-generic kernel, and domU running Ubuntu Precise (12.04) cloud images running 3.2.0-39-virtual. We are using the xl.conf below on qemu-upstream-dm and HVM and two identical sending and receiving machines (hardware and software)

Thanks for your description -- I've put them on my list of bugs to track.

One thing that's not clear from the report: Did you try this with
qemu-traditional or not?

Thanks,
 -George

-- 
Diana Alexandra Crisan 
Developer 
Flexiant 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-14 13:11 ` HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI Diana Crisan
  2013-05-14 16:09   ` George Dunlap
@ 2013-05-15 13:46   ` Alex Bligh
  2013-05-20 11:11     ` George Dunlap
  2013-05-30 14:32   ` George Dunlap
  2 siblings, 1 reply; 60+ messages in thread
From: Alex Bligh @ 2013-05-15 13:46 UTC (permalink / raw)
  To: Diana Crisan, xen-devel; +Cc: Alex Bligh



--On 14 May 2013 14:11:20 +0100 Diana Crisan <dcrisan@flexiant.com> wrote:

> We are using Xen-4.3-rc1 with dom0 running Ubuntu Precise and
> 3.5.0-23-generic kernel, and domU running Ubuntu Precise (12.04) cloud
> images running 3.2.0-39-virtual. We are using the xl.conf below on
> qemu-upstream-dm and HVM and two identical sending and receiving machines
> (hardware and software)
>
> When live migration is instigated between two identical hardware
> configurations using 'xl migrate', the migrate completes but the system
> clock in domU appears to be stuck when the domU resumes on the receiving
> side. For instance, running 'top', 'date', or 'uptime' will constantly
> report the same result. The clocks in dom0 were synchronized before
> migration using ntpdate. A modification of the clock using the date
> command in the migrated domU solves the problem; migrating back to the
> original machine works, but after a third migration the problem
> reappears.

Stranger and stranger.

What happens after this migrate is that 'uptime' reports increasing uptime
(as in seconds since boot), but the date/time is always the same. Setting
date fixes this.

It appears that whatever the process is that updates wall time in the kernel
is not running, until tweaked by wall time being set manually.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-15 13:46   ` Alex Bligh
@ 2013-05-20 11:11     ` George Dunlap
  2013-05-20 19:28       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 60+ messages in thread
From: George Dunlap @ 2013-05-20 11:11 UTC (permalink / raw)
  To: Alex Bligh; +Cc: Anthony PERARD, Diana Crisan, Konrad Rzeszutek Wilk, xen-devel

On Wed, May 15, 2013 at 2:46 PM, Alex Bligh <alex@alex.org.uk> wrote:
> Stranger and stranger.
>
> What happens after this migrate is that 'uptime' reports increasing uptime
> (as in seconds since boot), but the date/time is always the same. Setting
> date fixes this.
>
> It appears that whatever the process is that updates wall time in the kernel
> is not running, until tweaked by wall time being set manually.

Konrad, is this related to any of the recent Linux wallclock changes
you were trying to address?

 -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-20 11:11     ` George Dunlap
@ 2013-05-20 19:28       ` Konrad Rzeszutek Wilk
  2013-05-20 22:38         ` Alex Bligh
  0 siblings, 1 reply; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-05-20 19:28 UTC (permalink / raw)
  To: George Dunlap, david.vrabel
  Cc: Anthony PERARD, Diana Crisan, Alex Bligh, xen-devel

On Mon, May 20, 2013 at 12:11:51PM +0100, George Dunlap wrote:
> On Wed, May 15, 2013 at 2:46 PM, Alex Bligh <alex@alex.org.uk> wrote:
> > Stranger and stranger.
> >
> > What happens after this migrate is that 'uptime' reports increasing uptime
> > (as in seconds since boot), but the date/time is always the same. Setting
> > date fixes this.
> >
> > It appears that whatever the process is that updates wall time in the kernel
> > is not running, until tweaked by wall time being set manually.
> 
> Konrad, is this related to any of the recent Linux wallclock changes
> you were trying to address?

It was actually David (CC-ing him here). Alex, when you boot the hosts, are
the RTC times the same? (date?)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-20 19:28       ` Konrad Rzeszutek Wilk
@ 2013-05-20 22:38         ` Alex Bligh
  2013-05-21  1:04           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 60+ messages in thread
From: Alex Bligh @ 2013-05-20 22:38 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, George Dunlap, david.vrabel
  Cc: Anthony PERARD, Diana Crisan, Alex Bligh, xen-devel

Konrad,

--On 20 May 2013 15:28:52 -0400 Konrad Rzeszutek Wilk 
<konrad.wilk@oracle.com> wrote:

> It was actually David (CC-ing him here). Alex, when you boot the hosts,
> are the RTC times the same? (date?)

I believe they boot with ntpdate and run ntp, so the wallclock times are
the same. I haven't specifically checked the CMOS clock times if that's
what you meant - I'm not even sure how one does that - but I believe
ntp writes to the CMOS RTC these days.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-20 22:38         ` Alex Bligh
@ 2013-05-21  1:04           ` Konrad Rzeszutek Wilk
  2013-05-21 10:22             ` Diana Crisan
  0 siblings, 1 reply; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-05-21  1:04 UTC (permalink / raw)
  To: Alex Bligh
  Cc: Anthony PERARD, Diana Crisan, George Dunlap, david.vrabel, xen-devel

On Mon, May 20, 2013 at 11:38:45PM +0100, Alex Bligh wrote:
> Konrad,
> 
> --On 20 May 2013 15:28:52 -0400 Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> 
> >It was actually David (CC-ing him here). Alex, when you boot the hosts,
> >are the RTC times the same? (date?)
> 
> I believe they boot with ntpdate and run ntp, so the wallclock times are
> the same. I haven't specifically checked the CMOS clock times if that's
> what you meant - I'm not even sure how one does that - but I believe
> ntp writes to the CMOS RTC these days.

11 minutes after ntpd has started.
> 
> -- 
> Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-21  1:04           ` Konrad Rzeszutek Wilk
@ 2013-05-21 10:22             ` Diana Crisan
  2013-05-21 10:47               ` David Vrabel
  0 siblings, 1 reply; 60+ messages in thread
From: Diana Crisan @ 2013-05-21 10:22 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Anthony PERARD, George Dunlap, david vrabel, Alex Bligh, xen-devel

Hello,

>On Mon, May 20, 2013 at 11:38:45PM +0100, Alex Bligh wrote:
>> Konrad,
>> 
>> --On 20 May 2013 15:28:52 -0400 Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>> 
>> >It was actually David (CC-ing him here). Alex, when you boot the hosts,
>> >are the RTC times the same? (date?)
>> 
>> I believe they boot with ntpdate and run ntp, so the wallclock times are
>> the same. I haven't specifically checked the CMOS clock times if that's
>> what you meant - I'm not even sure how one does that - but I believe
>> ntp writes to the CMOS RTC these days.

>11 minutes after ntpd has started.


I have checked our machines and they aren't running ntpd, but they were synchronised with ntpdate. The date command showed the wallclock was in sync and clock -r showed the rtc was also in sync. Both cases have a delay of at most 1 second.

I ran my tests again to ensure they are still in sync, which they are.


-- 
Diana Crisan 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-21 10:22             ` Diana Crisan
@ 2013-05-21 10:47               ` David Vrabel
  2013-05-21 11:16                 ` Diana Crisan
  0 siblings, 1 reply; 60+ messages in thread
From: David Vrabel @ 2013-05-21 10:47 UTC (permalink / raw)
  To: Diana Crisan
  Cc: Anthony PERARD, George Dunlap, xen-devel, Alex Bligh,
	Konrad Rzeszutek Wilk

On 21/05/13 11:22, Diana Crisan wrote:
> Hello,
> 
>> On Mon, May 20, 2013 at 11:38:45PM +0100, Alex Bligh wrote:
>>> Konrad,
>>>
>>> --On 20 May 2013 15:28:52 -0400 Konrad Rzeszutek Wilk
>>> <konrad.wilk@oracle.com> wrote:
>>>
>>>> It was actually David (CC-ing him here). Alex, when you boot the hosts,
>>>> are the RTC times the same? (date?)
>>>
>>> I believe they boot with ntpdate and run ntp, so the wallclock times are
>>> the same. I haven't specifically checked the CMOS clock times if that's
>>> what you meant - I'm not even sure how one does that - but I believe
>>> ntp writes to the CMOS RTC these days.
> 
>> 11 minutes after ntpd has started.

NTP updates the persistent clock when it is first synchronized and then
every 11 minutes.

> I have checked our machines and they aren't running ntpd, but they
> were synchronised with ntpdate. The date command showed the wallclock
> was in sync and clock -r showed the rtc was also in sync. Both cases
> have a delay of at most 1 second.

Running ntpdate does /not/ set the persistent wallclock so the
persistent clock may be incorrect however if the CMOS RTC is correct on
both machines then the persistent clocks will be approximately the same
anyway.

So, you could try the wallclock[1] series but I think it is unlikely to
help here.

It looks like the guests may not be properly notified of a resume (after
the migration) and therefore are not properly accounting for the step
change in their clock source.  This means migrations one way look like
the clock source as stepped backwards and I guess this is confusing the
kernel as the close source is supposed to be monotonic.

Migrations the other way then work fine as the clock source steps
forwards not backwards.

Can you restart the hosts say, 5 minutes apart and then see if the
guest's time recovers (without manually setting the time etc.) within ~5
mins?

David

[1] http://lists.xen.org/archives/html/xen-devel/2013-05/msg01399.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-21 10:47               ` David Vrabel
@ 2013-05-21 11:16                 ` Diana Crisan
  2013-05-21 12:49                   ` David Vrabel
  0 siblings, 1 reply; 60+ messages in thread
From: Diana Crisan @ 2013-05-21 11:16 UTC (permalink / raw)
  To: David Vrabel
  Cc: Anthony PERARD, George Dunlap, xen-devel, Alex Bligh,
	Konrad Rzeszutek Wilk

On 21/05/13 11:47, David Vrabel wrote:
> On 21/05/13 11:22, Diana Crisan wrote:
>> Hello,
>>
>>> On Mon, May 20, 2013 at 11:38:45PM +0100, Alex Bligh wrote:
>>>> Konrad,
>>>>
>>>> --On 20 May 2013 15:28:52 -0400 Konrad Rzeszutek Wilk
>>>> <konrad.wilk@oracle.com> wrote:
>>>>
>>>>> It was actually David (CC-ing him here). Alex, when you boot the hosts,
>>>>> are the RTC times the same? (date?)
>>>> I believe they boot with ntpdate and run ntp, so the wallclock times are
>>>> the same. I haven't specifically checked the CMOS clock times if that's
>>>> what you meant - I'm not even sure how one does that - but I believe
>>>> ntp writes to the CMOS RTC these days.
>>> 11 minutes after ntpd has started.
> NTP updates the persistent clock when it is first synchronized and then
> every 11 minutes.
>
>> I have checked our machines and they aren't running ntpd, but they
>> were synchronised with ntpdate. The date command showed the wallclock
>> was in sync and clock -r showed the rtc was also in sync. Both cases
>> have a delay of at most 1 second.
> Running ntpdate does /not/ set the persistent wallclock so the
> persistent clock may be incorrect however if the CMOS RTC is correct on
> both machines then the persistent clocks will be approximately the same
> anyway.
> So, you could try the wallclock[1] series but I think it is unlikely to
> help here.
>
> It looks like the guests may not be properly notified of a resume (after
> the migration) and therefore are not properly accounting for the step
> change in their clock source.  This means migrations one way look like
> the clock source as stepped backwards and I guess this is confusing the
> kernel as the close source is supposed to be monotonic.
>
> Migrations the other way then work fine as the clock source steps
> forwards not backwards.
Actually, I have seen the problem replicated when the vm migrates back 
to the original host ( the one it was started on), as well as on a first 
migrate. I can confirm I alternated between the hosts on the creation of 
the vm to ensure I get all the possible scenarios. This was done without 
re-synchronising the hosts.
> Can you restart the hosts say, 5 minutes apart and then see if the
> guest's time recovers (without manually setting the time etc.) within ~5
> mins?
I restarted the hosts as you asked 5 minutes apart and the wallclocks 
and the hardware clocks are still in sync without having run any manual 
update/ ntpdate on the clock.

> David
>
> [1] http://lists.xen.org/archives/html/xen-devel/2013-05/msg01399.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-21 11:16                 ` Diana Crisan
@ 2013-05-21 12:49                   ` David Vrabel
  2013-05-21 13:16                     ` Alex Bligh
  0 siblings, 1 reply; 60+ messages in thread
From: David Vrabel @ 2013-05-21 12:49 UTC (permalink / raw)
  To: Diana Crisan
  Cc: Anthony PERARD, George Dunlap, xen-devel, Alex Bligh,
	Konrad Rzeszutek Wilk

On 21/05/13 12:16, Diana Crisan wrote:
> 
> Actually, I have seen the problem replicated when the vm migrates back
> to the original host ( the one it was started on), as well as on a first
> migrate. I can confirm I alternated between the hosts on the creation of
> the vm to ensure I get all the possible scenarios. This was done without
> re-synchronising the hosts.

I'm out of ideas then.

David

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-21 12:49                   ` David Vrabel
@ 2013-05-21 13:16                     ` Alex Bligh
  2013-05-24 16:16                       ` George Dunlap
  0 siblings, 1 reply; 60+ messages in thread
From: Alex Bligh @ 2013-05-21 13:16 UTC (permalink / raw)
  To: David Vrabel, Diana Crisan
  Cc: Anthony PERARD, George Dunlap, xen-devel, Alex Bligh,
	Konrad Rzeszutek Wilk



--On 21 May 2013 13:49:21 +0100 David Vrabel <david.vrabel@citrix.com> 
wrote:

>> Actually, I have seen the problem replicated when the vm migrates back
>> to the original host ( the one it was started on), as well as on a first
>> migrate. I can confirm I alternated between the hosts on the creation of
>> the vm to ensure I get all the possible scenarios. This was done without
>> re-synchronising the hosts.
>
> I'm out of ideas then.

FWIW it's reproducible on every host h/w platform we've tried
(a total of 2).

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-21 13:16                     ` Alex Bligh
@ 2013-05-24 16:16                       ` George Dunlap
  2013-05-25 10:18                         ` Alex Bligh
  0 siblings, 1 reply; 60+ messages in thread
From: George Dunlap @ 2013-05-24 16:16 UTC (permalink / raw)
  To: Alex Bligh
  Cc: Anthony PERARD, Diana Crisan, xen-devel, David Vrabel,
	Konrad Rzeszutek Wilk

On Tue, May 21, 2013 at 2:16 PM, Alex Bligh <alex@alex.org.uk> wrote:
>
>
> --On 21 May 2013 13:49:21 +0100 David Vrabel <david.vrabel@citrix.com>
> wrote:
>
>>> Actually, I have seen the problem replicated when the vm migrates back
>>> to the original host ( the one it was started on), as well as on a first
>>> migrate. I can confirm I alternated between the hosts on the creation of
>>> the vm to ensure I get all the possible scenarios. This was done without
>>> re-synchronising the hosts.
>>
>>
>> I'm out of ideas then.
>
>
> FWIW it's reproducible on every host h/w platform we've tried
> (a total of 2).

Do you see the same effects if you do a local-host migrate?

 -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-24 16:16                       ` George Dunlap
@ 2013-05-25 10:18                         ` Alex Bligh
  2013-05-26  8:38                           ` Ian Campbell
  0 siblings, 1 reply; 60+ messages in thread
From: Alex Bligh @ 2013-05-25 10:18 UTC (permalink / raw)
  To: George Dunlap
  Cc: Konrad Rzeszutek Wilk, xen-devel, David Vrabel, Alex Bligh,
	Anthony PERARD, Diana Crisan

George,

--On 24 May 2013 17:16:07 +0100 George Dunlap <George.Dunlap@eu.citrix.com> 
wrote:

>> FWIW it's reproducible on every host h/w platform we've tried
>> (a total of 2).
>
> Do you see the same effects if you do a local-host migrate?

I hadn't even realised that was possible. That would have made testing live 
migrate easier!

How do you avoid the name clash in xen-store? We'll try this next week.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-25 10:18                         ` Alex Bligh
@ 2013-05-26  8:38                           ` Ian Campbell
  2013-05-28 15:06                             ` Diana Crisan
  0 siblings, 1 reply; 60+ messages in thread
From: Ian Campbell @ 2013-05-26  8:38 UTC (permalink / raw)
  To: Alex Bligh
  Cc: Konrad Rzeszutek Wilk, George Dunlap, xen-devel, David Vrabel,
	Anthony PERARD, Diana Crisan

On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
> George,
> 
> --On 24 May 2013 17:16:07 +0100 George Dunlap <George.Dunlap@eu.citrix.com> 
> wrote:
> 
> >> FWIW it's reproducible on every host h/w platform we've tried
> >> (a total of 2).
> >
> > Do you see the same effects if you do a local-host migrate?
> 
> I hadn't even realised that was possible. That would have made testing live 
> migrate easier!

That's basically the whole reason it is supported ;-)

> How do you avoid the name clash in xen-store?

Most toolstacks receive the incoming migration into a domain named
FOO-incoming or some such and then rename to FOO upon completion. Some
also rename the outgoing domain "FOO-migratedaway" towards the end so
that the bits of the final teardown which can safely happen after the
target have start can be done so.

Ian.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-26  8:38                           ` Ian Campbell
@ 2013-05-28 15:06                             ` Diana Crisan
  2013-05-29 16:16                               ` Alex Bligh
  2013-05-30 15:26                               ` George Dunlap
  0 siblings, 2 replies; 60+ messages in thread
From: Diana Crisan @ 2013-05-28 15:06 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Konrad Rzeszutek Wilk, George Dunlap, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD

Hi,

On 26/05/13 09:38, Ian Campbell wrote:
> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>> George,
>>
>> --On 24 May 2013 17:16:07 +0100 George Dunlap <George.Dunlap@eu.citrix.com>
>> wrote:
>>
>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>> (a total of 2).
>>> Do you see the same effects if you do a local-host migrate?
>> I hadn't even realised that was possible. That would have made testing live
>> migrate easier!
> That's basically the whole reason it is supported ;-)
>
>> How do you avoid the name clash in xen-store?
> Most toolstacks receive the incoming migration into a domain named
> FOO-incoming or some such and then rename to FOO upon completion. Some
> also rename the outgoing domain "FOO-migratedaway" towards the end so
> that the bits of the final teardown which can safely happen after the
> target have start can be done so.
>
> Ian.
>
>

I am unsure what I am doing wrong, but I cannot seem to be able to do a 
localhost migrate.

I created a domU using "xl create xl.conf" and once it fully booted I 
issued an "xl migrate 11 localhost". This fails and gives the output below.

Would you please advise on how to get this working?

Thanks,
Diana


root@ubuntu:~# xl migrate 11 localhost
root@localhost's password:
migration target: Ready to receive domain.
Saving to migration stream new xl format (info 0x0/0x0/2344)
Loading new save file <incoming migration stream> (new xl fmt info 
0x0/0x0/2344)
  Savefile contains xl domain config
xc: progress: Reloading memory pages: 53248/1048575    5%
xc: progress: Reloading memory pages: 105472/1048575   10%
libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12 
device model: spawn failed (rc=-3)
libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device 
model did not start: -3
libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device Model 
already exited
migration target: Domain creation failed (code -3).
libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream 
truncated reading ready message from migration receiver stream
libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration 
target process [10934] exited with error status 3
Migration failed, resuming at sender.
xc: error: Cannot resume uncooperative HVM guests: Internal error
libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume failed 
for domain 11: Success
root@ubuntu:~# xl list
Name                                        ID   Mem VCPUs    State    
Time(s)
Domain-0                                     0 14722 2     r-----   20161.3
416-vm                                      11   512 1     ---ss-      17.6

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-28 15:06                             ` Diana Crisan
@ 2013-05-29 16:16                               ` Alex Bligh
  2013-05-29 19:04                                 ` Ian Campbell
  2013-05-30 15:39                                 ` Frediano Ziglio
  2013-05-30 15:26                               ` George Dunlap
  1 sibling, 2 replies; 60+ messages in thread
From: Alex Bligh @ 2013-05-29 16:16 UTC (permalink / raw)
  To: Diana Crisan, Ian Campbell
  Cc: Konrad Rzeszutek Wilk, George Dunlap, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD

--On 28 May 2013 16:06:19 +0100 Diana Crisan <dcrisan@flexiant.com> wrote:

>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>> (a total of 2).
>>>> Do you see the same effects if you do a local-host migrate?
>>> I hadn't even realised that was possible. That would have made testing
>>> live migrate easier!
>> That's basically the whole reason it is supported ;-)
>>
>>> How do you avoid the name clash in xen-store?
>> Most toolstacks receive the incoming migration into a domain named
>> FOO-incoming or some such and then rename to FOO upon completion. Some
>> also rename the outgoing domain "FOO-migratedaway" towards the end so
>> that the bits of the final teardown which can safely happen after the
>> target have start can be done so.
>>
>
> I am unsure what I am doing wrong, but I cannot seem to be able to do a
> localhost migrate.
>
> I created a domU using "xl create xl.conf" and once it fully booted I
> issued an "xl migrate 11 localhost". This fails and gives the output
> below.
>
> Would you please advise on how to get this working?

Any ideas?

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-29 16:16                               ` Alex Bligh
@ 2013-05-29 19:04                                 ` Ian Campbell
  2013-05-30 14:30                                   ` George Dunlap
  2013-05-30 15:39                                 ` Frediano Ziglio
  1 sibling, 1 reply; 60+ messages in thread
From: Ian Campbell @ 2013-05-29 19:04 UTC (permalink / raw)
  To: Alex Bligh
  Cc: Konrad Rzeszutek Wilk, George Dunlap, xen-devel, David Vrabel,
	Anthony PERARD, Diana Crisan

On Wed, 2013-05-29 at 17:16 +0100, Alex Bligh wrote:
> --On 28 May 2013 16:06:19 +0100 Diana Crisan <dcrisan@flexiant.com> wrote:
> 
> >>>>> FWIW it's reproducible on every host h/w platform we've tried
> >>>>> (a total of 2).
> >>>> Do you see the same effects if you do a local-host migrate?
> >>> I hadn't even realised that was possible. That would have made testing
> >>> live migrate easier!
> >> That's basically the whole reason it is supported ;-)
> >>
> >>> How do you avoid the name clash in xen-store?
> >> Most toolstacks receive the incoming migration into a domain named
> >> FOO-incoming or some such and then rename to FOO upon completion. Some
> >> also rename the outgoing domain "FOO-migratedaway" towards the end so
> >> that the bits of the final teardown which can safely happen after the
> >> target have start can be done so.
> >>
> >
> > I am unsure what I am doing wrong, but I cannot seem to be able to do a
> > localhost migrate.
> >
> > I created a domU using "xl create xl.conf" and once it fully booted I
> > issued an "xl migrate 11 localhost". This fails and gives the output
> > below.
> >
> > Would you please advise on how to get this working?
> 
> Any ideas?

The usual questions: What does "xl -vvv ... " say? Is there anything of
interest under /var/log/xen/* (any variant of the domain's name)?

It might also be useful to patch tools/libxl/xl_cmdimpl.c:main_migrate()
to add -vvv to the receive side as well (yes, we should plumb that in
properly).

Ian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-29 19:04                                 ` Ian Campbell
@ 2013-05-30 14:30                                   ` George Dunlap
  0 siblings, 0 replies; 60+ messages in thread
From: George Dunlap @ 2013-05-30 14:30 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Konrad Rzeszutek Wilk, xen-devel, David Vrabel, Alex Bligh,
	Anthony PERARD, Diana Crisan

On Wed, May 29, 2013 at 8:04 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> The usual questions: What does "xl -vvv ... " say? Is there anything of
> interest under /var/log/xen/* (any variant of the domain's name)?

In particular, look for errors in qemu-related files, since that's
what seems to have failed.

Since you said that pings failed when the clock stopped working, I'm
currently running the following mini-script in a Debian PVHVM guest.
So far it's made 18 iterations:

count=0; while ping -c 5 -w 10 10.80.238.151 ; do count=$(($count+1))
; echo "== migrate $count ==" ; xl migrate h0 localhost ; done

(It did once hang with all the domU's vcpus pegged at 100%, completely
unresponsive; but that looks like a different bug to the one you're
describing.)

I'll be playing around with the config a bit, and also trying a
Precise VM here in a bit.

 -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-14 13:11 ` HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI Diana Crisan
  2013-05-14 16:09   ` George Dunlap
  2013-05-15 13:46   ` Alex Bligh
@ 2013-05-30 14:32   ` George Dunlap
  2013-05-30 14:42     ` Diana Crisan
  2 siblings, 1 reply; 60+ messages in thread
From: George Dunlap @ 2013-05-30 14:32 UTC (permalink / raw)
  To: Diana Crisan; +Cc: xen-devel

On Tue, May 14, 2013 at 2:11 PM, Diana Crisan <dcrisan@flexiant.com> wrote:
> This is problem 1 of 3 problems we are having with live migration and/or ACPI on Xen-4.3 and Xen-4.2.
>
> Any help would be appreciated.
>
> Detailed description of problem:
>
> We are using Xen-4.3-rc1 with dom0 running Ubuntu Precise and 3.5.0-23-generic kernel, and domU running Ubuntu Precise (12.04) cloud images running 3.2.0-39-virtual. We are using the xl.conf below on qemu-upstream-dm and HVM and two identical sending and receiving machines (hardware and software)
>
> When live migration is instigated between two identical hardware configurations using 'xl migrate', the migrate completes but the system clock in domU appears to be stuck when the domU resumes on the receiving side. For instance, running 'top', 'date', or 'uptime' will constantly report the same result. The clocks in dom0 were synchronized before migration using ntpdate. A modification of the clock using the date command in the migrated domU solves the problem; migrating back to the original machine works, but after a third migration the problem reappears.
>
> Sometimes the clock is not stuck on the first migrate, but the problem is reproducible after several migrations.
>
> How to replicate:
>
> 1. Take two machines with identical hardware and software, running the xen-4.3-rc1 version of Xen on Ubuntu Precise with 3.5.0-23-generic kernel.
> 2. Use the xl.conf below as a configuration file.
> 3. Create a VM using Ubuntu Precise and 3.5.0-23 generic.

Sorry, one question -- here you have 3.5.0-23, but above you say that
you're using 3.2.0-39-virtual?  What OS is the guest running?

 -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-30 14:32   ` George Dunlap
@ 2013-05-30 14:42     ` Diana Crisan
  0 siblings, 0 replies; 60+ messages in thread
From: Diana Crisan @ 2013-05-30 14:42 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel

On 30/05/13 15:32, George Dunlap wrote:
> On Tue, May 14, 2013 at 2:11 PM, Diana Crisan <dcrisan@flexiant.com> wrote:
>> This is problem 1 of 3 problems we are having with live migration and/or ACPI on Xen-4.3 and Xen-4.2.
>>
>> Any help would be appreciated.
>>
>> Detailed description of problem:
>>
>> We are using Xen-4.3-rc1 with dom0 running Ubuntu Precise and 3.5.0-23-generic kernel, and domU running Ubuntu Precise (12.04) cloud images running 3.2.0-39-virtual. We are using the xl.conf below on qemu-upstream-dm and HVM and two identical sending and receiving machines (hardware and software)
>>
>> When live migration is instigated between two identical hardware configurations using 'xl migrate', the migrate completes but the system clock in domU appears to be stuck when the domU resumes on the receiving side. For instance, running 'top', 'date', or 'uptime' will constantly report the same result. The clocks in dom0 were synchronized before migration using ntpdate. A modification of the clock using the date command in the migrated domU solves the problem; migrating back to the original machine works, but after a third migration the problem reappears.
>>
>> Sometimes the clock is not stuck on the first migrate, but the problem is reproducible after several migrations.
>>
>> How to replicate:
>>
>> 1. Take two machines with identical hardware and software, running the xen-4.3-rc1 version of Xen on Ubuntu Precise with 3.5.0-23-generic kernel.
>> 2. Use the xl.conf below as a configuration file.
>> 3. Create a VM using Ubuntu Precise and 3.5.0-23 generic.
> Sorry, one question -- here you have 3.5.0-23, but above you say that
> you're using 3.2.0-39-virtual?  What OS is the guest running?
Sorry for the misunderstanding.
The host is 3.5.0-23-generic and the guest is 3.2.0-39-virtual
>   -George
--
Diana

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-28 15:06                             ` Diana Crisan
  2013-05-29 16:16                               ` Alex Bligh
@ 2013-05-30 15:26                               ` George Dunlap
  2013-05-30 15:55                                 ` Diana Crisan
  1 sibling, 1 reply; 60+ messages in thread
From: George Dunlap @ 2013-05-30 15:26 UTC (permalink / raw)
  To: Diana Crisan
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD

On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com> wrote:
> Hi,
>
>
> On 26/05/13 09:38, Ian Campbell wrote:
>>
>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>
>>> George,
>>>
>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>> <George.Dunlap@eu.citrix.com>
>>> wrote:
>>>
>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>> (a total of 2).
>>>>
>>>> Do you see the same effects if you do a local-host migrate?
>>>
>>> I hadn't even realised that was possible. That would have made testing
>>> live
>>> migrate easier!
>>
>> That's basically the whole reason it is supported ;-)
>>
>>> How do you avoid the name clash in xen-store?
>>
>> Most toolstacks receive the incoming migration into a domain named
>> FOO-incoming or some such and then rename to FOO upon completion. Some
>> also rename the outgoing domain "FOO-migratedaway" towards the end so
>> that the bits of the final teardown which can safely happen after the
>> target have start can be done so.
>>
>> Ian.
>>
>>
>
> I am unsure what I am doing wrong, but I cannot seem to be able to do a
> localhost migrate.
>
> I created a domU using "xl create xl.conf" and once it fully booted I issued
> an "xl migrate 11 localhost". This fails and gives the output below.
>
> Would you please advise on how to get this working?
>
> Thanks,
> Diana
>
>
> root@ubuntu:~# xl migrate 11 localhost
> root@localhost's password:
> migration target: Ready to receive domain.
> Saving to migration stream new xl format (info 0x0/0x0/2344)
> Loading new save file <incoming migration stream> (new xl fmt info
> 0x0/0x0/2344)
>  Savefile contains xl domain config
> xc: progress: Reloading memory pages: 53248/1048575    5%
> xc: progress: Reloading memory pages: 105472/1048575   10%
> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12 device
> model: spawn failed (rc=-3)
> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device model
> did not start: -3
> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device Model
> already exited
> migration target: Domain creation failed (code -3).
> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream truncated
> reading ready message from migration receiver stream
> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration
> target process [10934] exited with error status 3
> Migration failed, resuming at sender.
> xc: error: Cannot resume uncooperative HVM guests: Internal error
> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume failed for
> domain 11: Success

Aha -- I managed to reproduce this one as well.

Your problem is the "vncunused=0" -- that's instructing qemu "You must
use this exact port for the vnc server".  But when you do the migrate,
that port is still in use by the "from" domain; so the qemu for the
"to" domain can't get it, and fails.

Obviously this should fail a lot more gracefully, but that's a bit of
a lower-priority bug I think.

 -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-29 16:16                               ` Alex Bligh
  2013-05-29 19:04                                 ` Ian Campbell
@ 2013-05-30 15:39                                 ` Frediano Ziglio
  1 sibling, 0 replies; 60+ messages in thread
From: Frediano Ziglio @ 2013-05-30 15:39 UTC (permalink / raw)
  To: alex
  Cc: Ian Campbell, konrad.wilk, George Dunlap, xen-devel,
	David Vrabel, Anthony Perard, dcrisan

On Wed, 2013-05-29 at 17:16 +0100, Alex Bligh wrote:
> --On 28 May 2013 16:06:19 +0100 Diana Crisan <dcrisan@flexiant.com> wrote:
> 
> >>>>> FWIW it's reproducible on every host h/w platform we've tried
> >>>>> (a total of 2).
> >>>> Do you see the same effects if you do a local-host migrate?
> >>> I hadn't even realised that was possible. That would have made testing
> >>> live migrate easier!
> >> That's basically the whole reason it is supported ;-)
> >>
> >>> How do you avoid the name clash in xen-store?
> >> Most toolstacks receive the incoming migration into a domain named
> >> FOO-incoming or some such and then rename to FOO upon completion. Some
> >> also rename the outgoing domain "FOO-migratedaway" towards the end so
> >> that the bits of the final teardown which can safely happen after the
> >> target have start can be done so.
> >>
> >
> > I am unsure what I am doing wrong, but I cannot seem to be able to do a
> > localhost migrate.
> >
> > I created a domU using "xl create xl.conf" and once it fully booted I
> > issued an "xl migrate 11 localhost". This fails and gives the output
> > below.
> >
> > Would you please advise on how to get this working?
> 
> Any ideas?
> 

I hardly think it's a problem with missing interrupt as timer is used
for schedule and I don't think a normal system can live without it.
rdtsc and rtc clocks are missing. rdtsc should be emulated but at least
you should see a warning for a big drift and then should work again
(rdtsc is quite used in newer oses, for instance linux 64 use it instead
of using syscalls). I suspect rtc timer as implementation is quite
different with Xen and without. I actually can't try.

Frediano

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-30 15:26                               ` George Dunlap
@ 2013-05-30 15:55                                 ` Diana Crisan
  2013-05-30 16:06                                   ` George Dunlap
  0 siblings, 1 reply; 60+ messages in thread
From: Diana Crisan @ 2013-05-30 15:55 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD

On 30/05/13 16:26, George Dunlap wrote:
> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com> wrote:
>> Hi,
>>
>>
>> On 26/05/13 09:38, Ian Campbell wrote:
>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>> George,
>>>>
>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>> <George.Dunlap@eu.citrix.com>
>>>> wrote:
>>>>
>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>> (a total of 2).
>>>>> Do you see the same effects if you do a local-host migrate?
>>>> I hadn't even realised that was possible. That would have made testing
>>>> live
>>>> migrate easier!
>>> That's basically the whole reason it is supported ;-)
>>>
>>>> How do you avoid the name clash in xen-store?
>>> Most toolstacks receive the incoming migration into a domain named
>>> FOO-incoming or some such and then rename to FOO upon completion. Some
>>> also rename the outgoing domain "FOO-migratedaway" towards the end so
>>> that the bits of the final teardown which can safely happen after the
>>> target have start can be done so.
>>>
>>> Ian.
>>>
>>>
>> I am unsure what I am doing wrong, but I cannot seem to be able to do a
>> localhost migrate.
>>
>> I created a domU using "xl create xl.conf" and once it fully booted I issued
>> an "xl migrate 11 localhost". This fails and gives the output below.
>>
>> Would you please advise on how to get this working?
>>
>> Thanks,
>> Diana
>>
>>
>> root@ubuntu:~# xl migrate 11 localhost
>> root@localhost's password:
>> migration target: Ready to receive domain.
>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>> Loading new save file <incoming migration stream> (new xl fmt info
>> 0x0/0x0/2344)
>>   Savefile contains xl domain config
>> xc: progress: Reloading memory pages: 53248/1048575    5%
>> xc: progress: Reloading memory pages: 105472/1048575   10%
>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12 device
>> model: spawn failed (rc=-3)
>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device model
>> did not start: -3
>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device Model
>> already exited
>> migration target: Domain creation failed (code -3).
>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream truncated
>> reading ready message from migration receiver stream
>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration
>> target process [10934] exited with error status 3
>> Migration failed, resuming at sender.
>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume failed for
>> domain 11: Success
> Aha -- I managed to reproduce this one as well.
>
> Your problem is the "vncunused=0" -- that's instructing qemu "You must
> use this exact port for the vnc server".  But when you do the migrate,
> that port is still in use by the "from" domain; so the qemu for the
> "to" domain can't get it, and fails.
>
> Obviously this should fail a lot more gracefully, but that's a bit of
> a lower-priority bug I think.
>
>   -George
Yes, I managed to get to the bottom of it too and got vms migrating on 
localhost on our end.

I can confirm I did get the clock stuck problem while doing a localhost 
migrate.


--
Diana

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-30 15:55                                 ` Diana Crisan
@ 2013-05-30 16:06                                   ` George Dunlap
  2013-05-30 17:02                                     ` Diana Crisan
  2013-05-31  8:34                                     ` Diana Crisan
  0 siblings, 2 replies; 60+ messages in thread
From: George Dunlap @ 2013-05-30 16:06 UTC (permalink / raw)
  To: Diana Crisan
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD

On 05/30/2013 04:55 PM, Diana Crisan wrote:
> On 30/05/13 16:26, George Dunlap wrote:
>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
>> wrote:
>>> Hi,
>>>
>>>
>>> On 26/05/13 09:38, Ian Campbell wrote:
>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>>> George,
>>>>>
>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>>> <George.Dunlap@eu.citrix.com>
>>>>> wrote:
>>>>>
>>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>>> (a total of 2).
>>>>>> Do you see the same effects if you do a local-host migrate?
>>>>> I hadn't even realised that was possible. That would have made testing
>>>>> live
>>>>> migrate easier!
>>>> That's basically the whole reason it is supported ;-)
>>>>
>>>>> How do you avoid the name clash in xen-store?
>>>> Most toolstacks receive the incoming migration into a domain named
>>>> FOO-incoming or some such and then rename to FOO upon completion. Some
>>>> also rename the outgoing domain "FOO-migratedaway" towards the end so
>>>> that the bits of the final teardown which can safely happen after the
>>>> target have start can be done so.
>>>>
>>>> Ian.
>>>>
>>>>
>>> I am unsure what I am doing wrong, but I cannot seem to be able to do a
>>> localhost migrate.
>>>
>>> I created a domU using "xl create xl.conf" and once it fully booted I
>>> issued
>>> an "xl migrate 11 localhost". This fails and gives the output below.
>>>
>>> Would you please advise on how to get this working?
>>>
>>> Thanks,
>>> Diana
>>>
>>>
>>> root@ubuntu:~# xl migrate 11 localhost
>>> root@localhost's password:
>>> migration target: Ready to receive domain.
>>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>>> Loading new save file <incoming migration stream> (new xl fmt info
>>> 0x0/0x0/2344)
>>>   Savefile contains xl domain config
>>> xc: progress: Reloading memory pages: 53248/1048575    5%
>>> xc: progress: Reloading memory pages: 105472/1048575   10%
>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
>>> device
>>> model: spawn failed (rc=-3)
>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
>>> model
>>> did not start: -3
>>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device Model
>>> already exited
>>> migration target: Domain creation failed (code -3).
>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
>>> truncated
>>> reading ready message from migration receiver stream
>>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration
>>> target process [10934] exited with error status 3
>>> Migration failed, resuming at sender.
>>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
>>> failed for
>>> domain 11: Success
>> Aha -- I managed to reproduce this one as well.
>>
>> Your problem is the "vncunused=0" -- that's instructing qemu "You must
>> use this exact port for the vnc server".  But when you do the migrate,
>> that port is still in use by the "from" domain; so the qemu for the
>> "to" domain can't get it, and fails.
>>
>> Obviously this should fail a lot more gracefully, but that's a bit of
>> a lower-priority bug I think.
>>
>>   -George
> Yes, I managed to get to the bottom of it too and got vms migrating on
> localhost on our end.
>
> I can confirm I did get the clock stuck problem while doing a localhost
> migrate.

Does the script I posted earlier "work" for you (i.e., does it fail 
after some number of migrations)?

I've been using it to do a localhost migrate, using a nearly identical 
config as the one you posted (only difference, I'm using blkback rather 
than blktap), with an Ubuntu Precise VM using the 3.2.0-39-virtual 
kernel, and I'm up to 20 migrates with no problems.

Differences between my setup and yours at this point:
  - probably hardware (I've got an old AMD box)
  - dom0 kernel is Debian 2.6.32-5-xen
  - not using blktap

I've also been testing this on an Intel box, with the Debian 
3.2.0-4-686-pae kernel, with a Debian distro, and it's up to 103 
successful migrates.

It's possible that it's a model-specific issue, but it's sort of hard to 
see how the dom0 kernel, or blktap, could cause this.

Do you have any special kernel config parameters you're passing in to 
the guest?

Also, could you try a generic Debian Wheezy install, just to see if it's 
got something to do with the kernel?

  -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-30 16:06                                   ` George Dunlap
@ 2013-05-30 17:02                                     ` Diana Crisan
  2013-05-31  8:34                                     ` Diana Crisan
  1 sibling, 0 replies; 60+ messages in thread
From: Diana Crisan @ 2013-05-30 17:02 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD

On 30/05/13 17:06, George Dunlap wrote:
> On 05/30/2013 04:55 PM, Diana Crisan wrote:
>> On 30/05/13 16:26, George Dunlap wrote:
>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
>>> wrote:
>>>> Hi,
>>>>
>>>>
>>>> On 26/05/13 09:38, Ian Campbell wrote:
>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>>>> George,
>>>>>>
>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>>>> <George.Dunlap@eu.citrix.com>
>>>>>> wrote:
>>>>>>
>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>>>> (a total of 2).
>>>>>>> Do you see the same effects if you do a local-host migrate?
>>>>>> I hadn't even realised that was possible. That would have made 
>>>>>> testing
>>>>>> live
>>>>>> migrate easier!
>>>>> That's basically the whole reason it is supported ;-)
>>>>>
>>>>>> How do you avoid the name clash in xen-store?
>>>>> Most toolstacks receive the incoming migration into a domain named
>>>>> FOO-incoming or some such and then rename to FOO upon completion. 
>>>>> Some
>>>>> also rename the outgoing domain "FOO-migratedaway" towards the end so
>>>>> that the bits of the final teardown which can safely happen after the
>>>>> target have start can be done so.
>>>>>
>>>>> Ian.
>>>>>
>>>>>
>>>> I am unsure what I am doing wrong, but I cannot seem to be able to 
>>>> do a
>>>> localhost migrate.
>>>>
>>>> I created a domU using "xl create xl.conf" and once it fully booted I
>>>> issued
>>>> an "xl migrate 11 localhost". This fails and gives the output below.
>>>>
>>>> Would you please advise on how to get this working?
>>>>
>>>> Thanks,
>>>> Diana
>>>>
>>>>
>>>> root@ubuntu:~# xl migrate 11 localhost
>>>> root@localhost's password:
>>>> migration target: Ready to receive domain.
>>>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>>>> Loading new save file <incoming migration stream> (new xl fmt info
>>>> 0x0/0x0/2344)
>>>>   Savefile contains xl domain config
>>>> xc: progress: Reloading memory pages: 53248/1048575    5%
>>>> xc: progress: Reloading memory pages: 105472/1048575   10%
>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
>>>> device
>>>> model: spawn failed (rc=-3)
>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
>>>> model
>>>> did not start: -3
>>>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device 
>>>> Model
>>>> already exited
>>>> migration target: Domain creation failed (code -3).
>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
>>>> truncated
>>>> reading ready message from migration receiver stream
>>>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration
>>>> target process [10934] exited with error status 3
>>>> Migration failed, resuming at sender.
>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
>>>> failed for
>>>> domain 11: Success
>>> Aha -- I managed to reproduce this one as well.
>>>
>>> Your problem is the "vncunused=0" -- that's instructing qemu "You must
>>> use this exact port for the vnc server".  But when you do the migrate,
>>> that port is still in use by the "from" domain; so the qemu for the
>>> "to" domain can't get it, and fails.
>>>
>>> Obviously this should fail a lot more gracefully, but that's a bit of
>>> a lower-priority bug I think.
>>>
>>>   -George
>> Yes, I managed to get to the bottom of it too and got vms migrating on
>> localhost on our end.
>>
>> I can confirm I did get the clock stuck problem while doing a localhost
>> migrate.
>
> Does the script I posted earlier "work" for you (i.e., does it fail 
> after some number of migrations)?
>

Yes, it does "work". I got the vm to break down within 32 localhost 
migrations.

> I've been using it to do a localhost migrate, using a nearly identical 
> config as the one you posted (only difference, I'm using blkback 
> rather than blktap), with an Ubuntu Precise VM using the 
> 3.2.0-39-virtual kernel, and I'm up to 20 migrates with no problems.
>
> Differences between my setup and yours at this point:
>  - probably hardware (I've got an old AMD box)
>  - dom0 kernel is Debian 2.6.32-5-xen
>  - not using blktap
>
> I've also been testing this on an Intel box, with the Debian 
> 3.2.0-4-686-pae kernel, with a Debian distro, and it's up to 103 
> successful migrates.
The different hardware I tried are:
16 core AMD Turion(tm) II Neo N40L Dual-Core Processor
and
16 core AMD Opteron(TM) Processor 6212
>
> It's possible that it's a model-specific issue, but it's sort of hard 
> to see how the dom0 kernel, or blktap, could cause this.
>
> Do you have any special kernel config parameters you're passing in to 
> the guest?
>
No.

> Also, could you try a generic Debian Wheezy install, just to see if 
> it's got something to do with the kernel?
>

I take it you mean for the guest?
Will try tomorrow and let you know.
>  -George

--
Diana

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-30 16:06                                   ` George Dunlap
  2013-05-30 17:02                                     ` Diana Crisan
@ 2013-05-31  8:34                                     ` Diana Crisan
  2013-05-31 10:54                                       ` George Dunlap
  1 sibling, 1 reply; 60+ messages in thread
From: Diana Crisan @ 2013-05-31  8:34 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD

George,
On 30/05/13 17:06, George Dunlap wrote:
> On 05/30/2013 04:55 PM, Diana Crisan wrote:
>> On 30/05/13 16:26, George Dunlap wrote:
>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
>>> wrote:
>>>> Hi,
>>>>
>>>>
>>>> On 26/05/13 09:38, Ian Campbell wrote:
>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>>>> George,
>>>>>>
>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>>>> <George.Dunlap@eu.citrix.com>
>>>>>> wrote:
>>>>>>
>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>>>> (a total of 2).
>>>>>>> Do you see the same effects if you do a local-host migrate?
>>>>>> I hadn't even realised that was possible. That would have made 
>>>>>> testing
>>>>>> live
>>>>>> migrate easier!
>>>>> That's basically the whole reason it is supported ;-)
>>>>>
>>>>>> How do you avoid the name clash in xen-store?
>>>>> Most toolstacks receive the incoming migration into a domain named
>>>>> FOO-incoming or some such and then rename to FOO upon completion. 
>>>>> Some
>>>>> also rename the outgoing domain "FOO-migratedaway" towards the end so
>>>>> that the bits of the final teardown which can safely happen after the
>>>>> target have start can be done so.
>>>>>
>>>>> Ian.
>>>>>
>>>>>
>>>> I am unsure what I am doing wrong, but I cannot seem to be able to 
>>>> do a
>>>> localhost migrate.
>>>>
>>>> I created a domU using "xl create xl.conf" and once it fully booted I
>>>> issued
>>>> an "xl migrate 11 localhost". This fails and gives the output below.
>>>>
>>>> Would you please advise on how to get this working?
>>>>
>>>> Thanks,
>>>> Diana
>>>>
>>>>
>>>> root@ubuntu:~# xl migrate 11 localhost
>>>> root@localhost's password:
>>>> migration target: Ready to receive domain.
>>>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>>>> Loading new save file <incoming migration stream> (new xl fmt info
>>>> 0x0/0x0/2344)
>>>>   Savefile contains xl domain config
>>>> xc: progress: Reloading memory pages: 53248/1048575    5%
>>>> xc: progress: Reloading memory pages: 105472/1048575   10%
>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
>>>> device
>>>> model: spawn failed (rc=-3)
>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
>>>> model
>>>> did not start: -3
>>>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device 
>>>> Model
>>>> already exited
>>>> migration target: Domain creation failed (code -3).
>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
>>>> truncated
>>>> reading ready message from migration receiver stream
>>>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration
>>>> target process [10934] exited with error status 3
>>>> Migration failed, resuming at sender.
>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
>>>> failed for
>>>> domain 11: Success
>>> Aha -- I managed to reproduce this one as well.
>>>
>>> Your problem is the "vncunused=0" -- that's instructing qemu "You must
>>> use this exact port for the vnc server".  But when you do the migrate,
>>> that port is still in use by the "from" domain; so the qemu for the
>>> "to" domain can't get it, and fails.
>>>
>>> Obviously this should fail a lot more gracefully, but that's a bit of
>>> a lower-priority bug I think.
>>>
>>>   -George
>> Yes, I managed to get to the bottom of it too and got vms migrating on
>> localhost on our end.
>>
>> I can confirm I did get the clock stuck problem while doing a localhost
>> migrate.
>
> Does the script I posted earlier "work" for you (i.e., does it fail 
> after some number of migrations)?
>

I left your script running throughout the night and it seems that it 
does not always catch the problem. I see the following:

1. vm has the clock stuck
2. script is still running as it seems the vm is still ping-able.
3. migration fails on the basis that the vm is does not ack the suspend 
request (see below).

libxl: error: libxl_dom.c:1063:libxl__domain_suspend_common_callback: 
guest didn't acknowledge suspend, cancelling request
libxl: error: libxl_dom.c:1085:libxl__domain_suspend_common_callback: 
guest didn't acknowledge suspend, request cancelled
xc: error: Suspend request failed: Internal error
xc: error: Domain appears not to have suspended: Internal error
libxl: error: libxl_dom.c:1370:libxl__xc_domain_save_done: saving 
domain: domain did not respond to suspend request: Invalid argument
migration sender: libxl_domain_suspend failed (rc=-8)
xc: error: 0-length read: Internal error
xc: error: read_exact_timed failed (read rc: 0, errno: 0): Internal error
xc: error: Error when reading batch size (0 = Success): Internal error
xc: error: Error when reading batch (0 = Success): Internal error
libxl: error: libxl_create.c:834:libxl__xc_domain_restore_done: 
restoring domain: Resource temporarily unavailable
libxl: error: libxl_create.c:916:domcreate_rebuild_done: cannot 
(re-)build domain: -3
libxl: error: libxl.c:1378:libxl__destroy_domid: non-existant domain 111
libxl: error: libxl.c:1342:domain_destroy_callback: unable to destroy 
guest with domid 111
libxl: error: libxl_create.c:1225:domcreate_destruction_cb: unable to 
destroy domain 111 following failed creation
migration target: Domain creation failed (code -3).
libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration 
target process [7849] exited with error status 3
Migration failed, failed to suspend at sender.
PING 172.16.1.223 (172.16.1.223) 56(84) bytes of data.
64 bytes from 172.16.1.223: icmp_req=1 ttl=64 time=0.339 ms
64 bytes from 172.16.1.223: icmp_req=2 ttl=64 time=0.569 ms
64 bytes from 172.16.1.223: icmp_req=3 ttl=64 time=0.535 ms
64 bytes from 172.16.1.223: icmp_req=4 ttl=64 time=0.544 ms
64 bytes from 172.16.1.223: icmp_req=5 ttl=64 time=0.529 ms


> I've been using it to do a localhost migrate, using a nearly identical 
> config as the one you posted (only difference, I'm using blkback 
> rather than blktap), with an Ubuntu Precise VM using the 
> 3.2.0-39-virtual kernel, and I'm up to 20 migrates with no problems.
>
> Differences between my setup and yours at this point:
>  - probably hardware (I've got an old AMD box)
>  - dom0 kernel is Debian 2.6.32-5-xen
>  - not using blktap
>
> I've also been testing this on an Intel box, with the Debian 
> 3.2.0-4-686-pae kernel, with a Debian distro, and it's up to 103 
> successful migrates.
>
> It's possible that it's a model-specific issue, but it's sort of hard 
> to see how the dom0 kernel, or blktap, could cause this.
>
> Do you have any special kernel config parameters you're passing in to 
> the guest?
>
> Also, could you try a generic Debian Wheezy install, just to see if 
> it's got something to do with the kernel?
>
>  -George


I reckon our code caught a separate problem with this issue as whenever 
the vm got its clock stuck, the network interface wasn't coming back up 
and I would see NO-CARRIER for the guest, which made it unreachable.

--
Diana

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31  8:34                                     ` Diana Crisan
@ 2013-05-31 10:54                                       ` George Dunlap
  2013-05-31 10:59                                         ` George Dunlap
                                                           ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: George Dunlap @ 2013-05-31 10:54 UTC (permalink / raw)
  To: Diana Crisan
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD

On 31/05/13 09:34, Diana Crisan wrote:
> George,
> On 30/05/13 17:06, George Dunlap wrote:
>> On 05/30/2013 04:55 PM, Diana Crisan wrote:
>>> On 30/05/13 16:26, George Dunlap wrote:
>>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>> On 26/05/13 09:38, Ian Campbell wrote:
>>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>>>>> George,
>>>>>>>
>>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>>>>> <George.Dunlap@eu.citrix.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>>>>> (a total of 2).
>>>>>>>> Do you see the same effects if you do a local-host migrate?
>>>>>>> I hadn't even realised that was possible. That would have made 
>>>>>>> testing
>>>>>>> live
>>>>>>> migrate easier!
>>>>>> That's basically the whole reason it is supported ;-)
>>>>>>
>>>>>>> How do you avoid the name clash in xen-store?
>>>>>> Most toolstacks receive the incoming migration into a domain named
>>>>>> FOO-incoming or some such and then rename to FOO upon completion. 
>>>>>> Some
>>>>>> also rename the outgoing domain "FOO-migratedaway" towards the 
>>>>>> end so
>>>>>> that the bits of the final teardown which can safely happen after 
>>>>>> the
>>>>>> target have start can be done so.
>>>>>>
>>>>>> Ian.
>>>>>>
>>>>>>
>>>>> I am unsure what I am doing wrong, but I cannot seem to be able to 
>>>>> do a
>>>>> localhost migrate.
>>>>>
>>>>> I created a domU using "xl create xl.conf" and once it fully booted I
>>>>> issued
>>>>> an "xl migrate 11 localhost". This fails and gives the output below.
>>>>>
>>>>> Would you please advise on how to get this working?
>>>>>
>>>>> Thanks,
>>>>> Diana
>>>>>
>>>>>
>>>>> root@ubuntu:~# xl migrate 11 localhost
>>>>> root@localhost's password:
>>>>> migration target: Ready to receive domain.
>>>>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>>>>> Loading new save file <incoming migration stream> (new xl fmt info
>>>>> 0x0/0x0/2344)
>>>>>   Savefile contains xl domain config
>>>>> xc: progress: Reloading memory pages: 53248/1048575    5%
>>>>> xc: progress: Reloading memory pages: 105472/1048575   10%
>>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
>>>>> device
>>>>> model: spawn failed (rc=-3)
>>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
>>>>> model
>>>>> did not start: -3
>>>>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device 
>>>>> Model
>>>>> already exited
>>>>> migration target: Domain creation failed (code -3).
>>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
>>>>> truncated
>>>>> reading ready message from migration receiver stream
>>>>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: 
>>>>> migration
>>>>> target process [10934] exited with error status 3
>>>>> Migration failed, resuming at sender.
>>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
>>>>> failed for
>>>>> domain 11: Success
>>>> Aha -- I managed to reproduce this one as well.
>>>>
>>>> Your problem is the "vncunused=0" -- that's instructing qemu "You must
>>>> use this exact port for the vnc server".  But when you do the migrate,
>>>> that port is still in use by the "from" domain; so the qemu for the
>>>> "to" domain can't get it, and fails.
>>>>
>>>> Obviously this should fail a lot more gracefully, but that's a bit of
>>>> a lower-priority bug I think.
>>>>
>>>>   -George
>>> Yes, I managed to get to the bottom of it too and got vms migrating on
>>> localhost on our end.
>>>
>>> I can confirm I did get the clock stuck problem while doing a localhost
>>> migrate.
>>
>> Does the script I posted earlier "work" for you (i.e., does it fail 
>> after some number of migrations)?
>>
>
> I left your script running throughout the night and it seems that it 
> does not always catch the problem. I see the following:
>
> 1. vm has the clock stuck
> 2. script is still running as it seems the vm is still ping-able.
> 3. migration fails on the basis that the vm is does not ack the 
> suspend request (see below).

So I wrote a script to run "date", sleep for 2 seconds, and run "date" a 
second time -- and eventually the *sleep* hung.

The VM is still responsive, and I can log in; if I type "date" manually 
successive times then I get an advancing clock, but if I type "sleep 1" 
it just hangs.

If you run "dmesg" in the guest, do you see the following line?

CE: Reprogramming failure. Giving up

  -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 10:54                                       ` George Dunlap
@ 2013-05-31 10:59                                         ` George Dunlap
  2013-05-31 11:41                                           ` George Dunlap
  2013-05-31 21:30                                           ` Konrad Rzeszutek Wilk
  2013-05-31 11:18                                         ` Alex Bligh
  2013-05-31 11:36                                         ` Diana Crisan
  2 siblings, 2 replies; 60+ messages in thread
From: George Dunlap @ 2013-05-31 10:59 UTC (permalink / raw)
  To: Diana Crisan
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, Stefano Stabellini,
	xen-devel, David Vrabel, Alex Bligh, Anthony PERARD

On 31/05/13 11:54, George Dunlap wrote:
> On 31/05/13 09:34, Diana Crisan wrote:
>> George,
>> On 30/05/13 17:06, George Dunlap wrote:
>>> On 05/30/2013 04:55 PM, Diana Crisan wrote:
>>>> On 30/05/13 16:26, George Dunlap wrote:
>>>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
>>>>> wrote:
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> On 26/05/13 09:38, Ian Campbell wrote:
>>>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>>>>>> George,
>>>>>>>>
>>>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>>>>>> <George.Dunlap@eu.citrix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>>>>>> (a total of 2).
>>>>>>>>> Do you see the same effects if you do a local-host migrate?
>>>>>>>> I hadn't even realised that was possible. That would have made 
>>>>>>>> testing
>>>>>>>> live
>>>>>>>> migrate easier!
>>>>>>> That's basically the whole reason it is supported ;-)
>>>>>>>
>>>>>>>> How do you avoid the name clash in xen-store?
>>>>>>> Most toolstacks receive the incoming migration into a domain named
>>>>>>> FOO-incoming or some such and then rename to FOO upon 
>>>>>>> completion. Some
>>>>>>> also rename the outgoing domain "FOO-migratedaway" towards the 
>>>>>>> end so
>>>>>>> that the bits of the final teardown which can safely happen 
>>>>>>> after the
>>>>>>> target have start can be done so.
>>>>>>>
>>>>>>> Ian.
>>>>>>>
>>>>>>>
>>>>>> I am unsure what I am doing wrong, but I cannot seem to be able 
>>>>>> to do a
>>>>>> localhost migrate.
>>>>>>
>>>>>> I created a domU using "xl create xl.conf" and once it fully 
>>>>>> booted I
>>>>>> issued
>>>>>> an "xl migrate 11 localhost". This fails and gives the output below.
>>>>>>
>>>>>> Would you please advise on how to get this working?
>>>>>>
>>>>>> Thanks,
>>>>>> Diana
>>>>>>
>>>>>>
>>>>>> root@ubuntu:~# xl migrate 11 localhost
>>>>>> root@localhost's password:
>>>>>> migration target: Ready to receive domain.
>>>>>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>>>>>> Loading new save file <incoming migration stream> (new xl fmt info
>>>>>> 0x0/0x0/2344)
>>>>>>   Savefile contains xl domain config
>>>>>> xc: progress: Reloading memory pages: 53248/1048575 5%
>>>>>> xc: progress: Reloading memory pages: 105472/1048575 10%
>>>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
>>>>>> device
>>>>>> model: spawn failed (rc=-3)
>>>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
>>>>>> model
>>>>>> did not start: -3
>>>>>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device 
>>>>>> Model
>>>>>> already exited
>>>>>> migration target: Domain creation failed (code -3).
>>>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
>>>>>> truncated
>>>>>> reading ready message from migration receiver stream
>>>>>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: 
>>>>>> migration
>>>>>> target process [10934] exited with error status 3
>>>>>> Migration failed, resuming at sender.
>>>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>>>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
>>>>>> failed for
>>>>>> domain 11: Success
>>>>> Aha -- I managed to reproduce this one as well.
>>>>>
>>>>> Your problem is the "vncunused=0" -- that's instructing qemu "You 
>>>>> must
>>>>> use this exact port for the vnc server".  But when you do the 
>>>>> migrate,
>>>>> that port is still in use by the "from" domain; so the qemu for the
>>>>> "to" domain can't get it, and fails.
>>>>>
>>>>> Obviously this should fail a lot more gracefully, but that's a bit of
>>>>> a lower-priority bug I think.
>>>>>
>>>>>   -George
>>>> Yes, I managed to get to the bottom of it too and got vms migrating on
>>>> localhost on our end.
>>>>
>>>> I can confirm I did get the clock stuck problem while doing a 
>>>> localhost
>>>> migrate.
>>>
>>> Does the script I posted earlier "work" for you (i.e., does it fail 
>>> after some number of migrations)?
>>>
>>
>> I left your script running throughout the night and it seems that it 
>> does not always catch the problem. I see the following:
>>
>> 1. vm has the clock stuck
>> 2. script is still running as it seems the vm is still ping-able.
>> 3. migration fails on the basis that the vm is does not ack the 
>> suspend request (see below).
>
> So I wrote a script to run "date", sleep for 2 seconds, and run "date" 
> a second time -- and eventually the *sleep* hung.
>
> The VM is still responsive, and I can log in; if I type "date" 
> manually successive times then I get an advancing clock, but if I type 
> "sleep 1" it just hangs.
>
> If you run "dmesg" in the guest, do you see the following line?
>
> CE: Reprogramming failure. Giving up

I think this must be it; on my other box, I got the following messages:

[  224.732083] PM: late freeze of devices complete after 3.787 msecs
[  224.736062] Xen HVM callback vector for event delivery is enabled
[  224.736062] Xen Platform PCI: I/O protocol version 1
[  224.736062] xen: --> irq=8, pirq=16
[  224.736062] xen: --> irq=12, pirq=17
[  224.736062] xen: --> irq=1, pirq=18
[  224.736062] xen: --> irq=6, pirq=19
[  224.736062] xen: --> irq=4, pirq=20
[  224.736062] xen: --> irq=7, pirq=21
[  224.736062] xen: --> irq=28, pirq=22
[  224.736062] ata_piix 0000:00:01.1: restoring config space at offset 
0x1 (was 0x2800001, writing 0x2800005)
[  224.736062] PM: early restore of devices complete after 5.854 msecs
[  224.739692] ata_piix 0000:00:01.1: setting latency timer to 64
[  224.739782] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28 
(level, low) -> IRQ 28
[  224.746900] PM: restore of devices complete after 7.540 msecs
[  224.758612] Setting capacity to 16777216
[  224.758749] Setting capacity to 16777216
[  224.898426] ata2.01: NODEV after polling detection
[  224.900941] ata2.00: configured for MWDMA2
[  231.055978] CE: xen increased min_delta_ns to 150000 nsec
[  231.055986] hrtimer: interrupt took 14460 ns
[  247.893303] PM: freeze of devices complete after 2.168 msecs
[  247.893306] suspending xenstore...
[  247.896977] PM: late freeze of devices complete after 3.666 msecs
[  247.900067] Xen HVM callback vector for event delivery is enabled
[  247.900067] Xen Platform PCI: I/O protocol version 1
[  247.900067] xen: --> irq=8, pirq=16
[  247.900067] xen: --> irq=12, pirq=17
[  247.900067] xen: --> irq=1, pirq=18
[  247.900067] xen: --> irq=6, pirq=19
[  247.900067] xen: --> irq=4, pirq=20
[  247.900067] xen: --> irq=7, pirq=21
[  247.900067] xen: --> irq=28, pirq=22
[  247.900067] ata_piix 0000:00:01.1: restoring config space at offset 
0x1 (was 0x2800001, writing 0x2800005)
[  247.900067] PM: early restore of devices complete after 4.612 msecs
[  247.906454] ata_piix 0000:00:01.1: setting latency timer to 64
[  247.906558] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28 
(level, low) -> IRQ 28
[  247.914770] PM: restore of devices complete after 8.762 msecs
[  247.926557] Setting capacity to 16777216
[  247.926661] Setting capacity to 16777216
[  248.066661] ata2.01: NODEV after polling detection
[  248.067326] CE: xen increased min_delta_ns to 225000 nsec
[  248.067344] CE: xen increased min_delta_ns to 337500 nsec
[  248.067361] CE: xen increased min_delta_ns to 506250 nsec
[  248.067378] CE: xen increased min_delta_ns to 759375 nsec
[  248.067396] CE: xen increased min_delta_ns to 1139062 nsec
[  248.067413] CE: xen increased min_delta_ns to 1708593 nsec
[  248.067428] CE: xen increased min_delta_ns to 2562889 nsec
[  248.067441] CE: xen increased min_delta_ns to 3844333 nsec
[  248.067453] CE: xen increased min_delta_ns to 4000000 nsec
[  248.067466] CE: Reprogramming failure. Giving up
[  248.068075] ata2.00: configured for MWDMA2

Note the "CE: xen increased min_delta_ns to 150000nsec" at 231 for the 
previous suspend, and now it's increasing it up to 4 milliseconds before 
giving up for this suspend.

Konrad, stefano, any idea what's going on here?

  -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 10:54                                       ` George Dunlap
  2013-05-31 10:59                                         ` George Dunlap
@ 2013-05-31 11:18                                         ` Alex Bligh
  2013-05-31 11:36                                         ` Diana Crisan
  2 siblings, 0 replies; 60+ messages in thread
From: Alex Bligh @ 2013-05-31 11:18 UTC (permalink / raw)
  To: George Dunlap, Diana Crisan
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD

George,

--On 31 May 2013 11:54:09 +0100 George Dunlap <george.dunlap@eu.citrix.com> 
wrote:

> So I wrote a script to run "date", sleep for 2 seconds, and run "date" a
> second time -- and eventually the *sleep* hung.
>
> The VM is still responsive, and I can log in; if I type "date" manually
> successive times then I get an advancing clock, but if I type "sleep 1"
> it just hangs.

Yes that's exactly what we're seeing. Anything that uses the wallclock
hangs. IE you can ping the VM, but the VM can't ping as the ping command
uses the wallclock timer.

I'll leave the point about dmseg to Diana.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 10:54                                       ` George Dunlap
  2013-05-31 10:59                                         ` George Dunlap
  2013-05-31 11:18                                         ` Alex Bligh
@ 2013-05-31 11:36                                         ` Diana Crisan
  2013-05-31 11:41                                           ` Diana Crisan
  2 siblings, 1 reply; 60+ messages in thread
From: Diana Crisan @ 2013-05-31 11:36 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD



On 31 May 2013, at 11:54, George Dunlap <george.dunlap@eu.citrix.com> wrote:

> On 31/05/13 09:34, Diana Crisan wrote:
>> George,
>> On 30/05/13 17:06, George Dunlap wrote:
>>> On 05/30/2013 04:55 PM, Diana Crisan wrote:
>>>> On 30/05/13 16:26, George Dunlap wrote:
>>>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
>>>>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> 
>>>>>> On 26/05/13 09:38, Ian Campbell wrote:
>>>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>>>>>> George,
>>>>>>>> 
>>>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>>>>>> <George.Dunlap@eu.citrix.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>>>>>> (a total of 2).
>>>>>>>>> Do you see the same effects if you do a local-host migrate?
>>>>>>>> I hadn't even realised that was possible. That would have made testing
>>>>>>>> live
>>>>>>>> migrate easier!
>>>>>>> That's basically the whole reason it is supported ;-)
>>>>>>> 
>>>>>>>> How do you avoid the name clash in xen-store?
>>>>>>> Most toolstacks receive the incoming migration into a domain named
>>>>>>> FOO-incoming or some such and then rename to FOO upon completion. Some
>>>>>>> also rename the outgoing domain "FOO-migratedaway" towards the end so
>>>>>>> that the bits of the final teardown which can safely happen after the
>>>>>>> target have start can be done so.
>>>>>>> 
>>>>>>> Ian.
>>>>>> I am unsure what I am doing wrong, but I cannot seem to be able to do a
>>>>>> localhost migrate.
>>>>>> 
>>>>>> I created a domU using "xl create xl.conf" and once it fully booted I
>>>>>> issued
>>>>>> an "xl migrate 11 localhost". This fails and gives the output below.
>>>>>> 
>>>>>> Would you please advise on how to get this working?
>>>>>> 
>>>>>> Thanks,
>>>>>> Diana
>>>>>> 
>>>>>> 
>>>>>> root@ubuntu:~# xl migrate 11 localhost
>>>>>> root@localhost's password:
>>>>>> migration target: Ready to receive domain.
>>>>>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>>>>>> Loading new save file <incoming migration stream> (new xl fmt info
>>>>>> 0x0/0x0/2344)
>>>>>>  Savefile contains xl domain config
>>>>>> xc: progress: Reloading memory pages: 53248/1048575    5%
>>>>>> xc: progress: Reloading memory pages: 105472/1048575   10%
>>>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
>>>>>> device
>>>>>> model: spawn failed (rc=-3)
>>>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
>>>>>> model
>>>>>> did not start: -3
>>>>>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device Model
>>>>>> already exited
>>>>>> migration target: Domain creation failed (code -3).
>>>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
>>>>>> truncated
>>>>>> reading ready message from migration receiver stream
>>>>>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration
>>>>>> target process [10934] exited with error status 3
>>>>>> Migration failed, resuming at sender.
>>>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>>>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
>>>>>> failed for
>>>>>> domain 11: Success
>>>>> Aha -- I managed to reproduce this one as well.
>>>>> 
>>>>> Your problem is the "vncunused=0" -- that's instructing qemu "You must
>>>>> use this exact port for the vnc server".  But when you do the migrate,
>>>>> that port is still in use by the "from" domain; so the qemu for the
>>>>> "to" domain can't get it, and fails.
>>>>> 
>>>>> Obviously this should fail a lot more gracefully, but that's a bit of
>>>>> a lower-priority bug I think.
>>>>> 
>>>>>  -George
>>>> Yes, I managed to get to the bottom of it too and got vms migrating on
>>>> localhost on our end.
>>>> 
>>>> I can confirm I did get the clock stuck problem while doing a localhost
>>>> migrate.
>>> 
>>> Does the script I posted earlier "work" for you (i.e., does it fail after some number of migrations)?
>> 
>> I left your script running throughout the night and it seems that it does not always catch the problem. I see the following:
>> 
>> 1. vm has the clock stuck
>> 2. script is still running as it seems the vm is still ping-able.
>> 3. migration fails on the basis that the vm is does not ack the suspend request (see below).
> 
> So I wrote a script to run "date", sleep for 2 seconds, and run "date" a second time -- and eventually the *sleep* hung.
> 
> The VM is still responsive, and I can log in; if I type "date" manually successive times then I get an advancing clock, but if I type "sleep 1" it just hangs.
> 
> If you run "dmesg" in the guest, do you see the following line?
> 
> CE: Reprogramming failure. Giving up
> 

I do. It is preceded by:
CE: xen increased min_delta_ns to 4000000 nsec

> -George

--
Diana

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 11:36                                         ` Diana Crisan
@ 2013-05-31 11:41                                           ` Diana Crisan
  2013-05-31 11:49                                             ` George Dunlap
  0 siblings, 1 reply; 60+ messages in thread
From: Diana Crisan @ 2013-05-31 11:41 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD



On 31 May 2013, at 12:36, Diana Crisan <dcrisan@flexiant.com> wrote:

> 
> 
> On 31 May 2013, at 11:54, George Dunlap <george.dunlap@eu.citrix.com> wrote:
> 
>> On 31/05/13 09:34, Diana Crisan wrote:
>>> George,
>>> On 30/05/13 17:06, George Dunlap wrote:
>>>> On 05/30/2013 04:55 PM, Diana Crisan wrote:
>>>>> On 30/05/13 16:26, George Dunlap wrote:
>>>>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
>>>>>> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> 
>>>>>>> On 26/05/13 09:38, Ian Campbell wrote:
>>>>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>>>>>>> George,
>>>>>>>>> 
>>>>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>>>>>>> <George.Dunlap@eu.citrix.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>>>>>>> (a total of 2).
>>>>>>>>>> Do you see the same effects if you do a local-host migrate?
>>>>>>>>> I hadn't even realised that was possible. That would have made testing
>>>>>>>>> live
>>>>>>>>> migrate easier!
>>>>>>>> That's basically the whole reason it is supported ;-)
>>>>>>>> 
>>>>>>>>> How do you avoid the name clash in xen-store?
>>>>>>>> Most toolstacks receive the incoming migration into a domain named
>>>>>>>> FOO-incoming or some such and then rename to FOO upon completion. Some
>>>>>>>> also rename the outgoing domain "FOO-migratedaway" towards the end so
>>>>>>>> that the bits of the final teardown which can safely happen after the
>>>>>>>> target have start can be done so.
>>>>>>>> 
>>>>>>>> Ian.
>>>>>>> I am unsure what I am doing wrong, but I cannot seem to be able to do a
>>>>>>> localhost migrate.
>>>>>>> 
>>>>>>> I created a domU using "xl create xl.conf" and once it fully booted I
>>>>>>> issued
>>>>>>> an "xl migrate 11 localhost". This fails and gives the output below.
>>>>>>> 
>>>>>>> Would you please advise on how to get this working?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Diana
>>>>>>> 
>>>>>>> 
>>>>>>> root@ubuntu:~# xl migrate 11 localhost
>>>>>>> root@localhost's password:
>>>>>>> migration target: Ready to receive domain.
>>>>>>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>>>>>>> Loading new save file <incoming migration stream> (new xl fmt info
>>>>>>> 0x0/0x0/2344)
>>>>>>> Savefile contains xl domain config
>>>>>>> xc: progress: Reloading memory pages: 53248/1048575    5%
>>>>>>> xc: progress: Reloading memory pages: 105472/1048575   10%
>>>>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
>>>>>>> device
>>>>>>> model: spawn failed (rc=-3)
>>>>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
>>>>>>> model
>>>>>>> did not start: -3
>>>>>>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device Model
>>>>>>> already exited
>>>>>>> migration target: Domain creation failed (code -3).
>>>>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
>>>>>>> truncated
>>>>>>> reading ready message from migration receiver stream
>>>>>>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration
>>>>>>> target process [10934] exited with error status 3
>>>>>>> Migration failed, resuming at sender.
>>>>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>>>>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
>>>>>>> failed for
>>>>>>> domain 11: Success
>>>>>> Aha -- I managed to reproduce this one as well.
>>>>>> 
>>>>>> Your problem is the "vncunused=0" -- that's instructing qemu "You must
>>>>>> use this exact port for the vnc server".  But when you do the migrate,
>>>>>> that port is still in use by the "from" domain; so the qemu for the
>>>>>> "to" domain can't get it, and fails.
>>>>>> 
>>>>>> Obviously this should fail a lot more gracefully, but that's a bit of
>>>>>> a lower-priority bug I think.
>>>>>> 
>>>>>> -George
>>>>> Yes, I managed to get to the bottom of it too and got vms migrating on
>>>>> localhost on our end.
>>>>> 
>>>>> I can confirm I did get the clock stuck problem while doing a localhost
>>>>> migrate.
>>>> 
>>>> Does the script I posted earlier "work" for you (i.e., does it fail after some number of migrations)?
>>> 
>>> I left your script running throughout the night and it seems that it does not always catch the problem. I see the following:
>>> 
>>> 1. vm has the clock stuck
>>> 2. script is still running as it seems the vm is still ping-able.
>>> 3. migration fails on the basis that the vm is does not ack the suspend request (see below).
>> 
>> So I wrote a script to run "date", sleep for 2 seconds, and run "date" a second time -- and eventually the *sleep* hung.
>> 
>> The VM is still responsive, and I can log in; if I type "date" manually successive times then I get an advancing clock, but if I type "sleep 1" it just hangs.
>> 
>> If you run "dmesg" in the guest, do you see the following line?
>> 
>> CE: Reprogramming failure. Giving up
> 
> I do. It is preceded by:
> CE: xen increased min_delta_ns to 4000000 nsec
> 

It seems that it is always getting stuck when the min_delta_ns is set to 4mil nsec. Could this be it? Overflow perhaps?


>> -George
> 
> --
> Diana
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 10:59                                         ` George Dunlap
@ 2013-05-31 11:41                                           ` George Dunlap
  2013-05-31 21:30                                           ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 60+ messages in thread
From: George Dunlap @ 2013-05-31 11:41 UTC (permalink / raw)
  To: Diana Crisan
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, Stefano Stabellini,
	xen-devel, David Vrabel, Alex Bligh, Anthony PERARD

On 31/05/13 11:59, George Dunlap wrote:
> [ 248.067326] CE: xen increased min_delta_ns to 225000 nsec
> [  248.067344] CE: xen increased min_delta_ns to 337500 nsec
> [  248.067361] CE: xen increased min_delta_ns to 506250 nsec
> [  248.067378] CE: xen increased min_delta_ns to 759375 nsec
> [  248.067396] CE: xen increased min_delta_ns to 1139062 nsec
> [  248.067413] CE: xen increased min_delta_ns to 1708593 nsec
> [  248.067428] CE: xen increased min_delta_ns to 2562889 nsec
> [  248.067441] CE: xen increased min_delta_ns to 3844333 nsec
> [  248.067453] CE: xen increased min_delta_ns to 4000000 nsec
> [  248.067466] CE: Reprogramming failure. Giving up
> [  248.068075] ata2.00: configured for MWDMA2
>
> Note the "CE: xen increased min_delta_ns to 150000nsec" at 231 for the 
> previous suspend, and now it's increasing it up to 4 milliseconds 
> before giving up for this suspend.
>
> Konrad, stefano, any idea what's going on here?

So it looks like those messages are coming from 
linux.git/kernel/time/clockevents.c.

clockevents_program_events() calls clockevents_program_min_delta(), 
which calls the Xen clock set_next_event (in 
linux.git/arch/x86/xen/time.c) , which calls VCPUOP_set_singleshot_timer 
(which is handled in xen.git/xen/common/domain.c).

If set_next_event() returns an error, it tries again a couple of times, 
then tries increasing the "min_delta" and trying again; eventually it 
wil give up.  So set_next_event() must be returning an error consistently.

The only time that VCPUOP_set_singleshot_timer should return an error is 
if the requrested time is in the past *and* the VCPU_SSHOTTMR_future 
flag is set (which it is apparently).

So it would appear that the VM is down over the period that some event 
wants to happen; and Linux does not contemplate the idea that we may 
have been unable to hit an event within 4ms.

Overall it looks like something we should fix in Linux.  Completely 
giving up on all timers seems much too extreme.  At worst it should just 
drop timers.  Probably what it should do is on each iteration, check to 
see if any events are currently in the past and just fire them 
immediately, taking them off the queue.

  -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 11:41                                           ` Diana Crisan
@ 2013-05-31 11:49                                             ` George Dunlap
  2013-05-31 11:57                                               ` Alex Bligh
  2013-05-31 12:34                                               ` Ian Campbell
  0 siblings, 2 replies; 60+ messages in thread
From: George Dunlap @ 2013-05-31 11:49 UTC (permalink / raw)
  To: Diana Crisan
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD

On 31/05/13 12:41, Diana Crisan wrote:
>
> On 31 May 2013, at 12:36, Diana Crisan <dcrisan@flexiant.com> wrote:
>
>>
>> On 31 May 2013, at 11:54, George Dunlap <george.dunlap@eu.citrix.com> wrote:
>>
>>> On 31/05/13 09:34, Diana Crisan wrote:
>>>> George,
>>>> On 30/05/13 17:06, George Dunlap wrote:
>>>>> On 05/30/2013 04:55 PM, Diana Crisan wrote:
>>>>>> On 30/05/13 16:26, George Dunlap wrote:
>>>>>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> On 26/05/13 09:38, Ian Campbell wrote:
>>>>>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>>>>>>>> George,
>>>>>>>>>>
>>>>>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>>>>>>>> <George.Dunlap@eu.citrix.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>>>>>>>> (a total of 2).
>>>>>>>>>>> Do you see the same effects if you do a local-host migrate?
>>>>>>>>>> I hadn't even realised that was possible. That would have made testing
>>>>>>>>>> live
>>>>>>>>>> migrate easier!
>>>>>>>>> That's basically the whole reason it is supported ;-)
>>>>>>>>>
>>>>>>>>>> How do you avoid the name clash in xen-store?
>>>>>>>>> Most toolstacks receive the incoming migration into a domain named
>>>>>>>>> FOO-incoming or some such and then rename to FOO upon completion. Some
>>>>>>>>> also rename the outgoing domain "FOO-migratedaway" towards the end so
>>>>>>>>> that the bits of the final teardown which can safely happen after the
>>>>>>>>> target have start can be done so.
>>>>>>>>>
>>>>>>>>> Ian.
>>>>>>>> I am unsure what I am doing wrong, but I cannot seem to be able to do a
>>>>>>>> localhost migrate.
>>>>>>>>
>>>>>>>> I created a domU using "xl create xl.conf" and once it fully booted I
>>>>>>>> issued
>>>>>>>> an "xl migrate 11 localhost". This fails and gives the output below.
>>>>>>>>
>>>>>>>> Would you please advise on how to get this working?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Diana
>>>>>>>>
>>>>>>>>
>>>>>>>> root@ubuntu:~# xl migrate 11 localhost
>>>>>>>> root@localhost's password:
>>>>>>>> migration target: Ready to receive domain.
>>>>>>>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>>>>>>>> Loading new save file <incoming migration stream> (new xl fmt info
>>>>>>>> 0x0/0x0/2344)
>>>>>>>> Savefile contains xl domain config
>>>>>>>> xc: progress: Reloading memory pages: 53248/1048575    5%
>>>>>>>> xc: progress: Reloading memory pages: 105472/1048575   10%
>>>>>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
>>>>>>>> device
>>>>>>>> model: spawn failed (rc=-3)
>>>>>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
>>>>>>>> model
>>>>>>>> did not start: -3
>>>>>>>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device Model
>>>>>>>> already exited
>>>>>>>> migration target: Domain creation failed (code -3).
>>>>>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
>>>>>>>> truncated
>>>>>>>> reading ready message from migration receiver stream
>>>>>>>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration
>>>>>>>> target process [10934] exited with error status 3
>>>>>>>> Migration failed, resuming at sender.
>>>>>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>>>>>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
>>>>>>>> failed for
>>>>>>>> domain 11: Success
>>>>>>> Aha -- I managed to reproduce this one as well.
>>>>>>>
>>>>>>> Your problem is the "vncunused=0" -- that's instructing qemu "You must
>>>>>>> use this exact port for the vnc server".  But when you do the migrate,
>>>>>>> that port is still in use by the "from" domain; so the qemu for the
>>>>>>> "to" domain can't get it, and fails.
>>>>>>>
>>>>>>> Obviously this should fail a lot more gracefully, but that's a bit of
>>>>>>> a lower-priority bug I think.
>>>>>>>
>>>>>>> -George
>>>>>> Yes, I managed to get to the bottom of it too and got vms migrating on
>>>>>> localhost on our end.
>>>>>>
>>>>>> I can confirm I did get the clock stuck problem while doing a localhost
>>>>>> migrate.
>>>>> Does the script I posted earlier "work" for you (i.e., does it fail after some number of migrations)?
>>>> I left your script running throughout the night and it seems that it does not always catch the problem. I see the following:
>>>>
>>>> 1. vm has the clock stuck
>>>> 2. script is still running as it seems the vm is still ping-able.
>>>> 3. migration fails on the basis that the vm is does not ack the suspend request (see below).
>>> So I wrote a script to run "date", sleep for 2 seconds, and run "date" a second time -- and eventually the *sleep* hung.
>>>
>>> The VM is still responsive, and I can log in; if I type "date" manually successive times then I get an advancing clock, but if I type "sleep 1" it just hangs.
>>>
>>> If you run "dmesg" in the guest, do you see the following line?
>>>
>>> CE: Reprogramming failure. Giving up
>> I do. It is preceded by:
>> CE: xen increased min_delta_ns to 4000000 nsec
>>
> It seems that it is always getting stuck when the min_delta_ns is set to 4mil nsec. Could this be it? Overflow perhaps?

No -- Linux is asking, "Can you give me an alarm in 5ns?"  And Xen is 
saying, "No".  So Linux is saying, "OK, how about 5us?  10us? 20us?"  By 
the time it reaches 4ms, Linux has had enough, and says, "If this timer 
is so bad that it can't give me an event within 4ms it just won't use 
timers at all, thank you very much."

The problem appears to be that Linux thinks it's asking for something in 
the future, but is actually asking for something in the past.  It must 
look at its watch just before the final domain pause, and then asks for 
the time just after the migration resumes on the other side.  So it 
doesn't realize that 10ms (or something) has already passed, and that 
it's actually asking for a timer in the past.  The Xen timer driver in 
Linux specifically asks Xen for times set in the past to return an 
error.  Xen is returning an error because the time is in the past, Linux 
thinks it's getting an error because the time is too close in the future 
and tries asking a little further away.

Unfortunately I think this is something which needs to be fixed on the 
Linux side; I don't really see how we can work around it in Xen.

  -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 11:49                                             ` George Dunlap
@ 2013-05-31 11:57                                               ` Alex Bligh
  2013-05-31 12:40                                                 ` Ian Campbell
  2013-05-31 12:34                                               ` Ian Campbell
  1 sibling, 1 reply; 60+ messages in thread
From: Alex Bligh @ 2013-05-31 11:57 UTC (permalink / raw)
  To: George Dunlap, Diana Crisan
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD



--On 31 May 2013 12:49:18 +0100 George Dunlap <george.dunlap@eu.citrix.com> 
wrote:

> No -- Linux is asking, "Can you give me an alarm in 5ns?"  And Xen is
> saying, "No".  So Linux is saying, "OK, how about 5us?  10us? 20us?"  By
> the time it reaches 4ms, Linux has had enough, and says, "If this timer
> is so bad that it can't give me an event within 4ms it just won't use
> timers at all, thank you very much."
>
> The problem appears to be that Linux thinks it's asking for something in
> the future, but is actually asking for something in the past.  It must
> look at its watch just before the final domain pause, and then asks for
> the time just after the migration resumes on the other side.  So it
> doesn't realize that 10ms (or something) has already passed, and that
> it's actually asking for a timer in the past.  The Xen timer driver in
> Linux specifically asks Xen for times set in the past to return an error.
> Xen is returning an error because the time is in the past, Linux thinks
> it's getting an error because the time is too close in the future and
> tries asking a little further away.
>
> Unfortunately I think this is something which needs to be fixed on the
> Linux side; I don't really see how we can work around it in Xen.

I don't think fixing it only on the Linux side is a great idea, not least 
as it makes any current Linux image not live migrateable reliably. That's 
pretty horrible.

What would happen if Xen simply lied, then delivered the event a bit late? 
(obviously post migrate only). Obviously this would mean events wouldn't be 
delivered at exactly the right time, but in most circumstances that would 
be better than the guest wallclock simply dying.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 11:49                                             ` George Dunlap
  2013-05-31 11:57                                               ` Alex Bligh
@ 2013-05-31 12:34                                               ` Ian Campbell
  1 sibling, 0 replies; 60+ messages in thread
From: Ian Campbell @ 2013-05-31 12:34 UTC (permalink / raw)
  To: George Dunlap
  Cc: Konrad Rzeszutek Wilk, xen-devel, David Vrabel, Alex Bligh,
	Anthony PERARD, Diana Crisan

On Fri, 2013-05-31 at 12:49 +0100, George Dunlap wrote:
> The problem appears to be that Linux thinks it's asking for something in 
> the future, but is actually asking for something in the past.  It must 
> look at its watch just before the final domain pause, and then asks for 
> the time just after the migration resumes on the other side.  So it 
> doesn't realize that 10ms (or something) has already passed, and that 
> it's actually asking for a timer in the past.

I suppose the root cause of this is that the type of migration being
used is completely transparent to the guest? Unlike, say, a PV aware
migration or an S3 suspend where the guest is aware and will quiesce
things such that it won't look at the time before and use it after.

TBH I don't know how you either determine or control which type of
migration is done...

>   The Xen timer driver in 
> Linux specifically asks Xen for times set in the past to return an 
> error.  Xen is returning an error because the time is in the past, Linux 
> thinks it's getting an error because the time is too close in the future 
> and tries asking a little further away.
> 
> Unfortunately I think this is something which needs to be fixed on the 
> Linux side; I don't really see how we can work around it in Xen.
> 
>   -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 11:57                                               ` Alex Bligh
@ 2013-05-31 12:40                                                 ` Ian Campbell
  2013-05-31 13:07                                                   ` George Dunlap
  2013-05-31 13:16                                                   ` Alex Bligh
  0 siblings, 2 replies; 60+ messages in thread
From: Ian Campbell @ 2013-05-31 12:40 UTC (permalink / raw)
  To: Alex Bligh
  Cc: Konrad Rzeszutek Wilk, George Dunlap, xen-devel, David Vrabel,
	Anthony PERARD, Diana Crisan

On Fri, 2013-05-31 at 12:57 +0100, Alex Bligh wrote:
> 
> --On 31 May 2013 12:49:18 +0100 George Dunlap <george.dunlap@eu.citrix.com> 
> wrote:
> 
> > No -- Linux is asking, "Can you give me an alarm in 5ns?"  And Xen is
> > saying, "No".  So Linux is saying, "OK, how about 5us?  10us? 20us?"  By
> > the time it reaches 4ms, Linux has had enough, and says, "If this timer
> > is so bad that it can't give me an event within 4ms it just won't use
> > timers at all, thank you very much."
> >
> > The problem appears to be that Linux thinks it's asking for something in
> > the future, but is actually asking for something in the past.  It must
> > look at its watch just before the final domain pause, and then asks for
> > the time just after the migration resumes on the other side.  So it
> > doesn't realize that 10ms (or something) has already passed, and that
> > it's actually asking for a timer in the past.  The Xen timer driver in
> > Linux specifically asks Xen for times set in the past to return an error.
> > Xen is returning an error because the time is in the past, Linux thinks
> > it's getting an error because the time is too close in the future and
> > tries asking a little further away.
> >
> > Unfortunately I think this is something which needs to be fixed on the
> > Linux side; I don't really see how we can work around it in Xen.
> 
> I don't think fixing it only on the Linux side is a great idea, not least 
> as it makes any current Linux image not live migrateable reliably. That's 
> pretty horrible.

Ultimately though a guest bug is a guest bug, we don't really want to be
filling the hypervisor with lots of quirky exceptions to interfaces in
order to work around them, otherwise where does it end?

A kernel side fix can be pushed to the distros fairly aggressively (it's
mostly just a case of getting an upstream stable backport then filing
bugs with the main ones, we've done it before) and for users upgrading
the kernel via the distros is really not so hard and mostly reuses the
process they must have in place for guest kernel security updates and
other important kernel bugs anyway.

Ian.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 12:40                                                 ` Ian Campbell
@ 2013-05-31 13:07                                                   ` George Dunlap
  2013-05-31 15:10                                                     ` Roger Pau Monné
  2013-05-31 13:16                                                   ` Alex Bligh
  1 sibling, 1 reply; 60+ messages in thread
From: George Dunlap @ 2013-05-31 13:07 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Konrad Rzeszutek Wilk, xen-devel, David Vrabel, Alex Bligh,
	Anthony PERARD, Diana Crisan

On 31/05/13 13:40, Ian Campbell wrote:
> On Fri, 2013-05-31 at 12:57 +0100, Alex Bligh wrote:
>> --On 31 May 2013 12:49:18 +0100 George Dunlap <george.dunlap@eu.citrix.com>
>> wrote:
>>
>>> No -- Linux is asking, "Can you give me an alarm in 5ns?"  And Xen is
>>> saying, "No".  So Linux is saying, "OK, how about 5us?  10us? 20us?"  By
>>> the time it reaches 4ms, Linux has had enough, and says, "If this timer
>>> is so bad that it can't give me an event within 4ms it just won't use
>>> timers at all, thank you very much."
>>>
>>> The problem appears to be that Linux thinks it's asking for something in
>>> the future, but is actually asking for something in the past.  It must
>>> look at its watch just before the final domain pause, and then asks for
>>> the time just after the migration resumes on the other side.  So it
>>> doesn't realize that 10ms (or something) has already passed, and that
>>> it's actually asking for a timer in the past.  The Xen timer driver in
>>> Linux specifically asks Xen for times set in the past to return an error.
>>> Xen is returning an error because the time is in the past, Linux thinks
>>> it's getting an error because the time is too close in the future and
>>> tries asking a little further away.
>>>
>>> Unfortunately I think this is something which needs to be fixed on the
>>> Linux side; I don't really see how we can work around it in Xen.
>> I don't think fixing it only on the Linux side is a great idea, not least
>> as it makes any current Linux image not live migrateable reliably. That's
>> pretty horrible.
> Ultimately though a guest bug is a guest bug, we don't really want to be
> filling the hypervisor with lots of quirky exceptions to interfaces in
> order to work around them, otherwise where does it end?
>
> A kernel side fix can be pushed to the distros fairly aggressively (it's
> mostly just a case of getting an upstream stable backport then filing
> bugs with the main ones, we've done it before) and for users upgrading
> the kernel via the distros is really not so hard and mostly reuses the
> process they must have in place for guest kernel security updates and
> other important kernel bugs anyway.

In any case, it seems I was wrong -- Linux does "look at its watch" 
every time it asks.

The generic timer interface is "set me a timer N nanoseconds in the 
future"; the Xen timer implementation executes 
pvclock_clocksource_read() and adds the delta.  So it may well actually 
be a bug in Xen.

Stand by for further investigation...

  -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 12:40                                                 ` Ian Campbell
  2013-05-31 13:07                                                   ` George Dunlap
@ 2013-05-31 13:16                                                   ` Alex Bligh
  2013-05-31 14:36                                                     ` Ian Campbell
  1 sibling, 1 reply; 60+ messages in thread
From: Alex Bligh @ 2013-05-31 13:16 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Anthony, Konrad Rzeszutek Wilk, George Dunlap, xen-devel,
	David Vrabel, Alex Bligh, PERARD, Diana Crisan

Ian,

--On 31 May 2013 13:40:31 +0100 Ian Campbell <Ian.Campbell@citrix.com> 
wrote:

> Ultimately though a guest bug is a guest bug, we don't really want to be
> filling the hypervisor with lots of quirky exceptions to interfaces in
> order to work around them, otherwise where does it end?

I'm presuming you're saying this solely a problem with the Xen timer
driver under Linux, rather than a general timer bug under Linux. If
it were a general Linux guest bug, wouldn't it be appearing in
other circumstances than Xen live-migrate? EG on physical hardware or
kvm live migrate?

If that's correct, and I've understood what George said, then
I /think/ the only quirky fix that needs doing is this is to change
the API between kernel driver and xen so that 'don't give me a time
in the past' means 'don't give me a time in the past unless you've
just done a live migrate'. If you really want giving a time in the
past to error under some circumstances, you can signal that another
way ('really don't give me a time in the past).

IE make VCPU_SSHOTTMR_future mean 'error if the time is in the past
if not post a migrate', VCPU_SSHOTTMR_reallyfuture means 'error if the
time is in the past under any circumstances'.

A less messy change might just be a VM config value to ignore
VCPU_SSHOTTMR_future.

> A kernel side fix can be pushed to the distros fairly aggressively (it's
> mostly just a case of getting an upstream stable backport then filing
> bugs with the main ones, we've done it before) and for users upgrading
> the kernel via the distros is really not so hard and mostly reuses the
> process they must have in place for guest kernel security updates and
> other important kernel bugs anyway.

The issue with this from a practical point of view is that in a service
provider environment you need to support guest of all shapes and sizes.
And the migrate is often done by the SP with no notice to the guests.
Guests that partly hang as a result do not make people popular with their
customers. Yes, it would be lovely if everyone always applied the latest
patches to their kernel and rebooted, but they don't.

Otherwise the net result will be Xen4.3 does not reliably live migrate
a pile of Linux OS's unless running with a patched kernel. That is not
a great conclusion.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 13:16                                                   ` Alex Bligh
@ 2013-05-31 14:36                                                     ` Ian Campbell
  2013-05-31 15:18                                                       ` Alex Bligh
  0 siblings, 1 reply; 60+ messages in thread
From: Ian Campbell @ 2013-05-31 14:36 UTC (permalink / raw)
  To: Alex Bligh
  Cc: Anthony, Konrad Rzeszutek Wilk, George Dunlap, xen-devel,
	David Vrabel, PERARD, Diana Crisan

On Fri, 2013-05-31 at 14:16 +0100, Alex Bligh wrote:
> Ian,
> 
> --On 31 May 2013 13:40:31 +0100 Ian Campbell <Ian.Campbell@citrix.com> 
> wrote:
> 
> > Ultimately though a guest bug is a guest bug, we don't really want to be
> > filling the hypervisor with lots of quirky exceptions to interfaces in
> > order to work around them, otherwise where does it end?
> 
> I'm presuming you're saying this solely a problem with the Xen timer
> driver under Linux, rather than a general timer bug under Linux. If
> it were a general Linux guest bug, wouldn't it be appearing in
> other circumstances than Xen live-migrate? EG on physical hardware

There's no such thing as a "migration" on physical hardware and a
save/restore etc is under kernel control so it knows not to cache timer
values etc.

> or kvm live migrate?

I'm not sure how this works. *If* this is a kernel bug then I don't know
why they wouldn't also be effected.

Note that my comments were predicated on this being a guest kernel bug.
Obviously if this turns out to be a Xen bug then we should fix Xen.

> If that's correct, and I've understood what George said, then
> I /think/ the only quirky fix that needs doing is this is to change
> the API between kernel driver and xen so that 'don't give me a time
> in the past' means 'don't give me a time in the past unless you've
> just done a live migrate'.

What does "just" mean here? How do you determine it?

I said "filling the hypervisor with lots of quirky exceptions", this is
just one and in isolation maybe it isn't too bad. Now imagine we'd
accumulated a dozen over the last 10 years, the semantics of our timer
operation would be impossible to understand, do this unless A, otherwise
if not B do something else, etc etc.

>  If you really want giving a time in the
> past to error under some circumstances, you can signal that another
> way ('really don't give me a time in the past).

That would be changing the behaviour of an existing ABI AFAICT, which is
right out -- what if some other guest is relying on the current
behaviour?

But in any case until George (or someone else) has actually diagnosed
what is going on this entire discussion is premature.

>  Yes, it would be lovely if everyone always applied the latest
> patches to their kernel and rebooted, but they don't.
> 
> Otherwise the net result will be Xen4.3 does not reliably live migrate
> a pile of Linux OS's unless running with a patched kernel. That is not
> a great conclusion.

Are you saying this didn't happen with Xen 4.2 and earlier? That would
tend to lean towards this being a Xen bug.

Ian.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 13:07                                                   ` George Dunlap
@ 2013-05-31 15:10                                                     ` Roger Pau Monné
  2013-06-03  8:37                                                       ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2013-05-31 15:10 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD, Diana Crisan

On 31/05/13 15:07, George Dunlap wrote:
> On 31/05/13 13:40, Ian Campbell wrote:
>> On Fri, 2013-05-31 at 12:57 +0100, Alex Bligh wrote:
>>> --On 31 May 2013 12:49:18 +0100 George Dunlap
>>> <george.dunlap@eu.citrix.com>
>>> wrote:
>>>
>>>> No -- Linux is asking, "Can you give me an alarm in 5ns?"  And Xen is
>>>> saying, "No".  So Linux is saying, "OK, how about 5us?  10us?
>>>> 20us?"  By
>>>> the time it reaches 4ms, Linux has had enough, and says, "If this timer
>>>> is so bad that it can't give me an event within 4ms it just won't use
>>>> timers at all, thank you very much."
>>>>
>>>> The problem appears to be that Linux thinks it's asking for
>>>> something in
>>>> the future, but is actually asking for something in the past.  It must
>>>> look at its watch just before the final domain pause, and then asks for
>>>> the time just after the migration resumes on the other side.  So it
>>>> doesn't realize that 10ms (or something) has already passed, and that
>>>> it's actually asking for a timer in the past.  The Xen timer driver in
>>>> Linux specifically asks Xen for times set in the past to return an
>>>> error.
>>>> Xen is returning an error because the time is in the past, Linux thinks
>>>> it's getting an error because the time is too close in the future and
>>>> tries asking a little further away.
>>>>
>>>> Unfortunately I think this is something which needs to be fixed on the
>>>> Linux side; I don't really see how we can work around it in Xen.
>>> I don't think fixing it only on the Linux side is a great idea, not
>>> least
>>> as it makes any current Linux image not live migrateable reliably.
>>> That's
>>> pretty horrible.
>> Ultimately though a guest bug is a guest bug, we don't really want to be
>> filling the hypervisor with lots of quirky exceptions to interfaces in
>> order to work around them, otherwise where does it end?
>>
>> A kernel side fix can be pushed to the distros fairly aggressively (it's
>> mostly just a case of getting an upstream stable backport then filing
>> bugs with the main ones, we've done it before) and for users upgrading
>> the kernel via the distros is really not so hard and mostly reuses the
>> process they must have in place for guest kernel security updates and
>> other important kernel bugs anyway.
> 
> In any case, it seems I was wrong -- Linux does "look at its watch"
> every time it asks.
> 
> The generic timer interface is "set me a timer N nanoseconds in the
> future"; the Xen timer implementation executes
> pvclock_clocksource_read() and adds the delta.  So it may well actually
> be a bug in Xen.
> 
> Stand by for further investigation...

I've also seen this on FreeBSD PVHVM when doing live migration, which
also uses the single shot timer. It seems like the values in
vcpu_info->time are not updated as often as they should after the
migration. I've implemented a back-off mechanism to cope with that, but
this clearly looks like a bug in Xen.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 14:36                                                     ` Ian Campbell
@ 2013-05-31 15:18                                                       ` Alex Bligh
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Bligh @ 2013-05-31 15:18 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Anthony, Konrad Rzeszutek Wilk, George Dunlap, xen-devel,
	David Vrabel, Alex Bligh, PERARD, Diana Crisan

Ian,

> There's no such thing as a "migration" on physical hardware and a
> save/restore etc is under kernel control so it knows not to cache timer
> values etc.

Indeed, so it's the live migrate which is causing it!

>> If that's correct, and I've understood what George said, then
>> I /think/ the only quirky fix that needs doing is this is to change
>> the API between kernel driver and xen so that 'don't give me a time
>> in the past' means 'don't give me a time in the past unless you've
>> just done a live migrate'.
>
> What does "just" mean here? How do you determine it?

I'd suggest whatever time interval is required to resync. If you said
1 second, for instance, that would be a bodge, but would presumably
work unless the clocks were out by more than a second.

> I said "filling the hypervisor with lots of quirky exceptions", this is
> just one and in isolation maybe it isn't too bad. Now imagine we'd
> accumulated a dozen over the last 10 years, the semantics of our timer
> operation would be impossible to understand, do this unless A, otherwise
> if not B do something else, etc etc.
>
>>  If you really want giving a time in the
>> past to error under some circumstances, you can signal that another
>> way ('really don't give me a time in the past).
>
> That would be changing the behaviour of an existing ABI AFAICT, which is
> right out -- what if some other guest is relying on the current
> behaviour?

Well Linux is sort of relying on it - so we might fix those guests too :-)

I suppose the result would be that if anyone relied on the failure of
the timer event in the one second following migration, then sometimes
that failure would not happen.

> But in any case until George (or someone else) has actually diagnosed
> what is going on this entire discussion is premature.
>
>>  Yes, it would be lovely if everyone always applied the latest
>> patches to their kernel and rebooted, but they don't.
>>
>> Otherwise the net result will be Xen4.3 does not reliably live migrate
>> a pile of Linux OS's unless running with a patched kernel. That is not
>> a great conclusion.
>
> Are you saying this didn't happen with Xen 4.2 and earlier? That would
> tend to lean towards this being a Xen bug.

It happens in 4.2.

We did not discover it in 4.1, but have not retested so comprehensively.
And in 4.1 we were using a different device model (if that's relevant).

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 10:59                                         ` George Dunlap
  2013-05-31 11:41                                           ` George Dunlap
@ 2013-05-31 21:30                                           ` Konrad Rzeszutek Wilk
  2013-05-31 22:51                                             ` Alex Bligh
  2013-06-03  9:43                                             ` George Dunlap
  1 sibling, 2 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-05-31 21:30 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Stefano Stabellini, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD, Diana Crisan

On Fri, May 31, 2013 at 11:59:22AM +0100, George Dunlap wrote:
> On 31/05/13 11:54, George Dunlap wrote:
> >On 31/05/13 09:34, Diana Crisan wrote:
> >>George,
> >>On 30/05/13 17:06, George Dunlap wrote:
> >>>On 05/30/2013 04:55 PM, Diana Crisan wrote:
> >>>>On 30/05/13 16:26, George Dunlap wrote:
> >>>>>On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
> >>>>>wrote:
> >>>>>>Hi,
> >>>>>>
> >>>>>>
> >>>>>>On 26/05/13 09:38, Ian Campbell wrote:
> >>>>>>>On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
> >>>>>>>>George,
> >>>>>>>>
> >>>>>>>>--On 24 May 2013 17:16:07 +0100 George Dunlap
> >>>>>>>><George.Dunlap@eu.citrix.com>
> >>>>>>>>wrote:
> >>>>>>>>
> >>>>>>>>>>FWIW it's reproducible on every host h/w platform we've tried
> >>>>>>>>>>(a total of 2).
> >>>>>>>>>Do you see the same effects if you do a local-host migrate?
> >>>>>>>>I hadn't even realised that was possible. That would
> >>>>>>>>have made testing
> >>>>>>>>live
> >>>>>>>>migrate easier!
> >>>>>>>That's basically the whole reason it is supported ;-)
> >>>>>>>
> >>>>>>>>How do you avoid the name clash in xen-store?
> >>>>>>>Most toolstacks receive the incoming migration into a domain named
> >>>>>>>FOO-incoming or some such and then rename to FOO upon
> >>>>>>>completion. Some
> >>>>>>>also rename the outgoing domain "FOO-migratedaway"
> >>>>>>>towards the end so
> >>>>>>>that the bits of the final teardown which can safely
> >>>>>>>happen after the
> >>>>>>>target have start can be done so.
> >>>>>>>
> >>>>>>>Ian.
> >>>>>>>
> >>>>>>>
> >>>>>>I am unsure what I am doing wrong, but I cannot seem to
> >>>>>>be able to do a
> >>>>>>localhost migrate.
> >>>>>>
> >>>>>>I created a domU using "xl create xl.conf" and once it
> >>>>>>fully booted I
> >>>>>>issued
> >>>>>>an "xl migrate 11 localhost". This fails and gives the output below.
> >>>>>>
> >>>>>>Would you please advise on how to get this working?
> >>>>>>
> >>>>>>Thanks,
> >>>>>>Diana
> >>>>>>
> >>>>>>
> >>>>>>root@ubuntu:~# xl migrate 11 localhost
> >>>>>>root@localhost's password:
> >>>>>>migration target: Ready to receive domain.
> >>>>>>Saving to migration stream new xl format (info 0x0/0x0/2344)
> >>>>>>Loading new save file <incoming migration stream> (new xl fmt info
> >>>>>>0x0/0x0/2344)
> >>>>>>  Savefile contains xl domain config
> >>>>>>xc: progress: Reloading memory pages: 53248/1048575 5%
> >>>>>>xc: progress: Reloading memory pages: 105472/1048575 10%
> >>>>>>libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
> >>>>>>device
> >>>>>>model: spawn failed (rc=-3)
> >>>>>>libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
> >>>>>>model
> >>>>>>did not start: -3
> >>>>>>libxl: error:
> >>>>>>libxl_dm.c:1311:libxl__destroy_device_model: Device
> >>>>>>Model
> >>>>>>already exited
> >>>>>>migration target: Domain creation failed (code -3).
> >>>>>>libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
> >>>>>>truncated
> >>>>>>reading ready message from migration receiver stream
> >>>>>>libxl: info:
> >>>>>>libxl_exec.c:118:libxl_report_child_exitstatus:
> >>>>>>migration
> >>>>>>target process [10934] exited with error status 3
> >>>>>>Migration failed, resuming at sender.
> >>>>>>xc: error: Cannot resume uncooperative HVM guests: Internal error
> >>>>>>libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
> >>>>>>failed for
> >>>>>>domain 11: Success
> >>>>>Aha -- I managed to reproduce this one as well.
> >>>>>
> >>>>>Your problem is the "vncunused=0" -- that's instructing
> >>>>>qemu "You must
> >>>>>use this exact port for the vnc server".  But when you do
> >>>>>the migrate,
> >>>>>that port is still in use by the "from" domain; so the qemu for the
> >>>>>"to" domain can't get it, and fails.
> >>>>>
> >>>>>Obviously this should fail a lot more gracefully, but that's a bit of
> >>>>>a lower-priority bug I think.
> >>>>>
> >>>>>  -George
> >>>>Yes, I managed to get to the bottom of it too and got vms migrating on
> >>>>localhost on our end.
> >>>>
> >>>>I can confirm I did get the clock stuck problem while doing
> >>>>a localhost
> >>>>migrate.
> >>>
> >>>Does the script I posted earlier "work" for you (i.e., does it
> >>>fail after some number of migrations)?
> >>>
> >>
> >>I left your script running throughout the night and it seems
> >>that it does not always catch the problem. I see the following:
> >>
> >>1. vm has the clock stuck
> >>2. script is still running as it seems the vm is still ping-able.
> >>3. migration fails on the basis that the vm is does not ack the
> >>suspend request (see below).
> >
> >So I wrote a script to run "date", sleep for 2 seconds, and run
> >"date" a second time -- and eventually the *sleep* hung.
> >
> >The VM is still responsive, and I can log in; if I type "date"
> >manually successive times then I get an advancing clock, but if I
> >type "sleep 1" it just hangs.
> >
> >If you run "dmesg" in the guest, do you see the following line?
> >
> >CE: Reprogramming failure. Giving up
> 
> I think this must be it; on my other box, I got the following messages:
> 
> [  224.732083] PM: late freeze of devices complete after 3.787 msecs
> [  224.736062] Xen HVM callback vector for event delivery is enabled
> [  224.736062] Xen Platform PCI: I/O protocol version 1
> [  224.736062] xen: --> irq=8, pirq=16
> [  224.736062] xen: --> irq=12, pirq=17
> [  224.736062] xen: --> irq=1, pirq=18
> [  224.736062] xen: --> irq=6, pirq=19
> [  224.736062] xen: --> irq=4, pirq=20
> [  224.736062] xen: --> irq=7, pirq=21
> [  224.736062] xen: --> irq=28, pirq=22
> [  224.736062] ata_piix 0000:00:01.1: restoring config space at
> offset 0x1 (was 0x2800001, writing 0x2800005)
> [  224.736062] PM: early restore of devices complete after 5.854 msecs
> [  224.739692] ata_piix 0000:00:01.1: setting latency timer to 64
> [  224.739782] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28
> (level, low) -> IRQ 28
> [  224.746900] PM: restore of devices complete after 7.540 msecs
> [  224.758612] Setting capacity to 16777216
> [  224.758749] Setting capacity to 16777216
> [  224.898426] ata2.01: NODEV after polling detection
> [  224.900941] ata2.00: configured for MWDMA2
> [  231.055978] CE: xen increased min_delta_ns to 150000 nsec
> [  231.055986] hrtimer: interrupt took 14460 ns
> [  247.893303] PM: freeze of devices complete after 2.168 msecs
> [  247.893306] suspending xenstore...
> [  247.896977] PM: late freeze of devices complete after 3.666 msecs
> [  247.900067] Xen HVM callback vector for event delivery is enabled
> [  247.900067] Xen Platform PCI: I/O protocol version 1
> [  247.900067] xen: --> irq=8, pirq=16
> [  247.900067] xen: --> irq=12, pirq=17
> [  247.900067] xen: --> irq=1, pirq=18
> [  247.900067] xen: --> irq=6, pirq=19
> [  247.900067] xen: --> irq=4, pirq=20
> [  247.900067] xen: --> irq=7, pirq=21
> [  247.900067] xen: --> irq=28, pirq=22
> [  247.900067] ata_piix 0000:00:01.1: restoring config space at
> offset 0x1 (was 0x2800001, writing 0x2800005)
> [  247.900067] PM: early restore of devices complete after 4.612 msecs
> [  247.906454] ata_piix 0000:00:01.1: setting latency timer to 64
> [  247.906558] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28
> (level, low) -> IRQ 28
> [  247.914770] PM: restore of devices complete after 8.762 msecs
> [  247.926557] Setting capacity to 16777216
> [  247.926661] Setting capacity to 16777216
> [  248.066661] ata2.01: NODEV after polling detection
> [  248.067326] CE: xen increased min_delta_ns to 225000 nsec
> [  248.067344] CE: xen increased min_delta_ns to 337500 nsec
> [  248.067361] CE: xen increased min_delta_ns to 506250 nsec
> [  248.067378] CE: xen increased min_delta_ns to 759375 nsec
> [  248.067396] CE: xen increased min_delta_ns to 1139062 nsec
> [  248.067413] CE: xen increased min_delta_ns to 1708593 nsec
> [  248.067428] CE: xen increased min_delta_ns to 2562889 nsec
> [  248.067441] CE: xen increased min_delta_ns to 3844333 nsec
> [  248.067453] CE: xen increased min_delta_ns to 4000000 nsec
> [  248.067466] CE: Reprogramming failure. Giving up
> [  248.068075] ata2.00: configured for MWDMA2
> 
> Note the "CE: xen increased min_delta_ns to 150000nsec" at 231 for
> the previous suspend, and now it's increasing it up to 4
> milliseconds before giving up for this suspend.
> 
> Konrad, stefano, any idea what's going on here?

VIRQ_TIMER not being delievered. Aka this commit

bee980d9e9642e96351fa3ca9077b853ecf62f57
xen/events: Handle VIRQ_TIMER before any other hardirq in event loop.

should be back-ported but didn't yet. Let me put that
on my TODO list.
> 
>  -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 21:30                                           ` Konrad Rzeszutek Wilk
@ 2013-05-31 22:51                                             ` Alex Bligh
  2013-06-03  9:43                                             ` George Dunlap
  1 sibling, 0 replies; 60+ messages in thread
From: Alex Bligh @ 2013-05-31 22:51 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, George Dunlap
  Cc: Ian Campbell, Stefano Stabellini, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD, Diana Crisan

Konrad,

--On 31 May 2013 17:30:41 -0400 Konrad Rzeszutek Wilk 
<konrad.wilk@oracle.com> wrote:

> VIRQ_TIMER not being delievered. Aka this commit
>
> bee980d9e9642e96351fa3ca9077b853ecf62f57
> xen/events: Handle VIRQ_TIMER before any other hardirq in event loop.
>
> should be back-ported but didn't yet. Let me put that
> on my TODO list.

Last time I played seriously with x86 interrupts was with real hardware
circa 23 years ago, so feel free to say this is a stupid idea, but assuming
the interrupts are level triggered, would it be possible for Xen to delay
delivery of any hard irqs until the first timer irq has been handled
after a migration? And/or accelerate delivery of a timer IRQ?

I'm still hoping for a workaround that doesn't involve patching every
guest OS.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 15:10                                                     ` Roger Pau Monné
@ 2013-06-03  8:37                                                       ` Roger Pau Monné
  2013-06-03 10:05                                                         ` Stefano Stabellini
  2013-06-03 10:25                                                         ` George Dunlap
  0 siblings, 2 replies; 60+ messages in thread
From: Roger Pau Monné @ 2013-06-03  8:37 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD, Diana Crisan

On 31/05/13 17:10, Roger Pau Monné wrote:
> On 31/05/13 15:07, George Dunlap wrote:
>> On 31/05/13 13:40, Ian Campbell wrote:
>>> On Fri, 2013-05-31 at 12:57 +0100, Alex Bligh wrote:
>>>> --On 31 May 2013 12:49:18 +0100 George Dunlap
>>>> <george.dunlap@eu.citrix.com>
>>>> wrote:
>>>>
>>>>> No -- Linux is asking, "Can you give me an alarm in 5ns?"  And Xen is
>>>>> saying, "No".  So Linux is saying, "OK, how about 5us?  10us?
>>>>> 20us?"  By
>>>>> the time it reaches 4ms, Linux has had enough, and says, "If this timer
>>>>> is so bad that it can't give me an event within 4ms it just won't use
>>>>> timers at all, thank you very much."
>>>>>
>>>>> The problem appears to be that Linux thinks it's asking for
>>>>> something in
>>>>> the future, but is actually asking for something in the past.  It must
>>>>> look at its watch just before the final domain pause, and then asks for
>>>>> the time just after the migration resumes on the other side.  So it
>>>>> doesn't realize that 10ms (or something) has already passed, and that
>>>>> it's actually asking for a timer in the past.  The Xen timer driver in
>>>>> Linux specifically asks Xen for times set in the past to return an
>>>>> error.
>>>>> Xen is returning an error because the time is in the past, Linux thinks
>>>>> it's getting an error because the time is too close in the future and
>>>>> tries asking a little further away.
>>>>>
>>>>> Unfortunately I think this is something which needs to be fixed on the
>>>>> Linux side; I don't really see how we can work around it in Xen.
>>>> I don't think fixing it only on the Linux side is a great idea, not
>>>> least
>>>> as it makes any current Linux image not live migrateable reliably.
>>>> That's
>>>> pretty horrible.
>>> Ultimately though a guest bug is a guest bug, we don't really want to be
>>> filling the hypervisor with lots of quirky exceptions to interfaces in
>>> order to work around them, otherwise where does it end?
>>>
>>> A kernel side fix can be pushed to the distros fairly aggressively (it's
>>> mostly just a case of getting an upstream stable backport then filing
>>> bugs with the main ones, we've done it before) and for users upgrading
>>> the kernel via the distros is really not so hard and mostly reuses the
>>> process they must have in place for guest kernel security updates and
>>> other important kernel bugs anyway.
>>
>> In any case, it seems I was wrong -- Linux does "look at its watch"
>> every time it asks.
>>
>> The generic timer interface is "set me a timer N nanoseconds in the
>> future"; the Xen timer implementation executes
>> pvclock_clocksource_read() and adds the delta.  So it may well actually
>> be a bug in Xen.
>>
>> Stand by for further investigation...

I've been investigating further during the weekend, and although I'm not
familiar with the timer code in Xen, I think the problem comes from the
fact that in __update_vcpu_system_time when Xen detects that the guest
is using a vtsc it adds offsets to the time passed to the guest, while
in VCPUOP_set_singleshot_timer Xen compares the time passed from the
guest using NOW(), which is just the Xen uptime, without taking into
account any offsets.

This only happens after migration because Xen automatically switches to
vtsc when it detects that the guest has been migrated. I'm currently
setting up a Linux PVHVM on shared storage to perform some testing, but
one possible solution might be to add tsc_mode="native_paravirt" to the
PVHVM config file, and another one would be fixing
VCPUOP_set_singleshot_timer to take into account the vtsc offsets and
correctly translate the time passed from the guest.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-05-31 21:30                                           ` Konrad Rzeszutek Wilk
  2013-05-31 22:51                                             ` Alex Bligh
@ 2013-06-03  9:43                                             ` George Dunlap
  1 sibling, 0 replies; 60+ messages in thread
From: George Dunlap @ 2013-06-03  9:43 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Ian Campbell, Stefano Stabellini, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD, Diana Crisan

On 31/05/13 22:30, Konrad Rzeszutek Wilk wrote:
> On Fri, May 31, 2013 at 11:59:22AM +0100, George Dunlap wrote:
>> On 31/05/13 11:54, George Dunlap wrote:
>>> On 31/05/13 09:34, Diana Crisan wrote:
>>>> George,
>>>> On 30/05/13 17:06, George Dunlap wrote:
>>>>> On 05/30/2013 04:55 PM, Diana Crisan wrote:
>>>>>> On 30/05/13 16:26, George Dunlap wrote:
>>>>>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> On 26/05/13 09:38, Ian Campbell wrote:
>>>>>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>>>>>>>> George,
>>>>>>>>>>
>>>>>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>>>>>>>> <George.Dunlap@eu.citrix.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>>>>>>>> (a total of 2).
>>>>>>>>>>> Do you see the same effects if you do a local-host migrate?
>>>>>>>>>> I hadn't even realised that was possible. That would
>>>>>>>>>> have made testing
>>>>>>>>>> live
>>>>>>>>>> migrate easier!
>>>>>>>>> That's basically the whole reason it is supported ;-)
>>>>>>>>>
>>>>>>>>>> How do you avoid the name clash in xen-store?
>>>>>>>>> Most toolstacks receive the incoming migration into a domain named
>>>>>>>>> FOO-incoming or some such and then rename to FOO upon
>>>>>>>>> completion. Some
>>>>>>>>> also rename the outgoing domain "FOO-migratedaway"
>>>>>>>>> towards the end so
>>>>>>>>> that the bits of the final teardown which can safely
>>>>>>>>> happen after the
>>>>>>>>> target have start can be done so.
>>>>>>>>>
>>>>>>>>> Ian.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I am unsure what I am doing wrong, but I cannot seem to
>>>>>>>> be able to do a
>>>>>>>> localhost migrate.
>>>>>>>>
>>>>>>>> I created a domU using "xl create xl.conf" and once it
>>>>>>>> fully booted I
>>>>>>>> issued
>>>>>>>> an "xl migrate 11 localhost". This fails and gives the output below.
>>>>>>>>
>>>>>>>> Would you please advise on how to get this working?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Diana
>>>>>>>>
>>>>>>>>
>>>>>>>> root@ubuntu:~# xl migrate 11 localhost
>>>>>>>> root@localhost's password:
>>>>>>>> migration target: Ready to receive domain.
>>>>>>>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>>>>>>>> Loading new save file <incoming migration stream> (new xl fmt info
>>>>>>>> 0x0/0x0/2344)
>>>>>>>>   Savefile contains xl domain config
>>>>>>>> xc: progress: Reloading memory pages: 53248/1048575 5%
>>>>>>>> xc: progress: Reloading memory pages: 105472/1048575 10%
>>>>>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
>>>>>>>> device
>>>>>>>> model: spawn failed (rc=-3)
>>>>>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
>>>>>>>> model
>>>>>>>> did not start: -3
>>>>>>>> libxl: error:
>>>>>>>> libxl_dm.c:1311:libxl__destroy_device_model: Device
>>>>>>>> Model
>>>>>>>> already exited
>>>>>>>> migration target: Domain creation failed (code -3).
>>>>>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
>>>>>>>> truncated
>>>>>>>> reading ready message from migration receiver stream
>>>>>>>> libxl: info:
>>>>>>>> libxl_exec.c:118:libxl_report_child_exitstatus:
>>>>>>>> migration
>>>>>>>> target process [10934] exited with error status 3
>>>>>>>> Migration failed, resuming at sender.
>>>>>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>>>>>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
>>>>>>>> failed for
>>>>>>>> domain 11: Success
>>>>>>> Aha -- I managed to reproduce this one as well.
>>>>>>>
>>>>>>> Your problem is the "vncunused=0" -- that's instructing
>>>>>>> qemu "You must
>>>>>>> use this exact port for the vnc server".  But when you do
>>>>>>> the migrate,
>>>>>>> that port is still in use by the "from" domain; so the qemu for the
>>>>>>> "to" domain can't get it, and fails.
>>>>>>>
>>>>>>> Obviously this should fail a lot more gracefully, but that's a bit of
>>>>>>> a lower-priority bug I think.
>>>>>>>
>>>>>>>   -George
>>>>>> Yes, I managed to get to the bottom of it too and got vms migrating on
>>>>>> localhost on our end.
>>>>>>
>>>>>> I can confirm I did get the clock stuck problem while doing
>>>>>> a localhost
>>>>>> migrate.
>>>>> Does the script I posted earlier "work" for you (i.e., does it
>>>>> fail after some number of migrations)?
>>>>>
>>>> I left your script running throughout the night and it seems
>>>> that it does not always catch the problem. I see the following:
>>>>
>>>> 1. vm has the clock stuck
>>>> 2. script is still running as it seems the vm is still ping-able.
>>>> 3. migration fails on the basis that the vm is does not ack the
>>>> suspend request (see below).
>>> So I wrote a script to run "date", sleep for 2 seconds, and run
>>> "date" a second time -- and eventually the *sleep* hung.
>>>
>>> The VM is still responsive, and I can log in; if I type "date"
>>> manually successive times then I get an advancing clock, but if I
>>> type "sleep 1" it just hangs.
>>>
>>> If you run "dmesg" in the guest, do you see the following line?
>>>
>>> CE: Reprogramming failure. Giving up
>> I think this must be it; on my other box, I got the following messages:
>>
>> [  224.732083] PM: late freeze of devices complete after 3.787 msecs
>> [  224.736062] Xen HVM callback vector for event delivery is enabled
>> [  224.736062] Xen Platform PCI: I/O protocol version 1
>> [  224.736062] xen: --> irq=8, pirq=16
>> [  224.736062] xen: --> irq=12, pirq=17
>> [  224.736062] xen: --> irq=1, pirq=18
>> [  224.736062] xen: --> irq=6, pirq=19
>> [  224.736062] xen: --> irq=4, pirq=20
>> [  224.736062] xen: --> irq=7, pirq=21
>> [  224.736062] xen: --> irq=28, pirq=22
>> [  224.736062] ata_piix 0000:00:01.1: restoring config space at
>> offset 0x1 (was 0x2800001, writing 0x2800005)
>> [  224.736062] PM: early restore of devices complete after 5.854 msecs
>> [  224.739692] ata_piix 0000:00:01.1: setting latency timer to 64
>> [  224.739782] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28
>> (level, low) -> IRQ 28
>> [  224.746900] PM: restore of devices complete after 7.540 msecs
>> [  224.758612] Setting capacity to 16777216
>> [  224.758749] Setting capacity to 16777216
>> [  224.898426] ata2.01: NODEV after polling detection
>> [  224.900941] ata2.00: configured for MWDMA2
>> [  231.055978] CE: xen increased min_delta_ns to 150000 nsec
>> [  231.055986] hrtimer: interrupt took 14460 ns
>> [  247.893303] PM: freeze of devices complete after 2.168 msecs
>> [  247.893306] suspending xenstore...
>> [  247.896977] PM: late freeze of devices complete after 3.666 msecs
>> [  247.900067] Xen HVM callback vector for event delivery is enabled
>> [  247.900067] Xen Platform PCI: I/O protocol version 1
>> [  247.900067] xen: --> irq=8, pirq=16
>> [  247.900067] xen: --> irq=12, pirq=17
>> [  247.900067] xen: --> irq=1, pirq=18
>> [  247.900067] xen: --> irq=6, pirq=19
>> [  247.900067] xen: --> irq=4, pirq=20
>> [  247.900067] xen: --> irq=7, pirq=21
>> [  247.900067] xen: --> irq=28, pirq=22
>> [  247.900067] ata_piix 0000:00:01.1: restoring config space at
>> offset 0x1 (was 0x2800001, writing 0x2800005)
>> [  247.900067] PM: early restore of devices complete after 4.612 msecs
>> [  247.906454] ata_piix 0000:00:01.1: setting latency timer to 64
>> [  247.906558] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28
>> (level, low) -> IRQ 28
>> [  247.914770] PM: restore of devices complete after 8.762 msecs
>> [  247.926557] Setting capacity to 16777216
>> [  247.926661] Setting capacity to 16777216
>> [  248.066661] ata2.01: NODEV after polling detection
>> [  248.067326] CE: xen increased min_delta_ns to 225000 nsec
>> [  248.067344] CE: xen increased min_delta_ns to 337500 nsec
>> [  248.067361] CE: xen increased min_delta_ns to 506250 nsec
>> [  248.067378] CE: xen increased min_delta_ns to 759375 nsec
>> [  248.067396] CE: xen increased min_delta_ns to 1139062 nsec
>> [  248.067413] CE: xen increased min_delta_ns to 1708593 nsec
>> [  248.067428] CE: xen increased min_delta_ns to 2562889 nsec
>> [  248.067441] CE: xen increased min_delta_ns to 3844333 nsec
>> [  248.067453] CE: xen increased min_delta_ns to 4000000 nsec
>> [  248.067466] CE: Reprogramming failure. Giving up
>> [  248.068075] ata2.00: configured for MWDMA2
>>
>> Note the "CE: xen increased min_delta_ns to 150000nsec" at 231 for
>> the previous suspend, and now it's increasing it up to 4
>> milliseconds before giving up for this suspend.
>>
>> Konrad, stefano, any idea what's going on here?
> VIRQ_TIMER not being delievered. Aka this commit
>
> bee980d9e9642e96351fa3ca9077b853ecf62f57
> xen/events: Handle VIRQ_TIMER before any other hardirq in event loop.
>
> should be back-ported but didn't yet. Let me put that
> on my TODO list.

Konrad,

I don't understand how the VIRQ timer can be the issue.

As far as I can tell, what's happening is this:

1. The kernel asks Xen timer for something N ns in the future.

2. The xen timer stuff in Linux calculates the current time using stuff 
from the shared info page, adds N ns, then asks Xen for an event to 
trigger at that time.

3. Unfortunately, that new time is in the past, and Xen returns an error.

So how is the VIRQ_TIMER not being delivered causing the calculation to 
come up with a time in the past?

  -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-06-03  8:37                                                       ` Roger Pau Monné
@ 2013-06-03 10:05                                                         ` Stefano Stabellini
  2013-06-03 10:23                                                           ` Roger Pau Monné
  2013-06-03 10:25                                                         ` George Dunlap
  1 sibling, 1 reply; 60+ messages in thread
From: Stefano Stabellini @ 2013-06-03 10:05 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, George Dunlap, xen-devel,
	David Vrabel, Alex Bligh, Anthony PERARD, Diana Crisan

[-- Attachment #1: Type: text/plain, Size: 3994 bytes --]

On Mon, 3 Jun 2013, Roger Pau Monné wrote:
> On 31/05/13 17:10, Roger Pau Monné wrote:
> > On 31/05/13 15:07, George Dunlap wrote:
> >> On 31/05/13 13:40, Ian Campbell wrote:
> >>> On Fri, 2013-05-31 at 12:57 +0100, Alex Bligh wrote:
> >>>> --On 31 May 2013 12:49:18 +0100 George Dunlap
> >>>> <george.dunlap@eu.citrix.com>
> >>>> wrote:
> >>>>
> >>>>> No -- Linux is asking, "Can you give me an alarm in 5ns?"  And Xen is
> >>>>> saying, "No".  So Linux is saying, "OK, how about 5us?  10us?
> >>>>> 20us?"  By
> >>>>> the time it reaches 4ms, Linux has had enough, and says, "If this timer
> >>>>> is so bad that it can't give me an event within 4ms it just won't use
> >>>>> timers at all, thank you very much."
> >>>>>
> >>>>> The problem appears to be that Linux thinks it's asking for
> >>>>> something in
> >>>>> the future, but is actually asking for something in the past.  It must
> >>>>> look at its watch just before the final domain pause, and then asks for
> >>>>> the time just after the migration resumes on the other side.  So it
> >>>>> doesn't realize that 10ms (or something) has already passed, and that
> >>>>> it's actually asking for a timer in the past.  The Xen timer driver in
> >>>>> Linux specifically asks Xen for times set in the past to return an
> >>>>> error.
> >>>>> Xen is returning an error because the time is in the past, Linux thinks
> >>>>> it's getting an error because the time is too close in the future and
> >>>>> tries asking a little further away.
> >>>>>
> >>>>> Unfortunately I think this is something which needs to be fixed on the
> >>>>> Linux side; I don't really see how we can work around it in Xen.
> >>>> I don't think fixing it only on the Linux side is a great idea, not
> >>>> least
> >>>> as it makes any current Linux image not live migrateable reliably.
> >>>> That's
> >>>> pretty horrible.
> >>> Ultimately though a guest bug is a guest bug, we don't really want to be
> >>> filling the hypervisor with lots of quirky exceptions to interfaces in
> >>> order to work around them, otherwise where does it end?
> >>>
> >>> A kernel side fix can be pushed to the distros fairly aggressively (it's
> >>> mostly just a case of getting an upstream stable backport then filing
> >>> bugs with the main ones, we've done it before) and for users upgrading
> >>> the kernel via the distros is really not so hard and mostly reuses the
> >>> process they must have in place for guest kernel security updates and
> >>> other important kernel bugs anyway.
> >>
> >> In any case, it seems I was wrong -- Linux does "look at its watch"
> >> every time it asks.
> >>
> >> The generic timer interface is "set me a timer N nanoseconds in the
> >> future"; the Xen timer implementation executes
> >> pvclock_clocksource_read() and adds the delta.  So it may well actually
> >> be a bug in Xen.
> >>
> >> Stand by for further investigation...
> 
> I've been investigating further during the weekend, and although I'm not
> familiar with the timer code in Xen, I think the problem comes from the
> fact that in __update_vcpu_system_time when Xen detects that the guest
> is using a vtsc it adds offsets to the time passed to the guest, while
> in VCPUOP_set_singleshot_timer Xen compares the time passed from the
> guest using NOW(), which is just the Xen uptime, without taking into
> account any offsets.
> 
> This only happens after migration because Xen automatically switches to
> vtsc when it detects that the guest has been migrated. I'm currently
> setting up a Linux PVHVM on shared storage to perform some testing, but
> one possible solution might be to add tsc_mode="native_paravirt" to the
> PVHVM config file, and another one would be fixing
> VCPUOP_set_singleshot_timer to take into account the vtsc offsets and
> correctly translate the time passed from the guest.

Good analisys!
I think that the right solution would be to fix
VCPUOP_set_singleshot_timer.

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-06-03 10:05                                                         ` Stefano Stabellini
@ 2013-06-03 10:23                                                           ` Roger Pau Monné
  2013-06-03 10:30                                                             ` Stefano Stabellini
  2013-06-03 11:16                                                             ` George Dunlap
  0 siblings, 2 replies; 60+ messages in thread
From: Roger Pau Monné @ 2013-06-03 10:23 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, George Dunlap, xen-devel,
	David Vrabel, Alex Bligh, Anthony PERARD, Diana Crisan

On 03/06/13 12:05, Stefano Stabellini wrote:
> On Mon, 3 Jun 2013, Roger Pau Monné wrote:
>> On 31/05/13 17:10, Roger Pau Monné wrote:
>>> On 31/05/13 15:07, George Dunlap wrote:
>>>> On 31/05/13 13:40, Ian Campbell wrote:
>>>>> On Fri, 2013-05-31 at 12:57 +0100, Alex Bligh wrote:
>>>>>> --On 31 May 2013 12:49:18 +0100 George Dunlap
>>>>>> <george.dunlap@eu.citrix.com>
>>>>>> wrote:
>>>>>>
>>>>>>> No -- Linux is asking, "Can you give me an alarm in 5ns?"  And Xen is
>>>>>>> saying, "No".  So Linux is saying, "OK, how about 5us?  10us?
>>>>>>> 20us?"  By
>>>>>>> the time it reaches 4ms, Linux has had enough, and says, "If this timer
>>>>>>> is so bad that it can't give me an event within 4ms it just won't use
>>>>>>> timers at all, thank you very much."
>>>>>>>
>>>>>>> The problem appears to be that Linux thinks it's asking for
>>>>>>> something in
>>>>>>> the future, but is actually asking for something in the past.  It must
>>>>>>> look at its watch just before the final domain pause, and then asks for
>>>>>>> the time just after the migration resumes on the other side.  So it
>>>>>>> doesn't realize that 10ms (or something) has already passed, and that
>>>>>>> it's actually asking for a timer in the past.  The Xen timer driver in
>>>>>>> Linux specifically asks Xen for times set in the past to return an
>>>>>>> error.
>>>>>>> Xen is returning an error because the time is in the past, Linux thinks
>>>>>>> it's getting an error because the time is too close in the future and
>>>>>>> tries asking a little further away.
>>>>>>>
>>>>>>> Unfortunately I think this is something which needs to be fixed on the
>>>>>>> Linux side; I don't really see how we can work around it in Xen.
>>>>>> I don't think fixing it only on the Linux side is a great idea, not
>>>>>> least
>>>>>> as it makes any current Linux image not live migrateable reliably.
>>>>>> That's
>>>>>> pretty horrible.
>>>>> Ultimately though a guest bug is a guest bug, we don't really want to be
>>>>> filling the hypervisor with lots of quirky exceptions to interfaces in
>>>>> order to work around them, otherwise where does it end?
>>>>>
>>>>> A kernel side fix can be pushed to the distros fairly aggressively (it's
>>>>> mostly just a case of getting an upstream stable backport then filing
>>>>> bugs with the main ones, we've done it before) and for users upgrading
>>>>> the kernel via the distros is really not so hard and mostly reuses the
>>>>> process they must have in place for guest kernel security updates and
>>>>> other important kernel bugs anyway.
>>>>
>>>> In any case, it seems I was wrong -- Linux does "look at its watch"
>>>> every time it asks.
>>>>
>>>> The generic timer interface is "set me a timer N nanoseconds in the
>>>> future"; the Xen timer implementation executes
>>>> pvclock_clocksource_read() and adds the delta.  So it may well actually
>>>> be a bug in Xen.
>>>>
>>>> Stand by for further investigation...
>>
>> I've been investigating further during the weekend, and although I'm not
>> familiar with the timer code in Xen, I think the problem comes from the
>> fact that in __update_vcpu_system_time when Xen detects that the guest
>> is using a vtsc it adds offsets to the time passed to the guest, while
>> in VCPUOP_set_singleshot_timer Xen compares the time passed from the
>> guest using NOW(), which is just the Xen uptime, without taking into
>> account any offsets.
>>
>> This only happens after migration because Xen automatically switches to
>> vtsc when it detects that the guest has been migrated. I'm currently
>> setting up a Linux PVHVM on shared storage to perform some testing, but
>> one possible solution might be to add tsc_mode="native_paravirt" to the
>> PVHVM config file, and another one would be fixing
>> VCPUOP_set_singleshot_timer to take into account the vtsc offsets and
>> correctly translate the time passed from the guest.
> 
> Good analisys!
> I think that the right solution would be to fix
> VCPUOP_set_singleshot_timer.

As a band aid I can confirm that adding tsc_mode="native_paravirt" seems
to be working fine (with the tests I've done so far), but it requires
the admin to know whether a certain HVM will be using the PV timer or
not, which I guess it's not possible in every case.

Xen could also force the TSC mode to native_paravirt when it detects
that a HVM guest is using the PV timer, but I don't think that's the
right approach. Is this something we aim to fix before the 4.3 release?


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-06-03  8:37                                                       ` Roger Pau Monné
  2013-06-03 10:05                                                         ` Stefano Stabellini
@ 2013-06-03 10:25                                                         ` George Dunlap
  1 sibling, 0 replies; 60+ messages in thread
From: George Dunlap @ 2013-06-03 10:25 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, xen-devel, David Vrabel,
	Alex Bligh, Anthony PERARD, Diana Crisan

On 03/06/13 09:37, Roger Pau Monné wrote:
> On 31/05/13 17:10, Roger Pau Monné wrote:
>> On 31/05/13 15:07, George Dunlap wrote:
>>> On 31/05/13 13:40, Ian Campbell wrote:
>>>> On Fri, 2013-05-31 at 12:57 +0100, Alex Bligh wrote:
>>>>> --On 31 May 2013 12:49:18 +0100 George Dunlap
>>>>> <george.dunlap@eu.citrix.com>
>>>>> wrote:
>>>>>
>>>>>> No -- Linux is asking, "Can you give me an alarm in 5ns?"  And Xen is
>>>>>> saying, "No".  So Linux is saying, "OK, how about 5us?  10us?
>>>>>> 20us?"  By
>>>>>> the time it reaches 4ms, Linux has had enough, and says, "If this timer
>>>>>> is so bad that it can't give me an event within 4ms it just won't use
>>>>>> timers at all, thank you very much."
>>>>>>
>>>>>> The problem appears to be that Linux thinks it's asking for
>>>>>> something in
>>>>>> the future, but is actually asking for something in the past.  It must
>>>>>> look at its watch just before the final domain pause, and then asks for
>>>>>> the time just after the migration resumes on the other side.  So it
>>>>>> doesn't realize that 10ms (or something) has already passed, and that
>>>>>> it's actually asking for a timer in the past.  The Xen timer driver in
>>>>>> Linux specifically asks Xen for times set in the past to return an
>>>>>> error.
>>>>>> Xen is returning an error because the time is in the past, Linux thinks
>>>>>> it's getting an error because the time is too close in the future and
>>>>>> tries asking a little further away.
>>>>>>
>>>>>> Unfortunately I think this is something which needs to be fixed on the
>>>>>> Linux side; I don't really see how we can work around it in Xen.
>>>>> I don't think fixing it only on the Linux side is a great idea, not
>>>>> least
>>>>> as it makes any current Linux image not live migrateable reliably.
>>>>> That's
>>>>> pretty horrible.
>>>> Ultimately though a guest bug is a guest bug, we don't really want to be
>>>> filling the hypervisor with lots of quirky exceptions to interfaces in
>>>> order to work around them, otherwise where does it end?
>>>>
>>>> A kernel side fix can be pushed to the distros fairly aggressively (it's
>>>> mostly just a case of getting an upstream stable backport then filing
>>>> bugs with the main ones, we've done it before) and for users upgrading
>>>> the kernel via the distros is really not so hard and mostly reuses the
>>>> process they must have in place for guest kernel security updates and
>>>> other important kernel bugs anyway.
>>> In any case, it seems I was wrong -- Linux does "look at its watch"
>>> every time it asks.
>>>
>>> The generic timer interface is "set me a timer N nanoseconds in the
>>> future"; the Xen timer implementation executes
>>> pvclock_clocksource_read() and adds the delta.  So it may well actually
>>> be a bug in Xen.
>>>
>>> Stand by for further investigation...
> I've been investigating further during the weekend, and although I'm not
> familiar with the timer code in Xen, I think the problem comes from the
> fact that in __update_vcpu_system_time when Xen detects that the guest
> is using a vtsc it adds offsets to the time passed to the guest, while
> in VCPUOP_set_singleshot_timer Xen compares the time passed from the
> guest using NOW(), which is just the Xen uptime, without taking into
> account any offsets.

All the code is really complicated, but it seems like the offset is 
added because the offset is *subtacted* by the hardware when the HVM 
guest does an RDTSC instruction -- and subtracted in a different way by 
Xen when emulating the RDTSC instruction, if you've set tsc_mode 
"always_emulate".

Just to test some of this stuff, I put the TSC mode to "always_emulate", 
and it has the exact same effect -- even though "always_emulate" will 
emulate a 1GHz clock.

> This only happens after migration because Xen automatically switches to
> vtsc when it detects that the guest has been migrated. I'm currently
> setting up a Linux PVHVM on shared storage to perform some testing, but
> one possible solution might be to add tsc_mode="native_paravirt" to the
> PVHVM config file, and another one would be fixing
> VCPUOP_set_singleshot_timer to take into account the vtsc offsets and
> correctly translate the time passed from the guest.

So have you tested it with native_paravirt?  Does it work around the 
problem?

  -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-06-03 10:23                                                           ` Roger Pau Monné
@ 2013-06-03 10:30                                                             ` Stefano Stabellini
  2013-06-03 11:16                                                             ` George Dunlap
  1 sibling, 0 replies; 60+ messages in thread
From: Stefano Stabellini @ 2013-06-03 10:30 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, George Dunlap,
	Stefano Stabellini, xen-devel, David Vrabel, Alex Bligh,
	Anthony PERARD, Diana Crisan

[-- Attachment #1: Type: text/plain, Size: 4901 bytes --]

On Mon, 3 Jun 2013, Roger Pau Monné wrote:
> On 03/06/13 12:05, Stefano Stabellini wrote:
> > On Mon, 3 Jun 2013, Roger Pau Monné wrote:
> >> On 31/05/13 17:10, Roger Pau Monné wrote:
> >>> On 31/05/13 15:07, George Dunlap wrote:
> >>>> On 31/05/13 13:40, Ian Campbell wrote:
> >>>>> On Fri, 2013-05-31 at 12:57 +0100, Alex Bligh wrote:
> >>>>>> --On 31 May 2013 12:49:18 +0100 George Dunlap
> >>>>>> <george.dunlap@eu.citrix.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> No -- Linux is asking, "Can you give me an alarm in 5ns?"  And Xen is
> >>>>>>> saying, "No".  So Linux is saying, "OK, how about 5us?  10us?
> >>>>>>> 20us?"  By
> >>>>>>> the time it reaches 4ms, Linux has had enough, and says, "If this timer
> >>>>>>> is so bad that it can't give me an event within 4ms it just won't use
> >>>>>>> timers at all, thank you very much."
> >>>>>>>
> >>>>>>> The problem appears to be that Linux thinks it's asking for
> >>>>>>> something in
> >>>>>>> the future, but is actually asking for something in the past.  It must
> >>>>>>> look at its watch just before the final domain pause, and then asks for
> >>>>>>> the time just after the migration resumes on the other side.  So it
> >>>>>>> doesn't realize that 10ms (or something) has already passed, and that
> >>>>>>> it's actually asking for a timer in the past.  The Xen timer driver in
> >>>>>>> Linux specifically asks Xen for times set in the past to return an
> >>>>>>> error.
> >>>>>>> Xen is returning an error because the time is in the past, Linux thinks
> >>>>>>> it's getting an error because the time is too close in the future and
> >>>>>>> tries asking a little further away.
> >>>>>>>
> >>>>>>> Unfortunately I think this is something which needs to be fixed on the
> >>>>>>> Linux side; I don't really see how we can work around it in Xen.
> >>>>>> I don't think fixing it only on the Linux side is a great idea, not
> >>>>>> least
> >>>>>> as it makes any current Linux image not live migrateable reliably.
> >>>>>> That's
> >>>>>> pretty horrible.
> >>>>> Ultimately though a guest bug is a guest bug, we don't really want to be
> >>>>> filling the hypervisor with lots of quirky exceptions to interfaces in
> >>>>> order to work around them, otherwise where does it end?
> >>>>>
> >>>>> A kernel side fix can be pushed to the distros fairly aggressively (it's
> >>>>> mostly just a case of getting an upstream stable backport then filing
> >>>>> bugs with the main ones, we've done it before) and for users upgrading
> >>>>> the kernel via the distros is really not so hard and mostly reuses the
> >>>>> process they must have in place for guest kernel security updates and
> >>>>> other important kernel bugs anyway.
> >>>>
> >>>> In any case, it seems I was wrong -- Linux does "look at its watch"
> >>>> every time it asks.
> >>>>
> >>>> The generic timer interface is "set me a timer N nanoseconds in the
> >>>> future"; the Xen timer implementation executes
> >>>> pvclock_clocksource_read() and adds the delta.  So it may well actually
> >>>> be a bug in Xen.
> >>>>
> >>>> Stand by for further investigation...
> >>
> >> I've been investigating further during the weekend, and although I'm not
> >> familiar with the timer code in Xen, I think the problem comes from the
> >> fact that in __update_vcpu_system_time when Xen detects that the guest
> >> is using a vtsc it adds offsets to the time passed to the guest, while
> >> in VCPUOP_set_singleshot_timer Xen compares the time passed from the
> >> guest using NOW(), which is just the Xen uptime, without taking into
> >> account any offsets.
> >>
> >> This only happens after migration because Xen automatically switches to
> >> vtsc when it detects that the guest has been migrated. I'm currently
> >> setting up a Linux PVHVM on shared storage to perform some testing, but
> >> one possible solution might be to add tsc_mode="native_paravirt" to the
> >> PVHVM config file, and another one would be fixing
> >> VCPUOP_set_singleshot_timer to take into account the vtsc offsets and
> >> correctly translate the time passed from the guest.
> > 
> > Good analisys!
> > I think that the right solution would be to fix
> > VCPUOP_set_singleshot_timer.
> 
> As a band aid I can confirm that adding tsc_mode="native_paravirt" seems
> to be working fine (with the tests I've done so far), but it requires
> the admin to know whether a certain HVM will be using the PV timer or
> not, which I guess it's not possible in every case.

It needs to work out of the box


> Xen could also force the TSC mode to native_paravirt when it detects
> that a HVM guest is using the PV timer, but I don't think that's the
> right approach. Is this something we aim to fix before the 4.3 release?

I think it should be fixed and probably backported anywhere we claim
XENFEAT_hvm_safe_pvclock

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-06-03 10:23                                                           ` Roger Pau Monné
  2013-06-03 10:30                                                             ` Stefano Stabellini
@ 2013-06-03 11:16                                                             ` George Dunlap
  2013-06-03 11:24                                                               ` Diana Crisan
                                                                                 ` (2 more replies)
  1 sibling, 3 replies; 60+ messages in thread
From: George Dunlap @ 2013-06-03 11:16 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, Stefano Stabellini,
	xen-devel, David Vrabel, Alex Bligh, Anthony PERARD,
	Diana Crisan

On 03/06/13 11:23, Roger Pau Monné wrote:
> On 03/06/13 12:05, Stefano Stabellini wrote:
>> On Mon, 3 Jun 2013, Roger Pau Monné wrote:
>>> On 31/05/13 17:10, Roger Pau Monné wrote:
>>>> On 31/05/13 15:07, George Dunlap wrote:
>>>>> On 31/05/13 13:40, Ian Campbell wrote:
>>>>>> On Fri, 2013-05-31 at 12:57 +0100, Alex Bligh wrote:
>>>>>>> --On 31 May 2013 12:49:18 +0100 George Dunlap
>>>>>>> <george.dunlap@eu.citrix.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> No -- Linux is asking, "Can you give me an alarm in 5ns?"  And Xen is
>>>>>>>> saying, "No".  So Linux is saying, "OK, how about 5us?  10us?
>>>>>>>> 20us?"  By
>>>>>>>> the time it reaches 4ms, Linux has had enough, and says, "If this timer
>>>>>>>> is so bad that it can't give me an event within 4ms it just won't use
>>>>>>>> timers at all, thank you very much."
>>>>>>>>
>>>>>>>> The problem appears to be that Linux thinks it's asking for
>>>>>>>> something in
>>>>>>>> the future, but is actually asking for something in the past.  It must
>>>>>>>> look at its watch just before the final domain pause, and then asks for
>>>>>>>> the time just after the migration resumes on the other side.  So it
>>>>>>>> doesn't realize that 10ms (or something) has already passed, and that
>>>>>>>> it's actually asking for a timer in the past.  The Xen timer driver in
>>>>>>>> Linux specifically asks Xen for times set in the past to return an
>>>>>>>> error.
>>>>>>>> Xen is returning an error because the time is in the past, Linux thinks
>>>>>>>> it's getting an error because the time is too close in the future and
>>>>>>>> tries asking a little further away.
>>>>>>>>
>>>>>>>> Unfortunately I think this is something which needs to be fixed on the
>>>>>>>> Linux side; I don't really see how we can work around it in Xen.
>>>>>>> I don't think fixing it only on the Linux side is a great idea, not
>>>>>>> least
>>>>>>> as it makes any current Linux image not live migrateable reliably.
>>>>>>> That's
>>>>>>> pretty horrible.
>>>>>> Ultimately though a guest bug is a guest bug, we don't really want to be
>>>>>> filling the hypervisor with lots of quirky exceptions to interfaces in
>>>>>> order to work around them, otherwise where does it end?
>>>>>>
>>>>>> A kernel side fix can be pushed to the distros fairly aggressively (it's
>>>>>> mostly just a case of getting an upstream stable backport then filing
>>>>>> bugs with the main ones, we've done it before) and for users upgrading
>>>>>> the kernel via the distros is really not so hard and mostly reuses the
>>>>>> process they must have in place for guest kernel security updates and
>>>>>> other important kernel bugs anyway.
>>>>> In any case, it seems I was wrong -- Linux does "look at its watch"
>>>>> every time it asks.
>>>>>
>>>>> The generic timer interface is "set me a timer N nanoseconds in the
>>>>> future"; the Xen timer implementation executes
>>>>> pvclock_clocksource_read() and adds the delta.  So it may well actually
>>>>> be a bug in Xen.
>>>>>
>>>>> Stand by for further investigation...
>>> I've been investigating further during the weekend, and although I'm not
>>> familiar with the timer code in Xen, I think the problem comes from the
>>> fact that in __update_vcpu_system_time when Xen detects that the guest
>>> is using a vtsc it adds offsets to the time passed to the guest, while
>>> in VCPUOP_set_singleshot_timer Xen compares the time passed from the
>>> guest using NOW(), which is just the Xen uptime, without taking into
>>> account any offsets.
>>>
>>> This only happens after migration because Xen automatically switches to
>>> vtsc when it detects that the guest has been migrated. I'm currently
>>> setting up a Linux PVHVM on shared storage to perform some testing, but
>>> one possible solution might be to add tsc_mode="native_paravirt" to the
>>> PVHVM config file, and another one would be fixing
>>> VCPUOP_set_singleshot_timer to take into account the vtsc offsets and
>>> correctly translate the time passed from the guest.
>> Good analisys!
>> I think that the right solution would be to fix
>> VCPUOP_set_singleshot_timer.
> As a band aid I can confirm that adding tsc_mode="native_paravirt" seems
> to be working fine (with the tests I've done so far), but it requires
> the admin to know whether a certain HVM will be using the PV timer or
> not, which I guess it's not possible in every case.

Right -- it seems to be working for me too.  I'm on migrate #52, and 
there hasn't been a single warning from Linux's clock code about 
increasing min_delay.

Diana / Alex: It looks like we might have a work-around for you. Diana, 
would it be possible for you to do the following:
  1. Add tsc_mode="native_paravirt" to your config file
  2. Try it again localhost migrate to see if you run into any problems
  3. Try it again migrating between machines, just to make sure we've 
solved the important problem

Thanks,
  -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-06-03 11:16                                                             ` George Dunlap
@ 2013-06-03 11:24                                                               ` Diana Crisan
  2013-06-03 14:01                                                               ` Diana Crisan
  2013-06-03 17:09                                                               ` Alex Bligh
  2 siblings, 0 replies; 60+ messages in thread
From: Diana Crisan @ 2013-06-03 11:24 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Stefano Stabellini, Konrad Rzeszutek Wilk,
	xen-devel, David Vrabel, Alex Bligh, Anthony PERARD,
	Roger Pau Monné

On 03/06/13 12:16, George Dunlap wrote:
> Right -- it seems to be working for me too.  I'm on migrate #52, and 
> there hasn't been a single warning from Linux's clock code about 
> increasing min_delay.
>
> Diana / Alex: It looks like we might have a work-around for you. 
> Diana, would it be possible for you to do the following:
>  1. Add tsc_mode="native_paravirt" to your config file
>  2. Try it again localhost migrate to see if you run into any problems
>  3. Try it again migrating between machines, just to make sure we've 
> solved the important problem
>

Sure, will run our tests and update with the results.

Thanks,
Diana

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-06-03 11:16                                                             ` George Dunlap
  2013-06-03 11:24                                                               ` Diana Crisan
@ 2013-06-03 14:01                                                               ` Diana Crisan
  2013-06-03 17:09                                                               ` Alex Bligh
  2 siblings, 0 replies; 60+ messages in thread
From: Diana Crisan @ 2013-06-03 14:01 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Stefano Stabellini, Konrad Rzeszutek Wilk,
	xen-devel, David Vrabel, Alex Bligh, Anthony PERARD,
	Roger Pau Monné


> Right -- it seems to be working for me too.  I'm on migrate #52, and 
> there hasn't been a single warning from Linux's clock code about 
> increasing min_delay.
>
> Diana / Alex: It looks like we might have a work-around for you. 
> Diana, would it be possible for you to do the following:
>  1. Add tsc_mode="native_paravirt" to your config file
>  2. Try it again localhost migrate to see if you run into any problems
>  3. Try it again migrating between machines, just to make sure we've 
> solved the important problem

This seems to fix migration for us too.
I have tested your two scenarios and also within our code and I can 
confirm the problem did not reappear yet.
I will continue to test throughout the next couple of days and I will 
update if anything comes up.

Thank you for your help,
Diana

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-06-03 11:16                                                             ` George Dunlap
  2013-06-03 11:24                                                               ` Diana Crisan
  2013-06-03 14:01                                                               ` Diana Crisan
@ 2013-06-03 17:09                                                               ` Alex Bligh
  2013-06-03 17:12                                                                 ` George Dunlap
  2 siblings, 1 reply; 60+ messages in thread
From: Alex Bligh @ 2013-06-03 17:09 UTC (permalink / raw)
  To: George Dunlap, Roger Pau Monné
  Cc: Ian Campbell, Konrad Rzeszutek Wilk, Stefano Stabellini,
	xen-devel, David Vrabel, Alex Bligh, Anthony PERARD,
	Diana Crisan

George,

--On 3 June 2013 12:16:45 +0100 George Dunlap <george.dunlap@eu.citrix.com> 
wrote:

> Right -- it seems to be working for me too.  I'm on migrate #52, and
> there hasn't been a single warning from Linux's clock code about
> increasing min_delay.
>
> Diana / Alex: It looks like we might have a work-around for you. Diana,
> would it be possible for you to do the following:
>   1. Add tsc_mode="native_paravirt" to your config file
>   2. Try it again localhost migrate to see if you run into any problems
>   3. Try it again migrating between machines, just to make sure we've
> solved the important problem

Thanks for this.

Will this work with all guests? We (unfortunately) do not know whether the
guest is Linux with or without PV, Windows with or without PV or what. I
/think/ this is going to give problems when the guest is not using
the PV timer (including Windows). Is that right?

I'd be happiest if it selected native_paravirt whenever the guest was
using the PV timer, but not otherwise. Happy to test patches for that.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-06-03 17:09                                                               ` Alex Bligh
@ 2013-06-03 17:12                                                                 ` George Dunlap
  2013-06-03 17:18                                                                   ` Alex Bligh
  0 siblings, 1 reply; 60+ messages in thread
From: George Dunlap @ 2013-06-03 17:12 UTC (permalink / raw)
  To: Alex Bligh
  Cc: Ian Campbell, Stefano Stabellini, Konrad Rzeszutek Wilk,
	xen-devel, David Vrabel, Anthony PERARD, Diana Crisan,
	Roger Pau Monné

On 03/06/13 18:09, Alex Bligh wrote:
> George,
>
> --On 3 June 2013 12:16:45 +0100 George Dunlap 
> <george.dunlap@eu.citrix.com> wrote:
>
>> Right -- it seems to be working for me too.  I'm on migrate #52, and
>> there hasn't been a single warning from Linux's clock code about
>> increasing min_delay.
>>
>> Diana / Alex: It looks like we might have a work-around for you. Diana,
>> would it be possible for you to do the following:
>>   1. Add tsc_mode="native_paravirt" to your config file
>>   2. Try it again localhost migrate to see if you run into any problems
>>   3. Try it again migrating between machines, just to make sure we've
>> solved the important problem
>
> Thanks for this.
>
> Will this work with all guests? We (unfortunately) do not know whether 
> the
> guest is Linux with or without PV, Windows with or without PV or what. I
> /think/ this is going to give problems when the guest is not using
> the PV timer (including Windows). Is that right?
>
> I'd be happiest if it selected native_paravirt whenever the guest was
> using the PV timer, but not otherwise. Happy to test patches for that.

I'm not sure the effect, but I suspect it probably won't work quite 
right.  Might be worth a quick test.

We do want to fix this properly (which means, "just works w/o having to 
use a work-around"); I'm trying to understand the exact nature of the 
error now.

  -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-06-03 17:12                                                                 ` George Dunlap
@ 2013-06-03 17:18                                                                   ` Alex Bligh
  2013-06-03 17:25                                                                     ` George Dunlap
  0 siblings, 1 reply; 60+ messages in thread
From: Alex Bligh @ 2013-06-03 17:18 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Stefano Stabellini, Konrad Rzeszutek Wilk,
	xen-devel, David Vrabel, Alex Bligh, Anthony PERARD,
	Diana Crisan, Roger Pau Monné

George,

--On 3 June 2013 18:12:47 +0100 George Dunlap <george.dunlap@eu.citrix.com> 
wrote:

>> Will this work with all guests? We (unfortunately) do not know whether
>> the
>> guest is Linux with or without PV, Windows with or without PV or what. I
>> /think/ this is going to give problems when the guest is not using
>> the PV timer (including Windows). Is that right?
>>
>> I'd be happiest if it selected native_paravirt whenever the guest was
>> using the PV timer, but not otherwise. Happy to test patches for that.
>
> I'm not sure the effect, but I suspect it probably won't work quite
> right.  Might be worth a quick test.
>
> We do want to fix this properly (which means, "just works w/o having to
> use a work-around"); I'm trying to understand the exact nature of the
> error now.

Sorry did you mean using native_paravirt on Linux guests with PV support
wouldn't work quite right, or using that on all guests wouldn't work quite
right, or that switching automatically if use of the PV timer was
detected wouldn't work quite right?

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-06-03 17:18                                                                   ` Alex Bligh
@ 2013-06-03 17:25                                                                     ` George Dunlap
  2013-06-03 17:42                                                                       ` Alex Bligh
  0 siblings, 1 reply; 60+ messages in thread
From: George Dunlap @ 2013-06-03 17:25 UTC (permalink / raw)
  To: Alex Bligh
  Cc: Ian Campbell, Stefano Stabellini, Konrad Rzeszutek Wilk,
	xen-devel, David Vrabel, Anthony PERARD, Diana Crisan,
	Roger Pau Monné

On 03/06/13 18:18, Alex Bligh wrote:
> George,
>
> --On 3 June 2013 18:12:47 +0100 George Dunlap 
> <george.dunlap@eu.citrix.com> wrote:
>
>>> Will this work with all guests? We (unfortunately) do not know whether
>>> the
>>> guest is Linux with or without PV, Windows with or without PV or 
>>> what. I
>>> /think/ this is going to give problems when the guest is not using
>>> the PV timer (including Windows). Is that right?
>>>
>>> I'd be happiest if it selected native_paravirt whenever the guest was
>>> using the PV timer, but not otherwise. Happy to test patches for that.
>>
>> I'm not sure the effect, but I suspect it probably won't work quite
>> right.  Might be worth a quick test.
>>
>> We do want to fix this properly (which means, "just works w/o having to
>> use a work-around"); I'm trying to understand the exact nature of the
>> error now.
>
> Sorry did you mean using native_paravirt on Linux guests with PV support
> wouldn't work quite right, or using that on all guests wouldn't work 
> quite
> right, or that switching automatically if use of the PV timer was
> detected wouldn't work quite right?

I mean that if you enable it for guests that aren't using a PV 
clocksource (e.g., Windows guests), I'm not sure what will happen. It 
might be fine, but you should  probably test it. :-)

Switching TSC modes isn't the right thing to do -- the right thing is to 
figure out where the math is getting messed up and make it work properly 
no matter which mode you're in.

  -George

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
  2013-06-03 17:25                                                                     ` George Dunlap
@ 2013-06-03 17:42                                                                       ` Alex Bligh
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Bligh @ 2013-06-03 17:42 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Stefano Stabellini, Konrad Rzeszutek Wilk,
	xen-devel, David Vrabel, Alex Bligh, Anthony PERARD,
	Diana Crisan, Roger Pau Monné



--On 3 June 2013 18:25:37 +0100 George Dunlap <george.dunlap@eu.citrix.com> 
wrote:

> Switching TSC modes isn't the right thing to do -- the right thing is to
> figure out where the math is getting messed up and make it work properly
> no matter which mode you're in.

That would be my preference too!

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
@ 2013-06-03 17:18 Alex Bligh
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Bligh @ 2013-06-03 17:18 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Stefano Stabellini, Konrad Rzeszutek Wilk,
	xen-devel, David Vrabel, Alex Bligh, Anthony PERARD,
	Diana Crisan, Roger Pau Monné

George,

--On 3 June 2013 18:12:47 +0100 George Dunlap <george.dunlap@eu.citrix.com> 
wrote:

>> Will this work with all guests? We (unfortunately) do not know whether
>> the
>> guest is Linux with or without PV, Windows with or without PV or what. I
>> /think/ this is going to give problems when the guest is not using
>> the PV timer (including Windows). Is that right?
>>
>> I'd be happiest if it selected native_paravirt whenever the guest was
>> using the PV timer, but not otherwise. Happy to test patches for that.
>
> I'm not sure the effect, but I suspect it probably won't work quite
> right.  Might be worth a quick test.
>
> We do want to fix this properly (which means, "just works w/o having to
> use a work-around"); I'm trying to understand the exact nature of the
> error now.

Sorry did you mean using native_paravirt on Linux guests with PV support
wouldn't work quite right, or using that on all guests wouldn't work quite
right, or that switching automatically if use of the PV timer was
detected wouldn't work quite right?

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2013-06-03 17:42 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1223417765.8633857.1368537033873.JavaMail.root@zimbra002>
2013-05-14 13:11 ` HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI Diana Crisan
2013-05-14 16:09   ` George Dunlap
2013-05-15 10:05     ` Diana Crisan
2013-05-15 13:46   ` Alex Bligh
2013-05-20 11:11     ` George Dunlap
2013-05-20 19:28       ` Konrad Rzeszutek Wilk
2013-05-20 22:38         ` Alex Bligh
2013-05-21  1:04           ` Konrad Rzeszutek Wilk
2013-05-21 10:22             ` Diana Crisan
2013-05-21 10:47               ` David Vrabel
2013-05-21 11:16                 ` Diana Crisan
2013-05-21 12:49                   ` David Vrabel
2013-05-21 13:16                     ` Alex Bligh
2013-05-24 16:16                       ` George Dunlap
2013-05-25 10:18                         ` Alex Bligh
2013-05-26  8:38                           ` Ian Campbell
2013-05-28 15:06                             ` Diana Crisan
2013-05-29 16:16                               ` Alex Bligh
2013-05-29 19:04                                 ` Ian Campbell
2013-05-30 14:30                                   ` George Dunlap
2013-05-30 15:39                                 ` Frediano Ziglio
2013-05-30 15:26                               ` George Dunlap
2013-05-30 15:55                                 ` Diana Crisan
2013-05-30 16:06                                   ` George Dunlap
2013-05-30 17:02                                     ` Diana Crisan
2013-05-31  8:34                                     ` Diana Crisan
2013-05-31 10:54                                       ` George Dunlap
2013-05-31 10:59                                         ` George Dunlap
2013-05-31 11:41                                           ` George Dunlap
2013-05-31 21:30                                           ` Konrad Rzeszutek Wilk
2013-05-31 22:51                                             ` Alex Bligh
2013-06-03  9:43                                             ` George Dunlap
2013-05-31 11:18                                         ` Alex Bligh
2013-05-31 11:36                                         ` Diana Crisan
2013-05-31 11:41                                           ` Diana Crisan
2013-05-31 11:49                                             ` George Dunlap
2013-05-31 11:57                                               ` Alex Bligh
2013-05-31 12:40                                                 ` Ian Campbell
2013-05-31 13:07                                                   ` George Dunlap
2013-05-31 15:10                                                     ` Roger Pau Monné
2013-06-03  8:37                                                       ` Roger Pau Monné
2013-06-03 10:05                                                         ` Stefano Stabellini
2013-06-03 10:23                                                           ` Roger Pau Monné
2013-06-03 10:30                                                             ` Stefano Stabellini
2013-06-03 11:16                                                             ` George Dunlap
2013-06-03 11:24                                                               ` Diana Crisan
2013-06-03 14:01                                                               ` Diana Crisan
2013-06-03 17:09                                                               ` Alex Bligh
2013-06-03 17:12                                                                 ` George Dunlap
2013-06-03 17:18                                                                   ` Alex Bligh
2013-06-03 17:25                                                                     ` George Dunlap
2013-06-03 17:42                                                                       ` Alex Bligh
2013-06-03 10:25                                                         ` George Dunlap
2013-05-31 13:16                                                   ` Alex Bligh
2013-05-31 14:36                                                     ` Ian Campbell
2013-05-31 15:18                                                       ` Alex Bligh
2013-05-31 12:34                                               ` Ian Campbell
2013-05-30 14:32   ` George Dunlap
2013-05-30 14:42     ` Diana Crisan
2013-06-03 17:18 Alex Bligh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.