All of lore.kernel.org
 help / color / mirror / Atom feed
* pv 2.6.31 (kernel.org) and save/migrate
@ 2009-11-06 18:37 Dan Magenheimer
  2009-11-06 20:37 ` Pasi Kärkkäinen
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Magenheimer @ 2009-11-06 18:37 UTC (permalink / raw)
  To: Xen-Devel (E-mail)

Sorry for another possibly stupid question:

I've observed that for a pv domain that's been updated
to a 2.6.31 kernel (straight from kernel.org), "xm save"
never completes.  When the older kernel (2.6.18)
is booted, "xm save" works fine.  Is this a known problem...
or perhaps xm save has never worked with an upstream pv
kernel and I've never noticed?

I'd assume migrate and live migrate would fail also but
haven't tried them.

Thanks,
Dan

P.S. This is with very recent xen-unstable, c/s 20399.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate
  2009-11-06 18:37 pv 2.6.31 (kernel.org) and save/migrate Dan Magenheimer
@ 2009-11-06 20:37 ` Pasi Kärkkäinen
  2009-11-06 22:27   ` Dan Magenheimer
  2009-11-07  0:19   ` pv 2.6.31 (kernel.org) and save/migrate Jeremy Fitzhardinge
  0 siblings, 2 replies; 30+ messages in thread
From: Pasi Kärkkäinen @ 2009-11-06 20:37 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail)

On Fri, Nov 06, 2009 at 10:37:49AM -0800, Dan Magenheimer wrote:
> Sorry for another possibly stupid question:
> 
> I've observed that for a pv domain that's been updated
> to a 2.6.31 kernel (straight from kernel.org), "xm save"
> never completes.  When the older kernel (2.6.18)
> is booted, "xm save" works fine.  Is this a known problem...
> or perhaps xm save has never worked with an upstream pv
> kernel and I've never noticed?
> 
> I'd assume migrate and live migrate would fail also but
> haven't tried them.
> 

Just checking.. are you running the latest 2.6.31.5 ? I think there has
been multiple xen related bugfixes in the 2.6.31.X releases.

-- Pasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: pv 2.6.31 (kernel.org) and save/migrate
  2009-11-06 20:37 ` Pasi Kärkkäinen
@ 2009-11-06 22:27   ` Dan Magenheimer
  2009-11-06 22:30     ` Pasi Kärkkäinen
  2009-11-07  0:19   ` pv 2.6.31 (kernel.org) and save/migrate Jeremy Fitzhardinge
  1 sibling, 1 reply; 30+ messages in thread
From: Dan Magenheimer @ 2009-11-06 22:27 UTC (permalink / raw)
  To: "Pasi Kärkkäinen"; +Cc: Xen-Devel (E-mail)

> On Fri, Nov 06, 2009 at 10:37:49AM -0800, Dan Magenheimer wrote:
> > Sorry for another possibly stupid question:
> > 
> > I've observed that for a pv domain that's been updated
> > to a 2.6.31 kernel (straight from kernel.org), "xm save"
> > never completes.  When the older kernel (2.6.18)
> > is booted, "xm save" works fine.  Is this a known problem...
> > or perhaps xm save has never worked with an upstream pv
> > kernel and I've never noticed?
> > 
> > I'd assume migrate and live migrate would fail also but
> > haven't tried them.
> > 
> 
> Just checking.. are you running the latest 2.6.31.5 ? I think 
> there has
> been multiple xen related bugfixes in the 2.6.31.X releases.
> 
> -- Pasi

No it was plain 2.6.31.  But I downloaded/built 2.6.31.5 and
can't even get it to boot (and no console or VNC output at
all).  Are CONFIG changes required betwen 2.6.31 and 2.6.31.5
for Xen?  (I checked and I am using the same .config.)

Trying to reproduce on a different machine, just to verify.

Dan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate
  2009-11-06 22:27   ` Dan Magenheimer
@ 2009-11-06 22:30     ` Pasi Kärkkäinen
  2009-11-07  0:08       ` Dan Magenheimer
  0 siblings, 1 reply; 30+ messages in thread
From: Pasi Kärkkäinen @ 2009-11-06 22:30 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail)

On Fri, Nov 06, 2009 at 02:27:27PM -0800, Dan Magenheimer wrote:
> > On Fri, Nov 06, 2009 at 10:37:49AM -0800, Dan Magenheimer wrote:
> > > Sorry for another possibly stupid question:
> > > 
> > > I've observed that for a pv domain that's been updated
> > > to a 2.6.31 kernel (straight from kernel.org), "xm save"
> > > never completes.  When the older kernel (2.6.18)
> > > is booted, "xm save" works fine.  Is this a known problem...
> > > or perhaps xm save has never worked with an upstream pv
> > > kernel and I've never noticed?
> > > 
> > > I'd assume migrate and live migrate would fail also but
> > > haven't tried them.
> > > 
> > 
> > Just checking.. are you running the latest 2.6.31.5 ? I think 
> > there has
> > been multiple xen related bugfixes in the 2.6.31.X releases.
> > 
> > -- Pasi
> 
> No it was plain 2.6.31.  But I downloaded/built 2.6.31.5 and
> can't even get it to boot (and no console or VNC output at
> all).  Are CONFIG changes required betwen 2.6.31 and 2.6.31.5
> for Xen?  (I checked and I am using the same .config.)
> 
> Trying to reproduce on a different machine, just to verify.
> 

There shouldn't be any .config changes needed.

Can you paste the full domU console output? Does it crash or? 

-- Pasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: pv 2.6.31 (kernel.org) and save/migrate
  2009-11-06 22:30     ` Pasi Kärkkäinen
@ 2009-11-07  0:08       ` Dan Magenheimer
  2009-11-07 11:09         ` Pasi Kärkkäinen
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Magenheimer @ 2009-11-07  0:08 UTC (permalink / raw)
  To: "Pasi Kärkkäinen"; +Cc: Xen-Devel (E-mail)

> On Fri, Nov 06, 2009 at 02:27:27PM -0800, Dan Magenheimer wrote:
> > > On Fri, Nov 06, 2009 at 10:37:49AM -0800, Dan Magenheimer wrote:
> > > > Sorry for another possibly stupid question:
> > > > 
> > > > I've observed that for a pv domain that's been updated
> > > > to a 2.6.31 kernel (straight from kernel.org), "xm save"
> > > > never completes.  When the older kernel (2.6.18)
> > > > is booted, "xm save" works fine.  Is this a known problem...
> > > > or perhaps xm save has never worked with an upstream pv
> > > > kernel and I've never noticed?
> > > > 
> > > > I'd assume migrate and live migrate would fail also but
> > > > haven't tried them.
> > > > 
> > > 
> > > Just checking.. are you running the latest 2.6.31.5 ? I think 
> > > there has
> > > been multiple xen related bugfixes in the 2.6.31.X releases.
> > > 
> > > -- Pasi
> > 
> > No it was plain 2.6.31.  But I downloaded/built 2.6.31.5 and
> > can't even get it to boot (and no console or VNC output at
> > all).  Are CONFIG changes required betwen 2.6.31 and 2.6.31.5
> > for Xen?  (I checked and I am using the same .config.)
> > 
> > Trying to reproduce on a different machine, just to verify.
> 
> There shouldn't be any .config changes needed.
> 
> Can you paste the full domU console output? Does it crash or? 
> 
> -- Pasi

Well, first, I got 2.6.31.5 to boot in a PV guest in another
machine and it fails to save also.  Are you able to save
2.6.31{,.5} successfully?  On latest xen-unstable?
(NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't
know if that is important.)

(On the machine I couldn't boot 2.6.31.5 as a PV guest, there
was absolutely no console output.  However, I think tools
are out-of-date on that machine so ignore that.)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate
  2009-11-06 20:37 ` Pasi Kärkkäinen
  2009-11-06 22:27   ` Dan Magenheimer
@ 2009-11-07  0:19   ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 30+ messages in thread
From: Jeremy Fitzhardinge @ 2009-11-07  0:19 UTC (permalink / raw)
  To: Pasi Kärkkäinen; +Cc: Dan Magenheimer, Xen-Devel (E-mail)

On 11/06/09 12:37, Pasi Kärkkäinen wrote:
> On Fri, Nov 06, 2009 at 10:37:49AM -0800, Dan Magenheimer wrote:
>   
>> Sorry for another possibly stupid question:
>>
>> I've observed that for a pv domain that's been updated
>> to a 2.6.31 kernel (straight from kernel.org), "xm save"
>> never completes.  When the older kernel (2.6.18)
>> is booted, "xm save" works fine.  Is this a known problem...
>> or perhaps xm save has never worked with an upstream pv
>> kernel and I've never noticed?
>>
>> I'd assume migrate and live migrate would fail also but
>> haven't tried them.
>>
>>     
> Just checking.. are you running the latest 2.6.31.5 ? I think there has
> been multiple xen related bugfixes in the 2.6.31.X releases.
>   

Nothing relating to save/restore.  Does it work for you?

    J

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate
  2009-11-07  0:08       ` Dan Magenheimer
@ 2009-11-07 11:09         ` Pasi Kärkkäinen
  2009-11-07 15:32           ` Dan Magenheimer
  0 siblings, 1 reply; 30+ messages in thread
From: Pasi Kärkkäinen @ 2009-11-07 11:09 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail)

On Fri, Nov 06, 2009 at 04:08:26PM -0800, Dan Magenheimer wrote:
> > On Fri, Nov 06, 2009 at 02:27:27PM -0800, Dan Magenheimer wrote:
> > > > On Fri, Nov 06, 2009 at 10:37:49AM -0800, Dan Magenheimer wrote:
> > > > > Sorry for another possibly stupid question:
> > > > > 
> > > > > I've observed that for a pv domain that's been updated
> > > > > to a 2.6.31 kernel (straight from kernel.org), "xm save"
> > > > > never completes.  When the older kernel (2.6.18)
> > > > > is booted, "xm save" works fine.  Is this a known problem...
> > > > > or perhaps xm save has never worked with an upstream pv
> > > > > kernel and I've never noticed?
> > > > > 
> > > > > I'd assume migrate and live migrate would fail also but
> > > > > haven't tried them.
> > > > > 
> > > > 
> > > > Just checking.. are you running the latest 2.6.31.5 ? I think 
> > > > there has
> > > > been multiple xen related bugfixes in the 2.6.31.X releases.
> > > > 
> > > > -- Pasi
> > > 
> > > No it was plain 2.6.31.  But I downloaded/built 2.6.31.5 and
> > > can't even get it to boot (and no console or VNC output at
> > > all).  Are CONFIG changes required betwen 2.6.31 and 2.6.31.5
> > > for Xen?  (I checked and I am using the same .config.)
> > > 
> > > Trying to reproduce on a different machine, just to verify.
> > 
> > There shouldn't be any .config changes needed.
> > 
> > Can you paste the full domU console output? Does it crash or? 
> > 
> > -- Pasi
> 
> Well, first, I got 2.6.31.5 to boot in a PV guest in another
> machine and it fails to save also.  Are you able to save
> 2.6.31{,.5} successfully?  On latest xen-unstable?
> (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't
> know if that is important.)
> 

I'll have to try it later today..

> (On the machine I couldn't boot 2.6.31.5 as a PV guest, there
> was absolutely no console output.  However, I think tools
> are out-of-date on that machine so ignore that.)

Did you have "console=hvc0 earlyprintk=xen" in the domU kernel
parameters?

You might also change the xen guest cfgfile so that you have
on_crash=preserve and then when the PV guest is crashed run this:

/usr/lib/xen/bin/xenctx -s System.map-domUkernelversion <domid>

(if you have 64b host the xenctx binary might be under /usr/lib64/)

to get a stack trace..

-- Pasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: pv 2.6.31 (kernel.org) and save/migrate
  2009-11-07 11:09         ` Pasi Kärkkäinen
@ 2009-11-07 15:32           ` Dan Magenheimer
  2009-11-08 14:17             ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Pasi Kärkkäinen
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Magenheimer @ 2009-11-07 15:32 UTC (permalink / raw)
  To: "Pasi Kärkkäinen"; +Cc: Xen-Devel (E-mail)

> > Well, first, I got 2.6.31.5 to boot in a PV guest in another
> > machine and it fails to save also.  Are you able to save
> > 2.6.31{,.5} successfully?  On latest xen-unstable?
> > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't
> > know if that is important.)
> 
> I'll have to try it later today..

Let me know.

> > (On the machine I couldn't boot 2.6.31.5 as a PV guest, there
> > was absolutely no console output.  However, I think tools
> > are out-of-date on that machine so ignore that.)
> 
> Did you have "console=hvc0 earlyprintk=xen" in the domU kernel
> parameters?

No, but that didn't work either.

> You might also change the xen guest cfgfile so that you have
> on_crash=preserve and then when the PV guest is crashed run this:
> 
> /usr/lib/xen/bin/xenctx -s System.map-domUkernelversion <domid>
> 
> (if you have 64b host the xenctx binary might be under /usr/lib64/)
> 
> to get a stack trace..

Very interesting and useful!  I was completely unaware of
xenctx and could have used it many times in tmem development!

The results explain why I can get it to run on
one machine (an older laptop) and not run on another
machine (a Nehalem system)... looks like this is maybe
related to the cpuid-extended-topology-leaf bug that Jeremy
sent a fix for upstream recently.

cs:eip: e019:c040342d xen_cpuid+0x46 
flags: 00001206 i nz p
ss:esp: e021:c0779ee4
eax: 00000001	ebx: 00000002	ecx: 00000100	edx: 00000001
esi: c0779f1c	edi: c0779f18	ebp: c0779f24
 ds:     e021	 es:     e021	 fs:     00d8	 gs:     0000
Code (instr addr c040342d)
24 04 8b 15 a4 02 7c c0 89 54 24 08 8b 0e 0f 0b 78 65 6e 0f a2 <89> 45 00 8b 04 24 89 18 89 0e 89 


Stack:
 c0779f20 ffffffff ffffffff c07c0360 c0779f18 c0779f1c c0779f20 c066fd0f
 c0779f18 c0779f24 00000002 16aee301 00000001 00000001 16aee301 00000002
 0000000b c07c03cc c07c0360 c07c0360 c07c03d8 c0670ed8 c0779f58 00000001
 c07c0360 c0779f60 c066fe6a c0779f60 c0779f60 00000003 00000001 00000000

Call Trace:
  [<c040342d>] xen_cpuid+0x46  <--
  [<c066fd0f>] detect_extended_topology+0xae 
  [<c0670ed8>] init_intel+0x140 
  [<c066fe6a>] init_scattered_cpuid_features+0x82 
  [<c06705e2>] identify_cpu+0x22d 
  [<c040584c>] xen_force_evtchn_callback+0xc 
  [<c0405e78>] check_events+0x8 
  [<c07c9dec>] identify_boot_cpu+0xa 
  [<c07c9e9a>] check_bugs+0x8 
  [<c07c27bd>] start_kernel+0x2a0 
  [<c07c5206>] xen_start_kernel+0x340 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG()
  2009-11-07 15:32           ` Dan Magenheimer
@ 2009-11-08 14:17             ` Pasi Kärkkäinen
  2009-11-08 14:20               ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen
  2009-11-08 15:29               ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Dan Magenheimer
  0 siblings, 2 replies; 30+ messages in thread
From: Pasi Kärkkäinen @ 2009-11-08 14:17 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail)

On Sat, Nov 07, 2009 at 07:32:49AM -0800, Dan Magenheimer wrote:
> > > Well, first, I got 2.6.31.5 to boot in a PV guest in another
> > > machine and it fails to save also.  Are you able to save
> > > 2.6.31{,.5} successfully?  On latest xen-unstable?
> > > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't
> > > know if that is important.)
> > 
> > I'll have to try it later today..
> 
> Let me know.
> 

Ok. I just tried with a Fedora 12 (rawhide) PV guest. I was able to 
"xm save" and "xm restore" it without problems. 

But I noticed there was a BUG printed on the guest console:
http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86_64-saverestore.txt

BUG: sleeping function called from invalid context at kernel/mutex.c:94
in_atomic(): 0, irqs_disabled(): 1, pid: 1052, name: kstop/0
Pid: 1052, comm: kstop/0 Not tainted 2.6.31.5-122.fc12.x86_64 #1
Call Trace:
 [<ffffffff8104021f>] __might_sleep+0xe6/0xe8
 [<ffffffff81419c84>] mutex_lock+0x22/0x4e
 [<ffffffff812afdce>] dpm_resume_noirq+0x21/0x11f
 [<ffffffff81272b05>] xen_suspend+0xca/0xd1
 [<ffffffff8108c172>] stop_cpu+0x8c/0xd2
 [<ffffffff8106350c>] worker_thread+0x18a/0x224
 [<ffffffff81067ae7>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff8141ab29>] ? _spin_unlock_irqrestore+0x19/0x1b
 [<ffffffff81063382>] ? worker_thread+0x0/0x224
 [<ffffffff81067765>] kthread+0x91/0x99
 [<ffffffff81012daa>] child_rip+0xa/0x20
 [<ffffffff81011f97>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101271d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81012da0>] ? child_rip+0x0/0x20


More information about my setup:

Host/dom0: Fedora 12 (latest rawhide) with included Xen 3.4.1-5 and
custom 2.6.31.5 x86_64 pv_ops dom0 kernel (a couple of days old).

Guest/domU: Fedora 12 (latest rawhide) with the included/default
2.6.31.5-122.fc12.x86_64 kernel.

> > > (On the machine I couldn't boot 2.6.31.5 as a PV guest, there
> > > was absolutely no console output.  However, I think tools
> > > are out-of-date on that machine so ignore that.)
> > 
> > Did you have "console=hvc0 earlyprintk=xen" in the domU kernel
> > parameters?
> 
> No, but that didn't work either.
> 

Ok.. then it crashes really early.

> > You might also change the xen guest cfgfile so that you have
> > on_crash=preserve and then when the PV guest is crashed run this:
> > 
> > /usr/lib/xen/bin/xenctx -s System.map-domUkernelversion <domid>
> > 
> > (if you have 64b host the xenctx binary might be under /usr/lib64/)
> > 
> > to get a stack trace..
> 
> Very interesting and useful!  I was completely unaware of
> xenctx and could have used it many times in tmem development!
> 
> The results explain why I can get it to run on
> one machine (an older laptop) and not run on another
> machine (a Nehalem system)... looks like this is maybe
> related to the cpuid-extended-topology-leaf bug that Jeremy
> sent a fix for upstream recently.
> 

Did you try with that patch applied? 

-- Pasi

> cs:eip: e019:c040342d xen_cpuid+0x46 
> flags: 00001206 i nz p
> ss:esp: e021:c0779ee4
> eax: 00000001	ebx: 00000002	ecx: 00000100	edx: 00000001
> esi: c0779f1c	edi: c0779f18	ebp: c0779f24
>  ds:     e021	 es:     e021	 fs:     00d8	 gs:     0000
> Code (instr addr c040342d)
> 24 04 8b 15 a4 02 7c c0 89 54 24 08 8b 0e 0f 0b 78 65 6e 0f a2 <89> 45 00 8b 04 24 89 18 89 0e 89 
> 
> 
> Stack:
>  c0779f20 ffffffff ffffffff c07c0360 c0779f18 c0779f1c c0779f20 c066fd0f
>  c0779f18 c0779f24 00000002 16aee301 00000001 00000001 16aee301 00000002
>  0000000b c07c03cc c07c0360 c07c0360 c07c03d8 c0670ed8 c0779f58 00000001
>  c07c0360 c0779f60 c066fe6a c0779f60 c0779f60 00000003 00000001 00000000
> 
> Call Trace:
>   [<c040342d>] xen_cpuid+0x46  <--
>   [<c066fd0f>] detect_extended_topology+0xae 
>   [<c0670ed8>] init_intel+0x140 
>   [<c066fe6a>] init_scattered_cpuid_features+0x82 
>   [<c06705e2>] identify_cpu+0x22d 
>   [<c040584c>] xen_force_evtchn_callback+0xc 
>   [<c0405e78>] check_events+0x8 
>   [<c07c9dec>] identify_boot_cpu+0xa 
>   [<c07c9e9a>] check_bugs+0x8 
>   [<c07c27bd>] start_kernel+0x2a0 
>   [<c07c5206>] xen_start_kernel+0x340 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG
  2009-11-08 14:17             ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Pasi Kärkkäinen
@ 2009-11-08 14:20               ` Pasi Kärkkäinen
  2009-11-08 15:29               ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Dan Magenheimer
  1 sibling, 0 replies; 30+ messages in thread
From: Pasi Kärkkäinen @ 2009-11-08 14:20 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail)

On Sun, Nov 08, 2009 at 04:17:43PM +0200, Pasi Kärkkäinen wrote:
> On Sat, Nov 07, 2009 at 07:32:49AM -0800, Dan Magenheimer wrote:
> > > > Well, first, I got 2.6.31.5 to boot in a PV guest in another
> > > > machine and it fails to save also.  Are you able to save
> > > > 2.6.31{,.5} successfully?  On latest xen-unstable?
> > > > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't
> > > > know if that is important.)
> > > 
> > > I'll have to try it later today..
> > 
> > Let me know.
> > 
> 
> Ok. I just tried with a Fedora 12 (rawhide) PV guest. I was able to 
> "xm save" and "xm restore" it without problems. 
> 
> But I noticed there was a BUG printed on the guest console:
> http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86_64-saverestore.txt
> 
> BUG: sleeping function called from invalid context at kernel/mutex.c:94
> in_atomic(): 0, irqs_disabled(): 1, pid: 1052, name: kstop/0
> Pid: 1052, comm: kstop/0 Not tainted 2.6.31.5-122.fc12.x86_64 #1
> Call Trace:
>  [<ffffffff8104021f>] __might_sleep+0xe6/0xe8
>  [<ffffffff81419c84>] mutex_lock+0x22/0x4e
>  [<ffffffff812afdce>] dpm_resume_noirq+0x21/0x11f
>  [<ffffffff81272b05>] xen_suspend+0xca/0xd1
>  [<ffffffff8108c172>] stop_cpu+0x8c/0xd2
>  [<ffffffff8106350c>] worker_thread+0x18a/0x224
>  [<ffffffff81067ae7>] ? autoremove_wake_function+0x0/0x39
>  [<ffffffff8141ab29>] ? _spin_unlock_irqrestore+0x19/0x1b
>  [<ffffffff81063382>] ? worker_thread+0x0/0x224
>  [<ffffffff81067765>] kthread+0x91/0x99
>  [<ffffffff81012daa>] child_rip+0xa/0x20
>  [<ffffffff81011f97>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff8101271d>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81012da0>] ? child_rip+0x0/0x20
> 

Oh, I forgot to mention that this BUG is non-fatal. The guest still
works after that..

-- Pasi

> 
> More information about my setup:
> 
> Host/dom0: Fedora 12 (latest rawhide) with included Xen 3.4.1-5 and
> custom 2.6.31.5 x86_64 pv_ops dom0 kernel (a couple of days old).
> 
> Guest/domU: Fedora 12 (latest rawhide) with the included/default
> 2.6.31.5-122.fc12.x86_64 kernel.
> 
> > > > (On the machine I couldn't boot 2.6.31.5 as a PV guest, there
> > > > was absolutely no console output.  However, I think tools
> > > > are out-of-date on that machine so ignore that.)
> > > 
> > > Did you have "console=hvc0 earlyprintk=xen" in the domU kernel
> > > parameters?
> > 
> > No, but that didn't work either.
> > 
> 
> Ok.. then it crashes really early.
> 
> > > You might also change the xen guest cfgfile so that you have
> > > on_crash=preserve and then when the PV guest is crashed run this:
> > > 
> > > /usr/lib/xen/bin/xenctx -s System.map-domUkernelversion <domid>
> > > 
> > > (if you have 64b host the xenctx binary might be under /usr/lib64/)
> > > 
> > > to get a stack trace..
> > 
> > Very interesting and useful!  I was completely unaware of
> > xenctx and could have used it many times in tmem development!
> > 
> > The results explain why I can get it to run on
> > one machine (an older laptop) and not run on another
> > machine (a Nehalem system)... looks like this is maybe
> > related to the cpuid-extended-topology-leaf bug that Jeremy
> > sent a fix for upstream recently.
> > 
> 
> Did you try with that patch applied? 
> 
> -- Pasi
> 
> > cs:eip: e019:c040342d xen_cpuid+0x46 
> > flags: 00001206 i nz p
> > ss:esp: e021:c0779ee4
> > eax: 00000001	ebx: 00000002	ecx: 00000100	edx: 00000001
> > esi: c0779f1c	edi: c0779f18	ebp: c0779f24
> >  ds:     e021	 es:     e021	 fs:     00d8	 gs:     0000
> > Code (instr addr c040342d)
> > 24 04 8b 15 a4 02 7c c0 89 54 24 08 8b 0e 0f 0b 78 65 6e 0f a2 <89> 45 00 8b 04 24 89 18 89 0e 89 
> > 
> > 
> > Stack:
> >  c0779f20 ffffffff ffffffff c07c0360 c0779f18 c0779f1c c0779f20 c066fd0f
> >  c0779f18 c0779f24 00000002 16aee301 00000001 00000001 16aee301 00000002
> >  0000000b c07c03cc c07c0360 c07c0360 c07c03d8 c0670ed8 c0779f58 00000001
> >  c07c0360 c0779f60 c066fe6a c0779f60 c0779f60 00000003 00000001 00000000
> > 
> > Call Trace:
> >   [<c040342d>] xen_cpuid+0x46  <--
> >   [<c066fd0f>] detect_extended_topology+0xae 
> >   [<c0670ed8>] init_intel+0x140 
> >   [<c066fe6a>] init_scattered_cpuid_features+0x82 
> >   [<c06705e2>] identify_cpu+0x22d 
> >   [<c040584c>] xen_force_evtchn_callback+0xc 
> >   [<c0405e78>] check_events+0x8 
> >   [<c07c9dec>] identify_boot_cpu+0xa 
> >   [<c07c9e9a>] check_bugs+0x8 
> >   [<c07c27bd>] start_kernel+0x2a0 
> >   [<c07c5206>] xen_start_kernel+0x340 
> 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: pv 2.6.31 (kernel.org) and save/migrate, domU BUG()
  2009-11-08 14:17             ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Pasi Kärkkäinen
  2009-11-08 14:20               ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen
@ 2009-11-08 15:29               ` Dan Magenheimer
  2009-11-08 15:41                 ` Pasi Kärkkäinen
  1 sibling, 1 reply; 30+ messages in thread
From: Dan Magenheimer @ 2009-11-08 15:29 UTC (permalink / raw)
  To: "Pasi Kärkkäinen"; +Cc: Xen-Devel (E-mail)

> > > > machine and it fails to save also.  Are you able to save
> > > > 2.6.31{,.5} successfully?  On latest xen-unstable?
> > > > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't
> > > > know if that is important.)
> 
> Ok. I just tried with a Fedora 12 (rawhide) PV guest. I was able to 
> "xm save" and "xm restore" it without problems. 
> 
> But I noticed there was a BUG printed on the guest console:
> http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86
> _64-saverestore.txt
> BUG: sleeping function called from invalid context at 
> kernel/mutex.c:94
> in_atomic(): 0, irqs_disabled(): 1, pid: 1052, name: kstop/0
> Pid: 1052, comm: kstop/0 Not tainted 2.6.31.5-122.fc12.x86_64 #1

Ok, so it appears there is something problematic with
saving an upstream kernel.  It might be (partially) fixed
in Fedora 12 or maybe there is some other environmental
difference which makes save fail entirely on my system.

> > The results explain why I can get it to run on
> > one machine (an older laptop) and not run on another
> > machine (a Nehalem system)... looks like this is maybe
> > related to the cpuid-extended-topology-leaf bug that Jeremy
> > sent a fix for upstream recently.
> 
> Did you try with that patch applied? 

No, the patch wasn't posted, just a pull request to Linus,
so I don't have the patch (and am not a git expert so
am not sure how to get it).

http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00182.html

So I'll try it again when .6 or .7 is available.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG()
  2009-11-08 15:29               ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Dan Magenheimer
@ 2009-11-08 15:41                 ` Pasi Kärkkäinen
  2009-11-08 16:48                   ` Pasi Kärkkäinen
  2009-11-08 16:54                   ` Dan Magenheimer
  0 siblings, 2 replies; 30+ messages in thread
From: Pasi Kärkkäinen @ 2009-11-08 15:41 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail)

On Sun, Nov 08, 2009 at 07:29:58AM -0800, Dan Magenheimer wrote:
> > > > > machine and it fails to save also.  Are you able to save
> > > > > 2.6.31{,.5} successfully?  On latest xen-unstable?
> > > > > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't
> > > > > know if that is important.)
> > 
> > Ok. I just tried with a Fedora 12 (rawhide) PV guest. I was able to 
> > "xm save" and "xm restore" it without problems. 
> > 
> > But I noticed there was a BUG printed on the guest console:
> > http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86
> > _64-saverestore.txt
> > BUG: sleeping function called from invalid context at 
> > kernel/mutex.c:94
> > in_atomic(): 0, irqs_disabled(): 1, pid: 1052, name: kstop/0
> > Pid: 1052, comm: kstop/0 Not tainted 2.6.31.5-122.fc12.x86_64 #1
> 
> Ok, so it appears there is something problematic with
> saving an upstream kernel.  It might be (partially) fixed
> in Fedora 12 or maybe there is some other environmental
> difference which makes save fail entirely on my system.
> 

Yeah, fedora kernel has some patches, but it should be pretty 
close to upstream kernel..

btw was your guest UP or SMP? Mine was UP..

> > > The results explain why I can get it to run on
> > > one machine (an older laptop) and not run on another
> > > machine (a Nehalem system)... looks like this is maybe
> > > related to the cpuid-extended-topology-leaf bug that Jeremy
> > > sent a fix for upstream recently.
> > 
> > Did you try with that patch applied? 
> 
> No, the patch wasn't posted, just a pull request to Linus,
> so I don't have the patch (and am not a git expert so
> am not sure how to get it).
> 
> http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00182.html
> 
> So I'll try it again when .6 or .7 is available.

See here for changelog:
http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=shortlog;h=bugfix

You can get the diffs/patches from there using the links..

-- Pasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG()
  2009-11-08 15:41                 ` Pasi Kärkkäinen
@ 2009-11-08 16:48                   ` Pasi Kärkkäinen
  2009-11-12 23:16                     ` Jeremy Fitzhardinge
  2009-11-08 16:54                   ` Dan Magenheimer
  1 sibling, 1 reply; 30+ messages in thread
From: Pasi Kärkkäinen @ 2009-11-08 16:48 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail)

On Sun, Nov 08, 2009 at 05:41:53PM +0200, Pasi Kärkkäinen wrote:
> On Sun, Nov 08, 2009 at 07:29:58AM -0800, Dan Magenheimer wrote:
> > > > > > machine and it fails to save also.  Are you able to save
> > > > > > 2.6.31{,.5} successfully?  On latest xen-unstable?
> > > > > > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't
> > > > > > know if that is important.)
> > > 
> > > Ok. I just tried with a Fedora 12 (rawhide) PV guest. I was able to 
> > > "xm save" and "xm restore" it without problems. 
> > > 
> > > But I noticed there was a BUG printed on the guest console:
> > > http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86
> > > _64-saverestore.txt
> > > BUG: sleeping function called from invalid context at 
> > > kernel/mutex.c:94
> > > in_atomic(): 0, irqs_disabled(): 1, pid: 1052, name: kstop/0
> > > Pid: 1052, comm: kstop/0 Not tainted 2.6.31.5-122.fc12.x86_64 #1
> > 
> > Ok, so it appears there is something problematic with
> > saving an upstream kernel.  It might be (partially) fixed
> > in Fedora 12 or maybe there is some other environmental
> > difference which makes save fail entirely on my system.
> > 
> 
> Yeah, fedora kernel has some patches, but it should be pretty 
> close to upstream kernel..
> 
> btw was your guest UP or SMP? Mine was UP..
> 

Ok.. saving SMP guest fails for me too:

[2009-11-09 23:44:38 1353] DEBUG (XendCheckpoint:110) [xc_save]: /usr/lib64/xen/bin/xc_save 28 2 0 0 0
[2009-11-09 23:44:38 1353] INFO (XendCheckpoint:417) xc_save: failed to get the suspend evtchn port

Jeremy: Ideas what's causing that? "xm save" for UP 2.6.31.5 guest works
OK, but for SMP guest it fails with the error above.

-- Pasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: pv 2.6.31 (kernel.org) and save/migrate, domU BUG()
  2009-11-08 15:41                 ` Pasi Kärkkäinen
  2009-11-08 16:48                   ` Pasi Kärkkäinen
@ 2009-11-08 16:54                   ` Dan Magenheimer
  2009-11-08 17:27                     ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen
  2009-11-12 23:21                     ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Jeremy Fitzhardinge
  1 sibling, 2 replies; 30+ messages in thread
From: Dan Magenheimer @ 2009-11-08 16:54 UTC (permalink / raw)
  To: "Pasi Kärkkäinen"; +Cc: Xen-Devel (E-mail)

[-- Attachment #1: Type: text/plain, Size: 1688 bytes --]

> > Ok, so it appears there is something problematic with
> > saving an upstream kernel.  It might be (partially) fixed
> > in Fedora 12 or maybe there is some other environmental
> > difference which makes save fail entirely on my system.
> > 
> 
> Yeah, fedora kernel has some patches, but it should be pretty 
> close to upstream kernel..
> 
> btw was your guest UP or SMP? Mine was UP..

Mine was SMP... switching to UP I can now save.  BUT...
restore doesn't seem to quite work.  The restore completes
but I get no response from the VNC console.  When I
use a tty console, after restore, I am getting
an infinite dump of

WARNING: at arch/x86/time.c:180 xen_sched_clock+0x2b

(see attached).

Did you try restore on Fedora 12?
 
> > > > The results explain why I can get it to run on
> > > > one machine (an older laptop) and not run on another
> > > > machine (a Nehalem system)... looks like this is maybe
> > > > related to the cpuid-extended-topology-leaf bug that Jeremy
> > > > sent a fix for upstream recently.
> > > 
> > > Did you try with that patch applied? 
> > 
> > No, the patch wasn't posted, just a pull request to Linus,
> > so I don't have the patch (and am not a git expert so
> > am not sure how to get it).
> > 
> > http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00182.html
> >
> > So I'll try it again when .6 or .7 is available.
> 
> See here for changelog:
> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=shortlog;h=bugfix
> 
> You can get the diffs/patches from there using the links..

Thanks.  Yes, Jeremy's patch allows 2.6.31.5 (in a PV domain)
to completely boot on my Nehalem box.

[-- Attachment #2: restore.out --]
[-- Type: application/octet-stream, Size: 9696 bytes --]

------------[ cut here ]------------
WARNING: at arch/x86/xen/time.c:180 xen_sched_clock+0x2b/0x5a()
Modules linked in: autofs4 hidp nfs lockd nfs_acl auth_rpcgss rfcomm l2cap bluetooth rfkill sunrpc iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_multipath parport_pc lp parport rtc_core rtc_lib pcspkr joydev dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 16, comm: xenwatch Tainted: G        W  2.6.31.5 #4
Call Trace:
 [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a
 [<c0430540>] ? warn_slowpath_common+0x5e/0x71
 [<c043055d>] ? warn_slowpath_null+0xa/0xc
 [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a
 [<c040b4b4>] ? sched_clock+0x8/0x18
 [<c0444de2>] ? cpu_clock+0x1d/0x33
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c045b2ae>] ? get_timestamp+0x5/0xd
 [<c045b2cf>] ? __touch_softlockup_watchdog+0x19/0x1f
 [<c0438a2b>] ? update_process_times+0x21/0x49
 [<c0449a5b>] ? tick_periodic+0x60/0x6a
 [<c0449ab9>] ? tick_handle_periodic+0x54/0x5b
 [<c0405ad1>] ? xen_timer_interrupt+0x26/0x17f
 [<c04223da>] ? __wake_up_common+0x2e/0x58
 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c05714bd>] ? unmask_evtchn+0x2c/0xc6
 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1
 [<c045b8d7>] ? handle_IRQ_event+0x4e/0xf1
 [<c045d010>] ? handle_level_irq+0x69/0xad
 [<c045cfa7>] ? handle_level_irq+0x0/0xad
 <IRQ>  [<c0571b64>] ? xen_evtchn_do_upcall+0xa2/0x11b
 [<c0408387>] ? xen_do_upcall+0x7/0xc
 [<c0402227>] ? hypercall_page+0x227/0x1001
 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c0405e5b>] ? xen_irq_enable_direct_end+0x0/0x1
 [<c0427777>] ? finish_task_switch+0x52/0xa4
 [<c0676e53>] ? schedule+0x764/0x7c9
 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10
 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1
 [<c067824e>] ? _spin_unlock_irqrestore+0xe/0x10
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c042b237>] ? __cond_resched+0x13/0x2f
 [<c0676f0b>] ? _cond_resched+0x18/0x21
 [<c043e17d>] ? flush_workqueue+0x1d/0x4a
 [<c0572433>] ? xen_suspend+0x0/0xc4
 [<c0452c0e>] ? __stop_machine+0xbf/0xd6
 [<c0572433>] ? xen_suspend+0x0/0xc4
 [<c0452da8>] ? stop_machine+0x25/0x39
 [<c0572663>] ? shutdown_handler+0x16c/0x1d9
 [<c0573c2e>] ? xenwatch_thread+0xc8/0xee
 [<c04414f4>] ? autoremove_wake_function+0x0/0x2d
 [<c0573b66>] ? xenwatch_thread+0x0/0xee
 [<c0441450>] ? kthread+0x6e/0x76
 [<c04413e2>] ? kthread+0x0/0x76
 [<c0408337>] ? kernel_thread_helper+0x7/0x10
---[ end trace bb4cac02c28c9de1 ]---
------------[ cut here ]------------
WARNING: at arch/x86/xen/time.c:180 xen_sched_clock+0x2b/0x5a()
Modules linked in: autofs4 hidp nfs lockd nfs_acl auth_rpcgss rfcomm l2cap bluetooth rfkill sunrpc iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_multipath parport_pc lp parport rtc_core rtc_lib pcspkr joydev dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 16, comm: xenwatch Tainted: G        W  2.6.31.5 #4
Call Trace:
 [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a
 [<c0430540>] ? warn_slowpath_common+0x5e/0x71
 [<c043055d>] ? warn_slowpath_null+0xa/0xc
 [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a
 [<c040b4b4>] ? sched_clock+0x8/0x18
 [<c042e372>] ? scheduler_tick+0x44/0x10c
 [<c0438a49>] ? update_process_times+0x3f/0x49
 [<c0449a5b>] ? tick_periodic+0x60/0x6a
 [<c0449ab9>] ? tick_handle_periodic+0x54/0x5b
 [<c0405ad1>] ? xen_timer_interrupt+0x26/0x17f
 [<c04223da>] ? __wake_up_common+0x2e/0x58
 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c05714bd>] ? unmask_evtchn+0x2c/0xc6
 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1
 [<c045b8d7>] ? handle_IRQ_event+0x4e/0xf1
 [<c045d010>] ? handle_level_irq+0x69/0xad
 [<c045cfa7>] ? handle_level_irq+0x0/0xad
 <IRQ>  [<c0571b64>] ? xen_evtchn_do_upcall+0xa2/0x11b
 [<c0408387>] ? xen_do_upcall+0x7/0xc
 [<c0402227>] ? hypercall_page+0x227/0x1001
 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c0405e5b>] ? xen_irq_enable_direct_end+0x0/0x1
 [<c0427777>] ? finish_task_switch+0x52/0xa4
 [<c0676e53>] ? schedule+0x764/0x7c9
 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10
 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1
 [<c067824e>] ? _spin_unlock_irqrestore+0xe/0x10
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c042b237>] ? __cond_resched+0x13/0x2f
 [<c0676f0b>] ? _cond_resched+0x18/0x21
 [<c043e17d>] ? flush_workqueue+0x1d/0x4a
 [<c0572433>] ? xen_suspend+0x0/0xc4
 [<c0452c0e>] ? __stop_machine+0xbf/0xd6
 [<c0572433>] ? xen_suspend+0x0/0xc4
 [<c0452da8>] ? stop_machine+0x25/0x39
 [<c0572663>] ? shutdown_handler+0x16c/0x1d9
 [<c0573c2e>] ? xenwatch_thread+0xc8/0xee
 [<c04414f4>] ? autoremove_wake_function+0x0/0x2d
 [<c0573b66>] ? xenwatch_thread+0x0/0xee
 [<c0441450>] ? kthread+0x6e/0x76
 [<c04413e2>] ? kthread+0x0/0x76
 [<c0408337>] ? kernel_thread_helper+0x7/0x10
---[ end trace bb4cac02c28c9de2 ]---
------------[ cut here ]------------
WARNING: at arch/x86/xen/time.c:180 xen_sched_clock+0x2b/0x5a()
Modules linked in: autofs4 hidp nfs lockd nfs_acl auth_rpcgss rfcomm l2cap bluetooth rfkill sunrpc iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_multipath parport_pc lp parport rtc_core rtc_lib pcspkr joydev dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 16, comm: xenwatch Tainted: G        W  2.6.31.5 #4
Call Trace:
 [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a
 [<c0430540>] ? warn_slowpath_common+0x5e/0x71
 [<c043055d>] ? warn_slowpath_null+0xa/0xc
 [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a
 [<c040b4b4>] ? sched_clock+0x8/0x18
 [<c0444de2>] ? cpu_clock+0x1d/0x33
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c045b2ae>] ? get_timestamp+0x5/0xd
 [<c045b2cf>] ? __touch_softlockup_watchdog+0x19/0x1f
 [<c0438a2b>] ? update_process_times+0x21/0x49
 [<c0449a5b>] ? tick_periodic+0x60/0x6a
 [<c0449ab9>] ? tick_handle_periodic+0x54/0x5b
 [<c0405ad1>] ? xen_timer_interrupt+0x26/0x17f
 [<c04223da>] ? __wake_up_common+0x2e/0x58
 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c05714bd>] ? unmask_evtchn+0x2c/0xc6
 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1
 [<c045b8d7>] ? handle_IRQ_event+0x4e/0xf1
 [<c045d010>] ? handle_level_irq+0x69/0xad
 [<c045cfa7>] ? handle_level_irq+0x0/0xad
 <IRQ>  [<c0571b64>] ? xen_evtchn_do_upcall+0xa2/0x11b
 [<c0408387>] ? xen_do_upcall+0x7/0xc
 [<c0402227>] ? hypercall_page+0x227/0x1001
 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c0405e5b>] ? xen_irq_enable_direct_end+0x0/0x1
 [<c0427777>] ? finish_task_switch+0x52/0xa4
 [<c0676e53>] ? schedule+0x764/0x7c9
 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10
 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1
 [<c067824e>] ? _spin_unlock_irqrestore+0xe/0x10
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c042b237>] ? __cond_resched+0x13/0x2f
 [<c0676f0b>] ? _cond_resched+0x18/0x21
 [<c043e17d>] ? flush_workqueue+0x1d/0x4a
 [<c0572433>] ? xen_suspend+0x0/0xc4
 [<c0452c0e>] ? __stop_machine+0xbf/0xd6
 [<c0572433>] ? xen_suspend+0x0/0xc4
 [<c0452da8>] ? stop_machine+0x25/0x39
 [<c0572663>] ? shutdown_handler+0x16c/0x1d9
 [<c0573c2e>] ? xenwatch_thread+0xc8/0xee
 [<c04414f4>] ? autoremove_wake_function+0x0/0x2d
 [<c0573b66>] ? xenwatch_thread+0x0/0xee
 [<c0441450>] ? kthread+0x6e/0x76
 [<c04413e2>] ? kthread+0x0/0x76
 [<c0408337>] ? kernel_thread_helper+0x7/0x10
---[ end trace bb4cac02c28c9de3 ]---
------------[ cut here ]------------
WARNING: at arch/x86/xen/time.c:180 xen_sched_clock+0x2b/0x5a()
Modules linked in: autofs4 hidp nfs lockd nfs_acl auth_rpcgss rfcomm l2cap bluetooth rfkill sunrpc iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_multipath parport_pc lp parport rtc_core rtc_lib pcspkr joydev dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 16, comm: xenwatch Tainted: G        W  2.6.31.5 #4
Call Trace:
 [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a
 [<c0430540>] ? warn_slowpath_common+0x5e/0x71
 [<c043055d>] ? warn_slowpath_null+0xa/0xc
 [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a
 [<c040b4b4>] ? sched_clock+0x8/0x18
 [<c042e372>] ? scheduler_tick+0x44/0x10c
 [<c0438a49>] ? update_process_times+0x3f/0x49
 [<c0449a5b>] ? tick_periodic+0x60/0x6a
 [<c0449ab9>] ? tick_handle_periodic+0x54/0x5b
 [<c0405ad1>] ? xen_timer_interrupt+0x26/0x17f
 [<c04223da>] ? __wake_up_common+0x2e/0x58
 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c05714bd>] ? unmask_evtchn+0x2c/0xc6
 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1
 [<c045b8d7>] ? handle_IRQ_event+0x4e/0xf1
 [<c045d010>] ? handle_level_irq+0x69/0xad
 [<c045cfa7>] ? handle_level_irq+0x0/0xad
 <IRQ>  [<c0571b64>] ? xen_evtchn_do_upcall+0xa2/0x11b
 [<c0408387>] ? xen_do_upcall+0x7/0xc
 [<c0402227>] ? hypercall_page+0x227/0x1001
 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c0405e5b>] ? xen_irq_enable_direct_end+0x0/0x1
 [<c0427777>] ? finish_task_switch+0x52/0xa4
 [<c0676e53>] ? schedule+0x764/0x7c9
 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10
 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1
 [<c067824e>] ? _spin_unlock_irqrestore+0xe/0x10
 [<c0405e9c>] ? check_events+0x8/0xc
 [<c042b237>] ? __cond_resched+0x13/0x2f
 [<c0676f0b>] ? _cond_resched+0x18/0x21
[root@dmagenhe-nsvpn-dhcp-141-144-22-8 OVM_EL5U2_X86_PVM_10GB]# 


[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG
  2009-11-08 16:54                   ` Dan Magenheimer
@ 2009-11-08 17:27                     ` Pasi Kärkkäinen
  2009-11-10 10:08                       ` pv 2.6.31 (kernel.org) and save/migrate fails, " Pasi Kärkkäinen
  2009-11-12 23:21                     ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Jeremy Fitzhardinge
  1 sibling, 1 reply; 30+ messages in thread
From: Pasi Kärkkäinen @ 2009-11-08 17:27 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail)

On Sun, Nov 08, 2009 at 08:54:23AM -0800, Dan Magenheimer wrote:
> > > Ok, so it appears there is something problematic with
> > > saving an upstream kernel.  It might be (partially) fixed
> > > in Fedora 12 or maybe there is some other environmental
> > > difference which makes save fail entirely on my system.
> > > 
> > 
> > Yeah, fedora kernel has some patches, but it should be pretty 
> > close to upstream kernel..
> > 
> > btw was your guest UP or SMP? Mine was UP..
> 
> Mine was SMP... switching to UP I can now save.  BUT...
> restore doesn't seem to quite work.  The restore completes
> but I get no response from the VNC console.  When I
> use a tty console, after restore, I am getting
> an infinite dump of
> 
> WARNING: at arch/x86/time.c:180 xen_sched_clock+0x2b
> 
> (see attached).
> 
> Did you try restore on Fedora 12?
>  

Yeah. save+restore for UP F12 guest works for me 
(except I get that non-fatal BUG on the guest).

SMP guest doesn't work.. save crashes it.

> > > > > The results explain why I can get it to run on
> > > > > one machine (an older laptop) and not run on another
> > > > > machine (a Nehalem system)... looks like this is maybe
> > > > > related to the cpuid-extended-topology-leaf bug that Jeremy
> > > > > sent a fix for upstream recently.
> > > > 
> > > > Did you try with that patch applied? 
> > > 
> > > No, the patch wasn't posted, just a pull request to Linus,
> > > so I don't have the patch (and am not a git expert so
> > > am not sure how to get it).
> > > 
> > > http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00182.html
> > >
> > > So I'll try it again when .6 or .7 is available.
> > 
> > See here for changelog:
> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=shortlog;h=bugfix
> > 
> > You can get the diffs/patches from there using the links..
> 
> Thanks.  Yes, Jeremy's patch allows 2.6.31.5 (in a PV domain)
> to completely boot on my Nehalem box.

Ok. But I guess those doesn't help for the save+restore problem..

-- Pasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG
  2009-11-08 17:27                     ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen
@ 2009-11-10 10:08                       ` Pasi Kärkkäinen
  2009-11-12 23:36                         ` Jeremy Fitzhardinge
  2009-11-23 16:44                         ` pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG Ian Campbell
  0 siblings, 2 replies; 30+ messages in thread
From: Pasi Kärkkäinen @ 2009-11-10 10:08 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail)

Hello,

Jeremy: Here's summary about these save/restore problems
using upstream Linux 2.6.31.5 PV guest.

For me:
	- I can "xm save" + "xm restore" UP guest, but I get non-fatal
	  BUG in the guest kernel, see [1].
	- "xm save" fails for SMP guest with "failed to get the suspend evtchn port", see [2].

For Dan:
	- "xm save" works for UP guest, but "xm restore" doesn't, giving
	  infinite xen_sched_clock related dumps in the guest kernel, see [3].
	- "xm save" for SMP guest fails, it never ends. I suspect this
	  is the same problem I'm seeing.


[1] non-fatal BUG on the guest kernel after "xm restore":
http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86_64-saverestore.txt

[2] "xm log" contains:
[2009-11-09 23:44:38 1353] DEBUG (XendCheckpoint:110) [xc_save]: /usr/lib64/xen/bin/xc_save 28 2 0 0 0
[2009-11-09 23:44:38 1353] INFO (XendCheckpoint:417) xc_save: failed to get the suspend evtchn port

[3] See the attachment in this email:
http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00391.html


Any tips how to debug these? 

-- Pasi


On Sun, Nov 08, 2009 at 07:27:47PM +0200, Pasi Kärkkäinen wrote:
> On Sun, Nov 08, 2009 at 08:54:23AM -0800, Dan Magenheimer wrote:
> > > > Ok, so it appears there is something problematic with
> > > > saving an upstream kernel.  It might be (partially) fixed
> > > > in Fedora 12 or maybe there is some other environmental
> > > > difference which makes save fail entirely on my system.
> > > > 
> > > 
> > > Yeah, fedora kernel has some patches, but it should be pretty 
> > > close to upstream kernel..
> > > 
> > > btw was your guest UP or SMP? Mine was UP..
> > 
> > Mine was SMP... switching to UP I can now save.  BUT...
> > restore doesn't seem to quite work.  The restore completes
> > but I get no response from the VNC console.  When I
> > use a tty console, after restore, I am getting
> > an infinite dump of
> > 
> > WARNING: at arch/x86/time.c:180 xen_sched_clock+0x2b
> > 
> > (see attached).
> > 
> > Did you try restore on Fedora 12?
> >  
> 
> Yeah. save+restore for UP F12 guest works for me 
> (except I get that non-fatal BUG on the guest).
> 
> SMP guest doesn't work.. save crashes it.
> 
> > > > > > The results explain why I can get it to run on
> > > > > > one machine (an older laptop) and not run on another
> > > > > > machine (a Nehalem system)... looks like this is maybe
> > > > > > related to the cpuid-extended-topology-leaf bug that Jeremy
> > > > > > sent a fix for upstream recently.
> > > > > 
> > > > > Did you try with that patch applied? 
> > > > 
> > > > No, the patch wasn't posted, just a pull request to Linus,
> > > > so I don't have the patch (and am not a git expert so
> > > > am not sure how to get it).
> > > > 
> > > > http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00182.html
> > > >
> > > > So I'll try it again when .6 or .7 is available.
> > > 
> > > See here for changelog:
> > > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=shortlog;h=bugfix
> > > 
> > > You can get the diffs/patches from there using the links..
> > 
> > Thanks.  Yes, Jeremy's patch allows 2.6.31.5 (in a PV domain)
> > to completely boot on my Nehalem box.
> 
> Ok. But I guess those doesn't help for the save+restore problem..
> 
> -- Pasi
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG()
  2009-11-08 16:48                   ` Pasi Kärkkäinen
@ 2009-11-12 23:16                     ` Jeremy Fitzhardinge
  2009-11-12 23:22                       ` Brendan Cully
  0 siblings, 1 reply; 30+ messages in thread
From: Jeremy Fitzhardinge @ 2009-11-12 23:16 UTC (permalink / raw)
  To: Pasi Kärkkäinen
  Cc: Dan Magenheimer, Xen-Devel (E-mail), Brendan Cully

On 11/08/09 08:48, Pasi Kärkkäinen wrote:
> Ok.. saving SMP guest fails for me too:
>
> [2009-11-09 23:44:38 1353] DEBUG (XendCheckpoint:110) [xc_save]: /usr/lib64/xen/bin/xc_save 28 2 0 0 0
> [2009-11-09 23:44:38 1353] INFO (XendCheckpoint:417) xc_save: failed to get the suspend evtchn port
>
> Jeremy: Ideas what's causing that? "xm save" for UP 2.6.31.5 guest works
> OK, but for SMP guest it fails with the error above.

There's no "suspend evtchn port" in a pvops kernel.  That looks like a
Remus thing.  I think.

    J

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG()
  2009-11-08 16:54                   ` Dan Magenheimer
  2009-11-08 17:27                     ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen
@ 2009-11-12 23:21                     ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 30+ messages in thread
From: Jeremy Fitzhardinge @ 2009-11-12 23:21 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail), Jan Beulich, Keir Fraser

On 11/08/09 08:54, Dan Magenheimer wrote:
> Mine was SMP... switching to UP I can now save.  BUT...
> restore doesn't seem to quite work.  The restore completes
> but I get no response from the VNC console.  When I
> use a tty console, after restore, I am getting
> an infinite dump of
>
> WARNING: at arch/x86/time.c:180 xen_sched_clock+0x2b
>   

That means that the test to see that the CPU its currently running on is
not currently running according to Xen...  It's hard to imagine how it
got into that state...

    J

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG()
  2009-11-12 23:16                     ` Jeremy Fitzhardinge
@ 2009-11-12 23:22                       ` Brendan Cully
  0 siblings, 0 replies; 30+ messages in thread
From: Brendan Cully @ 2009-11-12 23:22 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Dan Magenheimer, Xen-Devel (E-mail)

On Thursday, 12 November 2009 at 15:16, Jeremy Fitzhardinge wrote:
> On 11/08/09 08:48, Pasi Kärkkäinen wrote:
> > Ok.. saving SMP guest fails for me too:
> >
> > [2009-11-09 23:44:38 1353] DEBUG (XendCheckpoint:110) [xc_save]: /usr/lib64/xen/bin/xc_save 28 2 0 0 0
> > [2009-11-09 23:44:38 1353] INFO (XendCheckpoint:417) xc_save: failed to get the suspend evtchn port
> >
> > Jeremy: Ideas what's causing that? "xm save" for UP 2.6.31.5 guest works
> > OK, but for SMP guest it fails with the error above.
> 
> There's no "suspend evtchn port" in a pvops kernel.  That looks like a
> Remus thing.  I think.

This is only an INFO-level message, because xc_save falls back to the
old xenstore method if it can't find a suspend event channel. I don't
know the context here, but this particular message ought to be
harmless.

The event channel was made for Remus, but regular xc_save also uses it
to reduce the downtime at the end of live migration.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG
  2009-11-10 10:08                       ` pv 2.6.31 (kernel.org) and save/migrate fails, " Pasi Kärkkäinen
@ 2009-11-12 23:36                         ` Jeremy Fitzhardinge
  2009-11-24 14:27                           ` Ian Campbell
  2009-11-23 16:44                         ` pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG Ian Campbell
  1 sibling, 1 reply; 30+ messages in thread
From: Jeremy Fitzhardinge @ 2009-11-12 23:36 UTC (permalink / raw)
  To: Pasi Kärkkäinen; +Cc: Dan Magenheimer, Xen-Devel (E-mail)

On 11/10/09 02:08, Pasi Kärkkäinen wrote:
> Hello,
>
> Jeremy: Here's summary about these save/restore problems
> using upstream Linux 2.6.31.5 PV guest.
>
> For me:
> 	- I can "xm save" + "xm restore" UP guest, but I get non-fatal
> 	  BUG in the guest kernel, see [1].
> 	- "xm save" fails for SMP guest with "failed to get the suspend evtchn port", see [2].
>
> For Dan:
> 	- "xm save" works for UP guest, but "xm restore" doesn't, giving
> 	  infinite xen_sched_clock related dumps in the guest kernel, see [3].
> 	- "xm save" for SMP guest fails, it never ends. I suspect this
> 	  is the same problem I'm seeing.
>
>
> [1] non-fatal BUG on the guest kernel after "xm restore":
> http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86_64-saverestore.txt
>   

Does this help:

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 10d03d7..da57ea1 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -43,7 +43,6 @@ static int xen_suspend(void *data)
 	if (err) {
 		printk(KERN_ERR "xen_suspend: sysdev_suspend failed: %d\n",
 			err);
-		dpm_resume_noirq(PMSG_RESUME);
 		return err;
 	}
 
@@ -69,7 +68,6 @@ static int xen_suspend(void *data)
 	}
 
 	sysdev_resume();
-	dpm_resume_noirq(PMSG_RESUME);
 
 	return 0;
 }
@@ -108,6 +106,9 @@ static void do_suspend(void)
 	}
 
 	err = stop_machine(xen_suspend, &cancelled, cpumask_of(0));
+
+	dpm_resume_noirq(PMSG_RESUME);
+
 	if (err) {
 		printk(KERN_ERR "failed to start xen_suspend: %d\n", err);
 		goto out;


> [2] "xm log" contains:
> [2009-11-09 23:44:38 1353] DEBUG (XendCheckpoint:110) [xc_save]: /usr/lib64/xen/bin/xc_save 28 2 0 0 0
> [2009-11-09 23:44:38 1353] INFO (XendCheckpoint:417) xc_save: failed to get the suspend evtchn port
>   

I think this may be a Remus side-effect.

> [3] See the attachment in this email:
> http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00391.html
>   

No idea about this one.  Needs a closer look.

    J

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG
  2009-11-10 10:08                       ` pv 2.6.31 (kernel.org) and save/migrate fails, " Pasi Kärkkäinen
  2009-11-12 23:36                         ` Jeremy Fitzhardinge
@ 2009-11-23 16:44                         ` Ian Campbell
  2009-11-24 10:27                           ` Ian Campbell
  1 sibling, 1 reply; 30+ messages in thread
From: Ian Campbell @ 2009-11-23 16:44 UTC (permalink / raw)
  To: Pasi Kärkkäinen
  Cc: Dan Magenheimer, Xen-Devel (E-mail), Jeremy Fitzhardinge

On Tue, 2009-11-10 at 10:08 +0000, Pasi Kärkkäinen wrote:
> Hello,
> 
> Jeremy: Here's summary about these save/restore problems
> using upstream Linux 2.6.31.5 PV guest.
> 
> For me:
> 	- I can "xm save" + "xm restore" UP guest, but I get non-fatal
> 	  BUG in the guest kernel, see [1].
> 	- "xm save" fails for SMP guest with "failed to get the suspend evtchn port", see [2].
> 
> For Dan:
> 	- "xm save" works for UP guest, but "xm restore" doesn't, giving
> 	  infinite xen_sched_clock related dumps in the guest kernel, see [3].

The runstate fix I sent to the list last week should help with this one.

> 	- "xm save" for SMP guest fails, it never ends. I suspect this
> 	  is the same problem I'm seeing.

I'm seeing this (or something very like it) too. At the moment it looks
as if drivers/xen/manage.c:do_suspend is getting as far as the
stop_machine() call but I am never seeing to the xen_suspend() callback.

Ian.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG
  2009-11-23 16:44                         ` pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG Ian Campbell
@ 2009-11-24 10:27                           ` Ian Campbell
  0 siblings, 0 replies; 30+ messages in thread
From: Ian Campbell @ 2009-11-24 10:27 UTC (permalink / raw)
  To: Pasi Kärkkäinen
  Cc: Dan Magenheimer, Xen-Devel (E-mail), Jeremy Fitzhardinge

On Mon, 2009-11-23 at 16:44 +0000, Ian Campbell wrote:
> On Tue, 2009-11-10 at 10:08 +0000, Pasi Kärkkäinen wrote:
> > Hello,
> > 
> > Jeremy: Here's summary about these save/restore problems
> > using upstream Linux 2.6.31.5 PV guest.
> > 
> > For me:
> > 	- I can "xm save" + "xm restore" UP guest, but I get non-fatal
> > 	  BUG in the guest kernel, see [1].
> > 	- "xm save" fails for SMP guest with "failed to get the suspend evtchn port", see [2].
> > 
> > For Dan:
> > 	- "xm save" works for UP guest, but "xm restore" doesn't, giving
> > 	  infinite xen_sched_clock related dumps in the guest kernel, see [3].
> 
> The runstate fix I sent to the list last week should help with this one.
> 
> > 	- "xm save" for SMP guest fails, it never ends. I suspect this
> > 	  is the same problem I'm seeing.
> 
> I'm seeing this (or something very like it) too. At the moment it looks
> as if drivers/xen/manage.c:do_suspend is getting as far as the
> stop_machine() call but I am never seeing to the xen_suspend() callback.

See "xen: register timer interrupt with IRQF_TIMER" that I just sent to
the list for the fix.

Ian.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG
  2009-11-12 23:36                         ` Jeremy Fitzhardinge
@ 2009-11-24 14:27                           ` Ian Campbell
  2009-11-25 14:12                             ` Ian Campbell
  0 siblings, 1 reply; 30+ messages in thread
From: Ian Campbell @ 2009-11-24 14:27 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Dan Magenheimer, Xen-Devel (E-mail)

On Thu, 2009-11-12 at 23:36 +0000, Jeremy Fitzhardinge wrote:
> On 11/10/09 02:08, Pasi Kärkkäinen wrote:
> > Hello,
> >
> > Jeremy: Here's summary about these save/restore problems
> > using upstream Linux 2.6.31.5 PV guest.
> >
> > For me:
> > 	- I can "xm save" + "xm restore" UP guest, but I get non-fatal
> > 	  BUG in the guest kernel, see [1].
> > 	- "xm save" fails for SMP guest with "failed to get the suspend evtchn port", see [2].
> >
> > For Dan:
> > 	- "xm save" works for UP guest, but "xm restore" doesn't, giving
> > 	  infinite xen_sched_clock related dumps in the guest kernel, see [3].
> > 	- "xm save" for SMP guest fails, it never ends. I suspect this
> > 	  is the same problem I'm seeing.
> >
> >
> > [1] non-fatal BUG on the guest kernel after "xm restore":
> > http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86_64-saverestore.txt
> >   
> 
> Does this help:

It does for me. There's another dpm_resume_noirq(PMSG_RESUME) a little
later in do_suspend() which I think needs to be dropped as well.

I'm still seeing other problems with resume, the system is hung on
restore and the RCU stall detection logic is triggering, unfortunately
arch_trigger_all_cpu_backtrace is not Xen compatible (uses APIC
directly) so I don't get much useful info out of it. It's most likely a
symptom of the actual problem rather than a problem with RCU per-se
anyhow.

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 10d03d7..7b69a1a 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -43,7 +43,6 @@ static int xen_suspend(void *data)
 	if (err) {
 		printk(KERN_ERR "xen_suspend: sysdev_suspend failed: %d\n",
 			err);
-		dpm_resume_noirq(PMSG_RESUME);
 		return err;
 	}
 
@@ -69,7 +68,6 @@ static int xen_suspend(void *data)
 	}
 
 	sysdev_resume();
-	dpm_resume_noirq(PMSG_RESUME);
 
 	return 0;
 }
@@ -108,6 +106,9 @@ static void do_suspend(void)
 	}
 
 	err = stop_machine(xen_suspend, &cancelled, cpumask_of(0));
+
+	dpm_resume_noirq(PMSG_RESUME);
+
 	if (err) {
 		printk(KERN_ERR "failed to start xen_suspend: %d\n", err);
 		goto out;
@@ -119,8 +120,6 @@ static void do_suspend(void)
 	} else
 		xs_suspend_cancel();
 
-	dpm_resume_noirq(PMSG_RESUME);
-
 resume_devices:
 	dpm_resume_end(PMSG_RESUME);
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG
  2009-11-24 14:27                           ` Ian Campbell
@ 2009-11-25 14:12                             ` Ian Campbell
  2009-11-25 19:28                               ` Jeremy Fitzhardinge
                                                 ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Ian Campbell @ 2009-11-25 14:12 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Dan Magenheimer, Xen-Devel (E-mail)

On Tue, 2009-11-24 at 14:27 +0000, Ian Campbell wrote:
> 
> I'm still seeing other problems with resume, the system is hung on
> restore and the RCU stall detection logic is triggering, unfortunately
> arch_trigger_all_cpu_backtrace is not Xen compatible (uses APIC
> directly) so I don't get much useful info out of it. It's most likely
> a symptom of the actual problem rather than a problem with RCU per-se
> anyhow. 

tick_resume() is never called on secondary processors. Presumably this
is because they are offlined for suspend on native and so this is
normally taken care of in the CPU onlining path. Under Xen we keep all
CPUs online over a suspend.

This patch papers over the issue for me but I will investigate a more
generic, less hacky, way of doing to the same.

tick_suspend is also only called on the boot CPU which I presume should
be fixed too.

Ian.

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 6343a5d..cdfeed2 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -1,4 +1,5 @@
 #include <linux/types.h>
+#include <linux/clockchips.h>
 
 #include <xen/interface/xen.h>
 #include <xen/grant_table.h>
@@ -46,7 +50,19 @@ void xen_post_suspend(int suspend_cancelled)
 
 }
 
+static void xen_vcpu_notify_restore(void *data)
+{
+	unsigned long reason = (unsigned long)data;
+
+	/* Boot processor notified via generic timekeeping_resume() */
+	if ( smp_processor_id() == 0)
+		return;
+
+	clockevents_notify(reason, NULL);
+}
+
 void xen_arch_resume(void)
 {
-	/* nothing */
+	smp_call_function_many(cpu_online_mask, xen_vcpu_notify_restore,
+			       (void *)CLOCK_EVT_NOTIFY_RESUME, 1);
 }

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG
  2009-11-25 14:12                             ` Ian Campbell
@ 2009-11-25 19:28                               ` Jeremy Fitzhardinge
  2009-11-25 20:03                                 ` Ian Campbell
  2009-12-01 11:47                               ` [PATCH] xen: improve error handling in do_suspend Ian Campbell
  2009-12-01 11:47                               ` [PATCH] xen: explicitly create/destroy stop_machine workqueues outside suspend/resume region Ian Campbell
  2 siblings, 1 reply; 30+ messages in thread
From: Jeremy Fitzhardinge @ 2009-11-25 19:28 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Rafael J. Wysocki, Dan Magenheimer, Xen-Devel (E-mail), Thomas Gleixner

On 11/25/09 06:12, Ian Campbell wrote:
> tick_resume() is never called on secondary processors. Presumably this
> is because they are offlined for suspend on native and so this is
> normally taken care of in the CPU onlining path. Under Xen we keep all
> CPUs online over a suspend.
>
> This patch papers over the issue for me but I will investigate a more
> generic, less hacky, way of doing to the same.
>
> tick_suspend is also only called on the boot CPU which I presume should
> be fixed too.
>   

Yep.  I wonder how it ever worked?  There's been a fair amount of change
in the PM code, so that could have changed things.  I don't know if
there's a deep reason for not calling tick_resume() on all processors.

Rafael, tglx: suspend/resume under Xen doesn't need to hot unplug all
the CPUs, so we don't; the hypervisor can manage the context
save/restore for all CPUs.  Is there a deep reason why
timekeeping_resume() can't call the CLOCK_EVT_NOTIFY_RESUME notifier on
all online CPUs?

>  void xen_arch_resume(void)
>  {
> -	/* nothing */
> +	smp_call_function_many(cpu_online_mask, xen_vcpu_notify_restore,
> +			       (void *)CLOCK_EVT_NOTIFY_RESUME, 1);
>  }
>   

This is equivalent to smp_call_function().

    J

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG
  2009-11-25 19:28                               ` Jeremy Fitzhardinge
@ 2009-11-25 20:03                                 ` Ian Campbell
  2009-11-25 20:32                                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 30+ messages in thread
From: Ian Campbell @ 2009-11-25 20:03 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Rafael J. Wysocki, Dan Magenheimer, Xen-Devel (E-mail), Gleixner, Thomas

On Wed, 2009-11-25 at 19:28 +0000, Jeremy Fitzhardinge wrote: 
> On 11/25/09 06:12, Ian Campbell wrote:
> > tick_resume() is never called on secondary processors. Presumably this
> > is because they are offlined for suspend on native and so this is
> > normally taken care of in the CPU onlining path. Under Xen we keep all
> > CPUs online over a suspend.
> >
> > This patch papers over the issue for me but I will investigate a more
> > generic, less hacky, way of doing to the same.
> >
> > tick_suspend is also only called on the boot CPU which I presume should
> > be fixed too.
> >   
> 
> Yep.  I wonder how it ever worked?  There's been a fair amount of change
> in the PM code, so that could have changed things.  I don't know if
> there's a deep reason for not calling tick_resume() on all processors.
> 
> Rafael, tglx: suspend/resume under Xen doesn't need to hot unplug all
> the CPUs, so we don't; the hypervisor can manage the context
> save/restore for all CPUs.  Is there a deep reason why
> timekeeping_resume() can't call the CLOCK_EVT_NOTIFY_RESUME notifier on
> all online CPUs?

Interrupts are disabled at that point where it currently calls the
notifier, so none of the SMP function call primitives work.

> >  void xen_arch_resume(void)
> >  {
> > -	/* nothing */
> > +	smp_call_function_many(cpu_online_mask, xen_vcpu_notify_restore,
> > +			       (void *)CLOCK_EVT_NOTIFY_RESUME, 1);
> >  }
> >   
> 
> This is equivalent to smp_call_function().

Oh yeah.

Ian.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG
  2009-11-25 20:03                                 ` Ian Campbell
@ 2009-11-25 20:32                                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 30+ messages in thread
From: Jeremy Fitzhardinge @ 2009-11-25 20:32 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Rafael J. Wysocki, Dan Magenheimer, Xen-Devel (E-mail), Thomas Gleixner

On 11/25/09 12:03, Ian Campbell wrote:
>> Yep.  I wonder how it ever worked?  There's been a fair amount of change
>> in the PM code, so that could have changed things.  I don't know if
>> there's a deep reason for not calling tick_resume() on all processors.
>>
>> Rafael, tglx: suspend/resume under Xen doesn't need to hot unplug all
>> the CPUs, so we don't; the hypervisor can manage the context
>> save/restore for all CPUs.  Is there a deep reason why
>> timekeeping_resume() can't call the CLOCK_EVT_NOTIFY_RESUME notifier on
>> all online CPUs?
>>     
> Interrupts are disabled at that point where it currently calls the
> notifier, so none of the SMP function call primitives work.
>   

That does make it pretty awkward.


    J

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] xen: improve error handling in do_suspend.
  2009-11-25 14:12                             ` Ian Campbell
  2009-11-25 19:28                               ` Jeremy Fitzhardinge
@ 2009-12-01 11:47                               ` Ian Campbell
  2009-12-01 11:47                               ` [PATCH] xen: explicitly create/destroy stop_machine workqueues outside suspend/resume region Ian Campbell
  2 siblings, 0 replies; 30+ messages in thread
From: Ian Campbell @ 2009-12-01 11:47 UTC (permalink / raw)
  To: xen-devel; +Cc: Jeremy Fitzhardinge, Ian Campbell

The existing error handling has a few issues:
- If freeze_processes() fails it exits with shutting_down = SHUTDOWN_SUSPEND.
- If dpm_suspend_noirq() fails it exits without resuming xenbus.
- If stop_machine() fails it exits without resuming xenbus or calling
  dpm_resume_end().
- xs_suspend()/xs_resume() and dpm_suspend_noirq()/dpm_resume_noirq() were not
  nested in the obvious way.

Fix by ensuring each failure case goto's the correct label. Treat a failure of
stop_machine() as a cancelled suspend in order to follow the correct resume
path.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
---
 drivers/xen/manage.c |   20 +++++++++++---------
 1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 7b69a1a..2fb7d39 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -86,32 +86,32 @@ static void do_suspend(void)
 	err = freeze_processes();
 	if (err) {
 		printk(KERN_ERR "xen suspend: freeze failed %d\n", err);
-		return;
+		goto out;
 	}
 #endif
 
 	err = dpm_suspend_start(PMSG_SUSPEND);
 	if (err) {
 		printk(KERN_ERR "xen suspend: dpm_suspend_start %d\n", err);
-		goto out;
+		goto out_thaw;
 	}
 
-	printk(KERN_DEBUG "suspending xenstore...\n");
-	xs_suspend();
-
 	err = dpm_suspend_noirq(PMSG_SUSPEND);
 	if (err) {
 		printk(KERN_ERR "dpm_suspend_noirq failed: %d\n", err);
-		goto resume_devices;
+		goto out_resume;
 	}
 
+	printk(KERN_DEBUG "suspending xenstore...\n");
+	xs_suspend();
+
 	err = stop_machine(xen_suspend, &cancelled, cpumask_of(0));
 
 	dpm_resume_noirq(PMSG_RESUME);
 
 	if (err) {
 		printk(KERN_ERR "failed to start xen_suspend: %d\n", err);
-		goto out;
+		cancelled = 1;
 	}
 
 	if (!cancelled) {
@@ -120,15 +120,17 @@ static void do_suspend(void)
 	} else
 		xs_suspend_cancel();
 
-resume_devices:
+out_resume:
 	dpm_resume_end(PMSG_RESUME);
 
 	/* Make sure timer events get retriggered on all CPUs */
 	clock_was_set();
-out:
+
+out_thaw:
 #ifdef CONFIG_PREEMPT
 	thaw_processes();
 #endif
+out:
 	shutting_down = SHUTDOWN_INVALID;
 }
 #endif	/* CONFIG_PM_SLEEP */
-- 
1.5.6.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH] xen: explicitly create/destroy stop_machine workqueues outside suspend/resume region.
  2009-11-25 14:12                             ` Ian Campbell
  2009-11-25 19:28                               ` Jeremy Fitzhardinge
  2009-12-01 11:47                               ` [PATCH] xen: improve error handling in do_suspend Ian Campbell
@ 2009-12-01 11:47                               ` Ian Campbell
  2009-12-01 22:50                                 ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 30+ messages in thread
From: Ian Campbell @ 2009-12-01 11:47 UTC (permalink / raw)
  To: xen-devel; +Cc: Jeremy Fitzhardinge, Ian Campbell

I have observed cases where the implicit stop_machine_destroy() done by
stop_machine() hangs while destroying the workqueues, specifically in
kthread_stop(). This seems to be because timer ticks are not restarted
until after stop_machine() returns.

Fortunately stop_machine provides a facility to pre-create/post-destroy the
workqueues so use this to ensure that workqueues are only destroyed after
everything is really up and running again.

I only actually observed this failure with 2.6.30. It seems that newer kernels
are somehow more robust against doing kthread_stop() without timer interrupts
(I tried some backports of some likely looking candidates but did not track
down the commit which added this robustness). However this change seems like a
reasonable belt&braces thing to do.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
---
 drivers/xen/manage.c |   12 +++++++++++-
 1 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 2fb7d39..c499793 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -79,6 +79,12 @@ static void do_suspend(void)
 
 	shutting_down = SHUTDOWN_SUSPEND;
 
+	err = stop_machine_create();
+	if (err) {
+		printk(KERN_ERR "xen suspend: failed to setup stop_machine %d\n", err);
+		goto out;
+	}
+
 #ifdef CONFIG_PREEMPT
 	/* If the kernel is preemptible, we need to freeze all the processes
 	   to prevent them from being in the middle of a pagetable update
@@ -86,7 +92,7 @@ static void do_suspend(void)
 	err = freeze_processes();
 	if (err) {
 		printk(KERN_ERR "xen suspend: freeze failed %d\n", err);
-		goto out;
+		goto out_destroy_sm;
 	}
 #endif
 
@@ -129,7 +135,11 @@ out_resume:
 out_thaw:
 #ifdef CONFIG_PREEMPT
 	thaw_processes();
+
+out_destroy_sm:
 #endif
+	stop_machine_destroy();
+
 out:
 	shutting_down = SHUTDOWN_INVALID;
 }
-- 
1.5.6.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] xen: explicitly create/destroy stop_machine workqueues outside suspend/resume region.
  2009-12-01 11:47                               ` [PATCH] xen: explicitly create/destroy stop_machine workqueues outside suspend/resume region Ian Campbell
@ 2009-12-01 22:50                                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 30+ messages in thread
From: Jeremy Fitzhardinge @ 2009-12-01 22:50 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel

On 12/01/09 03:47, Ian Campbell wrote:
> I have observed cases where the implicit stop_machine_destroy() done by
> stop_machine() hangs while destroying the workqueues, specifically in
> kthread_stop(). This seems to be because timer ticks are not restarted
> until after stop_machine() returns.
>   

Thanks for these - applied.

    J

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2009-12-01 22:50 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-06 18:37 pv 2.6.31 (kernel.org) and save/migrate Dan Magenheimer
2009-11-06 20:37 ` Pasi Kärkkäinen
2009-11-06 22:27   ` Dan Magenheimer
2009-11-06 22:30     ` Pasi Kärkkäinen
2009-11-07  0:08       ` Dan Magenheimer
2009-11-07 11:09         ` Pasi Kärkkäinen
2009-11-07 15:32           ` Dan Magenheimer
2009-11-08 14:17             ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Pasi Kärkkäinen
2009-11-08 14:20               ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen
2009-11-08 15:29               ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Dan Magenheimer
2009-11-08 15:41                 ` Pasi Kärkkäinen
2009-11-08 16:48                   ` Pasi Kärkkäinen
2009-11-12 23:16                     ` Jeremy Fitzhardinge
2009-11-12 23:22                       ` Brendan Cully
2009-11-08 16:54                   ` Dan Magenheimer
2009-11-08 17:27                     ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen
2009-11-10 10:08                       ` pv 2.6.31 (kernel.org) and save/migrate fails, " Pasi Kärkkäinen
2009-11-12 23:36                         ` Jeremy Fitzhardinge
2009-11-24 14:27                           ` Ian Campbell
2009-11-25 14:12                             ` Ian Campbell
2009-11-25 19:28                               ` Jeremy Fitzhardinge
2009-11-25 20:03                                 ` Ian Campbell
2009-11-25 20:32                                   ` Jeremy Fitzhardinge
2009-12-01 11:47                               ` [PATCH] xen: improve error handling in do_suspend Ian Campbell
2009-12-01 11:47                               ` [PATCH] xen: explicitly create/destroy stop_machine workqueues outside suspend/resume region Ian Campbell
2009-12-01 22:50                                 ` Jeremy Fitzhardinge
2009-11-23 16:44                         ` pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG Ian Campbell
2009-11-24 10:27                           ` Ian Campbell
2009-11-12 23:21                     ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Jeremy Fitzhardinge
2009-11-07  0:19   ` pv 2.6.31 (kernel.org) and save/migrate Jeremy Fitzhardinge

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.