* pv 2.6.31 (kernel.org) and save/migrate @ 2009-11-06 18:37 Dan Magenheimer 2009-11-06 20:37 ` Pasi Kärkkäinen 0 siblings, 1 reply; 30+ messages in thread From: Dan Magenheimer @ 2009-11-06 18:37 UTC (permalink / raw) To: Xen-Devel (E-mail) Sorry for another possibly stupid question: I've observed that for a pv domain that's been updated to a 2.6.31 kernel (straight from kernel.org), "xm save" never completes. When the older kernel (2.6.18) is booted, "xm save" works fine. Is this a known problem... or perhaps xm save has never worked with an upstream pv kernel and I've never noticed? I'd assume migrate and live migrate would fail also but haven't tried them. Thanks, Dan P.S. This is with very recent xen-unstable, c/s 20399. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate 2009-11-06 18:37 pv 2.6.31 (kernel.org) and save/migrate Dan Magenheimer @ 2009-11-06 20:37 ` Pasi Kärkkäinen 2009-11-06 22:27 ` Dan Magenheimer 2009-11-07 0:19 ` pv 2.6.31 (kernel.org) and save/migrate Jeremy Fitzhardinge 0 siblings, 2 replies; 30+ messages in thread From: Pasi Kärkkäinen @ 2009-11-06 20:37 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Xen-Devel (E-mail) On Fri, Nov 06, 2009 at 10:37:49AM -0800, Dan Magenheimer wrote: > Sorry for another possibly stupid question: > > I've observed that for a pv domain that's been updated > to a 2.6.31 kernel (straight from kernel.org), "xm save" > never completes. When the older kernel (2.6.18) > is booted, "xm save" works fine. Is this a known problem... > or perhaps xm save has never worked with an upstream pv > kernel and I've never noticed? > > I'd assume migrate and live migrate would fail also but > haven't tried them. > Just checking.. are you running the latest 2.6.31.5 ? I think there has been multiple xen related bugfixes in the 2.6.31.X releases. -- Pasi ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: pv 2.6.31 (kernel.org) and save/migrate 2009-11-06 20:37 ` Pasi Kärkkäinen @ 2009-11-06 22:27 ` Dan Magenheimer 2009-11-06 22:30 ` Pasi Kärkkäinen 2009-11-07 0:19 ` pv 2.6.31 (kernel.org) and save/migrate Jeremy Fitzhardinge 1 sibling, 1 reply; 30+ messages in thread From: Dan Magenheimer @ 2009-11-06 22:27 UTC (permalink / raw) To: "Pasi Kärkkäinen"; +Cc: Xen-Devel (E-mail) > On Fri, Nov 06, 2009 at 10:37:49AM -0800, Dan Magenheimer wrote: > > Sorry for another possibly stupid question: > > > > I've observed that for a pv domain that's been updated > > to a 2.6.31 kernel (straight from kernel.org), "xm save" > > never completes. When the older kernel (2.6.18) > > is booted, "xm save" works fine. Is this a known problem... > > or perhaps xm save has never worked with an upstream pv > > kernel and I've never noticed? > > > > I'd assume migrate and live migrate would fail also but > > haven't tried them. > > > > Just checking.. are you running the latest 2.6.31.5 ? I think > there has > been multiple xen related bugfixes in the 2.6.31.X releases. > > -- Pasi No it was plain 2.6.31. But I downloaded/built 2.6.31.5 and can't even get it to boot (and no console or VNC output at all). Are CONFIG changes required betwen 2.6.31 and 2.6.31.5 for Xen? (I checked and I am using the same .config.) Trying to reproduce on a different machine, just to verify. Dan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate 2009-11-06 22:27 ` Dan Magenheimer @ 2009-11-06 22:30 ` Pasi Kärkkäinen 2009-11-07 0:08 ` Dan Magenheimer 0 siblings, 1 reply; 30+ messages in thread From: Pasi Kärkkäinen @ 2009-11-06 22:30 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Xen-Devel (E-mail) On Fri, Nov 06, 2009 at 02:27:27PM -0800, Dan Magenheimer wrote: > > On Fri, Nov 06, 2009 at 10:37:49AM -0800, Dan Magenheimer wrote: > > > Sorry for another possibly stupid question: > > > > > > I've observed that for a pv domain that's been updated > > > to a 2.6.31 kernel (straight from kernel.org), "xm save" > > > never completes. When the older kernel (2.6.18) > > > is booted, "xm save" works fine. Is this a known problem... > > > or perhaps xm save has never worked with an upstream pv > > > kernel and I've never noticed? > > > > > > I'd assume migrate and live migrate would fail also but > > > haven't tried them. > > > > > > > Just checking.. are you running the latest 2.6.31.5 ? I think > > there has > > been multiple xen related bugfixes in the 2.6.31.X releases. > > > > -- Pasi > > No it was plain 2.6.31. But I downloaded/built 2.6.31.5 and > can't even get it to boot (and no console or VNC output at > all). Are CONFIG changes required betwen 2.6.31 and 2.6.31.5 > for Xen? (I checked and I am using the same .config.) > > Trying to reproduce on a different machine, just to verify. > There shouldn't be any .config changes needed. Can you paste the full domU console output? Does it crash or? -- Pasi ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: pv 2.6.31 (kernel.org) and save/migrate 2009-11-06 22:30 ` Pasi Kärkkäinen @ 2009-11-07 0:08 ` Dan Magenheimer 2009-11-07 11:09 ` Pasi Kärkkäinen 0 siblings, 1 reply; 30+ messages in thread From: Dan Magenheimer @ 2009-11-07 0:08 UTC (permalink / raw) To: "Pasi Kärkkäinen"; +Cc: Xen-Devel (E-mail) > On Fri, Nov 06, 2009 at 02:27:27PM -0800, Dan Magenheimer wrote: > > > On Fri, Nov 06, 2009 at 10:37:49AM -0800, Dan Magenheimer wrote: > > > > Sorry for another possibly stupid question: > > > > > > > > I've observed that for a pv domain that's been updated > > > > to a 2.6.31 kernel (straight from kernel.org), "xm save" > > > > never completes. When the older kernel (2.6.18) > > > > is booted, "xm save" works fine. Is this a known problem... > > > > or perhaps xm save has never worked with an upstream pv > > > > kernel and I've never noticed? > > > > > > > > I'd assume migrate and live migrate would fail also but > > > > haven't tried them. > > > > > > > > > > Just checking.. are you running the latest 2.6.31.5 ? I think > > > there has > > > been multiple xen related bugfixes in the 2.6.31.X releases. > > > > > > -- Pasi > > > > No it was plain 2.6.31. But I downloaded/built 2.6.31.5 and > > can't even get it to boot (and no console or VNC output at > > all). Are CONFIG changes required betwen 2.6.31 and 2.6.31.5 > > for Xen? (I checked and I am using the same .config.) > > > > Trying to reproduce on a different machine, just to verify. > > There shouldn't be any .config changes needed. > > Can you paste the full domU console output? Does it crash or? > > -- Pasi Well, first, I got 2.6.31.5 to boot in a PV guest in another machine and it fails to save also. Are you able to save 2.6.31{,.5} successfully? On latest xen-unstable? (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't know if that is important.) (On the machine I couldn't boot 2.6.31.5 as a PV guest, there was absolutely no console output. However, I think tools are out-of-date on that machine so ignore that.) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate 2009-11-07 0:08 ` Dan Magenheimer @ 2009-11-07 11:09 ` Pasi Kärkkäinen 2009-11-07 15:32 ` Dan Magenheimer 0 siblings, 1 reply; 30+ messages in thread From: Pasi Kärkkäinen @ 2009-11-07 11:09 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Xen-Devel (E-mail) On Fri, Nov 06, 2009 at 04:08:26PM -0800, Dan Magenheimer wrote: > > On Fri, Nov 06, 2009 at 02:27:27PM -0800, Dan Magenheimer wrote: > > > > On Fri, Nov 06, 2009 at 10:37:49AM -0800, Dan Magenheimer wrote: > > > > > Sorry for another possibly stupid question: > > > > > > > > > > I've observed that for a pv domain that's been updated > > > > > to a 2.6.31 kernel (straight from kernel.org), "xm save" > > > > > never completes. When the older kernel (2.6.18) > > > > > is booted, "xm save" works fine. Is this a known problem... > > > > > or perhaps xm save has never worked with an upstream pv > > > > > kernel and I've never noticed? > > > > > > > > > > I'd assume migrate and live migrate would fail also but > > > > > haven't tried them. > > > > > > > > > > > > > Just checking.. are you running the latest 2.6.31.5 ? I think > > > > there has > > > > been multiple xen related bugfixes in the 2.6.31.X releases. > > > > > > > > -- Pasi > > > > > > No it was plain 2.6.31. But I downloaded/built 2.6.31.5 and > > > can't even get it to boot (and no console or VNC output at > > > all). Are CONFIG changes required betwen 2.6.31 and 2.6.31.5 > > > for Xen? (I checked and I am using the same .config.) > > > > > > Trying to reproduce on a different machine, just to verify. > > > > There shouldn't be any .config changes needed. > > > > Can you paste the full domU console output? Does it crash or? > > > > -- Pasi > > Well, first, I got 2.6.31.5 to boot in a PV guest in another > machine and it fails to save also. Are you able to save > 2.6.31{,.5} successfully? On latest xen-unstable? > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't > know if that is important.) > I'll have to try it later today.. > (On the machine I couldn't boot 2.6.31.5 as a PV guest, there > was absolutely no console output. However, I think tools > are out-of-date on that machine so ignore that.) Did you have "console=hvc0 earlyprintk=xen" in the domU kernel parameters? You might also change the xen guest cfgfile so that you have on_crash=preserve and then when the PV guest is crashed run this: /usr/lib/xen/bin/xenctx -s System.map-domUkernelversion <domid> (if you have 64b host the xenctx binary might be under /usr/lib64/) to get a stack trace.. -- Pasi ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: pv 2.6.31 (kernel.org) and save/migrate 2009-11-07 11:09 ` Pasi Kärkkäinen @ 2009-11-07 15:32 ` Dan Magenheimer 2009-11-08 14:17 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Pasi Kärkkäinen 0 siblings, 1 reply; 30+ messages in thread From: Dan Magenheimer @ 2009-11-07 15:32 UTC (permalink / raw) To: "Pasi Kärkkäinen"; +Cc: Xen-Devel (E-mail) > > Well, first, I got 2.6.31.5 to boot in a PV guest in another > > machine and it fails to save also. Are you able to save > > 2.6.31{,.5} successfully? On latest xen-unstable? > > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't > > know if that is important.) > > I'll have to try it later today.. Let me know. > > (On the machine I couldn't boot 2.6.31.5 as a PV guest, there > > was absolutely no console output. However, I think tools > > are out-of-date on that machine so ignore that.) > > Did you have "console=hvc0 earlyprintk=xen" in the domU kernel > parameters? No, but that didn't work either. > You might also change the xen guest cfgfile so that you have > on_crash=preserve and then when the PV guest is crashed run this: > > /usr/lib/xen/bin/xenctx -s System.map-domUkernelversion <domid> > > (if you have 64b host the xenctx binary might be under /usr/lib64/) > > to get a stack trace.. Very interesting and useful! I was completely unaware of xenctx and could have used it many times in tmem development! The results explain why I can get it to run on one machine (an older laptop) and not run on another machine (a Nehalem system)... looks like this is maybe related to the cpuid-extended-topology-leaf bug that Jeremy sent a fix for upstream recently. cs:eip: e019:c040342d xen_cpuid+0x46 flags: 00001206 i nz p ss:esp: e021:c0779ee4 eax: 00000001 ebx: 00000002 ecx: 00000100 edx: 00000001 esi: c0779f1c edi: c0779f18 ebp: c0779f24 ds: e021 es: e021 fs: 00d8 gs: 0000 Code (instr addr c040342d) 24 04 8b 15 a4 02 7c c0 89 54 24 08 8b 0e 0f 0b 78 65 6e 0f a2 <89> 45 00 8b 04 24 89 18 89 0e 89 Stack: c0779f20 ffffffff ffffffff c07c0360 c0779f18 c0779f1c c0779f20 c066fd0f c0779f18 c0779f24 00000002 16aee301 00000001 00000001 16aee301 00000002 0000000b c07c03cc c07c0360 c07c0360 c07c03d8 c0670ed8 c0779f58 00000001 c07c0360 c0779f60 c066fe6a c0779f60 c0779f60 00000003 00000001 00000000 Call Trace: [<c040342d>] xen_cpuid+0x46 <-- [<c066fd0f>] detect_extended_topology+0xae [<c0670ed8>] init_intel+0x140 [<c066fe6a>] init_scattered_cpuid_features+0x82 [<c06705e2>] identify_cpu+0x22d [<c040584c>] xen_force_evtchn_callback+0xc [<c0405e78>] check_events+0x8 [<c07c9dec>] identify_boot_cpu+0xa [<c07c9e9a>] check_bugs+0x8 [<c07c27bd>] start_kernel+0x2a0 [<c07c5206>] xen_start_kernel+0x340 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG() 2009-11-07 15:32 ` Dan Magenheimer @ 2009-11-08 14:17 ` Pasi Kärkkäinen 2009-11-08 14:20 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen 2009-11-08 15:29 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Dan Magenheimer 0 siblings, 2 replies; 30+ messages in thread From: Pasi Kärkkäinen @ 2009-11-08 14:17 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Xen-Devel (E-mail) On Sat, Nov 07, 2009 at 07:32:49AM -0800, Dan Magenheimer wrote: > > > Well, first, I got 2.6.31.5 to boot in a PV guest in another > > > machine and it fails to save also. Are you able to save > > > 2.6.31{,.5} successfully? On latest xen-unstable? > > > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't > > > know if that is important.) > > > > I'll have to try it later today.. > > Let me know. > Ok. I just tried with a Fedora 12 (rawhide) PV guest. I was able to "xm save" and "xm restore" it without problems. But I noticed there was a BUG printed on the guest console: http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86_64-saverestore.txt BUG: sleeping function called from invalid context at kernel/mutex.c:94 in_atomic(): 0, irqs_disabled(): 1, pid: 1052, name: kstop/0 Pid: 1052, comm: kstop/0 Not tainted 2.6.31.5-122.fc12.x86_64 #1 Call Trace: [<ffffffff8104021f>] __might_sleep+0xe6/0xe8 [<ffffffff81419c84>] mutex_lock+0x22/0x4e [<ffffffff812afdce>] dpm_resume_noirq+0x21/0x11f [<ffffffff81272b05>] xen_suspend+0xca/0xd1 [<ffffffff8108c172>] stop_cpu+0x8c/0xd2 [<ffffffff8106350c>] worker_thread+0x18a/0x224 [<ffffffff81067ae7>] ? autoremove_wake_function+0x0/0x39 [<ffffffff8141ab29>] ? _spin_unlock_irqrestore+0x19/0x1b [<ffffffff81063382>] ? worker_thread+0x0/0x224 [<ffffffff81067765>] kthread+0x91/0x99 [<ffffffff81012daa>] child_rip+0xa/0x20 [<ffffffff81011f97>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff8101271d>] ? retint_restore_args+0x5/0x6 [<ffffffff81012da0>] ? child_rip+0x0/0x20 More information about my setup: Host/dom0: Fedora 12 (latest rawhide) with included Xen 3.4.1-5 and custom 2.6.31.5 x86_64 pv_ops dom0 kernel (a couple of days old). Guest/domU: Fedora 12 (latest rawhide) with the included/default 2.6.31.5-122.fc12.x86_64 kernel. > > > (On the machine I couldn't boot 2.6.31.5 as a PV guest, there > > > was absolutely no console output. However, I think tools > > > are out-of-date on that machine so ignore that.) > > > > Did you have "console=hvc0 earlyprintk=xen" in the domU kernel > > parameters? > > No, but that didn't work either. > Ok.. then it crashes really early. > > You might also change the xen guest cfgfile so that you have > > on_crash=preserve and then when the PV guest is crashed run this: > > > > /usr/lib/xen/bin/xenctx -s System.map-domUkernelversion <domid> > > > > (if you have 64b host the xenctx binary might be under /usr/lib64/) > > > > to get a stack trace.. > > Very interesting and useful! I was completely unaware of > xenctx and could have used it many times in tmem development! > > The results explain why I can get it to run on > one machine (an older laptop) and not run on another > machine (a Nehalem system)... looks like this is maybe > related to the cpuid-extended-topology-leaf bug that Jeremy > sent a fix for upstream recently. > Did you try with that patch applied? -- Pasi > cs:eip: e019:c040342d xen_cpuid+0x46 > flags: 00001206 i nz p > ss:esp: e021:c0779ee4 > eax: 00000001 ebx: 00000002 ecx: 00000100 edx: 00000001 > esi: c0779f1c edi: c0779f18 ebp: c0779f24 > ds: e021 es: e021 fs: 00d8 gs: 0000 > Code (instr addr c040342d) > 24 04 8b 15 a4 02 7c c0 89 54 24 08 8b 0e 0f 0b 78 65 6e 0f a2 <89> 45 00 8b 04 24 89 18 89 0e 89 > > > Stack: > c0779f20 ffffffff ffffffff c07c0360 c0779f18 c0779f1c c0779f20 c066fd0f > c0779f18 c0779f24 00000002 16aee301 00000001 00000001 16aee301 00000002 > 0000000b c07c03cc c07c0360 c07c0360 c07c03d8 c0670ed8 c0779f58 00000001 > c07c0360 c0779f60 c066fe6a c0779f60 c0779f60 00000003 00000001 00000000 > > Call Trace: > [<c040342d>] xen_cpuid+0x46 <-- > [<c066fd0f>] detect_extended_topology+0xae > [<c0670ed8>] init_intel+0x140 > [<c066fe6a>] init_scattered_cpuid_features+0x82 > [<c06705e2>] identify_cpu+0x22d > [<c040584c>] xen_force_evtchn_callback+0xc > [<c0405e78>] check_events+0x8 > [<c07c9dec>] identify_boot_cpu+0xa > [<c07c9e9a>] check_bugs+0x8 > [<c07c27bd>] start_kernel+0x2a0 > [<c07c5206>] xen_start_kernel+0x340 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG 2009-11-08 14:17 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Pasi Kärkkäinen @ 2009-11-08 14:20 ` Pasi Kärkkäinen 2009-11-08 15:29 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Dan Magenheimer 1 sibling, 0 replies; 30+ messages in thread From: Pasi Kärkkäinen @ 2009-11-08 14:20 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail) On Sun, Nov 08, 2009 at 04:17:43PM +0200, Pasi Kärkkäinen wrote: > On Sat, Nov 07, 2009 at 07:32:49AM -0800, Dan Magenheimer wrote: > > > > Well, first, I got 2.6.31.5 to boot in a PV guest in another > > > > machine and it fails to save also. Are you able to save > > > > 2.6.31{,.5} successfully? On latest xen-unstable? > > > > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't > > > > know if that is important.) > > > > > > I'll have to try it later today.. > > > > Let me know. > > > > Ok. I just tried with a Fedora 12 (rawhide) PV guest. I was able to > "xm save" and "xm restore" it without problems. > > But I noticed there was a BUG printed on the guest console: > http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86_64-saverestore.txt > > BUG: sleeping function called from invalid context at kernel/mutex.c:94 > in_atomic(): 0, irqs_disabled(): 1, pid: 1052, name: kstop/0 > Pid: 1052, comm: kstop/0 Not tainted 2.6.31.5-122.fc12.x86_64 #1 > Call Trace: > [<ffffffff8104021f>] __might_sleep+0xe6/0xe8 > [<ffffffff81419c84>] mutex_lock+0x22/0x4e > [<ffffffff812afdce>] dpm_resume_noirq+0x21/0x11f > [<ffffffff81272b05>] xen_suspend+0xca/0xd1 > [<ffffffff8108c172>] stop_cpu+0x8c/0xd2 > [<ffffffff8106350c>] worker_thread+0x18a/0x224 > [<ffffffff81067ae7>] ? autoremove_wake_function+0x0/0x39 > [<ffffffff8141ab29>] ? _spin_unlock_irqrestore+0x19/0x1b > [<ffffffff81063382>] ? worker_thread+0x0/0x224 > [<ffffffff81067765>] kthread+0x91/0x99 > [<ffffffff81012daa>] child_rip+0xa/0x20 > [<ffffffff81011f97>] ? int_ret_from_sys_call+0x7/0x1b > [<ffffffff8101271d>] ? retint_restore_args+0x5/0x6 > [<ffffffff81012da0>] ? child_rip+0x0/0x20 > Oh, I forgot to mention that this BUG is non-fatal. The guest still works after that.. -- Pasi > > More information about my setup: > > Host/dom0: Fedora 12 (latest rawhide) with included Xen 3.4.1-5 and > custom 2.6.31.5 x86_64 pv_ops dom0 kernel (a couple of days old). > > Guest/domU: Fedora 12 (latest rawhide) with the included/default > 2.6.31.5-122.fc12.x86_64 kernel. > > > > > (On the machine I couldn't boot 2.6.31.5 as a PV guest, there > > > > was absolutely no console output. However, I think tools > > > > are out-of-date on that machine so ignore that.) > > > > > > Did you have "console=hvc0 earlyprintk=xen" in the domU kernel > > > parameters? > > > > No, but that didn't work either. > > > > Ok.. then it crashes really early. > > > > You might also change the xen guest cfgfile so that you have > > > on_crash=preserve and then when the PV guest is crashed run this: > > > > > > /usr/lib/xen/bin/xenctx -s System.map-domUkernelversion <domid> > > > > > > (if you have 64b host the xenctx binary might be under /usr/lib64/) > > > > > > to get a stack trace.. > > > > Very interesting and useful! I was completely unaware of > > xenctx and could have used it many times in tmem development! > > > > The results explain why I can get it to run on > > one machine (an older laptop) and not run on another > > machine (a Nehalem system)... looks like this is maybe > > related to the cpuid-extended-topology-leaf bug that Jeremy > > sent a fix for upstream recently. > > > > Did you try with that patch applied? > > -- Pasi > > > cs:eip: e019:c040342d xen_cpuid+0x46 > > flags: 00001206 i nz p > > ss:esp: e021:c0779ee4 > > eax: 00000001 ebx: 00000002 ecx: 00000100 edx: 00000001 > > esi: c0779f1c edi: c0779f18 ebp: c0779f24 > > ds: e021 es: e021 fs: 00d8 gs: 0000 > > Code (instr addr c040342d) > > 24 04 8b 15 a4 02 7c c0 89 54 24 08 8b 0e 0f 0b 78 65 6e 0f a2 <89> 45 00 8b 04 24 89 18 89 0e 89 > > > > > > Stack: > > c0779f20 ffffffff ffffffff c07c0360 c0779f18 c0779f1c c0779f20 c066fd0f > > c0779f18 c0779f24 00000002 16aee301 00000001 00000001 16aee301 00000002 > > 0000000b c07c03cc c07c0360 c07c0360 c07c03d8 c0670ed8 c0779f58 00000001 > > c07c0360 c0779f60 c066fe6a c0779f60 c0779f60 00000003 00000001 00000000 > > > > Call Trace: > > [<c040342d>] xen_cpuid+0x46 <-- > > [<c066fd0f>] detect_extended_topology+0xae > > [<c0670ed8>] init_intel+0x140 > > [<c066fe6a>] init_scattered_cpuid_features+0x82 > > [<c06705e2>] identify_cpu+0x22d > > [<c040584c>] xen_force_evtchn_callback+0xc > > [<c0405e78>] check_events+0x8 > > [<c07c9dec>] identify_boot_cpu+0xa > > [<c07c9e9a>] check_bugs+0x8 > > [<c07c27bd>] start_kernel+0x2a0 > > [<c07c5206>] xen_start_kernel+0x340 > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: pv 2.6.31 (kernel.org) and save/migrate, domU BUG() 2009-11-08 14:17 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Pasi Kärkkäinen 2009-11-08 14:20 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen @ 2009-11-08 15:29 ` Dan Magenheimer 2009-11-08 15:41 ` Pasi Kärkkäinen 1 sibling, 1 reply; 30+ messages in thread From: Dan Magenheimer @ 2009-11-08 15:29 UTC (permalink / raw) To: "Pasi Kärkkäinen"; +Cc: Xen-Devel (E-mail) > > > > machine and it fails to save also. Are you able to save > > > > 2.6.31{,.5} successfully? On latest xen-unstable? > > > > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't > > > > know if that is important.) > > Ok. I just tried with a Fedora 12 (rawhide) PV guest. I was able to > "xm save" and "xm restore" it without problems. > > But I noticed there was a BUG printed on the guest console: > http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86 > _64-saverestore.txt > BUG: sleeping function called from invalid context at > kernel/mutex.c:94 > in_atomic(): 0, irqs_disabled(): 1, pid: 1052, name: kstop/0 > Pid: 1052, comm: kstop/0 Not tainted 2.6.31.5-122.fc12.x86_64 #1 Ok, so it appears there is something problematic with saving an upstream kernel. It might be (partially) fixed in Fedora 12 or maybe there is some other environmental difference which makes save fail entirely on my system. > > The results explain why I can get it to run on > > one machine (an older laptop) and not run on another > > machine (a Nehalem system)... looks like this is maybe > > related to the cpuid-extended-topology-leaf bug that Jeremy > > sent a fix for upstream recently. > > Did you try with that patch applied? No, the patch wasn't posted, just a pull request to Linus, so I don't have the patch (and am not a git expert so am not sure how to get it). http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00182.html So I'll try it again when .6 or .7 is available. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG() 2009-11-08 15:29 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Dan Magenheimer @ 2009-11-08 15:41 ` Pasi Kärkkäinen 2009-11-08 16:48 ` Pasi Kärkkäinen 2009-11-08 16:54 ` Dan Magenheimer 0 siblings, 2 replies; 30+ messages in thread From: Pasi Kärkkäinen @ 2009-11-08 15:41 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Xen-Devel (E-mail) On Sun, Nov 08, 2009 at 07:29:58AM -0800, Dan Magenheimer wrote: > > > > > machine and it fails to save also. Are you able to save > > > > > 2.6.31{,.5} successfully? On latest xen-unstable? > > > > > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't > > > > > know if that is important.) > > > > Ok. I just tried with a Fedora 12 (rawhide) PV guest. I was able to > > "xm save" and "xm restore" it without problems. > > > > But I noticed there was a BUG printed on the guest console: > > http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86 > > _64-saverestore.txt > > BUG: sleeping function called from invalid context at > > kernel/mutex.c:94 > > in_atomic(): 0, irqs_disabled(): 1, pid: 1052, name: kstop/0 > > Pid: 1052, comm: kstop/0 Not tainted 2.6.31.5-122.fc12.x86_64 #1 > > Ok, so it appears there is something problematic with > saving an upstream kernel. It might be (partially) fixed > in Fedora 12 or maybe there is some other environmental > difference which makes save fail entirely on my system. > Yeah, fedora kernel has some patches, but it should be pretty close to upstream kernel.. btw was your guest UP or SMP? Mine was UP.. > > > The results explain why I can get it to run on > > > one machine (an older laptop) and not run on another > > > machine (a Nehalem system)... looks like this is maybe > > > related to the cpuid-extended-topology-leaf bug that Jeremy > > > sent a fix for upstream recently. > > > > Did you try with that patch applied? > > No, the patch wasn't posted, just a pull request to Linus, > so I don't have the patch (and am not a git expert so > am not sure how to get it). > > http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00182.html > > So I'll try it again when .6 or .7 is available. See here for changelog: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=shortlog;h=bugfix You can get the diffs/patches from there using the links.. -- Pasi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG() 2009-11-08 15:41 ` Pasi Kärkkäinen @ 2009-11-08 16:48 ` Pasi Kärkkäinen 2009-11-12 23:16 ` Jeremy Fitzhardinge 2009-11-08 16:54 ` Dan Magenheimer 1 sibling, 1 reply; 30+ messages in thread From: Pasi Kärkkäinen @ 2009-11-08 16:48 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail) On Sun, Nov 08, 2009 at 05:41:53PM +0200, Pasi Kärkkäinen wrote: > On Sun, Nov 08, 2009 at 07:29:58AM -0800, Dan Magenheimer wrote: > > > > > > machine and it fails to save also. Are you able to save > > > > > > 2.6.31{,.5} successfully? On latest xen-unstable? > > > > > > (NOTE: Yes, I do have CONFIG_XEN_SAVE_RESTORE=y... don't > > > > > > know if that is important.) > > > > > > Ok. I just tried with a Fedora 12 (rawhide) PV guest. I was able to > > > "xm save" and "xm restore" it without problems. > > > > > > But I noticed there was a BUG printed on the guest console: > > > http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86 > > > _64-saverestore.txt > > > BUG: sleeping function called from invalid context at > > > kernel/mutex.c:94 > > > in_atomic(): 0, irqs_disabled(): 1, pid: 1052, name: kstop/0 > > > Pid: 1052, comm: kstop/0 Not tainted 2.6.31.5-122.fc12.x86_64 #1 > > > > Ok, so it appears there is something problematic with > > saving an upstream kernel. It might be (partially) fixed > > in Fedora 12 or maybe there is some other environmental > > difference which makes save fail entirely on my system. > > > > Yeah, fedora kernel has some patches, but it should be pretty > close to upstream kernel.. > > btw was your guest UP or SMP? Mine was UP.. > Ok.. saving SMP guest fails for me too: [2009-11-09 23:44:38 1353] DEBUG (XendCheckpoint:110) [xc_save]: /usr/lib64/xen/bin/xc_save 28 2 0 0 0 [2009-11-09 23:44:38 1353] INFO (XendCheckpoint:417) xc_save: failed to get the suspend evtchn port Jeremy: Ideas what's causing that? "xm save" for UP 2.6.31.5 guest works OK, but for SMP guest it fails with the error above. -- Pasi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG() 2009-11-08 16:48 ` Pasi Kärkkäinen @ 2009-11-12 23:16 ` Jeremy Fitzhardinge 2009-11-12 23:22 ` Brendan Cully 0 siblings, 1 reply; 30+ messages in thread From: Jeremy Fitzhardinge @ 2009-11-12 23:16 UTC (permalink / raw) To: Pasi Kärkkäinen Cc: Dan Magenheimer, Xen-Devel (E-mail), Brendan Cully On 11/08/09 08:48, Pasi Kärkkäinen wrote: > Ok.. saving SMP guest fails for me too: > > [2009-11-09 23:44:38 1353] DEBUG (XendCheckpoint:110) [xc_save]: /usr/lib64/xen/bin/xc_save 28 2 0 0 0 > [2009-11-09 23:44:38 1353] INFO (XendCheckpoint:417) xc_save: failed to get the suspend evtchn port > > Jeremy: Ideas what's causing that? "xm save" for UP 2.6.31.5 guest works > OK, but for SMP guest it fails with the error above. There's no "suspend evtchn port" in a pvops kernel. That looks like a Remus thing. I think. J ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG() 2009-11-12 23:16 ` Jeremy Fitzhardinge @ 2009-11-12 23:22 ` Brendan Cully 0 siblings, 0 replies; 30+ messages in thread From: Brendan Cully @ 2009-11-12 23:22 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: Dan Magenheimer, Xen-Devel (E-mail) On Thursday, 12 November 2009 at 15:16, Jeremy Fitzhardinge wrote: > On 11/08/09 08:48, Pasi Kärkkäinen wrote: > > Ok.. saving SMP guest fails for me too: > > > > [2009-11-09 23:44:38 1353] DEBUG (XendCheckpoint:110) [xc_save]: /usr/lib64/xen/bin/xc_save 28 2 0 0 0 > > [2009-11-09 23:44:38 1353] INFO (XendCheckpoint:417) xc_save: failed to get the suspend evtchn port > > > > Jeremy: Ideas what's causing that? "xm save" for UP 2.6.31.5 guest works > > OK, but for SMP guest it fails with the error above. > > There's no "suspend evtchn port" in a pvops kernel. That looks like a > Remus thing. I think. This is only an INFO-level message, because xc_save falls back to the old xenstore method if it can't find a suspend event channel. I don't know the context here, but this particular message ought to be harmless. The event channel was made for Remus, but regular xc_save also uses it to reduce the downtime at the end of live migration. ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: pv 2.6.31 (kernel.org) and save/migrate, domU BUG() 2009-11-08 15:41 ` Pasi Kärkkäinen 2009-11-08 16:48 ` Pasi Kärkkäinen @ 2009-11-08 16:54 ` Dan Magenheimer 2009-11-08 17:27 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen 2009-11-12 23:21 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Jeremy Fitzhardinge 1 sibling, 2 replies; 30+ messages in thread From: Dan Magenheimer @ 2009-11-08 16:54 UTC (permalink / raw) To: "Pasi Kärkkäinen"; +Cc: Xen-Devel (E-mail) [-- Attachment #1: Type: text/plain, Size: 1688 bytes --] > > Ok, so it appears there is something problematic with > > saving an upstream kernel. It might be (partially) fixed > > in Fedora 12 or maybe there is some other environmental > > difference which makes save fail entirely on my system. > > > > Yeah, fedora kernel has some patches, but it should be pretty > close to upstream kernel.. > > btw was your guest UP or SMP? Mine was UP.. Mine was SMP... switching to UP I can now save. BUT... restore doesn't seem to quite work. The restore completes but I get no response from the VNC console. When I use a tty console, after restore, I am getting an infinite dump of WARNING: at arch/x86/time.c:180 xen_sched_clock+0x2b (see attached). Did you try restore on Fedora 12? > > > > The results explain why I can get it to run on > > > > one machine (an older laptop) and not run on another > > > > machine (a Nehalem system)... looks like this is maybe > > > > related to the cpuid-extended-topology-leaf bug that Jeremy > > > > sent a fix for upstream recently. > > > > > > Did you try with that patch applied? > > > > No, the patch wasn't posted, just a pull request to Linus, > > so I don't have the patch (and am not a git expert so > > am not sure how to get it). > > > > http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00182.html > > > > So I'll try it again when .6 or .7 is available. > > See here for changelog: > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=shortlog;h=bugfix > > You can get the diffs/patches from there using the links.. Thanks. Yes, Jeremy's patch allows 2.6.31.5 (in a PV domain) to completely boot on my Nehalem box. [-- Attachment #2: restore.out --] [-- Type: application/octet-stream, Size: 9696 bytes --] ------------[ cut here ]------------ WARNING: at arch/x86/xen/time.c:180 xen_sched_clock+0x2b/0x5a() Modules linked in: autofs4 hidp nfs lockd nfs_acl auth_rpcgss rfcomm l2cap bluetooth rfkill sunrpc iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_multipath parport_pc lp parport rtc_core rtc_lib pcspkr joydev dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 16, comm: xenwatch Tainted: G W 2.6.31.5 #4 Call Trace: [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a [<c0430540>] ? warn_slowpath_common+0x5e/0x71 [<c043055d>] ? warn_slowpath_null+0xa/0xc [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a [<c040b4b4>] ? sched_clock+0x8/0x18 [<c0444de2>] ? cpu_clock+0x1d/0x33 [<c0405e9c>] ? check_events+0x8/0xc [<c045b2ae>] ? get_timestamp+0x5/0xd [<c045b2cf>] ? __touch_softlockup_watchdog+0x19/0x1f [<c0438a2b>] ? update_process_times+0x21/0x49 [<c0449a5b>] ? tick_periodic+0x60/0x6a [<c0449ab9>] ? tick_handle_periodic+0x54/0x5b [<c0405ad1>] ? xen_timer_interrupt+0x26/0x17f [<c04223da>] ? __wake_up_common+0x2e/0x58 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10 [<c0405e9c>] ? check_events+0x8/0xc [<c05714bd>] ? unmask_evtchn+0x2c/0xc6 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1 [<c045b8d7>] ? handle_IRQ_event+0x4e/0xf1 [<c045d010>] ? handle_level_irq+0x69/0xad [<c045cfa7>] ? handle_level_irq+0x0/0xad <IRQ> [<c0571b64>] ? xen_evtchn_do_upcall+0xa2/0x11b [<c0408387>] ? xen_do_upcall+0x7/0xc [<c0402227>] ? hypercall_page+0x227/0x1001 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10 [<c0405e9c>] ? check_events+0x8/0xc [<c0405e5b>] ? xen_irq_enable_direct_end+0x0/0x1 [<c0427777>] ? finish_task_switch+0x52/0xa4 [<c0676e53>] ? schedule+0x764/0x7c9 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1 [<c067824e>] ? _spin_unlock_irqrestore+0xe/0x10 [<c0405e9c>] ? check_events+0x8/0xc [<c042b237>] ? __cond_resched+0x13/0x2f [<c0676f0b>] ? _cond_resched+0x18/0x21 [<c043e17d>] ? flush_workqueue+0x1d/0x4a [<c0572433>] ? xen_suspend+0x0/0xc4 [<c0452c0e>] ? __stop_machine+0xbf/0xd6 [<c0572433>] ? xen_suspend+0x0/0xc4 [<c0452da8>] ? stop_machine+0x25/0x39 [<c0572663>] ? shutdown_handler+0x16c/0x1d9 [<c0573c2e>] ? xenwatch_thread+0xc8/0xee [<c04414f4>] ? autoremove_wake_function+0x0/0x2d [<c0573b66>] ? xenwatch_thread+0x0/0xee [<c0441450>] ? kthread+0x6e/0x76 [<c04413e2>] ? kthread+0x0/0x76 [<c0408337>] ? kernel_thread_helper+0x7/0x10 ---[ end trace bb4cac02c28c9de1 ]--- ------------[ cut here ]------------ WARNING: at arch/x86/xen/time.c:180 xen_sched_clock+0x2b/0x5a() Modules linked in: autofs4 hidp nfs lockd nfs_acl auth_rpcgss rfcomm l2cap bluetooth rfkill sunrpc iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_multipath parport_pc lp parport rtc_core rtc_lib pcspkr joydev dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 16, comm: xenwatch Tainted: G W 2.6.31.5 #4 Call Trace: [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a [<c0430540>] ? warn_slowpath_common+0x5e/0x71 [<c043055d>] ? warn_slowpath_null+0xa/0xc [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a [<c040b4b4>] ? sched_clock+0x8/0x18 [<c042e372>] ? scheduler_tick+0x44/0x10c [<c0438a49>] ? update_process_times+0x3f/0x49 [<c0449a5b>] ? tick_periodic+0x60/0x6a [<c0449ab9>] ? tick_handle_periodic+0x54/0x5b [<c0405ad1>] ? xen_timer_interrupt+0x26/0x17f [<c04223da>] ? __wake_up_common+0x2e/0x58 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10 [<c0405e9c>] ? check_events+0x8/0xc [<c05714bd>] ? unmask_evtchn+0x2c/0xc6 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1 [<c045b8d7>] ? handle_IRQ_event+0x4e/0xf1 [<c045d010>] ? handle_level_irq+0x69/0xad [<c045cfa7>] ? handle_level_irq+0x0/0xad <IRQ> [<c0571b64>] ? xen_evtchn_do_upcall+0xa2/0x11b [<c0408387>] ? xen_do_upcall+0x7/0xc [<c0402227>] ? hypercall_page+0x227/0x1001 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10 [<c0405e9c>] ? check_events+0x8/0xc [<c0405e5b>] ? xen_irq_enable_direct_end+0x0/0x1 [<c0427777>] ? finish_task_switch+0x52/0xa4 [<c0676e53>] ? schedule+0x764/0x7c9 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1 [<c067824e>] ? _spin_unlock_irqrestore+0xe/0x10 [<c0405e9c>] ? check_events+0x8/0xc [<c042b237>] ? __cond_resched+0x13/0x2f [<c0676f0b>] ? _cond_resched+0x18/0x21 [<c043e17d>] ? flush_workqueue+0x1d/0x4a [<c0572433>] ? xen_suspend+0x0/0xc4 [<c0452c0e>] ? __stop_machine+0xbf/0xd6 [<c0572433>] ? xen_suspend+0x0/0xc4 [<c0452da8>] ? stop_machine+0x25/0x39 [<c0572663>] ? shutdown_handler+0x16c/0x1d9 [<c0573c2e>] ? xenwatch_thread+0xc8/0xee [<c04414f4>] ? autoremove_wake_function+0x0/0x2d [<c0573b66>] ? xenwatch_thread+0x0/0xee [<c0441450>] ? kthread+0x6e/0x76 [<c04413e2>] ? kthread+0x0/0x76 [<c0408337>] ? kernel_thread_helper+0x7/0x10 ---[ end trace bb4cac02c28c9de2 ]--- ------------[ cut here ]------------ WARNING: at arch/x86/xen/time.c:180 xen_sched_clock+0x2b/0x5a() Modules linked in: autofs4 hidp nfs lockd nfs_acl auth_rpcgss rfcomm l2cap bluetooth rfkill sunrpc iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_multipath parport_pc lp parport rtc_core rtc_lib pcspkr joydev dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 16, comm: xenwatch Tainted: G W 2.6.31.5 #4 Call Trace: [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a [<c0430540>] ? warn_slowpath_common+0x5e/0x71 [<c043055d>] ? warn_slowpath_null+0xa/0xc [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a [<c040b4b4>] ? sched_clock+0x8/0x18 [<c0444de2>] ? cpu_clock+0x1d/0x33 [<c0405e9c>] ? check_events+0x8/0xc [<c045b2ae>] ? get_timestamp+0x5/0xd [<c045b2cf>] ? __touch_softlockup_watchdog+0x19/0x1f [<c0438a2b>] ? update_process_times+0x21/0x49 [<c0449a5b>] ? tick_periodic+0x60/0x6a [<c0449ab9>] ? tick_handle_periodic+0x54/0x5b [<c0405ad1>] ? xen_timer_interrupt+0x26/0x17f [<c04223da>] ? __wake_up_common+0x2e/0x58 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10 [<c0405e9c>] ? check_events+0x8/0xc [<c05714bd>] ? unmask_evtchn+0x2c/0xc6 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1 [<c045b8d7>] ? handle_IRQ_event+0x4e/0xf1 [<c045d010>] ? handle_level_irq+0x69/0xad [<c045cfa7>] ? handle_level_irq+0x0/0xad <IRQ> [<c0571b64>] ? xen_evtchn_do_upcall+0xa2/0x11b [<c0408387>] ? xen_do_upcall+0x7/0xc [<c0402227>] ? hypercall_page+0x227/0x1001 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10 [<c0405e9c>] ? check_events+0x8/0xc [<c0405e5b>] ? xen_irq_enable_direct_end+0x0/0x1 [<c0427777>] ? finish_task_switch+0x52/0xa4 [<c0676e53>] ? schedule+0x764/0x7c9 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1 [<c067824e>] ? _spin_unlock_irqrestore+0xe/0x10 [<c0405e9c>] ? check_events+0x8/0xc [<c042b237>] ? __cond_resched+0x13/0x2f [<c0676f0b>] ? _cond_resched+0x18/0x21 [<c043e17d>] ? flush_workqueue+0x1d/0x4a [<c0572433>] ? xen_suspend+0x0/0xc4 [<c0452c0e>] ? __stop_machine+0xbf/0xd6 [<c0572433>] ? xen_suspend+0x0/0xc4 [<c0452da8>] ? stop_machine+0x25/0x39 [<c0572663>] ? shutdown_handler+0x16c/0x1d9 [<c0573c2e>] ? xenwatch_thread+0xc8/0xee [<c04414f4>] ? autoremove_wake_function+0x0/0x2d [<c0573b66>] ? xenwatch_thread+0x0/0xee [<c0441450>] ? kthread+0x6e/0x76 [<c04413e2>] ? kthread+0x0/0x76 [<c0408337>] ? kernel_thread_helper+0x7/0x10 ---[ end trace bb4cac02c28c9de3 ]--- ------------[ cut here ]------------ WARNING: at arch/x86/xen/time.c:180 xen_sched_clock+0x2b/0x5a() Modules linked in: autofs4 hidp nfs lockd nfs_acl auth_rpcgss rfcomm l2cap bluetooth rfkill sunrpc iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_multipath parport_pc lp parport rtc_core rtc_lib pcspkr joydev dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 16, comm: xenwatch Tainted: G W 2.6.31.5 #4 Call Trace: [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a [<c0430540>] ? warn_slowpath_common+0x5e/0x71 [<c043055d>] ? warn_slowpath_null+0xa/0xc [<c0405cc3>] ? xen_sched_clock+0x2b/0x5a [<c040b4b4>] ? sched_clock+0x8/0x18 [<c042e372>] ? scheduler_tick+0x44/0x10c [<c0438a49>] ? update_process_times+0x3f/0x49 [<c0449a5b>] ? tick_periodic+0x60/0x6a [<c0449ab9>] ? tick_handle_periodic+0x54/0x5b [<c0405ad1>] ? xen_timer_interrupt+0x26/0x17f [<c04223da>] ? __wake_up_common+0x2e/0x58 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10 [<c0405e9c>] ? check_events+0x8/0xc [<c05714bd>] ? unmask_evtchn+0x2c/0xc6 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1 [<c045b8d7>] ? handle_IRQ_event+0x4e/0xf1 [<c045d010>] ? handle_level_irq+0x69/0xad [<c045cfa7>] ? handle_level_irq+0x0/0xad <IRQ> [<c0571b64>] ? xen_evtchn_do_upcall+0xa2/0x11b [<c0408387>] ? xen_do_upcall+0x7/0xc [<c0402227>] ? hypercall_page+0x227/0x1001 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10 [<c0405e9c>] ? check_events+0x8/0xc [<c0405e5b>] ? xen_irq_enable_direct_end+0x0/0x1 [<c0427777>] ? finish_task_switch+0x52/0xa4 [<c0676e53>] ? schedule+0x764/0x7c9 [<c0405870>] ? xen_force_evtchn_callback+0xc/0x10 [<c0405e93>] ? xen_restore_fl_direct_end+0x0/0x1 [<c067824e>] ? _spin_unlock_irqrestore+0xe/0x10 [<c0405e9c>] ? check_events+0x8/0xc [<c042b237>] ? __cond_resched+0x13/0x2f [<c0676f0b>] ? _cond_resched+0x18/0x21 [root@dmagenhe-nsvpn-dhcp-141-144-22-8 OVM_EL5U2_X86_PVM_10GB]# [-- Attachment #3: Type: text/plain, Size: 138 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG 2009-11-08 16:54 ` Dan Magenheimer @ 2009-11-08 17:27 ` Pasi Kärkkäinen 2009-11-10 10:08 ` pv 2.6.31 (kernel.org) and save/migrate fails, " Pasi Kärkkäinen 2009-11-12 23:21 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Jeremy Fitzhardinge 1 sibling, 1 reply; 30+ messages in thread From: Pasi Kärkkäinen @ 2009-11-08 17:27 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail) On Sun, Nov 08, 2009 at 08:54:23AM -0800, Dan Magenheimer wrote: > > > Ok, so it appears there is something problematic with > > > saving an upstream kernel. It might be (partially) fixed > > > in Fedora 12 or maybe there is some other environmental > > > difference which makes save fail entirely on my system. > > > > > > > Yeah, fedora kernel has some patches, but it should be pretty > > close to upstream kernel.. > > > > btw was your guest UP or SMP? Mine was UP.. > > Mine was SMP... switching to UP I can now save. BUT... > restore doesn't seem to quite work. The restore completes > but I get no response from the VNC console. When I > use a tty console, after restore, I am getting > an infinite dump of > > WARNING: at arch/x86/time.c:180 xen_sched_clock+0x2b > > (see attached). > > Did you try restore on Fedora 12? > Yeah. save+restore for UP F12 guest works for me (except I get that non-fatal BUG on the guest). SMP guest doesn't work.. save crashes it. > > > > > The results explain why I can get it to run on > > > > > one machine (an older laptop) and not run on another > > > > > machine (a Nehalem system)... looks like this is maybe > > > > > related to the cpuid-extended-topology-leaf bug that Jeremy > > > > > sent a fix for upstream recently. > > > > > > > > Did you try with that patch applied? > > > > > > No, the patch wasn't posted, just a pull request to Linus, > > > so I don't have the patch (and am not a git expert so > > > am not sure how to get it). > > > > > > http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00182.html > > > > > > So I'll try it again when .6 or .7 is available. > > > > See here for changelog: > > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=shortlog;h=bugfix > > > > You can get the diffs/patches from there using the links.. > > Thanks. Yes, Jeremy's patch allows 2.6.31.5 (in a PV domain) > to completely boot on my Nehalem box. Ok. But I guess those doesn't help for the save+restore problem.. -- Pasi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG 2009-11-08 17:27 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen @ 2009-11-10 10:08 ` Pasi Kärkkäinen 2009-11-12 23:36 ` Jeremy Fitzhardinge 2009-11-23 16:44 ` pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG Ian Campbell 0 siblings, 2 replies; 30+ messages in thread From: Pasi Kärkkäinen @ 2009-11-10 10:08 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail) Hello, Jeremy: Here's summary about these save/restore problems using upstream Linux 2.6.31.5 PV guest. For me: - I can "xm save" + "xm restore" UP guest, but I get non-fatal BUG in the guest kernel, see [1]. - "xm save" fails for SMP guest with "failed to get the suspend evtchn port", see [2]. For Dan: - "xm save" works for UP guest, but "xm restore" doesn't, giving infinite xen_sched_clock related dumps in the guest kernel, see [3]. - "xm save" for SMP guest fails, it never ends. I suspect this is the same problem I'm seeing. [1] non-fatal BUG on the guest kernel after "xm restore": http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86_64-saverestore.txt [2] "xm log" contains: [2009-11-09 23:44:38 1353] DEBUG (XendCheckpoint:110) [xc_save]: /usr/lib64/xen/bin/xc_save 28 2 0 0 0 [2009-11-09 23:44:38 1353] INFO (XendCheckpoint:417) xc_save: failed to get the suspend evtchn port [3] See the attachment in this email: http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00391.html Any tips how to debug these? -- Pasi On Sun, Nov 08, 2009 at 07:27:47PM +0200, Pasi Kärkkäinen wrote: > On Sun, Nov 08, 2009 at 08:54:23AM -0800, Dan Magenheimer wrote: > > > > Ok, so it appears there is something problematic with > > > > saving an upstream kernel. It might be (partially) fixed > > > > in Fedora 12 or maybe there is some other environmental > > > > difference which makes save fail entirely on my system. > > > > > > > > > > Yeah, fedora kernel has some patches, but it should be pretty > > > close to upstream kernel.. > > > > > > btw was your guest UP or SMP? Mine was UP.. > > > > Mine was SMP... switching to UP I can now save. BUT... > > restore doesn't seem to quite work. The restore completes > > but I get no response from the VNC console. When I > > use a tty console, after restore, I am getting > > an infinite dump of > > > > WARNING: at arch/x86/time.c:180 xen_sched_clock+0x2b > > > > (see attached). > > > > Did you try restore on Fedora 12? > > > > Yeah. save+restore for UP F12 guest works for me > (except I get that non-fatal BUG on the guest). > > SMP guest doesn't work.. save crashes it. > > > > > > > The results explain why I can get it to run on > > > > > > one machine (an older laptop) and not run on another > > > > > > machine (a Nehalem system)... looks like this is maybe > > > > > > related to the cpuid-extended-topology-leaf bug that Jeremy > > > > > > sent a fix for upstream recently. > > > > > > > > > > Did you try with that patch applied? > > > > > > > > No, the patch wasn't posted, just a pull request to Linus, > > > > so I don't have the patch (and am not a git expert so > > > > am not sure how to get it). > > > > > > > > http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00182.html > > > > > > > > So I'll try it again when .6 or .7 is available. > > > > > > See here for changelog: > > > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=shortlog;h=bugfix > > > > > > You can get the diffs/patches from there using the links.. > > > > Thanks. Yes, Jeremy's patch allows 2.6.31.5 (in a PV domain) > > to completely boot on my Nehalem box. > > Ok. But I guess those doesn't help for the save+restore problem.. > > -- Pasi > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG 2009-11-10 10:08 ` pv 2.6.31 (kernel.org) and save/migrate fails, " Pasi Kärkkäinen @ 2009-11-12 23:36 ` Jeremy Fitzhardinge 2009-11-24 14:27 ` Ian Campbell 2009-11-23 16:44 ` pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG Ian Campbell 1 sibling, 1 reply; 30+ messages in thread From: Jeremy Fitzhardinge @ 2009-11-12 23:36 UTC (permalink / raw) To: Pasi Kärkkäinen; +Cc: Dan Magenheimer, Xen-Devel (E-mail) On 11/10/09 02:08, Pasi Kärkkäinen wrote: > Hello, > > Jeremy: Here's summary about these save/restore problems > using upstream Linux 2.6.31.5 PV guest. > > For me: > - I can "xm save" + "xm restore" UP guest, but I get non-fatal > BUG in the guest kernel, see [1]. > - "xm save" fails for SMP guest with "failed to get the suspend evtchn port", see [2]. > > For Dan: > - "xm save" works for UP guest, but "xm restore" doesn't, giving > infinite xen_sched_clock related dumps in the guest kernel, see [3]. > - "xm save" for SMP guest fails, it never ends. I suspect this > is the same problem I'm seeing. > > > [1] non-fatal BUG on the guest kernel after "xm restore": > http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86_64-saverestore.txt > Does this help: diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c index 10d03d7..da57ea1 100644 --- a/drivers/xen/manage.c +++ b/drivers/xen/manage.c @@ -43,7 +43,6 @@ static int xen_suspend(void *data) if (err) { printk(KERN_ERR "xen_suspend: sysdev_suspend failed: %d\n", err); - dpm_resume_noirq(PMSG_RESUME); return err; } @@ -69,7 +68,6 @@ static int xen_suspend(void *data) } sysdev_resume(); - dpm_resume_noirq(PMSG_RESUME); return 0; } @@ -108,6 +106,9 @@ static void do_suspend(void) } err = stop_machine(xen_suspend, &cancelled, cpumask_of(0)); + + dpm_resume_noirq(PMSG_RESUME); + if (err) { printk(KERN_ERR "failed to start xen_suspend: %d\n", err); goto out; > [2] "xm log" contains: > [2009-11-09 23:44:38 1353] DEBUG (XendCheckpoint:110) [xc_save]: /usr/lib64/xen/bin/xc_save 28 2 0 0 0 > [2009-11-09 23:44:38 1353] INFO (XendCheckpoint:417) xc_save: failed to get the suspend evtchn port > I think this may be a Remus side-effect. > [3] See the attachment in this email: > http://lists.xensource.com/archives/html/xen-devel/2009-11/msg00391.html > No idea about this one. Needs a closer look. J ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG 2009-11-12 23:36 ` Jeremy Fitzhardinge @ 2009-11-24 14:27 ` Ian Campbell 2009-11-25 14:12 ` Ian Campbell 0 siblings, 1 reply; 30+ messages in thread From: Ian Campbell @ 2009-11-24 14:27 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: Dan Magenheimer, Xen-Devel (E-mail) On Thu, 2009-11-12 at 23:36 +0000, Jeremy Fitzhardinge wrote: > On 11/10/09 02:08, Pasi Kärkkäinen wrote: > > Hello, > > > > Jeremy: Here's summary about these save/restore problems > > using upstream Linux 2.6.31.5 PV guest. > > > > For me: > > - I can "xm save" + "xm restore" UP guest, but I get non-fatal > > BUG in the guest kernel, see [1]. > > - "xm save" fails for SMP guest with "failed to get the suspend evtchn port", see [2]. > > > > For Dan: > > - "xm save" works for UP guest, but "xm restore" doesn't, giving > > infinite xen_sched_clock related dumps in the guest kernel, see [3]. > > - "xm save" for SMP guest fails, it never ends. I suspect this > > is the same problem I'm seeing. > > > > > > [1] non-fatal BUG on the guest kernel after "xm restore": > > http://pasik.reaktio.net/xen/debug/dmesg-2.6.31.5-122.fc12.x86_64-saverestore.txt > > > > Does this help: It does for me. There's another dpm_resume_noirq(PMSG_RESUME) a little later in do_suspend() which I think needs to be dropped as well. I'm still seeing other problems with resume, the system is hung on restore and the RCU stall detection logic is triggering, unfortunately arch_trigger_all_cpu_backtrace is not Xen compatible (uses APIC directly) so I don't get much useful info out of it. It's most likely a symptom of the actual problem rather than a problem with RCU per-se anyhow. diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c index 10d03d7..7b69a1a 100644 --- a/drivers/xen/manage.c +++ b/drivers/xen/manage.c @@ -43,7 +43,6 @@ static int xen_suspend(void *data) if (err) { printk(KERN_ERR "xen_suspend: sysdev_suspend failed: %d\n", err); - dpm_resume_noirq(PMSG_RESUME); return err; } @@ -69,7 +68,6 @@ static int xen_suspend(void *data) } sysdev_resume(); - dpm_resume_noirq(PMSG_RESUME); return 0; } @@ -108,6 +106,9 @@ static void do_suspend(void) } err = stop_machine(xen_suspend, &cancelled, cpumask_of(0)); + + dpm_resume_noirq(PMSG_RESUME); + if (err) { printk(KERN_ERR "failed to start xen_suspend: %d\n", err); goto out; @@ -119,8 +120,6 @@ static void do_suspend(void) } else xs_suspend_cancel(); - dpm_resume_noirq(PMSG_RESUME); - resume_devices: dpm_resume_end(PMSG_RESUME); ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG 2009-11-24 14:27 ` Ian Campbell @ 2009-11-25 14:12 ` Ian Campbell 2009-11-25 19:28 ` Jeremy Fitzhardinge ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Ian Campbell @ 2009-11-25 14:12 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: Dan Magenheimer, Xen-Devel (E-mail) On Tue, 2009-11-24 at 14:27 +0000, Ian Campbell wrote: > > I'm still seeing other problems with resume, the system is hung on > restore and the RCU stall detection logic is triggering, unfortunately > arch_trigger_all_cpu_backtrace is not Xen compatible (uses APIC > directly) so I don't get much useful info out of it. It's most likely > a symptom of the actual problem rather than a problem with RCU per-se > anyhow. tick_resume() is never called on secondary processors. Presumably this is because they are offlined for suspend on native and so this is normally taken care of in the CPU onlining path. Under Xen we keep all CPUs online over a suspend. This patch papers over the issue for me but I will investigate a more generic, less hacky, way of doing to the same. tick_suspend is also only called on the boot CPU which I presume should be fixed too. Ian. diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c index 6343a5d..cdfeed2 100644 --- a/arch/x86/xen/suspend.c +++ b/arch/x86/xen/suspend.c @@ -1,4 +1,5 @@ #include <linux/types.h> +#include <linux/clockchips.h> #include <xen/interface/xen.h> #include <xen/grant_table.h> @@ -46,7 +50,19 @@ void xen_post_suspend(int suspend_cancelled) } +static void xen_vcpu_notify_restore(void *data) +{ + unsigned long reason = (unsigned long)data; + + /* Boot processor notified via generic timekeeping_resume() */ + if ( smp_processor_id() == 0) + return; + + clockevents_notify(reason, NULL); +} + void xen_arch_resume(void) { - /* nothing */ + smp_call_function_many(cpu_online_mask, xen_vcpu_notify_restore, + (void *)CLOCK_EVT_NOTIFY_RESUME, 1); } ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG 2009-11-25 14:12 ` Ian Campbell @ 2009-11-25 19:28 ` Jeremy Fitzhardinge 2009-11-25 20:03 ` Ian Campbell 2009-12-01 11:47 ` [PATCH] xen: improve error handling in do_suspend Ian Campbell 2009-12-01 11:47 ` [PATCH] xen: explicitly create/destroy stop_machine workqueues outside suspend/resume region Ian Campbell 2 siblings, 1 reply; 30+ messages in thread From: Jeremy Fitzhardinge @ 2009-11-25 19:28 UTC (permalink / raw) To: Ian Campbell Cc: Rafael J. Wysocki, Dan Magenheimer, Xen-Devel (E-mail), Thomas Gleixner On 11/25/09 06:12, Ian Campbell wrote: > tick_resume() is never called on secondary processors. Presumably this > is because they are offlined for suspend on native and so this is > normally taken care of in the CPU onlining path. Under Xen we keep all > CPUs online over a suspend. > > This patch papers over the issue for me but I will investigate a more > generic, less hacky, way of doing to the same. > > tick_suspend is also only called on the boot CPU which I presume should > be fixed too. > Yep. I wonder how it ever worked? There's been a fair amount of change in the PM code, so that could have changed things. I don't know if there's a deep reason for not calling tick_resume() on all processors. Rafael, tglx: suspend/resume under Xen doesn't need to hot unplug all the CPUs, so we don't; the hypervisor can manage the context save/restore for all CPUs. Is there a deep reason why timekeeping_resume() can't call the CLOCK_EVT_NOTIFY_RESUME notifier on all online CPUs? > void xen_arch_resume(void) > { > - /* nothing */ > + smp_call_function_many(cpu_online_mask, xen_vcpu_notify_restore, > + (void *)CLOCK_EVT_NOTIFY_RESUME, 1); > } > This is equivalent to smp_call_function(). J ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG 2009-11-25 19:28 ` Jeremy Fitzhardinge @ 2009-11-25 20:03 ` Ian Campbell 2009-11-25 20:32 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 30+ messages in thread From: Ian Campbell @ 2009-11-25 20:03 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Rafael J. Wysocki, Dan Magenheimer, Xen-Devel (E-mail), Gleixner, Thomas On Wed, 2009-11-25 at 19:28 +0000, Jeremy Fitzhardinge wrote: > On 11/25/09 06:12, Ian Campbell wrote: > > tick_resume() is never called on secondary processors. Presumably this > > is because they are offlined for suspend on native and so this is > > normally taken care of in the CPU onlining path. Under Xen we keep all > > CPUs online over a suspend. > > > > This patch papers over the issue for me but I will investigate a more > > generic, less hacky, way of doing to the same. > > > > tick_suspend is also only called on the boot CPU which I presume should > > be fixed too. > > > > Yep. I wonder how it ever worked? There's been a fair amount of change > in the PM code, so that could have changed things. I don't know if > there's a deep reason for not calling tick_resume() on all processors. > > Rafael, tglx: suspend/resume under Xen doesn't need to hot unplug all > the CPUs, so we don't; the hypervisor can manage the context > save/restore for all CPUs. Is there a deep reason why > timekeeping_resume() can't call the CLOCK_EVT_NOTIFY_RESUME notifier on > all online CPUs? Interrupts are disabled at that point where it currently calls the notifier, so none of the SMP function call primitives work. > > void xen_arch_resume(void) > > { > > - /* nothing */ > > + smp_call_function_many(cpu_online_mask, xen_vcpu_notify_restore, > > + (void *)CLOCK_EVT_NOTIFY_RESUME, 1); > > } > > > > This is equivalent to smp_call_function(). Oh yeah. Ian. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG 2009-11-25 20:03 ` Ian Campbell @ 2009-11-25 20:32 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 30+ messages in thread From: Jeremy Fitzhardinge @ 2009-11-25 20:32 UTC (permalink / raw) To: Ian Campbell Cc: Rafael J. Wysocki, Dan Magenheimer, Xen-Devel (E-mail), Thomas Gleixner On 11/25/09 12:03, Ian Campbell wrote: >> Yep. I wonder how it ever worked? There's been a fair amount of change >> in the PM code, so that could have changed things. I don't know if >> there's a deep reason for not calling tick_resume() on all processors. >> >> Rafael, tglx: suspend/resume under Xen doesn't need to hot unplug all >> the CPUs, so we don't; the hypervisor can manage the context >> save/restore for all CPUs. Is there a deep reason why >> timekeeping_resume() can't call the CLOCK_EVT_NOTIFY_RESUME notifier on >> all online CPUs? >> > Interrupts are disabled at that point where it currently calls the > notifier, so none of the SMP function call primitives work. > That does make it pretty awkward. J ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] xen: improve error handling in do_suspend. 2009-11-25 14:12 ` Ian Campbell 2009-11-25 19:28 ` Jeremy Fitzhardinge @ 2009-12-01 11:47 ` Ian Campbell 2009-12-01 11:47 ` [PATCH] xen: explicitly create/destroy stop_machine workqueues outside suspend/resume region Ian Campbell 2 siblings, 0 replies; 30+ messages in thread From: Ian Campbell @ 2009-12-01 11:47 UTC (permalink / raw) To: xen-devel; +Cc: Jeremy Fitzhardinge, Ian Campbell The existing error handling has a few issues: - If freeze_processes() fails it exits with shutting_down = SHUTDOWN_SUSPEND. - If dpm_suspend_noirq() fails it exits without resuming xenbus. - If stop_machine() fails it exits without resuming xenbus or calling dpm_resume_end(). - xs_suspend()/xs_resume() and dpm_suspend_noirq()/dpm_resume_noirq() were not nested in the obvious way. Fix by ensuring each failure case goto's the correct label. Treat a failure of stop_machine() as a cancelled suspend in order to follow the correct resume path. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> --- drivers/xen/manage.c | 20 +++++++++++--------- 1 files changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c index 7b69a1a..2fb7d39 100644 --- a/drivers/xen/manage.c +++ b/drivers/xen/manage.c @@ -86,32 +86,32 @@ static void do_suspend(void) err = freeze_processes(); if (err) { printk(KERN_ERR "xen suspend: freeze failed %d\n", err); - return; + goto out; } #endif err = dpm_suspend_start(PMSG_SUSPEND); if (err) { printk(KERN_ERR "xen suspend: dpm_suspend_start %d\n", err); - goto out; + goto out_thaw; } - printk(KERN_DEBUG "suspending xenstore...\n"); - xs_suspend(); - err = dpm_suspend_noirq(PMSG_SUSPEND); if (err) { printk(KERN_ERR "dpm_suspend_noirq failed: %d\n", err); - goto resume_devices; + goto out_resume; } + printk(KERN_DEBUG "suspending xenstore...\n"); + xs_suspend(); + err = stop_machine(xen_suspend, &cancelled, cpumask_of(0)); dpm_resume_noirq(PMSG_RESUME); if (err) { printk(KERN_ERR "failed to start xen_suspend: %d\n", err); - goto out; + cancelled = 1; } if (!cancelled) { @@ -120,15 +120,17 @@ static void do_suspend(void) } else xs_suspend_cancel(); -resume_devices: +out_resume: dpm_resume_end(PMSG_RESUME); /* Make sure timer events get retriggered on all CPUs */ clock_was_set(); -out: + +out_thaw: #ifdef CONFIG_PREEMPT thaw_processes(); #endif +out: shutting_down = SHUTDOWN_INVALID; } #endif /* CONFIG_PM_SLEEP */ -- 1.5.6.5 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH] xen: explicitly create/destroy stop_machine workqueues outside suspend/resume region. 2009-11-25 14:12 ` Ian Campbell 2009-11-25 19:28 ` Jeremy Fitzhardinge 2009-12-01 11:47 ` [PATCH] xen: improve error handling in do_suspend Ian Campbell @ 2009-12-01 11:47 ` Ian Campbell 2009-12-01 22:50 ` Jeremy Fitzhardinge 2 siblings, 1 reply; 30+ messages in thread From: Ian Campbell @ 2009-12-01 11:47 UTC (permalink / raw) To: xen-devel; +Cc: Jeremy Fitzhardinge, Ian Campbell I have observed cases where the implicit stop_machine_destroy() done by stop_machine() hangs while destroying the workqueues, specifically in kthread_stop(). This seems to be because timer ticks are not restarted until after stop_machine() returns. Fortunately stop_machine provides a facility to pre-create/post-destroy the workqueues so use this to ensure that workqueues are only destroyed after everything is really up and running again. I only actually observed this failure with 2.6.30. It seems that newer kernels are somehow more robust against doing kthread_stop() without timer interrupts (I tried some backports of some likely looking candidates but did not track down the commit which added this robustness). However this change seems like a reasonable belt&braces thing to do. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> --- drivers/xen/manage.c | 12 +++++++++++- 1 files changed, 11 insertions(+), 1 deletions(-) diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c index 2fb7d39..c499793 100644 --- a/drivers/xen/manage.c +++ b/drivers/xen/manage.c @@ -79,6 +79,12 @@ static void do_suspend(void) shutting_down = SHUTDOWN_SUSPEND; + err = stop_machine_create(); + if (err) { + printk(KERN_ERR "xen suspend: failed to setup stop_machine %d\n", err); + goto out; + } + #ifdef CONFIG_PREEMPT /* If the kernel is preemptible, we need to freeze all the processes to prevent them from being in the middle of a pagetable update @@ -86,7 +92,7 @@ static void do_suspend(void) err = freeze_processes(); if (err) { printk(KERN_ERR "xen suspend: freeze failed %d\n", err); - goto out; + goto out_destroy_sm; } #endif @@ -129,7 +135,11 @@ out_resume: out_thaw: #ifdef CONFIG_PREEMPT thaw_processes(); + +out_destroy_sm: #endif + stop_machine_destroy(); + out: shutting_down = SHUTDOWN_INVALID; } -- 1.5.6.5 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] xen: explicitly create/destroy stop_machine workqueues outside suspend/resume region. 2009-12-01 11:47 ` [PATCH] xen: explicitly create/destroy stop_machine workqueues outside suspend/resume region Ian Campbell @ 2009-12-01 22:50 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 30+ messages in thread From: Jeremy Fitzhardinge @ 2009-12-01 22:50 UTC (permalink / raw) To: Ian Campbell; +Cc: xen-devel On 12/01/09 03:47, Ian Campbell wrote: > I have observed cases where the implicit stop_machine_destroy() done by > stop_machine() hangs while destroying the workqueues, specifically in > kthread_stop(). This seems to be because timer ticks are not restarted > until after stop_machine() returns. > Thanks for these - applied. J ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG 2009-11-10 10:08 ` pv 2.6.31 (kernel.org) and save/migrate fails, " Pasi Kärkkäinen 2009-11-12 23:36 ` Jeremy Fitzhardinge @ 2009-11-23 16:44 ` Ian Campbell 2009-11-24 10:27 ` Ian Campbell 1 sibling, 1 reply; 30+ messages in thread From: Ian Campbell @ 2009-11-23 16:44 UTC (permalink / raw) To: Pasi Kärkkäinen Cc: Dan Magenheimer, Xen-Devel (E-mail), Jeremy Fitzhardinge On Tue, 2009-11-10 at 10:08 +0000, Pasi Kärkkäinen wrote: > Hello, > > Jeremy: Here's summary about these save/restore problems > using upstream Linux 2.6.31.5 PV guest. > > For me: > - I can "xm save" + "xm restore" UP guest, but I get non-fatal > BUG in the guest kernel, see [1]. > - "xm save" fails for SMP guest with "failed to get the suspend evtchn port", see [2]. > > For Dan: > - "xm save" works for UP guest, but "xm restore" doesn't, giving > infinite xen_sched_clock related dumps in the guest kernel, see [3]. The runstate fix I sent to the list last week should help with this one. > - "xm save" for SMP guest fails, it never ends. I suspect this > is the same problem I'm seeing. I'm seeing this (or something very like it) too. At the moment it looks as if drivers/xen/manage.c:do_suspend is getting as far as the stop_machine() call but I am never seeing to the xen_suspend() callback. Ian. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG 2009-11-23 16:44 ` pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG Ian Campbell @ 2009-11-24 10:27 ` Ian Campbell 0 siblings, 0 replies; 30+ messages in thread From: Ian Campbell @ 2009-11-24 10:27 UTC (permalink / raw) To: Pasi Kärkkäinen Cc: Dan Magenheimer, Xen-Devel (E-mail), Jeremy Fitzhardinge On Mon, 2009-11-23 at 16:44 +0000, Ian Campbell wrote: > On Tue, 2009-11-10 at 10:08 +0000, Pasi Kärkkäinen wrote: > > Hello, > > > > Jeremy: Here's summary about these save/restore problems > > using upstream Linux 2.6.31.5 PV guest. > > > > For me: > > - I can "xm save" + "xm restore" UP guest, but I get non-fatal > > BUG in the guest kernel, see [1]. > > - "xm save" fails for SMP guest with "failed to get the suspend evtchn port", see [2]. > > > > For Dan: > > - "xm save" works for UP guest, but "xm restore" doesn't, giving > > infinite xen_sched_clock related dumps in the guest kernel, see [3]. > > The runstate fix I sent to the list last week should help with this one. > > > - "xm save" for SMP guest fails, it never ends. I suspect this > > is the same problem I'm seeing. > > I'm seeing this (or something very like it) too. At the moment it looks > as if drivers/xen/manage.c:do_suspend is getting as far as the > stop_machine() call but I am never seeing to the xen_suspend() callback. See "xen: register timer interrupt with IRQF_TIMER" that I just sent to the list for the fix. Ian. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate, domU BUG() 2009-11-08 16:54 ` Dan Magenheimer 2009-11-08 17:27 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen @ 2009-11-12 23:21 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 30+ messages in thread From: Jeremy Fitzhardinge @ 2009-11-12 23:21 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Xen-Devel (E-mail), Jan Beulich, Keir Fraser On 11/08/09 08:54, Dan Magenheimer wrote: > Mine was SMP... switching to UP I can now save. BUT... > restore doesn't seem to quite work. The restore completes > but I get no response from the VNC console. When I > use a tty console, after restore, I am getting > an infinite dump of > > WARNING: at arch/x86/time.c:180 xen_sched_clock+0x2b > That means that the test to see that the CPU its currently running on is not currently running according to Xen... It's hard to imagine how it got into that state... J ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: pv 2.6.31 (kernel.org) and save/migrate 2009-11-06 20:37 ` Pasi Kärkkäinen 2009-11-06 22:27 ` Dan Magenheimer @ 2009-11-07 0:19 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 30+ messages in thread From: Jeremy Fitzhardinge @ 2009-11-07 0:19 UTC (permalink / raw) To: Pasi Kärkkäinen; +Cc: Dan Magenheimer, Xen-Devel (E-mail) On 11/06/09 12:37, Pasi Kärkkäinen wrote: > On Fri, Nov 06, 2009 at 10:37:49AM -0800, Dan Magenheimer wrote: > >> Sorry for another possibly stupid question: >> >> I've observed that for a pv domain that's been updated >> to a 2.6.31 kernel (straight from kernel.org), "xm save" >> never completes. When the older kernel (2.6.18) >> is booted, "xm save" works fine. Is this a known problem... >> or perhaps xm save has never worked with an upstream pv >> kernel and I've never noticed? >> >> I'd assume migrate and live migrate would fail also but >> haven't tried them. >> >> > Just checking.. are you running the latest 2.6.31.5 ? I think there has > been multiple xen related bugfixes in the 2.6.31.X releases. > Nothing relating to save/restore. Does it work for you? J ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2009-12-01 22:50 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-11-06 18:37 pv 2.6.31 (kernel.org) and save/migrate Dan Magenheimer 2009-11-06 20:37 ` Pasi Kärkkäinen 2009-11-06 22:27 ` Dan Magenheimer 2009-11-06 22:30 ` Pasi Kärkkäinen 2009-11-07 0:08 ` Dan Magenheimer 2009-11-07 11:09 ` Pasi Kärkkäinen 2009-11-07 15:32 ` Dan Magenheimer 2009-11-08 14:17 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Pasi Kärkkäinen 2009-11-08 14:20 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen 2009-11-08 15:29 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Dan Magenheimer 2009-11-08 15:41 ` Pasi Kärkkäinen 2009-11-08 16:48 ` Pasi Kärkkäinen 2009-11-12 23:16 ` Jeremy Fitzhardinge 2009-11-12 23:22 ` Brendan Cully 2009-11-08 16:54 ` Dan Magenheimer 2009-11-08 17:27 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG Pasi Kärkkäinen 2009-11-10 10:08 ` pv 2.6.31 (kernel.org) and save/migrate fails, " Pasi Kärkkäinen 2009-11-12 23:36 ` Jeremy Fitzhardinge 2009-11-24 14:27 ` Ian Campbell 2009-11-25 14:12 ` Ian Campbell 2009-11-25 19:28 ` Jeremy Fitzhardinge 2009-11-25 20:03 ` Ian Campbell 2009-11-25 20:32 ` Jeremy Fitzhardinge 2009-12-01 11:47 ` [PATCH] xen: improve error handling in do_suspend Ian Campbell 2009-12-01 11:47 ` [PATCH] xen: explicitly create/destroy stop_machine workqueues outside suspend/resume region Ian Campbell 2009-12-01 22:50 ` Jeremy Fitzhardinge 2009-11-23 16:44 ` pv 2.6.31 (kernel.org) and save/migrate fails, domU BUG Ian Campbell 2009-11-24 10:27 ` Ian Campbell 2009-11-12 23:21 ` pv 2.6.31 (kernel.org) and save/migrate, domU BUG() Jeremy Fitzhardinge 2009-11-07 0:19 ` pv 2.6.31 (kernel.org) and save/migrate Jeremy Fitzhardinge
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.