From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andre Przywara Subject: Re: Hypervisor crash(!) on xl cpupool-numa-split Date: Fri, 28 Jan 2011 14:05:38 +0100 Message-ID: <4D42BF22.60201@amd.com> References: <4D41FD3A.5090506@amd.com> <4D426673.7020200@ts.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: George Dunlap Cc: Keir Fraser , Juergen Gross , "xen-devel@lists.xensource.com" , Ian Jackson List-Id: xen-devel@lists.xenproject.org George Dunlap wrote: > Hmm, strange... looks like it has something to do with the code which > keeps track of which vcpus are earning credits. You say this is done > immediately after boot, with no VMs running other than dom0? Right, after Dom0's prompt I just start xl cpupool-numa-split and the machine crashes. > > What are the dom0_max_vcpus and dom0_mem settings required to make it work? dom0_mem=8192M dom0_max_vcpus=6: works dom0_mem=8192M: works dom0_max_vcpus=6: works (no settings): crashes dom0_mem=20480M dom0_max_vcpus=8: works The machine has 8 nodes with 6 CPUs each, the nodes have alternating 16G and 8GB memory (4 12-core (MCM aka dual-node) Opterons with 96GB RAM in total). If I try to reproduce the actions of xl numa-split via a shell script it also crashes, just before the creation of the last pool. I will insert some instrumentation to the code to find the offending action. Regards, Andre. > On Fri, Jan 28, 2011 at 6:47 AM, Juergen Gross > wrote: >> On 01/28/11 00:18, Andre Przywara wrote: >>> Hi, >>> >>> when I boot my machine without restricting Dom0 (dom0_mem= >>> dom0_max_vcpus=) I get an _hypervisor_ crash when I run >>> # xl cpupool-numa-split >>> If Dom0's resources are limited on the Xen cmdline, everything works fine. >>> The crashdump points to a scheduling problem with weights, so I assume >>> the NUMA distribution algorithm some fools the hypervisor completely. >>> >>> I will investigate this further tomorrow, but maybe someone has some >>> good idea. >> I've seen this once with an older cpupool version on a 24 processor machine. >> It was NOT related to NUMA, but did occur only on reboot after a Dom0 panic. >> The machine had an init script creating a cpupool and populating it with >> cpus. The machine was in a panic loop due to the BUG in sched_acct then >> until >> it was resetted manually. After the reset the problem was gone. >> >> As I was never able to reproduce the problem later (the same software is >> running on dozens of machines!), I assumed there was a problem related to >> the first Dom0 panic, may be some destroyed BIOS tables. >> >> Can the crash be reproduced easily? >> >> >> Juergen >> >>> Regards, >>> Andre. >>> >>> root@dosorca:/data/images# xl cpupool-numa-split >>> (XEN) Xen BUG at sched_credit.c:990 >>> (XEN) ----[ Xen-4.1.0-rc2-pre x86_64 debug=y Not tainted ]---- >>> (XEN) CPU: 0 >>> (XEN) RIP: e008:[] csched_acct+0x11f/0x419 >>> (XEN) RFLAGS: 0000000000010006 CONTEXT: hypervisor >>> (XEN) rax: 0000000000000010 rbx: 0000000000000f00 rcx: 0000000000000100 >>> (XEN) rdx: 0000000000001000 rsi: ffff830437ffa600 rdi: 0000000000000010 >>> (XEN) rbp: ffff82c480297e10 rsp: ffff82c480297d80 r8: 0000000000000100 >>> (XEN) r9: 0000000000000006 r10: ffff82c4802d4100 r11: 000000afc7df0edf >>> (XEN) r12: ffff830437ffa5e0 r13: ffff82c480117fd9 r14: ffff830437f9f2e8 >>> (XEN) r15: ffff830434321ec0 cr0: 000000008005003b cr4: 00000000000006f0 >>> (XEN) cr3: 000000080df4e000 cr2: ffff88179af79618 >>> (XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008 >>> (XEN) Xen stack trace from rsp=ffff82c480297d80: >>> (XEN) 0000000000000282 fffffed4802d3f80 0000000000000eff ffff830437ffa5e0 >>> (XEN) ffff830437ffa5e8 ffff830437ffa870 ffff830437ffa5e0 0000000000000282 >>> (XEN) ffff830437ffa5e8 00002a3037ffa870 00000f0000000f00 0000000000000000 >>> (XEN) ffff82c400000000 ffff82c4802d3f80 ffff830437ffa5e0 ffff82c480117fd9 >>> (XEN) ffff830437f9f2e8 ffff830437f9f2e0 ffff82c480297e40 ffff82c480125f34 >>> (XEN) 0000000000000002 ffff830437ffa600 ffff82c4802d3f80 000000afb6f8667f >>> (XEN) ffff82c480297e90 ffff82c480126259 ffff82c48024ae20 ffff82c4802d3f80 >>> (XEN) ffff830437f9f2e0 0000000000000000 0000000000000000 ffff82c4802b0880 >>> (XEN) ffff82c480297f18 ffffffffffffffff ffff82c480297ed0 ffff82c480123327 >>> (XEN) ffff82c4802d4a00 ffff82c480297f18 ffff82c48024ae20 ffff82c480297f18 >>> (XEN) 000000afb6abd652 ffff82c4802d3ec0 ffff82c480297ee0 ffff82c4801233a2 >>> (XEN) ffff82c480297f10 ffff82c4801563f5 0000000000000000 ffff8300c7cd6000 >>> (XEN) 0000000000000000 ffff8300c7ad4000 ffff82c480297d48 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 ffffffff81a69060 ffff8817a8503f10 >>> (XEN) ffff8817a8503fd8 0000000000000246 ffff8817a8503e80 ffff880000000001 >>> (XEN) 0000000000000000 0000000000000000 ffffffff810093aa 000000aafab2f86e >>> (XEN) 00000000deadbeef 00000000deadbeef 0000010000000000 ffffffff810093aa >>> (XEN) 000000000000e033 0000000000000246 ffff8817a8503ef8 000000000000e02b >>> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 ffff8300c7cd6000 0000000000000000 0000000000000000 >>> (XEN) Xen call trace: >>> (XEN) [] csched_acct+0x11f/0x419 >>> (XEN) [] execute_timer+0x4e/0x6c >>> (XEN) [] timer_softirq_action+0xf2/0x245 >>> (XEN) [] __do_softirq+0x88/0x99 >>> (XEN) [] do_softirq+0x6a/0x7a >>> (XEN) [] idle_loop+0x6a/0x6f >>> (XEN) >>> (XEN) >>> (XEN) **************************************** >>> (XEN) Panic on CPU 0: >>> (XEN) Xen BUG at sched_credit.c:990 >>> (XEN) **************************************** >>> (XEN) >>> (XEN) Reboot in five seconds... >>> >>> >> >> -- >> Juergen Gross Principal Developer Operating Systems >> TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 >> Fujitsu Technology Solutions e-mail: >> juergen.gross@ts.fujitsu.com >> Domagkstr. 28 Internet: ts.fujitsu.com >> D-80807 Muenchen Company details: >> ts.fujitsu.com/imprint.html >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >> > -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712