All of lore.kernel.org
 help / color / mirror / Atom feed
* Hypervisor crash(!) on xl cpupool-numa-split
@ 2011-01-27 23:18 Andre Przywara
  2011-01-28  6:47 ` Juergen Gross
  0 siblings, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-01-27 23:18 UTC (permalink / raw)
  To: Keir Fraser, Ian Jackson, Juergen Gross; +Cc: xen-devel

Hi,

when I boot my machine without restricting Dom0 (dom0_mem= 
dom0_max_vcpus=) I get an _hypervisor_ crash when I run
# xl cpupool-numa-split
If Dom0's resources are limited on the Xen cmdline, everything works fine.
The crashdump points to a scheduling problem with weights, so I assume 
the NUMA distribution algorithm some fools the hypervisor completely.

I will investigate this further tomorrow, but maybe someone has some 
good idea.

Regards,
Andre.

root@dosorca:/data/images# xl cpupool-numa-split
(XEN) Xen BUG at sched_credit.c:990
(XEN) ----[ Xen-4.1.0-rc2-pre  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c4801180f8>] csched_acct+0x11f/0x419
(XEN) RFLAGS: 0000000000010006   CONTEXT: hypervisor
(XEN) rax: 0000000000000010   rbx: 0000000000000f00   rcx: 0000000000000100
(XEN) rdx: 0000000000001000   rsi: ffff830437ffa600   rdi: 0000000000000010
(XEN) rbp: ffff82c480297e10   rsp: ffff82c480297d80   r8:  0000000000000100
(XEN) r9:  0000000000000006   r10: ffff82c4802d4100   r11: 000000afc7df0edf
(XEN) r12: ffff830437ffa5e0   r13: ffff82c480117fd9   r14: ffff830437f9f2e8
(XEN) r15: ffff830434321ec0   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 000000080df4e000   cr2: ffff88179af79618
(XEN) ds: 002b   es: 002b   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c480297d80:
(XEN)    0000000000000282 fffffed4802d3f80 0000000000000eff ffff830437ffa5e0
(XEN)    ffff830437ffa5e8 ffff830437ffa870 ffff830437ffa5e0 0000000000000282
(XEN)    ffff830437ffa5e8 00002a3037ffa870 00000f0000000f00 0000000000000000
(XEN)    ffff82c400000000 ffff82c4802d3f80 ffff830437ffa5e0 ffff82c480117fd9
(XEN)    ffff830437f9f2e8 ffff830437f9f2e0 ffff82c480297e40 ffff82c480125f34
(XEN)    0000000000000002 ffff830437ffa600 ffff82c4802d3f80 000000afb6f8667f
(XEN)    ffff82c480297e90 ffff82c480126259 ffff82c48024ae20 ffff82c4802d3f80
(XEN)    ffff830437f9f2e0 0000000000000000 0000000000000000 ffff82c4802b0880
(XEN)    ffff82c480297f18 ffffffffffffffff ffff82c480297ed0 ffff82c480123327
(XEN)    ffff82c4802d4a00 ffff82c480297f18 ffff82c48024ae20 ffff82c480297f18
(XEN)    000000afb6abd652 ffff82c4802d3ec0 ffff82c480297ee0 ffff82c4801233a2
(XEN)    ffff82c480297f10 ffff82c4801563f5 0000000000000000 ffff8300c7cd6000
(XEN)    0000000000000000 ffff8300c7ad4000 ffff82c480297d48 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffffffff81a69060 ffff8817a8503f10
(XEN)    ffff8817a8503fd8 0000000000000246 ffff8817a8503e80 ffff880000000001
(XEN)    0000000000000000 0000000000000000 ffffffff810093aa 000000aafab2f86e
(XEN)    00000000deadbeef 00000000deadbeef 0000010000000000 ffffffff810093aa
(XEN)    000000000000e033 0000000000000246 ffff8817a8503ef8 000000000000e02b
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 ffff8300c7cd6000 0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c4801180f8>] csched_acct+0x11f/0x419
(XEN)    [<ffff82c480125f34>] execute_timer+0x4e/0x6c
(XEN)    [<ffff82c480126259>] timer_softirq_action+0xf2/0x245
(XEN)    [<ffff82c480123327>] __do_softirq+0x88/0x99
(XEN)    [<ffff82c4801233a2>] do_softirq+0x6a/0x7a
(XEN)    [<ffff82c4801563f5>] idle_loop+0x6a/0x6f
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Xen BUG at sched_credit.c:990
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-01-27 23:18 Hypervisor crash(!) on xl cpupool-numa-split Andre Przywara
@ 2011-01-28  6:47 ` Juergen Gross
  2011-01-28 11:07   ` Andre Przywara
  2011-01-28 11:13   ` George Dunlap
  0 siblings, 2 replies; 53+ messages in thread
From: Juergen Gross @ 2011-01-28  6:47 UTC (permalink / raw)
  To: Andre Przywara; +Cc: xen-devel, Ian Jackson, Keir Fraser

On 01/28/11 00:18, Andre Przywara wrote:
> Hi,
>
> when I boot my machine without restricting Dom0 (dom0_mem=
> dom0_max_vcpus=) I get an _hypervisor_ crash when I run
> # xl cpupool-numa-split
> If Dom0's resources are limited on the Xen cmdline, everything works fine.
> The crashdump points to a scheduling problem with weights, so I assume
> the NUMA distribution algorithm some fools the hypervisor completely.
>
> I will investigate this further tomorrow, but maybe someone has some
> good idea.

I've seen this once with an older cpupool version on a 24 processor machine.
It was NOT related to NUMA, but did occur only on reboot after a Dom0 panic.
The machine had an init script creating a cpupool and populating it with
cpus. The machine was in a panic loop due to the BUG in sched_acct then until
it was resetted manually. After the reset the problem was gone.

As I was never able to reproduce the problem later (the same software is
running on dozens of machines!), I assumed there was a problem related to
the first Dom0 panic, may be some destroyed BIOS tables.

Can the crash be reproduced easily?


Juergen

>
> Regards,
> Andre.
>
> root@dosorca:/data/images# xl cpupool-numa-split
> (XEN) Xen BUG at sched_credit.c:990
> (XEN) ----[ Xen-4.1.0-rc2-pre x86_64 debug=y Not tainted ]----
> (XEN) CPU: 0
> (XEN) RIP: e008:[<ffff82c4801180f8>] csched_acct+0x11f/0x419
> (XEN) RFLAGS: 0000000000010006 CONTEXT: hypervisor
> (XEN) rax: 0000000000000010 rbx: 0000000000000f00 rcx: 0000000000000100
> (XEN) rdx: 0000000000001000 rsi: ffff830437ffa600 rdi: 0000000000000010
> (XEN) rbp: ffff82c480297e10 rsp: ffff82c480297d80 r8: 0000000000000100
> (XEN) r9: 0000000000000006 r10: ffff82c4802d4100 r11: 000000afc7df0edf
> (XEN) r12: ffff830437ffa5e0 r13: ffff82c480117fd9 r14: ffff830437f9f2e8
> (XEN) r15: ffff830434321ec0 cr0: 000000008005003b cr4: 00000000000006f0
> (XEN) cr3: 000000080df4e000 cr2: ffff88179af79618
> (XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008
> (XEN) Xen stack trace from rsp=ffff82c480297d80:
> (XEN) 0000000000000282 fffffed4802d3f80 0000000000000eff ffff830437ffa5e0
> (XEN) ffff830437ffa5e8 ffff830437ffa870 ffff830437ffa5e0 0000000000000282
> (XEN) ffff830437ffa5e8 00002a3037ffa870 00000f0000000f00 0000000000000000
> (XEN) ffff82c400000000 ffff82c4802d3f80 ffff830437ffa5e0 ffff82c480117fd9
> (XEN) ffff830437f9f2e8 ffff830437f9f2e0 ffff82c480297e40 ffff82c480125f34
> (XEN) 0000000000000002 ffff830437ffa600 ffff82c4802d3f80 000000afb6f8667f
> (XEN) ffff82c480297e90 ffff82c480126259 ffff82c48024ae20 ffff82c4802d3f80
> (XEN) ffff830437f9f2e0 0000000000000000 0000000000000000 ffff82c4802b0880
> (XEN) ffff82c480297f18 ffffffffffffffff ffff82c480297ed0 ffff82c480123327
> (XEN) ffff82c4802d4a00 ffff82c480297f18 ffff82c48024ae20 ffff82c480297f18
> (XEN) 000000afb6abd652 ffff82c4802d3ec0 ffff82c480297ee0 ffff82c4801233a2
> (XEN) ffff82c480297f10 ffff82c4801563f5 0000000000000000 ffff8300c7cd6000
> (XEN) 0000000000000000 ffff8300c7ad4000 ffff82c480297d48 0000000000000000
> (XEN) 0000000000000000 0000000000000000 ffffffff81a69060 ffff8817a8503f10
> (XEN) ffff8817a8503fd8 0000000000000246 ffff8817a8503e80 ffff880000000001
> (XEN) 0000000000000000 0000000000000000 ffffffff810093aa 000000aafab2f86e
> (XEN) 00000000deadbeef 00000000deadbeef 0000010000000000 ffffffff810093aa
> (XEN) 000000000000e033 0000000000000246 ffff8817a8503ef8 000000000000e02b
> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) 0000000000000000 ffff8300c7cd6000 0000000000000000 0000000000000000
> (XEN) Xen call trace:
> (XEN) [<ffff82c4801180f8>] csched_acct+0x11f/0x419
> (XEN) [<ffff82c480125f34>] execute_timer+0x4e/0x6c
> (XEN) [<ffff82c480126259>] timer_softirq_action+0xf2/0x245
> (XEN) [<ffff82c480123327>] __do_softirq+0x88/0x99
> (XEN) [<ffff82c4801233a2>] do_softirq+0x6a/0x7a
> (XEN) [<ffff82c4801563f5>] idle_loop+0x6a/0x6f
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) Xen BUG at sched_credit.c:990
> (XEN) ****************************************
> (XEN)
> (XEN) Reboot in five seconds...
>
>


-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-01-28  6:47 ` Juergen Gross
@ 2011-01-28 11:07   ` Andre Przywara
  2011-01-28 11:44     ` Juergen Gross
  2011-01-28 11:13   ` George Dunlap
  1 sibling, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-01-28 11:07 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel, Ian Jackson, Keir Fraser

Juergen Gross wrote:
> On 01/28/11 00:18, Andre Przywara wrote:
>> Hi,
>>
>> when I boot my machine without restricting Dom0 (dom0_mem=
>> dom0_max_vcpus=) I get an _hypervisor_ crash when I run
>> # xl cpupool-numa-split
>> If Dom0's resources are limited on the Xen cmdline, everything works fine.
>> The crashdump points to a scheduling problem with weights, so I assume
>> the NUMA distribution algorithm some fools the hypervisor completely.
>>
>> I will investigate this further tomorrow, but maybe someone has some
>> good idea.
> 
> I've seen this once with an older cpupool version on a 24 processor machine.
> It was NOT related to NUMA, but did occur only on reboot after a Dom0 panic.
> The machine had an init script creating a cpupool and populating it with
> cpus. The machine was in a panic loop due to the BUG in sched_acct then until
> it was resetted manually. After the reset the problem was gone.
> 
> As I was never able to reproduce the problem later (the same software is
> running on dozens of machines!), I assumed there was a problem related to
> the first Dom0 panic, may be some destroyed BIOS tables.
> 
> Can the crash be reproduced easily?
Yes.
If I don't specify dom0_max_vcpus= and dom0_mem= on the Xen cmdline, I 
can reliably trigger the crash with xl cpupool-numa-split.
Omitting dom0_max_vcpus only does not suffice.

Will continue after lunch-break ;-)

Regards,
Andre.

> 
> 
> Juergen
> 
>> Regards,
>> Andre.
>>
>> root@dosorca:/data/images# xl cpupool-numa-split
>> (XEN) Xen BUG at sched_credit.c:990
>> (XEN) ----[ Xen-4.1.0-rc2-pre x86_64 debug=y Not tainted ]----
>> (XEN) CPU: 0
>> (XEN) RIP: e008:[<ffff82c4801180f8>] csched_acct+0x11f/0x419
>> (XEN) RFLAGS: 0000000000010006 CONTEXT: hypervisor
>> (XEN) rax: 0000000000000010 rbx: 0000000000000f00 rcx: 0000000000000100
>> (XEN) rdx: 0000000000001000 rsi: ffff830437ffa600 rdi: 0000000000000010
>> (XEN) rbp: ffff82c480297e10 rsp: ffff82c480297d80 r8: 0000000000000100
>> (XEN) r9: 0000000000000006 r10: ffff82c4802d4100 r11: 000000afc7df0edf
>> (XEN) r12: ffff830437ffa5e0 r13: ffff82c480117fd9 r14: ffff830437f9f2e8
>> (XEN) r15: ffff830434321ec0 cr0: 000000008005003b cr4: 00000000000006f0
>> (XEN) cr3: 000000080df4e000 cr2: ffff88179af79618
>> (XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008
>> (XEN) Xen stack trace from rsp=ffff82c480297d80:
>> (XEN) 0000000000000282 fffffed4802d3f80 0000000000000eff ffff830437ffa5e0
>> (XEN) ffff830437ffa5e8 ffff830437ffa870 ffff830437ffa5e0 0000000000000282
>> (XEN) ffff830437ffa5e8 00002a3037ffa870 00000f0000000f00 0000000000000000
>> (XEN) ffff82c400000000 ffff82c4802d3f80 ffff830437ffa5e0 ffff82c480117fd9
>> (XEN) ffff830437f9f2e8 ffff830437f9f2e0 ffff82c480297e40 ffff82c480125f34
>> (XEN) 0000000000000002 ffff830437ffa600 ffff82c4802d3f80 000000afb6f8667f
>> (XEN) ffff82c480297e90 ffff82c480126259 ffff82c48024ae20 ffff82c4802d3f80
>> (XEN) ffff830437f9f2e0 0000000000000000 0000000000000000 ffff82c4802b0880
>> (XEN) ffff82c480297f18 ffffffffffffffff ffff82c480297ed0 ffff82c480123327
>> (XEN) ffff82c4802d4a00 ffff82c480297f18 ffff82c48024ae20 ffff82c480297f18
>> (XEN) 000000afb6abd652 ffff82c4802d3ec0 ffff82c480297ee0 ffff82c4801233a2
>> (XEN) ffff82c480297f10 ffff82c4801563f5 0000000000000000 ffff8300c7cd6000
>> (XEN) 0000000000000000 ffff8300c7ad4000 ffff82c480297d48 0000000000000000
>> (XEN) 0000000000000000 0000000000000000 ffffffff81a69060 ffff8817a8503f10
>> (XEN) ffff8817a8503fd8 0000000000000246 ffff8817a8503e80 ffff880000000001
>> (XEN) 0000000000000000 0000000000000000 ffffffff810093aa 000000aafab2f86e
>> (XEN) 00000000deadbeef 00000000deadbeef 0000010000000000 ffffffff810093aa
>> (XEN) 000000000000e033 0000000000000246 ffff8817a8503ef8 000000000000e02b
>> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> (XEN) 0000000000000000 ffff8300c7cd6000 0000000000000000 0000000000000000
>> (XEN) Xen call trace:
>> (XEN) [<ffff82c4801180f8>] csched_acct+0x11f/0x419
>> (XEN) [<ffff82c480125f34>] execute_timer+0x4e/0x6c
>> (XEN) [<ffff82c480126259>] timer_softirq_action+0xf2/0x245
>> (XEN) [<ffff82c480123327>] __do_softirq+0x88/0x99
>> (XEN) [<ffff82c4801233a2>] do_softirq+0x6a/0x7a
>> (XEN) [<ffff82c4801563f5>] idle_loop+0x6a/0x6f
>> (XEN)
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 0:
>> (XEN) Xen BUG at sched_credit.c:990
>> (XEN) ****************************************
>> (XEN)
>> (XEN) Reboot in five seconds...
>>
>>
> 
> 


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-01-28  6:47 ` Juergen Gross
  2011-01-28 11:07   ` Andre Przywara
@ 2011-01-28 11:13   ` George Dunlap
  2011-01-28 13:05     ` Andre Przywara
  1 sibling, 1 reply; 53+ messages in thread
From: George Dunlap @ 2011-01-28 11:13 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Andre Przywara, xen-devel, Ian Jackson, Keir Fraser

Hmm, strange... looks like it has something to do with the code which
keeps track of which vcpus are earning credits.  You say this is done
immediately after boot, with no VMs running other than dom0?

What are the dom0_max_vcpus and dom0_mem settings required to make it work?

 -George

On Fri, Jan 28, 2011 at 6:47 AM, Juergen Gross
<juergen.gross@ts.fujitsu.com> wrote:
> On 01/28/11 00:18, Andre Przywara wrote:
>>
>> Hi,
>>
>> when I boot my machine without restricting Dom0 (dom0_mem=
>> dom0_max_vcpus=) I get an _hypervisor_ crash when I run
>> # xl cpupool-numa-split
>> If Dom0's resources are limited on the Xen cmdline, everything works fine.
>> The crashdump points to a scheduling problem with weights, so I assume
>> the NUMA distribution algorithm some fools the hypervisor completely.
>>
>> I will investigate this further tomorrow, but maybe someone has some
>> good idea.
>
> I've seen this once with an older cpupool version on a 24 processor machine.
> It was NOT related to NUMA, but did occur only on reboot after a Dom0 panic.
> The machine had an init script creating a cpupool and populating it with
> cpus. The machine was in a panic loop due to the BUG in sched_acct then
> until
> it was resetted manually. After the reset the problem was gone.
>
> As I was never able to reproduce the problem later (the same software is
> running on dozens of machines!), I assumed there was a problem related to
> the first Dom0 panic, may be some destroyed BIOS tables.
>
> Can the crash be reproduced easily?
>
>
> Juergen
>
>>
>> Regards,
>> Andre.
>>
>> root@dosorca:/data/images# xl cpupool-numa-split
>> (XEN) Xen BUG at sched_credit.c:990
>> (XEN) ----[ Xen-4.1.0-rc2-pre x86_64 debug=y Not tainted ]----
>> (XEN) CPU: 0
>> (XEN) RIP: e008:[<ffff82c4801180f8>] csched_acct+0x11f/0x419
>> (XEN) RFLAGS: 0000000000010006 CONTEXT: hypervisor
>> (XEN) rax: 0000000000000010 rbx: 0000000000000f00 rcx: 0000000000000100
>> (XEN) rdx: 0000000000001000 rsi: ffff830437ffa600 rdi: 0000000000000010
>> (XEN) rbp: ffff82c480297e10 rsp: ffff82c480297d80 r8: 0000000000000100
>> (XEN) r9: 0000000000000006 r10: ffff82c4802d4100 r11: 000000afc7df0edf
>> (XEN) r12: ffff830437ffa5e0 r13: ffff82c480117fd9 r14: ffff830437f9f2e8
>> (XEN) r15: ffff830434321ec0 cr0: 000000008005003b cr4: 00000000000006f0
>> (XEN) cr3: 000000080df4e000 cr2: ffff88179af79618
>> (XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008
>> (XEN) Xen stack trace from rsp=ffff82c480297d80:
>> (XEN) 0000000000000282 fffffed4802d3f80 0000000000000eff ffff830437ffa5e0
>> (XEN) ffff830437ffa5e8 ffff830437ffa870 ffff830437ffa5e0 0000000000000282
>> (XEN) ffff830437ffa5e8 00002a3037ffa870 00000f0000000f00 0000000000000000
>> (XEN) ffff82c400000000 ffff82c4802d3f80 ffff830437ffa5e0 ffff82c480117fd9
>> (XEN) ffff830437f9f2e8 ffff830437f9f2e0 ffff82c480297e40 ffff82c480125f34
>> (XEN) 0000000000000002 ffff830437ffa600 ffff82c4802d3f80 000000afb6f8667f
>> (XEN) ffff82c480297e90 ffff82c480126259 ffff82c48024ae20 ffff82c4802d3f80
>> (XEN) ffff830437f9f2e0 0000000000000000 0000000000000000 ffff82c4802b0880
>> (XEN) ffff82c480297f18 ffffffffffffffff ffff82c480297ed0 ffff82c480123327
>> (XEN) ffff82c4802d4a00 ffff82c480297f18 ffff82c48024ae20 ffff82c480297f18
>> (XEN) 000000afb6abd652 ffff82c4802d3ec0 ffff82c480297ee0 ffff82c4801233a2
>> (XEN) ffff82c480297f10 ffff82c4801563f5 0000000000000000 ffff8300c7cd6000
>> (XEN) 0000000000000000 ffff8300c7ad4000 ffff82c480297d48 0000000000000000
>> (XEN) 0000000000000000 0000000000000000 ffffffff81a69060 ffff8817a8503f10
>> (XEN) ffff8817a8503fd8 0000000000000246 ffff8817a8503e80 ffff880000000001
>> (XEN) 0000000000000000 0000000000000000 ffffffff810093aa 000000aafab2f86e
>> (XEN) 00000000deadbeef 00000000deadbeef 0000010000000000 ffffffff810093aa
>> (XEN) 000000000000e033 0000000000000246 ffff8817a8503ef8 000000000000e02b
>> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> (XEN) 0000000000000000 ffff8300c7cd6000 0000000000000000 0000000000000000
>> (XEN) Xen call trace:
>> (XEN) [<ffff82c4801180f8>] csched_acct+0x11f/0x419
>> (XEN) [<ffff82c480125f34>] execute_timer+0x4e/0x6c
>> (XEN) [<ffff82c480126259>] timer_softirq_action+0xf2/0x245
>> (XEN) [<ffff82c480123327>] __do_softirq+0x88/0x99
>> (XEN) [<ffff82c4801233a2>] do_softirq+0x6a/0x7a
>> (XEN) [<ffff82c4801563f5>] idle_loop+0x6a/0x6f
>> (XEN)
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 0:
>> (XEN) Xen BUG at sched_credit.c:990
>> (XEN) ****************************************
>> (XEN)
>> (XEN) Reboot in five seconds...
>>
>>
>
>
> --
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
> Fujitsu Technology Solutions              e-mail:
> juergen.gross@ts.fujitsu.com
> Domagkstr. 28                           Internet: ts.fujitsu.com
> D-80807 Muenchen                 Company details:
> ts.fujitsu.com/imprint.html
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-01-28 11:07   ` Andre Przywara
@ 2011-01-28 11:44     ` Juergen Gross
  2011-01-28 13:14       ` Andre Przywara
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-01-28 11:44 UTC (permalink / raw)
  To: Andre Przywara; +Cc: xen-devel, Ian Jackson, Keir Fraser

On 01/28/11 12:07, Andre Przywara wrote:
> Juergen Gross wrote:
>> On 01/28/11 00:18, Andre Przywara wrote:
>>> Hi,
>>>
>>> when I boot my machine without restricting Dom0 (dom0_mem=
>>> dom0_max_vcpus=) I get an _hypervisor_ crash when I run
>>> # xl cpupool-numa-split
>>> If Dom0's resources are limited on the Xen cmdline, everything works
>>> fine.
>>> The crashdump points to a scheduling problem with weights, so I assume
>>> the NUMA distribution algorithm some fools the hypervisor completely.
>>>
>>> I will investigate this further tomorrow, but maybe someone has some
>>> good idea.
>>
>> I've seen this once with an older cpupool version on a 24 processor
>> machine.
>> It was NOT related to NUMA, but did occur only on reboot after a Dom0
>> panic.
>> The machine had an init script creating a cpupool and populating it with
>> cpus. The machine was in a panic loop due to the BUG in sched_acct
>> then until
>> it was resetted manually. After the reset the problem was gone.
>>
>> As I was never able to reproduce the problem later (the same software is
>> running on dozens of machines!), I assumed there was a problem related to
>> the first Dom0 panic, may be some destroyed BIOS tables.
>>
>> Can the crash be reproduced easily?
> Yes.
> If I don't specify dom0_max_vcpus= and dom0_mem= on the Xen cmdline, I
> can reliably trigger the crash with xl cpupool-numa-split.
> Omitting dom0_max_vcpus only does not suffice.

Do I understand correctly?
No crash with only dom0_max_vcpus= and no crash with only dom0_mem= ?

Could you try this patch?

diff -r b59f04eb8978 xen/common/schedule.c
--- a/xen/common/schedule.c     Fri Jan 21 18:06:23 2011 +0000
+++ b/xen/common/schedule.c     Fri Jan 28 12:42:46 2011 +0100
@@ -1301,7 +1301,9 @@ void schedule_cpu_switch(unsigned int cp

      idle = idle_vcpu[cpu];
      ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
+    BUG_ON(ppriv == NULL);
      vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
+    BUG_ON(vpriv == NULL);

      pcpu_schedule_lock_irqsave(cpu, flags);



-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-01-28 11:13   ` George Dunlap
@ 2011-01-28 13:05     ` Andre Przywara
  0 siblings, 0 replies; 53+ messages in thread
From: Andre Przywara @ 2011-01-28 13:05 UTC (permalink / raw)
  To: George Dunlap; +Cc: Keir Fraser, Juergen Gross, xen-devel, Ian Jackson

George Dunlap wrote:
> Hmm, strange... looks like it has something to do with the code which
> keeps track of which vcpus are earning credits.  You say this is done
> immediately after boot, with no VMs running other than dom0?
Right, after Dom0's prompt I just start xl cpupool-numa-split and the 
machine crashes.
> 
> What are the dom0_max_vcpus and dom0_mem settings required to make it work?
dom0_mem=8192M dom0_max_vcpus=6: works
dom0_mem=8192M: works
dom0_max_vcpus=6: works
(no settings): crashes
dom0_mem=20480M dom0_max_vcpus=8: works
The machine has 8 nodes with 6 CPUs each, the nodes have alternating 16G 
and 8GB memory (4 12-core (MCM aka dual-node) Opterons with 96GB RAM in 
total).
If I try to reproduce the actions of xl numa-split via a shell script it 
also crashes, just before the creation of the last pool. I will insert 
some instrumentation to the code to find the offending action.

Regards,
Andre.

> On Fri, Jan 28, 2011 at 6:47 AM, Juergen Gross
> <juergen.gross@ts.fujitsu.com> wrote:
>> On 01/28/11 00:18, Andre Przywara wrote:
>>> Hi,
>>>
>>> when I boot my machine without restricting Dom0 (dom0_mem=
>>> dom0_max_vcpus=) I get an _hypervisor_ crash when I run
>>> # xl cpupool-numa-split
>>> If Dom0's resources are limited on the Xen cmdline, everything works fine.
>>> The crashdump points to a scheduling problem with weights, so I assume
>>> the NUMA distribution algorithm some fools the hypervisor completely.
>>>
>>> I will investigate this further tomorrow, but maybe someone has some
>>> good idea.
>> I've seen this once with an older cpupool version on a 24 processor machine.
>> It was NOT related to NUMA, but did occur only on reboot after a Dom0 panic.
>> The machine had an init script creating a cpupool and populating it with
>> cpus. The machine was in a panic loop due to the BUG in sched_acct then
>> until
>> it was resetted manually. After the reset the problem was gone.
>>
>> As I was never able to reproduce the problem later (the same software is
>> running on dozens of machines!), I assumed there was a problem related to
>> the first Dom0 panic, may be some destroyed BIOS tables.
>>
>> Can the crash be reproduced easily?
>>
>>
>> Juergen
>>
>>> Regards,
>>> Andre.
>>>
>>> root@dosorca:/data/images# xl cpupool-numa-split
>>> (XEN) Xen BUG at sched_credit.c:990
>>> (XEN) ----[ Xen-4.1.0-rc2-pre x86_64 debug=y Not tainted ]----
>>> (XEN) CPU: 0
>>> (XEN) RIP: e008:[<ffff82c4801180f8>] csched_acct+0x11f/0x419
>>> (XEN) RFLAGS: 0000000000010006 CONTEXT: hypervisor
>>> (XEN) rax: 0000000000000010 rbx: 0000000000000f00 rcx: 0000000000000100
>>> (XEN) rdx: 0000000000001000 rsi: ffff830437ffa600 rdi: 0000000000000010
>>> (XEN) rbp: ffff82c480297e10 rsp: ffff82c480297d80 r8: 0000000000000100
>>> (XEN) r9: 0000000000000006 r10: ffff82c4802d4100 r11: 000000afc7df0edf
>>> (XEN) r12: ffff830437ffa5e0 r13: ffff82c480117fd9 r14: ffff830437f9f2e8
>>> (XEN) r15: ffff830434321ec0 cr0: 000000008005003b cr4: 00000000000006f0
>>> (XEN) cr3: 000000080df4e000 cr2: ffff88179af79618
>>> (XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008
>>> (XEN) Xen stack trace from rsp=ffff82c480297d80:
>>> (XEN) 0000000000000282 fffffed4802d3f80 0000000000000eff ffff830437ffa5e0
>>> (XEN) ffff830437ffa5e8 ffff830437ffa870 ffff830437ffa5e0 0000000000000282
>>> (XEN) ffff830437ffa5e8 00002a3037ffa870 00000f0000000f00 0000000000000000
>>> (XEN) ffff82c400000000 ffff82c4802d3f80 ffff830437ffa5e0 ffff82c480117fd9
>>> (XEN) ffff830437f9f2e8 ffff830437f9f2e0 ffff82c480297e40 ffff82c480125f34
>>> (XEN) 0000000000000002 ffff830437ffa600 ffff82c4802d3f80 000000afb6f8667f
>>> (XEN) ffff82c480297e90 ffff82c480126259 ffff82c48024ae20 ffff82c4802d3f80
>>> (XEN) ffff830437f9f2e0 0000000000000000 0000000000000000 ffff82c4802b0880
>>> (XEN) ffff82c480297f18 ffffffffffffffff ffff82c480297ed0 ffff82c480123327
>>> (XEN) ffff82c4802d4a00 ffff82c480297f18 ffff82c48024ae20 ffff82c480297f18
>>> (XEN) 000000afb6abd652 ffff82c4802d3ec0 ffff82c480297ee0 ffff82c4801233a2
>>> (XEN) ffff82c480297f10 ffff82c4801563f5 0000000000000000 ffff8300c7cd6000
>>> (XEN) 0000000000000000 ffff8300c7ad4000 ffff82c480297d48 0000000000000000
>>> (XEN) 0000000000000000 0000000000000000 ffffffff81a69060 ffff8817a8503f10
>>> (XEN) ffff8817a8503fd8 0000000000000246 ffff8817a8503e80 ffff880000000001
>>> (XEN) 0000000000000000 0000000000000000 ffffffff810093aa 000000aafab2f86e
>>> (XEN) 00000000deadbeef 00000000deadbeef 0000010000000000 ffffffff810093aa
>>> (XEN) 000000000000e033 0000000000000246 ffff8817a8503ef8 000000000000e02b
>>> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>>> (XEN) 0000000000000000 ffff8300c7cd6000 0000000000000000 0000000000000000
>>> (XEN) Xen call trace:
>>> (XEN) [<ffff82c4801180f8>] csched_acct+0x11f/0x419
>>> (XEN) [<ffff82c480125f34>] execute_timer+0x4e/0x6c
>>> (XEN) [<ffff82c480126259>] timer_softirq_action+0xf2/0x245
>>> (XEN) [<ffff82c480123327>] __do_softirq+0x88/0x99
>>> (XEN) [<ffff82c4801233a2>] do_softirq+0x6a/0x7a
>>> (XEN) [<ffff82c4801563f5>] idle_loop+0x6a/0x6f
>>> (XEN)
>>> (XEN)
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 0:
>>> (XEN) Xen BUG at sched_credit.c:990
>>> (XEN) ****************************************
>>> (XEN)
>>> (XEN) Reboot in five seconds...
>>>
>>>
>>
>> --
>> Juergen Gross                 Principal Developer Operating Systems
>> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
>> Fujitsu Technology Solutions              e-mail:
>> juergen.gross@ts.fujitsu.com
>> Domagkstr. 28                           Internet: ts.fujitsu.com
>> D-80807 Muenchen                 Company details:
>> ts.fujitsu.com/imprint.html
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
> 


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-01-28 11:44     ` Juergen Gross
@ 2011-01-28 13:14       ` Andre Przywara
  2011-01-31  7:04         ` Juergen Gross
  0 siblings, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-01-28 13:14 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Ian, Keir Fraser, xen-devel, Jackson

> 
> Do I understand correctly?
> No crash with only dom0_max_vcpus= and no crash with only dom0_mem= ?
Yes, see my previous mail to George.

> 
> Could you try this patch?
Ok, the crash dump is as follows:
(XEN) Xen BUG at sched_credit.c:384
(XEN) ----[ Xen-4.1.0-rc2-pre  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    2
(XEN) RIP:    e008:[<ffff82c480117fa0>] csched_alloc_pdata+0x146/0x17f
(XEN) RFLAGS: 0000000000010093   CONTEXT: hypervisor
(XEN) rax: ffff830434322000   rbx: ffff830434418748   rcx: 0000000000000024
(XEN) rdx: ffff82c4802d3ec0   rsi: 0000000000000003   rdi: ffff8304343c9100
(XEN) rbp: ffff83043457fce8   rsp: ffff83043457fca8   r8:  0000000000000001
(XEN) r9:  ffff830434418748   r10: ffff82c48021a0a0   r11: 0000000000000286
(XEN) r12: 0000000000000024   r13: ffff83123a3b2b60   r14: ffff830434418730
(XEN) r15: 0000000000000024   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 00000008061df000   cr2: ffff8817a21f87a0
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff83043457fca8:
(XEN)    ffff83043457fcb8 ffff83123a3b2b60 0000000000000286 0000000000000024
(XEN)    ffff830434418820 ffff83123a3b2a70 0000000000000024 ffff82c4802b0880
(XEN)    ffff83043457fd58 ffff82c48011fa63 ffff82f60102aa80 0000000000081554
(XEN)    ffff8300c7cfa000 0000000000000000 0000400000000000 ffff82c480248e00
(XEN)    0000000000000002 0000000000000024 ffff830434418820 0000000000305000
(XEN)    ffff82c4802550e4 ffff82c4802b0880 ffff83043457fd78 ffff82c48010188c
(XEN)    ffff83043457fe40 0000000000000024 ffff83043457fdb8 ffff82c480101b94
(XEN)    ffff83043457fdb8 ffff82c4801836f2 fffffffe00000286 ffff83043457ff18
(XEN)    0000000002170004 0000000000305000 ffff83043457fef8 ffff82c480125281
(XEN)    ffff83043457fdd8 0000000180153c9d 0000000000000000 ffff82c4801068f8
(XEN)    0000000000000296 ffff8300c7e0a1c8 aaaaaaaaaaaaaaaa 0000000000000000
(XEN)    ffff88007d1ac170 ffff88007d1ac170 ffff83043457fef8 ffff82c480113d8a
(XEN)    ffff83043457fe78 ffff83043457fe88 0000000800000012 0000000600000004
(XEN)    0000000000000000 ffffffff00000024 0000000000000000 00007fac2e0e5a00
(XEN)    0000000002170000 0000000000000000 0000000000000000 ffffffffffffffff
(XEN)    0000000000000000 0000000000000080 000000000000002f 0000000002170004
(XEN)    0000000002172004 0000000002174004 00007fff878f1c80 0000000000000033
(XEN)    ffff83043457fed8 ffff8300c7e0a000 00007fff878f1b30 0000000000305000
(XEN)    0000000000000003 0000000000000003 00007cfbcba800c7 ffff82c480207dd8
(XEN)    ffffffff8100946a 0000000000000023 0000000000000003 0000000000000003
(XEN) Xen call trace:
(XEN)    [<ffff82c480117fa0>] csched_alloc_pdata+0x146/0x17f
(XEN)    [<ffff82c48011fa63>] schedule_cpu_switch+0x75/0x1eb
(XEN)    [<ffff82c48010188c>] cpupool_assign_cpu_locked+0x44/0x8b
(XEN)    [<ffff82c480101b94>] cpupool_do_sysctl+0x1fb/0x461
(XEN)    [<ffff82c480125281>] do_sysctl+0x921/0xa30
(XEN)    [<ffff82c480207dd8>] syscall_enter+0xc8/0x122
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 2:
(XEN) Xen BUG at sched_credit.c:384
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...

Regards,
Andre.

> 
> diff -r b59f04eb8978 xen/common/schedule.c
> --- a/xen/common/schedule.c     Fri Jan 21 18:06:23 2011 +0000
> +++ b/xen/common/schedule.c     Fri Jan 28 12:42:46 2011 +0100
> @@ -1301,7 +1301,9 @@ void schedule_cpu_switch(unsigned int cp
> 
>       idle = idle_vcpu[cpu];
>       ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
> +    BUG_ON(ppriv == NULL);
>       vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
> +    BUG_ON(vpriv == NULL);
> 
>       pcpu_schedule_lock_irqsave(cpu, flags);
> 
> 
> 


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-01-28 13:14       ` Andre Przywara
@ 2011-01-31  7:04         ` Juergen Gross
  2011-01-31 14:59           ` Andre Przywara
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-01-31  7:04 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Ian Jackson, xen-devel, Keir Fraser

On 01/28/11 14:14, Andre Przywara wrote:
>>
>> Do I understand correctly?
>> No crash with only dom0_max_vcpus= and no crash with only dom0_mem= ?
> Yes, see my previous mail to George.
>
>>
>> Could you try this patch?
> Ok, the crash dump is as follows:

Hmm, is the new crash reproducable as well?
Seems not to be directly related to my diagnosis patch...

Currently I have no NUMA machine available. I tried to use numa=fake=...
boot parameter, but this seems to fake only NUMA memory nodes, all cpus are
still in node 0:

(XEN) 'u' pressed -> dumping numa info (now-0x120:5D5E0203)
(XEN) idx0 -> NODE0 start->0 size->524288
(XEN) phys_to_nid(0000000000001000) -> 0 should be 0
(XEN) idx1 -> NODE1 start->524288 size->524288
(XEN) phys_to_nid(0000000080001000) -> 1 should be 1
(XEN) idx2 -> NODE2 start->1048576 size->524288
(XEN) phys_to_nid(0000000100001000) -> 2 should be 2
(XEN) idx3 -> NODE3 start->1572864 size->1835008
(XEN) phys_to_nid(0000000180001000) -> 3 should be 3
(XEN) CPU0 -> NODE0
(XEN) CPU1 -> NODE0
(XEN) CPU2 -> NODE0
(XEN) CPU3 -> NODE0
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 3003121):
(XEN)     Node 0: 433864
(XEN)     Node 1: 258522
(XEN)     Node 2: 514315
(XEN)     Node 3: 1796420

I suspect a problem with the __cpuinit stuff overwriting some node info.
Andre, could you check this? I hope to reproduce your problem on my machine.

> (XEN) Xen BUG at sched_credit.c:384
> (XEN) ----[ Xen-4.1.0-rc2-pre x86_64 debug=y Not tainted ]----
> (XEN) CPU: 2
> (XEN) RIP: e008:[<ffff82c480117fa0>] csched_alloc_pdata+0x146/0x17f
> (XEN) RFLAGS: 0000000000010093 CONTEXT: hypervisor
> (XEN) rax: ffff830434322000 rbx: ffff830434418748 rcx: 0000000000000024
> (XEN) rdx: ffff82c4802d3ec0 rsi: 0000000000000003 rdi: ffff8304343c9100
> (XEN) rbp: ffff83043457fce8 rsp: ffff83043457fca8 r8: 0000000000000001
> (XEN) r9: ffff830434418748 r10: ffff82c48021a0a0 r11: 0000000000000286
> (XEN) r12: 0000000000000024 r13: ffff83123a3b2b60 r14: ffff830434418730
> (XEN) r15: 0000000000000024 cr0: 000000008005003b cr4: 00000000000006f0
> (XEN) cr3: 00000008061df000 cr2: ffff8817a21f87a0
> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
> (XEN) Xen stack trace from rsp=ffff83043457fca8:
> (XEN) ffff83043457fcb8 ffff83123a3b2b60 0000000000000286 0000000000000024
> (XEN) ffff830434418820 ffff83123a3b2a70 0000000000000024 ffff82c4802b0880
> (XEN) ffff83043457fd58 ffff82c48011fa63 ffff82f60102aa80 0000000000081554
> (XEN) ffff8300c7cfa000 0000000000000000 0000400000000000 ffff82c480248e00
> (XEN) 0000000000000002 0000000000000024 ffff830434418820 0000000000305000
> (XEN) ffff82c4802550e4 ffff82c4802b0880 ffff83043457fd78 ffff82c48010188c
> (XEN) ffff83043457fe40 0000000000000024 ffff83043457fdb8 ffff82c480101b94
> (XEN) ffff83043457fdb8 ffff82c4801836f2 fffffffe00000286 ffff83043457ff18
> (XEN) 0000000002170004 0000000000305000 ffff83043457fef8 ffff82c480125281
> (XEN) ffff83043457fdd8 0000000180153c9d 0000000000000000 ffff82c4801068f8
> (XEN) 0000000000000296 ffff8300c7e0a1c8 aaaaaaaaaaaaaaaa 0000000000000000
> (XEN) ffff88007d1ac170 ffff88007d1ac170 ffff83043457fef8 ffff82c480113d8a
> (XEN) ffff83043457fe78 ffff83043457fe88 0000000800000012 0000000600000004
> (XEN) 0000000000000000 ffffffff00000024 0000000000000000 00007fac2e0e5a00
> (XEN) 0000000002170000 0000000000000000 0000000000000000 ffffffffffffffff
> (XEN) 0000000000000000 0000000000000080 000000000000002f 0000000002170004
> (XEN) 0000000002172004 0000000002174004 00007fff878f1c80 0000000000000033
> (XEN) ffff83043457fed8 ffff8300c7e0a000 00007fff878f1b30 0000000000305000
> (XEN) 0000000000000003 0000000000000003 00007cfbcba800c7 ffff82c480207dd8
> (XEN) ffffffff8100946a 0000000000000023 0000000000000003 0000000000000003
> (XEN) Xen call trace:
> (XEN) [<ffff82c480117fa0>] csched_alloc_pdata+0x146/0x17f
> (XEN) [<ffff82c48011fa63>] schedule_cpu_switch+0x75/0x1eb
> (XEN) [<ffff82c48010188c>] cpupool_assign_cpu_locked+0x44/0x8b
> (XEN) [<ffff82c480101b94>] cpupool_do_sysctl+0x1fb/0x461
> (XEN) [<ffff82c480125281>] do_sysctl+0x921/0xa30
> (XEN) [<ffff82c480207dd8>] syscall_enter+0xc8/0x122
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 2:
> (XEN) Xen BUG at sched_credit.c:384
> (XEN) ****************************************
> (XEN)
> (XEN) Reboot in five seconds...


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-01-31  7:04         ` Juergen Gross
@ 2011-01-31 14:59           ` Andre Przywara
  2011-01-31 15:28             ` George Dunlap
  0 siblings, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-01-31 14:59 UTC (permalink / raw)
  To: Juergen Gross, George Dunlap; +Cc: Ian Jackson, xen-devel, Keir Fraser

Juergen Gross wrote:
> On 01/28/11 14:14, Andre Przywara wrote:
>>> Do I understand correctly?
>>> No crash with only dom0_max_vcpus= and no crash with only dom0_mem= ?
>> Yes, see my previous mail to George.
>>
>>> Could you try this patch?
>> Ok, the crash dump is as follows:
> 
> Hmm, is the new crash reproducable as well?
> Seems not to be directly related to my diagnosis patch...
Right, that was also my impression.

I seemed to get a bit further, though:
By accident I found that in c/s 22846 the issue is fixed, it works now 
without crashing. I bisected it down to my own patch, which disables the 
NODEID_MSR in Dom0. I could confirm this theory by a) applying this 
single line (clear_bit(NODEID_MSR)) to 22799 and _not_ seeing it crash 
and b) by removing this line from 22846 and seeing it crash.

So my theory is that Dom0 sees different nodes on its virtual CPUs via 
the physical NodeID MSR, but this association can (and will) be changed 
every moment by the Xen scheduler. So Dom0 will build a bogus topology 
based upon these values. As soon as all vCPUs of Dom0 are contained into 
one node (node 0, this is caused by the cpupool-numa-split call), the 
Xen scheduler somehow hicks up.
So it seems to be bad combination caused by the NodeID-MSR (on newer AMD 
platforms: sockets C32 and G34) and a NodeID MSR aware Dom0 (2.6.32.27).
Since this is a hypervisor crash, I assume that the bug is still there, 
only the current tip will make it much less likely to be triggered.

Hope that help, I will dig deeper now.

Regards,
Andre.

> 
> Currently I have no NUMA machine available. I tried to use numa=fake=...
> boot parameter, but this seems to fake only NUMA memory nodes, all cpus are
> still in node 0:
> 
> (XEN) 'u' pressed -> dumping numa info (now-0x120:5D5E0203)
> (XEN) idx0 -> NODE0 start->0 size->524288
> (XEN) phys_to_nid(0000000000001000) -> 0 should be 0
> (XEN) idx1 -> NODE1 start->524288 size->524288
> (XEN) phys_to_nid(0000000080001000) -> 1 should be 1
> (XEN) idx2 -> NODE2 start->1048576 size->524288
> (XEN) phys_to_nid(0000000100001000) -> 2 should be 2
> (XEN) idx3 -> NODE3 start->1572864 size->1835008
> (XEN) phys_to_nid(0000000180001000) -> 3 should be 3
> (XEN) CPU0 -> NODE0
> (XEN) CPU1 -> NODE0
> (XEN) CPU2 -> NODE0
> (XEN) CPU3 -> NODE0
> (XEN) Memory location of each domain:
> (XEN) Domain 0 (total: 3003121):
> (XEN)     Node 0: 433864
> (XEN)     Node 1: 258522
> (XEN)     Node 2: 514315
> (XEN)     Node 3: 1796420
> 
> I suspect a problem with the __cpuinit stuff overwriting some node info.
> Andre, could you check this? I hope to reproduce your problem on my machine.
> 
>> (XEN) Xen BUG at sched_credit.c:384
>> (XEN) ----[ Xen-4.1.0-rc2-pre x86_64 debug=y Not tainted ]----
>> (XEN) CPU: 2
>> (XEN) RIP: e008:[<ffff82c480117fa0>] csched_alloc_pdata+0x146/0x17f
>> (XEN) RFLAGS: 0000000000010093 CONTEXT: hypervisor
>> (XEN) rax: ffff830434322000 rbx: ffff830434418748 rcx: 0000000000000024
>> (XEN) rdx: ffff82c4802d3ec0 rsi: 0000000000000003 rdi: ffff8304343c9100
>> (XEN) rbp: ffff83043457fce8 rsp: ffff83043457fca8 r8: 0000000000000001
>> (XEN) r9: ffff830434418748 r10: ffff82c48021a0a0 r11: 0000000000000286
>> (XEN) r12: 0000000000000024 r13: ffff83123a3b2b60 r14: ffff830434418730
>> (XEN) r15: 0000000000000024 cr0: 000000008005003b cr4: 00000000000006f0
>> (XEN) cr3: 00000008061df000 cr2: ffff8817a21f87a0
>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
>> (XEN) Xen stack trace from rsp=ffff83043457fca8:
>> (XEN) ffff83043457fcb8 ffff83123a3b2b60 0000000000000286 0000000000000024
>> (XEN) ffff830434418820 ffff83123a3b2a70 0000000000000024 ffff82c4802b0880
>> (XEN) ffff83043457fd58 ffff82c48011fa63 ffff82f60102aa80 0000000000081554
>> (XEN) ffff8300c7cfa000 0000000000000000 0000400000000000 ffff82c480248e00
>> (XEN) 0000000000000002 0000000000000024 ffff830434418820 0000000000305000
>> (XEN) ffff82c4802550e4 ffff82c4802b0880 ffff83043457fd78 ffff82c48010188c
>> (XEN) ffff83043457fe40 0000000000000024 ffff83043457fdb8 ffff82c480101b94
>> (XEN) ffff83043457fdb8 ffff82c4801836f2 fffffffe00000286 ffff83043457ff18
>> (XEN) 0000000002170004 0000000000305000 ffff83043457fef8 ffff82c480125281
>> (XEN) ffff83043457fdd8 0000000180153c9d 0000000000000000 ffff82c4801068f8
>> (XEN) 0000000000000296 ffff8300c7e0a1c8 aaaaaaaaaaaaaaaa 0000000000000000
>> (XEN) ffff88007d1ac170 ffff88007d1ac170 ffff83043457fef8 ffff82c480113d8a
>> (XEN) ffff83043457fe78 ffff83043457fe88 0000000800000012 0000000600000004
>> (XEN) 0000000000000000 ffffffff00000024 0000000000000000 00007fac2e0e5a00
>> (XEN) 0000000002170000 0000000000000000 0000000000000000 ffffffffffffffff
>> (XEN) 0000000000000000 0000000000000080 000000000000002f 0000000002170004
>> (XEN) 0000000002172004 0000000002174004 00007fff878f1c80 0000000000000033
>> (XEN) ffff83043457fed8 ffff8300c7e0a000 00007fff878f1b30 0000000000305000
>> (XEN) 0000000000000003 0000000000000003 00007cfbcba800c7 ffff82c480207dd8
>> (XEN) ffffffff8100946a 0000000000000023 0000000000000003 0000000000000003
>> (XEN) Xen call trace:
>> (XEN) [<ffff82c480117fa0>] csched_alloc_pdata+0x146/0x17f
>> (XEN) [<ffff82c48011fa63>] schedule_cpu_switch+0x75/0x1eb
>> (XEN) [<ffff82c48010188c>] cpupool_assign_cpu_locked+0x44/0x8b
>> (XEN) [<ffff82c480101b94>] cpupool_do_sysctl+0x1fb/0x461
>> (XEN) [<ffff82c480125281>] do_sysctl+0x921/0xa30
>> (XEN) [<ffff82c480207dd8>] syscall_enter+0xc8/0x122
>> (XEN)
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 2:
>> (XEN) Xen BUG at sched_credit.c:384
>> (XEN) ****************************************
>> (XEN)
>> (XEN) Reboot in five seconds...
> 
> 
> Juergen
> 


-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-01-31 14:59           ` Andre Przywara
@ 2011-01-31 15:28             ` George Dunlap
  2011-02-01 16:32               ` Andre Przywara
  0 siblings, 1 reply; 53+ messages in thread
From: George Dunlap @ 2011-01-31 15:28 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Keir Fraser, Juergen Gross, xen-devel, Ian Jackson

On Mon, Jan 31, 2011 at 2:59 PM, Andre Przywara <andre.przywara@amd.com> wrote:
> Right, that was also my impression.
>
> I seemed to get a bit further, though:
> By accident I found that in c/s 22846 the issue is fixed, it works now
> without crashing. I bisected it down to my own patch, which disables the
> NODEID_MSR in Dom0. I could confirm this theory by a) applying this single
> line (clear_bit(NODEID_MSR)) to 22799 and _not_ seeing it crash and b) by
> removing this line from 22846 and seeing it crash.
>
> So my theory is that Dom0 sees different nodes on its virtual CPUs via the
> physical NodeID MSR, but this association can (and will) be changed every
> moment by the Xen scheduler. So Dom0 will build a bogus topology based upon
> these values. As soon as all vCPUs of Dom0 are contained into one node (node
> 0, this is caused by the cpupool-numa-split call), the Xen scheduler somehow
> hicks up.
> So it seems to be bad combination caused by the NodeID-MSR (on newer AMD
> platforms: sockets C32 and G34) and a NodeID MSR aware Dom0 (2.6.32.27).
> Since this is a hypervisor crash, I assume that the bug is still there, only
> the current tip will make it much less likely to be triggered.
>
> Hope that help, I will dig deeper now.

Thanks.  The crashes you're getting are in fact very strange.  They
have to do with assumptions that the credit scheduler makes as part of
its accounting process.  It would only make sense for those to be
triggered if a vcpu was moved from one pool to another pool without
the proper accounting being done.  (Specifically, each vcpu is
classified as either "active" or "inactive"; and each scheduler
instance keeps track of the total weight of all "active" vcpus.  The
BUGs you're tripping over are saying that this invariant has been
violated.)  However, I've looked at the cpupools vcpu-migrate code,
and it looks like it does everything right.  So I'm a bit mystified.
My only thought is if possibly a cpumask somewhere that wasn't getting
set properly, such that a vcpu was being run on a cpu from another
pool.

Unfortunately I can't take a good look at this right now; hopefully
I'll be able to take a look next week.

Andre, if you were keen, you might go through the credit code and put
in a bunch of ASSERTs that the current pcpu is in the mask of the
current vcpu; and that the current vcpu is assigned to the pool of the
current pcpu, and so on.

 -George

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-01-31 15:28             ` George Dunlap
@ 2011-02-01 16:32               ` Andre Przywara
  2011-02-02  6:27                 ` Juergen Gross
  2011-02-02 14:39                 ` Stephan Diestelhorst
  0 siblings, 2 replies; 53+ messages in thread
From: Andre Przywara @ 2011-02-01 16:32 UTC (permalink / raw)
  To: George Dunlap
  Cc: xen-devel, Keir Fraser, Juergen Gross, Ian Jackson, Stephan,
	Diestelhorst

Hi folks,

I asked Stephan Diestelhorst for help and after I convinced him that 
removing credit and making SEDF the default again is not an option he 
worked together with me on that ;-) Many thanks for that!
We haven't come to a final solution but could gather some debug data.
I will simply dump some data here, maybe somebody has got a clue. We 
will work further on this tomorrow.

First I replaced the BUG_ON with some printks to get some insight:
(XEN) sdom->active_vcpu_count: 18
(XEN) sdom->weight: 256
(XEN) weight_left: 4096, weight_total: 4096
(XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
(XEN) Xen BUG at sched_credit.c:591
(XEN) ----[ Xen-4.1.0-rc2-pre  x86_64  debug=y  Not tainted ]----

So that one shows that the number of VCPUs is not up-to-date with the 
computed weight sum, we have seen a difference of one or two VCPUs (in 
this case here the weight has been computed from 16 VCPUs). Also it 
shows that the assertion kicks in in the first iteration of the loop, 
where weight_left and weight_total are still equal.

So I additionally instrumented alloc_pdata and free_pdata, the 
unprefixed lines come from a shell script mimicking the functionality of 
cpupool-numa-split.
------------
Removing CPUs from Pool 0
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node6
scheduler:      credit
number of cpus: 1
(XEN) adding CPU 36, now 1 CPUs
(XEN) removing CPU 36, remaining: 17
Populating new pool
(XEN) sdom->active_vcpu_count: 9
(XEN) sdom->weight: 256
(XEN) weight_left: 2048, weight_total: 2048
(XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
(XEN) adding CPU 37, now 2 CPUs
(XEN) removing CPU 37, remaining: 16
(XEN) adding CPU 38, now 3 CPUs
(XEN) removing CPU 38, remaining: 15
(XEN) adding CPU 39, now 4 CPUs
(XEN) removing CPU 39, remaining: 14
(XEN) adding CPU 40, now 5 CPUs
(XEN) removing CPU 40, remaining: 13
(XEN) sdom->active_vcpu_count: 17
(XEN) sdom->weight: 256
(XEN) weight_left: 4096, weight_total: 4096
(XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
(XEN) adding CPU 41, now 6 CPUs
(XEN) removing CPU 41, remaining: 12
...
Two thing startled me:
1) There is quite some between the "Removing CPUs" message from the 
script and the actual HV printk showing it's done, why is that not 
synchronous? Looking at the code it shows that 
__csched_vcpu_acct_start() is eventually triggered by a timer, shouldn't 
that be triggered synchronously by add/removal events?
2) It clearly shows that each CPU gets added to the new pool _before_ it 
gets removed from the old one (Pool-0), isn't that violating the "only 
one pool per CPU" rule? Even it that is fine for a short period of time, 
maybe the timer kicks in in this very moment resulting in violated 
invariants?

Yours confused,
Andre.

George Dunlap wrote:
> On Mon, Jan 31, 2011 at 2:59 PM, Andre Przywara <andre.przywara@amd.com> wrote:
>> Right, that was also my impression.
>>
>> I seemed to get a bit further, though:
>> By accident I found that in c/s 22846 the issue is fixed, it works now
>> without crashing. I bisected it down to my own patch, which disables the
>> NODEID_MSR in Dom0. I could confirm this theory by a) applying this single
>> line (clear_bit(NODEID_MSR)) to 22799 and _not_ seeing it crash and b) by
>> removing this line from 22846 and seeing it crash.
>>
>> So my theory is that Dom0 sees different nodes on its virtual CPUs via the
>> physical NodeID MSR, but this association can (and will) be changed every
>> moment by the Xen scheduler. So Dom0 will build a bogus topology based upon
>> these values. As soon as all vCPUs of Dom0 are contained into one node (node
>> 0, this is caused by the cpupool-numa-split call), the Xen scheduler somehow
>> hicks up.
>> So it seems to be bad combination caused by the NodeID-MSR (on newer AMD
>> platforms: sockets C32 and G34) and a NodeID MSR aware Dom0 (2.6.32.27).
>> Since this is a hypervisor crash, I assume that the bug is still there, only
>> the current tip will make it much less likely to be triggered.
>>
>> Hope that help, I will dig deeper now.
> 
> Thanks.  The crashes you're getting are in fact very strange.  They
> have to do with assumptions that the credit scheduler makes as part of
> its accounting process.  It would only make sense for those to be
> triggered if a vcpu was moved from one pool to another pool without
> the proper accounting being done.  (Specifically, each vcpu is
> classified as either "active" or "inactive"; and each scheduler
> instance keeps track of the total weight of all "active" vcpus.  The
> BUGs you're tripping over are saying that this invariant has been
> violated.)  However, I've looked at the cpupools vcpu-migrate code,
> and it looks like it does everything right.  So I'm a bit mystified.
> My only thought is if possibly a cpumask somewhere that wasn't getting
> set properly, such that a vcpu was being run on a cpu from another
> pool.
> 
> Unfortunately I can't take a good look at this right now; hopefully
> I'll be able to take a look next week.
> 
> Andre, if you were keen, you might go through the credit code and put
> in a bunch of ASSERTs that the current pcpu is in the mask of the
> current vcpu; and that the current vcpu is assigned to the pool of the
> current pcpu, and so on.
> 
>  -George
> 


-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-01 16:32               ` Andre Przywara
@ 2011-02-02  6:27                 ` Juergen Gross
  2011-02-02  8:49                   ` Juergen Gross
  2011-02-02 14:39                 ` Stephan Diestelhorst
  1 sibling, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-02  6:27 UTC (permalink / raw)
  To: Andre Przywara
  Cc: George Dunlap, Ian Jackson, xen-devel, Keir Fraser, Stephan Diestelhorst

On 02/01/11 17:32, Andre Przywara wrote:
> Hi folks,
>
> I asked Stephan Diestelhorst for help and after I convinced him that
> removing credit and making SEDF the default again is not an option he
> worked together with me on that ;-) Many thanks for that!
> We haven't come to a final solution but could gather some debug data.
> I will simply dump some data here, maybe somebody has got a clue. We
> will work further on this tomorrow.
>
> First I replaced the BUG_ON with some printks to get some insight:
> (XEN) sdom->active_vcpu_count: 18
> (XEN) sdom->weight: 256
> (XEN) weight_left: 4096, weight_total: 4096
> (XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
> (XEN) Xen BUG at sched_credit.c:591
> (XEN) ----[ Xen-4.1.0-rc2-pre x86_64 debug=y Not tainted ]----
>
> So that one shows that the number of VCPUs is not up-to-date with the
> computed weight sum, we have seen a difference of one or two VCPUs (in
> this case here the weight has been computed from 16 VCPUs). Also it
> shows that the assertion kicks in in the first iteration of the loop,
> where weight_left and weight_total are still equal.
>
> So I additionally instrumented alloc_pdata and free_pdata, the
> unprefixed lines come from a shell script mimicking the functionality of
> cpupool-numa-split.
> ------------
> Removing CPUs from Pool 0
> Creating new pool
> Using config file "cpupool.test"
> cpupool name: Pool-node6
> scheduler: credit
> number of cpus: 1
> (XEN) adding CPU 36, now 1 CPUs
> (XEN) removing CPU 36, remaining: 17
> Populating new pool
> (XEN) sdom->active_vcpu_count: 9
> (XEN) sdom->weight: 256
> (XEN) weight_left: 2048, weight_total: 2048
> (XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
> (XEN) adding CPU 37, now 2 CPUs
> (XEN) removing CPU 37, remaining: 16
> (XEN) adding CPU 38, now 3 CPUs
> (XEN) removing CPU 38, remaining: 15
> (XEN) adding CPU 39, now 4 CPUs
> (XEN) removing CPU 39, remaining: 14
> (XEN) adding CPU 40, now 5 CPUs
> (XEN) removing CPU 40, remaining: 13
> (XEN) sdom->active_vcpu_count: 17
> (XEN) sdom->weight: 256
> (XEN) weight_left: 4096, weight_total: 4096
> (XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
> (XEN) adding CPU 41, now 6 CPUs
> (XEN) removing CPU 41, remaining: 12
> ...
> Two thing startled me:
> 1) There is quite some between the "Removing CPUs" message from the
> script and the actual HV printk showing it's done, why is that not
> synchronous?

Removing cpus from Pool-0 requires no switching of the scheduler, so you
see no calls of alloc/free_pdata here.

 > Looking at the code it shows that
> __csched_vcpu_acct_start() is eventually triggered by a timer, shouldn't
> that be triggered synchronously by add/removal events?

The vcpus are not moved explicitly, they are migrated by the normal
scheduler mechanisms, same as for vcpu-pin.

> 2) It clearly shows that each CPU gets added to the new pool _before_ it
> gets removed from the old one (Pool-0), isn't that violating the "only
> one pool per CPU" rule? Even it that is fine for a short period of time,
> maybe the timer kicks in in this very moment resulting in violated
> invariants?

The sequence you are seeing seems to be okay. The alloc_pdata for the new pool
is called before the free_pdata for the old pool.

And the timer is not relevant, as only the idle vcpu should be running on the
moving cpu and the accounting stuff is never called during idle.



Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-02  6:27                 ` Juergen Gross
@ 2011-02-02  8:49                   ` Juergen Gross
  2011-02-02 10:05                     ` Juergen Gross
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-02  8:49 UTC (permalink / raw)
  To: Andre Przywara
  Cc: George Dunlap, Keir Fraser, xen-devel, Ian Jackson, Stephan Diestelhorst

On 02/02/11 07:27, Juergen Gross wrote:
> On 02/01/11 17:32, Andre Przywara wrote:
>> Hi folks,
>>
>> I asked Stephan Diestelhorst for help and after I convinced him that
>> removing credit and making SEDF the default again is not an option he
>> worked together with me on that ;-) Many thanks for that!
>> We haven't come to a final solution but could gather some debug data.
>> I will simply dump some data here, maybe somebody has got a clue. We
>> will work further on this tomorrow.
>>
>> First I replaced the BUG_ON with some printks to get some insight:
>> (XEN) sdom->active_vcpu_count: 18
>> (XEN) sdom->weight: 256
>> (XEN) weight_left: 4096, weight_total: 4096
>> (XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
>> (XEN) Xen BUG at sched_credit.c:591
>> (XEN) ----[ Xen-4.1.0-rc2-pre x86_64 debug=y Not tainted ]----
>>
>> So that one shows that the number of VCPUs is not up-to-date with the
>> computed weight sum, we have seen a difference of one or two VCPUs (in
>> this case here the weight has been computed from 16 VCPUs). Also it
>> shows that the assertion kicks in in the first iteration of the loop,
>> where weight_left and weight_total are still equal.
>>
>> So I additionally instrumented alloc_pdata and free_pdata, the
>> unprefixed lines come from a shell script mimicking the functionality of
>> cpupool-numa-split.
>> ------------
>> Removing CPUs from Pool 0
>> Creating new pool
>> Using config file "cpupool.test"
>> cpupool name: Pool-node6
>> scheduler: credit
>> number of cpus: 1
>> (XEN) adding CPU 36, now 1 CPUs
>> (XEN) removing CPU 36, remaining: 17
>> Populating new pool
>> (XEN) sdom->active_vcpu_count: 9
>> (XEN) sdom->weight: 256
>> (XEN) weight_left: 2048, weight_total: 2048
>> (XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
>> (XEN) adding CPU 37, now 2 CPUs
>> (XEN) removing CPU 37, remaining: 16
>> (XEN) adding CPU 38, now 3 CPUs
>> (XEN) removing CPU 38, remaining: 15
>> (XEN) adding CPU 39, now 4 CPUs
>> (XEN) removing CPU 39, remaining: 14
>> (XEN) adding CPU 40, now 5 CPUs
>> (XEN) removing CPU 40, remaining: 13
>> (XEN) sdom->active_vcpu_count: 17
>> (XEN) sdom->weight: 256
>> (XEN) weight_left: 4096, weight_total: 4096
>> (XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
>> (XEN) adding CPU 41, now 6 CPUs
>> (XEN) removing CPU 41, remaining: 12
>> ...
>> Two thing startled me:
>> 1) There is quite some between the "Removing CPUs" message from the
>> script and the actual HV printk showing it's done, why is that not
>> synchronous?
>
> Removing cpus from Pool-0 requires no switching of the scheduler, so you
> see no calls of alloc/free_pdata here.
>
>  > Looking at the code it shows that
>> __csched_vcpu_acct_start() is eventually triggered by a timer, shouldn't
>> that be triggered synchronously by add/removal events?
>
> The vcpus are not moved explicitly, they are migrated by the normal
> scheduler mechanisms, same as for vcpu-pin.
>
>> 2) It clearly shows that each CPU gets added to the new pool _before_ it
>> gets removed from the old one (Pool-0), isn't that violating the "only
>> one pool per CPU" rule? Even it that is fine for a short period of time,
>> maybe the timer kicks in in this very moment resulting in violated
>> invariants?
>
> The sequence you are seeing seems to be okay. The alloc_pdata for the
> new pool
> is called before the free_pdata for the old pool.
>
> And the timer is not relevant, as only the idle vcpu should be running
> on the
> moving cpu and the accounting stuff is never called during idle.

Uhh, this could be wrong!
The normal ticker doesn't call accounting in idle and it is stopped during
cpu move. The master_ticker is handled wrong, perhaps. I'll check this and
prepare a patch if necessary.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-02  8:49                   ` Juergen Gross
@ 2011-02-02 10:05                     ` Juergen Gross
  2011-02-02 10:59                       ` Andre Przywara
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-02 10:05 UTC (permalink / raw)
  To: Andre Przywara
  Cc: George Dunlap, Ian Jackson, xen-devel, Keir Fraser, Stephan Diestelhorst

[-- Attachment #1: Type: text/plain, Size: 4658 bytes --]

Hi Andre,

could you try the attached patch?
It should verify if your problems are due to the master ticker
kicking in at a time when the cpu is already gone from the cpupool.

I'm not sure if the patch is complete - Disabling the master ticker
in csched_tick_suspend might lead to problems with cstates. The
functionality is different, at least.

George, do you think this is correct?


Juergen

On 02/02/11 09:49, Juergen Gross wrote:
> On 02/02/11 07:27, Juergen Gross wrote:
>> On 02/01/11 17:32, Andre Przywara wrote:
>>> Hi folks,
>>>
>>> I asked Stephan Diestelhorst for help and after I convinced him that
>>> removing credit and making SEDF the default again is not an option he
>>> worked together with me on that ;-) Many thanks for that!
>>> We haven't come to a final solution but could gather some debug data.
>>> I will simply dump some data here, maybe somebody has got a clue. We
>>> will work further on this tomorrow.
>>>
>>> First I replaced the BUG_ON with some printks to get some insight:
>>> (XEN) sdom->active_vcpu_count: 18
>>> (XEN) sdom->weight: 256
>>> (XEN) weight_left: 4096, weight_total: 4096
>>> (XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
>>> (XEN) Xen BUG at sched_credit.c:591
>>> (XEN) ----[ Xen-4.1.0-rc2-pre x86_64 debug=y Not tainted ]----
>>>
>>> So that one shows that the number of VCPUs is not up-to-date with the
>>> computed weight sum, we have seen a difference of one or two VCPUs (in
>>> this case here the weight has been computed from 16 VCPUs). Also it
>>> shows that the assertion kicks in in the first iteration of the loop,
>>> where weight_left and weight_total are still equal.
>>>
>>> So I additionally instrumented alloc_pdata and free_pdata, the
>>> unprefixed lines come from a shell script mimicking the functionality of
>>> cpupool-numa-split.
>>> ------------
>>> Removing CPUs from Pool 0
>>> Creating new pool
>>> Using config file "cpupool.test"
>>> cpupool name: Pool-node6
>>> scheduler: credit
>>> number of cpus: 1
>>> (XEN) adding CPU 36, now 1 CPUs
>>> (XEN) removing CPU 36, remaining: 17
>>> Populating new pool
>>> (XEN) sdom->active_vcpu_count: 9
>>> (XEN) sdom->weight: 256
>>> (XEN) weight_left: 2048, weight_total: 2048
>>> (XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
>>> (XEN) adding CPU 37, now 2 CPUs
>>> (XEN) removing CPU 37, remaining: 16
>>> (XEN) adding CPU 38, now 3 CPUs
>>> (XEN) removing CPU 38, remaining: 15
>>> (XEN) adding CPU 39, now 4 CPUs
>>> (XEN) removing CPU 39, remaining: 14
>>> (XEN) adding CPU 40, now 5 CPUs
>>> (XEN) removing CPU 40, remaining: 13
>>> (XEN) sdom->active_vcpu_count: 17
>>> (XEN) sdom->weight: 256
>>> (XEN) weight_left: 4096, weight_total: 4096
>>> (XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
>>> (XEN) adding CPU 41, now 6 CPUs
>>> (XEN) removing CPU 41, remaining: 12
>>> ...
>>> Two thing startled me:
>>> 1) There is quite some between the "Removing CPUs" message from the
>>> script and the actual HV printk showing it's done, why is that not
>>> synchronous?
>>
>> Removing cpus from Pool-0 requires no switching of the scheduler, so you
>> see no calls of alloc/free_pdata here.
>>
>> > Looking at the code it shows that
>>> __csched_vcpu_acct_start() is eventually triggered by a timer, shouldn't
>>> that be triggered synchronously by add/removal events?
>>
>> The vcpus are not moved explicitly, they are migrated by the normal
>> scheduler mechanisms, same as for vcpu-pin.
>>
>>> 2) It clearly shows that each CPU gets added to the new pool _before_ it
>>> gets removed from the old one (Pool-0), isn't that violating the "only
>>> one pool per CPU" rule? Even it that is fine for a short period of time,
>>> maybe the timer kicks in in this very moment resulting in violated
>>> invariants?
>>
>> The sequence you are seeing seems to be okay. The alloc_pdata for the
>> new pool
>> is called before the free_pdata for the old pool.
>>
>> And the timer is not relevant, as only the idle vcpu should be running
>> on the
>> moving cpu and the accounting stuff is never called during idle.
>
> Uhh, this could be wrong!
> The normal ticker doesn't call accounting in idle and it is stopped during
> cpu move. The master_ticker is handled wrong, perhaps. I'll check this and
> prepare a patch if necessary.
>
>
> Juergen
>


-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

[-- Attachment #2: patch.txt --]
[-- Type: text/plain, Size: 2720 bytes --]

diff -r 76e1f7018b01 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c Mon Jan 31 08:10:00 2011 +0100
+++ b/xen/common/sched_credit.c Wed Feb 02 10:59:44 2011 +0100
@@ -50,6 +50,8 @@
     (CSCHED_CREDITS_PER_MSEC * CSCHED_MSECS_PER_TSLICE)
 #define CSCHED_CREDITS_PER_ACCT     \
     (CSCHED_CREDITS_PER_MSEC * CSCHED_MSECS_PER_TICK * CSCHED_TICKS_PER_ACCT)
+#define CSCHED_ACCT_TSLICE          \
+    (MILLISECS(CSCHED_MSECS_PER_TICK) * CSCHED_TICKS_PER_ACCT)
 
 
 /*
@@ -320,6 +322,7 @@ csched_free_pdata(const struct scheduler
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_pcpu *spc = pcpu;
     unsigned long flags;
+    uint64_t now = NOW();
 
     if ( spc == NULL )
         return;
@@ -334,6 +337,8 @@ csched_free_pdata(const struct scheduler
     {
         prv->master = first_cpu(prv->cpus);
         migrate_timer(&prv->master_ticker, prv->master);
+        set_timer(&prv->master_ticker, now + CSCHED_ACCT_TSLICE
+            - now % CSCHED_ACCT_TSLICE);
     }
     kill_timer(&spc->ticker);
     if ( prv->ncpus == 0 )
@@ -367,8 +372,7 @@ csched_alloc_pdata(const struct schedule
     {
         prv->master = cpu;
         init_timer(&prv->master_ticker, csched_acct, prv, cpu);
-        set_timer(&prv->master_ticker, NOW() +
-                  MILLISECS(CSCHED_MSECS_PER_TICK) * CSCHED_TICKS_PER_ACCT);
+        set_timer(&prv->master_ticker, NOW() + CSCHED_ACCT_TSLICE);
     }
 
     init_timer(&spc->ticker, csched_tick, (void *)(unsigned long)cpu, cpu);
@@ -1138,8 +1142,7 @@ csched_acct(void* dummy)
     prv->runq_sort++;
 
 out:
-    set_timer( &prv->master_ticker, NOW() +
-            MILLISECS(CSCHED_MSECS_PER_TICK) * CSCHED_TICKS_PER_ACCT );
+    set_timer( &prv->master_ticker, NOW() + CSCHED_ACCT_TSLICE );
 }
 
 static void
@@ -1531,22 +1534,31 @@ csched_deinit(const struct scheduler *op
 
 static void csched_tick_suspend(const struct scheduler *ops, unsigned int cpu)
 {
+    struct csched_private *prv;
     struct csched_pcpu *spc;
 
+    prv = CSCHED_PRIV(ops);
     spc = CSCHED_PCPU(cpu);
 
     stop_timer(&spc->ticker);
+    if ( prv->master == cpu )
+        stop_timer(&prv->master_ticker);
 }
 
 static void csched_tick_resume(const struct scheduler *ops, unsigned int cpu)
 {
+    struct csched_private *prv;
     struct csched_pcpu *spc;
     uint64_t now = NOW();
 
+    prv = CSCHED_PRIV(ops);
     spc = CSCHED_PCPU(cpu);
 
     set_timer(&spc->ticker, now + MILLISECS(CSCHED_MSECS_PER_TICK)
             - now % MILLISECS(CSCHED_MSECS_PER_TICK) );
+    if ( prv->master == cpu )
+        set_timer(&prv->master_ticker, now + CSCHED_ACCT_TSLICE
+            - now % CSCHED_ACCT_TSLICE);
 }
 
 static struct csched_private _csched_priv;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-02 10:05                     ` Juergen Gross
@ 2011-02-02 10:59                       ` Andre Przywara
  0 siblings, 0 replies; 53+ messages in thread
From: Andre Przywara @ 2011-02-02 10:59 UTC (permalink / raw)
  To: Juergen Gross
  Cc: George Dunlap, Ian Jackson, xen-devel, Keir Fraser, Diestelhorst,
	Stephan

Juergen Gross wrote:
> Hi Andre,
> 
> could you try the attached patch?
> It should verify if your problems are due to the master ticker
> kicking in at a time when the cpu is already gone from the cpupool.
That's what we found also yesterday. If the timer routine triggers 
before the timer is stopped but is actually _running_ afterwards, this 
could lead to problems.

Anyway, the hypervisor still crashes, now at a different BUG_ON():

     /* Start off idling... */
     BUG_ON(!is_idle_vcpu(per_cpu(schedule_data, cpu).curr));
     cpu_set(cpu, prv->idlers);

The complete crash dump was this:

(XEN) Xen BUG at sched_credit.c:389
(XEN) ----[ Xen-4.1.0-rc2-pre  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    3
(XEN) RIP:    e008:[<ffff82c480118020>] csched_alloc_pdata+0x146/0x197
(XEN) RFLAGS: 0000000000010093   CONTEXT: hypervisor
(XEN) rax: ffff830434322000   rbx: ffff830434492478   rcx: 0000000000000018
(XEN) rdx: ffff82c4802d3ec0   rsi: 0000000000000006   rdi: ffff83043445e100
(XEN) rbp: ffff83043456fce8   rsp: ffff83043456fca8   r8:  00000000deadbeef
(XEN) r9:  ffff830434492478   r10: ffff82c48021a1c0   r11: 0000000000000286
(XEN) r12: 0000000000000018   r13: ffff830a3c70c780   r14: ffff830434492460
(XEN) r15: 0000000000000018   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 0000000805bac000   cr2: 00007fbbdaf71116
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff83043456fca8:
(XEN)    ffff83043456fcb8 ffff830a3c70c780 0000000000000286 0000000000000018
(XEN)    ffff830434492550 ffff830a3c70c690 0000000000000018 ffff82c4802b0880
(XEN)    ffff83043456fd58 ffff82c48011fbb3 ffff82f601020900 0000000000081048
(XEN)    ffff8300c7e42000 0000000000000000 0000800000000000 ffff82c480249000
(XEN)    0000000000000002 0000000000000018 ffff830434492550 0000000000305000
(XEN)    ffff82c4802550e4 ffff82c4802b0880 ffff83043456fd78 ffff82c48010188c
(XEN)    ffff83043456fe40 0000000000000018 ffff83043456fdb8 ffff82c480101b94
(XEN)    ffff83043456fdb8 ffff82c48018380a fffffffe00000286 ffff83043456ff18
(XEN)    0000000001669004 0000000000305000 ffff83043456fef8 ffff82c4801253c1
(XEN)    ffff83043456fde8 ffff8300c7ac0000 0000000000000000 0000000000000246
(XEN)    ffff83043456fe18 ffff82c480106c7f ffff830434577100 ffff8300c7ac0000
(XEN)    ffff83043456fe28 ffff82c480125de4 0000000000000003 ffff82c4802d3f80
(XEN)    ffff83043456fe78 0000000000000282 0000000800000012 0000000400000004
(XEN)    0000000000000000 ffffffff00000018 0000000000000000 00007f7e6a549a00
(XEN)    0000000001669000 0000000000000000 0000000000000000 ffffffffffffffff
(XEN)    0000000000000000 0000000000000080 000000000000002f 0000000001669004
(XEN)    000000000166b004 000000000166d004 00007fffa59ff250 0000000000000033
(XEN)    ffff83043456fed8 ffff8300c7ac0000 00007fffa59ff100 0000000000305000
(XEN)    0000000000000003 0000000000000003 00007cfbcba900c7 ffff82c480207ee8
(XEN)    ffffffff8100946a 0000000000000023 0000000000000003 0000000000000003
(XEN) Xen call trace:
(XEN)    [<ffff82c480118020>] csched_alloc_pdata+0x146/0x197
(XEN)    [<ffff82c48011fbb3>] schedule_cpu_switch+0x75/0x1cd
(XEN)    [<ffff82c48010188c>] cpupool_assign_cpu_locked+0x44/0x8b
(XEN)    [<ffff82c480101b94>] cpupool_do_sysctl+0x1fb/0x461
(XEN)    [<ffff82c4801253c1>] do_sysctl+0x921/0xa30
(XEN)    [<ffff82c480207ee8>] syscall_enter+0xc8/0x122
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 3:
(XEN) Xen BUG at sched_credit.c:389
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...


Regards,
Andre.

> 
> I'm not sure if the patch is complete - Disabling the master ticker
> in csched_tick_suspend might lead to problems with cstates. The
> functionality is different, at least.
> 
> George, do you think this is correct?
> 
> 
> Juergen
> 
> On 02/02/11 09:49, Juergen Gross wrote:
>> On 02/02/11 07:27, Juergen Gross wrote:
>>> On 02/01/11 17:32, Andre Przywara wrote:
>>>> Hi folks,
>>>>
>>>> I asked Stephan Diestelhorst for help and after I convinced him that
>>>> removing credit and making SEDF the default again is not an option he
>>>> worked together with me on that ;-) Many thanks for that!
>>>> We haven't come to a final solution but could gather some debug data.
>>>> I will simply dump some data here, maybe somebody has got a clue. We
>>>> will work further on this tomorrow.
>>>>
>>>> First I replaced the BUG_ON with some printks to get some insight:
>>>> (XEN) sdom->active_vcpu_count: 18
>>>> (XEN) sdom->weight: 256
>>>> (XEN) weight_left: 4096, weight_total: 4096
>>>> (XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
>>>> (XEN) Xen BUG at sched_credit.c:591
>>>> (XEN) ----[ Xen-4.1.0-rc2-pre x86_64 debug=y Not tainted ]----
>>>>
>>>> So that one shows that the number of VCPUs is not up-to-date with the
>>>> computed weight sum, we have seen a difference of one or two VCPUs (in
>>>> this case here the weight has been computed from 16 VCPUs). Also it
>>>> shows that the assertion kicks in in the first iteration of the loop,
>>>> where weight_left and weight_total are still equal.
>>>>
>>>> So I additionally instrumented alloc_pdata and free_pdata, the
>>>> unprefixed lines come from a shell script mimicking the functionality of
>>>> cpupool-numa-split.
>>>> ------------
>>>> Removing CPUs from Pool 0
>>>> Creating new pool
>>>> Using config file "cpupool.test"
>>>> cpupool name: Pool-node6
>>>> scheduler: credit
>>>> number of cpus: 1
>>>> (XEN) adding CPU 36, now 1 CPUs
>>>> (XEN) removing CPU 36, remaining: 17
>>>> Populating new pool
>>>> (XEN) sdom->active_vcpu_count: 9
>>>> (XEN) sdom->weight: 256
>>>> (XEN) weight_left: 2048, weight_total: 2048
>>>> (XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
>>>> (XEN) adding CPU 37, now 2 CPUs
>>>> (XEN) removing CPU 37, remaining: 16
>>>> (XEN) adding CPU 38, now 3 CPUs
>>>> (XEN) removing CPU 38, remaining: 15
>>>> (XEN) adding CPU 39, now 4 CPUs
>>>> (XEN) removing CPU 39, remaining: 14
>>>> (XEN) adding CPU 40, now 5 CPUs
>>>> (XEN) removing CPU 40, remaining: 13
>>>> (XEN) sdom->active_vcpu_count: 17
>>>> (XEN) sdom->weight: 256
>>>> (XEN) weight_left: 4096, weight_total: 4096
>>>> (XEN) credit_balance: 0, credit_xtra: 0, credit_cap: 0
>>>> (XEN) adding CPU 41, now 6 CPUs
>>>> (XEN) removing CPU 41, remaining: 12
>>>> ...
>>>> Two thing startled me:
>>>> 1) There is quite some between the "Removing CPUs" message from the
>>>> script and the actual HV printk showing it's done, why is that not
>>>> synchronous?
>>> Removing cpus from Pool-0 requires no switching of the scheduler, so you
>>> see no calls of alloc/free_pdata here.
>>>
>>>> Looking at the code it shows that
>>>> __csched_vcpu_acct_start() is eventually triggered by a timer, shouldn't
>>>> that be triggered synchronously by add/removal events?
>>> The vcpus are not moved explicitly, they are migrated by the normal
>>> scheduler mechanisms, same as for vcpu-pin.
>>>
>>>> 2) It clearly shows that each CPU gets added to the new pool _before_ it
>>>> gets removed from the old one (Pool-0), isn't that violating the "only
>>>> one pool per CPU" rule? Even it that is fine for a short period of time,
>>>> maybe the timer kicks in in this very moment resulting in violated
>>>> invariants?
>>> The sequence you are seeing seems to be okay. The alloc_pdata for the
>>> new pool
>>> is called before the free_pdata for the old pool.
>>>
>>> And the timer is not relevant, as only the idle vcpu should be running
>>> on the
>>> moving cpu and the accounting stuff is never called during idle.
>> Uhh, this could be wrong!
>> The normal ticker doesn't call accounting in idle and it is stopped during
>> cpu move. The master_ticker is handled wrong, perhaps. I'll check this and
>> prepare a patch if necessary.
>>
>>
>> Juergen
>>
> 
> 

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-01 16:32               ` Andre Przywara
  2011-02-02  6:27                 ` Juergen Gross
@ 2011-02-02 14:39                 ` Stephan Diestelhorst
  2011-02-02 15:14                   ` Juergen Gross
  1 sibling, 1 reply; 53+ messages in thread
From: Stephan Diestelhorst @ 2011-02-02 14:39 UTC (permalink / raw)
  To: Przywara, Andre
  Cc: George Dunlap, Keir Fraser, Juergen Gross, xen-devel, Ian Jackson

Hi folks,
  long time no see. :-)

On Tuesday 01 February 2011 17:32:25 Andre Przywara wrote:
> I asked Stephan Diestelhorst for help and after I convinced him that 
> removing credit and making SEDF the default again is not an option he 
> worked together with me on that ;-) Many thanks for that!
> We haven't come to a final solution but could gather some debug data.
> I will simply dump some data here, maybe somebody has got a clue. We 
> will work further on this tomorrow.

Andre and I have been looking through this further, in particular sanity
checking the invariant

prv->weight >= sdom->weight * sdom->active_vcpu_count

each time someone tweaks the active vcpu count. This happens only in
__csched_vcpu_acct_start and __csched_vcpu_acct_stop_locked. We managed
to observe the broken invariant when splitting cpupoools.

We have the following theory of what happens:
* some vcpus of a particular domain are currently in the process of
  being moved to the new pool

* some are still left on the old pool (vcpus_old) and some are already
  in the new pool (vcpus_new)

* we now have vcpus_old->sdom = vcpus_new->sdom and following from this
  * vcpus_old->sdom->weight = vcpus_new->sdom->weight
  * vcpus_old->sdom->active_vcpu_count = vcpus_new->sdom->active_vcpu_count

* active_vcpu_count thus does not represent the separation of the
  actual vpcus (may be the sum, only the old or new ones, does not
  matter)

* however, sched_old != sched_new, and thus 
  * sched_old->prv != sched_new->prv
  * sched_old->prv->weight != sched_new->prv->weight

* the prv->weight field hence sees the incremental move of VCPUs
  (through modifications in *acct_start and *acct_stop_locked)

* if at any point in this half-way migration, the scheduler wants to
  csched_acct, it erroneously checks the wrong active_vcpu_count

Workarounds / fixes (none tried):
* disable scheduler accounting while half-way migrating a domain
  (dom->pool_migrating flag and then checking in csched_acct)
* temporarily split the sdom structures while migrating to account for
  transient split of vcpus
* synchronously disable all vcpus, migrate and then re-enable

Caveats:
* prv->lock does not guarantee mutual exclusion between (same)
  schedulers of different pools

<rant>
The general locking policy vs the comment situation is a nightmare.
I know that we have some advanced data-structure folks here, but
intuitively reasoning about when specific things are atomic and
mutually excluded is a pain in the scheduler / cpupool code, see the
issue with the separate prv->locks above.

E.g. cpupool_unassign_cpu and cpupool_unassign_cpu_helper interplay:
* cpupool_unassign_cpu unlocks cpupool_lock
* sets up the continuation calling cpupool_unassign_cpu_helper
* cpupool_unassign_cpu_helper locks cpupool_lock
* while intuitively, one would think that both should see a consistent
  snapshot and hence freeing the lock in the middle is a bad idea
* also communicating continuation-local state through global variables
  mandates that only a single global continuation can be pending

* reading cpu outside of the lock protection in
  cpupool_unassign_cpu_helper also smells
</rant>

Despite the rant, it is amazing to see the ability to move running
things around through this remote continuation trick! In my (ancient)
balancer experiments I added hypervisor-threads just for side-
stepping this issue..

Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst@amd.com
Tel. +49 (0)351 448 356 719

Advanced Micro Devices GmbH
Einsteinring 24
85609 Aschheim
Germany
Geschaeftsfuehrer: Alberto Bozzo u. Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-02 14:39                 ` Stephan Diestelhorst
@ 2011-02-02 15:14                   ` Juergen Gross
  2011-02-02 16:01                     ` Stephan Diestelhorst
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-02 15:14 UTC (permalink / raw)
  To: Stephan Diestelhorst
  Cc: George Dunlap, Przywara, Andre, Keir Fraser, xen-devel, Ian Jackson

On 02/02/11 15:39, Stephan Diestelhorst wrote:
> Hi folks,
>    long time no see. :-)
>
> On Tuesday 01 February 2011 17:32:25 Andre Przywara wrote:
>> I asked Stephan Diestelhorst for help and after I convinced him that
>> removing credit and making SEDF the default again is not an option he
>> worked together with me on that ;-) Many thanks for that!
>> We haven't come to a final solution but could gather some debug data.
>> I will simply dump some data here, maybe somebody has got a clue. We
>> will work further on this tomorrow.
>
> Andre and I have been looking through this further, in particular sanity
> checking the invariant
>
> prv->weight>= sdom->weight * sdom->active_vcpu_count
>
> each time someone tweaks the active vcpu count. This happens only in
> __csched_vcpu_acct_start and __csched_vcpu_acct_stop_locked. We managed
> to observe the broken invariant when splitting cpupoools.
>
> We have the following theory of what happens:
> * some vcpus of a particular domain are currently in the process of
>    being moved to the new pool

The only _vcpus_ to be moved between pools are the idle vcpus. And those
never contribute to accounting in credit scheduler.

We are moving _pcpus_ only (well, moving a domain between pools actually
moves vcpus as well, but then the domain is paused).
On the pcpu to be moved the idle vcpu should be running. Obviously you
have found a scenario where this isn't true. I have no idea how this could
happen, as other then idle vcpus are taken into account for scheduling
only if the pcpu is valid in the cpupool. And the pcpu is set valid after the
BUG_ON you have triggered in your tests.

>
> * some are still left on the old pool (vcpus_old) and some are already
>    in the new pool (vcpus_new)
>
> * we now have vcpus_old->sdom = vcpus_new->sdom and following from this
>    * vcpus_old->sdom->weight = vcpus_new->sdom->weight
>    * vcpus_old->sdom->active_vcpu_count = vcpus_new->sdom->active_vcpu_count
>
> * active_vcpu_count thus does not represent the separation of the
>    actual vpcus (may be the sum, only the old or new ones, does not
>    matter)
>
> * however, sched_old != sched_new, and thus
>    * sched_old->prv != sched_new->prv
>    * sched_old->prv->weight != sched_new->prv->weight
>
> * the prv->weight field hence sees the incremental move of VCPUs
>    (through modifications in *acct_start and *acct_stop_locked)
>
> * if at any point in this half-way migration, the scheduler wants to
>    csched_acct, it erroneously checks the wrong active_vcpu_count
>
> Workarounds / fixes (none tried):
> * disable scheduler accounting while half-way migrating a domain
>    (dom->pool_migrating flag and then checking in csched_acct)
> * temporarily split the sdom structures while migrating to account for
>    transient split of vcpus
> * synchronously disable all vcpus, migrate and then re-enable
>
> Caveats:
> * prv->lock does not guarantee mutual exclusion between (same)
>    schedulers of different pools
>
> <rant>
> The general locking policy vs the comment situation is a nightmare.
> I know that we have some advanced data-structure folks here, but
> intuitively reasoning about when specific things are atomic and
> mutually excluded is a pain in the scheduler / cpupool code, see the
> issue with the separate prv->locks above.
>
> E.g. cpupool_unassign_cpu and cpupool_unassign_cpu_helper interplay:
> * cpupool_unassign_cpu unlocks cpupool_lock
> * sets up the continuation calling cpupool_unassign_cpu_helper
> * cpupool_unassign_cpu_helper locks cpupool_lock
> * while intuitively, one would think that both should see a consistent
>    snapshot and hence freeing the lock in the middle is a bad idea
> * also communicating continuation-local state through global variables
>    mandates that only a single global continuation can be pending
>
> * reading cpu outside of the lock protection in
>    cpupool_unassign_cpu_helper also smells
> </rant>
>
> Despite the rant, it is amazing to see the ability to move running
> things around through this remote continuation trick! In my (ancient)
> balancer experiments I added hypervisor-threads just for side-
> stepping this issue..

I think the easiest way to solve the problem would be to move the cpu to the
new pool in a tasklet. This is possible now, because tasklets are always
executed in the idle vcpus.

OTOH I'd like to understand what is wrong with my current approach...


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-02 15:14                   ` Juergen Gross
@ 2011-02-02 16:01                     ` Stephan Diestelhorst
  2011-02-03  5:57                       ` Juergen Gross
  0 siblings, 1 reply; 53+ messages in thread
From: Stephan Diestelhorst @ 2011-02-02 16:01 UTC (permalink / raw)
  To: Juergen Gross
  Cc: George Dunlap, Przywara, Andre, Keir Fraser, xen-devel, Ian Jackson

On Wednesday 02 February 2011 16:14:25 Juergen Gross wrote:
> On 02/02/11 15:39, Stephan Diestelhorst wrote:
> > We have the following theory of what happens:
> > * some vcpus of a particular domain are currently in the process of
> >    being moved to the new pool
> 
> The only _vcpus_ to be moved between pools are the idle vcpus. And those
> never contribute to accounting in credit scheduler.
> 
> We are moving _pcpus_ only (well, moving a domain between pools actually
> moves vcpus as well, but then the domain is paused).

How do you ensure that the domain is paused and stays that way? Pausing
the domain was what I had in mind, too...

> > Despite the rant, it is amazing to see the ability to move running
> > things around through this remote continuation trick! In my (ancient)
> > balancer experiments I added hypervisor-threads just for side-
> > stepping this issue..
> 
> I think the easiest way to solve the problem would be to move the cpu to the
> new pool in a tasklet. This is possible now, because tasklets are always
> executed in the idle vcpus.

Yep. That was exactly what I build. At the time stuff like that did
not exist (2005).

> OTOH I'd like to understand what is wrong with my current approach...

Nothing, in fact I like it. In my rant I complained about the fact
that splitting the critical section accross this continuation looks
scary, basically causing some generic red lights to turn on :-) And
making reasoning about the correctness a little complicated, but that
may well be a local issue ;-)

Stephan

-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst@amd.com
Tel. +49 (0)351 448 356 719

Advanced Micro Devices GmbH
Einsteinring 24
85609 Aschheim
Germany

Geschaeftsfuehrer: Alberto Bozzo u. Andrew Bowd; 
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-02 16:01                     ` Stephan Diestelhorst
@ 2011-02-03  5:57                       ` Juergen Gross
  2011-02-03  9:18                         ` Juergen Gross
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-03  5:57 UTC (permalink / raw)
  To: Stephan Diestelhorst
  Cc: George Dunlap, Przywara, Andre, xen-devel, Keir Fraser, Ian Jackson

On 02/02/11 17:01, Stephan Diestelhorst wrote:
> On Wednesday 02 February 2011 16:14:25 Juergen Gross wrote:
>> On 02/02/11 15:39, Stephan Diestelhorst wrote:
>>> We have the following theory of what happens:
>>> * some vcpus of a particular domain are currently in the process of
>>>     being moved to the new pool
>>
>> The only _vcpus_ to be moved between pools are the idle vcpus. And those
>> never contribute to accounting in credit scheduler.
>>
>> We are moving _pcpus_ only (well, moving a domain between pools actually
>> moves vcpus as well, but then the domain is paused).
>
> How do you ensure that the domain is paused and stays that way? Pausing
> the domain was what I had in mind, too...

Look at sched_move_domain() in schedule.c: I'm calling domain_pause()
before moving the vcpus and domain_unpause() after that.

>
>>> Despite the rant, it is amazing to see the ability to move running
>>> things around through this remote continuation trick! In my (ancient)
>>> balancer experiments I added hypervisor-threads just for side-
>>> stepping this issue..
>>
>> I think the easiest way to solve the problem would be to move the cpu to the
>> new pool in a tasklet. This is possible now, because tasklets are always
>> executed in the idle vcpus.
>
> Yep. That was exactly what I build. At the time stuff like that did
> not exist (2005).
>
>> OTOH I'd like to understand what is wrong with my current approach...
>
> Nothing, in fact I like it. In my rant I complained about the fact
> that splitting the critical section accross this continuation looks
> scary, basically causing some generic red lights to turn on :-) And
> making reasoning about the correctness a little complicated, but that
> may well be a local issue ;-)

Perhaps you can help solving the miracle:

Could you replace the BUG_ON in sched_credit.c:389 with something like this:

if (!is_idle_vcpu(per_cpu(schedule_data, cpu).curr)) {
   extern void dump_runq(unsigned char key);
   struct vcpu *vc = per_cpu(schedule_data, cpu).curr;

   printk("+++ (%d.%d) instead idle vcpu on cpu %d\n", vc->domain->domain_id,
           vc->vcpu_id, cpu);
   dump_runq('q');
   BUG();
}


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-03  5:57                       ` Juergen Gross
@ 2011-02-03  9:18                         ` Juergen Gross
  2011-02-04 14:09                           ` Andre Przywara
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-03  9:18 UTC (permalink / raw)
  To: Stephan Diestelhorst
  Cc: George Dunlap, Przywara, Andre, xen-devel, Keir Fraser, Ian Jackson

[-- Attachment #1: Type: text/plain, Size: 968 bytes --]

Andre, Stephan,

could you give the attached patch a try?
It moves the cpu assigning/unassigning into a tasklet always executed on the
cpu to be moved. This should avoid critical races.

Regarding Stephans rant:
You should be aware that the main critical sections are only in the tasklets.
The locking in the main routines is needed only to avoid the cpupool to be
destroyed in between.

I'm not sure whether the master_ticker patch is still needed. It seems to
break something, as my machine hung up after several 100 cpu moves (without
the new patch). I'm still investigating this problem.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

[-- Attachment #2: cpupool-idle.patch --]
[-- Type: text/x-patch, Size: 4879 bytes --]

diff -r 4bdb78db22b6 xen/common/cpupool.c
--- a/xen/common/cpupool.c	Wed Feb 02 17:06:36 2011 +0000
+++ b/xen/common/cpupool.c	Thu Feb 03 10:09:53 2011 +0100
@@ -217,14 +217,30 @@ static int cpupool_assign_cpu_locked(str
     return 0;
 }
 
+static long cpupool_assign_cpu_helper(void *info)
+{
+    int cpu = cpupool_moving_cpu;
+    long ret;
+
+    cpupool_dprintk("cpupool_assign_cpu(pool=%d,cpu=%d) ret %ld\n",
+                    cpupool_cpu_moving->cpupool_id, cpu, ret);
+    BUG_ON(!is_idle_vcpu(current));
+    BUG_ON(cpu != smp_processor_id());
+    spin_lock(&cpupool_lock);
+    ret = cpupool_assign_cpu_locked(cpupool_cpu_moving, cpu);
+    spin_unlock(&cpupool_lock);
+    return ret;
+}
+
 static long cpupool_unassign_cpu_helper(void *info)
 {
     int cpu = cpupool_moving_cpu;
     long ret;
 
     cpupool_dprintk("cpupool_unassign_cpu(pool=%d,cpu=%d) ret %ld\n",
-                    cpupool_id, cpu, ret);
-
+                    cpupool_cpu_moving->cpupool_id, cpu, ret);
+    BUG_ON(!is_idle_vcpu(current));
+    BUG_ON(cpu != smp_processor_id());
     spin_lock(&cpupool_lock);
     ret = cpu_disable_scheduler(cpu);
     cpu_set(cpu, cpupool_free_cpus);
@@ -241,9 +257,51 @@ static long cpupool_unassign_cpu_helper(
 }
 
 /*
+ * assign a specific cpu to a cpupool
+ * we must be sure to run on the cpu to be assigned in idle! to achieve this
+ * the main functionality is performed via continue_hypercall_on_cpu on the
+ * specific cpu.
+ * possible failures:
+ * - cpu not free
+ * - cpu just being unplugged
+ */
+int cpupool_assign_cpu(struct cpupool *c, unsigned int cpu)
+{
+    int ret;
+
+    cpupool_dprintk("cpupool_assign_cpu(pool=%d,cpu=%d)\n",
+                    c->cpupool_id, cpu);
+
+    spin_lock(&cpupool_lock);
+    ret = -EBUSY;
+    if ( (cpupool_moving_cpu != -1) && (cpu != cpupool_moving_cpu) )
+        goto out;
+    if ( cpu_isset(cpu, cpupool_locked_cpus) )
+        goto out;
+
+    ret = 0;
+    if ( !cpu_isset(cpu, cpupool_free_cpus) && (cpu != cpupool_moving_cpu) )
+        goto out;
+
+    cpupool_moving_cpu = cpu;
+    atomic_inc(&c->refcnt);
+    cpupool_cpu_moving = c;
+    cpu_clear(cpu, c->cpu_valid);
+    spin_unlock(&cpupool_lock);
+
+    return continue_hypercall_on_cpu(cpu, cpupool_assign_cpu_helper, c);
+
+out:
+    spin_unlock(&cpupool_lock);
+    cpupool_dprintk("cpupool_assign_cpu(pool=%d,cpu=%d) ret %d\n",
+                    cpupool_id, cpu, ret);
+    return ret;
+}
+
+/*
  * unassign a specific cpu from a cpupool
- * we must be sure not to run on the cpu to be unassigned! to achieve this
- * the main functionality is performed via continue_hypercall_on_cpu on a
+ * we must be sure to run on the cpu to be unassigned in idle! to achieve this
+ * the main functionality is performed via continue_hypercall_on_cpu on the
  * specific cpu.
  * if the cpu to be removed is the last one of the cpupool no active domain
  * must be bound to the cpupool. dying domains are moved to cpupool0 as they
@@ -254,7 +312,6 @@ static long cpupool_unassign_cpu_helper(
  */
 int cpupool_unassign_cpu(struct cpupool *c, unsigned int cpu)
 {
-    int work_cpu;
     int ret;
     struct domain *d;
 
@@ -302,14 +359,7 @@ int cpupool_unassign_cpu(struct cpupool 
     cpu_clear(cpu, c->cpu_valid);
     spin_unlock(&cpupool_lock);
 
-    work_cpu = smp_processor_id();
-    if ( work_cpu == cpu )
-    {
-        work_cpu = first_cpu(cpupool0->cpu_valid);
-        if ( work_cpu == cpu )
-            work_cpu = next_cpu(cpu, cpupool0->cpu_valid);
-    }
-    return continue_hypercall_on_cpu(work_cpu, cpupool_unassign_cpu_helper, c);
+    return continue_hypercall_on_cpu(cpu, cpupool_unassign_cpu_helper, c);
 
 out:
     spin_unlock(&cpupool_lock);
@@ -455,27 +505,15 @@ int cpupool_do_sysctl(struct xen_sysctl_
     {
         unsigned cpu;
 
+        c = __cpupool_get_by_id(op->cpupool_id, 0);
+        ret = -ENOENT;
+        if ( c == NULL )
+            break;
         cpu = op->cpu;
-        cpupool_dprintk("cpupool_assign_cpu(pool=%d,cpu=%d)\n",
-                        op->cpupool_id, cpu);
-        spin_lock(&cpupool_lock);
         if ( cpu == XEN_SYSCTL_CPUPOOL_PAR_ANY )
             cpu = first_cpu(cpupool_free_cpus);
-        ret = -EINVAL;
-        if ( cpu >= NR_CPUS )
-            goto addcpu_out;
-        ret = -EBUSY;
-        if ( !cpu_isset(cpu, cpupool_free_cpus) )
-            goto addcpu_out;
-        c = cpupool_find_by_id(op->cpupool_id, 0);
-        ret = -ENOENT;
-        if ( c == NULL )
-            goto addcpu_out;
-        ret = cpupool_assign_cpu_locked(c, cpu);
-    addcpu_out:
-        spin_unlock(&cpupool_lock);
-        cpupool_dprintk("cpupool_assign_cpu(pool=%d,cpu=%d) ret %d\n",
-                        op->cpupool_id, cpu, ret);
+        ret = (cpu < NR_CPUS) ? cpupool_assign_cpu(c, cpu) : -EINVAL;
+        cpupool_put(c);
     }
     break;
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-03  9:18                         ` Juergen Gross
@ 2011-02-04 14:09                           ` Andre Przywara
  2011-02-07 12:38                             ` Andre Przywara
  0 siblings, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-02-04 14:09 UTC (permalink / raw)
  To: Juergen Gross
  Cc: George Dunlap, Ian Jackson, xen-devel, Keir Fraser, Diestelhorst,
	Stephan

Juergen Gross wrote:
> Andre, Stephan,
> 
> could you give the attached patch a try?
> It moves the cpu assigning/unassigning into a tasklet always executed on the
> cpu to be moved. This should avoid critical races.

Done. I checked it twice, but sadly it does not fix the issue. It still 
BUGs:
(XEN) Xen BUG at sched_credit.c:990
(XEN) ----[ Xen-4.1.0-rc3-pre  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c480118208>] csched_acct+0x11f/0x419
(XEN) RFLAGS: 0000000000010006   CONTEXT: hypervisor
(XEN) rax: 0000000000000010   rbx: 0000000000000f00   rcx: 0000000000000100
(XEN) rdx: 0000000000001000   rsi: ffff830437ffa600   rdi: 0000000000000010
(XEN) rbp: ffff82c480297e10   rsp: ffff82c480297d80   r8:  0000000000000100
(XEN) r9:  0000000000000006   r10: ffff82c4802d4100   r11: 0000017322fea49a
(XEN) r12: ffff830437ffa5e0   r13: ffff82c4801180e9   r14: ffff83043399f018
(XEN) r15: ffff830434321ec0   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 00000000c7c9c000   cr2: 0000000001ec8048
(XEN) ds: 002b   es: 002b   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c480297d80:
(XEN)    ffff82c480297f18 fffffed4c7cd6000 ffff830000000eff ffff830437ffa5e0
(XEN)    ffff830437ffa5e8 ffff82c480297df8 ffff830437ffa5e0 0000000000000282
(XEN)    ffff830437ffa5e8 00001c200000000f 00000f0000000f00 0000000000000000
(XEN)    ffff82c400000000 ffff82c4802d3f80 ffff830437ffa5e0 ffff82c4801180e9
(XEN)    ffff83043399f018 ffff83043399f010 ffff82c480297e40 ffff82c480126044
(XEN)    0000000000000002 ffff830437ffa600 ffff82c4802d3f80 00000173010849b7
(XEN)    ffff82c480297e90 ffff82c480126369 ffff82c48024aea0 ffff82c4802d3f80
(XEN)    ffff83043399f010 0000000000000000 0000000000000000 ffff82c4802b0880
(XEN)    ffff82c480297f18 ffffffffffffffff ffff82c480297ed0 ffff82c480123437
(XEN)    ffff8300c7e1e0f8 ffff82c480297f18 ffff82c48024aea0 ffff82c480297f18
(XEN)    0000017301008665 ffff82c4802d3ec0 ffff82c480297ee0 ffff82c4801234b2
(XEN)    ffff82c480297f10 ffff82c4801564f5 0000000000000000 ffff8300c7cd6000
(XEN)    0000000000000000 ffff8300c7e1e000 ffff82c480297d48 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffffffff81a69060 ffff8817a8553f10
(XEN)    ffff8817a8553fd8 0000000000000246 ffff8817a8553e80 ffff880000000001
(XEN)    0000000000000000 0000000000000000 ffffffff810093aa 000000000000e030
(XEN)    00000000deadbeef 00000000deadbeef 0000010000000000 ffffffff810093aa
(XEN)    000000000000e033 0000000000000246 ffff8817a8553ef8 000000000000e02b
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 ffff8300c7cd6000 0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c480118208>] csched_acct+0x11f/0x419
(XEN)    [<ffff82c480126044>] execute_timer+0x4e/0x6c
(XEN)    [<ffff82c480126369>] timer_softirq_action+0xf2/0x245
(XEN)    [<ffff82c480123437>] __do_softirq+0x88/0x99
(XEN)    [<ffff82c4801234b2>] do_softirq+0x6a/0x7a
(XEN)    [<ffff82c4801564f5>] idle_loop+0x6a/0x6f
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Xen BUG at sched_credit.c:990
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...


Stephan had created more printk debug patches, we will summarize the 
results soon.

Regards,
Andre.


> 
> Regarding Stephans rant:
> You should be aware that the main critical sections are only in the tasklets.
> The locking in the main routines is needed only to avoid the cpupool to be
> destroyed in between.
> 
> I'm not sure whether the master_ticker patch is still needed. It seems to
> break something, as my machine hung up after several 100 cpu moves (without
> the new patch). I'm still investigating this problem.
> 
> 
> Juergen
> 
> 


-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-04 14:09                           ` Andre Przywara
@ 2011-02-07 12:38                             ` Andre Przywara
  2011-02-07 13:32                               ` Juergen Gross
  0 siblings, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-02-07 12:38 UTC (permalink / raw)
  To: Juergen Gross; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 985 bytes --]

Juergen,

as promised some more debug data. This is from c/s 22858 with Stephans 
debug patch (attached).
We get the following dump when the hypervisor crashes, note that the 
first lock is different from the second and subsequent ones:

(XEN) sched_credit.c, 572: prv: ffff831836df2970 &prv->lock: 
ffff831836df2970 prv->weight: 256 sdom->active_vcpu_count: 3 
sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: 
ffff830437ffa5e0 prv->weight: 768 sdom->active_vcpu_count: 4 
sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: 
ffff830437ffa5e0 prv->weight: 1024 sdom->active_vcpu_count: 5 
sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: 
ffff830437ffa5e0 prv->weight: 1280 sdom->active_vcpu_count: 6 
sdom->weight: 256

....

Hope that gives you an idea. I attach the whole log for your reference.

Regards,
Andre

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

[-- Attachment #2: hv_cpupools_crash.txt --]
[-- Type: text/plain, Size: 12532 bytes --]

Welcome to Linux 2.6.32.27-pvops (hvc0)

dosorca login: root
Password: 
Linux 2.6.32.27-pvops.
Last login: Fri Jan 28 00:15:40 +0100 2011 on hvc0.
You have mail.
root@dosorca:~# sync
root@dosorca:~# cd /data/images/
root@dosorca:/data/images# sh numasplit.sh 
Removing CPUs from Pool 0
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node1
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node2
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node3
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node4
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node5
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node6
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node7
scheduler:      credit
number of cpus: 1
Populating new pool
root@dosorca:/data/images# sh numasplit.sh revert
Destroying Pool 1
adding freed CPUs to pool 0
Destroying Pool 2
adding freed CPUs to pool 0
Destroying Pool 3
adding freed CPUs to pool 0
Destroying Pool 4
adding freed CPUs to pool 0
Destroying Pool 5
adding freed CPUs to pool 0
Destroying Pool 6
adding freed CPUs to pool 0
Destroying Pool 7
adding freed CPUs to pool 0
root@dosorca:/data/images# sh numasplit.sh
Removing CPUs from Pool 0
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node1
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node2
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) sched_credit.c, 572: prv: ffff831836df2970 &prv->lock: ffff831836df2970 prv->weight: 256 sdom->active_vcpu_count: 3 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 768 sdom->active_vcpu_count: 4 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 1024 sdom->active_vcpu_count: 5 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 1280 sdom->active_vcpu_count: 6 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 1536 sdom->active_vcpu_count: 7 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 1792 sdom->active_vcpu_count: 8 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 2048 sdom->active_vcpu_count: 9 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 2304 sdom->active_vcpu_count: 10 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 2560 sdom->active_vcpu_count: 11 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 2816 sdom->active_vcpu_count: 12 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 3072 sdom->active_vcpu_count: 13 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 3328 sdom->active_vcpu_count: 14 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 3584 sdom->active_vcpu_count: 15 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 3840 sdom->active_vcpu_count: 16 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 4096 sdom->active_vcpu_count: 17 sdom->weight: 256
(XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock: ffff830437ffa5e0 prv->weight: 4352 sdom->active_vcpu_count: 18 sdom->weight: 256
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 0 on processor: 33 with state 0 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 1 on processor: 35 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 2 on processor: 20 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 3 on processor: 26 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 4 on processor: 37 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 5 on processor: 36 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 6 on processor: 2 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 7 on processor: 24 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 8 on processor: 28 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 9 on processor: 40 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 10 on processor: 4 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 11 on processor: 44 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 12 on processor: 36 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 13 on processor: 29 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 14 on processor: 3 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 15 on processor: 13 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 16 on processor: 21 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 17 on processor: 1 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 18 on processor: 20 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 19 on processor: 28 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 20 on processor: 39 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 21 on processor: 34 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 22 on processor: 41 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 23 on processor: 0 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 24 on processor: 2 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 25 on processor: 22 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 26 on processor: 42 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 27 on processor: 43 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 28 on processor: 30 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 29 on processor: 27 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 30 on processor: 23 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 31 on processor: 32 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 32 on processor: 25 with state 0 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 33 on processor: 46 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 34 on processor: 38 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 35 on processor: 4 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 36 on processor: 45 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 37 on processor: 34 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 38 on processor: 5 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 39 on processor: 1 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 40 on processor: 30 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 41 on processor: 28 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 42 on processor: 31 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 43 on processor: 0 with state 1 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 44 on processor: 47 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 45 on processor: 29 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 46 on processor: 44 with state 2 violates invariant!
(XEN) BUG in sched_credit.c,1008: Domain 0 VCPU: 47 on processor: 20 with state 2 violates invariant!
(XEN) Xen BUG at sched_credit.c:1013
(XEN) ----[ Xen-4.1.0-rc3-pre  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c4801182f3>] csched_acct+0x197/0x51d
(XEN) RFLAGS: 0000000000010087   CONTEXT: hypervisor
(XEN) rax: 0000000000000012   rbx: ffff830434321ec0   rcx: 0000000000000000
(XEN) rdx: 0000000000001200   rsi: 0000000000000012   rdi: 0000000000000100
(XEN) rbp: ffff82c480297e10   rsp: ffff82c480297d70   r8:  0000000000000100
(XEN) r9:  ffff82c480214a20   r10: 00000000fffffffc   r11: 0000000000000001
(XEN) r12: ffff830434322000   r13: ffff82c48011815c   r14: ffff83043399f018
(XEN) r15: ffff83043399f010   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 0000000621001000   cr2: 00007f3818efa000
(XEN) ds: 002b   es: 002b   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c480297d70:
(XEN)    ffff830400000002 ffff82c480297e38 fffffed480118b9e 00000000000010ff
(XEN)    ffff830437ffa5e0 ffff830437ffa5e8 ffff82c4802d3ec0 ffff830437ffa5e0
(XEN)    0000000000000282 ffff830437ffa5e8 ffff830434321ec0 00002a309695b272
(XEN)    0000110000001100 0000000000000000 ffff82c400000000 ffff82c4802d3f80
(XEN)    ffff830437ffa5e0 ffff82c48011815c ffff83043399f018 ffff83043399f010
(XEN)    ffff82c480297e40 ffff82c480126144 0000000000000002 ffff830437ffa600
(XEN)    ffff82c4802d3f80 0000001de513cb60 ffff82c480297e90 ffff82c480126469
(XEN)    ffff82c48024b020 ffff82c4802d3f80 ffff83043399f010 0000000000000000
(XEN)    0000000000000000 ffff82c4802b0880 ffff82c480297f18 ffffffffffffffff
(XEN)    ffff82c480297ed0 ffff82c480123537 ffff8300c7e340f8 ffff82c480297f18
(XEN)    ffff82c48024b020 ffff82c480297f18 0000001de5129a7f ffff82c4802d3ec0
(XEN)    ffff82c480297ee0 ffff82c4801235b2 ffff82c480297f10 ffff82c4801565f5
(XEN)    0000000000000000 ffff8300c7cd6000 0000000000000000 ffff8300c7e34000
(XEN)    ffff82c480297d48 0000000000000000 0000000000000000 0000000000000000
(XEN)    ffffffff81a69060 ffff8817a8535f10 ffff8817a8535fd8 0000000000000246
(XEN)    ffff8817a8535e80 ffff880000000001 0000000000000000 0000000000000000
(XEN)    ffffffff810093aa 000000193592cbd4 00000000deadbeef 00000000deadbeef
(XEN)    0000010000000000 ffffffff810093aa 000000000000e033 0000000000000246
(XEN)    ffff8817a8535ef8 000000000000e02b 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 ffff8300c7cd6000
(XEN) Xen call trace:
(XEN)    [<ffff82c4801182f3>] csched_acct+0x197/0x51d
(XEN)    [<ffff82c480126144>] execute_timer+0x4e/0x6c
(XEN)    [<ffff82c480126469>] timer_softirq_action+0xf2/0x245
(XEN)    [<ffff82c480123537>] __do_softirq+0x88/0x99
(XEN)    [<ffff82c4801235b2>] do_softirq+0x6a/0x7a
(XEN)    [<ffff82c4801565f5>] idle_loop+0x6a/0x6f
(XEN)    
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Xen BUG at sched_credit.c:1013
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...
(XEN) Resetting with ACPI MEMORY or I/O RESET_REG.

[-- Attachment #3: sd_xen_caution_04.patch --]
[-- Type: text/x-patch, Size: 3991 bytes --]

diff -r 9a6458e0c3f5 xen/common/cpupool.c
--- a/xen/common/cpupool.c	Tue Feb 01 19:26:36 2011 +0000
+++ b/xen/common/cpupool.c	Thu Feb 03 18:51:40 2011 +0100
@@ -30,6 +30,7 @@
 static int cpupool_moving_cpu = -1;
 static struct cpupool *cpupool_cpu_moving = NULL;
 static cpumask_t cpupool_locked_cpus = CPU_MASK_NONE;
+static int cpupool_debug_move_continue = 0;
 
 static DEFINE_SPINLOCK(cpupool_lock);
 
@@ -226,6 +227,8 @@
                     cpupool_id, cpu, ret);
 
     spin_lock(&cpupool_lock);
+	BUG_ON(!cpupool_debug_move_continue); // Continuation still flagged?
+	BUG_ON(cpu != *((volatile int*)&cpupool_moving_cpu));
     ret = cpu_disable_scheduler(cpu);
     cpu_set(cpu, cpupool_free_cpus);
     if ( !ret )
@@ -236,6 +239,7 @@
         cpupool_put(cpupool_cpu_moving);
         cpupool_cpu_moving = NULL;
     }
+	cpupool_debug_move_continue = 0; // Continuation done.
     spin_unlock(&cpupool_lock);
     return ret;
 }
@@ -300,6 +304,8 @@
     atomic_inc(&c->refcnt);
     cpupool_cpu_moving = c;
     cpu_clear(cpu, c->cpu_valid);
+	BUG_ON(cpupool_debug_move_continue); // Only one outstanding continuation!
+	cpupool_debug_move_continue = 1;
     spin_unlock(&cpupool_lock);
 
     work_cpu = smp_processor_id();
@@ -309,6 +315,7 @@
         if ( work_cpu == cpu )
             work_cpu = next_cpu(cpu, cpupool0->cpu_valid);
     }
+	// SD NOTE:  Why not keep the protection through cpupool_lock until here?
     return continue_hypercall_on_cpu(work_cpu, cpupool_unassign_cpu_helper, c);
 
 out:
diff -r 9a6458e0c3f5 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Tue Feb 01 19:26:36 2011 +0000
+++ b/xen/common/sched_credit.c	Thu Feb 03 18:51:40 2011 +0100
@@ -567,6 +567,14 @@
         list_add(&svc->active_vcpu_elem, &sdom->active_vcpu);
         /* Make weight per-vcpu */
         prv->weight += sdom->weight;
+        if (prv->weight < sdom->active_vcpu_count * sdom->weight) {
+            printk("%s, %i: Dom: %i VCPU: %i prv: %p &prv->lock: %p prv->weight: %i "\
+                   "sdom->active_vcpu_count: %i sdom->weight: %i\n",
+                   __FILE__, __LINE__, sdom->dom->domain_id, svc->vcpu->vcpu_id,
+                   (void*) prv, &(prv->lock), prv->weight,
+                   sdom->active_vcpu_count, sdom->weight);
+        }
+        //BUG_ON(prv->weight < sdom->active_vcpu_count * sdom->weight);
         if ( list_empty(&sdom->active_sdom_elem) )
         {
             list_add(&sdom->active_sdom_elem, &prv->active_sdom);
@@ -591,6 +599,14 @@
     sdom->active_vcpu_count--;
     list_del_init(&svc->active_vcpu_elem);
     prv->weight -= sdom->weight;
+    if (prv->weight < sdom->active_vcpu_count * sdom->weight) {
+         printk("%s, %i: Dom: %i VCPU: %i prv: %p &prv->lock: %p prv->weight: %i "\
+                "sdom->active_vcpu_count: %i sdom->weight: %i\n",
+                __FILE__, __LINE__, sdom->dom->domain_id, svc->vcpu->vcpu_id,
+                (void*) prv, &(prv->lock), prv->weight,
+                sdom->active_vcpu_count, sdom->weight);
+    }
+    //BUG_ON(prv->weight < sdom->active_vcpu_count * sdom->weight);
     if ( list_empty(&sdom->active_vcpu) )
     {
         list_del_init(&sdom->active_sdom_elem);
@@ -987,6 +1003,17 @@
         BUG_ON( is_idle_domain(sdom->dom) );
         BUG_ON( sdom->active_vcpu_count == 0 );
         BUG_ON( sdom->weight == 0 );
+        if ( (sdom->weight * sdom->active_vcpu_count) > weight_left ) {
+            struct domain *d = sdom->dom;
+            struct vcpu   *v;
+            for_each_vcpu ( d, v ) {
+                printk("BUG in %s,%i: Domain %i VCPU: %i on processor: %i with "\
+                       "state %i violates invariant!\n",
+                       __FILE__,__LINE__, d->domain_id, v->vcpu_id, v->processor,
+                       v->runstate.state);
+            }
+        }
+
         BUG_ON( (sdom->weight * sdom->active_vcpu_count) > weight_left );
 
         weight_left -= ( sdom->weight * sdom->active_vcpu_count );

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-07 12:38                             ` Andre Przywara
@ 2011-02-07 13:32                               ` Juergen Gross
  2011-02-07 15:55                                 ` George Dunlap
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-07 13:32 UTC (permalink / raw)
  To: Andre Przywara; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 1842 bytes --]

On 02/07/11 13:38, Andre Przywara wrote:
> Juergen,
>
> as promised some more debug data. This is from c/s 22858 with Stephans
> debug patch (attached).
> We get the following dump when the hypervisor crashes, note that the
> first lock is different from the second and subsequent ones:
>
> (XEN) sched_credit.c, 572: prv: ffff831836df2970 &prv->lock:
> ffff831836df2970 prv->weight: 256 sdom->active_vcpu_count: 3
> sdom->weight: 256
> (XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock:
> ffff830437ffa5e0 prv->weight: 768 sdom->active_vcpu_count: 4
> sdom->weight: 256
> (XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock:
> ffff830437ffa5e0 prv->weight: 1024 sdom->active_vcpu_count: 5
> sdom->weight: 256
> (XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock:
> ffff830437ffa5e0 prv->weight: 1280 sdom->active_vcpu_count: 6
> sdom->weight: 256
>
> ....
>
> Hope that gives you an idea. I attach the whole log for your reference.

Hmm, could it be your log wasn't created with the attached patch? I'm missing
Dom-Id and VCPU from the printk() above, which would be interesting (at least
I hope so)...
Additionally printing the local pcpu number would help, too.
And could you add a printk for the new prv address in csched_init()?

It would be nice if you could enable cpupool diag output. Please use the
attached patch (includes the previous patch for executing the cpu move on the
cpu to be moved, plus some diag printk corrections).


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

[-- Attachment #2: diag.patch --]
[-- Type: text/x-patch, Size: 5480 bytes --]

diff -r 7ada6faef565 xen/common/cpupool.c
--- a/xen/common/cpupool.c	Sun Feb 06 17:26:31 2011 +0000
+++ b/xen/common/cpupool.c	Mon Feb 07 14:26:50 2011 +0100
@@ -35,7 +35,7 @@ static DEFINE_SPINLOCK(cpupool_lock);
 
 DEFINE_PER_CPU(struct cpupool *, cpupool);
 
-#define cpupool_dprintk(x...) ((void)0)
+#define cpupool_dprintk(x...) printk(x)
 
 static struct cpupool *alloc_cpupool_struct(void)
 {
@@ -227,14 +227,30 @@ static int cpupool_assign_cpu_locked(str
     return 0;
 }
 
+static long cpupool_assign_cpu_helper(void *info)
+{
+    int cpu = cpupool_moving_cpu;
+    long ret;
+
+    cpupool_dprintk("cpupool_assign_cpu(pool=%d,cpu=%d)\n",
+                    cpupool_cpu_moving->cpupool_id, cpu);
+    BUG_ON(!is_idle_vcpu(current));
+    BUG_ON(cpu != smp_processor_id());
+    spin_lock(&cpupool_lock);
+    ret = cpupool_assign_cpu_locked(cpupool_cpu_moving, cpu);
+    spin_unlock(&cpupool_lock);
+    return ret;
+}
+
 static long cpupool_unassign_cpu_helper(void *info)
 {
     int cpu = cpupool_moving_cpu;
     long ret;
 
-    cpupool_dprintk("cpupool_unassign_cpu(pool=%d,cpu=%d) ret %ld\n",
-                    cpupool_id, cpu, ret);
-
+    cpupool_dprintk("cpupool_unassign_cpu(pool=%d,cpu=%d)\n",
+                    cpupool_cpu_moving->cpupool_id, cpu);
+    BUG_ON(!is_idle_vcpu(current));
+    BUG_ON(cpu != smp_processor_id());
     spin_lock(&cpupool_lock);
     ret = cpu_disable_scheduler(cpu);
     cpu_set(cpu, cpupool_free_cpus);
@@ -258,9 +274,51 @@ out:
 }
 
 /*
+ * assign a specific cpu to a cpupool
+ * we must be sure to run on the cpu to be assigned in idle! to achieve this
+ * the main functionality is performed via continue_hypercall_on_cpu on the
+ * specific cpu.
+ * possible failures:
+ * - cpu not free
+ * - cpu just being unplugged
+ */
+int cpupool_assign_cpu(struct cpupool *c, unsigned int cpu)
+{
+    int ret;
+
+    cpupool_dprintk("cpupool_assign_cpu(pool=%d,cpu=%d)\n",
+                    c->cpupool_id, cpu);
+
+    spin_lock(&cpupool_lock);
+    ret = -EBUSY;
+    if ( (cpupool_moving_cpu != -1) && (cpu != cpupool_moving_cpu) )
+        goto out;
+    if ( cpu_isset(cpu, cpupool_locked_cpus) )
+        goto out;
+
+    ret = 0;
+    if ( !cpu_isset(cpu, cpupool_free_cpus) && (cpu != cpupool_moving_cpu) )
+        goto out;
+
+    cpupool_moving_cpu = cpu;
+    atomic_inc(&c->refcnt);
+    cpupool_cpu_moving = c;
+    cpu_clear(cpu, c->cpu_valid);
+    spin_unlock(&cpupool_lock);
+
+    return continue_hypercall_on_cpu(cpu, cpupool_assign_cpu_helper, c);
+
+out:
+    spin_unlock(&cpupool_lock);
+    cpupool_dprintk("cpupool_assign_cpu(pool=%d,cpu=%d) ret %d\n",
+                    c->cpupool_id, cpu, ret);
+    return ret;
+}
+
+/*
  * unassign a specific cpu from a cpupool
- * we must be sure not to run on the cpu to be unassigned! to achieve this
- * the main functionality is performed via continue_hypercall_on_cpu on a
+ * we must be sure to run on the cpu to be unassigned in idle! to achieve this
+ * the main functionality is performed via continue_hypercall_on_cpu on the
  * specific cpu.
  * if the cpu to be removed is the last one of the cpupool no active domain
  * must be bound to the cpupool. dying domains are moved to cpupool0 as they
@@ -271,7 +329,6 @@ out:
  */
 int cpupool_unassign_cpu(struct cpupool *c, unsigned int cpu)
 {
-    int work_cpu;
     int ret;
     struct domain *d;
 
@@ -319,19 +376,12 @@ int cpupool_unassign_cpu(struct cpupool 
     cpu_clear(cpu, c->cpu_valid);
     spin_unlock(&cpupool_lock);
 
-    work_cpu = smp_processor_id();
-    if ( work_cpu == cpu )
-    {
-        work_cpu = first_cpu(cpupool0->cpu_valid);
-        if ( work_cpu == cpu )
-            work_cpu = next_cpu(cpu, cpupool0->cpu_valid);
-    }
-    return continue_hypercall_on_cpu(work_cpu, cpupool_unassign_cpu_helper, c);
+    return continue_hypercall_on_cpu(cpu, cpupool_unassign_cpu_helper, c);
 
 out:
     spin_unlock(&cpupool_lock);
     cpupool_dprintk("cpupool_unassign_cpu(pool=%d,cpu=%d) ret %d\n",
-                    cpupool_id, cpu, ret);
+                    c->cpupool_id, cpu, ret);
     return ret;
 }
 
@@ -345,7 +395,7 @@ int cpupool_add_domain(struct domain *d,
 {
     struct cpupool *c;
     int rc = 1;
-    int n_dom;
+    int n_dom = 0;
 
     if ( poolid == CPUPOOLID_NONE )
         return 0;
@@ -472,27 +522,15 @@ int cpupool_do_sysctl(struct xen_sysctl_
     {
         unsigned cpu;
 
+        c = __cpupool_get_by_id(op->cpupool_id, 0);
+        ret = -ENOENT;
+        if ( c == NULL )
+            break;
         cpu = op->cpu;
-        cpupool_dprintk("cpupool_assign_cpu(pool=%d,cpu=%d)\n",
-                        op->cpupool_id, cpu);
-        spin_lock(&cpupool_lock);
         if ( cpu == XEN_SYSCTL_CPUPOOL_PAR_ANY )
             cpu = first_cpu(cpupool_free_cpus);
-        ret = -EINVAL;
-        if ( cpu >= NR_CPUS )
-            goto addcpu_out;
-        ret = -EBUSY;
-        if ( !cpu_isset(cpu, cpupool_free_cpus) )
-            goto addcpu_out;
-        c = cpupool_find_by_id(op->cpupool_id, 0);
-        ret = -ENOENT;
-        if ( c == NULL )
-            goto addcpu_out;
-        ret = cpupool_assign_cpu_locked(c, cpu);
-    addcpu_out:
-        spin_unlock(&cpupool_lock);
-        cpupool_dprintk("cpupool_assign_cpu(pool=%d,cpu=%d) ret %d\n",
-                        op->cpupool_id, cpu, ret);
+        ret = (cpu < NR_CPUS) ? cpupool_assign_cpu(c, cpu) : -EINVAL;
+        cpupool_put(c);
     }
     break;
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-07 13:32                               ` Juergen Gross
@ 2011-02-07 15:55                                 ` George Dunlap
  2011-02-08  5:43                                   ` Juergen Gross
  0 siblings, 1 reply; 53+ messages in thread
From: George Dunlap @ 2011-02-07 15:55 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 3406 bytes --]

Juergen,

What is supposed to happen if a domain is in cpupool0, and then all of
the cpus are taken out of cpupool0?  Is that possible?

It looks like there's code in cpupools.c:cpupool_unassign_cpu() which
will move all VMs in a cpupool to cpupool0 before removing the last
cpu.  But what happens if cpupool0 is the pool that has become empty?
It seems like that breaks a lot of the assumptions; e.g.,
sched_move_domain() seems to assume that the pool we're moving a VM to
actually has cpus.

While we're at it, what's with the "(cpu != cpu_moving_cpu)" in the
first half of cpupool_unassign_cpu()?  Under what conditions are you
anticipating cpupool_unassign_cpu() being called a second time before
the first completes?  If you have to abort the move because
schedule_cpu_switch() failed, wouldn't it be better just to roll the
whole transaction back, rather than leaving it hanging in the middle?

Hmm, and why does RMCPU call cpupool_get_by_id() with exact==0?  What
could possibly be the use of grabbing a random cpupool and then trying
to remove the specified cpu from it?

Andre, you might think about folding the attached patch into your debug patch.

 -George

On Mon, Feb 7, 2011 at 1:32 PM, Juergen Gross
<juergen.gross@ts.fujitsu.com> wrote:
> On 02/07/11 13:38, Andre Przywara wrote:
>>
>> Juergen,
>>
>> as promised some more debug data. This is from c/s 22858 with Stephans
>> debug patch (attached).
>> We get the following dump when the hypervisor crashes, note that the
>> first lock is different from the second and subsequent ones:
>>
>> (XEN) sched_credit.c, 572: prv: ffff831836df2970 &prv->lock:
>> ffff831836df2970 prv->weight: 256 sdom->active_vcpu_count: 3
>> sdom->weight: 256
>> (XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock:
>> ffff830437ffa5e0 prv->weight: 768 sdom->active_vcpu_count: 4
>> sdom->weight: 256
>> (XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock:
>> ffff830437ffa5e0 prv->weight: 1024 sdom->active_vcpu_count: 5
>> sdom->weight: 256
>> (XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock:
>> ffff830437ffa5e0 prv->weight: 1280 sdom->active_vcpu_count: 6
>> sdom->weight: 256
>>
>> ....
>>
>> Hope that gives you an idea. I attach the whole log for your reference.
>
> Hmm, could it be your log wasn't created with the attached patch? I'm
> missing
> Dom-Id and VCPU from the printk() above, which would be interesting (at
> least
> I hope so)...
> Additionally printing the local pcpu number would help, too.
> And could you add a printk for the new prv address in csched_init()?
>
> It would be nice if you could enable cpupool diag output. Please use the
> attached patch (includes the previous patch for executing the cpu move on
> the
> cpu to be moved, plus some diag printk corrections).
>
>
> Juergen
>
> --
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
> Fujitsu Technology Solutions              e-mail:
> juergen.gross@ts.fujitsu.com
> Domagkstr. 28                           Internet: ts.fujitsu.com
> D-80807 Muenchen                 Company details:
> ts.fujitsu.com/imprint.html
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>

[-- Attachment #2: cpupools-bug-on-move-to-self.diff --]
[-- Type: text/x-diff, Size: 364 bytes --]

diff -r 0be7c0cd27ad xen/common/schedule.c
--- a/xen/common/schedule.c	Mon Feb 07 14:50:21 2011 +0000
+++ b/xen/common/schedule.c	Mon Feb 07 15:53:56 2011 +0000
@@ -234,6 +234,8 @@
     void **vcpu_priv;
     void *domdata;
 
+    BUG_ON(d->cpupool == c);
+
     domdata = SCHED_OP(c->sched, alloc_domdata, d);
     if ( domdata == NULL )
         return -ENOMEM;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-07 15:55                                 ` George Dunlap
@ 2011-02-08  5:43                                   ` Juergen Gross
  2011-02-08 12:08                                     ` George Dunlap
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-08  5:43 UTC (permalink / raw)
  To: George Dunlap; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

On 02/07/11 16:55, George Dunlap wrote:
> Juergen,
>
> What is supposed to happen if a domain is in cpupool0, and then all of
> the cpus are taken out of cpupool0?  Is that possible?

No. Cpupool0 can't be without any cpu, as Dom0 is always member of cpupool0.

>
> It looks like there's code in cpupools.c:cpupool_unassign_cpu() which
> will move all VMs in a cpupool to cpupool0 before removing the last
> cpu.  But what happens if cpupool0 is the pool that has become empty?
> It seems like that breaks a lot of the assumptions; e.g.,
> sched_move_domain() seems to assume that the pool we're moving a VM to
> actually has cpus.

The move of VMs to cpupool0 is done only for domains which are dying.
If there are any active domains in the cpupool, removing the last cpu from
it will be denied.

>
> While we're at it, what's with the "(cpu != cpu_moving_cpu)" in the
> first half of cpupool_unassign_cpu()?  Under what conditions are you
> anticipating cpupool_unassign_cpu() being called a second time before
> the first completes?  If you have to abort the move because
> schedule_cpu_switch() failed, wouldn't it be better just to roll the
> whole transaction back, rather than leaving it hanging in the middle?

Not really. It could take some time until all vcpus have been migrated to
another cpu. In this case -EAGAIN is returned and the cpu is already
removed from the cpumask of valid cpus for that cpupool to avoid scheduling
of other vcpus on that cpu. Without cpu_moving_cpu there would be no
forward progress guaranteed.

>
> Hmm, and why does RMCPU call cpupool_get_by_id() with exact==0?  What
> could possibly be the use of grabbing a random cpupool and then trying
> to remove the specified cpu from it?

This is a very good question :-)
I think this should be fixed. Seems to be a copy and paste error. I'll send a
patch.


Thanks for your thoughts,


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-08  5:43                                   ` Juergen Gross
@ 2011-02-08 12:08                                     ` George Dunlap
  2011-02-08 12:14                                       ` George Dunlap
  2011-02-08 12:23                                       ` Juergen Gross
  0 siblings, 2 replies; 53+ messages in thread
From: George Dunlap @ 2011-02-08 12:08 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

On Tue, Feb 8, 2011 at 5:43 AM, Juergen Gross
<juergen.gross@ts.fujitsu.com> wrote:
> On 02/07/11 16:55, George Dunlap wrote:
>>
>> Juergen,
>>
>> What is supposed to happen if a domain is in cpupool0, and then all of
>> the cpus are taken out of cpupool0?  Is that possible?
>
> No. Cpupool0 can't be without any cpu, as Dom0 is always member of cpupool0.

If that's the case, then since Andre is running this immediately after
boot, he shouldn't be seeing any vcpus in the new pools; and all of
the dom0 vcpus should be migrated to cpupool0, right?  Is it possible
that migration process isn't happening properly?

It looks like schedule.c:cpu_disable_scheduler() will try to migrate
all vcpus, and if it fails to migrate, it returns -EAGAIN so that the
tools will try again.  It's probably worth instrumenting that whole
code-path to make sure it actually happens as we expect.  Are we
certain, for example, that if a hypercall continued on another cpu
will actually return the new error value properly?

Another minor thing: In cpupool.c:cpupool_unassign_cpu_helper(), why
is the cpu's bit set in cpupool_free_cpus without checking to see if
the cpu_disable_scheduler() call actually worked?  Shouldn't that also
be inside the if() statement?

 -George

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-08 12:08                                     ` George Dunlap
@ 2011-02-08 12:14                                       ` George Dunlap
  2011-02-08 16:33                                         ` Andre Przywara
  2011-02-08 12:23                                       ` Juergen Gross
  1 sibling, 1 reply; 53+ messages in thread
From: George Dunlap @ 2011-02-08 12:14 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 1498 bytes --]

Andre,

Can you try again with the attached patch?

Thanks,
 -George

On Tue, Feb 8, 2011 at 12:08 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> On Tue, Feb 8, 2011 at 5:43 AM, Juergen Gross
> <juergen.gross@ts.fujitsu.com> wrote:
>> On 02/07/11 16:55, George Dunlap wrote:
>>>
>>> Juergen,
>>>
>>> What is supposed to happen if a domain is in cpupool0, and then all of
>>> the cpus are taken out of cpupool0?  Is that possible?
>>
>> No. Cpupool0 can't be without any cpu, as Dom0 is always member of cpupool0.
>
> If that's the case, then since Andre is running this immediately after
> boot, he shouldn't be seeing any vcpus in the new pools; and all of
> the dom0 vcpus should be migrated to cpupool0, right?  Is it possible
> that migration process isn't happening properly?
>
> It looks like schedule.c:cpu_disable_scheduler() will try to migrate
> all vcpus, and if it fails to migrate, it returns -EAGAIN so that the
> tools will try again.  It's probably worth instrumenting that whole
> code-path to make sure it actually happens as we expect.  Are we
> certain, for example, that if a hypercall continued on another cpu
> will actually return the new error value properly?
>
> Another minor thing: In cpupool.c:cpupool_unassign_cpu_helper(), why
> is the cpu's bit set in cpupool_free_cpus without checking to see if
> the cpu_disable_scheduler() call actually worked?  Shouldn't that also
> be inside the if() statement?
>
>  -George
>

[-- Attachment #2: cpupools-vcpu-migrate-debug.diff --]
[-- Type: text/x-diff, Size: 1231 bytes --]

diff -r 9e463cb15658 xen/common/cpupool.c
--- a/xen/common/cpupool.c	Mon Feb 07 17:02:46 2011 +0000
+++ b/xen/common/cpupool.c	Tue Feb 08 12:13:35 2011 +0000
@@ -297,6 +297,8 @@
         {
             if ( d->cpupool != c )
                 continue;
+            /* Don't allow a cpu to be moved if there's a live
+             * domain still running on it */
             if ( !d->is_dying )
             {
                 ret = -EBUSY;
diff -r 9e463cb15658 xen/common/schedule.c
--- a/xen/common/schedule.c	Mon Feb 07 17:02:46 2011 +0000
+++ b/xen/common/schedule.c	Tue Feb 08 12:13:35 2011 +0000
@@ -495,6 +495,8 @@
 
             if ( v->processor == cpu )
             {
+                printk("%s: Migrating d%dv%d from cpu %d\n",
+                       __func__, d->domain_id, v->vcpu_id, cpu);
                 set_bit(_VPF_migrating, &v->pause_flags);
                 vcpu_schedule_unlock_irq(v);
                 vcpu_sleep_nosync(v);
@@ -511,7 +513,10 @@
              * all locks.
              */
             if ( v->processor == cpu )
+            {
+                printk("  Migration failed, must retry later.\n");
                 ret = -EAGAIN;
+            }
         }
 
         if ( affinity_broken )

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-08 12:08                                     ` George Dunlap
  2011-02-08 12:14                                       ` George Dunlap
@ 2011-02-08 12:23                                       ` Juergen Gross
  1 sibling, 0 replies; 53+ messages in thread
From: Juergen Gross @ 2011-02-08 12:23 UTC (permalink / raw)
  To: George Dunlap; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

On 02/08/11 13:08, George Dunlap wrote:
> On Tue, Feb 8, 2011 at 5:43 AM, Juergen Gross
> <juergen.gross@ts.fujitsu.com>  wrote:
>> On 02/07/11 16:55, George Dunlap wrote:
>>>
>>> Juergen,
>>>
>>> What is supposed to happen if a domain is in cpupool0, and then all of
>>> the cpus are taken out of cpupool0?  Is that possible?
>>
>> No. Cpupool0 can't be without any cpu, as Dom0 is always member of cpupool0.
>
> If that's the case, then since Andre is running this immediately after
> boot, he shouldn't be seeing any vcpus in the new pools; and all of
> the dom0 vcpus should be migrated to cpupool0, right?  Is it possible
> that migration process isn't happening properly?

Again: not the vcpus are migrated to cpupool0, but the physical cpus are
taken away from it, so the vcpus being active on the cpu to be moved MUST
be migrated to other cpus of cpupool0.

>
> It looks like schedule.c:cpu_disable_scheduler() will try to migrate
> all vcpus, and if it fails to migrate, it returns -EAGAIN so that the
> tools will try again.  It's probably worth instrumenting that whole
> code-path to make sure it actually happens as we expect.  Are we
> certain, for example, that if a hypercall continued on another cpu
> will actually return the new error value properly?

I have checked that and did never see any problem. And yes, I did see
the EAGAIN case happen.
With my test patch to execute the cpu_disable_scheduler() always on the
cpu to be moved this should not be a problem at all, since the tasklet
is always running in the idle vcpu.

>
> Another minor thing: In cpupool.c:cpupool_unassign_cpu_helper(), why
> is the cpu's bit set in cpupool_free_cpus without checking to see if
> the cpu_disable_scheduler() call actually worked?  Shouldn't that also
> be inside the if() statement?

No, I don't think so. If removing a cpu fails permanently after returning
-EAGAIN before, it should be addable to the original cpupool easily. This can
only be done, if it is flagged as free. Adding it to another cpupool will be
denied as cpupool_cpu_moving is still set.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-08 12:14                                       ` George Dunlap
@ 2011-02-08 16:33                                         ` Andre Przywara
  2011-02-09 12:27                                           ` George Dunlap
  0 siblings, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-02-08 16:33 UTC (permalink / raw)
  To: George Dunlap; +Cc: Juergen Gross, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 1935 bytes --]

George Dunlap wrote:
> Andre,
> 
> Can you try again with the attached patch?
Sure. Unfortunately (or is this a good sign?) the "Migration failed" 
message didn't trigger, I only saw various instances of the other 
printk, see the attached log file.
Migration is happening quite often, because Dom0 has 48 vCPUs and in the 
end they are squashed into less and less pCPUs. I guess that is the 
reason my I see it on my machine.

Regards,
Andre.

> 
> Thanks,
>  -George
> 
> On Tue, Feb 8, 2011 at 12:08 PM, George Dunlap
> <George.Dunlap@eu.citrix.com> wrote:
>> On Tue, Feb 8, 2011 at 5:43 AM, Juergen Gross
>> <juergen.gross@ts.fujitsu.com> wrote:
>>> On 02/07/11 16:55, George Dunlap wrote:
>>>> Juergen,
>>>>
>>>> What is supposed to happen if a domain is in cpupool0, and then all of
>>>> the cpus are taken out of cpupool0?  Is that possible?
>>> No. Cpupool0 can't be without any cpu, as Dom0 is always member of cpupool0.
>> If that's the case, then since Andre is running this immediately after
>> boot, he shouldn't be seeing any vcpus in the new pools; and all of
>> the dom0 vcpus should be migrated to cpupool0, right?  Is it possible
>> that migration process isn't happening properly?
>>
>> It looks like schedule.c:cpu_disable_scheduler() will try to migrate
>> all vcpus, and if it fails to migrate, it returns -EAGAIN so that the
>> tools will try again.  It's probably worth instrumenting that whole
>> code-path to make sure it actually happens as we expect.  Are we
>> certain, for example, that if a hypercall continued on another cpu
>> will actually return the new error value properly?
>>
>> Another minor thing: In cpupool.c:cpupool_unassign_cpu_helper(), why
>> is the cpu's bit set in cpupool_free_cpus without checking to see if
>> the cpu_disable_scheduler() call actually worked?  Shouldn't that also
>> be inside the if() statement?
>>
>>  -George
>>


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

[-- Attachment #2: george_debug.log --]
[-- Type: text/plain, Size: 8076 bytes --]

root@dosorca:/data/images# sh numasplit.sh
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 6
(XEN) cpu_disable_scheduler: Migrating d0v26 from cpu 6
(XEN) cpu_disable_scheduler: Migrating d0v9 from cpu 7
(XEN) cpu_disable_scheduler: Migrating d0v23 from cpu 7
(XEN) cpu_disable_scheduler: Migrating d0v9 from cpu 8
(XEN) cpu_disable_scheduler: Migrating d0v19 from cpu 8
(XEN) cpu_disable_scheduler: Migrating d0v0 from cpu 9
(XEN) cpu_disable_scheduler: Migrating d0v9 from cpu 9
(XEN) cpu_disable_scheduler: Migrating d0v19 from cpu 9
(XEN) cpu_disable_scheduler: Migrating d0v0 from cpu 10
(XEN) cpu_disable_scheduler: Migrating d0v9 from cpu 10
(XEN) cpu_disable_scheduler: Migrating d0v19 from cpu 10
(XEN) cpu_disable_scheduler: Migrating d0v0 from cpu 11
(XEN) cpu_disable_scheduler: Migrating d0v9 from cpu 11
(XEN) cpu_disable_scheduler: Migrating d0v19 from cpu 11
(XEN) cpu_disable_scheduler: Migrating d0v31 from cpu 11
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node1
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v44 from cpu 12
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 13
(XEN) cpu_disable_scheduler: Migrating d0v33 from cpu 13
(XEN) cpu_disable_scheduler: Migrating d0v44 from cpu 13
(XEN) cpu_disable_scheduler: Migrating d0v10 from cpu 14
(XEN) cpu_disable_scheduler: Migrating d0v33 from cpu 14
(XEN) cpu_disable_scheduler: Migrating d0v44 from cpu 14
(XEN) cpu_disable_scheduler: Migrating d0v10 from cpu 15
(XEN) cpu_disable_scheduler: Migrating d0v33 from cpu 15
(XEN) cpu_disable_scheduler: Migrating d0v44 from cpu 15
(XEN) cpu_disable_scheduler: Migrating d0v10 from cpu 16
(XEN) cpu_disable_scheduler: Migrating d0v33 from cpu 16
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 16
(XEN) cpu_disable_scheduler: Migrating d0v10 from cpu 17
(XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 17
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 17
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node2
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v10 from cpu 18
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 18
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 18
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 19
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 19
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 20
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 20
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 20
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 21
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 21
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 21
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 21
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 22
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 22
(XEN) cpu_disable_scheduler: Migrating d0v23 from cpu 22
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 22
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 22
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 23
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 23
(XEN) cpu_disable_scheduler: Migrating d0v23 from cpu 23
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 23
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node3
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 24
(XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 24
(XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 24
(XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 25
(XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 25
(XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 25
(XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 26
(XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 26
(XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 26
(XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v39 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 29
(XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 29
(XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 29
(XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 29
(XEN) cpu_disable_scheduler: Migrating d0v39 from cpu 29
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node4
scheduler:      credit
number of cpus: 1
(XEN) Xen BUG at sched_credit.c:384
(XEN) ----[ Xen-4.1.0-rc3-pre  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    32
(XEN) RIP:    e008:[<ffff82c480117fa0>] csched_alloc_pdata+0x146/0x17f
(XEN) RFLAGS: 0000000000010093   CONTEXT: hypervisor
(XEN) rax: ffff830434322000   rbx: ffff830a3800f1e8   rcx: 0000000000000018
(XEN) rdx: ffff82c4802d3ec0   rsi: 0000000000000002   rdi: ffff83043445e100
(XEN) rbp: ffff8304343efce8   rsp: ffff8304343efca8   r8:  0000000000000001
(XEN) r9:  ffff830a3800f1e8   r10: ffff82c480219dc0   r11: 0000000000000286
(XEN) r12: 0000000000000018   r13: ffff8310341a7d50   r14: ffff830a3800f1d0
(XEN) r15: 0000000000000018   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 0000000806aed000   cr2: 00007f50c671def5
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff8304343efca8:
(XEN)    ffff8304343efcb8 ffff8310341a7d50 0000000000000282 0000000000000018
(XEN)    ffff830a3800f460 ffff8310341a7c60 0000000000000018 ffff82c4802b0880
(XEN)    ffff8304343efd58 ffff82c48011fa63 ffff82f601024d80 000000000008126c
(XEN)    ffff8300c7e42000 0000000000000000 0000080000000000 ffff82c480248b80
(XEN)    0000000000000002 0000000000000018 ffff830a3800f460 0000000000305000
(XEN)    ffff82c4802550e4 ffff82c4802b0880 ffff8304343efd78 ffff82c48010188c
(XEN)    ffff8304343efe40 0000000000000018 ffff8304343efdb8 ffff82c480101b94
(XEN)    ffff8304343efdb8 ffff82c480183562 fffffffe00000286 ffff8304343eff18
(XEN)    000000000066e004 0000000000305000 ffff8304343efef8 ffff82c4801252a1
(XEN)    ffff8304343efdd8 0000000180153c8d 0000000000000000 ffff82c4801068f8
(XEN)    0000000000000296 ffff8300c7e1e1c8 aaaaaaaaaaaaaaaa 0000000000000000
(XEN)    ffff88007d094170 ffff88007d094170 ffff8304343efef8 ffff82c480113d8a
(XEN)    ffff8304343efe78 ffff8304343efe88 0000000800000012 0000000400000004
(XEN)    00007fff00000001 0000000000000018 00000000000000b3 0000000000000072
(XEN)    00007f50c64e5960 0000000000000018 00007fff85f117c0 00007f50c6b48342
(XEN)    0000000000000001 0000000000000000 0000000000000018 0000000000000004
(XEN)    000000000066d050 000000000066e000 85f1189c00000000 0000000000000033
(XEN)    ffff8304343efed8 ffff8300c7e1e000 00007fff85f11600 0000000000305000
(XEN)    0000000000000003 0000000000000003 00007cfbcbc100c7 ffff82c480207be8
(XEN)    ffffffff8100946a 0000000000000023 0000000000000003 0000000000000003
(XEN) Xen call trace:
(XEN)    [<ffff82c480117fa0>] csched_alloc_pdata+0x146/0x17f
(XEN)    [<ffff82c48011fa63>] schedule_cpu_switch+0x75/0x1cd
(XEN)    [<ffff82c48010188c>] cpupool_assign_cpu_locked+0x44/0x8b
(XEN)    [<ffff82c480101b94>] cpupool_do_sysctl+0x1fb/0x461
(XEN)    [<ffff82c4801252a1>] do_sysctl+0x921/0xa30
(XEN)    [<ffff82c480207be8>] syscall_enter+0xc8/0x122
(XEN)    
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 32:
(XEN) Xen BUG at sched_credit.c:384
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-08 16:33                                         ` Andre Przywara
@ 2011-02-09 12:27                                           ` George Dunlap
  2011-02-09 12:27                                             ` George Dunlap
  0 siblings, 1 reply; 53+ messages in thread
From: George Dunlap @ 2011-02-09 12:27 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Juergen Gross, xen-devel, Diestelhorst, Stephan

On Tue, Feb 8, 2011 at 4:33 PM, Andre Przywara <andre.przywara@amd.com> wrote:
> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 24
> (XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 24
> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 24
> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 25
> (XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 25
> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 25
> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 26
> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 26
> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 26
> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 27
> (XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 27
> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 27
> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 27
> (XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 28
> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 28
> (XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 28
> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 28
> (XEN) cpu_disable_scheduler: Migrating d0v39 from cpu 28
> (XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 29

Interesting -- what seems to happen here is that as cpus are disabled,
vcpus are "shovelled" in an accumulative fashion from one cpu to the
next:
* v18,34,42 start on cpu 24.
* When 24 is brought down, they're all migrated to 25; then when 25 is
brougth down, to 26, then to 27
* v24 is running on cpu 27, so when 27 is brought down, v24 is added to the mix
* v3 is running on cpu 28, so all of them plus v3 are shoveled onto cpu 29.

While that behavior may not be ideal, it should certainly be bug-free.

Another interesting thing to note is that the bug happened on pcpu 32,
but there were no advertised migrations from that cpu.

Andre, can you fold the attached patch into your testing?

Thanks for all your work on this.

 -George

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-09 12:27                                           ` George Dunlap
@ 2011-02-09 12:27                                             ` George Dunlap
  2011-02-09 13:04                                               ` Juergen Gross
  2011-02-09 13:51                                               ` Andre Przywara
  0 siblings, 2 replies; 53+ messages in thread
From: George Dunlap @ 2011-02-09 12:27 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Juergen Gross, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 2155 bytes --]

Sorry, forgot the patch...
 -G

On Wed, Feb 9, 2011 at 12:27 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> On Tue, Feb 8, 2011 at 4:33 PM, Andre Przywara <andre.przywara@amd.com> wrote:
>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 24
>> (XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 24
>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 24
>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 25
>> (XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 25
>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 25
>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 26
>> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 26
>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 26
>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 27
>> (XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 27
>> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 27
>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 27
>> (XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 28
>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 28
>> (XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 28
>> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 28
>> (XEN) cpu_disable_scheduler: Migrating d0v39 from cpu 28
>> (XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 29
>
> Interesting -- what seems to happen here is that as cpus are disabled,
> vcpus are "shovelled" in an accumulative fashion from one cpu to the
> next:
> * v18,34,42 start on cpu 24.
> * When 24 is brought down, they're all migrated to 25; then when 25 is
> brougth down, to 26, then to 27
> * v24 is running on cpu 27, so when 27 is brought down, v24 is added to the mix
> * v3 is running on cpu 28, so all of them plus v3 are shoveled onto cpu 29.
>
> While that behavior may not be ideal, it should certainly be bug-free.
>
> Another interesting thing to note is that the bug happened on pcpu 32,
> but there were no advertised migrations from that cpu.
>
> Andre, can you fold the attached patch into your testing?
>
> Thanks for all your work on this.
>
>  -George
>

[-- Attachment #2: cpupools-debug-curr-not-idle.diff --]
[-- Type: text/x-diff, Size: 673 bytes --]

diff -r 9ddf07022b3f xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Wed Feb 09 10:29:53 2011 +0000
+++ b/xen/common/sched_credit.c	Wed Feb 09 10:51:05 2011 +0000
@@ -381,6 +381,14 @@
         per_cpu(schedule_data, cpu).sched_priv = spc;
 
     /* Start off idling... */
+    if ( !is_idle_vcpu(per_cpu(schedule_data, cpu).curr) )
+    {
+        printk("%s: curr d%dv%d on p%d!\n",
+               __func__,
+               per_cpu(schedule_data, cpu).curr->domain->domain_id,
+               per_cpu(schedule_data, cpu).curr->vcpu_id,
+               cpu);
+    }
     BUG_ON(!is_idle_vcpu(per_cpu(schedule_data, cpu).curr));
     cpu_set(cpu, prv->idlers);
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-09 12:27                                             ` George Dunlap
@ 2011-02-09 13:04                                               ` Juergen Gross
  2011-02-09 13:39                                                 ` Andre Przywara
  2011-02-09 13:51                                               ` Andre Przywara
  1 sibling, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-09 13:04 UTC (permalink / raw)
  To: George Dunlap; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

On 02/09/11 13:27, George Dunlap wrote:
> Sorry, forgot the patch...
>   -G
>
> On Wed, Feb 9, 2011 at 12:27 PM, George Dunlap
> <George.Dunlap@eu.citrix.com>  wrote:
>> On Tue, Feb 8, 2011 at 4:33 PM, Andre Przywara<andre.przywara@amd.com>  wrote:
>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 24
>>> (XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 24
>>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 24
>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 25
>>> (XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 25
>>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 25
>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 26
>>> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 26
>>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 26
>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 27
>>> (XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 27
>>> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 27
>>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 27
>>> (XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 28
>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 28
>>> (XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 28
>>> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 28
>>> (XEN) cpu_disable_scheduler: Migrating d0v39 from cpu 28
>>> (XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 29
>>
>> Interesting -- what seems to happen here is that as cpus are disabled,
>> vcpus are "shovelled" in an accumulative fashion from one cpu to the
>> next:
>> * v18,34,42 start on cpu 24.
>> * When 24 is brought down, they're all migrated to 25; then when 25 is
>> brougth down, to 26, then to 27
>> * v24 is running on cpu 27, so when 27 is brought down, v24 is added to the mix
>> * v3 is running on cpu 28, so all of them plus v3 are shoveled onto cpu 29.
>>
>> While that behavior may not be ideal, it should certainly be bug-free.
>>
>> Another interesting thing to note is that the bug happened on pcpu 32,
>> but there were no advertised migrations from that cpu.

If I understand the configuration of Andre's machine correctly, pcpu32 will
be the target of the next migrations. This pcpu is member of the next numa
node, correct?

Could it be there is a problem with the call of domain_update_node_affinity()
from cpu_disable_scheduler() ?

Hmm, I think this could really be the problem.
Andre, could you try the following patch?

diff -r f1fac30a531b xen/common/schedule.c
--- a/xen/common/schedule.c     Wed Feb 09 08:58:11 2011 +0000
+++ b/xen/common/schedule.c     Wed Feb 09 14:02:12 2011 +0100
@@ -491,6 +491,10 @@ int cpu_disable_scheduler(unsigned int c
                          v->domain->domain_id, v->vcpu_id);
                  cpus_setall(v->cpu_affinity);
                  affinity_broken = 1;
+            }
+            if ( cpus_weight(v->cpu_affinity) < NR_CPUS )
+            {
+                cpu_clear(cpu, v->cpu_affinity);
              }

              if ( v->processor == cpu )


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-09 13:04                                               ` Juergen Gross
@ 2011-02-09 13:39                                                 ` Andre Przywara
  0 siblings, 0 replies; 53+ messages in thread
From: Andre Przywara @ 2011-02-09 13:39 UTC (permalink / raw)
  To: Juergen Gross; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

Juergen Gross wrote:
>>> Another interesting thing to note is that the bug happened on pcpu 32,
>>> but there were no advertised migrations from that cpu.
> 
> If I understand the configuration of Andre's machine correctly, pcpu32 will
> be the target of the next migrations. This pcpu is member of the next numa
> node, correct?
No, this is a 6-core box, so the NUMA node span pcpu30-35.
> 
> Could it be there is a problem with the call of domain_update_node_affinity()
> from cpu_disable_scheduler() ?
> 
> Hmm, I think this could really be the problem.
> Andre, could you try the following patch?
Sorry, but that one didn't help. It crashed with the well-known BUG_ON:
(XEN) Xen BUG at sched_credit.c:990
(which is the weight assert in csched_acct (c/s 22858))

Regards,
Andre.

> 
> diff -r f1fac30a531b xen/common/schedule.c
> --- a/xen/common/schedule.c     Wed Feb 09 08:58:11 2011 +0000
> +++ b/xen/common/schedule.c     Wed Feb 09 14:02:12 2011 +0100
> @@ -491,6 +491,10 @@ int cpu_disable_scheduler(unsigned int c
>                           v->domain->domain_id, v->vcpu_id);
>                   cpus_setall(v->cpu_affinity);
>                   affinity_broken = 1;
> +            }
> +            if ( cpus_weight(v->cpu_affinity) < NR_CPUS )
> +            {
> +                cpu_clear(cpu, v->cpu_affinity);
>               }
> 
>               if ( v->processor == cpu )
> 
> 
> Juergen
> 


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-09 12:27                                             ` George Dunlap
  2011-02-09 13:04                                               ` Juergen Gross
@ 2011-02-09 13:51                                               ` Andre Przywara
  2011-02-09 14:21                                                 ` Juergen Gross
  1 sibling, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-02-09 13:51 UTC (permalink / raw)
  To: George Dunlap; +Cc: Juergen Gross, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 2741 bytes --]

George Dunlap wrote:
> <George.Dunlap@eu.citrix.com> wrote:
>> On Tue, Feb 8, 2011 at 4:33 PM, Andre Przywara <andre.przywara@amd.com> wrote:
>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 24
>>> (XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 24
>>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 24
>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 25
>>> (XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 25
>>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 25
>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 26
>>> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 26
>>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 26
>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 27
>>> (XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 27
>>> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 27
>>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 27
>>> (XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 28
>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 28
>>> (XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 28
>>> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 28
>>> (XEN) cpu_disable_scheduler: Migrating d0v39 from cpu 28
>>> (XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 29
>> Interesting -- what seems to happen here is that as cpus are disabled,
>> vcpus are "shovelled" in an accumulative fashion from one cpu to the
>> next:
>> * v18,34,42 start on cpu 24.
>> * When 24 is brought down, they're all migrated to 25; then when 25 is
>> brougth down, to 26, then to 27
>> * v24 is running on cpu 27, so when 27 is brought down, v24 is added to the mix
>> * v3 is running on cpu 28, so all of them plus v3 are shoveled onto cpu 29.
>>
>> While that behavior may not be ideal, it should certainly be bug-free.
>>
>> Another interesting thing to note is that the bug happened on pcpu 32,
>> but there were no advertised migrations from that cpu.
>>
>> Andre, can you fold the attached patch into your testing?
Sorry, but that bug (and its output) didn't trigger on two tries. 
Instead I now saw two occasions of the "migration failed, must retry 
later" message. Interestingly enough is does not seem to be fatal. The 
first time it triggers, the numa-split even completes, then after I roll 
it back and repeat it it shows again, but crashes later on that old 
BUG_ON().

See the attached log for more details.

Thanks for the try, anyway.

Regards,
Andre.


>>
>> Thanks for all your work on this.
I am glad for all your help. I only start to really understand the 
scheduler, so your support is much appreciated.

>>
>>  -George
>>


-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

[-- Attachment #2: george_2_debug.log --]
[-- Type: text/plain, Size: 20965 bytes --]

root@dosorca:/data/images# sh numasplit.sh create
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 6
(XEN) cpu_disable_scheduler: Migrating d0v7 from cpu 7
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 7
(XEN) cpu_disable_scheduler: Migrating d0v7 from cpu 8
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 8
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 9
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 10
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 11
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node1
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 12
(XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 12
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 13
(XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 13
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 14
(XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 14
(XEN) cpu_disable_scheduler: Migrating d0v0 from cpu 15
(XEN)   Migration failed, must retry later.
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 15
(XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 15
(XEN) cpu_disable_scheduler: Migrating d0v15 from cpu 16
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 16
(XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 16
(XEN) cpu_disable_scheduler: Migrating d0v44 from cpu 16
(XEN) cpu_disable_scheduler: Migrating d0v8 from cpu 17
(XEN) cpu_disable_scheduler: Migrating d0v15 from cpu 17
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 17
(XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 17
(XEN) cpu_disable_scheduler: Migrating d0v44 from cpu 17
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node2
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v7 from cpu 18
(XEN) cpu_disable_scheduler: Migrating d0v1 from cpu 19
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 19
(XEN) cpu_disable_scheduler: Migrating d0v1 from cpu 20
(XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 20
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 20
(XEN) cpu_disable_scheduler: Migrating d0v1 from cpu 21
(XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 21
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 21
(XEN) cpu_disable_scheduler: Migrating d0v1 from cpu 22
(XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 22
(XEN) cpu_disable_scheduler: Migrating d0v38 from cpu 22
(XEN) cpu_disable_scheduler: Migrating d0v1 from cpu 23
(XEN) cpu_disable_scheduler: Migrating d0v20 from cpu 23
(XEN) cpu_disable_scheduler: Migrating d0v38 from cpu 23
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node3
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v12 from cpu 24
(XEN) cpu_disable_scheduler: Migrating d0v30 from cpu 24
(XEN) cpu_disable_scheduler: Migrating d0v5 from cpu 25
(XEN) cpu_disable_scheduler: Migrating d0v30 from cpu 25
(XEN) cpu_disable_scheduler: Migrating d0v5 from cpu 26
(XEN) cpu_disable_scheduler: Migrating d0v16 from cpu 26
(XEN) cpu_disable_scheduler: Migrating d0v28 from cpu 26
(XEN) cpu_disable_scheduler: Migrating d0v44 from cpu 26
(XEN) cpu_disable_scheduler: Migrating d0v5 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v16 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v28 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v5 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v16 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v22 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v38 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v5 from cpu 29
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 29
(XEN) cpu_disable_scheduler: Migrating d0v26 from cpu 29
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 29
(XEN) cpu_disable_scheduler: Migrating d0v39 from cpu 29
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node4
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v22 from cpu 30
(XEN) cpu_disable_scheduler: Migrating d0v40 from cpu 30
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 31
(XEN) cpu_disable_scheduler: Migrating d0v20 from cpu 31
(XEN) cpu_disable_scheduler: Migrating d0v40 from cpu 31
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 32
(XEN) cpu_disable_scheduler: Migrating d0v20 from cpu 32
(XEN) cpu_disable_scheduler: Migrating d0v40 from cpu 32
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 33
(XEN) cpu_disable_scheduler: Migrating d0v20 from cpu 33
(XEN) cpu_disable_scheduler: Migrating d0v35 from cpu 33
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 34
(XEN) cpu_disable_scheduler: Migrating d0v20 from cpu 34
(XEN) cpu_disable_scheduler: Migrating d0v35 from cpu 34
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 35
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 35
(XEN) cpu_disable_scheduler: Migrating d0v26 from cpu 35
(XEN) cpu_disable_scheduler: Migrating d0v35 from cpu 35
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node5
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 36
(XEN) cpu_disable_scheduler: Migrating d0v45 from cpu 36
(XEN) cpu_disable_scheduler: Migrating d0v5 from cpu 37
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 37
(XEN) cpu_disable_scheduler: Migrating d0v22 from cpu 37
(XEN) cpu_disable_scheduler: Migrating d0v45 from cpu 37
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 38
(XEN) cpu_disable_scheduler: Migrating d0v13 from cpu 38
(XEN) cpu_disable_scheduler: Migrating d0v22 from cpu 38
(XEN) cpu_disable_scheduler: Migrating d0v28 from cpu 38
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 38
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 39
(XEN) cpu_disable_scheduler: Migrating d0v13 from cpu 39
(XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 39
(XEN) cpu_disable_scheduler: Migrating d0v26 from cpu 39
(XEN) cpu_disable_scheduler: Migrating d0v31 from cpu 39
(XEN) cpu_disable_scheduler: Migrating d0v38 from cpu 39
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 40
(XEN) cpu_disable_scheduler: Migrating d0v13 from cpu 40
(XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 40
(XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 40
(XEN) cpu_disable_scheduler: Migrating d0v31 from cpu 40
(XEN) cpu_disable_scheduler: Migrating d0v38 from cpu 40
(XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 41
(XEN) cpu_disable_scheduler: Migrating d0v13 from cpu 41
(XEN) cpu_disable_scheduler: Migrating d0v20 from cpu 41
(XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 41
(XEN) cpu_disable_scheduler: Migrating d0v31 from cpu 41
(XEN) cpu_disable_scheduler: Migrating d0v38 from cpu 41
(XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 41
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node6
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v8 from cpu 42
(XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 42
(XEN) cpu_disable_scheduler: Migrating d0v35 from cpu 42
(XEN) cpu_disable_scheduler: Migrating d0v46 from cpu 42
(XEN) cpu_disable_scheduler: Migrating d0v0 from cpu 43
(XEN) cpu_disable_scheduler: Migrating d0v8 from cpu 43
(XEN) cpu_disable_scheduler: Migrating d0v12 from cpu 43
(XEN) cpu_disable_scheduler: Migrating d0v19 from cpu 43
(XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 43
(XEN) cpu_disable_scheduler: Migrating d0v31 from cpu 43
(XEN) cpu_disable_scheduler: Migrating d0v43 from cpu 43
(XEN) cpu_disable_scheduler: Migrating d0v0 from cpu 44
(XEN) cpu_disable_scheduler: Migrating d0v8 from cpu 44
(XEN) cpu_disable_scheduler: Migrating d0v15 from cpu 44
(XEN) cpu_disable_scheduler: Migrating d0v23 from cpu 44
(XEN) cpu_disable_scheduler: Migrating d0v31 from cpu 44
(XEN) cpu_disable_scheduler: Migrating d0v40 from cpu 44
(XEN) cpu_disable_scheduler: Migrating d0v0 from cpu 45
(XEN) cpu_disable_scheduler: Migrating d0v8 from cpu 45
(XEN) cpu_disable_scheduler: Migrating d0v13 from cpu 45
(XEN) cpu_disable_scheduler: Migrating d0v21 from cpu 45
(XEN) cpu_disable_scheduler: Migrating d0v31 from cpu 45
(XEN) cpu_disable_scheduler: Migrating d0v37 from cpu 45
(XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 45
(XEN) cpu_disable_scheduler: Migrating d0v8 from cpu 46
(XEN) cpu_disable_scheduler: Migrating d0v21 from cpu 46
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 46
(XEN) cpu_disable_scheduler: Migrating d0v31 from cpu 46
(XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 46
(XEN) cpu_disable_scheduler: Migrating d0v12 from cpu 47
(XEN) cpu_disable_scheduler: Migrating d0v16 from cpu 47
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 47
(XEN) cpu_disable_scheduler: Migrating d0v33 from cpu 47
(XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 47
(XEN) cpu_disable_scheduler: Migrating d0v43 from cpu 47
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node7
scheduler:      credit
number of cpus: 1
Populating new pool
root@dosorca:/data/images# sh numasplit.sh create revert
Destroying Pool 1
adding freed CPUs to pool 0
Destroying Pool 2
adding freed CPUs to pool 0
Destroying Pool 3
adding freed CPUs to pool 0
Destroying Pool 4
adding freed CPUs to pool 0
Destroying Pool 5
adding freed CPUs to pool 0
Destroying Pool 6
adding freed CPUs to pool 0
Destroying Pool 7
adding freed CPUs to pool 0
root@dosorca:/data/images# sh numasplit.sh create
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 6
(XEN) cpu_disable_scheduler: Migrating d0v31 from cpu 6
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 7
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 7
(XEN) cpu_disable_scheduler: Migrating d0v31 from cpu 7
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 8
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 8
(XEN) cpu_disable_scheduler: Migrating d0v22 from cpu 8
(XEN) cpu_disable_scheduler: Migrating d0v31 from cpu 8
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 9
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 9
(XEN) cpu_disable_scheduler: Migrating d0v22 from cpu 9
(XEN) cpu_disable_scheduler: Migrating d0v31 from cpu 9
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 10
(XEN) cpu_disable_scheduler: Migrating d0v22 from cpu 10
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 10
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 11
(XEN) cpu_disable_scheduler: Migrating d0v17 from cpu 11
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 11
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node1
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 12
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 12
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 13
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 13
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 13
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 14
(XEN) cpu_disable_scheduler: Migrating d0v19 from cpu 14
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 14
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 14
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 15
(XEN) cpu_disable_scheduler: Migrating d0v19 from cpu 15
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 15
(XEN) cpu_disable_scheduler: Migrating d0v37 from cpu 15
(XEN) cpu_disable_scheduler: Migrating d0v41 from cpu 15
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 16
(XEN) cpu_disable_scheduler: Migrating d0v19 from cpu 16
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 16
(XEN) cpu_disable_scheduler: Migrating d0v37 from cpu 16
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 17
(XEN) cpu_disable_scheduler: Migrating d0v19 from cpu 17
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 17
(XEN) cpu_disable_scheduler: Migrating d0v37 from cpu 17
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node2
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 18
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 18
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 19
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 19
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 20
(XEN) cpu_disable_scheduler: Migrating d0v21 from cpu 20
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 20
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 21
(XEN) cpu_disable_scheduler: Migrating d0v21 from cpu 21
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 21
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 22
(XEN) cpu_disable_scheduler: Migrating d0v21 from cpu 22
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 22
(XEN) cpu_disable_scheduler: Migrating d0v6 from cpu 23
(XEN) cpu_disable_scheduler: Migrating d0v21 from cpu 23
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 23
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node3
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 24
(XEN) cpu_disable_scheduler: Migrating d0v7 from cpu 25
(XEN) cpu_disable_scheduler: Migrating d0v17 from cpu 25
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 25
(XEN) cpu_disable_scheduler: Migrating d0v7 from cpu 26
(XEN) cpu_disable_scheduler: Migrating d0v17 from cpu 26
(XEN) cpu_disable_scheduler: Migrating d0v29 from cpu 26
(XEN) cpu_disable_scheduler: Migrating d0v43 from cpu 26
(XEN) cpu_disable_scheduler: Migrating d0v7 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v17 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v39 from cpu 27
(XEN) cpu_disable_scheduler: Migrating d0v7 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v17 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v39 from cpu 28
(XEN) cpu_disable_scheduler: Migrating d0v7 from cpu 29
(XEN) cpu_disable_scheduler: Migrating d0v17 from cpu 29
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 29
(XEN) cpu_disable_scheduler: Migrating d0v38 from cpu 29
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node4
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v21 from cpu 31
(XEN) cpu_disable_scheduler: Migrating d0v21 from cpu 32
(XEN) cpu_disable_scheduler: Migrating d0v46 from cpu 32
(XEN)   Migration failed, must retry later.
(XEN) cpu_disable_scheduler: Migrating d0v14 from cpu 33
(XEN) cpu_disable_scheduler: Migrating d0v8 from cpu 34
(XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 34
(XEN) cpu_disable_scheduler: Migrating d0v28 from cpu 34
(XEN) cpu_disable_scheduler: Migrating d0v4 from cpu 35
(XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 35
(XEN) cpu_disable_scheduler: Migrating d0v28 from cpu 35
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node5
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) cpu_disable_scheduler: Migrating d0v1 from cpu 36
(XEN) cpu_disable_scheduler: Migrating d0v15 from cpu 36
(XEN) cpu_disable_scheduler: Migrating d0v35 from cpu 36
(XEN) cpu_disable_scheduler: Migrating d0v44 from cpu 36
(XEN) cpu_disable_scheduler: Migrating d0v1 from cpu 37
(XEN) cpu_disable_scheduler: Migrating d0v15 from cpu 37
(XEN) cpu_disable_scheduler: Migrating d0v35 from cpu 37
(XEN) cpu_disable_scheduler: Migrating d0v44 from cpu 37
(XEN) cpu_disable_scheduler: Migrating d0v1 from cpu 38
(XEN) cpu_disable_scheduler: Migrating d0v15 from cpu 38
(XEN) cpu_disable_scheduler: Migrating d0v28 from cpu 38
(XEN) cpu_disable_scheduler: Migrating d0v35 from cpu 38
(XEN) cpu_disable_scheduler: Migrating d0v44 from cpu 38
(XEN) cpu_disable_scheduler: Migrating d0v1 from cpu 39
(XEN) cpu_disable_scheduler: Migrating d0v13 from cpu 39
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 39
(XEN) cpu_disable_scheduler: Migrating d0v28 from cpu 39
(XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 39
(XEN) cpu_disable_scheduler: Migrating d0v39 from cpu 39
(XEN) cpu_disable_scheduler: Migrating d0v0 from cpu 40
(XEN) cpu_disable_scheduler: Migrating d0v13 from cpu 40
(XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 40
(XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 40
(XEN) cpu_disable_scheduler: Migrating d0v39 from cpu 40
(XEN) cpu_disable_scheduler: Migrating d0v47 from cpu 40
(XEN) cpu_disable_scheduler: Migrating d0v0 from cpu 41
(XEN) cpu_disable_scheduler: Migrating d0v8 from cpu 41
(XEN) cpu_disable_scheduler: Migrating d0v13 from cpu 41
(XEN) cpu_disable_scheduler: Migrating d0v19 from cpu 41
(XEN) cpu_disable_scheduler: Migrating d0v26 from cpu 41
(XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 41
(XEN) cpu_disable_scheduler: Migrating d0v45 from cpu 41
Rewriting config file
Creating new pool
Using config file "cpupool.test"
cpupool name:   Pool-node6
scheduler:      credit
number of cpus: 1
Populating new pool
Removing CPUs from Pool 0
(XEN) Xen BUG at sched_credit.c:998
(XEN) ----[ Xen-4.1.0-rc3-pre  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c48011814d>] csched_acct+0x11f/0x41c
(XEN) RFLAGS: 0000000000010006   CONTEXT: hypervisor
(XEN) rax: 0000000000000010   rbx: 0000000000000f00   rcx: 0000000000000100
(XEN) rdx: 0000000000001000   rsi: ffff830437ffa600   rdi: 0000000000000010
(XEN) rbp: ffff82c480297e38   rsp: ffff82c480297da8   r8:  0000000000000100
(XEN) r9:  0000000000000007   r10: ffff82c4802cbfe0   r11: 0000009d467684b5
(XEN) r12: ffff830437ffa5e0   r13: ffff82c48011802e   r14: ffff830433af2018
(XEN) r15: ffff830434321ec0   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 00000008067e3000   cr2: 00007f5a56dce590
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c480297da8:
(XEN)    ffff82c480297dc8 fffffed480153c5e ffff830400000eff ffff830437ffa5e0
(XEN)    ffff830437ffa5e8 ffff82c480153ced ffff830437ffa5e0 0000000000000292
(XEN)    ffff830437ffa5e8 00000e1034322000 00000f0000000f00 0000000000000000
(XEN)    ffff82c400000000 ffff82c4802d3f80 ffff830437ffa5e0 ffff82c48011802e
(XEN)    ffff830433af2018 ffff830433af2010 ffff82c480297e68 ffff82c480125fc4
(XEN)    0000000000000002 ffff830437ffa600 ffff82c4802d3f80 0000009d44e55c65
(XEN)    ffff82c480297eb8 ffff82c4801262e9 0000000000000001 ffff82c4802d3f80
(XEN)    ffff830433af2010 0000000000000000 0000000000000000 ffff82c4802b0880
(XEN)    ffff82c480297f18 ffffffffffffffff ffff82c480297ef8 ffff82c4801233b7
(XEN)    ffff82c480297ed8 ffff8300c7e0a000 ffff88007ce88f40 ffff8817a7c47000
(XEN)    0000000000000001 000000000000002c ffff82c480297f08 ffff82c480123432
(XEN)    00007d3b7fd680c7 ffff82c480207d16 000000000000002c 0000000000000001
(XEN)    ffff8817a7c47000 ffff88007ce88f40 ffff88179f9ffce8 ffff8817a84df000
(XEN)    0000000000000286 000000000000000f ffff88179f9ffcf8 ffff88007ce898c0
(XEN)    0000000000000000 ffffffff8100940a ffff88007ce75000 00000000deadbeef
(XEN)    00000000deadbeef 0000010000000000 ffffffff8100940a 000000000000e033
(XEN)    0000000000000286 ffff88179f9ffca0 000000000000e02b 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    ffff8300c7e0a000 0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c48011814d>] csched_acct+0x11f/0x41c
(XEN)    [<ffff82c480125fc4>] execute_timer+0x4e/0x6c
(XEN)    [<ffff82c4801262e9>] timer_softirq_action+0xf2/0x245
(XEN)    [<ffff82c4801233b7>] __do_softirq+0x88/0x99
(XEN)    [<ffff82c480123432>] do_softirq+0x6a/0x7a
(XEN)    
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Xen BUG at sched_credit.c:998
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...
(XEN) Resetting with ACPI MEMORY or I/O RESET_REG.

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-09 13:51                                               ` Andre Przywara
@ 2011-02-09 14:21                                                 ` Juergen Gross
  2011-02-10  6:42                                                   ` Juergen Gross
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-09 14:21 UTC (permalink / raw)
  To: Andre Przywara; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

Andre, George,


What seems to be interesting: I think the problem did always occur when
a new cpupool was created and the first cpu was moved to it.

I think my previous assumption regarding the master_ticker was not too bad.
I think somehow the master_ticker of the new cpupool is becoming active
before the scheduler is really initialized properly. This could happen, if
enough time is spent between alloc_pdata for the cpu to be moved and the
critical section in schedule_cpu_switch().

The solution should be to activate the timers only if the scheduler is
ready for them.

George, do you think the master_ticker should be stopped in suspend_ticker
as well? I still see potential problems for entering deep C-States. I think
I'll prepare a patch which will keep the master_ticker active for the
C-State case and migrate it for the schedule_cpu_switch() case.


Juergen

On 02/09/11 14:51, Andre Przywara wrote:
> George Dunlap wrote:
>> <George.Dunlap@eu.citrix.com> wrote:
>>> On Tue, Feb 8, 2011 at 4:33 PM, Andre Przywara
>>> <andre.przywara@amd.com> wrote:
>>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 24
>>>> (XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 24
>>>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 24
>>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 25
>>>> (XEN) cpu_disable_scheduler: Migrating d0v34 from cpu 25
>>>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 25
>>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 26
>>>> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 26
>>>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 26
>>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 27
>>>> (XEN) cpu_disable_scheduler: Migrating d0v24 from cpu 27
>>>> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 27
>>>> (XEN) cpu_disable_scheduler: Migrating d0v42 from cpu 27
>>>> (XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 28
>>>> (XEN) cpu_disable_scheduler: Migrating d0v18 from cpu 28
>>>> (XEN) cpu_disable_scheduler: Migrating d0v25 from cpu 28
>>>> (XEN) cpu_disable_scheduler: Migrating d0v32 from cpu 28
>>>> (XEN) cpu_disable_scheduler: Migrating d0v39 from cpu 28
>>>> (XEN) cpu_disable_scheduler: Migrating d0v3 from cpu 29
>>> Interesting -- what seems to happen here is that as cpus are disabled,
>>> vcpus are "shovelled" in an accumulative fashion from one cpu to the
>>> next:
>>> * v18,34,42 start on cpu 24.
>>> * When 24 is brought down, they're all migrated to 25; then when 25 is
>>> brougth down, to 26, then to 27
>>> * v24 is running on cpu 27, so when 27 is brought down, v24 is added
>>> to the mix
>>> * v3 is running on cpu 28, so all of them plus v3 are shoveled onto
>>> cpu 29.
>>>
>>> While that behavior may not be ideal, it should certainly be bug-free.
>>>
>>> Another interesting thing to note is that the bug happened on pcpu 32,
>>> but there were no advertised migrations from that cpu.
>>>
>>> Andre, can you fold the attached patch into your testing?
> Sorry, but that bug (and its output) didn't trigger on two tries.
> Instead I now saw two occasions of the "migration failed, must retry
> later" message. Interestingly enough is does not seem to be fatal. The
> first time it triggers, the numa-split even completes, then after I roll
> it back and repeat it it shows again, but crashes later on that old
> BUG_ON().
>
> See the attached log for more details.
>
> Thanks for the try, anyway.
>
> Regards,
> Andre.
>
>
>>>
>>> Thanks for all your work on this.
> I am glad for all your help. I only start to really understand the
> scheduler, so your support is much appreciated.
>
>>>
>>> -George
>>>
>
>


-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-09 14:21                                                 ` Juergen Gross
@ 2011-02-10  6:42                                                   ` Juergen Gross
  2011-02-10  9:25                                                     ` Andre Przywara
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-10  6:42 UTC (permalink / raw)
  To: Andre Przywara; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 1428 bytes --]

On 02/09/11 15:21, Juergen Gross wrote:
> Andre, George,
>
>
> What seems to be interesting: I think the problem did always occur when
> a new cpupool was created and the first cpu was moved to it.
>
> I think my previous assumption regarding the master_ticker was not too bad.
> I think somehow the master_ticker of the new cpupool is becoming active
> before the scheduler is really initialized properly. This could happen, if
> enough time is spent between alloc_pdata for the cpu to be moved and the
> critical section in schedule_cpu_switch().
>
> The solution should be to activate the timers only if the scheduler is
> ready for them.
>
> George, do you think the master_ticker should be stopped in suspend_ticker
> as well? I still see potential problems for entering deep C-States. I think
> I'll prepare a patch which will keep the master_ticker active for the
> C-State case and migrate it for the schedule_cpu_switch() case.

Okay, here is a patch for this. It ran on my 4-core machine without any
problems.
Andre, could you give it a try?


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

[-- Attachment #2: ticker.patch --]
[-- Type: text/x-patch, Size: 5447 bytes --]

diff -r 1967c7c290eb xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Wed Feb 09 12:03:09 2011 +0000
+++ b/xen/common/sched_credit.c	Thu Feb 10 07:39:27 2011 +0100
@@ -50,6 +50,8 @@
     (CSCHED_CREDITS_PER_MSEC * CSCHED_MSECS_PER_TSLICE)
 #define CSCHED_CREDITS_PER_ACCT     \
     (CSCHED_CREDITS_PER_MSEC * CSCHED_MSECS_PER_TICK * CSCHED_TICKS_PER_ACCT)
+#define CSCHED_ACCT_TSLICE          \
+    (MILLISECS(CSCHED_MSECS_PER_TICK) * CSCHED_TICKS_PER_ACCT)
 
 
 /*
@@ -170,6 +172,7 @@ struct csched_private {
     uint32_t ncpus;
     struct timer  master_ticker;
     unsigned int master;
+    int master_active;
     cpumask_t idlers;
     cpumask_t cpus;
     uint32_t weight;
@@ -320,6 +323,7 @@ csched_free_pdata(const struct scheduler
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_pcpu *spc = pcpu;
     unsigned long flags;
+    uint64_t now = NOW();
 
     if ( spc == NULL )
         return;
@@ -334,10 +338,16 @@ csched_free_pdata(const struct scheduler
     {
         prv->master = first_cpu(prv->cpus);
         migrate_timer(&prv->master_ticker, prv->master);
+        if ( prv->master_active )
+            set_timer(&prv->master_ticker, now + CSCHED_ACCT_TSLICE
+                - now % CSCHED_ACCT_TSLICE);
     }
     kill_timer(&spc->ticker);
     if ( prv->ncpus == 0 )
+    {
         kill_timer(&prv->master_ticker);
+        prv->master_active = 0;
+    }
 
     spin_unlock_irqrestore(&prv->lock, flags);
 
@@ -367,12 +377,10 @@ csched_alloc_pdata(const struct schedule
     {
         prv->master = cpu;
         init_timer(&prv->master_ticker, csched_acct, prv, cpu);
-        set_timer(&prv->master_ticker, NOW() +
-                  MILLISECS(CSCHED_MSECS_PER_TICK) * CSCHED_TICKS_PER_ACCT);
+        prv->master_active = 0;
     }
 
     init_timer(&spc->ticker, csched_tick, (void *)(unsigned long)cpu, cpu);
-    set_timer(&spc->ticker, NOW() + MILLISECS(CSCHED_MSECS_PER_TICK));
 
     INIT_LIST_HEAD(&spc->runq);
     spc->runq_sort_last = prv->runq_sort;
@@ -1138,8 +1146,7 @@ csched_acct(void* dummy)
     prv->runq_sort++;
 
 out:
-    set_timer( &prv->master_ticker, NOW() +
-            MILLISECS(CSCHED_MSECS_PER_TICK) * CSCHED_TICKS_PER_ACCT );
+    set_timer( &prv->master_ticker, NOW() + CSCHED_ACCT_TSLICE );
 }
 
 static void
@@ -1529,24 +1536,39 @@ csched_deinit(const struct scheduler *op
         xfree(prv);
 }
 
-static void csched_tick_suspend(const struct scheduler *ops, unsigned int cpu)
+static void csched_tick_suspend(const struct scheduler *ops, unsigned int cpu, int temp)
 {
+    struct csched_private *prv;
     struct csched_pcpu *spc;
 
+    prv = CSCHED_PRIV(ops);
     spc = CSCHED_PCPU(cpu);
 
     stop_timer(&spc->ticker);
+    if ( (prv->master == cpu) && !temp )
+    {
+        prv->master = cycle_cpu(prv->master, prv->cpus);
+        migrate_timer(&prv->master_ticker, prv->master);
+    }
 }
 
 static void csched_tick_resume(const struct scheduler *ops, unsigned int cpu)
 {
+    struct csched_private *prv;
     struct csched_pcpu *spc;
     uint64_t now = NOW();
 
+    prv = CSCHED_PRIV(ops);
     spc = CSCHED_PCPU(cpu);
 
     set_timer(&spc->ticker, now + MILLISECS(CSCHED_MSECS_PER_TICK)
             - now % MILLISECS(CSCHED_MSECS_PER_TICK) );
+    if ( (prv->master == cpu) && !prv->master_active )
+    {
+        set_timer(&prv->master_ticker, now + CSCHED_ACCT_TSLICE
+            - now % CSCHED_ACCT_TSLICE);
+        prv->master_active = 1;
+    }
 }
 
 static struct csched_private _csched_priv;
diff -r 1967c7c290eb xen/common/schedule.c
--- a/xen/common/schedule.c	Wed Feb 09 12:03:09 2011 +0000
+++ b/xen/common/schedule.c	Thu Feb 10 07:39:27 2011 +0100
@@ -1208,6 +1208,8 @@ static int cpu_schedule_up(unsigned int 
     if ( (ops.alloc_pdata != NULL) &&
          ((sd->sched_priv = ops.alloc_pdata(&ops, cpu)) == NULL) )
         return -ENOMEM;
+    if ( ops.tick_resume != NULL )
+        ops.tick_resume(&ops, cpu);
 
     return 0;
 }
@@ -1286,6 +1288,8 @@ void __init scheduler_init(void)
     if ( ops.alloc_pdata &&
          !(this_cpu(schedule_data).sched_priv = ops.alloc_pdata(&ops, 0)) )
         BUG();
+    if ( ops.tick_resume != NULL )
+        ops.tick_resume(&ops, 0);
 }
 
 int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
@@ -1312,7 +1316,7 @@ int schedule_cpu_switch(unsigned int cpu
 
     pcpu_schedule_lock_irqsave(cpu, flags);
 
-    SCHED_OP(old_ops, tick_suspend, cpu);
+    SCHED_OP(old_ops, tick_suspend, cpu, 0);
     vpriv_old = idle->sched_priv;
     idle->sched_priv = vpriv;
     per_cpu(scheduler, cpu) = new_ops;
@@ -1392,7 +1396,7 @@ void sched_tick_suspend(void)
     unsigned int cpu = smp_processor_id();
 
     sched = per_cpu(scheduler, cpu);
-    SCHED_OP(sched, tick_suspend, cpu);
+    SCHED_OP(sched, tick_suspend, cpu, 1);
 }
 
 void sched_tick_resume(void)
diff -r 1967c7c290eb xen/include/xen/sched-if.h
--- a/xen/include/xen/sched-if.h	Wed Feb 09 12:03:09 2011 +0000
+++ b/xen/include/xen/sched-if.h	Thu Feb 10 07:39:27 2011 +0100
@@ -175,7 +175,7 @@ struct scheduler {
     void         (*dump_settings)  (const struct scheduler *);
     void         (*dump_cpu_state) (const struct scheduler *, int);
 
-    void         (*tick_suspend)    (const struct scheduler *, unsigned int);
+    void         (*tick_suspend)    (const struct scheduler *, unsigned int, int);
     void         (*tick_resume)     (const struct scheduler *, unsigned int);
 };
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-10  6:42                                                   ` Juergen Gross
@ 2011-02-10  9:25                                                     ` Andre Przywara
  2011-02-10 14:18                                                       ` Andre Przywara
  0 siblings, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-02-10  9:25 UTC (permalink / raw)
  To: Juergen Gross; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

On 02/10/2011 07:42 AM, Juergen Gross wrote:
> On 02/09/11 15:21, Juergen Gross wrote:
>> Andre, George,
>>
>>
>> What seems to be interesting: I think the problem did always occur when
>> a new cpupool was created and the first cpu was moved to it.
>>
>> I think my previous assumption regarding the master_ticker was not too bad.
>> I think somehow the master_ticker of the new cpupool is becoming active
>> before the scheduler is really initialized properly. This could happen, if
>> enough time is spent between alloc_pdata for the cpu to be moved and the
>> critical section in schedule_cpu_switch().
>>
>> The solution should be to activate the timers only if the scheduler is
>> ready for them.
>>
>> George, do you think the master_ticker should be stopped in suspend_ticker
>> as well? I still see potential problems for entering deep C-States. I think
>> I'll prepare a patch which will keep the master_ticker active for the
>> C-State case and migrate it for the schedule_cpu_switch() case.
>
> Okay, here is a patch for this. It ran on my 4-core machine without any
> problems.
> Andre, could you give it a try?
Did, but unfortunately it crashed as always. Tried twice and made sure I 
booted the right kernel. Sorry.
The idea with the race between the timer and the state changing sounded 
very appealing, actually that was suspicious to me from the beginning.

I will add some code to dump the state of all cpupools to the BUG_ON to 
see in which situation we are when the bug triggers.

Regards,
Andre.

-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-10  9:25                                                     ` Andre Przywara
@ 2011-02-10 14:18                                                       ` Andre Przywara
  2011-02-11  6:17                                                         ` Juergen Gross
  0 siblings, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-02-10 14:18 UTC (permalink / raw)
  To: Juergen Gross; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

Andre Przywara wrote:
> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>> On 02/09/11 15:21, Juergen Gross wrote:
>>> Andre, George,
>>>
>>>
>>> What seems to be interesting: I think the problem did always occur when
>>> a new cpupool was created and the first cpu was moved to it.
>>>
>>> I think my previous assumption regarding the master_ticker was not too bad.
>>> I think somehow the master_ticker of the new cpupool is becoming active
>>> before the scheduler is really initialized properly. This could happen, if
>>> enough time is spent between alloc_pdata for the cpu to be moved and the
>>> critical section in schedule_cpu_switch().
>>>
>>> The solution should be to activate the timers only if the scheduler is
>>> ready for them.
>>>
>>> George, do you think the master_ticker should be stopped in suspend_ticker
>>> as well? I still see potential problems for entering deep C-States. I think
>>> I'll prepare a patch which will keep the master_ticker active for the
>>> C-State case and migrate it for the schedule_cpu_switch() case.
>> Okay, here is a patch for this. It ran on my 4-core machine without any
>> problems.
>> Andre, could you give it a try?
> Did, but unfortunately it crashed as always. Tried twice and made sure I 
> booted the right kernel. Sorry.
> The idea with the race between the timer and the state changing sounded 
> very appealing, actually that was suspicious to me from the beginning.
> 
> I will add some code to dump the state of all cpupools to the BUG_ON to 
> see in which situation we are when the bug triggers.
OK, here is a first try of this, the patch iterates over all CPU pools 
and outputs some data if the BUG_ON
((sdom->weight * sdom->active_vcpu_count) > weight_left) condition triggers:
(XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask: fffffffc003f
(XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
(XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
(XEN) Xen BUG at sched_credit.c:1010
....
The masks look proper (6 cores per node), the bug triggers when the 
first CPU is about to be(?) inserted.

HTH,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-10 14:18                                                       ` Andre Przywara
@ 2011-02-11  6:17                                                         ` Juergen Gross
  2011-02-11  7:39                                                           ` Andre Przywara
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-11  6:17 UTC (permalink / raw)
  To: Andre Przywara; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

On 02/10/11 15:18, Andre Przywara wrote:
> Andre Przywara wrote:
>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>> Andre, George,
>>>>
>>>>
>>>> What seems to be interesting: I think the problem did always occur when
>>>> a new cpupool was created and the first cpu was moved to it.
>>>>
>>>> I think my previous assumption regarding the master_ticker was not
>>>> too bad.
>>>> I think somehow the master_ticker of the new cpupool is becoming active
>>>> before the scheduler is really initialized properly. This could
>>>> happen, if
>>>> enough time is spent between alloc_pdata for the cpu to be moved and
>>>> the
>>>> critical section in schedule_cpu_switch().
>>>>
>>>> The solution should be to activate the timers only if the scheduler is
>>>> ready for them.
>>>>
>>>> George, do you think the master_ticker should be stopped in
>>>> suspend_ticker
>>>> as well? I still see potential problems for entering deep C-States.
>>>> I think
>>>> I'll prepare a patch which will keep the master_ticker active for the
>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>> Okay, here is a patch for this. It ran on my 4-core machine without any
>>> problems.
>>> Andre, could you give it a try?
>> Did, but unfortunately it crashed as always. Tried twice and made sure
>> I booted the right kernel. Sorry.
>> The idea with the race between the timer and the state changing
>> sounded very appealing, actually that was suspicious to me from the
>> beginning.
>>
>> I will add some code to dump the state of all cpupools to the BUG_ON
>> to see in which situation we are when the bug triggers.
> OK, here is a first try of this, the patch iterates over all CPU pools
> and outputs some data if the BUG_ON
> ((sdom->weight * sdom->active_vcpu_count) > weight_left) condition
> triggers:
> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask: fffffffc003f
> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
> (XEN) Xen BUG at sched_credit.c:1010
> ....
> The masks look proper (6 cores per node), the bug triggers when the
> first CPU is about to be(?) inserted.

Sure? I'm missing the cpu with mask 2000.
I'll try to reproduce the problem on a larger machine here (24 cores, 4 numa
nodes).
Andre, can you give me your xen boot parameters? Which xen changeset are you
running, and do you have any additional patches in use?


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-11  6:17                                                         ` Juergen Gross
@ 2011-02-11  7:39                                                           ` Andre Przywara
  2011-02-14 17:57                                                             ` George Dunlap
  0 siblings, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-02-11  7:39 UTC (permalink / raw)
  To: Juergen Gross; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

Juergen Gross wrote:
> On 02/10/11 15:18, Andre Przywara wrote:
>> Andre Przywara wrote:
>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>> Andre, George,
>>>>>
>>>>>
>>>>> What seems to be interesting: I think the problem did always occur when
>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>
>>>>> I think my previous assumption regarding the master_ticker was not
>>>>> too bad.
>>>>> I think somehow the master_ticker of the new cpupool is becoming active
>>>>> before the scheduler is really initialized properly. This could
>>>>> happen, if
>>>>> enough time is spent between alloc_pdata for the cpu to be moved and
>>>>> the
>>>>> critical section in schedule_cpu_switch().
>>>>>
>>>>> The solution should be to activate the timers only if the scheduler is
>>>>> ready for them.
>>>>>
>>>>> George, do you think the master_ticker should be stopped in
>>>>> suspend_ticker
>>>>> as well? I still see potential problems for entering deep C-States.
>>>>> I think
>>>>> I'll prepare a patch which will keep the master_ticker active for the
>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>> Okay, here is a patch for this. It ran on my 4-core machine without any
>>>> problems.
>>>> Andre, could you give it a try?
>>> Did, but unfortunately it crashed as always. Tried twice and made sure
>>> I booted the right kernel. Sorry.
>>> The idea with the race between the timer and the state changing
>>> sounded very appealing, actually that was suspicious to me from the
>>> beginning.
>>>
>>> I will add some code to dump the state of all cpupools to the BUG_ON
>>> to see in which situation we are when the bug triggers.
>> OK, here is a first try of this, the patch iterates over all CPU pools
>> and outputs some data if the BUG_ON
>> ((sdom->weight * sdom->active_vcpu_count) > weight_left) condition
>> triggers:
>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask: fffffffc003f
>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>> (XEN) Xen BUG at sched_credit.c:1010
>> ....
>> The masks look proper (6 cores per node), the bug triggers when the
>> first CPU is about to be(?) inserted.
> 
> Sure? I'm missing the cpu with mask 2000.
> I'll try to reproduce the problem on a larger machine here (24 cores, 4 numa
> nodes).
> Andre, can you give me your xen boot parameters? Which xen changeset are you
> running, and do you have any additional patches in use?

The grub lines:
kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200
module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0 
console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0

All of my experiments are use c/s 22858 as a base.
If you use a AMD Magny-Cours box for your experiments (socket C32 or 
G34), you should add the following patch (removing the line)
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
          __clear_bit(X86_FEATURE_SKINIT % 32, &c);
          __clear_bit(X86_FEATURE_WDT % 32, &c);
          __clear_bit(X86_FEATURE_LWP % 32, &c);
-        __clear_bit(X86_FEATURE_NODEID_MSR % 32, &c);
          __clear_bit(X86_FEATURE_TOPOEXT % 32, &c);
          break;
      case 5: /* MONITOR/MWAIT */

This is not necessary (in fact that reverts my patch c/s 22815), but 
raises the probability to trigger the bug, probably because it increases 
the pressure of the Dom0 scheduler. If you cannot trigger it with Dom0, 
try to create a guest with many VCPUs and squeeze it into a small CPU-pool.

Good luck ;-)
Andre.

-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-11  7:39                                                           ` Andre Przywara
@ 2011-02-14 17:57                                                             ` George Dunlap
  2011-02-15  7:22                                                               ` Juergen Gross
  0 siblings, 1 reply; 53+ messages in thread
From: George Dunlap @ 2011-02-14 17:57 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Juergen Gross, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 5299 bytes --]

The good news is, I've managed to reproduce this on my local test
hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
attached script.  It's time to go home now, but I should be able to
dig something up tomorrow.

To use the script:
* Rename cpupool0 to "p0", and create an empty second pool, "p1"
* You can modify elements by adding "arg=val" as arguments.
* Arguments are:
 + dryrun={true,false} Do the work, but don't actually execute any xl
arguments.  Default false.
 + left: Number commands to execute.  Default 10.
 + maxcpus: highest numerical value for a cpu.  Default 7 (i.e., 0-7 is 8 cpus).
 + verbose={true,false} Print what you're doing.  Default is true.

The script sometimes attempts to remove the last cpu from cpupool0; in
this case, libxl will print an error.  If the script gets an error
under that condition, it will ignore it; under any other condition, it
will print diagnostic information.

What finally crashed it for me was this command:
# ./cpupool-test.sh verbose=false left=1000

 -George

On Fri, Feb 11, 2011 at 7:39 AM, Andre Przywara <andre.przywara@amd.com> wrote:
> Juergen Gross wrote:
>>
>> On 02/10/11 15:18, Andre Przywara wrote:
>>>
>>> Andre Przywara wrote:
>>>>
>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>>>
>>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>>>
>>>>>> Andre, George,
>>>>>>
>>>>>>
>>>>>> What seems to be interesting: I think the problem did always occur
>>>>>> when
>>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>>
>>>>>> I think my previous assumption regarding the master_ticker was not
>>>>>> too bad.
>>>>>> I think somehow the master_ticker of the new cpupool is becoming
>>>>>> active
>>>>>> before the scheduler is really initialized properly. This could
>>>>>> happen, if
>>>>>> enough time is spent between alloc_pdata for the cpu to be moved and
>>>>>> the
>>>>>> critical section in schedule_cpu_switch().
>>>>>>
>>>>>> The solution should be to activate the timers only if the scheduler is
>>>>>> ready for them.
>>>>>>
>>>>>> George, do you think the master_ticker should be stopped in
>>>>>> suspend_ticker
>>>>>> as well? I still see potential problems for entering deep C-States.
>>>>>> I think
>>>>>> I'll prepare a patch which will keep the master_ticker active for the
>>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>>>
>>>>> Okay, here is a patch for this. It ran on my 4-core machine without any
>>>>> problems.
>>>>> Andre, could you give it a try?
>>>>
>>>> Did, but unfortunately it crashed as always. Tried twice and made sure
>>>> I booted the right kernel. Sorry.
>>>> The idea with the race between the timer and the state changing
>>>> sounded very appealing, actually that was suspicious to me from the
>>>> beginning.
>>>>
>>>> I will add some code to dump the state of all cpupools to the BUG_ON
>>>> to see in which situation we are when the bug triggers.
>>>
>>> OK, here is a first try of this, the patch iterates over all CPU pools
>>> and outputs some data if the BUG_ON
>>> ((sdom->weight * sdom->active_vcpu_count) > weight_left) condition
>>> triggers:
>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask: fffffffc003f
>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>>> (XEN) Xen BUG at sched_credit.c:1010
>>> ....
>>> The masks look proper (6 cores per node), the bug triggers when the
>>> first CPU is about to be(?) inserted.
>>
>> Sure? I'm missing the cpu with mask 2000.
>> I'll try to reproduce the problem on a larger machine here (24 cores, 4
>> numa
>> nodes).
>> Andre, can you give me your xen boot parameters? Which xen changeset are
>> you
>> running, and do you have any additional patches in use?
>
> The grub lines:
> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200
> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0
>
> All of my experiments are use c/s 22858 as a base.
> If you use a AMD Magny-Cours box for your experiments (socket C32 or G34),
> you should add the following patch (removing the line)
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
>         __clear_bit(X86_FEATURE_SKINIT % 32, &c);
>         __clear_bit(X86_FEATURE_WDT % 32, &c);
>         __clear_bit(X86_FEATURE_LWP % 32, &c);
> -        __clear_bit(X86_FEATURE_NODEID_MSR % 32, &c);
>         __clear_bit(X86_FEATURE_TOPOEXT % 32, &c);
>         break;
>     case 5: /* MONITOR/MWAIT */
>
> This is not necessary (in fact that reverts my patch c/s 22815), but raises
> the probability to trigger the bug, probably because it increases the
> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0, try to
> create a guest with many VCPUs and squeeze it into a small CPU-pool.
>
> Good luck ;-)
> Andre.
>
> --
> Andre Przywara
> AMD-OSRC (Dresden)
> Tel: x29712
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

[-- Attachment #2: cpupool-test.sh --]
[-- Type: application/x-sh, Size: 3043 bytes --]

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-14 17:57                                                             ` George Dunlap
@ 2011-02-15  7:22                                                               ` Juergen Gross
  2011-02-16  9:47                                                                 ` Juergen Gross
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-15  7:22 UTC (permalink / raw)
  To: George Dunlap; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

On 02/14/11 18:57, George Dunlap wrote:
> The good news is, I've managed to reproduce this on my local test
> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
> attached script.  It's time to go home now, but I should be able to
> dig something up tomorrow.
>
> To use the script:
> * Rename cpupool0 to "p0", and create an empty second pool, "p1"
> * You can modify elements by adding "arg=val" as arguments.
> * Arguments are:
>   + dryrun={true,false} Do the work, but don't actually execute any xl
> arguments.  Default false.
>   + left: Number commands to execute.  Default 10.
>   + maxcpus: highest numerical value for a cpu.  Default 7 (i.e., 0-7 is 8 cpus).
>   + verbose={true,false} Print what you're doing.  Default is true.
>
> The script sometimes attempts to remove the last cpu from cpupool0; in
> this case, libxl will print an error.  If the script gets an error
> under that condition, it will ignore it; under any other condition, it
> will print diagnostic information.
>
> What finally crashed it for me was this command:
> # ./cpupool-test.sh verbose=false left=1000

Nice!
With your script I finally managed to get the error, too. On my box (2 sockets
a 6 cores) I had to use

./cpupool-test.sh verbose=false left=10000 maxcpus=11

to trigger it.
Looking for more data now...


Juergen

>
>   -George
>
> On Fri, Feb 11, 2011 at 7:39 AM, Andre Przywara<andre.przywara@amd.com>  wrote:
>> Juergen Gross wrote:
>>>
>>> On 02/10/11 15:18, Andre Przywara wrote:
>>>>
>>>> Andre Przywara wrote:
>>>>>
>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>>>>
>>>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>>>>
>>>>>>> Andre, George,
>>>>>>>
>>>>>>>
>>>>>>> What seems to be interesting: I think the problem did always occur
>>>>>>> when
>>>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>>>
>>>>>>> I think my previous assumption regarding the master_ticker was not
>>>>>>> too bad.
>>>>>>> I think somehow the master_ticker of the new cpupool is becoming
>>>>>>> active
>>>>>>> before the scheduler is really initialized properly. This could
>>>>>>> happen, if
>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved and
>>>>>>> the
>>>>>>> critical section in schedule_cpu_switch().
>>>>>>>
>>>>>>> The solution should be to activate the timers only if the scheduler is
>>>>>>> ready for them.
>>>>>>>
>>>>>>> George, do you think the master_ticker should be stopped in
>>>>>>> suspend_ticker
>>>>>>> as well? I still see potential problems for entering deep C-States.
>>>>>>> I think
>>>>>>> I'll prepare a patch which will keep the master_ticker active for the
>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>>>>
>>>>>> Okay, here is a patch for this. It ran on my 4-core machine without any
>>>>>> problems.
>>>>>> Andre, could you give it a try?
>>>>>
>>>>> Did, but unfortunately it crashed as always. Tried twice and made sure
>>>>> I booted the right kernel. Sorry.
>>>>> The idea with the race between the timer and the state changing
>>>>> sounded very appealing, actually that was suspicious to me from the
>>>>> beginning.
>>>>>
>>>>> I will add some code to dump the state of all cpupools to the BUG_ON
>>>>> to see in which situation we are when the bug triggers.
>>>>
>>>> OK, here is a first try of this, the patch iterates over all CPU pools
>>>> and outputs some data if the BUG_ON
>>>> ((sdom->weight * sdom->active_vcpu_count)>  weight_left) condition
>>>> triggers:
>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask: fffffffc003f
>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>>>> (XEN) Xen BUG at sched_credit.c:1010
>>>> ....
>>>> The masks look proper (6 cores per node), the bug triggers when the
>>>> first CPU is about to be(?) inserted.
>>>
>>> Sure? I'm missing the cpu with mask 2000.
>>> I'll try to reproduce the problem on a larger machine here (24 cores, 4
>>> numa
>>> nodes).
>>> Andre, can you give me your xen boot parameters? Which xen changeset are
>>> you
>>> running, and do you have any additional patches in use?
>>
>> The grub lines:
>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200
>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0
>>
>> All of my experiments are use c/s 22858 as a base.
>> If you use a AMD Magny-Cours box for your experiments (socket C32 or G34),
>> you should add the following patch (removing the line)
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
>>          __clear_bit(X86_FEATURE_SKINIT % 32,&c);
>>          __clear_bit(X86_FEATURE_WDT % 32,&c);
>>          __clear_bit(X86_FEATURE_LWP % 32,&c);
>> -        __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c);
>>          __clear_bit(X86_FEATURE_TOPOEXT % 32,&c);
>>          break;
>>      case 5: /* MONITOR/MWAIT */
>>
>> This is not necessary (in fact that reverts my patch c/s 22815), but raises
>> the probability to trigger the bug, probably because it increases the
>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0, try to
>> create a guest with many VCPUs and squeeze it into a small CPU-pool.
>>
>> Good luck ;-)
>> Andre.
>>
>> --
>> Andre Przywara
>> AMD-OSRC (Dresden)
>> Tel: x29712
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel


-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-15  7:22                                                               ` Juergen Gross
@ 2011-02-16  9:47                                                                 ` Juergen Gross
  2011-02-16 13:54                                                                   ` George Dunlap
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-16  9:47 UTC (permalink / raw)
  To: George Dunlap; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

Okay, I have some more data.

I activated cpupool_dprintk() and included checks in sched_credit.c to
test for weight inconsistencies. To reduce race possibilities I've added
my patch to execute cpu assigning/unassigning always in a tasklet on the
cpu to be moved.

Here is the result:

(XEN) cpupool_unassign_cpu(pool=0,cpu=6)
(XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
(XEN) cpupool_unassign_cpu(pool=0,cpu=6)
(XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
(XEN) cpupool_assign_cpu(pool=0,cpu=1)
(XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0
(XEN) cpupool_assign_cpu(cpu=1) ret 0
(XEN) cpupool_assign_cpu(pool=1,cpu=4)
(XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40
(XEN) cpupool_assign_cpu(cpu=4) ret 0
(XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0:
(XEN) sdom->weight: 256, sdom->active_vcpu_count: 1
(XEN) Xen BUG at sched_credit.c:570
(XEN) ----[ Xen-4.1.0-rc5-pre  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    4
(XEN) RIP:    e008:[<ffff82c4801197d7>] csched_tick+0x186/0x37f
(XEN) RFLAGS: 0000000000010086   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff830839d3ec30   rcx: 0000000000000000
(XEN) rdx: ffff830839dcff18   rsi: 000000000000000a   rdi: ffff82c4802542e8
(XEN) rbp: ffff830839dcfe38   rsp: ffff830839dcfde8   r8:  0000000000000004
(XEN) r9:  ffff82c480213520   r10: 00000000fffffffc   r11: 0000000000000001
(XEN) r12: 0000000000000004   r13: ffff830839d3ec40   r14: ffff831002ad5e40
(XEN) r15: ffff830839d66f90   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000001020a98000   cr2: 00007fc5e9b79d98
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff830839dcfde8:
(XEN)    ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246 ffff830839d6c000
(XEN)    0000000000000000 ffff830839dd1100 0000000000000004 ffff82c480119651
(XEN)    ffff831002b28018 ffff831002b28010 ffff830839dcfe68 ffff82c480126204
(XEN)    0000000000000002 ffff83083ffa3bb8 ffff830839dd1100 000000cae439ea7e
(XEN)    ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20 ffff830839dd1100
(XEN)    ffff831002b28010 0000000000000004 0000000000000004 ffff82c4802b0880
(XEN)    ffff830839dcff18 ffffffffffffffff ffff830839dcfef8 ffff82c480123647
(XEN)    ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98 00007fc5e9fa5b20
(XEN)    0000000000000002 00007fff46826f20 ffff830839dcff08 ffff82c4801236c2
(XEN)    00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20 0000000000000002
(XEN)    00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260 00007fff46826f50
(XEN)    0000000000000246 0000000000000032 0000000000000000 00000000ffffffff
(XEN)    0000000000000009 00007fc5e9d9de1a 0000000000000003 0000000000004848
(XEN)    00007fc5e9b7a000 0000010000000000 ffffffff800073f0 000000000000e033
(XEN)    0000000000000246 ffff880f97b51fc8 000000000000e02b 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000004
(XEN)    ffff830077eee000 00000043b9afd180 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c4801197d7>] csched_tick+0x186/0x37f
(XEN)    [<ffff82c480126204>] execute_timer+0x4e/0x6c
(XEN)    [<ffff82c480126539>] timer_softirq_action+0xf6/0x239
(XEN)    [<ffff82c480123647>] __do_softirq+0x88/0x99
(XEN)    [<ffff82c4801236c2>] do_softirq+0x6a/0x7a
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 4:
(XEN) Xen BUG at sched_credit.c:570
(XEN) ****************************************

As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The BUG_ON
triggered in csched_acct() is a logical result of this.

How this can happen I don't know yet.
Anyone any idea? I'll keep searching...


Juergen

On 02/15/11 08:22, Juergen Gross wrote:
> On 02/14/11 18:57, George Dunlap wrote:
>> The good news is, I've managed to reproduce this on my local test
>> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
>> attached script. It's time to go home now, but I should be able to
>> dig something up tomorrow.
>>
>> To use the script:
>> * Rename cpupool0 to "p0", and create an empty second pool, "p1"
>> * You can modify elements by adding "arg=val" as arguments.
>> * Arguments are:
>> + dryrun={true,false} Do the work, but don't actually execute any xl
>> arguments. Default false.
>> + left: Number commands to execute. Default 10.
>> + maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is
>> 8 cpus).
>> + verbose={true,false} Print what you're doing. Default is true.
>>
>> The script sometimes attempts to remove the last cpu from cpupool0; in
>> this case, libxl will print an error. If the script gets an error
>> under that condition, it will ignore it; under any other condition, it
>> will print diagnostic information.
>>
>> What finally crashed it for me was this command:
>> # ./cpupool-test.sh verbose=false left=1000
>
> Nice!
> With your script I finally managed to get the error, too. On my box (2
> sockets
> a 6 cores) I had to use
>
> ./cpupool-test.sh verbose=false left=10000 maxcpus=11
>
> to trigger it.
> Looking for more data now...
>
>
> Juergen
>
>>
>> -George
>>
>> On Fri, Feb 11, 2011 at 7:39 AM, Andre
>> Przywara<andre.przywara@amd.com> wrote:
>>> Juergen Gross wrote:
>>>>
>>>> On 02/10/11 15:18, Andre Przywara wrote:
>>>>>
>>>>> Andre Przywara wrote:
>>>>>>
>>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>>>>>
>>>>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>>>>>
>>>>>>>> Andre, George,
>>>>>>>>
>>>>>>>>
>>>>>>>> What seems to be interesting: I think the problem did always occur
>>>>>>>> when
>>>>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>>>>
>>>>>>>> I think my previous assumption regarding the master_ticker was not
>>>>>>>> too bad.
>>>>>>>> I think somehow the master_ticker of the new cpupool is becoming
>>>>>>>> active
>>>>>>>> before the scheduler is really initialized properly. This could
>>>>>>>> happen, if
>>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved
>>>>>>>> and
>>>>>>>> the
>>>>>>>> critical section in schedule_cpu_switch().
>>>>>>>>
>>>>>>>> The solution should be to activate the timers only if the
>>>>>>>> scheduler is
>>>>>>>> ready for them.
>>>>>>>>
>>>>>>>> George, do you think the master_ticker should be stopped in
>>>>>>>> suspend_ticker
>>>>>>>> as well? I still see potential problems for entering deep C-States.
>>>>>>>> I think
>>>>>>>> I'll prepare a patch which will keep the master_ticker active
>>>>>>>> for the
>>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>>>>>
>>>>>>> Okay, here is a patch for this. It ran on my 4-core machine
>>>>>>> without any
>>>>>>> problems.
>>>>>>> Andre, could you give it a try?
>>>>>>
>>>>>> Did, but unfortunately it crashed as always. Tried twice and made
>>>>>> sure
>>>>>> I booted the right kernel. Sorry.
>>>>>> The idea with the race between the timer and the state changing
>>>>>> sounded very appealing, actually that was suspicious to me from the
>>>>>> beginning.
>>>>>>
>>>>>> I will add some code to dump the state of all cpupools to the BUG_ON
>>>>>> to see in which situation we are when the bug triggers.
>>>>>
>>>>> OK, here is a first try of this, the patch iterates over all CPU pools
>>>>> and outputs some data if the BUG_ON
>>>>> ((sdom->weight * sdom->active_vcpu_count)> weight_left) condition
>>>>> triggers:
>>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask:
>>>>> fffffffc003f
>>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>>>>> (XEN) Xen BUG at sched_credit.c:1010
>>>>> ....
>>>>> The masks look proper (6 cores per node), the bug triggers when the
>>>>> first CPU is about to be(?) inserted.
>>>>
>>>> Sure? I'm missing the cpu with mask 2000.
>>>> I'll try to reproduce the problem on a larger machine here (24 cores, 4
>>>> numa
>>>> nodes).
>>>> Andre, can you give me your xen boot parameters? Which xen changeset
>>>> are
>>>> you
>>>> running, and do you have any additional patches in use?
>>>
>>> The grub lines:
>>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200
>>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
>>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0
>>>
>>> All of my experiments are use c/s 22858 as a base.
>>> If you use a AMD Magny-Cours box for your experiments (socket C32 or
>>> G34),
>>> you should add the following patch (removing the line)
>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
>>> __clear_bit(X86_FEATURE_SKINIT % 32,&c);
>>> __clear_bit(X86_FEATURE_WDT % 32,&c);
>>> __clear_bit(X86_FEATURE_LWP % 32,&c);
>>> - __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c);
>>> __clear_bit(X86_FEATURE_TOPOEXT % 32,&c);
>>> break;
>>> case 5: /* MONITOR/MWAIT */
>>>
>>> This is not necessary (in fact that reverts my patch c/s 22815), but
>>> raises
>>> the probability to trigger the bug, probably because it increases the
>>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0,
>>> try to
>>> create a guest with many VCPUs and squeeze it into a small CPU-pool.
>>>
>>> Good luck ;-)
>>> Andre.
>>>
>>> --
>>> Andre Przywara
>>> AMD-OSRC (Dresden)
>>> Tel: x29712
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>
>


-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-16  9:47                                                                 ` Juergen Gross
@ 2011-02-16 13:54                                                                   ` George Dunlap
       [not found]                                                                     ` <4D6237C6.1050206@amd.c om>
                                                                                       ` (3 more replies)
  0 siblings, 4 replies; 53+ messages in thread
From: George Dunlap @ 2011-02-16 13:54 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 11892 bytes --]

Andre (and Juergen), can you try again with the attached patch?

What the patch basically does is try to make "cpu_disable_scheduler()"
do what it seems to say it does. :-)  Namely, the various
scheduler-related interrutps (both per-cpu ticks and the master tick)
is a part of the scheduler, so disable them before doing anything, and
don't enable them until the cpu is really ready to go again.

To be precise:
* cpu_disable_scheduler() disables ticks
* scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
and does it after inserting the idle vcpu
* Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
stop tickers
 + Call tick_{resume,suspend} in cpu_{up,down}, respectively
* Modify credit1's tick_{suspend,resume} to handle the master ticker as well.

With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
on one pcpu), I can perform thousands of operations successfully.

(NB this is not ready for application yet, I just wanted to check to
see if it fixes Andre's problem)

 -George

On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross
<juergen.gross@ts.fujitsu.com> wrote:
> Okay, I have some more data.
>
> I activated cpupool_dprintk() and included checks in sched_credit.c to
> test for weight inconsistencies. To reduce race possibilities I've added
> my patch to execute cpu assigning/unassigning always in a tasklet on the
> cpu to be moved.
>
> Here is the result:
>
> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
> (XEN) cpupool_assign_cpu(pool=0,cpu=1)
> (XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0
> (XEN) cpupool_assign_cpu(cpu=1) ret 0
> (XEN) cpupool_assign_cpu(pool=1,cpu=4)
> (XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40
> (XEN) cpupool_assign_cpu(cpu=4) ret 0
> (XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0:
> (XEN) sdom->weight: 256, sdom->active_vcpu_count: 1
> (XEN) Xen BUG at sched_credit.c:570
> (XEN) ----[ Xen-4.1.0-rc5-pre  x86_64  debug=y  Tainted:    C ]----
> (XEN) CPU:    4
> (XEN) RIP:    e008:[<ffff82c4801197d7>] csched_tick+0x186/0x37f
> (XEN) RFLAGS: 0000000000010086   CONTEXT: hypervisor
> (XEN) rax: 0000000000000000   rbx: ffff830839d3ec30   rcx: 0000000000000000
> (XEN) rdx: ffff830839dcff18   rsi: 000000000000000a   rdi: ffff82c4802542e8
> (XEN) rbp: ffff830839dcfe38   rsp: ffff830839dcfde8   r8:  0000000000000004
> (XEN) r9:  ffff82c480213520   r10: 00000000fffffffc   r11: 0000000000000001
> (XEN) r12: 0000000000000004   r13: ffff830839d3ec40   r14: ffff831002ad5e40
> (XEN) r15: ffff830839d66f90   cr0: 000000008005003b   cr4: 00000000000026f0
> (XEN) cr3: 0000001020a98000   cr2: 00007fc5e9b79d98
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff830839dcfde8:
> (XEN)    ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246 ffff830839d6c000
> (XEN)    0000000000000000 ffff830839dd1100 0000000000000004 ffff82c480119651
> (XEN)    ffff831002b28018 ffff831002b28010 ffff830839dcfe68 ffff82c480126204
> (XEN)    0000000000000002 ffff83083ffa3bb8 ffff830839dd1100 000000cae439ea7e
> (XEN)    ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20 ffff830839dd1100
> (XEN)    ffff831002b28010 0000000000000004 0000000000000004 ffff82c4802b0880
> (XEN)    ffff830839dcff18 ffffffffffffffff ffff830839dcfef8 ffff82c480123647
> (XEN)    ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98 00007fc5e9fa5b20
> (XEN)    0000000000000002 00007fff46826f20 ffff830839dcff08 ffff82c4801236c2
> (XEN)    00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20 0000000000000002
> (XEN)    00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260 00007fff46826f50
> (XEN)    0000000000000246 0000000000000032 0000000000000000 00000000ffffffff
> (XEN)    0000000000000009 00007fc5e9d9de1a 0000000000000003 0000000000004848
> (XEN)    00007fc5e9b7a000 0000010000000000 ffffffff800073f0 000000000000e033
> (XEN)    0000000000000246 ffff880f97b51fc8 000000000000e02b 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000004
> (XEN)    ffff830077eee000 00000043b9afd180 0000000000000000
> (XEN) Xen call trace:
> (XEN)    [<ffff82c4801197d7>] csched_tick+0x186/0x37f
> (XEN)    [<ffff82c480126204>] execute_timer+0x4e/0x6c
> (XEN)    [<ffff82c480126539>] timer_softirq_action+0xf6/0x239
> (XEN)    [<ffff82c480123647>] __do_softirq+0x88/0x99
> (XEN)    [<ffff82c4801236c2>] do_softirq+0x6a/0x7a
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 4:
> (XEN) Xen BUG at sched_credit.c:570
> (XEN) ****************************************
>
> As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The BUG_ON
> triggered in csched_acct() is a logical result of this.
>
> How this can happen I don't know yet.
> Anyone any idea? I'll keep searching...
>
>
> Juergen
>
> On 02/15/11 08:22, Juergen Gross wrote:
>>
>> On 02/14/11 18:57, George Dunlap wrote:
>>>
>>> The good news is, I've managed to reproduce this on my local test
>>> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
>>> attached script. It's time to go home now, but I should be able to
>>> dig something up tomorrow.
>>>
>>> To use the script:
>>> * Rename cpupool0 to "p0", and create an empty second pool, "p1"
>>> * You can modify elements by adding "arg=val" as arguments.
>>> * Arguments are:
>>> + dryrun={true,false} Do the work, but don't actually execute any xl
>>> arguments. Default false.
>>> + left: Number commands to execute. Default 10.
>>> + maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is
>>> 8 cpus).
>>> + verbose={true,false} Print what you're doing. Default is true.
>>>
>>> The script sometimes attempts to remove the last cpu from cpupool0; in
>>> this case, libxl will print an error. If the script gets an error
>>> under that condition, it will ignore it; under any other condition, it
>>> will print diagnostic information.
>>>
>>> What finally crashed it for me was this command:
>>> # ./cpupool-test.sh verbose=false left=1000
>>
>> Nice!
>> With your script I finally managed to get the error, too. On my box (2
>> sockets
>> a 6 cores) I had to use
>>
>> ./cpupool-test.sh verbose=false left=10000 maxcpus=11
>>
>> to trigger it.
>> Looking for more data now...
>>
>>
>> Juergen
>>
>>>
>>> -George
>>>
>>> On Fri, Feb 11, 2011 at 7:39 AM, Andre
>>> Przywara<andre.przywara@amd.com> wrote:
>>>>
>>>> Juergen Gross wrote:
>>>>>
>>>>> On 02/10/11 15:18, Andre Przywara wrote:
>>>>>>
>>>>>> Andre Przywara wrote:
>>>>>>>
>>>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>>>>>>
>>>>>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>>>>>>
>>>>>>>>> Andre, George,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What seems to be interesting: I think the problem did always occur
>>>>>>>>> when
>>>>>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>>>>>
>>>>>>>>> I think my previous assumption regarding the master_ticker was not
>>>>>>>>> too bad.
>>>>>>>>> I think somehow the master_ticker of the new cpupool is becoming
>>>>>>>>> active
>>>>>>>>> before the scheduler is really initialized properly. This could
>>>>>>>>> happen, if
>>>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved
>>>>>>>>> and
>>>>>>>>> the
>>>>>>>>> critical section in schedule_cpu_switch().
>>>>>>>>>
>>>>>>>>> The solution should be to activate the timers only if the
>>>>>>>>> scheduler is
>>>>>>>>> ready for them.
>>>>>>>>>
>>>>>>>>> George, do you think the master_ticker should be stopped in
>>>>>>>>> suspend_ticker
>>>>>>>>> as well? I still see potential problems for entering deep C-States.
>>>>>>>>> I think
>>>>>>>>> I'll prepare a patch which will keep the master_ticker active
>>>>>>>>> for the
>>>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>>>>>>
>>>>>>>> Okay, here is a patch for this. It ran on my 4-core machine
>>>>>>>> without any
>>>>>>>> problems.
>>>>>>>> Andre, could you give it a try?
>>>>>>>
>>>>>>> Did, but unfortunately it crashed as always. Tried twice and made
>>>>>>> sure
>>>>>>> I booted the right kernel. Sorry.
>>>>>>> The idea with the race between the timer and the state changing
>>>>>>> sounded very appealing, actually that was suspicious to me from the
>>>>>>> beginning.
>>>>>>>
>>>>>>> I will add some code to dump the state of all cpupools to the BUG_ON
>>>>>>> to see in which situation we are when the bug triggers.
>>>>>>
>>>>>> OK, here is a first try of this, the patch iterates over all CPU pools
>>>>>> and outputs some data if the BUG_ON
>>>>>> ((sdom->weight * sdom->active_vcpu_count)> weight_left) condition
>>>>>> triggers:
>>>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask:
>>>>>> fffffffc003f
>>>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>>>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>>>>>> (XEN) Xen BUG at sched_credit.c:1010
>>>>>> ....
>>>>>> The masks look proper (6 cores per node), the bug triggers when the
>>>>>> first CPU is about to be(?) inserted.
>>>>>
>>>>> Sure? I'm missing the cpu with mask 2000.
>>>>> I'll try to reproduce the problem on a larger machine here (24 cores, 4
>>>>> numa
>>>>> nodes).
>>>>> Andre, can you give me your xen boot parameters? Which xen changeset
>>>>> are
>>>>> you
>>>>> running, and do you have any additional patches in use?
>>>>
>>>> The grub lines:
>>>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200
>>>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
>>>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0
>>>>
>>>> All of my experiments are use c/s 22858 as a base.
>>>> If you use a AMD Magny-Cours box for your experiments (socket C32 or
>>>> G34),
>>>> you should add the following patch (removing the line)
>>>> --- a/xen/arch/x86/traps.c
>>>> +++ b/xen/arch/x86/traps.c
>>>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
>>>> __clear_bit(X86_FEATURE_SKINIT % 32,&c);
>>>> __clear_bit(X86_FEATURE_WDT % 32,&c);
>>>> __clear_bit(X86_FEATURE_LWP % 32,&c);
>>>> - __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c);
>>>> __clear_bit(X86_FEATURE_TOPOEXT % 32,&c);
>>>> break;
>>>> case 5: /* MONITOR/MWAIT */
>>>>
>>>> This is not necessary (in fact that reverts my patch c/s 22815), but
>>>> raises
>>>> the probability to trigger the bug, probably because it increases the
>>>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0,
>>>> try to
>>>> create a guest with many VCPUs and squeeze it into a small CPU-pool.
>>>>
>>>> Good luck ;-)
>>>> Andre.
>>>>
>>>> --
>>>> Andre Przywara
>>>> AMD-OSRC (Dresden)
>>>> Tel: x29712
>>>>
>>>>
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xensource.com
>>>> http://lists.xensource.com/xen-devel
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xensource.com
>>>> http://lists.xensource.com/xen-devel
>>
>>
>
>
> --
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
> Fujitsu Technology Solutions              e-mail:
> juergen.gross@ts.fujitsu.com
> Domagkstr. 28                           Internet: ts.fujitsu.com
> D-80807 Muenchen                 Company details:
> ts.fujitsu.com/imprint.html
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

[-- Attachment #2: cpupools-tick-rearrange.diff --]
[-- Type: text/x-diff, Size: 8999 bytes --]

diff -r 4ea36cce2519 xen/common/cpupool.c
--- a/xen/common/cpupool.c	Mon Feb 14 09:10:22 2011 +0000
+++ b/xen/common/cpupool.c	Wed Feb 16 12:21:37 2011 +0000
@@ -291,49 +291,76 @@
 
     spin_lock(&cpupool_lock);
     ret = -EBUSY;
-    if ( (cpupool_moving_cpu != -1) && (cpu != cpupool_moving_cpu) )
-        goto out;
-    if ( cpu_isset(cpu, cpupool_locked_cpus) )
-        goto out;
+    if ( cpu != cpupool_moving_cpu )
+    {
+        /* Don't start a second operation until the first has completed */
+        if ( (cpupool_moving_cpu != -1) )
+            goto out;
 
-    ret = 0;
-    if ( !cpu_isset(cpu, c->cpu_valid) && (cpu != cpupool_moving_cpu) )
-        goto out;
+        /* Don't start an op on a locked cpu (?)*/
+        if ( cpu_isset(cpu, cpupool_locked_cpus) )
+            goto out;
 
-    if ( (c->n_dom > 0) && (cpus_weight(c->cpu_valid) == 1) &&
-         (cpu != cpupool_moving_cpu) )
+        /* Can't take the last cpu out of cpupool0 */
+        ret = -EINVAL;
+        if ( cpus_weight(c->cpu_valid) == 1
+             && c == cpupool0)
+            goto out;
+
+        ret = 0;
+        if ( !cpu_isset(cpu, c->cpu_valid) )
+            goto out;
+
+        if ( (c->n_dom > 0) && (cpus_weight(c->cpu_valid) == 1) )
+        {
+            for_each_domain(d)
+            {
+                if ( d->cpupool != c )
+                    continue;
+                /* Don't allow the last cpu from a pool to be moved if there's a live
+                 * domain still running on it */
+                if ( !d->is_dying )
+                {
+                    printk("%s: cpu %d pool %p: d%d still present\n",
+                           __func__, cpu, c, d->domain_id);
+                    ret = -EBUSY;
+                    break;
+                }
+                c->n_dom--;
+                ret = sched_move_domain(d, cpupool0);
+                if ( ret )
+                {
+                    c->n_dom++;
+                    break;
+                }
+                cpupool0->n_dom++;
+            }
+            if ( ret )
+                goto out;
+        }
+        cpupool_moving_cpu = cpu;
+        cpupool_cpu_moving = c;
+        cpu_clear(cpu, c->cpu_valid);
+    }
+    else
     {
-        for_each_domain(d)
-        {
-            if ( d->cpupool != c )
-                continue;
-            if ( !d->is_dying )
-            {
-                ret = -EBUSY;
-                break;
-            }
-            c->n_dom--;
-            ret = sched_move_domain(d, cpupool0);
-            if ( ret )
-            {
-                c->n_dom++;
-                break;
-            }
-            cpupool0->n_dom++;
-        }
-        if ( ret )
-            goto out;
+        /* Make sure all the things we did last time still hold */
+        BUG_ON(cpupool_cpu_moving != c);
+        BUG_ON(cpu_isset(cpu, c->cpu_valid));
+        /* compare cpu_valid to 0, since we cleared cpu in cpu_valid above */
+        BUG_ON((c->n_dom > 0) && (cpus_weight(c->cpu_valid) == 0));
     }
-    cpupool_moving_cpu = cpu;
+    /* Increase the refcount both times through, because the return path released
+     * the reference. */
     atomic_inc(&c->refcnt);
-    cpupool_cpu_moving = c;
-    cpu_clear(cpu, c->cpu_valid);
+
     spin_unlock(&cpupool_lock);
 
     work_cpu = smp_processor_id();
     if ( work_cpu == cpu )
     {
         work_cpu = first_cpu(cpupool0->cpu_valid);
+        /* If cpu is in cpupool0, then cpupool0 must contain at least one other cpu */
         if ( work_cpu == cpu )
             work_cpu = next_cpu(cpu, cpupool0->cpu_valid);
     }
diff -r 4ea36cce2519 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Mon Feb 14 09:10:22 2011 +0000
+++ b/xen/common/sched_credit.c	Wed Feb 16 12:21:37 2011 +0000
@@ -330,11 +330,14 @@
     prv->ncpus--;
     cpu_clear(cpu, prv->idlers);
     cpu_clear(cpu, prv->cpus);
+#if 0
+    /* This should have been disabled already */
     if ( (prv->master == cpu) && (prv->ncpus > 0) )
     {
         prv->master = first_cpu(prv->cpus);
         migrate_timer(&prv->master_ticker, prv->master);
     }
+#endif
     kill_timer(&spc->ticker);
     if ( prv->ncpus == 0 )
         kill_timer(&prv->master_ticker);
@@ -367,12 +370,16 @@
     {
         prv->master = cpu;
         init_timer(&prv->master_ticker, csched_acct, prv, cpu);
+#if 0
         set_timer(&prv->master_ticker, NOW() +
                   MILLISECS(CSCHED_MSECS_PER_TICK) * CSCHED_TICKS_PER_ACCT);
+#endif
     }
 
     init_timer(&spc->ticker, csched_tick, (void *)(unsigned long)cpu, cpu);
+#if 0
     set_timer(&spc->ticker, NOW() + MILLISECS(CSCHED_MSECS_PER_TICK));
+#endif
 
     INIT_LIST_HEAD(&spc->runq);
     spc->runq_sort_last = prv->runq_sort;
@@ -1531,15 +1538,28 @@
 
 static void csched_tick_suspend(const struct scheduler *ops, unsigned int cpu)
 {
+    struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_pcpu *spc;
 
     spc = CSCHED_PCPU(cpu);
 
+    if (prv->master == cpu)
+    {
+        if ( (prv->ncpus > 0) )
+        {
+            prv->master = first_cpu(prv->cpus);
+            migrate_timer(&prv->master_ticker, prv->master);
+        }
+        else
+            stop_timer(&prv->master_ticker);
+    }
+
     stop_timer(&spc->ticker);
 }
 
 static void csched_tick_resume(const struct scheduler *ops, unsigned int cpu)
 {
+    struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_pcpu *spc;
     uint64_t now = NOW();
 
@@ -1547,6 +1567,12 @@
 
     set_timer(&spc->ticker, now + MILLISECS(CSCHED_MSECS_PER_TICK)
             - now % MILLISECS(CSCHED_MSECS_PER_TICK) );
+
+    if (prv->master == cpu)
+    {
+        set_timer(&prv->master_ticker, NOW() +
+                  MILLISECS(CSCHED_MSECS_PER_TICK) * CSCHED_TICKS_PER_ACCT);
+    }
 }
 
 static struct csched_private _csched_priv;
diff -r 4ea36cce2519 xen/common/schedule.c
--- a/xen/common/schedule.c	Mon Feb 14 09:10:22 2011 +0000
+++ b/xen/common/schedule.c	Wed Feb 16 12:21:37 2011 +0000
@@ -469,10 +469,21 @@
     cpumask_t online_affinity;
     int    ret = 0;
     bool_t affinity_broken;
+    struct scheduler *cpu_ops;
 
+    cpu_ops = per_cpu(scheduler, cpu);
     c = per_cpu(cpupool, cpu);
-    if ( c == NULL )
+    if ( c == NULL || cpu_ops == NULL )
+    {
+        printk("%s: no scheduler for cpu %d\n",
+               __func__, cpu);
         return ret;
+    }
+
+    pcpu_schedule_lock_irq(cpu);
+    SCHED_OP(cpu_ops, tick_suspend, cpu);
+    pcpu_schedule_unlock_irq(cpu);
+
 
     for_each_domain ( d )
     {
@@ -1211,6 +1222,8 @@
          ((sd->sched_priv = ops.alloc_pdata(&ops, cpu)) == NULL) )
         return -ENOMEM;
 
+    ops.tick_resume(&ops, cpu);
+
     return 0;
 }
 
@@ -1219,7 +1232,11 @@
     struct schedule_data *sd = &per_cpu(schedule_data, cpu);
 
     if ( sd->sched_priv != NULL )
+    {
+        /* FIXME: What if scheduler has different ops? */
+        SCHED_OP(&ops, tick_suspend, cpu);
         SCHED_OP(&ops, free_pdata, sd->sched_priv, cpu);
+    }
 
     kill_timer(&sd->s_timer);
 }
@@ -1288,6 +1305,8 @@
     if ( ops.alloc_pdata &&
          !(this_cpu(schedule_data).sched_priv = ops.alloc_pdata(&ops, 0)) )
         BUG();
+
+    ops.tick_resume(&ops, 0);
 }
 
 int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
@@ -1298,31 +1317,51 @@
     struct scheduler *old_ops = per_cpu(scheduler, cpu);
     struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
 
+    BUG_ON(!is_idle_vcpu(per_cpu(schedule_data, cpu).curr));
+
     if ( old_ops == new_ops )
+    {
+        //printk("%s: cpu %d pool %p: no change\n",
+        //     __func__,  cpu, c);
         return 0;
+    }
 
     idle = idle_vcpu[cpu];
     ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
     if ( ppriv == NULL )
+    {
+        printk("%s: cpu %d pool %p: alloc_pdata failed\n",
+               __func__, cpu, c);
         return -ENOMEM;
+    }
     vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
     if ( vpriv == NULL )
     {
+        printk("%s: cpu %d pool %p: alloc_vdata(idle) failed\n",
+               __func__, cpu, c);
         SCHED_OP(new_ops, free_pdata, ppriv, cpu);
         return -ENOMEM;
     }
 
     pcpu_schedule_lock_irqsave(cpu, flags);
-
-    SCHED_OP(old_ops, tick_suspend, cpu);
+    //SCHED_OP(old_ops, tick_suspend, cpu);
+    /* Switch idle private */
     vpriv_old = idle->sched_priv;
     idle->sched_priv = vpriv;
+
+    /* Switch ops */
     per_cpu(scheduler, cpu) = new_ops;
+
+    /* Switch pcpu private */
     ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
     per_cpu(schedule_data, cpu).sched_priv = ppriv;
-    SCHED_OP(new_ops, tick_resume, cpu);
+
     SCHED_OP(new_ops, insert_vcpu, idle);
 
+    /* Enabling ticks should be the last thing done,
+     * and only if moving it to a cpu pool */
+    if ( c != NULL )
+        SCHED_OP(new_ops, tick_resume, cpu);
     pcpu_schedule_unlock_irqrestore(cpu, flags);
 
     SCHED_OP(old_ops, free_vdata, vpriv_old);

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-16 13:54                                                                   ` George Dunlap
       [not found]                                                                     ` <4D6237C6.1050206@amd.c om>
@ 2011-02-16 14:11                                                                     ` Juergen Gross
  2011-02-16 14:28                                                                       ` Juergen Gross
  2011-02-17  0:05                                                                       ` André Przywara
  2011-02-17  7:05                                                                     ` Juergen Gross
  2011-02-21 10:00                                                                     ` Andre Przywara
  3 siblings, 2 replies; 53+ messages in thread
From: Juergen Gross @ 2011-02-16 14:11 UTC (permalink / raw)
  To: George Dunlap; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 12728 bytes --]

On 02/16/11 14:54, George Dunlap wrote:
> Andre (and Juergen), can you try again with the attached patch?
>
> What the patch basically does is try to make "cpu_disable_scheduler()"
> do what it seems to say it does. :-)  Namely, the various
> scheduler-related interrutps (both per-cpu ticks and the master tick)
> is a part of the scheduler, so disable them before doing anything, and
> don't enable them until the cpu is really ready to go again.
>
> To be precise:
> * cpu_disable_scheduler() disables ticks
> * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
> and does it after inserting the idle vcpu
> * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
> stop tickers
>   + Call tick_{resume,suspend} in cpu_{up,down}, respectively

I tried this before :-)
It didn't work for Andre, but may be there were some bits missing.

> * Modify credit1's tick_{suspend,resume} to handle the master ticker as well.
>
> With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
> on one pcpu), I can perform thousands of operations successfully.

Nice. I'll try later. In the moment I'm testing another patch (attached
for review, if you like). I think I've identified two possible races.


Juergen

>
> (NB this is not ready for application yet, I just wanted to check to
> see if it fixes Andre's problem)
>
>   -George
>
> On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross
> <juergen.gross@ts.fujitsu.com>  wrote:
>> Okay, I have some more data.
>>
>> I activated cpupool_dprintk() and included checks in sched_credit.c to
>> test for weight inconsistencies. To reduce race possibilities I've added
>> my patch to execute cpu assigning/unassigning always in a tasklet on the
>> cpu to be moved.
>>
>> Here is the result:
>>
>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>> (XEN) cpupool_assign_cpu(pool=0,cpu=1)
>> (XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0
>> (XEN) cpupool_assign_cpu(cpu=1) ret 0
>> (XEN) cpupool_assign_cpu(pool=1,cpu=4)
>> (XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40
>> (XEN) cpupool_assign_cpu(cpu=4) ret 0
>> (XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0:
>> (XEN) sdom->weight: 256, sdom->active_vcpu_count: 1
>> (XEN) Xen BUG at sched_credit.c:570
>> (XEN) ----[ Xen-4.1.0-rc5-pre  x86_64  debug=y  Tainted:    C ]----
>> (XEN) CPU:    4
>> (XEN) RIP:    e008:[<ffff82c4801197d7>] csched_tick+0x186/0x37f
>> (XEN) RFLAGS: 0000000000010086   CONTEXT: hypervisor
>> (XEN) rax: 0000000000000000   rbx: ffff830839d3ec30   rcx: 0000000000000000
>> (XEN) rdx: ffff830839dcff18   rsi: 000000000000000a   rdi: ffff82c4802542e8
>> (XEN) rbp: ffff830839dcfe38   rsp: ffff830839dcfde8   r8:  0000000000000004
>> (XEN) r9:  ffff82c480213520   r10: 00000000fffffffc   r11: 0000000000000001
>> (XEN) r12: 0000000000000004   r13: ffff830839d3ec40   r14: ffff831002ad5e40
>> (XEN) r15: ffff830839d66f90   cr0: 000000008005003b   cr4: 00000000000026f0
>> (XEN) cr3: 0000001020a98000   cr2: 00007fc5e9b79d98
>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
>> (XEN) Xen stack trace from rsp=ffff830839dcfde8:
>> (XEN)    ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246 ffff830839d6c000
>> (XEN)    0000000000000000 ffff830839dd1100 0000000000000004 ffff82c480119651
>> (XEN)    ffff831002b28018 ffff831002b28010 ffff830839dcfe68 ffff82c480126204
>> (XEN)    0000000000000002 ffff83083ffa3bb8 ffff830839dd1100 000000cae439ea7e
>> (XEN)    ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20 ffff830839dd1100
>> (XEN)    ffff831002b28010 0000000000000004 0000000000000004 ffff82c4802b0880
>> (XEN)    ffff830839dcff18 ffffffffffffffff ffff830839dcfef8 ffff82c480123647
>> (XEN)    ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98 00007fc5e9fa5b20
>> (XEN)    0000000000000002 00007fff46826f20 ffff830839dcff08 ffff82c4801236c2
>> (XEN)    00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20 0000000000000002
>> (XEN)    00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260 00007fff46826f50
>> (XEN)    0000000000000246 0000000000000032 0000000000000000 00000000ffffffff
>> (XEN)    0000000000000009 00007fc5e9d9de1a 0000000000000003 0000000000004848
>> (XEN)    00007fc5e9b7a000 0000010000000000 ffffffff800073f0 000000000000e033
>> (XEN)    0000000000000246 ffff880f97b51fc8 000000000000e02b 0000000000000000
>> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000004
>> (XEN)    ffff830077eee000 00000043b9afd180 0000000000000000
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82c4801197d7>] csched_tick+0x186/0x37f
>> (XEN)    [<ffff82c480126204>] execute_timer+0x4e/0x6c
>> (XEN)    [<ffff82c480126539>] timer_softirq_action+0xf6/0x239
>> (XEN)    [<ffff82c480123647>] __do_softirq+0x88/0x99
>> (XEN)    [<ffff82c4801236c2>] do_softirq+0x6a/0x7a
>> (XEN)
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 4:
>> (XEN) Xen BUG at sched_credit.c:570
>> (XEN) ****************************************
>>
>> As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The BUG_ON
>> triggered in csched_acct() is a logical result of this.
>>
>> How this can happen I don't know yet.
>> Anyone any idea? I'll keep searching...
>>
>>
>> Juergen
>>
>> On 02/15/11 08:22, Juergen Gross wrote:
>>>
>>> On 02/14/11 18:57, George Dunlap wrote:
>>>>
>>>> The good news is, I've managed to reproduce this on my local test
>>>> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
>>>> attached script. It's time to go home now, but I should be able to
>>>> dig something up tomorrow.
>>>>
>>>> To use the script:
>>>> * Rename cpupool0 to "p0", and create an empty second pool, "p1"
>>>> * You can modify elements by adding "arg=val" as arguments.
>>>> * Arguments are:
>>>> + dryrun={true,false} Do the work, but don't actually execute any xl
>>>> arguments. Default false.
>>>> + left: Number commands to execute. Default 10.
>>>> + maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is
>>>> 8 cpus).
>>>> + verbose={true,false} Print what you're doing. Default is true.
>>>>
>>>> The script sometimes attempts to remove the last cpu from cpupool0; in
>>>> this case, libxl will print an error. If the script gets an error
>>>> under that condition, it will ignore it; under any other condition, it
>>>> will print diagnostic information.
>>>>
>>>> What finally crashed it for me was this command:
>>>> # ./cpupool-test.sh verbose=false left=1000
>>>
>>> Nice!
>>> With your script I finally managed to get the error, too. On my box (2
>>> sockets
>>> a 6 cores) I had to use
>>>
>>> ./cpupool-test.sh verbose=false left=10000 maxcpus=11
>>>
>>> to trigger it.
>>> Looking for more data now...
>>>
>>>
>>> Juergen
>>>
>>>>
>>>> -George
>>>>
>>>> On Fri, Feb 11, 2011 at 7:39 AM, Andre
>>>> Przywara<andre.przywara@amd.com>  wrote:
>>>>>
>>>>> Juergen Gross wrote:
>>>>>>
>>>>>> On 02/10/11 15:18, Andre Przywara wrote:
>>>>>>>
>>>>>>> Andre Przywara wrote:
>>>>>>>>
>>>>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>>>>>>>
>>>>>>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>>>>>>>
>>>>>>>>>> Andre, George,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What seems to be interesting: I think the problem did always occur
>>>>>>>>>> when
>>>>>>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>>>>>>
>>>>>>>>>> I think my previous assumption regarding the master_ticker was not
>>>>>>>>>> too bad.
>>>>>>>>>> I think somehow the master_ticker of the new cpupool is becoming
>>>>>>>>>> active
>>>>>>>>>> before the scheduler is really initialized properly. This could
>>>>>>>>>> happen, if
>>>>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved
>>>>>>>>>> and
>>>>>>>>>> the
>>>>>>>>>> critical section in schedule_cpu_switch().
>>>>>>>>>>
>>>>>>>>>> The solution should be to activate the timers only if the
>>>>>>>>>> scheduler is
>>>>>>>>>> ready for them.
>>>>>>>>>>
>>>>>>>>>> George, do you think the master_ticker should be stopped in
>>>>>>>>>> suspend_ticker
>>>>>>>>>> as well? I still see potential problems for entering deep C-States.
>>>>>>>>>> I think
>>>>>>>>>> I'll prepare a patch which will keep the master_ticker active
>>>>>>>>>> for the
>>>>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>>>>>>>
>>>>>>>>> Okay, here is a patch for this. It ran on my 4-core machine
>>>>>>>>> without any
>>>>>>>>> problems.
>>>>>>>>> Andre, could you give it a try?
>>>>>>>>
>>>>>>>> Did, but unfortunately it crashed as always. Tried twice and made
>>>>>>>> sure
>>>>>>>> I booted the right kernel. Sorry.
>>>>>>>> The idea with the race between the timer and the state changing
>>>>>>>> sounded very appealing, actually that was suspicious to me from the
>>>>>>>> beginning.
>>>>>>>>
>>>>>>>> I will add some code to dump the state of all cpupools to the BUG_ON
>>>>>>>> to see in which situation we are when the bug triggers.
>>>>>>>
>>>>>>> OK, here is a first try of this, the patch iterates over all CPU pools
>>>>>>> and outputs some data if the BUG_ON
>>>>>>> ((sdom->weight * sdom->active_vcpu_count)>  weight_left) condition
>>>>>>> triggers:
>>>>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask:
>>>>>>> fffffffc003f
>>>>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>>>>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>>>>>>> (XEN) Xen BUG at sched_credit.c:1010
>>>>>>> ....
>>>>>>> The masks look proper (6 cores per node), the bug triggers when the
>>>>>>> first CPU is about to be(?) inserted.
>>>>>>
>>>>>> Sure? I'm missing the cpu with mask 2000.
>>>>>> I'll try to reproduce the problem on a larger machine here (24 cores, 4
>>>>>> numa
>>>>>> nodes).
>>>>>> Andre, can you give me your xen boot parameters? Which xen changeset
>>>>>> are
>>>>>> you
>>>>>> running, and do you have any additional patches in use?
>>>>>
>>>>> The grub lines:
>>>>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200
>>>>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
>>>>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0
>>>>>
>>>>> All of my experiments are use c/s 22858 as a base.
>>>>> If you use a AMD Magny-Cours box for your experiments (socket C32 or
>>>>> G34),
>>>>> you should add the following patch (removing the line)
>>>>> --- a/xen/arch/x86/traps.c
>>>>> +++ b/xen/arch/x86/traps.c
>>>>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
>>>>> __clear_bit(X86_FEATURE_SKINIT % 32,&c);
>>>>> __clear_bit(X86_FEATURE_WDT % 32,&c);
>>>>> __clear_bit(X86_FEATURE_LWP % 32,&c);
>>>>> - __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c);
>>>>> __clear_bit(X86_FEATURE_TOPOEXT % 32,&c);
>>>>> break;
>>>>> case 5: /* MONITOR/MWAIT */
>>>>>
>>>>> This is not necessary (in fact that reverts my patch c/s 22815), but
>>>>> raises
>>>>> the probability to trigger the bug, probably because it increases the
>>>>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0,
>>>>> try to
>>>>> create a guest with many VCPUs and squeeze it into a small CPU-pool.
>>>>>
>>>>> Good luck ;-)
>>>>> Andre.
>>>>>
>>>>> --
>>>>> Andre Przywara
>>>>> AMD-OSRC (Dresden)
>>>>> Tel: x29712
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xensource.com
>>>>> http://lists.xensource.com/xen-devel
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xensource.com
>>>>> http://lists.xensource.com/xen-devel
>>>
>>>
>>
>>
>> --
>> Juergen Gross                 Principal Developer Operating Systems
>> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
>> Fujitsu Technology Solutions              e-mail:
>> juergen.gross@ts.fujitsu.com
>> Domagkstr. 28                           Internet: ts.fujitsu.com
>> D-80807 Muenchen                 Company details:
>> ts.fujitsu.com/imprint.html
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel


-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

[-- Attachment #2: cpupool-race.patch --]
[-- Type: text/x-patch, Size: 2617 bytes --]

diff -r 72470de157ce xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Wed Feb 16 09:49:33 2011 +0000
+++ b/xen/common/sched_credit.c	Wed Feb 16 15:09:54 2011 +0100
@@ -1268,7 +1268,8 @@ csched_load_balance(struct csched_privat
         /*
          * Any work over there to steal?
          */
-        speer = csched_runq_steal(peer_cpu, cpu, snext->pri);
+        speer = cpu_isset(peer_cpu, *online) ?
+            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
         pcpu_schedule_unlock(peer_cpu);
         if ( speer != NULL )
         {
diff -r 72470de157ce xen/common/schedule.c
--- a/xen/common/schedule.c	Wed Feb 16 09:49:33 2011 +0000
+++ b/xen/common/schedule.c	Wed Feb 16 15:09:54 2011 +0100
@@ -395,7 +395,28 @@ static void vcpu_migrate(struct vcpu *v)
     unsigned long flags;
     int old_cpu, new_cpu;
 
-    vcpu_schedule_lock_irqsave(v, flags);
+    for (;;)
+    {
+        vcpu_schedule_lock_irqsave(v, flags);
+
+        /* Select new CPU. */
+        old_cpu = v->processor;
+        new_cpu = SCHED_OP(VCPU2OP(v), pick_cpu, v);
+
+        if ( new_cpu == old_cpu )
+            break;
+
+        if ( !pcpu_schedule_trylock(new_cpu) )
+        {
+            vcpu_schedule_unlock_irqrestore(v, flags);
+            continue;
+        }
+        if ( cpu_isset(new_cpu, v->domain->cpupool->cpu_valid) )
+            break;
+
+        pcpu_schedule_unlock(new_cpu);
+        vcpu_schedule_unlock_irqrestore(v, flags);
+    }
 
     /*
      * NB. Check of v->running happens /after/ setting migration flag
@@ -405,13 +426,12 @@ static void vcpu_migrate(struct vcpu *v)
     if ( v->is_running ||
          !test_and_clear_bit(_VPF_migrating, &v->pause_flags) )
     {
+        if ( old_cpu != new_cpu )
+            pcpu_schedule_unlock(new_cpu);
+
         vcpu_schedule_unlock_irqrestore(v, flags);
         return;
     }
-
-    /* Select new CPU. */
-    old_cpu = v->processor;
-    new_cpu = SCHED_OP(VCPU2OP(v), pick_cpu, v);
 
     /*
      * Transfer urgency status to new CPU before switching CPUs, as once
@@ -424,9 +444,13 @@ static void vcpu_migrate(struct vcpu *v)
         atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count);
     }
 
-    /* Switch to new CPU, then unlock old CPU.  This is safe because
+    /* Switch to new CPU, then unlock new and old CPU.  This is safe because
      * the lock pointer cant' change while the current lock is held. */
     v->processor = new_cpu;
+
+    if ( old_cpu != new_cpu )
+        pcpu_schedule_unlock(new_cpu);
+
     spin_unlock_irqrestore(
         per_cpu(schedule_data, old_cpu).schedule_lock, flags);
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-16 14:11                                                                     ` Juergen Gross
@ 2011-02-16 14:28                                                                       ` Juergen Gross
  2011-02-17  0:05                                                                       ` André Przywara
  1 sibling, 0 replies; 53+ messages in thread
From: Juergen Gross @ 2011-02-16 14:28 UTC (permalink / raw)
  To: George Dunlap; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

On 02/16/11 15:11, Juergen Gross wrote:
> On 02/16/11 14:54, George Dunlap wrote:
>> Andre (and Juergen), can you try again with the attached patch?
>>
>> What the patch basically does is try to make "cpu_disable_scheduler()"
>> do what it seems to say it does. :-) Namely, the various
>> scheduler-related interrutps (both per-cpu ticks and the master tick)
>> is a part of the scheduler, so disable them before doing anything, and
>> don't enable them until the cpu is really ready to go again.
>>
>> To be precise:
>> * cpu_disable_scheduler() disables ticks
>> * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
>> and does it after inserting the idle vcpu
>> * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
>> stop tickers
>> + Call tick_{resume,suspend} in cpu_{up,down}, respectively
>
> I tried this before :-)
> It didn't work for Andre, but may be there were some bits missing.
>
>> * Modify credit1's tick_{suspend,resume} to handle the master ticker
>> as well.
>>
>> With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
>> on one pcpu), I can perform thousands of operations successfully.
>
> Nice. I'll try later. In the moment I'm testing another patch (attached
> for review, if you like). I think I've identified two possible races.

My patch works for me. I think I have to rework the locking for credit1, but
that shouldn't be too hard.

My machine survived 10000 iterations of your script with additional
consistency checks in the scheduler. Without my patch the machine crashed
after less then 500 iterations.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-16 14:11                                                                     ` Juergen Gross
  2011-02-16 14:28                                                                       ` Juergen Gross
@ 2011-02-17  0:05                                                                       ` André Przywara
  1 sibling, 0 replies; 53+ messages in thread
From: André Przywara @ 2011-02-17  0:05 UTC (permalink / raw)
  To: Juergen Gross; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

Am 16.02.2011 15:11, schrieb Juergen Gross:
> On 02/16/11 14:54, George Dunlap wrote:
>> Andre (and Juergen), can you try again with the attached patch?
George, Juergen, thanks for all your work on this!
I will try the patch as soon as I am back in the office today afternoon.

Regards,
Andre.

>>
>> What the patch basically does is try to make "cpu_disable_scheduler()"
>> do what it seems to say it does. :-)  Namely, the various
>> scheduler-related interrutps (both per-cpu ticks and the master tick)
>> is a part of the scheduler, so disable them before doing anything, and
>> don't enable them until the cpu is really ready to go again.
>>
>> To be precise:
>> * cpu_disable_scheduler() disables ticks
>> * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
>> and does it after inserting the idle vcpu
>> * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
>> stop tickers
>>    + Call tick_{resume,suspend} in cpu_{up,down}, respectively
>
> I tried this before :-)
> It didn't work for Andre, but may be there were some bits missing.
>
>> * Modify credit1's tick_{suspend,resume} to handle the master ticker as well.
>>
>> With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
>> on one pcpu), I can perform thousands of operations successfully.
>
> Nice. I'll try later. In the moment I'm testing another patch (attached
> for review, if you like). I think I've identified two possible races.
>
>
> Juergen
>
>>
>> (NB this is not ready for application yet, I just wanted to check to
>> see if it fixes Andre's problem)
>>
>>    -George
>>
>> On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross
>> <juergen.gross@ts.fujitsu.com>   wrote:
>>> Okay, I have some more data.
>>>
>>> I activated cpupool_dprintk() and included checks in sched_credit.c to
>>> test for weight inconsistencies. To reduce race possibilities I've added
>>> my patch to execute cpu assigning/unassigning always in a tasklet on the
>>> cpu to be moved.
>>>
>>> Here is the result:
>>>
>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>>> (XEN) cpupool_assign_cpu(pool=0,cpu=1)
>>> (XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0
>>> (XEN) cpupool_assign_cpu(cpu=1) ret 0
>>> (XEN) cpupool_assign_cpu(pool=1,cpu=4)
>>> (XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40
>>> (XEN) cpupool_assign_cpu(cpu=4) ret 0
>>> (XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0:
>>> (XEN) sdom->weight: 256, sdom->active_vcpu_count: 1
>>> (XEN) Xen BUG at sched_credit.c:570
>>> (XEN) ----[ Xen-4.1.0-rc5-pre  x86_64  debug=y  Tainted:    C ]----
>>> (XEN) CPU:    4
>>> (XEN) RIP:    e008:[<ffff82c4801197d7>] csched_tick+0x186/0x37f
>>> (XEN) RFLAGS: 0000000000010086   CONTEXT: hypervisor
>>> (XEN) rax: 0000000000000000   rbx: ffff830839d3ec30   rcx: 0000000000000000
>>> (XEN) rdx: ffff830839dcff18   rsi: 000000000000000a   rdi: ffff82c4802542e8
>>> (XEN) rbp: ffff830839dcfe38   rsp: ffff830839dcfde8   r8:  0000000000000004
>>> (XEN) r9:  ffff82c480213520   r10: 00000000fffffffc   r11: 0000000000000001
>>> (XEN) r12: 0000000000000004   r13: ffff830839d3ec40   r14: ffff831002ad5e40
>>> (XEN) r15: ffff830839d66f90   cr0: 000000008005003b   cr4: 00000000000026f0
>>> (XEN) cr3: 0000001020a98000   cr2: 00007fc5e9b79d98
>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
>>> (XEN) Xen stack trace from rsp=ffff830839dcfde8:
>>> (XEN)    ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246 ffff830839d6c000
>>> (XEN)    0000000000000000 ffff830839dd1100 0000000000000004 ffff82c480119651
>>> (XEN)    ffff831002b28018 ffff831002b28010 ffff830839dcfe68 ffff82c480126204
>>> (XEN)    0000000000000002 ffff83083ffa3bb8 ffff830839dd1100 000000cae439ea7e
>>> (XEN)    ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20 ffff830839dd1100
>>> (XEN)    ffff831002b28010 0000000000000004 0000000000000004 ffff82c4802b0880
>>> (XEN)    ffff830839dcff18 ffffffffffffffff ffff830839dcfef8 ffff82c480123647
>>> (XEN)    ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98 00007fc5e9fa5b20
>>> (XEN)    0000000000000002 00007fff46826f20 ffff830839dcff08 ffff82c4801236c2
>>> (XEN)    00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20 0000000000000002
>>> (XEN)    00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260 00007fff46826f50
>>> (XEN)    0000000000000246 0000000000000032 0000000000000000 00000000ffffffff
>>> (XEN)    0000000000000009 00007fc5e9d9de1a 0000000000000003 0000000000004848
>>> (XEN)    00007fc5e9b7a000 0000010000000000 ffffffff800073f0 000000000000e033
>>> (XEN)    0000000000000246 ffff880f97b51fc8 000000000000e02b 0000000000000000
>>> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000004
>>> (XEN)    ffff830077eee000 00000043b9afd180 0000000000000000
>>> (XEN) Xen call trace:
>>> (XEN)    [<ffff82c4801197d7>] csched_tick+0x186/0x37f
>>> (XEN)    [<ffff82c480126204>] execute_timer+0x4e/0x6c
>>> (XEN)    [<ffff82c480126539>] timer_softirq_action+0xf6/0x239
>>> (XEN)    [<ffff82c480123647>] __do_softirq+0x88/0x99
>>> (XEN)    [<ffff82c4801236c2>] do_softirq+0x6a/0x7a
>>> (XEN)
>>> (XEN)
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 4:
>>> (XEN) Xen BUG at sched_credit.c:570
>>> (XEN) ****************************************
>>>
>>> As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The BUG_ON
>>> triggered in csched_acct() is a logical result of this.
>>>
>>> How this can happen I don't know yet.
>>> Anyone any idea? I'll keep searching...
>>>
>>>
>>> Juergen
>>>
>>> On 02/15/11 08:22, Juergen Gross wrote:
>>>>
>>>> On 02/14/11 18:57, George Dunlap wrote:
>>>>>
>>>>> The good news is, I've managed to reproduce this on my local test
>>>>> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
>>>>> attached script. It's time to go home now, but I should be able to
>>>>> dig something up tomorrow.
>>>>>
>>>>> To use the script:
>>>>> * Rename cpupool0 to "p0", and create an empty second pool, "p1"
>>>>> * You can modify elements by adding "arg=val" as arguments.
>>>>> * Arguments are:
>>>>> + dryrun={true,false} Do the work, but don't actually execute any xl
>>>>> arguments. Default false.
>>>>> + left: Number commands to execute. Default 10.
>>>>> + maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is
>>>>> 8 cpus).
>>>>> + verbose={true,false} Print what you're doing. Default is true.
>>>>>
>>>>> The script sometimes attempts to remove the last cpu from cpupool0; in
>>>>> this case, libxl will print an error. If the script gets an error
>>>>> under that condition, it will ignore it; under any other condition, it
>>>>> will print diagnostic information.
>>>>>
>>>>> What finally crashed it for me was this command:
>>>>> # ./cpupool-test.sh verbose=false left=1000
>>>>
>>>> Nice!
>>>> With your script I finally managed to get the error, too. On my box (2
>>>> sockets
>>>> a 6 cores) I had to use
>>>>
>>>> ./cpupool-test.sh verbose=false left=10000 maxcpus=11
>>>>
>>>> to trigger it.
>>>> Looking for more data now...
>>>>
>>>>
>>>> Juergen
>>>>
>>>>>
>>>>> -George
>>>>>
>>>>> On Fri, Feb 11, 2011 at 7:39 AM, Andre
>>>>> Przywara<andre.przywara@amd.com>   wrote:
>>>>>>
>>>>>> Juergen Gross wrote:
>>>>>>>
>>>>>>> On 02/10/11 15:18, Andre Przywara wrote:
>>>>>>>>
>>>>>>>> Andre Przywara wrote:
>>>>>>>>>
>>>>>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>>>>>>>>
>>>>>>>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>>>>>>>>
>>>>>>>>>>> Andre, George,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> What seems to be interesting: I think the problem did always occur
>>>>>>>>>>> when
>>>>>>>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>>>>>>>
>>>>>>>>>>> I think my previous assumption regarding the master_ticker was not
>>>>>>>>>>> too bad.
>>>>>>>>>>> I think somehow the master_ticker of the new cpupool is becoming
>>>>>>>>>>> active
>>>>>>>>>>> before the scheduler is really initialized properly. This could
>>>>>>>>>>> happen, if
>>>>>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved
>>>>>>>>>>> and
>>>>>>>>>>> the
>>>>>>>>>>> critical section in schedule_cpu_switch().
>>>>>>>>>>>
>>>>>>>>>>> The solution should be to activate the timers only if the
>>>>>>>>>>> scheduler is
>>>>>>>>>>> ready for them.
>>>>>>>>>>>
>>>>>>>>>>> George, do you think the master_ticker should be stopped in
>>>>>>>>>>> suspend_ticker
>>>>>>>>>>> as well? I still see potential problems for entering deep C-States.
>>>>>>>>>>> I think
>>>>>>>>>>> I'll prepare a patch which will keep the master_ticker active
>>>>>>>>>>> for the
>>>>>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>>>>>>>>
>>>>>>>>>> Okay, here is a patch for this. It ran on my 4-core machine
>>>>>>>>>> without any
>>>>>>>>>> problems.
>>>>>>>>>> Andre, could you give it a try?
>>>>>>>>>
>>>>>>>>> Did, but unfortunately it crashed as always. Tried twice and made
>>>>>>>>> sure
>>>>>>>>> I booted the right kernel. Sorry.
>>>>>>>>> The idea with the race between the timer and the state changing
>>>>>>>>> sounded very appealing, actually that was suspicious to me from the
>>>>>>>>> beginning.
>>>>>>>>>
>>>>>>>>> I will add some code to dump the state of all cpupools to the BUG_ON
>>>>>>>>> to see in which situation we are when the bug triggers.
>>>>>>>>
>>>>>>>> OK, here is a first try of this, the patch iterates over all CPU pools
>>>>>>>> and outputs some data if the BUG_ON
>>>>>>>> ((sdom->weight * sdom->active_vcpu_count)>   weight_left) condition
>>>>>>>> triggers:
>>>>>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask:
>>>>>>>> fffffffc003f
>>>>>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>>>>>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>>>>>>>> (XEN) Xen BUG at sched_credit.c:1010
>>>>>>>> ....
>>>>>>>> The masks look proper (6 cores per node), the bug triggers when the
>>>>>>>> first CPU is about to be(?) inserted.
>>>>>>>
>>>>>>> Sure? I'm missing the cpu with mask 2000.
>>>>>>> I'll try to reproduce the problem on a larger machine here (24 cores, 4
>>>>>>> numa
>>>>>>> nodes).
>>>>>>> Andre, can you give me your xen boot parameters? Which xen changeset
>>>>>>> are
>>>>>>> you
>>>>>>> running, and do you have any additional patches in use?
>>>>>>
>>>>>> The grub lines:
>>>>>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200
>>>>>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
>>>>>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0
>>>>>>
>>>>>> All of my experiments are use c/s 22858 as a base.
>>>>>> If you use a AMD Magny-Cours box for your experiments (socket C32 or
>>>>>> G34),
>>>>>> you should add the following patch (removing the line)
>>>>>> --- a/xen/arch/x86/traps.c
>>>>>> +++ b/xen/arch/x86/traps.c
>>>>>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
>>>>>> __clear_bit(X86_FEATURE_SKINIT % 32,&c);
>>>>>> __clear_bit(X86_FEATURE_WDT % 32,&c);
>>>>>> __clear_bit(X86_FEATURE_LWP % 32,&c);
>>>>>> - __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c);
>>>>>> __clear_bit(X86_FEATURE_TOPOEXT % 32,&c);
>>>>>> break;
>>>>>> case 5: /* MONITOR/MWAIT */
>>>>>>
>>>>>> This is not necessary (in fact that reverts my patch c/s 22815), but
>>>>>> raises
>>>>>> the probability to trigger the bug, probably because it increases the
>>>>>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0,
>>>>>> try to
>>>>>> create a guest with many VCPUs and squeeze it into a small CPU-pool.
>>>>>>
>>>>>> Good luck ;-)
>>>>>> Andre.
>>>>>>
>>>>>> --
>>>>>> Andre Przywara
>>>>>> AMD-OSRC (Dresden)
>>>>>> Tel: x29712
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-devel mailing list
>>>>>> Xen-devel@lists.xensource.com
>>>>>> http://lists.xensource.com/xen-devel
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-devel mailing list
>>>>>> Xen-devel@lists.xensource.com
>>>>>> http://lists.xensource.com/xen-devel
>>>>
>>>>
>>>
>>>
>>> --
>>> Juergen Gross                 Principal Developer Operating Systems
>>> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
>>> Fujitsu Technology Solutions              e-mail:
>>> juergen.gross@ts.fujitsu.com
>>> Domagkstr. 28                           Internet: ts.fujitsu.com
>>> D-80807 Muenchen                 Company details:
>>> ts.fujitsu.com/imprint.html
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>
>
> --
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
> Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
> Domagkstr. 28                           Internet: ts.fujitsu.com
> D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html


-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-16 13:54                                                                   ` George Dunlap
       [not found]                                                                     ` <4D6237C6.1050206@amd.c om>
  2011-02-16 14:11                                                                     ` Juergen Gross
@ 2011-02-17  7:05                                                                     ` Juergen Gross
  2011-02-17  9:11                                                                       ` Juergen Gross
  2011-02-21 10:00                                                                     ` Andre Przywara
  3 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-17  7:05 UTC (permalink / raw)
  To: George Dunlap; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

On 02/16/11 14:54, George Dunlap wrote:
> Andre (and Juergen), can you try again with the attached patch?
>
> What the patch basically does is try to make "cpu_disable_scheduler()"
> do what it seems to say it does. :-)  Namely, the various
> scheduler-related interrutps (both per-cpu ticks and the master tick)
> is a part of the scheduler, so disable them before doing anything, and
> don't enable them until the cpu is really ready to go again.
>
> To be precise:
> * cpu_disable_scheduler() disables ticks
> * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
> and does it after inserting the idle vcpu
> * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
> stop tickers
>   + Call tick_{resume,suspend} in cpu_{up,down}, respectively
> * Modify credit1's tick_{suspend,resume} to handle the master ticker as well.
>
> With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
> on one pcpu), I can perform thousands of operations successfully.
>
> (NB this is not ready for application yet, I just wanted to check to
> see if it fixes Andre's problem)

After some thousand iterations the machine hang and after dumping Dom0
registers to console it continued running and crashed about a second later:

(XEN) cpupool_unassign_cpu(pool=0,cpu=9)
(XEN) cpupool_unassign_cpu(pool=0,cpu=9) ffff83083fff74c0
(XEN) cpupool_unassign_cpu ret=0
(XEN) cpupool_unassign_cpu(pool=0,cpu=4)
(XEN) cpupool_unassign_cpu(pool=0,cpu=4) ffff83083fff74c0
(XEN) cpupool_unassign_cpu ret=0
(XEN) cpupool_assign_cpu(pool=1,cpu=9)
(XEN) cpupool_assign_cpu(pool=1,cpu=9) ffff83083002de40
(XEN) Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:279
(XEN) ----[ Xen-4.1.0-rc5-pre  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    9
(XEN) RIP:    e008:[<ffff82c480126100>] active_timer+0xc/0x37
(XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: 0000000000000000   rcx: 0000000000000000
(XEN) rdx: ffff830839d8ff18   rsi: 0000010dbb628a80   rdi: ffff83083ffbcf98
(XEN) rbp: ffff830839d8fd50   rsp: ffff830839d8fd50   r8:  ffff83083ffbcf90
(XEN) r9:  ffff82c480213680   r10: 00000000ffffffff   r11: 0000000000000010
(XEN) r12: ffff82c4802d3f80   r13: ffff82c4802d3f80   r14: ffff83083ffbcf98
(XEN) r15: ffff83083ffbcfc0   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 000000007809c000   cr2: 0000000000620048
(XEN) ds: 002b   es: 002b   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff830839d8fd50:
(XEN)    ffff830839d8fda0 ffff82c480126ef9 0000000000000000 0000010dbb628a80
(XEN)    0000000000000086 0000000000000009 ffff83083002de40 ffff83083002dd50
(XEN)    0000000000000009 0000000000000009 ffff830839d8fdc0 ffff82c480117906
(XEN)    ffff83083ffa3b40 ffff83083ffa5d70 ffff830839d8fe30 ffff82c4801214fa
(XEN)    ffff83083002dd00 0000000900000100 0000000000000286 ffff8300780da000
(XEN)    ffff83083ffbcf80 ffff83083ffbcf90 ffff82c480247e00 0000000000000009
(XEN)    00000000fffffff0 ffff83083002dd00 0000000000000000 ffff8300781cc198
(XEN)    ffff830839d8fe60 ffff82c4801019ff 0000000000000009 0000000000000009
(XEN)    ffff8300781cc198 ffff830839d990d0 ffff830839d8fe80 ffff82c480101bd9
(XEN)    ffff83107e80c5b0 ffff8300781cc000 ffff830839d8fea0 ffff82c480104f21
(XEN)    0000000000000009 ffff830839d990e0 ffff830839d8fee0 ffff82c480125b6c
(XEN)    ffff82c48024a020 ffff830839d8ff18 ffff82c48024a020 ffff830839d8ff18
(XEN)    ffff830839d99060 ffff830839d99040 ffff830839d8ff10 ffff82c48015645a
(XEN)    0000000000000000 ffff8300780da000 ffff8300780da000 ffffffffffffffff
(XEN)    ffff830839d8fe00 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 ffffffff8062bda0 ffff880fbb1e5fd8 0000000000000246
(XEN)    0000000000000000 000000010003347d 0000000000000000 0000000000000000
(XEN)    ffffffff800033aa 00000000deadbeef 00000000deadbeef 00000000deadbeef
(XEN)    0000010000000000 ffffffff800033aa 000000000000e033 0000000000000246
(XEN)    ffff880fbb1e5f08 000000000000e02b 0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c480126100>] active_timer+0xc/0x37
(XEN)    [<ffff82c480126ef9>] set_timer+0x102/0x218
(XEN)    [<ffff82c480117906>] csched_tick_resume+0x53/0x75
(XEN)    [<ffff82c4801214fa>] schedule_cpu_switch+0x1f1/0x25c
(XEN)    [<ffff82c4801019ff>] cpupool_assign_cpu_locked+0x61/0xd6
(XEN)    [<ffff82c480101bd9>] cpupool_assign_cpu_helper+0x9f/0xcd
(XEN)    [<ffff82c480104f21>] continue_hypercall_tasklet_handler+0x51/0xc3
(XEN)    [<ffff82c480125b6c>] do_tasklet+0xe1/0x155
(XEN)    [<ffff82c48015645a>] idle_loop+0x5f/0x67
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 9:
(XEN) Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:279
(XEN) ****************************************


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-17  7:05                                                                     ` Juergen Gross
@ 2011-02-17  9:11                                                                       ` Juergen Gross
  0 siblings, 0 replies; 53+ messages in thread
From: Juergen Gross @ 2011-02-17  9:11 UTC (permalink / raw)
  To: George Dunlap; +Cc: Andre Przywara, xen-devel, Diestelhorst, Stephan

On 02/17/11 08:05, Juergen Gross wrote:
> On 02/16/11 14:54, George Dunlap wrote:
>> Andre (and Juergen), can you try again with the attached patch?
>>
>> What the patch basically does is try to make "cpu_disable_scheduler()"
>> do what it seems to say it does. :-) Namely, the various
>> scheduler-related interrutps (both per-cpu ticks and the master tick)
>> is a part of the scheduler, so disable them before doing anything, and
>> don't enable them until the cpu is really ready to go again.
>>
>> To be precise:
>> * cpu_disable_scheduler() disables ticks
>> * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
>> and does it after inserting the idle vcpu
>> * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
>> stop tickers
>> + Call tick_{resume,suspend} in cpu_{up,down}, respectively
>> * Modify credit1's tick_{suspend,resume} to handle the master ticker
>> as well.
>>
>> With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
>> on one pcpu), I can perform thousands of operations successfully.
>>
>> (NB this is not ready for application yet, I just wanted to check to
>> see if it fixes Andre's problem)

Tried again, this time with the following patch:

diff -r 72470de157ce xen/common/sched_credit.c
--- a/xen/common/sched_credit.c Wed Feb 16 09:49:33 2011 +0000
+++ b/xen/common/sched_credit.c Wed Feb 16 15:09:54 2011 +0100
@@ -1268,7 +1268,8 @@ csched_load_balance(struct csched_privat
          /*
           * Any work over there to steal?
           */
-        speer = csched_runq_steal(peer_cpu, cpu, snext->pri);
+        speer = cpu_isset(peer_cpu, *online) ?
+            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
          pcpu_schedule_unlock(peer_cpu);
          if ( speer != NULL )
          {


Worked without any flaw for 30000 iterations.


Juergen

>
> After some thousand iterations the machine hang and after dumping Dom0
> registers to console it continued running and crashed about a second later:
>
> (XEN) cpupool_unassign_cpu(pool=0,cpu=9)
> (XEN) cpupool_unassign_cpu(pool=0,cpu=9) ffff83083fff74c0
> (XEN) cpupool_unassign_cpu ret=0
> (XEN) cpupool_unassign_cpu(pool=0,cpu=4)
> (XEN) cpupool_unassign_cpu(pool=0,cpu=4) ffff83083fff74c0
> (XEN) cpupool_unassign_cpu ret=0
> (XEN) cpupool_assign_cpu(pool=1,cpu=9)
> (XEN) cpupool_assign_cpu(pool=1,cpu=9) ffff83083002de40
> (XEN) Assertion 'timer->status >= TIMER_STATUS_inactive' failed at
> timer.c:279
> (XEN) ----[ Xen-4.1.0-rc5-pre x86_64 debug=y Tainted: C ]----
> (XEN) CPU: 9
> (XEN) RIP: e008:[<ffff82c480126100>] active_timer+0xc/0x37
> (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor
> (XEN) rax: 0000000000000000 rbx: 0000000000000000 rcx: 0000000000000000
> (XEN) rdx: ffff830839d8ff18 rsi: 0000010dbb628a80 rdi: ffff83083ffbcf98
> (XEN) rbp: ffff830839d8fd50 rsp: ffff830839d8fd50 r8: ffff83083ffbcf90
> (XEN) r9: ffff82c480213680 r10: 00000000ffffffff r11: 0000000000000010
> (XEN) r12: ffff82c4802d3f80 r13: ffff82c4802d3f80 r14: ffff83083ffbcf98
> (XEN) r15: ffff83083ffbcfc0 cr0: 000000008005003b cr4: 00000000000026f0
> (XEN) cr3: 000000007809c000 cr2: 0000000000620048
> (XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008
> (XEN) Xen stack trace from rsp=ffff830839d8fd50:
> (XEN) ffff830839d8fda0 ffff82c480126ef9 0000000000000000 0000010dbb628a80
> (XEN) 0000000000000086 0000000000000009 ffff83083002de40 ffff83083002dd50
> (XEN) 0000000000000009 0000000000000009 ffff830839d8fdc0 ffff82c480117906
> (XEN) ffff83083ffa3b40 ffff83083ffa5d70 ffff830839d8fe30 ffff82c4801214fa
> (XEN) ffff83083002dd00 0000000900000100 0000000000000286 ffff8300780da000
> (XEN) ffff83083ffbcf80 ffff83083ffbcf90 ffff82c480247e00 0000000000000009
> (XEN) 00000000fffffff0 ffff83083002dd00 0000000000000000 ffff8300781cc198
> (XEN) ffff830839d8fe60 ffff82c4801019ff 0000000000000009 0000000000000009
> (XEN) ffff8300781cc198 ffff830839d990d0 ffff830839d8fe80 ffff82c480101bd9
> (XEN) ffff83107e80c5b0 ffff8300781cc000 ffff830839d8fea0 ffff82c480104f21
> (XEN) 0000000000000009 ffff830839d990e0 ffff830839d8fee0 ffff82c480125b6c
> (XEN) ffff82c48024a020 ffff830839d8ff18 ffff82c48024a020 ffff830839d8ff18
> (XEN) ffff830839d99060 ffff830839d99040 ffff830839d8ff10 ffff82c48015645a
> (XEN) 0000000000000000 ffff8300780da000 ffff8300780da000 ffffffffffffffff
> (XEN) ffff830839d8fe00 0000000000000000 0000000000000000 0000000000000000
> (XEN) 0000000000000000 ffffffff8062bda0 ffff880fbb1e5fd8 0000000000000246
> (XEN) 0000000000000000 000000010003347d 0000000000000000 0000000000000000
> (XEN) ffffffff800033aa 00000000deadbeef 00000000deadbeef 00000000deadbeef
> (XEN) 0000010000000000 ffffffff800033aa 000000000000e033 0000000000000246
> (XEN) ffff880fbb1e5f08 000000000000e02b 0000000000000000 0000000000000000
> (XEN) Xen call trace:
> (XEN) [<ffff82c480126100>] active_timer+0xc/0x37
> (XEN) [<ffff82c480126ef9>] set_timer+0x102/0x218
> (XEN) [<ffff82c480117906>] csched_tick_resume+0x53/0x75
> (XEN) [<ffff82c4801214fa>] schedule_cpu_switch+0x1f1/0x25c
> (XEN) [<ffff82c4801019ff>] cpupool_assign_cpu_locked+0x61/0xd6
> (XEN) [<ffff82c480101bd9>] cpupool_assign_cpu_helper+0x9f/0xcd
> (XEN) [<ffff82c480104f21>] continue_hypercall_tasklet_handler+0x51/0xc3
> (XEN) [<ffff82c480125b6c>] do_tasklet+0xe1/0x155
> (XEN) [<ffff82c48015645a>] idle_loop+0x5f/0x67
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 9:
> (XEN) Assertion 'timer->status >= TIMER_STATUS_inactive' failed at
> timer.c:279
> (XEN) ****************************************
>
>
> Juergen
>


-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-16 13:54                                                                   ` George Dunlap
                                                                                       ` (2 preceding siblings ...)
  2011-02-17  7:05                                                                     ` Juergen Gross
@ 2011-02-21 10:00                                                                     ` Andre Przywara
  2011-02-21 13:19                                                                       ` Juergen Gross
  3 siblings, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-02-21 10:00 UTC (permalink / raw)
  To: George Dunlap; +Cc: Juergen Gross, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 12198 bytes --]

George Dunlap wrote:
> Andre (and Juergen), can you try again with the attached patch?

I applied this patch on top of 22931 and it did _not_ work.
The crash occurred almost immediately after I started my script, so the 
same behaviour as without the patch.
(attached my script for reference, though it will most likely only make 
sense on bigger NUMA machines)

Regards,
Andre.


> What the patch basically does is try to make "cpu_disable_scheduler()"
> do what it seems to say it does. :-)  Namely, the various
> scheduler-related interrutps (both per-cpu ticks and the master tick)
> is a part of the scheduler, so disable them before doing anything, and
> don't enable them until the cpu is really ready to go again.
> 
> To be precise:
> * cpu_disable_scheduler() disables ticks
> * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
> and does it after inserting the idle vcpu
> * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
> stop tickers
>  + Call tick_{resume,suspend} in cpu_{up,down}, respectively
> * Modify credit1's tick_{suspend,resume} to handle the master ticker as well.
> 
> With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
> on one pcpu), I can perform thousands of operations successfully.
> 
> (NB this is not ready for application yet, I just wanted to check to
> see if it fixes Andre's problem)
> 
>  -George
> 
> On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross
> <juergen.gross@ts.fujitsu.com> wrote:
>> Okay, I have some more data.
>>
>> I activated cpupool_dprintk() and included checks in sched_credit.c to
>> test for weight inconsistencies. To reduce race possibilities I've added
>> my patch to execute cpu assigning/unassigning always in a tasklet on the
>> cpu to be moved.
>>
>> Here is the result:
>>
>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>> (XEN) cpupool_assign_cpu(pool=0,cpu=1)
>> (XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0
>> (XEN) cpupool_assign_cpu(cpu=1) ret 0
>> (XEN) cpupool_assign_cpu(pool=1,cpu=4)
>> (XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40
>> (XEN) cpupool_assign_cpu(cpu=4) ret 0
>> (XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0:
>> (XEN) sdom->weight: 256, sdom->active_vcpu_count: 1
>> (XEN) Xen BUG at sched_credit.c:570
>> (XEN) ----[ Xen-4.1.0-rc5-pre  x86_64  debug=y  Tainted:    C ]----
>> (XEN) CPU:    4
>> (XEN) RIP:    e008:[<ffff82c4801197d7>] csched_tick+0x186/0x37f
>> (XEN) RFLAGS: 0000000000010086   CONTEXT: hypervisor
>> (XEN) rax: 0000000000000000   rbx: ffff830839d3ec30   rcx: 0000000000000000
>> (XEN) rdx: ffff830839dcff18   rsi: 000000000000000a   rdi: ffff82c4802542e8
>> (XEN) rbp: ffff830839dcfe38   rsp: ffff830839dcfde8   r8:  0000000000000004
>> (XEN) r9:  ffff82c480213520   r10: 00000000fffffffc   r11: 0000000000000001
>> (XEN) r12: 0000000000000004   r13: ffff830839d3ec40   r14: ffff831002ad5e40
>> (XEN) r15: ffff830839d66f90   cr0: 000000008005003b   cr4: 00000000000026f0
>> (XEN) cr3: 0000001020a98000   cr2: 00007fc5e9b79d98
>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
>> (XEN) Xen stack trace from rsp=ffff830839dcfde8:
>> (XEN)    ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246 ffff830839d6c000
>> (XEN)    0000000000000000 ffff830839dd1100 0000000000000004 ffff82c480119651
>> (XEN)    ffff831002b28018 ffff831002b28010 ffff830839dcfe68 ffff82c480126204
>> (XEN)    0000000000000002 ffff83083ffa3bb8 ffff830839dd1100 000000cae439ea7e
>> (XEN)    ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20 ffff830839dd1100
>> (XEN)    ffff831002b28010 0000000000000004 0000000000000004 ffff82c4802b0880
>> (XEN)    ffff830839dcff18 ffffffffffffffff ffff830839dcfef8 ffff82c480123647
>> (XEN)    ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98 00007fc5e9fa5b20
>> (XEN)    0000000000000002 00007fff46826f20 ffff830839dcff08 ffff82c4801236c2
>> (XEN)    00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20 0000000000000002
>> (XEN)    00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260 00007fff46826f50
>> (XEN)    0000000000000246 0000000000000032 0000000000000000 00000000ffffffff
>> (XEN)    0000000000000009 00007fc5e9d9de1a 0000000000000003 0000000000004848
>> (XEN)    00007fc5e9b7a000 0000010000000000 ffffffff800073f0 000000000000e033
>> (XEN)    0000000000000246 ffff880f97b51fc8 000000000000e02b 0000000000000000
>> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000004
>> (XEN)    ffff830077eee000 00000043b9afd180 0000000000000000
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82c4801197d7>] csched_tick+0x186/0x37f
>> (XEN)    [<ffff82c480126204>] execute_timer+0x4e/0x6c
>> (XEN)    [<ffff82c480126539>] timer_softirq_action+0xf6/0x239
>> (XEN)    [<ffff82c480123647>] __do_softirq+0x88/0x99
>> (XEN)    [<ffff82c4801236c2>] do_softirq+0x6a/0x7a
>> (XEN)
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 4:
>> (XEN) Xen BUG at sched_credit.c:570
>> (XEN) ****************************************
>>
>> As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The BUG_ON
>> triggered in csched_acct() is a logical result of this.
>>
>> How this can happen I don't know yet.
>> Anyone any idea? I'll keep searching...
>>
>>
>> Juergen
>>
>> On 02/15/11 08:22, Juergen Gross wrote:
>>> On 02/14/11 18:57, George Dunlap wrote:
>>>> The good news is, I've managed to reproduce this on my local test
>>>> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
>>>> attached script. It's time to go home now, but I should be able to
>>>> dig something up tomorrow.
>>>>
>>>> To use the script:
>>>> * Rename cpupool0 to "p0", and create an empty second pool, "p1"
>>>> * You can modify elements by adding "arg=val" as arguments.
>>>> * Arguments are:
>>>> + dryrun={true,false} Do the work, but don't actually execute any xl
>>>> arguments. Default false.
>>>> + left: Number commands to execute. Default 10.
>>>> + maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is
>>>> 8 cpus).
>>>> + verbose={true,false} Print what you're doing. Default is true.
>>>>
>>>> The script sometimes attempts to remove the last cpu from cpupool0; in
>>>> this case, libxl will print an error. If the script gets an error
>>>> under that condition, it will ignore it; under any other condition, it
>>>> will print diagnostic information.
>>>>
>>>> What finally crashed it for me was this command:
>>>> # ./cpupool-test.sh verbose=false left=1000
>>> Nice!
>>> With your script I finally managed to get the error, too. On my box (2
>>> sockets
>>> a 6 cores) I had to use
>>>
>>> ./cpupool-test.sh verbose=false left=10000 maxcpus=11
>>>
>>> to trigger it.
>>> Looking for more data now...
>>>
>>>
>>> Juergen
>>>
>>>> -George
>>>>
>>>> On Fri, Feb 11, 2011 at 7:39 AM, Andre
>>>> Przywara<andre.przywara@amd.com> wrote:
>>>>> Juergen Gross wrote:
>>>>>> On 02/10/11 15:18, Andre Przywara wrote:
>>>>>>> Andre Przywara wrote:
>>>>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>>>>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>>>>>>> Andre, George,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What seems to be interesting: I think the problem did always occur
>>>>>>>>>> when
>>>>>>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>>>>>>
>>>>>>>>>> I think my previous assumption regarding the master_ticker was not
>>>>>>>>>> too bad.
>>>>>>>>>> I think somehow the master_ticker of the new cpupool is becoming
>>>>>>>>>> active
>>>>>>>>>> before the scheduler is really initialized properly. This could
>>>>>>>>>> happen, if
>>>>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved
>>>>>>>>>> and
>>>>>>>>>> the
>>>>>>>>>> critical section in schedule_cpu_switch().
>>>>>>>>>>
>>>>>>>>>> The solution should be to activate the timers only if the
>>>>>>>>>> scheduler is
>>>>>>>>>> ready for them.
>>>>>>>>>>
>>>>>>>>>> George, do you think the master_ticker should be stopped in
>>>>>>>>>> suspend_ticker
>>>>>>>>>> as well? I still see potential problems for entering deep C-States.
>>>>>>>>>> I think
>>>>>>>>>> I'll prepare a patch which will keep the master_ticker active
>>>>>>>>>> for the
>>>>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>>>>>>> Okay, here is a patch for this. It ran on my 4-core machine
>>>>>>>>> without any
>>>>>>>>> problems.
>>>>>>>>> Andre, could you give it a try?
>>>>>>>> Did, but unfortunately it crashed as always. Tried twice and made
>>>>>>>> sure
>>>>>>>> I booted the right kernel. Sorry.
>>>>>>>> The idea with the race between the timer and the state changing
>>>>>>>> sounded very appealing, actually that was suspicious to me from the
>>>>>>>> beginning.
>>>>>>>>
>>>>>>>> I will add some code to dump the state of all cpupools to the BUG_ON
>>>>>>>> to see in which situation we are when the bug triggers.
>>>>>>> OK, here is a first try of this, the patch iterates over all CPU pools
>>>>>>> and outputs some data if the BUG_ON
>>>>>>> ((sdom->weight * sdom->active_vcpu_count)> weight_left) condition
>>>>>>> triggers:
>>>>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask:
>>>>>>> fffffffc003f
>>>>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>>>>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>>>>>>> (XEN) Xen BUG at sched_credit.c:1010
>>>>>>> ....
>>>>>>> The masks look proper (6 cores per node), the bug triggers when the
>>>>>>> first CPU is about to be(?) inserted.
>>>>>> Sure? I'm missing the cpu with mask 2000.
>>>>>> I'll try to reproduce the problem on a larger machine here (24 cores, 4
>>>>>> numa
>>>>>> nodes).
>>>>>> Andre, can you give me your xen boot parameters? Which xen changeset
>>>>>> are
>>>>>> you
>>>>>> running, and do you have any additional patches in use?
>>>>> The grub lines:
>>>>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200
>>>>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
>>>>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0
>>>>>
>>>>> All of my experiments are use c/s 22858 as a base.
>>>>> If you use a AMD Magny-Cours box for your experiments (socket C32 or
>>>>> G34),
>>>>> you should add the following patch (removing the line)
>>>>> --- a/xen/arch/x86/traps.c
>>>>> +++ b/xen/arch/x86/traps.c
>>>>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
>>>>> __clear_bit(X86_FEATURE_SKINIT % 32,&c);
>>>>> __clear_bit(X86_FEATURE_WDT % 32,&c);
>>>>> __clear_bit(X86_FEATURE_LWP % 32,&c);
>>>>> - __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c);
>>>>> __clear_bit(X86_FEATURE_TOPOEXT % 32,&c);
>>>>> break;
>>>>> case 5: /* MONITOR/MWAIT */
>>>>>
>>>>> This is not necessary (in fact that reverts my patch c/s 22815), but
>>>>> raises
>>>>> the probability to trigger the bug, probably because it increases the
>>>>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0,
>>>>> try to
>>>>> create a guest with many VCPUs and squeeze it into a small CPU-pool.
>>>>>
>>>>> Good luck ;-)
>>>>> Andre.
>>>>>
>>>>> --
>>>>> Andre Przywara
>>>>> AMD-OSRC (Dresden)
>>>>> Tel: x29712
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xensource.com
>>>>> http://lists.xensource.com/xen-devel
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xensource.com
>>>>> http://lists.xensource.com/xen-devel
>>>
>>
>> --
>> Juergen Gross                 Principal Developer Operating Systems
>> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
>> Fujitsu Technology Solutions              e-mail:
>> juergen.gross@ts.fujitsu.com
>> Domagkstr. 28                           Internet: ts.fujitsu.com
>> D-80807 Muenchen                 Company details:
>> ts.fujitsu.com/imprint.html
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>


-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

[-- Attachment #2: numasplit.sh --]
[-- Type: text/plain, Size: 1778 bytes --]

#!/bin/sh

XL=./ldxl
ROOTPOOL=Pool-0
NUMAPREFIX=Pool-node
numnodes=`xl info | sed -e 's/^nr_nodes *: \([0-9]*\)/\1/;t;d'`
numcores=`xl info | sed -e 's/^nr_cpus *: \([0-9]*\)/\1/;t;d'`

if [ ! -x ${XL} ]
then
	XL=xl
fi

if [ $# -gt 0 ];
then
	action=$1
else
	action=create
fi


if [ "$action" = "create" ]
then
	$XL cpupool-rename $ROOTPOOL ${NUMAPREFIX}0
	for i in `seq 1 $((numnodes-1))`
	do
		echo "Removing CPUs from Pool 0"
		$XL cpupool-cpu-remove ${NUMAPREFIX}0 node:$i
		echo "Rewriting config file"
		sed -i -e "s/${NUMAPREFIX}./${NUMAPREFIX}${i}/" cpupool.test
		echo "Creating new pool"
		$XL cpupool-create cpupool.test
		echo "Populating new pool"
		$XL cpupool-cpu-add ${NUMAPREFIX}${i} node:$i
	done
elif [ "$action" = "create2" ]
then
	$XL cpupool-rename $ROOTPOOL ${NUMAPREFIX}0
	echo "Removing CPUs from Pool 0"
	for i in `seq 1 $((numnodes-1))`
	do
		$XL cpupool-cpu-remove ${NUMAPREFIX}0 node:$i
	done
	for i in `seq 1 $((numnodes-1))`
	do
		echo "Rewriting config file"
		sed -i -e "s/${NUMAPREFIX}./${NUMAPREFIX}${i}/" cpupool.test
		echo "Creating new pool"
		$XL cpupool-create cpupool.test
		echo "Populating new pool"
		$XL cpupool-cpu-add ${NUMAPREFIX}${i} node:$i
	done
elif [ "$action" = "revert" ]
then
	for i in `seq 1 $((numnodes-1))`
	do
		echo "Destroying Pool $i"
		$XL cpupool-destroy ${NUMAPREFIX}${i}
		echo "adding freed CPUs to pool 0"
		$XL cpupool-cpu-add ${NUMAPREFIX}0 node:$i
	done
	$XL cpupool-rename ${NUMAPREFIX}0 $ROOTPOOL
elif [ "$action" = "remove" ]
then
	for i in `seq 1 $((numcores-1))`
	do
		echo "Removing CPU $i from Pool-0"
		$XL cpupool-cpu-remove $ROOTPOOL $i
	done
elif [ "$action" = "add" ]
then
	for i in `seq 1 $((numcores-1))`
	do
		echo "Removing CPU $i from Pool-0"
		$XL cpupool-cpu-add $ROOTPOOL $i
	done
fi


[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-21 10:00                                                                     ` Andre Przywara
@ 2011-02-21 13:19                                                                       ` Juergen Gross
  2011-02-21 14:45                                                                         ` Andre Przywara
  0 siblings, 1 reply; 53+ messages in thread
From: Juergen Gross @ 2011-02-21 13:19 UTC (permalink / raw)
  To: Andre Przywara; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

[-- Attachment #1: Type: text/plain, Size: 13351 bytes --]

On 02/21/11 11:00, Andre Przywara wrote:
> George Dunlap wrote:
>> Andre (and Juergen), can you try again with the attached patch?
>
> I applied this patch on top of 22931 and it did _not_ work.
> The crash occurred almost immediately after I started my script, so the
> same behaviour as without the patch.

Did you try my patch addressing races in the scheduler when moving cpus
between cpupools?
I've attached it again. For me it works quite well, while George's patch
seems not to be enough (machine hanging after some tests with cpupools).
OTOH I can't reproduce an error as fast as you even without any patch :-)

> (attached my script for reference, though it will most likely only make
> sense on bigger NUMA machines)

Yeah, on my 2-node system I need several hundred tries to get an error.
But it seems to be more effective than George's script.


Juergen

>
> Regards,
> Andre.
>
>
>> What the patch basically does is try to make "cpu_disable_scheduler()"
>> do what it seems to say it does. :-) Namely, the various
>> scheduler-related interrutps (both per-cpu ticks and the master tick)
>> is a part of the scheduler, so disable them before doing anything, and
>> don't enable them until the cpu is really ready to go again.
>>
>> To be precise:
>> * cpu_disable_scheduler() disables ticks
>> * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
>> and does it after inserting the idle vcpu
>> * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
>> stop tickers
>> + Call tick_{resume,suspend} in cpu_{up,down}, respectively
>> * Modify credit1's tick_{suspend,resume} to handle the master ticker
>> as well.
>>
>> With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
>> on one pcpu), I can perform thousands of operations successfully.
>>
>> (NB this is not ready for application yet, I just wanted to check to
>> see if it fixes Andre's problem)
>>
>> -George
>>
>> On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross
>> <juergen.gross@ts.fujitsu.com> wrote:
>>> Okay, I have some more data.
>>>
>>> I activated cpupool_dprintk() and included checks in sched_credit.c to
>>> test for weight inconsistencies. To reduce race possibilities I've added
>>> my patch to execute cpu assigning/unassigning always in a tasklet on the
>>> cpu to be moved.
>>>
>>> Here is the result:
>>>
>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>>> (XEN) cpupool_assign_cpu(pool=0,cpu=1)
>>> (XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0
>>> (XEN) cpupool_assign_cpu(cpu=1) ret 0
>>> (XEN) cpupool_assign_cpu(pool=1,cpu=4)
>>> (XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40
>>> (XEN) cpupool_assign_cpu(cpu=4) ret 0
>>> (XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0:
>>> (XEN) sdom->weight: 256, sdom->active_vcpu_count: 1
>>> (XEN) Xen BUG at sched_credit.c:570
>>> (XEN) ----[ Xen-4.1.0-rc5-pre x86_64 debug=y Tainted: C ]----
>>> (XEN) CPU: 4
>>> (XEN) RIP: e008:[<ffff82c4801197d7>] csched_tick+0x186/0x37f
>>> (XEN) RFLAGS: 0000000000010086 CONTEXT: hypervisor
>>> (XEN) rax: 0000000000000000 rbx: ffff830839d3ec30 rcx: 0000000000000000
>>> (XEN) rdx: ffff830839dcff18 rsi: 000000000000000a rdi: ffff82c4802542e8
>>> (XEN) rbp: ffff830839dcfe38 rsp: ffff830839dcfde8 r8: 0000000000000004
>>> (XEN) r9: ffff82c480213520 r10: 00000000fffffffc r11: 0000000000000001
>>> (XEN) r12: 0000000000000004 r13: ffff830839d3ec40 r14: ffff831002ad5e40
>>> (XEN) r15: ffff830839d66f90 cr0: 000000008005003b cr4: 00000000000026f0
>>> (XEN) cr3: 0000001020a98000 cr2: 00007fc5e9b79d98
>>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
>>> (XEN) Xen stack trace from rsp=ffff830839dcfde8:
>>> (XEN) ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246
>>> ffff830839d6c000
>>> (XEN) 0000000000000000 ffff830839dd1100 0000000000000004
>>> ffff82c480119651
>>> (XEN) ffff831002b28018 ffff831002b28010 ffff830839dcfe68
>>> ffff82c480126204
>>> (XEN) 0000000000000002 ffff83083ffa3bb8 ffff830839dd1100
>>> 000000cae439ea7e
>>> (XEN) ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20
>>> ffff830839dd1100
>>> (XEN) ffff831002b28010 0000000000000004 0000000000000004
>>> ffff82c4802b0880
>>> (XEN) ffff830839dcff18 ffffffffffffffff ffff830839dcfef8
>>> ffff82c480123647
>>> (XEN) ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98
>>> 00007fc5e9fa5b20
>>> (XEN) 0000000000000002 00007fff46826f20 ffff830839dcff08
>>> ffff82c4801236c2
>>> (XEN) 00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20
>>> 0000000000000002
>>> (XEN) 00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260
>>> 00007fff46826f50
>>> (XEN) 0000000000000246 0000000000000032 0000000000000000
>>> 00000000ffffffff
>>> (XEN) 0000000000000009 00007fc5e9d9de1a 0000000000000003
>>> 0000000000004848
>>> (XEN) 00007fc5e9b7a000 0000010000000000 ffffffff800073f0
>>> 000000000000e033
>>> (XEN) 0000000000000246 ffff880f97b51fc8 000000000000e02b
>>> 0000000000000000
>>> (XEN) 0000000000000000 0000000000000000 0000000000000000
>>> 0000000000000004
>>> (XEN) ffff830077eee000 00000043b9afd180 0000000000000000
>>> (XEN) Xen call trace:
>>> (XEN) [<ffff82c4801197d7>] csched_tick+0x186/0x37f
>>> (XEN) [<ffff82c480126204>] execute_timer+0x4e/0x6c
>>> (XEN) [<ffff82c480126539>] timer_softirq_action+0xf6/0x239
>>> (XEN) [<ffff82c480123647>] __do_softirq+0x88/0x99
>>> (XEN) [<ffff82c4801236c2>] do_softirq+0x6a/0x7a
>>> (XEN)
>>> (XEN)
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 4:
>>> (XEN) Xen BUG at sched_credit.c:570
>>> (XEN) ****************************************
>>>
>>> As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The
>>> BUG_ON
>>> triggered in csched_acct() is a logical result of this.
>>>
>>> How this can happen I don't know yet.
>>> Anyone any idea? I'll keep searching...
>>>
>>>
>>> Juergen
>>>
>>> On 02/15/11 08:22, Juergen Gross wrote:
>>>> On 02/14/11 18:57, George Dunlap wrote:
>>>>> The good news is, I've managed to reproduce this on my local test
>>>>> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
>>>>> attached script. It's time to go home now, but I should be able to
>>>>> dig something up tomorrow.
>>>>>
>>>>> To use the script:
>>>>> * Rename cpupool0 to "p0", and create an empty second pool, "p1"
>>>>> * You can modify elements by adding "arg=val" as arguments.
>>>>> * Arguments are:
>>>>> + dryrun={true,false} Do the work, but don't actually execute any xl
>>>>> arguments. Default false.
>>>>> + left: Number commands to execute. Default 10.
>>>>> + maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is
>>>>> 8 cpus).
>>>>> + verbose={true,false} Print what you're doing. Default is true.
>>>>>
>>>>> The script sometimes attempts to remove the last cpu from cpupool0; in
>>>>> this case, libxl will print an error. If the script gets an error
>>>>> under that condition, it will ignore it; under any other condition, it
>>>>> will print diagnostic information.
>>>>>
>>>>> What finally crashed it for me was this command:
>>>>> # ./cpupool-test.sh verbose=false left=1000
>>>> Nice!
>>>> With your script I finally managed to get the error, too. On my box (2
>>>> sockets
>>>> a 6 cores) I had to use
>>>>
>>>> ./cpupool-test.sh verbose=false left=10000 maxcpus=11
>>>>
>>>> to trigger it.
>>>> Looking for more data now...
>>>>
>>>>
>>>> Juergen
>>>>
>>>>> -George
>>>>>
>>>>> On Fri, Feb 11, 2011 at 7:39 AM, Andre
>>>>> Przywara<andre.przywara@amd.com> wrote:
>>>>>> Juergen Gross wrote:
>>>>>>> On 02/10/11 15:18, Andre Przywara wrote:
>>>>>>>> Andre Przywara wrote:
>>>>>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>>>>>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>>>>>>>> Andre, George,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> What seems to be interesting: I think the problem did always
>>>>>>>>>>> occur
>>>>>>>>>>> when
>>>>>>>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>>>>>>>
>>>>>>>>>>> I think my previous assumption regarding the master_ticker
>>>>>>>>>>> was not
>>>>>>>>>>> too bad.
>>>>>>>>>>> I think somehow the master_ticker of the new cpupool is becoming
>>>>>>>>>>> active
>>>>>>>>>>> before the scheduler is really initialized properly. This could
>>>>>>>>>>> happen, if
>>>>>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved
>>>>>>>>>>> and
>>>>>>>>>>> the
>>>>>>>>>>> critical section in schedule_cpu_switch().
>>>>>>>>>>>
>>>>>>>>>>> The solution should be to activate the timers only if the
>>>>>>>>>>> scheduler is
>>>>>>>>>>> ready for them.
>>>>>>>>>>>
>>>>>>>>>>> George, do you think the master_ticker should be stopped in
>>>>>>>>>>> suspend_ticker
>>>>>>>>>>> as well? I still see potential problems for entering deep
>>>>>>>>>>> C-States.
>>>>>>>>>>> I think
>>>>>>>>>>> I'll prepare a patch which will keep the master_ticker active
>>>>>>>>>>> for the
>>>>>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>>>>>>>> Okay, here is a patch for this. It ran on my 4-core machine
>>>>>>>>>> without any
>>>>>>>>>> problems.
>>>>>>>>>> Andre, could you give it a try?
>>>>>>>>> Did, but unfortunately it crashed as always. Tried twice and made
>>>>>>>>> sure
>>>>>>>>> I booted the right kernel. Sorry.
>>>>>>>>> The idea with the race between the timer and the state changing
>>>>>>>>> sounded very appealing, actually that was suspicious to me from
>>>>>>>>> the
>>>>>>>>> beginning.
>>>>>>>>>
>>>>>>>>> I will add some code to dump the state of all cpupools to the
>>>>>>>>> BUG_ON
>>>>>>>>> to see in which situation we are when the bug triggers.
>>>>>>>> OK, here is a first try of this, the patch iterates over all CPU
>>>>>>>> pools
>>>>>>>> and outputs some data if the BUG_ON
>>>>>>>> ((sdom->weight * sdom->active_vcpu_count)> weight_left) condition
>>>>>>>> triggers:
>>>>>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask:
>>>>>>>> fffffffc003f
>>>>>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>>>>>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>>>>>>>> (XEN) Xen BUG at sched_credit.c:1010
>>>>>>>> ....
>>>>>>>> The masks look proper (6 cores per node), the bug triggers when the
>>>>>>>> first CPU is about to be(?) inserted.
>>>>>>> Sure? I'm missing the cpu with mask 2000.
>>>>>>> I'll try to reproduce the problem on a larger machine here (24
>>>>>>> cores, 4
>>>>>>> numa
>>>>>>> nodes).
>>>>>>> Andre, can you give me your xen boot parameters? Which xen changeset
>>>>>>> are
>>>>>>> you
>>>>>>> running, and do you have any additional patches in use?
>>>>>> The grub lines:
>>>>>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga
>>>>>> com1=115200
>>>>>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
>>>>>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0
>>>>>>
>>>>>> All of my experiments are use c/s 22858 as a base.
>>>>>> If you use a AMD Magny-Cours box for your experiments (socket C32 or
>>>>>> G34),
>>>>>> you should add the following patch (removing the line)
>>>>>> --- a/xen/arch/x86/traps.c
>>>>>> +++ b/xen/arch/x86/traps.c
>>>>>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
>>>>>> __clear_bit(X86_FEATURE_SKINIT % 32,&c);
>>>>>> __clear_bit(X86_FEATURE_WDT % 32,&c);
>>>>>> __clear_bit(X86_FEATURE_LWP % 32,&c);
>>>>>> - __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c);
>>>>>> __clear_bit(X86_FEATURE_TOPOEXT % 32,&c);
>>>>>> break;
>>>>>> case 5: /* MONITOR/MWAIT */
>>>>>>
>>>>>> This is not necessary (in fact that reverts my patch c/s 22815), but
>>>>>> raises
>>>>>> the probability to trigger the bug, probably because it increases the
>>>>>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0,
>>>>>> try to
>>>>>> create a guest with many VCPUs and squeeze it into a small CPU-pool.
>>>>>>
>>>>>> Good luck ;-)
>>>>>> Andre.
>>>>>>
>>>>>> --
>>>>>> Andre Przywara
>>>>>> AMD-OSRC (Dresden)
>>>>>> Tel: x29712
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-devel mailing list
>>>>>> Xen-devel@lists.xensource.com
>>>>>> http://lists.xensource.com/xen-devel
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-devel mailing list
>>>>>> Xen-devel@lists.xensource.com
>>>>>> http://lists.xensource.com/xen-devel
>>>>
>>>
>>> --
>>> Juergen Gross Principal Developer Operating Systems
>>> TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967
>>> Fujitsu Technology Solutions e-mail:
>>> juergen.gross@ts.fujitsu.com
>>> Domagkstr. 28 Internet: ts.fujitsu.com
>>> D-80807 Muenchen Company details:
>>> ts.fujitsu.com/imprint.html
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel


-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

[-- Attachment #2: cpupool-race.patch --]
[-- Type: text/x-patch, Size: 2617 bytes --]

diff -r 72470de157ce xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Wed Feb 16 09:49:33 2011 +0000
+++ b/xen/common/sched_credit.c	Wed Feb 16 15:09:54 2011 +0100
@@ -1268,7 +1268,8 @@ csched_load_balance(struct csched_privat
         /*
          * Any work over there to steal?
          */
-        speer = csched_runq_steal(peer_cpu, cpu, snext->pri);
+        speer = cpu_isset(peer_cpu, *online) ?
+            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
         pcpu_schedule_unlock(peer_cpu);
         if ( speer != NULL )
         {
diff -r 72470de157ce xen/common/schedule.c
--- a/xen/common/schedule.c	Wed Feb 16 09:49:33 2011 +0000
+++ b/xen/common/schedule.c	Wed Feb 16 15:09:54 2011 +0100
@@ -395,7 +395,28 @@ static void vcpu_migrate(struct vcpu *v)
     unsigned long flags;
     int old_cpu, new_cpu;
 
-    vcpu_schedule_lock_irqsave(v, flags);
+    for (;;)
+    {
+        vcpu_schedule_lock_irqsave(v, flags);
+
+        /* Select new CPU. */
+        old_cpu = v->processor;
+        new_cpu = SCHED_OP(VCPU2OP(v), pick_cpu, v);
+
+        if ( new_cpu == old_cpu )
+            break;
+
+        if ( !pcpu_schedule_trylock(new_cpu) )
+        {
+            vcpu_schedule_unlock_irqrestore(v, flags);
+            continue;
+        }
+        if ( cpu_isset(new_cpu, v->domain->cpupool->cpu_valid) )
+            break;
+
+        pcpu_schedule_unlock(new_cpu);
+        vcpu_schedule_unlock_irqrestore(v, flags);
+    }
 
     /*
      * NB. Check of v->running happens /after/ setting migration flag
@@ -405,13 +426,12 @@ static void vcpu_migrate(struct vcpu *v)
     if ( v->is_running ||
          !test_and_clear_bit(_VPF_migrating, &v->pause_flags) )
     {
+        if ( old_cpu != new_cpu )
+            pcpu_schedule_unlock(new_cpu);
+
         vcpu_schedule_unlock_irqrestore(v, flags);
         return;
     }
-
-    /* Select new CPU. */
-    old_cpu = v->processor;
-    new_cpu = SCHED_OP(VCPU2OP(v), pick_cpu, v);
 
     /*
      * Transfer urgency status to new CPU before switching CPUs, as once
@@ -424,9 +444,13 @@ static void vcpu_migrate(struct vcpu *v)
         atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count);
     }
 
-    /* Switch to new CPU, then unlock old CPU.  This is safe because
+    /* Switch to new CPU, then unlock new and old CPU.  This is safe because
      * the lock pointer cant' change while the current lock is held. */
     v->processor = new_cpu;
+
+    if ( old_cpu != new_cpu )
+        pcpu_schedule_unlock(new_cpu);
+
     spin_unlock_irqrestore(
         per_cpu(schedule_data, old_cpu).schedule_lock, flags);
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-21 13:19                                                                       ` Juergen Gross
@ 2011-02-21 14:45                                                                         ` Andre Przywara
  2011-02-21 14:50                                                                           ` Juergen Gross
  0 siblings, 1 reply; 53+ messages in thread
From: Andre Przywara @ 2011-02-21 14:45 UTC (permalink / raw)
  To: Juergen Gross; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

Juergen Gross wrote:
> On 02/21/11 11:00, Andre Przywara wrote:
>> George Dunlap wrote:
>>> Andre (and Juergen), can you try again with the attached patch?
>> I applied this patch on top of 22931 and it did _not_ work.
>> The crash occurred almost immediately after I started my script, so the
>> same behaviour as without the patch.
> 
> Did you try my patch addressing races in the scheduler when moving cpus
> between cpupools?
Sorry, I tried yours first, but it didn't apply cleanly on my particular 
tree (sched_jg_fix ;-). So I tested George's first.

> I've attached it again. For me it works quite well, while George's patch
> seems not to be enough (machine hanging after some tests with cpupools).
OK, it now applied after a rebase.
And yes, I didn't see a crash! At least until the script stopped while 
at lot of these messages appeared:
(XEN) do_IRQ: 0.89 No irq handler for vector (irq -1)

That is what I reported before and is most probably totally unrelated to 
this issue.
So I consider this fix working!
I will try to match my recent theories and debug results with your patch 
to see whether this fits.

> OTOH I can't reproduce an error as fast as you even without any patch :-)
> 
>> (attached my script for reference, though it will most likely only make
>> sense on bigger NUMA machines)
> 
> Yeah, on my 2-node system I need several hundred tries to get an error.
> But it seems to be more effective than George's script.
I consider the large over-provisioning the reason. With Dom0 having 48 
VCPUs finally squashed together to 6 pCPUs, my script triggered at the 
second run the latest.
With your patch it made 24 iterations before the other bug kicked in.

Thanks very much!
Andre.

> 
> 
> Juergen
> 
>> Regards,
>> Andre.
>>
>>
>>> What the patch basically does is try to make "cpu_disable_scheduler()"
>>> do what it seems to say it does. :-) Namely, the various
>>> scheduler-related interrutps (both per-cpu ticks and the master tick)
>>> is a part of the scheduler, so disable them before doing anything, and
>>> don't enable them until the cpu is really ready to go again.
>>>
>>> To be precise:
>>> * cpu_disable_scheduler() disables ticks
>>> * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
>>> and does it after inserting the idle vcpu
>>> * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
>>> stop tickers
>>> + Call tick_{resume,suspend} in cpu_{up,down}, respectively
>>> * Modify credit1's tick_{suspend,resume} to handle the master ticker
>>> as well.
>>>
>>> With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
>>> on one pcpu), I can perform thousands of operations successfully.
>>>
>>> (NB this is not ready for application yet, I just wanted to check to
>>> see if it fixes Andre's problem)
>>>
>>> -George
>>>
>>> On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross
>>> <juergen.gross@ts.fujitsu.com> wrote:
>>>> Okay, I have some more data.
>>>>
>>>> I activated cpupool_dprintk() and included checks in sched_credit.c to
>>>> test for weight inconsistencies. To reduce race possibilities I've added
>>>> my patch to execute cpu assigning/unassigning always in a tasklet on the
>>>> cpu to be moved.
>>>>
>>>> Here is the result:
>>>>
>>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>>>> (XEN) cpupool_assign_cpu(pool=0,cpu=1)
>>>> (XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0
>>>> (XEN) cpupool_assign_cpu(cpu=1) ret 0
>>>> (XEN) cpupool_assign_cpu(pool=1,cpu=4)
>>>> (XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40
>>>> (XEN) cpupool_assign_cpu(cpu=4) ret 0
>>>> (XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0:
>>>> (XEN) sdom->weight: 256, sdom->active_vcpu_count: 1
>>>> (XEN) Xen BUG at sched_credit.c:570
>>>> (XEN) ----[ Xen-4.1.0-rc5-pre x86_64 debug=y Tainted: C ]----
>>>> (XEN) CPU: 4
>>>> (XEN) RIP: e008:[<ffff82c4801197d7>] csched_tick+0x186/0x37f
>>>> (XEN) RFLAGS: 0000000000010086 CONTEXT: hypervisor
>>>> (XEN) rax: 0000000000000000 rbx: ffff830839d3ec30 rcx: 0000000000000000
>>>> (XEN) rdx: ffff830839dcff18 rsi: 000000000000000a rdi: ffff82c4802542e8
>>>> (XEN) rbp: ffff830839dcfe38 rsp: ffff830839dcfde8 r8: 0000000000000004
>>>> (XEN) r9: ffff82c480213520 r10: 00000000fffffffc r11: 0000000000000001
>>>> (XEN) r12: 0000000000000004 r13: ffff830839d3ec40 r14: ffff831002ad5e40
>>>> (XEN) r15: ffff830839d66f90 cr0: 000000008005003b cr4: 00000000000026f0
>>>> (XEN) cr3: 0000001020a98000 cr2: 00007fc5e9b79d98
>>>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
>>>> (XEN) Xen stack trace from rsp=ffff830839dcfde8:
>>>> (XEN) ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246
>>>> ffff830839d6c000
>>>> (XEN) 0000000000000000 ffff830839dd1100 0000000000000004
>>>> ffff82c480119651
>>>> (XEN) ffff831002b28018 ffff831002b28010 ffff830839dcfe68
>>>> ffff82c480126204
>>>> (XEN) 0000000000000002 ffff83083ffa3bb8 ffff830839dd1100
>>>> 000000cae439ea7e
>>>> (XEN) ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20
>>>> ffff830839dd1100
>>>> (XEN) ffff831002b28010 0000000000000004 0000000000000004
>>>> ffff82c4802b0880
>>>> (XEN) ffff830839dcff18 ffffffffffffffff ffff830839dcfef8
>>>> ffff82c480123647
>>>> (XEN) ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98
>>>> 00007fc5e9fa5b20
>>>> (XEN) 0000000000000002 00007fff46826f20 ffff830839dcff08
>>>> ffff82c4801236c2
>>>> (XEN) 00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20
>>>> 0000000000000002
>>>> (XEN) 00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260
>>>> 00007fff46826f50
>>>> (XEN) 0000000000000246 0000000000000032 0000000000000000
>>>> 00000000ffffffff
>>>> (XEN) 0000000000000009 00007fc5e9d9de1a 0000000000000003
>>>> 0000000000004848
>>>> (XEN) 00007fc5e9b7a000 0000010000000000 ffffffff800073f0
>>>> 000000000000e033
>>>> (XEN) 0000000000000246 ffff880f97b51fc8 000000000000e02b
>>>> 0000000000000000
>>>> (XEN) 0000000000000000 0000000000000000 0000000000000000
>>>> 0000000000000004
>>>> (XEN) ffff830077eee000 00000043b9afd180 0000000000000000
>>>> (XEN) Xen call trace:
>>>> (XEN) [<ffff82c4801197d7>] csched_tick+0x186/0x37f
>>>> (XEN) [<ffff82c480126204>] execute_timer+0x4e/0x6c
>>>> (XEN) [<ffff82c480126539>] timer_softirq_action+0xf6/0x239
>>>> (XEN) [<ffff82c480123647>] __do_softirq+0x88/0x99
>>>> (XEN) [<ffff82c4801236c2>] do_softirq+0x6a/0x7a
>>>> (XEN)
>>>> (XEN)
>>>> (XEN) ****************************************
>>>> (XEN) Panic on CPU 4:
>>>> (XEN) Xen BUG at sched_credit.c:570
>>>> (XEN) ****************************************
>>>>
>>>> As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The
>>>> BUG_ON
>>>> triggered in csched_acct() is a logical result of this.
>>>>
>>>> How this can happen I don't know yet.
>>>> Anyone any idea? I'll keep searching...
>>>>
>>>>
>>>> Juergen
>>>>
>>>> On 02/15/11 08:22, Juergen Gross wrote:
>>>>> On 02/14/11 18:57, George Dunlap wrote:
>>>>>> The good news is, I've managed to reproduce this on my local test
>>>>>> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
>>>>>> attached script. It's time to go home now, but I should be able to
>>>>>> dig something up tomorrow.
>>>>>>
>>>>>> To use the script:
>>>>>> * Rename cpupool0 to "p0", and create an empty second pool, "p1"
>>>>>> * You can modify elements by adding "arg=val" as arguments.
>>>>>> * Arguments are:
>>>>>> + dryrun={true,false} Do the work, but don't actually execute any xl
>>>>>> arguments. Default false.
>>>>>> + left: Number commands to execute. Default 10.
>>>>>> + maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is
>>>>>> 8 cpus).
>>>>>> + verbose={true,false} Print what you're doing. Default is true.
>>>>>>
>>>>>> The script sometimes attempts to remove the last cpu from cpupool0; in
>>>>>> this case, libxl will print an error. If the script gets an error
>>>>>> under that condition, it will ignore it; under any other condition, it
>>>>>> will print diagnostic information.
>>>>>>
>>>>>> What finally crashed it for me was this command:
>>>>>> # ./cpupool-test.sh verbose=false left=1000
>>>>> Nice!
>>>>> With your script I finally managed to get the error, too. On my box (2
>>>>> sockets
>>>>> a 6 cores) I had to use
>>>>>
>>>>> ./cpupool-test.sh verbose=false left=10000 maxcpus=11
>>>>>
>>>>> to trigger it.
>>>>> Looking for more data now...
>>>>>
>>>>>
>>>>> Juergen
>>>>>
>>>>>> -George
>>>>>>
>>>>>> On Fri, Feb 11, 2011 at 7:39 AM, Andre
>>>>>> Przywara<andre.przywara@amd.com> wrote:
>>>>>>> Juergen Gross wrote:
>>>>>>>> On 02/10/11 15:18, Andre Przywara wrote:
>>>>>>>>> Andre Przywara wrote:
>>>>>>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>>>>>>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>>>>>>>>> Andre, George,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> What seems to be interesting: I think the problem did always
>>>>>>>>>>>> occur
>>>>>>>>>>>> when
>>>>>>>>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>>>>>>>>
>>>>>>>>>>>> I think my previous assumption regarding the master_ticker
>>>>>>>>>>>> was not
>>>>>>>>>>>> too bad.
>>>>>>>>>>>> I think somehow the master_ticker of the new cpupool is becoming
>>>>>>>>>>>> active
>>>>>>>>>>>> before the scheduler is really initialized properly. This could
>>>>>>>>>>>> happen, if
>>>>>>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved
>>>>>>>>>>>> and
>>>>>>>>>>>> the
>>>>>>>>>>>> critical section in schedule_cpu_switch().
>>>>>>>>>>>>
>>>>>>>>>>>> The solution should be to activate the timers only if the
>>>>>>>>>>>> scheduler is
>>>>>>>>>>>> ready for them.
>>>>>>>>>>>>
>>>>>>>>>>>> George, do you think the master_ticker should be stopped in
>>>>>>>>>>>> suspend_ticker
>>>>>>>>>>>> as well? I still see potential problems for entering deep
>>>>>>>>>>>> C-States.
>>>>>>>>>>>> I think
>>>>>>>>>>>> I'll prepare a patch which will keep the master_ticker active
>>>>>>>>>>>> for the
>>>>>>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>>>>>>>>> Okay, here is a patch for this. It ran on my 4-core machine
>>>>>>>>>>> without any
>>>>>>>>>>> problems.
>>>>>>>>>>> Andre, could you give it a try?
>>>>>>>>>> Did, but unfortunately it crashed as always. Tried twice and made
>>>>>>>>>> sure
>>>>>>>>>> I booted the right kernel. Sorry.
>>>>>>>>>> The idea with the race between the timer and the state changing
>>>>>>>>>> sounded very appealing, actually that was suspicious to me from
>>>>>>>>>> the
>>>>>>>>>> beginning.
>>>>>>>>>>
>>>>>>>>>> I will add some code to dump the state of all cpupools to the
>>>>>>>>>> BUG_ON
>>>>>>>>>> to see in which situation we are when the bug triggers.
>>>>>>>>> OK, here is a first try of this, the patch iterates over all CPU
>>>>>>>>> pools
>>>>>>>>> and outputs some data if the BUG_ON
>>>>>>>>> ((sdom->weight * sdom->active_vcpu_count)> weight_left) condition
>>>>>>>>> triggers:
>>>>>>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask:
>>>>>>>>> fffffffc003f
>>>>>>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>>>>>>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>>>>>>>>> (XEN) Xen BUG at sched_credit.c:1010
>>>>>>>>> ....
>>>>>>>>> The masks look proper (6 cores per node), the bug triggers when the
>>>>>>>>> first CPU is about to be(?) inserted.
>>>>>>>> Sure? I'm missing the cpu with mask 2000.
>>>>>>>> I'll try to reproduce the problem on a larger machine here (24
>>>>>>>> cores, 4
>>>>>>>> numa
>>>>>>>> nodes).
>>>>>>>> Andre, can you give me your xen boot parameters? Which xen changeset
>>>>>>>> are
>>>>>>>> you
>>>>>>>> running, and do you have any additional patches in use?
>>>>>>> The grub lines:
>>>>>>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga
>>>>>>> com1=115200
>>>>>>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
>>>>>>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0
>>>>>>>
>>>>>>> All of my experiments are use c/s 22858 as a base.
>>>>>>> If you use a AMD Magny-Cours box for your experiments (socket C32 or
>>>>>>> G34),
>>>>>>> you should add the following patch (removing the line)
>>>>>>> --- a/xen/arch/x86/traps.c
>>>>>>> +++ b/xen/arch/x86/traps.c
>>>>>>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
>>>>>>> __clear_bit(X86_FEATURE_SKINIT % 32,&c);
>>>>>>> __clear_bit(X86_FEATURE_WDT % 32,&c);
>>>>>>> __clear_bit(X86_FEATURE_LWP % 32,&c);
>>>>>>> - __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c);
>>>>>>> __clear_bit(X86_FEATURE_TOPOEXT % 32,&c);
>>>>>>> break;
>>>>>>> case 5: /* MONITOR/MWAIT */
>>>>>>>
>>>>>>> This is not necessary (in fact that reverts my patch c/s 22815), but
>>>>>>> raises
>>>>>>> the probability to trigger the bug, probably because it increases the
>>>>>>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0,
>>>>>>> try to
>>>>>>> create a guest with many VCPUs and squeeze it into a small CPU-pool.
>>>>>>>
>>>>>>> Good luck ;-)
>>>>>>> Andre.
>>>>>>>
>>>>>>> --
>>>>>>> Andre Przywara
>>>>>>> AMD-OSRC (Dresden)
>>>>>>> Tel: x29712
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@lists.xensource.com
>>>>>>> http://lists.xensource.com/xen-devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@lists.xensource.com
>>>>>>> http://lists.xensource.com/xen-devel
>>>> --
>>>> Juergen Gross Principal Developer Operating Systems
>>>> TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967
>>>> Fujitsu Technology Solutions e-mail:
>>>> juergen.gross@ts.fujitsu.com
>>>> Domagkstr. 28 Internet: ts.fujitsu.com
>>>> D-80807 Muenchen Company details:
>>>> ts.fujitsu.com/imprint.html
>>>>
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xensource.com
>>>> http://lists.xensource.com/xen-devel
>>>>
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
> 
> 
> --
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
> Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
> Domagkstr. 28                           Internet: ts.fujitsu.com
> D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html
> 


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Hypervisor crash(!) on xl cpupool-numa-split
  2011-02-21 14:45                                                                         ` Andre Przywara
@ 2011-02-21 14:50                                                                           ` Juergen Gross
  0 siblings, 0 replies; 53+ messages in thread
From: Juergen Gross @ 2011-02-21 14:50 UTC (permalink / raw)
  To: Andre Przywara; +Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

On 02/21/11 15:45, Andre Przywara wrote:
> Juergen Gross wrote:
>> On 02/21/11 11:00, Andre Przywara wrote:
>>> George Dunlap wrote:
>>>> Andre (and Juergen), can you try again with the attached patch?
>>> I applied this patch on top of 22931 and it did _not_ work.
>>> The crash occurred almost immediately after I started my script, so the
>>> same behaviour as without the patch.
>>
>> Did you try my patch addressing races in the scheduler when moving cpus
>> between cpupools?
> Sorry, I tried yours first, but it didn't apply cleanly on my particular
> tree (sched_jg_fix ;-). So I tested George's first.
>
>> I've attached it again. For me it works quite well, while George's patch
>> seems not to be enough (machine hanging after some tests with cpupools).
> OK, it now applied after a rebase.
> And yes, I didn't see a crash! At least until the script stopped while
> at lot of these messages appeared:
> (XEN) do_IRQ: 0.89 No irq handler for vector (irq -1)
>
> That is what I reported before and is most probably totally unrelated to
> this issue.
> So I consider this fix working!
> I will try to match my recent theories and debug results with your patch
> to see whether this fits.
>
>> OTOH I can't reproduce an error as fast as you even without any patch :-)
>>
>>> (attached my script for reference, though it will most likely only make
>>> sense on bigger NUMA machines)
>>
>> Yeah, on my 2-node system I need several hundred tries to get an error.
>> But it seems to be more effective than George's script.
> I consider the large over-provisioning the reason. With Dom0 having 48
> VCPUs finally squashed together to 6 pCPUs, my script triggered at the
> second run the latest.
> With your patch it made 24 iterations before the other bug kicked in.

Okay, I'll prepare an official patch. Might last some days, as I'm not in the
office until Thursday.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2011-02-21 14:50 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-27 23:18 Hypervisor crash(!) on xl cpupool-numa-split Andre Przywara
2011-01-28  6:47 ` Juergen Gross
2011-01-28 11:07   ` Andre Przywara
2011-01-28 11:44     ` Juergen Gross
2011-01-28 13:14       ` Andre Przywara
2011-01-31  7:04         ` Juergen Gross
2011-01-31 14:59           ` Andre Przywara
2011-01-31 15:28             ` George Dunlap
2011-02-01 16:32               ` Andre Przywara
2011-02-02  6:27                 ` Juergen Gross
2011-02-02  8:49                   ` Juergen Gross
2011-02-02 10:05                     ` Juergen Gross
2011-02-02 10:59                       ` Andre Przywara
2011-02-02 14:39                 ` Stephan Diestelhorst
2011-02-02 15:14                   ` Juergen Gross
2011-02-02 16:01                     ` Stephan Diestelhorst
2011-02-03  5:57                       ` Juergen Gross
2011-02-03  9:18                         ` Juergen Gross
2011-02-04 14:09                           ` Andre Przywara
2011-02-07 12:38                             ` Andre Przywara
2011-02-07 13:32                               ` Juergen Gross
2011-02-07 15:55                                 ` George Dunlap
2011-02-08  5:43                                   ` Juergen Gross
2011-02-08 12:08                                     ` George Dunlap
2011-02-08 12:14                                       ` George Dunlap
2011-02-08 16:33                                         ` Andre Przywara
2011-02-09 12:27                                           ` George Dunlap
2011-02-09 12:27                                             ` George Dunlap
2011-02-09 13:04                                               ` Juergen Gross
2011-02-09 13:39                                                 ` Andre Przywara
2011-02-09 13:51                                               ` Andre Przywara
2011-02-09 14:21                                                 ` Juergen Gross
2011-02-10  6:42                                                   ` Juergen Gross
2011-02-10  9:25                                                     ` Andre Przywara
2011-02-10 14:18                                                       ` Andre Przywara
2011-02-11  6:17                                                         ` Juergen Gross
2011-02-11  7:39                                                           ` Andre Przywara
2011-02-14 17:57                                                             ` George Dunlap
2011-02-15  7:22                                                               ` Juergen Gross
2011-02-16  9:47                                                                 ` Juergen Gross
2011-02-16 13:54                                                                   ` George Dunlap
     [not found]                                                                     ` <4D6237C6.1050206@amd.c om>
2011-02-16 14:11                                                                     ` Juergen Gross
2011-02-16 14:28                                                                       ` Juergen Gross
2011-02-17  0:05                                                                       ` André Przywara
2011-02-17  7:05                                                                     ` Juergen Gross
2011-02-17  9:11                                                                       ` Juergen Gross
2011-02-21 10:00                                                                     ` Andre Przywara
2011-02-21 13:19                                                                       ` Juergen Gross
2011-02-21 14:45                                                                         ` Andre Przywara
2011-02-21 14:50                                                                           ` Juergen Gross
2011-02-08 12:23                                       ` Juergen Gross
2011-01-28 11:13   ` George Dunlap
2011-01-28 13:05     ` Andre Przywara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.