All of lore.kernel.org
 help / color / mirror / Atom feed
* [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-07 10:22 Mahesh J Salgaonkar
  2011-07-07 10:59 ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Mahesh J Salgaonkar @ 2011-07-07 10:22 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev; +Cc: a.p.zijlstra, mingo, anton, benh, torvalds

Hi,

linux-3.0-rc fails to boot on a power7 system with 1TB ram and 896 CPUs.
While the initial boot log shows a soft-lockup [1], the machine is hung after.
Dropping into xmon shows the cpus are all struck at:
--------------------
cpu 0xa: Vector: 100 (System Reset) at [c000000fae51fae0]
    pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0
    lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210
    sp: c000000fae51fd60
   msr: 8000000000089032
  current = 0xc000000fae49d990
  paca    = 0xc00000000ebb1900
    pid   = 0, comm = kworker/0:1
cpu 0x41: Vector: 100 (System Reset) at [c000000fac01bae0]
    pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0
    lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210
    sp: c000000fac01bd60
   msr: 8000000000089032
  current = 0xc000000faefbf210
  paca    = 0xc00000000ebba280
    pid   = 0, comm = kworker/0:1
cpu 0x21: Vector: 100 (System Reset) at [c000000fae9abae0]
    pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0
    lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210
    sp: c000000fae9abd60
   msr: 8000000000089032
  current = 0xc000000fae998590
  paca    = 0xc00000000ebb5280
    pid   = 0, comm = kworker/0:1
cpu 0xb8: Vector: 100 (System Reset) at [c000000fab3dbae0]
    pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0
    lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210
    sp: c000000fab3dbd60
   msr: 8000000000089032
  current = 0xc000000fab3a2710
  paca    = 0xc00000000ebccc00
    pid   = 0, comm = kworker/0:1
......
......
And shows same for all the CPUs.
a:mon> t
[link register   ] c00000000005b9a4 .pseries_dedicated_idle_sleep+0x194/0x210
[c000000fae51fd60] 00000000134d0000 (unreliable)
[c000000fae51fe20] c000000000018b64 .cpu_idle+0x164/0x210
[c000000fae51fed0] c0000000005d55b0 .start_secondary+0x348/0x354
[c000000fae51ff90] c000000000009268 .start_secondary_prolog+0x10/0x14
a:mon> S
msr  = 8000000000001032  sprg0= 0000000000000000
pvr  = 00000000003f0201  sprg1= c00000000ebb1900
dec  = 0000000030fb5b4f  sprg2= c00000000ebb1900
sp   = c000000fae51f440  sprg3= 000000000000000a
toc  = c000000000e21f90  dar  = c000011aee0c20e8
a:mon>
--------------------

2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae -
"sched: Change NODE sched_domain group creation" as the cause.

Thanks,
-Mahesh.

[1]:
POWER7 performance monitor hardware support registered
Brought up 896 CPUs
Enabling Asymmetric SMT scheduling
BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1]
Modules linked in:
NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000
REGS: c000000fae25f9c0 TRAP: 0901   Not tainted  (3.0.0-rc6)
MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24000088  XER: 00000004
TASK = c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0
GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb00
GPR04: 0000000000000100 0000000000000100 c000011adcf23418 0000000000000000
GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024ac
GPR12: 0000000044000042 c00000000ebb0000
NIP [c000000000074b90] .update_group_power+0x50/0x190
LR [c00000000008a1c4] .build_sched_domains+0x434/0x490
Call Trace:
[c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable)
[c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490
[c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224
[c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc
[c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70
Instruction dump:
f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054
e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-07 10:22 Mahesh J Salgaonkar
  2011-07-07 10:59 ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Mahesh J Salgaonkar @ 2011-07-07 10:22 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev; +Cc: mingo, torvalds, a.p.zijlstra, anton

Hi,

linux-3.0-rc fails to boot on a power7 system with 1TB ram and 896 CPUs.
While the initial boot log shows a soft-lockup [1], the machine is hung after.
Dropping into xmon shows the cpus are all struck at:
--------------------
cpu 0xa: Vector: 100 (System Reset) at [c000000fae51fae0]
    pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0
    lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210
    sp: c000000fae51fd60
   msr: 8000000000089032
  current = 0xc000000fae49d990
  paca    = 0xc00000000ebb1900
    pid   = 0, comm = kworker/0:1
cpu 0x41: Vector: 100 (System Reset) at [c000000fac01bae0]
    pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0
    lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210
    sp: c000000fac01bd60
   msr: 8000000000089032
  current = 0xc000000faefbf210
  paca    = 0xc00000000ebba280
    pid   = 0, comm = kworker/0:1
cpu 0x21: Vector: 100 (System Reset) at [c000000fae9abae0]
    pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0
    lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210
    sp: c000000fae9abd60
   msr: 8000000000089032
  current = 0xc000000fae998590
  paca    = 0xc00000000ebb5280
    pid   = 0, comm = kworker/0:1
cpu 0xb8: Vector: 100 (System Reset) at [c000000fab3dbae0]
    pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0
    lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210
    sp: c000000fab3dbd60
   msr: 8000000000089032
  current = 0xc000000fab3a2710
  paca    = 0xc00000000ebccc00
    pid   = 0, comm = kworker/0:1
......
......
And shows same for all the CPUs.
a:mon> t
[link register   ] c00000000005b9a4 .pseries_dedicated_idle_sleep+0x194/0x210
[c000000fae51fd60] 00000000134d0000 (unreliable)
[c000000fae51fe20] c000000000018b64 .cpu_idle+0x164/0x210
[c000000fae51fed0] c0000000005d55b0 .start_secondary+0x348/0x354
[c000000fae51ff90] c000000000009268 .start_secondary_prolog+0x10/0x14
a:mon> S
msr  = 8000000000001032  sprg0= 0000000000000000
pvr  = 00000000003f0201  sprg1= c00000000ebb1900
dec  = 0000000030fb5b4f  sprg2= c00000000ebb1900
sp   = c000000fae51f440  sprg3= 000000000000000a
toc  = c000000000e21f90  dar  = c000011aee0c20e8
a:mon>
--------------------

2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae -
"sched: Change NODE sched_domain group creation" as the cause.

Thanks,
-Mahesh.

[1]:
POWER7 performance monitor hardware support registered
Brought up 896 CPUs
Enabling Asymmetric SMT scheduling
BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1]
Modules linked in:
NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000
REGS: c000000fae25f9c0 TRAP: 0901   Not tainted  (3.0.0-rc6)
MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24000088  XER: 00000004
TASK = c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0
GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb00
GPR04: 0000000000000100 0000000000000100 c000011adcf23418 0000000000000000
GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024ac
GPR12: 0000000044000042 c00000000ebb0000
NIP [c000000000074b90] .update_group_power+0x50/0x190
LR [c00000000008a1c4] .build_sched_domains+0x434/0x490
Call Trace:
[c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable)
[c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490
[c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224
[c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc
[c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70
Instruction dump:
f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054
e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-07 10:22 [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 Mahesh J Salgaonkar
@ 2011-07-07 10:59 ` Peter Zijlstra
  2011-07-07 11:55   ` Mahesh J Salgaonkar
  2011-07-14  0:34   ` Anton Blanchard
  0 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-07 10:59 UTC (permalink / raw)
  To: mahesh; +Cc: linux-kernel, linuxppc-dev, mingo, anton, benh, torvalds

On Thu, 2011-07-07 at 15:52 +0530, Mahesh J Salgaonkar wrote:
> 
> 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae -
> "sched: Change NODE sched_domain group creation" as the cause.

Weird, there's no locking anywhere around there. The typical problems
with this patch-set were massive explosions due to bad pointers etc..
But not silent hangs.

The code its stuck at:

> [1]:
> POWER7 performance monitor hardware support registered
> Brought up 896 CPUs
> Enabling Asymmetric SMT scheduling
> BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1]
> Modules linked in:
> NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000
> REGS: c000000fae25f9c0 TRAP: 0901   Not tainted  (3.0.0-rc6)
> MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24000088  XER: 00000004
> TASK = c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0
> GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb00
> GPR04: 0000000000000100 0000000000000100 c000011adcf23418 0000000000000000
> GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024ac
> GPR12: 0000000044000042 c00000000ebb0000
> NIP [c000000000074b90] .update_group_power+0x50/0x190
> LR [c00000000008a1c4] .build_sched_domains+0x434/0x490
> Call Trace:
> [c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable)
> [c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490
> [c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224
> [c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc
> [c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70
> Instruction dump:
> f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054
> e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14

doesn't contains any locks, its simply looping over all the cpus, and
with that many I can imagine it takes a while, but getting 'stuck' there
is unexpected to say the least.

Surely this isn't the first multi-node P7 to boot a kernel with this
patch? If my git foo is any good it hit -next on 23rd of May.

I guess I'm asking is, do smaller P7 machines boot? And if so, is there
any difference except size?

How many nodes does the thing have anyway, 28? Hmm, that could mean its
the first machine with >16 nodes to boot this, which would make it
trigger the magic ALL_NODES crap.

Let me dig around there.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-07 10:59 ` Peter Zijlstra
  2011-07-07 11:55   ` Mahesh J Salgaonkar
  2011-07-14  0:34   ` Anton Blanchard
  0 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-07 10:59 UTC (permalink / raw)
  To: mahesh; +Cc: linuxppc-dev, linux-kernel, anton, mingo, torvalds

On Thu, 2011-07-07 at 15:52 +0530, Mahesh J Salgaonkar wrote:
>=20
> 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae -
> "sched: Change NODE sched_domain group creation" as the cause.

Weird, there's no locking anywhere around there. The typical problems
with this patch-set were massive explosions due to bad pointers etc..
But not silent hangs.

The code its stuck at:

> [1]:
> POWER7 performance monitor hardware support registered
> Brought up 896 CPUs
> Enabling Asymmetric SMT scheduling
> BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1]
> Modules linked in:
> NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000
> REGS: c000000fae25f9c0 TRAP: 0901   Not tainted  (3.0.0-rc6)
> MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24000088  XER: 00000004
> TASK =3D c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0
> GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb0=
0
> GPR04: 0000000000000100 0000000000000100 c000011adcf23418 000000000000000=
0
> GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024a=
c
> GPR12: 0000000044000042 c00000000ebb0000
> NIP [c000000000074b90] .update_group_power+0x50/0x190
> LR [c00000000008a1c4] .build_sched_domains+0x434/0x490
> Call Trace:
> [c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable)
> [c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490
> [c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224
> [c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc
> [c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70
> Instruction dump:
> f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054
> e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14

doesn't contains any locks, its simply looping over all the cpus, and
with that many I can imagine it takes a while, but getting 'stuck' there
is unexpected to say the least.

Surely this isn't the first multi-node P7 to boot a kernel with this
patch? If my git foo is any good it hit -next on 23rd of May.

I guess I'm asking is, do smaller P7 machines boot? And if so, is there
any difference except size?

How many nodes does the thing have anyway, 28? Hmm, that could mean its
the first machine with >16 nodes to boot this, which would make it
trigger the magic ALL_NODES crap.

Let me dig around there.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-07 10:59 ` Peter Zijlstra
@ 2011-07-07 11:55   ` Mahesh J Salgaonkar
  2011-07-07 12:28     ` Peter Zijlstra
  2011-07-14  0:34   ` Anton Blanchard
  1 sibling, 1 reply; 43+ messages in thread
From: Mahesh J Salgaonkar @ 2011-07-07 11:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linuxppc-dev, linux-kernel, anton, mingo, torvalds

On 2011-07-07 12:59:35 Thu, Peter Zijlstra wrote:
> On Thu, 2011-07-07 at 15:52 +0530, Mahesh J Salgaonkar wrote:
> > 
> > 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae -
> > "sched: Change NODE sched_domain group creation" as the cause.
> 
> Weird, there's no locking anywhere around there. The typical problems
> with this patch-set were massive explosions due to bad pointers etc..
> But not silent hangs.
> 
> The code its stuck at:
> 
> > [1]:
> > POWER7 performance monitor hardware support registered
> > Brought up 896 CPUs
> > Enabling Asymmetric SMT scheduling
> > BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1]
> > Modules linked in:
> > NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000
> > REGS: c000000fae25f9c0 TRAP: 0901   Not tainted  (3.0.0-rc6)
> > MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24000088  XER: 00000004
> > TASK = c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0
> > GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb00
> > GPR04: 0000000000000100 0000000000000100 c000011adcf23418 0000000000000000
> > GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024ac
> > GPR12: 0000000044000042 c00000000ebb0000
> > NIP [c000000000074b90] .update_group_power+0x50/0x190
> > LR [c00000000008a1c4] .build_sched_domains+0x434/0x490
> > Call Trace:
> > [c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable)
> > [c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490
> > [c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224
> > [c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc
> > [c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70
> > Instruction dump:
> > f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054
> > e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14
> 
> doesn't contains any locks, its simply looping over all the cpus, and
> with that many I can imagine it takes a while, but getting 'stuck' there
> is unexpected to say the least.
> 
> Surely this isn't the first multi-node P7 to boot a kernel with this
> patch? If my git foo is any good it hit -next on 23rd of May.
> 
> I guess I'm asking is, do smaller P7 machines boot? And if so, is there
> any difference except size?

Yes, the smaller P7 machine that I have with 20 CPUs and 2GB ram boots
fine with 3.0.0-rc.

> 
> How many nodes does the thing have anyway, 28? Hmm, that could mean its
> the first machine with >16 nodes to boot this, which would make it
> trigger the magic ALL_NODES crap.

The P7 machine where kernel fails to boot shows following demsg log w.r.t
node map:
---------------------------
Zone PFN ranges:
  DMA      0x00000000 -> 0x01229000
  Normal   empty
Movable zone start PFN for each node
early_node_map[12] active PFN ranges
    0: 0x00000000 -> 0x000fd000
    4: 0x000fd000 -> 0x002fb000
    5: 0x002fb000 -> 0x004b9000
    6: 0x004b9000 -> 0x006b9000
    8: 0x006b9000 -> 0x007b5000
   12: 0x007b5000 -> 0x008b5000
   16: 0x008b5000 -> 0x009b1000
   20: 0x009b1000 -> 0x00bb1000
   21: 0x00bb1000 -> 0x00db1000
   22: 0x00db1000 -> 0x00fb1000
   23: 0x00fb1000 -> 0x011b1000
   28: 0x011b1000 -> 0x01229000
Could not find start_pfn for node 1
Could not find start_pfn for node 2
Could not find start_pfn for node 3
Could not find start_pfn for node 7
Could not find start_pfn for node 9
Could not find start_pfn for node 10
Could not find start_pfn for node 11
Could not find start_pfn for node 13
Could not find start_pfn for node 14
Could not find start_pfn for node 15
Could not find start_pfn for node 17
Could not find start_pfn for node 18
Could not find start_pfn for node 19
Could not find start_pfn for node 29
Could not find start_pfn for node 30
Could not find start_pfn for node 31
[boot]0015 Setup Done
PERCPU: Embedded 1 pages/cpu @c000000013c00000 s31488 r0 d34048 u65536
Built 28 zonelists in Node order, mobility grouping on.  Total pages:
19026032
Policy zone: DMA
Kernel command line: root=/dev/mapper/vg_nish1-lv_root ro
rd_LVM_LV=vg_nish1/lv_root rd_LVM_LV=VolGroup/lv_swap
rd_LVM_LV=vg_nish1/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8
SYSFONT=latarcyrheb-sun16 KEYTABLE=us console=hvc0i memblock=debug 
PID hash table entries: 4096 (order: -1, 32768 bytes)
freeing bootmem node 0
freeing bootmem node 4
freeing bootmem node 5
freeing bootmem node 6
freeing bootmem node 8
freeing bootmem node 12
freeing bootmem node 16
freeing bootmem node 20
freeing bootmem node 21
freeing bootmem node 22
freeing bootmem node 23
freeing bootmem node 28
Memory: 1213775296k/1218707456k available (13312k kernel code, 4932160k
reserved, 1600k data, 2727k bss, 4928k init)
---------------------------

Thanks,
-Mahesh.

> 
> Let me dig around there.
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

-- 
Mahesh J Salgaonkar

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-07 11:55   ` Mahesh J Salgaonkar
  2011-07-07 12:28     ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Mahesh J Salgaonkar @ 2011-07-07 11:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: torvalds, mingo, linuxppc-dev, linux-kernel, anton

On 2011-07-07 12:59:35 Thu, Peter Zijlstra wrote:
> On Thu, 2011-07-07 at 15:52 +0530, Mahesh J Salgaonkar wrote:
> > 
> > 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae -
> > "sched: Change NODE sched_domain group creation" as the cause.
> 
> Weird, there's no locking anywhere around there. The typical problems
> with this patch-set were massive explosions due to bad pointers etc..
> But not silent hangs.
> 
> The code its stuck at:
> 
> > [1]:
> > POWER7 performance monitor hardware support registered
> > Brought up 896 CPUs
> > Enabling Asymmetric SMT scheduling
> > BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1]
> > Modules linked in:
> > NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000
> > REGS: c000000fae25f9c0 TRAP: 0901   Not tainted  (3.0.0-rc6)
> > MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24000088  XER: 00000004
> > TASK = c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0
> > GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb00
> > GPR04: 0000000000000100 0000000000000100 c000011adcf23418 0000000000000000
> > GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024ac
> > GPR12: 0000000044000042 c00000000ebb0000
> > NIP [c000000000074b90] .update_group_power+0x50/0x190
> > LR [c00000000008a1c4] .build_sched_domains+0x434/0x490
> > Call Trace:
> > [c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable)
> > [c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490
> > [c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224
> > [c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc
> > [c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70
> > Instruction dump:
> > f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054
> > e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14
> 
> doesn't contains any locks, its simply looping over all the cpus, and
> with that many I can imagine it takes a while, but getting 'stuck' there
> is unexpected to say the least.
> 
> Surely this isn't the first multi-node P7 to boot a kernel with this
> patch? If my git foo is any good it hit -next on 23rd of May.
> 
> I guess I'm asking is, do smaller P7 machines boot? And if so, is there
> any difference except size?

Yes, the smaller P7 machine that I have with 20 CPUs and 2GB ram boots
fine with 3.0.0-rc.

> 
> How many nodes does the thing have anyway, 28? Hmm, that could mean its
> the first machine with >16 nodes to boot this, which would make it
> trigger the magic ALL_NODES crap.

The P7 machine where kernel fails to boot shows following demsg log w.r.t
node map:
---------------------------
Zone PFN ranges:
  DMA      0x00000000 -> 0x01229000
  Normal   empty
Movable zone start PFN for each node
early_node_map[12] active PFN ranges
    0: 0x00000000 -> 0x000fd000
    4: 0x000fd000 -> 0x002fb000
    5: 0x002fb000 -> 0x004b9000
    6: 0x004b9000 -> 0x006b9000
    8: 0x006b9000 -> 0x007b5000
   12: 0x007b5000 -> 0x008b5000
   16: 0x008b5000 -> 0x009b1000
   20: 0x009b1000 -> 0x00bb1000
   21: 0x00bb1000 -> 0x00db1000
   22: 0x00db1000 -> 0x00fb1000
   23: 0x00fb1000 -> 0x011b1000
   28: 0x011b1000 -> 0x01229000
Could not find start_pfn for node 1
Could not find start_pfn for node 2
Could not find start_pfn for node 3
Could not find start_pfn for node 7
Could not find start_pfn for node 9
Could not find start_pfn for node 10
Could not find start_pfn for node 11
Could not find start_pfn for node 13
Could not find start_pfn for node 14
Could not find start_pfn for node 15
Could not find start_pfn for node 17
Could not find start_pfn for node 18
Could not find start_pfn for node 19
Could not find start_pfn for node 29
Could not find start_pfn for node 30
Could not find start_pfn for node 31
[boot]0015 Setup Done
PERCPU: Embedded 1 pages/cpu @c000000013c00000 s31488 r0 d34048 u65536
Built 28 zonelists in Node order, mobility grouping on.  Total pages:
19026032
Policy zone: DMA
Kernel command line: root=/dev/mapper/vg_nish1-lv_root ro
rd_LVM_LV=vg_nish1/lv_root rd_LVM_LV=VolGroup/lv_swap
rd_LVM_LV=vg_nish1/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8
SYSFONT=latarcyrheb-sun16 KEYTABLE=us console=hvc0i memblock=debug 
PID hash table entries: 4096 (order: -1, 32768 bytes)
freeing bootmem node 0
freeing bootmem node 4
freeing bootmem node 5
freeing bootmem node 6
freeing bootmem node 8
freeing bootmem node 12
freeing bootmem node 16
freeing bootmem node 20
freeing bootmem node 21
freeing bootmem node 22
freeing bootmem node 23
freeing bootmem node 28
Memory: 1213775296k/1218707456k available (13312k kernel code, 4932160k
reserved, 1600k data, 2727k bss, 4928k init)
---------------------------

Thanks,
-Mahesh.

> 
> Let me dig around there.
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

-- 
Mahesh J Salgaonkar

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-07 11:55   ` Mahesh J Salgaonkar
@ 2011-07-07 12:28     ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-07 12:28 UTC (permalink / raw)
  To: mahesh; +Cc: linuxppc-dev, linux-kernel, anton, mingo, torvalds

On Thu, 2011-07-07 at 17:25 +0530, Mahesh J Salgaonkar wrote:
> > I guess I'm asking is, do smaller P7 machines boot? And if so, is there
> > any difference except size?
> 
> Yes, the smaller P7 machine that I have with 20 CPUs and 2GB ram boots
> fine with 3.0.0-rc. 

That sounds like a single node machine. P7 comes as {4,6,8}*4 (16,24,32
cpus) per socket. And that 2G doesn't sound like much either.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-07 12:28     ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-07 12:28 UTC (permalink / raw)
  To: mahesh; +Cc: torvalds, mingo, linuxppc-dev, linux-kernel, anton

On Thu, 2011-07-07 at 17:25 +0530, Mahesh J Salgaonkar wrote:
> > I guess I'm asking is, do smaller P7 machines boot? And if so, is there
> > any difference except size?
>=20
> Yes, the smaller P7 machine that I have with 20 CPUs and 2GB ram boots
> fine with 3.0.0-rc.=20

That sounds like a single node machine. P7 comes as {4,6,8}*4 (16,24,32
cpus) per socket. And that 2G doesn't sound like much either.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-07 10:59 ` Peter Zijlstra
  2011-07-07 11:55   ` Mahesh J Salgaonkar
@ 2011-07-14  0:34   ` Anton Blanchard
  2011-07-14  4:35     ` Anton Blanchard
  1 sibling, 1 reply; 43+ messages in thread
From: Anton Blanchard @ 2011-07-14  0:34 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds


Hi Peter,

> Surely this isn't the first multi-node P7 to boot a kernel with this
> patch? If my git foo is any good it hit -next on 23rd of May.
> 
> I guess I'm asking is, do smaller P7 machines boot? And if so, is
> there any difference except size?
> 
> How many nodes does the thing have anyway, 28? Hmm, that could mean
> its the first machine with >16 nodes to boot this, which would make it
> trigger the magic ALL_NODES crap.

We haven't tested a box with more than 16 nodes in quite a while, so it
may be this.

I took a quick look and we are stuck in update_group_power:

        do {
                power += group->cpu_power;
                group = group->next;
        } while (group != child->groups);

I looked at the linked list:

child->groups = c000007b2f74ff00

and dumping group as we go:

c000007b2f74ff00 c000007b2f760000 c000007b2fb60000 c000007b2ff60000

at this point we end up in a cycle and never make it back to
child->groups:

c000008b2e68ff00 c000008b2e6a0000 c000008b2eaa0000 c000008b2eea0000
c000009aee77ff00 c000009aee790000 c000009aeeb90000 c000009aeef90000
c00000bafde91800 c00000dafdf81800 c00000fafce81800 c000011afdf71800
c00001226e70ff00 c00001226e720000 c00001226eb20000 c00001226ef20000
c000008b2e68ff00

Still investigating

Anton


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-14  0:34   ` Anton Blanchard
  2011-07-14  4:35     ` Anton Blanchard
  0 siblings, 1 reply; 43+ messages in thread
From: Anton Blanchard @ 2011-07-14  0:34 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds


Hi Peter,

> Surely this isn't the first multi-node P7 to boot a kernel with this
> patch? If my git foo is any good it hit -next on 23rd of May.
> 
> I guess I'm asking is, do smaller P7 machines boot? And if so, is
> there any difference except size?
> 
> How many nodes does the thing have anyway, 28? Hmm, that could mean
> its the first machine with >16 nodes to boot this, which would make it
> trigger the magic ALL_NODES crap.

We haven't tested a box with more than 16 nodes in quite a while, so it
may be this.

I took a quick look and we are stuck in update_group_power:

        do {
                power += group->cpu_power;
                group = group->next;
        } while (group != child->groups);

I looked at the linked list:

child->groups = c000007b2f74ff00

and dumping group as we go:

c000007b2f74ff00 c000007b2f760000 c000007b2fb60000 c000007b2ff60000

at this point we end up in a cycle and never make it back to
child->groups:

c000008b2e68ff00 c000008b2e6a0000 c000008b2eaa0000 c000008b2eea0000
c000009aee77ff00 c000009aee790000 c000009aeeb90000 c000009aeef90000
c00000bafde91800 c00000dafdf81800 c00000fafce81800 c000011afdf71800
c00001226e70ff00 c00001226e720000 c00001226eb20000 c00001226ef20000
c000008b2e68ff00

Still investigating

Anton

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-14  0:34   ` Anton Blanchard
@ 2011-07-14  4:35     ` Anton Blanchard
  2011-07-14 13:16       ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Anton Blanchard @ 2011-07-14  4:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds


> I took a quick look and we are stuck in update_group_power:
> 
>         do {
>                 power += group->cpu_power;
>                 group = group->next;
>         } while (group != child->groups);
> 
> I looked at the linked list:
> 
> child->groups = c000007b2f74ff00
> 
> and dumping group as we go:
> 
> c000007b2f74ff00 c000007b2f760000 c000007b2fb60000 c000007b2ff60000
> 
> at this point we end up in a cycle and never make it back to
> child->groups:
> 
> c000008b2e68ff00 c000008b2e6a0000 c000008b2eaa0000 c000008b2eea0000
> c000009aee77ff00 c000009aee790000 c000009aeeb90000 c000009aeef90000
> c00000bafde91800 c00000dafdf81800 c00000fafce81800 c000011afdf71800
> c00001226e70ff00 c00001226e720000 c00001226eb20000 c00001226ef20000
> c000008b2e68ff00

It looks like the group ends up in two lists. I added a BUG_ON to
ensure we never link a group twice, and it hits.

I also printed out the cpu spans as we walk through build_sched_groups:

0 1 2 3
0 4 8 12 16 20 24 28
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480
0 128 256 384
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
32 33 34 35
32 36 40 44 48 52 56 60
36 37 38 39
40 41 42 43
44 45 46 47
48 49 50 51
52 53 54 55
56 57 58 59
60 61 62 63
64 65 66 67
64 68 72 76 80 84 88 92
68 69 70 71
72 73 74 75
76 77 78 79
80 81 82 83
84 85 86 87
88 89 90 91
92 93 94 95
96 97 98 99
96 100 104 108 112 116 120 124
100 101 102 103
104 105 106 107
108 109 110 111
112 113 114 115
116 117 118 119
120 121 122 123
124 125 126 127
128 129 130 131
128 132 136 140 144 148 152 156

Duplicates start appearing in this span:
128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608

So it looks like the overlap of the 16 entry spans
(SD_NODES_PER_DOMAIN) is causing our problem.

Anton

Index: linux-2.6-work/kernel/sched.c
===================================================================
--- linux-2.6-work.orig/kernel/sched.c	2011-07-11
12:48:48.251087767 +1000 +++ linux-2.6-work/kernel/sched.c
2011-07-14 14:19:45.867094044 +1000 @@ -7021,6 +7021,7 @@
build_sched_groups(struct sched_domain * 
 		cpumask_clear(sched_group_cpus(sg));
 		sg->cpu_power = 0;
+		BUG_ON(sg->next);
 
 		for_each_cpu(j, span) {
 			if (get_group(j, sdd, NULL) != group)


Anton

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-14  4:35     ` Anton Blanchard
  2011-07-14 13:16       ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Anton Blanchard @ 2011-07-14  4:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds


> I took a quick look and we are stuck in update_group_power:
> 
>         do {
>                 power += group->cpu_power;
>                 group = group->next;
>         } while (group != child->groups);
> 
> I looked at the linked list:
> 
> child->groups = c000007b2f74ff00
> 
> and dumping group as we go:
> 
> c000007b2f74ff00 c000007b2f760000 c000007b2fb60000 c000007b2ff60000
> 
> at this point we end up in a cycle and never make it back to
> child->groups:
> 
> c000008b2e68ff00 c000008b2e6a0000 c000008b2eaa0000 c000008b2eea0000
> c000009aee77ff00 c000009aee790000 c000009aeeb90000 c000009aeef90000
> c00000bafde91800 c00000dafdf81800 c00000fafce81800 c000011afdf71800
> c00001226e70ff00 c00001226e720000 c00001226eb20000 c00001226ef20000
> c000008b2e68ff00

It looks like the group ends up in two lists. I added a BUG_ON to
ensure we never link a group twice, and it hits.

I also printed out the cpu spans as we walk through build_sched_groups:

0 1 2 3
0 4 8 12 16 20 24 28
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480
0 128 256 384
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
32 33 34 35
32 36 40 44 48 52 56 60
36 37 38 39
40 41 42 43
44 45 46 47
48 49 50 51
52 53 54 55
56 57 58 59
60 61 62 63
64 65 66 67
64 68 72 76 80 84 88 92
68 69 70 71
72 73 74 75
76 77 78 79
80 81 82 83
84 85 86 87
88 89 90 91
92 93 94 95
96 97 98 99
96 100 104 108 112 116 120 124
100 101 102 103
104 105 106 107
108 109 110 111
112 113 114 115
116 117 118 119
120 121 122 123
124 125 126 127
128 129 130 131
128 132 136 140 144 148 152 156

Duplicates start appearing in this span:
128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608

So it looks like the overlap of the 16 entry spans
(SD_NODES_PER_DOMAIN) is causing our problem.

Anton

Index: linux-2.6-work/kernel/sched.c
===================================================================
--- linux-2.6-work.orig/kernel/sched.c	2011-07-11
12:48:48.251087767 +1000 +++ linux-2.6-work/kernel/sched.c
2011-07-14 14:19:45.867094044 +1000 @@ -7021,6 +7021,7 @@
build_sched_groups(struct sched_domain * 
 		cpumask_clear(sched_group_cpus(sg));
 		sg->cpu_power = 0;
+		BUG_ON(sg->next);
 
 		for_each_cpu(j, span) {
 			if (get_group(j, sdd, NULL) != group)


Anton

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-14  4:35     ` Anton Blanchard
@ 2011-07-14 13:16       ` Peter Zijlstra
  2011-07-15  0:45         ` Anton Blanchard
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-14 13:16 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds

On Thu, 2011-07-14 at 14:35 +1000, Anton Blanchard wrote:

> I also printed out the cpu spans as we walk through build_sched_groups:

> 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480

> Duplicates start appearing in this span:
> 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608
> 
> So it looks like the overlap of the 16 entry spans
> (SD_NODES_PER_DOMAIN) is causing our problem.

Urgh.. so those spans are generated by sched_domain_node_span(), and it
looks like that simply picks the 15 nearest nodes to the one we've got
without consideration for overlap with previously generated spans.

Now that used to work because it used to simply allocate a new group
instead of using the existing one.

The thing is, we want to track state unique to a group of cpus, so
duplicating that is iffy.

Otoh, making these masks non-overlapping is probably sub-optimal from a
NUMA pov.

Looking at a slightly simpler set-up (4 socket AMD magny-cours):

$ cat /sys/devices/system/node/node*/distance
10 16 16 22 16 22 16 22
16 10 22 16 22 16 22 16
16 22 10 16 16 22 16 22
22 16 16 10 22 16 22 16
16 22 16 22 10 16 16 22
22 16 22 16 16 10 22 16
16 22 16 22 16 22 10 16
22 16 22 16 22 16 16 10

We can translate that into groups like

{0} {0,1,2,4,6} {0-7}
{1} {1,0,3,5,7} {0-7}
...

and we can easily see there's overlap there as well in the NUMA layout
itself.

This seems to suggest we need to separate the unique state from the
sched_group.

Now all I need is a way to not consume gobs of memory.. /me goes prod

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-14 13:16       ` Peter Zijlstra
  2011-07-15  0:45         ` Anton Blanchard
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-14 13:16 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds

On Thu, 2011-07-14 at 14:35 +1000, Anton Blanchard wrote:

> I also printed out the cpu spans as we walk through build_sched_groups:

> 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480

> Duplicates start appearing in this span:
> 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608
>=20
> So it looks like the overlap of the 16 entry spans
> (SD_NODES_PER_DOMAIN) is causing our problem.

Urgh.. so those spans are generated by sched_domain_node_span(), and it
looks like that simply picks the 15 nearest nodes to the one we've got
without consideration for overlap with previously generated spans.

Now that used to work because it used to simply allocate a new group
instead of using the existing one.

The thing is, we want to track state unique to a group of cpus, so
duplicating that is iffy.

Otoh, making these masks non-overlapping is probably sub-optimal from a
NUMA pov.

Looking at a slightly simpler set-up (4 socket AMD magny-cours):

$ cat /sys/devices/system/node/node*/distance
10 16 16 22 16 22 16 22
16 10 22 16 22 16 22 16
16 22 10 16 16 22 16 22
22 16 16 10 22 16 22 16
16 22 16 22 10 16 16 22
22 16 22 16 16 10 22 16
16 22 16 22 16 22 10 16
22 16 22 16 22 16 16 10

We can translate that into groups like

{0} {0,1,2,4,6} {0-7}
{1} {1,0,3,5,7} {0-7}
...

and we can easily see there's overlap there as well in the NUMA layout
itself.

This seems to suggest we need to separate the unique state from the
sched_group.

Now all I need is a way to not consume gobs of memory.. /me goes prod

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-14 13:16       ` Peter Zijlstra
@ 2011-07-15  0:45         ` Anton Blanchard
  2011-07-15  8:37           ` Peter Zijlstra
  2011-07-18 21:35           ` Peter Zijlstra
  0 siblings, 2 replies; 43+ messages in thread
From: Anton Blanchard @ 2011-07-15  0:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds


Hi,

> Urgh.. so those spans are generated by sched_domain_node_span(), and
> it looks like that simply picks the 15 nearest nodes to the one we've
> got without consideration for overlap with previously generated spans.

I do wonder if we need this extra level at all on ppc64. From memory
SGI added it for their massive setups, but our largest setup is 32 nodes
and breaking that down into 16 node chunks seems overkill.

I just realised we were setting NEWIDLE on our node definition and that
was causing large amounts of rebalance work even with
SD_NODES_PER_DOMAIN=16.

After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look
pretty good.

Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this
extra level is only used by SGI boxes.

Anton

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-15  0:45         ` Anton Blanchard
  2011-07-15  8:37           ` Peter Zijlstra
  2011-07-18 21:35           ` Peter Zijlstra
  0 siblings, 2 replies; 43+ messages in thread
From: Anton Blanchard @ 2011-07-15  0:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds


Hi,

> Urgh.. so those spans are generated by sched_domain_node_span(), and
> it looks like that simply picks the 15 nearest nodes to the one we've
> got without consideration for overlap with previously generated spans.

I do wonder if we need this extra level at all on ppc64. From memory
SGI added it for their massive setups, but our largest setup is 32 nodes
and breaking that down into 16 node chunks seems overkill.

I just realised we were setting NEWIDLE on our node definition and that
was causing large amounts of rebalance work even with
SD_NODES_PER_DOMAIN=16.

After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look
pretty good.

Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this
extra level is only used by SGI boxes.

Anton

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-15  0:45         ` Anton Blanchard
@ 2011-07-15  8:37           ` Peter Zijlstra
  2011-07-18 21:35           ` Peter Zijlstra
  1 sibling, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-15  8:37 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds

On Fri, 2011-07-15 at 10:45 +1000, Anton Blanchard wrote:
> Hi,
> 
> > Urgh.. so those spans are generated by sched_domain_node_span(), and
> > it looks like that simply picks the 15 nearest nodes to the one we've
> > got without consideration for overlap with previously generated spans.
> 
> I do wonder if we need this extra level at all on ppc64. From memory
> SGI added it for their massive setups, but our largest setup is 32 nodes
> and breaking that down into 16 node chunks seems overkill.
> 
> I just realised we were setting NEWIDLE on our node definition and that
> was causing large amounts of rebalance work even with
> SD_NODES_PER_DOMAIN=16.
> 
> After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look
> pretty good.
> 
> Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this
> extra level is only used by SGI boxes.

We can certainly remove the whole topology layer that causes this
problem for 3.0 and try to fix up for 3.1 again.

But I was rather hoping to introduce more of those layers in the near
future, I was hoping to create a layer per node_distance() value, such
that the load-balancing is aware of the interconnects.

Now for that I ran into the exact same problem, and at the time didn't
come up with a solution, but I think I now see a way out.

Something like the below ought to avoid the problem.. makes SGI sad
though :-)

---
 kernel/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8fb4245..877b9f1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7203,7 +7203,7 @@ static struct sched_domain_topology_level default_topology[] = {
 #endif
 	{ sd_init_CPU, cpu_cpu_mask, },
 #ifdef CONFIG_NUMA
-	{ sd_init_NODE, cpu_node_mask, },
+//	{ sd_init_NODE, cpu_node_mask, },
 	{ sd_init_ALLNODES, cpu_allnodes_mask, },
 #endif
 	{ NULL, },


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-15  8:37           ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-15  8:37 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds

On Fri, 2011-07-15 at 10:45 +1000, Anton Blanchard wrote:
> Hi,
>=20
> > Urgh.. so those spans are generated by sched_domain_node_span(), and
> > it looks like that simply picks the 15 nearest nodes to the one we've
> > got without consideration for overlap with previously generated spans.
>=20
> I do wonder if we need this extra level at all on ppc64. From memory
> SGI added it for their massive setups, but our largest setup is 32 nodes
> and breaking that down into 16 node chunks seems overkill.
>=20
> I just realised we were setting NEWIDLE on our node definition and that
> was causing large amounts of rebalance work even with
> SD_NODES_PER_DOMAIN=3D16.
>=20
> After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look
> pretty good.
>=20
> Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this
> extra level is only used by SGI boxes.

We can certainly remove the whole topology layer that causes this
problem for 3.0 and try to fix up for 3.1 again.

But I was rather hoping to introduce more of those layers in the near
future, I was hoping to create a layer per node_distance() value, such
that the load-balancing is aware of the interconnects.

Now for that I ran into the exact same problem, and at the time didn't
come up with a solution, but I think I now see a way out.

Something like the below ought to avoid the problem.. makes SGI sad
though :-)

---
 kernel/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8fb4245..877b9f1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7203,7 +7203,7 @@ static struct sched_domain_topology_level default_top=
ology[] =3D {
 #endif
 	{ sd_init_CPU, cpu_cpu_mask, },
 #ifdef CONFIG_NUMA
-	{ sd_init_NODE, cpu_node_mask, },
+//	{ sd_init_NODE, cpu_node_mask, },
 	{ sd_init_ALLNODES, cpu_allnodes_mask, },
 #endif
 	{ NULL, },

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-15  0:45         ` Anton Blanchard
  2011-07-15  8:37           ` Peter Zijlstra
@ 2011-07-18 21:35           ` Peter Zijlstra
  2011-07-19  4:44             ` Anton Blanchard
  1 sibling, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-18 21:35 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds

[-- Attachment #1: Type: text/plain, Size: 442 bytes --]

Anton, could you test the below two patches on that machine?

It should make things boot again, while I don't have a machine nearly
big enough to trigger any of this, I tested the new code paths by
setting FORCE_SD_OVERLAP in /debug/sched_features. Although any review
of the error paths would be much appreciated.

Also, could you send me the node_distance table for that machine? I'm
curious what the interconnects look like on that thing.

[-- Attachment #2: sched-domain-foo-1.patch --]
[-- Type: text/x-patch, Size: 9787 bytes --]

Subject: sched: Break out cpu_power from the sched_group structure
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Thu Jul 14 13:00:06 CEST 2011

In order to prepare for non-unique sched_groups per domain, we need to
carry the cpu_power elsewhere, so put a level of indirection in.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org
---
 include/linux/sched.h |   14 +++++++++-----
 kernel/sched.c        |   32 ++++++++++++++++++++++++++------
 kernel/sched_fair.c   |   46 +++++++++++++++++++++++-----------------------
 3 files changed, 58 insertions(+), 34 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -6550,7 +6550,7 @@ static int sched_domain_debug_one(struct
 			break;
 		}
 
-		if (!group->cpu_power) {
+		if (!group->sgp->power) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: domain->cpu_power not "
 					"set\n");
@@ -6574,9 +6574,9 @@ static int sched_domain_debug_one(struct
 		cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));
 
 		printk(KERN_CONT " %s", str);
-		if (group->cpu_power != SCHED_POWER_SCALE) {
+		if (group->sgp->power != SCHED_POWER_SCALE) {
 			printk(KERN_CONT " (cpu_power = %d)",
-				group->cpu_power);
+				group->sgp->power);
 		}
 
 		group = group->next;
@@ -6770,8 +6770,10 @@ static struct root_domain *alloc_rootdom
 static void free_sched_domain(struct rcu_head *rcu)
 {
 	struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
-	if (atomic_dec_and_test(&sd->groups->ref))
+	if (atomic_dec_and_test(&sd->groups->ref)) {
+		kfree(sd->groups->sgp);
 		kfree(sd->groups);
+	}
 	kfree(sd);
 }
 
@@ -6938,6 +6940,7 @@ int sched_smt_power_savings = 0, sched_m
 struct sd_data {
 	struct sched_domain **__percpu sd;
 	struct sched_group **__percpu sg;
+	struct sched_group_power **__percpu sgp;
 };
 
 struct s_data {
@@ -6974,8 +6977,10 @@ static int get_group(int cpu, struct sd_
 	if (child)
 		cpu = cpumask_first(sched_domain_span(child));
 
-	if (sg)
+	if (sg) {
 		*sg = *per_cpu_ptr(sdd->sg, cpu);
+		(*sg)->sgp = *per_cpu_ptr(sdd->sgp, cpu);
+	}
 
 	return cpu;
 }
@@ -7013,7 +7018,7 @@ build_sched_groups(struct sched_domain *
 			continue;
 
 		cpumask_clear(sched_group_cpus(sg));
-		sg->cpu_power = 0;
+		sg->sgp->power = 0;
 
 		for_each_cpu(j, span) {
 			if (get_group(j, sdd, NULL) != group)
@@ -7178,6 +7183,7 @@ static void claim_allocations(int cpu, s
 	if (cpu == cpumask_first(sched_group_cpus(sg))) {
 		WARN_ON_ONCE(*per_cpu_ptr(sdd->sg, cpu) != sg);
 		*per_cpu_ptr(sdd->sg, cpu) = NULL;
+		*per_cpu_ptr(sdd->sgp, cpu) = NULL;
 	}
 }
 
@@ -7227,9 +7233,14 @@ static int __sdt_alloc(const struct cpum
 		if (!sdd->sg)
 			return -ENOMEM;
 
+		sdd->sgp = alloc_percpu(struct sched_group_power *);
+		if (!sdd->sgp)
+			return -ENOMEM;
+
 		for_each_cpu(j, cpu_map) {
 			struct sched_domain *sd;
 			struct sched_group *sg;
+			struct sched_group_power *sgp;
 
 		       	sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
@@ -7244,6 +7255,13 @@ static int __sdt_alloc(const struct cpum
 				return -ENOMEM;
 
 			*per_cpu_ptr(sdd->sg, j) = sg;
+
+			sgp = kzalloc_node(sizeof(struct sched_group_power),
+					GFP_KERNEL, cpu_to_node(j));
+			if (!sgp)
+				return -ENOMEM;
+
+			*per_cpu_ptr(sdd->sgp, j) = sgp;
 		}
 	}
 
@@ -7261,9 +7279,11 @@ static void __sdt_free(const struct cpum
 		for_each_cpu(j, cpu_map) {
 			kfree(*per_cpu_ptr(sdd->sd, j));
 			kfree(*per_cpu_ptr(sdd->sg, j));
+			kfree(*per_cpu_ptr(sdd->sgp, j));
 		}
 		free_percpu(sdd->sd);
 		free_percpu(sdd->sg);
+		free_percpu(sdd->sgp);
 	}
 }
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1583,7 +1583,7 @@ find_idlest_group(struct sched_domain *s
 		}
 
 		/* Adjust by relative CPU power of the group */
-		avg_load = (avg_load * SCHED_POWER_SCALE) / group->cpu_power;
+		avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power;
 
 		if (local_group) {
 			this_load = avg_load;
@@ -2629,7 +2629,7 @@ static void update_cpu_power(struct sche
 		power >>= SCHED_POWER_SHIFT;
 	}
 
-	sdg->cpu_power_orig = power;
+	sdg->sgp->power_orig = power;
 
 	if (sched_feat(ARCH_POWER))
 		power *= arch_scale_freq_power(sd, cpu);
@@ -2645,7 +2645,7 @@ static void update_cpu_power(struct sche
 		power = 1;
 
 	cpu_rq(cpu)->cpu_power = power;
-	sdg->cpu_power = power;
+	sdg->sgp->power = power;
 }
 
 static void update_group_power(struct sched_domain *sd, int cpu)
@@ -2663,11 +2663,11 @@ static void update_group_power(struct sc
 
 	group = child->groups;
 	do {
-		power += group->cpu_power;
+		power += group->sgp->power;
 		group = group->next;
 	} while (group != child->groups);
 
-	sdg->cpu_power = power;
+	sdg->sgp->power = power;
 }
 
 /*
@@ -2689,7 +2689,7 @@ fix_small_capacity(struct sched_domain *
 	/*
 	 * If ~90% of the cpu_power is still there, we're good.
 	 */
-	if (group->cpu_power * 32 > group->cpu_power_orig * 29)
+	if (group->sgp->power * 32 > group->sgp->power_orig * 29)
 		return 1;
 
 	return 0;
@@ -2769,7 +2769,7 @@ static inline void update_sg_lb_stats(st
 	}
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->cpu_power;
+	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
 
 	/*
 	 * Consider the group unbalanced when the imbalance is larger
@@ -2786,7 +2786,7 @@ static inline void update_sg_lb_stats(st
 	if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && max_nr_running > 1)
 		sgs->group_imb = 1;
 
-	sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power,
+	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
 						SCHED_POWER_SCALE);
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(sd, group);
@@ -2875,7 +2875,7 @@ static inline void update_sd_lb_stats(st
 			return;
 
 		sds->total_load += sgs.group_load;
-		sds->total_pwr += sg->cpu_power;
+		sds->total_pwr += sg->sgp->power;
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
@@ -2960,7 +2960,7 @@ static int check_asym_packing(struct sch
 	if (this_cpu > busiest_cpu)
 		return 0;
 
-	*imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power,
+	*imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->sgp->power,
 				       SCHED_POWER_SCALE);
 	return 1;
 }
@@ -2991,7 +2991,7 @@ static inline void fix_small_imbalance(s
 
 	scaled_busy_load_per_task = sds->busiest_load_per_task
 					 * SCHED_POWER_SCALE;
-	scaled_busy_load_per_task /= sds->busiest->cpu_power;
+	scaled_busy_load_per_task /= sds->busiest->sgp->power;
 
 	if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
 			(scaled_busy_load_per_task * imbn)) {
@@ -3005,28 +3005,28 @@ static inline void fix_small_imbalance(s
 	 * moving them.
 	 */
 
-	pwr_now += sds->busiest->cpu_power *
+	pwr_now += sds->busiest->sgp->power *
 			min(sds->busiest_load_per_task, sds->max_load);
-	pwr_now += sds->this->cpu_power *
+	pwr_now += sds->this->sgp->power *
 			min(sds->this_load_per_task, sds->this_load);
 	pwr_now /= SCHED_POWER_SCALE;
 
 	/* Amount of load we'd subtract */
 	tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
-		sds->busiest->cpu_power;
+		sds->busiest->sgp->power;
 	if (sds->max_load > tmp)
-		pwr_move += sds->busiest->cpu_power *
+		pwr_move += sds->busiest->sgp->power *
 			min(sds->busiest_load_per_task, sds->max_load - tmp);
 
 	/* Amount of load we'd add */
-	if (sds->max_load * sds->busiest->cpu_power <
+	if (sds->max_load * sds->busiest->sgp->power <
 		sds->busiest_load_per_task * SCHED_POWER_SCALE)
-		tmp = (sds->max_load * sds->busiest->cpu_power) /
-			sds->this->cpu_power;
+		tmp = (sds->max_load * sds->busiest->sgp->power) /
+			sds->this->sgp->power;
 	else
 		tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
-			sds->this->cpu_power;
-	pwr_move += sds->this->cpu_power *
+			sds->this->sgp->power;
+	pwr_move += sds->this->sgp->power *
 			min(sds->this_load_per_task, sds->this_load + tmp);
 	pwr_move /= SCHED_POWER_SCALE;
 
@@ -3072,7 +3072,7 @@ static inline void calculate_imbalance(s
 
 		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
 
-		load_above_capacity /= sds->busiest->cpu_power;
+		load_above_capacity /= sds->busiest->sgp->power;
 	}
 
 	/*
@@ -3088,8 +3088,8 @@ static inline void calculate_imbalance(s
 	max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
 
 	/* How much load to actually move to equalise the imbalance */
-	*imbalance = min(max_pull * sds->busiest->cpu_power,
-		(sds->avg_load - sds->this_load) * sds->this->cpu_power)
+	*imbalance = min(max_pull * sds->busiest->sgp->power,
+		(sds->avg_load - sds->this_load) * sds->this->sgp->power)
 			/ SCHED_POWER_SCALE;
 
 	/*
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -893,16 +893,20 @@ static inline int sd_power_saving_flags(
 	return 0;
 }
 
-struct sched_group {
-	struct sched_group *next;	/* Must be a circular list */
-	atomic_t ref;
-
+struct sched_group_power {
 	/*
 	 * CPU power of this group, SCHED_LOAD_SCALE being max power for a
 	 * single CPU.
 	 */
-	unsigned int cpu_power, cpu_power_orig;
+	unsigned int power, power_orig;
+};
+
+struct sched_group {
+	struct sched_group *next;	/* Must be a circular list */
+	atomic_t ref;
+
 	unsigned int group_weight;
+	struct sched_group_power *sgp;
 
 	/*
 	 * The CPUs this group covers.

[-- Attachment #3: sched-domain-foo-2.patch --]
[-- Type: text/x-patch, Size: 8956 bytes --]

Subject: sched: Allow for overlapping sched_domain spans
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri Jul 15 10:35:52 CEST 2011

Allow for sched_domain spans that overlap by giving such domains their
own sched_group list instead of sharing the sched_groups amongst
each-other.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-yr71izj2souh2dbifdh6j68y@git.kernel.org
---
 include/linux/sched.h   |    2 
 kernel/sched.c          |  157 +++++++++++++++++++++++++++++++++++++++---------
 kernel/sched_features.h |    2 
 3 files changed, 132 insertions(+), 29 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -844,6 +844,7 @@ enum cpu_idle_type {
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
+#define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 
 enum powersavings_balance_level {
 	POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
@@ -894,6 +895,7 @@ static inline int sd_power_saving_flags(
 }
 
 struct sched_group_power {
+	atomic_t ref;
 	/*
 	 * CPU power of this group, SCHED_LOAD_SCALE being max power for a
 	 * single CPU.
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -6767,10 +6767,36 @@ static struct root_domain *alloc_rootdom
 	return rd;
 }
 
+static void free_sched_groups(struct sched_group *sg, int free_sgp)
+{
+	struct sched_group *tmp, *first;
+
+	if (!sg)
+		return;
+
+	first = sg;
+	do {
+		tmp = sg->next;
+
+		if (free_sgp && atomic_dec_and_test(&sg->sgp->ref))
+			kfree(sg->sgp);
+
+		kfree(sg);
+		sg = tmp;
+	} while (sg != first);
+}
+
 static void free_sched_domain(struct rcu_head *rcu)
 {
 	struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
-	if (atomic_dec_and_test(&sd->groups->ref)) {
+
+	/*
+	 * If its an overlapping domain it has private groups, iterate and
+	 * nuke them all.
+	 */
+	if (sd->flags & SD_OVERLAP) {
+		free_sched_groups(sd->groups, 1);
+	} else if (atomic_dec_and_test(&sd->groups->ref)) {
 		kfree(sd->groups->sgp);
 		kfree(sd->groups);
 	}
@@ -6960,15 +6986,73 @@ struct sched_domain_topology_level;
 typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
+#define SDTL_OVERLAP	0x01
+
 struct sched_domain_topology_level {
 	sched_domain_init_f init;
 	sched_domain_mask_f mask;
+	int		    flags;
 	struct sd_data      data;
 };
 
-/*
- * Assumes the sched_domain tree is fully constructed
- */
+static int
+build_overlap_sched_groups(struct sched_domain *sd, int cpu)
+{
+	struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg;
+	const struct cpumask *span = sched_domain_span(sd);
+	struct cpumask *covered = sched_domains_tmpmask;
+	struct sd_data *sdd = sd->private;
+	struct sched_domain *child;
+	int i;
+
+	cpumask_clear(covered);
+
+	for_each_cpu(i, span) {
+		struct cpumask *sg_span;
+
+		if (cpumask_test_cpu(i, covered))
+			continue;
+
+		sg = kzalloc_node(sizeof(struct sched_group), GFP_KERNEL,
+				cpu_to_node(i));
+
+		if (!sg)
+			goto fail;
+
+		sg_span = sched_group_cpus(sg);
+
+		child = *per_cpu_ptr(sdd->sd, i);
+		if (child->child) {
+			child = child->child;
+			*sg_span = *sched_domain_span(child);
+		} else
+			cpumask_set_cpu(i, sg_span);
+
+		cpumask_or(covered, covered, sg_span);
+
+		sg->sgp = *per_cpu_ptr(sdd->sgp, cpumask_first(sg_span));
+		atomic_inc(&sg->sgp->ref);
+
+		if (cpumask_test_cpu(cpu, sg_span))
+			groups = sg;
+
+		if (!first)
+			first = sg;
+		if (last)
+			last->next = sg;
+		last = sg;
+		last->next = first;
+	}
+	sd->groups = groups;
+
+	return 0;
+
+fail:
+	free_sched_groups(first, 0);
+
+	return -ENOMEM;
+}
+
 static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
 {
 	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
@@ -6980,23 +7064,21 @@ static int get_group(int cpu, struct sd_
 	if (sg) {
 		*sg = *per_cpu_ptr(sdd->sg, cpu);
 		(*sg)->sgp = *per_cpu_ptr(sdd->sgp, cpu);
+		atomic_set(&(*sg)->sgp->ref, 1); /* for claim_allocations */
 	}
 
 	return cpu;
 }
 
 /*
- * build_sched_groups takes the cpumask we wish to span, and a pointer
- * to a function which identifies what group(along with sched group) a CPU
- * belongs to. The return value of group_fn must be a >= 0 and < nr_cpu_ids
- * (due to the fact that we keep track of groups covered with a struct cpumask).
- *
  * build_sched_groups will build a circular linked list of the groups
  * covered by the given span, and will set each group's ->cpumask correctly,
  * and ->cpu_power to 0.
+ *
+ * Assumes the sched_domain tree is fully constructed
  */
-static void
-build_sched_groups(struct sched_domain *sd)
+static int
+build_sched_groups(struct sched_domain *sd, int cpu)
 {
 	struct sched_group *first = NULL, *last = NULL;
 	struct sd_data *sdd = sd->private;
@@ -7004,6 +7086,12 @@ build_sched_groups(struct sched_domain *
 	struct cpumask *covered;
 	int i;
 
+	get_group(cpu, sdd, &sd->groups);
+	atomic_inc(&sd->groups->ref);
+
+	if (cpu != cpumask_first(sched_domain_span(sd)))
+		return 0;
+
 	lockdep_assert_held(&sched_domains_mutex);
 	covered = sched_domains_tmpmask;
 
@@ -7035,6 +7123,8 @@ build_sched_groups(struct sched_domain *
 		last = sg;
 	}
 	last->next = first;
+
+	return 0;
 }
 
 /*
@@ -7049,12 +7139,17 @@ build_sched_groups(struct sched_domain *
  */
 static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 {
-	WARN_ON(!sd || !sd->groups);
+	struct sched_group *sg = sd->groups;
 
-	if (cpu != group_first_cpu(sd->groups))
-		return;
+	WARN_ON(!sd || !sg);
 
-	sd->groups->group_weight = cpumask_weight(sched_group_cpus(sd->groups));
+	do {
+		sg->group_weight = cpumask_weight(sched_group_cpus(sg));
+		sg = sg->next;
+	} while (sg != sd->groups);
+
+	if (cpu != group_first_cpu(sg))
+		return;
 
 	update_group_power(sd, cpu);
 }
@@ -7175,16 +7270,15 @@ static enum s_alloc __visit_domain_alloc
 static void claim_allocations(int cpu, struct sched_domain *sd)
 {
 	struct sd_data *sdd = sd->private;
-	struct sched_group *sg = sd->groups;
 
 	WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
 	*per_cpu_ptr(sdd->sd, cpu) = NULL;
 
-	if (cpu == cpumask_first(sched_group_cpus(sg))) {
-		WARN_ON_ONCE(*per_cpu_ptr(sdd->sg, cpu) != sg);
+	if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
 		*per_cpu_ptr(sdd->sg, cpu) = NULL;
+
+	if (atomic_read(&(*per_cpu_ptr(sdd->sgp, cpu))->ref))
 		*per_cpu_ptr(sdd->sgp, cpu) = NULL;
-	}
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -7209,7 +7303,7 @@ static struct sched_domain_topology_leve
 #endif
 	{ sd_init_CPU, cpu_cpu_mask, },
 #ifdef CONFIG_NUMA
-	{ sd_init_NODE, cpu_node_mask, },
+	{ sd_init_NODE, cpu_node_mask, SDTL_OVERLAP, },
 	{ sd_init_ALLNODES, cpu_allnodes_mask, },
 #endif
 	{ NULL, },
@@ -7277,7 +7371,9 @@ static void __sdt_free(const struct cpum
 		struct sd_data *sdd = &tl->data;
 
 		for_each_cpu(j, cpu_map) {
-			kfree(*per_cpu_ptr(sdd->sd, j));
+			struct sched_domain *sd = *per_cpu_ptr(sdd->sd, j);
+			if (sd && (sd->flags & SD_OVERLAP))
+				free_sched_groups(sd->groups, 0);
 			kfree(*per_cpu_ptr(sdd->sg, j));
 			kfree(*per_cpu_ptr(sdd->sgp, j));
 		}
@@ -7329,8 +7425,11 @@ static int build_sched_domains(const str
 		struct sched_domain_topology_level *tl;
 
 		sd = NULL;
-		for (tl = sched_domain_topology; tl->init; tl++)
+		for (tl = sched_domain_topology; tl->init; tl++) {
 			sd = build_sched_domain(tl, &d, cpu_map, attr, sd, i);
+			if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
+				sd->flags |= SD_OVERLAP;
+		}
 
 		while (sd->child)
 			sd = sd->child;
@@ -7342,13 +7441,13 @@ static int build_sched_domains(const str
 	for_each_cpu(i, cpu_map) {
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
 			sd->span_weight = cpumask_weight(sched_domain_span(sd));
-			get_group(i, sd->private, &sd->groups);
-			atomic_inc(&sd->groups->ref);
-
-			if (i != cpumask_first(sched_domain_span(sd)))
-				continue;
-
-			build_sched_groups(sd);
+			if (sd->flags & SD_OVERLAP) {
+				if (build_overlap_sched_groups(sd, i))
+					goto error;
+			} else {
+				if (build_sched_groups(sd, i))
+					goto error;
+			}
 		}
 	}
 
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -70,3 +70,5 @@ SCHED_FEAT(NONIRQ_POWER, 1)
  * using the scheduler IPI. Reduces rq->lock contention/bounces.
  */
 SCHED_FEAT(TTWU_QUEUE, 1)
+
+SCHED_FEAT(FORCE_SD_OVERLAP, 0)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-18 21:35           ` Peter Zijlstra
  2011-07-19  4:44             ` Anton Blanchard
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-18 21:35 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds

[-- Attachment #1: Type: text/plain, Size: 442 bytes --]

Anton, could you test the below two patches on that machine?

It should make things boot again, while I don't have a machine nearly
big enough to trigger any of this, I tested the new code paths by
setting FORCE_SD_OVERLAP in /debug/sched_features. Although any review
of the error paths would be much appreciated.

Also, could you send me the node_distance table for that machine? I'm
curious what the interconnects look like on that thing.

[-- Attachment #2: sched-domain-foo-1.patch --]
[-- Type: text/x-patch, Size: 9787 bytes --]

Subject: sched: Break out cpu_power from the sched_group structure
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Thu Jul 14 13:00:06 CEST 2011

In order to prepare for non-unique sched_groups per domain, we need to
carry the cpu_power elsewhere, so put a level of indirection in.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org
---
 include/linux/sched.h |   14 +++++++++-----
 kernel/sched.c        |   32 ++++++++++++++++++++++++++------
 kernel/sched_fair.c   |   46 +++++++++++++++++++++++-----------------------
 3 files changed, 58 insertions(+), 34 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -6550,7 +6550,7 @@ static int sched_domain_debug_one(struct
 			break;
 		}
 
-		if (!group->cpu_power) {
+		if (!group->sgp->power) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: domain->cpu_power not "
 					"set\n");
@@ -6574,9 +6574,9 @@ static int sched_domain_debug_one(struct
 		cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));
 
 		printk(KERN_CONT " %s", str);
-		if (group->cpu_power != SCHED_POWER_SCALE) {
+		if (group->sgp->power != SCHED_POWER_SCALE) {
 			printk(KERN_CONT " (cpu_power = %d)",
-				group->cpu_power);
+				group->sgp->power);
 		}
 
 		group = group->next;
@@ -6770,8 +6770,10 @@ static struct root_domain *alloc_rootdom
 static void free_sched_domain(struct rcu_head *rcu)
 {
 	struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
-	if (atomic_dec_and_test(&sd->groups->ref))
+	if (atomic_dec_and_test(&sd->groups->ref)) {
+		kfree(sd->groups->sgp);
 		kfree(sd->groups);
+	}
 	kfree(sd);
 }
 
@@ -6938,6 +6940,7 @@ int sched_smt_power_savings = 0, sched_m
 struct sd_data {
 	struct sched_domain **__percpu sd;
 	struct sched_group **__percpu sg;
+	struct sched_group_power **__percpu sgp;
 };
 
 struct s_data {
@@ -6974,8 +6977,10 @@ static int get_group(int cpu, struct sd_
 	if (child)
 		cpu = cpumask_first(sched_domain_span(child));
 
-	if (sg)
+	if (sg) {
 		*sg = *per_cpu_ptr(sdd->sg, cpu);
+		(*sg)->sgp = *per_cpu_ptr(sdd->sgp, cpu);
+	}
 
 	return cpu;
 }
@@ -7013,7 +7018,7 @@ build_sched_groups(struct sched_domain *
 			continue;
 
 		cpumask_clear(sched_group_cpus(sg));
-		sg->cpu_power = 0;
+		sg->sgp->power = 0;
 
 		for_each_cpu(j, span) {
 			if (get_group(j, sdd, NULL) != group)
@@ -7178,6 +7183,7 @@ static void claim_allocations(int cpu, s
 	if (cpu == cpumask_first(sched_group_cpus(sg))) {
 		WARN_ON_ONCE(*per_cpu_ptr(sdd->sg, cpu) != sg);
 		*per_cpu_ptr(sdd->sg, cpu) = NULL;
+		*per_cpu_ptr(sdd->sgp, cpu) = NULL;
 	}
 }
 
@@ -7227,9 +7233,14 @@ static int __sdt_alloc(const struct cpum
 		if (!sdd->sg)
 			return -ENOMEM;
 
+		sdd->sgp = alloc_percpu(struct sched_group_power *);
+		if (!sdd->sgp)
+			return -ENOMEM;
+
 		for_each_cpu(j, cpu_map) {
 			struct sched_domain *sd;
 			struct sched_group *sg;
+			struct sched_group_power *sgp;
 
 		       	sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
@@ -7244,6 +7255,13 @@ static int __sdt_alloc(const struct cpum
 				return -ENOMEM;
 
 			*per_cpu_ptr(sdd->sg, j) = sg;
+
+			sgp = kzalloc_node(sizeof(struct sched_group_power),
+					GFP_KERNEL, cpu_to_node(j));
+			if (!sgp)
+				return -ENOMEM;
+
+			*per_cpu_ptr(sdd->sgp, j) = sgp;
 		}
 	}
 
@@ -7261,9 +7279,11 @@ static void __sdt_free(const struct cpum
 		for_each_cpu(j, cpu_map) {
 			kfree(*per_cpu_ptr(sdd->sd, j));
 			kfree(*per_cpu_ptr(sdd->sg, j));
+			kfree(*per_cpu_ptr(sdd->sgp, j));
 		}
 		free_percpu(sdd->sd);
 		free_percpu(sdd->sg);
+		free_percpu(sdd->sgp);
 	}
 }
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1583,7 +1583,7 @@ find_idlest_group(struct sched_domain *s
 		}
 
 		/* Adjust by relative CPU power of the group */
-		avg_load = (avg_load * SCHED_POWER_SCALE) / group->cpu_power;
+		avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power;
 
 		if (local_group) {
 			this_load = avg_load;
@@ -2629,7 +2629,7 @@ static void update_cpu_power(struct sche
 		power >>= SCHED_POWER_SHIFT;
 	}
 
-	sdg->cpu_power_orig = power;
+	sdg->sgp->power_orig = power;
 
 	if (sched_feat(ARCH_POWER))
 		power *= arch_scale_freq_power(sd, cpu);
@@ -2645,7 +2645,7 @@ static void update_cpu_power(struct sche
 		power = 1;
 
 	cpu_rq(cpu)->cpu_power = power;
-	sdg->cpu_power = power;
+	sdg->sgp->power = power;
 }
 
 static void update_group_power(struct sched_domain *sd, int cpu)
@@ -2663,11 +2663,11 @@ static void update_group_power(struct sc
 
 	group = child->groups;
 	do {
-		power += group->cpu_power;
+		power += group->sgp->power;
 		group = group->next;
 	} while (group != child->groups);
 
-	sdg->cpu_power = power;
+	sdg->sgp->power = power;
 }
 
 /*
@@ -2689,7 +2689,7 @@ fix_small_capacity(struct sched_domain *
 	/*
 	 * If ~90% of the cpu_power is still there, we're good.
 	 */
-	if (group->cpu_power * 32 > group->cpu_power_orig * 29)
+	if (group->sgp->power * 32 > group->sgp->power_orig * 29)
 		return 1;
 
 	return 0;
@@ -2769,7 +2769,7 @@ static inline void update_sg_lb_stats(st
 	}
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->cpu_power;
+	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
 
 	/*
 	 * Consider the group unbalanced when the imbalance is larger
@@ -2786,7 +2786,7 @@ static inline void update_sg_lb_stats(st
 	if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && max_nr_running > 1)
 		sgs->group_imb = 1;
 
-	sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power,
+	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
 						SCHED_POWER_SCALE);
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(sd, group);
@@ -2875,7 +2875,7 @@ static inline void update_sd_lb_stats(st
 			return;
 
 		sds->total_load += sgs.group_load;
-		sds->total_pwr += sg->cpu_power;
+		sds->total_pwr += sg->sgp->power;
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
@@ -2960,7 +2960,7 @@ static int check_asym_packing(struct sch
 	if (this_cpu > busiest_cpu)
 		return 0;
 
-	*imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power,
+	*imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->sgp->power,
 				       SCHED_POWER_SCALE);
 	return 1;
 }
@@ -2991,7 +2991,7 @@ static inline void fix_small_imbalance(s
 
 	scaled_busy_load_per_task = sds->busiest_load_per_task
 					 * SCHED_POWER_SCALE;
-	scaled_busy_load_per_task /= sds->busiest->cpu_power;
+	scaled_busy_load_per_task /= sds->busiest->sgp->power;
 
 	if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
 			(scaled_busy_load_per_task * imbn)) {
@@ -3005,28 +3005,28 @@ static inline void fix_small_imbalance(s
 	 * moving them.
 	 */
 
-	pwr_now += sds->busiest->cpu_power *
+	pwr_now += sds->busiest->sgp->power *
 			min(sds->busiest_load_per_task, sds->max_load);
-	pwr_now += sds->this->cpu_power *
+	pwr_now += sds->this->sgp->power *
 			min(sds->this_load_per_task, sds->this_load);
 	pwr_now /= SCHED_POWER_SCALE;
 
 	/* Amount of load we'd subtract */
 	tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
-		sds->busiest->cpu_power;
+		sds->busiest->sgp->power;
 	if (sds->max_load > tmp)
-		pwr_move += sds->busiest->cpu_power *
+		pwr_move += sds->busiest->sgp->power *
 			min(sds->busiest_load_per_task, sds->max_load - tmp);
 
 	/* Amount of load we'd add */
-	if (sds->max_load * sds->busiest->cpu_power <
+	if (sds->max_load * sds->busiest->sgp->power <
 		sds->busiest_load_per_task * SCHED_POWER_SCALE)
-		tmp = (sds->max_load * sds->busiest->cpu_power) /
-			sds->this->cpu_power;
+		tmp = (sds->max_load * sds->busiest->sgp->power) /
+			sds->this->sgp->power;
 	else
 		tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
-			sds->this->cpu_power;
-	pwr_move += sds->this->cpu_power *
+			sds->this->sgp->power;
+	pwr_move += sds->this->sgp->power *
 			min(sds->this_load_per_task, sds->this_load + tmp);
 	pwr_move /= SCHED_POWER_SCALE;
 
@@ -3072,7 +3072,7 @@ static inline void calculate_imbalance(s
 
 		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
 
-		load_above_capacity /= sds->busiest->cpu_power;
+		load_above_capacity /= sds->busiest->sgp->power;
 	}
 
 	/*
@@ -3088,8 +3088,8 @@ static inline void calculate_imbalance(s
 	max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
 
 	/* How much load to actually move to equalise the imbalance */
-	*imbalance = min(max_pull * sds->busiest->cpu_power,
-		(sds->avg_load - sds->this_load) * sds->this->cpu_power)
+	*imbalance = min(max_pull * sds->busiest->sgp->power,
+		(sds->avg_load - sds->this_load) * sds->this->sgp->power)
 			/ SCHED_POWER_SCALE;
 
 	/*
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -893,16 +893,20 @@ static inline int sd_power_saving_flags(
 	return 0;
 }
 
-struct sched_group {
-	struct sched_group *next;	/* Must be a circular list */
-	atomic_t ref;
-
+struct sched_group_power {
 	/*
 	 * CPU power of this group, SCHED_LOAD_SCALE being max power for a
 	 * single CPU.
 	 */
-	unsigned int cpu_power, cpu_power_orig;
+	unsigned int power, power_orig;
+};
+
+struct sched_group {
+	struct sched_group *next;	/* Must be a circular list */
+	atomic_t ref;
+
 	unsigned int group_weight;
+	struct sched_group_power *sgp;
 
 	/*
 	 * The CPUs this group covers.

[-- Attachment #3: sched-domain-foo-2.patch --]
[-- Type: text/x-patch, Size: 8956 bytes --]

Subject: sched: Allow for overlapping sched_domain spans
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri Jul 15 10:35:52 CEST 2011

Allow for sched_domain spans that overlap by giving such domains their
own sched_group list instead of sharing the sched_groups amongst
each-other.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-yr71izj2souh2dbifdh6j68y@git.kernel.org
---
 include/linux/sched.h   |    2 
 kernel/sched.c          |  157 +++++++++++++++++++++++++++++++++++++++---------
 kernel/sched_features.h |    2 
 3 files changed, 132 insertions(+), 29 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -844,6 +844,7 @@ enum cpu_idle_type {
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
+#define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 
 enum powersavings_balance_level {
 	POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
@@ -894,6 +895,7 @@ static inline int sd_power_saving_flags(
 }
 
 struct sched_group_power {
+	atomic_t ref;
 	/*
 	 * CPU power of this group, SCHED_LOAD_SCALE being max power for a
 	 * single CPU.
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -6767,10 +6767,36 @@ static struct root_domain *alloc_rootdom
 	return rd;
 }
 
+static void free_sched_groups(struct sched_group *sg, int free_sgp)
+{
+	struct sched_group *tmp, *first;
+
+	if (!sg)
+		return;
+
+	first = sg;
+	do {
+		tmp = sg->next;
+
+		if (free_sgp && atomic_dec_and_test(&sg->sgp->ref))
+			kfree(sg->sgp);
+
+		kfree(sg);
+		sg = tmp;
+	} while (sg != first);
+}
+
 static void free_sched_domain(struct rcu_head *rcu)
 {
 	struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
-	if (atomic_dec_and_test(&sd->groups->ref)) {
+
+	/*
+	 * If its an overlapping domain it has private groups, iterate and
+	 * nuke them all.
+	 */
+	if (sd->flags & SD_OVERLAP) {
+		free_sched_groups(sd->groups, 1);
+	} else if (atomic_dec_and_test(&sd->groups->ref)) {
 		kfree(sd->groups->sgp);
 		kfree(sd->groups);
 	}
@@ -6960,15 +6986,73 @@ struct sched_domain_topology_level;
 typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
+#define SDTL_OVERLAP	0x01
+
 struct sched_domain_topology_level {
 	sched_domain_init_f init;
 	sched_domain_mask_f mask;
+	int		    flags;
 	struct sd_data      data;
 };
 
-/*
- * Assumes the sched_domain tree is fully constructed
- */
+static int
+build_overlap_sched_groups(struct sched_domain *sd, int cpu)
+{
+	struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg;
+	const struct cpumask *span = sched_domain_span(sd);
+	struct cpumask *covered = sched_domains_tmpmask;
+	struct sd_data *sdd = sd->private;
+	struct sched_domain *child;
+	int i;
+
+	cpumask_clear(covered);
+
+	for_each_cpu(i, span) {
+		struct cpumask *sg_span;
+
+		if (cpumask_test_cpu(i, covered))
+			continue;
+
+		sg = kzalloc_node(sizeof(struct sched_group), GFP_KERNEL,
+				cpu_to_node(i));
+
+		if (!sg)
+			goto fail;
+
+		sg_span = sched_group_cpus(sg);
+
+		child = *per_cpu_ptr(sdd->sd, i);
+		if (child->child) {
+			child = child->child;
+			*sg_span = *sched_domain_span(child);
+		} else
+			cpumask_set_cpu(i, sg_span);
+
+		cpumask_or(covered, covered, sg_span);
+
+		sg->sgp = *per_cpu_ptr(sdd->sgp, cpumask_first(sg_span));
+		atomic_inc(&sg->sgp->ref);
+
+		if (cpumask_test_cpu(cpu, sg_span))
+			groups = sg;
+
+		if (!first)
+			first = sg;
+		if (last)
+			last->next = sg;
+		last = sg;
+		last->next = first;
+	}
+	sd->groups = groups;
+
+	return 0;
+
+fail:
+	free_sched_groups(first, 0);
+
+	return -ENOMEM;
+}
+
 static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
 {
 	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
@@ -6980,23 +7064,21 @@ static int get_group(int cpu, struct sd_
 	if (sg) {
 		*sg = *per_cpu_ptr(sdd->sg, cpu);
 		(*sg)->sgp = *per_cpu_ptr(sdd->sgp, cpu);
+		atomic_set(&(*sg)->sgp->ref, 1); /* for claim_allocations */
 	}
 
 	return cpu;
 }
 
 /*
- * build_sched_groups takes the cpumask we wish to span, and a pointer
- * to a function which identifies what group(along with sched group) a CPU
- * belongs to. The return value of group_fn must be a >= 0 and < nr_cpu_ids
- * (due to the fact that we keep track of groups covered with a struct cpumask).
- *
  * build_sched_groups will build a circular linked list of the groups
  * covered by the given span, and will set each group's ->cpumask correctly,
  * and ->cpu_power to 0.
+ *
+ * Assumes the sched_domain tree is fully constructed
  */
-static void
-build_sched_groups(struct sched_domain *sd)
+static int
+build_sched_groups(struct sched_domain *sd, int cpu)
 {
 	struct sched_group *first = NULL, *last = NULL;
 	struct sd_data *sdd = sd->private;
@@ -7004,6 +7086,12 @@ build_sched_groups(struct sched_domain *
 	struct cpumask *covered;
 	int i;
 
+	get_group(cpu, sdd, &sd->groups);
+	atomic_inc(&sd->groups->ref);
+
+	if (cpu != cpumask_first(sched_domain_span(sd)))
+		return 0;
+
 	lockdep_assert_held(&sched_domains_mutex);
 	covered = sched_domains_tmpmask;
 
@@ -7035,6 +7123,8 @@ build_sched_groups(struct sched_domain *
 		last = sg;
 	}
 	last->next = first;
+
+	return 0;
 }
 
 /*
@@ -7049,12 +7139,17 @@ build_sched_groups(struct sched_domain *
  */
 static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 {
-	WARN_ON(!sd || !sd->groups);
+	struct sched_group *sg = sd->groups;
 
-	if (cpu != group_first_cpu(sd->groups))
-		return;
+	WARN_ON(!sd || !sg);
 
-	sd->groups->group_weight = cpumask_weight(sched_group_cpus(sd->groups));
+	do {
+		sg->group_weight = cpumask_weight(sched_group_cpus(sg));
+		sg = sg->next;
+	} while (sg != sd->groups);
+
+	if (cpu != group_first_cpu(sg))
+		return;
 
 	update_group_power(sd, cpu);
 }
@@ -7175,16 +7270,15 @@ static enum s_alloc __visit_domain_alloc
 static void claim_allocations(int cpu, struct sched_domain *sd)
 {
 	struct sd_data *sdd = sd->private;
-	struct sched_group *sg = sd->groups;
 
 	WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
 	*per_cpu_ptr(sdd->sd, cpu) = NULL;
 
-	if (cpu == cpumask_first(sched_group_cpus(sg))) {
-		WARN_ON_ONCE(*per_cpu_ptr(sdd->sg, cpu) != sg);
+	if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
 		*per_cpu_ptr(sdd->sg, cpu) = NULL;
+
+	if (atomic_read(&(*per_cpu_ptr(sdd->sgp, cpu))->ref))
 		*per_cpu_ptr(sdd->sgp, cpu) = NULL;
-	}
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -7209,7 +7303,7 @@ static struct sched_domain_topology_leve
 #endif
 	{ sd_init_CPU, cpu_cpu_mask, },
 #ifdef CONFIG_NUMA
-	{ sd_init_NODE, cpu_node_mask, },
+	{ sd_init_NODE, cpu_node_mask, SDTL_OVERLAP, },
 	{ sd_init_ALLNODES, cpu_allnodes_mask, },
 #endif
 	{ NULL, },
@@ -7277,7 +7371,9 @@ static void __sdt_free(const struct cpum
 		struct sd_data *sdd = &tl->data;
 
 		for_each_cpu(j, cpu_map) {
-			kfree(*per_cpu_ptr(sdd->sd, j));
+			struct sched_domain *sd = *per_cpu_ptr(sdd->sd, j);
+			if (sd && (sd->flags & SD_OVERLAP))
+				free_sched_groups(sd->groups, 0);
 			kfree(*per_cpu_ptr(sdd->sg, j));
 			kfree(*per_cpu_ptr(sdd->sgp, j));
 		}
@@ -7329,8 +7425,11 @@ static int build_sched_domains(const str
 		struct sched_domain_topology_level *tl;
 
 		sd = NULL;
-		for (tl = sched_domain_topology; tl->init; tl++)
+		for (tl = sched_domain_topology; tl->init; tl++) {
 			sd = build_sched_domain(tl, &d, cpu_map, attr, sd, i);
+			if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
+				sd->flags |= SD_OVERLAP;
+		}
 
 		while (sd->child)
 			sd = sd->child;
@@ -7342,13 +7441,13 @@ static int build_sched_domains(const str
 	for_each_cpu(i, cpu_map) {
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
 			sd->span_weight = cpumask_weight(sched_domain_span(sd));
-			get_group(i, sd->private, &sd->groups);
-			atomic_inc(&sd->groups->ref);
-
-			if (i != cpumask_first(sched_domain_span(sd)))
-				continue;
-
-			build_sched_groups(sd);
+			if (sd->flags & SD_OVERLAP) {
+				if (build_overlap_sched_groups(sd, i))
+					goto error;
+			} else {
+				if (build_sched_groups(sd, i))
+					goto error;
+			}
 		}
 	}
 
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -70,3 +70,5 @@ SCHED_FEAT(NONIRQ_POWER, 1)
  * using the scheduler IPI. Reduces rq->lock contention/bounces.
  */
 SCHED_FEAT(TTWU_QUEUE, 1)
+
+SCHED_FEAT(FORCE_SD_OVERLAP, 0)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-18 21:35           ` Peter Zijlstra
@ 2011-07-19  4:44             ` Anton Blanchard
  2011-07-19 10:21               ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Anton Blanchard @ 2011-07-19  4:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds

On Mon, 18 Jul 2011 23:35:56 +0200
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Anton, could you test the below two patches on that machine?
> 
> It should make things boot again, while I don't have a machine nearly
> big enough to trigger any of this, I tested the new code paths by
> setting FORCE_SD_OVERLAP in /debug/sched_features. Although any review
> of the error paths would be much appreciated.

I get an oops in slub code:

NIP [c000000000197d30] .deactivate_slab+0x1b0/0x200
LR [c000000000199d94] .__slab_alloc+0xb4/0x5a0
[c000000000199d94] .__slab_alloc+0xb4/0x5a0
[c00000000019ac98] .kmem_cache_alloc_node_trace+0xa8/0x260
[c00000000007eb70] .build_sched_domains+0xa60/0xb90
[c000000000a16a98] .sched_init_smp+0xa8/0x228
[c000000000a00274] .kernel_init+0x10c/0x1fc
[c00000000002324c] .kernel_thread+0x54/0x70

I'm guessing it's a result of some nodes not having any local memory.
but a bit surprised I'm not seeing it elsewhere.

Investigating.

> Also, could you send me the node_distance table for that machine? I'm
> curious what the interconnects look like on that thing.

Our node distances are a bit arbitrary (I make them up based on
information given to us in the device tree). In terms of memory we have
a maximum of three levels. To give some gross estimates, on chip memory
might be 30GB/sec, on node memory 10-15GB/sec and off node memory
5GB/sec.

The only thing we tweak with node distances is to make sure we go into
node reclaim before going off node:

/*
 * Before going off node we want the VM to try and reclaim from the local
 * node. It does this if the remote distance is larger than RECLAIM_DISTANCE.
 * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of
 * 20, we never reclaim and go off node straight away.
 *
 * To fix this we choose a smaller value of RECLAIM_DISTANCE.
 */
#define RECLAIM_DISTANCE 10

Anton

node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31 
  0:  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  1:  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  2:  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  3:  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  4:  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  5:  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  6:  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  7:  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  8:  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  9:  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 10:  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 11:  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 12:  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 13:  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 14:  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 15:  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 16:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40   0   0   0   0 
 17:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40   0   0   0   0 
 18:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40   0   0   0   0 
 19:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40   0   0   0   0 
 20:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40   0   0   0   0 
 21:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40   0   0   0   0 
 22:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40   0   0   0   0 
 23:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40   0   0   0   0 
 24:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 25:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 26:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 27:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 28:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20   0   0   0   0 
 29:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20   0   0   0   0 
 30:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20   0   0   0   0 
 31:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10   0   0   0   0 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-19  4:44             ` Anton Blanchard
  2011-07-19 10:21               ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Anton Blanchard @ 2011-07-19  4:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds

On Mon, 18 Jul 2011 23:35:56 +0200
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Anton, could you test the below two patches on that machine?
> 
> It should make things boot again, while I don't have a machine nearly
> big enough to trigger any of this, I tested the new code paths by
> setting FORCE_SD_OVERLAP in /debug/sched_features. Although any review
> of the error paths would be much appreciated.

I get an oops in slub code:

NIP [c000000000197d30] .deactivate_slab+0x1b0/0x200
LR [c000000000199d94] .__slab_alloc+0xb4/0x5a0
[c000000000199d94] .__slab_alloc+0xb4/0x5a0
[c00000000019ac98] .kmem_cache_alloc_node_trace+0xa8/0x260
[c00000000007eb70] .build_sched_domains+0xa60/0xb90
[c000000000a16a98] .sched_init_smp+0xa8/0x228
[c000000000a00274] .kernel_init+0x10c/0x1fc
[c00000000002324c] .kernel_thread+0x54/0x70

I'm guessing it's a result of some nodes not having any local memory.
but a bit surprised I'm not seeing it elsewhere.

Investigating.

> Also, could you send me the node_distance table for that machine? I'm
> curious what the interconnects look like on that thing.

Our node distances are a bit arbitrary (I make them up based on
information given to us in the device tree). In terms of memory we have
a maximum of three levels. To give some gross estimates, on chip memory
might be 30GB/sec, on node memory 10-15GB/sec and off node memory
5GB/sec.

The only thing we tweak with node distances is to make sure we go into
node reclaim before going off node:

/*
 * Before going off node we want the VM to try and reclaim from the local
 * node. It does this if the remote distance is larger than RECLAIM_DISTANCE.
 * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of
 * 20, we never reclaim and go off node straight away.
 *
 * To fix this we choose a smaller value of RECLAIM_DISTANCE.
 */
#define RECLAIM_DISTANCE 10

Anton

node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31 
  0:  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  1:  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  2:  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  3:  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  4:  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  5:  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  6:  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  7:  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  8:  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  9:  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 10:  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 11:  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 12:  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 13:  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 14:  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 15:  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 16:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40   0   0   0   0 
 17:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40   0   0   0   0 
 18:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40   0   0   0   0 
 19:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40   0   0   0   0 
 20:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40   0   0   0   0 
 21:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40   0   0   0   0 
 22:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40   0   0   0   0 
 23:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40   0   0   0   0 
 24:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 25:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 26:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 27:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 28:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20   0   0   0   0 
 29:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20   0   0   0   0 
 30:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20   0   0   0   0 
 31:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10   0   0   0   0 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-19  4:44             ` Anton Blanchard
@ 2011-07-19 10:21               ` Peter Zijlstra
  2011-07-20  2:03                 ` Anton Blanchard
  2011-07-20 10:14                 ` Anton Blanchard
  0 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-19 10:21 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds

On Tue, 2011-07-19 at 14:44 +1000, Anton Blanchard wrote:
> 
> Our node distances are a bit arbitrary (I make them up based on
> information given to us in the device tree). In terms of memory we have
> a maximum of three levels. To give some gross estimates, on chip memory
> might be 30GB/sec, on node memory 10-15GB/sec and off node memory
> 5GB/sec.
> 
> The only thing we tweak with node distances is to make sure we go into
> node reclaim before going off node:
> 
> /*
>  * Before going off node we want the VM to try and reclaim from the local
>  * node. It does this if the remote distance is larger than RECLAIM_DISTANCE.
>  * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of
>  * 20, we never reclaim and go off node straight away.
>  *
>  * To fix this we choose a smaller value of RECLAIM_DISTANCE.
>  */
> #define RECLAIM_DISTANCE 10

> node distances:
> node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31 
>   0:  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>   1:  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>   2:  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>   3:  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>   4:  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>   5:  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>   6:  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>   7:  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>   8:  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>   9:  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>  10:  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>  11:  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>  12:  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>  13:  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>  14:  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>  15:  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
>  16:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40   0   0   0   0 
>  17:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40   0   0   0   0 
>  18:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40   0   0   0   0 
>  19:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40   0   0   0   0 
>  20:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40   0   0   0   0 
>  21:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40   0   0   0   0 
>  22:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40   0   0   0   0 
>  23:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40   0   0   0   0 
>  24:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
>  25:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
>  26:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
>  27:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
>  28:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20   0   0   0   0 
>  29:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20   0   0   0   0 
>  30:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20   0   0   0   0 
>  31:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10   0   0   0   0 


That looks very strange indeed.. up to node 23 there is the normal
symmetric matrix with all the trace elements on 10 (as we would expect
for local access), and some 4x4 sub-matrix stacked around the trace with
20, suggesting a single hop distance, and the rest on 40 being
out-there.

But row 24-27 and column 28-31 are way weird, how can that ever be?
Aren't the inter-connects symmetric and thus mandating a fully symmetric
matrix? That is, how can traffic from node 23 (row) to node 28 (column)
have inf bandwidth (0) yet traffic from node 28 (row) to node 23
(column) have a multi-hop distance of 40.

So the idea I had to generate numa sched domains from the node distance
( http://marc.info/?l=linux-kernel&m=130218515520540 ), would that still
work for you? [it does assume a symmetric matrix ]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-19 10:21               ` Peter Zijlstra
  2011-07-20  2:03                 ` Anton Blanchard
  2011-07-20 10:14                 ` Anton Blanchard
  0 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-19 10:21 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds

On Tue, 2011-07-19 at 14:44 +1000, Anton Blanchard wrote:
>=20
> Our node distances are a bit arbitrary (I make them up based on
> information given to us in the device tree). In terms of memory we have
> a maximum of three levels. To give some gross estimates, on chip memory
> might be 30GB/sec, on node memory 10-15GB/sec and off node memory
> 5GB/sec.
>=20
> The only thing we tweak with node distances is to make sure we go into
> node reclaim before going off node:
>=20
> /*
>  * Before going off node we want the VM to try and reclaim from the local
>  * node. It does this if the remote distance is larger than RECLAIM_DISTA=
NCE.
>  * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANC=
E of
>  * 20, we never reclaim and go off node straight away.
>  *
>  * To fix this we choose a smaller value of RECLAIM_DISTANCE.
>  */
> #define RECLAIM_DISTANCE 10

> node distances:
> node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 =
 17  18  19  20  21  22  23  24  25  26  27  28  29  30  31=20
>   0:  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>   1:  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>   2:  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>   3:  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>   4:  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>   5:  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>   6:  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>   7:  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>   8:  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>   9:  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>  10:  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>  11:  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>  12:  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>  13:  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>  14:  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>  15:  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40 =
 40  40  40  40  40  40  40  40  40  40  40   0   0   0   0=20
>  16:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10 =
 20  20  20  40  40  40  40  40  40  40  40   0   0   0   0=20
>  17:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20 =
 10  20  20  40  40  40  40  40  40  40  40   0   0   0   0=20
>  18:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20 =
 20  10  20  40  40  40  40  40  40  40  40   0   0   0   0=20
>  19:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20 =
 20  20  10  40  40  40  40  40  40  40  40   0   0   0   0=20
>  20:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40 =
 40  40  40  10  20  20  20  40  40  40  40   0   0   0   0=20
>  21:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40 =
 40  40  40  20  10  20  20  40  40  40  40   0   0   0   0=20
>  22:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40 =
 40  40  40  20  20  10  20  40  40  40  40   0   0   0   0=20
>  23:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40 =
 40  40  40  20  20  20  10  40  40  40  40   0   0   0   0=20
>  24:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 =
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0=20
>  25:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 =
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0=20
>  26:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 =
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0=20
>  27:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 =
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0=20
>  28:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40 =
 40  40  40  40  40  40  40  10  20  20  20   0   0   0   0=20
>  29:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40 =
 40  40  40  40  40  40  40  20  10  20  20   0   0   0   0=20
>  30:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40 =
 40  40  40  40  40  40  40  20  20  10  20   0   0   0   0=20
>  31:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40 =
 40  40  40  40  40  40  40  20  20  20  10   0   0   0   0=20


That looks very strange indeed.. up to node 23 there is the normal
symmetric matrix with all the trace elements on 10 (as we would expect
for local access), and some 4x4 sub-matrix stacked around the trace with
20, suggesting a single hop distance, and the rest on 40 being
out-there.

But row 24-27 and column 28-31 are way weird, how can that ever be?
Aren't the inter-connects symmetric and thus mandating a fully symmetric
matrix? That is, how can traffic from node 23 (row) to node 28 (column)
have inf bandwidth (0) yet traffic from node 28 (row) to node 23
(column) have a multi-hop distance of 40.

So the idea I had to generate numa sched domains from the node distance
( http://marc.info/?l=3Dlinux-kernel&m=3D130218515520540 ), would that stil=
l
work for you? [it does assume a symmetric matrix ]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-19 10:21               ` Peter Zijlstra
@ 2011-07-20  2:03                 ` Anton Blanchard
  2011-07-20 10:14                 ` Anton Blanchard
  1 sibling, 0 replies; 43+ messages in thread
From: Anton Blanchard @ 2011-07-20  2:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds


Hi,

> That looks very strange indeed.. up to node 23 there is the normal
> symmetric matrix with all the trace elements on 10 (as we would expect
> for local access), and some 4x4 sub-matrix stacked around the trace
> with 20, suggesting a single hop distance, and the rest on 40 being
> out-there.
> 
> But row 24-27 and column 28-31 are way weird, how can that ever be?
> Aren't the inter-connects symmetric and thus mandating a fully
> symmetric matrix? That is, how can traffic from node 23 (row) to node
> 28 (column) have inf bandwidth (0) yet traffic from node 28 (row) to
> node 23 (column) have a multi-hop distance of 40.

Good point, it definitely makes no sense. It looks like a bug in
numactl, the raw data looks reasonable:

# cat /sys/devices/system/node/node?/distance node??/distance
10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10

Yet another bug to track down :(

> So the idea I had to generate numa sched domains from the node
> distance ( http://marc.info/?l=linux-kernel&m=130218515520540 ),
> would that still work for you? [it does assume a symmetric matrix ]

It should work for us and it makes our NUMA memory and scheduler
domains more consistent. Nice!

Anton

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-20  2:03                 ` Anton Blanchard
  0 siblings, 0 replies; 43+ messages in thread
From: Anton Blanchard @ 2011-07-20  2:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds


Hi,

> That looks very strange indeed.. up to node 23 there is the normal
> symmetric matrix with all the trace elements on 10 (as we would expect
> for local access), and some 4x4 sub-matrix stacked around the trace
> with 20, suggesting a single hop distance, and the rest on 40 being
> out-there.
> 
> But row 24-27 and column 28-31 are way weird, how can that ever be?
> Aren't the inter-connects symmetric and thus mandating a fully
> symmetric matrix? That is, how can traffic from node 23 (row) to node
> 28 (column) have inf bandwidth (0) yet traffic from node 28 (row) to
> node 23 (column) have a multi-hop distance of 40.

Good point, it definitely makes no sense. It looks like a bug in
numactl, the raw data looks reasonable:

# cat /sys/devices/system/node/node?/distance node??/distance
10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10

Yet another bug to track down :(

> So the idea I had to generate numa sched domains from the node
> distance ( http://marc.info/?l=linux-kernel&m=130218515520540 ),
> would that still work for you? [it does assume a symmetric matrix ]

It should work for us and it makes our NUMA memory and scheduler
domains more consistent. Nice!

Anton

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-19 10:21               ` Peter Zijlstra
  2011-07-20  2:03                 ` Anton Blanchard
@ 2011-07-20 10:14                 ` Anton Blanchard
  2011-07-20 10:45                   ` Peter Zijlstra
  1 sibling, 1 reply; 43+ messages in thread
From: Anton Blanchard @ 2011-07-20 10:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds


Hi Peter,

> That looks very strange indeed.. up to node 23 there is the normal
> symmetric matrix with all the trace elements on 10 (as we would expect
> for local access), and some 4x4 sub-matrix stacked around the trace
> with 20, suggesting a single hop distance, and the rest on 40 being
> out-there.

I retested with the latest version of numactl, and get correct results.

I worked out why the patches don't boot, we weren't allocating any
space for the cpumask and ran off the end of the allocation.

Should we also use cpumask_copy instead of open coding it? I added that
too.

Anton

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2011-07-20 01:54:08.191668781 -0500
+++ linux-2.6/kernel/sched.c	2011-07-20 04:45:36.203750525 -0500
@@ -7020,8 +7020,8 @@
 		if (cpumask_test_cpu(i, covered))
 			continue;
 
-		sg = kzalloc_node(sizeof(struct sched_group), GFP_KERNEL,
-				cpu_to_node(i));
+		sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
+				  GFP_KERNEL, cpu_to_node(i));
 
 		if (!sg)
 			goto fail;
@@ -7031,7 +7031,7 @@
 		child = *per_cpu_ptr(sdd->sd, i);
 		if (child->child) {
 			child = child->child;
-			*sg_span = *sched_domain_span(child);
+			cpumask_copy(sg_span, sched_domain_span(child));
 		} else
 			cpumask_set_cpu(i, sg_span);
 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-20 10:14                 ` Anton Blanchard
  2011-07-20 10:45                   ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Anton Blanchard @ 2011-07-20 10:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds


Hi Peter,

> That looks very strange indeed.. up to node 23 there is the normal
> symmetric matrix with all the trace elements on 10 (as we would expect
> for local access), and some 4x4 sub-matrix stacked around the trace
> with 20, suggesting a single hop distance, and the rest on 40 being
> out-there.

I retested with the latest version of numactl, and get correct results.

I worked out why the patches don't boot, we weren't allocating any
space for the cpumask and ran off the end of the allocation.

Should we also use cpumask_copy instead of open coding it? I added that
too.

Anton

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2011-07-20 01:54:08.191668781 -0500
+++ linux-2.6/kernel/sched.c	2011-07-20 04:45:36.203750525 -0500
@@ -7020,8 +7020,8 @@
 		if (cpumask_test_cpu(i, covered))
 			continue;
 
-		sg = kzalloc_node(sizeof(struct sched_group), GFP_KERNEL,
-				cpu_to_node(i));
+		sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
+				  GFP_KERNEL, cpu_to_node(i));
 
 		if (!sg)
 			goto fail;
@@ -7031,7 +7031,7 @@
 		child = *per_cpu_ptr(sdd->sd, i);
 		if (child->child) {
 			child = child->child;
-			*sg_span = *sched_domain_span(child);
+			cpumask_copy(sg_span, sched_domain_span(child));
 		} else
 			cpumask_set_cpu(i, sg_span);
 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-20 10:14                 ` Anton Blanchard
@ 2011-07-20 10:45                   ` Peter Zijlstra
  2011-07-20 12:14                     ` Anton Blanchard
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-20 10:45 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds

On Wed, 2011-07-20 at 20:14 +1000, Anton Blanchard wrote:

> > That looks very strange indeed.. up to node 23 there is the normal
> > symmetric matrix with all the trace elements on 10 (as we would expect
> > for local access), and some 4x4 sub-matrix stacked around the trace
> > with 20, suggesting a single hop distance, and the rest on 40 being
> > out-there.
> 
> I retested with the latest version of numactl, and get correct results.

One less thing to worry about ;-)

> I worked out why the patches don't boot, we weren't allocating any
> space for the cpumask and ran off the end of the allocation.

Gah! that's not the first time I made that particular mistake :/

> Should we also use cpumask_copy instead of open coding it? I added that
> too.

Probably, I looked for cpumask_assign() and on failing to find that used
the direct assignment.

So with that fix the patch makes the machine happy again?

Thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-20 10:45                   ` Peter Zijlstra
  2011-07-20 12:14                     ` Anton Blanchard
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-20 10:45 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds

On Wed, 2011-07-20 at 20:14 +1000, Anton Blanchard wrote:

> > That looks very strange indeed.. up to node 23 there is the normal
> > symmetric matrix with all the trace elements on 10 (as we would expect
> > for local access), and some 4x4 sub-matrix stacked around the trace
> > with 20, suggesting a single hop distance, and the rest on 40 being
> > out-there.
>=20
> I retested with the latest version of numactl, and get correct results.

One less thing to worry about ;-)

> I worked out why the patches don't boot, we weren't allocating any
> space for the cpumask and ran off the end of the allocation.

Gah! that's not the first time I made that particular mistake :/

> Should we also use cpumask_copy instead of open coding it? I added that
> too.

Probably, I looked for cpumask_assign() and on failing to find that used
the direct assignment.

So with that fix the patch makes the machine happy again?

Thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-20 10:45                   ` Peter Zijlstra
@ 2011-07-20 12:14                     ` Anton Blanchard
  2011-07-20 14:40                       ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Anton Blanchard @ 2011-07-20 12:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds


Hi Peter,

> So with that fix the patch makes the machine happy again?

Yes, the machine looks fine with the patches applied. Thanks!

Anton

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-20 12:14                     ` Anton Blanchard
  2011-07-20 14:40                       ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Anton Blanchard @ 2011-07-20 12:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds


Hi Peter,

> So with that fix the patch makes the machine happy again?

Yes, the machine looks fine with the patches applied. Thanks!

Anton

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-20 12:14                     ` Anton Blanchard
@ 2011-07-20 14:40                       ` Linus Torvalds
  2011-07-20 14:58                         ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2011-07-20 14:40 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Peter Zijlstra, mahesh, linux-kernel, linuxppc-dev, mingo, benh

On Wed, Jul 20, 2011 at 5:14 AM, Anton Blanchard <anton@samba.org> wrote:
>
>> So with that fix the patch makes the machine happy again?
>
> Yes, the machine looks fine with the patches applied. Thanks!

Ok, so what's the situation for 3.0 (I'm waiting for some RCU
resolution now)? Anton's patch may be small, but that's just the tiny
fixup patch to Peter's much scarier one ;)

                            Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-20 14:40                       ` Linus Torvalds
  2011-07-20 14:58                         ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2011-07-20 14:40 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Peter Zijlstra, mahesh, linux-kernel, mingo, linuxppc-dev

On Wed, Jul 20, 2011 at 5:14 AM, Anton Blanchard <anton@samba.org> wrote:
>
>> So with that fix the patch makes the machine happy again?
>
> Yes, the machine looks fine with the patches applied. Thanks!

Ok, so what's the situation for 3.0 (I'm waiting for some RCU
resolution now)? Anton's patch may be small, but that's just the tiny
fixup patch to Peter's much scarier one ;)

                            Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-20 14:40                       ` Linus Torvalds
@ 2011-07-20 14:58                         ` Peter Zijlstra
  2011-07-20 16:04                           ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-20 14:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Anton Blanchard, mahesh, linux-kernel, linuxppc-dev, mingo, benh

On Wed, 2011-07-20 at 07:40 -0700, Linus Torvalds wrote:
> On Wed, Jul 20, 2011 at 5:14 AM, Anton Blanchard <anton@samba.org> wrote:
> >
> >> So with that fix the patch makes the machine happy again?
> >
> > Yes, the machine looks fine with the patches applied. Thanks!
> 
> Ok, so what's the situation for 3.0 (I'm waiting for some RCU
> resolution now)? Anton's patch may be small, but that's just the tiny
> fixup patch to Peter's much scarier one ;)

Right, so we can either merge my scary patches now and have 3.0 boot on
16+ node machines (and risk breaking something), or delay them until
3.0.1 and have 16+ node machines suffer a little.

The alternative quick hack is simply to disable the node domain, but
that'll be detrimental to regular machines in that the top domain used
to have NODE sd_flags will now have ALL_NODE sd_flags which are much
less aggressive.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-20 14:58                         ` Peter Zijlstra
  2011-07-20 16:04                           ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-20 14:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mahesh, linux-kernel, Anton Blanchard, mingo, linuxppc-dev

On Wed, 2011-07-20 at 07:40 -0700, Linus Torvalds wrote:
> On Wed, Jul 20, 2011 at 5:14 AM, Anton Blanchard <anton@samba.org> wrote:
> >
> >> So with that fix the patch makes the machine happy again?
> >
> > Yes, the machine looks fine with the patches applied. Thanks!
>=20
> Ok, so what's the situation for 3.0 (I'm waiting for some RCU
> resolution now)? Anton's patch may be small, but that's just the tiny
> fixup patch to Peter's much scarier one ;)

Right, so we can either merge my scary patches now and have 3.0 boot on
16+ node machines (and risk breaking something), or delay them until
3.0.1 and have 16+ node machines suffer a little.

The alternative quick hack is simply to disable the node domain, but
that'll be detrimental to regular machines in that the top domain used
to have NODE sd_flags will now have ALL_NODE sd_flags which are much
less aggressive.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-20 14:58                         ` Peter Zijlstra
@ 2011-07-20 16:04                           ` Linus Torvalds
  2011-07-20 16:42                             ` Ingo Molnar
  2011-07-20 16:42                             ` Peter Zijlstra
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2011-07-20 16:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Anton Blanchard, mahesh, linux-kernel, linuxppc-dev, mingo, benh

On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> Right, so we can either merge my scary patches now and have 3.0 boot on
> 16+ node machines (and risk breaking something), or delay them until
> 3.0.1 and have 16+ node machines suffer a little.

So how much impact does your scary patch have on machines that don't
have multiple nodes? If it's a "the code isn't even called by normal
machines" kind of setup, I don't think I care a lot.

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-20 16:04                           ` Linus Torvalds
  2011-07-20 16:42                             ` Ingo Molnar
  2011-07-20 16:42                             ` Peter Zijlstra
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2011-07-20 16:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linux-kernel, Anton Blanchard, mingo, linuxppc-dev

On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> Right, so we can either merge my scary patches now and have 3.0 boot on
> 16+ node machines (and risk breaking something), or delay them until
> 3.0.1 and have 16+ node machines suffer a little.

So how much impact does your scary patch have on machines that don't
have multiple nodes? If it's a "the code isn't even called by normal
machines" kind of setup, I don't think I care a lot.

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-20 16:04                           ` Linus Torvalds
@ 2011-07-20 16:42                             ` Ingo Molnar
  2011-07-20 16:42                             ` Peter Zijlstra
  1 sibling, 0 replies; 43+ messages in thread
From: Ingo Molnar @ 2011-07-20 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Anton Blanchard, mahesh, linux-kernel,
	linuxppc-dev, benh


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > Right, so we can either merge my scary patches now and have 3.0 
> > boot on 16+ node machines (and risk breaking something), or delay 
> > them until 3.0.1 and have 16+ node machines suffer a little.
> 
> So how much impact does your scary patch have on machines that 
> don't have multiple nodes? If it's a "the code isn't even called by 
> normal machines" kind of setup, I don't think I care a lot.

NUMA systems will trigger the new code - not just 'weird NUMA 
systems' - but i still think we could try the patches, the code looks 
straightforward and i booted them on NUMA systems and it all seems 
fine so far.

Anyway, i'll push the new sched/urgent branch out in a few minutes 
and then you'll see the full patches in the commit notifications.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-20 16:42                             ` Ingo Molnar
  0 siblings, 0 replies; 43+ messages in thread
From: Ingo Molnar @ 2011-07-20 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, mahesh, linux-kernel, Anton Blanchard, linuxppc-dev


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > Right, so we can either merge my scary patches now and have 3.0 
> > boot on 16+ node machines (and risk breaking something), or delay 
> > them until 3.0.1 and have 16+ node machines suffer a little.
> 
> So how much impact does your scary patch have on machines that 
> don't have multiple nodes? If it's a "the code isn't even called by 
> normal machines" kind of setup, I don't think I care a lot.

NUMA systems will trigger the new code - not just 'weird NUMA 
systems' - but i still think we could try the patches, the code looks 
straightforward and i booted them on NUMA systems and it all seems 
fine so far.

Anyway, i'll push the new sched/urgent branch out in a few minutes 
and then you'll see the full patches in the commit notifications.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
  2011-07-20 16:04                           ` Linus Torvalds
  2011-07-20 16:42                             ` Ingo Molnar
@ 2011-07-20 16:42                             ` Peter Zijlstra
  2011-07-20 17:29                               ` [tip:sched/urgent] sched: Avoid creating superfluous NUMA domains on non-NUMA systems tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-20 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Anton Blanchard, mahesh, linux-kernel, linuxppc-dev, mingo, benh

On Wed, 2011-07-20 at 09:04 -0700, Linus Torvalds wrote:
> On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > Right, so we can either merge my scary patches now and have 3.0 boot on
> > 16+ node machines (and risk breaking something), or delay them until
> > 3.0.1 and have 16+ node machines suffer a little.
> 
> So how much impact does your scary patch have on machines that don't
> have multiple nodes? If it's a "the code isn't even called by normal
> machines" kind of setup, I don't think I care a lot.

Hmm, it does get called, but it looks relatively straight forward to
make it so that it doesn't. Let me try that.

Yes, the below works nicely (on top of the previous two).

Built and boot tested on a single-node and multi-node x86_64.

---
Subject: sched: Avoid creating superfluous domains
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Jul 20 18:34:30 CEST 2011

When creating sched_domains, stop when we've covered the entire target
span instead of continuing to create domains, only to later find
they're redundant and throw them away again.

This avoids single node systems from touching funny NUMA sched_domain
creation code.

Requested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -7436,6 +7436,8 @@ static int build_sched_domains(const str
 			sd = build_sched_domain(tl, &d, cpu_map, attr, sd, i);
 			if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
 				sd->flags |= SD_OVERLAP;
+			if (cpumask_equal(cpu_map, sched_domain_span(sd)))
+				break;
 		}
 
 		while (sd->child)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
@ 2011-07-20 16:42                             ` Peter Zijlstra
  2011-07-20 17:29                               ` [tip:sched/urgent] sched: Avoid creating superfluous NUMA domains on non-NUMA systems tip-bot for Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-20 16:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mahesh, linux-kernel, Anton Blanchard, mingo, linuxppc-dev

On Wed, 2011-07-20 at 09:04 -0700, Linus Torvalds wrote:
> On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> =
wrote:
> >
> > Right, so we can either merge my scary patches now and have 3.0 boot on
> > 16+ node machines (and risk breaking something), or delay them until
> > 3.0.1 and have 16+ node machines suffer a little.
>=20
> So how much impact does your scary patch have on machines that don't
> have multiple nodes? If it's a "the code isn't even called by normal
> machines" kind of setup, I don't think I care a lot.

Hmm, it does get called, but it looks relatively straight forward to
make it so that it doesn't. Let me try that.

Yes, the below works nicely (on top of the previous two).

Built and boot tested on a single-node and multi-node x86_64.

---
Subject: sched: Avoid creating superfluous domains
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Jul 20 18:34:30 CEST 2011

When creating sched_domains, stop when we've covered the entire target
span instead of continuing to create domains, only to later find
they're redundant and throw them away again.

This avoids single node systems from touching funny NUMA sched_domain
creation code.

Requested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/kernel/sched.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -7436,6 +7436,8 @@ static int build_sched_domains(const str
 			sd =3D build_sched_domain(tl, &d, cpu_map, attr, sd, i);
 			if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
 				sd->flags |=3D SD_OVERLAP;
+			if (cpumask_equal(cpu_map, sched_domain_span(sd)))
+				break;
 		}
=20
 		while (sd->child)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [tip:sched/urgent] sched: Avoid creating superfluous NUMA domains on non-NUMA systems
  2011-07-20 16:42                             ` Peter Zijlstra
@ 2011-07-20 17:29                               ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: tip-bot for Peter Zijlstra @ 2011-07-20 17:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, anton, hpa, mingo, torvalds, a.p.zijlstra, tglx, mingo

Commit-ID:  d110235d2c331c4f79e0879f51104be79e17a469
Gitweb:     http://git.kernel.org/tip/d110235d2c331c4f79e0879f51104be79e17a469
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Wed, 20 Jul 2011 18:42:57 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 20 Jul 2011 18:54:33 +0200

sched: Avoid creating superfluous NUMA domains on non-NUMA systems

When creating sched_domains, stop when we've covered the entire
target span instead of continuing to create domains, only to
later find they're redundant and throw them away again.

This avoids single node systems from touching funny NUMA
sched_domain creation code and reduces the risks of the new
SD_OVERLAP code.

Requested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Anton Blanchard <anton@samba.org>
Cc: mahesh@linux.vnet.ibm.com
Cc: benh@kernel.crashing.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1311180177.29152.57.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 921adf6..14168c4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7436,6 +7436,8 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 			sd = build_sched_domain(tl, &d, cpu_map, attr, sd, i);
 			if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
 				sd->flags |= SD_OVERLAP;
+			if (cpumask_equal(cpu_map, sched_domain_span(sd)))
+				break;
 		}
 
 		while (sd->child)

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2011-07-20 17:30 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-07 10:22 [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 Mahesh J Salgaonkar
2011-07-07 10:59 ` Peter Zijlstra
2011-07-07 11:55   ` Mahesh J Salgaonkar
2011-07-07 12:28     ` Peter Zijlstra
2011-07-14  0:34   ` Anton Blanchard
2011-07-14  4:35     ` Anton Blanchard
2011-07-14 13:16       ` Peter Zijlstra
2011-07-15  0:45         ` Anton Blanchard
2011-07-15  8:37           ` Peter Zijlstra
2011-07-18 21:35           ` Peter Zijlstra
2011-07-19  4:44             ` Anton Blanchard
2011-07-19 10:21               ` Peter Zijlstra
2011-07-20  2:03                 ` Anton Blanchard
2011-07-20 10:14                 ` Anton Blanchard
2011-07-20 10:45                   ` Peter Zijlstra
2011-07-20 12:14                     ` Anton Blanchard
2011-07-20 14:40                       ` Linus Torvalds
2011-07-20 14:58                         ` Peter Zijlstra
2011-07-20 16:04                           ` Linus Torvalds
2011-07-20 16:42                             ` Ingo Molnar
2011-07-20 16:42                             ` Peter Zijlstra
2011-07-20 17:29                               ` [tip:sched/urgent] sched: Avoid creating superfluous NUMA domains on non-NUMA systems tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.