* [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-07 10:22 ` Mahesh J Salgaonkar 0 siblings, 0 replies; 43+ messages in thread From: Mahesh J Salgaonkar @ 2011-07-07 10:22 UTC (permalink / raw) To: linux-kernel, linuxppc-dev; +Cc: a.p.zijlstra, mingo, anton, benh, torvalds Hi, linux-3.0-rc fails to boot on a power7 system with 1TB ram and 896 CPUs. While the initial boot log shows a soft-lockup [1], the machine is hung after. Dropping into xmon shows the cpus are all struck at: -------------------- cpu 0xa: Vector: 100 (System Reset) at [c000000fae51fae0] pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0 lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210 sp: c000000fae51fd60 msr: 8000000000089032 current = 0xc000000fae49d990 paca = 0xc00000000ebb1900 pid = 0, comm = kworker/0:1 cpu 0x41: Vector: 100 (System Reset) at [c000000fac01bae0] pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0 lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210 sp: c000000fac01bd60 msr: 8000000000089032 current = 0xc000000faefbf210 paca = 0xc00000000ebba280 pid = 0, comm = kworker/0:1 cpu 0x21: Vector: 100 (System Reset) at [c000000fae9abae0] pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0 lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210 sp: c000000fae9abd60 msr: 8000000000089032 current = 0xc000000fae998590 paca = 0xc00000000ebb5280 pid = 0, comm = kworker/0:1 cpu 0xb8: Vector: 100 (System Reset) at [c000000fab3dbae0] pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0 lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210 sp: c000000fab3dbd60 msr: 8000000000089032 current = 0xc000000fab3a2710 paca = 0xc00000000ebccc00 pid = 0, comm = kworker/0:1 ...... ...... And shows same for all the CPUs. a:mon> t [link register ] c00000000005b9a4 .pseries_dedicated_idle_sleep+0x194/0x210 [c000000fae51fd60] 00000000134d0000 (unreliable) [c000000fae51fe20] c000000000018b64 .cpu_idle+0x164/0x210 [c000000fae51fed0] c0000000005d55b0 .start_secondary+0x348/0x354 [c000000fae51ff90] c000000000009268 .start_secondary_prolog+0x10/0x14 a:mon> S msr = 8000000000001032 sprg0= 0000000000000000 pvr = 00000000003f0201 sprg1= c00000000ebb1900 dec = 0000000030fb5b4f sprg2= c00000000ebb1900 sp = c000000fae51f440 sprg3= 000000000000000a toc = c000000000e21f90 dar = c000011aee0c20e8 a:mon> -------------------- 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae - "sched: Change NODE sched_domain group creation" as the cause. Thanks, -Mahesh. [1]: POWER7 performance monitor hardware support registered Brought up 896 CPUs Enabling Asymmetric SMT scheduling BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1] Modules linked in: NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000 REGS: c000000fae25f9c0 TRAP: 0901 Not tainted (3.0.0-rc6) MSR: 8000000000009032 <EE,ME,IR,DR> CR: 24000088 XER: 00000004 TASK = c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0 GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb00 GPR04: 0000000000000100 0000000000000100 c000011adcf23418 0000000000000000 GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024ac GPR12: 0000000044000042 c00000000ebb0000 NIP [c000000000074b90] .update_group_power+0x50/0x190 LR [c00000000008a1c4] .build_sched_domains+0x434/0x490 Call Trace: [c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable) [c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490 [c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224 [c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc [c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70 Instruction dump: f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054 e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14 ^ permalink raw reply [flat|nested] 43+ messages in thread
* [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-07 10:22 ` Mahesh J Salgaonkar 0 siblings, 0 replies; 43+ messages in thread From: Mahesh J Salgaonkar @ 2011-07-07 10:22 UTC (permalink / raw) To: linux-kernel, linuxppc-dev; +Cc: mingo, torvalds, a.p.zijlstra, anton Hi, linux-3.0-rc fails to boot on a power7 system with 1TB ram and 896 CPUs. While the initial boot log shows a soft-lockup [1], the machine is hung after. Dropping into xmon shows the cpus are all struck at: -------------------- cpu 0xa: Vector: 100 (System Reset) at [c000000fae51fae0] pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0 lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210 sp: c000000fae51fd60 msr: 8000000000089032 current = 0xc000000fae49d990 paca = 0xc00000000ebb1900 pid = 0, comm = kworker/0:1 cpu 0x41: Vector: 100 (System Reset) at [c000000fac01bae0] pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0 lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210 sp: c000000fac01bd60 msr: 8000000000089032 current = 0xc000000faefbf210 paca = 0xc00000000ebba280 pid = 0, comm = kworker/0:1 cpu 0x21: Vector: 100 (System Reset) at [c000000fae9abae0] pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0 lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210 sp: c000000fae9abd60 msr: 8000000000089032 current = 0xc000000fae998590 paca = 0xc00000000ebb5280 pid = 0, comm = kworker/0:1 cpu 0xb8: Vector: 100 (System Reset) at [c000000fab3dbae0] pc: c0000000000596b8: .plpar_hcall_norets+0x80/0xd0 lr: c00000000005b9a4: .pseries_dedicated_idle_sleep+0x194/0x210 sp: c000000fab3dbd60 msr: 8000000000089032 current = 0xc000000fab3a2710 paca = 0xc00000000ebccc00 pid = 0, comm = kworker/0:1 ...... ...... And shows same for all the CPUs. a:mon> t [link register ] c00000000005b9a4 .pseries_dedicated_idle_sleep+0x194/0x210 [c000000fae51fd60] 00000000134d0000 (unreliable) [c000000fae51fe20] c000000000018b64 .cpu_idle+0x164/0x210 [c000000fae51fed0] c0000000005d55b0 .start_secondary+0x348/0x354 [c000000fae51ff90] c000000000009268 .start_secondary_prolog+0x10/0x14 a:mon> S msr = 8000000000001032 sprg0= 0000000000000000 pvr = 00000000003f0201 sprg1= c00000000ebb1900 dec = 0000000030fb5b4f sprg2= c00000000ebb1900 sp = c000000fae51f440 sprg3= 000000000000000a toc = c000000000e21f90 dar = c000011aee0c20e8 a:mon> -------------------- 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae - "sched: Change NODE sched_domain group creation" as the cause. Thanks, -Mahesh. [1]: POWER7 performance monitor hardware support registered Brought up 896 CPUs Enabling Asymmetric SMT scheduling BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1] Modules linked in: NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000 REGS: c000000fae25f9c0 TRAP: 0901 Not tainted (3.0.0-rc6) MSR: 8000000000009032 <EE,ME,IR,DR> CR: 24000088 XER: 00000004 TASK = c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0 GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb00 GPR04: 0000000000000100 0000000000000100 c000011adcf23418 0000000000000000 GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024ac GPR12: 0000000044000042 c00000000ebb0000 NIP [c000000000074b90] .update_group_power+0x50/0x190 LR [c00000000008a1c4] .build_sched_domains+0x434/0x490 Call Trace: [c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable) [c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490 [c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224 [c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc [c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70 Instruction dump: f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054 e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-07 10:22 ` Mahesh J Salgaonkar @ 2011-07-07 10:59 ` Peter Zijlstra -1 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-07 10:59 UTC (permalink / raw) To: mahesh; +Cc: linux-kernel, linuxppc-dev, mingo, anton, benh, torvalds On Thu, 2011-07-07 at 15:52 +0530, Mahesh J Salgaonkar wrote: > > 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae - > "sched: Change NODE sched_domain group creation" as the cause. Weird, there's no locking anywhere around there. The typical problems with this patch-set were massive explosions due to bad pointers etc.. But not silent hangs. The code its stuck at: > [1]: > POWER7 performance monitor hardware support registered > Brought up 896 CPUs > Enabling Asymmetric SMT scheduling > BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1] > Modules linked in: > NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000 > REGS: c000000fae25f9c0 TRAP: 0901 Not tainted (3.0.0-rc6) > MSR: 8000000000009032 <EE,ME,IR,DR> CR: 24000088 XER: 00000004 > TASK = c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0 > GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb00 > GPR04: 0000000000000100 0000000000000100 c000011adcf23418 0000000000000000 > GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024ac > GPR12: 0000000044000042 c00000000ebb0000 > NIP [c000000000074b90] .update_group_power+0x50/0x190 > LR [c00000000008a1c4] .build_sched_domains+0x434/0x490 > Call Trace: > [c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable) > [c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490 > [c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224 > [c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc > [c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70 > Instruction dump: > f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054 > e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14 doesn't contains any locks, its simply looping over all the cpus, and with that many I can imagine it takes a while, but getting 'stuck' there is unexpected to say the least. Surely this isn't the first multi-node P7 to boot a kernel with this patch? If my git foo is any good it hit -next on 23rd of May. I guess I'm asking is, do smaller P7 machines boot? And if so, is there any difference except size? How many nodes does the thing have anyway, 28? Hmm, that could mean its the first machine with >16 nodes to boot this, which would make it trigger the magic ALL_NODES crap. Let me dig around there. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-07 10:59 ` Peter Zijlstra 0 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-07 10:59 UTC (permalink / raw) To: mahesh; +Cc: linuxppc-dev, linux-kernel, anton, mingo, torvalds On Thu, 2011-07-07 at 15:52 +0530, Mahesh J Salgaonkar wrote: >=20 > 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae - > "sched: Change NODE sched_domain group creation" as the cause. Weird, there's no locking anywhere around there. The typical problems with this patch-set were massive explosions due to bad pointers etc.. But not silent hangs. The code its stuck at: > [1]: > POWER7 performance monitor hardware support registered > Brought up 896 CPUs > Enabling Asymmetric SMT scheduling > BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1] > Modules linked in: > NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000 > REGS: c000000fae25f9c0 TRAP: 0901 Not tainted (3.0.0-rc6) > MSR: 8000000000009032 <EE,ME,IR,DR> CR: 24000088 XER: 00000004 > TASK =3D c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0 > GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb0= 0 > GPR04: 0000000000000100 0000000000000100 c000011adcf23418 000000000000000= 0 > GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024a= c > GPR12: 0000000044000042 c00000000ebb0000 > NIP [c000000000074b90] .update_group_power+0x50/0x190 > LR [c00000000008a1c4] .build_sched_domains+0x434/0x490 > Call Trace: > [c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable) > [c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490 > [c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224 > [c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc > [c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70 > Instruction dump: > f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054 > e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14 doesn't contains any locks, its simply looping over all the cpus, and with that many I can imagine it takes a while, but getting 'stuck' there is unexpected to say the least. Surely this isn't the first multi-node P7 to boot a kernel with this patch? If my git foo is any good it hit -next on 23rd of May. I guess I'm asking is, do smaller P7 machines boot? And if so, is there any difference except size? How many nodes does the thing have anyway, 28? Hmm, that could mean its the first machine with >16 nodes to boot this, which would make it trigger the magic ALL_NODES crap. Let me dig around there. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-07 10:59 ` Peter Zijlstra @ 2011-07-07 11:55 ` Mahesh J Salgaonkar -1 siblings, 0 replies; 43+ messages in thread From: Mahesh J Salgaonkar @ 2011-07-07 11:55 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linuxppc-dev, linux-kernel, anton, mingo, torvalds On 2011-07-07 12:59:35 Thu, Peter Zijlstra wrote: > On Thu, 2011-07-07 at 15:52 +0530, Mahesh J Salgaonkar wrote: > > > > 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae - > > "sched: Change NODE sched_domain group creation" as the cause. > > Weird, there's no locking anywhere around there. The typical problems > with this patch-set were massive explosions due to bad pointers etc.. > But not silent hangs. > > The code its stuck at: > > > [1]: > > POWER7 performance monitor hardware support registered > > Brought up 896 CPUs > > Enabling Asymmetric SMT scheduling > > BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1] > > Modules linked in: > > NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000 > > REGS: c000000fae25f9c0 TRAP: 0901 Not tainted (3.0.0-rc6) > > MSR: 8000000000009032 <EE,ME,IR,DR> CR: 24000088 XER: 00000004 > > TASK = c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0 > > GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb00 > > GPR04: 0000000000000100 0000000000000100 c000011adcf23418 0000000000000000 > > GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024ac > > GPR12: 0000000044000042 c00000000ebb0000 > > NIP [c000000000074b90] .update_group_power+0x50/0x190 > > LR [c00000000008a1c4] .build_sched_domains+0x434/0x490 > > Call Trace: > > [c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable) > > [c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490 > > [c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224 > > [c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc > > [c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70 > > Instruction dump: > > f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054 > > e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14 > > doesn't contains any locks, its simply looping over all the cpus, and > with that many I can imagine it takes a while, but getting 'stuck' there > is unexpected to say the least. > > Surely this isn't the first multi-node P7 to boot a kernel with this > patch? If my git foo is any good it hit -next on 23rd of May. > > I guess I'm asking is, do smaller P7 machines boot? And if so, is there > any difference except size? Yes, the smaller P7 machine that I have with 20 CPUs and 2GB ram boots fine with 3.0.0-rc. > > How many nodes does the thing have anyway, 28? Hmm, that could mean its > the first machine with >16 nodes to boot this, which would make it > trigger the magic ALL_NODES crap. The P7 machine where kernel fails to boot shows following demsg log w.r.t node map: --------------------------- Zone PFN ranges: DMA 0x00000000 -> 0x01229000 Normal empty Movable zone start PFN for each node early_node_map[12] active PFN ranges 0: 0x00000000 -> 0x000fd000 4: 0x000fd000 -> 0x002fb000 5: 0x002fb000 -> 0x004b9000 6: 0x004b9000 -> 0x006b9000 8: 0x006b9000 -> 0x007b5000 12: 0x007b5000 -> 0x008b5000 16: 0x008b5000 -> 0x009b1000 20: 0x009b1000 -> 0x00bb1000 21: 0x00bb1000 -> 0x00db1000 22: 0x00db1000 -> 0x00fb1000 23: 0x00fb1000 -> 0x011b1000 28: 0x011b1000 -> 0x01229000 Could not find start_pfn for node 1 Could not find start_pfn for node 2 Could not find start_pfn for node 3 Could not find start_pfn for node 7 Could not find start_pfn for node 9 Could not find start_pfn for node 10 Could not find start_pfn for node 11 Could not find start_pfn for node 13 Could not find start_pfn for node 14 Could not find start_pfn for node 15 Could not find start_pfn for node 17 Could not find start_pfn for node 18 Could not find start_pfn for node 19 Could not find start_pfn for node 29 Could not find start_pfn for node 30 Could not find start_pfn for node 31 [boot]0015 Setup Done PERCPU: Embedded 1 pages/cpu @c000000013c00000 s31488 r0 d34048 u65536 Built 28 zonelists in Node order, mobility grouping on. Total pages: 19026032 Policy zone: DMA Kernel command line: root=/dev/mapper/vg_nish1-lv_root ro rd_LVM_LV=vg_nish1/lv_root rd_LVM_LV=VolGroup/lv_swap rd_LVM_LV=vg_nish1/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us console=hvc0i memblock=debug PID hash table entries: 4096 (order: -1, 32768 bytes) freeing bootmem node 0 freeing bootmem node 4 freeing bootmem node 5 freeing bootmem node 6 freeing bootmem node 8 freeing bootmem node 12 freeing bootmem node 16 freeing bootmem node 20 freeing bootmem node 21 freeing bootmem node 22 freeing bootmem node 23 freeing bootmem node 28 Memory: 1213775296k/1218707456k available (13312k kernel code, 4932160k reserved, 1600k data, 2727k bss, 4928k init) --------------------------- Thanks, -Mahesh. > > Let me dig around there. > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev -- Mahesh J Salgaonkar ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-07 11:55 ` Mahesh J Salgaonkar 0 siblings, 0 replies; 43+ messages in thread From: Mahesh J Salgaonkar @ 2011-07-07 11:55 UTC (permalink / raw) To: Peter Zijlstra; +Cc: torvalds, mingo, linuxppc-dev, linux-kernel, anton On 2011-07-07 12:59:35 Thu, Peter Zijlstra wrote: > On Thu, 2011-07-07 at 15:52 +0530, Mahesh J Salgaonkar wrote: > > > > 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae - > > "sched: Change NODE sched_domain group creation" as the cause. > > Weird, there's no locking anywhere around there. The typical problems > with this patch-set were massive explosions due to bad pointers etc.. > But not silent hangs. > > The code its stuck at: > > > [1]: > > POWER7 performance monitor hardware support registered > > Brought up 896 CPUs > > Enabling Asymmetric SMT scheduling > > BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1] > > Modules linked in: > > NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000 > > REGS: c000000fae25f9c0 TRAP: 0901 Not tainted (3.0.0-rc6) > > MSR: 8000000000009032 <EE,ME,IR,DR> CR: 24000088 XER: 00000004 > > TASK = c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0 > > GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb00 > > GPR04: 0000000000000100 0000000000000100 c000011adcf23418 0000000000000000 > > GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024ac > > GPR12: 0000000044000042 c00000000ebb0000 > > NIP [c000000000074b90] .update_group_power+0x50/0x190 > > LR [c00000000008a1c4] .build_sched_domains+0x434/0x490 > > Call Trace: > > [c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable) > > [c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490 > > [c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224 > > [c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc > > [c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70 > > Instruction dump: > > f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054 > > e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14 > > doesn't contains any locks, its simply looping over all the cpus, and > with that many I can imagine it takes a while, but getting 'stuck' there > is unexpected to say the least. > > Surely this isn't the first multi-node P7 to boot a kernel with this > patch? If my git foo is any good it hit -next on 23rd of May. > > I guess I'm asking is, do smaller P7 machines boot? And if so, is there > any difference except size? Yes, the smaller P7 machine that I have with 20 CPUs and 2GB ram boots fine with 3.0.0-rc. > > How many nodes does the thing have anyway, 28? Hmm, that could mean its > the first machine with >16 nodes to boot this, which would make it > trigger the magic ALL_NODES crap. The P7 machine where kernel fails to boot shows following demsg log w.r.t node map: --------------------------- Zone PFN ranges: DMA 0x00000000 -> 0x01229000 Normal empty Movable zone start PFN for each node early_node_map[12] active PFN ranges 0: 0x00000000 -> 0x000fd000 4: 0x000fd000 -> 0x002fb000 5: 0x002fb000 -> 0x004b9000 6: 0x004b9000 -> 0x006b9000 8: 0x006b9000 -> 0x007b5000 12: 0x007b5000 -> 0x008b5000 16: 0x008b5000 -> 0x009b1000 20: 0x009b1000 -> 0x00bb1000 21: 0x00bb1000 -> 0x00db1000 22: 0x00db1000 -> 0x00fb1000 23: 0x00fb1000 -> 0x011b1000 28: 0x011b1000 -> 0x01229000 Could not find start_pfn for node 1 Could not find start_pfn for node 2 Could not find start_pfn for node 3 Could not find start_pfn for node 7 Could not find start_pfn for node 9 Could not find start_pfn for node 10 Could not find start_pfn for node 11 Could not find start_pfn for node 13 Could not find start_pfn for node 14 Could not find start_pfn for node 15 Could not find start_pfn for node 17 Could not find start_pfn for node 18 Could not find start_pfn for node 19 Could not find start_pfn for node 29 Could not find start_pfn for node 30 Could not find start_pfn for node 31 [boot]0015 Setup Done PERCPU: Embedded 1 pages/cpu @c000000013c00000 s31488 r0 d34048 u65536 Built 28 zonelists in Node order, mobility grouping on. Total pages: 19026032 Policy zone: DMA Kernel command line: root=/dev/mapper/vg_nish1-lv_root ro rd_LVM_LV=vg_nish1/lv_root rd_LVM_LV=VolGroup/lv_swap rd_LVM_LV=vg_nish1/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us console=hvc0i memblock=debug PID hash table entries: 4096 (order: -1, 32768 bytes) freeing bootmem node 0 freeing bootmem node 4 freeing bootmem node 5 freeing bootmem node 6 freeing bootmem node 8 freeing bootmem node 12 freeing bootmem node 16 freeing bootmem node 20 freeing bootmem node 21 freeing bootmem node 22 freeing bootmem node 23 freeing bootmem node 28 Memory: 1213775296k/1218707456k available (13312k kernel code, 4932160k reserved, 1600k data, 2727k bss, 4928k init) --------------------------- Thanks, -Mahesh. > > Let me dig around there. > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev -- Mahesh J Salgaonkar ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-07 11:55 ` Mahesh J Salgaonkar @ 2011-07-07 12:28 ` Peter Zijlstra -1 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-07 12:28 UTC (permalink / raw) To: mahesh; +Cc: linuxppc-dev, linux-kernel, anton, mingo, torvalds On Thu, 2011-07-07 at 17:25 +0530, Mahesh J Salgaonkar wrote: > > I guess I'm asking is, do smaller P7 machines boot? And if so, is there > > any difference except size? > > Yes, the smaller P7 machine that I have with 20 CPUs and 2GB ram boots > fine with 3.0.0-rc. That sounds like a single node machine. P7 comes as {4,6,8}*4 (16,24,32 cpus) per socket. And that 2G doesn't sound like much either. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-07 12:28 ` Peter Zijlstra 0 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-07 12:28 UTC (permalink / raw) To: mahesh; +Cc: torvalds, mingo, linuxppc-dev, linux-kernel, anton On Thu, 2011-07-07 at 17:25 +0530, Mahesh J Salgaonkar wrote: > > I guess I'm asking is, do smaller P7 machines boot? And if so, is there > > any difference except size? >=20 > Yes, the smaller P7 machine that I have with 20 CPUs and 2GB ram boots > fine with 3.0.0-rc.=20 That sounds like a single node machine. P7 comes as {4,6,8}*4 (16,24,32 cpus) per socket. And that 2G doesn't sound like much either. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-07 10:59 ` Peter Zijlstra @ 2011-07-14 0:34 ` Anton Blanchard -1 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-14 0:34 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds Hi Peter, > Surely this isn't the first multi-node P7 to boot a kernel with this > patch? If my git foo is any good it hit -next on 23rd of May. > > I guess I'm asking is, do smaller P7 machines boot? And if so, is > there any difference except size? > > How many nodes does the thing have anyway, 28? Hmm, that could mean > its the first machine with >16 nodes to boot this, which would make it > trigger the magic ALL_NODES crap. We haven't tested a box with more than 16 nodes in quite a while, so it may be this. I took a quick look and we are stuck in update_group_power: do { power += group->cpu_power; group = group->next; } while (group != child->groups); I looked at the linked list: child->groups = c000007b2f74ff00 and dumping group as we go: c000007b2f74ff00 c000007b2f760000 c000007b2fb60000 c000007b2ff60000 at this point we end up in a cycle and never make it back to child->groups: c000008b2e68ff00 c000008b2e6a0000 c000008b2eaa0000 c000008b2eea0000 c000009aee77ff00 c000009aee790000 c000009aeeb90000 c000009aeef90000 c00000bafde91800 c00000dafdf81800 c00000fafce81800 c000011afdf71800 c00001226e70ff00 c00001226e720000 c00001226eb20000 c00001226ef20000 c000008b2e68ff00 Still investigating Anton ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-14 0:34 ` Anton Blanchard 0 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-14 0:34 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds Hi Peter, > Surely this isn't the first multi-node P7 to boot a kernel with this > patch? If my git foo is any good it hit -next on 23rd of May. > > I guess I'm asking is, do smaller P7 machines boot? And if so, is > there any difference except size? > > How many nodes does the thing have anyway, 28? Hmm, that could mean > its the first machine with >16 nodes to boot this, which would make it > trigger the magic ALL_NODES crap. We haven't tested a box with more than 16 nodes in quite a while, so it may be this. I took a quick look and we are stuck in update_group_power: do { power += group->cpu_power; group = group->next; } while (group != child->groups); I looked at the linked list: child->groups = c000007b2f74ff00 and dumping group as we go: c000007b2f74ff00 c000007b2f760000 c000007b2fb60000 c000007b2ff60000 at this point we end up in a cycle and never make it back to child->groups: c000008b2e68ff00 c000008b2e6a0000 c000008b2eaa0000 c000008b2eea0000 c000009aee77ff00 c000009aee790000 c000009aeeb90000 c000009aeef90000 c00000bafde91800 c00000dafdf81800 c00000fafce81800 c000011afdf71800 c00001226e70ff00 c00001226e720000 c00001226eb20000 c00001226ef20000 c000008b2e68ff00 Still investigating Anton ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-14 0:34 ` Anton Blanchard @ 2011-07-14 4:35 ` Anton Blanchard -1 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-14 4:35 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds > I took a quick look and we are stuck in update_group_power: > > do { > power += group->cpu_power; > group = group->next; > } while (group != child->groups); > > I looked at the linked list: > > child->groups = c000007b2f74ff00 > > and dumping group as we go: > > c000007b2f74ff00 c000007b2f760000 c000007b2fb60000 c000007b2ff60000 > > at this point we end up in a cycle and never make it back to > child->groups: > > c000008b2e68ff00 c000008b2e6a0000 c000008b2eaa0000 c000008b2eea0000 > c000009aee77ff00 c000009aee790000 c000009aeeb90000 c000009aeef90000 > c00000bafde91800 c00000dafdf81800 c00000fafce81800 c000011afdf71800 > c00001226e70ff00 c00001226e720000 c00001226eb20000 c00001226ef20000 > c000008b2e68ff00 It looks like the group ends up in two lists. I added a BUG_ON to ensure we never link a group twice, and it hits. I also printed out the cpu spans as we walk through build_sched_groups: 0 1 2 3 0 4 8 12 16 20 24 28 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 0 128 256 384 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 32 36 40 44 48 52 56 60 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 64 68 72 76 80 84 88 92 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 96 100 104 108 112 116 120 124 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 128 132 136 140 144 148 152 156 Duplicates start appearing in this span: 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608 So it looks like the overlap of the 16 entry spans (SD_NODES_PER_DOMAIN) is causing our problem. Anton Index: linux-2.6-work/kernel/sched.c =================================================================== --- linux-2.6-work.orig/kernel/sched.c 2011-07-11 12:48:48.251087767 +1000 +++ linux-2.6-work/kernel/sched.c 2011-07-14 14:19:45.867094044 +1000 @@ -7021,6 +7021,7 @@ build_sched_groups(struct sched_domain * cpumask_clear(sched_group_cpus(sg)); sg->cpu_power = 0; + BUG_ON(sg->next); for_each_cpu(j, span) { if (get_group(j, sdd, NULL) != group) Anton ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-14 4:35 ` Anton Blanchard 0 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-14 4:35 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds > I took a quick look and we are stuck in update_group_power: > > do { > power += group->cpu_power; > group = group->next; > } while (group != child->groups); > > I looked at the linked list: > > child->groups = c000007b2f74ff00 > > and dumping group as we go: > > c000007b2f74ff00 c000007b2f760000 c000007b2fb60000 c000007b2ff60000 > > at this point we end up in a cycle and never make it back to > child->groups: > > c000008b2e68ff00 c000008b2e6a0000 c000008b2eaa0000 c000008b2eea0000 > c000009aee77ff00 c000009aee790000 c000009aeeb90000 c000009aeef90000 > c00000bafde91800 c00000dafdf81800 c00000fafce81800 c000011afdf71800 > c00001226e70ff00 c00001226e720000 c00001226eb20000 c00001226ef20000 > c000008b2e68ff00 It looks like the group ends up in two lists. I added a BUG_ON to ensure we never link a group twice, and it hits. I also printed out the cpu spans as we walk through build_sched_groups: 0 1 2 3 0 4 8 12 16 20 24 28 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 0 128 256 384 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 32 36 40 44 48 52 56 60 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 64 68 72 76 80 84 88 92 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 96 100 104 108 112 116 120 124 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 128 132 136 140 144 148 152 156 Duplicates start appearing in this span: 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608 So it looks like the overlap of the 16 entry spans (SD_NODES_PER_DOMAIN) is causing our problem. Anton Index: linux-2.6-work/kernel/sched.c =================================================================== --- linux-2.6-work.orig/kernel/sched.c 2011-07-11 12:48:48.251087767 +1000 +++ linux-2.6-work/kernel/sched.c 2011-07-14 14:19:45.867094044 +1000 @@ -7021,6 +7021,7 @@ build_sched_groups(struct sched_domain * cpumask_clear(sched_group_cpus(sg)); sg->cpu_power = 0; + BUG_ON(sg->next); for_each_cpu(j, span) { if (get_group(j, sdd, NULL) != group) Anton ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-14 4:35 ` Anton Blanchard @ 2011-07-14 13:16 ` Peter Zijlstra -1 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-14 13:16 UTC (permalink / raw) To: Anton Blanchard; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds On Thu, 2011-07-14 at 14:35 +1000, Anton Blanchard wrote: > I also printed out the cpu spans as we walk through build_sched_groups: > 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 > Duplicates start appearing in this span: > 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608 > > So it looks like the overlap of the 16 entry spans > (SD_NODES_PER_DOMAIN) is causing our problem. Urgh.. so those spans are generated by sched_domain_node_span(), and it looks like that simply picks the 15 nearest nodes to the one we've got without consideration for overlap with previously generated spans. Now that used to work because it used to simply allocate a new group instead of using the existing one. The thing is, we want to track state unique to a group of cpus, so duplicating that is iffy. Otoh, making these masks non-overlapping is probably sub-optimal from a NUMA pov. Looking at a slightly simpler set-up (4 socket AMD magny-cours): $ cat /sys/devices/system/node/node*/distance 10 16 16 22 16 22 16 22 16 10 22 16 22 16 22 16 16 22 10 16 16 22 16 22 22 16 16 10 22 16 22 16 16 22 16 22 10 16 16 22 22 16 22 16 16 10 22 16 16 22 16 22 16 22 10 16 22 16 22 16 22 16 16 10 We can translate that into groups like {0} {0,1,2,4,6} {0-7} {1} {1,0,3,5,7} {0-7} ... and we can easily see there's overlap there as well in the NUMA layout itself. This seems to suggest we need to separate the unique state from the sched_group. Now all I need is a way to not consume gobs of memory.. /me goes prod ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-14 13:16 ` Peter Zijlstra 0 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-14 13:16 UTC (permalink / raw) To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds On Thu, 2011-07-14 at 14:35 +1000, Anton Blanchard wrote: > I also printed out the cpu spans as we walk through build_sched_groups: > 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 > Duplicates start appearing in this span: > 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608 >=20 > So it looks like the overlap of the 16 entry spans > (SD_NODES_PER_DOMAIN) is causing our problem. Urgh.. so those spans are generated by sched_domain_node_span(), and it looks like that simply picks the 15 nearest nodes to the one we've got without consideration for overlap with previously generated spans. Now that used to work because it used to simply allocate a new group instead of using the existing one. The thing is, we want to track state unique to a group of cpus, so duplicating that is iffy. Otoh, making these masks non-overlapping is probably sub-optimal from a NUMA pov. Looking at a slightly simpler set-up (4 socket AMD magny-cours): $ cat /sys/devices/system/node/node*/distance 10 16 16 22 16 22 16 22 16 10 22 16 22 16 22 16 16 22 10 16 16 22 16 22 22 16 16 10 22 16 22 16 16 22 16 22 10 16 16 22 22 16 22 16 16 10 22 16 16 22 16 22 16 22 10 16 22 16 22 16 22 16 16 10 We can translate that into groups like {0} {0,1,2,4,6} {0-7} {1} {1,0,3,5,7} {0-7} ... and we can easily see there's overlap there as well in the NUMA layout itself. This seems to suggest we need to separate the unique state from the sched_group. Now all I need is a way to not consume gobs of memory.. /me goes prod ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-14 13:16 ` Peter Zijlstra @ 2011-07-15 0:45 ` Anton Blanchard -1 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-15 0:45 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds Hi, > Urgh.. so those spans are generated by sched_domain_node_span(), and > it looks like that simply picks the 15 nearest nodes to the one we've > got without consideration for overlap with previously generated spans. I do wonder if we need this extra level at all on ppc64. From memory SGI added it for their massive setups, but our largest setup is 32 nodes and breaking that down into 16 node chunks seems overkill. I just realised we were setting NEWIDLE on our node definition and that was causing large amounts of rebalance work even with SD_NODES_PER_DOMAIN=16. After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look pretty good. Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this extra level is only used by SGI boxes. Anton ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-15 0:45 ` Anton Blanchard 0 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-15 0:45 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds Hi, > Urgh.. so those spans are generated by sched_domain_node_span(), and > it looks like that simply picks the 15 nearest nodes to the one we've > got without consideration for overlap with previously generated spans. I do wonder if we need this extra level at all on ppc64. From memory SGI added it for their massive setups, but our largest setup is 32 nodes and breaking that down into 16 node chunks seems overkill. I just realised we were setting NEWIDLE on our node definition and that was causing large amounts of rebalance work even with SD_NODES_PER_DOMAIN=16. After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look pretty good. Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this extra level is only used by SGI boxes. Anton ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-15 0:45 ` Anton Blanchard @ 2011-07-15 8:37 ` Peter Zijlstra -1 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-15 8:37 UTC (permalink / raw) To: Anton Blanchard; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds On Fri, 2011-07-15 at 10:45 +1000, Anton Blanchard wrote: > Hi, > > > Urgh.. so those spans are generated by sched_domain_node_span(), and > > it looks like that simply picks the 15 nearest nodes to the one we've > > got without consideration for overlap with previously generated spans. > > I do wonder if we need this extra level at all on ppc64. From memory > SGI added it for their massive setups, but our largest setup is 32 nodes > and breaking that down into 16 node chunks seems overkill. > > I just realised we were setting NEWIDLE on our node definition and that > was causing large amounts of rebalance work even with > SD_NODES_PER_DOMAIN=16. > > After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look > pretty good. > > Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this > extra level is only used by SGI boxes. We can certainly remove the whole topology layer that causes this problem for 3.0 and try to fix up for 3.1 again. But I was rather hoping to introduce more of those layers in the near future, I was hoping to create a layer per node_distance() value, such that the load-balancing is aware of the interconnects. Now for that I ran into the exact same problem, and at the time didn't come up with a solution, but I think I now see a way out. Something like the below ought to avoid the problem.. makes SGI sad though :-) --- kernel/sched.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 8fb4245..877b9f1 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -7203,7 +7203,7 @@ static struct sched_domain_topology_level default_topology[] = { #endif { sd_init_CPU, cpu_cpu_mask, }, #ifdef CONFIG_NUMA - { sd_init_NODE, cpu_node_mask, }, +// { sd_init_NODE, cpu_node_mask, }, { sd_init_ALLNODES, cpu_allnodes_mask, }, #endif { NULL, }, ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-15 8:37 ` Peter Zijlstra 0 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-15 8:37 UTC (permalink / raw) To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds On Fri, 2011-07-15 at 10:45 +1000, Anton Blanchard wrote: > Hi, >=20 > > Urgh.. so those spans are generated by sched_domain_node_span(), and > > it looks like that simply picks the 15 nearest nodes to the one we've > > got without consideration for overlap with previously generated spans. >=20 > I do wonder if we need this extra level at all on ppc64. From memory > SGI added it for their massive setups, but our largest setup is 32 nodes > and breaking that down into 16 node chunks seems overkill. >=20 > I just realised we were setting NEWIDLE on our node definition and that > was causing large amounts of rebalance work even with > SD_NODES_PER_DOMAIN=3D16. >=20 > After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look > pretty good. >=20 > Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this > extra level is only used by SGI boxes. We can certainly remove the whole topology layer that causes this problem for 3.0 and try to fix up for 3.1 again. But I was rather hoping to introduce more of those layers in the near future, I was hoping to create a layer per node_distance() value, such that the load-balancing is aware of the interconnects. Now for that I ran into the exact same problem, and at the time didn't come up with a solution, but I think I now see a way out. Something like the below ought to avoid the problem.. makes SGI sad though :-) --- kernel/sched.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 8fb4245..877b9f1 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -7203,7 +7203,7 @@ static struct sched_domain_topology_level default_top= ology[] =3D { #endif { sd_init_CPU, cpu_cpu_mask, }, #ifdef CONFIG_NUMA - { sd_init_NODE, cpu_node_mask, }, +// { sd_init_NODE, cpu_node_mask, }, { sd_init_ALLNODES, cpu_allnodes_mask, }, #endif { NULL, }, ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-15 0:45 ` Anton Blanchard @ 2011-07-18 21:35 ` Peter Zijlstra -1 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-18 21:35 UTC (permalink / raw) To: Anton Blanchard; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds [-- Attachment #1: Type: text/plain, Size: 442 bytes --] Anton, could you test the below two patches on that machine? It should make things boot again, while I don't have a machine nearly big enough to trigger any of this, I tested the new code paths by setting FORCE_SD_OVERLAP in /debug/sched_features. Although any review of the error paths would be much appreciated. Also, could you send me the node_distance table for that machine? I'm curious what the interconnects look like on that thing. [-- Attachment #2: sched-domain-foo-1.patch --] [-- Type: text/x-patch, Size: 9787 bytes --] Subject: sched: Break out cpu_power from the sched_group structure From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Thu Jul 14 13:00:06 CEST 2011 In order to prepare for non-unique sched_groups per domain, we need to carry the cpu_power elsewhere, so put a level of indirection in. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org --- include/linux/sched.h | 14 +++++++++----- kernel/sched.c | 32 ++++++++++++++++++++++++++------ kernel/sched_fair.c | 46 +++++++++++++++++++++++----------------------- 3 files changed, 58 insertions(+), 34 deletions(-) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -6550,7 +6550,7 @@ static int sched_domain_debug_one(struct break; } - if (!group->cpu_power) { + if (!group->sgp->power) { printk(KERN_CONT "\n"); printk(KERN_ERR "ERROR: domain->cpu_power not " "set\n"); @@ -6574,9 +6574,9 @@ static int sched_domain_debug_one(struct cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group)); printk(KERN_CONT " %s", str); - if (group->cpu_power != SCHED_POWER_SCALE) { + if (group->sgp->power != SCHED_POWER_SCALE) { printk(KERN_CONT " (cpu_power = %d)", - group->cpu_power); + group->sgp->power); } group = group->next; @@ -6770,8 +6770,10 @@ static struct root_domain *alloc_rootdom static void free_sched_domain(struct rcu_head *rcu) { struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu); - if (atomic_dec_and_test(&sd->groups->ref)) + if (atomic_dec_and_test(&sd->groups->ref)) { + kfree(sd->groups->sgp); kfree(sd->groups); + } kfree(sd); } @@ -6938,6 +6940,7 @@ int sched_smt_power_savings = 0, sched_m struct sd_data { struct sched_domain **__percpu sd; struct sched_group **__percpu sg; + struct sched_group_power **__percpu sgp; }; struct s_data { @@ -6974,8 +6977,10 @@ static int get_group(int cpu, struct sd_ if (child) cpu = cpumask_first(sched_domain_span(child)); - if (sg) + if (sg) { *sg = *per_cpu_ptr(sdd->sg, cpu); + (*sg)->sgp = *per_cpu_ptr(sdd->sgp, cpu); + } return cpu; } @@ -7013,7 +7018,7 @@ build_sched_groups(struct sched_domain * continue; cpumask_clear(sched_group_cpus(sg)); - sg->cpu_power = 0; + sg->sgp->power = 0; for_each_cpu(j, span) { if (get_group(j, sdd, NULL) != group) @@ -7178,6 +7183,7 @@ static void claim_allocations(int cpu, s if (cpu == cpumask_first(sched_group_cpus(sg))) { WARN_ON_ONCE(*per_cpu_ptr(sdd->sg, cpu) != sg); *per_cpu_ptr(sdd->sg, cpu) = NULL; + *per_cpu_ptr(sdd->sgp, cpu) = NULL; } } @@ -7227,9 +7233,14 @@ static int __sdt_alloc(const struct cpum if (!sdd->sg) return -ENOMEM; + sdd->sgp = alloc_percpu(struct sched_group_power *); + if (!sdd->sgp) + return -ENOMEM; + for_each_cpu(j, cpu_map) { struct sched_domain *sd; struct sched_group *sg; + struct sched_group_power *sgp; sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(), GFP_KERNEL, cpu_to_node(j)); @@ -7244,6 +7255,13 @@ static int __sdt_alloc(const struct cpum return -ENOMEM; *per_cpu_ptr(sdd->sg, j) = sg; + + sgp = kzalloc_node(sizeof(struct sched_group_power), + GFP_KERNEL, cpu_to_node(j)); + if (!sgp) + return -ENOMEM; + + *per_cpu_ptr(sdd->sgp, j) = sgp; } } @@ -7261,9 +7279,11 @@ static void __sdt_free(const struct cpum for_each_cpu(j, cpu_map) { kfree(*per_cpu_ptr(sdd->sd, j)); kfree(*per_cpu_ptr(sdd->sg, j)); + kfree(*per_cpu_ptr(sdd->sgp, j)); } free_percpu(sdd->sd); free_percpu(sdd->sg); + free_percpu(sdd->sgp); } } Index: linux-2.6/kernel/sched_fair.c =================================================================== --- linux-2.6.orig/kernel/sched_fair.c +++ linux-2.6/kernel/sched_fair.c @@ -1583,7 +1583,7 @@ find_idlest_group(struct sched_domain *s } /* Adjust by relative CPU power of the group */ - avg_load = (avg_load * SCHED_POWER_SCALE) / group->cpu_power; + avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power; if (local_group) { this_load = avg_load; @@ -2629,7 +2629,7 @@ static void update_cpu_power(struct sche power >>= SCHED_POWER_SHIFT; } - sdg->cpu_power_orig = power; + sdg->sgp->power_orig = power; if (sched_feat(ARCH_POWER)) power *= arch_scale_freq_power(sd, cpu); @@ -2645,7 +2645,7 @@ static void update_cpu_power(struct sche power = 1; cpu_rq(cpu)->cpu_power = power; - sdg->cpu_power = power; + sdg->sgp->power = power; } static void update_group_power(struct sched_domain *sd, int cpu) @@ -2663,11 +2663,11 @@ static void update_group_power(struct sc group = child->groups; do { - power += group->cpu_power; + power += group->sgp->power; group = group->next; } while (group != child->groups); - sdg->cpu_power = power; + sdg->sgp->power = power; } /* @@ -2689,7 +2689,7 @@ fix_small_capacity(struct sched_domain * /* * If ~90% of the cpu_power is still there, we're good. */ - if (group->cpu_power * 32 > group->cpu_power_orig * 29) + if (group->sgp->power * 32 > group->sgp->power_orig * 29) return 1; return 0; @@ -2769,7 +2769,7 @@ static inline void update_sg_lb_stats(st } /* Adjust by relative CPU power of the group */ - sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->cpu_power; + sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power; /* * Consider the group unbalanced when the imbalance is larger @@ -2786,7 +2786,7 @@ static inline void update_sg_lb_stats(st if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && max_nr_running > 1) sgs->group_imb = 1; - sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power, + sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power, SCHED_POWER_SCALE); if (!sgs->group_capacity) sgs->group_capacity = fix_small_capacity(sd, group); @@ -2875,7 +2875,7 @@ static inline void update_sd_lb_stats(st return; sds->total_load += sgs.group_load; - sds->total_pwr += sg->cpu_power; + sds->total_pwr += sg->sgp->power; /* * In case the child domain prefers tasks go to siblings @@ -2960,7 +2960,7 @@ static int check_asym_packing(struct sch if (this_cpu > busiest_cpu) return 0; - *imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power, + *imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE); return 1; } @@ -2991,7 +2991,7 @@ static inline void fix_small_imbalance(s scaled_busy_load_per_task = sds->busiest_load_per_task * SCHED_POWER_SCALE; - scaled_busy_load_per_task /= sds->busiest->cpu_power; + scaled_busy_load_per_task /= sds->busiest->sgp->power; if (sds->max_load - sds->this_load + scaled_busy_load_per_task >= (scaled_busy_load_per_task * imbn)) { @@ -3005,28 +3005,28 @@ static inline void fix_small_imbalance(s * moving them. */ - pwr_now += sds->busiest->cpu_power * + pwr_now += sds->busiest->sgp->power * min(sds->busiest_load_per_task, sds->max_load); - pwr_now += sds->this->cpu_power * + pwr_now += sds->this->sgp->power * min(sds->this_load_per_task, sds->this_load); pwr_now /= SCHED_POWER_SCALE; /* Amount of load we'd subtract */ tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) / - sds->busiest->cpu_power; + sds->busiest->sgp->power; if (sds->max_load > tmp) - pwr_move += sds->busiest->cpu_power * + pwr_move += sds->busiest->sgp->power * min(sds->busiest_load_per_task, sds->max_load - tmp); /* Amount of load we'd add */ - if (sds->max_load * sds->busiest->cpu_power < + if (sds->max_load * sds->busiest->sgp->power < sds->busiest_load_per_task * SCHED_POWER_SCALE) - tmp = (sds->max_load * sds->busiest->cpu_power) / - sds->this->cpu_power; + tmp = (sds->max_load * sds->busiest->sgp->power) / + sds->this->sgp->power; else tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) / - sds->this->cpu_power; - pwr_move += sds->this->cpu_power * + sds->this->sgp->power; + pwr_move += sds->this->sgp->power * min(sds->this_load_per_task, sds->this_load + tmp); pwr_move /= SCHED_POWER_SCALE; @@ -3072,7 +3072,7 @@ static inline void calculate_imbalance(s load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE); - load_above_capacity /= sds->busiest->cpu_power; + load_above_capacity /= sds->busiest->sgp->power; } /* @@ -3088,8 +3088,8 @@ static inline void calculate_imbalance(s max_pull = min(sds->max_load - sds->avg_load, load_above_capacity); /* How much load to actually move to equalise the imbalance */ - *imbalance = min(max_pull * sds->busiest->cpu_power, - (sds->avg_load - sds->this_load) * sds->this->cpu_power) + *imbalance = min(max_pull * sds->busiest->sgp->power, + (sds->avg_load - sds->this_load) * sds->this->sgp->power) / SCHED_POWER_SCALE; /* Index: linux-2.6/include/linux/sched.h =================================================================== --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -893,16 +893,20 @@ static inline int sd_power_saving_flags( return 0; } -struct sched_group { - struct sched_group *next; /* Must be a circular list */ - atomic_t ref; - +struct sched_group_power { /* * CPU power of this group, SCHED_LOAD_SCALE being max power for a * single CPU. */ - unsigned int cpu_power, cpu_power_orig; + unsigned int power, power_orig; +}; + +struct sched_group { + struct sched_group *next; /* Must be a circular list */ + atomic_t ref; + unsigned int group_weight; + struct sched_group_power *sgp; /* * The CPUs this group covers. [-- Attachment #3: sched-domain-foo-2.patch --] [-- Type: text/x-patch, Size: 8956 bytes --] Subject: sched: Allow for overlapping sched_domain spans From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Fri Jul 15 10:35:52 CEST 2011 Allow for sched_domain spans that overlap by giving such domains their own sched_group list instead of sharing the sched_groups amongst each-other. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-yr71izj2souh2dbifdh6j68y@git.kernel.org --- include/linux/sched.h | 2 kernel/sched.c | 157 +++++++++++++++++++++++++++++++++++++++--------- kernel/sched_features.h | 2 3 files changed, 132 insertions(+), 29 deletions(-) Index: linux-2.6/include/linux/sched.h =================================================================== --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -844,6 +844,7 @@ enum cpu_idle_type { #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ +#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */ enum powersavings_balance_level { POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */ @@ -894,6 +895,7 @@ static inline int sd_power_saving_flags( } struct sched_group_power { + atomic_t ref; /* * CPU power of this group, SCHED_LOAD_SCALE being max power for a * single CPU. Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -6767,10 +6767,36 @@ static struct root_domain *alloc_rootdom return rd; } +static void free_sched_groups(struct sched_group *sg, int free_sgp) +{ + struct sched_group *tmp, *first; + + if (!sg) + return; + + first = sg; + do { + tmp = sg->next; + + if (free_sgp && atomic_dec_and_test(&sg->sgp->ref)) + kfree(sg->sgp); + + kfree(sg); + sg = tmp; + } while (sg != first); +} + static void free_sched_domain(struct rcu_head *rcu) { struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu); - if (atomic_dec_and_test(&sd->groups->ref)) { + + /* + * If its an overlapping domain it has private groups, iterate and + * nuke them all. + */ + if (sd->flags & SD_OVERLAP) { + free_sched_groups(sd->groups, 1); + } else if (atomic_dec_and_test(&sd->groups->ref)) { kfree(sd->groups->sgp); kfree(sd->groups); } @@ -6960,15 +6986,73 @@ struct sched_domain_topology_level; typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu); typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); +#define SDTL_OVERLAP 0x01 + struct sched_domain_topology_level { sched_domain_init_f init; sched_domain_mask_f mask; + int flags; struct sd_data data; }; -/* - * Assumes the sched_domain tree is fully constructed - */ +static int +build_overlap_sched_groups(struct sched_domain *sd, int cpu) +{ + struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg; + const struct cpumask *span = sched_domain_span(sd); + struct cpumask *covered = sched_domains_tmpmask; + struct sd_data *sdd = sd->private; + struct sched_domain *child; + int i; + + cpumask_clear(covered); + + for_each_cpu(i, span) { + struct cpumask *sg_span; + + if (cpumask_test_cpu(i, covered)) + continue; + + sg = kzalloc_node(sizeof(struct sched_group), GFP_KERNEL, + cpu_to_node(i)); + + if (!sg) + goto fail; + + sg_span = sched_group_cpus(sg); + + child = *per_cpu_ptr(sdd->sd, i); + if (child->child) { + child = child->child; + *sg_span = *sched_domain_span(child); + } else + cpumask_set_cpu(i, sg_span); + + cpumask_or(covered, covered, sg_span); + + sg->sgp = *per_cpu_ptr(sdd->sgp, cpumask_first(sg_span)); + atomic_inc(&sg->sgp->ref); + + if (cpumask_test_cpu(cpu, sg_span)) + groups = sg; + + if (!first) + first = sg; + if (last) + last->next = sg; + last = sg; + last->next = first; + } + sd->groups = groups; + + return 0; + +fail: + free_sched_groups(first, 0); + + return -ENOMEM; +} + static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg) { struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu); @@ -6980,23 +7064,21 @@ static int get_group(int cpu, struct sd_ if (sg) { *sg = *per_cpu_ptr(sdd->sg, cpu); (*sg)->sgp = *per_cpu_ptr(sdd->sgp, cpu); + atomic_set(&(*sg)->sgp->ref, 1); /* for claim_allocations */ } return cpu; } /* - * build_sched_groups takes the cpumask we wish to span, and a pointer - * to a function which identifies what group(along with sched group) a CPU - * belongs to. The return value of group_fn must be a >= 0 and < nr_cpu_ids - * (due to the fact that we keep track of groups covered with a struct cpumask). - * * build_sched_groups will build a circular linked list of the groups * covered by the given span, and will set each group's ->cpumask correctly, * and ->cpu_power to 0. + * + * Assumes the sched_domain tree is fully constructed */ -static void -build_sched_groups(struct sched_domain *sd) +static int +build_sched_groups(struct sched_domain *sd, int cpu) { struct sched_group *first = NULL, *last = NULL; struct sd_data *sdd = sd->private; @@ -7004,6 +7086,12 @@ build_sched_groups(struct sched_domain * struct cpumask *covered; int i; + get_group(cpu, sdd, &sd->groups); + atomic_inc(&sd->groups->ref); + + if (cpu != cpumask_first(sched_domain_span(sd))) + return 0; + lockdep_assert_held(&sched_domains_mutex); covered = sched_domains_tmpmask; @@ -7035,6 +7123,8 @@ build_sched_groups(struct sched_domain * last = sg; } last->next = first; + + return 0; } /* @@ -7049,12 +7139,17 @@ build_sched_groups(struct sched_domain * */ static void init_sched_groups_power(int cpu, struct sched_domain *sd) { - WARN_ON(!sd || !sd->groups); + struct sched_group *sg = sd->groups; - if (cpu != group_first_cpu(sd->groups)) - return; + WARN_ON(!sd || !sg); - sd->groups->group_weight = cpumask_weight(sched_group_cpus(sd->groups)); + do { + sg->group_weight = cpumask_weight(sched_group_cpus(sg)); + sg = sg->next; + } while (sg != sd->groups); + + if (cpu != group_first_cpu(sg)) + return; update_group_power(sd, cpu); } @@ -7175,16 +7270,15 @@ static enum s_alloc __visit_domain_alloc static void claim_allocations(int cpu, struct sched_domain *sd) { struct sd_data *sdd = sd->private; - struct sched_group *sg = sd->groups; WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd); *per_cpu_ptr(sdd->sd, cpu) = NULL; - if (cpu == cpumask_first(sched_group_cpus(sg))) { - WARN_ON_ONCE(*per_cpu_ptr(sdd->sg, cpu) != sg); + if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref)) *per_cpu_ptr(sdd->sg, cpu) = NULL; + + if (atomic_read(&(*per_cpu_ptr(sdd->sgp, cpu))->ref)) *per_cpu_ptr(sdd->sgp, cpu) = NULL; - } } #ifdef CONFIG_SCHED_SMT @@ -7209,7 +7303,7 @@ static struct sched_domain_topology_leve #endif { sd_init_CPU, cpu_cpu_mask, }, #ifdef CONFIG_NUMA - { sd_init_NODE, cpu_node_mask, }, + { sd_init_NODE, cpu_node_mask, SDTL_OVERLAP, }, { sd_init_ALLNODES, cpu_allnodes_mask, }, #endif { NULL, }, @@ -7277,7 +7371,9 @@ static void __sdt_free(const struct cpum struct sd_data *sdd = &tl->data; for_each_cpu(j, cpu_map) { - kfree(*per_cpu_ptr(sdd->sd, j)); + struct sched_domain *sd = *per_cpu_ptr(sdd->sd, j); + if (sd && (sd->flags & SD_OVERLAP)) + free_sched_groups(sd->groups, 0); kfree(*per_cpu_ptr(sdd->sg, j)); kfree(*per_cpu_ptr(sdd->sgp, j)); } @@ -7329,8 +7425,11 @@ static int build_sched_domains(const str struct sched_domain_topology_level *tl; sd = NULL; - for (tl = sched_domain_topology; tl->init; tl++) + for (tl = sched_domain_topology; tl->init; tl++) { sd = build_sched_domain(tl, &d, cpu_map, attr, sd, i); + if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP)) + sd->flags |= SD_OVERLAP; + } while (sd->child) sd = sd->child; @@ -7342,13 +7441,13 @@ static int build_sched_domains(const str for_each_cpu(i, cpu_map) { for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { sd->span_weight = cpumask_weight(sched_domain_span(sd)); - get_group(i, sd->private, &sd->groups); - atomic_inc(&sd->groups->ref); - - if (i != cpumask_first(sched_domain_span(sd))) - continue; - - build_sched_groups(sd); + if (sd->flags & SD_OVERLAP) { + if (build_overlap_sched_groups(sd, i)) + goto error; + } else { + if (build_sched_groups(sd, i)) + goto error; + } } } Index: linux-2.6/kernel/sched_features.h =================================================================== --- linux-2.6.orig/kernel/sched_features.h +++ linux-2.6/kernel/sched_features.h @@ -70,3 +70,5 @@ SCHED_FEAT(NONIRQ_POWER, 1) * using the scheduler IPI. Reduces rq->lock contention/bounces. */ SCHED_FEAT(TTWU_QUEUE, 1) + +SCHED_FEAT(FORCE_SD_OVERLAP, 0) ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-18 21:35 ` Peter Zijlstra 0 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-18 21:35 UTC (permalink / raw) To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds [-- Attachment #1: Type: text/plain, Size: 442 bytes --] Anton, could you test the below two patches on that machine? It should make things boot again, while I don't have a machine nearly big enough to trigger any of this, I tested the new code paths by setting FORCE_SD_OVERLAP in /debug/sched_features. Although any review of the error paths would be much appreciated. Also, could you send me the node_distance table for that machine? I'm curious what the interconnects look like on that thing. [-- Attachment #2: sched-domain-foo-1.patch --] [-- Type: text/x-patch, Size: 9787 bytes --] Subject: sched: Break out cpu_power from the sched_group structure From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Thu Jul 14 13:00:06 CEST 2011 In order to prepare for non-unique sched_groups per domain, we need to carry the cpu_power elsewhere, so put a level of indirection in. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org --- include/linux/sched.h | 14 +++++++++----- kernel/sched.c | 32 ++++++++++++++++++++++++++------ kernel/sched_fair.c | 46 +++++++++++++++++++++++----------------------- 3 files changed, 58 insertions(+), 34 deletions(-) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -6550,7 +6550,7 @@ static int sched_domain_debug_one(struct break; } - if (!group->cpu_power) { + if (!group->sgp->power) { printk(KERN_CONT "\n"); printk(KERN_ERR "ERROR: domain->cpu_power not " "set\n"); @@ -6574,9 +6574,9 @@ static int sched_domain_debug_one(struct cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group)); printk(KERN_CONT " %s", str); - if (group->cpu_power != SCHED_POWER_SCALE) { + if (group->sgp->power != SCHED_POWER_SCALE) { printk(KERN_CONT " (cpu_power = %d)", - group->cpu_power); + group->sgp->power); } group = group->next; @@ -6770,8 +6770,10 @@ static struct root_domain *alloc_rootdom static void free_sched_domain(struct rcu_head *rcu) { struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu); - if (atomic_dec_and_test(&sd->groups->ref)) + if (atomic_dec_and_test(&sd->groups->ref)) { + kfree(sd->groups->sgp); kfree(sd->groups); + } kfree(sd); } @@ -6938,6 +6940,7 @@ int sched_smt_power_savings = 0, sched_m struct sd_data { struct sched_domain **__percpu sd; struct sched_group **__percpu sg; + struct sched_group_power **__percpu sgp; }; struct s_data { @@ -6974,8 +6977,10 @@ static int get_group(int cpu, struct sd_ if (child) cpu = cpumask_first(sched_domain_span(child)); - if (sg) + if (sg) { *sg = *per_cpu_ptr(sdd->sg, cpu); + (*sg)->sgp = *per_cpu_ptr(sdd->sgp, cpu); + } return cpu; } @@ -7013,7 +7018,7 @@ build_sched_groups(struct sched_domain * continue; cpumask_clear(sched_group_cpus(sg)); - sg->cpu_power = 0; + sg->sgp->power = 0; for_each_cpu(j, span) { if (get_group(j, sdd, NULL) != group) @@ -7178,6 +7183,7 @@ static void claim_allocations(int cpu, s if (cpu == cpumask_first(sched_group_cpus(sg))) { WARN_ON_ONCE(*per_cpu_ptr(sdd->sg, cpu) != sg); *per_cpu_ptr(sdd->sg, cpu) = NULL; + *per_cpu_ptr(sdd->sgp, cpu) = NULL; } } @@ -7227,9 +7233,14 @@ static int __sdt_alloc(const struct cpum if (!sdd->sg) return -ENOMEM; + sdd->sgp = alloc_percpu(struct sched_group_power *); + if (!sdd->sgp) + return -ENOMEM; + for_each_cpu(j, cpu_map) { struct sched_domain *sd; struct sched_group *sg; + struct sched_group_power *sgp; sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(), GFP_KERNEL, cpu_to_node(j)); @@ -7244,6 +7255,13 @@ static int __sdt_alloc(const struct cpum return -ENOMEM; *per_cpu_ptr(sdd->sg, j) = sg; + + sgp = kzalloc_node(sizeof(struct sched_group_power), + GFP_KERNEL, cpu_to_node(j)); + if (!sgp) + return -ENOMEM; + + *per_cpu_ptr(sdd->sgp, j) = sgp; } } @@ -7261,9 +7279,11 @@ static void __sdt_free(const struct cpum for_each_cpu(j, cpu_map) { kfree(*per_cpu_ptr(sdd->sd, j)); kfree(*per_cpu_ptr(sdd->sg, j)); + kfree(*per_cpu_ptr(sdd->sgp, j)); } free_percpu(sdd->sd); free_percpu(sdd->sg); + free_percpu(sdd->sgp); } } Index: linux-2.6/kernel/sched_fair.c =================================================================== --- linux-2.6.orig/kernel/sched_fair.c +++ linux-2.6/kernel/sched_fair.c @@ -1583,7 +1583,7 @@ find_idlest_group(struct sched_domain *s } /* Adjust by relative CPU power of the group */ - avg_load = (avg_load * SCHED_POWER_SCALE) / group->cpu_power; + avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power; if (local_group) { this_load = avg_load; @@ -2629,7 +2629,7 @@ static void update_cpu_power(struct sche power >>= SCHED_POWER_SHIFT; } - sdg->cpu_power_orig = power; + sdg->sgp->power_orig = power; if (sched_feat(ARCH_POWER)) power *= arch_scale_freq_power(sd, cpu); @@ -2645,7 +2645,7 @@ static void update_cpu_power(struct sche power = 1; cpu_rq(cpu)->cpu_power = power; - sdg->cpu_power = power; + sdg->sgp->power = power; } static void update_group_power(struct sched_domain *sd, int cpu) @@ -2663,11 +2663,11 @@ static void update_group_power(struct sc group = child->groups; do { - power += group->cpu_power; + power += group->sgp->power; group = group->next; } while (group != child->groups); - sdg->cpu_power = power; + sdg->sgp->power = power; } /* @@ -2689,7 +2689,7 @@ fix_small_capacity(struct sched_domain * /* * If ~90% of the cpu_power is still there, we're good. */ - if (group->cpu_power * 32 > group->cpu_power_orig * 29) + if (group->sgp->power * 32 > group->sgp->power_orig * 29) return 1; return 0; @@ -2769,7 +2769,7 @@ static inline void update_sg_lb_stats(st } /* Adjust by relative CPU power of the group */ - sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->cpu_power; + sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power; /* * Consider the group unbalanced when the imbalance is larger @@ -2786,7 +2786,7 @@ static inline void update_sg_lb_stats(st if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && max_nr_running > 1) sgs->group_imb = 1; - sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power, + sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power, SCHED_POWER_SCALE); if (!sgs->group_capacity) sgs->group_capacity = fix_small_capacity(sd, group); @@ -2875,7 +2875,7 @@ static inline void update_sd_lb_stats(st return; sds->total_load += sgs.group_load; - sds->total_pwr += sg->cpu_power; + sds->total_pwr += sg->sgp->power; /* * In case the child domain prefers tasks go to siblings @@ -2960,7 +2960,7 @@ static int check_asym_packing(struct sch if (this_cpu > busiest_cpu) return 0; - *imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power, + *imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE); return 1; } @@ -2991,7 +2991,7 @@ static inline void fix_small_imbalance(s scaled_busy_load_per_task = sds->busiest_load_per_task * SCHED_POWER_SCALE; - scaled_busy_load_per_task /= sds->busiest->cpu_power; + scaled_busy_load_per_task /= sds->busiest->sgp->power; if (sds->max_load - sds->this_load + scaled_busy_load_per_task >= (scaled_busy_load_per_task * imbn)) { @@ -3005,28 +3005,28 @@ static inline void fix_small_imbalance(s * moving them. */ - pwr_now += sds->busiest->cpu_power * + pwr_now += sds->busiest->sgp->power * min(sds->busiest_load_per_task, sds->max_load); - pwr_now += sds->this->cpu_power * + pwr_now += sds->this->sgp->power * min(sds->this_load_per_task, sds->this_load); pwr_now /= SCHED_POWER_SCALE; /* Amount of load we'd subtract */ tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) / - sds->busiest->cpu_power; + sds->busiest->sgp->power; if (sds->max_load > tmp) - pwr_move += sds->busiest->cpu_power * + pwr_move += sds->busiest->sgp->power * min(sds->busiest_load_per_task, sds->max_load - tmp); /* Amount of load we'd add */ - if (sds->max_load * sds->busiest->cpu_power < + if (sds->max_load * sds->busiest->sgp->power < sds->busiest_load_per_task * SCHED_POWER_SCALE) - tmp = (sds->max_load * sds->busiest->cpu_power) / - sds->this->cpu_power; + tmp = (sds->max_load * sds->busiest->sgp->power) / + sds->this->sgp->power; else tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) / - sds->this->cpu_power; - pwr_move += sds->this->cpu_power * + sds->this->sgp->power; + pwr_move += sds->this->sgp->power * min(sds->this_load_per_task, sds->this_load + tmp); pwr_move /= SCHED_POWER_SCALE; @@ -3072,7 +3072,7 @@ static inline void calculate_imbalance(s load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE); - load_above_capacity /= sds->busiest->cpu_power; + load_above_capacity /= sds->busiest->sgp->power; } /* @@ -3088,8 +3088,8 @@ static inline void calculate_imbalance(s max_pull = min(sds->max_load - sds->avg_load, load_above_capacity); /* How much load to actually move to equalise the imbalance */ - *imbalance = min(max_pull * sds->busiest->cpu_power, - (sds->avg_load - sds->this_load) * sds->this->cpu_power) + *imbalance = min(max_pull * sds->busiest->sgp->power, + (sds->avg_load - sds->this_load) * sds->this->sgp->power) / SCHED_POWER_SCALE; /* Index: linux-2.6/include/linux/sched.h =================================================================== --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -893,16 +893,20 @@ static inline int sd_power_saving_flags( return 0; } -struct sched_group { - struct sched_group *next; /* Must be a circular list */ - atomic_t ref; - +struct sched_group_power { /* * CPU power of this group, SCHED_LOAD_SCALE being max power for a * single CPU. */ - unsigned int cpu_power, cpu_power_orig; + unsigned int power, power_orig; +}; + +struct sched_group { + struct sched_group *next; /* Must be a circular list */ + atomic_t ref; + unsigned int group_weight; + struct sched_group_power *sgp; /* * The CPUs this group covers. [-- Attachment #3: sched-domain-foo-2.patch --] [-- Type: text/x-patch, Size: 8956 bytes --] Subject: sched: Allow for overlapping sched_domain spans From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Fri Jul 15 10:35:52 CEST 2011 Allow for sched_domain spans that overlap by giving such domains their own sched_group list instead of sharing the sched_groups amongst each-other. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-yr71izj2souh2dbifdh6j68y@git.kernel.org --- include/linux/sched.h | 2 kernel/sched.c | 157 +++++++++++++++++++++++++++++++++++++++--------- kernel/sched_features.h | 2 3 files changed, 132 insertions(+), 29 deletions(-) Index: linux-2.6/include/linux/sched.h =================================================================== --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -844,6 +844,7 @@ enum cpu_idle_type { #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ +#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */ enum powersavings_balance_level { POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */ @@ -894,6 +895,7 @@ static inline int sd_power_saving_flags( } struct sched_group_power { + atomic_t ref; /* * CPU power of this group, SCHED_LOAD_SCALE being max power for a * single CPU. Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -6767,10 +6767,36 @@ static struct root_domain *alloc_rootdom return rd; } +static void free_sched_groups(struct sched_group *sg, int free_sgp) +{ + struct sched_group *tmp, *first; + + if (!sg) + return; + + first = sg; + do { + tmp = sg->next; + + if (free_sgp && atomic_dec_and_test(&sg->sgp->ref)) + kfree(sg->sgp); + + kfree(sg); + sg = tmp; + } while (sg != first); +} + static void free_sched_domain(struct rcu_head *rcu) { struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu); - if (atomic_dec_and_test(&sd->groups->ref)) { + + /* + * If its an overlapping domain it has private groups, iterate and + * nuke them all. + */ + if (sd->flags & SD_OVERLAP) { + free_sched_groups(sd->groups, 1); + } else if (atomic_dec_and_test(&sd->groups->ref)) { kfree(sd->groups->sgp); kfree(sd->groups); } @@ -6960,15 +6986,73 @@ struct sched_domain_topology_level; typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu); typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); +#define SDTL_OVERLAP 0x01 + struct sched_domain_topology_level { sched_domain_init_f init; sched_domain_mask_f mask; + int flags; struct sd_data data; }; -/* - * Assumes the sched_domain tree is fully constructed - */ +static int +build_overlap_sched_groups(struct sched_domain *sd, int cpu) +{ + struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg; + const struct cpumask *span = sched_domain_span(sd); + struct cpumask *covered = sched_domains_tmpmask; + struct sd_data *sdd = sd->private; + struct sched_domain *child; + int i; + + cpumask_clear(covered); + + for_each_cpu(i, span) { + struct cpumask *sg_span; + + if (cpumask_test_cpu(i, covered)) + continue; + + sg = kzalloc_node(sizeof(struct sched_group), GFP_KERNEL, + cpu_to_node(i)); + + if (!sg) + goto fail; + + sg_span = sched_group_cpus(sg); + + child = *per_cpu_ptr(sdd->sd, i); + if (child->child) { + child = child->child; + *sg_span = *sched_domain_span(child); + } else + cpumask_set_cpu(i, sg_span); + + cpumask_or(covered, covered, sg_span); + + sg->sgp = *per_cpu_ptr(sdd->sgp, cpumask_first(sg_span)); + atomic_inc(&sg->sgp->ref); + + if (cpumask_test_cpu(cpu, sg_span)) + groups = sg; + + if (!first) + first = sg; + if (last) + last->next = sg; + last = sg; + last->next = first; + } + sd->groups = groups; + + return 0; + +fail: + free_sched_groups(first, 0); + + return -ENOMEM; +} + static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg) { struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu); @@ -6980,23 +7064,21 @@ static int get_group(int cpu, struct sd_ if (sg) { *sg = *per_cpu_ptr(sdd->sg, cpu); (*sg)->sgp = *per_cpu_ptr(sdd->sgp, cpu); + atomic_set(&(*sg)->sgp->ref, 1); /* for claim_allocations */ } return cpu; } /* - * build_sched_groups takes the cpumask we wish to span, and a pointer - * to a function which identifies what group(along with sched group) a CPU - * belongs to. The return value of group_fn must be a >= 0 and < nr_cpu_ids - * (due to the fact that we keep track of groups covered with a struct cpumask). - * * build_sched_groups will build a circular linked list of the groups * covered by the given span, and will set each group's ->cpumask correctly, * and ->cpu_power to 0. + * + * Assumes the sched_domain tree is fully constructed */ -static void -build_sched_groups(struct sched_domain *sd) +static int +build_sched_groups(struct sched_domain *sd, int cpu) { struct sched_group *first = NULL, *last = NULL; struct sd_data *sdd = sd->private; @@ -7004,6 +7086,12 @@ build_sched_groups(struct sched_domain * struct cpumask *covered; int i; + get_group(cpu, sdd, &sd->groups); + atomic_inc(&sd->groups->ref); + + if (cpu != cpumask_first(sched_domain_span(sd))) + return 0; + lockdep_assert_held(&sched_domains_mutex); covered = sched_domains_tmpmask; @@ -7035,6 +7123,8 @@ build_sched_groups(struct sched_domain * last = sg; } last->next = first; + + return 0; } /* @@ -7049,12 +7139,17 @@ build_sched_groups(struct sched_domain * */ static void init_sched_groups_power(int cpu, struct sched_domain *sd) { - WARN_ON(!sd || !sd->groups); + struct sched_group *sg = sd->groups; - if (cpu != group_first_cpu(sd->groups)) - return; + WARN_ON(!sd || !sg); - sd->groups->group_weight = cpumask_weight(sched_group_cpus(sd->groups)); + do { + sg->group_weight = cpumask_weight(sched_group_cpus(sg)); + sg = sg->next; + } while (sg != sd->groups); + + if (cpu != group_first_cpu(sg)) + return; update_group_power(sd, cpu); } @@ -7175,16 +7270,15 @@ static enum s_alloc __visit_domain_alloc static void claim_allocations(int cpu, struct sched_domain *sd) { struct sd_data *sdd = sd->private; - struct sched_group *sg = sd->groups; WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd); *per_cpu_ptr(sdd->sd, cpu) = NULL; - if (cpu == cpumask_first(sched_group_cpus(sg))) { - WARN_ON_ONCE(*per_cpu_ptr(sdd->sg, cpu) != sg); + if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref)) *per_cpu_ptr(sdd->sg, cpu) = NULL; + + if (atomic_read(&(*per_cpu_ptr(sdd->sgp, cpu))->ref)) *per_cpu_ptr(sdd->sgp, cpu) = NULL; - } } #ifdef CONFIG_SCHED_SMT @@ -7209,7 +7303,7 @@ static struct sched_domain_topology_leve #endif { sd_init_CPU, cpu_cpu_mask, }, #ifdef CONFIG_NUMA - { sd_init_NODE, cpu_node_mask, }, + { sd_init_NODE, cpu_node_mask, SDTL_OVERLAP, }, { sd_init_ALLNODES, cpu_allnodes_mask, }, #endif { NULL, }, @@ -7277,7 +7371,9 @@ static void __sdt_free(const struct cpum struct sd_data *sdd = &tl->data; for_each_cpu(j, cpu_map) { - kfree(*per_cpu_ptr(sdd->sd, j)); + struct sched_domain *sd = *per_cpu_ptr(sdd->sd, j); + if (sd && (sd->flags & SD_OVERLAP)) + free_sched_groups(sd->groups, 0); kfree(*per_cpu_ptr(sdd->sg, j)); kfree(*per_cpu_ptr(sdd->sgp, j)); } @@ -7329,8 +7425,11 @@ static int build_sched_domains(const str struct sched_domain_topology_level *tl; sd = NULL; - for (tl = sched_domain_topology; tl->init; tl++) + for (tl = sched_domain_topology; tl->init; tl++) { sd = build_sched_domain(tl, &d, cpu_map, attr, sd, i); + if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP)) + sd->flags |= SD_OVERLAP; + } while (sd->child) sd = sd->child; @@ -7342,13 +7441,13 @@ static int build_sched_domains(const str for_each_cpu(i, cpu_map) { for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { sd->span_weight = cpumask_weight(sched_domain_span(sd)); - get_group(i, sd->private, &sd->groups); - atomic_inc(&sd->groups->ref); - - if (i != cpumask_first(sched_domain_span(sd))) - continue; - - build_sched_groups(sd); + if (sd->flags & SD_OVERLAP) { + if (build_overlap_sched_groups(sd, i)) + goto error; + } else { + if (build_sched_groups(sd, i)) + goto error; + } } } Index: linux-2.6/kernel/sched_features.h =================================================================== --- linux-2.6.orig/kernel/sched_features.h +++ linux-2.6/kernel/sched_features.h @@ -70,3 +70,5 @@ SCHED_FEAT(NONIRQ_POWER, 1) * using the scheduler IPI. Reduces rq->lock contention/bounces. */ SCHED_FEAT(TTWU_QUEUE, 1) + +SCHED_FEAT(FORCE_SD_OVERLAP, 0) ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-18 21:35 ` Peter Zijlstra @ 2011-07-19 4:44 ` Anton Blanchard -1 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-19 4:44 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds On Mon, 18 Jul 2011 23:35:56 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > Anton, could you test the below two patches on that machine? > > It should make things boot again, while I don't have a machine nearly > big enough to trigger any of this, I tested the new code paths by > setting FORCE_SD_OVERLAP in /debug/sched_features. Although any review > of the error paths would be much appreciated. I get an oops in slub code: NIP [c000000000197d30] .deactivate_slab+0x1b0/0x200 LR [c000000000199d94] .__slab_alloc+0xb4/0x5a0 [c000000000199d94] .__slab_alloc+0xb4/0x5a0 [c00000000019ac98] .kmem_cache_alloc_node_trace+0xa8/0x260 [c00000000007eb70] .build_sched_domains+0xa60/0xb90 [c000000000a16a98] .sched_init_smp+0xa8/0x228 [c000000000a00274] .kernel_init+0x10c/0x1fc [c00000000002324c] .kernel_thread+0x54/0x70 I'm guessing it's a result of some nodes not having any local memory. but a bit surprised I'm not seeing it elsewhere. Investigating. > Also, could you send me the node_distance table for that machine? I'm > curious what the interconnects look like on that thing. Our node distances are a bit arbitrary (I make them up based on information given to us in the device tree). In terms of memory we have a maximum of three levels. To give some gross estimates, on chip memory might be 30GB/sec, on node memory 10-15GB/sec and off node memory 5GB/sec. The only thing we tweak with node distances is to make sure we go into node reclaim before going off node: /* * Before going off node we want the VM to try and reclaim from the local * node. It does this if the remote distance is larger than RECLAIM_DISTANCE. * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of * 20, we never reclaim and go off node straight away. * * To fix this we choose a smaller value of RECLAIM_DISTANCE. */ #define RECLAIM_DISTANCE 10 Anton node distances: node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0: 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 1: 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 2: 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 3: 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 4: 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 5: 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 6: 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 7: 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 8: 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 9: 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 10: 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 11: 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 12: 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 13: 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 14: 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 15: 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 16: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 0 0 0 0 17: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 0 0 0 0 18: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 0 0 0 0 19: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 0 0 0 0 20: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 0 0 0 0 21: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 0 0 0 0 22: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 0 0 0 0 23: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 0 0 0 0 24: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 0 0 0 0 29: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 0 0 0 0 30: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 0 0 0 0 31: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 0 0 0 0 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-19 4:44 ` Anton Blanchard 0 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-19 4:44 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds On Mon, 18 Jul 2011 23:35:56 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > Anton, could you test the below two patches on that machine? > > It should make things boot again, while I don't have a machine nearly > big enough to trigger any of this, I tested the new code paths by > setting FORCE_SD_OVERLAP in /debug/sched_features. Although any review > of the error paths would be much appreciated. I get an oops in slub code: NIP [c000000000197d30] .deactivate_slab+0x1b0/0x200 LR [c000000000199d94] .__slab_alloc+0xb4/0x5a0 [c000000000199d94] .__slab_alloc+0xb4/0x5a0 [c00000000019ac98] .kmem_cache_alloc_node_trace+0xa8/0x260 [c00000000007eb70] .build_sched_domains+0xa60/0xb90 [c000000000a16a98] .sched_init_smp+0xa8/0x228 [c000000000a00274] .kernel_init+0x10c/0x1fc [c00000000002324c] .kernel_thread+0x54/0x70 I'm guessing it's a result of some nodes not having any local memory. but a bit surprised I'm not seeing it elsewhere. Investigating. > Also, could you send me the node_distance table for that machine? I'm > curious what the interconnects look like on that thing. Our node distances are a bit arbitrary (I make them up based on information given to us in the device tree). In terms of memory we have a maximum of three levels. To give some gross estimates, on chip memory might be 30GB/sec, on node memory 10-15GB/sec and off node memory 5GB/sec. The only thing we tweak with node distances is to make sure we go into node reclaim before going off node: /* * Before going off node we want the VM to try and reclaim from the local * node. It does this if the remote distance is larger than RECLAIM_DISTANCE. * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of * 20, we never reclaim and go off node straight away. * * To fix this we choose a smaller value of RECLAIM_DISTANCE. */ #define RECLAIM_DISTANCE 10 Anton node distances: node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0: 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 1: 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 2: 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 3: 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 4: 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 5: 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 6: 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 7: 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 8: 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 9: 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 10: 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 11: 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 12: 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 13: 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 14: 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 15: 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 16: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 0 0 0 0 17: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 0 0 0 0 18: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 0 0 0 0 19: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 0 0 0 0 20: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 0 0 0 0 21: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 0 0 0 0 22: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 0 0 0 0 23: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 0 0 0 0 24: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 0 0 0 0 29: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 0 0 0 0 30: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 0 0 0 0 31: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 0 0 0 0 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-19 4:44 ` Anton Blanchard @ 2011-07-19 10:21 ` Peter Zijlstra -1 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-19 10:21 UTC (permalink / raw) To: Anton Blanchard; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds On Tue, 2011-07-19 at 14:44 +1000, Anton Blanchard wrote: > > Our node distances are a bit arbitrary (I make them up based on > information given to us in the device tree). In terms of memory we have > a maximum of three levels. To give some gross estimates, on chip memory > might be 30GB/sec, on node memory 10-15GB/sec and off node memory > 5GB/sec. > > The only thing we tweak with node distances is to make sure we go into > node reclaim before going off node: > > /* > * Before going off node we want the VM to try and reclaim from the local > * node. It does this if the remote distance is larger than RECLAIM_DISTANCE. > * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of > * 20, we never reclaim and go off node straight away. > * > * To fix this we choose a smaller value of RECLAIM_DISTANCE. > */ > #define RECLAIM_DISTANCE 10 > node distances: > node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 > 0: 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 1: 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 2: 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 3: 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 4: 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 5: 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 6: 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 7: 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 8: 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 9: 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 10: 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 11: 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 12: 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 13: 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 14: 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 15: 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0 > 16: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 0 0 0 0 > 17: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 0 0 0 0 > 18: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 0 0 0 0 > 19: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 0 0 0 0 > 20: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 0 0 0 0 > 21: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 0 0 0 0 > 22: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 0 0 0 0 > 23: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 0 0 0 0 > 24: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 25: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 26: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 27: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 28: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 0 0 0 0 > 29: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 0 0 0 0 > 30: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 0 0 0 0 > 31: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 0 0 0 0 That looks very strange indeed.. up to node 23 there is the normal symmetric matrix with all the trace elements on 10 (as we would expect for local access), and some 4x4 sub-matrix stacked around the trace with 20, suggesting a single hop distance, and the rest on 40 being out-there. But row 24-27 and column 28-31 are way weird, how can that ever be? Aren't the inter-connects symmetric and thus mandating a fully symmetric matrix? That is, how can traffic from node 23 (row) to node 28 (column) have inf bandwidth (0) yet traffic from node 28 (row) to node 23 (column) have a multi-hop distance of 40. So the idea I had to generate numa sched domains from the node distance ( http://marc.info/?l=linux-kernel&m=130218515520540 ), would that still work for you? [it does assume a symmetric matrix ] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-19 10:21 ` Peter Zijlstra 0 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-19 10:21 UTC (permalink / raw) To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds On Tue, 2011-07-19 at 14:44 +1000, Anton Blanchard wrote: >=20 > Our node distances are a bit arbitrary (I make them up based on > information given to us in the device tree). In terms of memory we have > a maximum of three levels. To give some gross estimates, on chip memory > might be 30GB/sec, on node memory 10-15GB/sec and off node memory > 5GB/sec. >=20 > The only thing we tweak with node distances is to make sure we go into > node reclaim before going off node: >=20 > /* > * Before going off node we want the VM to try and reclaim from the local > * node. It does this if the remote distance is larger than RECLAIM_DISTA= NCE. > * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANC= E of > * 20, we never reclaim and go off node straight away. > * > * To fix this we choose a smaller value of RECLAIM_DISTANCE. > */ > #define RECLAIM_DISTANCE 10 > node distances: > node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 = 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31=20 > 0: 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 1: 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 2: 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 3: 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 4: 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 5: 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 6: 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 7: 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 8: 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 9: 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 10: 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 11: 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 12: 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 13: 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 14: 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 15: 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 = 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0=20 > 16: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 = 20 20 20 40 40 40 40 40 40 40 40 0 0 0 0=20 > 17: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 = 10 20 20 40 40 40 40 40 40 40 40 0 0 0 0=20 > 18: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 = 20 10 20 40 40 40 40 40 40 40 40 0 0 0 0=20 > 19: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 = 20 20 10 40 40 40 40 40 40 40 40 0 0 0 0=20 > 20: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 = 40 40 40 10 20 20 20 40 40 40 40 0 0 0 0=20 > 21: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 = 40 40 40 20 10 20 20 40 40 40 40 0 0 0 0=20 > 22: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 = 40 40 40 20 20 10 20 40 40 40 40 0 0 0 0=20 > 23: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 = 40 40 40 20 20 20 10 40 40 40 40 0 0 0 0=20 > 24: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=20 > 25: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=20 > 26: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=20 > 27: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=20 > 28: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 = 40 40 40 40 40 40 40 10 20 20 20 0 0 0 0=20 > 29: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 = 40 40 40 40 40 40 40 20 10 20 20 0 0 0 0=20 > 30: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 = 40 40 40 40 40 40 40 20 20 10 20 0 0 0 0=20 > 31: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 = 40 40 40 40 40 40 40 20 20 20 10 0 0 0 0=20 That looks very strange indeed.. up to node 23 there is the normal symmetric matrix with all the trace elements on 10 (as we would expect for local access), and some 4x4 sub-matrix stacked around the trace with 20, suggesting a single hop distance, and the rest on 40 being out-there. But row 24-27 and column 28-31 are way weird, how can that ever be? Aren't the inter-connects symmetric and thus mandating a fully symmetric matrix? That is, how can traffic from node 23 (row) to node 28 (column) have inf bandwidth (0) yet traffic from node 28 (row) to node 23 (column) have a multi-hop distance of 40. So the idea I had to generate numa sched domains from the node distance ( http://marc.info/?l=3Dlinux-kernel&m=3D130218515520540 ), would that stil= l work for you? [it does assume a symmetric matrix ] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-19 10:21 ` Peter Zijlstra @ 2011-07-20 2:03 ` Anton Blanchard -1 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-20 2:03 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds Hi, > That looks very strange indeed.. up to node 23 there is the normal > symmetric matrix with all the trace elements on 10 (as we would expect > for local access), and some 4x4 sub-matrix stacked around the trace > with 20, suggesting a single hop distance, and the rest on 40 being > out-there. > > But row 24-27 and column 28-31 are way weird, how can that ever be? > Aren't the inter-connects symmetric and thus mandating a fully > symmetric matrix? That is, how can traffic from node 23 (row) to node > 28 (column) have inf bandwidth (0) yet traffic from node 28 (row) to > node 23 (column) have a multi-hop distance of 40. Good point, it definitely makes no sense. It looks like a bug in numactl, the raw data looks reasonable: # cat /sys/devices/system/node/node?/distance node??/distance 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 Yet another bug to track down :( > So the idea I had to generate numa sched domains from the node > distance ( http://marc.info/?l=linux-kernel&m=130218515520540 ), > would that still work for you? [it does assume a symmetric matrix ] It should work for us and it makes our NUMA memory and scheduler domains more consistent. Nice! Anton ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-20 2:03 ` Anton Blanchard 0 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-20 2:03 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds Hi, > That looks very strange indeed.. up to node 23 there is the normal > symmetric matrix with all the trace elements on 10 (as we would expect > for local access), and some 4x4 sub-matrix stacked around the trace > with 20, suggesting a single hop distance, and the rest on 40 being > out-there. > > But row 24-27 and column 28-31 are way weird, how can that ever be? > Aren't the inter-connects symmetric and thus mandating a fully > symmetric matrix? That is, how can traffic from node 23 (row) to node > 28 (column) have inf bandwidth (0) yet traffic from node 28 (row) to > node 23 (column) have a multi-hop distance of 40. Good point, it definitely makes no sense. It looks like a bug in numactl, the raw data looks reasonable: # cat /sys/devices/system/node/node?/distance node??/distance 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 Yet another bug to track down :( > So the idea I had to generate numa sched domains from the node > distance ( http://marc.info/?l=linux-kernel&m=130218515520540 ), > would that still work for you? [it does assume a symmetric matrix ] It should work for us and it makes our NUMA memory and scheduler domains more consistent. Nice! Anton ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-19 10:21 ` Peter Zijlstra @ 2011-07-20 10:14 ` Anton Blanchard -1 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-20 10:14 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds Hi Peter, > That looks very strange indeed.. up to node 23 there is the normal > symmetric matrix with all the trace elements on 10 (as we would expect > for local access), and some 4x4 sub-matrix stacked around the trace > with 20, suggesting a single hop distance, and the rest on 40 being > out-there. I retested with the latest version of numactl, and get correct results. I worked out why the patches don't boot, we weren't allocating any space for the cpumask and ran off the end of the allocation. Should we also use cpumask_copy instead of open coding it? I added that too. Anton Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c 2011-07-20 01:54:08.191668781 -0500 +++ linux-2.6/kernel/sched.c 2011-07-20 04:45:36.203750525 -0500 @@ -7020,8 +7020,8 @@ if (cpumask_test_cpu(i, covered)) continue; - sg = kzalloc_node(sizeof(struct sched_group), GFP_KERNEL, - cpu_to_node(i)); + sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), + GFP_KERNEL, cpu_to_node(i)); if (!sg) goto fail; @@ -7031,7 +7031,7 @@ child = *per_cpu_ptr(sdd->sd, i); if (child->child) { child = child->child; - *sg_span = *sched_domain_span(child); + cpumask_copy(sg_span, sched_domain_span(child)); } else cpumask_set_cpu(i, sg_span); ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-20 10:14 ` Anton Blanchard 0 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-20 10:14 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds Hi Peter, > That looks very strange indeed.. up to node 23 there is the normal > symmetric matrix with all the trace elements on 10 (as we would expect > for local access), and some 4x4 sub-matrix stacked around the trace > with 20, suggesting a single hop distance, and the rest on 40 being > out-there. I retested with the latest version of numactl, and get correct results. I worked out why the patches don't boot, we weren't allocating any space for the cpumask and ran off the end of the allocation. Should we also use cpumask_copy instead of open coding it? I added that too. Anton Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c 2011-07-20 01:54:08.191668781 -0500 +++ linux-2.6/kernel/sched.c 2011-07-20 04:45:36.203750525 -0500 @@ -7020,8 +7020,8 @@ if (cpumask_test_cpu(i, covered)) continue; - sg = kzalloc_node(sizeof(struct sched_group), GFP_KERNEL, - cpu_to_node(i)); + sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), + GFP_KERNEL, cpu_to_node(i)); if (!sg) goto fail; @@ -7031,7 +7031,7 @@ child = *per_cpu_ptr(sdd->sd, i); if (child->child) { child = child->child; - *sg_span = *sched_domain_span(child); + cpumask_copy(sg_span, sched_domain_span(child)); } else cpumask_set_cpu(i, sg_span); ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-20 10:14 ` Anton Blanchard @ 2011-07-20 10:45 ` Peter Zijlstra -1 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-20 10:45 UTC (permalink / raw) To: Anton Blanchard; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds On Wed, 2011-07-20 at 20:14 +1000, Anton Blanchard wrote: > > That looks very strange indeed.. up to node 23 there is the normal > > symmetric matrix with all the trace elements on 10 (as we would expect > > for local access), and some 4x4 sub-matrix stacked around the trace > > with 20, suggesting a single hop distance, and the rest on 40 being > > out-there. > > I retested with the latest version of numactl, and get correct results. One less thing to worry about ;-) > I worked out why the patches don't boot, we weren't allocating any > space for the cpumask and ran off the end of the allocation. Gah! that's not the first time I made that particular mistake :/ > Should we also use cpumask_copy instead of open coding it? I added that > too. Probably, I looked for cpumask_assign() and on failing to find that used the direct assignment. So with that fix the patch makes the machine happy again? Thanks! ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-20 10:45 ` Peter Zijlstra 0 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-20 10:45 UTC (permalink / raw) To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds On Wed, 2011-07-20 at 20:14 +1000, Anton Blanchard wrote: > > That looks very strange indeed.. up to node 23 there is the normal > > symmetric matrix with all the trace elements on 10 (as we would expect > > for local access), and some 4x4 sub-matrix stacked around the trace > > with 20, suggesting a single hop distance, and the rest on 40 being > > out-there. >=20 > I retested with the latest version of numactl, and get correct results. One less thing to worry about ;-) > I worked out why the patches don't boot, we weren't allocating any > space for the cpumask and ran off the end of the allocation. Gah! that's not the first time I made that particular mistake :/ > Should we also use cpumask_copy instead of open coding it? I added that > too. Probably, I looked for cpumask_assign() and on failing to find that used the direct assignment. So with that fix the patch makes the machine happy again? Thanks! ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-20 10:45 ` Peter Zijlstra @ 2011-07-20 12:14 ` Anton Blanchard -1 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-20 12:14 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linux-kernel, linuxppc-dev, mingo, benh, torvalds Hi Peter, > So with that fix the patch makes the machine happy again? Yes, the machine looks fine with the patches applied. Thanks! Anton ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-20 12:14 ` Anton Blanchard 0 siblings, 0 replies; 43+ messages in thread From: Anton Blanchard @ 2011-07-20 12:14 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds Hi Peter, > So with that fix the patch makes the machine happy again? Yes, the machine looks fine with the patches applied. Thanks! Anton ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-20 12:14 ` Anton Blanchard @ 2011-07-20 14:40 ` Linus Torvalds -1 siblings, 0 replies; 43+ messages in thread From: Linus Torvalds @ 2011-07-20 14:40 UTC (permalink / raw) To: Anton Blanchard Cc: Peter Zijlstra, mahesh, linux-kernel, linuxppc-dev, mingo, benh On Wed, Jul 20, 2011 at 5:14 AM, Anton Blanchard <anton@samba.org> wrote: > >> So with that fix the patch makes the machine happy again? > > Yes, the machine looks fine with the patches applied. Thanks! Ok, so what's the situation for 3.0 (I'm waiting for some RCU resolution now)? Anton's patch may be small, but that's just the tiny fixup patch to Peter's much scarier one ;) Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-20 14:40 ` Linus Torvalds 0 siblings, 0 replies; 43+ messages in thread From: Linus Torvalds @ 2011-07-20 14:40 UTC (permalink / raw) To: Anton Blanchard; +Cc: Peter Zijlstra, mahesh, linux-kernel, mingo, linuxppc-dev On Wed, Jul 20, 2011 at 5:14 AM, Anton Blanchard <anton@samba.org> wrote: > >> So with that fix the patch makes the machine happy again? > > Yes, the machine looks fine with the patches applied. Thanks! Ok, so what's the situation for 3.0 (I'm waiting for some RCU resolution now)? Anton's patch may be small, but that's just the tiny fixup patch to Peter's much scarier one ;) Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-20 14:40 ` Linus Torvalds @ 2011-07-20 14:58 ` Peter Zijlstra -1 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-20 14:58 UTC (permalink / raw) To: Linus Torvalds Cc: Anton Blanchard, mahesh, linux-kernel, linuxppc-dev, mingo, benh On Wed, 2011-07-20 at 07:40 -0700, Linus Torvalds wrote: > On Wed, Jul 20, 2011 at 5:14 AM, Anton Blanchard <anton@samba.org> wrote: > > > >> So with that fix the patch makes the machine happy again? > > > > Yes, the machine looks fine with the patches applied. Thanks! > > Ok, so what's the situation for 3.0 (I'm waiting for some RCU > resolution now)? Anton's patch may be small, but that's just the tiny > fixup patch to Peter's much scarier one ;) Right, so we can either merge my scary patches now and have 3.0 boot on 16+ node machines (and risk breaking something), or delay them until 3.0.1 and have 16+ node machines suffer a little. The alternative quick hack is simply to disable the node domain, but that'll be detrimental to regular machines in that the top domain used to have NODE sd_flags will now have ALL_NODE sd_flags which are much less aggressive. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-20 14:58 ` Peter Zijlstra 0 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-20 14:58 UTC (permalink / raw) To: Linus Torvalds; +Cc: mahesh, linux-kernel, Anton Blanchard, mingo, linuxppc-dev On Wed, 2011-07-20 at 07:40 -0700, Linus Torvalds wrote: > On Wed, Jul 20, 2011 at 5:14 AM, Anton Blanchard <anton@samba.org> wrote: > > > >> So with that fix the patch makes the machine happy again? > > > > Yes, the machine looks fine with the patches applied. Thanks! >=20 > Ok, so what's the situation for 3.0 (I'm waiting for some RCU > resolution now)? Anton's patch may be small, but that's just the tiny > fixup patch to Peter's much scarier one ;) Right, so we can either merge my scary patches now and have 3.0 boot on 16+ node machines (and risk breaking something), or delay them until 3.0.1 and have 16+ node machines suffer a little. The alternative quick hack is simply to disable the node domain, but that'll be detrimental to regular machines in that the top domain used to have NODE sd_flags will now have ALL_NODE sd_flags which are much less aggressive. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-20 14:58 ` Peter Zijlstra @ 2011-07-20 16:04 ` Linus Torvalds -1 siblings, 0 replies; 43+ messages in thread From: Linus Torvalds @ 2011-07-20 16:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Anton Blanchard, mahesh, linux-kernel, linuxppc-dev, mingo, benh On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > Right, so we can either merge my scary patches now and have 3.0 boot on > 16+ node machines (and risk breaking something), or delay them until > 3.0.1 and have 16+ node machines suffer a little. So how much impact does your scary patch have on machines that don't have multiple nodes? If it's a "the code isn't even called by normal machines" kind of setup, I don't think I care a lot. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-20 16:04 ` Linus Torvalds 0 siblings, 0 replies; 43+ messages in thread From: Linus Torvalds @ 2011-07-20 16:04 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mahesh, linux-kernel, Anton Blanchard, mingo, linuxppc-dev On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > Right, so we can either merge my scary patches now and have 3.0 boot on > 16+ node machines (and risk breaking something), or delay them until > 3.0.1 and have 16+ node machines suffer a little. So how much impact does your scary patch have on machines that don't have multiple nodes? If it's a "the code isn't even called by normal machines" kind of setup, I don't think I care a lot. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-20 16:04 ` Linus Torvalds @ 2011-07-20 16:42 ` Ingo Molnar -1 siblings, 0 replies; 43+ messages in thread From: Ingo Molnar @ 2011-07-20 16:42 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Anton Blanchard, mahesh, linux-kernel, linuxppc-dev, benh * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > > Right, so we can either merge my scary patches now and have 3.0 > > boot on 16+ node machines (and risk breaking something), or delay > > them until 3.0.1 and have 16+ node machines suffer a little. > > So how much impact does your scary patch have on machines that > don't have multiple nodes? If it's a "the code isn't even called by > normal machines" kind of setup, I don't think I care a lot. NUMA systems will trigger the new code - not just 'weird NUMA systems' - but i still think we could try the patches, the code looks straightforward and i booted them on NUMA systems and it all seems fine so far. Anyway, i'll push the new sched/urgent branch out in a few minutes and then you'll see the full patches in the commit notifications. Thanks, Ingo ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-20 16:42 ` Ingo Molnar 0 siblings, 0 replies; 43+ messages in thread From: Ingo Molnar @ 2011-07-20 16:42 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, mahesh, linux-kernel, Anton Blanchard, linuxppc-dev * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > > Right, so we can either merge my scary patches now and have 3.0 > > boot on 16+ node machines (and risk breaking something), or delay > > them until 3.0.1 and have 16+ node machines suffer a little. > > So how much impact does your scary patch have on machines that > don't have multiple nodes? If it's a "the code isn't even called by > normal machines" kind of setup, I don't think I care a lot. NUMA systems will trigger the new code - not just 'weird NUMA systems' - but i still think we could try the patches, the code looks straightforward and i booted them on NUMA systems and it all seems fine so far. Anyway, i'll push the new sched/urgent branch out in a few minutes and then you'll see the full patches in the commit notifications. Thanks, Ingo ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 2011-07-20 16:04 ` Linus Torvalds @ 2011-07-20 16:42 ` Peter Zijlstra -1 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-20 16:42 UTC (permalink / raw) To: Linus Torvalds Cc: Anton Blanchard, mahesh, linux-kernel, linuxppc-dev, mingo, benh On Wed, 2011-07-20 at 09:04 -0700, Linus Torvalds wrote: > On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > > Right, so we can either merge my scary patches now and have 3.0 boot on > > 16+ node machines (and risk breaking something), or delay them until > > 3.0.1 and have 16+ node machines suffer a little. > > So how much impact does your scary patch have on machines that don't > have multiple nodes? If it's a "the code isn't even called by normal > machines" kind of setup, I don't think I care a lot. Hmm, it does get called, but it looks relatively straight forward to make it so that it doesn't. Let me try that. Yes, the below works nicely (on top of the previous two). Built and boot tested on a single-node and multi-node x86_64. --- Subject: sched: Avoid creating superfluous domains From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Wed Jul 20 18:34:30 CEST 2011 When creating sched_domains, stop when we've covered the entire target span instead of continuing to create domains, only to later find they're redundant and throw them away again. This avoids single node systems from touching funny NUMA sched_domain creation code. Requested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/sched.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -7436,6 +7436,8 @@ static int build_sched_domains(const str sd = build_sched_domain(tl, &d, cpu_map, attr, sd, i); if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP)) sd->flags |= SD_OVERLAP; + if (cpumask_equal(cpu_map, sched_domain_span(sd))) + break; } while (sd->child) ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 @ 2011-07-20 16:42 ` Peter Zijlstra 0 siblings, 0 replies; 43+ messages in thread From: Peter Zijlstra @ 2011-07-20 16:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: mahesh, linux-kernel, Anton Blanchard, mingo, linuxppc-dev On Wed, 2011-07-20 at 09:04 -0700, Linus Torvalds wrote: > On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> = wrote: > > > > Right, so we can either merge my scary patches now and have 3.0 boot on > > 16+ node machines (and risk breaking something), or delay them until > > 3.0.1 and have 16+ node machines suffer a little. >=20 > So how much impact does your scary patch have on machines that don't > have multiple nodes? If it's a "the code isn't even called by normal > machines" kind of setup, I don't think I care a lot. Hmm, it does get called, but it looks relatively straight forward to make it so that it doesn't. Let me try that. Yes, the below works nicely (on top of the previous two). Built and boot tested on a single-node and multi-node x86_64. --- Subject: sched: Avoid creating superfluous domains From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Wed Jul 20 18:34:30 CEST 2011 When creating sched_domains, stop when we've covered the entire target span instead of continuing to create domains, only to later find they're redundant and throw them away again. This avoids single node systems from touching funny NUMA sched_domain creation code. Requested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/sched.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6/kernel/sched.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -7436,6 +7436,8 @@ static int build_sched_domains(const str sd =3D build_sched_domain(tl, &d, cpu_map, attr, sd, i); if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP)) sd->flags |=3D SD_OVERLAP; + if (cpumask_equal(cpu_map, sched_domain_span(sd))) + break; } =20 while (sd->child) ^ permalink raw reply [flat|nested] 43+ messages in thread
* [tip:sched/urgent] sched: Avoid creating superfluous NUMA domains on non-NUMA systems 2011-07-20 16:42 ` Peter Zijlstra (?) @ 2011-07-20 17:29 ` tip-bot for Peter Zijlstra -1 siblings, 0 replies; 43+ messages in thread From: tip-bot for Peter Zijlstra @ 2011-07-20 17:29 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, anton, hpa, mingo, torvalds, a.p.zijlstra, tglx, mingo Commit-ID: d110235d2c331c4f79e0879f51104be79e17a469 Gitweb: http://git.kernel.org/tip/d110235d2c331c4f79e0879f51104be79e17a469 Author: Peter Zijlstra <a.p.zijlstra@chello.nl> AuthorDate: Wed, 20 Jul 2011 18:42:57 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Wed, 20 Jul 2011 18:54:33 +0200 sched: Avoid creating superfluous NUMA domains on non-NUMA systems When creating sched_domains, stop when we've covered the entire target span instead of continuing to create domains, only to later find they're redundant and throw them away again. This avoids single node systems from touching funny NUMA sched_domain creation code and reduces the risks of the new SD_OVERLAP code. Requested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Anton Blanchard <anton@samba.org> Cc: mahesh@linux.vnet.ibm.com Cc: benh@kernel.crashing.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1311180177.29152.57.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 921adf6..14168c4 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -7436,6 +7436,8 @@ static int build_sched_domains(const struct cpumask *cpu_map, sd = build_sched_domain(tl, &d, cpu_map, attr, sd, i); if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP)) sd->flags |= SD_OVERLAP; + if (cpumask_equal(cpu_map, sched_domain_span(sd))) + break; } while (sd->child) ^ permalink raw reply related [flat|nested] 43+ messages in thread
end of thread, other threads:[~2011-07-20 17:30 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-07-07 10:22 [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 Mahesh J Salgaonkar 2011-07-07 10:22 ` Mahesh J Salgaonkar 2011-07-07 10:59 ` Peter Zijlstra 2011-07-07 10:59 ` Peter Zijlstra 2011-07-07 11:55 ` Mahesh J Salgaonkar 2011-07-07 11:55 ` Mahesh J Salgaonkar 2011-07-07 12:28 ` Peter Zijlstra 2011-07-07 12:28 ` Peter Zijlstra 2011-07-14 0:34 ` Anton Blanchard 2011-07-14 0:34 ` Anton Blanchard 2011-07-14 4:35 ` Anton Blanchard 2011-07-14 4:35 ` Anton Blanchard 2011-07-14 13:16 ` Peter Zijlstra 2011-07-14 13:16 ` Peter Zijlstra 2011-07-15 0:45 ` Anton Blanchard 2011-07-15 0:45 ` Anton Blanchard 2011-07-15 8:37 ` Peter Zijlstra 2011-07-15 8:37 ` Peter Zijlstra 2011-07-18 21:35 ` Peter Zijlstra 2011-07-18 21:35 ` Peter Zijlstra 2011-07-19 4:44 ` Anton Blanchard 2011-07-19 4:44 ` Anton Blanchard 2011-07-19 10:21 ` Peter Zijlstra 2011-07-19 10:21 ` Peter Zijlstra 2011-07-20 2:03 ` Anton Blanchard 2011-07-20 2:03 ` Anton Blanchard 2011-07-20 10:14 ` Anton Blanchard 2011-07-20 10:14 ` Anton Blanchard 2011-07-20 10:45 ` Peter Zijlstra 2011-07-20 10:45 ` Peter Zijlstra 2011-07-20 12:14 ` Anton Blanchard 2011-07-20 12:14 ` Anton Blanchard 2011-07-20 14:40 ` Linus Torvalds 2011-07-20 14:40 ` Linus Torvalds 2011-07-20 14:58 ` Peter Zijlstra 2011-07-20 14:58 ` Peter Zijlstra 2011-07-20 16:04 ` Linus Torvalds 2011-07-20 16:04 ` Linus Torvalds 2011-07-20 16:42 ` Ingo Molnar 2011-07-20 16:42 ` Ingo Molnar 2011-07-20 16:42 ` Peter Zijlstra 2011-07-20 16:42 ` Peter Zijlstra 2011-07-20 17:29 ` [tip:sched/urgent] sched: Avoid creating superfluous NUMA domains on non-NUMA systems tip-bot for Peter Zijlstra
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.