All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	anton@samba.org, mingo@elte.hu, torvalds@linux-foundation.org
Subject: Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
Date: Thu, 7 Jul 2011 17:25:31 +0530	[thread overview]
Message-ID: <20110707115531.GA21737@in.ibm.com> (raw)
In-Reply-To: <1310036375.3282.509.camel@twins>

On 2011-07-07 12:59:35 Thu, Peter Zijlstra wrote:
> On Thu, 2011-07-07 at 15:52 +0530, Mahesh J Salgaonkar wrote:
> > 
> > 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae -
> > "sched: Change NODE sched_domain group creation" as the cause.
> 
> Weird, there's no locking anywhere around there. The typical problems
> with this patch-set were massive explosions due to bad pointers etc..
> But not silent hangs.
> 
> The code its stuck at:
> 
> > [1]:
> > POWER7 performance monitor hardware support registered
> > Brought up 896 CPUs
> > Enabling Asymmetric SMT scheduling
> > BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1]
> > Modules linked in:
> > NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000
> > REGS: c000000fae25f9c0 TRAP: 0901   Not tainted  (3.0.0-rc6)
> > MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24000088  XER: 00000004
> > TASK = c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0
> > GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb00
> > GPR04: 0000000000000100 0000000000000100 c000011adcf23418 0000000000000000
> > GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024ac
> > GPR12: 0000000044000042 c00000000ebb0000
> > NIP [c000000000074b90] .update_group_power+0x50/0x190
> > LR [c00000000008a1c4] .build_sched_domains+0x434/0x490
> > Call Trace:
> > [c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable)
> > [c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490
> > [c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224
> > [c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc
> > [c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70
> > Instruction dump:
> > f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054
> > e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14
> 
> doesn't contains any locks, its simply looping over all the cpus, and
> with that many I can imagine it takes a while, but getting 'stuck' there
> is unexpected to say the least.
> 
> Surely this isn't the first multi-node P7 to boot a kernel with this
> patch? If my git foo is any good it hit -next on 23rd of May.
> 
> I guess I'm asking is, do smaller P7 machines boot? And if so, is there
> any difference except size?

Yes, the smaller P7 machine that I have with 20 CPUs and 2GB ram boots
fine with 3.0.0-rc.

> 
> How many nodes does the thing have anyway, 28? Hmm, that could mean its
> the first machine with >16 nodes to boot this, which would make it
> trigger the magic ALL_NODES crap.

The P7 machine where kernel fails to boot shows following demsg log w.r.t
node map:
---------------------------
Zone PFN ranges:
  DMA      0x00000000 -> 0x01229000
  Normal   empty
Movable zone start PFN for each node
early_node_map[12] active PFN ranges
    0: 0x00000000 -> 0x000fd000
    4: 0x000fd000 -> 0x002fb000
    5: 0x002fb000 -> 0x004b9000
    6: 0x004b9000 -> 0x006b9000
    8: 0x006b9000 -> 0x007b5000
   12: 0x007b5000 -> 0x008b5000
   16: 0x008b5000 -> 0x009b1000
   20: 0x009b1000 -> 0x00bb1000
   21: 0x00bb1000 -> 0x00db1000
   22: 0x00db1000 -> 0x00fb1000
   23: 0x00fb1000 -> 0x011b1000
   28: 0x011b1000 -> 0x01229000
Could not find start_pfn for node 1
Could not find start_pfn for node 2
Could not find start_pfn for node 3
Could not find start_pfn for node 7
Could not find start_pfn for node 9
Could not find start_pfn for node 10
Could not find start_pfn for node 11
Could not find start_pfn for node 13
Could not find start_pfn for node 14
Could not find start_pfn for node 15
Could not find start_pfn for node 17
Could not find start_pfn for node 18
Could not find start_pfn for node 19
Could not find start_pfn for node 29
Could not find start_pfn for node 30
Could not find start_pfn for node 31
[boot]0015 Setup Done
PERCPU: Embedded 1 pages/cpu @c000000013c00000 s31488 r0 d34048 u65536
Built 28 zonelists in Node order, mobility grouping on.  Total pages:
19026032
Policy zone: DMA
Kernel command line: root=/dev/mapper/vg_nish1-lv_root ro
rd_LVM_LV=vg_nish1/lv_root rd_LVM_LV=VolGroup/lv_swap
rd_LVM_LV=vg_nish1/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8
SYSFONT=latarcyrheb-sun16 KEYTABLE=us console=hvc0i memblock=debug 
PID hash table entries: 4096 (order: -1, 32768 bytes)
freeing bootmem node 0
freeing bootmem node 4
freeing bootmem node 5
freeing bootmem node 6
freeing bootmem node 8
freeing bootmem node 12
freeing bootmem node 16
freeing bootmem node 20
freeing bootmem node 21
freeing bootmem node 22
freeing bootmem node 23
freeing bootmem node 28
Memory: 1213775296k/1218707456k available (13312k kernel code, 4932160k
reserved, 1600k data, 2727k bss, 4928k init)
---------------------------

Thanks,
-Mahesh.

> 
> Let me dig around there.
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

-- 
Mahesh J Salgaonkar

WARNING: multiple messages have this Message-ID (diff)
From: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: torvalds@linux-foundation.org, mingo@elte.hu,
	linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	anton@samba.org
Subject: Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
Date: Thu, 7 Jul 2011 17:25:31 +0530	[thread overview]
Message-ID: <20110707115531.GA21737@in.ibm.com> (raw)
In-Reply-To: <1310036375.3282.509.camel@twins>

On 2011-07-07 12:59:35 Thu, Peter Zijlstra wrote:
> On Thu, 2011-07-07 at 15:52 +0530, Mahesh J Salgaonkar wrote:
> > 
> > 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae -
> > "sched: Change NODE sched_domain group creation" as the cause.
> 
> Weird, there's no locking anywhere around there. The typical problems
> with this patch-set were massive explosions due to bad pointers etc..
> But not silent hangs.
> 
> The code its stuck at:
> 
> > [1]:
> > POWER7 performance monitor hardware support registered
> > Brought up 896 CPUs
> > Enabling Asymmetric SMT scheduling
> > BUG: soft lockup - CPU#0 stuck for 22s! [swapper:1]
> > Modules linked in:
> > NIP: c000000000074b90 LR: c00000000008a1c4 CTR: 0000000000000000
> > REGS: c000000fae25f9c0 TRAP: 0901   Not tainted  (3.0.0-rc6)
> > MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24000088  XER: 00000004
> > TASK = c000000fae248490[1] 'swapper' THREAD: c000000fae25c000 CPU: 0
> > GPR00: 0000e2a55cbeec50 c000000fae25fc40 c000000000e21f90 c000007b2b34cb00
> > GPR04: 0000000000000100 0000000000000100 c000011adcf23418 0000000000000000
> > GPR08: 0000000000000000 c000008b2b7d4480 c000007b2b35ef80 00000000000024ac
> > GPR12: 0000000044000042 c00000000ebb0000
> > NIP [c000000000074b90] .update_group_power+0x50/0x190
> > LR [c00000000008a1c4] .build_sched_domains+0x434/0x490
> > Call Trace:
> > [c000000fae25fc40] [c000000fae25fce0] 0xc000000fae25fce0 (unreliable)
> > [c000000fae25fce0] [c00000000008a1c4] .build_sched_domains+0x434/0x490
> > [c000000fae25fdd0] [c000000000867370] .sched_init_smp+0xa8/0x224
> > [c000000fae25fee0] [c000000000850274] .kernel_init+0x10c/0x1fc
> > [c000000fae25ff90] [c000000000023884] .kernel_thread+0x54/0x70
> > Instruction dump:
> > f821ff61 ebc2b1a0 7c7f1b78 7c9c2378 e9230008 eba30010 2fa90000 419e0054
> > e9490010 38000000 7d495378 60000000 <8169000c> e9290000 7faa4800 7c005a14
> 
> doesn't contains any locks, its simply looping over all the cpus, and
> with that many I can imagine it takes a while, but getting 'stuck' there
> is unexpected to say the least.
> 
> Surely this isn't the first multi-node P7 to boot a kernel with this
> patch? If my git foo is any good it hit -next on 23rd of May.
> 
> I guess I'm asking is, do smaller P7 machines boot? And if so, is there
> any difference except size?

Yes, the smaller P7 machine that I have with 20 CPUs and 2GB ram boots
fine with 3.0.0-rc.

> 
> How many nodes does the thing have anyway, 28? Hmm, that could mean its
> the first machine with >16 nodes to boot this, which would make it
> trigger the magic ALL_NODES crap.

The P7 machine where kernel fails to boot shows following demsg log w.r.t
node map:
---------------------------
Zone PFN ranges:
  DMA      0x00000000 -> 0x01229000
  Normal   empty
Movable zone start PFN for each node
early_node_map[12] active PFN ranges
    0: 0x00000000 -> 0x000fd000
    4: 0x000fd000 -> 0x002fb000
    5: 0x002fb000 -> 0x004b9000
    6: 0x004b9000 -> 0x006b9000
    8: 0x006b9000 -> 0x007b5000
   12: 0x007b5000 -> 0x008b5000
   16: 0x008b5000 -> 0x009b1000
   20: 0x009b1000 -> 0x00bb1000
   21: 0x00bb1000 -> 0x00db1000
   22: 0x00db1000 -> 0x00fb1000
   23: 0x00fb1000 -> 0x011b1000
   28: 0x011b1000 -> 0x01229000
Could not find start_pfn for node 1
Could not find start_pfn for node 2
Could not find start_pfn for node 3
Could not find start_pfn for node 7
Could not find start_pfn for node 9
Could not find start_pfn for node 10
Could not find start_pfn for node 11
Could not find start_pfn for node 13
Could not find start_pfn for node 14
Could not find start_pfn for node 15
Could not find start_pfn for node 17
Could not find start_pfn for node 18
Could not find start_pfn for node 19
Could not find start_pfn for node 29
Could not find start_pfn for node 30
Could not find start_pfn for node 31
[boot]0015 Setup Done
PERCPU: Embedded 1 pages/cpu @c000000013c00000 s31488 r0 d34048 u65536
Built 28 zonelists in Node order, mobility grouping on.  Total pages:
19026032
Policy zone: DMA
Kernel command line: root=/dev/mapper/vg_nish1-lv_root ro
rd_LVM_LV=vg_nish1/lv_root rd_LVM_LV=VolGroup/lv_swap
rd_LVM_LV=vg_nish1/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8
SYSFONT=latarcyrheb-sun16 KEYTABLE=us console=hvc0i memblock=debug 
PID hash table entries: 4096 (order: -1, 32768 bytes)
freeing bootmem node 0
freeing bootmem node 4
freeing bootmem node 5
freeing bootmem node 6
freeing bootmem node 8
freeing bootmem node 12
freeing bootmem node 16
freeing bootmem node 20
freeing bootmem node 21
freeing bootmem node 22
freeing bootmem node 23
freeing bootmem node 28
Memory: 1213775296k/1218707456k available (13312k kernel code, 4932160k
reserved, 1600k data, 2727k bss, 4928k init)
---------------------------

Thanks,
-Mahesh.

> 
> Let me dig around there.
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

-- 
Mahesh J Salgaonkar

  reply	other threads:[~2011-07-07 11:55 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-07 10:22 [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 Mahesh J Salgaonkar
2011-07-07 10:22 ` Mahesh J Salgaonkar
2011-07-07 10:59 ` Peter Zijlstra
2011-07-07 10:59   ` Peter Zijlstra
2011-07-07 11:55   ` Mahesh J Salgaonkar [this message]
2011-07-07 11:55     ` Mahesh J Salgaonkar
2011-07-07 12:28     ` Peter Zijlstra
2011-07-07 12:28       ` Peter Zijlstra
2011-07-14  0:34   ` Anton Blanchard
2011-07-14  0:34     ` Anton Blanchard
2011-07-14  4:35     ` Anton Blanchard
2011-07-14  4:35       ` Anton Blanchard
2011-07-14 13:16       ` Peter Zijlstra
2011-07-14 13:16         ` Peter Zijlstra
2011-07-15  0:45         ` Anton Blanchard
2011-07-15  0:45           ` Anton Blanchard
2011-07-15  8:37           ` Peter Zijlstra
2011-07-15  8:37             ` Peter Zijlstra
2011-07-18 21:35           ` Peter Zijlstra
2011-07-18 21:35             ` Peter Zijlstra
2011-07-19  4:44             ` Anton Blanchard
2011-07-19  4:44               ` Anton Blanchard
2011-07-19 10:21               ` Peter Zijlstra
2011-07-19 10:21                 ` Peter Zijlstra
2011-07-20  2:03                 ` Anton Blanchard
2011-07-20  2:03                   ` Anton Blanchard
2011-07-20 10:14                 ` Anton Blanchard
2011-07-20 10:14                   ` Anton Blanchard
2011-07-20 10:45                   ` Peter Zijlstra
2011-07-20 10:45                     ` Peter Zijlstra
2011-07-20 12:14                     ` Anton Blanchard
2011-07-20 12:14                       ` Anton Blanchard
2011-07-20 14:40                       ` Linus Torvalds
2011-07-20 14:40                         ` Linus Torvalds
2011-07-20 14:58                         ` Peter Zijlstra
2011-07-20 14:58                           ` Peter Zijlstra
2011-07-20 16:04                           ` Linus Torvalds
2011-07-20 16:04                             ` Linus Torvalds
2011-07-20 16:42                             ` Ingo Molnar
2011-07-20 16:42                               ` Ingo Molnar
2011-07-20 16:42                             ` Peter Zijlstra
2011-07-20 16:42                               ` Peter Zijlstra
2011-07-20 17:29                               ` [tip:sched/urgent] sched: Avoid creating superfluous NUMA domains on non-NUMA systems tip-bot for Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110707115531.GA21737@in.ibm.com \
    --to=mahesh@linux.vnet.ibm.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=anton@samba.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mingo@elte.hu \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.