linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* Kernel crash caused by cpufreq
@ 2014-07-25  1:07 Gavin Shan
  2014-07-28  7:03 ` Michael Ellerman
  0 siblings, 1 reply; 3+ messages in thread
From: Gavin Shan @ 2014-07-25  1:07 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: gwshan


I'm tracing one LSI interrupt issue on P8 box, and eventually into the
following kernel crash. Not sure if there is one fix against this? :-)

Starting Linux PPC64 #401 SMP Fri Jul 25 10:52:28 EST 2014
-----------------------------------------------------
ppc64_pft_size                = 0x0
physicalMemorySize            = 0x800000000
htab_address                  = 0xc0000007fe000000
htab_hash_mask                = 0x3ffff
-----------------------------------------------------
 <- setup_system()
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpuacct
Linux version 3.16.0-rc6-00076-g4226dbe-dirty (shangw@shangw) (gcc version 4.5.2 (crosstool-NG 1.19.0) ) #401 SMP Fri Jul 25 10:52:28 EST 2014
	:
< Unrelated log stripped >
	:
------------[ cut here ]------------
kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134!
cpu 0x1: Vector: 700 (Program Check) at [c0000007f8483370]
    pc: c00000000096f32c: .pstate_id_to_freq+0x2c/0x50
    lr: c00000000096f37c: .powernv_read_cpu_freq+0x2c/0x50
    sp: c0000007f84835f0
   msr: 9000000000029032
  current = 0xc0000007f8400000
  paca    = 0xc00000000ff00400	 softe: 0	 irq_happened: 0x01
    pid   = 1, comm = swapper/0
kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134!
enter ? for help
[link register   ] c00000000096f37c .powernv_read_cpu_freq+0x2c/0x50
[c0000007f84835f0] c000000001076070 key_type_dns_resolver+0xef20/0x40d20 (unreliable)
[c0000007f8483670] c00000000010812c .generic_exec_single+0x8c/0x270
[c0000007f8483730] c0000000001083d0 .smp_call_function_single+0x90/0xb0
[c0000007f84837b0] c000000000108cec .smp_call_function_any+0x15c/0x200
[c0000007f8483860] c00000000096f27c .powernv_cpufreq_get+0x3c/0x60
[c0000007f84838f0] c000000000969dc4 .__cpufreq_add_dev.clone.11+0x574/0xa20
[c0000007f84839e0] c0000000005f7a4c .subsys_interface_register+0xec/0x130
[c0000007f8483a90] c000000000967af8 .cpufreq_register_driver+0x168/0x2d0
[c0000007f8483b30] c000000000f774cc .powernv_cpufreq_init+0x210/0x244
[c0000007f8483be0] c00000000000bc08 .do_one_initcall+0xc8/0x240
[c0000007f8483ce0] c000000000f44054 .kernel_init_freeable+0x268/0x33c
[c0000007f8483db0] c00000000000c4dc .kernel_init+0x1c/0x110
[c0000007f8483e30] c00000000000a428 .ret_from_kernel_thread+0x58/0xb0
1:mon> e
cpu 0x1: Vector: 700 (Program Check) at [c0000007f8483370]
    pc: c00000000096f32c: .pstate_id_to_freq+0x2c/0x50
    lr: c00000000096f37c: .powernv_read_cpu_freq+0x2c/0x50
    sp: c0000007f84835f0
   msr: 9000000000029032
  current = 0xc0000007f8400000
  paca    = 0xc00000000ff00400	 softe: 0	 irq_happened: 0x01
    pid   = 1, comm = swapper/0
kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134!
1:mon> r
R00 = 0000000000000042   R16 = 0000000000000000
R01 = c0000007f84835f0   R17 = 0000000000000000
R02 = c00000000116c430   R18 = 0000000000000000
R03 = ffffffffffffffbe   R19 = 0000000000000001
R04 = 0000000000000000   R20 = c00000078bd61e58
R05 = c00000000114ca78   R21 = c000000001008910
R06 = c0000007f84838d0   R22 = c0000000011c1b74
R07 = 0000000000000001   R23 = c000000002200030
R08 = c0000000011c1910   R24 = 0000000000000001
R09 = c000000001f2970c   R25 = c000000001008730
R10 = 0000000000000001   R26 = c000000001f29478
R11 = 0000000000000042   R27 = c0000007f84838d0
R12 = 0000000044002084   R28 = c00000000114ca78
R13 = c00000000ff00400   R29 = 0000000000000000
R14 = c00000000000c4c0   R30 = c0000000010a90f0
R15 = 0000000000000000   R31 = c0000007f84838d0
pc  = c00000000096f32c .pstate_id_to_freq+0x2c/0x50
cfar= c00000000096f324 .pstate_id_to_freq+0x24/0x50
lr  = c00000000096f37c .powernv_read_cpu_freq+0x2c/0x50
msr = 9000000000029032   cr  = 44002082
ctr = c00000000096f350   xer = 0000000020000000   trap =  700
1:mon>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Kernel crash caused by cpufreq
  2014-07-25  1:07 Kernel crash caused by cpufreq Gavin Shan
@ 2014-07-28  7:03 ` Michael Ellerman
  2014-07-28 10:28   ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 3+ messages in thread
From: Michael Ellerman @ 2014-07-28  7:03 UTC (permalink / raw)
  To: Gavin Shan, svaidy; +Cc: linuxppc-dev

On Fri, 2014-07-25 at 11:07 +1000, Gavin Shan wrote:
> I'm tracing one LSI interrupt issue on P8 box, and eventually into the
> following kernel crash. Not sure if there is one fix against this? :-)
 
Vaidy wrote that I'm pretty sure (on CC).

> Starting Linux PPC64 #401 SMP Fri Jul 25 10:52:28 EST 2014
> -----------------------------------------------------
> ppc64_pft_size                = 0x0
> physicalMemorySize            = 0x800000000
> htab_address                  = 0xc0000007fe000000
> htab_hash_mask                = 0x3ffff
> -----------------------------------------------------
>  <- setup_system()
> Initializing cgroup subsys cpuset
> Initializing cgroup subsys cpuacct
> Linux version 3.16.0-rc6-00076-g4226dbe-dirty (shangw@shangw) (gcc version 4.5.2 (crosstool-NG 1.19.0) ) #401 SMP Fri Jul 25 10:52:28 EST 2014
> 	:
> < Unrelated log stripped >

Are you sure there's nothing else in the log that might be related? See the
messages in init_powernv_pstates() for example.

> ------------[ cut here ]------------
> kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134!
> cpu 0x1: Vector: 700 (Program Check) at [c0000007f8483370]
>     pc: c00000000096f32c: .pstate_id_to_freq+0x2c/0x50
>     lr: c00000000096f37c: .powernv_read_cpu_freq+0x2c/0x50
>     sp: c0000007f84835f0
>    msr: 9000000000029032
>   current = 0xc0000007f8400000
>   paca    = 0xc00000000ff00400	 softe: 0	 irq_happened: 0x01
>     pid   = 1, comm = swapper/0
> kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134!
> enter ? for help
> [link register   ] c00000000096f37c .powernv_read_cpu_freq+0x2c/0x50
> [c0000007f84835f0] c000000001076070 key_type_dns_resolver+0xef20/0x40d20 (unreliable)
> [c0000007f8483670] c00000000010812c .generic_exec_single+0x8c/0x270
> [c0000007f8483730] c0000000001083d0 .smp_call_function_single+0x90/0xb0
> [c0000007f84837b0] c000000000108cec .smp_call_function_any+0x15c/0x200
> [c0000007f8483860] c00000000096f27c .powernv_cpufreq_get+0x3c/0x60
> [c0000007f84838f0] c000000000969dc4 .__cpufreq_add_dev.clone.11+0x574/0xa20
> [c0000007f84839e0] c0000000005f7a4c .subsys_interface_register+0xec/0x130
> [c0000007f8483a90] c000000000967af8 .cpufreq_register_driver+0x168/0x2d0
> [c0000007f8483b30] c000000000f774cc .powernv_cpufreq_init+0x210/0x244
> [c0000007f8483be0] c00000000000bc08 .do_one_initcall+0xc8/0x240
> [c0000007f8483ce0] c000000000f44054 .kernel_init_freeable+0x268/0x33c
> [c0000007f8483db0] c00000000000c4dc .kernel_init+0x1c/0x110
> [c0000007f8483e30] c00000000000a428 .ret_from_kernel_thread+0x58/0xb0
> 1:mon> e
> cpu 0x1: Vector: 700 (Program Check) at [c0000007f8483370]
>     pc: c00000000096f32c: .pstate_id_to_freq+0x2c/0x50
>     lr: c00000000096f37c: .powernv_read_cpu_freq+0x2c/0x50
>     sp: c0000007f84835f0
>    msr: 9000000000029032
>   current = 0xc0000007f8400000
>   paca    = 0xc00000000ff00400	 softe: 0	 irq_happened: 0x01
>     pid   = 1, comm = swapper/0
> kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134!
> 1:mon> r
> R00 = 0000000000000042   R16 = 0000000000000000
> R01 = c0000007f84835f0   R17 = 0000000000000000
> R02 = c00000000116c430   R18 = 0000000000000000
> R03 = ffffffffffffffbe   R19 = 0000000000000001
> R04 = 0000000000000000   R20 = c00000078bd61e58
> R05 = c00000000114ca78   R21 = c000000001008910
> R06 = c0000007f84838d0   R22 = c0000000011c1b74
> R07 = 0000000000000001   R23 = c000000002200030
> R08 = c0000000011c1910   R24 = 0000000000000001
> R09 = c000000001f2970c   R25 = c000000001008730
> R10 = 0000000000000001   R26 = c000000001f29478
> R11 = 0000000000000042   R27 = c0000007f84838d0
> R12 = 0000000044002084   R28 = c00000000114ca78
> R13 = c00000000ff00400   R29 = 0000000000000000
> R14 = c00000000000c4c0   R30 = c0000000010a90f0
> R15 = 0000000000000000   R31 = c0000007f84838d0
> pc  = c00000000096f32c .pstate_id_to_freq+0x2c/0x50
> cfar= c00000000096f324 .pstate_id_to_freq+0x24/0x50
> lr  = c00000000096f37c .powernv_read_cpu_freq+0x2c/0x50
> msr = 9000000000029032   cr  = 44002082
> ctr = c00000000096f350   xer = 0000000020000000   trap =  700
> 1:mon>

Gavin, in future for a dump like this it's very helpful to see the actual code
that hit the bug. You can get that with:

1:mon> di $.pstate_id_to_freq


Vaidy, judging by r3 it looks like i became negative. That would obviously
happen if powernv_pstate_info.max was zero?

cheers

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Kernel crash caused by cpufreq
  2014-07-28  7:03 ` Michael Ellerman
@ 2014-07-28 10:28   ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 3+ messages in thread
From: Vaidyanathan Srinivasan @ 2014-07-28 10:28 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, Gavin Shan

* Michael Ellerman <mpe@ellerman.id.au> [2014-07-28 17:03:10]:

> On Fri, 2014-07-25 at 11:07 +1000, Gavin Shan wrote:
> > I'm tracing one LSI interrupt issue on P8 box, and eventually into the
> > following kernel crash. Not sure if there is one fix against this? :-)
> 
> Vaidy wrote that I'm pretty sure (on CC).

Yes, I did :)
 
> > Starting Linux PPC64 #401 SMP Fri Jul 25 10:52:28 EST 2014
> > -----------------------------------------------------
> > ppc64_pft_size                = 0x0
> > physicalMemorySize            = 0x800000000
> > htab_address                  = 0xc0000007fe000000
> > htab_hash_mask                = 0x3ffff
> > -----------------------------------------------------
> >  <- setup_system()
> > Initializing cgroup subsys cpuset
> > Initializing cgroup subsys cpuacct
> > Linux version 3.16.0-rc6-00076-g4226dbe-dirty (shangw@shangw) (gcc version 4.5.2 (crosstool-NG 1.19.0) ) #401 SMP Fri Jul 25 10:52:28 EST 2014
> > 	:
> > < Unrelated log stripped >
> 
> Are you sure there's nothing else in the log that might be related? See the
> messages in init_powernv_pstates() for example.

Most likely PMSR register is showing out of bound values because of
potential firmware (OPAL) issue.

> > ------------[ cut here ]------------
> > kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134!
> > cpu 0x1: Vector: 700 (Program Check) at [c0000007f8483370]
> >     pc: c00000000096f32c: .pstate_id_to_freq+0x2c/0x50
> >     lr: c00000000096f37c: .powernv_read_cpu_freq+0x2c/0x50
> >     sp: c0000007f84835f0
> >    msr: 9000000000029032
> >   current = 0xc0000007f8400000
> >   paca    = 0xc00000000ff00400	 softe: 0	 irq_happened: 0x01
> >     pid   = 1, comm = swapper/0
> > kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134!
> > enter ? for help
> > [link register   ] c00000000096f37c .powernv_read_cpu_freq+0x2c/0x50
> > [c0000007f84835f0] c000000001076070 key_type_dns_resolver+0xef20/0x40d20 (unreliable)
> > [c0000007f8483670] c00000000010812c .generic_exec_single+0x8c/0x270
> > [c0000007f8483730] c0000000001083d0 .smp_call_function_single+0x90/0xb0
> > [c0000007f84837b0] c000000000108cec .smp_call_function_any+0x15c/0x200
> > [c0000007f8483860] c00000000096f27c .powernv_cpufreq_get+0x3c/0x60
> > [c0000007f84838f0] c000000000969dc4 .__cpufreq_add_dev.clone.11+0x574/0xa20
> > [c0000007f84839e0] c0000000005f7a4c .subsys_interface_register+0xec/0x130
> > [c0000007f8483a90] c000000000967af8 .cpufreq_register_driver+0x168/0x2d0
> > [c0000007f8483b30] c000000000f774cc .powernv_cpufreq_init+0x210/0x244
> > [c0000007f8483be0] c00000000000bc08 .do_one_initcall+0xc8/0x240
> > [c0000007f8483ce0] c000000000f44054 .kernel_init_freeable+0x268/0x33c
> > [c0000007f8483db0] c00000000000c4dc .kernel_init+0x1c/0x110
> > [c0000007f8483e30] c00000000000a428 .ret_from_kernel_thread+0x58/0xb0
> > 1:mon> e
> > cpu 0x1: Vector: 700 (Program Check) at [c0000007f8483370]
> >     pc: c00000000096f32c: .pstate_id_to_freq+0x2c/0x50
> >     lr: c00000000096f37c: .powernv_read_cpu_freq+0x2c/0x50
> >     sp: c0000007f84835f0
> >    msr: 9000000000029032
> >   current = 0xc0000007f8400000
> >   paca    = 0xc00000000ff00400	 softe: 0	 irq_happened: 0x01
> >     pid   = 1, comm = swapper/0
> > kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134!
> > 1:mon> r
> > R00 = 0000000000000042   R16 = 0000000000000000
> > R01 = c0000007f84835f0   R17 = 0000000000000000
> > R02 = c00000000116c430   R18 = 0000000000000000
> > R03 = ffffffffffffffbe   R19 = 0000000000000001
> > R04 = 0000000000000000   R20 = c00000078bd61e58
> > R05 = c00000000114ca78   R21 = c000000001008910
> > R06 = c0000007f84838d0   R22 = c0000000011c1b74
> > R07 = 0000000000000001   R23 = c000000002200030
> > R08 = c0000000011c1910   R24 = 0000000000000001
> > R09 = c000000001f2970c   R25 = c000000001008730
> > R10 = 0000000000000001   R26 = c000000001f29478
> > R11 = 0000000000000042   R27 = c0000007f84838d0
> > R12 = 0000000044002084   R28 = c00000000114ca78
> > R13 = c00000000ff00400   R29 = 0000000000000000
> > R14 = c00000000000c4c0   R30 = c0000000010a90f0
> > R15 = 0000000000000000   R31 = c0000007f84838d0
> > pc  = c00000000096f32c .pstate_id_to_freq+0x2c/0x50
> > cfar= c00000000096f324 .pstate_id_to_freq+0x24/0x50
> > lr  = c00000000096f37c .powernv_read_cpu_freq+0x2c/0x50
> > msr = 9000000000029032   cr  = 44002082
> > ctr = c00000000096f350   xer = 0000000020000000   trap =  700
> > 1:mon>
> 
> Gavin, in future for a dump like this it's very helpful to see the actual code
> that hit the bug. You can get that with:
> 
> 1:mon> di $.pstate_id_to_freq
> 
> 
> Vaidy, judging by r3 it looks like i became negative. That would obviously
> happen if powernv_pstate_info.max was zero?

yes, negative is ok.  Something has gone wrong with the PState
firmware/hardware.  A BUG_ON() is too severe for this error.  I will
change code to not stop the system for this error and also investigate
what is happening at runtime.

--Vaidy

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-07-28 10:29 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-25  1:07 Kernel crash caused by cpufreq Gavin Shan
2014-07-28  7:03 ` Michael Ellerman
2014-07-28 10:28   ` Vaidyanathan Srinivasan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).