All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux panics when suspend cannot offline the secondary cores
@ 2016-06-10 15:41 ` Mason
  0 siblings, 0 replies; 19+ messages in thread
From: Mason @ 2016-06-10 15:41 UTC (permalink / raw)
  To: linux-pm, Linux ARM
  Cc: Rafael J. Wysocki, Russell King, Stephen Boyd, Sebastian Frias

Hello,

I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
unhappy when the suspend framework fails to offline secondary cores.

Is this expected/by design, or could it fail more gracefully?
(It could also be something missing in my platform's code.)

Regards.


# echo mem > /sys/power/state 
[   30.722352] PM: Syncing filesystems ... done.
[   30.727146] PM: Preparing system for sleep (mem)
[   30.736927] Freezing user space processes ... (elapsed 0.001 seconds) done.
[   30.745519] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[   30.754098] PM: Suspending system (mem)
[   30.760934] PM: suspend of devices complete after 2.104 msecs
[   30.767638] PM: late suspend of devices complete after 0.883 msecs
[   30.774529] PM: noirq suspend of devices complete after 0.653 msecs
[   30.780846] Disabling non-boot CPUs ...
[   30.795697] CPU1: shutdown
[   30.795701] IN tango_cpu_die
[   30.795709] CPU1: smp_ops.cpu_die() returned, trying to resuscitate
[   30.795730] BUG: scheduling while atomic: swapper/1/0/0x00000002
[   30.795735] Modules linked in:
[   30.795756] Preemption disabled at:[<c04a5898>] schedule_preempt_disabled+0x20/0x24
[   30.795757] 
[   30.795766] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   30.795768] Hardware name: Sigma Tango DT
[   30.795773] Backtrace: 
[   30.795790] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
[   30.795797]  r7:60000013 r6:c080eb04 r5:00000000 r4:c080eb04
[   30.795811] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
[   30.795820] [<c02eb004>] (dump_stack) from [<c013cb34>] (__schedule_bug+0x6c/0xb8)
[   30.795827]  r7:c0802638 r6:e745f6c0 r5:e7ae8ec0 r4:e7460000
[   30.795833] [<c013cac8>] (__schedule_bug) from [<c04a522c>] (__schedule+0x434/0x530)
[   30.795837]  r5:e7ae8ec0 r4:c0736ec0
[   30.795842] [<c04a4df8>] (__schedule) from [<c04a5378>] (schedule+0x50/0xb0)
[   30.795852]  r10:00000000 r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494
[   30.795855]  r4:e7460000
[   30.795861] [<c04a5328>] (schedule) from [<c04a5890>] (schedule_preempt_disabled+0x18/0x24)
[   30.795865]  r5:c0802494 r4:e7460000
[   30.795876] [<c04a5878>] (schedule_preempt_disabled) from [<c0155f0c>] (cpu_startup_entry+0x10c/0x18c)
[   30.795884] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
[   30.795888]  r7:c081e2d6 r4:c080b530
[   30.795898] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
[   30.795902]  r5:c0802494 r4:00000001
[   30.952513] IN tango_cpu_kill
[   30.955537] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[   30.963668] pgd = c0004000
[   30.966382] [00000010] *pgd=00000000
[   30.969976] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[   30.975312] Modules linked in:
[   30.978379] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   30.989478] Hardware name: Sigma Tango DT
[   30.993503] task: e745f6c0 ti: e7460000 task.ti: e7460000
[   30.998933] PC is at __tick_nohz_idle_enter+0x2d8/0x444
[   31.004188] LR is at debug_smp_processor_id+0x20/0x24
[   31.009262] pc : [<c0184d1c>]    lr : [<c030305c>]    psr: 60000093
[   31.009262] sp : e7461f50  ip : e7461f20  fp : e7461fac
[   31.020800] r10: 00000000  r9 : 00000000  r8 : 00000000
[   31.026047] r7 : 00000000  r6 : 0032dcd5  r5 : 00000001  r4 : e7ae6e38
[   31.032605] r3 : 00000000  r2 : 0032dcd5  r1 : 00000000  r0 : 0032dcd5
[   31.039164] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   31.046420] Control: 10c5387d  Table: 8000404a  DAC: 00000051
[   31.052192] Process swapper/1 (pid: 0, stack limit = 0xe7460210)
[   31.058226] Stack: (0xe7461f50 to 0xe7462000)
[   31.062602] 1f40:                                     c04a4fcc c013c8b0 00000001 00000000
[   31.070821] 1f60: 35293313 00000007 34faa6c3 00000007 34f6563e 00000007 34faa6c3 00000007
[   31.079041] 1f80: ffffffff 7fffffff c0734e38 c0802494 c05ce0b8 c081e2d6 c05b8b6c c08024f8
[   31.087261] 1fa0: e7461fc4 e7461fb0 c0185294 c0184a50 e7460000 c0802494 e7461fdc e7461fc8
[   31.095480] 1fc0: c0155e58 c0185258 c080b530 c081e2d6 e7461ff4 e7461fe0 c010dc14 c0155e0c
[   31.103700] 1fe0: 00000001 c0802494 00000000 e7461ff8 c04a9208 c010dac8 454115f5 56b2e41b
[   31.111916] Backtrace: 
[   31.114376] [<c0184a44>] (__tick_nohz_idle_enter) from [<c0185294>] (tick_nohz_idle_enter+0x48/0x80)
[   31.123553]  r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494 r4:c0734e38
[   31.131353] [<c018524c>] (tick_nohz_idle_enter) from [<c0155e58>] (cpu_startup_entry+0x58/0x18c)
[   31.140181]  r5:c0802494 r4:e7460000
[   31.143778] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
[   31.152868]  r7:c081e2d6 r4:c080b530
[   31.156464] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
[   31.166253]  r5:c0802494 r4:00000001
[   31.169848] Code: e89dabf0 e14b24d4 e1a00004 ebffff22 (e1c821d0) 
[   31.175972] ---[ end trace 5e1e78cb2505c930 ]---
[   31.180611] Kernel panic - not syncing: Attempted to kill the idle task!
[   31.187346] CPU0: stopping
[   31.190064] CPU: 0 PID: 10 Comm: migration/0 Tainted: G      D W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   31.201426] Hardware name: Sigma Tango DT
[   31.205449] Backtrace: 
[   31.207911] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
[   31.215516]  r7:20000193 r6:c080eb04 r5:00000000 r4:c080eb04
[   31.221218] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
[   31.228478] [<c02eb004>] (dump_stack) from [<c010e034>] (handle_IPI+0x1a0/0x1b4)
[   31.235909]  r7:00000000 r6:00000004 r5:00000000 r4:c0735428
[   31.241607] [<c010de94>] (handle_IPI) from [<c01014ec>] (gic_handle_irq+0x90/0x94)
[   31.249212]  r9:e8803100 r8:e8802100 r7:e745de78 r6:e880210c r5:c080277c r4:c080ed20
[   31.257008] [<c010145c>] (gic_handle_irq) from [<c010c694>] (__irq_svc+0x54/0x90)
[   31.264527] Exception stack(0xe745de78 to 0xe745dec0)
[   31.269600] de60:                                                       00000000 c05bfe50
[   31.277820] de80: 00000000 00000001 e6e49cfc 00000001 e6e49ce8 20000013 00000000 e7ad9eec
[   31.286039] dea0: e6e49c90 e745deec e745deb8 e745dec8 c030305c c01910b8 60000013 ffffffff
[   31.294255]  r9:e7ad9eec r8:00000000 r7:e745deac r6:ffffffff r5:60000013 r4:c01910b8
[   31.302057] [<c0191008>] (multi_cpu_stop) from [<c0191304>] (cpu_stopper_thread+0xa8/0x120)
[   31.310448]  r9:e7ad9eec r8:e745c000 r7:e6e49ce8 r6:c0191008 r5:e7ad9ee4 r4:e7ad9ee0
[   31.318245] [<c019125c>] (cpu_stopper_thread) from [<c013b500>] (smpboot_thread_fn+0x164/0x288)
[   31.326985]  r10:ffffe000 r9:c080a9bc r8:00000000 r7:00000001 r6:00000000 r5:e7418680
[   31.334866]  r4:e745c000
[   31.337412] [<c013b39c>] (smpboot_thread_fn) from [<c0138434>] (kthread+0xe4/0xfc)
[   31.345017]  r10:00000000 r9:00000000 r8:00000000 r7:c013b39c r6:e7418680 r5:e7418500
[   31.352898]  r4:00000000 r3:e7452080
[   31.356493] [<c0138350>] (kthread) from [<c0107c18>] (ret_from_fork+0x14/0x3c)
[   31.363749]  r7:00000000 r6:00000000 r5:c0138350 r4:e7418500
[   31.369447] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
@ 2016-06-10 15:41 ` Mason
  0 siblings, 0 replies; 19+ messages in thread
From: Mason @ 2016-06-10 15:41 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
unhappy when the suspend framework fails to offline secondary cores.

Is this expected/by design, or could it fail more gracefully?
(It could also be something missing in my platform's code.)

Regards.


# echo mem > /sys/power/state 
[   30.722352] PM: Syncing filesystems ... done.
[   30.727146] PM: Preparing system for sleep (mem)
[   30.736927] Freezing user space processes ... (elapsed 0.001 seconds) done.
[   30.745519] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[   30.754098] PM: Suspending system (mem)
[   30.760934] PM: suspend of devices complete after 2.104 msecs
[   30.767638] PM: late suspend of devices complete after 0.883 msecs
[   30.774529] PM: noirq suspend of devices complete after 0.653 msecs
[   30.780846] Disabling non-boot CPUs ...
[   30.795697] CPU1: shutdown
[   30.795701] IN tango_cpu_die
[   30.795709] CPU1: smp_ops.cpu_die() returned, trying to resuscitate
[   30.795730] BUG: scheduling while atomic: swapper/1/0/0x00000002
[   30.795735] Modules linked in:
[   30.795756] Preemption disabled at:[<c04a5898>] schedule_preempt_disabled+0x20/0x24
[   30.795757] 
[   30.795766] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   30.795768] Hardware name: Sigma Tango DT
[   30.795773] Backtrace: 
[   30.795790] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
[   30.795797]  r7:60000013 r6:c080eb04 r5:00000000 r4:c080eb04
[   30.795811] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
[   30.795820] [<c02eb004>] (dump_stack) from [<c013cb34>] (__schedule_bug+0x6c/0xb8)
[   30.795827]  r7:c0802638 r6:e745f6c0 r5:e7ae8ec0 r4:e7460000
[   30.795833] [<c013cac8>] (__schedule_bug) from [<c04a522c>] (__schedule+0x434/0x530)
[   30.795837]  r5:e7ae8ec0 r4:c0736ec0
[   30.795842] [<c04a4df8>] (__schedule) from [<c04a5378>] (schedule+0x50/0xb0)
[   30.795852]  r10:00000000 r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494
[   30.795855]  r4:e7460000
[   30.795861] [<c04a5328>] (schedule) from [<c04a5890>] (schedule_preempt_disabled+0x18/0x24)
[   30.795865]  r5:c0802494 r4:e7460000
[   30.795876] [<c04a5878>] (schedule_preempt_disabled) from [<c0155f0c>] (cpu_startup_entry+0x10c/0x18c)
[   30.795884] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
[   30.795888]  r7:c081e2d6 r4:c080b530
[   30.795898] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
[   30.795902]  r5:c0802494 r4:00000001
[   30.952513] IN tango_cpu_kill
[   30.955537] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[   30.963668] pgd = c0004000
[   30.966382] [00000010] *pgd=00000000
[   30.969976] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[   30.975312] Modules linked in:
[   30.978379] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   30.989478] Hardware name: Sigma Tango DT
[   30.993503] task: e745f6c0 ti: e7460000 task.ti: e7460000
[   30.998933] PC is at __tick_nohz_idle_enter+0x2d8/0x444
[   31.004188] LR is at debug_smp_processor_id+0x20/0x24
[   31.009262] pc : [<c0184d1c>]    lr : [<c030305c>]    psr: 60000093
[   31.009262] sp : e7461f50  ip : e7461f20  fp : e7461fac
[   31.020800] r10: 00000000  r9 : 00000000  r8 : 00000000
[   31.026047] r7 : 00000000  r6 : 0032dcd5  r5 : 00000001  r4 : e7ae6e38
[   31.032605] r3 : 00000000  r2 : 0032dcd5  r1 : 00000000  r0 : 0032dcd5
[   31.039164] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   31.046420] Control: 10c5387d  Table: 8000404a  DAC: 00000051
[   31.052192] Process swapper/1 (pid: 0, stack limit = 0xe7460210)
[   31.058226] Stack: (0xe7461f50 to 0xe7462000)
[   31.062602] 1f40:                                     c04a4fcc c013c8b0 00000001 00000000
[   31.070821] 1f60: 35293313 00000007 34faa6c3 00000007 34f6563e 00000007 34faa6c3 00000007
[   31.079041] 1f80: ffffffff 7fffffff c0734e38 c0802494 c05ce0b8 c081e2d6 c05b8b6c c08024f8
[   31.087261] 1fa0: e7461fc4 e7461fb0 c0185294 c0184a50 e7460000 c0802494 e7461fdc e7461fc8
[   31.095480] 1fc0: c0155e58 c0185258 c080b530 c081e2d6 e7461ff4 e7461fe0 c010dc14 c0155e0c
[   31.103700] 1fe0: 00000001 c0802494 00000000 e7461ff8 c04a9208 c010dac8 454115f5 56b2e41b
[   31.111916] Backtrace: 
[   31.114376] [<c0184a44>] (__tick_nohz_idle_enter) from [<c0185294>] (tick_nohz_idle_enter+0x48/0x80)
[   31.123553]  r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494 r4:c0734e38
[   31.131353] [<c018524c>] (tick_nohz_idle_enter) from [<c0155e58>] (cpu_startup_entry+0x58/0x18c)
[   31.140181]  r5:c0802494 r4:e7460000
[   31.143778] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
[   31.152868]  r7:c081e2d6 r4:c080b530
[   31.156464] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
[   31.166253]  r5:c0802494 r4:00000001
[   31.169848] Code: e89dabf0 e14b24d4 e1a00004 ebffff22 (e1c821d0) 
[   31.175972] ---[ end trace 5e1e78cb2505c930 ]---
[   31.180611] Kernel panic - not syncing: Attempted to kill the idle task!
[   31.187346] CPU0: stopping
[   31.190064] CPU: 0 PID: 10 Comm: migration/0 Tainted: G      D W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   31.201426] Hardware name: Sigma Tango DT
[   31.205449] Backtrace: 
[   31.207911] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
[   31.215516]  r7:20000193 r6:c080eb04 r5:00000000 r4:c080eb04
[   31.221218] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
[   31.228478] [<c02eb004>] (dump_stack) from [<c010e034>] (handle_IPI+0x1a0/0x1b4)
[   31.235909]  r7:00000000 r6:00000004 r5:00000000 r4:c0735428
[   31.241607] [<c010de94>] (handle_IPI) from [<c01014ec>] (gic_handle_irq+0x90/0x94)
[   31.249212]  r9:e8803100 r8:e8802100 r7:e745de78 r6:e880210c r5:c080277c r4:c080ed20
[   31.257008] [<c010145c>] (gic_handle_irq) from [<c010c694>] (__irq_svc+0x54/0x90)
[   31.264527] Exception stack(0xe745de78 to 0xe745dec0)
[   31.269600] de60:                                                       00000000 c05bfe50
[   31.277820] de80: 00000000 00000001 e6e49cfc 00000001 e6e49ce8 20000013 00000000 e7ad9eec
[   31.286039] dea0: e6e49c90 e745deec e745deb8 e745dec8 c030305c c01910b8 60000013 ffffffff
[   31.294255]  r9:e7ad9eec r8:00000000 r7:e745deac r6:ffffffff r5:60000013 r4:c01910b8
[   31.302057] [<c0191008>] (multi_cpu_stop) from [<c0191304>] (cpu_stopper_thread+0xa8/0x120)
[   31.310448]  r9:e7ad9eec r8:e745c000 r7:e6e49ce8 r6:c0191008 r5:e7ad9ee4 r4:e7ad9ee0
[   31.318245] [<c019125c>] (cpu_stopper_thread) from [<c013b500>] (smpboot_thread_fn+0x164/0x288)
[   31.326985]  r10:ffffe000 r9:c080a9bc r8:00000000 r7:00000001 r6:00000000 r5:e7418680
[   31.334866]  r4:e745c000
[   31.337412] [<c013b39c>] (smpboot_thread_fn) from [<c0138434>] (kthread+0xe4/0xfc)
[   31.345017]  r10:00000000 r9:00000000 r8:00000000 r7:c013b39c r6:e7418680 r5:e7418500
[   31.352898]  r4:00000000 r3:e7452080
[   31.356493] [<c0138350>] (kthread) from [<c0107c18>] (ret_from_fork+0x14/0x3c)
[   31.363749]  r7:00000000 r6:00000000 r5:c0138350 r4:e7418500
[   31.369447] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux panics when suspend cannot offline the secondary cores
  2016-06-10 15:41 ` Mason
@ 2016-06-10 21:35   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2016-06-10 21:35 UTC (permalink / raw)
  To: Mason; +Cc: linux-pm, Linux ARM, Russell King, Stephen Boyd, Sebastian Frias

On Friday, June 10, 2016 05:41:32 PM Mason wrote:
> Hello,
> 
> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
> unhappy when the suspend framework fails to offline secondary cores.
> 
> Is this expected/by design, or could it fail more gracefully?
> (It could also be something missing in my platform's code.)

This looks like a CPU offline bug to me which is more general than just
system suspend.


> # echo mem > /sys/power/state 
> [   30.722352] PM: Syncing filesystems ... done.
> [   30.727146] PM: Preparing system for sleep (mem)
> [   30.736927] Freezing user space processes ... (elapsed 0.001 seconds) done.
> [   30.745519] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> [   30.754098] PM: Suspending system (mem)
> [   30.760934] PM: suspend of devices complete after 2.104 msecs
> [   30.767638] PM: late suspend of devices complete after 0.883 msecs
> [   30.774529] PM: noirq suspend of devices complete after 0.653 msecs
> [   30.780846] Disabling non-boot CPUs ...
> [   30.795697] CPU1: shutdown
> [   30.795701] IN tango_cpu_die
> [   30.795709] CPU1: smp_ops.cpu_die() returned, trying to resuscitate
> [   30.795730] BUG: scheduling while atomic: swapper/1/0/0x00000002
> [   30.795735] Modules linked in:
> [   30.795756] Preemption disabled at:[<c04a5898>] schedule_preempt_disabled+0x20/0x24
> [   30.795757] 
> [   30.795766] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
> [   30.795768] Hardware name: Sigma Tango DT
> [   30.795773] Backtrace: 
> [   30.795790] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
> [   30.795797]  r7:60000013 r6:c080eb04 r5:00000000 r4:c080eb04
> [   30.795811] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
> [   30.795820] [<c02eb004>] (dump_stack) from [<c013cb34>] (__schedule_bug+0x6c/0xb8)
> [   30.795827]  r7:c0802638 r6:e745f6c0 r5:e7ae8ec0 r4:e7460000
> [   30.795833] [<c013cac8>] (__schedule_bug) from [<c04a522c>] (__schedule+0x434/0x530)
> [   30.795837]  r5:e7ae8ec0 r4:c0736ec0
> [   30.795842] [<c04a4df8>] (__schedule) from [<c04a5378>] (schedule+0x50/0xb0)
> [   30.795852]  r10:00000000 r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494
> [   30.795855]  r4:e7460000
> [   30.795861] [<c04a5328>] (schedule) from [<c04a5890>] (schedule_preempt_disabled+0x18/0x24)
> [   30.795865]  r5:c0802494 r4:e7460000
> [   30.795876] [<c04a5878>] (schedule_preempt_disabled) from [<c0155f0c>] (cpu_startup_entry+0x10c/0x18c)
> [   30.795884] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
> [   30.795888]  r7:c081e2d6 r4:c080b530
> [   30.795898] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
> [   30.795902]  r5:c0802494 r4:00000001
> [   30.952513] IN tango_cpu_kill
> [   30.955537] Unable to handle kernel NULL pointer dereference at virtual address 00000010
> [   30.963668] pgd = c0004000
> [   30.966382] [00000010] *pgd=00000000
> [   30.969976] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
> [   30.975312] Modules linked in:
> [   30.978379] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
> [   30.989478] Hardware name: Sigma Tango DT
> [   30.993503] task: e745f6c0 ti: e7460000 task.ti: e7460000
> [   30.998933] PC is at __tick_nohz_idle_enter+0x2d8/0x444
> [   31.004188] LR is at debug_smp_processor_id+0x20/0x24
> [   31.009262] pc : [<c0184d1c>]    lr : [<c030305c>]    psr: 60000093
> [   31.009262] sp : e7461f50  ip : e7461f20  fp : e7461fac
> [   31.020800] r10: 00000000  r9 : 00000000  r8 : 00000000
> [   31.026047] r7 : 00000000  r6 : 0032dcd5  r5 : 00000001  r4 : e7ae6e38
> [   31.032605] r3 : 00000000  r2 : 0032dcd5  r1 : 00000000  r0 : 0032dcd5
> [   31.039164] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [   31.046420] Control: 10c5387d  Table: 8000404a  DAC: 00000051
> [   31.052192] Process swapper/1 (pid: 0, stack limit = 0xe7460210)
> [   31.058226] Stack: (0xe7461f50 to 0xe7462000)
> [   31.062602] 1f40:                                     c04a4fcc c013c8b0 00000001 00000000
> [   31.070821] 1f60: 35293313 00000007 34faa6c3 00000007 34f6563e 00000007 34faa6c3 00000007
> [   31.079041] 1f80: ffffffff 7fffffff c0734e38 c0802494 c05ce0b8 c081e2d6 c05b8b6c c08024f8
> [   31.087261] 1fa0: e7461fc4 e7461fb0 c0185294 c0184a50 e7460000 c0802494 e7461fdc e7461fc8
> [   31.095480] 1fc0: c0155e58 c0185258 c080b530 c081e2d6 e7461ff4 e7461fe0 c010dc14 c0155e0c
> [   31.103700] 1fe0: 00000001 c0802494 00000000 e7461ff8 c04a9208 c010dac8 454115f5 56b2e41b
> [   31.111916] Backtrace: 
> [   31.114376] [<c0184a44>] (__tick_nohz_idle_enter) from [<c0185294>] (tick_nohz_idle_enter+0x48/0x80)
> [   31.123553]  r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494 r4:c0734e38
> [   31.131353] [<c018524c>] (tick_nohz_idle_enter) from [<c0155e58>] (cpu_startup_entry+0x58/0x18c)
> [   31.140181]  r5:c0802494 r4:e7460000
> [   31.143778] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
> [   31.152868]  r7:c081e2d6 r4:c080b530
> [   31.156464] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
> [   31.166253]  r5:c0802494 r4:00000001
> [   31.169848] Code: e89dabf0 e14b24d4 e1a00004 ebffff22 (e1c821d0) 
> [   31.175972] ---[ end trace 5e1e78cb2505c930 ]---
> [   31.180611] Kernel panic - not syncing: Attempted to kill the idle task!
> [   31.187346] CPU0: stopping
> [   31.190064] CPU: 0 PID: 10 Comm: migration/0 Tainted: G      D W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
> [   31.201426] Hardware name: Sigma Tango DT
> [   31.205449] Backtrace: 
> [   31.207911] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
> [   31.215516]  r7:20000193 r6:c080eb04 r5:00000000 r4:c080eb04
> [   31.221218] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
> [   31.228478] [<c02eb004>] (dump_stack) from [<c010e034>] (handle_IPI+0x1a0/0x1b4)
> [   31.235909]  r7:00000000 r6:00000004 r5:00000000 r4:c0735428
> [   31.241607] [<c010de94>] (handle_IPI) from [<c01014ec>] (gic_handle_irq+0x90/0x94)
> [   31.249212]  r9:e8803100 r8:e8802100 r7:e745de78 r6:e880210c r5:c080277c r4:c080ed20
> [   31.257008] [<c010145c>] (gic_handle_irq) from [<c010c694>] (__irq_svc+0x54/0x90)
> [   31.264527] Exception stack(0xe745de78 to 0xe745dec0)
> [   31.269600] de60:                                                       00000000 c05bfe50
> [   31.277820] de80: 00000000 00000001 e6e49cfc 00000001 e6e49ce8 20000013 00000000 e7ad9eec
> [   31.286039] dea0: e6e49c90 e745deec e745deb8 e745dec8 c030305c c01910b8 60000013 ffffffff
> [   31.294255]  r9:e7ad9eec r8:00000000 r7:e745deac r6:ffffffff r5:60000013 r4:c01910b8
> [   31.302057] [<c0191008>] (multi_cpu_stop) from [<c0191304>] (cpu_stopper_thread+0xa8/0x120)
> [   31.310448]  r9:e7ad9eec r8:e745c000 r7:e6e49ce8 r6:c0191008 r5:e7ad9ee4 r4:e7ad9ee0
> [   31.318245] [<c019125c>] (cpu_stopper_thread) from [<c013b500>] (smpboot_thread_fn+0x164/0x288)
> [   31.326985]  r10:ffffe000 r9:c080a9bc r8:00000000 r7:00000001 r6:00000000 r5:e7418680
> [   31.334866]  r4:e745c000
> [   31.337412] [<c013b39c>] (smpboot_thread_fn) from [<c0138434>] (kthread+0xe4/0xfc)
> [   31.345017]  r10:00000000 r9:00000000 r8:00000000 r7:c013b39c r6:e7418680 r5:e7418500
> [   31.352898]  r4:00000000 r3:e7452080
> [   31.356493] [<c0138350>] (kthread) from [<c0107c18>] (ret_from_fork+0x14/0x3c)
> [   31.363749]  r7:00000000 r6:00000000 r5:c0138350 r4:e7418500
> [   31.369447] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!
> --


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
@ 2016-06-10 21:35   ` Rafael J. Wysocki
  0 siblings, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2016-06-10 21:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday, June 10, 2016 05:41:32 PM Mason wrote:
> Hello,
> 
> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
> unhappy when the suspend framework fails to offline secondary cores.
> 
> Is this expected/by design, or could it fail more gracefully?
> (It could also be something missing in my platform's code.)

This looks like a CPU offline bug to me which is more general than just
system suspend.


> # echo mem > /sys/power/state 
> [   30.722352] PM: Syncing filesystems ... done.
> [   30.727146] PM: Preparing system for sleep (mem)
> [   30.736927] Freezing user space processes ... (elapsed 0.001 seconds) done.
> [   30.745519] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> [   30.754098] PM: Suspending system (mem)
> [   30.760934] PM: suspend of devices complete after 2.104 msecs
> [   30.767638] PM: late suspend of devices complete after 0.883 msecs
> [   30.774529] PM: noirq suspend of devices complete after 0.653 msecs
> [   30.780846] Disabling non-boot CPUs ...
> [   30.795697] CPU1: shutdown
> [   30.795701] IN tango_cpu_die
> [   30.795709] CPU1: smp_ops.cpu_die() returned, trying to resuscitate
> [   30.795730] BUG: scheduling while atomic: swapper/1/0/0x00000002
> [   30.795735] Modules linked in:
> [   30.795756] Preemption disabled at:[<c04a5898>] schedule_preempt_disabled+0x20/0x24
> [   30.795757] 
> [   30.795766] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
> [   30.795768] Hardware name: Sigma Tango DT
> [   30.795773] Backtrace: 
> [   30.795790] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
> [   30.795797]  r7:60000013 r6:c080eb04 r5:00000000 r4:c080eb04
> [   30.795811] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
> [   30.795820] [<c02eb004>] (dump_stack) from [<c013cb34>] (__schedule_bug+0x6c/0xb8)
> [   30.795827]  r7:c0802638 r6:e745f6c0 r5:e7ae8ec0 r4:e7460000
> [   30.795833] [<c013cac8>] (__schedule_bug) from [<c04a522c>] (__schedule+0x434/0x530)
> [   30.795837]  r5:e7ae8ec0 r4:c0736ec0
> [   30.795842] [<c04a4df8>] (__schedule) from [<c04a5378>] (schedule+0x50/0xb0)
> [   30.795852]  r10:00000000 r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494
> [   30.795855]  r4:e7460000
> [   30.795861] [<c04a5328>] (schedule) from [<c04a5890>] (schedule_preempt_disabled+0x18/0x24)
> [   30.795865]  r5:c0802494 r4:e7460000
> [   30.795876] [<c04a5878>] (schedule_preempt_disabled) from [<c0155f0c>] (cpu_startup_entry+0x10c/0x18c)
> [   30.795884] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
> [   30.795888]  r7:c081e2d6 r4:c080b530
> [   30.795898] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
> [   30.795902]  r5:c0802494 r4:00000001
> [   30.952513] IN tango_cpu_kill
> [   30.955537] Unable to handle kernel NULL pointer dereference at virtual address 00000010
> [   30.963668] pgd = c0004000
> [   30.966382] [00000010] *pgd=00000000
> [   30.969976] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
> [   30.975312] Modules linked in:
> [   30.978379] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
> [   30.989478] Hardware name: Sigma Tango DT
> [   30.993503] task: e745f6c0 ti: e7460000 task.ti: e7460000
> [   30.998933] PC is at __tick_nohz_idle_enter+0x2d8/0x444
> [   31.004188] LR is at debug_smp_processor_id+0x20/0x24
> [   31.009262] pc : [<c0184d1c>]    lr : [<c030305c>]    psr: 60000093
> [   31.009262] sp : e7461f50  ip : e7461f20  fp : e7461fac
> [   31.020800] r10: 00000000  r9 : 00000000  r8 : 00000000
> [   31.026047] r7 : 00000000  r6 : 0032dcd5  r5 : 00000001  r4 : e7ae6e38
> [   31.032605] r3 : 00000000  r2 : 0032dcd5  r1 : 00000000  r0 : 0032dcd5
> [   31.039164] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [   31.046420] Control: 10c5387d  Table: 8000404a  DAC: 00000051
> [   31.052192] Process swapper/1 (pid: 0, stack limit = 0xe7460210)
> [   31.058226] Stack: (0xe7461f50 to 0xe7462000)
> [   31.062602] 1f40:                                     c04a4fcc c013c8b0 00000001 00000000
> [   31.070821] 1f60: 35293313 00000007 34faa6c3 00000007 34f6563e 00000007 34faa6c3 00000007
> [   31.079041] 1f80: ffffffff 7fffffff c0734e38 c0802494 c05ce0b8 c081e2d6 c05b8b6c c08024f8
> [   31.087261] 1fa0: e7461fc4 e7461fb0 c0185294 c0184a50 e7460000 c0802494 e7461fdc e7461fc8
> [   31.095480] 1fc0: c0155e58 c0185258 c080b530 c081e2d6 e7461ff4 e7461fe0 c010dc14 c0155e0c
> [   31.103700] 1fe0: 00000001 c0802494 00000000 e7461ff8 c04a9208 c010dac8 454115f5 56b2e41b
> [   31.111916] Backtrace: 
> [   31.114376] [<c0184a44>] (__tick_nohz_idle_enter) from [<c0185294>] (tick_nohz_idle_enter+0x48/0x80)
> [   31.123553]  r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494 r4:c0734e38
> [   31.131353] [<c018524c>] (tick_nohz_idle_enter) from [<c0155e58>] (cpu_startup_entry+0x58/0x18c)
> [   31.140181]  r5:c0802494 r4:e7460000
> [   31.143778] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
> [   31.152868]  r7:c081e2d6 r4:c080b530
> [   31.156464] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
> [   31.166253]  r5:c0802494 r4:00000001
> [   31.169848] Code: e89dabf0 e14b24d4 e1a00004 ebffff22 (e1c821d0) 
> [   31.175972] ---[ end trace 5e1e78cb2505c930 ]---
> [   31.180611] Kernel panic - not syncing: Attempted to kill the idle task!
> [   31.187346] CPU0: stopping
> [   31.190064] CPU: 0 PID: 10 Comm: migration/0 Tainted: G      D W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
> [   31.201426] Hardware name: Sigma Tango DT
> [   31.205449] Backtrace: 
> [   31.207911] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
> [   31.215516]  r7:20000193 r6:c080eb04 r5:00000000 r4:c080eb04
> [   31.221218] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
> [   31.228478] [<c02eb004>] (dump_stack) from [<c010e034>] (handle_IPI+0x1a0/0x1b4)
> [   31.235909]  r7:00000000 r6:00000004 r5:00000000 r4:c0735428
> [   31.241607] [<c010de94>] (handle_IPI) from [<c01014ec>] (gic_handle_irq+0x90/0x94)
> [   31.249212]  r9:e8803100 r8:e8802100 r7:e745de78 r6:e880210c r5:c080277c r4:c080ed20
> [   31.257008] [<c010145c>] (gic_handle_irq) from [<c010c694>] (__irq_svc+0x54/0x90)
> [   31.264527] Exception stack(0xe745de78 to 0xe745dec0)
> [   31.269600] de60:                                                       00000000 c05bfe50
> [   31.277820] de80: 00000000 00000001 e6e49cfc 00000001 e6e49ce8 20000013 00000000 e7ad9eec
> [   31.286039] dea0: e6e49c90 e745deec e745deb8 e745dec8 c030305c c01910b8 60000013 ffffffff
> [   31.294255]  r9:e7ad9eec r8:00000000 r7:e745deac r6:ffffffff r5:60000013 r4:c01910b8
> [   31.302057] [<c0191008>] (multi_cpu_stop) from [<c0191304>] (cpu_stopper_thread+0xa8/0x120)
> [   31.310448]  r9:e7ad9eec r8:e745c000 r7:e6e49ce8 r6:c0191008 r5:e7ad9ee4 r4:e7ad9ee0
> [   31.318245] [<c019125c>] (cpu_stopper_thread) from [<c013b500>] (smpboot_thread_fn+0x164/0x288)
> [   31.326985]  r10:ffffe000 r9:c080a9bc r8:00000000 r7:00000001 r6:00000000 r5:e7418680
> [   31.334866]  r4:e745c000
> [   31.337412] [<c013b39c>] (smpboot_thread_fn) from [<c0138434>] (kthread+0xe4/0xfc)
> [   31.345017]  r10:00000000 r9:00000000 r8:00000000 r7:c013b39c r6:e7418680 r5:e7418500
> [   31.352898]  r4:00000000 r3:e7452080
> [   31.356493] [<c0138350>] (kthread) from [<c0107c18>] (ret_from_fork+0x14/0x3c)
> [   31.363749]  r7:00000000 r6:00000000 r5:c0138350 r4:e7418500
> [   31.369447] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!
> --

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux panics when suspend cannot offline the secondary cores
  2016-06-10 21:35   ` Rafael J. Wysocki
@ 2016-06-10 21:37     ` Mason
  -1 siblings, 0 replies; 19+ messages in thread
From: Mason @ 2016-06-10 21:37 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, Linux ARM, Russell King, Stephen Boyd, Sebastian Frias

On 10/06/2016 23:35, Rafael J. Wysocki wrote:
              ^^^^^

Your clock is 5 minutes ahead ;-)

> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
>
>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
>> unhappy when the suspend framework fails to offline secondary cores.
>>
>> Is this expected/by design, or could it fail more gracefully?
>> (It could also be something missing in my platform's code.)
> 
> This looks like a CPU offline bug to me which is more general than just
> system suspend.

You may be right, I will try just off-lining cpu1.
Suspend may be a red herring.

By the way, I know my implementation of tango_cpu_die
is incorrect, I was testing the failure mode.

Regards.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
@ 2016-06-10 21:37     ` Mason
  0 siblings, 0 replies; 19+ messages in thread
From: Mason @ 2016-06-10 21:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 10/06/2016 23:35, Rafael J. Wysocki wrote:
              ^^^^^

Your clock is 5 minutes ahead ;-)

> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
>
>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
>> unhappy when the suspend framework fails to offline secondary cores.
>>
>> Is this expected/by design, or could it fail more gracefully?
>> (It could also be something missing in my platform's code.)
> 
> This looks like a CPU offline bug to me which is more general than just
> system suspend.

You may be right, I will try just off-lining cpu1.
Suspend may be a red herring.

By the way, I know my implementation of tango_cpu_die
is incorrect, I was testing the failure mode.

Regards.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux panics when suspend cannot offline the secondary cores
  2016-06-10 21:37     ` Mason
@ 2016-06-13 12:06       ` Mason
  -1 siblings, 0 replies; 19+ messages in thread
From: Mason @ 2016-06-13 12:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, Linux ARM, Russell King, Stephen Boyd, Sebastian Frias,
	Lorenzo Pieralisi, Will Deacon, Arnd Bergmann

On 10/06/2016 23:37, Mason wrote:

> On 10/06/2016 23:35, Rafael J. Wysocki wrote:
> 
>> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
>>
>>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
>>> unhappy when the suspend framework fails to offline secondary cores.
>>>
>>> Is this expected/by design, or could it fail more gracefully?
>>> (It could also be something missing in my platform's code.)
>>
>> This looks like a CPU offline bug to me which is more general than just
>> system suspend.
> 
> You may be right, I will try just off-lining cpu1.
> Suspend may be a red herring.
> 
> By the way, I know my implementation of tango_cpu_die
> is incorrect, I was testing the failure mode.

Hello Rafael,

Suspend was indeed a red herring. Manually requesting cpu1 off-lining
also makes Linux panic when cpu_die() unexpectedly returns.

The subject should perhaps have been:

  Linux panics when secondary core off-lining fails

Could it be made to fail more gracefully?
Or is this borkage inherent to the failed operation?
Or is it a bug in my platform code?
(A bug other than tango_cpu_die() failing to kill the core.)


#ifdef CONFIG_HOTPLUG_CPU
static int tango_cpu_kill(unsigned int cpu)
{
	printk("IN %s\n", __func__);
	return 1;
}

static void tango_cpu_die(unsigned int cpu)
{
	printk("IN %s\n", __func__);
}
#endif


Regards.


# echo 0 > /sys/devices/system/cpu/cpu1/online
[   60.619026] CPU1: shutdown
[   60.619031] IN tango_cpu_die
[   60.619041] CPU1: smp_ops.cpu_die() returned, trying to resuscitate
[   60.619063] BUG: scheduling while atomic: swapper/1/0/0x00000002
[   60.619069] Modules linked in:
[   60.619088] Preemption disabled at:[<c04a5898>] schedule_preempt_disabled+0x20/0x24
[   60.619089] 
[   60.619098] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   60.619099] Hardware name: Sigma Tango DT
[   60.619104] Backtrace: 
[   60.619121] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
[   60.619129]  r7:60000013 r6:c080eb04 r5:00000000 r4:c080eb04
[   60.619141] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
[   60.619150] [<c02eb004>] (dump_stack) from [<c013cb34>] (__schedule_bug+0x6c/0xb8)
[   60.619157]  r7:c0802638 r6:df45b6c0 r5:dfbeaec0 r4:df45c000
[   60.619162] [<c013cac8>] (__schedule_bug) from [<c04a522c>] (__schedule+0x434/0x530)
[   60.619167]  r5:dfbeaec0 r4:c0736ec0
[   60.619172] [<c04a4df8>] (__schedule) from [<c04a5378>] (schedule+0x50/0xb0)
[   60.619182]  r10:00000000 r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494
[   60.619184]  r4:df45c000
[   60.619190] [<c04a5328>] (schedule) from [<c04a5890>] (schedule_preempt_disabled+0x18/0x24)
[   60.619195]  r5:c0802494 r4:df45c000
[   60.619206] [<c04a5878>] (schedule_preempt_disabled) from [<c0155f0c>] (cpu_startup_entry+0x10c/0x18c)
[   60.619213] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
[   60.619218]  r7:c081e2d6 r4:c080b530
[   60.619226] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
[   60.619231]  r5:c0802494 r4:00000001
[   60.775838] IN tango_cpu_kill
[   60.779453] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[   60.787593] pgd = c0004000
[   60.790307] [00000010] *pgd=00000000
[   60.793901] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[   60.799324] Modules linked in:
[   60.802393] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   60.813493] Hardware name: Sigma Tango DT
[   60.817518] task: df45b6c0 ti: df45c000 task.ti: df45c000
[   60.822948] PC is at __tick_nohz_idle_enter+0x2d8/0x444
[   60.828204] LR is at debug_smp_processor_id+0x20/0x24
[   60.833278] pc : [<c0184d1c>]    lr : [<c030305c>]    psr: 60000093
[   60.833278] sp : df45df50  ip : df45df20  fp : df45dfac
[   60.844815] r10: 00000000  r9 : 00000000  r8 : 00000000
[   60.850063] r7 : 00000000  r6 : 0032dcd5  r5 : 00000001  r4 : dfbe8e38
[   60.856620] r3 : 00000000  r2 : 0032dcd5  r1 : 00000000  r0 : 0032dcd5
[   60.863179] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   60.870435] Control: 10c5387d  Table: 9ed8804a  DAC: 00000051
[   60.876206] Process swapper/1 (pid: 0, stack limit = 0xdf45c210)
[   60.882240] Stack: (0xdf45df50 to 0xdf45e000)
[   60.886616] df40:                                     c04a4fcc c013c8b0 00000001 00000000
[   60.894836] df60: 26c51b42 0000000e 269f8229 0000000e 26923e6d 0000000e 269f8229 0000000e
[   60.903057] df80: ffffffff 7fffffff c0734e38 c0802494 c05ce0b8 c081e2d6 c05b8b6c c08024f8
[   60.911276] dfa0: df45dfc4 df45dfb0 c0185294 c0184a50 df45c000 c0802494 df45dfdc df45dfc8
[   60.919495] dfc0: c0155e58 c0185258 c080b530 c081e2d6 df45dff4 df45dfe0 c010dc14 c0155e0c
[   60.927716] dfe0: 00000001 c0802494 00000000 df45dff8 c04a9208 c010dac8 c1640288 22a54aa8
[   60.935932] Backtrace: 
[   60.938391] [<c0184a44>] (__tick_nohz_idle_enter) from [<c0185294>] (tick_nohz_idle_enter+0x48/0x80)
[   60.947569]  r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494 r4:c0734e38
[   60.955370] [<c018524c>] (tick_nohz_idle_enter) from [<c0155e58>] (cpu_startup_entry+0x58/0x18c)
[   60.964198]  r5:c0802494 r4:df45c000
[   60.967796] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
[   60.976885]  r7:c081e2d6 r4:c080b530
[   60.980485] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
[   60.990273]  r5:c0802494 r4:00000001
[   60.993867] Code: e89dabf0 e14b24d4 e1a00004 ebffff22 (e1c821d0) 
[   60.999991] ---[ end trace b2639488439a8390 ]---
[   61.004631] Kernel panic - not syncing: Attempted to kill the idle task!
[   61.011368] CPU0: stopping
[   61.014087] CPU: 0 PID: 10 Comm: migration/0 Tainted: G      D W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   61.025448] Hardware name: Sigma Tango DT
[   61.029471] Backtrace: 
[   61.031936] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
[   61.039542]  r7:20000193 r6:c080eb04 r5:00000000 r4:c080eb04
[   61.045246] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
[   61.052507] [<c02eb004>] (dump_stack) from [<c010e034>] (handle_IPI+0x1a0/0x1b4)
[   61.059936]  r7:00000000 r6:00000004 r5:00000000 r4:c0735428
[   61.065635] [<c010de94>] (handle_IPI) from [<c01014ec>] (gic_handle_irq+0x90/0x94)
[   61.073240]  r9:e0803100 r8:e0802100 r7:df459e78 r6:e080210c r5:c080277c r4:c080ed20
[   61.081038] [<c010145c>] (gic_handle_irq) from [<c010c694>] (__irq_svc+0x54/0x90)
[   61.088556] Exception stack(0xdf459e78 to 0xdf459ec0)
[   61.093629] 9e60:                                                       00000000 c05bfe50
[   61.101849] 9e80: 00000000 00000001 dee37d54 00000001 dee37d40 20000013 00000000 dfbdbeec
[   61.110069] 9ea0: dee37ce8 df459eec df459eb8 df459ec8 c030305c c01910b8 60000013 ffffffff
[   61.118285]  r9:dfbdbeec r8:00000000 r7:df459eac r6:ffffffff r5:60000013 r4:c01910b8
[   61.126086] [<c0191008>] (multi_cpu_stop) from [<c0191304>] (cpu_stopper_thread+0xa8/0x120)
[   61.134477]  r9:dfbdbeec r8:df458000 r7:dee37d40 r6:c0191008 r5:dfbdbee4 r4:dfbdbee0
[   61.142274] [<c019125c>] (cpu_stopper_thread) from [<c013b500>] (smpboot_thread_fn+0x164/0x288)
[   61.151014]  r10:ffffe000 r9:c080a9bc r8:00000000 r7:00000001 r6:00000000 r5:df41a680
[   61.158894]  r4:df458000
[   61.161440] [<c013b39c>] (smpboot_thread_fn) from [<c0138434>] (kthread+0xe4/0xfc)
[   61.169045]  r10:00000000 r9:00000000 r8:00000000 r7:c013b39c r6:df41a680 r5:df41a500
[   61.176927]  r4:00000000 r3:df44e080
[   61.180523] [<c0138350>] (kthread) from [<c0107c18>] (ret_from_fork+0x14/0x3c)
[   61.187778]  r7:00000000 r6:00000000 r5:c0138350 r4:df41a500
[   61.193475] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
@ 2016-06-13 12:06       ` Mason
  0 siblings, 0 replies; 19+ messages in thread
From: Mason @ 2016-06-13 12:06 UTC (permalink / raw)
  To: linux-arm-kernel

On 10/06/2016 23:37, Mason wrote:

> On 10/06/2016 23:35, Rafael J. Wysocki wrote:
> 
>> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
>>
>>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
>>> unhappy when the suspend framework fails to offline secondary cores.
>>>
>>> Is this expected/by design, or could it fail more gracefully?
>>> (It could also be something missing in my platform's code.)
>>
>> This looks like a CPU offline bug to me which is more general than just
>> system suspend.
> 
> You may be right, I will try just off-lining cpu1.
> Suspend may be a red herring.
> 
> By the way, I know my implementation of tango_cpu_die
> is incorrect, I was testing the failure mode.

Hello Rafael,

Suspend was indeed a red herring. Manually requesting cpu1 off-lining
also makes Linux panic when cpu_die() unexpectedly returns.

The subject should perhaps have been:

  Linux panics when secondary core off-lining fails

Could it be made to fail more gracefully?
Or is this borkage inherent to the failed operation?
Or is it a bug in my platform code?
(A bug other than tango_cpu_die() failing to kill the core.)


#ifdef CONFIG_HOTPLUG_CPU
static int tango_cpu_kill(unsigned int cpu)
{
	printk("IN %s\n", __func__);
	return 1;
}

static void tango_cpu_die(unsigned int cpu)
{
	printk("IN %s\n", __func__);
}
#endif


Regards.


# echo 0 > /sys/devices/system/cpu/cpu1/online
[   60.619026] CPU1: shutdown
[   60.619031] IN tango_cpu_die
[   60.619041] CPU1: smp_ops.cpu_die() returned, trying to resuscitate
[   60.619063] BUG: scheduling while atomic: swapper/1/0/0x00000002
[   60.619069] Modules linked in:
[   60.619088] Preemption disabled at:[<c04a5898>] schedule_preempt_disabled+0x20/0x24
[   60.619089] 
[   60.619098] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   60.619099] Hardware name: Sigma Tango DT
[   60.619104] Backtrace: 
[   60.619121] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
[   60.619129]  r7:60000013 r6:c080eb04 r5:00000000 r4:c080eb04
[   60.619141] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
[   60.619150] [<c02eb004>] (dump_stack) from [<c013cb34>] (__schedule_bug+0x6c/0xb8)
[   60.619157]  r7:c0802638 r6:df45b6c0 r5:dfbeaec0 r4:df45c000
[   60.619162] [<c013cac8>] (__schedule_bug) from [<c04a522c>] (__schedule+0x434/0x530)
[   60.619167]  r5:dfbeaec0 r4:c0736ec0
[   60.619172] [<c04a4df8>] (__schedule) from [<c04a5378>] (schedule+0x50/0xb0)
[   60.619182]  r10:00000000 r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494
[   60.619184]  r4:df45c000
[   60.619190] [<c04a5328>] (schedule) from [<c04a5890>] (schedule_preempt_disabled+0x18/0x24)
[   60.619195]  r5:c0802494 r4:df45c000
[   60.619206] [<c04a5878>] (schedule_preempt_disabled) from [<c0155f0c>] (cpu_startup_entry+0x10c/0x18c)
[   60.619213] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
[   60.619218]  r7:c081e2d6 r4:c080b530
[   60.619226] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
[   60.619231]  r5:c0802494 r4:00000001
[   60.775838] IN tango_cpu_kill
[   60.779453] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[   60.787593] pgd = c0004000
[   60.790307] [00000010] *pgd=00000000
[   60.793901] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[   60.799324] Modules linked in:
[   60.802393] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   60.813493] Hardware name: Sigma Tango DT
[   60.817518] task: df45b6c0 ti: df45c000 task.ti: df45c000
[   60.822948] PC is at __tick_nohz_idle_enter+0x2d8/0x444
[   60.828204] LR is at debug_smp_processor_id+0x20/0x24
[   60.833278] pc : [<c0184d1c>]    lr : [<c030305c>]    psr: 60000093
[   60.833278] sp : df45df50  ip : df45df20  fp : df45dfac
[   60.844815] r10: 00000000  r9 : 00000000  r8 : 00000000
[   60.850063] r7 : 00000000  r6 : 0032dcd5  r5 : 00000001  r4 : dfbe8e38
[   60.856620] r3 : 00000000  r2 : 0032dcd5  r1 : 00000000  r0 : 0032dcd5
[   60.863179] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   60.870435] Control: 10c5387d  Table: 9ed8804a  DAC: 00000051
[   60.876206] Process swapper/1 (pid: 0, stack limit = 0xdf45c210)
[   60.882240] Stack: (0xdf45df50 to 0xdf45e000)
[   60.886616] df40:                                     c04a4fcc c013c8b0 00000001 00000000
[   60.894836] df60: 26c51b42 0000000e 269f8229 0000000e 26923e6d 0000000e 269f8229 0000000e
[   60.903057] df80: ffffffff 7fffffff c0734e38 c0802494 c05ce0b8 c081e2d6 c05b8b6c c08024f8
[   60.911276] dfa0: df45dfc4 df45dfb0 c0185294 c0184a50 df45c000 c0802494 df45dfdc df45dfc8
[   60.919495] dfc0: c0155e58 c0185258 c080b530 c081e2d6 df45dff4 df45dfe0 c010dc14 c0155e0c
[   60.927716] dfe0: 00000001 c0802494 00000000 df45dff8 c04a9208 c010dac8 c1640288 22a54aa8
[   60.935932] Backtrace: 
[   60.938391] [<c0184a44>] (__tick_nohz_idle_enter) from [<c0185294>] (tick_nohz_idle_enter+0x48/0x80)
[   60.947569]  r9:c08024f8 r8:c05b8b6c r7:c081e2d6 r6:c05ce0b8 r5:c0802494 r4:c0734e38
[   60.955370] [<c018524c>] (tick_nohz_idle_enter) from [<c0155e58>] (cpu_startup_entry+0x58/0x18c)
[   60.964198]  r5:c0802494 r4:df45c000
[   60.967796] [<c0155e00>] (cpu_startup_entry) from [<c010dc14>] (secondary_start_kernel+0x158/0x164)
[   60.976885]  r7:c081e2d6 r4:c080b530
[   60.980485] [<c010dabc>] (secondary_start_kernel) from [<c04a9208>] (_raw_spin_unlock_irqrestore+0x30/0x5c)
[   60.990273]  r5:c0802494 r4:00000001
[   60.993867] Code: e89dabf0 e14b24d4 e1a00004 ebffff22 (e1c821d0) 
[   60.999991] ---[ end trace b2639488439a8390 ]---
[   61.004631] Kernel panic - not syncing: Attempted to kill the idle task!
[   61.011368] CPU0: stopping
[   61.014087] CPU: 0 PID: 10 Comm: migration/0 Tainted: G      D W       4.7.0-rc1-next-20160530-00002-g6c94ca0b0db1-dirty #117
[   61.025448] Hardware name: Sigma Tango DT
[   61.029471] Backtrace: 
[   61.031936] [<c010b974>] (dump_backtrace) from [<c010bb70>] (show_stack+0x18/0x1c)
[   61.039542]  r7:20000193 r6:c080eb04 r5:00000000 r4:c080eb04
[   61.045246] [<c010bb58>] (show_stack) from [<c02eb084>] (dump_stack+0x80/0x94)
[   61.052507] [<c02eb004>] (dump_stack) from [<c010e034>] (handle_IPI+0x1a0/0x1b4)
[   61.059936]  r7:00000000 r6:00000004 r5:00000000 r4:c0735428
[   61.065635] [<c010de94>] (handle_IPI) from [<c01014ec>] (gic_handle_irq+0x90/0x94)
[   61.073240]  r9:e0803100 r8:e0802100 r7:df459e78 r6:e080210c r5:c080277c r4:c080ed20
[   61.081038] [<c010145c>] (gic_handle_irq) from [<c010c694>] (__irq_svc+0x54/0x90)
[   61.088556] Exception stack(0xdf459e78 to 0xdf459ec0)
[   61.093629] 9e60:                                                       00000000 c05bfe50
[   61.101849] 9e80: 00000000 00000001 dee37d54 00000001 dee37d40 20000013 00000000 dfbdbeec
[   61.110069] 9ea0: dee37ce8 df459eec df459eb8 df459ec8 c030305c c01910b8 60000013 ffffffff
[   61.118285]  r9:dfbdbeec r8:00000000 r7:df459eac r6:ffffffff r5:60000013 r4:c01910b8
[   61.126086] [<c0191008>] (multi_cpu_stop) from [<c0191304>] (cpu_stopper_thread+0xa8/0x120)
[   61.134477]  r9:dfbdbeec r8:df458000 r7:dee37d40 r6:c0191008 r5:dfbdbee4 r4:dfbdbee0
[   61.142274] [<c019125c>] (cpu_stopper_thread) from [<c013b500>] (smpboot_thread_fn+0x164/0x288)
[   61.151014]  r10:ffffe000 r9:c080a9bc r8:00000000 r7:00000001 r6:00000000 r5:df41a680
[   61.158894]  r4:df458000
[   61.161440] [<c013b39c>] (smpboot_thread_fn) from [<c0138434>] (kthread+0xe4/0xfc)
[   61.169045]  r10:00000000 r9:00000000 r8:00000000 r7:c013b39c r6:df41a680 r5:df41a500
[   61.176927]  r4:00000000 r3:df44e080
[   61.180523] [<c0138350>] (kthread) from [<c0107c18>] (ret_from_fork+0x14/0x3c)
[   61.187778]  r7:00000000 r6:00000000 r5:c0138350 r4:df41a500
[   61.193475] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux panics when suspend cannot offline the secondary cores
  2016-06-13 12:06       ` Mason
@ 2016-06-13 13:30         ` Rafael J. Wysocki
  -1 siblings, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2016-06-13 13:30 UTC (permalink / raw)
  To: Mason
  Cc: linux-pm, Linux ARM, Russell King, Stephen Boyd, Sebastian Frias,
	Lorenzo Pieralisi, Will Deacon, Arnd Bergmann

On Monday, June 13, 2016 02:06:14 PM Mason wrote:
> On 10/06/2016 23:37, Mason wrote:
> 
> > On 10/06/2016 23:35, Rafael J. Wysocki wrote:
> > 
> >> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
> >>
> >>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
> >>> unhappy when the suspend framework fails to offline secondary cores.
> >>>
> >>> Is this expected/by design, or could it fail more gracefully?
> >>> (It could also be something missing in my platform's code.)
> >>
> >> This looks like a CPU offline bug to me which is more general than just
> >> system suspend.
> > 
> > You may be right, I will try just off-lining cpu1.
> > Suspend may be a red herring.
> > 
> > By the way, I know my implementation of tango_cpu_die
> > is incorrect, I was testing the failure mode.
> 
> Hello Rafael,
> 
> Suspend was indeed a red herring. Manually requesting cpu1 off-lining
> also makes Linux panic when cpu_die() unexpectedly returns.
> 
> The subject should perhaps have been:
> 
>   Linux panics when secondary core off-lining fails
> 
> Could it be made to fail more gracefully?
> Or is this borkage inherent to the failed operation?
> Or is it a bug in my platform code?
> (A bug other than tango_cpu_die() failing to kill the core.)

Well, smp_ops.cpu_die() is not expected to return AFAICS, so that may be
the reason why it fails for you the way it does.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
@ 2016-06-13 13:30         ` Rafael J. Wysocki
  0 siblings, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2016-06-13 13:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday, June 13, 2016 02:06:14 PM Mason wrote:
> On 10/06/2016 23:37, Mason wrote:
> 
> > On 10/06/2016 23:35, Rafael J. Wysocki wrote:
> > 
> >> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
> >>
> >>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
> >>> unhappy when the suspend framework fails to offline secondary cores.
> >>>
> >>> Is this expected/by design, or could it fail more gracefully?
> >>> (It could also be something missing in my platform's code.)
> >>
> >> This looks like a CPU offline bug to me which is more general than just
> >> system suspend.
> > 
> > You may be right, I will try just off-lining cpu1.
> > Suspend may be a red herring.
> > 
> > By the way, I know my implementation of tango_cpu_die
> > is incorrect, I was testing the failure mode.
> 
> Hello Rafael,
> 
> Suspend was indeed a red herring. Manually requesting cpu1 off-lining
> also makes Linux panic when cpu_die() unexpectedly returns.
> 
> The subject should perhaps have been:
> 
>   Linux panics when secondary core off-lining fails
> 
> Could it be made to fail more gracefully?
> Or is this borkage inherent to the failed operation?
> Or is it a bug in my platform code?
> (A bug other than tango_cpu_die() failing to kill the core.)

Well, smp_ops.cpu_die() is not expected to return AFAICS, so that may be
the reason why it fails for you the way it does.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux panics when suspend cannot offline the secondary cores
  2016-06-13 13:30         ` Rafael J. Wysocki
@ 2016-06-13 13:50           ` Mason
  -1 siblings, 0 replies; 19+ messages in thread
From: Mason @ 2016-06-13 13:50 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, Linux ARM, Russell King, Stephen Boyd, Sebastian Frias,
	Lorenzo Pieralisi, Will Deacon, Arnd Bergmann

On 13/06/2016 15:30, Rafael J. Wysocki wrote:

> On Monday, June 13, 2016 02:06:14 PM Mason wrote:
>
>> On 10/06/2016 23:37, Mason wrote:
>>
>>> On 10/06/2016 23:35, Rafael J. Wysocki wrote:
>>>
>>>> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
>>>>
>>>>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
>>>>> unhappy when the suspend framework fails to offline secondary cores.
>>>>>
>>>>> Is this expected/by design, or could it fail more gracefully?
>>>>> (It could also be something missing in my platform's code.)
>>>>
>>>> This looks like a CPU offline bug to me which is more general than just
>>>> system suspend.
>>>
>>> You may be right, I will try just off-lining cpu1.
>>> Suspend may be a red herring.
>>>
>>> By the way, I know my implementation of tango_cpu_die
>>> is incorrect, I was testing the failure mode.
>>
>> Hello Rafael,
>>
>> Suspend was indeed a red herring. Manually requesting cpu1 off-lining
>> also makes Linux panic when cpu_die() unexpectedly returns.
>>
>> The subject should perhaps have been:
>>
>>   Linux panics when secondary core off-lining fails
>>
>> Could it be made to fail more gracefully?
>> Or is this borkage inherent to the failed operation?
>> Or is it a bug in my platform code?
>> (A bug other than tango_cpu_die() failing to kill the core.)
> 
> Well, smp_ops.cpu_die() is not expected to return AFAICS, so that may be
> the reason why it fails for you the way it does.

I am aware that smp_ops.cpu_die() is not expected to return.
(I was wondering if the framework could handle it gracefully.)

The actual implementation for cpu_die() asks the firmware to off-line
the current core. If the operation fails, for whatever reason, firmware
is not supposed to return control to Linux?

Is panic the only safe thing to do in Linux:
(If yes, then why doesn't the framework panic immediately?)

static void tango_cpu_die(unsigned int cpu)
{
	ask_firmware_to_offline(cpu);
	/* if we return here, something went wrong */
	panic("firmware could not offline");
}

Regards.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
@ 2016-06-13 13:50           ` Mason
  0 siblings, 0 replies; 19+ messages in thread
From: Mason @ 2016-06-13 13:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 13/06/2016 15:30, Rafael J. Wysocki wrote:

> On Monday, June 13, 2016 02:06:14 PM Mason wrote:
>
>> On 10/06/2016 23:37, Mason wrote:
>>
>>> On 10/06/2016 23:35, Rafael J. Wysocki wrote:
>>>
>>>> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
>>>>
>>>>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
>>>>> unhappy when the suspend framework fails to offline secondary cores.
>>>>>
>>>>> Is this expected/by design, or could it fail more gracefully?
>>>>> (It could also be something missing in my platform's code.)
>>>>
>>>> This looks like a CPU offline bug to me which is more general than just
>>>> system suspend.
>>>
>>> You may be right, I will try just off-lining cpu1.
>>> Suspend may be a red herring.
>>>
>>> By the way, I know my implementation of tango_cpu_die
>>> is incorrect, I was testing the failure mode.
>>
>> Hello Rafael,
>>
>> Suspend was indeed a red herring. Manually requesting cpu1 off-lining
>> also makes Linux panic when cpu_die() unexpectedly returns.
>>
>> The subject should perhaps have been:
>>
>>   Linux panics when secondary core off-lining fails
>>
>> Could it be made to fail more gracefully?
>> Or is this borkage inherent to the failed operation?
>> Or is it a bug in my platform code?
>> (A bug other than tango_cpu_die() failing to kill the core.)
> 
> Well, smp_ops.cpu_die() is not expected to return AFAICS, so that may be
> the reason why it fails for you the way it does.

I am aware that smp_ops.cpu_die() is not expected to return.
(I was wondering if the framework could handle it gracefully.)

The actual implementation for cpu_die() asks the firmware to off-line
the current core. If the operation fails, for whatever reason, firmware
is not supposed to return control to Linux?

Is panic the only safe thing to do in Linux:
(If yes, then why doesn't the framework panic immediately?)

static void tango_cpu_die(unsigned int cpu)
{
	ask_firmware_to_offline(cpu);
	/* if we return here, something went wrong */
	panic("firmware could not offline");
}

Regards.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux panics when suspend cannot offline the secondary cores
  2016-06-13 13:50           ` Mason
@ 2016-06-13 20:49             ` Rafael J. Wysocki
  -1 siblings, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2016-06-13 20:49 UTC (permalink / raw)
  To: Mason
  Cc: linux-pm, Linux ARM, Russell King, Stephen Boyd, Sebastian Frias,
	Lorenzo Pieralisi, Will Deacon, Arnd Bergmann

On Monday, June 13, 2016 03:50:56 PM Mason wrote:
> On 13/06/2016 15:30, Rafael J. Wysocki wrote:
> 
> > On Monday, June 13, 2016 02:06:14 PM Mason wrote:
> >
> >> On 10/06/2016 23:37, Mason wrote:
> >>
> >>> On 10/06/2016 23:35, Rafael J. Wysocki wrote:
> >>>
> >>>> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
> >>>>
> >>>>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
> >>>>> unhappy when the suspend framework fails to offline secondary cores.
> >>>>>
> >>>>> Is this expected/by design, or could it fail more gracefully?
> >>>>> (It could also be something missing in my platform's code.)
> >>>>
> >>>> This looks like a CPU offline bug to me which is more general than just
> >>>> system suspend.
> >>>
> >>> You may be right, I will try just off-lining cpu1.
> >>> Suspend may be a red herring.
> >>>
> >>> By the way, I know my implementation of tango_cpu_die
> >>> is incorrect, I was testing the failure mode.
> >>
> >> Hello Rafael,
> >>
> >> Suspend was indeed a red herring. Manually requesting cpu1 off-lining
> >> also makes Linux panic when cpu_die() unexpectedly returns.
> >>
> >> The subject should perhaps have been:
> >>
> >>   Linux panics when secondary core off-lining fails
> >>
> >> Could it be made to fail more gracefully?
> >> Or is this borkage inherent to the failed operation?
> >> Or is it a bug in my platform code?
> >> (A bug other than tango_cpu_die() failing to kill the core.)
> > 
> > Well, smp_ops.cpu_die() is not expected to return AFAICS, so that may be
> > the reason why it fails for you the way it does.
> 
> I am aware that smp_ops.cpu_die() is not expected to return.
> (I was wondering if the framework could handle it gracefully.)
> 
> The actual implementation for cpu_die() asks the firmware to off-line
> the current core. If the operation fails, for whatever reason, firmware
> is not supposed to return control to Linux?

Firmware can do what it wants (although ideally it should just do what it is
asked for).  smp_ops.cpu_die() is not supposed to return to its caller anyway.

> Is panic the only safe thing to do in Linux:
> (If yes, then why doesn't the framework panic immediately?)

I guess all of the existing implementations of smp_ops.cpu_die() don't return
to the caller no matter what, so the caller did not have to consider anything
else.

And quite frankly I don't see why it would have to.  smp_ops.cpu_die() simply
needs to be implemented to never return.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
@ 2016-06-13 20:49             ` Rafael J. Wysocki
  0 siblings, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2016-06-13 20:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday, June 13, 2016 03:50:56 PM Mason wrote:
> On 13/06/2016 15:30, Rafael J. Wysocki wrote:
> 
> > On Monday, June 13, 2016 02:06:14 PM Mason wrote:
> >
> >> On 10/06/2016 23:37, Mason wrote:
> >>
> >>> On 10/06/2016 23:35, Rafael J. Wysocki wrote:
> >>>
> >>>> On Friday, June 10, 2016 05:41:32 PM Mason wrote:
> >>>>
> >>>>> I'm playing with S3 Suspend-to-RAM, and I noticed that Linux is really
> >>>>> unhappy when the suspend framework fails to offline secondary cores.
> >>>>>
> >>>>> Is this expected/by design, or could it fail more gracefully?
> >>>>> (It could also be something missing in my platform's code.)
> >>>>
> >>>> This looks like a CPU offline bug to me which is more general than just
> >>>> system suspend.
> >>>
> >>> You may be right, I will try just off-lining cpu1.
> >>> Suspend may be a red herring.
> >>>
> >>> By the way, I know my implementation of tango_cpu_die
> >>> is incorrect, I was testing the failure mode.
> >>
> >> Hello Rafael,
> >>
> >> Suspend was indeed a red herring. Manually requesting cpu1 off-lining
> >> also makes Linux panic when cpu_die() unexpectedly returns.
> >>
> >> The subject should perhaps have been:
> >>
> >>   Linux panics when secondary core off-lining fails
> >>
> >> Could it be made to fail more gracefully?
> >> Or is this borkage inherent to the failed operation?
> >> Or is it a bug in my platform code?
> >> (A bug other than tango_cpu_die() failing to kill the core.)
> > 
> > Well, smp_ops.cpu_die() is not expected to return AFAICS, so that may be
> > the reason why it fails for you the way it does.
> 
> I am aware that smp_ops.cpu_die() is not expected to return.
> (I was wondering if the framework could handle it gracefully.)
> 
> The actual implementation for cpu_die() asks the firmware to off-line
> the current core. If the operation fails, for whatever reason, firmware
> is not supposed to return control to Linux?

Firmware can do what it wants (although ideally it should just do what it is
asked for).  smp_ops.cpu_die() is not supposed to return to its caller anyway.

> Is panic the only safe thing to do in Linux:
> (If yes, then why doesn't the framework panic immediately?)

I guess all of the existing implementations of smp_ops.cpu_die() don't return
to the caller no matter what, so the caller did not have to consider anything
else.

And quite frankly I don't see why it would have to.  smp_ops.cpu_die() simply
needs to be implemented to never return.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux panics when suspend cannot offline the secondary cores
  2016-06-13 20:49             ` Rafael J. Wysocki
@ 2016-06-13 21:02               ` Russell King - ARM Linux
  -1 siblings, 0 replies; 19+ messages in thread
From: Russell King - ARM Linux @ 2016-06-13 21:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Sebastian Frias, Lorenzo Pieralisi, Arnd Bergmann, linux-pm,
	Will Deacon, Mason, Stephen Boyd, Linux ARM

On Mon, Jun 13, 2016 at 10:49:32PM +0200, Rafael J. Wysocki wrote:
> I guess all of the existing implementations of smp_ops.cpu_die() don't return
> to the caller no matter what, so the caller did not have to consider anything
> else.

Existing implementations for hardware which implements CPU hotplug
takes the requested CPU down in such a way that smp_ops.cpu_die()
*never* returns.

We have a number of evaluation boards where its desirable to emulate
CPU hotplug.  These boards have no power management abilities, and
have no way to power down or reset a CPU from software.  For these,
we implement CPU hotplug by taking the CPU down gracefully, taking
it out of coherency, and then placing it in a loop waiting for the
CPU up event to arrive.  At that point (and this is the only legal
time) smp_ops.cpu_die() returns - at which point you get the
resuscitating kernel message, and the CPU re-enters the kernel.

This path is _only_ for these evaluation platforms which have no
hardware support for CPU hotplug, and therefore no PM and no kexec.

The *only* solution to having working PM support Mason's platform is
a properly implemented CPU hotplug correctly - which means ensuring
that the CPU is either powered down or placed in reset during the
smp_ops.cpu_die() call.  Everything else (even the simulation of it)
is not good enough.

That can be done either by the dying CPU when it calls into
smp_ops.cpu_die(), or the CPU requesting the death of the CPU via
smp_ops.cpu_kill().

Either way, it's up to the platform code to implement these, and as
I say, a correct and proper implementation of this is a fundamental
requirement for system power management (like suspend) and kexec in
a SMP system.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
@ 2016-06-13 21:02               ` Russell King - ARM Linux
  0 siblings, 0 replies; 19+ messages in thread
From: Russell King - ARM Linux @ 2016-06-13 21:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 13, 2016 at 10:49:32PM +0200, Rafael J. Wysocki wrote:
> I guess all of the existing implementations of smp_ops.cpu_die() don't return
> to the caller no matter what, so the caller did not have to consider anything
> else.

Existing implementations for hardware which implements CPU hotplug
takes the requested CPU down in such a way that smp_ops.cpu_die()
*never* returns.

We have a number of evaluation boards where its desirable to emulate
CPU hotplug.  These boards have no power management abilities, and
have no way to power down or reset a CPU from software.  For these,
we implement CPU hotplug by taking the CPU down gracefully, taking
it out of coherency, and then placing it in a loop waiting for the
CPU up event to arrive.  At that point (and this is the only legal
time) smp_ops.cpu_die() returns - at which point you get the
resuscitating kernel message, and the CPU re-enters the kernel.

This path is _only_ for these evaluation platforms which have no
hardware support for CPU hotplug, and therefore no PM and no kexec.

The *only* solution to having working PM support Mason's platform is
a properly implemented CPU hotplug correctly - which means ensuring
that the CPU is either powered down or placed in reset during the
smp_ops.cpu_die() call.  Everything else (even the simulation of it)
is not good enough.

That can be done either by the dying CPU when it calls into
smp_ops.cpu_die(), or the CPU requesting the death of the CPU via
smp_ops.cpu_kill().

Either way, it's up to the platform code to implement these, and as
I say, a correct and proper implementation of this is a fundamental
requirement for system power management (like suspend) and kexec in
a SMP system.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Linux panics when suspend cannot offline the secondary cores
  2016-06-13 21:02               ` Russell King - ARM Linux
@ 2016-06-14 12:42                 ` Mason
  -1 siblings, 0 replies; 19+ messages in thread
From: Mason @ 2016-06-14 12:42 UTC (permalink / raw)
  To: Russell King - ARM Linux, Rafael J. Wysocki
  Cc: linux-pm, Linux ARM, Stephen Boyd, Sebastian Frias,
	Lorenzo Pieralisi, Will Deacon, Arnd Bergmann

On 13/06/2016 23:02, Russell King - ARM Linux wrote:

> On Mon, Jun 13, 2016 at 10:49:32PM +0200, Rafael J. Wysocki wrote:
>
>> I guess all of the existing implementations of smp_ops.cpu_die() don't return
>> to the caller no matter what, so the caller did not have to consider anything
>> else.
> 
> Existing implementations for hardware which implements CPU hotplug
> takes the requested CPU down in such a way that smp_ops.cpu_die()
> *never* returns.
> 
> We have a number of evaluation boards where its desirable to emulate
> CPU hotplug.  These boards have no power management abilities, and
> have no way to power down or reset a CPU from software.  For these,
> we implement CPU hotplug by taking the CPU down gracefully, taking
> it out of coherency, and then placing it in a loop waiting for the
> CPU up event to arrive.  At that point (and this is the only legal
> time) smp_ops.cpu_die() returns - at which point you get the
> resuscitating kernel message, and the CPU re-enters the kernel.
> 
> This path is _only_ for these evaluation platforms which have no
> hardware support for CPU hotplug, and therefore no PM and no kexec.
> 
> The *only* solution to having working PM support Mason's platform is
> a properly implemented CPU hotplug correctly - which means ensuring
> that the CPU is either powered down or placed in reset during the
> smp_ops.cpu_die() call.  Everything else (even the simulation of it)
> is not good enough.
> 
> That can be done either by the dying CPU when it calls into
> smp_ops.cpu_die(), or the CPU requesting the death of the CPU via
> smp_ops.cpu_kill().
> 
> Either way, it's up to the platform code to implement these, and as
> I say, a correct and proper implementation of this is a fundamental
> requirement for system power management (like suspend) and kexec in
> a SMP system.

Hello Russell,

The current plan is to have cpu_die() jump into the firmware, and have
the firmware "park" the calling core into a WFI loop until someone wants
to online the parked core, via the smp_boot_secondary() callback.

Would that work?

So far, I haven't cared about what HOTPLUG does with the parked core,
because we would just provide HOTPLUG as a requirement for suspend,
which offlines the secondary cores, and then we will power down the
entire SoC.

On a tangential subject, is the scheduler able to off-line idle cores?

Regards.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Linux panics when suspend cannot offline the secondary cores
@ 2016-06-14 12:42                 ` Mason
  0 siblings, 0 replies; 19+ messages in thread
From: Mason @ 2016-06-14 12:42 UTC (permalink / raw)
  To: linux-arm-kernel

On 13/06/2016 23:02, Russell King - ARM Linux wrote:

> On Mon, Jun 13, 2016 at 10:49:32PM +0200, Rafael J. Wysocki wrote:
>
>> I guess all of the existing implementations of smp_ops.cpu_die() don't return
>> to the caller no matter what, so the caller did not have to consider anything
>> else.
> 
> Existing implementations for hardware which implements CPU hotplug
> takes the requested CPU down in such a way that smp_ops.cpu_die()
> *never* returns.
> 
> We have a number of evaluation boards where its desirable to emulate
> CPU hotplug.  These boards have no power management abilities, and
> have no way to power down or reset a CPU from software.  For these,
> we implement CPU hotplug by taking the CPU down gracefully, taking
> it out of coherency, and then placing it in a loop waiting for the
> CPU up event to arrive.  At that point (and this is the only legal
> time) smp_ops.cpu_die() returns - at which point you get the
> resuscitating kernel message, and the CPU re-enters the kernel.
> 
> This path is _only_ for these evaluation platforms which have no
> hardware support for CPU hotplug, and therefore no PM and no kexec.
> 
> The *only* solution to having working PM support Mason's platform is
> a properly implemented CPU hotplug correctly - which means ensuring
> that the CPU is either powered down or placed in reset during the
> smp_ops.cpu_die() call.  Everything else (even the simulation of it)
> is not good enough.
> 
> That can be done either by the dying CPU when it calls into
> smp_ops.cpu_die(), or the CPU requesting the death of the CPU via
> smp_ops.cpu_kill().
> 
> Either way, it's up to the platform code to implement these, and as
> I say, a correct and proper implementation of this is a fundamental
> requirement for system power management (like suspend) and kexec in
> a SMP system.

Hello Russell,

The current plan is to have cpu_die() jump into the firmware, and have
the firmware "park" the calling core into a WFI loop until someone wants
to online the parked core, via the smp_boot_secondary() callback.

Would that work?

So far, I haven't cared about what HOTPLUG does with the parked core,
because we would just provide HOTPLUG as a requirement for suspend,
which offlines the secondary cores, and then we will power down the
entire SoC.

On a tangential subject, is the scheduler able to off-line idle cores?

Regards.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Rebooting Cortex A9 MPCore (was: Linux panics when suspend cannot offline the secondary cores)
  2016-06-14 12:42                 ` Mason
  (?)
@ 2016-06-15 11:48                 ` Mason
  -1 siblings, 0 replies; 19+ messages in thread
From: Mason @ 2016-06-15 11:48 UTC (permalink / raw)
  To: Russell King - ARM Linux, Rafael J. Wysocki
  Cc: linux-pm, Linux ARM, Stephen Boyd, Sebastian Frias,
	Lorenzo Pieralisi, Will Deacon, Mark Rutland, Arnd Bergmann,
	Thibaud Cornic

On 14/06/2016 14:42, Mason wrote:

> On 13/06/2016 23:02, Russell King - ARM Linux wrote:
> 
>> On Mon, Jun 13, 2016 at 10:49:32PM +0200, Rafael J. Wysocki wrote:
>>
>>> I guess all of the existing implementations of smp_ops.cpu_die() don't return
>>> to the caller no matter what, so the caller did not have to consider anything
>>> else.
>>
>> Existing implementations for hardware which implements CPU hotplug
>> takes the requested CPU down in such a way that smp_ops.cpu_die()
>> *never* returns.
>>
>> We have a number of evaluation boards where its desirable to emulate
>> CPU hotplug.  These boards have no power management abilities, and
>> have no way to power down or reset a CPU from software.  For these,
>> we implement CPU hotplug by taking the CPU down gracefully, taking
>> it out of coherency, and then placing it in a loop waiting for the
>> CPU up event to arrive.  At that point (and this is the only legal
>> time) smp_ops.cpu_die() returns - at which point you get the
>> resuscitating kernel message, and the CPU re-enters the kernel.
>>
>> This path is _only_ for these evaluation platforms which have no
>> hardware support for CPU hotplug, and therefore no PM and no kexec.
>>
>> The *only* solution to having working PM support Mason's platform is
>> a properly implemented CPU hotplug correctly - which means ensuring
>> that the CPU is either powered down or placed in reset during the
>> smp_ops.cpu_die() call.  Everything else (even the simulation of it)
>> is not good enough.
>>
>> That can be done either by the dying CPU when it calls into
>> smp_ops.cpu_die(), or the CPU requesting the death of the CPU via
>> smp_ops.cpu_kill().
>>
>> Either way, it's up to the platform code to implement these, and as
>> I say, a correct and proper implementation of this is a fundamental
>> requirement for system power management (like suspend) and kexec in
>> a SMP system.
> 
> Hello Russell,
> 
> The current plan is to have cpu_die() jump into the firmware, and have
> the firmware "park" the calling core into a WFI loop until someone wants
> to online the parked core, via the smp_boot_secondary() callback.

Link to the whole discussion:
http://thread.gmane.org/gmane.linux.power-management.general/77268

Change of plans, because of MMU issues.

cpu_die:
  secondary core jumps from Linux into the firmware
  firmware prepares the core to be reset(*)
  core spins in a busy loop => never returns

cpu_kill:
  main core jumps from Linux into the firmware
  firmware resets secondary core, and puts it in a WFE/WFI loop
      (until smp_boot_secondary() is called from Linux)

Our preliminary implementation passes basic stress tests.

The starred step is a bit unclear to me...
What steps are required to prepare a Cortex A9 MPCore to safely reboot?

I briefly discussed the topic with mrutland on IRC:

> Typically the sequence is:
> 1) prevent allocation (i.e. disable translation and caching in all modes)
> 2) clean+invalidate local caches
> 3) exit coherency somehow

Point 1 was clarified thus

> Typically, you need to prevent allocation into data or unified caches,
> and that may involve disabling data and instruction cacheability
> (since instruction lookups may allocate in unified cache)

Does someone know if step 1 is required on Cortex A9 MPCore,
and how to achieve it?

Is point 3 achieved by clearing bit 6 in ACTLR? (ACTLR.SMP)

The MPCore TRM mentions "SCU CPU Power Status Register"
which speaks of modes (normal, dormant, powered-off).
Are these relevant for taking the core offline?

Regards.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-06-15 11:48 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-10 15:41 Linux panics when suspend cannot offline the secondary cores Mason
2016-06-10 15:41 ` Mason
2016-06-10 21:35 ` Rafael J. Wysocki
2016-06-10 21:35   ` Rafael J. Wysocki
2016-06-10 21:37   ` Mason
2016-06-10 21:37     ` Mason
2016-06-13 12:06     ` Mason
2016-06-13 12:06       ` Mason
2016-06-13 13:30       ` Rafael J. Wysocki
2016-06-13 13:30         ` Rafael J. Wysocki
2016-06-13 13:50         ` Mason
2016-06-13 13:50           ` Mason
2016-06-13 20:49           ` Rafael J. Wysocki
2016-06-13 20:49             ` Rafael J. Wysocki
2016-06-13 21:02             ` Russell King - ARM Linux
2016-06-13 21:02               ` Russell King - ARM Linux
2016-06-14 12:42               ` Mason
2016-06-14 12:42                 ` Mason
2016-06-15 11:48                 ` Rebooting Cortex A9 MPCore (was: Linux panics when suspend cannot offline the secondary cores) Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.