linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* (bisected) Lock up on sh73a0/kzm9g on cpuidle initialization
@ 2014-11-06 20:38 Geert Uytterhoeven
  2014-11-06 21:02 ` Daniel Lezcano
  0 siblings, 1 reply; 7+ messages in thread
From: Geert Uytterhoeven @ 2014-11-06 20:38 UTC (permalink / raw)
  To: Daniel Lezcano, Ingo Molnar, Nicolas Pitre
  Cc: Paul McKenney, Jiri Kosina, Rafael J. Wysocki, Linux PM list,
	Linux-sh list, linux-arm-kernel, linux-kernel

When CONFIG_CPU_IDLE=y, the kernel locks up during cpuidle initialization
on Renesas sh73a0/kzm9g-reference, which has a dual-core Cortex-A9.

Last message is:

    DMA: preallocated 256 KiB pool for atomic coherent allocations

After this it's supposed to print:

    cpuidle: using governor ladder
    cpuidle: using governor menu

I've bisected this to commit 442bf3aaf55a91ebfec71da46a4ee10a3c905bcc
("sched: Let the scheduler see CPU idle states").

Reverting that commit, and commit 83a0a96a5f26d974580fd7251043ff70c8f1823d
("sched/fair: Leverage the idle state info when choosing the "idlest"
cpu") which
depends on it, fixes the problem.

I saw the discussion "lockdep splat in CPU hotplug", so I enabled lockdep
debugging, but didn't see a lockdep splat.

I'm using CONFIG_TREE_RCU=y, as this is SMP without PREEMPT.

Anyone with a clue?

Thanks!

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: (bisected) Lock up on sh73a0/kzm9g on cpuidle initialization
  2014-11-06 20:38 (bisected) Lock up on sh73a0/kzm9g on cpuidle initialization Geert Uytterhoeven
@ 2014-11-06 21:02 ` Daniel Lezcano
  2014-11-07  7:59   ` Geert Uytterhoeven
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Lezcano @ 2014-11-06 21:02 UTC (permalink / raw)
  To: Geert Uytterhoeven, Ingo Molnar, Nicolas Pitre
  Cc: Paul McKenney, Jiri Kosina, Rafael J. Wysocki, Linux PM list,
	Linux-sh list, linux-arm-kernel, linux-kernel

On 11/06/2014 09:38 PM, Geert Uytterhoeven wrote:
> When CONFIG_CPU_IDLE=y, the kernel locks up during cpuidle initialization
> on Renesas sh73a0/kzm9g-reference, which has a dual-core Cortex-A9.
>
> Last message is:
>
>      DMA: preallocated 256 KiB pool for atomic coherent allocations
>
> After this it's supposed to print:
>
>      cpuidle: using governor ladder
>      cpuidle: using governor menu
>
> I've bisected this to commit 442bf3aaf55a91ebfec71da46a4ee10a3c905bcc
> ("sched: Let the scheduler see CPU idle states").
>
> Reverting that commit, and commit 83a0a96a5f26d974580fd7251043ff70c8f1823d
> ("sched/fair: Leverage the idle state info when choosing the "idlest"
> cpu") which
> depends on it, fixes the problem.
>
> I saw the discussion "lockdep splat in CPU hotplug", so I enabled lockdep
> debugging, but didn't see a lockdep splat.

Did you try the fix attached ?

https://lkml.org/lkml/2014/10/22/722

> I'm using CONFIG_TREE_RCU=y, as this is SMP without PREEMPT.
>
> Anyone with a clue?




>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                  -- Linus Torvalds
>


-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: (bisected) Lock up on sh73a0/kzm9g on cpuidle initialization
  2014-11-06 21:02 ` Daniel Lezcano
@ 2014-11-07  7:59   ` Geert Uytterhoeven
  2014-11-25 17:49     ` Geert Uytterhoeven
  0 siblings, 1 reply; 7+ messages in thread
From: Geert Uytterhoeven @ 2014-11-07  7:59 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Ingo Molnar, Nicolas Pitre, Paul McKenney, Jiri Kosina,
	Rafael J. Wysocki, Linux PM list, Linux-sh list,
	linux-arm-kernel, linux-kernel

Hi Daniel,

On Thu, Nov 6, 2014 at 10:02 PM, Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
> On 11/06/2014 09:38 PM, Geert Uytterhoeven wrote:
>> When CONFIG_CPU_IDLE=y, the kernel locks up during cpuidle initialization
>> on Renesas sh73a0/kzm9g-reference, which has a dual-core Cortex-A9.
>>
>> Last message is:
>>
>>      DMA: preallocated 256 KiB pool for atomic coherent allocations
>>
>> After this it's supposed to print:
>>
>>      cpuidle: using governor ladder
>>      cpuidle: using governor menu
>>
>> I've bisected this to commit 442bf3aaf55a91ebfec71da46a4ee10a3c905bcc
>> ("sched: Let the scheduler see CPU idle states").
>>
>> Reverting that commit, and commit 83a0a96a5f26d974580fd7251043ff70c8f1823d
>> ("sched/fair: Leverage the idle state info when choosing the "idlest"
>> cpu") which
>> depends on it, fixes the problem.
>>
>> I saw the discussion "lockdep splat in CPU hotplug", so I enabled lockdep
>> debugging, but didn't see a lockdep splat.
>
> Did you try the fix attached ?
>
> https://lkml.org/lkml/2014/10/22/722

Thanks, I didn't try that.

However, this patch seems to be in v3.18-rc3, so I'm already using it.
Hence it doesn't fix the problem for me.

On another board, with a dual Cortex-A15, the problem doesn't show up.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: (bisected) Lock up on sh73a0/kzm9g on cpuidle initialization
  2014-11-07  7:59   ` Geert Uytterhoeven
@ 2014-11-25 17:49     ` Geert Uytterhoeven
  2014-11-25 18:01       ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Geert Uytterhoeven @ 2014-11-25 17:49 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Ingo Molnar, Nicolas Pitre, Paul McKenney, Jiri Kosina,
	Rafael J. Wysocki, Linux PM list, Linux-sh list,
	linux-arm-kernel, linux-kernel

On Fri, Nov 7, 2014 at 8:59 AM, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> On Thu, Nov 6, 2014 at 10:02 PM, Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
>> On 11/06/2014 09:38 PM, Geert Uytterhoeven wrote:
>>> When CONFIG_CPU_IDLE=y, the kernel locks up during cpuidle initialization
>>> on Renesas sh73a0/kzm9g-reference, which has a dual-core Cortex-A9.
>>>
>>> Last message is:
>>>
>>>      DMA: preallocated 256 KiB pool for atomic coherent allocations
>>>
>>> After this it's supposed to print:
>>>
>>>      cpuidle: using governor ladder
>>>      cpuidle: using governor menu
>>>
>>> I've bisected this to commit 442bf3aaf55a91ebfec71da46a4ee10a3c905bcc
>>> ("sched: Let the scheduler see CPU idle states").
>>>
>>> Reverting that commit, and commit 83a0a96a5f26d974580fd7251043ff70c8f1823d
>>> ("sched/fair: Leverage the idle state info when choosing the "idlest"
>>> cpu") which
>>> depends on it, fixes the problem.
>>>
>>> I saw the discussion "lockdep splat in CPU hotplug", so I enabled lockdep
>>> debugging, but didn't see a lockdep splat.
>>
>> Did you try the fix attached ?
>>
>> https://lkml.org/lkml/2014/10/22/722
>
> Thanks, I didn't try that.
>
> However, this patch seems to be in v3.18-rc3, so I'm already using it.
> Hence it doesn't fix the problem for me.
>
> On another board, with a dual Cortex-A15, the problem doesn't show up.

This problem (regression introduced in v3.18-rc1) is still present in v3.18-rc6.

I did some more investigations, and it's hanging in the call to
synchronize_rcu() in cpuidle_uninstall_idle_handler(), which was added in
commit 442bf3aaf55a91ebfec71da46a4ee10a3c905bcc.
More specificailly, it's blocked on the wait_for_completion(&rcu.completion)
in kernel/rcu/update.c:void wait_rcu_gp(call_rcu_func_t crf).

Anyone with a clue?

Thanks again!

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: (bisected) Lock up on sh73a0/kzm9g on cpuidle initialization
  2014-11-25 17:49     ` Geert Uytterhoeven
@ 2014-11-25 18:01       ` Paul E. McKenney
  2014-11-25 21:27         ` Geert Uytterhoeven
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2014-11-25 18:01 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Daniel Lezcano, Ingo Molnar, Nicolas Pitre, Jiri Kosina,
	Rafael J. Wysocki, Linux PM list, Linux-sh list,
	linux-arm-kernel, linux-kernel

On Tue, Nov 25, 2014 at 06:49:16PM +0100, Geert Uytterhoeven wrote:
> On Fri, Nov 7, 2014 at 8:59 AM, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> > On Thu, Nov 6, 2014 at 10:02 PM, Daniel Lezcano
> > <daniel.lezcano@linaro.org> wrote:
> >> On 11/06/2014 09:38 PM, Geert Uytterhoeven wrote:
> >>> When CONFIG_CPU_IDLE=y, the kernel locks up during cpuidle initialization
> >>> on Renesas sh73a0/kzm9g-reference, which has a dual-core Cortex-A9.
> >>>
> >>> Last message is:
> >>>
> >>>      DMA: preallocated 256 KiB pool for atomic coherent allocations
> >>>
> >>> After this it's supposed to print:
> >>>
> >>>      cpuidle: using governor ladder
> >>>      cpuidle: using governor menu
> >>>
> >>> I've bisected this to commit 442bf3aaf55a91ebfec71da46a4ee10a3c905bcc
> >>> ("sched: Let the scheduler see CPU idle states").
> >>>
> >>> Reverting that commit, and commit 83a0a96a5f26d974580fd7251043ff70c8f1823d
> >>> ("sched/fair: Leverage the idle state info when choosing the "idlest"
> >>> cpu") which
> >>> depends on it, fixes the problem.
> >>>
> >>> I saw the discussion "lockdep splat in CPU hotplug", so I enabled lockdep
> >>> debugging, but didn't see a lockdep splat.
> >>
> >> Did you try the fix attached ?
> >>
> >> https://lkml.org/lkml/2014/10/22/722
> >
> > Thanks, I didn't try that.
> >
> > However, this patch seems to be in v3.18-rc3, so I'm already using it.
> > Hence it doesn't fix the problem for me.
> >
> > On another board, with a dual Cortex-A15, the problem doesn't show up.
> 
> This problem (regression introduced in v3.18-rc1) is still present in v3.18-rc6.
> 
> I did some more investigations, and it's hanging in the call to
> synchronize_rcu() in cpuidle_uninstall_idle_handler(), which was added in
> commit 442bf3aaf55a91ebfec71da46a4ee10a3c905bcc.
> More specificailly, it's blocked on the wait_for_completion(&rcu.completion)
> in kernel/rcu/update.c:void wait_rcu_gp(call_rcu_func_t crf).

You didn't disable RCU CPU stall warnings, did you?  If you did, please
re-enable them, as the stall warning messages will likely help to debug
this.  The soft-lockup checks can also be quite valuable.

If you haven't run with CONFIG_PROVE_RCU=y, please try that.  For example,
if you have CONFIG_PREEMPT=y and you do synchronize_rcu() from within
an RCU read-side critical section (don't do that, it will hang!!!),
then you will get a lockdep splat.

Does any sort of system activity (keyboard, network, etc.) unstick the
system?

If you have tried all those things without good effect, could you please
send along your .config and an alt-sysrq-t dump of all tasks' stacks?

							Thanx, Paul

> Anyone with a clue?
> 
> Thanks again!
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: (bisected) Lock up on sh73a0/kzm9g on cpuidle initialization
  2014-11-25 18:01       ` Paul E. McKenney
@ 2014-11-25 21:27         ` Geert Uytterhoeven
  2014-11-25 22:23           ` Daniel Lezcano
  0 siblings, 1 reply; 7+ messages in thread
From: Geert Uytterhoeven @ 2014-11-25 21:27 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Daniel Lezcano, Ingo Molnar, Nicolas Pitre, Jiri Kosina,
	Rafael J. Wysocki, Linux PM list, Linux-sh list,
	linux-arm-kernel, linux-kernel, Magnus Damm

Hi Paul,

On Tue, Nov 25, 2014 at 7:01 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Tue, Nov 25, 2014 at 06:49:16PM +0100, Geert Uytterhoeven wrote:
>> On Fri, Nov 7, 2014 at 8:59 AM, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
>> > On Thu, Nov 6, 2014 at 10:02 PM, Daniel Lezcano
>> > <daniel.lezcano@linaro.org> wrote:
>> >> On 11/06/2014 09:38 PM, Geert Uytterhoeven wrote:
>> >>> When CONFIG_CPU_IDLE=y, the kernel locks up during cpuidle initialization
>> >>> on Renesas sh73a0/kzm9g-reference, which has a dual-core Cortex-A9.
>> >>>
>> >>> Last message is:
>> >>>
>> >>>      DMA: preallocated 256 KiB pool for atomic coherent allocations
>> >>>
>> >>> After this it's supposed to print:
>> >>>
>> >>>      cpuidle: using governor ladder
>> >>>      cpuidle: using governor menu
>> >>>
>> >>> I've bisected this to commit 442bf3aaf55a91ebfec71da46a4ee10a3c905bcc
>> >>> ("sched: Let the scheduler see CPU idle states").
>> >>>
>> >>> Reverting that commit, and commit 83a0a96a5f26d974580fd7251043ff70c8f1823d
>> >>> ("sched/fair: Leverage the idle state info when choosing the "idlest"
>> >>> cpu") which
>> >>> depends on it, fixes the problem.
>> >>>
>> >>> I saw the discussion "lockdep splat in CPU hotplug", so I enabled lockdep
>> >>> debugging, but didn't see a lockdep splat.
>> >>
>> >> Did you try the fix attached ?
>> >>
>> >> https://lkml.org/lkml/2014/10/22/722
>> >
>> > Thanks, I didn't try that.
>> >
>> > However, this patch seems to be in v3.18-rc3, so I'm already using it.
>> > Hence it doesn't fix the problem for me.
>> >
>> > On another board, with a dual Cortex-A15, the problem doesn't show up.
>>
>> This problem (regression introduced in v3.18-rc1) is still present in v3.18-rc6.
>>
>> I did some more investigations, and it's hanging in the call to
>> synchronize_rcu() in cpuidle_uninstall_idle_handler(), which was added in
>> commit 442bf3aaf55a91ebfec71da46a4ee10a3c905bcc.
>> More specificailly, it's blocked on the wait_for_completion(&rcu.completion)
>> in kernel/rcu/update.c:void wait_rcu_gp(call_rcu_func_t crf).
>
> You didn't disable RCU CPU stall warnings, did you?  If you did, please
> re-enable them, as the stall warning messages will likely help to debug
> this.  The soft-lockup checks can also be quite valuable.
>
> If you haven't run with CONFIG_PROVE_RCU=y, please try that.  For example,
> if you have CONFIG_PREEMPT=y and you do synchronize_rcu() from within
> an RCU read-side critical section (don't do that, it will hang!!!),
> then you will get a lockdep splat.
>
> Does any sort of system activity (keyboard, network, etc.) unstick the
> system?

Thanks! Unfortunately none of the above helped.

However, I found the culprit. It turned out to be a platform issue, not an
issue in the generic cpu idle or RCU code. Read on below if you're
interested in the gory details. Else just skip, and sleep well again tonight ;-)

> If you have tried all those things without good effect, could you please
> send along your .config and an alt-sysrq-t dump of all tasks' stacks?

As I didn't manage to trigger a sysrq dump over the serial console,
I just called __handle_sysrq() right before the wait_for_completion(), after
a small delay. The dump didn't show anything suspicious. Everything
looked the same as on the dual-core Cortex A15, where the problem
doesn't manifest.

Then I noticed the sched debug output on the A15, which was missing
on the CA9 build. Enabling it on the A9 gave:

Sched Debug Version: v0.11,
3.18.0-rc6-kzm9g-reference-04913-gedc89a2a2059c7ff-dirty #101
ktime                                   : 0.000000
sched_clk                               : 0.000000
cpu_clk                                 : 0.000000
jiffies                                 : 4294928896

Oops, time is not advancing?

Dmesg also shows (early):

    clocksource_of_init: no matching clocksources found

and the timer is only initialized much later, after cpu idle initialization:

    sh_cmt e6138000.timer: ch0: used for periodic clock events

Hacking up a timer node for "arm,cortex-a9-twd-timer" in sh73a0.dtsi
(with some "guessed" values) made it work.

Thanks!

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: (bisected) Lock up on sh73a0/kzm9g on cpuidle initialization
  2014-11-25 21:27         ` Geert Uytterhoeven
@ 2014-11-25 22:23           ` Daniel Lezcano
  0 siblings, 0 replies; 7+ messages in thread
From: Daniel Lezcano @ 2014-11-25 22:23 UTC (permalink / raw)
  To: Geert Uytterhoeven, Paul McKenney
  Cc: Ingo Molnar, Nicolas Pitre, Jiri Kosina, Rafael J. Wysocki,
	Linux PM list, Linux-sh list, linux-arm-kernel, linux-kernel,
	Magnus Damm

On 11/25/2014 10:27 PM, Geert Uytterhoeven wrote:

[ ... ]

>> Does any sort of system activity (keyboard, network, etc.) unstick the
>> system?
>
> Thanks! Unfortunately none of the above helped.
>
> However, I found the culprit. It turned out to be a platform issue, not an
> issue in the generic cpu idle or RCU code. Read on below if you're
> interested in the gory details. Else just skip, and sleep well again tonight ;-)
>
>> If you have tried all those things without good effect, could you please
>> send along your .config and an alt-sysrq-t dump of all tasks' stacks?
>
> As I didn't manage to trigger a sysrq dump over the serial console,
> I just called __handle_sysrq() right before the wait_for_completion(), after
> a small delay. The dump didn't show anything suspicious. Everything
> looked the same as on the dual-core Cortex A15, where the problem
> doesn't manifest.
>
> Then I noticed the sched debug output on the A15, which was missing
> on the CA9 build. Enabling it on the A9 gave:
>
> Sched Debug Version: v0.11,
> 3.18.0-rc6-kzm9g-reference-04913-gedc89a2a2059c7ff-dirty #101
> ktime                                   : 0.000000
> sched_clk                               : 0.000000
> cpu_clk                                 : 0.000000
> jiffies                                 : 4294928896
>
> Oops, time is not advancing?
>
> Dmesg also shows (early):
>
>      clocksource_of_init: no matching clocksources found
>
> and the timer is only initialized much later, after cpu idle initialization:
>
>      sh_cmt e6138000.timer: ch0: used for periodic clock events
>
> Hacking up a timer node for "arm,cortex-a9-twd-timer" in sh73a0.dtsi
> (with some "guessed" values) made it work.
>
> Thanks!
>
> Gr{oetje,eeting}s,

Hi Geert,

thanks for sharing this information.

   -- Daniel


-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-11-25 22:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-06 20:38 (bisected) Lock up on sh73a0/kzm9g on cpuidle initialization Geert Uytterhoeven
2014-11-06 21:02 ` Daniel Lezcano
2014-11-07  7:59   ` Geert Uytterhoeven
2014-11-25 17:49     ` Geert Uytterhoeven
2014-11-25 18:01       ` Paul E. McKenney
2014-11-25 21:27         ` Geert Uytterhoeven
2014-11-25 22:23           ` Daniel Lezcano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).