All of lore.kernel.org
 help / color / mirror / Atom feed
* ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
@ 2023-09-04  8:33 ` Rafał Miłecki
  0 siblings, 0 replies; 22+ messages in thread
From: Rafał Miłecki @ 2023-09-04  8:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, Russell King, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel
  Cc: openwrt-devel, bcm-kernel-feedback-list

I made a second attempt on debugging some longstanding stability issues
affecting BCM53753 SoCs. Those are single CPU core ARM Cortex-A7 boards
with a pretty slow arch timer running at 36,8 kHz.

After 0 to 20 minutes of close to zero activity I experience hangs and I
need to wait a minute for watchdog to kick in and reboot device.

First debugging attempt:
https://lore.kernel.org/netdev/0f9d0cd6-d344-7915-7bc1-7a090b8305d2@gmail.com/T/ ("ARM board lockups/hangs triggered by locks and mutexes")

After a lot of bisecting, testing & hacking I believe there are 3 types
of kernel aspects that affect BCM53573 stability. I'd like to describe
them below to document my debugging work. I'm clueless at this point.
Maybe someone can come up with an idea of actual issue & ideally a
solution.

#####

1. Locking

During my first bisecting attempts I found multiple locking-related
commit that regressed stability.

Bisected commits:

131287ff833d ("once: add DO_ONCE_SLOW() for sleepable contexts").

and a following group:

d0d583484d2e ("locking/refcount: Consolidate implementations of refcount_t")
dab787c73f6e ("locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions")
0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
809554147d60 ("locking/refcount: Improve performance of generic REFCOUNT_FULL code")
9c9269977f03 ("locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header")
04bff7d7b808 ("locking/refcount: Remove unused refcount_*_checked() variants")
513b19a43bec ("locking/refcount: Ensure integer operands are treated as signed")
68b4ee68e8c8 ("locking/refcount: Define constants for saturation and max refcount values")

I don't believe there is actually anything wrong about above changes.
Maybe it's some tiny timing thing that my board just doesn't like?

#####

2. Clock (arm,armv7-timer)

While comparing main clock in Broadcom's SDK with upstream one I noticed
a tiny difference: mask value. I don't know it it makes any sense but
switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in
arm_arch_timer.c (to match SDK) increases average uptime (time before a
hang/lockup happens) from 4 minutes to 36 minutes.

#####

3. Random code changes

During my bisecting attempts I found one commit that regressed kernel
stability but actual changes were meaningless in context of locking. It
was commit ad9b10d1eaad ("mtd: core: introduce of support for dynamic
partitions"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad9b10d1eaada169bd764abcab58f08538877e26

I thought that maybe it was all about making add_mtd_device() bigger and
changing addresses of a lot of symbols (looking at System.map). So I
reverted that mtd commit and developed a dummy change relocating as few
symbols (System.map) as possible while still breaking stability:

--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -94,6 +94,21 @@ void __cpuidle default_idle_call(void)
  		arch_cpu_idle();
  		start_critical_timings();
  	}
+
+	if (cpu_idle_force_poll == 1234)
+		arch_cpu_idle();
+	if (cpu_idle_force_poll == 5678)
+		arch_cpu_idle();
+	if (cpu_idle_force_poll == 1234)
+		arch_cpu_idle();
+	if (cpu_idle_force_poll == 5678)
+		arch_cpu_idle();
+	if (cpu_idle_force_poll == 1234)
+		arch_cpu_idle();
+	if (cpu_idle_force_poll == 5678)
+		arch_cpu_idle();
+	if (cpu_idle_force_poll == 1234)
+		arch_cpu_idle();
  }

  static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev,

Above dummy change didn't relocate thousands of symbols but only about
20 of them. They happened to be lock symbols however. Does it make any
sense for above diff to regress kernel stability for me and cause
hangs/lockups?

--- System.map.good
+++ System.map.bad
@@ -22214,36 +22214,36 @@
  c062e7e0 T __cpuidle_text_start
  c062e7e0 t cpu_idle_poll
  c062e860 T default_idle_call
-c062e884 T __cpuidle_text_end
-c062e888 T __lock_text_start
-c062e8a0 T _raw_spin_unlock_irqrestore
-c062e8c0 T _raw_spin_trylock
-c062e900 T _raw_write_unlock_irqrestore
-c062e920 T _raw_read_trylock
-c062e960 T _raw_write_trylock
-c062e9a0 T _raw_spin_lock_bh
-c062ea00 T _raw_read_lock_bh
-c062ea40 T _raw_write_lock_bh
-c062ea80 T _raw_spin_trylock_bh
-c062eb00 T _raw_spin_unlock_bh
-c062eb40 T _raw_write_unlock_bh
-c062eb80 T _raw_read_unlock_bh
-c062ebc0 T _raw_read_unlock_irqrestore
-c062ec00 T _raw_write_lock
-c062ec40 T _raw_write_lock_irq
-c062ec80 T _raw_write_lock_irqsave
-c062ecc0 T _raw_read_lock
-c062ed00 T _raw_spin_lock
-c062ed40 T _raw_read_lock_irq
-c062ed80 T _raw_spin_lock_irq
-c062ede0 T _raw_spin_lock_irqsave
-c062ee40 T _raw_read_lock_irqsave
-c062ee70 T __hyp_text_end
-c062ee70 T __hyp_text_start
-c062ee70 T __kprobes_text_end
-c062ee70 T __kprobes_text_start
-c062ee70 T __lock_text_end
-c062ee70 T _etext
+c062e954 T __cpuidle_text_end
+c062e958 T __lock_text_start
+c062e960 T _raw_spin_unlock_irqrestore
+c062e980 T _raw_spin_trylock
+c062e9c0 T _raw_write_unlock_irqrestore
+c062e9e0 T _raw_read_trylock
+c062ea20 T _raw_write_trylock
+c062ea60 T _raw_spin_lock_bh
+c062eac0 T _raw_read_lock_bh
+c062eb00 T _raw_write_lock_bh
+c062eb40 T _raw_spin_trylock_bh
+c062ebc0 T _raw_spin_unlock_bh
+c062ec00 T _raw_write_unlock_bh
+c062ec40 T _raw_read_unlock_bh
+c062ec80 T _raw_read_unlock_irqrestore
+c062ecc0 T _raw_write_lock
+c062ed00 T _raw_write_lock_irq
+c062ed40 T _raw_write_lock_irqsave
+c062ed80 T _raw_read_lock
+c062edc0 T _raw_spin_lock
+c062ee00 T _raw_read_lock_irq
+c062ee40 T _raw_spin_lock_irq
+c062eea0 T _raw_spin_lock_irqsave
+c062ef00 T _raw_read_lock_irqsave
+c062ef30 T __hyp_text_end
+c062ef30 T __hyp_text_start
+c062ef30 T __kprobes_text_end
+c062ef30 T __kprobes_text_start
+c062ef30 T __lock_text_end
+c062ef30 T _etext
  c062f000 D __start_rodata
  c062f000 D sigreturn_codes
  c062f044 d cpu_arch_name

###

As those hangs/lockups are related to so many different changes it's
really hard to debug them.

This bug seems to be specific to the slow arch clock that affects
stability only when kernel locking code and symbols layout trigger some
very specific timing.

Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it affects
so much code it's hard to tell why it actually matters.

Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
disabled. I tried it and it improves stability (I had 3 devices with 6
days of uptime and counting) indeed. Again it affects a lot of kernel
parts so it's hard to tell why it helps.

Unless someone comes up with some magic solution I'll probably try
building BCM53573 images without CONFIG_SMP for my personal needs.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
@ 2023-09-04  8:33 ` Rafał Miłecki
  0 siblings, 0 replies; 22+ messages in thread
From: Rafał Miłecki @ 2023-09-04  8:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, Russell King, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel
  Cc: openwrt-devel, bcm-kernel-feedback-list

I made a second attempt on debugging some longstanding stability issues
affecting BCM53753 SoCs. Those are single CPU core ARM Cortex-A7 boards
with a pretty slow arch timer running at 36,8 kHz.

After 0 to 20 minutes of close to zero activity I experience hangs and I
need to wait a minute for watchdog to kick in and reboot device.

First debugging attempt:
https://lore.kernel.org/netdev/0f9d0cd6-d344-7915-7bc1-7a090b8305d2@gmail.com/T/ ("ARM board lockups/hangs triggered by locks and mutexes")

After a lot of bisecting, testing & hacking I believe there are 3 types
of kernel aspects that affect BCM53573 stability. I'd like to describe
them below to document my debugging work. I'm clueless at this point.
Maybe someone can come up with an idea of actual issue & ideally a
solution.

#####

1. Locking

During my first bisecting attempts I found multiple locking-related
commit that regressed stability.

Bisected commits:

131287ff833d ("once: add DO_ONCE_SLOW() for sleepable contexts").

and a following group:

d0d583484d2e ("locking/refcount: Consolidate implementations of refcount_t")
dab787c73f6e ("locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions")
0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
809554147d60 ("locking/refcount: Improve performance of generic REFCOUNT_FULL code")
9c9269977f03 ("locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header")
04bff7d7b808 ("locking/refcount: Remove unused refcount_*_checked() variants")
513b19a43bec ("locking/refcount: Ensure integer operands are treated as signed")
68b4ee68e8c8 ("locking/refcount: Define constants for saturation and max refcount values")

I don't believe there is actually anything wrong about above changes.
Maybe it's some tiny timing thing that my board just doesn't like?

#####

2. Clock (arm,armv7-timer)

While comparing main clock in Broadcom's SDK with upstream one I noticed
a tiny difference: mask value. I don't know it it makes any sense but
switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in
arm_arch_timer.c (to match SDK) increases average uptime (time before a
hang/lockup happens) from 4 minutes to 36 minutes.

#####

3. Random code changes

During my bisecting attempts I found one commit that regressed kernel
stability but actual changes were meaningless in context of locking. It
was commit ad9b10d1eaad ("mtd: core: introduce of support for dynamic
partitions"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad9b10d1eaada169bd764abcab58f08538877e26

I thought that maybe it was all about making add_mtd_device() bigger and
changing addresses of a lot of symbols (looking at System.map). So I
reverted that mtd commit and developed a dummy change relocating as few
symbols (System.map) as possible while still breaking stability:

--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -94,6 +94,21 @@ void __cpuidle default_idle_call(void)
  		arch_cpu_idle();
  		start_critical_timings();
  	}
+
+	if (cpu_idle_force_poll == 1234)
+		arch_cpu_idle();
+	if (cpu_idle_force_poll == 5678)
+		arch_cpu_idle();
+	if (cpu_idle_force_poll == 1234)
+		arch_cpu_idle();
+	if (cpu_idle_force_poll == 5678)
+		arch_cpu_idle();
+	if (cpu_idle_force_poll == 1234)
+		arch_cpu_idle();
+	if (cpu_idle_force_poll == 5678)
+		arch_cpu_idle();
+	if (cpu_idle_force_poll == 1234)
+		arch_cpu_idle();
  }

  static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev,

Above dummy change didn't relocate thousands of symbols but only about
20 of them. They happened to be lock symbols however. Does it make any
sense for above diff to regress kernel stability for me and cause
hangs/lockups?

--- System.map.good
+++ System.map.bad
@@ -22214,36 +22214,36 @@
  c062e7e0 T __cpuidle_text_start
  c062e7e0 t cpu_idle_poll
  c062e860 T default_idle_call
-c062e884 T __cpuidle_text_end
-c062e888 T __lock_text_start
-c062e8a0 T _raw_spin_unlock_irqrestore
-c062e8c0 T _raw_spin_trylock
-c062e900 T _raw_write_unlock_irqrestore
-c062e920 T _raw_read_trylock
-c062e960 T _raw_write_trylock
-c062e9a0 T _raw_spin_lock_bh
-c062ea00 T _raw_read_lock_bh
-c062ea40 T _raw_write_lock_bh
-c062ea80 T _raw_spin_trylock_bh
-c062eb00 T _raw_spin_unlock_bh
-c062eb40 T _raw_write_unlock_bh
-c062eb80 T _raw_read_unlock_bh
-c062ebc0 T _raw_read_unlock_irqrestore
-c062ec00 T _raw_write_lock
-c062ec40 T _raw_write_lock_irq
-c062ec80 T _raw_write_lock_irqsave
-c062ecc0 T _raw_read_lock
-c062ed00 T _raw_spin_lock
-c062ed40 T _raw_read_lock_irq
-c062ed80 T _raw_spin_lock_irq
-c062ede0 T _raw_spin_lock_irqsave
-c062ee40 T _raw_read_lock_irqsave
-c062ee70 T __hyp_text_end
-c062ee70 T __hyp_text_start
-c062ee70 T __kprobes_text_end
-c062ee70 T __kprobes_text_start
-c062ee70 T __lock_text_end
-c062ee70 T _etext
+c062e954 T __cpuidle_text_end
+c062e958 T __lock_text_start
+c062e960 T _raw_spin_unlock_irqrestore
+c062e980 T _raw_spin_trylock
+c062e9c0 T _raw_write_unlock_irqrestore
+c062e9e0 T _raw_read_trylock
+c062ea20 T _raw_write_trylock
+c062ea60 T _raw_spin_lock_bh
+c062eac0 T _raw_read_lock_bh
+c062eb00 T _raw_write_lock_bh
+c062eb40 T _raw_spin_trylock_bh
+c062ebc0 T _raw_spin_unlock_bh
+c062ec00 T _raw_write_unlock_bh
+c062ec40 T _raw_read_unlock_bh
+c062ec80 T _raw_read_unlock_irqrestore
+c062ecc0 T _raw_write_lock
+c062ed00 T _raw_write_lock_irq
+c062ed40 T _raw_write_lock_irqsave
+c062ed80 T _raw_read_lock
+c062edc0 T _raw_spin_lock
+c062ee00 T _raw_read_lock_irq
+c062ee40 T _raw_spin_lock_irq
+c062eea0 T _raw_spin_lock_irqsave
+c062ef00 T _raw_read_lock_irqsave
+c062ef30 T __hyp_text_end
+c062ef30 T __hyp_text_start
+c062ef30 T __kprobes_text_end
+c062ef30 T __kprobes_text_start
+c062ef30 T __lock_text_end
+c062ef30 T _etext
  c062f000 D __start_rodata
  c062f000 D sigreturn_codes
  c062f044 d cpu_arch_name

###

As those hangs/lockups are related to so many different changes it's
really hard to debug them.

This bug seems to be specific to the slow arch clock that affects
stability only when kernel locking code and symbols layout trigger some
very specific timing.

Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it affects
so much code it's hard to tell why it actually matters.

Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
disabled. I tried it and it improves stability (I had 3 devices with 6
days of uptime and counting) indeed. Again it affects a lot of kernel
parts so it's hard to tell why it helps.

Unless someone comes up with some magic solution I'll probably try
building BCM53573 images without CONFIG_SMP for my personal needs.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
  2023-09-04  8:33 ` Rafał Miłecki
@ 2023-09-04  8:58   ` Geert Uytterhoeven
  -1 siblings, 0 replies; 22+ messages in thread
From: Geert Uytterhoeven @ 2023-09-04  8:58 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, Russell King, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

Hi Rafał,

On Mon, Sep 4, 2023 at 10:35 AM Rafał Miłecki <zajec5@gmail.com> wrote:
> 2. Clock (arm,armv7-timer)
>
> While comparing main clock in Broadcom's SDK with upstream one I noticed
> a tiny difference: mask value. I don't know it it makes any sense but
> switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in
> arm_arch_timer.c (to match SDK) increases average uptime (time before a
> hang/lockup happens) from 4 minutes to 36 minutes.

That code path is used only for type != ARCH_TIMER_TYPE_CP15,
but your kernel log

    arch_timer: cp15 timer(s) running at 0.03MHz (virt).

suggest that type == ARCH_TIMER_TYPE_CP15?!?

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
@ 2023-09-04  8:58   ` Geert Uytterhoeven
  0 siblings, 0 replies; 22+ messages in thread
From: Geert Uytterhoeven @ 2023-09-04  8:58 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, Russell King, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

Hi Rafał,

On Mon, Sep 4, 2023 at 10:35 AM Rafał Miłecki <zajec5@gmail.com> wrote:
> 2. Clock (arm,armv7-timer)
>
> While comparing main clock in Broadcom's SDK with upstream one I noticed
> a tiny difference: mask value. I don't know it it makes any sense but
> switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in
> arm_arch_timer.c (to match SDK) increases average uptime (time before a
> hang/lockup happens) from 4 minutes to 36 minutes.

That code path is used only for type != ARCH_TIMER_TYPE_CP15,
but your kernel log

    arch_timer: cp15 timer(s) running at 0.03MHz (virt).

suggest that type == ARCH_TIMER_TYPE_CP15?!?

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
  2023-09-04  8:33 ` Rafał Miłecki
@ 2023-09-04 15:25   ` Waiman Long
  -1 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2023-09-04 15:25 UTC (permalink / raw)
  To: Rafał Miłecki, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Boqun Feng, Russell King, Daniel Lezcano,
	Thomas Gleixner, Florian Fainelli, linux-clk, linux-arm-kernel,
	netdev, linux-kernel
  Cc: openwrt-devel, bcm-kernel-feedback-list


On 9/4/23 04:33, Rafał Miłecki wrote:
> As those hangs/lockups are related to so many different changes it's
> really hard to debug them.
>
> This bug seems to be specific to the slow arch clock that affects
> stability only when kernel locking code and symbols layout trigger some
> very specific timing.
>
> Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it affects
> so much code it's hard to tell why it actually matters.
>
> Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
> disabled. I tried it and it improves stability (I had 3 devices with 6
> days of uptime and counting) indeed. Again it affects a lot of kernel
> parts so it's hard to tell why it helps.
>
> Unless someone comes up with some magic solution I'll probably try
> building BCM53573 images without CONFIG_SMP for my personal needs.

All the locking operations rely on the fact that the instruction to 
acquire or release a lock is atomic. Is it possible that it may not be 
the case under certain circumstances for this ARM BCM53573 SoC? Or maybe 
some Kconfig options are not set correctly like missing some errata that 
are needed.

I don't know enough about the 32-bit arm architecture to say whether 
this is the case or not, but that is my best guess.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
@ 2023-09-04 15:25   ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2023-09-04 15:25 UTC (permalink / raw)
  To: Rafał Miłecki, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Boqun Feng, Russell King, Daniel Lezcano,
	Thomas Gleixner, Florian Fainelli, linux-clk, linux-arm-kernel,
	netdev, linux-kernel
  Cc: openwrt-devel, bcm-kernel-feedback-list


On 9/4/23 04:33, Rafał Miłecki wrote:
> As those hangs/lockups are related to so many different changes it's
> really hard to debug them.
>
> This bug seems to be specific to the slow arch clock that affects
> stability only when kernel locking code and symbols layout trigger some
> very specific timing.
>
> Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it affects
> so much code it's hard to tell why it actually matters.
>
> Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
> disabled. I tried it and it improves stability (I had 3 devices with 6
> days of uptime and counting) indeed. Again it affects a lot of kernel
> parts so it's hard to tell why it helps.
>
> Unless someone comes up with some magic solution I'll probably try
> building BCM53573 images without CONFIG_SMP for my personal needs.

All the locking operations rely on the fact that the instruction to 
acquire or release a lock is atomic. Is it possible that it may not be 
the case under certain circumstances for this ARM BCM53573 SoC? Or maybe 
some Kconfig options are not set correctly like missing some errata that 
are needed.

I don't know enough about the 32-bit arm architecture to say whether 
this is the case or not, but that is my best guess.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
  2023-09-04 15:25   ` Waiman Long
@ 2023-09-04 15:40     ` Russell King (Oracle)
  -1 siblings, 0 replies; 22+ messages in thread
From: Russell King (Oracle) @ 2023-09-04 15:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Rafał Miłecki, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Boqun Feng, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

On Mon, Sep 04, 2023 at 11:25:57AM -0400, Waiman Long wrote:
> 
> On 9/4/23 04:33, Rafał Miłecki wrote:
> > As those hangs/lockups are related to so many different changes it's
> > really hard to debug them.
> > 
> > This bug seems to be specific to the slow arch clock that affects
> > stability only when kernel locking code and symbols layout trigger some
> > very specific timing.
> > 
> > Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it affects
> > so much code it's hard to tell why it actually matters.
> > 
> > Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
> > disabled. I tried it and it improves stability (I had 3 devices with 6
> > days of uptime and counting) indeed. Again it affects a lot of kernel
> > parts so it's hard to tell why it helps.
> > 
> > Unless someone comes up with some magic solution I'll probably try
> > building BCM53573 images without CONFIG_SMP for my personal needs.
> 
> All the locking operations rely on the fact that the instruction to acquire
> or release a lock is atomic. Is it possible that it may not be the case
> under certain circumstances for this ARM BCM53573 SoC? Or maybe some Kconfig
> options are not set correctly like missing some errata that are needed.
> 
> I don't know enough about the 32-bit arm architecture to say whether this is
> the case or not, but that is my best guess.

So, BCM53573 is Cortex-A7, which is ARMv7, which has the exclusive
load/store instructions. Whether the SoC has the necessary exclusive
monitors to support these instructions is another matter, and I
suspect someone with documentation would need to check that.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
@ 2023-09-04 15:40     ` Russell King (Oracle)
  0 siblings, 0 replies; 22+ messages in thread
From: Russell King (Oracle) @ 2023-09-04 15:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Rafał Miłecki, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Boqun Feng, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

On Mon, Sep 04, 2023 at 11:25:57AM -0400, Waiman Long wrote:
> 
> On 9/4/23 04:33, Rafał Miłecki wrote:
> > As those hangs/lockups are related to so many different changes it's
> > really hard to debug them.
> > 
> > This bug seems to be specific to the slow arch clock that affects
> > stability only when kernel locking code and symbols layout trigger some
> > very specific timing.
> > 
> > Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it affects
> > so much code it's hard to tell why it actually matters.
> > 
> > Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
> > disabled. I tried it and it improves stability (I had 3 devices with 6
> > days of uptime and counting) indeed. Again it affects a lot of kernel
> > parts so it's hard to tell why it helps.
> > 
> > Unless someone comes up with some magic solution I'll probably try
> > building BCM53573 images without CONFIG_SMP for my personal needs.
> 
> All the locking operations rely on the fact that the instruction to acquire
> or release a lock is atomic. Is it possible that it may not be the case
> under certain circumstances for this ARM BCM53573 SoC? Or maybe some Kconfig
> options are not set correctly like missing some errata that are needed.
> 
> I don't know enough about the 32-bit arm architecture to say whether this is
> the case or not, but that is my best guess.

So, BCM53573 is Cortex-A7, which is ARMv7, which has the exclusive
load/store instructions. Whether the SoC has the necessary exclusive
monitors to support these instructions is another matter, and I
suspect someone with documentation would need to check that.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
  2023-09-04 15:40     ` Russell King (Oracle)
@ 2023-09-04 20:16       ` Waiman Long
  -1 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2023-09-04 20:16 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Rafał Miłecki, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Boqun Feng, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

On 9/4/23 11:40, Russell King (Oracle) wrote:
> On Mon, Sep 04, 2023 at 11:25:57AM -0400, Waiman Long wrote:
>> On 9/4/23 04:33, Rafał Miłecki wrote:
>>> As those hangs/lockups are related to so many different changes it's
>>> really hard to debug them.
>>>
>>> This bug seems to be specific to the slow arch clock that affects
>>> stability only when kernel locking code and symbols layout trigger some
>>> very specific timing.
>>>
>>> Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it affects
>>> so much code it's hard to tell why it actually matters.
>>>
>>> Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
>>> disabled. I tried it and it improves stability (I had 3 devices with 6
>>> days of uptime and counting) indeed. Again it affects a lot of kernel
>>> parts so it's hard to tell why it helps.
>>>
>>> Unless someone comes up with some magic solution I'll probably try
>>> building BCM53573 images without CONFIG_SMP for my personal needs.
>> All the locking operations rely on the fact that the instruction to acquire
>> or release a lock is atomic. Is it possible that it may not be the case
>> under certain circumstances for this ARM BCM53573 SoC? Or maybe some Kconfig
>> options are not set correctly like missing some errata that are needed.
>>
>> I don't know enough about the 32-bit arm architecture to say whether this is
>> the case or not, but that is my best guess.
> So, BCM53573 is Cortex-A7, which is ARMv7, which has the exclusive
> load/store instructions. Whether the SoC has the necessary exclusive
> monitors to support these instructions is another matter, and I
> suspect someone with documentation would need to check that.

To clarify, it is not necessary to use atomic instruction as in x86, but 
the LL/SC style of synchronization instructions with proper hardware 
support should also be enough. Again the hardware needs to have the 
proper support for the correct operation of those synchronization 
instructions.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
@ 2023-09-04 20:16       ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2023-09-04 20:16 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Rafał Miłecki, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Boqun Feng, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

On 9/4/23 11:40, Russell King (Oracle) wrote:
> On Mon, Sep 04, 2023 at 11:25:57AM -0400, Waiman Long wrote:
>> On 9/4/23 04:33, Rafał Miłecki wrote:
>>> As those hangs/lockups are related to so many different changes it's
>>> really hard to debug them.
>>>
>>> This bug seems to be specific to the slow arch clock that affects
>>> stability only when kernel locking code and symbols layout trigger some
>>> very specific timing.
>>>
>>> Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it affects
>>> so much code it's hard to tell why it actually matters.
>>>
>>> Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
>>> disabled. I tried it and it improves stability (I had 3 devices with 6
>>> days of uptime and counting) indeed. Again it affects a lot of kernel
>>> parts so it's hard to tell why it helps.
>>>
>>> Unless someone comes up with some magic solution I'll probably try
>>> building BCM53573 images without CONFIG_SMP for my personal needs.
>> All the locking operations rely on the fact that the instruction to acquire
>> or release a lock is atomic. Is it possible that it may not be the case
>> under certain circumstances for this ARM BCM53573 SoC? Or maybe some Kconfig
>> options are not set correctly like missing some errata that are needed.
>>
>> I don't know enough about the 32-bit arm architecture to say whether this is
>> the case or not, but that is my best guess.
> So, BCM53573 is Cortex-A7, which is ARMv7, which has the exclusive
> load/store instructions. Whether the SoC has the necessary exclusive
> monitors to support these instructions is another matter, and I
> suspect someone with documentation would need to check that.

To clarify, it is not necessary to use atomic instruction as in x86, but 
the LL/SC style of synchronization instructions with proper hardware 
support should also be enough. Again the hardware needs to have the 
proper support for the correct operation of those synchronization 
instructions.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
  2023-09-04 15:40     ` Russell King (Oracle)
@ 2023-09-05 20:07       ` Florian Fainelli
  -1 siblings, 0 replies; 22+ messages in thread
From: Florian Fainelli @ 2023-09-05 20:07 UTC (permalink / raw)
  To: Russell King (Oracle), Waiman Long
  Cc: Rafał Miłecki, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Boqun Feng, Daniel Lezcano, Thomas Gleixner,
	linux-clk, linux-arm-kernel, netdev, linux-kernel, openwrt-devel,
	bcm-kernel-feedback-list



On 9/4/2023 8:40 AM, Russell King (Oracle) wrote:
> On Mon, Sep 04, 2023 at 11:25:57AM -0400, Waiman Long wrote:
>>
>> On 9/4/23 04:33, Rafał Miłecki wrote:
>>> As those hangs/lockups are related to so many different changes it's
>>> really hard to debug them.
>>>
>>> This bug seems to be specific to the slow arch clock that affects
>>> stability only when kernel locking code and symbols layout trigger some
>>> very specific timing.
>>>
>>> Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it affects
>>> so much code it's hard to tell why it actually matters.
>>>
>>> Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
>>> disabled. I tried it and it improves stability (I had 3 devices with 6
>>> days of uptime and counting) indeed. Again it affects a lot of kernel
>>> parts so it's hard to tell why it helps.
>>>
>>> Unless someone comes up with some magic solution I'll probably try
>>> building BCM53573 images without CONFIG_SMP for my personal needs.
>>
>> All the locking operations rely on the fact that the instruction to acquire
>> or release a lock is atomic. Is it possible that it may not be the case
>> under certain circumstances for this ARM BCM53573 SoC? Or maybe some Kconfig
>> options are not set correctly like missing some errata that are needed.
>>
>> I don't know enough about the 32-bit arm architecture to say whether this is
>> the case or not, but that is my best guess.
> 
> So, BCM53573 is Cortex-A7, which is ARMv7, which has the exclusive
> load/store instructions. Whether the SoC has the necessary exclusive
> monitors to support these instructions is another matter, and I
> suspect someone with documentation would need to check that.

Finding documentation about this SoC has been very difficult 
unfortunately...

Would any of the lock or mutex debugging self test catch hardware 
designed without proper support for exclusive monitors in the DRAM 
controller? Keep in mind this is an uni-processor system however, does 
that mean we may have issues in our SMP_ON_UP alternative patching?
-- 
Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
@ 2023-09-05 20:07       ` Florian Fainelli
  0 siblings, 0 replies; 22+ messages in thread
From: Florian Fainelli @ 2023-09-05 20:07 UTC (permalink / raw)
  To: Russell King (Oracle), Waiman Long
  Cc: Rafał Miłecki, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Boqun Feng, Daniel Lezcano, Thomas Gleixner,
	linux-clk, linux-arm-kernel, netdev, linux-kernel, openwrt-devel,
	bcm-kernel-feedback-list



On 9/4/2023 8:40 AM, Russell King (Oracle) wrote:
> On Mon, Sep 04, 2023 at 11:25:57AM -0400, Waiman Long wrote:
>>
>> On 9/4/23 04:33, Rafał Miłecki wrote:
>>> As those hangs/lockups are related to so many different changes it's
>>> really hard to debug them.
>>>
>>> This bug seems to be specific to the slow arch clock that affects
>>> stability only when kernel locking code and symbols layout trigger some
>>> very specific timing.
>>>
>>> Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it affects
>>> so much code it's hard to tell why it actually matters.
>>>
>>> Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
>>> disabled. I tried it and it improves stability (I had 3 devices with 6
>>> days of uptime and counting) indeed. Again it affects a lot of kernel
>>> parts so it's hard to tell why it helps.
>>>
>>> Unless someone comes up with some magic solution I'll probably try
>>> building BCM53573 images without CONFIG_SMP for my personal needs.
>>
>> All the locking operations rely on the fact that the instruction to acquire
>> or release a lock is atomic. Is it possible that it may not be the case
>> under certain circumstances for this ARM BCM53573 SoC? Or maybe some Kconfig
>> options are not set correctly like missing some errata that are needed.
>>
>> I don't know enough about the 32-bit arm architecture to say whether this is
>> the case or not, but that is my best guess.
> 
> So, BCM53573 is Cortex-A7, which is ARMv7, which has the exclusive
> load/store instructions. Whether the SoC has the necessary exclusive
> monitors to support these instructions is another matter, and I
> suspect someone with documentation would need to check that.

Finding documentation about this SoC has been very difficult 
unfortunately...

Would any of the lock or mutex debugging self test catch hardware 
designed without proper support for exclusive monitors in the DRAM 
controller? Keep in mind this is an uni-processor system however, does 
that mean we may have issues in our SMP_ON_UP alternative patching?
-- 
Florian

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
  2023-09-05 20:07       ` Florian Fainelli
@ 2023-09-06  2:17         ` Waiman Long
  -1 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2023-09-06  2:17 UTC (permalink / raw)
  To: Florian Fainelli, Russell King (Oracle)
  Cc: Rafał Miłecki, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Boqun Feng, Daniel Lezcano, Thomas Gleixner,
	linux-clk, linux-arm-kernel, netdev, linux-kernel, openwrt-devel,
	bcm-kernel-feedback-list

On 9/5/23 16:07, Florian Fainelli wrote:
>
>
> On 9/4/2023 8:40 AM, Russell King (Oracle) wrote:
>> On Mon, Sep 04, 2023 at 11:25:57AM -0400, Waiman Long wrote:
>>>
>>> On 9/4/23 04:33, Rafał Miłecki wrote:
>>>> As those hangs/lockups are related to so many different changes it's
>>>> really hard to debug them.
>>>>
>>>> This bug seems to be specific to the slow arch clock that affects
>>>> stability only when kernel locking code and symbols layout trigger 
>>>> some
>>>> very specific timing.
>>>>
>>>> Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it 
>>>> affects
>>>> so much code it's hard to tell why it actually matters.
>>>>
>>>> Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
>>>> disabled. I tried it and it improves stability (I had 3 devices with 6
>>>> days of uptime and counting) indeed. Again it affects a lot of kernel
>>>> parts so it's hard to tell why it helps.
>>>>
>>>> Unless someone comes up with some magic solution I'll probably try
>>>> building BCM53573 images without CONFIG_SMP for my personal needs.
>>>
>>> All the locking operations rely on the fact that the instruction to 
>>> acquire
>>> or release a lock is atomic. Is it possible that it may not be the case
>>> under certain circumstances for this ARM BCM53573 SoC? Or maybe some 
>>> Kconfig
>>> options are not set correctly like missing some errata that are needed.
>>>
>>> I don't know enough about the 32-bit arm architecture to say whether 
>>> this is
>>> the case or not, but that is my best guess.
>>
>> So, BCM53573 is Cortex-A7, which is ARMv7, which has the exclusive
>> load/store instructions. Whether the SoC has the necessary exclusive
>> monitors to support these instructions is another matter, and I
>> suspect someone with documentation would need to check that.
>
> Finding documentation about this SoC has been very difficult 
> unfortunately...
>
> Would any of the lock or mutex debugging self test catch hardware 
> designed without proper support for exclusive monitors in the DRAM 
> controller? Keep in mind this is an uni-processor system however, does 
> that mean we may have issues in our SMP_ON_UP alternative patching?

Usually this kind of locking problem is timing related and it happens 
once in a while. It is not easy to have a test to reliably figure out if 
there is a problem. I am not sure about the SMP_ON_UP thing.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
@ 2023-09-06  2:17         ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2023-09-06  2:17 UTC (permalink / raw)
  To: Florian Fainelli, Russell King (Oracle)
  Cc: Rafał Miłecki, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Boqun Feng, Daniel Lezcano, Thomas Gleixner,
	linux-clk, linux-arm-kernel, netdev, linux-kernel, openwrt-devel,
	bcm-kernel-feedback-list

On 9/5/23 16:07, Florian Fainelli wrote:
>
>
> On 9/4/2023 8:40 AM, Russell King (Oracle) wrote:
>> On Mon, Sep 04, 2023 at 11:25:57AM -0400, Waiman Long wrote:
>>>
>>> On 9/4/23 04:33, Rafał Miłecki wrote:
>>>> As those hangs/lockups are related to so many different changes it's
>>>> really hard to debug them.
>>>>
>>>> This bug seems to be specific to the slow arch clock that affects
>>>> stability only when kernel locking code and symbols layout trigger 
>>>> some
>>>> very specific timing.
>>>>
>>>> Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it 
>>>> affects
>>>> so much code it's hard to tell why it actually matters.
>>>>
>>>> Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
>>>> disabled. I tried it and it improves stability (I had 3 devices with 6
>>>> days of uptime and counting) indeed. Again it affects a lot of kernel
>>>> parts so it's hard to tell why it helps.
>>>>
>>>> Unless someone comes up with some magic solution I'll probably try
>>>> building BCM53573 images without CONFIG_SMP for my personal needs.
>>>
>>> All the locking operations rely on the fact that the instruction to 
>>> acquire
>>> or release a lock is atomic. Is it possible that it may not be the case
>>> under certain circumstances for this ARM BCM53573 SoC? Or maybe some 
>>> Kconfig
>>> options are not set correctly like missing some errata that are needed.
>>>
>>> I don't know enough about the 32-bit arm architecture to say whether 
>>> this is
>>> the case or not, but that is my best guess.
>>
>> So, BCM53573 is Cortex-A7, which is ARMv7, which has the exclusive
>> load/store instructions. Whether the SoC has the necessary exclusive
>> monitors to support these instructions is another matter, and I
>> suspect someone with documentation would need to check that.
>
> Finding documentation about this SoC has been very difficult 
> unfortunately...
>
> Would any of the lock or mutex debugging self test catch hardware 
> designed without proper support for exclusive monitors in the DRAM 
> controller? Keep in mind this is an uni-processor system however, does 
> that mean we may have issues in our SMP_ON_UP alternative patching?

Usually this kind of locking problem is timing related and it happens 
once in a while. It is not easy to have a test to reliably figure out if 
there is a problem. I am not sure about the SMP_ON_UP thing.

Cheers,
Longman



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
  2023-09-04  8:33 ` Rafał Miłecki
@ 2023-09-08  8:10   ` Linus Walleij
  -1 siblings, 0 replies; 22+ messages in thread
From: Linus Walleij @ 2023-09-08  8:10 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, Russell King, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

Hi Rafal,

On Mon, Sep 4, 2023 at 10:34 AM Rafał Miłecki <zajec5@gmail.com> wrote:

> I'm clueless at this point.
> Maybe someone can come up with an idea of actual issue & ideally a
> solution.

Damn this is frustrating.

> 2. Clock (arm,armv7-timer)
>
> While comparing main clock in Broadcom's SDK with upstream one I noticed
> a tiny difference: mask value. I don't know it it makes any sense but
> switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in
> arm_arch_timer.c (to match SDK) increases average uptime (time before a
> hang/lockup happens) from 4 minutes to 36 minutes.

This could be related to how often the system goes to idle.

> +       if (cpu_idle_force_poll == 1234)
> +               arch_cpu_idle();
> +       if (cpu_idle_force_poll == 5678)
> +               arch_cpu_idle();
> +       if (cpu_idle_force_poll == 1234)
> +               arch_cpu_idle();
> +       if (cpu_idle_force_poll == 5678)
> +               arch_cpu_idle();
> +       if (cpu_idle_force_poll == 1234)
> +               arch_cpu_idle();
> +       if (cpu_idle_force_poll == 5678)
> +               arch_cpu_idle();
> +       if (cpu_idle_force_poll == 1234)
> +               arch_cpu_idle();

Idle again.

I would have tried to see what arch_cpu_idle() is doing.

arm_pm_idle() or cpu_do_idle()?

What happens if you just put return in arch_cpu_idle()
so it does nothing?

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
@ 2023-09-08  8:10   ` Linus Walleij
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Walleij @ 2023-09-08  8:10 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, Russell King, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

Hi Rafal,

On Mon, Sep 4, 2023 at 10:34 AM Rafał Miłecki <zajec5@gmail.com> wrote:

> I'm clueless at this point.
> Maybe someone can come up with an idea of actual issue & ideally a
> solution.

Damn this is frustrating.

> 2. Clock (arm,armv7-timer)
>
> While comparing main clock in Broadcom's SDK with upstream one I noticed
> a tiny difference: mask value. I don't know it it makes any sense but
> switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in
> arm_arch_timer.c (to match SDK) increases average uptime (time before a
> hang/lockup happens) from 4 minutes to 36 minutes.

This could be related to how often the system goes to idle.

> +       if (cpu_idle_force_poll == 1234)
> +               arch_cpu_idle();
> +       if (cpu_idle_force_poll == 5678)
> +               arch_cpu_idle();
> +       if (cpu_idle_force_poll == 1234)
> +               arch_cpu_idle();
> +       if (cpu_idle_force_poll == 5678)
> +               arch_cpu_idle();
> +       if (cpu_idle_force_poll == 1234)
> +               arch_cpu_idle();
> +       if (cpu_idle_force_poll == 5678)
> +               arch_cpu_idle();
> +       if (cpu_idle_force_poll == 1234)
> +               arch_cpu_idle();

Idle again.

I would have tried to see what arch_cpu_idle() is doing.

arm_pm_idle() or cpu_do_idle()?

What happens if you just put return in arch_cpu_idle()
so it does nothing?

Yours,
Linus Walleij

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
  2023-09-08  8:10   ` Linus Walleij
@ 2023-11-29 21:20     ` Rafał Miłecki
  -1 siblings, 0 replies; 22+ messages in thread
From: Rafał Miłecki @ 2023-11-29 21:20 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, Russell King, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

Hi,

it's a late reply but I didn't find enough determination earlier.

On 8.09.2023 10:10, Linus Walleij wrote:
> On Mon, Sep 4, 2023 at 10:34 AM Rafał Miłecki <zajec5@gmail.com> wrote:
> 
>> I'm clueless at this point.
>> Maybe someone can come up with an idea of actual issue & ideally a
>> solution.
> 
> Damn this is frustrating.
> 
>> 2. Clock (arm,armv7-timer)
>>
>> While comparing main clock in Broadcom's SDK with upstream one I noticed
>> a tiny difference: mask value. I don't know it it makes any sense but
>> switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in
>> arm_arch_timer.c (to match SDK) increases average uptime (time before a
>> hang/lockup happens) from 4 minutes to 36 minutes.
> 
> This could be related to how often the system goes to idle.
> 
>> +       if (cpu_idle_force_poll == 1234)
>> +               arch_cpu_idle();
>> +       if (cpu_idle_force_poll == 5678)
>> +               arch_cpu_idle();
>> +       if (cpu_idle_force_poll == 1234)
>> +               arch_cpu_idle();
>> +       if (cpu_idle_force_poll == 5678)
>> +               arch_cpu_idle();
>> +       if (cpu_idle_force_poll == 1234)
>> +               arch_cpu_idle();
>> +       if (cpu_idle_force_poll == 5678)
>> +               arch_cpu_idle();
>> +       if (cpu_idle_force_poll == 1234)
>> +               arch_cpu_idle();
> 
> Idle again.
> 
> I would have tried to see what arch_cpu_idle() is doing.
> 
> arm_pm_idle() or cpu_do_idle()?

In my case arm_pm_idle is NULL.


> What happens if you just put return in arch_cpu_idle()
> so it does nothing?

Doesn't help. I also tried putting:
udelay(10);
and
udelay(1000);
at the arch_cpu_idle() beginning. None helped.


Here comes more interesting experiment though. Putting there:

if (!(foo++ % 10000)) {
	pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
}

doesn't seem to help.


Putting following however seems to make kernel/device stable:

if (!(foo++ % 100)) {
	pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
}


I think I'm just going to assume those chipsets are simply hw broken.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
@ 2023-11-29 21:20     ` Rafał Miłecki
  0 siblings, 0 replies; 22+ messages in thread
From: Rafał Miłecki @ 2023-11-29 21:20 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, Russell King, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

Hi,

it's a late reply but I didn't find enough determination earlier.

On 8.09.2023 10:10, Linus Walleij wrote:
> On Mon, Sep 4, 2023 at 10:34 AM Rafał Miłecki <zajec5@gmail.com> wrote:
> 
>> I'm clueless at this point.
>> Maybe someone can come up with an idea of actual issue & ideally a
>> solution.
> 
> Damn this is frustrating.
> 
>> 2. Clock (arm,armv7-timer)
>>
>> While comparing main clock in Broadcom's SDK with upstream one I noticed
>> a tiny difference: mask value. I don't know it it makes any sense but
>> switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in
>> arm_arch_timer.c (to match SDK) increases average uptime (time before a
>> hang/lockup happens) from 4 minutes to 36 minutes.
> 
> This could be related to how often the system goes to idle.
> 
>> +       if (cpu_idle_force_poll == 1234)
>> +               arch_cpu_idle();
>> +       if (cpu_idle_force_poll == 5678)
>> +               arch_cpu_idle();
>> +       if (cpu_idle_force_poll == 1234)
>> +               arch_cpu_idle();
>> +       if (cpu_idle_force_poll == 5678)
>> +               arch_cpu_idle();
>> +       if (cpu_idle_force_poll == 1234)
>> +               arch_cpu_idle();
>> +       if (cpu_idle_force_poll == 5678)
>> +               arch_cpu_idle();
>> +       if (cpu_idle_force_poll == 1234)
>> +               arch_cpu_idle();
> 
> Idle again.
> 
> I would have tried to see what arch_cpu_idle() is doing.
> 
> arm_pm_idle() or cpu_do_idle()?

In my case arm_pm_idle is NULL.


> What happens if you just put return in arch_cpu_idle()
> so it does nothing?

Doesn't help. I also tried putting:
udelay(10);
and
udelay(1000);
at the arch_cpu_idle() beginning. None helped.


Here comes more interesting experiment though. Putting there:

if (!(foo++ % 10000)) {
	pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
}

doesn't seem to help.


Putting following however seems to make kernel/device stable:

if (!(foo++ % 100)) {
	pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
}


I think I'm just going to assume those chipsets are simply hw broken.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
  2023-11-29 21:20     ` Rafał Miłecki
@ 2023-11-29 21:33       ` Linus Walleij
  -1 siblings, 0 replies; 22+ messages in thread
From: Linus Walleij @ 2023-11-29 21:33 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, Russell King, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

On Wed, Nov 29, 2023 at 10:20 PM Rafał Miłecki <zajec5@gmail.com> wrote:

> Here comes more interesting experiment though. Putting there:
>
> if (!(foo++ % 10000)) {
>         pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
> }
>
> doesn't seem to help.
>
>
> Putting following however seems to make kernel/device stable:
>
> if (!(foo++ % 100)) {
>         pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
> }

That's just too weird.

> I think I'm just going to assume those chipsets are simply hw broken.

If disabling CPU idle on these altogether stabilize them, then maybe that
is what we need to do?

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
@ 2023-11-29 21:33       ` Linus Walleij
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Walleij @ 2023-11-29 21:33 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, Russell King, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

On Wed, Nov 29, 2023 at 10:20 PM Rafał Miłecki <zajec5@gmail.com> wrote:

> Here comes more interesting experiment though. Putting there:
>
> if (!(foo++ % 10000)) {
>         pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
> }
>
> doesn't seem to help.
>
>
> Putting following however seems to make kernel/device stable:
>
> if (!(foo++ % 100)) {
>         pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
> }

That's just too weird.

> I think I'm just going to assume those chipsets are simply hw broken.

If disabling CPU idle on these altogether stabilize them, then maybe that
is what we need to do?

Yours,
Linus Walleij

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
  2023-11-29 21:33       ` Linus Walleij
@ 2023-11-29 21:42         ` Florian Fainelli
  -1 siblings, 0 replies; 22+ messages in thread
From: Florian Fainelli @ 2023-11-29 21:42 UTC (permalink / raw)
  To: Linus Walleij, Rafał Miłecki
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, Russell King, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list

[-- Attachment #1: Type: text/plain, Size: 1888 bytes --]

On 11/29/23 13:33, Linus Walleij wrote:
> On Wed, Nov 29, 2023 at 10:20 PM Rafał Miłecki <zajec5@gmail.com> wrote:
> 
>> Here comes more interesting experiment though. Putting there:
>>
>> if (!(foo++ % 10000)) {
>>          pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
>> }
>>
>> doesn't seem to help.
>>
>>
>> Putting following however seems to make kernel/device stable:
>>
>> if (!(foo++ % 100)) {
>>          pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
>> }
> 
> That's just too weird.

It does seem to indicate that idling for too long wrecks havoc, but it 
is indeed not making much sense. Not having proper documentation for 
this SoC, it is hard to figure out what impact does stopping the ARM CPU 
clock has on the rest of the memory subsystem, especially outside of the 
CPU. I do not believe that this SoC has any form of PLL clock gating or 
pulse skipping.

> 
>> I think I'm just going to assume those chipsets are simply hw broken.
> 
> If disabling CPU idle on these altogether stabilize them, then maybe that
> is what we need to do?

Yes, please try booting with "nohlt" set on the kernel command line and 
see how that fares.

Also useful would be to dump the L2 CTLR and L2 ECTLR, this is a 
complete shot in the dark, though was initially wondering if there could 
be some retention issues, and would have recommended disabling the L2 
retention policy completely just for testing.

MRC p15, 1, <Rt>, c9, c0, 2;

of particular interest here would be bit at position 0, try to see if 
changing it to 1 (3 cycles) or 0 (2 cycles) changes anything.

MRC p15, 1, <Rt>, c9, c0, 3;

the lower bits are reserved, so I would not necessarily expect them to 
be mapping to configurable latencies, but if you see non-zero values in 
bits [28:0], try changing them to 0 and see if that changes anything.

Thanks for your persistence!
-- 
Florian


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
@ 2023-11-29 21:42         ` Florian Fainelli
  0 siblings, 0 replies; 22+ messages in thread
From: Florian Fainelli @ 2023-11-29 21:42 UTC (permalink / raw)
  To: Linus Walleij, Rafał Miłecki
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, Russell King, Daniel Lezcano, Thomas Gleixner,
	Florian Fainelli, linux-clk, linux-arm-kernel, netdev,
	linux-kernel, openwrt-devel, bcm-kernel-feedback-list


[-- Attachment #1.1: Type: text/plain, Size: 1888 bytes --]

On 11/29/23 13:33, Linus Walleij wrote:
> On Wed, Nov 29, 2023 at 10:20 PM Rafał Miłecki <zajec5@gmail.com> wrote:
> 
>> Here comes more interesting experiment though. Putting there:
>>
>> if (!(foo++ % 10000)) {
>>          pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
>> }
>>
>> doesn't seem to help.
>>
>>
>> Putting following however seems to make kernel/device stable:
>>
>> if (!(foo++ % 100)) {
>>          pr_info("[%s] arm_pm_idle:%ps\n", __func__, arm_pm_idle);
>> }
> 
> That's just too weird.

It does seem to indicate that idling for too long wrecks havoc, but it 
is indeed not making much sense. Not having proper documentation for 
this SoC, it is hard to figure out what impact does stopping the ARM CPU 
clock has on the rest of the memory subsystem, especially outside of the 
CPU. I do not believe that this SoC has any form of PLL clock gating or 
pulse skipping.

> 
>> I think I'm just going to assume those chipsets are simply hw broken.
> 
> If disabling CPU idle on these altogether stabilize them, then maybe that
> is what we need to do?

Yes, please try booting with "nohlt" set on the kernel command line and 
see how that fares.

Also useful would be to dump the L2 CTLR and L2 ECTLR, this is a 
complete shot in the dark, though was initially wondering if there could 
be some retention issues, and would have recommended disabling the L2 
retention policy completely just for testing.

MRC p15, 1, <Rt>, c9, c0, 2;

of particular interest here would be bit at position 0, try to see if 
changing it to 1 (3 cycles) or 0 (2 cycles) changes anything.

MRC p15, 1, <Rt>, c9, c0, 3;

the lower bits are reserved, so I would not necessarily expect them to 
be mapping to configurable latencies, but if you see non-zero values in 
bits [28:0], try changing them to 0 and see if that changes anything.

Thanks for your persistence!
-- 
Florian


[-- Attachment #1.2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4221 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-11-29 21:43 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-04  8:33 ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes Rafał Miłecki
2023-09-04  8:33 ` Rafał Miłecki
2023-09-04  8:58 ` Geert Uytterhoeven
2023-09-04  8:58   ` Geert Uytterhoeven
2023-09-04 15:25 ` Waiman Long
2023-09-04 15:25   ` Waiman Long
2023-09-04 15:40   ` Russell King (Oracle)
2023-09-04 15:40     ` Russell King (Oracle)
2023-09-04 20:16     ` Waiman Long
2023-09-04 20:16       ` Waiman Long
2023-09-05 20:07     ` Florian Fainelli
2023-09-05 20:07       ` Florian Fainelli
2023-09-06  2:17       ` Waiman Long
2023-09-06  2:17         ` Waiman Long
2023-09-08  8:10 ` Linus Walleij
2023-09-08  8:10   ` Linus Walleij
2023-11-29 21:20   ` Rafał Miłecki
2023-11-29 21:20     ` Rafał Miłecki
2023-11-29 21:33     ` Linus Walleij
2023-11-29 21:33       ` Linus Walleij
2023-11-29 21:42       ` Florian Fainelli
2023-11-29 21:42         ` Florian Fainelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.