linux-omap.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Lockup inside omap4_prminst_read_inst_reg on OMAP5 uEVM
@ 2020-07-26 17:59 David Shah
  2020-07-27  6:47 ` Tony Lindgren
  2020-08-01 20:57 ` David Shah
  0 siblings, 2 replies; 6+ messages in thread
From: David Shah @ 2020-07-26 17:59 UTC (permalink / raw)
  To: Discussions about the Letux Kernel,
	; kernel@pyra-handheld.com; Linux-OMAP

Hi all,

I am looking into random lockups - significantly rarer than once a day
in typical usage, various patterns like lots of bursty network traffic
increase frequency - that affect both the uEVM and the Pyra (also
OMAP5432 based) on newer kernels (currently testing with 5.6 but I have
seen lockups with 5.7 too).

Currently I'm working with the uEVM as it is a bit easier to connect
the JTAG adapter. I managed to get a lockup with the JTAG attached, and
unfortunately the processor is badly locked up enough (presumably a
stuck memory bus?) that JTAG isn't able to get a register dump or
stacktrace. But I do get the following error which at least gives a
PC: 

CortexA15_0: Trouble Halting Target CPU: (Error -1323 @ 0xC0223E0C)
Device failed to enter debug/halt mode because pipeline is stalled.
Power-cycle the board. If error persists, confirm configuration and/or
try more reliable JTAG settings (e.g. lower TCLK). (Emulation package
9.2.0.00002) 

The second core is just sitting at WFI, don't think there is anything
suspicious about that.

Looking at the kernel disassembly this is the actual register read (ldr
r0, [r1]) part of omap4_prminst_read_inst_reg.

My best guess is that it is trying to read from a register that doesn't
exist or isn't responding due to the current power configuration, but I
wonder if anyone has seen this before or has any more clues on how to
debug this? It's a shame that I can't seem to see what r1 is or get a
backtrace. It looks like it might be possible to set some kind of
timeout on the interconnect, has anyone tried something like that to
debug this kind of issue?

Best

David Shah



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Lockup inside omap4_prminst_read_inst_reg on OMAP5 uEVM
  2020-07-26 17:59 Lockup inside omap4_prminst_read_inst_reg on OMAP5 uEVM David Shah
@ 2020-07-27  6:47 ` Tony Lindgren
  2020-08-01 20:57 ` David Shah
  1 sibling, 0 replies; 6+ messages in thread
From: Tony Lindgren @ 2020-07-27  6:47 UTC (permalink / raw)
  To: David Shah
  Cc: Discussions about the Letux Kernel,
	; kernel@pyra-handheld.com; Linux-OMAP

* David Shah <dave@ds0.me> [200726 17:59]:
> Hi all,
> 
> I am looking into random lockups - significantly rarer than once a day
> in typical usage, various patterns like lots of bursty network traffic
> increase frequency - that affect both the uEVM and the Pyra (also
> OMAP5432 based) on newer kernels (currently testing with 5.6 but I have
> seen lockups with 5.7 too).

Just wondering.. Is this with USB Ethernet or with WLAN?

Regards,

Tony

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Lockup inside omap4_prminst_read_inst_reg on OMAP5 uEVM
  2020-07-26 17:59 Lockup inside omap4_prminst_read_inst_reg on OMAP5 uEVM David Shah
  2020-07-27  6:47 ` Tony Lindgren
@ 2020-08-01 20:57 ` David Shah
  2020-08-16 17:13   ` [Letux-kernel] " David Shah
  1 sibling, 1 reply; 6+ messages in thread
From: David Shah @ 2020-08-01 20:57 UTC (permalink / raw)
  To: Discussions about the Letux Kernel, kernel, Linux-OMAP

A tiny bit more information, if anyone has any more ideas.

I can confirm that this happened once with the device idle, and no
networking connection.

Based on the information I have been able to extract, the call stack does
seem to involve omap4_enter_lowpower but I can't be certain.

The main JTAG access I have is to be able to read out what seems to be
kernel virtual memory via the other, non-locked-up but WFI, core. I
attempted to add some tracing via writing a value to a global variable
inside the problem function and then flushing the D$, but the delay this
adds (or the cache flush itself) seems to stop the lockup from occuring
most of the time. It did lock up once with this added, but then reading
out that area of memory failed, possibly because the locked up core was
confusing the cache coherency magic inside the cores.

Since that lock-up I added 20 NOPs after the cache flush, to try and make
sure the cache flush really does work, and with those added it does not
lock up at all.

Is there a better way to take advantage of this ability to read out
memory for debugging?

Best

David


On Sun, 2020-07-26 at 18:59 +0100, David Shah wrote:
> Hi all,
> 
> I am looking into random lockups - significantly rarer than once a day
> in typical usage, various patterns like lots of bursty network traffic
> increase frequency - that affect both the uEVM and the Pyra (also
> OMAP5432 based) on newer kernels (currently testing with 5.6 but I have
> seen lockups with 5.7 too).
> 
> Currently I'm working with the uEVM as it is a bit easier to connect
> the JTAG adapter. I managed to get a lockup with the JTAG attached, and
> unfortunately the processor is badly locked up enough (presumably a
> stuck memory bus?) that JTAG isn't able to get a register dump or
> stacktrace. But I do get the following error which at least gives a
> PC: 
> 
> CortexA15_0: Trouble Halting Target CPU: (Error -1323 @ 0xC0223E0C)
> Device failed to enter debug/halt mode because pipeline is stalled.
> Power-cycle the board. If error persists, confirm configuration and/or
> try more reliable JTAG settings (e.g. lower TCLK). (Emulation package
> 9.2.0.00002) 
> 
> The second core is just sitting at WFI, don't think there is anything
> suspicious about that.
> 
> Looking at the kernel disassembly this is the actual register read (ldr
> r0, [r1]) part of omap4_prminst_read_inst_reg.
> 
> My best guess is that it is trying to read from a register that doesn't
> exist or isn't responding due to the current power configuration, but I
> wonder if anyone has seen this before or has any more clues on how to
> debug this? It's a shame that I can't seem to see what r1 is or get a
> backtrace. It looks like it might be possible to set some kind of
> timeout on the interconnect, has anyone tried something like that to
> debug this kind of issue?
> 
> Best
> 
> David Shah
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Letux-kernel] Lockup inside omap4_prminst_read_inst_reg on OMAP5 uEVM
  2020-08-01 20:57 ` David Shah
@ 2020-08-16 17:13   ` David Shah
  2020-08-17  6:38     ` Tony Lindgren
  0 siblings, 1 reply; 6+ messages in thread
From: David Shah @ 2020-08-16 17:13 UTC (permalink / raw)
  To: Discussions about the Letux Kernel, kernel, Linux-OMAP

It seems like 'CSWR' idle may never have actually worked properly on
the OMAP5...

As an experiment, I took the old TI 3.8.y GLSDK kernel,
commit 2c871a879dbb4234232126f7075468d5bf0a50e3 and made the following
changes:

 - Enabling CONFIG_CPU_IDLE as this was not in omap2plus_defconfig back
then
 - Disabling all the kernel debugging related config, as these seem to
significantly reduce the frequency of lockups
 - OSWR idle disabled, as this is known broken
 - Some small patches to get it working with gcc9, none of which
touched any power management or idle code.

And I saw lockups with an almost identical frequency to 5.6 and 5.7
with a similar config; and the same pipeline stalled error reported by
CCS when connecting over JTAG. The only difference is the reported PC
was a read instruction inside sched_clock rather
than omap4_prminst_read_inst_reg.

Would be interested to know if there is a backstory here? Could it be
related to the bugs that stopped OSWR from working? Is there a glsdk
kernel version that I missed where CSWR on the OMAP5 actually works
reliably.

If anyone wants to try reproducing this; the most important settings
are:

 - CONFIG_CPU_IDLE=y
 - All kernel debugging settings disabled
 - CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y

This will usually result in a lockup while idle at a login prompt
within a few hours with no other hardware connected. A lockup usually
occurs sooner (within 30 minutes) repeatedly wget'ing a 100MB test file
in a loop.

Best

David

On Sat, 2020-08-01 at 21:57 +0100, David Shah wrote:
> A tiny bit more information, if anyone has any more ideas.
> 
> I can confirm that this happened once with the device idle, and no
> networking connection.
> 
> Based on the information I have been able to extract, the call stack
> does
> seem to involve omap4_enter_lowpower but I can't be certain.
> 
> The main JTAG access I have is to be able to read out what seems to
> be
> kernel virtual memory via the other, non-locked-up but WFI, core. I
> attempted to add some tracing via writing a value to a global
> variable
> inside the problem function and then flushing the D$, but the delay
> this
> adds (or the cache flush itself) seems to stop the lockup from
> occuring
> most of the time. It did lock up once with this added, but then
> reading
> out that area of memory failed, possibly because the locked up core
> was
> confusing the cache coherency magic inside the cores.
> 
> Since that lock-up I added 20 NOPs after the cache flush, to try and
> make
> sure the cache flush really does work, and with those added it does
> not
> lock up at all.
> 
> Is there a better way to take advantage of this ability to read out
> memory for debugging?
> 
> Best
> 
> David
> 
> 
> On Sun, 2020-07-26 at 18:59 +0100, David Shah wrote:
> > Hi all,
> > 
> > I am looking into random lockups - significantly rarer than once a
> > day
> > in typical usage, various patterns like lots of bursty network
> > traffic
> > increase frequency - that affect both the uEVM and the Pyra (also
> > OMAP5432 based) on newer kernels (currently testing with 5.6 but I
> > have
> > seen lockups with 5.7 too).
> > 
> > Currently I'm working with the uEVM as it is a bit easier to
> > connect
> > the JTAG adapter. I managed to get a lockup with the JTAG attached,
> > and
> > unfortunately the processor is badly locked up enough (presumably a
> > stuck memory bus?) that JTAG isn't able to get a register dump or
> > stacktrace. But I do get the following error which at least gives a
> > PC: 
> > 
> > CortexA15_0: Trouble Halting Target CPU: (Error -1323 @ 0xC0223E0C)
> > Device failed to enter debug/halt mode because pipeline is stalled.
> > Power-cycle the board. If error persists, confirm configuration
> > and/or
> > try more reliable JTAG settings (e.g. lower TCLK). (Emulation
> > package
> > 9.2.0.00002) 
> > 
> > The second core is just sitting at WFI, don't think there is
> > anything
> > suspicious about that.
> > 
> > Looking at the kernel disassembly this is the actual register read
> > (ldr
> > r0, [r1]) part of omap4_prminst_read_inst_reg.
> > 
> > My best guess is that it is trying to read from a register that
> > doesn't
> > exist or isn't responding due to the current power configuration,
> > but I
> > wonder if anyone has seen this before or has any more clues on how
> > to
> > debug this? It's a shame that I can't seem to see what r1 is or get
> > a
> > backtrace. It looks like it might be possible to set some kind of
> > timeout on the interconnect, has anyone tried something like that
> > to
> > debug this kind of issue?
> > 
> > Best
> > 
> > David Shah
> > 
> > 
> 
> _______________________________________________
> https://projects.goldelico.com/p/gta04-kernel/
> Letux-kernel mailing list
> Letux-kernel@openphoenux.org
> http://lists.goldelico.com/mailman/listinfo.cgi/letux-kernel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Letux-kernel] Lockup inside omap4_prminst_read_inst_reg on OMAP5 uEVM
  2020-08-16 17:13   ` [Letux-kernel] " David Shah
@ 2020-08-17  6:38     ` Tony Lindgren
  2020-08-17  8:18       ` David Shah
  0 siblings, 1 reply; 6+ messages in thread
From: Tony Lindgren @ 2020-08-17  6:38 UTC (permalink / raw)
  To: David Shah; +Cc: Discussions about the Letux Kernel, kernel, Linux-OMAP

Hi,

* David Shah <dave@ds0.me> [200816 20:13]:
> It seems like 'CSWR' idle may never have actually worked properly on
> the OMAP5...
> 
> As an experiment, I took the old TI 3.8.y GLSDK kernel,
> commit 2c871a879dbb4234232126f7075468d5bf0a50e3 and made the following
> changes:
> 
>  - Enabling CONFIG_CPU_IDLE as this was not in omap2plus_defconfig back
> then
>  - Disabling all the kernel debugging related config, as these seem to
> significantly reduce the frequency of lockups
>  - OSWR idle disabled, as this is known broken
>  - Some small patches to get it working with gcc9, none of which
> touched any power management or idle code.
> 
> And I saw lockups with an almost identical frequency to 5.6 and 5.7
> with a similar config; and the same pipeline stalled error reported by
> CCS when connecting over JTAG. The only difference is the reported PC
> was a read instruction inside sched_clock rather
> than omap4_prminst_read_inst_reg.
> 
> Would be interested to know if there is a backstory here? Could it be
> related to the bugs that stopped OSWR from working? Is there a glsdk
> kernel version that I missed where CSWR on the OMAP5 actually works
> reliably.
> 
> If anyone wants to try reproducing this; the most important settings
> are:
> 
>  - CONFIG_CPU_IDLE=y
>  - All kernel debugging settings disabled
>  - CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
> 
> This will usually result in a lockup while idle at a login prompt
> within a few hours with no other hardware connected. A lockup usually
> occurs sooner (within 30 minutes) repeatedly wget'ing a 100MB test file
> in a loop.

Care to check if this happens with current mainline kernel and sgx
disabled?

The reason I'm asking is I used a pi-top with a omap5-igep0050 board as
a test laptop with the mainline kernel for about two years until I managed
to break the UART connector on it a few years ago :) I sure had things
working reliably with no hangs with cpuidle enabled with LPAE. This was
with the pi-top HDMI panel without sgx.

Also please see if this happens with omap5-uevm too. There pyra related
DDR self-refresh related hangs should be out of the AFAIK, but still
worth testing.

Regards,

Tony

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Letux-kernel] Lockup inside omap4_prminst_read_inst_reg on OMAP5 uEVM
  2020-08-17  6:38     ` Tony Lindgren
@ 2020-08-17  8:18       ` David Shah
  0 siblings, 0 replies; 6+ messages in thread
From: David Shah @ 2020-08-17  8:18 UTC (permalink / raw)
  To: Tony Lindgren; +Cc: Discussions about the Letux Kernel, kernel, Linux-OMAP

Hi Tony,

On Mon, 2020-08-17 at 09:38 +0300, Tony Lindgren wrote:
> Hi,
> 
> * David Shah <dave@ds0.me> [200816 20:13]:
> > It seems like 'CSWR' idle may never have actually worked properly on
> > the OMAP5...
> > 
> > As an experiment, I took the old TI 3.8.y GLSDK kernel,
> > commit 2c871a879dbb4234232126f7075468d5bf0a50e3 and made the following
> > changes:
> > 
> >  - Enabling CONFIG_CPU_IDLE as this was not in omap2plus_defconfig back
> > then
> >  - Disabling all the kernel debugging related config, as these seem to
> > significantly reduce the frequency of lockups
> >  - OSWR idle disabled, as this is known broken
> >  - Some small patches to get it working with gcc9, none of which
> > touched any power management or idle code.
> > 
> > And I saw lockups with an almost identical frequency to 5.6 and 5.7
> > with a similar config; and the same pipeline stalled error reported by
> > CCS when connecting over JTAG. The only difference is the reported PC
> > was a read instruction inside sched_clock rather
> > than omap4_prminst_read_inst_reg.
> > 
> > Would be interested to know if there is a backstory here? Could it be
> > related to the bugs that stopped OSWR from working? Is there a glsdk
> > kernel version that I missed where CSWR on the OMAP5 actually works
> > reliably.
> > 
> > If anyone wants to try reproducing this; the most important settings
> > are:
> > 
> >  - CONFIG_CPU_IDLE=y
> >  - All kernel debugging settings disabled
> >  - CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
> > 
> > This will usually result in a lockup while idle at a login prompt
> > within a few hours with no other hardware connected. A lockup usually
> > occurs sooner (within 30 minutes) repeatedly wget'ing a 100MB test file
> > in a loop.
> 
> Care to check if this happens with current mainline kernel and sgx
> disabled?
> 
> The reason I'm asking is I used a pi-top with a omap5-igep0050 board as
> a test laptop with the mainline kernel for about two years until I managed
> to break the UART connector on it a few years ago :) I sure had things
> working reliably with no hangs with cpuidle enabled with LPAE. This was
> with the pi-top HDMI panel without sgx.
> 

That's a good idea, I'll try that. 

> Also please see if this happens with omap5-uevm too. There pyra related
> DDR self-refresh related hangs should be out of the AFAIK, but still
> worth testing.
> 

Most of my testing so far has been on the the uEVM, due to easier JTAG access.
For some reason that I have not yet identified, the uEVM actually locks up
slightly more frequently than the Pyra.

I wonder if there is some hardware difference going on, I know a few other
people have had good experiences with the IGEP on older mainline kernels too.

> Regards,
> 
> Tony


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-08-17  8:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-26 17:59 Lockup inside omap4_prminst_read_inst_reg on OMAP5 uEVM David Shah
2020-07-27  6:47 ` Tony Lindgren
2020-08-01 20:57 ` David Shah
2020-08-16 17:13   ` [Letux-kernel] " David Shah
2020-08-17  6:38     ` Tony Lindgren
2020-08-17  8:18       ` David Shah

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).