All of lore.kernel.org
 help / color / mirror / Atom feed
* Unstable Kernel behavior on an ARM based board
@ 2019-03-02 10:44 Embedded Engineer
  2019-03-02 11:00 ` Russell King - ARM Linux admin
  2019-03-02 11:01 ` Willy Tarreau
  0 siblings, 2 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-02 10:44 UTC (permalink / raw)
  To: linux-arm-kernel

We have designed a custom board based on an ARM based SoC (Nvidia
TK1). Much of the board design is derived from reference kit (Jetson
TK1). But on our board, we are facing quite an unstable behavior of
Kernel (v 3.10.40 downstream). As per my understanding, the issues are
not vendor specific, so I am posting the issues here. Two of the main
unstablilites we are getting as mentioned below:

-  The board starts and shows u-boot prints normally every time, but
sometimes the kernel boots normally after the last print by u-boot
"Starting kernel ..." and sometimes the boards hangs at this point. I
tried adding printk in the first line of start_kernel() function but I
get the print only when the system boots normally, meaning that when
the system is stuck at "Starting kernel ...", even start_kernel() is
not executing.

-  When the system is successfully booted to command line, it hangs
randomly without showing any output. The watchdog then resets the
board.

Can someone please guide how to debug these issues?

P.S:  The issues remain same on multiple boards.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 10:44 Unstable Kernel behavior on an ARM based board Embedded Engineer
@ 2019-03-02 11:00 ` Russell King - ARM Linux admin
  2019-03-02 11:01 ` Willy Tarreau
  1 sibling, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-02 11:00 UTC (permalink / raw)
  To: Embedded Engineer; +Cc: linux-arm-kernel

On Sat, Mar 02, 2019 at 03:44:49PM +0500, Embedded Engineer wrote:
> We have designed a custom board based on an ARM based SoC (Nvidia
> TK1). Much of the board design is derived from reference kit (Jetson
> TK1). But on our board, we are facing quite an unstable behavior of
> Kernel (v 3.10.40 downstream). As per my understanding, the issues are
> not vendor specific, so I am posting the issues here. Two of the main
> unstablilites we are getting as mentioned below:
> 
> -  The board starts and shows u-boot prints normally every time, but
> sometimes the kernel boots normally after the last print by u-boot
> "Starting kernel ..." and sometimes the boards hangs at this point. I
> tried adding printk in the first line of start_kernel() function but I
> get the print only when the system boots normally, meaning that when
> the system is stuck at "Starting kernel ...", even start_kernel() is
> not executing.
> 
> -  When the system is successfully booted to command line, it hangs
> randomly without showing any output. The watchdog then resets the
> board.
> 
> Can someone please guide how to debug these issues?

I'm sorry, but 3.10.40 is almost five years old, and will be riddled
with known security holes and bugs.  For the sake of the greater good,
please move forward to a more modern kernel, such as one released in
the last six months which is in active stable maintanence.

There are way too many insecure, vulnerable products out there which
are constantly being hijacked to be part of bot networks attacking
others.  We don't need another product developed against an ancient
kernel.

I'm afraid it would be irresponsible to help in this situation.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 10:44 Unstable Kernel behavior on an ARM based board Embedded Engineer
  2019-03-02 11:00 ` Russell King - ARM Linux admin
@ 2019-03-02 11:01 ` Willy Tarreau
  2019-03-02 11:22   ` Embedded Engineer
  1 sibling, 1 reply; 63+ messages in thread
From: Willy Tarreau @ 2019-03-02 11:01 UTC (permalink / raw)
  To: Embedded Engineer; +Cc: linux-arm-kernel

On Sat, Mar 02, 2019 at 03:44:49PM +0500, Embedded Engineer wrote:
> We have designed a custom board based on an ARM based SoC (Nvidia
> TK1). Much of the board design is derived from reference kit (Jetson
> TK1). But on our board, we are facing quite an unstable behavior of
> Kernel (v 3.10.40 downstream).
          ^^^^^^^^^

So not only this kernel is from a branch that went end of life more than
a year ago, but within this now unmaintained branch, your version is
affected by more than 4000 bugs which were fixed before the branch was
dropped :

  $ git log --oneline v3.10.40..v3.10.108 | wc -l
  4044

So there is no hope that anyone will be willing to review each of them
and try to guess which ones might be responsible for the problem you're
experiencing, the first thing to do is to switch to an up to date kernel
which is not affected by many known bugs like this. Branches 4.4, 4.9,
4.14 and 4.19 are long term supported, so it will be a good investment
to rebase your possible code onto one of these kernels, you'll get fixes
for free for many years.

Willy

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 11:01 ` Willy Tarreau
@ 2019-03-02 11:22   ` Embedded Engineer
  2019-03-02 11:25     ` Willy Tarreau
  2019-03-02 11:36     ` Russell King - ARM Linux admin
  0 siblings, 2 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-02 11:22 UTC (permalink / raw)
  To: Willy Tarreau, linux; +Cc: linux-arm-kernel

Thanks for response Russel and Willy, but AFIK the only available
Nvidia downstream kernel for our processor is 3.10.40. I tried
building the upstream kernel (4.9.x I guess) but it kept crashing even
worse so I didn't invest any more time bringing it up on our board.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 11:22   ` Embedded Engineer
@ 2019-03-02 11:25     ` Willy Tarreau
  2019-03-02 11:46       ` Russell King - ARM Linux admin
  2019-03-02 11:36     ` Russell King - ARM Linux admin
  1 sibling, 1 reply; 63+ messages in thread
From: Willy Tarreau @ 2019-03-02 11:25 UTC (permalink / raw)
  To: Embedded Engineer; +Cc: linux, linux-arm-kernel

On Sat, Mar 02, 2019 at 04:22:48PM +0500, Embedded Engineer wrote:
> Thanks for response Russel and Willy, but AFIK the only available
> Nvidia downstream kernel for our processor is 3.10.40. I tried
> building the upstream kernel (4.9.x I guess) but it kept crashing even
> worse so I didn't invest any more time bringing it up on our board.

Then if you depend on a vendor kernel, you have to deal with that vendor
since they are the only ones who know what they secretly modified in the
kernel to make it magically work sometimes.

Willy

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 11:22   ` Embedded Engineer
  2019-03-02 11:25     ` Willy Tarreau
@ 2019-03-02 11:36     ` Russell King - ARM Linux admin
  2019-03-02 11:52       ` Embedded Engineer
  1 sibling, 1 reply; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-02 11:36 UTC (permalink / raw)
  To: Embedded Engineer; +Cc: Willy Tarreau, linux-arm-kernel

On Sat, Mar 02, 2019 at 04:22:48PM +0500, Embedded Engineer wrote:
> Thanks for response Russel and Willy, but AFIK the only available
> Nvidia downstream kernel for our processor is 3.10.40. I tried
> building the upstream kernel (4.9.x I guess) but it kept crashing even
> worse so I didn't invest any more time bringing it up on our board.

Hi,

Please explain what you mean by "crashing even worse".  With 4.9.x,
there's the possibility to help with a currently maintained kernel.

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 11:25     ` Willy Tarreau
@ 2019-03-02 11:46       ` Russell King - ARM Linux admin
  2019-03-04 13:57         ` Thierry Reding
  0 siblings, 1 reply; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-02 11:46 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Embedded Engineer, linux-arm-kernel

On Sat, Mar 02, 2019 at 12:25:54PM +0100, Willy Tarreau wrote:
> On Sat, Mar 02, 2019 at 04:22:48PM +0500, Embedded Engineer wrote:
> > Thanks for response Russel and Willy, but AFIK the only available
> > Nvidia downstream kernel for our processor is 3.10.40. I tried
> > building the upstream kernel (4.9.x I guess) but it kept crashing even
> > worse so I didn't invest any more time bringing it up on our board.
> 
> Then if you depend on a vendor kernel, you have to deal with that vendor
> since they are the only ones who know what they secretly modified in the
> kernel to make it magically work sometimes.

I don't see why that would be necessary - mainline has support for the
Nvidia Jetson TK1 board, which means we have support for the SoC, so
mainline kernels should boot fine.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 11:36     ` Russell King - ARM Linux admin
@ 2019-03-02 11:52       ` Embedded Engineer
  2019-03-02 11:57         ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 63+ messages in thread
From: Embedded Engineer @ 2019-03-02 11:52 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Willy Tarreau, linux-arm-kernel

On Sat, Mar 2, 2019 at 4:36 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
> Please explain what you mean by "crashing even worse".  With 4.9.x,
> there's the possibility to help with a currently maintained kernel.

Sorry, the kernel version is 4.8.0-rc7. I have copied the zImage (v
4.8.0-rc7) and the dtb generated with this build to my board again but
the board is not getting past "Starting kernel ...". Please find below
logs:

Hit any key to stop autoboot:  0
MMC: no card present
switch to partitions #0, OK
mmc0(part 0) is current device
Scanning mmc 0...
Found /boot/extlinux/extlinux.conf
Retrieving file: /boot/extlinux/extlinux.conf
820 bytes read in 222 ms (2.9 KiB/s)
Jetson-TK1 eMMC boot options
1:      primary kernel
Enter choice: 1
1:      primary kernel
Retrieving file: /boot/zImage
3655400 bytes read in 193 ms (18.1 MiB/s)
append: console=ttyS0,115200n8 console=tty1 no_console_suspend=1
lp0_vec=2064@0xf46ff000 mem=2015M@2048M memtype=255
ddr_die=2048M@2048M section=256M pmuboard=0x0177:0x0000:0x02:0x43:0x00
tsec=32M@3913M otf_key=c75e5bb91eb3bd947560357b64422f85
usbcore.old_scheme_first=1 core_edp_mv=1150 core_edp_ma=4000
tegraid=40.1.1.0.0 debug_uartport=lsport,3 power_supply=Adapter
audio_codec=rt5640 modem_id=0 android.kerneltype=normal fbcon=map:1
commchip_id=0 usb_port_owner_info=0 lane_owner_info=1 emc_max_dvfs=0
touch_id=0@0 board_info=0x0177:0x0000:0x02:0x43:0x00 net.ifnames=0
root=/dev/mmcblk0p1 rw rootwait tegraboot=sdmmc gpt maxcpus=0
Retrieving file: /boot/tegra124-jetson-tk1.dtb
66405 bytes read in 308 ms (210 KiB/s)
Kernel image @ 0x81000000 [ 0x000000 - 0x37c6e8 ]
## Flattened Device Tree blob at 82000000
   Booting using the fdt blob at 0x82000000
   Using Device Tree in place at 82000000, end 82013364

Starting kernel ...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 11:52       ` Embedded Engineer
@ 2019-03-02 11:57         ` Russell King - ARM Linux admin
  2019-03-02 12:20           ` Embedded Engineer
  0 siblings, 1 reply; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-02 11:57 UTC (permalink / raw)
  To: Embedded Engineer; +Cc: Willy Tarreau, linux-arm-kernel

On Sat, Mar 02, 2019 at 04:52:54PM +0500, Embedded Engineer wrote:
> On Sat, Mar 2, 2019 at 4:36 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> > Please explain what you mean by "crashing even worse".  With 4.9.x,
> > there's the possibility to help with a currently maintained kernel.
> 
> Sorry, the kernel version is 4.8.0-rc7. I have copied the zImage (v
> 4.8.0-rc7) and the dtb generated with this build to my board again but
> the board is not getting past "Starting kernel ...". Please find below
> logs:
> 
> Hit any key to stop autoboot:  0
> MMC: no card present
> switch to partitions #0, OK
> mmc0(part 0) is current device
> Scanning mmc 0...
> Found /boot/extlinux/extlinux.conf
> Retrieving file: /boot/extlinux/extlinux.conf
> 820 bytes read in 222 ms (2.9 KiB/s)
> Jetson-TK1 eMMC boot options
> 1:      primary kernel
> Enter choice: 1
> 1:      primary kernel
> Retrieving file: /boot/zImage
> 3655400 bytes read in 193 ms (18.1 MiB/s)
> append: console=ttyS0,115200n8 console=tty1 no_console_suspend=1
> lp0_vec=2064@0xf46ff000 mem=2015M@2048M memtype=255
> ddr_die=2048M@2048M section=256M pmuboard=0x0177:0x0000:0x02:0x43:0x00
> tsec=32M@3913M otf_key=c75e5bb91eb3bd947560357b64422f85
> usbcore.old_scheme_first=1 core_edp_mv=1150 core_edp_ma=4000
> tegraid=40.1.1.0.0 debug_uartport=lsport,3 power_supply=Adapter
> audio_codec=rt5640 modem_id=0 android.kerneltype=normal fbcon=map:1
> commchip_id=0 usb_port_owner_info=0 lane_owner_info=1 emc_max_dvfs=0
> touch_id=0@0 board_info=0x0177:0x0000:0x02:0x43:0x00 net.ifnames=0
> root=/dev/mmcblk0p1 rw rootwait tegraboot=sdmmc gpt maxcpus=0
> Retrieving file: /boot/tegra124-jetson-tk1.dtb
> 66405 bytes read in 308 ms (210 KiB/s)
> Kernel image @ 0x81000000 [ 0x000000 - 0x37c6e8 ]
> ## Flattened Device Tree blob at 82000000
>    Booting using the fdt blob at 0x82000000
>    Using Device Tree in place at 82000000, end 82013364
> 
> Starting kernel ...

Okay, please try enabling DEBUG_LL, SERIAL_EARLYCON and select the
correct serial port. Then arrange for u-boot to pass "earlycon" in
addition to your other kernel arguments (shown in the "append" line
above.)

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 11:57         ` Russell King - ARM Linux admin
@ 2019-03-02 12:20           ` Embedded Engineer
  2019-03-02 12:39             ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 63+ messages in thread
From: Embedded Engineer @ 2019-03-02 12:20 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Willy Tarreau, linux-arm-kernel

On Sat, Mar 2, 2019 at 4:57 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Okay, please try enabling DEBUG_LL, SERIAL_EARLYCON and select the
> correct serial port. Then arrange for u-boot to pass "earlycon" in
> addition to your other kernel arguments (shown in the "append" line
> above.)

That's great :). I didn't make any change to kernel config as
DEBUG_LL, SERIAL_EARLYCON were already enabled. Just added 'earlycon'
to kernel arguments and boom, it booted without any crash. However
there were some errors mounting the rootfs. Please find below link to
complete boot logs:

https://pastebin.com/7CfN6wLQ

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 12:20           ` Embedded Engineer
@ 2019-03-02 12:39             ` Russell King - ARM Linux admin
  2019-03-02 13:10               ` Embedded Engineer
  2019-03-02 15:07               ` Clemens Koller
  0 siblings, 2 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-02 12:39 UTC (permalink / raw)
  To: Embedded Engineer; +Cc: Willy Tarreau, linux-arm-kernel

On Sat, Mar 02, 2019 at 05:20:53PM +0500, Embedded Engineer wrote:
> On Sat, Mar 2, 2019 at 4:57 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Okay, please try enabling DEBUG_LL, SERIAL_EARLYCON and select the
> > correct serial port. Then arrange for u-boot to pass "earlycon" in
> > addition to your other kernel arguments (shown in the "append" line
> > above.)
> 
> That's great :). I didn't make any change to kernel config as
> DEBUG_LL, SERIAL_EARLYCON were already enabled. Just added 'earlycon'
> to kernel arguments and boom, it booted without any crash. However
> there were some errors mounting the rootfs. Please find below link to
> complete boot logs:
> 
> https://pastebin.com/7CfN6wLQ

I think you're lucky it got that far:

tegra-ehci 7d008000.usb: dma_pool_alloc ehci_qtd, f08de000 (corrupted)
00000000: 60 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  `...............
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 40 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000040: 60 03 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  `...............
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

There is a definite pattern to that corruption - first 32-bit word in
every 32-bytes - I suspect the first 32-bit transfer into the cache
ends up being corrupted.

IMHO, this points to RAM timing or wiring issues.  I don't know much
about Tegra, but could it be a too-old u-boot (it reports that it's
2014, but apparently extlinux support was only merged in 2017...)?
As you've said that these are a new board design, it could also be a
design error with the RAM wiring.

I suspect at this point you may be better to wait for those who know
Tegra to reply - Thierry Reding or Jonathan Hunter.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 12:39             ` Russell King - ARM Linux admin
@ 2019-03-02 13:10               ` Embedded Engineer
  2019-03-02 15:07               ` Clemens Koller
  1 sibling, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-02 13:10 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: linux-arm-kernel

On Sat, Mar 2, 2019 at 5:39 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> I think you're lucky it got that far:

Indeed, I think the same because on subsequent reboots, the system
always gets stuck at following:

[    6.080364] VFS: Mounted root (ext4 filesystem) on device 179:1.
[    6.091900] devtmpfs: mounted
[    6.095711] Freeing unused kernel memory: 1024K (c0800000 - c0900000)
Mount failed for selinuxfs on /sys/fs/selinux:  No such file or directory
[    6.443184] init: plymouth-upstart-bridge main process (89)
terminated with status 1
[    6.452397] init: plymouth-upstart-bridge main process ended, respawning
[    6.460580] tegra-mc 70019000.memory-controller: sdmmcwab: write
@0x00000000: EMEM address decode error (EMEM decode error)
[    6.515303] init: plymouth-upstart-bridge main process (100)
terminated with status 1
[    6.515387] init: plymouth-upstart-bridge main process ended, respawning
[    6.536591] tegra-mc 70019000.memory-controller: sdmmcwab: write
@0x00000000: EMEM address decode error (EMEM decode error)
[    6.564351] init: ureadahead main process (92) terminated with status 5
[    6.571400] init: plymouth-upstart-bridge main process (106)
terminated with status 1
[    6.581434] init: plymouth-upstart-bridge main process ended, respawning
[    6.609268] init: plymouth-upstart-bridge main process (109)
terminated with status 1
[    6.622510] init: plymouth-upstart-bridge main process ended, respawning
[    6.646083] init: plymouth-upstart-bridge main process (112)
terminated with status 1
[    6.656656] init: plymouth-upstart-bridge main process ended, respawning
[    6.670629] mmc0: ADMA error


> IMHO, this points to RAM timing or wiring issues.  I don't know much
> about Tegra, but could it be a too-old u-boot (it reports that it's
> 2014, but apparently extlinux support was only merged in 2017...)?
> As you've said that these are a new board design, it could also be a
> design error with the RAM wiring.

I had the same thoughts at first, but my team insisted that they have
verified RAM using vendor provided memory characterization and test
tool. Although the RAM tests passed only at 204 MHz and below
frequencies, so currently we are using RAM at 204 MHz. As testing, I
also tested 1.5 GB of RAM using Linux stress tool (when the board
somehow managed to get to bash) and it didn't return any error. This
made us believe that RAM is not the root cause, may be you can comment
better on how authentic the results of 'stress' tool are.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 12:39             ` Russell King - ARM Linux admin
  2019-03-02 13:10               ` Embedded Engineer
@ 2019-03-02 15:07               ` Clemens Koller
  2019-03-04  5:14                 ` Embedded Engineer
  1 sibling, 1 reply; 63+ messages in thread
From: Clemens Koller @ 2019-03-02 15:07 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/03/2019 13.39, Russell King - ARM Linux admin wrote:
> On Sat, Mar 02, 2019 at 05:20:53PM +0500, Embedded Engineer wrote:
>> On Sat, Mar 2, 2019 at 4:57 PM Russell King - ARM Linux admin
>> <linux@armlinux.org.uk> wrote:
>>>
>>> Okay, please try enabling DEBUG_LL, SERIAL_EARLYCON and select the
>>> correct serial port. Then arrange for u-boot to pass "earlycon" in
>>> addition to your other kernel arguments (shown in the "append" line
>>> above.)
>>
>> That's great :). I didn't make any change to kernel config as
>> DEBUG_LL, SERIAL_EARLYCON were already enabled. Just added 'earlycon'
>> to kernel arguments and boom, it booted without any crash. However
>> there were some errors mounting the rootfs. Please find below link to
>> complete boot logs:
>>
>> https://pastebin.com/7CfN6wLQ
> 
> I think you're lucky it got that far:
> 
> tegra-ehci 7d008000.usb: dma_pool_alloc ehci_qtd, f08de000 (corrupted)
> 00000000: 60 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  `...............
> 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000020: 40 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
> 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000040: 60 03 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  `...............
> 00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 
> There is a definite pattern to that corruption - first 32-bit word in
> every 32-bytes - I suspect the first 32-bit transfer into the cache
> ends up being corrupted.
> 
> IMHO, this points to RAM timing or wiring issues.  I don't know much
> about Tegra, but could it be a too-old u-boot (it reports that it's
> 2014, but apparently extlinux support was only merged in 2017...)?
> As you've said that these are a new board design, it could also be a
> design error with the RAM wiring.

The behaviour looks familiar to me if RAM timing is off. The memory initialization might not be suitable for your PCB design.
You can try to exercise some in-depth memory tests from within u-boot up to the temperature extremes of your hardware, before you switch over to linux. Checkout "mtest".

Regards,
Clemens

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 15:07               ` Clemens Koller
@ 2019-03-04  5:14                 ` Embedded Engineer
  2019-03-04 10:26                   ` Vladimir Murzin
  2019-03-04 14:00                   ` Andrew Lunn
  0 siblings, 2 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-04  5:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Mar 2, 2019 at 8:07 PM Clemens Koller <clemens.ml@gmx.net> wrote:
> The behaviour looks familiar to me if RAM timing is off. The memory initialization might not be suitable for your PCB design.
> You can try to exercise some in-depth memory tests from within u-boot up to the temperature extremes of your hardware, before you switch over to linux. Checkout "mtest".

Thanks Clemens, we did consider enabling mtest in our u-boot and using
it but dropped the plan because of the description given on
https://github.com/endlessm/u-boot/blob/master/doc/README.memory-test,
specifically the following statement:

"This is probably the best known memory test utility in U-Boot.
Unfortunately, it is also the most problematic, and the most useless
one."

Do you suggest board redesign is only option left?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-04  5:14                 ` Embedded Engineer
@ 2019-03-04 10:26                   ` Vladimir Murzin
  2019-03-04 12:25                     ` Embedded Engineer
  2019-03-04 14:00                   ` Andrew Lunn
  1 sibling, 1 reply; 63+ messages in thread
From: Vladimir Murzin @ 2019-03-04 10:26 UTC (permalink / raw)
  To: Embedded Engineer, linux-arm-kernel

On 3/4/19 5:14 AM, Embedded Engineer wrote:
> On Sat, Mar 2, 2019 at 8:07 PM Clemens Koller <clemens.ml@gmx.net> wrote:
>> The behaviour looks familiar to me if RAM timing is off. The memory initialization might not be suitable for your PCB design.
>> You can try to exercise some in-depth memory tests from within u-boot up to the temperature extremes of your hardware, before you switch over to linux. Checkout "mtest".
> 
> Thanks Clemens, we did consider enabling mtest in our u-boot and using
> it but dropped the plan because of the description given on
> https://github.com/endlessm/u-boot/blob/master/doc/README.memory-test,
> specifically the following statement:
> 
> "This is probably the best known memory test utility in U-Boot.
> Unfortunately, it is also the most problematic, and the most useless
> one."
> 

You can try in-kernel memtest:

- CONFIG_MEMTEST=y
- pass memtest in kernel's command line


Cheers
Vladimir

> Do you suggest board redesign is only option left?
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-04 10:26                   ` Vladimir Murzin
@ 2019-03-04 12:25                     ` Embedded Engineer
  2019-03-04 14:25                       ` Thierry Reding
  0 siblings, 1 reply; 63+ messages in thread
From: Embedded Engineer @ 2019-03-04 12:25 UTC (permalink / raw)
  To: Vladimir Murzin; +Cc: linux-arm-kernel

On Mon, Mar 4, 2019 at 3:26 PM Vladimir Murzin <vladimir.murzin@arm.com> wrote:
>
> You can try in-kernel memtest:
>
> - CONFIG_MEMTEST=y
> - pass memtest in kernel's command line
>

Thanks Vladimir, I tried running mtest as suggested by Clemens in
u-boot and memtest in kernel as suggested by you. Both tests didn't
show any errors, however the board sometime hangs at "Starting kernel
...". Following logs were obtained when it booted but ended in a
crash:

https://pastebin.com/sZZjUcbh

Regards

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-02 11:46       ` Russell King - ARM Linux admin
@ 2019-03-04 13:57         ` Thierry Reding
  0 siblings, 0 replies; 63+ messages in thread
From: Thierry Reding @ 2019-03-04 13:57 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Embedded Engineer, Willy Tarreau, linux-arm-kernel, Jon Hunter


[-- Attachment #1.1: Type: text/plain, Size: 2072 bytes --]

On Sat, Mar 02, 2019 at 11:46:52AM +0000, Russell King - ARM Linux admin wrote:
> On Sat, Mar 02, 2019 at 12:25:54PM +0100, Willy Tarreau wrote:
> > On Sat, Mar 02, 2019 at 04:22:48PM +0500, Embedded Engineer wrote:
> > > Thanks for response Russel and Willy, but AFIK the only available
> > > Nvidia downstream kernel for our processor is 3.10.40. I tried
> > > building the upstream kernel (4.9.x I guess) but it kept crashing even
> > > worse so I didn't invest any more time bringing it up on our board.
> > 
> > Then if you depend on a vendor kernel, you have to deal with that vendor
> > since they are the only ones who know what they secretly modified in the
> > kernel to make it magically work sometimes.
> 
> I don't see why that would be necessary - mainline has support for the
> Nvidia Jetson TK1 board, which means we have support for the SoC, so
> mainline kernels should boot fine.

We've got automated testing in place for most of the boards we support
upstream. Most stable kernels are tested, as is linux-next. Jetson TK1
is among the board tested and I'm not aware of any recent regressions,
other than maybe the occasional one in linux-next that usually end up
impacting more than just Tegra.

Adding Jon who keeps better track of the test results than I do.

I should note that there are valid reasons for people wanting to stick
with the downstream kernel. The simple truth is that we lack a certain
number of features upstream, so if customers rely on those they don't
have a lot of choice. We're actively trying to close the feature gap,
but we're not quite there yet.

That said, I agree with what Russell and Willy said. Using a 3.10 kernel
as a base for product development is a bad idea.

My suggestion is to use a recent linux-next as a baseline for testing.
That's the best bet for validating that your hardware is good. Once you
have established that we should have a brief chat about what exactly the
requirements are that you have and then we'll have to evaluate how best
to support you.

Thierry

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-04  5:14                 ` Embedded Engineer
  2019-03-04 10:26                   ` Vladimir Murzin
@ 2019-03-04 14:00                   ` Andrew Lunn
  2019-03-04 14:27                     ` Thierry Reding
  2019-03-04 15:27                     ` Embedded Engineer
  1 sibling, 2 replies; 63+ messages in thread
From: Andrew Lunn @ 2019-03-04 14:00 UTC (permalink / raw)
  To: Embedded Engineer; +Cc: linux-arm-kernel

> Do you suggest board redesign is only option left?

Have you checked your cache configuration when you jump into the
kernel?

If i remember correctly, you need all caches turned off. I've had
problems with an ARM v5 system which got this wrong, left the caches
turned on, and so the kernel corrupted itself during startup.

       Andrew

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-04 12:25                     ` Embedded Engineer
@ 2019-03-04 14:25                       ` Thierry Reding
  2019-03-04 15:51                           ` Embedded Engineer
  2019-03-05 10:01                           ` Embedded Engineer
  0 siblings, 2 replies; 63+ messages in thread
From: Thierry Reding @ 2019-03-04 14:25 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: linux-tegra, Vladimir Murzin, linux-arm-kernel, Jon Hunter


[-- Attachment #1.1: Type: text/plain, Size: 2327 bytes --]

On Mon, Mar 04, 2019 at 05:25:28PM +0500, Embedded Engineer wrote:
> On Mon, Mar 4, 2019 at 3:26 PM Vladimir Murzin <vladimir.murzin@arm.com> wrote:
> >
> > You can try in-kernel memtest:
> >
> > - CONFIG_MEMTEST=y
> > - pass memtest in kernel's command line
> >
> 
> Thanks Vladimir, I tried running mtest as suggested by Clemens in
> u-boot and memtest in kernel as suggested by you. Both tests didn't
> show any errors, however the board sometime hangs at "Starting kernel
> ...". Following logs were obtained when it booted but ended in a
> crash:
> 
> https://pastebin.com/sZZjUcbh

Other than the memory corruption issue this looks like a fairly regular
boot. It's not clear whether the crash of your /sbin/init is related to
any memory issues. The earlier boot log that you had posted showed that
it was failing to mount the root filesystem and dropped you to a
maintenance shell, so that could be an indication that something isn't
right about the root filesystem. Or it could indicate that something is
wrong when loading files from the root filesystem.

The earlier log showed EMEM address decode errors, which are odd because
the addresses clearly lie in regions that should be system memory. EMEM
address decode usually only happens if the memory controller thinks you
are trying to access memory outside of system memory.

The good news is that I think you're pretty close. The memory corruption
is somewhat worrying, but at the same time it's unlikely that you'd get
as far as you do if your memory timings are completely off. However, I
think we need to gather more information to narrow down what's going
wrong.

All of the memory related configuration is part of a file called the
BCT. I think if you could provide that it would be very useful to have.
Also, it looks like you're using the Jetson TK1 device tree to boot, so
can I assume you haven't modified it at all?

Other bits of information that would be good to know are how you are
generating the BCT and your boot images, what exactly you do to flash
the board and which release of L4T you use.

Perhaps also try to run a recent linux-next just to exclude any issues
that may have been part of the 4.8.0-rc7 that you tested.

Also adding Jon and linux-tegra for a broader audience.

Thanks,
Thierry

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-04 14:00                   ` Andrew Lunn
@ 2019-03-04 14:27                     ` Thierry Reding
  2019-03-04 15:27                     ` Embedded Engineer
  1 sibling, 0 replies; 63+ messages in thread
From: Thierry Reding @ 2019-03-04 14:27 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Embedded Engineer, linux-arm-kernel


[-- Attachment #1.1: Type: text/plain, Size: 717 bytes --]

On Mon, Mar 04, 2019 at 03:00:42PM +0100, Andrew Lunn wrote:
> > Do you suggest board redesign is only option left?
> 
> Have you checked your cache configuration when you jump into the
> kernel?
> 
> If i remember correctly, you need all caches turned off. I've had
> problems with an ARM v5 system which got this wrong, left the caches
> turned on, and so the kernel corrupted itself during startup.

That's a good point. I vaguely remember this being an issue on Tegra a
long time ago and given that U-Boot on this board seems to be fairly old
it's not something I would exclude. Recent versions of U-Boot have that
fixed, so might be worth upgrading to a later version for those as well.

Thierry

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-04 14:00                   ` Andrew Lunn
  2019-03-04 14:27                     ` Thierry Reding
@ 2019-03-04 15:27                     ` Embedded Engineer
  2019-03-04 15:57                       ` Andrew Lunn
  1 sibling, 1 reply; 63+ messages in thread
From: Embedded Engineer @ 2019-03-04 15:27 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: linux-arm-kernel

On Mon, Mar 4, 2019 at 7:00 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > Do you suggest board redesign is only option left?
>
> Have you checked your cache configuration when you jump into the
> kernel?

Thanks alot Andrew. I have tried disabling cache by adding
CONFIG_SYS_ICACHE_OFF, CONFIG_SYS_DCACHE_OFF & CONFIG_SYS_L2CACHE_OFF
to u-boot config but the boards hangs after SPL by giving the
following output:

U-Boot SPL 2014.10-rc2 (Mar 04 2019 - 20:19:17)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-04 14:25                       ` Thierry Reding
@ 2019-03-04 15:51                           ` Embedded Engineer
  2019-03-05 10:01                           ` Embedded Engineer
  1 sibling, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-04 15:51 UTC (permalink / raw)
  To: Thierry Reding; +Cc: linux-tegra, Vladimir Murzin, linux-arm-kernel, Jon Hunter

Thanks a lot Thierry for considering this thread.

On Mon, Mar 4, 2019 at 7:25 PM Thierry Reding <thierry.reding@gmail.com> wrote:
>
> Or it could indicate that something is
> wrong when loading files from the root filesystem.

When I used the downstream kernel and L4T filesystem, there was no
problem regarding filesystem mounting.

> All of the memory related configuration is part of a file called the
> BCT. I think if you could provide that it would be very useful to have.

Please find the link to our BCT:
https://drive.google.com/open?id=1Az4nDIImCm14cnDSfHeBPlQYlYijGMrS

> Also, it looks like you're using the Jetson TK1 device tree to boot, so
> can I assume you haven't modified it at all?

Yes, I modified the downstream kernel's dtb by generating new pinmux
using Nvidia's dts generation tool but for upstream kernel I haven't
modified any dts.

> Other bits of information that would be good to know are how you are
> generating the BCT and your boot images, what exactly you do to flash
> the board and which release of L4T you use.

We run Shmoo memory characterization tool and get cfg file from that.
Then we convert that cfg to BCT (using mkbct command I guess).
We were never able to flash the board using nvflash/flash.sh utility. So
1. We build and flash u-boot & BCT using tegra-uboot-flasher.
2. We build kernel using make separately using sources available on
Nvidia download center.
3. We use apply_binaries.sh to copy tegra related files to sample file
system downloaded from Nvidia download center.
4. We mount the emmc/SD-card using u-boot's ums command on our Linux
host, and copy the whole filesystem, kernel and DTB to it.

We are using R21.7

> Perhaps also try to run a recent linux-next just to exclude any issues
> that may have been part of the 4.8.0-rc7 that you tested.

Ok I will build kernel using linux-next and update here.

Thanks again.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-04 15:51                           ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-04 15:51 UTC (permalink / raw)
  To: Thierry Reding; +Cc: linux-tegra, Vladimir Murzin, linux-arm-kernel, Jon Hunter

Thanks a lot Thierry for considering this thread.

On Mon, Mar 4, 2019 at 7:25 PM Thierry Reding <thierry.reding@gmail.com> wrote:
>
> Or it could indicate that something is
> wrong when loading files from the root filesystem.

When I used the downstream kernel and L4T filesystem, there was no
problem regarding filesystem mounting.

> All of the memory related configuration is part of a file called the
> BCT. I think if you could provide that it would be very useful to have.

Please find the link to our BCT:
https://drive.google.com/open?id=1Az4nDIImCm14cnDSfHeBPlQYlYijGMrS

> Also, it looks like you're using the Jetson TK1 device tree to boot, so
> can I assume you haven't modified it at all?

Yes, I modified the downstream kernel's dtb by generating new pinmux
using Nvidia's dts generation tool but for upstream kernel I haven't
modified any dts.

> Other bits of information that would be good to know are how you are
> generating the BCT and your boot images, what exactly you do to flash
> the board and which release of L4T you use.

We run Shmoo memory characterization tool and get cfg file from that.
Then we convert that cfg to BCT (using mkbct command I guess).
We were never able to flash the board using nvflash/flash.sh utility. So
1. We build and flash u-boot & BCT using tegra-uboot-flasher.
2. We build kernel using make separately using sources available on
Nvidia download center.
3. We use apply_binaries.sh to copy tegra related files to sample file
system downloaded from Nvidia download center.
4. We mount the emmc/SD-card using u-boot's ums command on our Linux
host, and copy the whole filesystem, kernel and DTB to it.

We are using R21.7

> Perhaps also try to run a recent linux-next just to exclude any issues
> that may have been part of the 4.8.0-rc7 that you tested.

Ok I will build kernel using linux-next and update here.

Thanks again.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-04 15:27                     ` Embedded Engineer
@ 2019-03-04 15:57                       ` Andrew Lunn
  2019-03-04 16:03                         ` Embedded Engineer
  0 siblings, 1 reply; 63+ messages in thread
From: Andrew Lunn @ 2019-03-04 15:57 UTC (permalink / raw)
  To: Embedded Engineer; +Cc: linux-arm-kernel

On Mon, Mar 04, 2019 at 08:27:21PM +0500, Embedded Engineer wrote:
> On Mon, Mar 4, 2019 at 7:00 PM Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > > Do you suggest board redesign is only option left?
> >
> > Have you checked your cache configuration when you jump into the
> > kernel?
> 
> Thanks alot Andrew. I have tried disabling cache by adding
> CONFIG_SYS_ICACHE_OFF, CONFIG_SYS_DCACHE_OFF & CONFIG_SYS_L2CACHE_OFF
> to u-boot config but the boards hangs after SPL by giving the
> following output:
> 
> U-Boot SPL 2014.10-rc2 (Mar 04 2019 - 20:19:17)

Using caches while inside uboot is fine. You just need to ensure they
are off when you jump into the kernel.

It could be, u boot actually requires caches enabled. The memory
controller might not be running yet, so you cannot use RAM. The code
to get the RAM going has to run from cache?

   Andrew

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-04 15:57                       ` Andrew Lunn
@ 2019-03-04 16:03                         ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-04 16:03 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: linux-arm-kernel

On Mon, Mar 4, 2019 at 8:57 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> It could be, u boot actually requires caches enabled. The memory
> controller might not be running yet, so you cannot use RAM. The code
> to get the RAM going has to run from cache?

AFIK the code to initialize mem controller and RAM runs from internal SRAM.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-04 14:25                       ` Thierry Reding
@ 2019-03-05 10:01                           ` Embedded Engineer
  2019-03-05 10:01                           ` Embedded Engineer
  1 sibling, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 10:01 UTC (permalink / raw)
  To: Thierry Reding
  Cc: linux-tegra, Andrew Lunn, Vladimir Murzin, linux-arm-kernel, Jon Hunter

On Mon, Mar 4, 2019 at 7:25 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> Perhaps also try to run a recent linux-next just to exclude any issues
> that may have been part of the 4.8.0-rc7 that you tested.

Thierry I have disabled cache as per Andrew's suggestion by calling
dcache_disable() and icache_disable() just before kernel_entry() in
u-boot source. I have also build the linux-next kernel and tested by
booting from microSD card but it is not going upto login console and
hangs midway. Please have a look at kernel logs in below link:

https://pastebin.com/ByuaLxTt

P.S: If I replace zImage and DTB of downstream same microSD card, it
successfully takes me to login console (although it has hanging issues
as I mentioned in previous posts)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 10:01                           ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 10:01 UTC (permalink / raw)
  To: Thierry Reding
  Cc: linux-tegra, Andrew Lunn, Vladimir Murzin, linux-arm-kernel, Jon Hunter

On Mon, Mar 4, 2019 at 7:25 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> Perhaps also try to run a recent linux-next just to exclude any issues
> that may have been part of the 4.8.0-rc7 that you tested.

Thierry I have disabled cache as per Andrew's suggestion by calling
dcache_disable() and icache_disable() just before kernel_entry() in
u-boot source. I have also build the linux-next kernel and tested by
booting from microSD card but it is not going upto login console and
hangs midway. Please have a look at kernel logs in below link:

https://pastebin.com/ByuaLxTt

P.S: If I replace zImage and DTB of downstream same microSD card, it
successfully takes me to login console (although it has hanging issues
as I mentioned in previous posts)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 10:01                           ` Embedded Engineer
@ 2019-03-05 10:07                             ` Russell King - ARM Linux admin
  -1 siblings, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 10:07 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 05, 2019 at 03:01:35PM +0500, Embedded Engineer wrote:
> On Mon, Mar 4, 2019 at 7:25 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> > Perhaps also try to run a recent linux-next just to exclude any issues
> > that may have been part of the 4.8.0-rc7 that you tested.
> 
> Thierry I have disabled cache as per Andrew's suggestion by calling
> dcache_disable() and icache_disable() just before kernel_entry() in
> u-boot source. I have also build the linux-next kernel and tested by
> booting from microSD card but it is not going upto login console and
> hangs midway. Please have a look at kernel logs in below link:
> 
> https://pastebin.com/ByuaLxTt

Please apply this patch so we can see the (ptrval) values.  Thanks.

8<===
From: Russell King <rmk+kernel@armlinux.org.uk>
Subject: [PATCH] lib: make vsprintf print pointers without munging

Printing pointers is useful for debugging, disable this so I can debug
the kernel.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
---
 lib/vsprintf.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 37a54a6dd594..c2ae4075c786 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -687,9 +687,9 @@ early_initcall(initialize_ptr_random);
 static char *ptr_to_id(char *buf, char *end, const void *ptr,
 		       struct printf_spec spec)
 {
-	const char *str = sizeof(ptr) == 8 ? "(____ptrval____)" : "(ptrval)";
 	unsigned long hashval;
 
+#if 0
 	/* When debugging early boot use non-cryptographically secure hash. */
 	if (unlikely(debug_boot_weak_hash)) {
 		hashval = hash_long((unsigned long)ptr, 32);
@@ -697,6 +697,7 @@ static char *ptr_to_id(char *buf, char *end, const void *ptr,
 	}
 
 	if (static_branch_unlikely(&not_filled_random_ptr_key)) {
+		const char *str = sizeof(ptr) == 8 ? "(____ptrval____)" : "(ptrval)";
 		spec.field_width = 2 * sizeof(ptr);
 		/* string length must be less than default_width */
 		return string(buf, end, str, spec);
@@ -712,6 +713,9 @@ static char *ptr_to_id(char *buf, char *end, const void *ptr,
 #else
 	hashval = (unsigned long)siphash_1u32((u32)ptr, &ptr_key);
 #endif
+#else
+	hashval = (unsigned long)ptr;
+#endif
 	return pointer_string(buf, end, (const void *)hashval, spec);
 }
 
-- 
2.7.4

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 10:07                             ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 10:07 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 05, 2019 at 03:01:35PM +0500, Embedded Engineer wrote:
> On Mon, Mar 4, 2019 at 7:25 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> > Perhaps also try to run a recent linux-next just to exclude any issues
> > that may have been part of the 4.8.0-rc7 that you tested.
> 
> Thierry I have disabled cache as per Andrew's suggestion by calling
> dcache_disable() and icache_disable() just before kernel_entry() in
> u-boot source. I have also build the linux-next kernel and tested by
> booting from microSD card but it is not going upto login console and
> hangs midway. Please have a look at kernel logs in below link:
> 
> https://pastebin.com/ByuaLxTt

Please apply this patch so we can see the (ptrval) values.  Thanks.

8<===
From: Russell King <rmk+kernel@armlinux.org.uk>
Subject: [PATCH] lib: make vsprintf print pointers without munging

Printing pointers is useful for debugging, disable this so I can debug
the kernel.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
---
 lib/vsprintf.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 37a54a6dd594..c2ae4075c786 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -687,9 +687,9 @@ early_initcall(initialize_ptr_random);
 static char *ptr_to_id(char *buf, char *end, const void *ptr,
 		       struct printf_spec spec)
 {
-	const char *str = sizeof(ptr) == 8 ? "(____ptrval____)" : "(ptrval)";
 	unsigned long hashval;
 
+#if 0
 	/* When debugging early boot use non-cryptographically secure hash. */
 	if (unlikely(debug_boot_weak_hash)) {
 		hashval = hash_long((unsigned long)ptr, 32);
@@ -697,6 +697,7 @@ static char *ptr_to_id(char *buf, char *end, const void *ptr,
 	}
 
 	if (static_branch_unlikely(&not_filled_random_ptr_key)) {
+		const char *str = sizeof(ptr) == 8 ? "(____ptrval____)" : "(ptrval)";
 		spec.field_width = 2 * sizeof(ptr);
 		/* string length must be less than default_width */
 		return string(buf, end, str, spec);
@@ -712,6 +713,9 @@ static char *ptr_to_id(char *buf, char *end, const void *ptr,
 #else
 	hashval = (unsigned long)siphash_1u32((u32)ptr, &ptr_key);
 #endif
+#else
+	hashval = (unsigned long)ptr;
+#endif
 	return pointer_string(buf, end, (const void *)hashval, spec);
 }
 
-- 
2.7.4

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 10:07                             ` Russell King - ARM Linux admin
@ 2019-03-05 10:29                               ` Embedded Engineer
  -1 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 10:29 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 5, 2019 at 3:07 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Please apply this patch so we can see the (ptrval) values.  Thanks.

Please find below logs after applying patch:

https://pastebin.com/6TaBxPX5

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 10:29                               ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 10:29 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 5, 2019 at 3:07 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Please apply this patch so we can see the (ptrval) values.  Thanks.

Please find below logs after applying patch:

https://pastebin.com/6TaBxPX5

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 10:01                           ` Embedded Engineer
  (?)
  (?)
@ 2019-03-05 10:32                           ` Thierry Reding
  2019-03-05 11:05                               ` Embedded Engineer
  -1 siblings, 1 reply; 63+ messages in thread
From: Thierry Reding @ 2019-03-05 10:32 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: linux-tegra, Andrew Lunn, Vladimir Murzin, linux-arm-kernel, Jon Hunter


[-- Attachment #1.1: Type: text/plain, Size: 1795 bytes --]

On Tue, Mar 05, 2019 at 03:01:35PM +0500, Embedded Engineer wrote:
> On Mon, Mar 4, 2019 at 7:25 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> > Perhaps also try to run a recent linux-next just to exclude any issues
> > that may have been part of the 4.8.0-rc7 that you tested.
> 
> Thierry I have disabled cache as per Andrew's suggestion by calling
> dcache_disable() and icache_disable() just before kernel_entry() in
> u-boot source. I have also build the linux-next kernel and tested by
> booting from microSD card but it is not going upto login console and
> hangs midway. Please have a look at kernel logs in below link:
> 
> https://pastebin.com/ByuaLxTt

Okay, looks fairly normal so far, except for the corrupted data. That's
definitely not normal and I think we need to fix that first, otherwise
we can't really be certain what's going on later.

One thing besides memory timings in BCT that comes to mind that could be
causing memory corruption are power supplies. Are you sure they're all
correctly configured and enabled as required? It might be worth looking
at all of them and marking them "regulator-always-on" just to make sure
an essential one isn't disabled inadvertently during boot. The
corruption happens long before unused regulators are disabled, so that
doesn't sound like it would be very relevant here. But perhaps best to
check it anyway, just in case.

> P.S: If I replace zImage and DTB of downstream same microSD card, it
> successfully takes me to login console (although it has hanging issues
> as I mentioned in previous posts)

Does the upstream kernel and DTB boot reliably, even if it doesn't get
you to a login prompt? Or does it also behave erratically like the
downstream kernel and DTB that you have?

Thierry

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 10:32                           ` Thierry Reding
@ 2019-03-05 11:05                               ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 11:05 UTC (permalink / raw)
  To: Thierry Reding
  Cc: linux-tegra, Andrew Lunn, Vladimir Murzin, linux-arm-kernel, Jon Hunter

On Tue, Mar 5, 2019 at 3:32 PM Thierry Reding <thierry.reding@gmail.com> wrote:
>
> One thing besides memory timings in BCT that comes to mind that could be
> causing memory corruption are power supplies. Are you sure they're all
> correctly configured and enabled as required?

This part is 100% same as the Jetson TK1 on hardware end. And in
device tree, the node 'vdd_1v35_lp0: sd2' has already
'regulator-always-on' property. We also tried once by using
oscilloscope to check if the power drops/fluctuates during operation
but noticed that DDR chips were getting stable power.

> Does the upstream kernel and DTB boot reliably, even if it doesn't get
> you to a login prompt? Or does it also behave erratically like the
> downstream kernel and DTB that you have?

2 out of 10 times it behaved erratically, i.e. one time it stuck at
'Starting kernel ...' and the other time it stuck after following
prints:

Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 5.0.0-rc8-next-20190304-dirty
(teresol@ubuntu) (gcc version 6.1.1 20160711 (Linaro GCC 6.1-2016.08))
#2 SMP PREEMPT Tue Mar 5 02:15:14 PST 2019
[    0.000000] CPU: ARMv7 Processor [413fc0f3] revision 3 (ARMv7), cr=10c5387d
[    0.000000] CPU: div instructions available: patching division code
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
[    0.000000] OF: fdt: Machine model: NVIDIA Tegra124 Jetson TK1
[    0.000000] earlycon: uart0 at MMIO 0x70006300 (options '115200n8')
[    0.000000] printk: bootconsole [uart0] enabled
[    0.000000] Memory policy: Data cache writealloc
[    0.000000] cma: Reserved 64 MiB at 0xac000000

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 11:05                               ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 11:05 UTC (permalink / raw)
  To: Thierry Reding
  Cc: linux-tegra, Andrew Lunn, Vladimir Murzin, linux-arm-kernel, Jon Hunter

On Tue, Mar 5, 2019 at 3:32 PM Thierry Reding <thierry.reding@gmail.com> wrote:
>
> One thing besides memory timings in BCT that comes to mind that could be
> causing memory corruption are power supplies. Are you sure they're all
> correctly configured and enabled as required?

This part is 100% same as the Jetson TK1 on hardware end. And in
device tree, the node 'vdd_1v35_lp0: sd2' has already
'regulator-always-on' property. We also tried once by using
oscilloscope to check if the power drops/fluctuates during operation
but noticed that DDR chips were getting stable power.

> Does the upstream kernel and DTB boot reliably, even if it doesn't get
> you to a login prompt? Or does it also behave erratically like the
> downstream kernel and DTB that you have?

2 out of 10 times it behaved erratically, i.e. one time it stuck at
'Starting kernel ...' and the other time it stuck after following
prints:

Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 5.0.0-rc8-next-20190304-dirty
(teresol@ubuntu) (gcc version 6.1.1 20160711 (Linaro GCC 6.1-2016.08))
#2 SMP PREEMPT Tue Mar 5 02:15:14 PST 2019
[    0.000000] CPU: ARMv7 Processor [413fc0f3] revision 3 (ARMv7), cr=10c5387d
[    0.000000] CPU: div instructions available: patching division code
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
[    0.000000] OF: fdt: Machine model: NVIDIA Tegra124 Jetson TK1
[    0.000000] earlycon: uart0 at MMIO 0x70006300 (options '115200n8')
[    0.000000] printk: bootconsole [uart0] enabled
[    0.000000] Memory policy: Data cache writealloc
[    0.000000] cma: Reserved 64 MiB at 0xac000000

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 10:29                               ` Embedded Engineer
  (?)
@ 2019-03-05 11:20                               ` Thierry Reding
  -1 siblings, 0 replies; 63+ messages in thread
From: Thierry Reding @ 2019-03-05 11:20 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Russell King - ARM Linux admin,
	Jon Hunter, linux-tegra, linux-arm-kernel


[-- Attachment #1.1: Type: text/plain, Size: 3924 bytes --]

On Tue, Mar 05, 2019 at 03:29:26PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 3:07 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Please apply this patch so we can see the (ptrval) values.  Thanks.
> 
> Please find below logs after applying patch:
> 
> https://pastebin.com/6TaBxPX5

Hm... so looks like what you're getting here is the error spew from the
DMA pool debug code in mm/dmax_pool.c. The way I understand it is that
that will initialize the memory for each page allocated from the pool
with the POOL_POISON_FREED (0xa7) (see pool_alloc_page()) and then upon
adding the page to the pool list, it'll store the offset to page->offset
field and check the contents of the page.

The contents of the page then don't match the expected poison. The dump
of the corrupted memory is somewhat confusing because the values that
don't match the poison are actually expected, at least partially. From
my reading of the DMA pool code, the first four bytes store the offset
of the DMA block into the physical memory page. However, given the size
of the hexdump, it looks like the pool was allocated with a block size
of 64 bytes, which matches the code in drivers/usb/chipidea/udc.c that
allocates the "ci_hw_qh" pool.

What's strange here, though, is that the offset that's stored to the
first four bytes of a block seems to actually be stored twice per block.
The first offset seems to be correct, since it's apparently used to find
the offset of the next block to allocate. If you look at the first
corrupted hexdump:

  [    1.327553] tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056080 (corrupted)
  [    1.335058] 00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.343077] 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.351095] 00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.359113] 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

This is the entry for the block at offset 0x00000080 and the offset for
the next block is 0x000000c0, which is exactly 64 bytes after the
current block. However, if you then look at the second offset that's
stored at offset 0x00000020 in the block, it's 0x00000080, which does
match the offset of the current block, but I think that may just be
coincidence. The same coincidence happens for the second corrupted
block:

  [    1.367210] tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056140 (corrupted)
  [    1.374709] 00000000: 80 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.382727] 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.390744] 00000020: 40 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
  [    1.398760] 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

But not for the third:

  [    1.406965] tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec0561c0 (corrupted)
  [    1.414466] 00000000: 00 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.422483] 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.430502] 00000020: 40 03 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
  [    1.438519] 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

The fact that we see the offset stored at offset 0x20 in each block
makes me think there's perhaps some sort of aliasing happening here. But
I'm not sure how the system would even boot this far if aliasing was
really the problem. Things should be falling apart much sooner if that's
really what's going on here.

However, this sort of aliasing is not something that your typical memory
test will catch, so it could explain why they aren't reporting any
errors.

Thierry

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 10:29                               ` Embedded Engineer
@ 2019-03-05 11:22                                 ` Russell King - ARM Linux admin
  -1 siblings, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 11:22 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 05, 2019 at 03:29:26PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 3:07 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Please apply this patch so we can see the (ptrval) values.  Thanks.
> 
> Please find below logs after applying patch:
> 
> https://pastebin.com/6TaBxPX5

So we have a pattern here:

tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056080 (corrupted)
00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056140 (corrupted)
00000000: 80 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 40 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec0561c0 (corrupted)
00000000: 00 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 40 03 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056200 (corrupted)
00000000: 40 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 40 05 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

and so it goes on.

The first four bytes are the offset to the next free block of memory in
this page, so can be ignored.  The remainder of the bytes should all be
0xa7, but every word at offset 32 into these is corrupted with what
looks to be a similar offset.

We dump 0x40 bytes, which, reading the code makes the pool size 0x40
bytes in size.  Tabulating the object offset, the next offset, and
the corruption at offset 32.  Corruption1 is from your latest log,
corruption2 is derived from your previous log using the next pointer
to tie up between the two:

object offset	next		corruption1	corruption2
0x0080		0x00c0		0x00000080	0x00000080
0x0140		0x0180		0x00000140	0x00000100
0x01c0		0x0200		0x00000340	0x000001c0
0x0200		0x0240		0x00000540	0x000001c0
0x0280		0x02c0		0x00000340	0x00000300
0x0340		0x0380		0x00000540	0x00000140
0x03c0		0x0400		0x00000540	0x00000300
0x0400		0x0440		0x000003c0	0x00000140
0x0480		0x04c0		0x00000540	0x000003c0
0x0540		0x0580		0x00000480	0x00000540
0x05c0		0x0600		0x000005c0	0x000005c0
0x0600		0x0640		0x00000500	0x000005c0
0x0680		0x06c0		0x00000740	0x00000680
??????		0x0780				0x00000740
0x07c0		0x0800		0x000007c0	0x00000700

The corruption looks very much like offset values, except they do not
seem to follow any rhyme or reason.  They also appear to be different
on each boot.

Given that the sequence here when a pool allocation occurs is:

1. allocate DMA coherent page
2. memset entire page with 0xa7
3. write next offsets
4. initialise 'offset' to zero (offset of first free object)
5. add page to pools list of pages
6. allocate first object, updating offset to the next free offset read
   from the first word of the object.

then when the next allocation request comes along, we allocate the
next object in the same way as step 6.  At the point of allocating the
third object, we find that there is corruption in the third object at
0x20 bytes into it - or 0xa0 bytes into the page.

Now, what does the driver that's allocating these do with them?  That
is done via init_eps() in drivers/usb/chipidea/udc.c, which doesn't do
anything with the allocated memory.  This is the only place that the
driver allocates from this DMA pool, which is done in a loop, so we
know that the objects allocated from this pool will be in relatively
quick succession.

So this does not make sense.

I really doubt that there is anything wrong with the kernel - this USB
driver is used on other SoCs (such as iMX6) and does not exhibit this
problem - it also works on the Tegra TK1 platform as well.

You are definitely seeing memory corruption here - but given what the
above looks like, I'd put forward another possible scenario - maybe
u-boot or something else is leaving a USB controller or some other DMA
agent active, which is writing over memory while the kernel is trying
to boot, resulting in memory corruption.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 11:22                                 ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 11:22 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 05, 2019 at 03:29:26PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 3:07 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Please apply this patch so we can see the (ptrval) values.  Thanks.
> 
> Please find below logs after applying patch:
> 
> https://pastebin.com/6TaBxPX5

So we have a pattern here:

tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056080 (corrupted)
00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056140 (corrupted)
00000000: 80 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 40 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec0561c0 (corrupted)
00000000: 00 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 40 03 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056200 (corrupted)
00000000: 40 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 40 05 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

and so it goes on.

The first four bytes are the offset to the next free block of memory in
this page, so can be ignored.  The remainder of the bytes should all be
0xa7, but every word at offset 32 into these is corrupted with what
looks to be a similar offset.

We dump 0x40 bytes, which, reading the code makes the pool size 0x40
bytes in size.  Tabulating the object offset, the next offset, and
the corruption at offset 32.  Corruption1 is from your latest log,
corruption2 is derived from your previous log using the next pointer
to tie up between the two:

object offset	next		corruption1	corruption2
0x0080		0x00c0		0x00000080	0x00000080
0x0140		0x0180		0x00000140	0x00000100
0x01c0		0x0200		0x00000340	0x000001c0
0x0200		0x0240		0x00000540	0x000001c0
0x0280		0x02c0		0x00000340	0x00000300
0x0340		0x0380		0x00000540	0x00000140
0x03c0		0x0400		0x00000540	0x00000300
0x0400		0x0440		0x000003c0	0x00000140
0x0480		0x04c0		0x00000540	0x000003c0
0x0540		0x0580		0x00000480	0x00000540
0x05c0		0x0600		0x000005c0	0x000005c0
0x0600		0x0640		0x00000500	0x000005c0
0x0680		0x06c0		0x00000740	0x00000680
??????		0x0780				0x00000740
0x07c0		0x0800		0x000007c0	0x00000700

The corruption looks very much like offset values, except they do not
seem to follow any rhyme or reason.  They also appear to be different
on each boot.

Given that the sequence here when a pool allocation occurs is:

1. allocate DMA coherent page
2. memset entire page with 0xa7
3. write next offsets
4. initialise 'offset' to zero (offset of first free object)
5. add page to pools list of pages
6. allocate first object, updating offset to the next free offset read
   from the first word of the object.

then when the next allocation request comes along, we allocate the
next object in the same way as step 6.  At the point of allocating the
third object, we find that there is corruption in the third object at
0x20 bytes into it - or 0xa0 bytes into the page.

Now, what does the driver that's allocating these do with them?  That
is done via init_eps() in drivers/usb/chipidea/udc.c, which doesn't do
anything with the allocated memory.  This is the only place that the
driver allocates from this DMA pool, which is done in a loop, so we
know that the objects allocated from this pool will be in relatively
quick succession.

So this does not make sense.

I really doubt that there is anything wrong with the kernel - this USB
driver is used on other SoCs (such as iMX6) and does not exhibit this
problem - it also works on the Tegra TK1 platform as well.

You are definitely seeing memory corruption here - but given what the
above looks like, I'd put forward another possible scenario - maybe
u-boot or something else is leaving a USB controller or some other DMA
agent active, which is writing over memory while the kernel is trying
to boot, resulting in memory corruption.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 11:05                               ` Embedded Engineer
  (?)
@ 2019-03-05 11:36                               ` Thierry Reding
  -1 siblings, 0 replies; 63+ messages in thread
From: Thierry Reding @ 2019-03-05 11:36 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: linux-tegra, Andrew Lunn, Vladimir Murzin, linux-arm-kernel, Jon Hunter


[-- Attachment #1.1: Type: text/plain, Size: 2213 bytes --]

On Tue, Mar 05, 2019 at 04:05:14PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 3:32 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> >
> > One thing besides memory timings in BCT that comes to mind that could be
> > causing memory corruption are power supplies. Are you sure they're all
> > correctly configured and enabled as required?
> 
> This part is 100% same as the Jetson TK1 on hardware end. And in
> device tree, the node 'vdd_1v35_lp0: sd2' has already
> 'regulator-always-on' property. We also tried once by using
> oscilloscope to check if the power drops/fluctuates during operation
> but noticed that DDR chips were getting stable power.

Okay, sounds like that's not relevant here, then.

> > Does the upstream kernel and DTB boot reliably, even if it doesn't get
> > you to a login prompt? Or does it also behave erratically like the
> > downstream kernel and DTB that you have?
> 
> 2 out of 10 times it behaved erratically, i.e. one time it stuck at
> 'Starting kernel ...' and the other time it stuck after following
> prints:
> 
> Starting kernel ...
> 
> [    0.000000] Booting Linux on physical CPU 0x0
> [    0.000000] Linux version 5.0.0-rc8-next-20190304-dirty
> (teresol@ubuntu) (gcc version 6.1.1 20160711 (Linaro GCC 6.1-2016.08))
> #2 SMP PREEMPT Tue Mar 5 02:15:14 PST 2019
> [    0.000000] CPU: ARMv7 Processor [413fc0f3] revision 3 (ARMv7), cr=10c5387d
> [    0.000000] CPU: div instructions available: patching division code
> [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
> [    0.000000] OF: fdt: Machine model: NVIDIA Tegra124 Jetson TK1
> [    0.000000] earlycon: uart0 at MMIO 0x70006300 (options '115200n8')
> [    0.000000] printk: bootconsole [uart0] enabled
> [    0.000000] Memory policy: Data cache writealloc
> [    0.000000] cma: Reserved 64 MiB at 0xac000000

Okay, this could corroborate the aliasing hypothesis. If aliasing is
really the problem, it would most likely indicate an issue in the BCT
that happened as part of shmooing. I'm not very familiar with the tests
run as part of the Shmoo suite, but I would've hoped that it contains
tests for aliasing.

Thierry

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 11:22                                 ` Russell King - ARM Linux admin
  (?)
@ 2019-03-05 11:57                                 ` Thierry Reding
  2019-03-05 13:16                                     ` Embedded Engineer
  -1 siblings, 1 reply; 63+ messages in thread
From: Thierry Reding @ 2019-03-05 11:57 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Embedded Engineer, Vladimir Murzin, Andrew Lunn, Jon Hunter,
	linux-tegra, linux-arm-kernel


[-- Attachment #1.1: Type: text/plain, Size: 7119 bytes --]

On Tue, Mar 05, 2019 at 11:22:26AM +0000, Russell King - ARM Linux admin wrote:
> On Tue, Mar 05, 2019 at 03:29:26PM +0500, Embedded Engineer wrote:
> > On Tue, Mar 5, 2019 at 3:07 PM Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
> > >
> > > Please apply this patch so we can see the (ptrval) values.  Thanks.
> > 
> > Please find below logs after applying patch:
> > 
> > https://pastebin.com/6TaBxPX5
> 
> So we have a pattern here:
> 
> tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056080 (corrupted)
> 00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056140 (corrupted)
> 00000000: 80 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000020: 40 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
> 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec0561c0 (corrupted)
> 00000000: 00 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000020: 40 03 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
> 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056200 (corrupted)
> 00000000: 40 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
> 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000020: 40 05 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
> 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 
> and so it goes on.
> 
> The first four bytes are the offset to the next free block of memory in
> this page, so can be ignored.  The remainder of the bytes should all be
> 0xa7, but every word at offset 32 into these is corrupted with what
> looks to be a similar offset.
> 
> We dump 0x40 bytes, which, reading the code makes the pool size 0x40
> bytes in size.  Tabulating the object offset, the next offset, and
> the corruption at offset 32.  Corruption1 is from your latest log,
> corruption2 is derived from your previous log using the next pointer
> to tie up between the two:
> 
> object offset	next		corruption1	corruption2
> 0x0080		0x00c0		0x00000080	0x00000080
> 0x0140		0x0180		0x00000140	0x00000100
> 0x01c0		0x0200		0x00000340	0x000001c0
> 0x0200		0x0240		0x00000540	0x000001c0
> 0x0280		0x02c0		0x00000340	0x00000300
> 0x0340		0x0380		0x00000540	0x00000140
> 0x03c0		0x0400		0x00000540	0x00000300
> 0x0400		0x0440		0x000003c0	0x00000140
> 0x0480		0x04c0		0x00000540	0x000003c0
> 0x0540		0x0580		0x00000480	0x00000540
> 0x05c0		0x0600		0x000005c0	0x000005c0
> 0x0600		0x0640		0x00000500	0x000005c0
> 0x0680		0x06c0		0x00000740	0x00000680
> ??????		0x0780				0x00000740
> 0x07c0		0x0800		0x000007c0	0x00000700
> 
> The corruption looks very much like offset values, except they do not
> seem to follow any rhyme or reason.  They also appear to be different
> on each boot.
> 
> Given that the sequence here when a pool allocation occurs is:
> 
> 1. allocate DMA coherent page
> 2. memset entire page with 0xa7
> 3. write next offsets
> 4. initialise 'offset' to zero (offset of first free object)
> 5. add page to pools list of pages
> 6. allocate first object, updating offset to the next free offset read
>    from the first word of the object.
> 
> then when the next allocation request comes along, we allocate the
> next object in the same way as step 6.  At the point of allocating the
> third object, we find that there is corruption in the third object at
> 0x20 bytes into it - or 0xa0 bytes into the page.
> 
> Now, what does the driver that's allocating these do with them?  That
> is done via init_eps() in drivers/usb/chipidea/udc.c, which doesn't do
> anything with the allocated memory.  This is the only place that the
> driver allocates from this DMA pool, which is done in a loop, so we
> know that the objects allocated from this pool will be in relatively
> quick succession.
> 
> So this does not make sense.
> 
> I really doubt that there is anything wrong with the kernel - this USB
> driver is used on other SoCs (such as iMX6) and does not exhibit this
> problem - it also works on the Tegra TK1 platform as well.
> 
> You are definitely seeing memory corruption here - but given what the
> above looks like, I'd put forward another possible scenario - maybe
> u-boot or something else is leaving a USB controller or some other DMA
> agent active, which is writing over memory while the kernel is trying
> to boot, resulting in memory corruption.

That had occurred to me as well. The kernel command line contains a
couple of memory regions that I think our downstream kernel parses and
uses to reserve memory (redacted here for readability):

	console=ttyS0,115200n8
	console=tty1
	no_console_suspend=1
	lp0_vec=2064@0xf46ff000
	mem=2015M@2048M
	memtype=255
	ddr_die=2048M@2048M
	section=256M
	pmuboard=0x0177:0x0000:0x02:0x43:0x00
	tsec=32M@3913M
	otf_key=c75e5bb91eb3bd947560357b64422f85
	usbcore.old_scheme_first=1
	core_edp_mv=1150
	core_edp_ma=4000
	tegraid=40.1.1.0.0
	debug_uartport=lsport,3
	power_supply=Adapter
	audio_codec=rt5640
	modem_id=0
	android.kerneltype=normal
	fbcon=map:1
	commchip_id=0
	usb_port_owner_info=0
	lane_owner_info=6
	emc_max_dvfs=0
	touch_id=0@0
	board_info=0x0177:0x0000:0x02:0x43:0x00
	net.ifnames=0
	root=/dev/mmcblk1p1
	rw
	rootwait
	tegraboot=sdmmc
	gpt
	maxcpus=0
	pci=noaer

Two things stand out here:

	mem=2015M@2048M
	tsec=32M@3913M

So it looks like there are two carveout regions that the kernel isn't
supposed to touch and presumably somebody else could be using them. If
there's overlap between them and the DMA memory used by the DMA pool,
that could perhaps explain what's going on here.

Can you try the following patch and send the boot log again?

Thanks,
Thierry

--- >8 ---
diff --git a/mm/dmapool.c b/mm/dmapool.c
index 76a160083506..6343d74cb963 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -361,11 +361,11 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 				continue;
 			if (pool->dev)
 				dev_err(pool->dev,
-					"dma_pool_alloc %s, %p (corrupted)\n",
-					pool->name, retval);
+					"dma_pool_alloc %s, %px/%pad (corrupted)\n",
+					pool->name, retval, handle);
 			else
-				pr_err("dma_pool_alloc %s, %p (corrupted)\n",
-					pool->name, retval);
+				pr_err("dma_pool_alloc %s, %px/%pad (corrupted)\n",
+					pool->name, retval, handle);
 
 			/*
 			 * Dump the first 4 bytes even if they are not

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 11:57                                 ` Thierry Reding
@ 2019-03-05 13:16                                     ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 13:16 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Andrew Lunn, Vladimir Murzin, Russell King - ARM Linux admin,
	Jon Hunter, linux-tegra, linux-arm-kernel

That was quite an in-depth analysis that you shared and took some time
get my head around it :)

On Tue, Mar 5, 2019 at 4:57 PM Thierry Reding <thierry.reding@gmail.com> wrote:
>
> Can you try the following patch and send the boot log again?

Please check the following logs after applying your patch:

https://pastebin.com/hGGKZcLU

Sorry to add more to your confusion, now the board is getting stuck
once in a while at following:

U-Boot SPL 2014.10-rc2 (Mar 05 2019 - 14:29:35)

U-Boot 2014.10-rc2 (Mar 05 2019 - 14:29:35)

TEGRA124
Board: NVIDIA Jetson TK1
DRAM:

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 13:16                                     ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 13:16 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Andrew Lunn, Vladimir Murzin, Russell King - ARM Linux admin,
	Jon Hunter, linux-tegra, linux-arm-kernel

That was quite an in-depth analysis that you shared and took some time
get my head around it :)

On Tue, Mar 5, 2019 at 4:57 PM Thierry Reding <thierry.reding@gmail.com> wrote:
>
> Can you try the following patch and send the boot log again?

Please check the following logs after applying your patch:

https://pastebin.com/hGGKZcLU

Sorry to add more to your confusion, now the board is getting stuck
once in a while at following:

U-Boot SPL 2014.10-rc2 (Mar 05 2019 - 14:29:35)

U-Boot 2014.10-rc2 (Mar 05 2019 - 14:29:35)

TEGRA124
Board: NVIDIA Jetson TK1
DRAM:

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 13:16                                     ` Embedded Engineer
@ 2019-03-05 13:23                                       ` Russell King - ARM Linux admin
  -1 siblings, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 13:23 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 05, 2019 at 06:16:38PM +0500, Embedded Engineer wrote:
> That was quite an in-depth analysis that you shared and took some time
> get my head around it :)
> 
> On Tue, Mar 5, 2019 at 4:57 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> >
> > Can you try the following patch and send the boot log again?
> 
> Please check the following logs after applying your patch:
> 
> https://pastebin.com/hGGKZcLU

So they're at 0xec056XXX virtual, 0xac056XXX physical, which is about
704MiB into system memory, and nowhere near either of the two regions
that Theirry identified.

> Sorry to add more to your confusion, now the board is getting stuck
> once in a while at following:
> 
> U-Boot SPL 2014.10-rc2 (Mar 05 2019 - 14:29:35)
> 
> U-Boot 2014.10-rc2 (Mar 05 2019 - 14:29:35)
> 
> TEGRA124
> Board: NVIDIA Jetson TK1
> DRAM:

Is there no later u-boot you can use to rule that out?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 13:23                                       ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 13:23 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 05, 2019 at 06:16:38PM +0500, Embedded Engineer wrote:
> That was quite an in-depth analysis that you shared and took some time
> get my head around it :)
> 
> On Tue, Mar 5, 2019 at 4:57 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> >
> > Can you try the following patch and send the boot log again?
> 
> Please check the following logs after applying your patch:
> 
> https://pastebin.com/hGGKZcLU

So they're at 0xec056XXX virtual, 0xac056XXX physical, which is about
704MiB into system memory, and nowhere near either of the two regions
that Theirry identified.

> Sorry to add more to your confusion, now the board is getting stuck
> once in a while at following:
> 
> U-Boot SPL 2014.10-rc2 (Mar 05 2019 - 14:29:35)
> 
> U-Boot 2014.10-rc2 (Mar 05 2019 - 14:29:35)
> 
> TEGRA124
> Board: NVIDIA Jetson TK1
> DRAM:

Is there no later u-boot you can use to rule that out?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 13:23                                       ` Russell King - ARM Linux admin
@ 2019-03-05 13:32                                         ` Embedded Engineer
  -1 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 13:32 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 5, 2019 at 6:23 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Is there no later u-boot you can use to rule that out?

This u-boot was working just fine with our board so didn't try
updating it to some newer version. Also the downstream u-boot has
different text base addresses than mainline ones I guess so didn't put
any effort in that.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 13:32                                         ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 13:32 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 5, 2019 at 6:23 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Is there no later u-boot you can use to rule that out?

This u-boot was working just fine with our board so didn't try
updating it to some newer version. Also the downstream u-boot has
different text base addresses than mainline ones I guess so didn't put
any effort in that.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 13:32                                         ` Embedded Engineer
@ 2019-03-05 14:23                                           ` Russell King - ARM Linux admin
  -1 siblings, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 14:23 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 05, 2019 at 06:32:19PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 6:23 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Is there no later u-boot you can use to rule that out?
> 
> This u-boot was working just fine with our board so didn't try
> updating it to some newer version. Also the downstream u-boot has
> different text base addresses than mainline ones I guess so didn't put
> any effort in that.

As it is also suffering from the "hanging" issue, it seems that the
problem is not specific to the kernel.

It leaves only a few possible causes:

1. The board firmware (including u-boot) is enabling some DMA that is
   causing corruption of some RAM.

2. You really do have an issue between the CPU and RAM causing
   random-ish data corruption.

It may be worth getting mm/dmapool.c to print the hexdump a number of
times to see whether the data read from the corrupted region changes.
Around line 372, there is a call to print_hex_dump().  Just replicate
that a number of times.

Another idea would be to print a hexdump of each object as it's
allocated and the next object.

Maybe something like this (untested, may need tweaks to get it to build,
you'll also need to revert Thierry's patch):

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 6d4b97e7e9e9..3db1e9b63809 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -219,6 +219,47 @@ static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
 	} while (offset < pool->allocation);
 }
 
+#ifdef	DMAPOOL_DEBUG
+static int verify_one(struct dma_pool *pool, struct dma_page *page,
+		      unsigned int offset, const char *desc)
+{
+	dma_addr_t handle = page->dma + offset;
+	u8 *data = page->vaddr + offset;
+	int i;
+
+	for (i = sizeof(page->offset); i < pool->size; i++) {
+		if (data[i] == POOL_POISON_FREED)
+			continue;
+		if (pool->dev)
+			dev_err(pool->dev,
+				"%s %s, %pad (corrupted)\n",
+				desc, pool->name, &handle);
+		else
+			pr_err("%s %s, %pad (corrupted)\n",
+				desc, pool->name, &handle);
+
+		/*
+		 * Dump the first 4 bytes even if they are not
+		 * POOL_POISON_FREED
+		 */
+		print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 16, 1,
+				data, pool->size, 1);
+		return 1;
+	}
+	return 0;
+}
+
+static void verify_free(struct dma_pool *pool, struct dma_page *page, const char *desc)
+{
+	unsigned int offset;
+
+	for (offset = page->offset; offset < page->allocation;
+	     offset = *(int *)(page->vaddr + offset))
+		if (verify_one(pool, page, offset, desc))
+			break;
+}
+#endif
+
 static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
 {
 	struct dma_page *page;
@@ -235,6 +276,9 @@ static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
 		pool_initialise_page(pool, page);
 		page->in_use = 0;
 		page->offset = 0;
+#ifdef	DMAPOOL_DEBUG
+		verify_free(pool, page, "pool_alloc_page");
+#endif
 	} else {
 		kfree(page);
 		page = NULL;
@@ -345,35 +389,17 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 	list_add(&page->page_list, &pool->page_list);
  ready:
 	page->in_use++;
+#ifdef	DMAPOOL_DEBUG
+	verify_free(pool, page, "dma_pool_alloc pre");
+#endif
 	offset = page->offset;
 	page->offset = *(int *)(page->vaddr + offset);
 	retval = offset + page->vaddr;
 	*handle = offset + page->dma;
 #ifdef	DMAPOOL_DEBUG
-	{
-		int i;
-		u8 *data = retval;
-		/* page->offset is stored in first 4 bytes */
-		for (i = sizeof(page->offset); i < pool->size; i++) {
-			if (data[i] == POOL_POISON_FREED)
-				continue;
-			if (pool->dev)
-				dev_err(pool->dev,
-					"dma_pool_alloc %s, %p (corrupted)\n",
-					pool->name, retval);
-			else
-				pr_err("dma_pool_alloc %s, %p (corrupted)\n",
-					pool->name, retval);
-
-			/*
-			 * Dump the first 4 bytes even if they are not
-			 * POOL_POISON_FREED
-			 */
-			print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 16, 1,
-					data, pool->size, 1);
-			break;
-		}
-	}
+	verify_one(pool, page, offset, "dma_pool_alloc");
+	if (page->offset < pool->allocation)
+		verify_one(pool, page, page->offset, "dma_pool_alloc next");
 	if (!(mem_flags & __GFP_ZERO))
 		memset(retval, POOL_POISON_ALLOCATED, pool->size);
 #endif

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 14:23                                           ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 14:23 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 05, 2019 at 06:32:19PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 6:23 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Is there no later u-boot you can use to rule that out?
> 
> This u-boot was working just fine with our board so didn't try
> updating it to some newer version. Also the downstream u-boot has
> different text base addresses than mainline ones I guess so didn't put
> any effort in that.

As it is also suffering from the "hanging" issue, it seems that the
problem is not specific to the kernel.

It leaves only a few possible causes:

1. The board firmware (including u-boot) is enabling some DMA that is
   causing corruption of some RAM.

2. You really do have an issue between the CPU and RAM causing
   random-ish data corruption.

It may be worth getting mm/dmapool.c to print the hexdump a number of
times to see whether the data read from the corrupted region changes.
Around line 372, there is a call to print_hex_dump().  Just replicate
that a number of times.

Another idea would be to print a hexdump of each object as it's
allocated and the next object.

Maybe something like this (untested, may need tweaks to get it to build,
you'll also need to revert Thierry's patch):

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 6d4b97e7e9e9..3db1e9b63809 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -219,6 +219,47 @@ static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
 	} while (offset < pool->allocation);
 }
 
+#ifdef	DMAPOOL_DEBUG
+static int verify_one(struct dma_pool *pool, struct dma_page *page,
+		      unsigned int offset, const char *desc)
+{
+	dma_addr_t handle = page->dma + offset;
+	u8 *data = page->vaddr + offset;
+	int i;
+
+	for (i = sizeof(page->offset); i < pool->size; i++) {
+		if (data[i] == POOL_POISON_FREED)
+			continue;
+		if (pool->dev)
+			dev_err(pool->dev,
+				"%s %s, %pad (corrupted)\n",
+				desc, pool->name, &handle);
+		else
+			pr_err("%s %s, %pad (corrupted)\n",
+				desc, pool->name, &handle);
+
+		/*
+		 * Dump the first 4 bytes even if they are not
+		 * POOL_POISON_FREED
+		 */
+		print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 16, 1,
+				data, pool->size, 1);
+		return 1;
+	}
+	return 0;
+}
+
+static void verify_free(struct dma_pool *pool, struct dma_page *page, const char *desc)
+{
+	unsigned int offset;
+
+	for (offset = page->offset; offset < page->allocation;
+	     offset = *(int *)(page->vaddr + offset))
+		if (verify_one(pool, page, offset, desc))
+			break;
+}
+#endif
+
 static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
 {
 	struct dma_page *page;
@@ -235,6 +276,9 @@ static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
 		pool_initialise_page(pool, page);
 		page->in_use = 0;
 		page->offset = 0;
+#ifdef	DMAPOOL_DEBUG
+		verify_free(pool, page, "pool_alloc_page");
+#endif
 	} else {
 		kfree(page);
 		page = NULL;
@@ -345,35 +389,17 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 	list_add(&page->page_list, &pool->page_list);
  ready:
 	page->in_use++;
+#ifdef	DMAPOOL_DEBUG
+	verify_free(pool, page, "dma_pool_alloc pre");
+#endif
 	offset = page->offset;
 	page->offset = *(int *)(page->vaddr + offset);
 	retval = offset + page->vaddr;
 	*handle = offset + page->dma;
 #ifdef	DMAPOOL_DEBUG
-	{
-		int i;
-		u8 *data = retval;
-		/* page->offset is stored in first 4 bytes */
-		for (i = sizeof(page->offset); i < pool->size; i++) {
-			if (data[i] == POOL_POISON_FREED)
-				continue;
-			if (pool->dev)
-				dev_err(pool->dev,
-					"dma_pool_alloc %s, %p (corrupted)\n",
-					pool->name, retval);
-			else
-				pr_err("dma_pool_alloc %s, %p (corrupted)\n",
-					pool->name, retval);
-
-			/*
-			 * Dump the first 4 bytes even if they are not
-			 * POOL_POISON_FREED
-			 */
-			print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 16, 1,
-					data, pool->size, 1);
-			break;
-		}
-	}
+	verify_one(pool, page, offset, "dma_pool_alloc");
+	if (page->offset < pool->allocation)
+		verify_one(pool, page, page->offset, "dma_pool_alloc next");
 	if (!(mem_flags & __GFP_ZERO))
 		memset(retval, POOL_POISON_ALLOCATED, pool->size);
 #endif

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 14:23                                           ` Russell King - ARM Linux admin
@ 2019-03-05 14:57                                             ` Embedded Engineer
  -1 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 14:57 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 5, 2019 at 7:23 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
> +       for (offset = page->offset; offset < page->allocation;
> +            offset = *(int *)(page->vaddr + offset))

 error: 'struct dma_page' has no member named 'allocation'. So I
replaced 'page->allocation' with 'page->in_use'. Did you really meant
that? If yes, following are the boot logs:

https://pastebin.com/rgfGdYcj

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 14:57                                             ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 14:57 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 5, 2019 at 7:23 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
> +       for (offset = page->offset; offset < page->allocation;
> +            offset = *(int *)(page->vaddr + offset))

 error: 'struct dma_page' has no member named 'allocation'. So I
replaced 'page->allocation' with 'page->in_use'. Did you really meant
that? If yes, following are the boot logs:

https://pastebin.com/rgfGdYcj

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 14:57                                             ` Embedded Engineer
@ 2019-03-05 14:58                                               ` Russell King - ARM Linux admin
  -1 siblings, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 14:58 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 05, 2019 at 07:57:18PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 7:23 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> > +       for (offset = page->offset; offset < page->allocation;
> > +            offset = *(int *)(page->vaddr + offset))
> 
>  error: 'struct dma_page' has no member named 'allocation'. So I
> replaced 'page->allocation' with 'page->in_use'. Did you really meant
> that? If yes, following are the boot logs:

Should've been pool->allocation.  Sorry about that.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 14:58                                               ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 14:58 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 05, 2019 at 07:57:18PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 7:23 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> > +       for (offset = page->offset; offset < page->allocation;
> > +            offset = *(int *)(page->vaddr + offset))
> 
>  error: 'struct dma_page' has no member named 'allocation'. So I
> replaced 'page->allocation' with 'page->in_use'. Did you really meant
> that? If yes, following are the boot logs:

Should've been pool->allocation.  Sorry about that.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 14:58                                               ` Russell King - ARM Linux admin
@ 2019-03-05 15:11                                                 ` Embedded Engineer
  -1 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 15:11 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Should've been pool->allocation.  Sorry about that.

No problems, here are the new logs:

https://pastebin.com/dfey3LwB

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 15:11                                                 ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 15:11 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Should've been pool->allocation.  Sorry about that.

No problems, here are the new logs:

https://pastebin.com/dfey3LwB

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 15:11                                                 ` Embedded Engineer
@ 2019-03-05 15:31                                                   ` Russell King - ARM Linux admin
  -1 siblings, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 15:31 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 05, 2019 at 08:11:22PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Should've been pool->allocation.  Sorry about that.
> 
> No problems, here are the new logs:
> 
> https://pastebin.com/dfey3LwB

Thanks - the patch I posted substantially increases the amount of checking
that is done... so not surprisingly we find new forms of corruption:

tegra-ehci 7d004000.usb: pool_alloc_page ehci_qh, 0xac050240 (corrupted)
00000000: a0 02 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  ....kkkkkkkkkkkk
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000040: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

and that corruption occurred _right_ after we allocated the page, memset
the entire page to 0xa7, and wrote the "next" pointers.

Again, similar scenario to the above:

tegra-ehci 7d004000.usb: pool_alloc_page ehci_qtd, 0xac0510c0 (corrupted)
00000000: 20 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7   ...............
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000040: e0 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

which is again right after the page is allocated and initialised.

If we look at the ci_hw_qh case, which is the one originally identified:

tegra-udc 7d000000.usb: pool_alloc_page ci_hw_qh, 0xac056080 (corrupted)
00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

Again, just allocated the coherent DMA page, memset() it and written
the offsets to it, and it is already corrupted.  Tegra124 does not
appear to be dma-coherent, so these allocations will be for normal,
uncached memory.  That means the cache won't be loading entire
cachelines at a time from memory for these accesses, but will be
reading them byte by byte as we print the hex values.

The window for this corruption occuring is now very small.

Right now, I don't have anything further to add beyond what I've
already suggested as causes - this is *definitely* memory corruption
either by something else writing to memory, by the CPU writes not
properly being stored in RAM or the CPU not being able to reliably
read data back from RAM.

I wonder whether any of the memory testers run with normal, uncached
memory.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 15:31                                                   ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 63+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 15:31 UTC (permalink / raw)
  To: Embedded Engineer
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 05, 2019 at 08:11:22PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Should've been pool->allocation.  Sorry about that.
> 
> No problems, here are the new logs:
> 
> https://pastebin.com/dfey3LwB

Thanks - the patch I posted substantially increases the amount of checking
that is done... so not surprisingly we find new forms of corruption:

tegra-ehci 7d004000.usb: pool_alloc_page ehci_qh, 0xac050240 (corrupted)
00000000: a0 02 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  ....kkkkkkkkkkkk
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000040: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

and that corruption occurred _right_ after we allocated the page, memset
the entire page to 0xa7, and wrote the "next" pointers.

Again, similar scenario to the above:

tegra-ehci 7d004000.usb: pool_alloc_page ehci_qtd, 0xac0510c0 (corrupted)
00000000: 20 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7   ...............
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000040: e0 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

which is again right after the page is allocated and initialised.

If we look at the ci_hw_qh case, which is the one originally identified:

tegra-udc 7d000000.usb: pool_alloc_page ci_hw_qh, 0xac056080 (corrupted)
00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

Again, just allocated the coherent DMA page, memset() it and written
the offsets to it, and it is already corrupted.  Tegra124 does not
appear to be dma-coherent, so these allocations will be for normal,
uncached memory.  That means the cache won't be loading entire
cachelines at a time from memory for these accesses, but will be
reading them byte by byte as we print the hex values.

The window for this corruption occuring is now very small.

Right now, I don't have anything further to add beyond what I've
already suggested as causes - this is *definitely* memory corruption
either by something else writing to memory, by the CPU writes not
properly being stored in RAM or the CPU not being able to reliably
read data back from RAM.

I wonder whether any of the memory testers run with normal, uncached
memory.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 15:31                                                   ` Russell King - ARM Linux admin
@ 2019-03-05 15:44                                                     ` Embedded Engineer
  -1 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 15:44 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 5, 2019 at 8:31 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Right now, I don't have anything further to add beyond what I've
> already suggested as causes - this is *definitely* memory corruption
> either by something else writing to memory, by the CPU writes not
> properly being stored in RAM or the CPU not being able to reliably
> read data back from RAM.

Thanks alot for your help, I will try updating u-boot to newer version
so that we can eliminate the chance that u-boot has left something on
in undesired state.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-05 15:44                                                     ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 15:44 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
	linux-tegra, linux-arm-kernel

On Tue, Mar 5, 2019 at 8:31 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Right now, I don't have anything further to add beyond what I've
> already suggested as causes - this is *definitely* memory corruption
> either by something else writing to memory, by the CPU writes not
> properly being stored in RAM or the CPU not being able to reliably
> read data back from RAM.

Thanks alot for your help, I will try updating u-boot to newer version
so that we can eliminate the chance that u-boot has left something on
in undesired state.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 15:31                                                   ` Russell King - ARM Linux admin
  (?)
  (?)
@ 2019-03-05 16:00                                                   ` Clemens Koller
  2019-03-05 16:21                                                     ` Embedded Engineer
  2019-03-09  7:50                                                       ` Embedded Engineer
  -1 siblings, 2 replies; 63+ messages in thread
From: Clemens Koller @ 2019-03-05 16:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

On 05/03/2019 16.31, Russell King - ARM Linux admin wrote:
> Right now, I don't have anything further to add beyond what I've
> already suggested as causes - this is *definitely* memory corruption
> either by something else writing to memory, by the CPU writes not
> properly being stored in RAM or the CPU not being able to reliably
> read data back from RAM.
> 
> I wonder whether any of the memory testers run with normal, uncached
> memory.

Yes, this really smells like memory timing issues.
Did you try the more extensive memory test of the latest u-boot? The regular one is quite naive. This is usually *not* enabled as default.
See: CONFIG_CMD_MEMTEST in https://github.com/u-boot/u-boot/blob/master/cmd/mem.c

Then, a Shmoo plot with different memory timing/voltage/temperature might be useful as well as a PCB layout review.

Regards,

Clemens
-- 

On 05/03/2019 16.31, Russell King - ARM Linux admin wrote:
> On Tue, Mar 05, 2019 at 08:11:22PM +0500, Embedded Engineer wrote:
>> On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin
>> <linux@armlinux.org.uk> wrote:
>>>
>>> Should've been pool->allocation.  Sorry about that.
>>
>> No problems, here are the new logs:
>>
>> https://pastebin.com/dfey3LwB
> 
> Thanks - the patch I posted substantially increases the amount of checking
> that is done... so not surprisingly we find new forms of corruption:
> 
> tegra-ehci 7d004000.usb: pool_alloc_page ehci_qh, 0xac050240 (corrupted)
> 00000000: a0 02 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  ....kkkkkkkkkkkk
> 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000040: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 
> and that corruption occurred _right_ after we allocated the page, memset
> the entire page to 0xa7, and wrote the "next" pointers.
> 
> Again, similar scenario to the above:
> 
> tegra-ehci 7d004000.usb: pool_alloc_page ehci_qtd, 0xac0510c0 (corrupted)
> 00000000: 20 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7   ...............
> 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000040: e0 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 
> which is again right after the page is allocated and initialised.
> 
> If we look at the ci_hw_qh case, which is the one originally identified:
> 
> tegra-udc 7d000000.usb: pool_alloc_page ci_hw_qh, 0xac056080 (corrupted)
> 00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
> 
> Again, just allocated the coherent DMA page, memset() it and written
> the offsets to it, and it is already corrupted.  Tegra124 does not
> appear to be dma-coherent, so these allocations will be for normal,
> uncached memory.  That means the cache won't be loading entire
> cachelines at a time from memory for these accesses, but will be
> reading them byte by byte as we print the hex values.
> 
> The window for this corruption occuring is now very small.
> 
> Right now, I don't have anything further to add beyond what I've
> already suggested as causes - this is *definitely* memory corruption
> either by something else writing to memory, by the CPU writes not
> properly being stored in RAM or the CPU not being able to reliably
> read data back from RAM.
> 
> I wonder whether any of the memory testers run with normal, uncached
> memory.
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 16:00                                                   ` Clemens Koller
@ 2019-03-05 16:21                                                     ` Embedded Engineer
  2019-03-09  7:50                                                       ` Embedded Engineer
  1 sibling, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-05 16:21 UTC (permalink / raw)
  To: Clemens Koller; +Cc: linux-arm-kernel

On Tue, Mar 5, 2019 at 9:01 PM Clemens Koller <clemens.ml@gmx.net> wrote:
>
> Did you try the more extensive memory test of the latest u-boot? The regular one is quite naive. This is usually *not* enabled as default.
> See: CONFIG_CMD_MEMTEST in https://github.com/u-boot/u-boot/blob/master/cmd/mem.c
>
> Then, a Shmoo plot with different memory timing/voltage/temperature might be useful as well as a PCB layout review.

I tried mtest with an old u-boot and it didn't report any error. I
will be trying to get latest u-boot working on my board and then
report the mtest results here.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 16:00                                                   ` Clemens Koller
@ 2019-03-09  7:50                                                       ` Embedded Engineer
  2019-03-09  7:50                                                       ` Embedded Engineer
  1 sibling, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-09  7:50 UTC (permalink / raw)
  To: Clemens Koller, Thierry Reding, linux-tegra, Andrew Lunn,
	Vladimir Murzin, linux-arm-kernel, Jon Hunter

On Tue, Mar 5, 2019 at 9:01 PM Clemens Koller <clemens.ml@gmx.net> wrote:
>
> Yes, this really smells like memory timing issues.
> Did you try the more extensive memory test of the latest u-boot? The regular one is quite naive. This is usually *not* enabled as default.

Unfortunately I was unable to get the latest (or any other upstream)
u-boot running on my board and even on Jetson TK1 kit. Although it
seems that the u-boot has support for Jetson TK1 in mainline but don't
know why its not working. The mtest command in the Nvidia's downstream
version of u-boot did not report any errors.

> Then, a Shmoo plot with different memory timing/voltage/temperature might be useful as well as a PCB layout review.

Tried running Shmoo test again but it generated the BCT file with same
parameters again. So it didn't seem to work.

I also stopped u-boot at command line and checked using oscilloscope
if there's some activity at data lines between DDR and TK1 processor.
There was no activity on data lines which made me believe that when
u-boot is in idle state, there's no peripheral (or DMA) writing to
memory which might be corrupting the memory used by kernel as suggest
by Thierry Reding.

Can anyone give some other suggestion as it seems I'm running out of
options to test :(

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-09  7:50                                                       ` Embedded Engineer
  0 siblings, 0 replies; 63+ messages in thread
From: Embedded Engineer @ 2019-03-09  7:50 UTC (permalink / raw)
  To: Clemens Koller, Thierry Reding, linux-tegra, Andrew Lunn,
	Vladimir Murzin, linux-arm-kernel, Jon Hunter

On Tue, Mar 5, 2019 at 9:01 PM Clemens Koller <clemens.ml@gmx.net> wrote:
>
> Yes, this really smells like memory timing issues.
> Did you try the more extensive memory test of the latest u-boot? The regular one is quite naive. This is usually *not* enabled as default.

Unfortunately I was unable to get the latest (or any other upstream)
u-boot running on my board and even on Jetson TK1 kit. Although it
seems that the u-boot has support for Jetson TK1 in mainline but don't
know why its not working. The mtest command in the Nvidia's downstream
version of u-boot did not report any errors.

> Then, a Shmoo plot with different memory timing/voltage/temperature might be useful as well as a PCB layout review.

Tried running Shmoo test again but it generated the BCT file with same
parameters again. So it didn't seem to work.

I also stopped u-boot at command line and checked using oscilloscope
if there's some activity at data lines between DDR and TK1 processor.
There was no activity on data lines which made me believe that when
u-boot is in idle state, there's no peripheral (or DMA) writing to
memory which might be corrupting the memory used by kernel as suggest
by Thierry Reding.

Can anyone give some other suggestion as it seems I'm running out of
options to test :(

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
  2019-03-05 15:44                                                     ` Embedded Engineer
@ 2019-03-15  8:55                                                       ` Marcel Ziswiler
  -1 siblings, 0 replies; 63+ messages in thread
From: Marcel Ziswiler @ 2019-03-15  8:55 UTC (permalink / raw)
  To: linux, embed786
  Cc: andrew, vladimir.murzin, jonathanh, thierry.reding, linux-tegra,
	linux-arm-kernel

On Tue, 2019-03-05 at 20:44 +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 8:31 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> > Right now, I don't have anything further to add beyond what I've
> > already suggested as causes - this is *definitely* memory
> > corruption
> > either by something else writing to memory, by the CPU writes not
> > properly being stored in RAM or the CPU not being able to reliably
> > read data back from RAM.
> 
> Thanks alot for your help, I will try updating u-boot to newer
> version
> so that we can eliminate the chance that u-boot has left something on
> in undesired state.

Sorry, I just saw this thread now. I have quite some TK1 experience
from our Apalis TK1 bring-up. For us mainline U-Boot works quite nicely
but I do remember some magic stuff called RAM repair NVIDIA has done to
their downstream which fixed a strange hang issue we have seen at times
during our extensive validation & verification:

http://git.toradex.com/cgit/u-boot-toradex.git/commit/?h=2016.11-toradex&id=df2b46ba248687c208767865abe5fca32a43faaf

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Unstable Kernel behavior on an ARM based board
@ 2019-03-15  8:55                                                       ` Marcel Ziswiler
  0 siblings, 0 replies; 63+ messages in thread
From: Marcel Ziswiler @ 2019-03-15  8:55 UTC (permalink / raw)
  To: linux, embed786
  Cc: andrew, vladimir.murzin, jonathanh, thierry.reding, linux-tegra,
	linux-arm-kernel

On Tue, 2019-03-05 at 20:44 +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 8:31 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> > Right now, I don't have anything further to add beyond what I've
> > already suggested as causes - this is *definitely* memory
> > corruption
> > either by something else writing to memory, by the CPU writes not
> > properly being stored in RAM or the CPU not being able to reliably
> > read data back from RAM.
> 
> Thanks alot for your help, I will try updating u-boot to newer
> version
> so that we can eliminate the chance that u-boot has left something on
> in undesired state.

Sorry, I just saw this thread now. I have quite some TK1 experience
from our Apalis TK1 bring-up. For us mainline U-Boot works quite nicely
but I do remember some magic stuff called RAM repair NVIDIA has done to
their downstream which fixed a strange hang issue we have seen at times
during our extensive validation & verification:

http://git.toradex.com/cgit/u-boot-toradex.git/commit/?h=2016.11-toradex&id=df2b46ba248687c208767865abe5fca32a43faaf
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2019-03-15  8:55 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-02 10:44 Unstable Kernel behavior on an ARM based board Embedded Engineer
2019-03-02 11:00 ` Russell King - ARM Linux admin
2019-03-02 11:01 ` Willy Tarreau
2019-03-02 11:22   ` Embedded Engineer
2019-03-02 11:25     ` Willy Tarreau
2019-03-02 11:46       ` Russell King - ARM Linux admin
2019-03-04 13:57         ` Thierry Reding
2019-03-02 11:36     ` Russell King - ARM Linux admin
2019-03-02 11:52       ` Embedded Engineer
2019-03-02 11:57         ` Russell King - ARM Linux admin
2019-03-02 12:20           ` Embedded Engineer
2019-03-02 12:39             ` Russell King - ARM Linux admin
2019-03-02 13:10               ` Embedded Engineer
2019-03-02 15:07               ` Clemens Koller
2019-03-04  5:14                 ` Embedded Engineer
2019-03-04 10:26                   ` Vladimir Murzin
2019-03-04 12:25                     ` Embedded Engineer
2019-03-04 14:25                       ` Thierry Reding
2019-03-04 15:51                         ` Embedded Engineer
2019-03-04 15:51                           ` Embedded Engineer
2019-03-05 10:01                         ` Embedded Engineer
2019-03-05 10:01                           ` Embedded Engineer
2019-03-05 10:07                           ` Russell King - ARM Linux admin
2019-03-05 10:07                             ` Russell King - ARM Linux admin
2019-03-05 10:29                             ` Embedded Engineer
2019-03-05 10:29                               ` Embedded Engineer
2019-03-05 11:20                               ` Thierry Reding
2019-03-05 11:22                               ` Russell King - ARM Linux admin
2019-03-05 11:22                                 ` Russell King - ARM Linux admin
2019-03-05 11:57                                 ` Thierry Reding
2019-03-05 13:16                                   ` Embedded Engineer
2019-03-05 13:16                                     ` Embedded Engineer
2019-03-05 13:23                                     ` Russell King - ARM Linux admin
2019-03-05 13:23                                       ` Russell King - ARM Linux admin
2019-03-05 13:32                                       ` Embedded Engineer
2019-03-05 13:32                                         ` Embedded Engineer
2019-03-05 14:23                                         ` Russell King - ARM Linux admin
2019-03-05 14:23                                           ` Russell King - ARM Linux admin
2019-03-05 14:57                                           ` Embedded Engineer
2019-03-05 14:57                                             ` Embedded Engineer
2019-03-05 14:58                                             ` Russell King - ARM Linux admin
2019-03-05 14:58                                               ` Russell King - ARM Linux admin
2019-03-05 15:11                                               ` Embedded Engineer
2019-03-05 15:11                                                 ` Embedded Engineer
2019-03-05 15:31                                                 ` Russell King - ARM Linux admin
2019-03-05 15:31                                                   ` Russell King - ARM Linux admin
2019-03-05 15:44                                                   ` Embedded Engineer
2019-03-05 15:44                                                     ` Embedded Engineer
2019-03-15  8:55                                                     ` Marcel Ziswiler
2019-03-15  8:55                                                       ` Marcel Ziswiler
2019-03-05 16:00                                                   ` Clemens Koller
2019-03-05 16:21                                                     ` Embedded Engineer
2019-03-09  7:50                                                     ` Embedded Engineer
2019-03-09  7:50                                                       ` Embedded Engineer
2019-03-05 10:32                           ` Thierry Reding
2019-03-05 11:05                             ` Embedded Engineer
2019-03-05 11:05                               ` Embedded Engineer
2019-03-05 11:36                               ` Thierry Reding
2019-03-04 14:00                   ` Andrew Lunn
2019-03-04 14:27                     ` Thierry Reding
2019-03-04 15:27                     ` Embedded Engineer
2019-03-04 15:57                       ` Andrew Lunn
2019-03-04 16:03                         ` Embedded Engineer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.