All of lore.kernel.org
 help / color / mirror / Atom feed
From: Russell King - ARM Linux admin <linux@armlinux.org.uk>
To: Embedded Engineer <embed786@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>,
	Vladimir Murzin <vladimir.murzin@arm.com>,
	Jon Hunter <jonathanh@nvidia.com>,
	Thierry Reding <thierry.reding@gmail.com>,
	linux-tegra@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: Unstable Kernel behavior on an ARM based board
Date: Tue, 5 Mar 2019 15:31:11 +0000	[thread overview]
Message-ID: <20190305153111.frvgnoba646uf5ar@shell.armlinux.org.uk> (raw)
In-Reply-To: <CA+_ZnZSTkA8vj4QYJizL=YUcvtiabrVzKtDUPYO+2v-AkgRSpw@mail.gmail.com>

On Tue, Mar 05, 2019 at 08:11:22PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Should've been pool->allocation.  Sorry about that.
> 
> No problems, here are the new logs:
> 
> https://pastebin.com/dfey3LwB

Thanks - the patch I posted substantially increases the amount of checking
that is done... so not surprisingly we find new forms of corruption:

tegra-ehci 7d004000.usb: pool_alloc_page ehci_qh, 0xac050240 (corrupted)
00000000: a0 02 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  ....kkkkkkkkkkkk
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000040: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

and that corruption occurred _right_ after we allocated the page, memset
the entire page to 0xa7, and wrote the "next" pointers.

Again, similar scenario to the above:

tegra-ehci 7d004000.usb: pool_alloc_page ehci_qtd, 0xac0510c0 (corrupted)
00000000: 20 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7   ...............
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000040: e0 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

which is again right after the page is allocated and initialised.

If we look at the ci_hw_qh case, which is the one originally identified:

tegra-udc 7d000000.usb: pool_alloc_page ci_hw_qh, 0xac056080 (corrupted)
00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

Again, just allocated the coherent DMA page, memset() it and written
the offsets to it, and it is already corrupted.  Tegra124 does not
appear to be dma-coherent, so these allocations will be for normal,
uncached memory.  That means the cache won't be loading entire
cachelines at a time from memory for these accesses, but will be
reading them byte by byte as we print the hex values.

The window for this corruption occuring is now very small.

Right now, I don't have anything further to add beyond what I've
already suggested as causes - this is *definitely* memory corruption
either by something else writing to memory, by the CPU writes not
properly being stored in RAM or the CPU not being able to reliably
read data back from RAM.

I wonder whether any of the memory testers run with normal, uncached
memory.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

WARNING: multiple messages have this Message-ID (diff)
From: Russell King - ARM Linux admin <linux@armlinux.org.uk>
To: Embedded Engineer <embed786@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>,
	Vladimir Murzin <vladimir.murzin@arm.com>,
	Jon Hunter <jonathanh@nvidia.com>,
	Thierry Reding <thierry.reding@gmail.com>,
	linux-tegra@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: Unstable Kernel behavior on an ARM based board
Date: Tue, 5 Mar 2019 15:31:11 +0000	[thread overview]
Message-ID: <20190305153111.frvgnoba646uf5ar@shell.armlinux.org.uk> (raw)
In-Reply-To: <CA+_ZnZSTkA8vj4QYJizL=YUcvtiabrVzKtDUPYO+2v-AkgRSpw@mail.gmail.com>

On Tue, Mar 05, 2019 at 08:11:22PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Should've been pool->allocation.  Sorry about that.
> 
> No problems, here are the new logs:
> 
> https://pastebin.com/dfey3LwB

Thanks - the patch I posted substantially increases the amount of checking
that is done... so not surprisingly we find new forms of corruption:

tegra-ehci 7d004000.usb: pool_alloc_page ehci_qh, 0xac050240 (corrupted)
00000000: a0 02 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  ....kkkkkkkkkkkk
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000040: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

and that corruption occurred _right_ after we allocated the page, memset
the entire page to 0xa7, and wrote the "next" pointers.

Again, similar scenario to the above:

tegra-ehci 7d004000.usb: pool_alloc_page ehci_qtd, 0xac0510c0 (corrupted)
00000000: 20 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7   ...............
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000040: e0 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

which is again right after the page is allocated and initialised.

If we look at the ci_hw_qh case, which is the one originally identified:

tegra-udc 7d000000.usb: pool_alloc_page ci_hw_qh, 0xac056080 (corrupted)
00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

Again, just allocated the coherent DMA page, memset() it and written
the offsets to it, and it is already corrupted.  Tegra124 does not
appear to be dma-coherent, so these allocations will be for normal,
uncached memory.  That means the cache won't be loading entire
cachelines at a time from memory for these accesses, but will be
reading them byte by byte as we print the hex values.

The window for this corruption occuring is now very small.

Right now, I don't have anything further to add beyond what I've
already suggested as causes - this is *definitely* memory corruption
either by something else writing to memory, by the CPU writes not
properly being stored in RAM or the CPU not being able to reliably
read data back from RAM.

I wonder whether any of the memory testers run with normal, uncached
memory.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2019-03-05 15:31 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-02 10:44 Unstable Kernel behavior on an ARM based board Embedded Engineer
2019-03-02 11:00 ` Russell King - ARM Linux admin
2019-03-02 11:01 ` Willy Tarreau
2019-03-02 11:22   ` Embedded Engineer
2019-03-02 11:25     ` Willy Tarreau
2019-03-02 11:46       ` Russell King - ARM Linux admin
2019-03-04 13:57         ` Thierry Reding
2019-03-02 11:36     ` Russell King - ARM Linux admin
2019-03-02 11:52       ` Embedded Engineer
2019-03-02 11:57         ` Russell King - ARM Linux admin
2019-03-02 12:20           ` Embedded Engineer
2019-03-02 12:39             ` Russell King - ARM Linux admin
2019-03-02 13:10               ` Embedded Engineer
2019-03-02 15:07               ` Clemens Koller
2019-03-04  5:14                 ` Embedded Engineer
2019-03-04 10:26                   ` Vladimir Murzin
2019-03-04 12:25                     ` Embedded Engineer
2019-03-04 14:25                       ` Thierry Reding
2019-03-04 15:51                         ` Embedded Engineer
2019-03-04 15:51                           ` Embedded Engineer
2019-03-05 10:01                         ` Embedded Engineer
2019-03-05 10:01                           ` Embedded Engineer
2019-03-05 10:07                           ` Russell King - ARM Linux admin
2019-03-05 10:07                             ` Russell King - ARM Linux admin
2019-03-05 10:29                             ` Embedded Engineer
2019-03-05 10:29                               ` Embedded Engineer
2019-03-05 11:20                               ` Thierry Reding
2019-03-05 11:22                               ` Russell King - ARM Linux admin
2019-03-05 11:22                                 ` Russell King - ARM Linux admin
2019-03-05 11:57                                 ` Thierry Reding
2019-03-05 13:16                                   ` Embedded Engineer
2019-03-05 13:16                                     ` Embedded Engineer
2019-03-05 13:23                                     ` Russell King - ARM Linux admin
2019-03-05 13:23                                       ` Russell King - ARM Linux admin
2019-03-05 13:32                                       ` Embedded Engineer
2019-03-05 13:32                                         ` Embedded Engineer
2019-03-05 14:23                                         ` Russell King - ARM Linux admin
2019-03-05 14:23                                           ` Russell King - ARM Linux admin
2019-03-05 14:57                                           ` Embedded Engineer
2019-03-05 14:57                                             ` Embedded Engineer
2019-03-05 14:58                                             ` Russell King - ARM Linux admin
2019-03-05 14:58                                               ` Russell King - ARM Linux admin
2019-03-05 15:11                                               ` Embedded Engineer
2019-03-05 15:11                                                 ` Embedded Engineer
2019-03-05 15:31                                                 ` Russell King - ARM Linux admin [this message]
2019-03-05 15:31                                                   ` Russell King - ARM Linux admin
2019-03-05 15:44                                                   ` Embedded Engineer
2019-03-05 15:44                                                     ` Embedded Engineer
2019-03-15  8:55                                                     ` Marcel Ziswiler
2019-03-15  8:55                                                       ` Marcel Ziswiler
2019-03-05 16:00                                                   ` Clemens Koller
2019-03-05 16:21                                                     ` Embedded Engineer
2019-03-09  7:50                                                     ` Embedded Engineer
2019-03-09  7:50                                                       ` Embedded Engineer
2019-03-05 10:32                           ` Thierry Reding
2019-03-05 11:05                             ` Embedded Engineer
2019-03-05 11:05                               ` Embedded Engineer
2019-03-05 11:36                               ` Thierry Reding
2019-03-04 14:00                   ` Andrew Lunn
2019-03-04 14:27                     ` Thierry Reding
2019-03-04 15:27                     ` Embedded Engineer
2019-03-04 15:57                       ` Andrew Lunn
2019-03-04 16:03                         ` Embedded Engineer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190305153111.frvgnoba646uf5ar@shell.armlinux.org.uk \
    --to=linux@armlinux.org.uk \
    --cc=andrew@lunn.ch \
    --cc=embed786@gmail.com \
    --cc=jonathanh@nvidia.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=thierry.reding@gmail.com \
    --cc=vladimir.murzin@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.