On Tue, Mar 05, 2019 at 03:29:26PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 3:07 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Please apply this patch so we can see the (ptrval) values.  Thanks.
> 
> Please find below logs after applying patch:
> 
> https://pastebin.com/6TaBxPX5

Hm... so looks like what you're getting here is the error spew from the
DMA pool debug code in mm/dmax_pool.c. The way I understand it is that
that will initialize the memory for each page allocated from the pool
with the POOL_POISON_FREED (0xa7) (see pool_alloc_page()) and then upon
adding the page to the pool list, it'll store the offset to page->offset
field and check the contents of the page.

The contents of the page then don't match the expected poison. The dump
of the corrupted memory is somewhat confusing because the values that
don't match the poison are actually expected, at least partially. From
my reading of the DMA pool code, the first four bytes store the offset
of the DMA block into the physical memory page. However, given the size
of the hexdump, it looks like the pool was allocated with a block size
of 64 bytes, which matches the code in drivers/usb/chipidea/udc.c that
allocates the "ci_hw_qh" pool.

What's strange here, though, is that the offset that's stored to the
first four bytes of a block seems to actually be stored twice per block.
The first offset seems to be correct, since it's apparently used to find
the offset of the next block to allocate. If you look at the first
corrupted hexdump:

  [    1.327553] tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056080 (corrupted)
  [    1.335058] 00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.343077] 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.351095] 00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.359113] 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

This is the entry for the block at offset 0x00000080 and the offset for
the next block is 0x000000c0, which is exactly 64 bytes after the
current block. However, if you then look at the second offset that's
stored at offset 0x00000020 in the block, it's 0x00000080, which does
match the offset of the current block, but I think that may just be
coincidence. The same coincidence happens for the second corrupted
block:

  [    1.367210] tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056140 (corrupted)
  [    1.374709] 00000000: 80 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.382727] 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.390744] 00000020: 40 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
  [    1.398760] 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

But not for the third:

  [    1.406965] tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec0561c0 (corrupted)
  [    1.414466] 00000000: 00 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.422483] 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
  [    1.430502] 00000020: 40 03 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  @...............
  [    1.438519] 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

The fact that we see the offset stored at offset 0x20 in each block
makes me think there's perhaps some sort of aliasing happening here. But
I'm not sure how the system would even boot this far if aliasing was
really the problem. Things should be falling apart much sooner if that's
really what's going on here.

However, this sort of aliasing is not something that your typical memory
test will catch, so it could explain why they aren't reporting any
errors.

Thierry