NAND timeout issues with blank chip and Marvell NFC

* NAND timeout issues with blank chip and Marvell NFC
@ 2018-04-24  5:31 Chris Packham
  2018-04-24 15:49 ` Steve deRosier
  2018-04-25 13:32 ` Miquel Raynal
  0 siblings, 2 replies; 16+ messages in thread
From: Chris Packham @ 2018-04-24  5:31 UTC (permalink / raw)
  To: linux-mtd; +Cc: Tobi Wulff, boris.brezillon, miquel.raynal

Hi,

We're in the process of qualifying new NAND chips (Macronix 
MX30LF2G18AC) for one of our Armada-385 based devices and we're 
experiencing some long startup times on units with factory fresh NAND 
chips. Anecdotally I think I've also seen this behaviour on the old 
chips as well (Micron MT29F2G08ABAEAWP-ITX:E).

On 4.17.0-rc2 with the newly re-written NAND infrastructure we see

nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda
nand: Macronix MX30LF2G18AC
nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080)
marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280)
Bad block table not found for chip 0
Bad block table not found for chip 0
Scanning device for bad blocks

(nothing for some time)

On an older kernel we see

pxa3xx-nand f10d0000.flash: This platform can't do DMA on this device
nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda
nand: Macronix MX30LF2G18AC
nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048
Bad block table not found for chip 0
Bad block table not found for chip 0
Scanning device for bad blocks
pxa3xx-nand f10d0000.flash: Wait time out!!!
pxa3xx-nand f10d0000.flash: Wait time out!!!
pxa3xx-nand f10d0000.flash: Wait time out!!!
pxa3xx-nand f10d0000.flash: Wait time out!!!
pxa3xx-nand f10d0000.flash: Wait time out!!!
...
(time outs continue for some time)

Presumably the new driver in 4.17.0-rc2 is experiencing the same wait 
time out but just not complaining about it.

If we leave the system running long enough (in the order of 30 minutes) 
things seem to sort themselves out and bootup continues, the subsequent 
boots are fine. If we run 'nand erase.chip' from u-boot on a fresh unit 
and then boot into the kernel then things are also fine.

If we run 'nand scrub.chip -y' from u-boot we are able to re-create the 
problem.

Our suspicion is that erased state of the chip is probably not agreeable 
with either the ecc data or the bad block table location (or both). By 
erasing it from u-boot this must fill in valid data in the expected 
places and the kernel is happy.

We could update our manufacturing procedures to run 'nand erase.chip' 
before the first boot but this feels wrong. Some of our devices boot 
over the network so the nand is not normally touched by the bootloader. 
It seems that there is some unhandled error condition that is stopping 
the kernel from seeing that the chip is completely blank and making 
forward progress.

Has anyone else seen something like this before? Any thoughts as to how 
we can avoid the long delay?

Thanks
Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread