From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gate2.alliedtelesis.co.nz ([2001:df5:b000:5::4]) by bombadil.infradead.org with esmtps (Exim 4.90_1 #2 (Red Hat Linux)) id 1fAqYh-0007WW-NK for linux-mtd@lists.infradead.org; Tue, 24 Apr 2018 05:32:10 +0000 From: Chris Packham To: "linux-mtd@lists.infradead.org" CC: Tobi Wulff , "boris.brezillon@bootlin.com" , "miquel.raynal@bootlin.com" Subject: NAND timeout issues with blank chip and Marvell NFC Date: Tue, 24 Apr 2018 05:31:39 +0000 Message-ID: Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Hi,=0A= =0A= We're in the process of qualifying new NAND chips (Macronix =0A= MX30LF2G18AC) for one of our Armada-385 based devices and we're =0A= experiencing some long startup times on units with factory fresh NAND =0A= chips. Anecdotally I think I've also seen this behaviour on the old =0A= chips as well (Micron MT29F2G08ABAEAWP-ITX:E).=0A= =0A= On 4.17.0-rc2 with the newly re-written NAND infrastructure we see=0A= =0A= nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda=0A= nand: Macronix MX30LF2G18AC=0A= nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64=0A= marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080)=0A= marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280)=0A= Bad block table not found for chip 0=0A= Bad block table not found for chip 0=0A= Scanning device for bad blocks=0A= =0A= (nothing for some time)=0A= =0A= On an older kernel we see=0A= =0A= pxa3xx-nand f10d0000.flash: This platform can't do DMA on this device=0A= nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda=0A= nand: Macronix MX30LF2G18AC=0A= nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64=0A= pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048=0A= Bad block table not found for chip 0=0A= Bad block table not found for chip 0=0A= Scanning device for bad blocks=0A= pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= ...=0A= (time outs continue for some time)=0A= =0A= Presumably the new driver in 4.17.0-rc2 is experiencing the same wait =0A= time out but just not complaining about it.=0A= =0A= If we leave the system running long enough (in the order of 30 minutes) =0A= things seem to sort themselves out and bootup continues, the subsequent =0A= boots are fine. If we run 'nand erase.chip' from u-boot on a fresh unit =0A= and then boot into the kernel then things are also fine.=0A= =0A= If we run 'nand scrub.chip -y' from u-boot we are able to re-create the =0A= problem.=0A= =0A= Our suspicion is that erased state of the chip is probably not agreeable = =0A= with either the ecc data or the bad block table location (or both). By =0A= erasing it from u-boot this must fill in valid data in the expected =0A= places and the kernel is happy.=0A= =0A= We could update our manufacturing procedures to run 'nand erase.chip' =0A= before the first boot but this feels wrong. Some of our devices boot =0A= over the network so the nand is not normally touched by the bootloader. =0A= It seems that there is some unhandled error condition that is stopping =0A= the kernel from seeing that the chip is completely blank and making =0A= forward progress.=0A= =0A= Has anyone else seen something like this before? Any thoughts as to how =0A= we can avoid the long delay?=0A= =0A= Thanks=0A= Chris=0A=