From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gate2.alliedtelesis.co.nz ([2001:df5:b000:5::4]) by bombadil.infradead.org with esmtps (Exim 4.90_1 #2 (Red Hat Linux)) id 1fBRnA-0006Vo-E4 for linux-mtd@lists.infradead.org; Wed, 25 Apr 2018 21:17:35 +0000 From: Chris Packham To: Steve deRosier CC: "linux-mtd@lists.infradead.org" , "boris.brezillon@bootlin.com" , Tobi Wulff , "miquel.raynal@bootlin.com" Subject: Re: NAND timeout issues with blank chip and Marvell NFC Date: Wed, 25 Apr 2018 21:16:44 +0000 Message-ID: <8e8d8e830f1b4b6b808a602f845153bd@svr-chch-ex1.atlnz.lc> References: Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On 25/04/18 03:50, Steve deRosier wrote:=0A= > Hi Chris,=0A= > =0A= > On Mon, Apr 23, 2018 at 10:31 PM, Chris Packham=0A= > wrote:=0A= >> Hi,=0A= >>=0A= >> We're in the process of qualifying new NAND chips (Macronix=0A= >> MX30LF2G18AC) for one of our Armada-385 based devices and we're=0A= >> experiencing some long startup times on units with factory fresh NAND=0A= >> chips. Anecdotally I think I've also seen this behaviour on the old=0A= >> chips as well (Micron MT29F2G08ABAEAWP-ITX:E).=0A= >>=0A= >> On 4.17.0-rc2 with the newly re-written NAND infrastructure we see=0A= >>=0A= >> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda=0A= >> nand: Macronix MX30LF2G18AC=0A= >> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64= =0A= >> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080)=0A= >> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280)=0A= >> Bad block table not found for chip 0=0A= >> Bad block table not found for chip 0=0A= >> Scanning device for bad blocks=0A= >>=0A= >> (nothing for some time)=0A= =0A= I should correct this. I left it overnight and it's still at this point =0A= after >24hrs. My original statement was based on the fact that the old =0A= driver would eventually complete.=0A= =0A= I can't be 100% sure that this is the same result as a factory fresh chip.= =0A= =0A= >>=0A= >> On an older kernel we see=0A= >>=0A= >> pxa3xx-nand f10d0000.flash: This platform can't do DMA on this device=0A= >> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda=0A= >> nand: Macronix MX30LF2G18AC=0A= >> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64= =0A= >> pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048=0A= >> Bad block table not found for chip 0=0A= >> Bad block table not found for chip 0=0A= >> Scanning device for bad blocks=0A= >> pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= >> pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= >> pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= >> pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= >> pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= >> ...=0A= >> (time outs continue for some time)=0A= >>=0A= >> Presumably the new driver in 4.17.0-rc2 is experiencing the same wait=0A= >> time out but just not complaining about it.=0A= >>=0A= >> If we leave the system running long enough (in the order of 30 minutes)= =0A= >> things seem to sort themselves out and bootup continues, the subsequent= =0A= >> boots are fine. If we run 'nand erase.chip' from u-boot on a fresh unit= =0A= >> and then boot into the kernel then things are also fine.=0A= >>=0A= >> If we run 'nand scrub.chip -y' from u-boot we are able to re-create the= =0A= >> problem.=0A= >>=0A= >> Our suspicion is that erased state of the chip is probably not agreeable= =0A= >> with either the ecc data or the bad block table location (or both). By= =0A= >> erasing it from u-boot this must fill in valid data in the expected=0A= >> places and the kernel is happy.=0A= >>=0A= > =0A= > During your very first boot, Linux can't find the bad-block table and=0A= > thus does a full scan of the chip, each and every block, to find the=0A= > manufacturer bad block marks and then constructs the table. =0A= =0A= That's what I assumed was going on.=0A= =0A= > I imagine=0A= > you've got a parameter incorrect somewhere that's causing it to wait=0A= > for timeouts at read points, instead of quickly able to read through=0A= > the 2k or 4k blocks on that flash. On subsequent boots, you don't see=0A= > this issue because the BBT is found and Linux just uses that. Same=0A= > deal if you do a `nand erase.chip`, because the BBT is itself marked=0A= > with a bad-block marker and gets skipped during a normal erase.=0A= =0A= Any suggestion as to which setting I may have missed. I haven't adjusted = =0A= my board dts to use any of the new capabilities from the updated =0A= framework but it does look pretty much the same as every other user of =0A= this driver. It is just inheriting the setup from armada-38x.dtsi and =0A= setting up the CS, BBT and ECC params. The final version looks something = =0A= like this=0A= =0A= flash@d0000 {=0A= compatible =3D "marvell,armada370-nand";=0A= reg =3D <0xd0000 0x54>;=0A= #address-cells =3D <0x1>;=0A= #size-cells =3D <0x1>;=0A= interrupts =3D <0x0 0x54 0x4>;=0A= clocks =3D <0xe 0x0>;=0A= status =3D "okay";=0A= num-cs =3D <0x1>;=0A= nand-ecc-strength =3D <0x4>;=0A= nand-ecc-step-size =3D <0x200>;=0A= marvell,nand-enable-arbiter;=0A= nand-on-flash-bbt;=0A= };=0A= =0A= =0A= > Now, I don't know if you're aware of this, but by doing the `nand=0A= > scub.chip -y`, you've ruined the flash chip. That device can not be=0A= > relied upon anymore. A scrub will ignore the factory bad-block-marks=0A= > and erase them. Unless you stored this information off-chip and=0A= > rewrite the markers, you've now lost the bad-block information from=0A= > the manufacturer's tests. In any case, this erases the BBT, so your=0A= > next boot triggers Linux to rebuild the BBT.=0A= =0A= I was aware and dumped out the BBT before scrubbing. These are sample =0A= chips anyway so I'm fine with burning them.=0A= =0A= > =0A= >> We could update our manufacturing procedures to run 'nand erase.chip'=0A= >> before the first boot but this feels wrong. Some of our devices boot=0A= >> over the network so the nand is not normally touched by the bootloader.= =0A= >> It seems that there is some unhandled error condition that is stopping= =0A= >> the kernel from seeing that the chip is completely blank and making=0A= >> forward progress.=0A= >>=0A= > =0A= > erase chip won't fix your issue. The BBT scan is going to happen=0A= > anyway. There is however clearly some parameter that is setup=0A= > incorrectly that's causing it to wait for the timeout instead of being=0A= > able to quickly read pages. I don't see why that'd be unique to the=0A= > BBT scan however, I'd expect you to see the problem on all reads, thus=0A= > slowing down the system noticeably in general.=0A= > =0A= > Your hint is likely these lines:=0A= > " marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080)=0A= > marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280)"=0A= > =0A= > You can go look at that in the driver and compare with the relevant=0A= > behavior in the datasheets. Sorry, but I can't help more specifically,=0A= > I'd have to know your particular hardware and datasheets and spend=0A= > some time looking at the code.=0A= =0A= Those messages seem to come out in both the "good" and "bad" cases. I've = =0A= been ignoring them up to now. I'll go take a closer look.=0A= =0A=