From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gate2.alliedtelesis.co.nz ([2001:df5:b000:5::4]) by bombadil.infradead.org with esmtps (Exim 4.90_1 #2 (Red Hat Linux)) id 1fBZHV-0006Gb-Nq for linux-mtd@lists.infradead.org; Thu, 26 Apr 2018 05:17:24 +0000 From: Chris Packham To: Miquel Raynal , Steve deRosier CC: "linux-mtd@lists.infradead.org" , "boris.brezillon@bootlin.com" , Tobi Wulff Subject: Re: NAND timeout issues with blank chip and Marvell NFC Date: Thu, 26 Apr 2018 05:16:57 +0000 Message-ID: References: <20180424180837.398957ba@xps13> <72ff5349ac6e48a9ab74986947572108@svr-chch-ex1.atlnz.lc> <7cd09dc2689643e9a8e0751e1cba3e11@svr-chch-ex1.atlnz.lc> Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , An update for the end of my working day.=0A= =0A= On 26/04/18 13:40, Chris Packham wrote:=0A= > On 26/04/18 09:22, Chris Packham wrote:=0A= >> Hi Miquel,=0A= >>=0A= >> On 25/04/18 04:08, Miquel Raynal wrote:=0A= >>> Hi Steve, Chris,=0A= >>>=0A= >>> On Tue, 24 Apr 2018 08:49:47 -0700, Steve deRosier = =0A= >>> wrote:=0A= >>>=0A= >>>> Hi Chris,=0A= >>>>=0A= >>>> On Mon, Apr 23, 2018 at 10:31 PM, Chris Packham=0A= >>>> wrote:=0A= >>>>> Hi,=0A= >>>>>=0A= >>>>> We're in the process of qualifying new NAND chips (Macronix=0A= >>>>> MX30LF2G18AC) for one of our Armada-385 based devices and we're=0A= >>>>> experiencing some long startup times on units with factory fresh NAND= =0A= >>>>> chips. Anecdotally I think I've also seen this behaviour on the old= =0A= >>>>> chips as well (Micron MT29F2G08ABAEAWP-ITX:E).=0A= >>>>>=0A= >>>>> On 4.17.0-rc2 with the newly re-written NAND infrastructure we see=0A= >>>>>=0A= >>>>> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda=0A= >>>>> nand: Macronix MX30LF2G18AC=0A= >>>>> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 6= 4=0A= >>>>> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080)=0A= >>>>> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280)=0A= >>>>> Bad block table not found for chip 0=0A= >>>>> Bad block table not found for chip 0=0A= >>>>> Scanning device for bad blocks=0A= >>>>>=0A= >>>>> (nothing for some time)=0A= >>>>>=0A= >>>>> On an older kernel we see=0A= >>>>>=0A= >>>>> pxa3xx-nand f10d0000.flash: This platform can't do DMA on this device= =0A= >>>>> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda=0A= >>>>> nand: Macronix MX30LF2G18AC=0A= >>>>> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 6= 4=0A= >>>>> pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048=0A= >>>>> Bad block table not found for chip 0=0A= >>>>> Bad block table not found for chip 0=0A= >>>>> Scanning device for bad blocks=0A= >>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= >>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= >>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= >>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= >>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!=0A= >>>>> ...=0A= >>>>> (time outs continue for some time)=0A= >>>>>=0A= >>>>> Presumably the new driver in 4.17.0-rc2 is experiencing the same wait= =0A= >>>>> time out but just not complaining about it.=0A= >>>>>=0A= >>>>> If we leave the system running long enough (in the order of 30 minute= s)=0A= >>>>> things seem to sort themselves out and bootup continues, the subseque= nt=0A= >>>>> boots are fine. If we run 'nand erase.chip' from u-boot on a fresh un= it=0A= >>>>> and then boot into the kernel then things are also fine.=0A= >>>>>=0A= >>>>> If we run 'nand scrub.chip -y' from u-boot we are able to re-create t= he=0A= >>>>> problem.=0A= >>>>>=0A= >>>>> Our suspicion is that erased state of the chip is probably not agreea= ble=0A= >>>>> with either the ecc data or the bad block table location (or both). B= y=0A= >>>>> erasing it from u-boot this must fill in valid data in the expected= =0A= >>>>> places and the kernel is happy.=0A= >>>>> =0A= >>>>=0A= >>>> During your very first boot, Linux can't find the bad-block table and= =0A= >>>> thus does a full scan of the chip, each and every block, to find the= =0A= >>>> manufacturer bad block marks and then constructs the table. I imagine= =0A= >>>> you've got a parameter incorrect somewhere that's causing it to wait= =0A= >>>> for timeouts at read points, instead of quickly able to read through= =0A= >>>> the 2k or 4k blocks on that flash. On subsequent boots, you don't see= =0A= >>>> this issue because the BBT is found and Linux just uses that. Same=0A= >>>> deal if you do a `nand erase.chip`, because the BBT is itself marked= =0A= >>>> with a bad-block marker and gets skipped during a normal erase.=0A= >>>=0A= >>> I share Steve's thoughts on that, there is probably some=0A= >>> misconfiguration at some point, having a first long boot is not a=0A= >>> problem, but 30 minutes for a 256MiB chip... What I don't understand is= =0A= >>> that you should have timeouts with the recent kernel too if there is=0A= >>> actually something wrong happening.=0A= >>=0A= >> As I mentioned in my other reply I may have understated the time. It is= =0A= >> ~30mins with the old pxa3xx driver but the new one seems to block=0A= >> indefinitely for me.=0A= >>=0A= >>>>=0A= >>>> Now, I don't know if you're aware of this, but by doing the `nand=0A= >>>> scub.chip -y`, you've ruined the flash chip. That device can not be= =0A= >>>> relied upon anymore. A scrub will ignore the factory bad-block-marks= =0A= >>>> and erase them. Unless you stored this information off-chip and=0A= >>>> rewrite the markers, you've now lost the bad-block information from=0A= >>>> the manufacturer's tests. In any case, this erases the BBT, so your= =0A= >>>> next boot triggers Linux to rebuild the BBT.=0A= >>>=0A= >>> I think U-Boot will do it automatically after the scrub. But the result= =0A= >>> is still the same.=0A= >>>=0A= >>>>=0A= >>>>> We could update our manufacturing procedures to run 'nand erase.chip'= =0A= >>>>> before the first boot but this feels wrong. Some of our devices boot= =0A= >>>>> over the network so the nand is not normally touched by the bootloade= r.=0A= >>>>> It seems that there is some unhandled error condition that is stoppin= g=0A= >>>>> the kernel from seeing that the chip is completely blank and making= =0A= >>>>> forward progress.=0A= >>>>> =0A= >>>>=0A= >>>> erase chip won't fix your issue. The BBT scan is going to happen=0A= >>>> anyway. There is however clearly some parameter that is setup=0A= >>>> incorrectly that's causing it to wait for the timeout instead of being= =0A= >>>> able to quickly read pages. I don't see why that'd be unique to the=0A= >>>> BBT scan however, I'd expect you to see the problem on all reads, thus= =0A= >>>> slowing down the system noticeably in general.=0A= >>>>=0A= >>>> Your hint is likely these lines:=0A= >>>> " marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080= )=0A= >>>> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280= )"=0A= >>>>=0A= >>>> You can go look at that in the driver and compare with the relevant=0A= >>>> behavior in the datasheets. Sorry, but I can't help more specifically,= =0A= >>>> I'd have to know your particular hardware and datasheets and spend=0A= >>>> some time looking at the code.=0A= >>>=0A= >>> I also reproduce the problem on my Armada 38x, the two timeouts at boot= =0A= >>> time (not specifically the first one) are suspicious, I'm going to look= =0A= >>> into it.=0A= >>=0A= >> Thanks for leaping onto it. I'll keep investigating it here as well.=0A= >>=0A= > =0A= > When I add some debugging to marvell_nfc_wait_op I see=0A= > =0A= > marvell-nfc f10d0000.flash: timeout_ms =3D 250=0A= > marvell-nfc f10d0000.flash: done=0A= > marvell-nfc f10d0000.flash: timeout_ms =3D 1=0A= > marvell-nfc f10d0000.flash: done=0A= > nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda=0A= > nand: Macronix MX30LF2G18AC=0A= > nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64=0A= > Bad block table not found for chip 0=0A= > Bad block table not found for chip 0=0A= > Scanning device for bad blocks=0A= > marvell-nfc f10d0000.flash: timeout_ms =3D 4=0A= > marvell-nfc f10d0000.flash: done=0A= > marvell-nfc f10d0000.flash: timeout_ms =3D 600000000=0A= > =0A= > That last line looks quite odd. I think the problem might be related to= =0A= > this line from marvell_nfc_hw_ecc_bch_write_page()=0A= > =0A= > ret =3D marvell_nfc_wait_op(chip,=0A= > chip->data_interface.timings.sdr.tPROG_max);= =0A= > =0A= > Based on the datasheet that number is 600 microseconds(us) not the=0A= > milliseconds expected by marvell_nfc_wait_op().=0A= > =0A= =0A= So naturally throwing in some PSEC_TO_MSEC() calls stopped the really =0A= long timeouts but then the probe would fail. It seems that I'm getting =0A= some "page done" and "command done" interrupts indications (NDSR =3D =0A= 0x0000500) while attempting to write the oob data.=0A= =0A= I've also re-done some of my initial tests and it seems that 4.17-rc2 =0A= cannot mount this chip. The 4.16.4 kernel can.=0A= =0A= Even if I use the old kernel to create the ubi volumes the new kernel =0A= seems to hang while mounting in a similar place to what I was seeing =0A= with the BBT creation.=0A=