All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
To: Miquel Raynal <miquel.raynal@bootlin.com>,
	Steve deRosier <derosier@gmail.com>
Cc: "linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org>,
	"boris.brezillon@bootlin.com" <boris.brezillon@bootlin.com>,
	Tobi Wulff <Tobi.Wulff@alliedtelesis.co.nz>
Subject: Re: NAND timeout issues with blank chip and Marvell NFC
Date: Thu, 26 Apr 2018 01:40:45 +0000	[thread overview]
Message-ID: <7cd09dc2689643e9a8e0751e1cba3e11@svr-chch-ex1.atlnz.lc> (raw)
In-Reply-To: 72ff5349ac6e48a9ab74986947572108@svr-chch-ex1.atlnz.lc

On 26/04/18 09:22, Chris Packham wrote:
> Hi Miquel,
> 
> On 25/04/18 04:08, Miquel Raynal wrote:
>> Hi Steve, Chris,
>>
>> On Tue, 24 Apr 2018 08:49:47 -0700, Steve deRosier <derosier@gmail.com>
>> wrote:
>>
>>> Hi Chris,
>>>
>>> On Mon, Apr 23, 2018 at 10:31 PM, Chris Packham
>>> <Chris.Packham@alliedtelesis.co.nz> wrote:
>>>> Hi,
>>>>
>>>> We're in the process of qualifying new NAND chips (Macronix
>>>> MX30LF2G18AC) for one of our Armada-385 based devices and we're
>>>> experiencing some long startup times on units with factory fresh NAND
>>>> chips. Anecdotally I think I've also seen this behaviour on the old
>>>> chips as well (Micron MT29F2G08ABAEAWP-ITX:E).
>>>>
>>>> On 4.17.0-rc2 with the newly re-written NAND infrastructure we see
>>>>
>>>> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda
>>>> nand: Macronix MX30LF2G18AC
>>>> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
>>>> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080)
>>>> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280)
>>>> Bad block table not found for chip 0
>>>> Bad block table not found for chip 0
>>>> Scanning device for bad blocks
>>>>
>>>> (nothing for some time)
>>>>
>>>> On an older kernel we see
>>>>
>>>> pxa3xx-nand f10d0000.flash: This platform can't do DMA on this device
>>>> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda
>>>> nand: Macronix MX30LF2G18AC
>>>> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
>>>> pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048
>>>> Bad block table not found for chip 0
>>>> Bad block table not found for chip 0
>>>> Scanning device for bad blocks
>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>> ...
>>>> (time outs continue for some time)
>>>>
>>>> Presumably the new driver in 4.17.0-rc2 is experiencing the same wait
>>>> time out but just not complaining about it.
>>>>
>>>> If we leave the system running long enough (in the order of 30 minutes)
>>>> things seem to sort themselves out and bootup continues, the subsequent
>>>> boots are fine. If we run 'nand erase.chip' from u-boot on a fresh unit
>>>> and then boot into the kernel then things are also fine.
>>>>
>>>> If we run 'nand scrub.chip -y' from u-boot we are able to re-create the
>>>> problem.
>>>>
>>>> Our suspicion is that erased state of the chip is probably not agreeable
>>>> with either the ecc data or the bad block table location (or both). By
>>>> erasing it from u-boot this must fill in valid data in the expected
>>>> places and the kernel is happy.
>>>>    
>>>
>>> During your very first boot, Linux can't find the bad-block table and
>>> thus does a full scan of the chip, each and every block, to find the
>>> manufacturer bad block marks and then constructs the table. I imagine
>>> you've got a parameter incorrect somewhere that's causing it to wait
>>> for timeouts at read points, instead of quickly able to read through
>>> the 2k or 4k blocks on that flash.  On subsequent boots, you don't see
>>> this issue because the BBT is found and Linux just uses that. Same
>>> deal if you do a `nand erase.chip`, because the BBT is itself marked
>>> with a bad-block marker and gets skipped during a normal erase.
>>
>> I share Steve's thoughts on that, there is probably some
>> misconfiguration at some point, having a first long boot is not a
>> problem, but 30 minutes for a 256MiB chip... What I don't understand is
>> that you should have timeouts with the recent kernel too if there is
>> actually something wrong happening.
> 
> As I mentioned in my other reply I may have understated the time. It is
> ~30mins with the old pxa3xx driver but the new one seems to block
> indefinitely for me.
> 
>>>
>>> Now, I don't know if you're aware of this, but by doing the `nand
>>> scub.chip -y`, you've ruined the flash chip.  That device can not be
>>> relied upon anymore. A scrub will ignore the factory bad-block-marks
>>> and erase them. Unless you stored this information off-chip and
>>> rewrite the markers, you've now lost the bad-block information from
>>> the manufacturer's tests.  In any case, this erases the BBT, so your
>>> next boot triggers Linux to rebuild the BBT.
>>
>> I think U-Boot will do it automatically after the scrub. But the result
>> is still the same.
>>
>>>
>>>> We could update our manufacturing procedures to run 'nand erase.chip'
>>>> before the first boot but this feels wrong. Some of our devices boot
>>>> over the network so the nand is not normally touched by the bootloader.
>>>> It seems that there is some unhandled error condition that is stopping
>>>> the kernel from seeing that the chip is completely blank and making
>>>> forward progress.
>>>>    
>>>
>>> erase chip won't fix your issue. The BBT scan is going to happen
>>> anyway. There is however clearly some parameter that is setup
>>> incorrectly that's causing it to wait for the timeout instead of being
>>> able to quickly read pages. I don't see why that'd be unique to the
>>> BBT scan however, I'd expect you to see the problem on all reads, thus
>>> slowing down the system noticeably in general.
>>>
>>> Your hint is likely these lines:
>>>       " marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080)
>>>         marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280)"
>>>
>>> You can go look at that in the driver and compare with the relevant
>>> behavior in the datasheets. Sorry, but I can't help more specifically,
>>> I'd have to know your particular hardware and datasheets and spend
>>> some time looking at the code.
>>
>> I also reproduce the problem on my Armada 38x, the two timeouts at boot
>> time (not specifically the first one) are suspicious, I'm going to look
>> into it.
> 
> Thanks for leaping onto it. I'll keep investigating it here as well.
> 

When I add some debugging to marvell_nfc_wait_op I see

marvell-nfc f10d0000.flash: timeout_ms = 250
marvell-nfc f10d0000.flash: done
marvell-nfc f10d0000.flash: timeout_ms = 1
marvell-nfc f10d0000.flash: done
nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda
nand: Macronix MX30LF2G18AC
nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
Bad block table not found for chip 0
Bad block table not found for chip 0
Scanning device for bad blocks
marvell-nfc f10d0000.flash: timeout_ms = 4
marvell-nfc f10d0000.flash: done
marvell-nfc f10d0000.flash: timeout_ms = 600000000

That last line looks quite odd. I think the problem might be related to 
this line from marvell_nfc_hw_ecc_bch_write_page()

  ret = marvell_nfc_wait_op(chip,
                            chip->data_interface.timings.sdr.tPROG_max);

Based on the datasheet that number is 600 microseconds(us) not the 
milliseconds expected by marvell_nfc_wait_op().

  reply	other threads:[~2018-04-26  1:41 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-24  5:31 NAND timeout issues with blank chip and Marvell NFC Chris Packham
2018-04-24 15:49 ` Steve deRosier
2018-04-24 16:08   ` Miquel Raynal
2018-04-25 21:22     ` Chris Packham
2018-04-26  1:40       ` Chris Packham [this message]
2018-04-26  5:16         ` Chris Packham
2018-04-26  6:06           ` Boris Brezillon
2018-04-26  6:21             ` Boris Brezillon
2018-04-26  7:03           ` Miquel Raynal
2018-04-26 22:43             ` Chris Packham
2018-04-27  4:30               ` Chris Packham
2018-04-27  6:16                 ` Boris Brezillon
2018-05-02 15:28               ` Miquel Raynal
2018-05-02 22:12                 ` Chris Packham
2018-04-25 21:16   ` Chris Packham
2018-04-25 13:32 ` Miquel Raynal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7cd09dc2689643e9a8e0751e1cba3e11@svr-chch-ex1.atlnz.lc \
    --to=chris.packham@alliedtelesis.co.nz \
    --cc=Tobi.Wulff@alliedtelesis.co.nz \
    --cc=boris.brezillon@bootlin.com \
    --cc=derosier@gmail.com \
    --cc=linux-mtd@lists.infradead.org \
    --cc=miquel.raynal@bootlin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.