RE: RFC: detect and manage power cut on MLC NAND

From: "Jeff Lauruhn (jlauruhn)" <jlauruhn@micron.com>
To: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: Andrea Scian <rnd4@dave-tech.it>,
	Richard Weinberger <richard@nod.at>,
	mtd_mailinglist <linux-mtd@lists.infradead.org>,
	"dedekind1@gmail.com" <dedekind1@gmail.com>
Subject: RE: RFC: detect and manage power cut on MLC NAND
Date: Fri, 13 Mar 2015 23:51:53 +0000	[thread overview]
Message-ID: <0D23F1ECC880A74392D56535BCADD7354973DF0B@NTXBOIMBX03.micron.com> (raw)
In-Reply-To: <20150313213134.1b53430b@bbrezillon>

Jeff Lauruhn

-----Original Message-----
From: linux-mtd [mailto:linux-mtd-bounces@lists.infradead.org] On Behalf Of Boris Brezillon
Sent: Friday, March 13, 2015 1:32 PM
To: Jeff Lauruhn (jlauruhn)
Cc: Richard Weinberger; dedekind1@gmail.com; mtd_mailinglist; Andrea Scian
Subject: Re: RFC: detect and manage power cut on MLC NAND

Hello Jeff,

I'm joining the discussion to ask more questions about MLC NANDs ;-).

Could you tell us more about how block wear impact the voltage level stored in NAND cells.

1/ Are all pages in a block impacted the same way ?
	Yes, because of block erase, P/E cycles affect all the pages in a block.    
2/ Is wear more likely to induce voltage increase, voltage decrease
   or is it unpredictable ?   Wear is a very well known a NAND characteristic.   During P/E cycling there is a potential for electrons to get permanently trapped in the oxide.  The more P/E cycles the more electrons get trapped.  Over many P/E cycles cells well get to a point where they look permanent programmed and can't be erased or programmed.  As cells begin to fail, ECC can be used to recover the data.  If too many bits fail in page the device will respond with a FAIL status after a P/E cycle.

3/ Is it possible to have more than one working voltage threshold
   (read-retry mode): I did some testing on my Hynix chip (I know you
   work for Micron but that's the only MLC chip I have :-)), and I
   managed to get less bitflips by trying another read-retry mode even
   if the previous one was allowing me to successfully fix existing
   bitflips.
Read Retry is available on some newer  products.  RR was introduced to help maintain and improve data retention and P/E cycles as geometry shrinks and bit/cell increase.  If the device supports RR, we have predefined RR Options, based on the most  likely chance of success.  Start with option 1 and step through the options until you get a successful read.  The DS usually has pretty good information.  

4/ Do you have any numbers/statistics that could
   help us choose the more appropriate read-retry mode according to the
   number of P/E cycles ?  I don't have numbers or statistics, but I can tell you that the RR steps are generally defined based on known NAND behavior.  Go to the Micron website and put in this PN MT29F128G08CBCCB and you will find good information on RR.

5/ Any other things you'd like to share regarding read-retry ? 
RR isn't available on all devices.   From your prospective I would give them the option to use RR if it's available.  

Apart from that, we're currently trying to find the most appropriate way to deal with paired pages, and this sounds rather complicated.
The current idea is to expose paired pages information up to the UBIFS layer, and let UBIFS decide when it should stop writing on pages paired with already written pages.
Moreover, we have a few pages we need to protect (UBI metadata: EC and VID headers) in order to keep UBI/UBIFS consistent.
Do you have anything to share on this topic (ideas, solutions we should consider, constraints we're not aware of, ...)

This is one of the reasons I came to this site.  I have a great deal of device knowledge and I need to know more about how end users use the device.  

Most designs today employ power loss detection and employ elegant shutdown to the NAND.  In addition, we provide Write Protect, which provides an extra layer of protection against power loss.  There is still a chance that if the power event happens during a program to a page, the previously programmed shared page can also be corrupted.  It's not clear to me how to keep track of shared pages for every device out there.  It's not like a parameter page that you can read.  It's an interesting problem.    

Thanks for your valuable information.

Best Regards,

Boris

--
Boris Brezillon, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/