MLC NAND: all 0xff after erase?

* MLC NAND: all 0xff after erase?
@ 2012-07-11  0:36 Brian Norris
  2012-07-11  6:41 ` Richard Genoud
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Brian Norris @ 2012-07-11  0:36 UTC (permalink / raw)
  To: linux-mtd
  Cc: Mike Dunn, Artem Bityutskiy, Richard Weinberger, Kevin Cernekee,
	Jim Quinlan, Al Viro, Joel Reardon, David Woodhouse,
	Shmulik Ladkani

Hello all,

I've seen some issues with MLC NAND and where I might erase a block,
read it back, and receive a few bitflips such that the data is not
entirely 0xff (i.e., a few bytes may be 0xfe, 0x7f, 0xf7, etc.).
However, I also notice that UBI, UBIFS, and YAFFS2 all make the
assumption that an erased page/block will be totally pristine: all
0xff. This brings me to my main question:

Can someone find an example MLC NAND datasheet that guarantees reading
an erased page will yield all 0xff data?

On first read, in fact, an example MLC datasheet doesn't even define
what "clearing the contents of a block" means. From the datasheet [1]:

  "Erase operations are used to clear the contents of a block in the
NAND Flash array to prepare its pages for program operations."

This doesn't guarantee what happens when performing READ_PAGE on the
cleared block.

Now, if we truly cannot rely on fully-0xff after erase, then this
suggests a need for change in several different layers. A few comments
and proposals:

* A few bitflips on an otherwise-0xff page should not be treated as
unhandled corruption, as long as the erase operation did not return an
error status. Such a page should be still usable, provided the driver
has sufficient ECC capability.

* My NAND controller's HW ECC flags these erased-page bitflips as
uncorrectable errors, as there was no ECC written to the page and the
read data does not match the 0xff special "erased" case. I assume most
other ECC mechanisms would treat this similarly [2]. In such cases, I
suspect that the best the driver can do is to return the raw data
(0xff with flips) and an ECC error message.

* UBI and other FS layers need to distinguish between: (a) 0xff
cleanly-erased, (b) 0xff with bitflips, and (c) true ECC errors. In
the end, we may treat (a) and (b) the same (as erased pages), but the
problem is distinguishing between (b) and (c). This may require
modifications to:

  - the MTD API, to provide explicit notification of erased blocks.
For instance, we might introduce a new return code for mtd_read() that
represents (b); when MTD detects an ECC error, we check for all 0xff
with a threshold of bitflips, then return either -EBADMSG or a special
ERASED code.

  - UBI/UBIFS/other-FS's, to utilize the new ERASED return code that
can be checked before checking for all-0xff. Either the ERASED code or
all-0xff data would be considered "erased".

Any comments are welcome, especially regarding my first question.

Thanks,
Brian

[1] Datasheet for Micron MT29F32G08CBABA
[2] Counterexample: it seems NAND_ECC_SOFT actually corrects a single
bitflip in 0xff data. Tested with nandsim and `nandwrite -n -o`.

^ permalink raw reply	[flat|nested] 10+ messages in thread