[PATCH v2 0/3] Fix proposal for the Micron shallow erase issue

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] Fix proposal for the Micron shallow erase issue
@ 2020-05-03 11:40 Miquel Raynal
  2020-05-03 11:40 ` [PATCH v2 1/3] mtd: rawnand: Add the nand_chip->erase hook Miquel Raynal
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Miquel Raynal @ 2020-05-03 11:40 UTC (permalink / raw)
  To: Richard Weinberger, Vignesh Raghavendra, Tudor Ambarus, linux-mtd
  Cc: Miquel Raynal, Zoltan Szubbocsev, Thomas Petazzoni,
	Boris Brezillon, tglx, Piotr Wojtaszczyk, Bean Huo

Hello,

After a first proposal by Thomas Gleixner and then another proposal by
Bean Huo (Micron), this is an attempt to mainline the fix for Micron's
"shallow erase" issue. IMHO this is a "pretty way", not so invasive,
with a limited performance penalty.

It has only be *compile-tested* and this is just to know if the
approach is fine or not, then I will optimize, maybe rewrite a bit and
forcibly (ask to) test it.

Changes in v2 (v1 was an RFC):
* Fixed the written pattern: should have been 0x00 instead of 0xFF.
* Removed files that I added in my commit by mistake.
* Reworded a little bit the comment about writing only odd pages.
* I am still waiting for Bean to comment of the need to write main
  area vs OOB and also the need to write pages starting from 0 or
  starting from the middle of the bloc. Anyway if I do not get more
  information, and people agree (or even test it), I will merge this
  set.

Miquel Raynal (3):
  mtd: rawnand: Add the nand_chip->erase hook
  mtd: rawnand: Add the nand_chip->write_oob hook
  mtd: rawnand: micron: Address the shallow erase issue

 drivers/mtd/nand/raw/internals.h   |   2 +
 drivers/mtd/nand/raw/nand_base.c   |  14 +++-
 drivers/mtd/nand/raw/nand_micron.c | 121 +++++++++++++++++++++++++++++
 include/linux/mtd/rawnand.h        |   6 ++
 4 files changed, 142 insertions(+), 1 deletion(-)

-- 
2.20.1

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/3] mtd: rawnand: Add the nand_chip->erase hook
  2020-05-03 11:40 [PATCH v2 0/3] Fix proposal for the Micron shallow erase issue Miquel Raynal
@ 2020-05-03 11:40 ` Miquel Raynal
  2020-05-03 15:01   ` Boris Brezillon
  2020-05-03 11:40 ` [PATCH v2 2/3] mtd: rawnand: Add the nand_chip->write_oob hook Miquel Raynal
  2020-05-03 11:40 ` [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue Miquel Raynal
  2 siblings, 1 reply; 31+ messages in thread
From: Miquel Raynal @ 2020-05-03 11:40 UTC (permalink / raw)
  To: Richard Weinberger, Vignesh Raghavendra, Tudor Ambarus, linux-mtd
  Cc: Miquel Raynal, Zoltan Szubbocsev, Thomas Petazzoni,
	Boris Brezillon, tglx, Piotr Wojtaszczyk, Bean Huo

In order to solve an issue with Micron NANDs, we must be able to
overload the erase operation. With this in mind, we create a ->erase
hook in the nand_chip structure which points by default to the
currently in use nand_erase_nand() helper.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
---
 drivers/mtd/nand/raw/nand_base.c | 6 +++++-
 include/linux/mtd/rawnand.h      | 3 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/mtd/nand/raw/nand_base.c b/drivers/mtd/nand/raw/nand_base.c
index c24e5e2ba130..7c7ac722d88b 100644
--- a/drivers/mtd/nand/raw/nand_base.c
+++ b/drivers/mtd/nand/raw/nand_base.c
@@ -4158,7 +4158,9 @@ static int nand_write_oob(struct mtd_info *mtd, loff_t to,
  */
 static int nand_erase(struct mtd_info *mtd, struct erase_info *instr)
 {
-	return nand_erase_nand(mtd_to_nand(mtd), instr, 0);
+	struct nand_chip *chip = mtd_to_nand(mtd);
+
+	return chip->erase(chip, instr, 0);
 }
 
 /**
@@ -4419,6 +4421,8 @@ static void nand_set_defaults(struct nand_chip *chip)
 
 	if (!chip->buf_align)
 		chip->buf_align = 1;
+
+	chip->erase = nand_erase_nand;
 }
 
 /* Sanitize ONFI strings so we can safely print them */
diff --git a/include/linux/mtd/rawnand.h b/include/linux/mtd/rawnand.h
index 1e76196f9829..505c13f7a2ba 100644
--- a/include/linux/mtd/rawnand.h
+++ b/include/linux/mtd/rawnand.h
@@ -1020,6 +1020,7 @@ struct nand_legacy {
  *			avoid using them.
  * @setup_read_retry:	[FLASHSPECIFIC] flash (vendor) specific function for
  *			setting the read-retry mode. Mostly needed for MLC NAND.
+ * @erase:		Raw NAND erase operation.
  * @ecc:		[BOARDSPECIFIC] ECC control structure
  * @buf_align:		minimum buffer alignment required by a platform
  * @oob_poi:		"poison value buffer," used for laying out OOB data
@@ -1089,6 +1090,8 @@ struct nand_chip {
 	struct nand_legacy legacy;
 
 	int (*setup_read_retry)(struct nand_chip *chip, int retry_mode);
+	int (*erase)(struct nand_chip *chip, struct erase_info *instr,
+		     int allowbbt);
 
 	unsigned int options;
 	unsigned int bbt_options;
-- 
2.20.1


______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 2/3] mtd: rawnand: Add the nand_chip->write_oob hook
  2020-05-03 11:40 [PATCH v2 0/3] Fix proposal for the Micron shallow erase issue Miquel Raynal
  2020-05-03 11:40 ` [PATCH v2 1/3] mtd: rawnand: Add the nand_chip->erase hook Miquel Raynal
@ 2020-05-03 11:40 ` Miquel Raynal
  2020-05-03 15:09   ` Boris Brezillon
  2020-05-03 11:40 ` [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue Miquel Raynal
  2 siblings, 1 reply; 31+ messages in thread
From: Miquel Raynal @ 2020-05-03 11:40 UTC (permalink / raw)
  To: Richard Weinberger, Vignesh Raghavendra, Tudor Ambarus, linux-mtd
  Cc: Miquel Raynal, Zoltan Szubbocsev, Thomas Petazzoni,
	Boris Brezillon, tglx, Piotr Wojtaszczyk, Bean Huo

With the same approach as for the ->erase hook, in order to solve an
issue with Micron NANDs, we must be able to overload the write
operation. With this in mind, we create a ->write_oob hook in the
nand_chip structure which points by default to the
currently in use nand_write_oob() helper, renamed
nand_write_oob_nand() for the parallel with the nand_erase_nand()
one.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
---
 drivers/mtd/nand/raw/internals.h | 2 ++
 drivers/mtd/nand/raw/nand_base.c | 8 ++++++++
 include/linux/mtd/rawnand.h      | 3 +++
 3 files changed, 13 insertions(+)

diff --git a/drivers/mtd/nand/raw/internals.h b/drivers/mtd/nand/raw/internals.h
index 9d0caadf940e..caf534a6586a 100644
--- a/drivers/mtd/nand/raw/internals.h
+++ b/drivers/mtd/nand/raw/internals.h
@@ -81,6 +81,8 @@ int nand_bbm_get_next_page(struct nand_chip *chip, int page);
 int nand_markbad_bbm(struct nand_chip *chip, loff_t ofs);
 int nand_erase_nand(struct nand_chip *chip, struct erase_info *instr,
 		    int allowbbt);
+int nand_write_oob_nand(struct nand_chip *chip, loff_t to,
+			struct mtd_oob_ops *ops);
 int onfi_fill_data_interface(struct nand_chip *chip,
 			     enum nand_data_interface_type type,
 			     int timing_mode);
diff --git a/drivers/mtd/nand/raw/nand_base.c b/drivers/mtd/nand/raw/nand_base.c
index 7c7ac722d88b..f9cf30949f49 100644
--- a/drivers/mtd/nand/raw/nand_base.c
+++ b/drivers/mtd/nand/raw/nand_base.c
@@ -4121,6 +4121,13 @@ static int nand_write_oob(struct mtd_info *mtd, loff_t to,
 			  struct mtd_oob_ops *ops)
 {
 	struct nand_chip *chip = mtd_to_nand(mtd);
+
+	return chip->write_oob(chip, to, ops);
+}
+
+int nand_write_oob_nand(struct nand_chip *chip, loff_t to,
+			struct mtd_oob_ops *ops)
+{
 	int ret;
 
 	ops->retlen = 0;
@@ -4423,6 +4430,7 @@ static void nand_set_defaults(struct nand_chip *chip)
 		chip->buf_align = 1;
 
 	chip->erase = nand_erase_nand;
+	chip->write_oob = nand_write_oob_nand;
 }
 
 /* Sanitize ONFI strings so we can safely print them */
diff --git a/include/linux/mtd/rawnand.h b/include/linux/mtd/rawnand.h
index 505c13f7a2ba..7fbbd5d7088f 100644
--- a/include/linux/mtd/rawnand.h
+++ b/include/linux/mtd/rawnand.h
@@ -1021,6 +1021,7 @@ struct nand_legacy {
  * @setup_read_retry:	[FLASHSPECIFIC] flash (vendor) specific function for
  *			setting the read-retry mode. Mostly needed for MLC NAND.
  * @erase:		Raw NAND erase operation.
+ * @write_oob:		Raw NAND write operation.
  * @ecc:		[BOARDSPECIFIC] ECC control structure
  * @buf_align:		minimum buffer alignment required by a platform
  * @oob_poi:		"poison value buffer," used for laying out OOB data
@@ -1092,6 +1093,8 @@ struct nand_chip {
 	int (*setup_read_retry)(struct nand_chip *chip, int retry_mode);
 	int (*erase)(struct nand_chip *chip, struct erase_info *instr,
 		     int allowbbt);
+	int (*write_oob)(struct nand_chip *chip, loff_t to,
+			 struct mtd_oob_ops *ops);
 
 	unsigned int options;
 	unsigned int bbt_options;
-- 
2.20.1


______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-03 11:40 [PATCH v2 0/3] Fix proposal for the Micron shallow erase issue Miquel Raynal
  2020-05-03 11:40 ` [PATCH v2 1/3] mtd: rawnand: Add the nand_chip->erase hook Miquel Raynal
  2020-05-03 11:40 ` [PATCH v2 2/3] mtd: rawnand: Add the nand_chip->write_oob hook Miquel Raynal
@ 2020-05-03 11:40 ` Miquel Raynal
  2020-05-03 16:10   ` Steve deRosier
  2020-05-06  8:28   ` [EXT] " Bean Huo (beanhuo)
  2 siblings, 2 replies; 31+ messages in thread
From: Miquel Raynal @ 2020-05-03 11:40 UTC (permalink / raw)
  To: Richard Weinberger, Vignesh Raghavendra, Tudor Ambarus, linux-mtd
  Cc: Miquel Raynal, Zoltan Szubbocsev, Thomas Petazzoni,
	Boris Brezillon, tglx, Piotr Wojtaszczyk, Bean Huo

With recent SLC NANDs, Micron admits that a "shallow erase" issue may
be observable. It is actually the chip itself not doing a correct
erase operation because of its internal machinery stating that the
pages have not been programmed. Micron told us that there is a way to
workaround this issue: ensure that all the odd pages in the 16 first
ones of each block to erase have been fully written.

To avoid a very big performance drawback by re-writting all the pages
for each erase operation, the fix proposed here overloads the ->erase
and ->write_oob hooks to count the pages actually written at runtime
and avoid re-writting them if not needed.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
---
 drivers/mtd/nand/raw/nand_micron.c | 121 +++++++++++++++++++++++++++++
 1 file changed, 121 insertions(+)

diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
index 56654030ec7f..a9afd1b9a9e8 100644
--- a/drivers/mtd/nand/raw/nand_micron.c
+++ b/drivers/mtd/nand/raw/nand_micron.c
@@ -36,6 +36,15 @@
 #define NAND_ECC_STATUS_1_3_CORRECTED	BIT(4)
 #define NAND_ECC_STATUS_7_8_CORRECTED	(BIT(4) | BIT(3))
 
+/*
+ * Micron SLC chips are subject to a shallow erase issue: if the first
+ * pages of a block have not enough bytes programmed, the internal
+ * machinery might declare the block empty and skip the actual erase
+ * operation. This is the number of pages we check by software.
+ */
+#define MICRON_SHALLOW_ERASE_MIN_PAGE 16
+#define MICRON_PAGE_MASK_TRIGGER GENMASK(MICRON_SHALLOW_ERASE_MIN_PAGE, 0)
+
 struct nand_onfi_vendor_micron {
 	u8 two_plane_read;
 	u8 read_cache;
@@ -64,6 +73,7 @@ struct micron_on_die_ecc {
 
 struct micron_nand {
 	struct micron_on_die_ecc ecc;
+	u16 *writtenp;
 };
 
 static int micron_nand_setup_read_retry(struct nand_chip *chip, int retry_mode)
@@ -429,6 +439,106 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)
 	return MICRON_ON_DIE_SUPPORTED;
 }
 
+static int micron_nand_avoid_shallow_erase(struct nand_chip *chip,
+					   unsigned int eb)
+{
+	struct micron_nand *micron = nand_get_manufacturer_data(chip);
+	unsigned int page = eb * nanddev_pages_per_eraseblock(&chip->base);
+	u8 *databuf = nand_get_data_buf(chip);
+	int ret, i;
+
+	memset(databuf, 0x00, nanddev_page_size(&chip->base));
+
+	/* Micron advises to only write the first 8 odd pages, counting from 1 */
+	for (i = 0; i < MICRON_SHALLOW_ERASE_MIN_PAGE; i += 2, page += 2) {
+		if (!(micron->writtenp[eb] & BIT(i))) {
+			ret = nand_write_page_raw(chip, databuf, false, page);
+			if (ret)
+				return ret;
+		}
+	}
+
+	return 0;
+}
+
+static int micron_nand_erase(struct nand_chip *chip, struct erase_info *instr,
+			int allowbbt)
+{
+	struct micron_nand *micron = nand_get_manufacturer_data(chip);
+	unsigned int eb_sz = nanddev_eraseblock_size(&chip->base);
+	unsigned int first_eb = DIV_ROUND_DOWN_ULL(instr->addr, eb_sz);
+	unsigned int nb_eb = DIV_ROUND_UP_ULL(instr->len, eb_sz);
+	unsigned int eb;
+
+	if (!micron)
+		return -EINVAL;
+
+	/*
+	 * Check that enough pages have been written in each block.
+	 * If not, write them before actually erasing.
+	 */
+	for (eb = first_eb; eb < first_eb + nb_eb; eb++) {
+		/* Il all the first pages are not written yet, do it */
+		if (micron->writtenp[eb] != MICRON_PAGE_MASK_TRIGGER)
+			micron_nand_avoid_shallow_erase(chip, eb);
+
+		micron->writtenp[eb] = 0;
+	}
+
+	return nand_erase_nand(chip, instr, allowbbt);
+}
+static int micron_nand_write_oob(struct nand_chip *chip, loff_t to,
+				 struct mtd_oob_ops *ops)
+{
+	struct micron_nand *micron = nand_get_manufacturer_data(chip);
+	unsigned int eb_sz = nanddev_eraseblock_size(&chip->base);
+	unsigned int p_sz = nanddev_page_size(&chip->base);
+	unsigned int ppeb = nanddev_pages_per_eraseblock(&chip->base);
+	unsigned int nb_p_tot = ops->len / p_sz;
+	unsigned int first_eb = DIV_ROUND_DOWN_ULL(to, eb_sz);
+	unsigned int first_p = DIV_ROUND_UP_ULL(to - (first_eb * eb_sz), p_sz);
+	unsigned int nb_eb = DIV_ROUND_UP_ULL(first_p + nb_p_tot, ppeb);
+	unsigned int remaining_p, eb, nb_p;
+	int ret;
+
+	ret = nand_write_oob_nand(chip, to, ops);
+	if (ret || (ops->len != ops->retlen))
+		return ret;
+
+	/* Mark the last pages of the first erase block to write */
+	nb_p = min(nb_p_tot, ppeb - first_p);
+	micron->writtenp[first_eb] |= GENMASK(first_p + nb_p, first_p) &
+				      MICRON_PAGE_MASK_TRIGGER;
+	remaining_p = nb_p_tot - nb_p;
+
+	/* Mark all the pages of all "in-the-middle" erase blocks */
+	for (eb = first_eb + 1; eb < first_eb + nb_eb - 1; eb++) {
+		micron->writtenp[eb] |= MICRON_PAGE_MASK_TRIGGER;
+		remaining_p -= ppeb;
+	}
+
+	/* Mark the first pages of the last erase block to write */
+	if (remaining_p)
+		micron->writtenp[eb] |= GENMASK(remaining_p - 1, 0) &
+					MICRON_PAGE_MASK_TRIGGER;
+
+	return 0;
+}
+
+static bool micron_nand_with_shallow_erase_issue(struct nand_chip *chip)
+{
+	/*
+	 * The shallow erase issue has been observed with MT29F*G*A
+	 * parts but Micron suspects that the issue can happen with
+	 * almost all recent SLC but at such a low probability that it
+	 * is almost invisible. Nevertheless, as we mitigate the
+	 * performance penalty at runtime by following the number of
+	 * written pages in a block before erasing it, we may want to
+	 * enable this fix by default.
+	 */
+	return nand_is_slc(chip);
+}
+
 static int micron_nand_init(struct nand_chip *chip)
 {
 	struct mtd_info *mtd = nand_to_mtd(chip);
@@ -513,6 +623,17 @@ static int micron_nand_init(struct nand_chip *chip)
 		}
 	}
 
+	if (micron_nand_with_shallow_erase_issue(chip)) {
+		micron->writtenp = kzalloc(sizeof(u16) *
+					   nanddev_neraseblocks(&chip->base),
+					   GFP_KERNEL);
+		if (!micron->writtenp)
+			goto err_free_manuf_data;
+
+		chip->erase = micron_nand_erase;
+		chip->write_oob = micron_nand_write_oob;
+	}
+
 	return 0;
 
 err_free_manuf_data:
-- 
2.20.1


______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] mtd: rawnand: Add the nand_chip->erase hook
  2020-05-03 11:40 ` [PATCH v2 1/3] mtd: rawnand: Add the nand_chip->erase hook Miquel Raynal
@ 2020-05-03 15:01   ` Boris Brezillon
  0 siblings, 0 replies; 31+ messages in thread
From: Boris Brezillon @ 2020-05-03 15:01 UTC (permalink / raw)
  To: Miquel Raynal
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Zoltan Szubbocsev, linux-mtd, Thomas Petazzoni, tglx,
	Piotr Wojtaszczyk, Bean Huo

On Sun,  3 May 2020 13:40:27 +0200
Miquel Raynal <miquel.raynal@bootlin.com> wrote:

> In order to solve an issue with Micron NANDs, we must be able to
> overload the erase operation. With this in mind, we create a ->erase
> hook in the nand_chip structure which points by default to the
> currently in use nand_erase_nand() helper.
> 
> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
> ---
>  drivers/mtd/nand/raw/nand_base.c | 6 +++++-
>  include/linux/mtd/rawnand.h      | 3 +++
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/mtd/nand/raw/nand_base.c b/drivers/mtd/nand/raw/nand_base.c
> index c24e5e2ba130..7c7ac722d88b 100644
> --- a/drivers/mtd/nand/raw/nand_base.c
> +++ b/drivers/mtd/nand/raw/nand_base.c
> @@ -4158,7 +4158,9 @@ static int nand_write_oob(struct mtd_info *mtd, loff_t to,
>   */
>  static int nand_erase(struct mtd_info *mtd, struct erase_info *instr)
>  {
> -	return nand_erase_nand(mtd_to_nand(mtd), instr, 0);
> +	struct nand_chip *chip = mtd_to_nand(mtd);
> +
> +	return chip->erase(chip, instr, 0);
>  }
>  
>  /**
> @@ -4419,6 +4421,8 @@ static void nand_set_defaults(struct nand_chip *chip)
>  
>  	if (!chip->buf_align)
>  		chip->buf_align = 1;
> +
> +	chip->erase = nand_erase_nand;
>  }
>  
>  /* Sanitize ONFI strings so we can safely print them */
> diff --git a/include/linux/mtd/rawnand.h b/include/linux/mtd/rawnand.h
> index 1e76196f9829..505c13f7a2ba 100644
> --- a/include/linux/mtd/rawnand.h
> +++ b/include/linux/mtd/rawnand.h
> @@ -1020,6 +1020,7 @@ struct nand_legacy {
>   *			avoid using them.
>   * @setup_read_retry:	[FLASHSPECIFIC] flash (vendor) specific function for
>   *			setting the read-retry mode. Mostly needed for MLC NAND.
> + * @erase:		Raw NAND erase operation.
>   * @ecc:		[BOARDSPECIFIC] ECC control structure
>   * @buf_align:		minimum buffer alignment required by a platform
>   * @oob_poi:		"poison value buffer," used for laying out OOB data
> @@ -1089,6 +1090,8 @@ struct nand_chip {
>  	struct nand_legacy legacy;
>  
>  	int (*setup_read_retry)(struct nand_chip *chip, int retry_mode);
> +	int (*erase)(struct nand_chip *chip, struct erase_info *instr,
> +		     int allowbbt);
>  

Before we do that, can we create a nand_chip_ops struct and place all
NAND-specific ops in there?

Regarding the erase hook itself, I wonder if it wouldn't be simpler to
have optional {pre,post}_erase() hooks instead (I think makes even more
sense for the ->write_oob() hook).

>  	unsigned int options;
>  	unsigned int bbt_options;


______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mtd: rawnand: Add the nand_chip->write_oob hook
  2020-05-03 11:40 ` [PATCH v2 2/3] mtd: rawnand: Add the nand_chip->write_oob hook Miquel Raynal
@ 2020-05-03 15:09   ` Boris Brezillon
  2020-05-03 17:02     ` Miquel Raynal
  0 siblings, 1 reply; 31+ messages in thread
From: Boris Brezillon @ 2020-05-03 15:09 UTC (permalink / raw)
  To: Miquel Raynal
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Zoltan Szubbocsev, linux-mtd, Thomas Petazzoni, tglx,
	Piotr Wojtaszczyk, Bean Huo

On Sun,  3 May 2020 13:40:28 +0200
Miquel Raynal <miquel.raynal@bootlin.com> wrote:

> With the same approach as for the ->erase hook, in order to solve an
> issue with Micron NANDs, we must be able to overload the write
> operation. With this in mind, we create a ->write_oob hook in the
> nand_chip structure which points by default to the
> currently in use nand_write_oob() helper, renamed
> nand_write_oob_nand() for the parallel with the nand_erase_nand()
> one.

First of all, I must say I hate the hook name. Having mtd->_write_oob()
that writes both OOB and data is confusing, and you're pulling this
confusing name to the raw NAND layer.

> 
> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
> ---
>  drivers/mtd/nand/raw/internals.h | 2 ++
>  drivers/mtd/nand/raw/nand_base.c | 8 ++++++++
>  include/linux/mtd/rawnand.h      | 3 +++
>  3 files changed, 13 insertions(+)
> 
> diff --git a/drivers/mtd/nand/raw/internals.h b/drivers/mtd/nand/raw/internals.h
> index 9d0caadf940e..caf534a6586a 100644
> --- a/drivers/mtd/nand/raw/internals.h
> +++ b/drivers/mtd/nand/raw/internals.h
> @@ -81,6 +81,8 @@ int nand_bbm_get_next_page(struct nand_chip *chip, int page);
>  int nand_markbad_bbm(struct nand_chip *chip, loff_t ofs);
>  int nand_erase_nand(struct nand_chip *chip, struct erase_info *instr,
>  		    int allowbbt);
> +int nand_write_oob_nand(struct nand_chip *chip, loff_t to,
> +			struct mtd_oob_ops *ops);
>  int onfi_fill_data_interface(struct nand_chip *chip,
>  			     enum nand_data_interface_type type,
>  			     int timing_mode);
> diff --git a/drivers/mtd/nand/raw/nand_base.c b/drivers/mtd/nand/raw/nand_base.c
> index 7c7ac722d88b..f9cf30949f49 100644
> --- a/drivers/mtd/nand/raw/nand_base.c
> +++ b/drivers/mtd/nand/raw/nand_base.c
> @@ -4121,6 +4121,13 @@ static int nand_write_oob(struct mtd_info *mtd, loff_t to,
>  			  struct mtd_oob_ops *ops)
>  {
>  	struct nand_chip *chip = mtd_to_nand(mtd);
> +
> +	return chip->write_oob(chip, to, ops);
> +}
> +
> +int nand_write_oob_nand(struct nand_chip *chip, loff_t to,
> +			struct mtd_oob_ops *ops)

Hm, what happens next time we have a similar name, do we suffix it with
_nand_nand? :P

> +{
>  	int ret;
>  
>  	ops->retlen = 0;
> @@ -4423,6 +4430,7 @@ static void nand_set_defaults(struct nand_chip *chip)
>  		chip->buf_align = 1;
>  
>  	chip->erase = nand_erase_nand;
> +	chip->write_oob = nand_write_oob_nand;
>  }
>  
>  /* Sanitize ONFI strings so we can safely print them */
> diff --git a/include/linux/mtd/rawnand.h b/include/linux/mtd/rawnand.h
> index 505c13f7a2ba..7fbbd5d7088f 100644
> --- a/include/linux/mtd/rawnand.h
> +++ b/include/linux/mtd/rawnand.h
> @@ -1021,6 +1021,7 @@ struct nand_legacy {
>   * @setup_read_retry:	[FLASHSPECIFIC] flash (vendor) specific function for
>   *			setting the read-retry mode. Mostly needed for MLC NAND.
>   * @erase:		Raw NAND erase operation.
> + * @write_oob:		Raw NAND write operation.
>   * @ecc:		[BOARDSPECIFIC] ECC control structure
>   * @buf_align:		minimum buffer alignment required by a platform
>   * @oob_poi:		"poison value buffer," used for laying out OOB data
> @@ -1092,6 +1093,8 @@ struct nand_chip {
>  	int (*setup_read_retry)(struct nand_chip *chip, int retry_mode);
>  	int (*erase)(struct nand_chip *chip, struct erase_info *instr,
>  		     int allowbbt);
> +	int (*write_oob)(struct nand_chip *chip, loff_t to,
> +			 struct mtd_oob_ops *ops);
>  

Okay, so I'm not sure duplicating the nand_write_oob() logic is the
best option here. I'd rather go for a post write_page() hook.

Note that we probably want a post read_page() hook so we can flag
pages as written by analyzing what's returned to the caller. That would
saves us unneeded writes when the page has been read.

>  	unsigned int options;
>  	unsigned int bbt_options;


______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-03 11:40 ` [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue Miquel Raynal
@ 2020-05-03 16:10   ` Steve deRosier
  2020-05-03 16:34     ` Boris Brezillon
  2020-05-03 16:36     ` Miquel Raynal
  2020-05-06  8:28   ` [EXT] " Bean Huo (beanhuo)
  1 sibling, 2 replies; 31+ messages in thread
From: Steve deRosier @ 2020-05-03 16:10 UTC (permalink / raw)
  To: Miquel Raynal
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Zoltan Szubbocsev, linux-mtd, Thomas Petazzoni, Boris Brezillon,
	tglx, Piotr Wojtaszczyk, Bean Huo

On Sun, May 3, 2020 at 4:41 AM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>
> With recent SLC NANDs, Micron admits that a "shallow erase" issue may
> be observable. It is actually the chip itself not doing a correct
> erase operation because of its internal machinery stating that the
> pages have not been programmed. Micron told us that there is a way to
> workaround this issue: ensure that all the odd pages in the 16 first
> ones of each block to erase have been fully written.
>
> To avoid a very big performance drawback by re-writting all the pages
> for each erase operation, the fix proposed here overloads the ->erase
> and ->write_oob hooks to count the pages actually written at runtime
> and avoid re-writting them if not needed.
>
> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
> ---
>  drivers/mtd/nand/raw/nand_micron.c | 121 +++++++++++++++++++++++++++++
>  1 file changed, 121 insertions(+)
>
> diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
> index 56654030ec7f..a9afd1b9a9e8 100644
> --- a/drivers/mtd/nand/raw/nand_micron.c
> +++ b/drivers/mtd/nand/raw/nand_micron.c
> @@ -36,6 +36,15 @@
>  #define NAND_ECC_STATUS_1_3_CORRECTED  BIT(4)
>  #define NAND_ECC_STATUS_7_8_CORRECTED  (BIT(4) | BIT(3))
>
> +/*
> + * Micron SLC chips are subject to a shallow erase issue: if the first
> + * pages of a block have not enough bytes programmed, the internal
> + * machinery might declare the block empty and skip the actual erase
> + * operation. This is the number of pages we check by software.
> + */
> +#define MICRON_SHALLOW_ERASE_MIN_PAGE 16
> +#define MICRON_PAGE_MASK_TRIGGER GENMASK(MICRON_SHALLOW_ERASE_MIN_PAGE, 0)
> +
>  struct nand_onfi_vendor_micron {
>         u8 two_plane_read;
>         u8 read_cache;
> @@ -64,6 +73,7 @@ struct micron_on_die_ecc {
>
>  struct micron_nand {
>         struct micron_on_die_ecc ecc;
> +       u16 *writtenp;
>  };
>
>  static int micron_nand_setup_read_retry(struct nand_chip *chip, int retry_mode)
> @@ -429,6 +439,106 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)
>         return MICRON_ON_DIE_SUPPORTED;
>  }
>
> +static int micron_nand_avoid_shallow_erase(struct nand_chip *chip,
> +                                          unsigned int eb)
> +{
> +       struct micron_nand *micron = nand_get_manufacturer_data(chip);
> +       unsigned int page = eb * nanddev_pages_per_eraseblock(&chip->base);
> +       u8 *databuf = nand_get_data_buf(chip);
> +       int ret, i;
> +
> +       memset(databuf, 0x00, nanddev_page_size(&chip->base));
> +
> +       /* Micron advises to only write the first 8 odd pages, counting from 1 */
> +       for (i = 0; i < MICRON_SHALLOW_ERASE_MIN_PAGE; i += 2, page += 2) {
> +               if (!(micron->writtenp[eb] & BIT(i))) {
> +                       ret = nand_write_page_raw(chip, databuf, false, page);
> +                       if (ret)
> +                               return ret;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +static int micron_nand_erase(struct nand_chip *chip, struct erase_info *instr,
> +                       int allowbbt)
> +{
> +       struct micron_nand *micron = nand_get_manufacturer_data(chip);
> +       unsigned int eb_sz = nanddev_eraseblock_size(&chip->base);
> +       unsigned int first_eb = DIV_ROUND_DOWN_ULL(instr->addr, eb_sz);
> +       unsigned int nb_eb = DIV_ROUND_UP_ULL(instr->len, eb_sz);
> +       unsigned int eb;
> +
> +       if (!micron)
> +               return -EINVAL;
> +
> +       /*
> +        * Check that enough pages have been written in each block.
> +        * If not, write them before actually erasing.
> +        */
> +       for (eb = first_eb; eb < first_eb + nb_eb; eb++) {
> +               /* Il all the first pages are not written yet, do it */
> +               if (micron->writtenp[eb] != MICRON_PAGE_MASK_TRIGGER)
> +                       micron_nand_avoid_shallow_erase(chip, eb);
> +
> +               micron->writtenp[eb] = 0;
> +       }
> +
> +       return nand_erase_nand(chip, instr, allowbbt);
> +}
> +static int micron_nand_write_oob(struct nand_chip *chip, loff_t to,
> +                                struct mtd_oob_ops *ops)
> +{
> +       struct micron_nand *micron = nand_get_manufacturer_data(chip);
> +       unsigned int eb_sz = nanddev_eraseblock_size(&chip->base);
> +       unsigned int p_sz = nanddev_page_size(&chip->base);
> +       unsigned int ppeb = nanddev_pages_per_eraseblock(&chip->base);
> +       unsigned int nb_p_tot = ops->len / p_sz;
> +       unsigned int first_eb = DIV_ROUND_DOWN_ULL(to, eb_sz);
> +       unsigned int first_p = DIV_ROUND_UP_ULL(to - (first_eb * eb_sz), p_sz);
> +       unsigned int nb_eb = DIV_ROUND_UP_ULL(first_p + nb_p_tot, ppeb);
> +       unsigned int remaining_p, eb, nb_p;
> +       int ret;
> +
> +       ret = nand_write_oob_nand(chip, to, ops);
> +       if (ret || (ops->len != ops->retlen))
> +               return ret;
> +
> +       /* Mark the last pages of the first erase block to write */
> +       nb_p = min(nb_p_tot, ppeb - first_p);
> +       micron->writtenp[first_eb] |= GENMASK(first_p + nb_p, first_p) &
> +                                     MICRON_PAGE_MASK_TRIGGER;
> +       remaining_p = nb_p_tot - nb_p;
> +
> +       /* Mark all the pages of all "in-the-middle" erase blocks */
> +       for (eb = first_eb + 1; eb < first_eb + nb_eb - 1; eb++) {
> +               micron->writtenp[eb] |= MICRON_PAGE_MASK_TRIGGER;
> +               remaining_p -= ppeb;
> +       }
> +
> +       /* Mark the first pages of the last erase block to write */
> +       if (remaining_p)
> +               micron->writtenp[eb] |= GENMASK(remaining_p - 1, 0) &
> +                                       MICRON_PAGE_MASK_TRIGGER;
> +
> +       return 0;
> +}
> +
> +static bool micron_nand_with_shallow_erase_issue(struct nand_chip *chip)
> +{
> +       /*
> +        * The shallow erase issue has been observed with MT29F*G*A
> +        * parts but Micron suspects that the issue can happen with
> +        * almost all recent SLC but at such a low probability that it
> +        * is almost invisible. Nevertheless, as we mitigate the
> +        * performance penalty at runtime by following the number of
> +        * written pages in a block before erasing it, we may want to
> +        * enable this fix by default.
> +        */
> +       return nand_is_slc(chip);
> +}


Whoa, let's hold our horses here!  "almost all recent" would imply
that older SLCs aren't affected. And the likelyhood that Micron will
fix newer parts is high - because why  would they leave in a major bug
like that in the next mask? So, what you're saying is when someone
goes to upgrade their older device's Linux they're going to take a
major filesystem performance hit without knowing it (because
realistically who reads 10,000s of patches before upgrading) when
their chip doesn't need it. Because we're too lazy to get the list
from Micron and code that ugliness?

We put this in and the resulting discussions for embedded systems
designers for the next decade are going to be one of two things:
* Oh, you want to use that SLC NAND from Micron? Well then don't use
Linux because it performs crappy on Micron SLC NANDs.
OR
* Oh, you want to use Linux? Well, don't use a Micron SLC NAND then
because they perform crappy on Linux.

Let's get a list of all chip that have this bug (and let's be clear -
it's a bug, not a "quirk") and enable it for those chips specifically.
Even better if there was something in the chipinfo itself that made it
obvious which ones had the problem (because realistically it's
probably specific to a particular geometry). In any case, it's in the
best interest of Micron to identify to us exactly which chips have or
are likely to have this issue and for us to be specific on which get
assigned this quirk. It is probably listed in an errata app-note, and
if not it should be.

Strong NAK to defaulting all Micron SLC NANDs to this - unless it
truly is the case that _all_ Micron SLC NANDs in the past and in the
future likely have this problem.



>
> +
>  static int micron_nand_init(struct nand_chip *chip)
>  {
>         struct mtd_info *mtd = nand_to_mtd(chip);
> @@ -513,6 +623,17 @@ static int micron_nand_init(struct nand_chip *chip)
>                 }
>         }
>
> +       if (micron_nand_with_shallow_erase_issue(chip)) {
> +               micron->writtenp = kzalloc(sizeof(u16) *
> +                                          nanddev_neraseblocks(&chip->base),
> +                                          GFP_KERNEL);
> +               if (!micron->writtenp)
> +                       goto err_free_manuf_data;
> +
> +               chip->erase = micron_nand_erase;
> +               chip->write_oob = micron_nand_write_oob;
> +       }
> +
>         return 0;
>
>  err_free_manuf_data:
> --
> 2.20.1
>
>

- Steve

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-03 16:10   ` Steve deRosier
@ 2020-05-03 16:34     ` Boris Brezillon
  2020-05-03 16:36     ` Miquel Raynal
  1 sibling, 0 replies; 31+ messages in thread
From: Boris Brezillon @ 2020-05-03 16:34 UTC (permalink / raw)
  To: Steve deRosier
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Zoltan Szubbocsev, linux-mtd, Thomas Petazzoni, Miquel Raynal,
	tglx, Piotr Wojtaszczyk, Bean Huo

Hello Steve,

On Sun, 3 May 2020 09:10:15 -0700
Steve deRosier <derosier@gmail.com> wrote:

> > +static bool micron_nand_with_shallow_erase_issue(struct nand_chip *chip)
> > +{
> > +       /*
> > +        * The shallow erase issue has been observed with MT29F*G*A
> > +        * parts but Micron suspects that the issue can happen with
> > +        * almost all recent SLC but at such a low probability that it
> > +        * is almost invisible. Nevertheless, as we mitigate the
> > +        * performance penalty at runtime by following the number of
> > +        * written pages in a block before erasing it, we may want to
> > +        * enable this fix by default.
> > +        */
> > +       return nand_is_slc(chip);
> > +}  
> 
> 
> Whoa, let's hold our horses here!  "almost all recent" would imply
> that older SLCs aren't affected. And the likelyhood that Micron will
> fix newer parts is high - because why  would they leave in a major bug
> like that in the next mask? So, what you're saying is when someone
> goes to upgrade their older device's Linux they're going to take a
> major filesystem performance hit without knowing it (because
> realistically who reads 10,000s of patches before upgrading) when
> their chip doesn't need it.

I do agree with what you say, but... (see below).

> Because we're too lazy to get the list from Micron and code that ugliness?

Too lazy to get the list from Micron?! I can tell you we've tried hard
and they've always been reluctant to give us such a list, or a
discriminant to identify those buggy parts. They even tried to
convince us it was not a bug but a problem that's inherent to any NAND
flashes, not only theirs. They didn't go as far as claiming this was a
feature, but given the energy they spent to deny the problem I
honestly thought they would. So no, it's definitely not what you think,
and I was hoping that threatening them to merge that patch upstream
would force them to provide us this information. Looks like it never
happened.

Maybe those that had to debug those weird/hardly reproducible issues
should speak up, because that's no fun thing to spend weeks/months
chasing such bugs just to discover that Micron knows about the issue and
can provide a fix if you ask them.

> 
> We put this in and the resulting discussions for embedded systems
> designers for the next decade are going to be one of two things:
> * Oh, you want to use that SLC NAND from Micron? Well then don't use
> Linux because it performs crappy on Micron SLC NANDs.
> OR
> * Oh, you want to use Linux? Well, don't use a Micron SLC NAND then
> because they perform crappy on Linux.
> 
> Let's get a list of all chip that have this bug (and let's be clear -
> it's a bug, not a "quirk") and enable it for those chips specifically.
> Even better if there was something in the chipinfo itself that made it
> obvious which ones had the problem (because realistically it's
> probably specific to a particular geometry). In any case, it's in the
> best interest of Micron to identify to us exactly which chips have or
> are likely to have this issue and for us to be specific on which get
> assigned this quirk. It is probably listed in an errata app-note, and
> if not it should be.
> 
> Strong NAK to defaulting all Micron SLC NANDs to this - unless it
> truly is the case that _all_ Micron SLC NANDs in the past and in the
> future likely have this problem.
> 

I honestly don't have a good solution here. I guess we could blacklist
flashes one by one when people report weird issues, but when they
discover the problem is already too late, and they have plenty of units
in the wild.

Regards,

Boris

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-03 16:10   ` Steve deRosier
  2020-05-03 16:34     ` Boris Brezillon
@ 2020-05-03 16:36     ` Miquel Raynal
  2020-05-03 19:57       ` Steve deRosier
  1 sibling, 1 reply; 31+ messages in thread
From: Miquel Raynal @ 2020-05-03 16:36 UTC (permalink / raw)
  To: Steve deRosier
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Zoltan Szubbocsev, linux-mtd, Thomas Petazzoni, Boris Brezillon,
	tglx, Piotr Wojtaszczyk, Bean Huo

Hi Steve,

Steve deRosier <derosier@gmail.com> wrote on Sun, 3 May 2020 09:10:15
-0700:

> On Sun, May 3, 2020 at 4:41 AM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >
> > With recent SLC NANDs, Micron admits that a "shallow erase" issue may
> > be observable. It is actually the chip itself not doing a correct
> > erase operation because of its internal machinery stating that the
> > pages have not been programmed. Micron told us that there is a way to
> > workaround this issue: ensure that all the odd pages in the 16 first
> > ones of each block to erase have been fully written.
> >
> > To avoid a very big performance drawback by re-writting all the pages
> > for each erase operation, the fix proposed here overloads the ->erase
> > and ->write_oob hooks to count the pages actually written at runtime
> > and avoid re-writting them if not needed.
> >
> > Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
> > ---
> >  drivers/mtd/nand/raw/nand_micron.c | 121 +++++++++++++++++++++++++++++
> >  1 file changed, 121 insertions(+)
> >
> > diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
> > index 56654030ec7f..a9afd1b9a9e8 100644
> > --- a/drivers/mtd/nand/raw/nand_micron.c
> > +++ b/drivers/mtd/nand/raw/nand_micron.c
> > @@ -36,6 +36,15 @@
> >  #define NAND_ECC_STATUS_1_3_CORRECTED  BIT(4)
> >  #define NAND_ECC_STATUS_7_8_CORRECTED  (BIT(4) | BIT(3))
> >
> > +/*
> > + * Micron SLC chips are subject to a shallow erase issue: if the first
> > + * pages of a block have not enough bytes programmed, the internal
> > + * machinery might declare the block empty and skip the actual erase
> > + * operation. This is the number of pages we check by software.
> > + */
> > +#define MICRON_SHALLOW_ERASE_MIN_PAGE 16
> > +#define MICRON_PAGE_MASK_TRIGGER GENMASK(MICRON_SHALLOW_ERASE_MIN_PAGE, 0)
> > +
> >  struct nand_onfi_vendor_micron {
> >         u8 two_plane_read;
> >         u8 read_cache;
> > @@ -64,6 +73,7 @@ struct micron_on_die_ecc {
> >
> >  struct micron_nand {
> >         struct micron_on_die_ecc ecc;
> > +       u16 *writtenp;
> >  };
> >
> >  static int micron_nand_setup_read_retry(struct nand_chip *chip, int retry_mode)
> > @@ -429,6 +439,106 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)
> >         return MICRON_ON_DIE_SUPPORTED;
> >  }
> >
> > +static int micron_nand_avoid_shallow_erase(struct nand_chip *chip,
> > +                                          unsigned int eb)
> > +{
> > +       struct micron_nand *micron = nand_get_manufacturer_data(chip);
> > +       unsigned int page = eb * nanddev_pages_per_eraseblock(&chip->base);
> > +       u8 *databuf = nand_get_data_buf(chip);
> > +       int ret, i;
> > +
> > +       memset(databuf, 0x00, nanddev_page_size(&chip->base));
> > +
> > +       /* Micron advises to only write the first 8 odd pages, counting from 1 */
> > +       for (i = 0; i < MICRON_SHALLOW_ERASE_MIN_PAGE; i += 2, page += 2) {
> > +               if (!(micron->writtenp[eb] & BIT(i))) {
> > +                       ret = nand_write_page_raw(chip, databuf, false, page);
> > +                       if (ret)
> > +                               return ret;
> > +               }
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static int micron_nand_erase(struct nand_chip *chip, struct erase_info *instr,
> > +                       int allowbbt)
> > +{
> > +       struct micron_nand *micron = nand_get_manufacturer_data(chip);
> > +       unsigned int eb_sz = nanddev_eraseblock_size(&chip->base);
> > +       unsigned int first_eb = DIV_ROUND_DOWN_ULL(instr->addr, eb_sz);
> > +       unsigned int nb_eb = DIV_ROUND_UP_ULL(instr->len, eb_sz);
> > +       unsigned int eb;
> > +
> > +       if (!micron)
> > +               return -EINVAL;
> > +
> > +       /*
> > +        * Check that enough pages have been written in each block.
> > +        * If not, write them before actually erasing.
> > +        */
> > +       for (eb = first_eb; eb < first_eb + nb_eb; eb++) {
> > +               /* Il all the first pages are not written yet, do it */
> > +               if (micron->writtenp[eb] != MICRON_PAGE_MASK_TRIGGER)
> > +                       micron_nand_avoid_shallow_erase(chip, eb);
> > +
> > +               micron->writtenp[eb] = 0;
> > +       }
> > +
> > +       return nand_erase_nand(chip, instr, allowbbt);
> > +}
> > +static int micron_nand_write_oob(struct nand_chip *chip, loff_t to,
> > +                                struct mtd_oob_ops *ops)
> > +{
> > +       struct micron_nand *micron = nand_get_manufacturer_data(chip);
> > +       unsigned int eb_sz = nanddev_eraseblock_size(&chip->base);
> > +       unsigned int p_sz = nanddev_page_size(&chip->base);
> > +       unsigned int ppeb = nanddev_pages_per_eraseblock(&chip->base);
> > +       unsigned int nb_p_tot = ops->len / p_sz;
> > +       unsigned int first_eb = DIV_ROUND_DOWN_ULL(to, eb_sz);
> > +       unsigned int first_p = DIV_ROUND_UP_ULL(to - (first_eb * eb_sz), p_sz);
> > +       unsigned int nb_eb = DIV_ROUND_UP_ULL(first_p + nb_p_tot, ppeb);
> > +       unsigned int remaining_p, eb, nb_p;
> > +       int ret;
> > +
> > +       ret = nand_write_oob_nand(chip, to, ops);
> > +       if (ret || (ops->len != ops->retlen))
> > +               return ret;
> > +
> > +       /* Mark the last pages of the first erase block to write */
> > +       nb_p = min(nb_p_tot, ppeb - first_p);
> > +       micron->writtenp[first_eb] |= GENMASK(first_p + nb_p, first_p) &
> > +                                     MICRON_PAGE_MASK_TRIGGER;
> > +       remaining_p = nb_p_tot - nb_p;
> > +
> > +       /* Mark all the pages of all "in-the-middle" erase blocks */
> > +       for (eb = first_eb + 1; eb < first_eb + nb_eb - 1; eb++) {
> > +               micron->writtenp[eb] |= MICRON_PAGE_MASK_TRIGGER;
> > +               remaining_p -= ppeb;
> > +       }
> > +
> > +       /* Mark the first pages of the last erase block to write */
> > +       if (remaining_p)
> > +               micron->writtenp[eb] |= GENMASK(remaining_p - 1, 0) &
> > +                                       MICRON_PAGE_MASK_TRIGGER;
> > +
> > +       return 0;
> > +}
> > +
> > +static bool micron_nand_with_shallow_erase_issue(struct nand_chip *chip)
> > +{
> > +       /*
> > +        * The shallow erase issue has been observed with MT29F*G*A
> > +        * parts but Micron suspects that the issue can happen with
> > +        * almost all recent SLC but at such a low probability that it
> > +        * is almost invisible. Nevertheless, as we mitigate the
> > +        * performance penalty at runtime by following the number of
> > +        * written pages in a block before erasing it, we may want to
> > +        * enable this fix by default.
> > +        */
> > +       return nand_is_slc(chip);
> > +}  
> 
> 
> Whoa, let's hold our horses here!  "almost all recent" would imply
> that older SLCs aren't affected. And the likelyhood that Micron will
> fix newer parts is high - because why  would they leave in a major bug
> like that in the next mask? So, what you're saying is when someone
> goes to upgrade their older device's Linux they're going to take a
> major filesystem performance hit without knowing it (because
> realistically who reads 10,000s of patches before upgrading) when
> their chip doesn't need it. Because we're too lazy to get the list
> from Micron and code that ugliness?

Well, that's where I strongly disagree. I know about this for
almost three years now. It took us this time to figure out:
1- what is impacted
2- why
3- what could work-around it

As you can see, we failed in #1 and trust me, we tried. By e-mail,
IRC, IRL. We tried hard. Bean and Zoltan in copy know about the
issue and they tried to minimize and hide it as much as they could,
lying shamelessly to us several times. Please do not call this
laziness.

> We put this in and the resulting discussions for embedded systems
> designers for the next decade are going to be one of two things:
> * Oh, you want to use that SLC NAND from Micron? Well then don't use
> Linux because it performs crappy on Micron SLC NANDs.
> OR
> * Oh, you want to use Linux? Well, don't use a Micron SLC NAND then
> because they perform crappy on Linux.
> 
> Let's get a list of all chip that have this bug (and let's be clear -
> it's a bug, not a "quirk") and enable it for those chips specifically.
> Even better if there was something in the chipinfo itself that made it
> obvious which ones had the problem (because realistically it's
> probably specific to a particular geometry). In any case, it's in the
> best interest of Micron to identify to us exactly which chips have or
> are likely to have this issue and for us to be specific on which get
> assigned this quirk. It is probably listed in an errata app-note, and
> if not it should be.
> 
> Strong NAK to defaulting all Micron SLC NANDs to this - unless it
> truly is the case that _all_ Micron SLC NANDs in the past and in the
> future likely have this problem.

I am open to alternatives.

Thanks,
Miquèl

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mtd: rawnand: Add the nand_chip->write_oob hook
  2020-05-03 15:09   ` Boris Brezillon
@ 2020-05-03 17:02     ` Miquel Raynal
  0 siblings, 0 replies; 31+ messages in thread
From: Miquel Raynal @ 2020-05-03 17:02 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Zoltan Szubbocsev, linux-mtd, Thomas Petazzoni, tglx,
	Piotr Wojtaszczyk, Bean Huo

Hi Boris,

> >  /* Sanitize ONFI strings so we can safely print them */
> > diff --git a/include/linux/mtd/rawnand.h b/include/linux/mtd/rawnand.h
> > index 505c13f7a2ba..7fbbd5d7088f 100644
> > --- a/include/linux/mtd/rawnand.h
> > +++ b/include/linux/mtd/rawnand.h
> > @@ -1021,6 +1021,7 @@ struct nand_legacy {
> >   * @setup_read_retry:	[FLASHSPECIFIC] flash (vendor) specific function for
> >   *			setting the read-retry mode. Mostly needed for MLC NAND.
> >   * @erase:		Raw NAND erase operation.
> > + * @write_oob:		Raw NAND write operation.
> >   * @ecc:		[BOARDSPECIFIC] ECC control structure
> >   * @buf_align:		minimum buffer alignment required by a platform
> >   * @oob_poi:		"poison value buffer," used for laying out OOB data
> > @@ -1092,6 +1093,8 @@ struct nand_chip {
> >  	int (*setup_read_retry)(struct nand_chip *chip, int retry_mode);
> >  	int (*erase)(struct nand_chip *chip, struct erase_info *instr,
> >  		     int allowbbt);
> > +	int (*write_oob)(struct nand_chip *chip, loff_t to,
> > +			 struct mtd_oob_ops *ops);
> >    
> 
> Okay, so I'm not sure duplicating the nand_write_oob() logic is the
> best option here. I'd rather go for a post write_page() hook.
> 
> Note that we probably want a post read_page() hook so we can flag
> pages as written by analyzing what's returned to the caller. That would
> saves us unneeded writes when the page has been read.

Ok, this is done the way you propose.

I will wait more feedback before resending this series. Especially on
the last patch which is crucial.

Thanks,
Miquèl

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-03 16:36     ` Miquel Raynal
@ 2020-05-03 19:57       ` Steve deRosier
  2020-05-06  8:37         ` [EXT] " Bean Huo (beanhuo)
  0 siblings, 1 reply; 31+ messages in thread
From: Steve deRosier @ 2020-05-03 19:57 UTC (permalink / raw)
  To: Miquel Raynal
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Zoltan Szubbocsev, linux-mtd, Thomas Petazzoni, Boris Brezillon,
	tglx, Piotr Wojtaszczyk, Bean Huo

Boris and Miquel,

On Sun, May 3, 2020 at 9:36 AM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>
> Hi Steve,
>
> Steve deRosier <derosier@gmail.com> wrote on Sun, 3 May 2020 09:10:15
> -0700:
>
> > On Sun, May 3, 2020 at 4:41 AM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> > >
> > > With recent SLC NANDs, Micron admits that a "shallow erase" issue may
> > > be observable. It is actually the chip itself not doing a correct
> > > erase operation because of its internal machinery stating that the
> > > pages have not been programmed. Micron told us that there is a way to
> > > workaround this issue: ensure that all the odd pages in the 16 first
> > > ones of each block to erase have been fully written.
> > >
> > > To avoid a very big performance drawback by re-writting all the pages
> > > for each erase operation, the fix proposed here overloads the ->erase
> > > and ->write_oob hooks to count the pages actually written at runtime
> > > and avoid re-writting them if not needed.
> > >
> > > Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
> > > ---
> > >  drivers/mtd/nand/raw/nand_micron.c | 121 +++++++++++++++++++++++++++++
> > >  1 file changed, 121 insertions(+)
> > >
> > > diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
> > > index 56654030ec7f..a9afd1b9a9e8 100644
> > > --- a/drivers/mtd/nand/raw/nand_micron.c
> > > +++ b/drivers/mtd/nand/raw/nand_micron.c
> > > @@ -36,6 +36,15 @@
> > >  #define NAND_ECC_STATUS_1_3_CORRECTED  BIT(4)
> > >  #define NAND_ECC_STATUS_7_8_CORRECTED  (BIT(4) | BIT(3))
> > >
> > > +/*
> > > + * Micron SLC chips are subject to a shallow erase issue: if the first
> > > + * pages of a block have not enough bytes programmed, the internal
> > > + * machinery might declare the block empty and skip the actual erase
> > > + * operation. This is the number of pages we check by software.
> > > + */
> > > +#define MICRON_SHALLOW_ERASE_MIN_PAGE 16
> > > +#define MICRON_PAGE_MASK_TRIGGER GENMASK(MICRON_SHALLOW_ERASE_MIN_PAGE, 0)
> > > +
> > >  struct nand_onfi_vendor_micron {
> > >         u8 two_plane_read;
> > >         u8 read_cache;
> > > @@ -64,6 +73,7 @@ struct micron_on_die_ecc {
> > >
> > >  struct micron_nand {
> > >         struct micron_on_die_ecc ecc;
> > > +       u16 *writtenp;
> > >  };
> > >
> > >  static int micron_nand_setup_read_retry(struct nand_chip *chip, int retry_mode)
> > > @@ -429,6 +439,106 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)
> > >         return MICRON_ON_DIE_SUPPORTED;
> > >  }
> > >
> > > +static int micron_nand_avoid_shallow_erase(struct nand_chip *chip,
> > > +                                          unsigned int eb)
> > > +{
> > > +       struct micron_nand *micron = nand_get_manufacturer_data(chip);
> > > +       unsigned int page = eb * nanddev_pages_per_eraseblock(&chip->base);
> > > +       u8 *databuf = nand_get_data_buf(chip);
> > > +       int ret, i;
> > > +
> > > +       memset(databuf, 0x00, nanddev_page_size(&chip->base));
> > > +
> > > +       /* Micron advises to only write the first 8 odd pages, counting from 1 */
> > > +       for (i = 0; i < MICRON_SHALLOW_ERASE_MIN_PAGE; i += 2, page += 2) {
> > > +               if (!(micron->writtenp[eb] & BIT(i))) {
> > > +                       ret = nand_write_page_raw(chip, databuf, false, page);
> > > +                       if (ret)
> > > +                               return ret;
> > > +               }
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static int micron_nand_erase(struct nand_chip *chip, struct erase_info *instr,
> > > +                       int allowbbt)
> > > +{
> > > +       struct micron_nand *micron = nand_get_manufacturer_data(chip);
> > > +       unsigned int eb_sz = nanddev_eraseblock_size(&chip->base);
> > > +       unsigned int first_eb = DIV_ROUND_DOWN_ULL(instr->addr, eb_sz);
> > > +       unsigned int nb_eb = DIV_ROUND_UP_ULL(instr->len, eb_sz);
> > > +       unsigned int eb;
> > > +
> > > +       if (!micron)
> > > +               return -EINVAL;
> > > +
> > > +       /*
> > > +        * Check that enough pages have been written in each block.
> > > +        * If not, write them before actually erasing.
> > > +        */
> > > +       for (eb = first_eb; eb < first_eb + nb_eb; eb++) {
> > > +               /* Il all the first pages are not written yet, do it */
> > > +               if (micron->writtenp[eb] != MICRON_PAGE_MASK_TRIGGER)
> > > +                       micron_nand_avoid_shallow_erase(chip, eb);
> > > +
> > > +               micron->writtenp[eb] = 0;
> > > +       }
> > > +
> > > +       return nand_erase_nand(chip, instr, allowbbt);
> > > +}
> > > +static int micron_nand_write_oob(struct nand_chip *chip, loff_t to,
> > > +                                struct mtd_oob_ops *ops)
> > > +{
> > > +       struct micron_nand *micron = nand_get_manufacturer_data(chip);
> > > +       unsigned int eb_sz = nanddev_eraseblock_size(&chip->base);
> > > +       unsigned int p_sz = nanddev_page_size(&chip->base);
> > > +       unsigned int ppeb = nanddev_pages_per_eraseblock(&chip->base);
> > > +       unsigned int nb_p_tot = ops->len / p_sz;
> > > +       unsigned int first_eb = DIV_ROUND_DOWN_ULL(to, eb_sz);
> > > +       unsigned int first_p = DIV_ROUND_UP_ULL(to - (first_eb * eb_sz), p_sz);
> > > +       unsigned int nb_eb = DIV_ROUND_UP_ULL(first_p + nb_p_tot, ppeb);
> > > +       unsigned int remaining_p, eb, nb_p;
> > > +       int ret;
> > > +
> > > +       ret = nand_write_oob_nand(chip, to, ops);
> > > +       if (ret || (ops->len != ops->retlen))
> > > +               return ret;
> > > +
> > > +       /* Mark the last pages of the first erase block to write */
> > > +       nb_p = min(nb_p_tot, ppeb - first_p);
> > > +       micron->writtenp[first_eb] |= GENMASK(first_p + nb_p, first_p) &
> > > +                                     MICRON_PAGE_MASK_TRIGGER;
> > > +       remaining_p = nb_p_tot - nb_p;
> > > +
> > > +       /* Mark all the pages of all "in-the-middle" erase blocks */
> > > +       for (eb = first_eb + 1; eb < first_eb + nb_eb - 1; eb++) {
> > > +               micron->writtenp[eb] |= MICRON_PAGE_MASK_TRIGGER;
> > > +               remaining_p -= ppeb;
> > > +       }
> > > +
> > > +       /* Mark the first pages of the last erase block to write */
> > > +       if (remaining_p)
> > > +               micron->writtenp[eb] |= GENMASK(remaining_p - 1, 0) &
> > > +                                       MICRON_PAGE_MASK_TRIGGER;
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static bool micron_nand_with_shallow_erase_issue(struct nand_chip *chip)
> > > +{
> > > +       /*
> > > +        * The shallow erase issue has been observed with MT29F*G*A
> > > +        * parts but Micron suspects that the issue can happen with
> > > +        * almost all recent SLC but at such a low probability that it
> > > +        * is almost invisible. Nevertheless, as we mitigate the
> > > +        * performance penalty at runtime by following the number of
> > > +        * written pages in a block before erasing it, we may want to
> > > +        * enable this fix by default.
> > > +        */
> > > +       return nand_is_slc(chip);
> > > +}
> >
> >
> > Whoa, let's hold our horses here!  "almost all recent" would imply
> > that older SLCs aren't affected. And the likelyhood that Micron will
> > fix newer parts is high - because why  would they leave in a major bug
> > like that in the next mask? So, what you're saying is when someone
> > goes to upgrade their older device's Linux they're going to take a
> > major filesystem performance hit without knowing it (because
> > realistically who reads 10,000s of patches before upgrading) when
> > their chip doesn't need it. Because we're too lazy to get the list
> > from Micron and code that ugliness?
>
> Well, that's where I strongly disagree. I know about this for
> almost three years now. It took us this time to figure out:
> 1- what is impacted
> 2- why
> 3- what could work-around it
>
> As you can see, we failed in #1 and trust me, we tried. By e-mail,
> IRC, IRL. We tried hard. Bean and Zoltan in copy know about the
> issue and they tried to minimize and hide it as much as they could,
> lying shamelessly to us several times. Please do not call this
> laziness.
>

I apologize.  And to clarify - I am not calling you (nor Boris, nor
the MTD community) lazy. As a long-time member of this list, I know
that's far from the case. Maybe more in this case Micron deserves to
be called lazy now that I understand the history. Though that I know
is an oversimplification too. In any case, perhaps _I_ was lazy with
my phrasing and by not looking up the history here.

Again, I apologize.

> > We put this in and the resulting discussions for embedded systems
> > designers for the next decade are going to be one of two things:
> > * Oh, you want to use that SLC NAND from Micron? Well then don't use
> > Linux because it performs crappy on Micron SLC NANDs.
> > OR
> > * Oh, you want to use Linux? Well, don't use a Micron SLC NAND then
> > because they perform crappy on Linux.
> >
> > Let's get a list of all chip that have this bug (and let's be clear -
> > it's a bug, not a "quirk") and enable it for those chips specifically.
> > Even better if there was something in the chipinfo itself that made it
> > obvious which ones had the problem (because realistically it's
> > probably specific to a particular geometry). In any case, it's in the
> > best interest of Micron to identify to us exactly which chips have or
> > are likely to have this issue and for us to be specific on which get
> > assigned this quirk. It is probably listed in an errata app-note, and
> > if not it should be.
> >
> > Strong NAK to defaulting all Micron SLC NANDs to this - unless it
> > truly is the case that _all_ Micron SLC NANDs in the past and in the
> > future likely have this problem.
>

From your description then, there's enough denial in Micron within
Micron that my statement does apply - that at least all future Micron
parts will have this problem. Simple logic -> if engineering won't
admit there is a problem, they're not going to get funding to fix the
problem. And it's not just going to go away in the next part, because
we all know that new part designs usually start by borrowing the old
part design and since you can't admit there's an issue it's likely the
problem will come along in the next part.

> I am open to alternatives.
>

This may sound extreme, but if the point here is to force the
manufacturer to give us the information that is required to support
their devices, I'd even suggest we pull the support and either drop
the driver entirely, move it to staging, or some other extremely
obvious move that says, "we are unable to properly support this
without the manufacturers help and the manufacturer doesn't care about
it's customers." Personally, when I upgrade my kernel, I want the
kernel to SHOUT to me that there's a problem with this driver rather
than have a performance issue that I might spend the next six months
trying to figure out why.

When I start a new project, one of the first things I do is check to
see if the parts that we want to use are supported by Linux. I'd like
it when I check if a Micron SLC NAND part is supported that it is
clear there's a big issue with that part. It showing up in staging
instead in the main tree would be a clear signal.

A message to Micron and other chip vendors:  Do not attempt to hide
your bugs, we will find out and hiding this stuff simply pisses us
off. Bugs happen, we get it, expect it, and can work around or fix
them. Responsible companies do not dismiss their users and will
quickly and actively investigate reported problems and then disclose
via errata. This used to be common, even as little as 10 or 20 years
ago - there was rarely a chip that I'd use that didn't have a few
errata documents associated with it. Now, good luck getting them (even
if you can get a datasheet).

We in the embedded Linux community want to work with the vendors to
get good solid upstream support. And vendors, seriously, why the hell
are you resisting it? You've got a large number of people not on your
payroll that are invested in spending time in making you succeed
because it makes things better for them too. And note the use of the
word "upstream" - I no longer use vendor drivers. They're generally of
horribly low quality and either make me stuck on a particular 8 year
old kernel or require me to rewrite them for current versions anyway.

Tomorrow I'm calling my CMs and will start the process to remove
Micron as an approved vendor and will start qualifying alternatives.
While AFAIK (and I intend to find out for sure), we don't have a
problem with the parts we've been using, I don't want to work with a
company that is technically dishonest like Micron.

To Miquel - I appreciate your work in finding and working around the
issue.  And I appreciate the pragmatic approach to fixing it.  But
perhaps it's time to say enough is enough.

I withdraw my NAK (for what it's worth), but can we find something to
very clearly mark "you may be using a broken device, but we don't know
because Micro won't tell us and so we've conservitavely setup a
workaround for all devices that affects performance"?

I'll suggest several alternative things, from most forceful to least:
1. Drop the driver.
2. Move the driver to staging.
3. Make it a kconfig option that is on by default, with strong wording
along the way of "if you know you're using a device without this
problem, you can turn this off".
4. Shove a printk of warning level in there so it's clear in the logs
for those of use who look at the logs when we upgrade.
5. Keep it on by default, and add a white-list of known devices that
are OK to not use it. Add a big comment of why (which is there) and
instructions of how to test if your device has the problem and that if
you test and know it's clean how to send a patch to add to the
whitelist.

At a minimum, I'd like to see a combo of #3 and #4.

Thanks,
- Steve

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-03 11:40 ` [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue Miquel Raynal
  2020-05-03 16:10   ` Steve deRosier
@ 2020-05-06  8:28   ` Bean Huo (beanhuo)
  2020-05-06  8:45     ` Boris Brezillon
  1 sibling, 1 reply; 31+ messages in thread
From: Bean Huo (beanhuo) @ 2020-05-06  8:28 UTC (permalink / raw)
  To: Miquel Raynal, Richard Weinberger, Vignesh Raghavendra,
	Tudor Ambarus, linux-mtd, Steve deRosier
  Cc: Zoltan Szubbocsev (zszubbocsev),
	Piotr Wojtaszczyk, tglx, Boris Brezillon, Thomas Petazzoni

Hi, Miquel
I have two questions about your patch, please help me. 

> +	 */
> +	for (eb = first_eb; eb < first_eb + nb_eb; eb++) {
> +		/* Il all the first pages are not written yet, do it */
> +		if (micron->writtenp[eb] != MICRON_PAGE_MASK_TRIGGER)
> +			micron_nand_avoid_shallow_erase(chip, eb);
> +
> +		micron->writtenp[eb] = 0;
> +	}


Here, if the power loss happens before erasing this block, for the next time boot up,
What will happen from FS layer in case FS detect this filled data?

> +
> +	return nand_erase_nand(chip, instr, allowbbt); 
>+ }

 static int
> +micron_nand_write_oob(struct nand_chip *chip, loff_t to,
> +				 struct mtd_oob_ops *ops)
> +{
> +	struct micron_nand *micron = nand_get_manufacturer_data(chip);
> +	unsigned int eb_sz = nanddev_eraseblock_size(&chip->base);
> +	unsigned int p_sz = nanddev_page_size(&chip->base);
> +	unsigned int ppeb = nanddev_pages_per_eraseblock(&chip->base);
> +	unsigned int nb_p_tot = ops->len / p_sz;
> +	unsigned int first_eb = DIV_ROUND_DOWN_ULL(to, eb_sz);
> +	unsigned int first_p = DIV_ROUND_UP_ULL(to - (first_eb * eb_sz), p_sz);
> +	unsigned int nb_eb = DIV_ROUND_UP_ULL(first_p + nb_p_tot, ppeb);
> +	unsigned int remaining_p, eb, nb_p;
> +	int ret;
> +
> +	ret = nand_write_oob_nand(chip, to, ops);
> +	if (ret || (ops->len != ops->retlen))
> +		return ret;
> +
> +	/* Mark the last pages of the first erase block to write */
> +	nb_p = min(nb_p_tot, ppeb - first_p);
> +	micron->writtenp[first_eb] |= GENMASK(first_p + nb_p, first_p) &
> +				      MICRON_PAGE_MASK_TRIGGER;
> +	remaining_p = nb_p_tot - nb_p;
> +
> +	/* Mark all the pages of all "in-the-middle" erase blocks */
> +	for (eb = first_eb + 1; eb < first_eb + nb_eb - 1; eb++) {
> +		micron->writtenp[eb] |= MICRON_PAGE_MASK_TRIGGER;
> +		remaining_p -= ppeb;
> +	}
> +
> +	/* Mark the first pages of the last erase block to write */
> +	if (remaining_p)
> +		micron->writtenp[eb] |= GENMASK(remaining_p - 1, 0) &
> +					MICRON_PAGE_MASK_TRIGGER;
> +


This micron->written is stored in the system memory, once power cut,  for the next time
Boot up, will it be reinstated or it will be 0x00?



> +	return 0;
> +}
> +


Thanks,

Bean
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [EXT] Re: [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-03 19:57       ` Steve deRosier
@ 2020-05-06  8:37         ` Bean Huo (beanhuo)
  0 siblings, 0 replies; 31+ messages in thread
From: Bean Huo (beanhuo) @ 2020-05-06  8:37 UTC (permalink / raw)
  To: Steve deRosier, Miquel Raynal
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Zoltan Szubbocsev (zszubbocsev),
	linux-mtd, Thomas Petazzoni, Boris Brezillon, tglx,
	Piotr Wojtaszczyk

Hi, Steve

I have fixed this issue in my patch https://patchwork.ozlabs.org/project/linux-mtd/patch/BYAPR08MB4533401FB969CA9A8D254F34DB9C0@BYAPR08MB4533.namprd08.prod.outlook.com/
But it's mainlined in the upstream Linux.  


Thanks,

Bean

> >
> 
> 
> I apologize.  And to clarify - I am not calling you (nor Boris, nor the MTD
> community) lazy. As a long-time member of this list, I know that's far from the
> case. Maybe more in this case Micron deserves to be called lazy now that I
> understand the history. Though that I know is an oversimplification too. In any
> case, perhaps _I_ was lazy with my phrasing and by not looking up the history
> here.
> 
> Again, I apologize.
> 
> > > We put this in and the resulting discussions for embedded systems
> > > designers for the next decade are going to be one of two things:
> > > * Oh, you want to use that SLC NAND from Micron? Well then don't use
> > > Linux because it performs crappy on Micron SLC NANDs.
> > > OR
> > > * Oh, you want to use Linux? Well, don't use a Micron SLC NAND then
> > > because they perform crappy on Linux.
> > >
> > > Let's get a list of all chip that have this bug (and let's be clear
> > > - it's a bug, not a "quirk") and enable it for those chips specifically.
> > > Even better if there was something in the chipinfo itself that made
> > > it obvious which ones had the problem (because realistically it's
> > > probably specific to a particular geometry). In any case, it's in
> > > the best interest of Micron to identify to us exactly which chips
> > > have or are likely to have this issue and for us to be specific on
> > > which get assigned this quirk. It is probably listed in an errata
> > > app-note, and if not it should be.
> > >
> > > Strong NAK to defaulting all Micron SLC NANDs to this - unless it
> > > truly is the case that _all_ Micron SLC NANDs in the past and in the
> > > future likely have this problem.
> >
> 
> From your description then, there's enough denial in Micron within Micron that
> my statement does apply - that at least all future Micron parts will have this
> problem. Simple logic -> if engineering won't admit there is a problem, they're
> not going to get funding to fix the problem. And it's not just going to go away in
> the next part, because we all know that new part designs usually start by
> borrowing the old part design and since you can't admit there's an issue it's likely
> the problem will come along in the next part.
> 
> > I am open to alternatives.
> >
> 
> This may sound extreme, but if the point here is to force the manufacturer to
> give us the information that is required to support their devices, I'd even suggest
> we pull the support and either drop the driver entirely, move it to staging, or
> some other extremely obvious move that says, "we are unable to properly
> support this without the manufacturers help and the manufacturer doesn't care
> about it's customers." Personally, when I upgrade my kernel, I want the kernel to
> SHOUT to me that there's a problem with this driver rather than have a
> performance issue that I might spend the next six months trying to figure out
> why.
> 
> When I start a new project, one of the first things I do is check to see if the parts
> that we want to use are supported by Linux. I'd like it when I check if a Micron
> SLC NAND part is supported that it is clear there's a big issue with that part. It
> showing up in staging instead in the main tree would be a clear signal.
> 
> A message to Micron and other chip vendors:  Do not attempt to hide your bugs,
> we will find out and hiding this stuff simply pisses us off. Bugs happen, we get it,
> expect it, and can work around or fix them. Responsible companies do not
> dismiss their users and will quickly and actively investigate reported problems
> and then disclose via errata. This used to be common, even as little as 10 or 20
> years ago - there was rarely a chip that I'd use that didn't have a few errata
> documents associated with it. Now, good luck getting them (even if you can get
> a datasheet).
> 
> We in the embedded Linux community want to work with the vendors to get
> good solid upstream support. And vendors, seriously, why the hell are you
> resisting it? You've got a large number of people not on your payroll that are
> invested in spending time in making you succeed because it makes things better
> for them too. And note the use of the word "upstream" - I no longer use vendor
> drivers. They're generally of horribly low quality and either make me stuck on a
> particular 8 year old kernel or require me to rewrite them for current versions
> anyway.
> 
> Tomorrow I'm calling my CMs and will start the process to remove Micron as an
> approved vendor and will start qualifying alternatives.
> While AFAIK (and I intend to find out for sure), we don't have a problem with the
> parts we've been using, I don't want to work with a company that is technically
> dishonest like Micron.
> 
> To Miquel - I appreciate your work in finding and working around the issue.  And
> I appreciate the pragmatic approach to fixing it.  But perhaps it's time to say
> enough is enough.
> 
> I withdraw my NAK (for what it's worth), but can we find something to very
> clearly mark "you may be using a broken device, but we don't know because
> Micro won't tell us and so we've conservitavely setup a workaround for all
> devices that affects performance"?
> 
> I'll suggest several alternative things, from most forceful to least:
> 1. Drop the driver.
> 2. Move the driver to staging.
> 3. Make it a kconfig option that is on by default, with strong wording along the
> way of "if you know you're using a device without this problem, you can turn this
> off".
> 4. Shove a printk of warning level in there so it's clear in the logs for those of use
> who look at the logs when we upgrade.
> 5. Keep it on by default, and add a white-list of known devices that are OK to not
> use it. Add a big comment of why (which is there) and instructions of how to test
> if your device has the problem and that if you test and know it's clean how to
> send a patch to add to the whitelist.
> 
> At a minimum, I'd like to see a combo of #3 and #4.
> 
> Thanks,
> - Steve
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06  8:28   ` [EXT] " Bean Huo (beanhuo)
@ 2020-05-06  8:45     ` Boris Brezillon
  2020-05-06 15:50       ` Bean Huo (beanhuo)
  0 siblings, 1 reply; 31+ messages in thread
From: Boris Brezillon @ 2020-05-06  8:45 UTC (permalink / raw)
  To: Bean Huo (beanhuo)
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Steve deRosier, Zoltan Szubbocsev (zszubbocsev),
	linux-mtd, Thomas Petazzoni, Miquel Raynal, tglx,
	Piotr Wojtaszczyk

On Wed, 6 May 2020 08:28:43 +0000
"Bean Huo (beanhuo)" <beanhuo@micron.com> wrote:

> Hi, Miquel
> I have two questions about your patch, please help me. 
> 
> > +	 */
> > +	for (eb = first_eb; eb < first_eb + nb_eb; eb++) {
> > +		/* Il all the first pages are not written yet, do it */
> > +		if (micron->writtenp[eb] != MICRON_PAGE_MASK_TRIGGER)
> > +			micron_nand_avoid_shallow_erase(chip, eb);
> > +
> > +		micron->writtenp[eb] = 0;
> > +	}  
> 
> 
> Here, if the power loss happens before erasing this block, for the next time boot up,
> What will happen from FS layer in case FS detect this filled data?

Most likely ECC errors will be returned, but that doesn't matter since
this block was about to be erased. You have pretty much the same
problem for partially erase blocks already, and that should be handled
by the wear-leveling/FS, if not, that would be bug (note that it's
properly handled by UBI, which just considers the block as invalid and
schedules an erase).

> 
> > +
> > +	return nand_erase_nand(chip, instr, allowbbt); 
> >+ }  
> 
>  static int
> > +micron_nand_write_oob(struct nand_chip *chip, loff_t to,
> > +				 struct mtd_oob_ops *ops)
> > +{
> > +	struct micron_nand *micron = nand_get_manufacturer_data(chip);
> > +	unsigned int eb_sz = nanddev_eraseblock_size(&chip->base);
> > +	unsigned int p_sz = nanddev_page_size(&chip->base);
> > +	unsigned int ppeb = nanddev_pages_per_eraseblock(&chip->base);
> > +	unsigned int nb_p_tot = ops->len / p_sz;
> > +	unsigned int first_eb = DIV_ROUND_DOWN_ULL(to, eb_sz);
> > +	unsigned int first_p = DIV_ROUND_UP_ULL(to - (first_eb * eb_sz), p_sz);
> > +	unsigned int nb_eb = DIV_ROUND_UP_ULL(first_p + nb_p_tot, ppeb);
> > +	unsigned int remaining_p, eb, nb_p;
> > +	int ret;
> > +
> > +	ret = nand_write_oob_nand(chip, to, ops);
> > +	if (ret || (ops->len != ops->retlen))
> > +		return ret;
> > +
> > +	/* Mark the last pages of the first erase block to write */
> > +	nb_p = min(nb_p_tot, ppeb - first_p);
> > +	micron->writtenp[first_eb] |= GENMASK(first_p + nb_p, first_p) &
> > +				      MICRON_PAGE_MASK_TRIGGER;
> > +	remaining_p = nb_p_tot - nb_p;
> > +
> > +	/* Mark all the pages of all "in-the-middle" erase blocks */
> > +	for (eb = first_eb + 1; eb < first_eb + nb_eb - 1; eb++) {
> > +		micron->writtenp[eb] |= MICRON_PAGE_MASK_TRIGGER;
> > +		remaining_p -= ppeb;
> > +	}
> > +
> > +	/* Mark the first pages of the last erase block to write */
> > +	if (remaining_p)
> > +		micron->writtenp[eb] |= GENMASK(remaining_p - 1, 0) &
> > +					MICRON_PAGE_MASK_TRIGGER;
> > +  
> 
> 
> This micron->written is stored in the system memory, once power cut,  for the next time
> Boot up, will it be reinstated or it will be 0x00?

Yep, and that shouldn't be a problem, it just means we might have
unneeded page writes if the pages were already written, but, other than
the perf penalty it incurs, it should work fine.

We can optimize that a bit by adding a ->post_read_page() hook
so we can flag already read pages as written/erased and avoid those
unneeded writes in some situations.

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06  8:45     ` Boris Brezillon
@ 2020-05-06 15:50       ` Bean Huo (beanhuo)
  2020-05-06 16:04         ` Boris Brezillon
  0 siblings, 1 reply; 31+ messages in thread
From: Bean Huo (beanhuo) @ 2020-05-06 15:50 UTC (permalink / raw)
  To: Boris Brezillon, Richard Weinberger
  Cc: Vignesh Raghavendra, Tudor Ambarus, Steve deRosier,
	Zoltan Szubbocsev (zszubbocsev),
	linux-mtd, Thomas Petazzoni, Miquel Raynal, tglx,
	Piotr Wojtaszczyk

Hi, Boris

> 
> On Wed, 6 May 2020 08:28:43 +0000
> "Bean Huo (beanhuo)" <beanhuo@micron.com> wrote:
> 
> > Hi, Miquel
> > I have two questions about your patch, please help me.
> >
> > > +	 */
> > > +	for (eb = first_eb; eb < first_eb + nb_eb; eb++) {
> > > +		/* Il all the first pages are not written yet, do it */
> > > +		if (micron->writtenp[eb] != MICRON_PAGE_MASK_TRIGGER)
> > > +			micron_nand_avoid_shallow_erase(chip, eb);
> > > +
> > > +		micron->writtenp[eb] = 0;
> > > +	}
> >
> >
> > Here, if the power loss happens before erasing this block, for the
> > next time boot up, What will happen from FS layer in case FS detect this filled
> data?
> 
> Most likely ECC errors will be returned, but that doesn't matter since this block
> was about to be erased. You have pretty much the same problem for partially
> erase blocks already, and that should be handled by the wear-leveling/FS, if not,
> that would be bug (note that it's properly handled by UBI, which just considers
> the block as invalid and schedules an erase).
> 

Concerning this, I still have question, for the UBIFS,  If I am correct, there are
EC and VID header both being damaged, then UBIFS will re-erase it. I don't know if
UBIFS can handle there is dirty/filling data in the some pages  and EC/VID valid.
Maybe Richard has fixed it.

> >
> > This micron->written is stored in the system memory, once power cut,
> > for the next time Boot up, will it be reinstated or it will be 0x00?
> 
> Yep, and that shouldn't be a problem, it just means we might have unneeded
> page writes if the pages were already written, but, other than the perf penalty it
> incurs, it should work fine.
> 
> We can optimize that a bit by adding a ->post_read_page() hook so we can flag
> already read pages as written/erased and avoid those unneeded writes in some
> situations.

That means we assume this block being read before erasing.

Thanks,

Bean

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06 15:50       ` Bean Huo (beanhuo)
@ 2020-05-06 16:04         ` Boris Brezillon
  2020-05-06 16:09           ` Bean Huo (beanhuo)
  0 siblings, 1 reply; 31+ messages in thread
From: Boris Brezillon @ 2020-05-06 16:04 UTC (permalink / raw)
  To: Bean Huo (beanhuo)
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Steve deRosier, Zoltan Szubbocsev (zszubbocsev),
	linux-mtd, Thomas Petazzoni, Miquel Raynal, tglx,
	Piotr Wojtaszczyk

On Wed, 6 May 2020 15:50:32 +0000
"Bean Huo (beanhuo)" <beanhuo@micron.com> wrote:

> Hi, Boris
> 
> > 
> > On Wed, 6 May 2020 08:28:43 +0000
> > "Bean Huo (beanhuo)" <beanhuo@micron.com> wrote:
> >   
> > > Hi, Miquel
> > > I have two questions about your patch, please help me.
> > >  
> > > > +	 */
> > > > +	for (eb = first_eb; eb < first_eb + nb_eb; eb++) {
> > > > +		/* Il all the first pages are not written yet, do it */
> > > > +		if (micron->writtenp[eb] != MICRON_PAGE_MASK_TRIGGER)
> > > > +			micron_nand_avoid_shallow_erase(chip, eb);
> > > > +
> > > > +		micron->writtenp[eb] = 0;
> > > > +	}  
> > >
> > >
> > > Here, if the power loss happens before erasing this block, for the
> > > next time boot up, What will happen from FS layer in case FS detect this filled  
> > data?
> > 
> > Most likely ECC errors will be returned, but that doesn't matter since this block
> > was about to be erased. You have pretty much the same problem for partially
> > erase blocks already, and that should be handled by the wear-leveling/FS, if not,
> > that would be bug (note that it's properly handled by UBI, which just considers
> > the block as invalid and schedules an erase).
> >   
> 
> Concerning this, I still have question, for the UBIFS,  If I am correct, there are
> EC and VID header both being damaged, then UBIFS will re-erase it. I don't know if
> UBIFS can handle there is dirty/filling data in the some pages  and EC/VID valid.
> Maybe Richard has fixed it.

If the block is being erased that means there's another one mapped to
the same LEB, or the block is simply not needed anymore. In both cases,
this old block shouldn't be referenced. Again, if that happens, it's a
bug.

> 
> > >
> > > This micron->written is stored in the system memory, once power cut,
> > > for the next time Boot up, will it be reinstated or it will be 0x00?  
> > 
> > Yep, and that shouldn't be a problem, it just means we might have unneeded
> > page writes if the pages were already written, but, other than the perf penalty it
> > incurs, it should work fine.
> > 
> > We can optimize that a bit by adding a ->post_read_page() hook so we can flag
> > already read pages as written/erased and avoid those unneeded writes in some
> > situations.  
> 
> That means we assume this block being read before erasing.

Not necessarily. If pages have been read because the MTD user wanted to
access then data (e.g. UBI reads the EC/VID header, UBIFS read its
metadata, ...) it's a good opportunity for us to flag pages as
'written/erased'. But we should not generate extra IOs just for that.

You seem to assume that page reads are almost free compared to erase
operations, and that's true if you only consider the time it takes to
load a page in the NAND cache, but doing IOs is much more expensive
that you think on a shitload of platforms. If we do as you suggest, we
might have 2 rounds of IOs, one to read the pages, and one to write
them if they've not been written already.

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06 16:04         ` Boris Brezillon
@ 2020-05-06 16:09           ` Bean Huo (beanhuo)
  2020-05-06 16:29             ` Boris Brezillon
  2020-05-06 18:44             ` Richard Weinberger
  0 siblings, 2 replies; 31+ messages in thread
From: Bean Huo (beanhuo) @ 2020-05-06 16:09 UTC (permalink / raw)
  To: Boris Brezillon, Richard Weinberger
  Cc: Vignesh Raghavendra, Tudor Ambarus, Steve deRosier,
	Zoltan Szubbocsev (zszubbocsev),
	linux-mtd, Thomas Petazzoni, Miquel Raynal, tglx,
	Piotr Wojtaszczyk

Hi, Richard

> > > > > +	}
> > > >
> > > >
> > > > Here, if the power loss happens before erasing this block, for the
> > > > next time boot up, What will happen from FS layer in case FS
> > > > detect this filled
> > > data?
> > >
> > > Most likely ECC errors will be returned, but that doesn't matter
> > > since this block was about to be erased. You have pretty much the
> > > same problem for partially erase blocks already, and that should be
> > > handled by the wear-leveling/FS, if not, that would be bug (note
> > > that it's properly handled by UBI, which just considers the block as invalid and
> schedules an erase).
> > >
> >
> > Concerning this, I still have question, for the UBIFS,  If I am
> > correct, there are EC and VID header both being damaged, then UBIFS
> > will re-erase it. I don't know if UBIFS can handle there is dirty/filling data in the
> some pages  and EC/VID valid.
> > Maybe Richard has fixed it.
> 
> If the block is being erased that means there's another one mapped to the same
> LEB, or the block is simply not needed anymore. In both cases, this old block
> shouldn't be referenced. Again, if that happens, it's a bug.

Would you please help us confirm this?  how does ubifs handle this situation? Also other FS? Eg, jffs2, yaffs

Thanks,

Bean

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06 16:09           ` Bean Huo (beanhuo)
@ 2020-05-06 16:29             ` Boris Brezillon
  2020-05-06 16:50               ` Bean Huo (beanhuo)
  2020-05-06 18:44             ` Richard Weinberger
  1 sibling, 1 reply; 31+ messages in thread
From: Boris Brezillon @ 2020-05-06 16:29 UTC (permalink / raw)
  To: Bean Huo (beanhuo)
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Steve deRosier, Zoltan Szubbocsev (zszubbocsev),
	linux-mtd, Thomas Petazzoni, Miquel Raynal, tglx,
	Piotr Wojtaszczyk

On Wed, 6 May 2020 16:09:19 +0000
"Bean Huo (beanhuo)" <beanhuo@micron.com> wrote:

> Hi, Richard
> 
> > > > > > +	}  
> > > > >
> > > > >
> > > > > Here, if the power loss happens before erasing this block, for the
> > > > > next time boot up, What will happen from FS layer in case FS
> > > > > detect this filled  
> > > > data?
> > > >
> > > > Most likely ECC errors will be returned, but that doesn't matter
> > > > since this block was about to be erased. You have pretty much the
> > > > same problem for partially erase blocks already, and that should be
> > > > handled by the wear-leveling/FS, if not, that would be bug (note
> > > > that it's properly handled by UBI, which just considers the block as invalid and  
> > schedules an erase).  
> > > >  
> > >
> > > Concerning this, I still have question, for the UBIFS,  If I am
> > > correct, there are EC and VID header both being damaged, then UBIFS
> > > will re-erase it. I don't know if UBIFS can handle there is dirty/filling data in the  
> > some pages  and EC/VID valid.  
> > > Maybe Richard has fixed it.  
> > 
> > If the block is being erased that means there's another one mapped to the same
> > LEB, or the block is simply not needed anymore. In both cases, this old block
> > shouldn't be referenced. Again, if that happens, it's a bug.  
> 
> Would you please help us confirm this?  how does ubifs handle this situation? Also other FS? Eg, jffs2, yaffs
> 

And if you're really worried about that, forcibly writing all pages
starting from 0 should do the trick. But I don't think that's needed.
As I said, when you trigger an erase, you're not guaranteed the erase
will complete before a potential power-cut, meaning that part of your
data might still be valid, while others could be corrupted.

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06 16:29             ` Boris Brezillon
@ 2020-05-06 16:50               ` Bean Huo (beanhuo)
  0 siblings, 0 replies; 31+ messages in thread
From: Bean Huo (beanhuo) @ 2020-05-06 16:50 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Steve deRosier, Zoltan Szubbocsev (zszubbocsev),
	linux-mtd, Thomas Petazzoni, Miquel Raynal, tglx,
	Piotr Wojtaszczyk

> >
> > > > > > > +	}
> > > > > >
> > > > > >
> > > > > > Here, if the power loss happens before erasing this block, for
> > > > > > the next time boot up, What will happen from FS layer in case
> > > > > > FS detect this filled
> > > > > data?
> > > > >
> > > > > Most likely ECC errors will be returned, but that doesn't matter
> > > > > since this block was about to be erased. You have pretty much
> > > > > the same problem for partially erase blocks already, and that
> > > > > should be handled by the wear-leveling/FS, if not, that would be
> > > > > bug (note that it's properly handled by UBI, which just
> > > > > considers the block as invalid and
> > > schedules an erase).
> > > > >
> > > >
> > > > Concerning this, I still have question, for the UBIFS,  If I am
> > > > correct, there are EC and VID header both being damaged, then
> > > > UBIFS will re-erase it. I don't know if UBIFS can handle there is
> > > > dirty/filling data in the
> > > some pages  and EC/VID valid.
> > > > Maybe Richard has fixed it.
> > >
> > > If the block is being erased that means there's another one mapped
> > > to the same LEB, or the block is simply not needed anymore. In both
> > > cases, this old block shouldn't be referenced. Again, if that happens, it's a
> bug.
> >
> > Would you please help us confirm this?  how does ubifs handle this
> > situation? Also other FS? Eg, jffs2, yaffs
> >
> 
> And if you're really worried about that, forcibly writing all pages starting from 0
> should do the trick. But I don't think that's needed.
> As I said, when you trigger an erase, you're not guaranteed the erase will
> complete before a potential power-cut, meaning that part of your data might
> still be valid, while others could be corrupted.

Boris
I think, this patch wasn't fully tested on the system. This patch inevitably lengthen the erase time.
As long going into this function, from the software point, it is erasing operation.
And from physical side, it is not. 

Thanks,

Bean

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06 16:09           ` Bean Huo (beanhuo)
  2020-05-06 16:29             ` Boris Brezillon
@ 2020-05-06 18:44             ` Richard Weinberger
  2020-05-06 19:01               ` Boris Brezillon
  1 sibling, 1 reply; 31+ messages in thread
From: Richard Weinberger @ 2020-05-06 18:44 UTC (permalink / raw)
  To: Bean Huo (beanhuo)
  Cc: Vignesh Raghavendra, Tudor Ambarus,
	Zoltan Szubbocsev (zszubbocsev),
	Steve deRosier, Boris Brezillon, linux-mtd, Thomas Petazzoni,
	Miquel Raynal, tglx, Piotr Wojtaszczyk

Bean, Boris,

----- Ursprüngliche Mail -----
>> > Concerning this, I still have question, for the UBIFS,  If I am
>> > correct, there are EC and VID header both being damaged, then UBIFS
>> > will re-erase it. I don't know if UBIFS can handle there is dirty/filling data
>> > in the
>> some pages  and EC/VID valid.

Uhh. Damaging just payload asks for trouble.

>> > Maybe Richard has fixed it.
>> 
>> If the block is being erased that means there's another one mapped to the same
>> LEB, or the block is simply not needed anymore. In both cases, this old block
>> shouldn't be referenced. Again, if that happens, it's a bug.

Sadly it is not so easy.

IIRC the UBIFS log ring is such a corner case, it uses a fixed LEB range for
this purpose. Before writing to a new LEB it unmaps it. If the resulting erase operation
is interrupted before a new version of the same LEB is written reading from that
LEB would result in ECC errors.

> Would you please help us confirm this?  how does ubifs handle this situation?
> Also other FS? Eg, jffs2, yaffs

There are cases where (partially) erased LEBs are still referenced.
UBIFS assumes that a LEB it unmaps is after a power-cut either 0xFF or intact.
In relies in the fact that UBI will detect an interrupted erase operation and
re-erases the PEB.
Fastmap once violated this rule, it took years until the first user hit this.

So please make sure that the VID header will be destroyed.

JFFS2 needs also thoughts, let me check the source.

YAFFS is out tree and therefore out of mind.

Thanks,
//richard

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06 18:44             ` Richard Weinberger
@ 2020-05-06 19:01               ` Boris Brezillon
  2020-05-06 19:23                 ` Richard Weinberger
  0 siblings, 1 reply; 31+ messages in thread
From: Boris Brezillon @ 2020-05-06 19:01 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Vignesh Raghavendra, Tudor Ambarus, Steve deRosier,
	Zoltan Szubbocsev (zszubbocsev),
	linux-mtd, Thomas Petazzoni, Miquel Raynal, tglx,
	Piotr Wojtaszczyk, Bean Huo (beanhuo)

On Wed, 6 May 2020 20:44:29 +0200 (CEST)
Richard Weinberger <richard@nod.at> wrote:

> Bean, Boris,
> 
> ----- Ursprüngliche Mail -----
> >> > Concerning this, I still have question, for the UBIFS,  If I am
> >> > correct, there are EC and VID header both being damaged, then UBIFS
> >> > will re-erase it. I don't know if UBIFS can handle there is dirty/filling data
> >> > in the  
> >> some pages  and EC/VID valid.  
> 
> Uhh. Damaging just payload asks for trouble.

I'd expect UBI to just mark the LEB as bad and schedule it for erasure
(again, pretty similar to an interrupted erase).

> 
> >> > Maybe Richard has fixed it.  
> >> 
> >> If the block is being erased that means there's another one mapped to the same
> >> LEB, or the block is simply not needed anymore. In both cases, this old block
> >> shouldn't be referenced. Again, if that happens, it's a bug.  
> 
> Sadly it is not so easy.
> 
> IIRC the UBIFS log ring is such a corner case, it uses a fixed LEB range for
> this purpose. Before writing to a new LEB it unmaps it. If the resulting erase operation
> is interrupted before a new version of the same LEB is written reading from that
> LEB would result in ECC errors.

Duh. What happens when you have ECC errors? Does that stop the mount?
Shouldn't we make that part more robust?

>  
> > Would you please help us confirm this?  how does ubifs handle this situation?
> > Also other FS? Eg, jffs2, yaffs  
> 
> There are cases where (partially) erased LEBs are still referenced.
> UBIFS assumes that a LEB it unmaps is after a power-cut either 0xFF or intact.
> In relies in the fact that UBI will detect an interrupted erase operation and
> re-erases the PEB.
> Fastmap once violated this rule, it took years until the first user hit this.
> 
> So please make sure that the VID header will be destroyed.

I really hate the idea of having FS-specific logic in the Micron
quirk. Isn't there a way we can fix that in UBIFS? Plus, do we have any
guarantee that the EC/VID headers will be corrupted along with UBIFS
data when an erase is interrupted?

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06 19:01               ` Boris Brezillon
@ 2020-05-06 19:23                 ` Richard Weinberger
  2020-05-06 20:40                   ` Boris Brezillon
  0 siblings, 1 reply; 31+ messages in thread
From: Richard Weinberger @ 2020-05-06 19:23 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Vignesh Raghavendra, Tudor Ambarus, Steve deRosier,
	Zoltan Szubbocsev, zszubbocsev, linux-mtd, Thomas Petazzoni,
	Miquel Raynal, tglx, Piotr Wojtaszczyk, Bean Huo, beanhuo

----- Ursprüngliche Mail -----
> Von: "Boris Brezillon" <boris.brezillon@collabora.com>
> An: "richard" <richard@nod.at>
> CC: "Bean Huo, beanhuo" <beanhuo@micron.com>, "Miquel Raynal" <miquel.raynal@bootlin.com>, "Vignesh Raghavendra"
> <vigneshr@ti.com>, "Tudor Ambarus" <Tudor.Ambarus@microchip.com>, "linux-mtd" <linux-mtd@lists.infradead.org>, "Steve
> deRosier" <derosier@gmail.com>, "Thomas Petazzoni" <thomas.petazzoni@bootlin.com>, "tglx" <tglx@linutronix.de>, "Zoltan
> Szubbocsev, zszubbocsev" <zszubbocsev@micron.com>, "Piotr Wojtaszczyk" <WojtaszczykP@cumminsallison.com>
> Gesendet: Mittwoch, 6. Mai 2020 21:01:58
> Betreff: Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue

> On Wed, 6 May 2020 20:44:29 +0200 (CEST)
> Richard Weinberger <richard@nod.at> wrote:
> 
>> Bean, Boris,
>> 
>> ----- Ursprüngliche Mail -----
>> >> > Concerning this, I still have question, for the UBIFS,  If I am
>> >> > correct, there are EC and VID header both being damaged, then UBIFS
>> >> > will re-erase it. I don't know if UBIFS can handle there is dirty/filling data
>> >> > in the
>> >> some pages  and EC/VID valid.
>> 
>> Uhh. Damaging just payload asks for trouble.
> 
> I'd expect UBI to just mark the LEB as bad and schedule it for erasure
> (again, pretty similar to an interrupted erase).

UBI scans only headers during attach. If you don't touch these, no way.

>> 
>> >> > Maybe Richard has fixed it.
>> >> 
>> >> If the block is being erased that means there's another one mapped to the same
>> >> LEB, or the block is simply not needed anymore. In both cases, this old block
>> >> shouldn't be referenced. Again, if that happens, it's a bug.
>> 
>> Sadly it is not so easy.
>> 
>> IIRC the UBIFS log ring is such a corner case, it uses a fixed LEB range for
>> this purpose. Before writing to a new LEB it unmaps it. If the resulting erase
>> operation
>> is interrupted before a new version of the same LEB is written reading from that
>> LEB would result in ECC errors.
> 
> Duh. What happens when you have ECC errors? Does that stop the mount?
> Shouldn't we make that part more robust?

UBIFS stops if there are *unexpected* ECC errors.
The last node in the last node group of the last log LEB is allowed to have ECC errors.

I get where you trying to head to. Let me give this a deep thought and re-read
the fixes for fastmap I did a few years ago.

>>  
>> > Would you please help us confirm this?  how does ubifs handle this situation?
>> > Also other FS? Eg, jffs2, yaffs
>> 
>> There are cases where (partially) erased LEBs are still referenced.
>> UBIFS assumes that a LEB it unmaps is after a power-cut either 0xFF or intact.
>> In relies in the fact that UBI will detect an interrupted erase operation and
>> re-erases the PEB.
>> Fastmap once violated this rule, it took years until the first user hit this.
>> 
>> So please make sure that the VID header will be destroyed.
> 
> I really hate the idea of having FS-specific logic in the Micron
> quirk. Isn't there a way we can fix that in UBIFS? Plus, do we have any
> guarantee that the EC/VID headers will be corrupted along with UBIFS
> data when an erase is interrupted?

And I hate the idea of changing UBIFS semantics. ;-)

I think you read equally much NAND datasheets as I did, so you know how often
they guarantee something. ;-)
UBIFS (actually UBI) assumes that interrupted erase will cause ECC errors across
the whole PEB.

Thanks,
//richard

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06 19:23                 ` Richard Weinberger
@ 2020-05-06 20:40                   ` Boris Brezillon
  2020-05-06 20:59                     ` Richard Weinberger
  0 siblings, 1 reply; 31+ messages in thread
From: Boris Brezillon @ 2020-05-06 20:40 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Vignesh Raghavendra, Tudor Ambarus, Steve deRosier,
	Zoltan Szubbocsev, zszubbocsev, linux-mtd, Thomas Petazzoni,
	Miquel Raynal, tglx, Piotr Wojtaszczyk, Bean Huo, beanhuo

On Wed, 6 May 2020 21:23:18 +0200 (CEST)
Richard Weinberger <richard@nod.at> wrote:

> ----- Ursprüngliche Mail -----
> > Von: "Boris Brezillon" <boris.brezillon@collabora.com>
> > An: "richard" <richard@nod.at>
> > CC: "Bean Huo, beanhuo" <beanhuo@micron.com>, "Miquel Raynal" <miquel.raynal@bootlin.com>, "Vignesh Raghavendra"
> > <vigneshr@ti.com>, "Tudor Ambarus" <Tudor.Ambarus@microchip.com>, "linux-mtd" <linux-mtd@lists.infradead.org>, "Steve
> > deRosier" <derosier@gmail.com>, "Thomas Petazzoni" <thomas.petazzoni@bootlin.com>, "tglx" <tglx@linutronix.de>, "Zoltan
> > Szubbocsev, zszubbocsev" <zszubbocsev@micron.com>, "Piotr Wojtaszczyk" <WojtaszczykP@cumminsallison.com>
> > Gesendet: Mittwoch, 6. Mai 2020 21:01:58
> > Betreff: Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue  
> 
> > On Wed, 6 May 2020 20:44:29 +0200 (CEST)
> > Richard Weinberger <richard@nod.at> wrote:
> >   
> >> Bean, Boris,
> >> 
> >> ----- Ursprüngliche Mail -----  
> >> >> > Concerning this, I still have question, for the UBIFS,  If I am
> >> >> > correct, there are EC and VID header both being damaged, then UBIFS
> >> >> > will re-erase it. I don't know if UBIFS can handle there is dirty/filling data
> >> >> > in the  
> >> >> some pages  and EC/VID valid.  
> >> 
> >> Uhh. Damaging just payload asks for trouble.  
> > 
> > I'd expect UBI to just mark the LEB as bad and schedule it for erasure
> > (again, pretty similar to an interrupted erase).  
> 
> UBI scans only headers during attach. If you don't touch these, no way.

Sorry, I misunderstood what you meant by payload. UBI should schedule
the PEB for erase if the EC/VID header is corrupted.


______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06 20:40                   ` Boris Brezillon
@ 2020-05-06 20:59                     ` Richard Weinberger
  2020-05-06 21:11                       ` Boris Brezillon
  2020-05-07  9:28                       ` Bean Huo (beanhuo)
  0 siblings, 2 replies; 31+ messages in thread
From: Richard Weinberger @ 2020-05-06 20:59 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Vignesh Raghavendra, Tudor Ambarus, Steve deRosier,
	Zoltan Szubbocsev, zszubbocsev, linux-mtd, Thomas Petazzoni,
	Miquel Raynal, tglx, Piotr Wojtaszczyk, Bean Huo, beanhuo

----- Ursprüngliche Mail -----
>> UBI scans only headers during attach. If you don't touch these, no way.
> 
> Sorry, I misunderstood what you meant by payload. UBI should schedule
> the PEB for erase if the EC/VID header is corrupted.

UBI even tries to recover from such a situation. If only the EC header is bad,
it will create a new one. Only of the VID header is bad/missing and the payload
is corrupted (ECC errors or bit-flips) it will erase it.

A missing VID header plus good payload will cause UBI to stop attaching since it
violates the IO model.

Thanks,
//richard

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06 20:59                     ` Richard Weinberger
@ 2020-05-06 21:11                       ` Boris Brezillon
  2020-05-07  9:28                         ` Bean Huo (beanhuo)
  2020-05-07  9:28                       ` Bean Huo (beanhuo)
  1 sibling, 1 reply; 31+ messages in thread
From: Boris Brezillon @ 2020-05-06 21:11 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Vignesh Raghavendra, Tudor Ambarus, Steve deRosier,
	Zoltan Szubbocsev, zszubbocsev, linux-mtd, Thomas Petazzoni,
	Miquel Raynal, tglx, Piotr Wojtaszczyk, Bean Huo, beanhuo

On Wed, 6 May 2020 22:59:39 +0200 (CEST)
Richard Weinberger <richard@nod.at> wrote:

> ----- Ursprüngliche Mail -----
> >> UBI scans only headers during attach. If you don't touch these, no way.  
> > 
> > Sorry, I misunderstood what you meant by payload. UBI should schedule
> > the PEB for erase if the EC/VID header is corrupted.  
> 
> UBI even tries to recover from such a situation. If only the EC header is bad,
> it will create a new one. Only of the VID header is bad/missing and the payload
> is corrupted (ECC errors or bit-flips) it will erase it.

Yep, I remember that, though in the case of a corrupted EC, UBI will
schedule a PEB recovery, and if the payload is corrupted, the read
should fail when copying the LEB content to another block.

> 
> A missing VID header plus good payload will cause UBI to stop attaching since it
> violates the IO model.

Sure, and that's not what we want to do anyway. We basically have 2
choices here:

1/ overwrite all pages starting from page 0 and ending at page 15. This
will lead to ECC errors on already written pages, and 0-filled pages
for others (unless we go for a raw write, in which case it might
actually lead to ECC errors depending on the engine).
2/ track already written pages (by reading them first), and only writes
those that have not been written yet. That means no ECC error in that
case, and no corrupted EC/VID header as well.

Clearly, #2 would help mitigate the perf penalty incurred by the Micron
workaround, but if it's not an option, maybe we can just start with
#1 (which is basically what Miquel implemented).

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06 21:11                       ` Boris Brezillon
@ 2020-05-07  9:28                         ` Bean Huo (beanhuo)
  2020-05-07  9:40                           ` Boris Brezillon
  0 siblings, 1 reply; 31+ messages in thread
From: Bean Huo (beanhuo) @ 2020-05-07  9:28 UTC (permalink / raw)
  To: Boris Brezillon, Richard Weinberger
  Cc: Vignesh Raghavendra, Tudor Ambarus, Steve deRosier,
	Zoltan Szubbocsev (zszubbocsev),
	linux-mtd, Thomas Petazzoni, Miquel Raynal, tglx,
	Piotr Wojtaszczyk

Hi, Boris, Richard 

> >
> > A missing VID header plus good payload will cause UBI to stop
> > attaching since it violates the IO model.
> 
> Sure, and that's not what we want to do anyway. We basically have 2 choices
> here:
> 
> 1/ overwrite all pages starting from page 0 and ending at page 15. This will lead
> to ECC errors on already written pages, and 0-filled pages for others (unless we
> go for a raw write, in which case it might actually lead to ECC errors depending
> on the engine).

No need for overwriting all pages. I overwrite EC and VID page just for prevent of 
Erase power loss issue. But you hate this since FS-specific approach.
According to Richard's emails, we don't need to overwrite page0,  overwrite page1
Is enough.

> 2/ track already written pages (by reading them first), and only writes those that
> have not been written yet. That means no ECC error in that case, and no
> corrupted EC/VID header as well.

This is similar with my patch approach, but corrupted EC and VID headers. 

I have two proposals:
1.  rebase my patch, and copy one idea from Miquel's patch which records the programmed pages.
But page 1 should be overwrote considering the UBIFS re-erase mechanism.

2.  add a new padding structure in the MTD,  which is used to filling empty page in the MTD.
And once FS layer detects this padding data, just ignore or trigger re-erase this block. 

How about you?

Thanks,

Bean

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-06 20:59                     ` Richard Weinberger
  2020-05-06 21:11                       ` Boris Brezillon
@ 2020-05-07  9:28                       ` Bean Huo (beanhuo)
  2020-05-07  9:30                         ` Boris Brezillon
  2020-05-07 12:20                         ` Richard Weinberger
  1 sibling, 2 replies; 31+ messages in thread
From: Bean Huo (beanhuo) @ 2020-05-07  9:28 UTC (permalink / raw)
  To: Richard Weinberger, Boris Brezillon
  Cc: Vignesh Raghavendra, Tudor Ambarus, Steve deRosier,
	Zoltan Szubbocsev (zszubbocsev),
	linux-mtd, Thomas Petazzoni, Miquel Raynal, tglx,
	Piotr Wojtaszczyk

Hi Richard
Thanks. 

How about this special  situation: 

Page 0(EC) is good;
Page 1(VID) is bad;
Page 2 (data) is good;
Page 3( data) is bad, or contain filling pattern.
Page 4 (data) is good, or empty;
Page 5( data) is bad, or contain filling pattern.
Page 6 (data) is good, or empty;
.....


Bean

> 
> ----- Ursprüngliche Mail -----
> >> UBI scans only headers during attach. If you don't touch these, no way.
> >
> > Sorry, I misunderstood what you meant by payload. UBI should schedule
> > the PEB for erase if the EC/VID header is corrupted.
> 
> UBI even tries to recover from such a situation. If only the EC header is bad, it
> will create a new one. Only of the VID header is bad/missing and the payload is
> corrupted (ECC errors or bit-flips) it will erase it.
> 
> A missing VID header plus good payload will cause UBI to stop attaching since it
> violates the IO model.
> 
> Thanks,
> //richard
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-07  9:28                       ` Bean Huo (beanhuo)
@ 2020-05-07  9:30                         ` Boris Brezillon
  2020-05-07 10:02                           ` Richard Weinberger
  2020-05-07 12:20                         ` Richard Weinberger
  1 sibling, 1 reply; 31+ messages in thread
From: Boris Brezillon @ 2020-05-07  9:30 UTC (permalink / raw)
  To: Bean Huo (beanhuo), Miquel Raynal
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Steve deRosier, Zoltan Szubbocsev (zszubbocsev),
	linux-mtd, Thomas Petazzoni, tglx, Piotr Wojtaszczyk

On Thu, 7 May 2020 09:28:59 +0000
"Bean Huo (beanhuo)" <beanhuo@micron.com> wrote:

> Hi Richard
> Thanks. 
> 
> How about this special  situation: 
> 
> Page 0(EC) is good;
> Page 1(VID) is bad;

UBI should reject the LEB because of this VID corruption, but I'm not
sure how well that works when you add fastmap in the mix :-/.

> Page 2 (data) is good;
> Page 3( data) is bad, or contain filling pattern.
> Page 4 (data) is good, or empty;
> Page 5( data) is bad, or contain filling pattern.
> Page 6 (data) is good, or empty;
> .....
> 
> 
> Bean
> 
> > 
> > ----- Ursprüngliche Mail -----  
> > >> UBI scans only headers during attach. If you don't touch these, no way.  
> > >
> > > Sorry, I misunderstood what you meant by payload. UBI should schedule
> > > the PEB for erase if the EC/VID header is corrupted.  
> > 
> > UBI even tries to recover from such a situation. If only the EC header is bad, it
> > will create a new one. Only of the VID header is bad/missing and the payload is
> > corrupted (ECC errors or bit-flips) it will erase it.
> > 
> > A missing VID header plus good payload will cause UBI to stop attaching since it
> > violates the IO model.
> > 
> > Thanks,
> > //richard  


______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-07  9:28                         ` Bean Huo (beanhuo)
@ 2020-05-07  9:40                           ` Boris Brezillon
  0 siblings, 0 replies; 31+ messages in thread
From: Boris Brezillon @ 2020-05-07  9:40 UTC (permalink / raw)
  To: Bean Huo (beanhuo)
  Cc: Vignesh Raghavendra, Tudor Ambarus, Richard Weinberger,
	Steve deRosier, Zoltan Szubbocsev (zszubbocsev),
	linux-mtd, Thomas Petazzoni, Miquel Raynal, tglx,
	Piotr Wojtaszczyk

On Thu, 7 May 2020 09:28:29 +0000
"Bean Huo (beanhuo)" <beanhuo@micron.com> wrote:

> Hi, Boris, Richard 
> 
> > >
> > > A missing VID header plus good payload will cause UBI to stop
> > > attaching since it violates the IO model.  
> > 
> > Sure, and that's not what we want to do anyway. We basically have 2 choices
> > here:
> > 
> > 1/ overwrite all pages starting from page 0 and ending at page 15. This will lead
> > to ECC errors on already written pages, and 0-filled pages for others (unless we
> > go for a raw write, in which case it might actually lead to ECC errors depending
> > on the engine).  
> 
> No need for overwriting all pages. I overwrite EC and VID page just for prevent of 
> Erase power loss issue. But you hate this since FS-specific approach.
> According to Richard's emails, we don't need to overwrite page0,  overwrite page1
> Is enough.

It's still UBI specific, so that's still a no-go for me.

> 
> > 2/ track already written pages (by reading them first), and only writes those that
> > have not been written yet. That means no ECC error in that case, and no
> > corrupted EC/VID header as well.  
> 
> This is similar with my patch approach, but corrupted EC and VID headers. 
> 
> I have two proposals:
> 1.  rebase my patch, and copy one idea from Miquel's patch which records the programmed pages.
> But page 1 should be overwrote considering the UBIFS re-erase mechanism.

See above, and remember we do all that because Micron broke one of the
base assumptions we have about NANDs => erase should reset all bits to
1 or return an error if that fails.

> 
> 2.  add a new padding structure in the MTD,  which is used to filling empty page in the MTD.
> And once FS layer detects this padding data, just ignore or trigger re-erase this block. 

Not sure what you mean by padding structure, but yes, filling empty
pages with 0s and expecting the FS/wear-leveling layers to be happy
with that would be ideal. But, as for the "erase only 2-page because
this is where UBI puts its VID header" thing, that's a NACK on my side
if you want to introduce a new pattern that's only understood by
UBI/UBIFS.

I really wish FS/wear-leveling layers were immune to corrupted/invalid
LEBs they should no longer reference, but according to Richard, that's
not that simple.

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-07  9:30                         ` Boris Brezillon
@ 2020-05-07 10:02                           ` Richard Weinberger
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Weinberger @ 2020-05-07 10:02 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Vignesh Raghavendra, Tudor Ambarus, Steve deRosier,
	Zoltan Szubbocsev, zszubbocsev, linux-mtd, Thomas Petazzoni,
	Miquel Raynal, tglx, Piotr Wojtaszczyk, Bean Huo, beanhuo

----- Ursprüngliche Mail -----
> Von: "Boris Brezillon" <boris.brezillon@collabora.com>
> An: "Bean Huo, beanhuo" <beanhuo@micron.com>, "Miquel Raynal" <miquel.raynal@bootlin.com>
> CC: "richard" <richard@nod.at>, "Vignesh Raghavendra" <vigneshr@ti.com>, "Tudor Ambarus" <Tudor.Ambarus@microchip.com>,
> "linux-mtd" <linux-mtd@lists.infradead.org>, "Steve deRosier" <derosier@gmail.com>, "Thomas Petazzoni"
> <thomas.petazzoni@bootlin.com>, "tglx" <tglx@linutronix.de>, "Zoltan Szubbocsev, zszubbocsev" <zszubbocsev@micron.com>,
> "Piotr Wojtaszczyk" <WojtaszczykP@cumminsallison.com>
> Gesendet: Donnerstag, 7. Mai 2020 11:30:50
> Betreff: Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue

> On Thu, 7 May 2020 09:28:59 +0000
> "Bean Huo (beanhuo)" <beanhuo@micron.com> wrote:
> 
>> Hi Richard
>> Thanks.
>> 
>> How about this special  situation:
>> 
>> Page 0(EC) is good;
>> Page 1(VID) is bad;
> 
> UBI should reject the LEB because of this VID corruption, but I'm not
> sure how well that works when you add fastmap in the mix :-/.

Fastmap can deal with that.
Upon first access it checks for a VID header.

Thanks,
//richard

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue
  2020-05-07  9:28                       ` Bean Huo (beanhuo)
  2020-05-07  9:30                         ` Boris Brezillon
@ 2020-05-07 12:20                         ` Richard Weinberger
  1 sibling, 0 replies; 31+ messages in thread
From: Richard Weinberger @ 2020-05-07 12:20 UTC (permalink / raw)
  To: Bean Huo, beanhuo
  Cc: Vignesh Raghavendra, Tudor Ambarus, Zoltan Szubbocsev,
	zszubbocsev, Steve deRosier, Boris Brezillon, linux-mtd,
	Thomas Petazzoni, Miquel Raynal, tglx, Piotr Wojtaszczyk

----- Ursprüngliche Mail -----
> Von: "Bean Huo, beanhuo" <beanhuo@micron.com>
> An: "richard" <richard@nod.at>, "Boris Brezillon" <boris.brezillon@collabora.com>
> CC: "Miquel Raynal" <miquel.raynal@bootlin.com>, "Vignesh Raghavendra" <vigneshr@ti.com>, "Tudor Ambarus"
> <Tudor.Ambarus@microchip.com>, "linux-mtd" <linux-mtd@lists.infradead.org>, "Steve deRosier" <derosier@gmail.com>,
> "Thomas Petazzoni" <thomas.petazzoni@bootlin.com>, "tglx" <tglx@linutronix.de>, "Zoltan Szubbocsev, zszubbocsev"
> <zszubbocsev@micron.com>, "Piotr Wojtaszczyk" <WojtaszczykP@cumminsallison.com>
> Gesendet: Donnerstag, 7. Mai 2020 11:28:59
> Betreff: RE: [EXT] [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue

> Hi Richard
> Thanks.
> 
> How about this special  situation:
> 
> Page 0(EC) is good;
> Page 1(VID) is bad;
> Page 2 (data) is good;
> Page 3( data) is bad, or contain filling pattern.
> Page 4 (data) is good, or empty;
> Page 5( data) is bad, or contain filling pattern.
> Page 6 (data) is good, or empty;

"bad" means ECC errors upon read?

UBI will be unhappy in scanning mode if VID header is bad but payload
does not show ECC errors nor bitflips and is not 0xFF.

See check_corruption() in drivers/mtd/ubi/attach.c
I'm not super eager to soften these checks but as last resort
we can modify them.

Fastmap is more forgiving and just checks whether the VID header is corrupted.
While reading this code I think we can actually use check_corruption() there too, hmmm.

Thanks,
//richard

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-05-07 12:20 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-03 11:40 [PATCH v2 0/3] Fix proposal for the Micron shallow erase issue Miquel Raynal
2020-05-03 11:40 ` [PATCH v2 1/3] mtd: rawnand: Add the nand_chip->erase hook Miquel Raynal
2020-05-03 15:01   ` Boris Brezillon
2020-05-03 11:40 ` [PATCH v2 2/3] mtd: rawnand: Add the nand_chip->write_oob hook Miquel Raynal
2020-05-03 15:09   ` Boris Brezillon
2020-05-03 17:02     ` Miquel Raynal
2020-05-03 11:40 ` [PATCH v2 3/3] mtd: rawnand: micron: Address the shallow erase issue Miquel Raynal
2020-05-03 16:10   ` Steve deRosier
2020-05-03 16:34     ` Boris Brezillon
2020-05-03 16:36     ` Miquel Raynal
2020-05-03 19:57       ` Steve deRosier
2020-05-06  8:37         ` [EXT] " Bean Huo (beanhuo)
2020-05-06  8:28   ` [EXT] " Bean Huo (beanhuo)
2020-05-06  8:45     ` Boris Brezillon
2020-05-06 15:50       ` Bean Huo (beanhuo)
2020-05-06 16:04         ` Boris Brezillon
2020-05-06 16:09           ` Bean Huo (beanhuo)
2020-05-06 16:29             ` Boris Brezillon
2020-05-06 16:50               ` Bean Huo (beanhuo)
2020-05-06 18:44             ` Richard Weinberger
2020-05-06 19:01               ` Boris Brezillon
2020-05-06 19:23                 ` Richard Weinberger
2020-05-06 20:40                   ` Boris Brezillon
2020-05-06 20:59                     ` Richard Weinberger
2020-05-06 21:11                       ` Boris Brezillon
2020-05-07  9:28                         ` Bean Huo (beanhuo)
2020-05-07  9:40                           ` Boris Brezillon
2020-05-07  9:28                       ` Bean Huo (beanhuo)
2020-05-07  9:30                         ` Boris Brezillon
2020-05-07 10:02                           ` Richard Weinberger
2020-05-07 12:20                         ` Richard Weinberger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.