linux-mtd.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Power cut leads to "corrupt empty space"
@ 2020-02-27 13:04 Timo Ketola
  2020-02-27 13:08 ` Fabio Estevam
  2020-03-01 21:28 ` Richard Weinberger
  0 siblings, 2 replies; 11+ messages in thread
From: Timo Ketola @ 2020-02-27 13:04 UTC (permalink / raw)
  To: linux-mtd

Hi,

We have a few i.MX6D devices which have corrupted their UBIFS filesystem
on power cut and refuse to mount them any more.

The log says:

> [   10.382580] UBIFS (ubi1:0): background thread "ubifs_bgt1_0" started, PID 158
> [   10.408838] UBIFS (ubi1:0): recovery needed
> [   10.802070] UBIFS error (ubi1:0 pid 157): ubifs_scan: corrupt empty space at 
> LEB 99:114688
> [   10.809054] UBIFS error (ubi1:0 pid 157): ubifs_scanned_corruption: corruptio
> n at LEB 99:114688
> [   10.816471] UBIFS error (ubi1:0 pid 157): ubifs_scanned_corruption: first 819
> 2 bytes from LEB 99:114688
> [   10.824585] 00000000: 06101831 713b7e1b 002e0640 00000000 000000a0 00000200 0
> 0000554 00000000  1....~;q@...............T.......
> [   10.824601] 00000020: 00000000 00000000 0001585b 00000000 0008c48d 00000000 5
> d512897 00000000  ........[X...............(Q]....

...

> [   10.827751] UBIFS error (ubi1:0 pid 157): ubifs_scan: LEB 99 scanning failed
> [   10.834615] UBIFS (ubi1:0): background thread "ubifs_bgt1_0" stops

I think I found the culprit from the mtdblock contents. Fragment from
hexdump:

> 3ca20000  55 42 49 23 01 00 00 00  00 00 00 00 00 00 00 04  |UBI#............|
> 3ca20010  00 00 08 00 00 00 10 00  0c 4d 7c ed 00 00 00 00  |.........M|.....|
> 3ca20020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 3ca20030  00 00 00 00 00 00 00 00  00 00 00 00 cb 5d 1f 01  |.............]..|
> 3ca20040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 3ca20800  55 42 49 21 01 01 00 00  00 00 00 00 00 00 00 63  |UBI!...........c|
> 3ca20810  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 3ca20820  00 00 00 00 00 00 00 00  00 00 00 00 00 00 8d 07  |................|
> 3ca20830  00 00 00 00 00 00 00 00  00 00 00 00 91 2b 87 87  |.............+..|
> 3ca20840  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 3ca21000  31 18 10 06 30 3c 6d 96  cd 05 2e 00 00 00 00 00  |1...0<m.........|
> 3ca21010  a0 00 00 00 00 02 00 00  54 05 00 00 00 00 00 00  |........T.......|

...

> 3ca3b8c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 3ca3c000  31 18 10 06 7b 71 87 8f  3c 06 2e 00 00 00 00 00  |1...{q..<.......|
> 3ca3c010  a0 00 00 00 00 02 00 00  54 05 00 00 00 00 00 00  |........T.......|
> 3ca3c020  00 00 00 00 00 00 00 00  5b 58 01 00 00 00 00 00  |........[X......|
> 3ca3c030  79 c3 08 00 00 00 00 00  97 28 51 5d 00 00 00 00  |y........(Q]....|
> 3ca3c040  19 58 6d 38 00 00 00 00  19 58 6d 38 00 00 00 00  |.Xm8.....Xm8....|
> 3ca3c050  00 00 00 00 00 00 00 00  00 00 00 00 01 00 00 00  |................|
> 3ca3c060  eb 03 00 00 eb 03 00 00  a4 81 00 00 01 00 00 00  |................|
> 3ca3c070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 3ca3c080  00 00 00 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
> 3ca3c090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 3ca3c0a0  31 18 10 06 84 13 e1 a0  00 00 00 00 00 00 00 00  |1...............|
> 3ca3c0b0  1c 00 00 00 05 00 00 00  44 07 00 00 00 00 00 00  |........D.......|
> 3ca3c0c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 3ca3c800  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 3ca3d000  31 18 10 06 1b 7e 3b 71  40 06 2e 00 00 00 00 00  |1....~;q@.......|
> 3ca3d010  a0 00 00 00 00 02 00 00  54 05 00 00 00 00 00 00  |........T.......|
> 3ca3d020  00 00 00 00 00 00 00 00  5b 58 01 00 00 00 00 00  |........[X......|
> 3ca3d030  8d c4 08 00 00 00 00 00  97 28 51 5d 00 00 00 00  |.........(Q]....|
> 3ca3d040  19 58 6d 38 00 00 00 00  19 58 6d 38 00 00 00 00  |.Xm8.....Xm8....|
> 3ca3d050  00 00 00 00 00 00 00 00  00 00 00 00 01 00 00 00  |................|
> 3ca3d060  eb 03 00 00 eb 03 00 00  a4 81 00 00 01 00 00 00  |................|
> 3ca3d070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 3ca3d080  00 00 00 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
> 3ca3d090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 3ca3d0a0  31 18 10 06 84 13 e1 a0  00 00 00 00 00 00 00 00  |1...............|
> 3ca3d0b0  1c 00 00 00 05 00 00 00  44 07 00 00 00 00 00 00  |........D.......|
> 3ca3d0c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 3ca3d800  31 18 10 06 c1 6b e6 57  42 06 2e 00 00 00 00 00  |1....k.WB.......|
> 3ca3d810  a0 00 00 00 00 02 00 00  54 05 00 00 00 00 00 00  |........T.......|
> 3ca3d820  00 00 00 00 00 00 00 00  5b 58 01 00 00 00 00 00  |........[X......|
> 3ca3d830  0d c5 08 00 00 00 00 00  97 28 51 5d 00 00 00 00  |.........(Q]....|
> 3ca3d840  19 58 6d 38 00 00 00 00  19 58 6d 38 00 00 00 00  |.Xm8.....Xm8....|
> 3ca3d850  00 00 00 00 00 00 00 00  00 00 00 00 01 00 00 00  |................|
> 3ca3d860  eb 03 00 00 eb 03 00 00  a4 81 00 00 01 00 00 00  |................|
> 3ca3d870  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 3ca3d880  00 00 00 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
> 3ca3d890  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 3ca3d8a0  31 18 10 06 84 13 e1 a0  00 00 00 00 00 00 00 00  |1...............|
> 3ca3d8b0  1c 00 00 00 05 00 00 00  44 07 00 00 00 00 00 00  |........D.......|
> 3ca3d8c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 3ca3e000  31 18 10 06 0b 75 3d 9e  44 06 2e 00 00 00 00 00  |1....u=.D.......|

IIUC, ubifs_scan finds empty space at 3ca3c800, stops scanning and
checks the rest of the LEB for being empty but finds something else at
3ca3d000. Then recovery aborts and mounting fails.

Do I understand correctly that empty space should always be continuous
at the end of the LEB?

How could this kind of corruption happen?

Is there any way to recover from this?

Storage is NAND with 0x20000 erase block size and the kernel is 4.9.88.

--

Timo
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power cut leads to "corrupt empty space"
  2020-02-27 13:04 Power cut leads to "corrupt empty space" Timo Ketola
@ 2020-02-27 13:08 ` Fabio Estevam
  2020-02-27 13:42   ` Timo Ketola
  2020-03-01 21:28 ` Richard Weinberger
  1 sibling, 1 reply; 11+ messages in thread
From: Fabio Estevam @ 2020-02-27 13:08 UTC (permalink / raw)
  To: Timo Ketola; +Cc: linux-mtd

Hi Timo,

On Thu, Feb 27, 2020 at 10:04 AM Timo Ketola <Timo.Ketola@exertus.fi> wrote:

> Storage is NAND with 0x20000 erase block size and the kernel is 4.9.88.

Could you try with kernel 5.4 or 5.5 instead?

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power cut leads to "corrupt empty space"
  2020-02-27 13:08 ` Fabio Estevam
@ 2020-02-27 13:42   ` Timo Ketola
  2020-02-27 15:16     ` Fabio Estevam
  0 siblings, 1 reply; 11+ messages in thread
From: Timo Ketola @ 2020-02-27 13:42 UTC (permalink / raw)
  To: Fabio Estevam; +Cc: linux-mtd

Hi Fabio,

On 27.2.2020 15.08, Fabio Estevam wrote:
> Hi Timo,
> 
> On Thu, Feb 27, 2020 at 10:04 AM Timo Ketola <Timo.Ketola@exertus.fi> wrote:
> 
>> Storage is NAND with 0x20000 erase block size and the kernel is 4.9.88.
> 
> Could you try with kernel 5.4 or 5.5 instead?
> 

That might take considerable effort. Would you think, there should be
fixes for this? Would it be on recovery side or preventing the issue
happening in the first place?

--

Timo
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power cut leads to "corrupt empty space"
  2020-02-27 13:42   ` Timo Ketola
@ 2020-02-27 15:16     ` Fabio Estevam
  2020-02-29 12:46       ` Timo Ketola
  0 siblings, 1 reply; 11+ messages in thread
From: Fabio Estevam @ 2020-02-27 15:16 UTC (permalink / raw)
  To: Timo Ketola; +Cc: linux-mtd

Hi Timo,

On Thu, Feb 27, 2020 at 10:42 AM Timo Ketola <Timo.Ketola@exertus.fi> wrote:

> That might take considerable effort. Would you think, there should be
> fixes for this? Would it be on recovery side or preventing the issue
> happening in the first place?

It is hard to tell. 4.9.88 is an old version, so better try with mainline

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power cut leads to "corrupt empty space"
  2020-02-27 15:16     ` Fabio Estevam
@ 2020-02-29 12:46       ` Timo Ketola
  2020-02-29 13:13         ` Fabio Estevam
  2020-02-29 14:20         ` Timo Ketola
  0 siblings, 2 replies; 11+ messages in thread
From: Timo Ketola @ 2020-02-29 12:46 UTC (permalink / raw)
  To: Fabio Estevam; +Cc: linux-mtd

On 27.2.2020 17.16, Fabio Estevam wrote:
> Hi Timo,
> 
> On Thu, Feb 27, 2020 at 10:42 AM Timo Ketola <Timo.Ketola@exertus.fi> wrote:
> 
>> That might take considerable effort. Would you think, there should be
>> fixes for this? Would it be on recovery side or preventing the issue
>> happening in the first place?
> 
> It is hard to tell. 4.9.88 is an old version, so better try with mainline
> 

Ok, I managed to get v5.4 booting - almost.

First, we had 'fsl,legacy-bch-geometry;' flag in device tree and I
couldn't find how I would get the same effect in this kernel in a
'standard way'. I had to put 'nand-ecc-strength = <8>;
nand-ecc-step-size = <512>;' into the device tree and make this change
in drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c:

> @@ -507,11 +507,11 @@ static int common_nfc_set_geometry(struct gpmi_nand_data *this)
>  	struct nand_chip *chip = &this->nand;
>  
>  	if (chip->ecc.strength > 0 && chip->ecc.size > 0)
>  		return set_geometry_by_ecc_info(this, chip->ecc.strength,
>  						chip->ecc.size);
> -
> +	return legacy_set_geometry(this);
>  	if ((of_property_read_bool(this->dev->of_node, "fsl,use-minimum-ecc"))
>  				|| legacy_set_geometry(this)) {
>  		if (!(chip->base.eccreq.strength > 0 &&
>  		      chip->base.eccreq.step_size > 0))
>  			return -EINVAL;

That is, call legacy_set_geometry unconditionally without then calling
set_geometry_by_ecc_info. After this it began to read the first half of
the NAND correctly.

The there is a bug (I think) in the NAND chip S34ML16G2. It has four
S34ML04G2 dies and two chip selects in the package and shows up as two
chips. It reports 128KiB per EB, 8192 EBs per LUN and 2 LUNs making up
2GiB. This is correct for the package but then Linux finds two such
chips, total of 4GiB, which is not correct. So I have this in
drivers/mtd/nand/raw/nand_base.c:

> @@ -4733,12 +4760,36 @@ static int nand_detect(struct nand_chip *chip, struct nand_flash_dev *type)
>  	if (!type->name || !type->pagesize) {
>  		/* Check if the chip is ONFI compliant */
>  		ret = nand_onfi_detect(chip);
>  		if (ret < 0)
>  			return ret;
> -		else if (ret)
> +		else if (ret) {
> +			if (type->name) {
> +				struct nand_device *nand = &chip->base;
> +				unsigned luns;
> +
> +				pr_info("%s detected\n", type->name);
> +				pr_info("luns %d, eraseblocks %d, pages %d, page size %d\n",
> +						nand->memorg.luns_per_target,
> +						nand->memorg.eraseblocks_per_lun,
> +						nand->memorg.pages_per_eraseblock,
> +						nand->memorg.pagesize);
> +				pr_info("sizes: page 0x%X, erase 0x%X, chip 0x%X\n",
> +						type->pagesize,
> +						type->erasesize,
> +						type->chipsize);
> +				luns = DIV_ROUND_DOWN_ULL((u64)type->chipsize << 20,
> +						nand->memorg.pagesize *
> +						nand->memorg.pages_per_eraseblock *
> +						nand->memorg.eraseblocks_per_lun);
> +				if (nand->memorg.luns_per_target != luns) {
> +					printk("Correcting luns-per-target to %d", luns);
> +					nand->memorg.luns_per_target = luns;
> +				}
> +			}
>  			goto ident_done;
> +		}
>  
>  		/* Check if the chip is JEDEC compliant */
>  		ret = nand_jedec_detect(chip);
>  		if (ret < 0)
>  			return ret;

output:

> nand: NAND 1GiB 3,3V 8-bit detected
> nand: luns 2, eraseblocks 8192, pages 64, page size 2048
> nand: sizes: page 0x0, erase 0x0, chip 0x400
> Correcting luns-pre-target to 1
> nand: device found, Manufacturer ID: 0x01, Chip ID: 0xd3
> nand: AMD/Spansion S34ML16G2
> nand: 1024 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 128
> nand: 2 chips detected

That idea worked on v4.9 imx kernel but not here. The driver reports ECC
errors for the second half of the NAND. I have debugged down to gpmi
driver and checked that page address is as should (e.g. realpage 524288,
page 0 0x80000 in nand_do_read_ops for the first page of the second
half) and target selection changes correctly. But it reads only FFs.
Still, it seems to erase correct blocks when trying to write BBTs.

I put this in drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c:

> @@ -2270,10 +2270,18 @@ static struct dma_async_tx_descriptor *gpmi_chain_command(
>  
>  	transfer->direction = DMA_TO_DEVICE;
>  
>  	desc = dmaengine_prep_slave_sg(channel, &transfer->sgl, 1, DMA_MEM_TO_DEV,
>  				       MXS_DMA_CTRL_WAIT4END);
> +	if (1) {
> +		unsigned i;
> +		char b[160], *p;
> +
> +		p = b + sprintf(b, "Transfer from/to chip %d, pio[0] %X, naddr %d, addr", chip, pio[0], naddr);
> +		for (i = 0; i < naddr; ++i) p += sprintf(p, " %02X", addr[i]);
> +		pr_info("%s\n", b);
> +	}
>  	return desc;
>  }
>  

and see

> Transfer from/to chip 1, pio[0] 930004, naddr 3, addr C0 FF 07

for erase, which seems to work and

> Transfer from/to chip 1, pio[0] 930006, naddr 5, addr 00 00 C0 FF 07

for reads/writes, which fail.

I'm real stuck.

--

Timo
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power cut leads to "corrupt empty space"
  2020-02-29 12:46       ` Timo Ketola
@ 2020-02-29 13:13         ` Fabio Estevam
  2020-02-29 14:20         ` Timo Ketola
  1 sibling, 0 replies; 11+ messages in thread
From: Fabio Estevam @ 2020-02-29 13:13 UTC (permalink / raw)
  To: Timo Ketola; +Cc: Han Xu, linux-mtd, Miquel Raynal

Adding Han Xu and Miquel

On Sat, Feb 29, 2020 at 9:46 AM Timo Ketola <Timo.Ketola@exertus.fi> wrote:
>
> On 27.2.2020 17.16, Fabio Estevam wrote:
> > Hi Timo,
> >
> > On Thu, Feb 27, 2020 at 10:42 AM Timo Ketola <Timo.Ketola@exertus.fi> wrote:
> >
> >> That might take considerable effort. Would you think, there should be
> >> fixes for this? Would it be on recovery side or preventing the issue
> >> happening in the first place?
> >
> > It is hard to tell. 4.9.88 is an old version, so better try with mainline
> >
>
> Ok, I managed to get v5.4 booting - almost.
>
> First, we had 'fsl,legacy-bch-geometry;' flag in device tree and I
> couldn't find how I would get the same effect in this kernel in a
> 'standard way'. I had to put 'nand-ecc-strength = <8>;
> nand-ecc-step-size = <512>;' into the device tree and make this change
> in drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c:
>
> > @@ -507,11 +507,11 @@ static int common_nfc_set_geometry(struct gpmi_nand_data *this)
> >       struct nand_chip *chip = &this->nand;
> >
> >       if (chip->ecc.strength > 0 && chip->ecc.size > 0)
> >               return set_geometry_by_ecc_info(this, chip->ecc.strength,
> >                                               chip->ecc.size);
> > -
> > +     return legacy_set_geometry(this);
> >       if ((of_property_read_bool(this->dev->of_node, "fsl,use-minimum-ecc"))
> >                               || legacy_set_geometry(this)) {
> >               if (!(chip->base.eccreq.strength > 0 &&
> >                     chip->base.eccreq.step_size > 0))
> >                       return -EINVAL;
>
> That is, call legacy_set_geometry unconditionally without then calling
> set_geometry_by_ecc_info. After this it began to read the first half of
> the NAND correctly.
>
> The there is a bug (I think) in the NAND chip S34ML16G2. It has four
> S34ML04G2 dies and two chip selects in the package and shows up as two
> chips. It reports 128KiB per EB, 8192 EBs per LUN and 2 LUNs making up
> 2GiB. This is correct for the package but then Linux finds two such
> chips, total of 4GiB, which is not correct. So I have this in
> drivers/mtd/nand/raw/nand_base.c:
>
> > @@ -4733,12 +4760,36 @@ static int nand_detect(struct nand_chip *chip, struct nand_flash_dev *type)
> >       if (!type->name || !type->pagesize) {
> >               /* Check if the chip is ONFI compliant */
> >               ret = nand_onfi_detect(chip);
> >               if (ret < 0)
> >                       return ret;
> > -             else if (ret)
> > +             else if (ret) {
> > +                     if (type->name) {
> > +                             struct nand_device *nand = &chip->base;
> > +                             unsigned luns;
> > +
> > +                             pr_info("%s detected\n", type->name);
> > +                             pr_info("luns %d, eraseblocks %d, pages %d, page size %d\n",
> > +                                             nand->memorg.luns_per_target,
> > +                                             nand->memorg.eraseblocks_per_lun,
> > +                                             nand->memorg.pages_per_eraseblock,
> > +                                             nand->memorg.pagesize);
> > +                             pr_info("sizes: page 0x%X, erase 0x%X, chip 0x%X\n",
> > +                                             type->pagesize,
> > +                                             type->erasesize,
> > +                                             type->chipsize);
> > +                             luns = DIV_ROUND_DOWN_ULL((u64)type->chipsize << 20,
> > +                                             nand->memorg.pagesize *
> > +                                             nand->memorg.pages_per_eraseblock *
> > +                                             nand->memorg.eraseblocks_per_lun);
> > +                             if (nand->memorg.luns_per_target != luns) {
> > +                                     printk("Correcting luns-per-target to %d", luns);
> > +                                     nand->memorg.luns_per_target = luns;
> > +                             }
> > +                     }
> >                       goto ident_done;
> > +             }
> >
> >               /* Check if the chip is JEDEC compliant */
> >               ret = nand_jedec_detect(chip);
> >               if (ret < 0)
> >                       return ret;
>
> output:
>
> > nand: NAND 1GiB 3,3V 8-bit detected
> > nand: luns 2, eraseblocks 8192, pages 64, page size 2048
> > nand: sizes: page 0x0, erase 0x0, chip 0x400
> > Correcting luns-pre-target to 1
> > nand: device found, Manufacturer ID: 0x01, Chip ID: 0xd3
> > nand: AMD/Spansion S34ML16G2
> > nand: 1024 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 128
> > nand: 2 chips detected
>
> That idea worked on v4.9 imx kernel but not here. The driver reports ECC
> errors for the second half of the NAND. I have debugged down to gpmi
> driver and checked that page address is as should (e.g. realpage 524288,
> page 0 0x80000 in nand_do_read_ops for the first page of the second
> half) and target selection changes correctly. But it reads only FFs.
> Still, it seems to erase correct blocks when trying to write BBTs.
>
> I put this in drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c:
>
> > @@ -2270,10 +2270,18 @@ static struct dma_async_tx_descriptor *gpmi_chain_command(
> >
> >       transfer->direction = DMA_TO_DEVICE;
> >
> >       desc = dmaengine_prep_slave_sg(channel, &transfer->sgl, 1, DMA_MEM_TO_DEV,
> >                                      MXS_DMA_CTRL_WAIT4END);
> > +     if (1) {
> > +             unsigned i;
> > +             char b[160], *p;
> > +
> > +             p = b + sprintf(b, "Transfer from/to chip %d, pio[0] %X, naddr %d, addr", chip, pio[0], naddr);
> > +             for (i = 0; i < naddr; ++i) p += sprintf(p, " %02X", addr[i]);
> > +             pr_info("%s\n", b);
> > +     }
> >       return desc;
> >  }
> >
>
> and see
>
> > Transfer from/to chip 1, pio[0] 930004, naddr 3, addr C0 FF 07
>
> for erase, which seems to work and
>
> > Transfer from/to chip 1, pio[0] 930006, naddr 5, addr 00 00 C0 FF 07
>
> for reads/writes, which fail.
>
> I'm real stuck.
>
> --
>
> Timo

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power cut leads to "corrupt empty space"
  2020-02-29 12:46       ` Timo Ketola
  2020-02-29 13:13         ` Fabio Estevam
@ 2020-02-29 14:20         ` Timo Ketola
  1 sibling, 0 replies; 11+ messages in thread
From: Timo Ketola @ 2020-02-29 14:20 UTC (permalink / raw)
  To: Fabio Estevam; +Cc: Han Xu, linux-mtd, Miquel Raynal

On 29.2.2020 14.46, Timo Ketola wrote:
> I had to put 'nand-ecc-strength = <8>;
> nand-ecc-step-size = <512>;' into the device tree

Actually, I tried these but they didn't help. They are not there any more.

> and make this change
> in drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c:
> 
>> @@ -507,11 +507,11 @@ static int common_nfc_set_geometry(struct gpmi_nand_data *this)

That was needed.

> Still, it seems to erase correct blocks when trying to write BBTs.

This might not be true

>> Transfer from/to chip 1, pio[0] 930006, naddr 5, addr 00 00 C0 FF 07

I tried the same in my v4.9 kernel and saw very (exactly?) similar
transactions and it works:

> Transfer from/to chip 0, pio[0] 830006, len 6, cmd 00 00 00 C0 FF 07
> Transfer from/to chip 0, pio[0] 830001, len 1, cmd 30
> Transfer from/to chip 1, pio[0] 930006, len 6, cmd 00 00 00 C0 FF 07
> Transfer from/to chip 1, pio[0] 930001, len 1, cmd 30
> Bad block table found at page 524224, version 0x01
> Bad block table found at page 1048512, version 0x01

--

Timo
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power cut leads to "corrupt empty space"
  2020-02-27 13:04 Power cut leads to "corrupt empty space" Timo Ketola
  2020-02-27 13:08 ` Fabio Estevam
@ 2020-03-01 21:28 ` Richard Weinberger
  2020-03-02 12:57   ` Timo Ketola
  1 sibling, 1 reply; 11+ messages in thread
From: Richard Weinberger @ 2020-03-01 21:28 UTC (permalink / raw)
  To: Timo Ketola; +Cc: linux-mtd

Timo,

On Thu, Feb 27, 2020 at 2:04 PM Timo Ketola <Timo.Ketola@exertus.fi> wrote:
> We have a few i.MX6D devices which have corrupted their UBIFS filesystem
> on power cut and refuse to mount them any more.
>
> The log says:
>
> > [   10.382580] UBIFS (ubi1:0): background thread "ubifs_bgt1_0" started, PID 158
> > [   10.408838] UBIFS (ubi1:0): recovery needed
> > [   10.802070] UBIFS error (ubi1:0 pid 157): ubifs_scan: corrupt empty space at
> > LEB 99:114688
> > [   10.809054] UBIFS error (ubi1:0 pid 157): ubifs_scanned_corruption: corruptio
> > n at LEB 99:114688
> > [   10.816471] UBIFS error (ubi1:0 pid 157): ubifs_scanned_corruption: first 819
> > 2 bytes from LEB 99:114688
> > [   10.824585] 00000000: 06101831 713b7e1b 002e0640 00000000 000000a0 00000200 0
> > 0000554 00000000  1....~;q@...............T.......
> > [   10.824601] 00000020: 00000000 00000000 0001585b 00000000 0008c48d 00000000 5
> > d512897 00000000  ........[X...............(Q]....
>
> ...
>
> > [   10.827751] UBIFS error (ubi1:0 pid 157): ubifs_scan: LEB 99 scanning failed
> > [   10.834615] UBIFS (ubi1:0): background thread "ubifs_bgt1_0" stops
>
> I think I found the culprit from the mtdblock contents. Fragment from
> hexdump:
>
> > 3ca20000  55 42 49 23 01 00 00 00  00 00 00 00 00 00 00 04  |UBI#............|
> > 3ca20010  00 00 08 00 00 00 10 00  0c 4d 7c ed 00 00 00 00  |.........M|.....|
> > 3ca20020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 3ca20030  00 00 00 00 00 00 00 00  00 00 00 00 cb 5d 1f 01  |.............]..|
> > 3ca20040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > *
> > 3ca20800  55 42 49 21 01 01 00 00  00 00 00 00 00 00 00 63  |UBI!...........c|
> > 3ca20810  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 3ca20820  00 00 00 00 00 00 00 00  00 00 00 00 00 00 8d 07  |................|
> > 3ca20830  00 00 00 00 00 00 00 00  00 00 00 00 91 2b 87 87  |.............+..|
> > 3ca20840  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > *
> > 3ca21000  31 18 10 06 30 3c 6d 96  cd 05 2e 00 00 00 00 00  |1...0<m.........|
> > 3ca21010  a0 00 00 00 00 02 00 00  54 05 00 00 00 00 00 00  |........T.......|
>
> ...
>
> > 3ca3b8c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > *
> > 3ca3c000  31 18 10 06 7b 71 87 8f  3c 06 2e 00 00 00 00 00  |1...{q..<.......|
> > 3ca3c010  a0 00 00 00 00 02 00 00  54 05 00 00 00 00 00 00  |........T.......|
> > 3ca3c020  00 00 00 00 00 00 00 00  5b 58 01 00 00 00 00 00  |........[X......|
> > 3ca3c030  79 c3 08 00 00 00 00 00  97 28 51 5d 00 00 00 00  |y........(Q]....|
> > 3ca3c040  19 58 6d 38 00 00 00 00  19 58 6d 38 00 00 00 00  |.Xm8.....Xm8....|
> > 3ca3c050  00 00 00 00 00 00 00 00  00 00 00 00 01 00 00 00  |................|
> > 3ca3c060  eb 03 00 00 eb 03 00 00  a4 81 00 00 01 00 00 00  |................|
> > 3ca3c070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 3ca3c080  00 00 00 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 3ca3c090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 3ca3c0a0  31 18 10 06 84 13 e1 a0  00 00 00 00 00 00 00 00  |1...............|
> > 3ca3c0b0  1c 00 00 00 05 00 00 00  44 07 00 00 00 00 00 00  |........D.......|
> > 3ca3c0c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > *
> > 3ca3c800  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> > *
> > 3ca3d000  31 18 10 06 1b 7e 3b 71  40 06 2e 00 00 00 00 00  |1....~;q@.......|

So, in there is a whole 2KiB area 0xFF.
It is also aligned, so it could be whole page.

> > 3ca3d010  a0 00 00 00 00 02 00 00  54 05 00 00 00 00 00 00  |........T.......|
> > 3ca3d020  00 00 00 00 00 00 00 00  5b 58 01 00 00 00 00 00  |........[X......|
> > 3ca3d030  8d c4 08 00 00 00 00 00  97 28 51 5d 00 00 00 00  |.........(Q]....|
> > 3ca3d040  19 58 6d 38 00 00 00 00  19 58 6d 38 00 00 00 00  |.Xm8.....Xm8....|
> > 3ca3d050  00 00 00 00 00 00 00 00  00 00 00 00 01 00 00 00  |................|
> > 3ca3d060  eb 03 00 00 eb 03 00 00  a4 81 00 00 01 00 00 00  |................|
> > 3ca3d070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 3ca3d080  00 00 00 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 3ca3d090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 3ca3d0a0  31 18 10 06 84 13 e1 a0  00 00 00 00 00 00 00 00  |1...............|
> > 3ca3d0b0  1c 00 00 00 05 00 00 00  44 07 00 00 00 00 00 00  |........D.......|
> > 3ca3d0c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > *
> > 3ca3d800  31 18 10 06 c1 6b e6 57  42 06 2e 00 00 00 00 00  |1....k.WB.......|
> > 3ca3d810  a0 00 00 00 00 02 00 00  54 05 00 00 00 00 00 00  |........T.......|
> > 3ca3d820  00 00 00 00 00 00 00 00  5b 58 01 00 00 00 00 00  |........[X......|
> > 3ca3d830  0d c5 08 00 00 00 00 00  97 28 51 5d 00 00 00 00  |.........(Q]....|
> > 3ca3d840  19 58 6d 38 00 00 00 00  19 58 6d 38 00 00 00 00  |.Xm8.....Xm8....|
> > 3ca3d850  00 00 00 00 00 00 00 00  00 00 00 00 01 00 00 00  |................|
> > 3ca3d860  eb 03 00 00 eb 03 00 00  a4 81 00 00 01 00 00 00  |................|
> > 3ca3d870  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 3ca3d880  00 00 00 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 3ca3d890  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 3ca3d8a0  31 18 10 06 84 13 e1 a0  00 00 00 00 00 00 00 00  |1...............|
> > 3ca3d8b0  1c 00 00 00 05 00 00 00  44 07 00 00 00 00 00 00  |........D.......|
> > 3ca3d8c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > *
> > 3ca3e000  31 18 10 06 0b 75 3d 9e  44 06 2e 00 00 00 00 00  |1....u=.D.......|
>
> IIUC, ubifs_scan finds empty space at 3ca3c800, stops scanning and
> checks the rest of the LEB for being empty but finds something else at
> 3ca3d000. Then recovery aborts and mounting fails.
>
> Do I understand correctly that empty space should always be continuous
> at the end of the LEB?

Correct.

> How could this kind of corruption happen?

Hard to say. Maybe bad timing settings which cause writes to have no effect.
But usually this leads to ECC errors.
If you can share the image with me I can have a look and with some luck we
find traces.

Is this a mainline kernel?
Wonky drivers can lead to all kind of "interesting" results. :->

> Is there any way to recover from this?

Not really. UBIFS' IO model got violated and it gives up.

> Storage is NAND with 0x20000 erase block size and the kernel is 4.9.88.

I guess 2KiB page size?

-- 
Thanks,
//richard

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power cut leads to "corrupt empty space"
  2020-03-01 21:28 ` Richard Weinberger
@ 2020-03-02 12:57   ` Timo Ketola
  2020-03-02 21:02     ` Richard Weinberger
  0 siblings, 1 reply; 11+ messages in thread
From: Timo Ketola @ 2020-03-02 12:57 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-mtd

On 1.3.2020 23.28, Richard Weinberger wrote:
> If you can share the image with me I can have a look and with some luck we
> find traces.

Thank you. I'll send the link separately.

> Is this a mainline kernel?
> Wonky drivers can lead to all kind of "interesting" results. :->

It is boundary-imx-o8.0.0_1.0.0-ga-pass2 from

https://github.com/boundarydevices/linux-imx6.git

branched at a51fcd6bd17c with our board support and patched with
v.4.9.88-rt66 from

git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

> I guess 2KiB page size?

Yes

--

Timo


______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power cut leads to "corrupt empty space"
  2020-03-02 12:57   ` Timo Ketola
@ 2020-03-02 21:02     ` Richard Weinberger
  2020-03-03  6:27       ` Timo Ketola
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Weinberger @ 2020-03-02 21:02 UTC (permalink / raw)
  To: Timo Ketola; +Cc: linux-mtd

On Mon, Mar 2, 2020 at 1:57 PM Timo Ketola <Timo.Ketola@exertus.fi> wrote:
>
> On 1.3.2020 23.28, Richard Weinberger wrote:
> > If you can share the image with me I can have a look and with some luck we
> > find traces.
>
> Thank you. I'll send the link separately.
>
> > Is this a mainline kernel?
> > Wonky drivers can lead to all kind of "interesting" results. :->
>
> It is boundary-imx-o8.0.0_1.0.0-ga-pass2 from
>
> https://github.com/boundarydevices/linux-imx6.git
>
> branched at a51fcd6bd17c with our board support and patched with
> v.4.9.88-rt66 from

Hmm, vendor tree....
I strongly suggest giving mainline a try.
Did you also double check your NAND settings, especially timings?

--
Thanks,
//richard

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power cut leads to "corrupt empty space"
  2020-03-02 21:02     ` Richard Weinberger
@ 2020-03-03  6:27       ` Timo Ketola
  0 siblings, 0 replies; 11+ messages in thread
From: Timo Ketola @ 2020-03-03  6:27 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-mtd

Hi Richard,

Thanks for looking at this!

On 2.3.2020 23.02, Richard Weinberger wrote:
> Hmm, vendor tree....
> I strongly suggest giving mainline a try.

I can share the feeling and I have tried couple of times to switch to
mainline (4.12 and 4.17) but failed. There were issues getting GPU and
camera interfaces working which I was unable to solve. At this time I
tried 5.4 but couldn't get even the NAND subsystem alone working:

http://lists.infradead.org/pipermail/linux-mtd/2020-February/094090.html

> Did you also double check your NAND settings, especially timings?

Not yet. I focused on finding out if the corruption could be recovered.
Now that it seems impossible, I obviously have to device tests to try to
make sure, it does never happen again in the first place.

At least for now, the incidents suggest that this relates somehow to the
power cut. That would speak against bad timings. And I have a design
blooper there: When supply voltage is dropped, NAND write protect signal
is set hard. Now I'm thinking about a 'dirty power loss' scenario, where
supply voltage is dropped momentarily just before actual total power
loss so that one page write fails and then several pages succeeds before
the final power cut. But shouldn't one page write fail put the whole
UBI/UBIFS volume in R/O mode and prevent further writes?

I hope you got my other mail with the link to the UBI image. It does
seem like simply one page in the middle had been left unwritten, doesn't
it? Is there anything there, which could be used to estimate how long
before power cut that happened?

--

Timo
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-03-03  6:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-27 13:04 Power cut leads to "corrupt empty space" Timo Ketola
2020-02-27 13:08 ` Fabio Estevam
2020-02-27 13:42   ` Timo Ketola
2020-02-27 15:16     ` Fabio Estevam
2020-02-29 12:46       ` Timo Ketola
2020-02-29 13:13         ` Fabio Estevam
2020-02-29 14:20         ` Timo Ketola
2020-03-01 21:28 ` Richard Weinberger
2020-03-02 12:57   ` Timo Ketola
2020-03-02 21:02     ` Richard Weinberger
2020-03-03  6:27       ` Timo Ketola

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).