* Power cut leads to "corrupt empty space" @ 2020-02-27 13:04 Timo Ketola 2020-02-27 13:08 ` Fabio Estevam 2020-03-01 21:28 ` Richard Weinberger 0 siblings, 2 replies; 11+ messages in thread From: Timo Ketola @ 2020-02-27 13:04 UTC (permalink / raw) To: linux-mtd Hi, We have a few i.MX6D devices which have corrupted their UBIFS filesystem on power cut and refuse to mount them any more. The log says: > [ 10.382580] UBIFS (ubi1:0): background thread "ubifs_bgt1_0" started, PID 158 > [ 10.408838] UBIFS (ubi1:0): recovery needed > [ 10.802070] UBIFS error (ubi1:0 pid 157): ubifs_scan: corrupt empty space at > LEB 99:114688 > [ 10.809054] UBIFS error (ubi1:0 pid 157): ubifs_scanned_corruption: corruptio > n at LEB 99:114688 > [ 10.816471] UBIFS error (ubi1:0 pid 157): ubifs_scanned_corruption: first 819 > 2 bytes from LEB 99:114688 > [ 10.824585] 00000000: 06101831 713b7e1b 002e0640 00000000 000000a0 00000200 0 > 0000554 00000000 1....~;q@...............T....... > [ 10.824601] 00000020: 00000000 00000000 0001585b 00000000 0008c48d 00000000 5 > d512897 00000000 ........[X...............(Q].... ... > [ 10.827751] UBIFS error (ubi1:0 pid 157): ubifs_scan: LEB 99 scanning failed > [ 10.834615] UBIFS (ubi1:0): background thread "ubifs_bgt1_0" stops I think I found the culprit from the mtdblock contents. Fragment from hexdump: > 3ca20000 55 42 49 23 01 00 00 00 00 00 00 00 00 00 00 04 |UBI#............| > 3ca20010 00 00 08 00 00 00 10 00 0c 4d 7c ed 00 00 00 00 |.........M|.....| > 3ca20020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > 3ca20030 00 00 00 00 00 00 00 00 00 00 00 00 cb 5d 1f 01 |.............]..| > 3ca20040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 3ca20800 55 42 49 21 01 01 00 00 00 00 00 00 00 00 00 63 |UBI!...........c| > 3ca20810 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > 3ca20820 00 00 00 00 00 00 00 00 00 00 00 00 00 00 8d 07 |................| > 3ca20830 00 00 00 00 00 00 00 00 00 00 00 00 91 2b 87 87 |.............+..| > 3ca20840 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 3ca21000 31 18 10 06 30 3c 6d 96 cd 05 2e 00 00 00 00 00 |1...0<m.........| > 3ca21010 a0 00 00 00 00 02 00 00 54 05 00 00 00 00 00 00 |........T.......| ... > 3ca3b8c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 3ca3c000 31 18 10 06 7b 71 87 8f 3c 06 2e 00 00 00 00 00 |1...{q..<.......| > 3ca3c010 a0 00 00 00 00 02 00 00 54 05 00 00 00 00 00 00 |........T.......| > 3ca3c020 00 00 00 00 00 00 00 00 5b 58 01 00 00 00 00 00 |........[X......| > 3ca3c030 79 c3 08 00 00 00 00 00 97 28 51 5d 00 00 00 00 |y........(Q]....| > 3ca3c040 19 58 6d 38 00 00 00 00 19 58 6d 38 00 00 00 00 |.Xm8.....Xm8....| > 3ca3c050 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 |................| > 3ca3c060 eb 03 00 00 eb 03 00 00 a4 81 00 00 01 00 00 00 |................| > 3ca3c070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > 3ca3c080 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 |................| > 3ca3c090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > 3ca3c0a0 31 18 10 06 84 13 e1 a0 00 00 00 00 00 00 00 00 |1...............| > 3ca3c0b0 1c 00 00 00 05 00 00 00 44 07 00 00 00 00 00 00 |........D.......| > 3ca3c0c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 3ca3c800 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| > * > 3ca3d000 31 18 10 06 1b 7e 3b 71 40 06 2e 00 00 00 00 00 |1....~;q@.......| > 3ca3d010 a0 00 00 00 00 02 00 00 54 05 00 00 00 00 00 00 |........T.......| > 3ca3d020 00 00 00 00 00 00 00 00 5b 58 01 00 00 00 00 00 |........[X......| > 3ca3d030 8d c4 08 00 00 00 00 00 97 28 51 5d 00 00 00 00 |.........(Q]....| > 3ca3d040 19 58 6d 38 00 00 00 00 19 58 6d 38 00 00 00 00 |.Xm8.....Xm8....| > 3ca3d050 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 |................| > 3ca3d060 eb 03 00 00 eb 03 00 00 a4 81 00 00 01 00 00 00 |................| > 3ca3d070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > 3ca3d080 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 |................| > 3ca3d090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > 3ca3d0a0 31 18 10 06 84 13 e1 a0 00 00 00 00 00 00 00 00 |1...............| > 3ca3d0b0 1c 00 00 00 05 00 00 00 44 07 00 00 00 00 00 00 |........D.......| > 3ca3d0c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 3ca3d800 31 18 10 06 c1 6b e6 57 42 06 2e 00 00 00 00 00 |1....k.WB.......| > 3ca3d810 a0 00 00 00 00 02 00 00 54 05 00 00 00 00 00 00 |........T.......| > 3ca3d820 00 00 00 00 00 00 00 00 5b 58 01 00 00 00 00 00 |........[X......| > 3ca3d830 0d c5 08 00 00 00 00 00 97 28 51 5d 00 00 00 00 |.........(Q]....| > 3ca3d840 19 58 6d 38 00 00 00 00 19 58 6d 38 00 00 00 00 |.Xm8.....Xm8....| > 3ca3d850 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 |................| > 3ca3d860 eb 03 00 00 eb 03 00 00 a4 81 00 00 01 00 00 00 |................| > 3ca3d870 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > 3ca3d880 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 |................| > 3ca3d890 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > 3ca3d8a0 31 18 10 06 84 13 e1 a0 00 00 00 00 00 00 00 00 |1...............| > 3ca3d8b0 1c 00 00 00 05 00 00 00 44 07 00 00 00 00 00 00 |........D.......| > 3ca3d8c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 3ca3e000 31 18 10 06 0b 75 3d 9e 44 06 2e 00 00 00 00 00 |1....u=.D.......| IIUC, ubifs_scan finds empty space at 3ca3c800, stops scanning and checks the rest of the LEB for being empty but finds something else at 3ca3d000. Then recovery aborts and mounting fails. Do I understand correctly that empty space should always be continuous at the end of the LEB? How could this kind of corruption happen? Is there any way to recover from this? Storage is NAND with 0x20000 erase block size and the kernel is 4.9.88. -- Timo ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Power cut leads to "corrupt empty space" 2020-02-27 13:04 Power cut leads to "corrupt empty space" Timo Ketola @ 2020-02-27 13:08 ` Fabio Estevam 2020-02-27 13:42 ` Timo Ketola 2020-03-01 21:28 ` Richard Weinberger 1 sibling, 1 reply; 11+ messages in thread From: Fabio Estevam @ 2020-02-27 13:08 UTC (permalink / raw) To: Timo Ketola; +Cc: linux-mtd Hi Timo, On Thu, Feb 27, 2020 at 10:04 AM Timo Ketola <Timo.Ketola@exertus.fi> wrote: > Storage is NAND with 0x20000 erase block size and the kernel is 4.9.88. Could you try with kernel 5.4 or 5.5 instead? ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Power cut leads to "corrupt empty space" 2020-02-27 13:08 ` Fabio Estevam @ 2020-02-27 13:42 ` Timo Ketola 2020-02-27 15:16 ` Fabio Estevam 0 siblings, 1 reply; 11+ messages in thread From: Timo Ketola @ 2020-02-27 13:42 UTC (permalink / raw) To: Fabio Estevam; +Cc: linux-mtd Hi Fabio, On 27.2.2020 15.08, Fabio Estevam wrote: > Hi Timo, > > On Thu, Feb 27, 2020 at 10:04 AM Timo Ketola <Timo.Ketola@exertus.fi> wrote: > >> Storage is NAND with 0x20000 erase block size and the kernel is 4.9.88. > > Could you try with kernel 5.4 or 5.5 instead? > That might take considerable effort. Would you think, there should be fixes for this? Would it be on recovery side or preventing the issue happening in the first place? -- Timo ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Power cut leads to "corrupt empty space" 2020-02-27 13:42 ` Timo Ketola @ 2020-02-27 15:16 ` Fabio Estevam 2020-02-29 12:46 ` Timo Ketola 0 siblings, 1 reply; 11+ messages in thread From: Fabio Estevam @ 2020-02-27 15:16 UTC (permalink / raw) To: Timo Ketola; +Cc: linux-mtd Hi Timo, On Thu, Feb 27, 2020 at 10:42 AM Timo Ketola <Timo.Ketola@exertus.fi> wrote: > That might take considerable effort. Would you think, there should be > fixes for this? Would it be on recovery side or preventing the issue > happening in the first place? It is hard to tell. 4.9.88 is an old version, so better try with mainline ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Power cut leads to "corrupt empty space" 2020-02-27 15:16 ` Fabio Estevam @ 2020-02-29 12:46 ` Timo Ketola 2020-02-29 13:13 ` Fabio Estevam 2020-02-29 14:20 ` Timo Ketola 0 siblings, 2 replies; 11+ messages in thread From: Timo Ketola @ 2020-02-29 12:46 UTC (permalink / raw) To: Fabio Estevam; +Cc: linux-mtd On 27.2.2020 17.16, Fabio Estevam wrote: > Hi Timo, > > On Thu, Feb 27, 2020 at 10:42 AM Timo Ketola <Timo.Ketola@exertus.fi> wrote: > >> That might take considerable effort. Would you think, there should be >> fixes for this? Would it be on recovery side or preventing the issue >> happening in the first place? > > It is hard to tell. 4.9.88 is an old version, so better try with mainline > Ok, I managed to get v5.4 booting - almost. First, we had 'fsl,legacy-bch-geometry;' flag in device tree and I couldn't find how I would get the same effect in this kernel in a 'standard way'. I had to put 'nand-ecc-strength = <8>; nand-ecc-step-size = <512>;' into the device tree and make this change in drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c: > @@ -507,11 +507,11 @@ static int common_nfc_set_geometry(struct gpmi_nand_data *this) > struct nand_chip *chip = &this->nand; > > if (chip->ecc.strength > 0 && chip->ecc.size > 0) > return set_geometry_by_ecc_info(this, chip->ecc.strength, > chip->ecc.size); > - > + return legacy_set_geometry(this); > if ((of_property_read_bool(this->dev->of_node, "fsl,use-minimum-ecc")) > || legacy_set_geometry(this)) { > if (!(chip->base.eccreq.strength > 0 && > chip->base.eccreq.step_size > 0)) > return -EINVAL; That is, call legacy_set_geometry unconditionally without then calling set_geometry_by_ecc_info. After this it began to read the first half of the NAND correctly. The there is a bug (I think) in the NAND chip S34ML16G2. It has four S34ML04G2 dies and two chip selects in the package and shows up as two chips. It reports 128KiB per EB, 8192 EBs per LUN and 2 LUNs making up 2GiB. This is correct for the package but then Linux finds two such chips, total of 4GiB, which is not correct. So I have this in drivers/mtd/nand/raw/nand_base.c: > @@ -4733,12 +4760,36 @@ static int nand_detect(struct nand_chip *chip, struct nand_flash_dev *type) > if (!type->name || !type->pagesize) { > /* Check if the chip is ONFI compliant */ > ret = nand_onfi_detect(chip); > if (ret < 0) > return ret; > - else if (ret) > + else if (ret) { > + if (type->name) { > + struct nand_device *nand = &chip->base; > + unsigned luns; > + > + pr_info("%s detected\n", type->name); > + pr_info("luns %d, eraseblocks %d, pages %d, page size %d\n", > + nand->memorg.luns_per_target, > + nand->memorg.eraseblocks_per_lun, > + nand->memorg.pages_per_eraseblock, > + nand->memorg.pagesize); > + pr_info("sizes: page 0x%X, erase 0x%X, chip 0x%X\n", > + type->pagesize, > + type->erasesize, > + type->chipsize); > + luns = DIV_ROUND_DOWN_ULL((u64)type->chipsize << 20, > + nand->memorg.pagesize * > + nand->memorg.pages_per_eraseblock * > + nand->memorg.eraseblocks_per_lun); > + if (nand->memorg.luns_per_target != luns) { > + printk("Correcting luns-per-target to %d", luns); > + nand->memorg.luns_per_target = luns; > + } > + } > goto ident_done; > + } > > /* Check if the chip is JEDEC compliant */ > ret = nand_jedec_detect(chip); > if (ret < 0) > return ret; output: > nand: NAND 1GiB 3,3V 8-bit detected > nand: luns 2, eraseblocks 8192, pages 64, page size 2048 > nand: sizes: page 0x0, erase 0x0, chip 0x400 > Correcting luns-pre-target to 1 > nand: device found, Manufacturer ID: 0x01, Chip ID: 0xd3 > nand: AMD/Spansion S34ML16G2 > nand: 1024 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 128 > nand: 2 chips detected That idea worked on v4.9 imx kernel but not here. The driver reports ECC errors for the second half of the NAND. I have debugged down to gpmi driver and checked that page address is as should (e.g. realpage 524288, page 0 0x80000 in nand_do_read_ops for the first page of the second half) and target selection changes correctly. But it reads only FFs. Still, it seems to erase correct blocks when trying to write BBTs. I put this in drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c: > @@ -2270,10 +2270,18 @@ static struct dma_async_tx_descriptor *gpmi_chain_command( > > transfer->direction = DMA_TO_DEVICE; > > desc = dmaengine_prep_slave_sg(channel, &transfer->sgl, 1, DMA_MEM_TO_DEV, > MXS_DMA_CTRL_WAIT4END); > + if (1) { > + unsigned i; > + char b[160], *p; > + > + p = b + sprintf(b, "Transfer from/to chip %d, pio[0] %X, naddr %d, addr", chip, pio[0], naddr); > + for (i = 0; i < naddr; ++i) p += sprintf(p, " %02X", addr[i]); > + pr_info("%s\n", b); > + } > return desc; > } > and see > Transfer from/to chip 1, pio[0] 930004, naddr 3, addr C0 FF 07 for erase, which seems to work and > Transfer from/to chip 1, pio[0] 930006, naddr 5, addr 00 00 C0 FF 07 for reads/writes, which fail. I'm real stuck. -- Timo ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Power cut leads to "corrupt empty space" 2020-02-29 12:46 ` Timo Ketola @ 2020-02-29 13:13 ` Fabio Estevam 2020-02-29 14:20 ` Timo Ketola 1 sibling, 0 replies; 11+ messages in thread From: Fabio Estevam @ 2020-02-29 13:13 UTC (permalink / raw) To: Timo Ketola; +Cc: Han Xu, linux-mtd, Miquel Raynal Adding Han Xu and Miquel On Sat, Feb 29, 2020 at 9:46 AM Timo Ketola <Timo.Ketola@exertus.fi> wrote: > > On 27.2.2020 17.16, Fabio Estevam wrote: > > Hi Timo, > > > > On Thu, Feb 27, 2020 at 10:42 AM Timo Ketola <Timo.Ketola@exertus.fi> wrote: > > > >> That might take considerable effort. Would you think, there should be > >> fixes for this? Would it be on recovery side or preventing the issue > >> happening in the first place? > > > > It is hard to tell. 4.9.88 is an old version, so better try with mainline > > > > Ok, I managed to get v5.4 booting - almost. > > First, we had 'fsl,legacy-bch-geometry;' flag in device tree and I > couldn't find how I would get the same effect in this kernel in a > 'standard way'. I had to put 'nand-ecc-strength = <8>; > nand-ecc-step-size = <512>;' into the device tree and make this change > in drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c: > > > @@ -507,11 +507,11 @@ static int common_nfc_set_geometry(struct gpmi_nand_data *this) > > struct nand_chip *chip = &this->nand; > > > > if (chip->ecc.strength > 0 && chip->ecc.size > 0) > > return set_geometry_by_ecc_info(this, chip->ecc.strength, > > chip->ecc.size); > > - > > + return legacy_set_geometry(this); > > if ((of_property_read_bool(this->dev->of_node, "fsl,use-minimum-ecc")) > > || legacy_set_geometry(this)) { > > if (!(chip->base.eccreq.strength > 0 && > > chip->base.eccreq.step_size > 0)) > > return -EINVAL; > > That is, call legacy_set_geometry unconditionally without then calling > set_geometry_by_ecc_info. After this it began to read the first half of > the NAND correctly. > > The there is a bug (I think) in the NAND chip S34ML16G2. It has four > S34ML04G2 dies and two chip selects in the package and shows up as two > chips. It reports 128KiB per EB, 8192 EBs per LUN and 2 LUNs making up > 2GiB. This is correct for the package but then Linux finds two such > chips, total of 4GiB, which is not correct. So I have this in > drivers/mtd/nand/raw/nand_base.c: > > > @@ -4733,12 +4760,36 @@ static int nand_detect(struct nand_chip *chip, struct nand_flash_dev *type) > > if (!type->name || !type->pagesize) { > > /* Check if the chip is ONFI compliant */ > > ret = nand_onfi_detect(chip); > > if (ret < 0) > > return ret; > > - else if (ret) > > + else if (ret) { > > + if (type->name) { > > + struct nand_device *nand = &chip->base; > > + unsigned luns; > > + > > + pr_info("%s detected\n", type->name); > > + pr_info("luns %d, eraseblocks %d, pages %d, page size %d\n", > > + nand->memorg.luns_per_target, > > + nand->memorg.eraseblocks_per_lun, > > + nand->memorg.pages_per_eraseblock, > > + nand->memorg.pagesize); > > + pr_info("sizes: page 0x%X, erase 0x%X, chip 0x%X\n", > > + type->pagesize, > > + type->erasesize, > > + type->chipsize); > > + luns = DIV_ROUND_DOWN_ULL((u64)type->chipsize << 20, > > + nand->memorg.pagesize * > > + nand->memorg.pages_per_eraseblock * > > + nand->memorg.eraseblocks_per_lun); > > + if (nand->memorg.luns_per_target != luns) { > > + printk("Correcting luns-per-target to %d", luns); > > + nand->memorg.luns_per_target = luns; > > + } > > + } > > goto ident_done; > > + } > > > > /* Check if the chip is JEDEC compliant */ > > ret = nand_jedec_detect(chip); > > if (ret < 0) > > return ret; > > output: > > > nand: NAND 1GiB 3,3V 8-bit detected > > nand: luns 2, eraseblocks 8192, pages 64, page size 2048 > > nand: sizes: page 0x0, erase 0x0, chip 0x400 > > Correcting luns-pre-target to 1 > > nand: device found, Manufacturer ID: 0x01, Chip ID: 0xd3 > > nand: AMD/Spansion S34ML16G2 > > nand: 1024 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 128 > > nand: 2 chips detected > > That idea worked on v4.9 imx kernel but not here. The driver reports ECC > errors for the second half of the NAND. I have debugged down to gpmi > driver and checked that page address is as should (e.g. realpage 524288, > page 0 0x80000 in nand_do_read_ops for the first page of the second > half) and target selection changes correctly. But it reads only FFs. > Still, it seems to erase correct blocks when trying to write BBTs. > > I put this in drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c: > > > @@ -2270,10 +2270,18 @@ static struct dma_async_tx_descriptor *gpmi_chain_command( > > > > transfer->direction = DMA_TO_DEVICE; > > > > desc = dmaengine_prep_slave_sg(channel, &transfer->sgl, 1, DMA_MEM_TO_DEV, > > MXS_DMA_CTRL_WAIT4END); > > + if (1) { > > + unsigned i; > > + char b[160], *p; > > + > > + p = b + sprintf(b, "Transfer from/to chip %d, pio[0] %X, naddr %d, addr", chip, pio[0], naddr); > > + for (i = 0; i < naddr; ++i) p += sprintf(p, " %02X", addr[i]); > > + pr_info("%s\n", b); > > + } > > return desc; > > } > > > > and see > > > Transfer from/to chip 1, pio[0] 930004, naddr 3, addr C0 FF 07 > > for erase, which seems to work and > > > Transfer from/to chip 1, pio[0] 930006, naddr 5, addr 00 00 C0 FF 07 > > for reads/writes, which fail. > > I'm real stuck. > > -- > > Timo ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Power cut leads to "corrupt empty space" 2020-02-29 12:46 ` Timo Ketola 2020-02-29 13:13 ` Fabio Estevam @ 2020-02-29 14:20 ` Timo Ketola 1 sibling, 0 replies; 11+ messages in thread From: Timo Ketola @ 2020-02-29 14:20 UTC (permalink / raw) To: Fabio Estevam; +Cc: Han Xu, linux-mtd, Miquel Raynal On 29.2.2020 14.46, Timo Ketola wrote: > I had to put 'nand-ecc-strength = <8>; > nand-ecc-step-size = <512>;' into the device tree Actually, I tried these but they didn't help. They are not there any more. > and make this change > in drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c: > >> @@ -507,11 +507,11 @@ static int common_nfc_set_geometry(struct gpmi_nand_data *this) That was needed. > Still, it seems to erase correct blocks when trying to write BBTs. This might not be true >> Transfer from/to chip 1, pio[0] 930006, naddr 5, addr 00 00 C0 FF 07 I tried the same in my v4.9 kernel and saw very (exactly?) similar transactions and it works: > Transfer from/to chip 0, pio[0] 830006, len 6, cmd 00 00 00 C0 FF 07 > Transfer from/to chip 0, pio[0] 830001, len 1, cmd 30 > Transfer from/to chip 1, pio[0] 930006, len 6, cmd 00 00 00 C0 FF 07 > Transfer from/to chip 1, pio[0] 930001, len 1, cmd 30 > Bad block table found at page 524224, version 0x01 > Bad block table found at page 1048512, version 0x01 -- Timo ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Power cut leads to "corrupt empty space" 2020-02-27 13:04 Power cut leads to "corrupt empty space" Timo Ketola 2020-02-27 13:08 ` Fabio Estevam @ 2020-03-01 21:28 ` Richard Weinberger 2020-03-02 12:57 ` Timo Ketola 1 sibling, 1 reply; 11+ messages in thread From: Richard Weinberger @ 2020-03-01 21:28 UTC (permalink / raw) To: Timo Ketola; +Cc: linux-mtd Timo, On Thu, Feb 27, 2020 at 2:04 PM Timo Ketola <Timo.Ketola@exertus.fi> wrote: > We have a few i.MX6D devices which have corrupted their UBIFS filesystem > on power cut and refuse to mount them any more. > > The log says: > > > [ 10.382580] UBIFS (ubi1:0): background thread "ubifs_bgt1_0" started, PID 158 > > [ 10.408838] UBIFS (ubi1:0): recovery needed > > [ 10.802070] UBIFS error (ubi1:0 pid 157): ubifs_scan: corrupt empty space at > > LEB 99:114688 > > [ 10.809054] UBIFS error (ubi1:0 pid 157): ubifs_scanned_corruption: corruptio > > n at LEB 99:114688 > > [ 10.816471] UBIFS error (ubi1:0 pid 157): ubifs_scanned_corruption: first 819 > > 2 bytes from LEB 99:114688 > > [ 10.824585] 00000000: 06101831 713b7e1b 002e0640 00000000 000000a0 00000200 0 > > 0000554 00000000 1....~;q@...............T....... > > [ 10.824601] 00000020: 00000000 00000000 0001585b 00000000 0008c48d 00000000 5 > > d512897 00000000 ........[X...............(Q].... > > ... > > > [ 10.827751] UBIFS error (ubi1:0 pid 157): ubifs_scan: LEB 99 scanning failed > > [ 10.834615] UBIFS (ubi1:0): background thread "ubifs_bgt1_0" stops > > I think I found the culprit from the mtdblock contents. Fragment from > hexdump: > > > 3ca20000 55 42 49 23 01 00 00 00 00 00 00 00 00 00 00 04 |UBI#............| > > 3ca20010 00 00 08 00 00 00 10 00 0c 4d 7c ed 00 00 00 00 |.........M|.....| > > 3ca20020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > 3ca20030 00 00 00 00 00 00 00 00 00 00 00 00 cb 5d 1f 01 |.............]..| > > 3ca20040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > * > > 3ca20800 55 42 49 21 01 01 00 00 00 00 00 00 00 00 00 63 |UBI!...........c| > > 3ca20810 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > 3ca20820 00 00 00 00 00 00 00 00 00 00 00 00 00 00 8d 07 |................| > > 3ca20830 00 00 00 00 00 00 00 00 00 00 00 00 91 2b 87 87 |.............+..| > > 3ca20840 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > * > > 3ca21000 31 18 10 06 30 3c 6d 96 cd 05 2e 00 00 00 00 00 |1...0<m.........| > > 3ca21010 a0 00 00 00 00 02 00 00 54 05 00 00 00 00 00 00 |........T.......| > > ... > > > 3ca3b8c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > * > > 3ca3c000 31 18 10 06 7b 71 87 8f 3c 06 2e 00 00 00 00 00 |1...{q..<.......| > > 3ca3c010 a0 00 00 00 00 02 00 00 54 05 00 00 00 00 00 00 |........T.......| > > 3ca3c020 00 00 00 00 00 00 00 00 5b 58 01 00 00 00 00 00 |........[X......| > > 3ca3c030 79 c3 08 00 00 00 00 00 97 28 51 5d 00 00 00 00 |y........(Q]....| > > 3ca3c040 19 58 6d 38 00 00 00 00 19 58 6d 38 00 00 00 00 |.Xm8.....Xm8....| > > 3ca3c050 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 |................| > > 3ca3c060 eb 03 00 00 eb 03 00 00 a4 81 00 00 01 00 00 00 |................| > > 3ca3c070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > 3ca3c080 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 |................| > > 3ca3c090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > 3ca3c0a0 31 18 10 06 84 13 e1 a0 00 00 00 00 00 00 00 00 |1...............| > > 3ca3c0b0 1c 00 00 00 05 00 00 00 44 07 00 00 00 00 00 00 |........D.......| > > 3ca3c0c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > * > > 3ca3c800 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| > > * > > 3ca3d000 31 18 10 06 1b 7e 3b 71 40 06 2e 00 00 00 00 00 |1....~;q@.......| So, in there is a whole 2KiB area 0xFF. It is also aligned, so it could be whole page. > > 3ca3d010 a0 00 00 00 00 02 00 00 54 05 00 00 00 00 00 00 |........T.......| > > 3ca3d020 00 00 00 00 00 00 00 00 5b 58 01 00 00 00 00 00 |........[X......| > > 3ca3d030 8d c4 08 00 00 00 00 00 97 28 51 5d 00 00 00 00 |.........(Q]....| > > 3ca3d040 19 58 6d 38 00 00 00 00 19 58 6d 38 00 00 00 00 |.Xm8.....Xm8....| > > 3ca3d050 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 |................| > > 3ca3d060 eb 03 00 00 eb 03 00 00 a4 81 00 00 01 00 00 00 |................| > > 3ca3d070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > 3ca3d080 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 |................| > > 3ca3d090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > 3ca3d0a0 31 18 10 06 84 13 e1 a0 00 00 00 00 00 00 00 00 |1...............| > > 3ca3d0b0 1c 00 00 00 05 00 00 00 44 07 00 00 00 00 00 00 |........D.......| > > 3ca3d0c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > * > > 3ca3d800 31 18 10 06 c1 6b e6 57 42 06 2e 00 00 00 00 00 |1....k.WB.......| > > 3ca3d810 a0 00 00 00 00 02 00 00 54 05 00 00 00 00 00 00 |........T.......| > > 3ca3d820 00 00 00 00 00 00 00 00 5b 58 01 00 00 00 00 00 |........[X......| > > 3ca3d830 0d c5 08 00 00 00 00 00 97 28 51 5d 00 00 00 00 |.........(Q]....| > > 3ca3d840 19 58 6d 38 00 00 00 00 19 58 6d 38 00 00 00 00 |.Xm8.....Xm8....| > > 3ca3d850 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 |................| > > 3ca3d860 eb 03 00 00 eb 03 00 00 a4 81 00 00 01 00 00 00 |................| > > 3ca3d870 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > 3ca3d880 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 |................| > > 3ca3d890 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > 3ca3d8a0 31 18 10 06 84 13 e1 a0 00 00 00 00 00 00 00 00 |1...............| > > 3ca3d8b0 1c 00 00 00 05 00 00 00 44 07 00 00 00 00 00 00 |........D.......| > > 3ca3d8c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > * > > 3ca3e000 31 18 10 06 0b 75 3d 9e 44 06 2e 00 00 00 00 00 |1....u=.D.......| > > IIUC, ubifs_scan finds empty space at 3ca3c800, stops scanning and > checks the rest of the LEB for being empty but finds something else at > 3ca3d000. Then recovery aborts and mounting fails. > > Do I understand correctly that empty space should always be continuous > at the end of the LEB? Correct. > How could this kind of corruption happen? Hard to say. Maybe bad timing settings which cause writes to have no effect. But usually this leads to ECC errors. If you can share the image with me I can have a look and with some luck we find traces. Is this a mainline kernel? Wonky drivers can lead to all kind of "interesting" results. :-> > Is there any way to recover from this? Not really. UBIFS' IO model got violated and it gives up. > Storage is NAND with 0x20000 erase block size and the kernel is 4.9.88. I guess 2KiB page size? -- Thanks, //richard ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Power cut leads to "corrupt empty space" 2020-03-01 21:28 ` Richard Weinberger @ 2020-03-02 12:57 ` Timo Ketola 2020-03-02 21:02 ` Richard Weinberger 0 siblings, 1 reply; 11+ messages in thread From: Timo Ketola @ 2020-03-02 12:57 UTC (permalink / raw) To: Richard Weinberger; +Cc: linux-mtd On 1.3.2020 23.28, Richard Weinberger wrote: > If you can share the image with me I can have a look and with some luck we > find traces. Thank you. I'll send the link separately. > Is this a mainline kernel? > Wonky drivers can lead to all kind of "interesting" results. :-> It is boundary-imx-o8.0.0_1.0.0-ga-pass2 from https://github.com/boundarydevices/linux-imx6.git branched at a51fcd6bd17c with our board support and patched with v.4.9.88-rt66 from git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git > I guess 2KiB page size? Yes -- Timo ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Power cut leads to "corrupt empty space" 2020-03-02 12:57 ` Timo Ketola @ 2020-03-02 21:02 ` Richard Weinberger 2020-03-03 6:27 ` Timo Ketola 0 siblings, 1 reply; 11+ messages in thread From: Richard Weinberger @ 2020-03-02 21:02 UTC (permalink / raw) To: Timo Ketola; +Cc: linux-mtd On Mon, Mar 2, 2020 at 1:57 PM Timo Ketola <Timo.Ketola@exertus.fi> wrote: > > On 1.3.2020 23.28, Richard Weinberger wrote: > > If you can share the image with me I can have a look and with some luck we > > find traces. > > Thank you. I'll send the link separately. > > > Is this a mainline kernel? > > Wonky drivers can lead to all kind of "interesting" results. :-> > > It is boundary-imx-o8.0.0_1.0.0-ga-pass2 from > > https://github.com/boundarydevices/linux-imx6.git > > branched at a51fcd6bd17c with our board support and patched with > v.4.9.88-rt66 from Hmm, vendor tree.... I strongly suggest giving mainline a try. Did you also double check your NAND settings, especially timings? -- Thanks, //richard ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Power cut leads to "corrupt empty space" 2020-03-02 21:02 ` Richard Weinberger @ 2020-03-03 6:27 ` Timo Ketola 0 siblings, 0 replies; 11+ messages in thread From: Timo Ketola @ 2020-03-03 6:27 UTC (permalink / raw) To: Richard Weinberger; +Cc: linux-mtd Hi Richard, Thanks for looking at this! On 2.3.2020 23.02, Richard Weinberger wrote: > Hmm, vendor tree.... > I strongly suggest giving mainline a try. I can share the feeling and I have tried couple of times to switch to mainline (4.12 and 4.17) but failed. There were issues getting GPU and camera interfaces working which I was unable to solve. At this time I tried 5.4 but couldn't get even the NAND subsystem alone working: http://lists.infradead.org/pipermail/linux-mtd/2020-February/094090.html > Did you also double check your NAND settings, especially timings? Not yet. I focused on finding out if the corruption could be recovered. Now that it seems impossible, I obviously have to device tests to try to make sure, it does never happen again in the first place. At least for now, the incidents suggest that this relates somehow to the power cut. That would speak against bad timings. And I have a design blooper there: When supply voltage is dropped, NAND write protect signal is set hard. Now I'm thinking about a 'dirty power loss' scenario, where supply voltage is dropped momentarily just before actual total power loss so that one page write fails and then several pages succeeds before the final power cut. But shouldn't one page write fail put the whole UBI/UBIFS volume in R/O mode and prevent further writes? I hope you got my other mail with the link to the UBI image. It does seem like simply one page in the middle had been left unwritten, doesn't it? Is there anything there, which could be used to estimate how long before power cut that happened? -- Timo ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-03-03 6:28 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-02-27 13:04 Power cut leads to "corrupt empty space" Timo Ketola 2020-02-27 13:08 ` Fabio Estevam 2020-02-27 13:42 ` Timo Ketola 2020-02-27 15:16 ` Fabio Estevam 2020-02-29 12:46 ` Timo Ketola 2020-02-29 13:13 ` Fabio Estevam 2020-02-29 14:20 ` Timo Ketola 2020-03-01 21:28 ` Richard Weinberger 2020-03-02 12:57 ` Timo Ketola 2020-03-02 21:02 ` Richard Weinberger 2020-03-03 6:27 ` Timo Ketola
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).