Hi There !

I've been looking into a few (very rare) cases of ubifs file system
corruption and noticed that our system fails the mtd-utils io_paral
test about 50% of the time. I doubt this is a problem with ubi or
ubifs, but I was hoping for some insights on what the problem might
be.

System Hardware -
Processor - Xilinx Zynq Ultrascale 4xARM53  (XCZU3EG)
Integrated Arasan NAND Flash controller IP Rev v3p9_140822
Flash - Micron MT29UZ4B8DZZHGPB-107 (DDR3 + MT29F4G08 NAND die)

Software -
Petalinux 2017.3 - Linux Kernel 4.9.0
mtd-utils 2.1.1

mtdinfo /dev/mtd2
mtd2
Name:                           misc
Type:                           nand
Eraseblock size:                131072 bytes, 128.0 KiB
Amount of eraseblocks:          1024 (134217728 bytes, 128.0 MiB)
Minimum input/output unit size: 2048 bytes
Sub-page size:                  2048 bytes
OOB size:                       64 bytes
Character device major/minor:   90:4
Bad blocks are allowed:         true
Device is writable:             true

ubinfo -d 5
ubi5
Volumes count:                           0
Logical eraseblock size:                 126976 bytes, 124.0 KiB
Total amount of logical eraseblocks:     1020 (129515520 bytes, 123.5 MiB)
Amount of available logical eraseblocks: 940 (119357440 bytes, 113.8 MiB)
Maximum count of volumes                 128
Count of bad physical eraseblocks:       4
Count of reserved physical eraseblocks:  76
Current maximum erase counter value:     96
Minimum input/output unit size:          2048 bytes
Character device major/minor:            218:0


I run the full battery of ubitests using the "runubitests.sh" script
but so far it only fails in io_paral. I modified io_paral so that it
prints the differences between the two buffers to help
troubleshooting, The patch is attached in case anyone finds it useful.

The test fails in write_thread() line 225, where the written buffer is
compared with what was read back. Using my patched version (line # are
different) it prints

write_thread():290: written and read data are different at byte: 14336

14336 / 2048 = 7 so the error occurred at start of the 7th page. The
hex dump of the buffers is attached, keep in mind it starts 10 bytes
before the failure. By looking at the logs we can see that the 8th
page is all zeros, then the correct data is once again showing up on
page 9 and onward.

So essentially what happened is that one page is missing the data and
instead has all zeros. I can imagine lots of problems that can cause
this to happen... but I was hoping someone could point me towards the
usual suspects ? I've seen it on multiple units so it does not seem to
be just one bad flash device. I have not seen nandpagetest &
nandsubpagetest fail yet, although both these run for substantially
shorter time compared to io_paral

Any hints & help is deeply appreciated

Thanks

/Otto