On Fri, 2012-06-29 at 16:05 +1000, Iwo Mergler wrote:
> > > It is possible to avoid the failure by performing a large number of
> > > filesystem operations (i.e. file system benchmark) during the first
> > > session.
> >
> > Hmm, sounds strange.
> 
> While trying to reproduce the problem, I have come across another
> way to avoid it. If the boot scripts in the rootfs perform an
> ubiformat, attach, mkvol & mount on an unrelated empty mtd
> partition, the problem goes away.
> 
> Is there any global state shared between separate UBI/UBIFS
> partitions?

No. Do you MTD partitions overlap? What is in /proc/mtd ?

> > This means the driver is buggy: it does not support sub-pages but
> > still reports that it does. Just fix it instead.
> 
> I was under the impression that the subpage capability is extracted
> from the ONFI information. So I take it there is a flag for the
> driver to override that?

I do not know your system, but if your flash chip supports subpages, but
the ECC you use does not allow them, the driver should report that
sub-pages are not supported..

> > Did you try to mount an empty volume and let UBIFS auto-format it, and
> > then reproduce the issue?
> 
> No, UBIFS created from an empty partition work OK. In fact, doing that
> also stops the rootfs mount failure on the second boot.

Sounds like this is not UBIFS fault but rather like a side-effect of
something strange happening elsewhere. Probably it is related to how you
flash it.

We had the following issue in the past.

1. You have some UBI on your flash. Then you want to flash an new image.
2. The flasher for some reason did not erase some PEBs of the partition.
Probably because Linux view of the partition and flashers did not 100%
match. Anyway, on or few PEBs were not erased in the end of the
partition. Lets call them "ghost PEBs".
3. We flashed new image.
4. UBI attached the partition, the ghost PEBs were scanned and treated
as valid PEBs and their data appeared in one of the volumes, because
their generation numbers were higher than in PEBs from the new image
(the generation number is in the UBI headers). The ghost data, instead
of valid data, was read by UBIFS. And we had strange corruptions.

We introduced so-called "image sequence number" to catch such issues. It
is stored in the EC header. All EC headers on the MTD device have to
have the same. Every time we generate an image - we pick random one. So
if there are ghost PEBs, we notice this because they have a different
image sequence number.

See 'image_seq' in drivers/mtd/ubi/ubi-media.h.

Can this problem affect you as well?

If you use 'ubiformat' for flashing your images, it will generate a
random image sequence number every time it flashes. So it won't use the
one in the image.

Do you use ubiformat for flashing? If not, try to re-generate your image
- ubinize will put a different number there, and flash it and see what
happens. You'd get an error like this:

UBI error: process_eb: bad image sequence number 3726164569 in PEB 47,
expected 642536469

Additional thoughts...

I think what could be more interesting if you could enable debugging for
real. The docs on the web-site are out of date and we switched to
dynamic debugging, so you need to enable the debugging messages
differently. I need to write a howto, and I do not know how to do this
via kernel cmdline so far, need to find out. I know how to do this via
debugfs. But check Documentation/dynamic-debug-howto.txt.

The image is not very helpful. UBI or UBIFS messages would probably
allow to track what UBI/UBIFS is doing to the "faulty" LEB and
corresponding PEB and verify that it is ok. But I really have a strong
feeling it is not UBI/UBIFS fault, so may be we'd spend time to just
prove this.

-- 
Best Regards,
Artem Bityutskiy