All of lore.kernel.org
 help / color / mirror / Atom feed
* "corrupt empty space" error on boot?!?
@ 2015-03-02 16:39 Steve deRosier
  2015-03-03  7:31 ` Artem Bityutskiy
  0 siblings, 1 reply; 4+ messages in thread
From: Steve deRosier @ 2015-03-02 16:39 UTC (permalink / raw)
  To: linux-mtd

Hi All,

So, after torturing one of our devices by rebooting it for a few
hundred iterations, we ran across a situation where the system fails
to boot due to a corrupt empty space error:

    Starting kernel ...

    Uncompressing Linux... done, booting the kernel.
    UBIFS error (pid 1): ubifs_scan: corrupt empty space at LEB 4:3918
    UBIFS error (pid 1): ubifs_scanned_corruption: corruption at LEB 4:398
    UBIFS error (pid 1): ubifs_scanned_corruption: first 8192 bytes from 8
    UBIFS error (pid 1): ubifs_scan: LEB 4 scanning failed
    Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-b)

This is on kernel v3.8, atmel_nand diver.  In earlier discussions, it
was suggested that the driver would encounter this sort of problem
because the driver/chip can't do ECC in erased pages so a bitflip
there could be an issue.  This is the first time I've seen this
problem in the wild though.

1. Is this likely what I'm seeing?
2. Will moving to a recent kernel help (we're currently updating our
mainline to bleeding-edge 4.0)?
3. How can I programmatically recover from this situation?

Logically, it seems to me that a non ecc protected bit-flip in an
empty page should be a non-issue. UBI should be able to move the
block, erase the block, torture/return-to-service and move on with
it's life.  No data is destroyed or even affected.

A unit not mounting the rootfs because of a bit-flip in _empty_space_
is unacceptable to us, so I've got to figure out a way to deal with
this rare event.

Any help would be appreciated.

Thanks,
- Steve

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: "corrupt empty space" error on boot?!?
  2015-03-02 16:39 "corrupt empty space" error on boot?!? Steve deRosier
@ 2015-03-03  7:31 ` Artem Bityutskiy
  2015-03-03 15:25   ` Steve deRosier
  0 siblings, 1 reply; 4+ messages in thread
From: Artem Bityutskiy @ 2015-03-03  7:31 UTC (permalink / raw)
  To: Steve deRosier; +Cc: linux-mtd

On Mon, 2015-03-02 at 08:39 -0800, Steve deRosier wrote:
> Logically, it seems to me that a non ecc protected bit-flip in an
> empty page should be a non-issue. UBI should be able to move the
> block, erase the block, torture/return-to-service and move on with
> it's life.  No data is destroyed or even affected.

Yes, you are right, if there is a corruption, UBIFS can:

1. Try to understand if this is a corruption in empty space or not.
2. If yes, recover the LEB.

But this is not implemented. People keep hitting this issue, but no one
contributed fixes yet.

> A unit not mounting the rootfs because of a bit-flip in _empty_space_
> is unacceptable to us, so I've got to figure out a way to deal with
> this rare event.

Well, improving UBIFS would be one of the possible solutions.

Artem.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: "corrupt empty space" error on boot?!?
  2015-03-03  7:31 ` Artem Bityutskiy
@ 2015-03-03 15:25   ` Steve deRosier
  2015-03-04  3:31     ` hujianyang
  0 siblings, 1 reply; 4+ messages in thread
From: Steve deRosier @ 2015-03-03 15:25 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: linux-mtd

Thanks Artem.

On Mon, Mar 2, 2015 at 11:31 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> Yes, you are right, if there is a corruption, UBIFS can:
>
> 1. Try to understand if this is a corruption in empty space or not.
> 2. If yes, recover the LEB.
>
> But this is not implemented. People keep hitting this issue, but no one
> contributed fixes yet.
>
>> A unit not mounting the rootfs because of a bit-flip in _empty_space_
>> is unacceptable to us, so I've got to figure out a way to deal with
>> this rare event.
>
> Well, improving UBIFS would be one of the possible solutions.
>

OK, two questions then:

1. Is there anything I can do from userspace, or uboot, to recover
this filesystem?  We've got mirrored filesystems, so we actually can
detect the failure and mount the other one and fix the first from
there.  Or maybe I can mount it ro and switch to the other filesystem
and reboot?

2. I'd like to be able to replicate the problem so I can fix it, but
simply poking a random bit to a random empty PEB won't do the trick.
I've actually tried this before when doing other investigations and
nothing bad happened, likely because the empty page I hit was never
looked at by UBIFS.  I know there's got to be a way to map LEB to PEB,
how do I do that/where is the table?  Specifically, how to map "LEB
4:3918" to a physical block and page on the flash device?

I'll give fixing it and contributing the patch a try. I'm up against a
project deadline with a board-bring-up right now (they wanted it done
2 weeks ago and I'm having to report on it each day now), so I
probably won't have time on it till next week.

- Steve

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: "corrupt empty space" error on boot?!?
  2015-03-03 15:25   ` Steve deRosier
@ 2015-03-04  3:31     ` hujianyang
  0 siblings, 0 replies; 4+ messages in thread
From: hujianyang @ 2015-03-04  3:31 UTC (permalink / raw)
  To: Steve deRosier; +Cc: linux-mtd, Artem Bityutskiy

Hi Steve,

On 2015/3/3 23:25, Steve deRosier wrote:
> Thanks Artem.
> 
> On Mon, Mar 2, 2015 at 11:31 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>> Yes, you are right, if there is a corruption, UBIFS can:
>>
>> 1. Try to understand if this is a corruption in empty space or not.
>> 2. If yes, recover the LEB.
>>
>> But this is not implemented. People keep hitting this issue, but no one
>> contributed fixes yet.
>>
>>> A unit not mounting the rootfs because of a bit-flip in _empty_space_
>>> is unacceptable to us, so I've got to figure out a way to deal with
>>> this rare event.
>>
>> Well, improving UBIFS would be one of the possible solutions.
>>
> 
> OK, two questions then:
> 
> 1. Is there anything I can do from userspace, or uboot, to recover
> this filesystem?  We've got mirrored filesystems, so we actually can
> detect the failure and mount the other one and fix the first from
> there.  Or maybe I can mount it ro and switch to the other filesystem
> and reboot?

That's what I want to do next. We'd discussed the recovery of UBIFS
some days ago, please see:

http://lists.infradead.org/pipermail/linux-mtd/2015-February/057710.html

Artem gave lots of suggestions in this thread.

The first stuff I want to do is separating the recovery and the mount
path. That is, once we mount a partition, UBIFS will try to clean up
the corrupted data during mount path, but once an error can't be fixed,
mounting thread breakout with changes during failed mount. I think this
append changes to a corrupted image may confuse the recovery of it. So
my plan is just marking the corrupted data during mount and cleanup them
once the mount scan finish.

The next step is try R/O mount if a non-recoverable error occur.

> 
> 2. I'd like to be able to replicate the problem so I can fix it, but
> simply poking a random bit to a random empty PEB won't do the trick.
> I've actually tried this before when doing other investigations and

Yes, I see your log, it's hard to inject. The corrupt must in the scanned
LEB during mount and must in empty space after valid data.

See function 'ubifs_scan' in fs/ubifs/scan.c.

> nothing bad happened, likely because the empty page I hit was never
> looked at by UBIFS.  I know there's got to be a way to map LEB to PEB,
> how do I do that/where is the table?  Specifically, how to map "LEB
> 4:3918" to a physical block and page on the flash device?
> 

You can try my ubidump to solve this problem.

http://lists.infradead.org/pipermail/linux-mtd/2014-December/056828.html

First, read super leb(LEB 0) and master leb(LEB 1, LEB2) to find the logic
position of each field, and use leb_change ioctl to change it.

> I'll give fixing it and contributing the patch a try. I'm up against a
> project deadline with a board-bring-up right now (they wanted it done
> 2 weeks ago and I'm having to report on it each day now), so I
> probably won't have time on it till next week.
>

I'm busy with personal stuff these days. But I'd like to build a coding
environment at home in this month so I could continue work at night, western
daytime.

I'm glad to see your patch~!

Thanks,
Hu

> - Steve

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-03-04  3:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-02 16:39 "corrupt empty space" error on boot?!? Steve deRosier
2015-03-03  7:31 ` Artem Bityutskiy
2015-03-03 15:25   ` Steve deRosier
2015-03-04  3:31     ` hujianyang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.