* "corrupt empty space" error on boot?!? @ 2015-03-02 16:39 Steve deRosier 2015-03-03 7:31 ` Artem Bityutskiy 0 siblings, 1 reply; 4+ messages in thread From: Steve deRosier @ 2015-03-02 16:39 UTC (permalink / raw) To: linux-mtd Hi All, So, after torturing one of our devices by rebooting it for a few hundred iterations, we ran across a situation where the system fails to boot due to a corrupt empty space error: Starting kernel ... Uncompressing Linux... done, booting the kernel. UBIFS error (pid 1): ubifs_scan: corrupt empty space at LEB 4:3918 UBIFS error (pid 1): ubifs_scanned_corruption: corruption at LEB 4:398 UBIFS error (pid 1): ubifs_scanned_corruption: first 8192 bytes from 8 UBIFS error (pid 1): ubifs_scan: LEB 4 scanning failed Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-b) This is on kernel v3.8, atmel_nand diver. In earlier discussions, it was suggested that the driver would encounter this sort of problem because the driver/chip can't do ECC in erased pages so a bitflip there could be an issue. This is the first time I've seen this problem in the wild though. 1. Is this likely what I'm seeing? 2. Will moving to a recent kernel help (we're currently updating our mainline to bleeding-edge 4.0)? 3. How can I programmatically recover from this situation? Logically, it seems to me that a non ecc protected bit-flip in an empty page should be a non-issue. UBI should be able to move the block, erase the block, torture/return-to-service and move on with it's life. No data is destroyed or even affected. A unit not mounting the rootfs because of a bit-flip in _empty_space_ is unacceptable to us, so I've got to figure out a way to deal with this rare event. Any help would be appreciated. Thanks, - Steve ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: "corrupt empty space" error on boot?!? 2015-03-02 16:39 "corrupt empty space" error on boot?!? Steve deRosier @ 2015-03-03 7:31 ` Artem Bityutskiy 2015-03-03 15:25 ` Steve deRosier 0 siblings, 1 reply; 4+ messages in thread From: Artem Bityutskiy @ 2015-03-03 7:31 UTC (permalink / raw) To: Steve deRosier; +Cc: linux-mtd On Mon, 2015-03-02 at 08:39 -0800, Steve deRosier wrote: > Logically, it seems to me that a non ecc protected bit-flip in an > empty page should be a non-issue. UBI should be able to move the > block, erase the block, torture/return-to-service and move on with > it's life. No data is destroyed or even affected. Yes, you are right, if there is a corruption, UBIFS can: 1. Try to understand if this is a corruption in empty space or not. 2. If yes, recover the LEB. But this is not implemented. People keep hitting this issue, but no one contributed fixes yet. > A unit not mounting the rootfs because of a bit-flip in _empty_space_ > is unacceptable to us, so I've got to figure out a way to deal with > this rare event. Well, improving UBIFS would be one of the possible solutions. Artem. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: "corrupt empty space" error on boot?!? 2015-03-03 7:31 ` Artem Bityutskiy @ 2015-03-03 15:25 ` Steve deRosier 2015-03-04 3:31 ` hujianyang 0 siblings, 1 reply; 4+ messages in thread From: Steve deRosier @ 2015-03-03 15:25 UTC (permalink / raw) To: Artem Bityutskiy; +Cc: linux-mtd Thanks Artem. On Mon, Mar 2, 2015 at 11:31 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > Yes, you are right, if there is a corruption, UBIFS can: > > 1. Try to understand if this is a corruption in empty space or not. > 2. If yes, recover the LEB. > > But this is not implemented. People keep hitting this issue, but no one > contributed fixes yet. > >> A unit not mounting the rootfs because of a bit-flip in _empty_space_ >> is unacceptable to us, so I've got to figure out a way to deal with >> this rare event. > > Well, improving UBIFS would be one of the possible solutions. > OK, two questions then: 1. Is there anything I can do from userspace, or uboot, to recover this filesystem? We've got mirrored filesystems, so we actually can detect the failure and mount the other one and fix the first from there. Or maybe I can mount it ro and switch to the other filesystem and reboot? 2. I'd like to be able to replicate the problem so I can fix it, but simply poking a random bit to a random empty PEB won't do the trick. I've actually tried this before when doing other investigations and nothing bad happened, likely because the empty page I hit was never looked at by UBIFS. I know there's got to be a way to map LEB to PEB, how do I do that/where is the table? Specifically, how to map "LEB 4:3918" to a physical block and page on the flash device? I'll give fixing it and contributing the patch a try. I'm up against a project deadline with a board-bring-up right now (they wanted it done 2 weeks ago and I'm having to report on it each day now), so I probably won't have time on it till next week. - Steve ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: "corrupt empty space" error on boot?!? 2015-03-03 15:25 ` Steve deRosier @ 2015-03-04 3:31 ` hujianyang 0 siblings, 0 replies; 4+ messages in thread From: hujianyang @ 2015-03-04 3:31 UTC (permalink / raw) To: Steve deRosier; +Cc: linux-mtd, Artem Bityutskiy Hi Steve, On 2015/3/3 23:25, Steve deRosier wrote: > Thanks Artem. > > On Mon, Mar 2, 2015 at 11:31 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote: >> Yes, you are right, if there is a corruption, UBIFS can: >> >> 1. Try to understand if this is a corruption in empty space or not. >> 2. If yes, recover the LEB. >> >> But this is not implemented. People keep hitting this issue, but no one >> contributed fixes yet. >> >>> A unit not mounting the rootfs because of a bit-flip in _empty_space_ >>> is unacceptable to us, so I've got to figure out a way to deal with >>> this rare event. >> >> Well, improving UBIFS would be one of the possible solutions. >> > > OK, two questions then: > > 1. Is there anything I can do from userspace, or uboot, to recover > this filesystem? We've got mirrored filesystems, so we actually can > detect the failure and mount the other one and fix the first from > there. Or maybe I can mount it ro and switch to the other filesystem > and reboot? That's what I want to do next. We'd discussed the recovery of UBIFS some days ago, please see: http://lists.infradead.org/pipermail/linux-mtd/2015-February/057710.html Artem gave lots of suggestions in this thread. The first stuff I want to do is separating the recovery and the mount path. That is, once we mount a partition, UBIFS will try to clean up the corrupted data during mount path, but once an error can't be fixed, mounting thread breakout with changes during failed mount. I think this append changes to a corrupted image may confuse the recovery of it. So my plan is just marking the corrupted data during mount and cleanup them once the mount scan finish. The next step is try R/O mount if a non-recoverable error occur. > > 2. I'd like to be able to replicate the problem so I can fix it, but > simply poking a random bit to a random empty PEB won't do the trick. > I've actually tried this before when doing other investigations and Yes, I see your log, it's hard to inject. The corrupt must in the scanned LEB during mount and must in empty space after valid data. See function 'ubifs_scan' in fs/ubifs/scan.c. > nothing bad happened, likely because the empty page I hit was never > looked at by UBIFS. I know there's got to be a way to map LEB to PEB, > how do I do that/where is the table? Specifically, how to map "LEB > 4:3918" to a physical block and page on the flash device? > You can try my ubidump to solve this problem. http://lists.infradead.org/pipermail/linux-mtd/2014-December/056828.html First, read super leb(LEB 0) and master leb(LEB 1, LEB2) to find the logic position of each field, and use leb_change ioctl to change it. > I'll give fixing it and contributing the patch a try. I'm up against a > project deadline with a board-bring-up right now (they wanted it done > 2 weeks ago and I'm having to report on it each day now), so I > probably won't have time on it till next week. > I'm busy with personal stuff these days. But I'd like to build a coding environment at home in this month so I could continue work at night, western daytime. I'm glad to see your patch~! Thanks, Hu > - Steve ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2015-03-04 3:32 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-03-02 16:39 "corrupt empty space" error on boot?!? Steve deRosier 2015-03-03 7:31 ` Artem Bityutskiy 2015-03-03 15:25 ` Steve deRosier 2015-03-04 3:31 ` hujianyang
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.