All of lore.kernel.org
 help / color / mirror / Atom feed
* Some questions on bit-flips and JFFS2
@ 2010-05-03 13:05 Thorsten Mühlfelder
  2010-05-04  9:28 ` Norbert van Bolhuis
  2010-05-05  6:59 ` Ricard Wanderlof
  0 siblings, 2 replies; 17+ messages in thread
From: Thorsten Mühlfelder @ 2010-05-03 13:05 UTC (permalink / raw)
  To: linux-mtd

Hi there,

I'm experiencing some problems with bit-flips on devices using NAND and JFFS2:
NAND device: Manufacturer ID: 0x2c, Chip ID: 0xdc (Micron NAND 512MiB 3,3V 
8-bit)
Creating 2 MTD partitions on "NAND 512MiB 3,3V 8-bit":
0x00000000-0x00a00000 : "Bootloader Area"
0x00a00000-0x20000000 : "User Area"

In rare cases 1 or 2 bits in the bootloader area (kernel) flip, so that the 
system won't boot anymore (kernel checksum error).
As the bootloader image is not mounted at all I wonder if this may be caused 
by these read disturbs I've heard of.

I've found some statements from different people about it here on the ML:

> We use JFFS2. As known JFFS2 detects and corrects single bit-flips
> (per 256 byte subpage) but it doesn't physically correct them
> on the NAND device itself.

and:

> AFAIK, jffs2 doesn't handle correctly bit flip on read: it won't try to
> copy the data on another block while the data can still be recovered
> by ecc.

For me this means that data still is read correctly because of ECC but it 
won't get moved to a new block if a bit-flip happens? And what happens if 
this occours on the kernel partition?

Furthermore:
> > How about detection of ECC errors in read only partitions?
>
> ECC should be done on both rw and read-only partitions. Sometimes NAND gets
> read disturbs which would impact on read-only partitions. Also, write
> disturbs from writes to one partition can still corrupt a read-only
> partition on the same chip.

So writing to my root partition may harm my kernel partition, too?

PS: I could not reproduce the bit-flip problem. It just happens in rare cases. 
Furthermore some of my devices are using Samsung NAND instead of the Micron 
NAND and did not show any problems yet. So perhaps my problem are just some 
bad NAND chip? But still I have to find a solution for the problem.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-03 13:05 Some questions on bit-flips and JFFS2 Thorsten Mühlfelder
@ 2010-05-04  9:28 ` Norbert van Bolhuis
  2010-05-04 14:59   ` Thorsten Mühlfelder
  2010-05-05  6:59 ` Ricard Wanderlof
  1 sibling, 1 reply; 17+ messages in thread
From: Norbert van Bolhuis @ 2010-05-04  9:28 UTC (permalink / raw)
  To: Thorsten Mühlfelder; +Cc: linux-mtd

Thorsten Mühlfelder wrote:
> Hi there,
> 
> I'm experiencing some problems with bit-flips on devices using NAND and JFFS2:
> NAND device: Manufacturer ID: 0x2c, Chip ID: 0xdc (Micron NAND 512MiB 3,3V 
> 8-bit)
> Creating 2 MTD partitions on "NAND 512MiB 3,3V 8-bit":
> 0x00000000-0x00a00000 : "Bootloader Area"
> 0x00a00000-0x20000000 : "User Area"
> 
> In rare cases 1 or 2 bits in the bootloader area (kernel) flip, so that the 
> system won't boot anymore (kernel checksum error).
> As the bootloader image is not mounted at all I wonder if this may be caused 
> by these read disturbs I've heard of.
> 


This may very well be the case.


> I've found some statements from different people about it here on the ML:
> 
>> We use JFFS2. As known JFFS2 detects and corrects single bit-flips
>> (per 256 byte subpage) but it doesn't physically correct them
>> on the NAND device itself.
> 
> and:
> 
>> AFAIK, jffs2 doesn't handle correctly bit flip on read: it won't try to
>> copy the data on another block while the data can still be recovered
>> by ecc.
> 
> For me this means that data still is read correctly because of ECC but it 
> won't get moved to a new block if a bit-flip happens? And what happens if 
> this occours on the kernel partition?
> 


True.
ECC is taken care of by the low-level MTD/NAND code
(e.g. drivers/mtd/nand/nand_base.c). These routines do indicate
errors but jffs2 doesn't really handle them (see jffs2_flash_read)
The kernel partition is a bare MTD(BLOCK) partition so the block won't
be moved or handled for sure. This means the same (=nothing) will happen.


> Furthermore:
>>> How about detection of ECC errors in read only partitions?
>> ECC should be done on both rw and read-only partitions. Sometimes NAND gets
>> read disturbs which would impact on read-only partitions. Also, write
>> disturbs from writes to one partition can still corrupt a read-only
>> partition on the same chip.
> 
> So writing to my root partition may harm my kernel partition, too?
> 


I don't know. Check/ask your hardware supplier. Micron may have some
details/documents about this.


> PS: I could not reproduce the bit-flip problem. It just happens in rare cases. 
> Furthermore some of my devices are using Samsung NAND instead of the Micron 
> NAND and did not show any problems yet. So perhaps my problem are just some 
> bad NAND chip? But still I have to find a solution for the problem.
> 


Maybe, as said check/ask your hardware supplier.
Maybe "refreshing" the block helps (that is saving the data, erasing the block(s) and
reprogramming all data). You could try this.
The best solution is of course UBIFS. UBI/UBIFS will handle bad blocks and read/write
disturbs. Include your kernel partition into the (big) flash filesystem partition and
start using UBIFS (i.s.o. JFFS2).


hth,
Norbert van Bolhuis.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-04  9:28 ` Norbert van Bolhuis
@ 2010-05-04 14:59   ` Thorsten Mühlfelder
  2010-05-05  8:34     ` Norbert van Bolhuis
  0 siblings, 1 reply; 17+ messages in thread
From: Thorsten Mühlfelder @ 2010-05-04 14:59 UTC (permalink / raw)
  To: linux-mtd; +Cc: Norbert van Bolhuis

Am Tuesday 04 May 2010 11:28:54 schrieb Norbert van Bolhuis:
> Thorsten Mühlfelder wrote:
> > Hi there,
> >
> > I'm experiencing some problems with bit-flips on devices using NAND and
> > JFFS2: NAND device: Manufacturer ID: 0x2c, Chip ID: 0xdc (Micron NAND
> > 512MiB 3,3V 8-bit)
> > Creating 2 MTD partitions on "NAND 512MiB 3,3V 8-bit":
> > 0x00000000-0x00a00000 : "Bootloader Area"
> > 0x00a00000-0x20000000 : "User Area"
> >
> > In rare cases 1 or 2 bits in the bootloader area (kernel) flip, so that
> > the system won't boot anymore (kernel checksum error).
> > As the bootloader image is not mounted at all I wonder if this may be
> > caused by these read disturbs I've heard of.
...
> > PS: I could not reproduce the bit-flip problem. It just happens in rare
> > cases. Furthermore some of my devices are using Samsung NAND instead of
> > the Micron NAND and did not show any problems yet. So perhaps my problem
> > are just some bad NAND chip? But still I have to find a solution for the
> > problem.
>
> Maybe, as said check/ask your hardware supplier.
> Maybe "refreshing" the block helps (that is saving the data, erasing the
> block(s) and reprogramming all data). You could try this.

I've already thought about something like this:
1. After first succesful bootup dump a mtd0 image and calculate a md5sum of 
it:
nanddump -o -b -f mtd0.img /dev/mtd0
2. Before each shutdown dump the image again and check if the md5sum has 
changed.
3. If it has changed write the initial dump back:
flash_eraseall /dev/mtd0
nandwrite -p /dev/mtd0 mtd0.img

Would this be the right method?

> The best solution is of course UBIFS. UBI/UBIFS will handle bad blocks and
> read/write disturbs. Include your kernel partition into the (big) flash
> filesystem partition and start using UBIFS (i.s.o. JFFS2).

Is there any How-To for U-Boot?

Greetings
Thorsten

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-03 13:05 Some questions on bit-flips and JFFS2 Thorsten Mühlfelder
  2010-05-04  9:28 ` Norbert van Bolhuis
@ 2010-05-05  6:59 ` Ricard Wanderlof
  1 sibling, 0 replies; 17+ messages in thread
From: Ricard Wanderlof @ 2010-05-05  6:59 UTC (permalink / raw)
  To: Thorsten Mühlfelder; +Cc: linux-mtd


On Mon, 3 May 2010, Thorsten Mühlfelder wrote:

> PS: I could not reproduce the bit-flip problem. It just happens in rare cases.
> Furthermore some of my devices are using Samsung NAND instead of the Micron
> NAND and did not show any problems yet. So perhaps my problem are just some
> bad NAND chip? But still I have to find a solution for the problem.

In our experience, which is limited to 32 MiB and 128 MiB SLC flashes, 
Micron, Hynix and Numonyx have much worse bit read error rates that 
Samsung and Toshiba, even though the data sheets hint at the same level of 
data integrity. We made some random tests, reading again and again from 
the same flash partition, and the first group above tended to show 
uncorrectable ECC errors already after less than a million reads, whereas 
the second group showed only single-bit errors even after 20 or even 60 
million reads.

Of course, reading that many times from a flash partition, especially a 
boot partition, is hardly realistic, but in our experience, it simulates 
quite well the situation of a having a seldom-read boot partition in a 
flash where there is activity going on in other parts of the flash, i.e. 
an embedded system where a single flash chip provides all non-volatile 
storage, over a long period of time.

Of course these are random samples only, but they've been very consistent 
so far. Lately a new batch of Numonyx 128 MiB flashes arrived which seem 
to have better error rates. One can speculate as to why but I'll leave 
that discussion off this mailing list.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-04 14:59   ` Thorsten Mühlfelder
@ 2010-05-05  8:34     ` Norbert van Bolhuis
  2010-05-05  8:40       ` Ricard Wanderlof
  2010-05-11  8:09       ` Thorsten Mühlfelder
  0 siblings, 2 replies; 17+ messages in thread
From: Norbert van Bolhuis @ 2010-05-05  8:34 UTC (permalink / raw)
  To: Thorsten Mühlfelder; +Cc: linux-mtd

Thorsten Mühlfelder wrote:

.
.
.

>> Maybe, as said check/ask your hardware supplier.
>> Maybe "refreshing" the block helps (that is saving the data, erasing the
>> block(s) and reprogramming all data). You could try this.
> 
> I've already thought about something like this:
> 1. After first succesful bootup dump a mtd0 image and calculate a md5sum of 
> it:
> nanddump -o -b -f mtd0.img /dev/mtd0
> 2. Before each shutdown dump the image again and check if the md5sum has 
> changed.
> 3. If it has changed write the initial dump back:
> flash_eraseall /dev/mtd0
> nandwrite -p /dev/mtd0 mtd0.img
> 
> Would this be the right method?
> 


yes that's the idea, given that "refreshing" really helps to prevent
(actually delay) future read disturbs.

But this won't work well. nanddump doesn't tell you the single=corrected
bit errors.
Only if there's an uncorrectable error (2 or more bits flip) a change will
be detected and the data will be refreshed. A read disturb tends to be
unstable though, meaning sometimes it's there and sometimes not.
This means you may miss an uncorrectable error (2 bits flip).

Another problem is a sudden reboot (e.g. crash or power-loss). There's no
check then.

It much more easier to let UBI/UBIFS deal with this suff. It's designed for
this.
u-boot support UBIFS (read-only). This means you can put a kernel image
on UBIFS and make u-boot read/boot it.


hth,
Norbert van Bolhuis.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-05  8:34     ` Norbert van Bolhuis
@ 2010-05-05  8:40       ` Ricard Wanderlof
  2010-05-05  8:51         ` Artem Bityutskiy
  2010-05-11  8:09       ` Thorsten Mühlfelder
  1 sibling, 1 reply; 17+ messages in thread
From: Ricard Wanderlof @ 2010-05-05  8:40 UTC (permalink / raw)
  To: Norbert van Bolhuis; +Cc: Thorsten Mühlfelder, linux-mtd


On Wed, 5 May 2010, Norbert van Bolhuis wrote:

>> I've already thought about something like this:
>> 1. After first succesful bootup dump a mtd0 image and calculate a md5sum of
>> it:
>> nanddump -o -b -f mtd0.img /dev/mtd0
>> 2. Before each shutdown dump the image again and check if the md5sum has
>> changed.
>> 3. If it has changed write the initial dump back:
>> flash_eraseall /dev/mtd0
>> nandwrite -p /dev/mtd0 mtd0.img
>>
>> Would this be the right method?
>
> yes that's the idea, given that "refreshing" really helps to prevent
> (actually delay) future read disturbs.
> ...
>
> Another problem is a sudden reboot (e.g. crash or power-loss). There's 
> no check then.

Also, if a power failure occurs during erase or nandwrite the system will 
not boot next time.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-05  8:40       ` Ricard Wanderlof
@ 2010-05-05  8:51         ` Artem Bityutskiy
  2010-05-05  9:20           ` Ricard Wanderlof
  2010-05-05  9:29           ` Norbert van Bolhuis
  0 siblings, 2 replies; 17+ messages in thread
From: Artem Bityutskiy @ 2010-05-05  8:51 UTC (permalink / raw)
  To: Ricard Wanderlof; +Cc: Thorsten Mühlfelder, linux-mtd, Norbert van Bolhuis

On Wed, 2010-05-05 at 10:40 +0200, Ricard Wanderlof wrote:
> On Wed, 5 May 2010, Norbert van Bolhuis wrote:
> 
> >> I've already thought about something like this:
> >> 1. After first succesful bootup dump a mtd0 image and calculate a md5sum of
> >> it:
> >> nanddump -o -b -f mtd0.img /dev/mtd0
> >> 2. Before each shutdown dump the image again and check if the md5sum has
> >> changed.
> >> 3. If it has changed write the initial dump back:
> >> flash_eraseall /dev/mtd0
> >> nandwrite -p /dev/mtd0 mtd0.img
> >>
> >> Would this be the right method?
> >
> > yes that's the idea, given that "refreshing" really helps to prevent
> > (actually delay) future read disturbs.
> > ...
> >
> > Another problem is a sudden reboot (e.g. crash or power-loss). There's 
> > no check then.
> 
> Also, if a power failure occurs during erase or nandwrite the system will 
> not boot next time.

Unless you supply your device with an UPS :-)

I did not really follow the discussion, so sorry if the following is
unrelated: I think it should not be too difficult to teach JFFS2 to
force GC on eraseblocks with bit-flips.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-05  8:51         ` Artem Bityutskiy
@ 2010-05-05  9:20           ` Ricard Wanderlof
  2010-05-11  7:59             ` Thorsten Mühlfelder
  2010-05-05  9:29           ` Norbert van Bolhuis
  1 sibling, 1 reply; 17+ messages in thread
From: Ricard Wanderlof @ 2010-05-05  9:20 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: linux-mtd, Thorsten Mühlfelder, Ricard Wanderlöf,
	Norbert van Bolhuis


On Wed, 5 May 2010, Artem Bityutskiy wrote:

> Unless you supply your device with an UPS :-)

Or rather, unless you never power down you're device. An UPS is great for 
unexpected power outages, but sooner or later you might want to power down 
your device under manual control anyway... :-)

> I did not really follow the discussion, so sorry if the following is 
> unrelated: I think it should not be too difficult to teach JFFS2 to 
> force GC on eraseblocks with bit-flips.

I think in this case the partition in question just held a raw Linux 
kernel with no file system, so JFFS2 is out of the picture here.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-05  8:51         ` Artem Bityutskiy
  2010-05-05  9:20           ` Ricard Wanderlof
@ 2010-05-05  9:29           ` Norbert van Bolhuis
  1 sibling, 0 replies; 17+ messages in thread
From: Norbert van Bolhuis @ 2010-05-05  9:29 UTC (permalink / raw)
  To: dedekind1; +Cc: linux-mtd, Thorsten Mühlfelder, Ricard Wanderlof

Artem Bityutskiy wrote:

.
.
.

> 
> Unless you supply your device with an UPS :-)
> 
> I did not really follow the discussion, so sorry if the following is
> unrelated: I think it should not be too difficult to teach JFFS2 to
> force GC on eraseblocks with bit-flips.
> 

I can confirm that ;-)
we actually implemented this (for our ancient 2.4.25 kernel with JFFS2 version 2005)
because we were seriously suffering from NAND bit flips on our Numonyx SLC NAND device.
see:
http://lists.infradead.org/pipermail/linux-mtd/2009-August/027080.html

This would only benefit the OP if kernel image is kept on JFFS2 (too).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-05  9:20           ` Ricard Wanderlof
@ 2010-05-11  7:59             ` Thorsten Mühlfelder
  2010-05-11  9:35               ` Ricard Wanderlof
  0 siblings, 1 reply; 17+ messages in thread
From: Thorsten Mühlfelder @ 2010-05-11  7:59 UTC (permalink / raw)
  To: linux-mtd; +Cc: Ricard Wanderlof, Norbert van Bolhuis, Artem Bityutskiy

Am Wednesday 05 May 2010 11:20:42 schrieb Ricard Wanderlof:
> > I did not really follow the discussion, so sorry if the following is
> > unrelated: I think it should not be too difficult to teach JFFS2 to
> > force GC on eraseblocks with bit-flips.
>
> I think in this case the partition in question just held a raw Linux
> kernel with no file system, so JFFS2 is out of the picture here.
>
> /Ricard

After investigating the problem I can tell you, that you are completly 
right ;-)
Atmel's flash tool Sam-ba 2.5 was used to flash the first NAND partition as 
follows:
1. Bootstrap
2. U-Boot bootloader
3. U-Boot environment variables
4. uImage Linux kernel
All these things are just written raw to predefined memory addresses and if 
some bit in there will flip, the system won't boot anymore.
So my idea was to create an image of that partition from within running Linux, 
check if the partition changes and if so write it back before shutdown. At 
least this may reduce the failing rate.
But unfortunately the Sam-Ba 2.5 tool has a bug: it uses different bad block 
table structure and Linux refuses to read/write every block, that was written 
by Sam-Ba 2.5 because it recognizes them as bad blocks.
So for now I have no idea what I can do to reduce the failing rate.
At least there is still no board using Samsung flash that has failed and I 
hope all problems are related to the Micron flash.

Kind regards
Thorsten

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-05  8:34     ` Norbert van Bolhuis
  2010-05-05  8:40       ` Ricard Wanderlof
@ 2010-05-11  8:09       ` Thorsten Mühlfelder
  2010-05-11 14:55         ` Norbert van Bolhuis
  1 sibling, 1 reply; 17+ messages in thread
From: Thorsten Mühlfelder @ 2010-05-11  8:09 UTC (permalink / raw)
  To: linux-mtd; +Cc: Norbert van Bolhuis

Am Wednesday 05 May 2010 10:34:14 schrieb Norbert van Bolhuis:
> It much more easier to let UBI/UBIFS deal with this suff. It's designed for
> this.
> u-boot support UBIFS (read-only). This means you can put a kernel image
> on UBIFS and make u-boot read/boot it.

How would I do this? The only thing I've found about it is this discussion of 
April 2008:
http://lists.infradead.org/pipermail/linux-mtd/2008-April/021268.html

Thanks for any tip
Thorsten

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-11  7:59             ` Thorsten Mühlfelder
@ 2010-05-11  9:35               ` Ricard Wanderlof
  2010-05-12 11:21                 ` Thorsten Mühlfelder
  0 siblings, 1 reply; 17+ messages in thread
From: Ricard Wanderlof @ 2010-05-11  9:35 UTC (permalink / raw)
  To: Thorsten Mühlfelder; +Cc: linux-mtd, Norbert van Bolhuis, Artem Bityutskiy


On Tue, 11 May 2010, Thorsten Mühlfelder wrote:

>> I think in this case the partition in question just held a raw Linux
>> kernel with no file system, so JFFS2 is out of the picture here.
>>
> But unfortunately the Sam-Ba 2.5 tool has a bug: it uses different bad block
> table structure and Linux refuses to read/write every block, that was written
> by Sam-Ba 2.5 because it recognizes them as bad blocks.
> So for now I have no idea what I can do to reduce the failing rate.

I don't know if this is a good way, but you could patch your kernel so it 
doesn't stop you from erasing/writing badblocks. Then you could write your 
own application which checks for bad blocks in the same way that the 
Sam-Ba 2.5 tool does, which would allow you to rewrite everything written 
by that tool.

> At least there is still no board using Samsung flash that has failed and I
> hope all problems are related to the Micron flash.

Even if you are not using jffs2, mtd will still perform single-bit error 
correction thanks the the ECC algorithm, so you need to be unlucky enough 
and get two bitflips within a 256 byte region for the system to fail.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-11  8:09       ` Thorsten Mühlfelder
@ 2010-05-11 14:55         ` Norbert van Bolhuis
  2010-05-12 10:48           ` Thorsten Mühlfelder
  0 siblings, 1 reply; 17+ messages in thread
From: Norbert van Bolhuis @ 2010-05-11 14:55 UTC (permalink / raw)
  To: Thorsten Mühlfelder; +Cc: linux-mtd

Thorsten Mühlfelder wrote:
> Am Wednesday 05 May 2010 10:34:14 schrieb Norbert van Bolhuis:
>> It much more easier to let UBI/UBIFS deal with this suff. It's designed for
>> this.
>> u-boot support UBIFS (read-only). This means you can put a kernel image
>> on UBIFS and make u-boot read/boot it.
> 
> How would I do this? The only thing I've found about it is this discussion of 
> April 2008:
> http://lists.infradead.org/pipermail/linux-mtd/2008-April/021268.html
> 
> Thanks for any tip
> Thorsten
> 

What I meant was to keep the NAND flash partitions the same and
use UBIFS (i.s.o. JFFS2) and of course also put the kernel uImage on
UBIFS.
The goal here is to make u-boot read and boot the kernel image from UBIFS
(i.s.o. from bare NAND).
But I understand Linux cannot read whatever Sam-Ba 2.5 writes, so I guess
it's not even possible to read the kernel image and put it on UBIFS.

Or can you put a kernel image on it beforehand ? are we talking about
deployed systems or to be deployed systems ?

Maybe you first have to find a solution for the Sam-Ba Linux
incompatibility. I guess u-boot can read the kernel uImage, so u-boot
does seem to understand whatever Sam-Ba 2.5 writes.

Anyway, if you switch to UBIFS and *can* put the kernel image on UBIFS
fatal bit flips in the uImage on bare NAND don't matter.
On the other hand: fatal bit flips in bootstrap/u-boot/u-boot_env will still
prevent the system to boot. So maybe also refreshing those images is necessary.

You need a recent linux kernel and u-boot for them to support UBIFS.

Note that: if you would upgrade your system with a new u-boot/linux and
UBIFS i.s.o. JFFS2, *everything* on flash will change. If you want to do
this for deployed systems you need a huge upgrade script.


I don't use NAND+UBIFS myself, so I have no experience there.

The below pointers should have some relevant info though:

- u-boot home-page: http://www.denx.de/wiki/U-Boot/WebHome
- u-boot mailing list: http://lists.denx.de/pipermail/u-boot/
- u-boot source code GIT tree: http://git.denx.de/cgi-bin/gitweb.cgi?p=u-boot.git;a=tree
   (checkout boards with NAND and boot-from-NAND support)
- specialized GIT trees for NAND and UBI: http://www.denx.de/wiki/U-Boot/Custodians


Regards,
Norbert.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-11 14:55         ` Norbert van Bolhuis
@ 2010-05-12 10:48           ` Thorsten Mühlfelder
  0 siblings, 0 replies; 17+ messages in thread
From: Thorsten Mühlfelder @ 2010-05-12 10:48 UTC (permalink / raw)
  To: linux-mtd; +Cc: Norbert van Bolhuis

Am Tuesday 11 May 2010 16:55:19 schrieb Norbert van Bolhuis:
> Or can you put a kernel image on it beforehand ? are we talking about
> deployed systems or to be deployed systems ?

Of course the thoughts about UBIFS are only for future systems, that are not 
deployed yet. Also there is a newer version of the Atmel Sam-ba tool, that 
fixes the bad blocks problem.
But because of the Sam-ba problem it seems to be hard to get a fix done for 
already deployed systems.

> Anyway, if you switch to UBIFS and *can* put the kernel image on UBIFS
> fatal bit flips in the uImage on bare NAND don't matter.
> On the other hand: fatal bit flips in bootstrap/u-boot/u-boot_env will
> still prevent the system to boot. So maybe also refreshing those images is
> necessary.

This is right. But as long as the kernel Image is about 1.5 MB and all the 
U-Boot stuff is only about 300 kB it would still increase the security. 
Furthermore UBIFS will have other advantages (e.g. lower mount times), too.

> You need a recent linux kernel and u-boot for them to support UBIFS.
>
> Note that: if you would upgrade your system with a new u-boot/linux and
> UBIFS i.s.o. JFFS2, *everything* on flash will change. If you want to do
> this for deployed systems you need a huge upgrade script.

This won't be done for deployed systems ;-)

> I don't use NAND+UBIFS myself, so I have no experience there.
>
> The below pointers should have some relevant info though:
>
> - u-boot home-page: http://www.denx.de/wiki/U-Boot/WebHome
> - u-boot mailing list: http://lists.denx.de/pipermail/u-boot/
> - u-boot source code GIT tree:
> http://git.denx.de/cgi-bin/gitweb.cgi?p=u-boot.git;a=tree (checkout boards
> with NAND and boot-from-NAND support)
> - specialized GIT trees for NAND and UBI:
> http://www.denx.de/wiki/U-Boot/Custodians

Thanks, will have to take a look at it later. But for now it's a higher 
priority to reduce failure rate in already deployed systems.

Regards
Thorsten

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-11  9:35               ` Ricard Wanderlof
@ 2010-05-12 11:21                 ` Thorsten Mühlfelder
  2010-05-12 12:22                   ` Ricard Wanderlof
  0 siblings, 1 reply; 17+ messages in thread
From: Thorsten Mühlfelder @ 2010-05-12 11:21 UTC (permalink / raw)
  To: linux-mtd; +Cc: Norbert van Bolhuis

Am Tuesday 11 May 2010 11:35:07 schrieb Ricard Wanderlof:
> On Tue, 11 May 2010, Thorsten Mühlfelder wrote:
> > But unfortunately the Sam-Ba 2.5 tool has a bug: it uses different bad
> > block table structure and Linux refuses to read/write every block, that
> > was written by Sam-Ba 2.5 because it recognizes them as bad blocks.
> > So for now I have no idea what I can do to reduce the failing rate.
>
> I don't know if this is a good way, but you could patch your kernel so it
> doesn't stop you from erasing/writing badblocks. 

So the only way to get bad blocks erased (scrubbed) in Linux is to have a 
patched kernel? This would be a problem, because I don't know any way of 
getting a new kernel to already deployed systems without deleting the Sam-ba 
bad blocks before.
BTW: Atmel FAQ says the following about it:
> SAM-BA v2.6 and NandFlash bad block management
>
> Question:
> SAM-BA v2.6 finds a lot of bad blocks when erasing or programming the
> NANDFLASH memory. Is it normal? How should I handle them? 
>
> Answer: 
> This case usually appears when SAM-BA v2.5 (or older) was used to program
> the NandFlash on the AT91SAM9260-EK or AT91SAM9263-EK boards.
>
> The blocks are not really bad, but data (especially ECC bytes) was written
> in the spare area bytes reserved to tag bad blocks. So SAM-BA v2.6 detects
> them as bad. To solve this problem and get an empty NandFlash without bad
> blocks, follow these steps :
>
> - launch SAM-BA v2.6 GUI
> - in the NANDFLASH tab, select the 'NandFlash Init' script and execute it
> - in the TCL shell part of the GUI, type :
> '::NANDFLASH::EraseAllNandFlashFull' WARNING : this procedure will erase
> all data AND bad block tags too (spare area zones), thus manufacturer bad
> block tagging will be lost.
>
> If you know which blocks were tagged bad by the manufacturer, you can
> manually tag them again by typing '::NANDFLASH::TagBadBlock <block_number>'
> in the SAM-BA TCL shell.

So IMHO there are only 2 options:
- Within a running Linux remove/erase all bad blocks from beginning of kernel 
image to end of the partition, test the erased area with nandtest and mark 
real bad blocks as bad, write the new kernel image to the right address again
- Or write some tool, that can distinguish between real bad blocks and the 
Sam-ba 2.5 created bad blocks, unmark the false bad blocks. But perhaps this 
is not possible at all. 

Perhaps somebody knows where I can find detailed information about 
these "spare area bytes reserved to tag bad blocks"? As far as I understand 
this is the OOB area, which is 64 bytes on my NAND:
/mtd_debug info /dev/mtd0
mtd.type = MTD_NANDFLASH
mtd.flags = MTD_CAP_NANDFLASH
mtd.size = 10485760 (10M)
mtd.erasesize = 131072 (128K)
mtd.writesize = 2048 (2K)
mtd.oobsize = 64 
regions = 0

Is the OOB part of a page or does each page have an extra OOB (2048+64 bytes)?
Is the OOB located at the beginning or at the end of each page?

Sorry for bothering you with all these questions,
Thorsten

> Then you could write your 
> own application which checks for bad blocks in the same way that the
> Sam-Ba 2.5 tool does, which would allow you to rewrite everything written
> by that tool.
>
> > At least there is still no board using Samsung flash that has failed and
> > I hope all problems are related to the Micron flash.
>
> Even if you are not using jffs2, mtd will still perform single-bit error
> correction thanks the the ECC algorithm, so you need to be unlucky enough
> and get two bitflips within a 256 byte region for the system to fail.
>
> /Ricard

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-12 11:21                 ` Thorsten Mühlfelder
@ 2010-05-12 12:22                   ` Ricard Wanderlof
  2010-05-19  7:45                     ` Thorsten Mühlfelder
  0 siblings, 1 reply; 17+ messages in thread
From: Ricard Wanderlof @ 2010-05-12 12:22 UTC (permalink / raw)
  To: Thorsten Mühlfelder; +Cc: linux-mtd, Norbert van Bolhuis


On Wed, 12 May 2010, Thorsten Mühlfelder wrote:

>> I don't know if this is a good way, but you could patch your kernel so it
>> doesn't stop you from erasing/writing badblocks.
>
> So the only way to get bad blocks erased (scrubbed) in Linux is to have a
> patched kernel? This would be a problem, because I don't know any way of
> getting a new kernel to already deployed systems without deleting the Sam-ba
> bad blocks before.

Another option would be to write a kernel module that bypasses mtd. But 
the mtd write routine is very adament about not erasing bad blocks.

> So IMHO there are only 2 options:
> - Within a running Linux remove/erase all bad blocks from beginning of kernel
> image to end of the partition, test the erased area with nandtest and mark
> real bad blocks as bad, write the new kernel image to the right address again

Testing for bad blocks is something you can't do in practice on a device. 
At the factory, AFAIK, they perform bad blocks tests while operating the 
chip at the limits of its specification, to try and catch all blocks which 
are marginal. Testing if such a block is 'bad' in an existing system most 
often doesn't give the expected result. On our development boards, it 
happens sometimes that someone manages to erase the whole flash including 
bad block markers, we then routinely just assume all blocks are good which 
works well enough in the lab, although I would never sell a product with a 
flash that had been erased that way.

> - Or write some tool, that can distinguish between real bad blocks and the
> Sam-ba 2.5 created bad blocks, unmark the false bad blocks. But perhaps this
> is not possible at all.

It depends on how exactly Sam-ba 2.5 overwrites the existing bad block 
markers.

> Perhaps somebody knows where I can find detailed information about
> these "spare area bytes reserved to tag bad blocks"? As far as I understand

The nand flash data sheets from the manufacturers have information both on 
the memory array layout and how the bad blocks are marked.

> Is the OOB part of a page or does each page have an extra OOB (2048+64 bytes)?

The OOB (in this case) is an 'extra' 64 bytes per 2048 byte page.

> Is the OOB located at the beginning or at the end of each page?

Conceptually it is at the end, although it is normally accessed and read 
in a separate read operation. I think you can perform a sequential read on 
a nand flash which will read the page + oob in succession.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Some questions on bit-flips and JFFS2
  2010-05-12 12:22                   ` Ricard Wanderlof
@ 2010-05-19  7:45                     ` Thorsten Mühlfelder
  0 siblings, 0 replies; 17+ messages in thread
From: Thorsten Mühlfelder @ 2010-05-19  7:45 UTC (permalink / raw)
  To: linux-mtd; +Cc: Ricard Wanderlof, Norbert van Bolhuis

Am Wednesday 12 May 2010 14:22:19 schrieb Ricard Wanderlof:
> > - Or write some tool, that can distinguish between real bad blocks and
> > the Sam-ba 2.5 created bad blocks, unmark the false bad blocks. But
> > perhaps this is not possible at all.
>
> It depends on how exactly Sam-ba 2.5 overwrites the existing bad block
> markers.

As you can see here the Sam-Ba 2.5 tool writes some ECC code to the first 4 
bytes of each erase block's OOB while in the first byte the bad block 
information usually is stored. Therefore it is impossible to restore factory 
set bad blocks. IMHO this is a mature bug in Atmel's Sam-Ba tool, but it is 
fixed in newer versions.
I'm quite sure this behaviour is the source of my problem. Probably factory 
set bad blocks have been overwritten by Sam-Ba and I guess in these blocks 
occur the bit flips:

nand dump 0x100000
Page 00100000 dump:
...
OOB:
	b1 20 46 df ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
U-Boot> nand dump 0x100800
Page 00100800 dump:
...
OOB:
	71 31 86 ce ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
U-Boot> nand dump 101800
Page 00101800 dump:
...
OOB:
	16 4b 16 4b ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
U-Boot> nand dump 102000
Page 00102000 dump:
...
OOB:
	06 47 f1 b8 ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
U-Boot> nand dump 102800
Page 00102800 dump:
...
OOB:
	24 5c d3 a3 ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
U-Boot> nand dump 103000
Page 00103000 dump:
...
OOB:
	c3 37 c3 37 ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
U-Boot> nand dump 103800
Page 00103800 dump:
...
OOB:
	81 0c 81 0c ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff
	ff ff ff ff ff ff ff ff

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-05-19  7:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-03 13:05 Some questions on bit-flips and JFFS2 Thorsten Mühlfelder
2010-05-04  9:28 ` Norbert van Bolhuis
2010-05-04 14:59   ` Thorsten Mühlfelder
2010-05-05  8:34     ` Norbert van Bolhuis
2010-05-05  8:40       ` Ricard Wanderlof
2010-05-05  8:51         ` Artem Bityutskiy
2010-05-05  9:20           ` Ricard Wanderlof
2010-05-11  7:59             ` Thorsten Mühlfelder
2010-05-11  9:35               ` Ricard Wanderlof
2010-05-12 11:21                 ` Thorsten Mühlfelder
2010-05-12 12:22                   ` Ricard Wanderlof
2010-05-19  7:45                     ` Thorsten Mühlfelder
2010-05-05  9:29           ` Norbert van Bolhuis
2010-05-11  8:09       ` Thorsten Mühlfelder
2010-05-11 14:55         ` Norbert van Bolhuis
2010-05-12 10:48           ` Thorsten Mühlfelder
2010-05-05  6:59 ` Ricard Wanderlof

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.