I don't understand how the counter for erasures is being maintained during erase failures

All of lore.kernel.org
 help / color / mirror / Atom feed

* I don't understand how the counter for erasures is being maintained during erase failures
@ 2011-04-12 12:57 Atlant Schmidt
  2011-04-14  7:13 ` Artem Bityutskiy
  0 siblings, 1 reply; 7+ messages in thread
From: Atlant Schmidt @ 2011-04-12 12:57 UTC (permalink / raw)
  To: 'linux-mtd@lists.infradead.org'

Folks:

On my linux system (running MTD/UBI/UBIfs), the following
event occurred:

  [62452.439299] UBI error: ubi_io_write: error -5 while writing 516096 bytes to PEB 3982:8192, written 503808 bytes
  [62452.465874] UBI: run torture test for PEB 3982
  [62463.910000] UBI: PEB 3982 passed torture test, do not mark it a bad
  [62466.666439] UBI error: ubi_io_write: error -5 while writing 516096 bytes to PEB 3982:8192, written 503808 bytes
  [62466.693753] UBI: run torture test for PEB 3982
  [62477.763592] UBI: PEB 3982 passed torture test, do not mark it a bad
    :
    :
  [62622.746585] UBI error: ubi_io_write: error -5 while writing 516096 bytes to PEB 3982:8192, written 503808 bytes
  [62622.801612] UBI: run torture test for PEB 3982
  [62633.821650] UBI: PEB 3982 passed torture test, do not mark it a bad
  [62636.629686] UBI error: ubi_io_write: error -5 while writing 516096 bytes to PEB 3982:8192, written 503808 bytes
  [62636.661260] UBI: run torture test for PEB 3982
  [62643.962758] UBI error: torture_peb: read problems on freshly erased PEB 3982, must be bad
  [62643.992792] UBI error: erase_worker: failed to erase PEB 3982, error -5
  [62644.022791] UBI: mark PEB 3982 as bad
  [62644.045182] UBI: 37 PEBs left in the reserve

At this point, I dumped out the contents of PEB 3982:

  /> ubi_dump.pl 3982
  PEB f8e (3982):  ec magic number is not correct. Is: 5a5a5a5a   Should be: 55424923
  PEB 3982:
    00000000:   5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A  ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
    00000020:   5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A  ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
    00000040:   5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A  ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
    00000060:   5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A  ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
      :
      :

So that PEB no longer contains any ubi_ec_hdr struct.

What happens next?

When I reboot, this block *HASN'T* been added to the bad block
list (nor were the other two blocks "marked as bad" during this
linux boot session). And after the reboot, my script reports
the following information about PEB 3982:

  /> ubi_dump.pl 3982
  PEB f8e (3982):  Erased 16
  Minimum erase count: 16
  Average erase count: 16 computed across 1 blocks
  Maximum erase count: 16

This can't be accurate -- the block was tortured 14 times
during the failure and each torture represents three erase/
write cycles, right? (Per torture_peb(), OxA5, 0x5A, and 0x00.)
So even if this block had somehow been "virgin" (and it's
certainly not!), it should now have an erase count of at
least 3*14=42, just considering the torturing.

Also, given that it failed to erase (or at least couldn't be
successfully read when freshly erased), why doesn't the block
permanently join the pool of bad PEBs?

                           Atlant

This e-mail and the information, including any attachments, it contains are intended to be a confidential communication only to the person or entity to whom it is addressed and may contain information that is privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender and destroy the original message.

Thank you.

Please consider the environment before printing this email.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: I don't understand how the counter for erasures is being maintained during erase failures
  2011-04-12 12:57 I don't understand how the counter for erasures is being maintained during erase failures Atlant Schmidt
@ 2011-04-14  7:13 ` Artem Bityutskiy
  2011-04-14 10:53   ` Atlant Schmidt
  0 siblings, 1 reply; 7+ messages in thread
From: Artem Bityutskiy @ 2011-04-14  7:13 UTC (permalink / raw)
  To: Atlant Schmidt; +Cc: 'linux-mtd@lists.infradead.org'

Hi,

On Tue, 2011-04-12 at 08:57 -0400, Atlant Schmidt wrote:
> Folks:
> 
> On my linux system (running MTD/UBI/UBIfs), the following
> event occurred:
> 
> 
>   [62452.439299] UBI error: ubi_io_write: error -5 while writing 516096 bytes to PEB 3982:8192, written 503808 bytes
>   [62452.465874] UBI: run torture test for PEB 3982
>   [62463.910000] UBI: PEB 3982 passed torture test, do not mark it a bad
>   [62466.666439] UBI error: ubi_io_write: error -5 while writing 516096 bytes to PEB 3982:8192, written 503808 bytes
>   [62466.693753] UBI: run torture test for PEB 3982
>   [62477.763592] UBI: PEB 3982 passed torture test, do not mark it a bad
>     :
>     :
>   [62622.746585] UBI error: ubi_io_write: error -5 while writing 516096 bytes to PEB 3982:8192, written 503808 bytes
>   [62622.801612] UBI: run torture test for PEB 3982
>   [62633.821650] UBI: PEB 3982 passed torture test, do not mark it a bad
>   [62636.629686] UBI error: ubi_io_write: error -5 while writing 516096 bytes to PEB 3982:8192, written 503808 bytes
>   [62636.661260] UBI: run torture test for PEB 3982
>   [62643.962758] UBI error: torture_peb: read problems on freshly erased PEB 3982, must be bad
>   [62643.992792] UBI error: erase_worker: failed to erase PEB 3982, error -5
>   [62644.022791] UBI: mark PEB 3982 as bad
>   [62644.045182] UBI: 37 PEBs left in the reserve

What is the flash? Is it MLC?

> At this point, I dumped out the contents of PEB 3982:
> 
>   /> ubi_dump.pl 3982
>   PEB f8e (3982):  ec magic number is not correct. Is: 5a5a5a5a   Should be: 55424923
>   PEB 3982:
>     00000000:   5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A  ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
>     00000020:   5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A  ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
>     00000040:   5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A  ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
>     00000060:   5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A  ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
>       :
>       :
> 
> 
> So that PEB no longer contains any ubi_ec_hdr struct.

May be we should change the torture test a bit and emulate real usage:
write patterns in 3 steps, not 1 go. I mean, write pattern to where EC
header should be, then to where VID header should be, and then where the
data should be. I think in your case the problem would have been spotted
quicker then. You can try to do this.

> What happens next?

It should be marked as bad.

> 
> When I reboot, this block *HASN'T* been added to the bad block
> list (nor were the other two blocks "marked as bad" during this
> linux boot session).

This is a real problem, you should dig this and fix your drivers.

>  And after the reboot, my script reports
> the following information about PEB 3982:
> 
>   /> ubi_dump.pl 3982
>   PEB f8e (3982):  Erased 16
>   Minimum erase count: 16
>   Average erase count: 16 computed across 1 blocks
>   Maximum erase count: 16

Yes, the erase counter was lost and the average was used.

> This can't be accurate -- the block was tortured 14 times
> during the failure and each torture represents three erase/
> write cycles, right? (Per torture_peb(), OxA5, 0x5A, and 0x00.)
> So even if this block had somehow been "virgin" (and it's
> certainly not!), it should now have an erase count of at
> least 3*14=42, just considering the torturing.

If the blocked passed the torture test, the EC would be correct. But it
did not, and it should have been marked bad. UBI should not use it at
all.

So wrong EC counter is not something you should worry about. This is not
a problem.

> Also, given that it failed to erase (or at least couldn't be
> successfully read when freshly erased), why doesn't the block
> permanently join the pool of bad PEBs?

That's the real problem. I do not know, this is an issue in your driver
- below the UBI level, somewhere in the MTD level. You need to dig this.

> Please consider the environment before printing this email.

Sure, I won't print it! :-)

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: I don't understand how the counter for erasures is being maintained during erase failures
  2011-04-14  7:13 ` Artem Bityutskiy
@ 2011-04-14 10:53   ` Atlant Schmidt
  2011-04-14 10:55     ` Artem Bityutskiy
  2011-04-14 11:35     ` Artem Bityutskiy
  0 siblings, 2 replies; 7+ messages in thread
From: Atlant Schmidt @ 2011-04-14 10:53 UTC (permalink / raw)
  To: 'dedekind1@gmail.com'; +Cc: 'linux-mtd@lists.infradead.org'

Artem:

> What is the flash? Is it MLC?

Today, unfortunately, yes, although our newest board
revisions have switched to SLC and we're retrofitting
the older boards as we can. But some of our systems
will be living with MLC for a while yet.

> This is a real problem, you should dig this and fix your drivers.

We're using the off-the-shelf MTD driver (although
we should probably be using newer versions of everything:
MTD, UBI, and UBIfs). But I'm becoming familiar with
the code so I'll look into this. If I get stuck, folks
on the list seem to be helpful to others with questions.

But the question I'll start-off with is: What specific
step(s) is/are necessary to cause UBI to permanently
consider this a bad block?

> > Please consider the environment before printing this email.
>
> Sure, I won't print it! :-)

As I'm sure you realize, I've no control over that
disclaimer, but someone, somewhere thought it was
a good idea.

                          Atlant

-----Original Message-----
From: Artem Bityutskiy [mailto:dedekind1@gmail.com]
Sent: Thursday, April 14, 2011 03:14
To: Atlant Schmidt
Cc: 'linux-mtd@lists.infradead.org'
Subject: Re: I don't understand how the counter for erasures is being maintained during erase failures

Hi,

On Tue, 2011-04-12 at 08:57 -0400, Atlant Schmidt wrote:
> Folks:
>
> On my linux system (running MTD/UBI/UBIfs), the following
> event occurred:
>
>
>   [62452.439299] UBI error: ubi_io_write: error -5 while writing 516096 bytes to PEB 3982:8192, written 503808 bytes
>   [62452.465874] UBI: run torture test for PEB 3982
>   [62463.910000] UBI: PEB 3982 passed torture test, do not mark it a bad
>   [62466.666439] UBI error: ubi_io_write: error -5 while writing 516096 bytes to PEB 3982:8192, written 503808 bytes
>   [62466.693753] UBI: run torture test for PEB 3982
>   [62477.763592] UBI: PEB 3982 passed torture test, do not mark it a bad
>     :
>     :
>   [62622.746585] UBI error: ubi_io_write: error -5 while writing 516096 bytes to PEB 3982:8192, written 503808 bytes
>   [62622.801612] UBI: run torture test for PEB 3982
>   [62633.821650] UBI: PEB 3982 passed torture test, do not mark it a bad
>   [62636.629686] UBI error: ubi_io_write: error -5 while writing 516096 bytes to PEB 3982:8192, written 503808 bytes
>   [62636.661260] UBI: run torture test for PEB 3982
>   [62643.962758] UBI error: torture_peb: read problems on freshly erased PEB 3982, must be bad
>   [62643.992792] UBI error: erase_worker: failed to erase PEB 3982, error -5
>   [62644.022791] UBI: mark PEB 3982 as bad
>   [62644.045182] UBI: 37 PEBs left in the reserve

What is the flash? Is it MLC?

> At this point, I dumped out the contents of PEB 3982:
>
>   /> ubi_dump.pl 3982
>   PEB f8e (3982):  ec magic number is not correct. Is: 5a5a5a5a   Should be: 55424923
>   PEB 3982:
>     00000000:   5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A  ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
>     00000020:   5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A  ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
>     00000040:   5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A  ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
>     00000060:   5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A  ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
>       :
>       :
>
>
> So that PEB no longer contains any ubi_ec_hdr struct.

May be we should change the torture test a bit and emulate real usage:
write patterns in 3 steps, not 1 go. I mean, write pattern to where EC
header should be, then to where VID header should be, and then where the
data should be. I think in your case the problem would have been spotted
quicker then. You can try to do this.

> What happens next?

It should be marked as bad.

>
> When I reboot, this block *HASN'T* been added to the bad block
> list (nor were the other two blocks "marked as bad" during this
> linux boot session).

This is a real problem, you should dig this and fix your drivers.

>  And after the reboot, my script reports
> the following information about PEB 3982:
>
>   /> ubi_dump.pl 3982
>   PEB f8e (3982):  Erased 16
>   Minimum erase count: 16
>   Average erase count: 16 computed across 1 blocks
>   Maximum erase count: 16

Yes, the erase counter was lost and the average was used.

> This can't be accurate -- the block was tortured 14 times
> during the failure and each torture represents three erase/
> write cycles, right? (Per torture_peb(), OxA5, 0x5A, and 0x00.)
> So even if this block had somehow been "virgin" (and it's
> certainly not!), it should now have an erase count of at
> least 3*14=42, just considering the torturing.

If the blocked passed the torture test, the EC would be correct. But it
did not, and it should have been marked bad. UBI should not use it at
all.

So wrong EC counter is not something you should worry about. This is not
a problem.

> Also, given that it failed to erase (or at least couldn't be
> successfully read when freshly erased), why doesn't the block
> permanently join the pool of bad PEBs?

That's the real problem. I do not know, this is an issue in your driver
- below the UBI level, somewhere in the MTD level. You need to dig this.

> Please consider the environment before printing this email.

Sure, I won't print it! :-)

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

This e-mail and the information, including any attachments, it contains are intended to be a confidential communication only to the person or entity to whom it is addressed and may contain information that is privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender and destroy the original message.

Thank you.

Please consider the environment before printing this email.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: I don't understand how the counter for erasures is being maintained during erase failures
  2011-04-14 10:53   ` Atlant Schmidt
@ 2011-04-14 10:55     ` Artem Bityutskiy
  2011-04-14 10:58       ` Atlant Schmidt
  2011-04-14 11:35     ` Artem Bityutskiy
  1 sibling, 1 reply; 7+ messages in thread
From: Artem Bityutskiy @ 2011-04-14 10:55 UTC (permalink / raw)
  To: Atlant Schmidt; +Cc: 'linux-mtd@lists.infradead.org'

On Thu, 2011-04-14 at 06:53 -0400, Atlant Schmidt wrote:
> But the question I'll start-off with is: What specific
> step(s) is/are necessary to cause UBI to permanently
> consider this a bad block? 

When you fail to erase it or when torture test fails, AFAIR. Just find
all callers of 'ubi_io_mark_bad()'.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: I don't understand how the counter for erasures is being maintained during erase failures
  2011-04-14 10:55     ` Artem Bityutskiy
@ 2011-04-14 10:58       ` Atlant Schmidt
  0 siblings, 0 replies; 7+ messages in thread
From: Atlant Schmidt @ 2011-04-14 10:58 UTC (permalink / raw)
  To: 'dedekind1@gmail.com'; +Cc: 'linux-mtd@lists.infradead.org'

Artem:

  Thanks!

       Atlant

-----Original Message-----
From: Artem Bityutskiy [mailto:dedekind1@gmail.com]
Sent: Thursday, April 14, 2011 06:55
To: Atlant Schmidt
Cc: 'linux-mtd@lists.infradead.org'
Subject: RE: I don't understand how the counter for erasures is being maintained during erase failures

On Thu, 2011-04-14 at 06:53 -0400, Atlant Schmidt wrote:
> But the question I'll start-off with is: What specific
> step(s) is/are necessary to cause UBI to permanently
> consider this a bad block?

When you fail to erase it or when torture test fails, AFAIR. Just find
all callers of 'ubi_io_mark_bad()'.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

This e-mail and the information, including any attachments, it contains are intended to be a confidential communication only to the person or entity to whom it is addressed and may contain information that is privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender and destroy the original message.

Thank you.

Please consider the environment before printing this email.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: I don't understand how the counter for erasures is being maintained during erase failures
  2011-04-14 10:53   ` Atlant Schmidt
  2011-04-14 10:55     ` Artem Bityutskiy
@ 2011-04-14 11:35     ` Artem Bityutskiy
  2011-04-14 11:53       ` Atlant Schmidt
  1 sibling, 1 reply; 7+ messages in thread
From: Artem Bityutskiy @ 2011-04-14 11:35 UTC (permalink / raw)
  To: Atlant Schmidt; +Cc: 'linux-mtd@lists.infradead.org'

On Thu, 2011-04-14 at 06:53 -0400, Atlant Schmidt wrote:
> Artem:
> 
> > What is the flash? Is it MLC?
> 
> Today, unfortunately, yes, although our newest board
> revisions have switched to SLC and we're retrofitting
> the older boards as we can. But some of our systems
> will be living with MLC for a while yet.

BTW, in case you did not read that:
http://www.linux-mtd.infradead.org/faq/ubifs.html#L_ubifs_mlc

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: I don't understand how the counter for erasures is being maintained during erase failures
  2011-04-14 11:35     ` Artem Bityutskiy
@ 2011-04-14 11:53       ` Atlant Schmidt
  0 siblings, 0 replies; 7+ messages in thread
From: Atlant Schmidt @ 2011-04-14 11:53 UTC (permalink / raw)
  To: 'dedekind1@gmail.com'; +Cc: 'linux-mtd@lists.infradead.org'

Artem:

> BTW, in case you did not read that:
> http://www.linux-mtd.infradead.org/faq/ubifs.html#L_ubifs_mlc

  Oh yes, here in my shop, we've read that!
  Too late, perhaps ;-), but we've read that!
  But just like disclaimers automatically
  included in E-mail messages, we live with
  what we have and not what we wish we had!

                        Atlant

-----Original Message-----
From: Artem Bityutskiy [mailto:dedekind1@gmail.com]
Sent: Thursday, April 14, 2011 07:35
To: Atlant Schmidt
Cc: 'linux-mtd@lists.infradead.org'
Subject: RE: I don't understand how the counter for erasures is being maintained during erase failures

On Thu, 2011-04-14 at 06:53 -0400, Atlant Schmidt wrote:
> Artem:
>
> > What is the flash? Is it MLC?
>
> Today, unfortunately, yes, although our newest board
> revisions have switched to SLC and we're retrofitting
> the older boards as we can. But some of our systems
> will be living with MLC for a while yet.

BTW, in case you did not read that:
http://www.linux-mtd.infradead.org/faq/ubifs.html#L_ubifs_mlc

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

This e-mail and the information, including any attachments, it contains are intended to be a confidential communication only to the person or entity to whom it is addressed and may contain information that is privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender and destroy the original message.

Thank you.

Please consider the environment before printing this email.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-04-14 11:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-12 12:57 I don't understand how the counter for erasures is being maintained during erase failures Atlant Schmidt
2011-04-14  7:13 ` Artem Bityutskiy
2011-04-14 10:53   ` Atlant Schmidt
2011-04-14 10:55     ` Artem Bityutskiy
2011-04-14 10:58       ` Atlant Schmidt
2011-04-14 11:35     ` Artem Bityutskiy
2011-04-14 11:53       ` Atlant Schmidt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.