All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mark Spieth <mspieth@digivation.com.au>
To: u-boot@lists.denx.de
Subject: [U-Boot] UBI fixable bit-flip issue
Date: Fri, 13 Jul 2018 00:03:43 +1000	[thread overview]
Message-ID: <38d06e1a-ab42-fa51-cdda-f55b92ad439a@digivation.com.au> (raw)
In-Reply-To: <14079607.ILCUqeWBoJ@blindfold>



On 12 July 2018 18:46:11 GMT+10:00, Richard Weinberger <richard@nod.at> 
wrote:
>Mark,
>
>Am Donnerstag, 12. Juli 2018, 07:22:13 CEST schrieb Heiko Schocher:
>> Hello Mark,
>> 
>> added Richard Weinberger to cc...
>> 
>> Am 12.07.2018 um 02:28 schrieb Mark Spieth:
>> > Hi
>> > 
>> > In the process of investigating a boot failure on one of our
>devices, the
>> > 
>> > UBI: fixable bit-flip detected at PEB
>> > 
>> > message was seen with the following behaviour during kernel load in
>u-boot.
>> > 
>> > Read [2285568] bytes
>> > UBI: fixable bit-flip detected at PEB 415
>> > UBI: schedule PEB 415 for scrubbing
>> > UBI: fixable bit-flip detected at PEB 415
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: schedule PEB 419 for scrubbing
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: schedule PEB 420 for scrubbing
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > 
>> > This repeats until reset.
>
>Do you see the same symptom also on Linux?
>We need to be very sure that it is actually a UBI problem.

The linux provided has an up to date mtd/ubi driver so already has the 
75% bitflip threshold thus hiding the issue in a new flash. So the 2 are 
not the same. Untested on linux.

>
>> > This fix is not a root cause fix though. Investigating further led
>to the following root cause 
>> > solution. The following is AFAICT.
>> > 
>> > When the scrubber chooses a PEB to move the from the free balanced
>tree. This tree is sorted by EC 
>> > (erase count) and then by PEB number.
>> > 
>> > The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF
>which is 8192 in this config. So the 
>> > find_wl_entry function will find a PEB that is better in error
>count that the current PEB EC. This
>
>error count? You mean erase count?

Yes of course.

> 
>> > can easily cause it to find the PEB that was just moved from if it
>is the lowest numbered PEB in the 
>> > free tree. Waiting for EC to go above 8192 would take a long time
>and cause premature aging of the 
>> > flash PEBs in question.
>> > 
>> > The easy solution is to change the max parameter to this call to 0
>so it finds a PEB with a smaller 
>> > EC than the one being replaced. This means it wont use the
>previously discarded PEB as its first 
>> > choice.
>
>For scrubbing this might be a good idea, but not for regular
>wear-leveling.
Yes only for scrubbing, not wear leveling.
>
>See comment in UBI:
>/*
>* When a physical eraseblock is moved, the WL sub-system has to pick
>the target
>* physical eraseblock to move to. The simplest way would be just to
>pick the
>* one with the highest erase counter. But in certain workloads this
>could lead
>* to an unlimited wear of one or few physical eraseblock. Indeed,
>imagine a
>* situation when the picked physical eraseblock is constantly erased
>after the
>* data is written to it. So, we have a constant which limits the
>highest erase
>* counter of the free physical eraseblock to pick. Namely, the WL
>sub-system
>* does not pick eraseblocks with erase counter greater than the lowest
>erase
> * counter plus %WL_FREE_MAX_DIFF.
> */
>#define WL_FREE_MAX_DIFF (2*UBI_WL_THRESHOLD)
>
>So we could change the logic such that for regular wear-leveling we
>keep using WL_FREE_MAX_DIFF,
>but for scrubbing (which is 1:1 wear-leveling but the source PEB is
>showing bit-flips) we use
>a lower value. IMHO WL_FREE_MAX_DIFF/2 would be a good choice.
>I'm not sure whether 0 is too extreme and might cause other
>distortions.

Yes the wear leveling threshold is still WL_FREE_MAX_DIFF and the 
scubbing threshold is 0.

This is why I'm asking. Because the 2 PEBs will track each others EC I'm 
not sure that will work.
>
>Mark, can you please file a patch and send it to linux-mtd mailing
>list?
>Such a change needs to go through Linux and then to u-boot.
>But first we need to think about and discuss it in detail.

Will do.

> 
>>   I am not sure if it is so easy ...
>>
>> > This fix was implemented and fixable bit-flip errors no longer
>hang/freeze the boot process! UBI 
>> > erase and reformat was used between re-tests to get consistent
>results.
>> > 
>> > Adding the above 75% correctable bitflip threshold is also a good
>thing as less movement will ensue 
>> > when the FLASH is new, but as the flash ages, the root cause will
>once again be invoked causing 
>> > un-recoverable boot failures.
>> > 
>> > Note this fault is also in the latest kernel drivers for UBI and
>may also exist in other wear 
>> > leveling implementations. The kernel driver issue may be at fault
>for android devices locking 
>> > up/freezing sporadically during FLASH read when scrubbing due to a
>relatively full flash and 
>> > correctable errors causing ping pong PEB moves.
>> > 
>> > The question is, is my root cause solution sound or have I missed
>something?
>> 
>> I have to think about, before I write nonsene, but may Richard has
>> here a deeper insight.
>

Thanks for your input.

Mark


-- 
Mark Spieth, PhD
Digivation Pty Ltd
9 Catalina Ave
ASHBURTON VIC 3147
Australia
Phone: +61 4 11 515717 (0411515717)
Fax: +61 3 9885 5774

  parent reply	other threads:[~2018-07-12 14:03 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-12  0:28 [U-Boot] UBI fixable bit-flip issue Mark Spieth
2018-07-12  5:22 ` Heiko Schocher
2018-07-12  5:38   ` Mark Spieth
2018-07-12  8:08     ` Heiko Schocher
2018-07-12  8:46   ` Richard Weinberger
2018-07-12  9:50     ` Mark Spieth
2018-07-12 14:03     ` Mark Spieth [this message]
2018-08-16  8:50       ` Richard Weinberger
2018-08-16 23:31         ` Mark Spieth
  -- strict thread matches above, loose matches on Subject: below --
2012-12-14 18:03 [U-Boot] UBI Fixable " Vikram Narayanan
2012-12-15  3:14 ` Vikram Narayanan
2012-12-17  8:44   ` Holger Brunck
2012-12-17 18:00     ` Vikram Narayanan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=38d06e1a-ab42-fa51-cdda-f55b92ad439a@digivation.com.au \
    --to=mspieth@digivation.com.au \
    --cc=u-boot@lists.denx.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.