[U-Boot] UBI fixable bit-flip issue

From: Mark Spieth <mspieth@digivation.com.au>
To: u-boot@lists.denx.de
Subject: [U-Boot] UBI fixable bit-flip issue
Date: Fri, 13 Jul 2018 00:03:43 +1000	[thread overview]
Message-ID: <38d06e1a-ab42-fa51-cdda-f55b92ad439a@digivation.com.au> (raw)
In-Reply-To: <14079607.ILCUqeWBoJ@blindfold>

On 12 July 2018 18:46:11 GMT+10:00, Richard Weinberger <richard@nod.at> 
wrote:
>Mark,
>
>Am Donnerstag, 12. Juli 2018, 07:22:13 CEST schrieb Heiko Schocher:
>> Hello Mark,
>> 
>> added Richard Weinberger to cc...
>> 
>> Am 12.07.2018 um 02:28 schrieb Mark Spieth:
>> > Hi
>> > 
>> > In the process of investigating a boot failure on one of our
>devices, the
>> > 
>> > UBI: fixable bit-flip detected at PEB
>> > 
>> > message was seen with the following behaviour during kernel load in
>u-boot.
>> > 
>> > Read [2285568] bytes
>> > UBI: fixable bit-flip detected at PEB 415
>> > UBI: schedule PEB 415 for scrubbing
>> > UBI: fixable bit-flip detected at PEB 415
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: schedule PEB 419 for scrubbing
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: schedule PEB 420 for scrubbing
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > 
>> > This repeats until reset.
>
>Do you see the same symptom also on Linux?
>We need to be very sure that it is actually a UBI problem.

The linux provided has an up to date mtd/ubi driver so already has the 
75% bitflip threshold thus hiding the issue in a new flash. So the 2 are 
not the same. Untested on linux.

>
>> > This fix is not a root cause fix though. Investigating further led
>to the following root cause 
>> > solution. The following is AFAICT.
>> > 
>> > When the scrubber chooses a PEB to move the from the free balanced
>tree. This tree is sorted by EC 
>> > (erase count) and then by PEB number.
>> > 
>> > The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF
>which is 8192 in this config. So the 
>> > find_wl_entry function will find a PEB that is better in error
>count that the current PEB EC. This
>
>error count? You mean erase count?

Yes of course.

> 
>> > can easily cause it to find the PEB that was just moved from if it
>is the lowest numbered PEB in the 
>> > free tree. Waiting for EC to go above 8192 would take a long time
>and cause premature aging of the 
>> > flash PEBs in question.
>> > 
>> > The easy solution is to change the max parameter to this call to 0
>so it finds a PEB with a smaller 
>> > EC than the one being replaced. This means it wont use the
>previously discarded PEB as its first 
>> > choice.
>
>For scrubbing this might be a good idea, but not for regular
>wear-leveling.
Yes only for scrubbing, not wear leveling.
>
>See comment in UBI:
>/*
>* When a physical eraseblock is moved, the WL sub-system has to pick
>the target
>* physical eraseblock to move to. The simplest way would be just to
>pick the
>* one with the highest erase counter. But in certain workloads this
>could lead
>* to an unlimited wear of one or few physical eraseblock. Indeed,
>imagine a
>* situation when the picked physical eraseblock is constantly erased
>after the
>* data is written to it. So, we have a constant which limits the
>highest erase
>* counter of the free physical eraseblock to pick. Namely, the WL
>sub-system
>* does not pick eraseblocks with erase counter greater than the lowest
>erase
> * counter plus %WL_FREE_MAX_DIFF.
> */
>#define WL_FREE_MAX_DIFF (2*UBI_WL_THRESHOLD)
>
>So we could change the logic such that for regular wear-leveling we
>keep using WL_FREE_MAX_DIFF,
>but for scrubbing (which is 1:1 wear-leveling but the source PEB is
>showing bit-flips) we use
>a lower value. IMHO WL_FREE_MAX_DIFF/2 would be a good choice.
>I'm not sure whether 0 is too extreme and might cause other
>distortions.

Yes the wear leveling threshold is still WL_FREE_MAX_DIFF and the 
scubbing threshold is 0.

This is why I'm asking. Because the 2 PEBs will track each others EC I'm 
not sure that will work.
>
>Mark, can you please file a patch and send it to linux-mtd mailing
>list?
>Such a change needs to go through Linux and then to u-boot.
>But first we need to think about and discuss it in detail.

Will do.

> 
>>   I am not sure if it is so easy ...
>>
>> > This fix was implemented and fixable bit-flip errors no longer
>hang/freeze the boot process! UBI 
>> > erase and reformat was used between re-tests to get consistent
>results.
>> > 
>> > Adding the above 75% correctable bitflip threshold is also a good
>thing as less movement will ensue 
>> > when the FLASH is new, but as the flash ages, the root cause will
>once again be invoked causing 
>> > un-recoverable boot failures.
>> > 
>> > Note this fault is also in the latest kernel drivers for UBI and
>may also exist in other wear 
>> > leveling implementations. The kernel driver issue may be at fault
>for android devices locking 
>> > up/freezing sporadically during FLASH read when scrubbing due to a
>relatively full flash and 
>> > correctable errors causing ping pong PEB moves.
>> > 
>> > The question is, is my root cause solution sound or have I missed
>something?
>> 
>> I have to think about, before I write nonsene, but may Richard has
>> here a deeper insight.
>

Thanks for your input.

Mark

-- 
Mark Spieth, PhD
Digivation Pty Ltd
9 Catalina Ave
ASHBURTON VIC 3147
Australia
Phone: +61 4 11 515717 (0411515717)
Fax: +61 3 9885 5774