From mboxrd@z Thu Jan 1 00:00:00 1970 From: Richard Weinberger Date: Thu, 12 Jul 2018 10:46:11 +0200 Subject: [U-Boot] UBI fixable bit-flip issue In-Reply-To: <6079a07f-b819-ed81-6d2c-58ae38629595@denx.de> References: <6079a07f-b819-ed81-6d2c-58ae38629595@denx.de> Message-ID: <14079607.ILCUqeWBoJ@blindfold> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: u-boot@lists.denx.de Mark, Am Donnerstag, 12. Juli 2018, 07:22:13 CEST schrieb Heiko Schocher: > Hello Mark, > > added Richard Weinberger to cc... > > Am 12.07.2018 um 02:28 schrieb Mark Spieth: > > Hi > > > > In the process of investigating a boot failure on one of our devices, the > > > > UBI: fixable bit-flip detected at PEB > > > > message was seen with the following behaviour during kernel load in u-boot. > > > > Read [2285568] bytes > > UBI: fixable bit-flip detected at PEB 415 > > UBI: schedule PEB 415 for scrubbing > > UBI: fixable bit-flip detected at PEB 415 > > UBI: fixable bit-flip detected at PEB 419 > > UBI: schedule PEB 419 for scrubbing > > UBI: fixable bit-flip detected at PEB 419 > > UBI: fixable bit-flip detected at PEB 420 > > UBI: schedule PEB 420 for scrubbing > > UBI: fixable bit-flip detected at PEB 420 > > UBI: fixable bit-flip detected at PEB 419 > > UBI: fixable bit-flip detected at PEB 420 > > UBI: fixable bit-flip detected at PEB 419 > > UBI: fixable bit-flip detected at PEB 420 > > UBI: fixable bit-flip detected at PEB 419 > > UBI: fixable bit-flip detected at PEB 420 > > UBI: fixable bit-flip detected at PEB 419 > > UBI: fixable bit-flip detected at PEB 420 > > UBI: fixable bit-flip detected at PEB 419 > > UBI: fixable bit-flip detected at PEB 420 > > UBI: fixable bit-flip detected at PEB 419 > > > > This repeats until reset. Do you see the same symptom also on Linux? We need to be very sure that it is actually a UBI problem. > > This fix is not a root cause fix though. Investigating further led to the following root cause > > solution. The following is AFAICT. > > > > When the scrubber chooses a PEB to move the from the free balanced tree. This tree is sorted by EC > > (erase count) and then by PEB number. > > > > The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which is 8192 in this config. So the > > find_wl_entry function will find a PEB that is better in error count that the current PEB EC. This error count? You mean erase count? > > can easily cause it to find the PEB that was just moved from if it is the lowest numbered PEB in the > > free tree. Waiting for EC to go above 8192 would take a long time and cause premature aging of the > > flash PEBs in question. > > > > The easy solution is to change the max parameter to this call to 0 so it finds a PEB with a smaller > > EC than the one being replaced. This means it wont use the previously discarded PEB as its first > > choice. For scrubbing this might be a good idea, but not for regular wear-leveling. See comment in UBI: /* * When a physical eraseblock is moved, the WL sub-system has to pick the target * physical eraseblock to move to. The simplest way would be just to pick the * one with the highest erase counter. But in certain workloads this could lead * to an unlimited wear of one or few physical eraseblock. Indeed, imagine a * situation when the picked physical eraseblock is constantly erased after the * data is written to it. So, we have a constant which limits the highest erase * counter of the free physical eraseblock to pick. Namely, the WL sub-system * does not pick eraseblocks with erase counter greater than the lowest erase * counter plus %WL_FREE_MAX_DIFF. */ #define WL_FREE_MAX_DIFF (2*UBI_WL_THRESHOLD) So we could change the logic such that for regular wear-leveling we keep using WL_FREE_MAX_DIFF, but for scrubbing (which is 1:1 wear-leveling but the source PEB is showing bit-flips) we use a lower value. IMHO WL_FREE_MAX_DIFF/2 would be a good choice. I'm not sure whether 0 is too extreme and might cause other distortions. Mark, can you please file a patch and send it to linux-mtd mailing list? Such a change needs to go through Linux and then to u-boot. But first we need to think about and discuss it in detail. > I am not sure if it is so easy ... > > > This fix was implemented and fixable bit-flip errors no longer hang/freeze the boot process! UBI > > erase and reformat was used between re-tests to get consistent results. > > > > Adding the above 75% correctable bitflip threshold is also a good thing as less movement will ensue > > when the FLASH is new, but as the flash ages, the root cause will once again be invoked causing > > un-recoverable boot failures. > > > > Note this fault is also in the latest kernel drivers for UBI and may also exist in other wear > > leveling implementations. The kernel driver issue may be at fault for android devices locking > > up/freezing sporadically during FLASH read when scrubbing due to a relatively full flash and > > correctable errors causing ping pong PEB moves. > > > > The question is, is my root cause solution sound or have I missed something? > > I have to think about, before I write nonsene, but may Richard has > here a deeper insight. Please see my comments. :) Thanks, //richard