From mboxrd@z Thu Jan 1 00:00:00 1970 From: Heiko Schocher Date: Thu, 12 Jul 2018 07:22:13 +0200 Subject: [U-Boot] UBI fixable bit-flip issue In-Reply-To: References: Message-ID: <6079a07f-b819-ed81-6d2c-58ae38629595@denx.de> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: u-boot@lists.denx.de Hello Mark, added Richard Weinberger to cc... Am 12.07.2018 um 02:28 schrieb Mark Spieth: > Hi > > In the process of investigating a boot failure on one of our devices, the > > UBI: fixable bit-flip detected at PEB > > message was seen with the following behaviour during kernel load in u-boot. > > Read [2285568] bytes > UBI: fixable bit-flip detected at PEB 415 > UBI: schedule PEB 415 for scrubbing > UBI: fixable bit-flip detected at PEB 415 > UBI: fixable bit-flip detected at PEB 419 > UBI: schedule PEB 419 for scrubbing > UBI: fixable bit-flip detected at PEB 419 > UBI: fixable bit-flip detected at PEB 420 > UBI: schedule PEB 420 for scrubbing > UBI: fixable bit-flip detected at PEB 420 > UBI: fixable bit-flip detected at PEB 419 > UBI: fixable bit-flip detected at PEB 420 > UBI: fixable bit-flip detected at PEB 419 > UBI: fixable bit-flip detected at PEB 420 > UBI: fixable bit-flip detected at PEB 419 > UBI: fixable bit-flip detected at PEB 420 > UBI: fixable bit-flip detected at PEB 419 > UBI: fixable bit-flip detected at PEB 420 > UBI: fixable bit-flip detected at PEB 419 > UBI: fixable bit-flip detected at PEB 420 > UBI: fixable bit-flip detected at PEB 419 > > This repeats until reset. > > U boot is a patched version of 2010.06 supplied by the chip vendor. No newer version is available > from the vendor to try. :-( Can you use current mainline ? It s hard to say something about a 8 year old vendor U-Boot version ... > The patches include the init eba/wl swap. What do you mean here? > A more detailed log with debugging available follows: > > UBI: fixable bit-flip detected at PEB 419 > UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 19, torture 0 > UBI DBG: erase_worker: erase PEB 419 EC 19 > UBI DBG: sync_erase: erase PEB 419, old EC 19 > UBI DBG: do_sync_erase: erase PEB 419 > UBI DBG: sync_erase: erased PEB 419, new EC 20 > UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419 > UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0 > UBI DBG: ensure_wear_leveling: schedule scrubbing > UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419 > UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420 > UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048 > UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419 > UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data > UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096 > UBI: fixable bit-flip detected at PEB 420 > UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419 > UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048 > UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419 > UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048 > UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096 > UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096 > UBI: fixable bit-flip detected at PEB 419 > UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 20, torture 0 > UBI DBG: erase_worker: erase PEB 419 EC 20 > UBI DBG: sync_erase: erase PEB 419, old EC 20 > UBI DBG: do_sync_erase: erase PEB 419 > UBI DBG: sync_erase: erased PEB 419, new EC 21 > UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419 > UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0 > UBI DBG: ensure_wear_leveling: schedule scrubbing > UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419 > UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420 > UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048 > UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419 > UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data > UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096 > UBI: fixable bit-flip detected at PEB 420 > UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419 > UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048 > UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419 > UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048 > UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096 > UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096 > UBI: fixable bit-flip detected at PEB 419 > > Investigation showed that a read with correctable bit errors was done returning -EUCLEAN to the ubi > read function. > > Having read https://lists.denx.de/pipermail/u-boot/2013-September/161961.html which details a > workaround to not return EUCLEAN from the NAND reader unless the number of fixed bits returned was > 75% of the total number of correctable bits was exceeded during the read. This was impleneted in > this version of ubi in uboot 2010.06 and it does hide the bit-flip infinite issue since this is new > NAND FLASH. The original 2010.06 implementation returns EUCLEAN for any number of fixable bit flips > and thus causes the PEB move to the best free one (scrub mode in wear_leveling_worker). > > This fix is not a root cause fix though. Investigating further led to the following root cause > solution. The following is AFAICT. > > When the scrubber chooses a PEB to move the from the free balanced tree. This tree is sorted by EC > (erase count) and then by PEB number. > > The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which is 8192 in this config. So the > find_wl_entry function will find a PEB that is better in error count that the current PEB EC. This > can easily cause it to find the PEB that was just moved from if it is the lowest numbered PEB in the > free tree. Waiting for EC to go above 8192 would take a long time and cause premature aging of the > flash PEBs in question. > > The easy solution is to change the max parameter to this call to 0 so it finds a PEB with a smaller > EC than the one being replaced. This means it wont use the previously discarded PEB as its first > choice. I am not sure if it is so easy ... > This fix was implemented and fixable bit-flip errors no longer hang/freeze the boot process! UBI > erase and reformat was used between re-tests to get consistent results. > > Adding the above 75% correctable bitflip threshold is also a good thing as less movement will ensue > when the FLASH is new, but as the flash ages, the root cause will once again be invoked causing > un-recoverable boot failures. > > Note this fault is also in the latest kernel drivers for UBI and may also exist in other wear > leveling implementations. The kernel driver issue may be at fault for android devices locking > up/freezing sporadically during FLASH read when scrubbing due to a relatively full flash and > correctable errors causing ping pong PEB moves. > > The question is, is my root cause solution sound or have I missed something? I have to think about, before I write nonsene, but may Richard has here a deeper insight. > I know an algo change would probably be better or a way to detect move loops to prevent this from > occurring, but this solution does work on all the devices that were failing manufacture tests > previously. bye, Heiko -- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: +49-8142-66989-52 Fax: +49-8142-66989-80 Email: hs at denx.de