From mboxrd@z Thu Jan 1 00:00:00 1970 From: Heiko Schocher Date: Thu, 12 Jul 2018 10:08:16 +0200 Subject: [U-Boot] UBI fixable bit-flip issue In-Reply-To: References: <6079a07f-b819-ed81-6d2c-58ae38629595@denx.de> Message-ID: List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable To: u-boot@lists.denx.de Hello Mark, Am 12.07.2018 um 07:38 schrieb Mark Spieth: >=20 > On 12/07/18 15:22, Heiko Schocher wrote: >> Hello Mark, >> >> added Richard Weinberger to cc... >> >> Am 12.07.2018 um 02:28 schrieb Mark Spieth: >>> Hi >>> >>> In the process of investigating a boot failure on one of our devices, t= he >>> >>> UBI: fixable bit-flip detected at PEB >>> >>> message was seen with the following behaviour during kernel load in u-b= oot. >>> >>> Read [2285568] bytes >>> UBI: fixable bit-flip detected at PEB 415 >>> UBI: schedule PEB 415 for scrubbing >>> UBI: fixable bit-flip detected at PEB 415 >>> UBI: fixable bit-flip detected at PEB 419 >>> UBI: schedule PEB 419 for scrubbing >>> UBI: fixable bit-flip detected at PEB 419 >>> UBI: fixable bit-flip detected at PEB 420 >>> UBI: schedule PEB 420 for scrubbing >>> UBI: fixable bit-flip detected at PEB 420 >>> UBI: fixable bit-flip detected at PEB 419 >>> UBI: fixable bit-flip detected at PEB 420 >>> UBI: fixable bit-flip detected at PEB 419 >>> UBI: fixable bit-flip detected at PEB 420 >>> UBI: fixable bit-flip detected at PEB 419 >>> UBI: fixable bit-flip detected at PEB 420 >>> UBI: fixable bit-flip detected at PEB 419 >>> UBI: fixable bit-flip detected at PEB 420 >>> UBI: fixable bit-flip detected at PEB 419 >>> UBI: fixable bit-flip detected at PEB 420 >>> UBI: fixable bit-flip detected at PEB 419 >>> >>> This repeats until reset. >>> >>> U boot is a patched version of 2010.06 supplied by the chip vendor. No = newer version is available=20 >>> from the vendor to try. >> >> :-( >> >> Can you use current mainline ? It s hard to say something >> about a 8 year old vendor U-Boot version ... > I know. I did look at the current 2018.07 and 2014.10 as comparison. >=20 > There are many patches applied by the vendor so porting them with the lar= ge changes to driver=20 > structure would be difficult and time consuming. > The vendor is Lantiq and the SDK is current (this year). >> >>> The patches include the init eba/wl swap. >> >> What do you mean here? > https://lists.denx.de/pipermail/u-boot/2013-January/143199.html > This patch was already applied by the vendor. >=20 > ubi_eba_init_scan() must be initialised before ubi_wl_init_scan() and in = that baseline they were the=20 > wrong way around. >=20 > There is only 1 other message chain for fixable bit flips (2011) and that= was not useful for this=20 > problem. >> >>> A more detailed log with debugging available follows: >>> >>> UBI: fixable bit-flip detected at PEB 419 >>> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 19, torture 0 >>> UBI DBG: erase_worker: erase PEB 419 EC 19 >>> UBI DBG: sync_erase: erase PEB 419, old EC 19 >>> UBI DBG: do_sync_erase: erase PEB 419 >>> UBI DBG: sync_erase: erased PEB 419, new EC 20 >>> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419 >>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0 >>> UBI DBG: ensure_wear_leveling: schedule scrubbing >>> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419 >>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420 >>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048 >>> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419 >>> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data >>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096 >>> UBI: fixable bit-flip detected at PEB 420 >>> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419 >>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048 >>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419 >>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048 >>> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096 >>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096 >>> UBI: fixable bit-flip detected at PEB 419 >>> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 20, torture 0 >>> UBI DBG: erase_worker: erase PEB 419 EC 20 >>> UBI DBG: sync_erase: erase PEB 419, old EC 20 >>> UBI DBG: do_sync_erase: erase PEB 419 >>> UBI DBG: sync_erase: erased PEB 419, new EC 21 >>> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419 >>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0 >>> UBI DBG: ensure_wear_leveling: schedule scrubbing >>> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419 >>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420 >>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048 >>> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419 >>> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data >>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096 >>> UBI: fixable bit-flip detected at PEB 420 >>> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419 >>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048 >>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419 >>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048 >>> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096 >>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096 >>> UBI: fixable bit-flip detected at PEB 419 >>> >>> Investigation showed that a read with correctable bit errors was done r= eturning -EUCLEAN to the=20 >>> ubi read function. >>> >>> Having read https://lists.denx.de/pipermail/u-boot/2013-September/16196= 1.html which details a=20 >>> workaround to not return EUCLEAN from the NAND reader unless the number= of fixed bits returned=20 >>> was 75% of the total number of correctable bits was exceeded during the= read. This was impleneted=20 >>> in this version of ubi in uboot 2010.06 and it does hide the bit-flip i= nfinite issue since this=20 >>> is new NAND FLASH. The original 2010.06 implementation returns EUCLEAN = for any number of fixable=20 >>> bit flips and thus causes the PEB move to the best free one (scrub mode= in wear_leveling_worker). >>> >>> This fix is not a root cause fix though. Investigating further led to t= he following root cause=20 >>> solution. The following is AFAICT. >>> >>> When the scrubber chooses a PEB to move the from the free balanced tree= . This tree is sorted by=20 >>> EC (erase count) and then by PEB number. >>> >>> The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which i= s 8192 in this config. So=20 >>> the find_wl_entry function will find a PEB that is better in error coun= t that the current PEB EC.=20 >>> This can easily cause it to find the PEB that was just moved from if it= is the lowest numbered=20 >>> PEB in the free tree. Waiting for EC to go above 8192 would take a long= time and cause premature=20 >>> aging of the flash PEBs in question. >>> >>> The easy solution is to change the max parameter to this call to 0 so i= t finds a PEB with a=20 >>> smaller EC than the one being replaced. This means it wont use the prev= iously discarded PEB as=20 >>> its first choice. >> >> =C2=A0I am not sure if it is so easy ... > This is why I'm asking :-) >> >>> This fix was implemented and fixable bit-flip errors no longer hang/fre= eze the boot process! UBI=20 >>> erase and reformat was used between re-tests to get consistent results. >>> >>> Adding the above 75% correctable bitflip threshold is also a good thing= as less movement will=20 >>> ensue when the FLASH is new, but as the flash ages, the root cause will= once again be invoked=20 >>> causing un-recoverable boot failures. >>> >>> Note this fault is also in the latest kernel drivers for UBI and may al= so exist in other wear=20 >>> leveling implementations. The kernel driver issue may be at fault for a= ndroid devices locking=20 >>> up/freezing sporadically during FLASH read when scrubbing due to a rela= tively full flash and=20 >>> correctable errors causing ping pong PEB moves. >>> >>> The question is, is my root cause solution sound or have I missed somet= hing? >> >> I have to think about, before I write nonsene, but may Richard has >> here a deeper insight. >> >>> I know an algo change would probably be better or a way to detect move = loops to prevent this from=20 >>> occurring, but this solution does work on all the devices that were fai= ling manufacture tests=20 >>> previously. >> > Is there another message board that deal with the mtd ubi driver specific= ally? Yes of course ... bye, Heiko --=20 DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: +49-8142-66989-52 Fax: +49-8142-66989-80 Email: hs at denx.de