All of lore.kernel.org
 help / color / mirror / Atom feed
* [U-Boot] UBI fixable bit-flip issue
@ 2018-07-12  0:28 Mark Spieth
  2018-07-12  5:22 ` Heiko Schocher
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Spieth @ 2018-07-12  0:28 UTC (permalink / raw)
  To: u-boot

Hi

In the process of investigating a boot failure on one of our devices, the

UBI: fixable bit-flip detected at PEB

message was seen with the following behaviour during kernel load in u-boot.

Read [2285568] bytes
UBI: fixable bit-flip detected at PEB 415
UBI: schedule PEB 415 for scrubbing
UBI: fixable bit-flip detected at PEB 415
UBI: fixable bit-flip detected at PEB 419
UBI: schedule PEB 419 for scrubbing
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: schedule PEB 420 for scrubbing
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419

This repeats until reset.

U boot is a patched version of 2010.06 supplied by the chip vendor. No 
newer version is available from the vendor to try.

The patches include the init eba/wl swap.

A more detailed log with debugging available follows:

UBI: fixable bit-flip detected at PEB 419
UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 19, torture 0
UBI DBG: erase_worker: erase PEB 419 EC 19
UBI DBG: sync_erase: erase PEB 419, old EC 19
UBI DBG: do_sync_erase: erase PEB 419
UBI DBG: sync_erase: erased PEB 419, new EC 20
UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
UBI DBG: ensure_wear_leveling: schedule scrubbing
UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
UBI: fixable bit-flip detected at PEB 420
UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
UBI: fixable bit-flip detected at PEB 419
UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 20, torture 0
UBI DBG: erase_worker: erase PEB 419 EC 20
UBI DBG: sync_erase: erase PEB 419, old EC 20
UBI DBG: do_sync_erase: erase PEB 419
UBI DBG: sync_erase: erased PEB 419, new EC 21
UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
UBI DBG: ensure_wear_leveling: schedule scrubbing
UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
UBI: fixable bit-flip detected at PEB 420
UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
UBI: fixable bit-flip detected at PEB 419

Investigation showed that a read with correctable bit errors was done 
returning -EUCLEAN to the ubi read function.

Having read 
https://lists.denx.de/pipermail/u-boot/2013-September/161961.html which 
details a workaround to not return EUCLEAN from the NAND reader unless 
the number of fixed bits returned was 75% of the total number of 
correctable bits was exceeded during the read. This was impleneted in 
this version of ubi in uboot 2010.06 and it does hide the bit-flip 
infinite issue since this is new NAND FLASH. The original 2010.06 
implementation returns EUCLEAN for any number of fixable bit flips and 
thus causes the PEB move to the best free one (scrub mode in 
wear_leveling_worker).

This fix is not a root cause fix though. Investigating further led to 
the following root cause solution. The following is AFAICT.

When the scrubber chooses a PEB to move the from the free balanced tree. 
This tree is sorted by EC (erase count) and then by PEB number.

The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which is 
8192 in this config. So the find_wl_entry function will find a PEB that 
is better in error count that the current PEB EC. This can easily cause 
it to find the PEB that was just moved from if it is the lowest numbered 
PEB in the free tree. Waiting for EC to go above 8192 would take a long 
time and cause premature aging of the flash PEBs in question.

The easy solution is to change the max parameter to this call to 0 so it 
finds a PEB with a smaller EC than the one being replaced. This means it 
wont use the previously discarded PEB as its first choice.

This fix was implemented and fixable bit-flip errors no longer 
hang/freeze the boot process! UBI erase and reformat was used between 
re-tests to get consistent results.

Adding the above 75% correctable bitflip threshold is also a good thing 
as less movement will ensue when the FLASH is new, but as the flash 
ages, the root cause will once again be invoked causing un-recoverable 
boot failures.

Note this fault is also in the latest kernel drivers for UBI and may 
also exist in other wear leveling implementations. The kernel driver 
issue may be at fault for android devices locking up/freezing 
sporadically during FLASH read when scrubbing due to a relatively full 
flash and correctable errors causing ping pong PEB moves.

The question is, is my root cause solution sound or have I missed something?

I know an algo change would probably be better or a way to detect move 
loops to prevent this from occurring, but this solution does work on all 
the devices that were failing manufacture tests previously.

Regards

Mark

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [U-Boot] UBI fixable bit-flip issue
  2018-07-12  0:28 [U-Boot] UBI fixable bit-flip issue Mark Spieth
@ 2018-07-12  5:22 ` Heiko Schocher
  2018-07-12  5:38   ` Mark Spieth
  2018-07-12  8:46   ` Richard Weinberger
  0 siblings, 2 replies; 13+ messages in thread
From: Heiko Schocher @ 2018-07-12  5:22 UTC (permalink / raw)
  To: u-boot

Hello Mark,

added Richard Weinberger to cc...

Am 12.07.2018 um 02:28 schrieb Mark Spieth:
> Hi
> 
> In the process of investigating a boot failure on one of our devices, the
> 
> UBI: fixable bit-flip detected at PEB
> 
> message was seen with the following behaviour during kernel load in u-boot.
> 
> Read [2285568] bytes
> UBI: fixable bit-flip detected at PEB 415
> UBI: schedule PEB 415 for scrubbing
> UBI: fixable bit-flip detected at PEB 415
> UBI: fixable bit-flip detected at PEB 419
> UBI: schedule PEB 419 for scrubbing
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: schedule PEB 420 for scrubbing
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> 
> This repeats until reset.
> 
> U boot is a patched version of 2010.06 supplied by the chip vendor. No newer version is available 
> from the vendor to try.

:-(

Can you use current mainline ? It s hard to say something
about a 8 year old vendor U-Boot version ...

> The patches include the init eba/wl swap.

What do you mean here?

> A more detailed log with debugging available follows:
> 
> UBI: fixable bit-flip detected at PEB 419
> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 19, torture 0
> UBI DBG: erase_worker: erase PEB 419 EC 19
> UBI DBG: sync_erase: erase PEB 419, old EC 19
> UBI DBG: do_sync_erase: erase PEB 419
> UBI DBG: sync_erase: erased PEB 419, new EC 20
> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
> UBI DBG: ensure_wear_leveling: schedule scrubbing
> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
> UBI: fixable bit-flip detected at PEB 420
> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
> UBI: fixable bit-flip detected at PEB 419
> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 20, torture 0
> UBI DBG: erase_worker: erase PEB 419 EC 20
> UBI DBG: sync_erase: erase PEB 419, old EC 20
> UBI DBG: do_sync_erase: erase PEB 419
> UBI DBG: sync_erase: erased PEB 419, new EC 21
> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
> UBI DBG: ensure_wear_leveling: schedule scrubbing
> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
> UBI: fixable bit-flip detected at PEB 420
> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
> UBI: fixable bit-flip detected at PEB 419
> 
> Investigation showed that a read with correctable bit errors was done returning -EUCLEAN to the ubi 
> read function.
> 
> Having read https://lists.denx.de/pipermail/u-boot/2013-September/161961.html which details a 
> workaround to not return EUCLEAN from the NAND reader unless the number of fixed bits returned was 
> 75% of the total number of correctable bits was exceeded during the read. This was impleneted in 
> this version of ubi in uboot 2010.06 and it does hide the bit-flip infinite issue since this is new 
> NAND FLASH. The original 2010.06 implementation returns EUCLEAN for any number of fixable bit flips 
> and thus causes the PEB move to the best free one (scrub mode in wear_leveling_worker).
> 
> This fix is not a root cause fix though. Investigating further led to the following root cause 
> solution. The following is AFAICT.
> 
> When the scrubber chooses a PEB to move the from the free balanced tree. This tree is sorted by EC 
> (erase count) and then by PEB number.
> 
> The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which is 8192 in this config. So the 
> find_wl_entry function will find a PEB that is better in error count that the current PEB EC. This 
> can easily cause it to find the PEB that was just moved from if it is the lowest numbered PEB in the 
> free tree. Waiting for EC to go above 8192 would take a long time and cause premature aging of the 
> flash PEBs in question.
> 
> The easy solution is to change the max parameter to this call to 0 so it finds a PEB with a smaller 
> EC than the one being replaced. This means it wont use the previously discarded PEB as its first 
> choice.

  I am not sure if it is so easy ...

> This fix was implemented and fixable bit-flip errors no longer hang/freeze the boot process! UBI 
> erase and reformat was used between re-tests to get consistent results.
> 
> Adding the above 75% correctable bitflip threshold is also a good thing as less movement will ensue 
> when the FLASH is new, but as the flash ages, the root cause will once again be invoked causing 
> un-recoverable boot failures.
> 
> Note this fault is also in the latest kernel drivers for UBI and may also exist in other wear 
> leveling implementations. The kernel driver issue may be at fault for android devices locking 
> up/freezing sporadically during FLASH read when scrubbing due to a relatively full flash and 
> correctable errors causing ping pong PEB moves.
> 
> The question is, is my root cause solution sound or have I missed something?

I have to think about, before I write nonsene, but may Richard has
here a deeper insight.

> I know an algo change would probably be better or a way to detect move loops to prevent this from 
> occurring, but this solution does work on all the devices that were failing manufacture tests 
> previously.

bye,
Heiko
-- 
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs at denx.de

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [U-Boot] UBI fixable bit-flip issue
  2018-07-12  5:22 ` Heiko Schocher
@ 2018-07-12  5:38   ` Mark Spieth
  2018-07-12  8:08     ` Heiko Schocher
  2018-07-12  8:46   ` Richard Weinberger
  1 sibling, 1 reply; 13+ messages in thread
From: Mark Spieth @ 2018-07-12  5:38 UTC (permalink / raw)
  To: u-boot


On 12/07/18 15:22, Heiko Schocher wrote:
> Hello Mark,
>
> added Richard Weinberger to cc...
>
> Am 12.07.2018 um 02:28 schrieb Mark Spieth:
>> Hi
>>
>> In the process of investigating a boot failure on one of our devices, 
>> the
>>
>> UBI: fixable bit-flip detected at PEB
>>
>> message was seen with the following behaviour during kernel load in 
>> u-boot.
>>
>> Read [2285568] bytes
>> UBI: fixable bit-flip detected at PEB 415
>> UBI: schedule PEB 415 for scrubbing
>> UBI: fixable bit-flip detected at PEB 415
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: schedule PEB 419 for scrubbing
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: schedule PEB 420 for scrubbing
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: fixable bit-flip detected at PEB 419
>>
>> This repeats until reset.
>>
>> U boot is a patched version of 2010.06 supplied by the chip vendor. 
>> No newer version is available from the vendor to try.
>
> :-(
>
> Can you use current mainline ? It s hard to say something
> about a 8 year old vendor U-Boot version ...
I know. I did look at the current 2018.07 and 2014.10 as comparison.

There are many patches applied by the vendor so porting them with the 
large changes to driver structure would be difficult and time consuming.
The vendor is Lantiq and the SDK is current (this year).
>
>> The patches include the init eba/wl swap.
>
> What do you mean here?
https://lists.denx.de/pipermail/u-boot/2013-January/143199.html
This patch was already applied by the vendor.

ubi_eba_init_scan() must be initialised before ubi_wl_init_scan() and in 
that baseline they were the wrong way around.

There is only 1 other message chain for fixable bit flips (2011) and 
that was not useful for this problem.
>
>> A more detailed log with debugging available follows:
>>
>> UBI: fixable bit-flip detected at PEB 419
>> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 19, torture 0
>> UBI DBG: erase_worker: erase PEB 419 EC 19
>> UBI DBG: sync_erase: erase PEB 419, old EC 19
>> UBI DBG: do_sync_erase: erase PEB 419
>> UBI DBG: sync_erase: erased PEB 419, new EC 20
>> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
>> UBI DBG: ensure_wear_leveling: schedule scrubbing
>> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
>> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
>> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
>> UBI: fixable bit-flip detected at PEB 420
>> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
>> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
>> UBI: fixable bit-flip detected at PEB 419
>> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 20, torture 0
>> UBI DBG: erase_worker: erase PEB 419 EC 20
>> UBI DBG: sync_erase: erase PEB 419, old EC 20
>> UBI DBG: do_sync_erase: erase PEB 419
>> UBI DBG: sync_erase: erased PEB 419, new EC 21
>> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
>> UBI DBG: ensure_wear_leveling: schedule scrubbing
>> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
>> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
>> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
>> UBI: fixable bit-flip detected at PEB 420
>> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
>> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
>> UBI: fixable bit-flip detected at PEB 419
>>
>> Investigation showed that a read with correctable bit errors was done 
>> returning -EUCLEAN to the ubi read function.
>>
>> Having read 
>> https://lists.denx.de/pipermail/u-boot/2013-September/161961.html 
>> which details a workaround to not return EUCLEAN from the NAND reader 
>> unless the number of fixed bits returned was 75% of the total number 
>> of correctable bits was exceeded during the read. This was impleneted 
>> in this version of ubi in uboot 2010.06 and it does hide the bit-flip 
>> infinite issue since this is new NAND FLASH. The original 2010.06 
>> implementation returns EUCLEAN for any number of fixable bit flips 
>> and thus causes the PEB move to the best free one (scrub mode in 
>> wear_leveling_worker).
>>
>> This fix is not a root cause fix though. Investigating further led to 
>> the following root cause solution. The following is AFAICT.
>>
>> When the scrubber chooses a PEB to move the from the free balanced 
>> tree. This tree is sorted by EC (erase count) and then by PEB number.
>>
>> The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which 
>> is 8192 in this config. So the find_wl_entry function will find a PEB 
>> that is better in error count that the current PEB EC. This can 
>> easily cause it to find the PEB that was just moved from if it is the 
>> lowest numbered PEB in the free tree. Waiting for EC to go above 8192 
>> would take a long time and cause premature aging of the flash PEBs in 
>> question.
>>
>> The easy solution is to change the max parameter to this call to 0 so 
>> it finds a PEB with a smaller EC than the one being replaced. This 
>> means it wont use the previously discarded PEB as its first choice.
>
>  I am not sure if it is so easy ...
This is why I'm asking :-)
>
>> This fix was implemented and fixable bit-flip errors no longer 
>> hang/freeze the boot process! UBI erase and reformat was used between 
>> re-tests to get consistent results.
>>
>> Adding the above 75% correctable bitflip threshold is also a good 
>> thing as less movement will ensue when the FLASH is new, but as the 
>> flash ages, the root cause will once again be invoked causing 
>> un-recoverable boot failures.
>>
>> Note this fault is also in the latest kernel drivers for UBI and may 
>> also exist in other wear leveling implementations. The kernel driver 
>> issue may be at fault for android devices locking up/freezing 
>> sporadically during FLASH read when scrubbing due to a relatively 
>> full flash and correctable errors causing ping pong PEB moves.
>>
>> The question is, is my root cause solution sound or have I missed 
>> something?
>
> I have to think about, before I write nonsene, but may Richard has
> here a deeper insight.
>
>> I know an algo change would probably be better or a way to detect 
>> move loops to prevent this from occurring, but this solution does 
>> work on all the devices that were failing manufacture tests previously.
>
Is there another message board that deal with the mtd ubi driver 
specifically?

Thanks
Mark

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [U-Boot] UBI fixable bit-flip issue
  2018-07-12  5:38   ` Mark Spieth
@ 2018-07-12  8:08     ` Heiko Schocher
  0 siblings, 0 replies; 13+ messages in thread
From: Heiko Schocher @ 2018-07-12  8:08 UTC (permalink / raw)
  To: u-boot

Hello Mark,

Am 12.07.2018 um 07:38 schrieb Mark Spieth:
> 
> On 12/07/18 15:22, Heiko Schocher wrote:
>> Hello Mark,
>>
>> added Richard Weinberger to cc...
>>
>> Am 12.07.2018 um 02:28 schrieb Mark Spieth:
>>> Hi
>>>
>>> In the process of investigating a boot failure on one of our devices, the
>>>
>>> UBI: fixable bit-flip detected at PEB
>>>
>>> message was seen with the following behaviour during kernel load in u-boot.
>>>
>>> Read [2285568] bytes
>>> UBI: fixable bit-flip detected at PEB 415
>>> UBI: schedule PEB 415 for scrubbing
>>> UBI: fixable bit-flip detected at PEB 415
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: schedule PEB 419 for scrubbing
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: schedule PEB 420 for scrubbing
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: fixable bit-flip detected at PEB 419
>>>
>>> This repeats until reset.
>>>
>>> U boot is a patched version of 2010.06 supplied by the chip vendor. No newer version is available 
>>> from the vendor to try.
>>
>> :-(
>>
>> Can you use current mainline ? It s hard to say something
>> about a 8 year old vendor U-Boot version ...
> I know. I did look at the current 2018.07 and 2014.10 as comparison.
> 
> There are many patches applied by the vendor so porting them with the large changes to driver 
> structure would be difficult and time consuming.
> The vendor is Lantiq and the SDK is current (this year).
>>
>>> The patches include the init eba/wl swap.
>>
>> What do you mean here?
> https://lists.denx.de/pipermail/u-boot/2013-January/143199.html
> This patch was already applied by the vendor.
> 
> ubi_eba_init_scan() must be initialised before ubi_wl_init_scan() and in that baseline they were the 
> wrong way around.
> 
> There is only 1 other message chain for fixable bit flips (2011) and that was not useful for this 
> problem.
>>
>>> A more detailed log with debugging available follows:
>>>
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 19, torture 0
>>> UBI DBG: erase_worker: erase PEB 419 EC 19
>>> UBI DBG: sync_erase: erase PEB 419, old EC 19
>>> UBI DBG: do_sync_erase: erase PEB 419
>>> UBI DBG: sync_erase: erased PEB 419, new EC 20
>>> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
>>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
>>> UBI DBG: ensure_wear_leveling: schedule scrubbing
>>> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
>>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
>>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
>>> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
>>> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
>>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
>>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
>>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
>>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
>>> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
>>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 20, torture 0
>>> UBI DBG: erase_worker: erase PEB 419 EC 20
>>> UBI DBG: sync_erase: erase PEB 419, old EC 20
>>> UBI DBG: do_sync_erase: erase PEB 419
>>> UBI DBG: sync_erase: erased PEB 419, new EC 21
>>> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
>>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
>>> UBI DBG: ensure_wear_leveling: schedule scrubbing
>>> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
>>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
>>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
>>> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
>>> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
>>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
>>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
>>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
>>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
>>> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
>>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
>>> UBI: fixable bit-flip detected at PEB 419
>>>
>>> Investigation showed that a read with correctable bit errors was done returning -EUCLEAN to the 
>>> ubi read function.
>>>
>>> Having read https://lists.denx.de/pipermail/u-boot/2013-September/161961.html which details a 
>>> workaround to not return EUCLEAN from the NAND reader unless the number of fixed bits returned 
>>> was 75% of the total number of correctable bits was exceeded during the read. This was impleneted 
>>> in this version of ubi in uboot 2010.06 and it does hide the bit-flip infinite issue since this 
>>> is new NAND FLASH. The original 2010.06 implementation returns EUCLEAN for any number of fixable 
>>> bit flips and thus causes the PEB move to the best free one (scrub mode in wear_leveling_worker).
>>>
>>> This fix is not a root cause fix though. Investigating further led to the following root cause 
>>> solution. The following is AFAICT.
>>>
>>> When the scrubber chooses a PEB to move the from the free balanced tree. This tree is sorted by 
>>> EC (erase count) and then by PEB number.
>>>
>>> The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which is 8192 in this config. So 
>>> the find_wl_entry function will find a PEB that is better in error count that the current PEB EC. 
>>> This can easily cause it to find the PEB that was just moved from if it is the lowest numbered 
>>> PEB in the free tree. Waiting for EC to go above 8192 would take a long time and cause premature 
>>> aging of the flash PEBs in question.
>>>
>>> The easy solution is to change the max parameter to this call to 0 so it finds a PEB with a 
>>> smaller EC than the one being replaced. This means it wont use the previously discarded PEB as 
>>> its first choice.
>>
>>  I am not sure if it is so easy ...
> This is why I'm asking :-)
>>
>>> This fix was implemented and fixable bit-flip errors no longer hang/freeze the boot process! UBI 
>>> erase and reformat was used between re-tests to get consistent results.
>>>
>>> Adding the above 75% correctable bitflip threshold is also a good thing as less movement will 
>>> ensue when the FLASH is new, but as the flash ages, the root cause will once again be invoked 
>>> causing un-recoverable boot failures.
>>>
>>> Note this fault is also in the latest kernel drivers for UBI and may also exist in other wear 
>>> leveling implementations. The kernel driver issue may be at fault for android devices locking 
>>> up/freezing sporadically during FLASH read when scrubbing due to a relatively full flash and 
>>> correctable errors causing ping pong PEB moves.
>>>
>>> The question is, is my root cause solution sound or have I missed something?
>>
>> I have to think about, before I write nonsene, but may Richard has
>> here a deeper insight.
>>
>>> I know an algo change would probably be better or a way to detect move loops to prevent this from 
>>> occurring, but this solution does work on all the devices that were failing manufacture tests 
>>> previously.
>>
> Is there another message board that deal with the mtd ubi driver specifically?

Yes of course ...

bye,
Heiko
-- 
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs at denx.de

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [U-Boot] UBI fixable bit-flip issue
  2018-07-12  5:22 ` Heiko Schocher
  2018-07-12  5:38   ` Mark Spieth
@ 2018-07-12  8:46   ` Richard Weinberger
  2018-07-12  9:50     ` Mark Spieth
  2018-07-12 14:03     ` Mark Spieth
  1 sibling, 2 replies; 13+ messages in thread
From: Richard Weinberger @ 2018-07-12  8:46 UTC (permalink / raw)
  To: u-boot

Mark,

Am Donnerstag, 12. Juli 2018, 07:22:13 CEST schrieb Heiko Schocher:
> Hello Mark,
> 
> added Richard Weinberger to cc...
> 
> Am 12.07.2018 um 02:28 schrieb Mark Spieth:
> > Hi
> > 
> > In the process of investigating a boot failure on one of our devices, the
> > 
> > UBI: fixable bit-flip detected at PEB
> > 
> > message was seen with the following behaviour during kernel load in u-boot.
> > 
> > Read [2285568] bytes
> > UBI: fixable bit-flip detected at PEB 415
> > UBI: schedule PEB 415 for scrubbing
> > UBI: fixable bit-flip detected at PEB 415
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: schedule PEB 419 for scrubbing
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: schedule PEB 420 for scrubbing
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > 
> > This repeats until reset.

Do you see the same symptom also on Linux?
We need to be very sure that it is actually a UBI problem.

> > This fix is not a root cause fix though. Investigating further led to the following root cause 
> > solution. The following is AFAICT.
> > 
> > When the scrubber chooses a PEB to move the from the free balanced tree. This tree is sorted by EC 
> > (erase count) and then by PEB number.
> > 
> > The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which is 8192 in this config. So the 
> > find_wl_entry function will find a PEB that is better in error count that the current PEB EC. This

error count? You mean erase count?
 
> > can easily cause it to find the PEB that was just moved from if it is the lowest numbered PEB in the 
> > free tree. Waiting for EC to go above 8192 would take a long time and cause premature aging of the 
> > flash PEBs in question.
> > 
> > The easy solution is to change the max parameter to this call to 0 so it finds a PEB with a smaller 
> > EC than the one being replaced. This means it wont use the previously discarded PEB as its first 
> > choice.

For scrubbing this might be a good idea, but not for regular wear-leveling.

See comment in UBI:
/*
 * When a physical eraseblock is moved, the WL sub-system has to pick the target
 * physical eraseblock to move to. The simplest way would be just to pick the
 * one with the highest erase counter. But in certain workloads this could lead
 * to an unlimited wear of one or few physical eraseblock. Indeed, imagine a
 * situation when the picked physical eraseblock is constantly erased after the
 * data is written to it. So, we have a constant which limits the highest erase
 * counter of the free physical eraseblock to pick. Namely, the WL sub-system
 * does not pick eraseblocks with erase counter greater than the lowest erase
 * counter plus %WL_FREE_MAX_DIFF.
 */
#define WL_FREE_MAX_DIFF (2*UBI_WL_THRESHOLD)

So we could change the logic such that for regular wear-leveling we keep using WL_FREE_MAX_DIFF,
but for scrubbing (which is 1:1 wear-leveling but the source PEB is showing bit-flips) we use
a lower value. IMHO WL_FREE_MAX_DIFF/2 would be a good choice.
I'm not sure whether 0 is too extreme and might cause other distortions.

Mark, can you please file a patch and send it to linux-mtd mailing list?
Such a change needs to go through Linux and then to u-boot.
But first we need to think about and discuss it in detail.
 
>   I am not sure if it is so easy ...
>
> > This fix was implemented and fixable bit-flip errors no longer hang/freeze the boot process! UBI 
> > erase and reformat was used between re-tests to get consistent results.
> > 
> > Adding the above 75% correctable bitflip threshold is also a good thing as less movement will ensue 
> > when the FLASH is new, but as the flash ages, the root cause will once again be invoked causing 
> > un-recoverable boot failures.
> > 
> > Note this fault is also in the latest kernel drivers for UBI and may also exist in other wear 
> > leveling implementations. The kernel driver issue may be at fault for android devices locking 
> > up/freezing sporadically during FLASH read when scrubbing due to a relatively full flash and 
> > correctable errors causing ping pong PEB moves.
> > 
> > The question is, is my root cause solution sound or have I missed something?
> 
> I have to think about, before I write nonsene, but may Richard has
> here a deeper insight.

Please see my comments. :)

Thanks,
//richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [U-Boot] UBI fixable bit-flip issue
  2018-07-12  8:46   ` Richard Weinberger
@ 2018-07-12  9:50     ` Mark Spieth
  2018-07-12 14:03     ` Mark Spieth
  1 sibling, 0 replies; 13+ messages in thread
From: Mark Spieth @ 2018-07-12  9:50 UTC (permalink / raw)
  To: u-boot



On 12 July 2018 18:46:11 GMT+10:00, Richard Weinberger <richard@nod.at> wrote:
>Mark,
>
>Am Donnerstag, 12. Juli 2018, 07:22:13 CEST schrieb Heiko Schocher:
>> Hello Mark,
>> 
>> added Richard Weinberger to cc...
>> 
>> Am 12.07.2018 um 02:28 schrieb Mark Spieth:
>> > Hi
>> > 
>> > In the process of investigating a boot failure on one of our
>devices, the
>> > 
>> > UBI: fixable bit-flip detected at PEB
>> > 
>> > message was seen with the following behaviour during kernel load in
>u-boot.
>> > 
>> > Read [2285568] bytes
>> > UBI: fixable bit-flip detected at PEB 415
>> > UBI: schedule PEB 415 for scrubbing
>> > UBI: fixable bit-flip detected at PEB 415
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: schedule PEB 419 for scrubbing
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: schedule PEB 420 for scrubbing
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > 
>> > This repeats until reset.
>
>Do you see the same symptom also on Linux?
>We need to be very sure that it is actually a UBI problem.

The linux provided has an up to date mtd/ubi driver so already has the 75% bitflip threshold thus hiding the issue in a new flash. So the 2 are not the same. Untested on linux.

>
>> > This fix is not a root cause fix though. Investigating further led
>to the following root cause 
>> > solution. The following is AFAICT.
>> > 
>> > When the scrubber chooses a PEB to move the from the free balanced
>tree. This tree is sorted by EC 
>> > (erase count) and then by PEB number.
>> > 
>> > The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF
>which is 8192 in this config. So the 
>> > find_wl_entry function will find a PEB that is better in error
>count that the current PEB EC. This
>
>error count? You mean erase count?

Yes of course.

> 
>> > can easily cause it to find the PEB that was just moved from if it
>is the lowest numbered PEB in the 
>> > free tree. Waiting for EC to go above 8192 would take a long time
>and cause premature aging of the 
>> > flash PEBs in question.
>> > 
>> > The easy solution is to change the max parameter to this call to 0
>so it finds a PEB with a smaller 
>> > EC than the one being replaced. This means it wont use the
>previously discarded PEB as its first 
>> > choice.
>
>For scrubbing this might be a good idea, but not for regular
>wear-leveling.
Yes only for scrubbing, not wear leveling.
>
>See comment in UBI:
>/*
>* When a physical eraseblock is moved, the WL sub-system has to pick
>the target
>* physical eraseblock to move to. The simplest way would be just to
>pick the
>* one with the highest erase counter. But in certain workloads this
>could lead
>* to an unlimited wear of one or few physical eraseblock. Indeed,
>imagine a
>* situation when the picked physical eraseblock is constantly erased
>after the
>* data is written to it. So, we have a constant which limits the
>highest erase
>* counter of the free physical eraseblock to pick. Namely, the WL
>sub-system
>* does not pick eraseblocks with erase counter greater than the lowest
>erase
> * counter plus %WL_FREE_MAX_DIFF.
> */
>#define WL_FREE_MAX_DIFF (2*UBI_WL_THRESHOLD)
>
>So we could change the logic such that for regular wear-leveling we
>keep using WL_FREE_MAX_DIFF,
>but for scrubbing (which is 1:1 wear-leveling but the source PEB is
>showing bit-flips) we use
>a lower value. IMHO WL_FREE_MAX_DIFF/2 would be a good choice.
>I'm not sure whether 0 is too extreme and might cause other
>distortions.

Yes the wear leveling threshold is still WL_FREE_MAX_DIFF and the scubbing threshold is 0.

This is why I'm asking. Because the 2 PEBs will track each others EC I'm not sure that will work. 

>
>Mark, can you please file a patch and send it to linux-mtd mailing
>list?
>Such a change needs to go through Linux and then to u-boot.
>But first we need to think about and discuss it in detail.

Will do.

> 
>>   I am not sure if it is so easy ...
>>
>> > This fix was implemented and fixable bit-flip errors no longer
>hang/freeze the boot process! UBI 
>> > erase and reformat was used between re-tests to get consistent
>results.
>> > 
>> > Adding the above 75% correctable bitflip threshold is also a good
>thing as less movement will ensue 
>> > when the FLASH is new, but as the flash ages, the root cause will
>once again be invoked causing 
>> > un-recoverable boot failures.
>> > 
>> > Note this fault is also in the latest kernel drivers for UBI and
>may also exist in other wear 
>> > leveling implementations. The kernel driver issue may be at fault
>for android devices locking 
>> > up/freezing sporadically during FLASH read when scrubbing due to a
>relatively full flash and 
>> > correctable errors causing ping pong PEB moves.
>> > 
>> > The question is, is my root cause solution sound or have I missed
>something?
>> 
>> I have to think about, before I write nonsene, but may Richard has
>> here a deeper insight.
>

Thanks for your input.

Mark

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [U-Boot] UBI fixable bit-flip issue
  2018-07-12  8:46   ` Richard Weinberger
  2018-07-12  9:50     ` Mark Spieth
@ 2018-07-12 14:03     ` Mark Spieth
  2018-08-16  8:50       ` Richard Weinberger
  1 sibling, 1 reply; 13+ messages in thread
From: Mark Spieth @ 2018-07-12 14:03 UTC (permalink / raw)
  To: u-boot



On 12 July 2018 18:46:11 GMT+10:00, Richard Weinberger <richard@nod.at> 
wrote:
>Mark,
>
>Am Donnerstag, 12. Juli 2018, 07:22:13 CEST schrieb Heiko Schocher:
>> Hello Mark,
>> 
>> added Richard Weinberger to cc...
>> 
>> Am 12.07.2018 um 02:28 schrieb Mark Spieth:
>> > Hi
>> > 
>> > In the process of investigating a boot failure on one of our
>devices, the
>> > 
>> > UBI: fixable bit-flip detected at PEB
>> > 
>> > message was seen with the following behaviour during kernel load in
>u-boot.
>> > 
>> > Read [2285568] bytes
>> > UBI: fixable bit-flip detected at PEB 415
>> > UBI: schedule PEB 415 for scrubbing
>> > UBI: fixable bit-flip detected at PEB 415
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: schedule PEB 419 for scrubbing
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: schedule PEB 420 for scrubbing
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > UBI: fixable bit-flip detected at PEB 420
>> > UBI: fixable bit-flip detected at PEB 419
>> > 
>> > This repeats until reset.
>
>Do you see the same symptom also on Linux?
>We need to be very sure that it is actually a UBI problem.

The linux provided has an up to date mtd/ubi driver so already has the 
75% bitflip threshold thus hiding the issue in a new flash. So the 2 are 
not the same. Untested on linux.

>
>> > This fix is not a root cause fix though. Investigating further led
>to the following root cause 
>> > solution. The following is AFAICT.
>> > 
>> > When the scrubber chooses a PEB to move the from the free balanced
>tree. This tree is sorted by EC 
>> > (erase count) and then by PEB number.
>> > 
>> > The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF
>which is 8192 in this config. So the 
>> > find_wl_entry function will find a PEB that is better in error
>count that the current PEB EC. This
>
>error count? You mean erase count?

Yes of course.

> 
>> > can easily cause it to find the PEB that was just moved from if it
>is the lowest numbered PEB in the 
>> > free tree. Waiting for EC to go above 8192 would take a long time
>and cause premature aging of the 
>> > flash PEBs in question.
>> > 
>> > The easy solution is to change the max parameter to this call to 0
>so it finds a PEB with a smaller 
>> > EC than the one being replaced. This means it wont use the
>previously discarded PEB as its first 
>> > choice.
>
>For scrubbing this might be a good idea, but not for regular
>wear-leveling.
Yes only for scrubbing, not wear leveling.
>
>See comment in UBI:
>/*
>* When a physical eraseblock is moved, the WL sub-system has to pick
>the target
>* physical eraseblock to move to. The simplest way would be just to
>pick the
>* one with the highest erase counter. But in certain workloads this
>could lead
>* to an unlimited wear of one or few physical eraseblock. Indeed,
>imagine a
>* situation when the picked physical eraseblock is constantly erased
>after the
>* data is written to it. So, we have a constant which limits the
>highest erase
>* counter of the free physical eraseblock to pick. Namely, the WL
>sub-system
>* does not pick eraseblocks with erase counter greater than the lowest
>erase
> * counter plus %WL_FREE_MAX_DIFF.
> */
>#define WL_FREE_MAX_DIFF (2*UBI_WL_THRESHOLD)
>
>So we could change the logic such that for regular wear-leveling we
>keep using WL_FREE_MAX_DIFF,
>but for scrubbing (which is 1:1 wear-leveling but the source PEB is
>showing bit-flips) we use
>a lower value. IMHO WL_FREE_MAX_DIFF/2 would be a good choice.
>I'm not sure whether 0 is too extreme and might cause other
>distortions.

Yes the wear leveling threshold is still WL_FREE_MAX_DIFF and the 
scubbing threshold is 0.

This is why I'm asking. Because the 2 PEBs will track each others EC I'm 
not sure that will work.
>
>Mark, can you please file a patch and send it to linux-mtd mailing
>list?
>Such a change needs to go through Linux and then to u-boot.
>But first we need to think about and discuss it in detail.

Will do.

> 
>>   I am not sure if it is so easy ...
>>
>> > This fix was implemented and fixable bit-flip errors no longer
>hang/freeze the boot process! UBI 
>> > erase and reformat was used between re-tests to get consistent
>results.
>> > 
>> > Adding the above 75% correctable bitflip threshold is also a good
>thing as less movement will ensue 
>> > when the FLASH is new, but as the flash ages, the root cause will
>once again be invoked causing 
>> > un-recoverable boot failures.
>> > 
>> > Note this fault is also in the latest kernel drivers for UBI and
>may also exist in other wear 
>> > leveling implementations. The kernel driver issue may be at fault
>for android devices locking 
>> > up/freezing sporadically during FLASH read when scrubbing due to a
>relatively full flash and 
>> > correctable errors causing ping pong PEB moves.
>> > 
>> > The question is, is my root cause solution sound or have I missed
>something?
>> 
>> I have to think about, before I write nonsene, but may Richard has
>> here a deeper insight.
>

Thanks for your input.

Mark


-- 
Mark Spieth, PhD
Digivation Pty Ltd
9 Catalina Ave
ASHBURTON VIC 3147
Australia
Phone: +61 4 11 515717 (0411515717)
Fax: +61 3 9885 5774

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [U-Boot] UBI fixable bit-flip issue
  2018-07-12 14:03     ` Mark Spieth
@ 2018-08-16  8:50       ` Richard Weinberger
  2018-08-16 23:31         ` Mark Spieth
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Weinberger @ 2018-08-16  8:50 UTC (permalink / raw)
  To: u-boot

Mark,

Am Donnerstag, 12. Juli 2018, 16:03:43 CEST schrieb Mark Spieth:
> >Mark, can you please file a patch and send it to linux-mtd mailing
> >list?
> >Such a change needs to go through Linux and then to u-boot.
> >But first we need to think about and discuss it in detail.
> 
> Will do.

Did you find some time to do that?
I'd like to have this resolved in both Linux and u-boot.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [U-Boot] UBI fixable bit-flip issue
  2018-08-16  8:50       ` Richard Weinberger
@ 2018-08-16 23:31         ` Mark Spieth
  0 siblings, 0 replies; 13+ messages in thread
From: Mark Spieth @ 2018-08-16 23:31 UTC (permalink / raw)
  To: u-boot

On 16/08/18 18:50, Richard Weinberger wrote:
> Mark,
>
> Am Donnerstag, 12. Juli 2018, 16:03:43 CEST schrieb Mark Spieth:
>>> Mark, can you please file a patch and send it to linux-mtd mailing
>>> list?
>>> Such a change needs to go through Linux and then to u-boot.
>>> But first we need to think about and discuss it in detail.
>> Will do.
> Did you find some time to do that?
> I'd like to have this resolved in both Linux and u-boot.
>
Apologies Richard.
I went on to another task and forgot about his.

I have been trying in spare time to make a unit test to demonstrate this 
fault but have not completed that yet. There is no good unit test 
framework so have had to come up with one using techniques I already use 
for other non kernel C projects using googletest/mock. I will post this 
when it works.

The patch for HEAD is attached (I hope this is acceptable).

It is not a good solution but it does prevent the bitflip issue in the 
older uboot.

Mark
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ubi-bit-flip-fix.patch
Type: text/x-patch
Size: 2413 bytes
Desc: not available
URL: <http://lists.denx.de/pipermail/u-boot/attachments/20180817/8f6317a1/attachment.bin>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [U-Boot] UBI Fixable bit-flip issue.
  2012-12-17  8:44   ` Holger Brunck
@ 2012-12-17 18:00     ` Vikram Narayanan
  0 siblings, 0 replies; 13+ messages in thread
From: Vikram Narayanan @ 2012-12-17 18:00 UTC (permalink / raw)
  To: u-boot

Hi,

On 12/17/2012 2:14 PM, Holger Brunck wrote:
> Hi,
>
> On 12/15/2012 04:14 AM, Vikram Narayanan wrote:
>> On 12/14/2012 11:33 PM, Vikram Narayanan wrote:
>>>
>>> I'm seeing a fixable bit-flip in the current u-boot (v2012.10) on a
>>> i.Mx6 Solo based custom board. The problem is similar to the one
>>> explained here [1].
>>>
>>> As observed by the thread's author, does reverting the commit "1b1f9a9"
>>> solves the issue? Did someone face a similar issue?
>>>
>
> this was a workaround I had until I found a proper solution for v2011.09. In the
> meantime the following fix was included in u-boot:
>
> http://git.denx.de/?p=u-boot.git;a=commit;h=d63894654df72b010de2abb4b3f07d0d755f65b6
>
> This solves this issue for my problem. This patch is included in v2012.10 so
> this should be ok. Maybe you hit a different problem.
>

Thanks for clarifying. I'll see if I can reproduce it someway and also 
rebase my work on top of v2012.10 to see if that solves the issue.

Thanks,
Vikram

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [U-Boot] UBI Fixable bit-flip issue.
  2012-12-15  3:14 ` Vikram Narayanan
@ 2012-12-17  8:44   ` Holger Brunck
  2012-12-17 18:00     ` Vikram Narayanan
  0 siblings, 1 reply; 13+ messages in thread
From: Holger Brunck @ 2012-12-17  8:44 UTC (permalink / raw)
  To: u-boot

Hi,

On 12/15/2012 04:14 AM, Vikram Narayanan wrote:
> On 12/14/2012 11:33 PM, Vikram Narayanan wrote:
>>
>> I'm seeing a fixable bit-flip in the current u-boot (v2012.10) on a
>> i.Mx6 Solo based custom board. The problem is similar to the one
>> explained here [1].
>>
>> As observed by the thread's author, does reverting the commit "1b1f9a9"
>> solves the issue? Did someone face a similar issue?
>>

this was a workaround I had until I found a proper solution for v2011.09. In the
meantime the following fix was included in u-boot:

http://git.denx.de/?p=u-boot.git;a=commit;h=d63894654df72b010de2abb4b3f07d0d755f65b6

This solves this issue for my problem. This patch is included in v2012.10 so
this should be ok. Maybe you hit a different problem.

Regards
Holger

>>
>> [1] http://lists.denx.de/pipermail/u-boot/2011-September/100237.html
>> [2] http://lists.denx.de/pipermail/u-boot/2011-September/101887.html
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [U-Boot] UBI Fixable bit-flip issue.
  2012-12-14 18:03 [U-Boot] UBI Fixable " Vikram Narayanan
@ 2012-12-15  3:14 ` Vikram Narayanan
  2012-12-17  8:44   ` Holger Brunck
  0 siblings, 1 reply; 13+ messages in thread
From: Vikram Narayanan @ 2012-12-15  3:14 UTC (permalink / raw)
  To: u-boot

Ccing the Author of [1].

On 12/14/2012 11:33 PM, Vikram Narayanan wrote:
> Hello,
>
> I'm seeing a fixable bit-flip in the current u-boot (v2012.10) on a
> i.Mx6 Solo based custom board. The problem is similar to the one
> explained here [1].
>
> As observed by the thread's author, does reverting the commit "1b1f9a9"
> solves the issue? Did someone face a similar issue?
>
> Thanks,
> Vikram
>
> [1] http://lists.denx.de/pipermail/u-boot/2011-September/100237.html
> [2] http://lists.denx.de/pipermail/u-boot/2011-September/101887.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [U-Boot] UBI Fixable bit-flip issue.
@ 2012-12-14 18:03 Vikram Narayanan
  2012-12-15  3:14 ` Vikram Narayanan
  0 siblings, 1 reply; 13+ messages in thread
From: Vikram Narayanan @ 2012-12-14 18:03 UTC (permalink / raw)
  To: u-boot

Hello,

I'm seeing a fixable bit-flip in the current u-boot (v2012.10) on a 
i.Mx6 Solo based custom board. The problem is similar to the one 
explained here [1].

As observed by the thread's author, does reverting the commit "1b1f9a9" 
solves the issue? Did someone face a similar issue?

Thanks,
Vikram

[1] http://lists.denx.de/pipermail/u-boot/2011-September/100237.html
[2] http://lists.denx.de/pipermail/u-boot/2011-September/101887.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-08-16 23:31 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-12  0:28 [U-Boot] UBI fixable bit-flip issue Mark Spieth
2018-07-12  5:22 ` Heiko Schocher
2018-07-12  5:38   ` Mark Spieth
2018-07-12  8:08     ` Heiko Schocher
2018-07-12  8:46   ` Richard Weinberger
2018-07-12  9:50     ` Mark Spieth
2018-07-12 14:03     ` Mark Spieth
2018-08-16  8:50       ` Richard Weinberger
2018-08-16 23:31         ` Mark Spieth
  -- strict thread matches above, loose matches on Subject: below --
2012-12-14 18:03 [U-Boot] UBI Fixable " Vikram Narayanan
2012-12-15  3:14 ` Vikram Narayanan
2012-12-17  8:44   ` Holger Brunck
2012-12-17 18:00     ` Vikram Narayanan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.