linux-mtd.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Corruped NAND booting for all devices
@ 2020-02-05  1:43 JH
  2020-02-05  2:32 ` Steve deRosier
  2020-02-05  8:12 ` Jef Driesen
  0 siblings, 2 replies; 7+ messages in thread
From: JH @ 2020-02-05  1:43 UTC (permalink / raw)
  To: linux-mtd

Hi,

It is a bad day we have 5 devices failed NAND booting all of certain
today. The 5 devices running kernel 4.19.75 on iMX6ULL customized
board, the devices had been running for weeks, the device DC power is
supplied from AC via ADC and regulator, we turned power on and off
several times when installing those test devices to test boxes in the
last couple of days without problems, then they all failed together
mysteriously today. It could not complete the booting to Linux user
space, so I am not able to log into the user space to check and to
debug it.

Here is some more background information, those 5 devices are from the
same new hardware reversion which changed a new power regulator, they
installed the same firmware kernel 4.19 building from Yocto.

Another fact, there are several old hardware reversion devices (using
different power regulator) installed the same version of kernel and
MTD, but different versions of high level applications in the same
test box, which are all running fine. So it ruled out power surge to
cause the problem.

I try to figure out the source of the problem whether it is caused by
firmware or buy hardware, as it is in very low level, I doubt the high
level applications could contribute the problem, but I could be wrong.
There were some NAND bad blocks, I am not sure if that could cause the
booting corruption. The Yocto image was built from the same Yocto
build system, I am not clear if the Yocto image could cause the NAND
corruption or not. Also, could using NAND fastmap cause the problem?
Attached following booting message, appreciate your insight comments.

Welcome to  minicom 2.7.1
OPTIONS:
Compiled on Aug 20 2018, 10:22:42.
Port /dev/cu.usbserial-FTAPPVUJ, 11:47:55
Press Meta-Z for help on special keys
??????
U-Boot 2018.03-g6ce2bc2ae2 (Jan 17 2020 - 00:57:41 +0000)
CPU:   Freescale i.MX6ULZ rev1.1 900 MHz (running at 396 MHz)
CPU:   Commercial temperature grade (0C to 95C) at 56C
Reset cause: POR
Model: Freescale i.MX6 ULZ 14x14 EVK Board
Board: MX6ULZ 14x14 EVK
DRAM:  512 MiB
NAND:  256 MiB
MMC:   FSL_SDHC: 0
Loading Environment from NAND... *** Warning - bad CRC, using default
environment
Failed (-5)
No panel detected: default to TFT43AB
Display: TFT43AB (480x272)
Video: LCDIF@0x21c8000 is fused, disable it
In:    serial
Out:   serial
Err:   serial
Net:   No ethernet found.
Normal Boot
Hit any key to stop autoboot:  0
Booting image 1
NAND read: device 0 offset 0x800000, size 0x2000000
 33554432 bytes read: OK
NAND read: device 0 offset 0x600000, size 0x200000
 2097152 bytes read: OK
Kernel image @ 0x80800000 [ 0x000000 - 0x7f09d0 ]
## Flattened Device Tree blob at 83000000
   Booting using the fdt blob at 0x83000000
   Using Device Tree in place at 83000000, end 83008972
ft_system_setup for mx6

Starting kernel ...
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.19.75 (oe-user@oe-host) (gcc version
8.2.0 (GCC)) #1 SMP Fri Nov 8 06:519
[    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
[    0.000000] CPU: div instructions available: patching division code
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing
instruction cache
[    0.000000] OF: fdt: Machine model: Traverse PMU RevC
[    0.000000] earlycon: ec_imx6q0 at MMIO 0x02020000 (options '')
[    0.000000] bootconsole [ec_imx6q0] enabled
[    0.000000] Memory policy: Data cache writealloc
[    0.000000] cma: Reserved 64 MiB at 0x8c000000
[    0.000000] random: get_random_bytes called from
start_kernel+0x8c/0x48c with crng_init=0
[    0.000000] percpu: Embedded 18 pages/cpu s41896 r8192 d23640 u73728
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 65024
[    0.000000] Kernel command line: console=ttymxc0,115200 earlycon
mtdparts=gpmi-nand:4m(boot),2m(uboo1
[    0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
[    0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
[    0.000000] Memory: 169124K/262144K available (11264K kernel code,
881K rwdata, 3664K rodata, 1024K )
[    0.000000] Virtual kernel memory layout:
[    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
[    0.000000]     fixmap  : 0xffc00000 - 0xfff00000   (3072 kB)
[    0.000000]     vmalloc : 0xd0800000 - 0xff800000   ( 752 MB)
[    0.000000]     lowmem  : 0xc0000000 - 0xd0000000   ( 256 MB)
[    0.000000]     pkmap   : 0xbfe00000 - 0xc0000000   (   2 MB)
[    0.000000]     modules : 0xbf000000 - 0xbfe00000   (  14 MB)
[    0.000000]       .text : 0x(ptrval) - 0x(ptrval)   (12256 kB)
[    0.000000]       .init : 0x(ptrval) - 0x(ptrval)   (1024 kB)
[    0.000000]       .data : 0x(ptrval) - 0x(ptrval)   ( 882 kB)
[    0.000000]        .bss : 0x(ptrval) - 0x(ptrval)   (7649 kB)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] Running RCU self tests
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu:     RCU event tracing is enabled.
[    0.000000] rcu:     RCU lockdep checking is enabled.
[    0.000000] rcu:     RCU restricting CPUs from NR_CPUS=4 to nr_cpu_ids=1.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[    0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[    0.000000] Switching to timer-based delay loop, resolution 41ns
[    0.000022] sched_clock: 32 bits at 24MHz, resolution 41ns, wraps
every 89478484971ns
[    0.007844] clocksource: mxc_timer1: mask: 0xffffffff max_cycles:
0xffffffff, max_idle_ns: 796358519s
[    0.020292] Console: colour dummy device 80x30
[    0.021978] Lock dependency validator: Copyright (c) 2006 Red Hat,
Inc., Ingo Molnar
[    0.030048] ... MAX_LOCKDEP_SUBCLASSES:  8
[    0.033798] ... MAX_LOCK_DEPTH:          48
[    0.037971] ... MAX_LOCKDEP_KEYS:        8191
[    0.042436] ... CLASSHASH_SIZE:          4096
[    0.046672] ... MAX_LOCKDEP_ENTRIES:     32768
[    0.051108] ... MAX_LOCKDEP_CHAINS:      65536
[    0.055644] ... CHAINHASH_SIZE:          32768
[    0.059983]  memory used by lock dependency info: 4655 kB
[    0.065378]  per task-struct memory footprint: 1536 bytes
[    0.070985] Calibrating delay loop (skipped), value calculated
using timer frequency.. 48.00 BogoMIP)
[    0.081242] pid_max: default: 32768 minimum: 301
[    0.086598] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.092513] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.103902] CPU: Testing write buffer coherency: ok
[    0.108359] /cpus/cpu@0 missing clock-frequency property
[    0.111490] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[    0.123335] Setting up static identity map for 0x80100000 - 0x80100078
[    0.128797] rcu: Hierarchical SRCU implementation.
[    0.137850] smp: Bringing up secondary CPUs ...
[    0.139607] smp: Brought up 1 node, 1 CPU
[    0.143604] SMP: Total of 1 processors activated (48.00 BogoMIPS).
[    0.149934] CPU: All CPU(s) started in SVC mode.
[    0.159781] devtmpfs: initialized
[    0.202335] VFP support v0.3: implementor 41 architecture 2 part 30
variant 7 rev 5
[    0.209235] clocksource: jiffies: mask: 0xffffffff max_cycles:
0xffffffff, max_idle_ns: 191126044627s
[    0.217465] futex hash table entries: 256 (order: 2, 16384 bytes)
[    0.230315] pinctrl core: initialized pinctrl subsystem
[    0.242556] NET: Registered protocol family 16
[    0.324620] DMA: preallocated 256 KiB pool for atomic coherent allocations
[    0.337326] cpuidle: using governor menu
[    0.380108] vdd3p0: supplied by regulator-dummy
[    0.386436] cpu: supplied by regulator-dummy
[    0.392527] vddsoc: supplied by regulator-dummy
[    0.444794] No ATAGs?
[    0.445397] hw-breakpoint: found 5 (+1 reserved) breakpoint and 4
watchpoint registers.
[    0.453188] hw-breakpoint: maximum watchpoint size is 8 bytes.
[    0.471287] imx6ul-pinctrl 20e0000.iomuxc: initialized IMX pinctrl driver
[    0.477738] imx6ul-pinctrl 2290000.iomuxc-snvs: no groups defined
in /soc/aips-bus@2200000/iomuxc-sn0
[    0.485316] imx6ul-pinctrl 2290000.iomuxc-snvs: initialized IMX
pinctrl driver
[    0.659193] mxs-dma 1804000.dma-apbh: initialized
[    0.672898] vgaarb: loaded
[    0.676307] SCSI subsystem initialized
[    0.680972] usbcore: registered new interface driver usbfs
[    0.684197] usbcore: registered new interface driver hub
[    0.690015] usbcore: registered new device driver usb
[    0.698122] media: Linux media interface: v0.10
[    0.700421] videodev: Linux video capture interface: v2.00
[    0.706461] pps_core: LinuxPPS API ver. 1 registered
[    0.710524] pps_core: Software ver. 5.3.6 - Copyright 2005-2007
Rodolfo Giometti <giometti@linux.it>
[    0.719590] PTP clock support registered
[    0.725896] Advanced Linux Sound Architecture Driver Initialized.
[    0.735897] Bluetooth: Core ver 2.22
[    0.736983] NET: Registered protocol family 31
[    0.741129] Bluetooth: HCI device and connection manager initialized
[    0.747893] Bluetooth: HCI socket layer initialized
[    0.752390] Bluetooth: L2CAP socket layer initialized
[    0.758035] Bluetooth: SCO socket layer initialized
[    0.767559] clocksource: Switched to clocksource mxc_timer1
[    1.441132] VFS: Disk quotas dquot_6.6.0
[    1.442713] VFS: Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
[    1.530381] NET: Registered protocol family 2
[    1.536966] tcp_listen_portaddr_hash hash table entries: 128
(order: 0, 5120 bytes)
[    1.542545] TCP established hash table entries: 2048 (order: 1, 8192 bytes)
[    1.549492] TCP bind hash table entries: 2048 (order: 4, 73728 bytes)
[    1.555907] TCP: Hash tables configured (established 2048 bind 2048)
[    1.563100] UDP hash table entries: 256 (order: 2, 20480 bytes)
[    1.568026] UDP-Lite hash table entries: 256 (order: 2, 20480 bytes)
[    1.575032] NET: Registered protocol family 1
[    1.584008] RPC: Registered named UNIX socket transport module.
[    1.587224] RPC: Registered udp transport module.
[    1.592175] RPC: Registered tcp transport module.
[    1.596597] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    1.622774] Initialise system trusted keyrings
[    1.626159] workingset: timestamp_bits=30 max_order=16 bucket_order=0
[    1.682563] NFS: Registering the id_resolver key type
[    1.685086] Key type id_resolver registered
[    1.689525] Key type id_legacy registered
[    1.693431] jffs2: version 2.2. (NAND) © 2001-2006 Red Hat, Inc.
[    1.703874] fuse init (API version 7.27)
[    1.751880] Key type asymmetric registered
[    1.753433] Asymmetric key parser 'x509' registered
[    1.760327] io scheduler noop registered
[    1.761997] io scheduler deadline registered
[    1.767245] io scheduler cfq registered (default)
[    1.771508] io scheduler mq-deadline registered
[    1.775474] io scheduler kyber registered
[    1.824841] imx-sdma 20ec000.sdma: Direct firmware load for
imx/sdma/sdma-imx6q.bin failed with erro2
[    1.832280] imx-sdma 20ec000.sdma: external firmware not found,
using ROM firmware
[    1.851955] console [ttymxc0] enabled
[    1.855858] bootconsole [ec_imx6q0] disabled
[    1.869926] 21e8000.serial: ttymxc1 at MMIO 0x21e8000 (irq = 49,
base_baud = 5000000) is a IMX
[    1.884516] 21f0000.serial: ttymxc3 at MMIO 0x21f0000 (irq = 50,
base_baud = 5000000) is a IMX
[    2.002621] brd: module loaded
[    2.080376] loop: module loaded
[    2.109170] random: fast init done
[    2.141063] nand: device found, Manufacturer ID: 0xef, Chip ID: 0xda
[    2.147794] nand: Winbond W29N02GV
[    2.151330] nand: 256 MiB, SLC, erase size: 128 KiB, page size:
2048, OOB size: 64
[    2.165475] Bad block table found at page 131008, version 0x01
[    2.173020] Bad block table found at page 130944, version 0x01
[    2.180892] 6 cmdlinepart partitions found on MTD device gpmi-nand
[    2.187223] Creating 6 MTD partitions on "gpmi-nand":
[    2.192661] 0x000000000000-0x000000400000 : "boot"
[    2.214714] 0x000000400000-0x000000600000 : "ubootenv"
[    2.231187] 0x000000600000-0x000000800000 : "dtb"
[    2.246842] 0x000000800000-0x000002800000 : "kernel1"
[    2.311648] 0x000002800000-0x000004800000 : "kernel2"
[    2.376299] 0x000004800000-0x000010000000 : "ubi"
[    2.684250] gpmi-nand 1806000.gpmi-nand: driver registered.
[    2.700166] libphy: Fixed MDIO Bus: probed
[    2.708597] CAN device driver interface
[    2.720153] usbcore: registered new interface driver asix
[    2.725978] usbcore: registered new interface driver ax88179_178a
[    2.732720] usbcore: registered new interface driver cdc_ether
[    2.739153] usbcore: registered new interface driver net1080
[    2.745207] usbcore: registered new interface driver cdc_subset
[    2.751721] usbcore: registered new interface driver zaurus
[    2.758081] usbcore: registered new interface driver cdc_ncm
[    2.764143] usbcore: registered new interface driver qmi_wwan
[    2.770205] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    2.776866] ehci-pci: EHCI PCI platform driver
[    2.781887] ehci-mxc: Freescale On-Chip EHCI Host driver
[    2.789587] usbcore: registered new interface driver cdc_wdm
[    2.795868] usbcore: registered new interface driver usb-storage
[    3.442670] mxs_phy 20c9000.usbphy: Data pin can't make good contact.
[    3.455664] imx_usb 2184200.usb: 2184200.usb supply vbus not found,
using dummy regulator
[    3.466393] imx_usb 2184200.usb: Linked as a consumer to regulator.0
[    3.479980] ci_hdrc ci_hdrc.1: EHCI Host Controller
[    3.485507] ci_hdrc ci_hdrc.1: new USB bus registered, assigned bus number 1
[    3.517779] ci_hdrc ci_hdrc.1: USB 2.0 started, EHCI 1.00
[    3.533486] hub 1-0:1.0: USB hub found
[    3.538773] hub 1-0:1.0: 1 port detected
[    3.560890] input: 20cc000.snvs:snvs-powerkey as
/devices/soc0/soc/2000000.aips-bus/20cc000.snvs/20c0
[    3.593166] snvs_rtc 20cc000.snvs:snvs-rtc-lp: rtc core: registered
20cc000.snvs:snvs-rtc-lp as rtc0
[    3.603922] i2c /dev entries driver
[    3.626515] imx2-wdt 20bc000.wdog: timeout 60 sec (nowayout=0)
[    3.634895] Bluetooth: HCI UART driver ver 2.3
[    3.639725] Bluetooth: HCI UART protocol H4 registered
[    3.646246] Bluetooth: HCI UART protocol LL registered
[    3.661035] sdhci: Secure Digital Host Controller Interface driver
[    3.667585] sdhci: Copyright(c) Pierre Ossman
[    3.672061] sdhci-pltfm: SDHCI platform and OF driver helper
[    3.682821] sdhci-esdhc-imx 2190000.usdhc: Linked as a consumer to
regulator.4
[    3.731410] mmc0: SDHCI controller on 2190000.usdhc [2190000.usdhc]
using ADMA
[    3.755207] usbcore: registered new interface driver usbhid
[    3.761180] usbhid: USB HID core driver
[    3.841803] mmc0: new high speed SDIO card at address 0001
[    3.853262] NET: Registered protocol family 10
[    3.871111] Segment Routing with IPv6
[    3.875518] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[    3.887265] NET: Registered protocol family 17
[    3.892554] can: controller area network core (rev 20170425 abi 9)
[    3.899993] NET: Registered protocol family 29
[    3.904608] can: raw protocol (rev 20170425)
[    3.909486] can: broadcast manager protocol (rev 20170425 t)
[    3.915347] can: netlink gateway (rev 20170425) max_hops=1
[    3.923735] Key type dns_resolver registered
[    3.933594] cpu cpu0: Linked as a consumer to regulator.2
[    3.940306] cpu cpu0: Linked as a consumer to regulator.3
[    3.951480] cpu cpu0: failed to disable 696MHz OPP
[    3.962215] Registering SWP/SWPB emulation handler
[    3.974089] Loading compiled-in X.509 certificates
[    4.091219] imx_thermal tempmon: Commercial CPU temperature grade -
max:95C critical:90C passive:85C
[    4.105067] ubi0: default fastmap pool size: 70
[    4.110009] ubi0: default fastmap WL pool size: 35
[    4.114863] ubi0: attaching mtd5
[    5.699392] ubi0: scanning is finished
[    5.729392] ubi0: attached mtd5 (name "ubi", size 184 MiB)
[    5.734965] ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
[    5.742078] ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
[    5.749091] ubi0: VID header offset: 2048 (aligned 2048), data offset: 4096
[    5.756111] ubi0: good PEBs: 1468, bad PEBs: 4, corrupted PEBs: 0
[    5.762353] ubi0: user volume: 1, internal volumes: 1, max. volumes
count: 128
[    5.769725] ubi0: max/mean erase counter: 1/0, WL threshold: 4096,
image sequence number: 2012832169
[    5.779000] ubi0: available PEBs: 104, total reserved PEBs: 1364,
PEBs reserved for bad PEB handling6
[    5.789597] ubi0: background thread "ubi_bgt0d" started, PID 63
[    5.797608] snvs_rtc 20cc000.snvs:snvs-rtc-lp: setting system clock
to 1970-01-01 00:00:02 UTC (2)
[    5.807115] cfg80211: Loading compiled-in X.509 certificates for
regulatory database
[    5.824472] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[    5.833748] ALSA device list:
[    5.836790]   No soundcards found.
[    5.841708] platform regulatory.0: Direct firmware load for
regulatory.db failed with error -2
[    5.850691] cfg80211: failed to load regulatory.db
[    5.906636] random: crng init done
[    5.915558] UBIFS (ubi0:0): recovery needed
[    6.777033] UBIFS (ubi0:0): recovery deferred
[    6.782640] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0,
name "rootfs_data", R/O mode
[    6.791634] UBIFS (ubi0:0): LEB size: 126976 bytes (124 KiB),
min./max. I/O unit sizes: 2048 bytes/2s
[    6.801709] UBIFS (ubi0:0): FS size: 166338560 bytes (158 MiB, 1310
LEBs), journal size 8380416 byte)
[    6.812543] UBIFS (ubi0:0): reserved for root: 4952683 bytes (4836 KiB)
[    6.819293] UBIFS (ubi0:0): media format: w4/r0 (latest is w5/r0),
UUID 88DD740C-C81A-4579-A6CC-721Al
[    6.836415] VFS: Mounted root (ubifs filesystem) readonly on device 0:13.
[    6.847946] devtmpfs: mounted
[    6.854231] Freeing unused kernel memory: 1024K
[    6.860023] Run /sbin/init as init process
[    7.499562] systemd[1]: System time before build time, advancing clock.
[    7.679364] systemd[1]: systemd 239 running in system mode. (-PAM
-AUDIT -SELINUX +IMA -APPARMOR +SM)
[    7.705865] systemd[1]: Detected architecture arm.

Welcome to OpenEmbedded nodistro.0!

[    7.775814] systemd[1]: Set hostname to <solar>.
[    8.733396] systemd[1]: File
/lib/systemd/system/systemd-journald.service:36 configures an IP
firewa.
[    8.752243] systemd[1]: Proceeding WITHOUT firewalling in effect!
(This warning is only shown for th)
[    9.498046] systemd[1]: Listening on Journal Socket (/dev/log).
[  OK  ] Listening on Journal Socket (/dev/log).
[    9.553931] systemd[1]: Listening on udev Kernel Socket.
[  OK  ] Listening on udev Kernel Socket.
[    9.575093] systemd[1]: Listening on Journal Socket.
[  OK  ] Listening on Journal Socket.
[    9.653485] systemd[1]: Mounting Kernel Debug File System...
         Mounting Kernel Debug File System...
[    9.706431] systemd[77]: sys-kernel-debug.mount: Failed to execute
command: No such file or directory
[    9.721420] systemd[1]: Reached target Swap.
[  OK  ] Reached target Swap.
         Mounting Temporary Directory (/tmp)...
[  OK  ] Listening on Network Service N[    9.819279] systemd[78]:
tmp.mount: Failed to execute commandy
etlink Socket.
         Starting Journal Service...
         Starting Apply Kernel Variables...
         Mounting FUSE Control File System...
[   10.031009] systemd[82]: sys-fs-fuse-connections.mount: Failed to
execute command: No such file or dy
[  OK  ] Reached target Remote File Systems.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Started Forward Password Requests to Wall Directory Watch.
[  OK  ] Reached target Slices.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Created slice system-serial\x2dgetty.slice.
         Starting File System Check on Root Device...
[  OK  ] Started Dispatch Password Requests to Console Directory Watch.
[  OK  ] Reached target Paths.
[  OK  ] Listening on initctl Compatibility Named Pipe.
[  OK  ] Listening on udev Control Socket.
         Starting udev Coldplug all Devices...
[  OK  ] Reached target Host and Network Name Lookups.
[   10.379141] systemd[84]: systemd-udev-trigger.service: Failed to
execute command: No such file or diy
[FAILED] Failed to mount Kernel Debug File System.
See 'systemctl status sys-kernel-debug.mount' for details.
[FAILED] Failed to mount Temporary Directory (/tmp).
See 'systemctl status tmp.mount' for details.
[DEPEND] Dependency failed for Network Service.
[DEPEND] Dependency failed for Network Time Synchronization.
[  OK  ] Started Journal Service.
[  OK  ] Started Apply Kernel Variables.
[FAILED] Failed to mount FUSE Control File System.
See 'systemctl status sys-fs-fuse-connections.mount' for details.
[  OK  ] Started File System Check on Root Device.
[FAILED] Failed to start udev Coldplug all Devices.
See 'systemctl status systemd-udev-trigger.service' for details.
[  OK  ] Reached target System Time Synchronized.
         Starting Remount Root and Kernel File Systems...
[FAILED] Failed to start Remount Root and Kernel File Systems.
See 'systemctl status systemd-remount-fs.service' for details.
         Starting Flush Journal to Persistent Storage...
         Starting Create Static Device Nodes in /dev...
[FAILED] Failed to start Flush Journal to Persistent Storage.
See 'systemctl status systemd-journal-flush.service' for details.
[FAILED] Failed to start Create Static Device Nodes in /dev.
See 'systemctl status systemd-tmpfiles-setup-dev.service' for details.
         Starting udev Kernel Device Manager...
[  OK  ] Reached target Local File Systems (Pre).
         Mounting /var/volatile...
[  OK  ] Reached target Containers.
[FAILED] Failed to mount /var/volatile.
See 'systemctl status var-volatile.mount' for details.
[DEPEND] Dependency failed for Bind mount volatile /var/cache.
[DEPEND] Dependency failed for Bind mount volatile /srv.
[DEPEND] Dependency failed for Bind mount volatile /var/spool.
[DEPEND] Dependency failed for Bind mount volatile /var/lib.
[DEPEND] Dependency failed for Local File Systems.
[  OK  ] Reached target Network.
[  OK  ] Reached target Login Prompts.
[  OK  ] Reached target Timers.
[  OK  ] Reached target Sockets.
[  OK  ] Started Emergency Shell.
[  OK  ] Reached target Emergency Mode.
         Starting Create Volatile Files and Directories...
         Starting Load/Save Random Seed...
[  OK  ] Started udev Kernel Device Manager.
[FAILED] Failed to start Create Volatile Files and Directories.
See 'systemctl status systemd-tmpfiles-setup.service' for details.
[  OK  ] Started Load/Save Random Seed.
         Starting Update UTMP about System Boot/Shutdown...
[  OK  ] Started Update UTMP about System Boot/Shutdown.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Started Update UTMP about System Runlevel Changes.
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.
Press Enter for maintenance
(or press Control-D to continue):

Thank you.

Kind regards,

- jh

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Corruped NAND booting for all devices
  2020-02-05  1:43 Corruped NAND booting for all devices JH
@ 2020-02-05  2:32 ` Steve deRosier
  2020-02-05  7:58   ` JH
  2020-02-05  8:12 ` Jef Driesen
  1 sibling, 1 reply; 7+ messages in thread
From: Steve deRosier @ 2020-02-05  2:32 UTC (permalink / raw)
  To: JH; +Cc: linux-mtd

Hi JH,

On Tue, Feb 4, 2020 at 5:43 PM JH <jupiter.hce@gmail.com> wrote:
>
> Hi,
>
> It is a bad day we have 5 devices failed NAND booting all of certain
> today. The 5 devices running kernel 4.19.75 on iMX6ULL customized
> board, the devices had been running for weeks, the device DC power is
> supplied from AC via ADC and regulator, we turned power on and off
> several times when installing those test devices to test boxes in the
> last couple of days without problems, then they all failed together
> mysteriously today. It could not complete the booting to Linux user
> space, so I am not able to log into the user space to check and to
> debug it.
>
...snip...

I've been following your questions on both this list and the
linux-wireless one. May I recommend some reading:
http://www.linux-mtd.infradead.org/doc/nand.html

It isn't clear what filesystem you're using, though I recall from an
earlier email you weren't running UBIFS. But in the log I do see UBIFS
messages. In any case, based on your descriptions, I strongly suspect
NAND bitflips are causing your filesystem corruptions, and you likely
don't have the correct settings for the ECC strength as necessary for
your NAND. Or maybe you're not flashing images correctly and the ECC
info is getting lost.  Or maybe you're writing logs and such to flash
and you're filing up the filesystem. Maybe your extents aren't correct
and one filesystem overwrites another. Unfortunately, you've got your
system so cobbled up with user-space prettiness in your log output
that you're obscuring the kernel log output that would help you
diagnose the problems.

Some steps/advice to help your debugging:
* Stop making assumptions about what is or couldn't possibly be wrong.
Use evidence only. Test and validate each assumption.
* Fix your serial port logging output so you can actually see all the
kernel messages instead of the systemd messages that aren't helping
you.
* You don't need access to the user-space tools on your corrupted
filesystem. You could nfs mount a root via U-Boot and then use the
tools to analyze your flash.
* Read the schematics of your device. Understand how your NAND is
hooked up.  Is it correct?
* Read the datasheet of your NAND and your flash controller. Check
your configurations against requirements.
* Understand what your flash partitions are, where each filesystem is
in NAND, etc. Check that the extents are correct and you're not
overwriting.
* Make sure you're not writing stuff to your flash and filling it.
* Check your required ECC strength. Verify that ECC bits are actually
being written during use and are correct.
* Make sure your method of flashing images write the ECC bits correctly. Verify.
* You can use U-Boot to dump your NAND pages and verify your ECC bits
are being written how they should be.
* Enable as much kernel log output as possible so you can see the
relevant debug messages.

Also see the list here:
http://lists.infradead.org/pipermail/linux-mtd/2018-December/086331.html

I don't know what's going on with your system. You have presented a
large number of random symptoms, a lot of assumptions, but very little
real information that we can help you with. And from the information
you present, pretty much no one here is going to be able to solve it
for you - _you_ need to solve your problem. The only way you're going
to do that is to UNDERSTAND the problem first. Get the right debug
output, understand your hardware inside and out, and verify the
software matches the hardware configuration and you'll probably get a
lot closer to finding your problem.

Debugging flash corruption problems is a non-trivial activity.  Last
product I had to do it on took me 6 months of investigation before we
finally solved it. It was a combination of several errors, and fixing
each one helped, but of course made the others harder to find as the
cycle time between failures increased. Some problems were our fault
and others were caused by an undocumented silicon error that took us a
while to realize and work around. Buckle down and go step by step.
With luck you'll find the problem quickly. If not, take it as an
opportunity to become an expert in every level of your system. I wish
you luck.

- Steve

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Corruped NAND booting for all devices
  2020-02-05  2:32 ` Steve deRosier
@ 2020-02-05  7:58   ` JH
  2020-02-05 17:06     ` Steve deRosier
  0 siblings, 1 reply; 7+ messages in thread
From: JH @ 2020-02-05  7:58 UTC (permalink / raw)
  To: Steve deRosier; +Cc: linux-mtd

Hi Steve,

On 2/5/20, Steve deRosier <derosier@gmail.com> wrote:
> I've been following your questions on both this list and the
> linux-wireless one. May I recommend some reading:
> http://www.linux-mtd.infradead.org/doc/nand.html
>
> It isn't clear what filesystem you're using, though I recall from an
> earlier email you weren't running UBIFS. But in the log I do see UBIFS
> messages. In any case, based on your descriptions, I strongly suspect
> NAND bitflips are causing your filesystem corruptions, and you likely
> don't have the correct settings for the ECC strength as necessary for
> your NAND. Or maybe you're not flashing images correctly and the ECC
> info is getting lost.  Or maybe you're writing logs and such to flash
> and you're filing up the filesystem. Maybe your extents aren't correct
> and one filesystem overwrites another. Unfortunately, you've got your
> system so cobbled up with user-space prettiness in your log output
> that you're obscuring the kernel log output that would help you
> diagnose the problems.

Yes, the file system is UBIFS, the different revision of test units
have been running for many months, they were relative stable until now
for a new revision of hardware. Like you found, we have lots of
problems in low level when running the new revision of hardware. As
both firmware and hardware evolved, the first rational thing is to
narrow down the source of the problem.

I appreciate all your advice which are very helpful and valid, the
hardware was designed by other contractors, there is limited tools and
equipment for software guy to debug the hardware. Hardware contractors
firmly ruled out any issues in hardware, they pointed finger to
software image built from Yocto to cause the NAND corruption. The
Yocto image contains all open sources, Linux kernel, connman, MTD,
ofono, etc, so I try to figure out if there are limitations and
constrains to turn the device power off while it may be in the middle
of erasing pages, would that cause the NAND flash corrupted? Or we
might not set up things properly?

I posted message here to gather information from your experiences and
to take your advice to figure out in what circumstances that the NAND
corruption could be occurred. So we could  mitigate the issues as much
as possible.

As you said, there are so many things in software and hardware could
cause the NAND corruption, what I am particular interested in is if so
called a bad Yocto image could cause the NAND corruption, let's make
it clear I am not talking about software problems in that image, I am
talking about Yocto build system problem which generated a bad image.
I thought, if you built a bad image, it would not be able to run at
first time, if an image to run NAND booting well for several days,
what that the Yocto build system could to make the image corrupted the
NAND  late like a virus? It does not make sense to me, but I could be
wrong.

Thank you.

Kind regards,

- jh

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Corruped NAND booting for all devices
  2020-02-05  1:43 Corruped NAND booting for all devices JH
  2020-02-05  2:32 ` Steve deRosier
@ 2020-02-05  8:12 ` Jef Driesen
  2020-02-06  0:49   ` JH
  1 sibling, 1 reply; 7+ messages in thread
From: Jef Driesen @ 2020-02-05  8:12 UTC (permalink / raw)
  To: JH, linux-mtd

On 2/5/20 2:43 AM, JH wrote:
> It is a bad day we have 5 devices failed NAND booting all of certain
> today. The 5 devices running kernel 4.19.75 on iMX6ULL customized
> board, the devices had been running for weeks, the device DC power is
> supplied from AC via ADC and regulator, we turned power on and off
> several times when installing those test devices to test boxes in the
> last couple of days without problems, then they all failed together
> mysteriously today. It could not complete the booting to Linux user
> space, so I am not able to log into the user space to check and to
> debug it.
> 
> ...
> [    5.915558] UBIFS (ubi0:0): recovery needed
> [    6.777033] UBIFS (ubi0:0): recovery deferred
> [    6.782640] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0,
> name "rootfs_data", R/O mode
> 
> ...
> 
> [FAILED] Failed to mount /var/volatile.
> See 'systemctl status var-volatile.mount' for details.
> [DEPEND] Dependency failed for Bind mount volatile /var/cache.
> [DEPEND] Dependency failed for Bind mount volatile /srv.
> [DEPEND] Dependency failed for Bind mount volatile /var/spool.
> [DEPEND] Dependency failed for Bind mount volatile /var/lib.

At first sight, it looks you have a read-only ubifs filesystem, with an 
overlay filesystem backed by another read-write ubifs filesystem? And 
that read-write filesystem fails to mount after a power failure?

In that case, this sounds very similar to the problem I reported last week:

http://lists.infradead.org/pipermail/linux-mtd/2020-January/093542.html

Jef

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Corruped NAND booting for all devices
  2020-02-05  7:58   ` JH
@ 2020-02-05 17:06     ` Steve deRosier
  0 siblings, 0 replies; 7+ messages in thread
From: Steve deRosier @ 2020-02-05 17:06 UTC (permalink / raw)
  To: JH; +Cc: linux-mtd

Hi JH,

On Tue, Feb 4, 2020 at 11:58 PM JH <jupiter.hce@gmail.com> wrote:
>
> Hi Steve,
>
> On 2/5/20, Steve deRosier <derosier@gmail.com> wrote:
> > I've been following your questions on both this list and the
> > linux-wireless one. May I recommend some reading:
> > http://www.linux-mtd.infradead.org/doc/nand.html
> >
> > It isn't clear what filesystem you're using, though I recall from an
> > earlier email you weren't running UBIFS. But in the log I do see UBIFS
> > messages. In any case, based on your descriptions, I strongly suspect
> > NAND bitflips are causing your filesystem corruptions, and you likely
> > don't have the correct settings for the ECC strength as necessary for
> > your NAND. Or maybe you're not flashing images correctly and the ECC
> > info is getting lost.  Or maybe you're writing logs and such to flash
> > and you're filing up the filesystem. Maybe your extents aren't correct
> > and one filesystem overwrites another. Unfortunately, you've got your
> > system so cobbled up with user-space prettiness in your log output
> > that you're obscuring the kernel log output that would help you
> > diagnose the problems.
>
> Yes, the file system is UBIFS, the different revision of test units
> have been running for many months, they were relative stable until now
> for a new revision of hardware. Like you found, we have lots of
> problems in low level when running the new revision of hardware. As
> both firmware and hardware evolved, the first rational thing is to
> narrow down the source of the problem.
>

My suggestion, assuming that you have a version that ran on the old
hardware and that it can run on the new hardware (so, basics like
processor types, etc haven't changed enough to keep it from working)
is to run the known-good software on the new hardware and see where
you stand. My prediction is it will fail hard, but it should be
informative.

With hardware changes, there's two levels:

1. Small enough changes it's still more or less the same platform, but
with a few things that need to be changed.
2. Big enough changes it should be considered a new platform and
treated as such.

Either way, the approach is similar, just different in scope. In the
former, take a look at the old schematic and the new schematic and see
what changed. Pull the datasheets of any chips that changed (both old
and new) and check changed parameters. Check for changed port lines,
gpio pulls, chip selects, and timing parameters.

In the case of the latter, we're talking the processor changed, the
device architecture changed (not like ARM to MIPS, that's covered in
"processor" changed, more like how your overall device is designed),
etc...  Honestly just start over. Assume everything changed and start
over from scratch. Examine everything, make sure your DT matches, your
compile flags. Test and confirm everything. Basically ground-up board
bring-up.

A good question for you - did the old design use SLC NAND and the new
MLC NAND? That makes a huge difference. Also, I've found that not all
manufacturers are equal in reliability. I had a hardware team that
wanted to substitute a cheaper but "equivalent" part from a different
manufacturer. It had the exact same specs and in theory should have
run seamlessly - yet we had endless corruptions on the first articles
they sent me. We just said "no" and stayed with the more expensive
part because saving $0.45 on something that would sell only a few
tens-of-thousands of units wasn't worth the engineering time to "solve
it in software" if it was even indeed possible to do so. Note that
this was only possible because the only thing that changed was the one
chip, which from a software standpoint should have been identical.
When you only change one thing at a time, it helps you find the
problems.

> I appreciate all your advice which are very helpful and valid, the
> hardware was designed by other contractors, there is limited tools and
> equipment for software guy to debug the hardware. Hardware contractors
> firmly ruled out any issues in hardware, they pointed finger to
> software image built from Yocto to cause the NAND corruption. The

Of course they are saying "it's not my problem". You seem to be living
in a "throw-it-over-the-wall" style organization. In my experience,
you have three choices - get the hell out, become a hardware expert,
or become close friends with one of the guys on the hardware team and
change the culture. You need to realize that in a way they're right -
by their limited perspective, ie the electrons go where they should
go, everything is fine.  But gluing a few chips down on a board is
only 2% of an embedded engineer's job; there's a lot more to do
because our chips are now so programmable, and at the end of the day
it still needs to work right.

Again, examine and understand the datasheets. Check against the
schematic. Satisfy yourself there's no electrical errors. Validate
that every value that gets set in a controller register is the correct
one.

> Yocto image contains all open sources, Linux kernel, connman, MTD,
> ofono, etc, so I try to figure out if there are limitations and
> constrains to turn the device power off while it may be in the middle
> of erasing pages, would that cause the NAND flash corrupted? Or we
> might not set up things properly?
>

UBI is designed to be power-cut safe. Not to say there hasn't been
bugs or isn't a bug.  But basic things like "the power cut happened
when we erased a page" shouldn't be a concern unless there's a driver
bug.

> As you said, there are so many things in software and hardware could
> cause the NAND corruption, what I am particular interested in is if so
> called a bad Yocto image could cause the NAND corruption, let's make
> it clear I am not talking about software problems in that image, I am
> talking about Yocto build system problem which generated a bad image.
> I thought, if you built a bad image, it would not be able to run at
> first time, if an image to run NAND booting well for several days,
> what that the Yocto build system could to make the image corrupted the
> NAND  late like a virus? It does not make sense to me, but I could be
> wrong.

While I would say "nothing is impossible", the scenario you're talking
about is close to it. You're looking down a dry hole. If you can
successfully build and run it, it's not a "bad Yocto image" in the
sense you're describing.

Data loss over time is usually a bit-rot issue in my experience. Check
your ECC parameters, check that ECC data is being written correctly in
ALL cases. Check that U-boot and the kernel agree on the ECC
parameters.  Note - you should be seeing ECC warning messages in your
kernel log on boot, but to my eye you don't have the right settings
there to see what you need to see.

Also, be sure you're not filling up the filesystem, or have partitions
overlapping and thus overwriting each other. One type of "bad image"
is where the new one it is larger than expected, and thus when you
flash it you either overwrite some other area or you truncate what you
are flashing. In some cases like these, it is possible for it to run
normally...until it doesn't. But those types of things are easy to
check.

Check that your method of flashing doesn't ignore bad-block markers.
Or that when you flash it that ECC data gets written correctly.

To give you an example: I had a system where the typically method of
flashing the system for production was via u-boot.  I worked with it
every day reflashing via u-boot. I never had a problem.  We got random
reports of problems from the field and even some of my colleagues
would see corrupted (couldn't boot) NAND after a while.  Going through
the problem, I discovered that our u-boot method of flashing worked
perfect and correctly wrote ECC.  I eventually discovered that our
user-space update script would not write ECC but instead left it
cleared. And, at least with the version of UBIFS we were using, UBIFS
was tolerant of no ECC data. Basically, it would read the page and if
there was ECC data it would validate it and correct or error out if
there was a problem, but if there was no ECC data for the page, it
would short-circuit and basically say "nothing for me to check, all is
OK". So, in the short term anything flashed with the buggy update
script would run fine and would only show problems weeks or so later.
And of course, only if a bit flip happened in the wrong place, etc....
So it was rare enough it took a while to notice. But devices that were
never updated (only production flashed) would be fine.  Or ones that
got upgraded via the u-boot method, like I happened to do, never saw
the problems. Hard to track down and only found because I went through
everything. Hence my comment of "ALL cases".

Hope that helps,
- Steve

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Corruped NAND booting for all devices
  2020-02-05  8:12 ` Jef Driesen
@ 2020-02-06  0:49   ` JH
  2020-02-06  1:40     ` Steve deRosier
  0 siblings, 1 reply; 7+ messages in thread
From: JH @ 2020-02-06  0:49 UTC (permalink / raw)
  To: Jef Driesen; +Cc: linux-mtd

Hi Jef,

On 2/5/20, Jef Driesen <jef.driesen@niko.eu> wrote:
>> [    6.782640] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0,
>> name "rootfs_data", R/O mode
>>
>> ...
>>
>> [FAILED] Failed to mount /var/volatile.
>> See 'systemctl status var-volatile.mount' for details.
>> [DEPEND] Dependency failed for Bind mount volatile /var/cache.
>> [DEPEND] Dependency failed for Bind mount volatile /srv.
>> [DEPEND] Dependency failed for Bind mount volatile /var/spool.
>> [DEPEND] Dependency failed for Bind mount volatile /var/lib.
>
> At first sight, it looks you have a read-only ubifs filesystem, with an
> overlay filesystem backed by another read-write ubifs filesystem? And
> that read-write filesystem fails to mount after a power failure?

I did notice the R/O mode, it was never in previous hardware revision,
the MTD partition for UBI is always RW, I did not know where is that
from, since the MTD partitions and installation was performed by our
hardware contractor, I don't believe it could be my Yocto image to
change it, but correct me, I could be wrong.

> In that case, this sounds very similar to the problem I reported last week:
>
> http://lists.infradead.org/pipermail/linux-mtd/2020-January/093542.html

Interesting, good to know there is an issue here,

I guess the question here, firstly why we had the RO ubifs filesystem
backed by an overlay RW ubifs filesystem in the first place, secondly
is it an MTD bug or is it our fault to set up wrong RO and overlay RW
ubifs?

I'll keep an eye on it.

Thank you so much for sharing your information.

Kind regards,

- jh

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Corruped NAND booting for all devices
  2020-02-06  0:49   ` JH
@ 2020-02-06  1:40     ` Steve deRosier
  0 siblings, 0 replies; 7+ messages in thread
From: Steve deRosier @ 2020-02-06  1:40 UTC (permalink / raw)
  To: JH; +Cc: linux-mtd, Jef Driesen

On Wed, Feb 5, 2020 at 4:49 PM JH <jupiter.hce@gmail.com> wrote:
>
> Hi Jef,
>
> On 2/5/20, Jef Driesen <jef.driesen@niko.eu> wrote:
> >> [    6.782640] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0,
> >> name "rootfs_data", R/O mode
> >>
> >> ...
> >>
> >> [FAILED] Failed to mount /var/volatile.
> >> See 'systemctl status var-volatile.mount' for details.
> >> [DEPEND] Dependency failed for Bind mount volatile /var/cache.
> >> [DEPEND] Dependency failed for Bind mount volatile /srv.
> >> [DEPEND] Dependency failed for Bind mount volatile /var/spool.
> >> [DEPEND] Dependency failed for Bind mount volatile /var/lib.
> >
> > At first sight, it looks you have a read-only ubifs filesystem, with an
> > overlay filesystem backed by another read-write ubifs filesystem? And
> > that read-write filesystem fails to mount after a power failure?
>
> I did notice the R/O mode, it was never in previous hardware revision,
> the MTD partition for UBI is always RW, I did not know where is that
> from, since the MTD partitions and installation was performed by our
> hardware contractor, I don't believe it could be my Yocto image to
> change it, but correct me, I could be wrong.
>

When UBIFS encounters an error, it switches the partition to R/O.
That's not a hardware thing necessarily.

Also, "MTD partitions and installation was performed by our hardware
contractor"?!?!  You've got to figure out the setup. Are you saying
that they're 100% responsible for everything below the level of
userspace? Or are you responsible for kernel, u-boot, etc... all the
software?  If so, this is a software thing and you've got to own it.
If they are truly responsible for everything lower than the userspace
software, it's time you stop wasting your time on it and get them to
debug it, this is way out of your realm.

> > In that case, this sounds very similar to the problem I reported last week:
> >
> > http://lists.infradead.org/pipermail/linux-mtd/2020-January/093542.html
>
> Interesting, good to know there is an issue here,
>

This is not necessarily your problem. Read and understand it, and run
some tests to check. But, t isn't likely the same thing.

> I guess the question here, firstly why we had the RO ubifs filesystem
> backed by an overlay RW ubifs filesystem in the first place, secondly
> is it an MTD bug or is it our fault to set up wrong RO and overlay RW
> ubifs?
>

Do you know you have this setup or not?  Based on what you're saying,
I don't think you have any idea. Boot a working device and check to
see if are running an overlayfs.

- Steve

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-02-06  1:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-05  1:43 Corruped NAND booting for all devices JH
2020-02-05  2:32 ` Steve deRosier
2020-02-05  7:58   ` JH
2020-02-05 17:06     ` Steve deRosier
2020-02-05  8:12 ` Jef Driesen
2020-02-06  0:49   ` JH
2020-02-06  1:40     ` Steve deRosier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).