On 2019/8/21 下午4:05, Peter Chant wrote: > hings > On 8/20/19 10:59 PM, Chris Murphy wrote: >> On Tue, Aug 20, 2019 at 3:10 PM Peter Chant wrote: >>> >>> Chasing IO errors. BTRFS: error (device dm-2) in >>> btrfs_run_delayed_refs:2907: errno=-5 IO failure >>> >>> >>> I've just had an odd one. >>> >>> Over the last few days I've noticed a file system blocking, if that is >>> the correct term, and this morning go read only. This resulted in a lot >>> of checksum errors. >> >> That doesn't sound good. Checksum errors where? A complete start to >> finish dmesg is most useful in this case. >> > Things to note: > System has five drives: > SSD for boot eith ext4 and btrfs partitions - no issues. > NVME for lxc and a database, via lvm it carries a btrfs and xfs file > systems, no issue. > Various overlayfs for lxc containers. > > Three WD reds, 2x3TB, 1x4TB, RAID1 problematic. > > I'll run the checks shortly. Well, check will also report that transid mismatch, and possibly a lot of extent tree corruption. [...] > [ 48.540518] BTRFS: device label Data devid 3 transid 2265510 /dev/dm-2 > [ 62.219602] BTRFS info (device dm-2): allowing degraded mounts > [ 62.220612] BTRFS info (device dm-2): use zstd compression, level 3 > [ 62.221339] BTRFS info (device dm-2): enabling auto defrag > [ 62.222031] BTRFS info (device dm-2): disk space caching is enabled > [ 62.223323] BTRFS warning (device dm-2): devid 5 uuid > 89195df2-4e3d-4856-aab0-2aa9f59b3846 is missing > [ 62.232894] BTRFS warning (device dm-2): devid 5 uuid > 89195df2-4e3d-4856-aab0-2aa9f59b3846 is missing > [ 95.956952] BTRFS info (device dm-2): checking UUID tree > [ 99.232089] BTRFS info (device dm-2): device fsid > 159b8826-8380-45be-acb6-0cb992a8dfd7 devid 4 moved > old:/dev/mapper/data_disk_1 new:/dev/dm-1 > [ 99.232146] BTRFS info (device dm-2): device fsid > 159b8826-8380-45be-acb6-0cb992a8dfd7 devid 3 moved > old:/dev/mapper/data_disk_2 new:/dev/dm-2 > [ 99.237670] BTRFS info (device dm-2): device fsid > 159b8826-8380-45be-acb6-0cb992a8dfd7 devid 4 moved old:/dev/dm-1 > new:/dev/mapper/data_disk_1 > [ 99.242692] BTRFS info (device dm-2): device fsid > 159b8826-8380-45be-acb6-0cb992a8dfd7 devid 3 moved old:/dev/dm-2 > new:/dev/mapper/data_disk_2 > [ 99.710315] EDAC amd64: Node 0: DRAM ECC disabled. > [ 99.710317] EDAC amd64: ECC disabled in the BIOS or no ECC > capability, module will not load. > Either enable ECC checking or force module loading by > setting 'ecc_enable_override'. > (Note that use of the override may cause unknown side > effects.) Not sure what the ECC part is doing, but it repeats quite some times. I'd assume it's unrelated though. [...] > [ 142.507291] BTRFS error (device dm-2): parent transid verify failed > on 13395960053760 wanted 2265296 found 2263090 > [ 142.544548] BTRFS error (device dm-2): parent transid verify failed > on 13395960053760 wanted 2265296 found 2263090 > [ 142.544561] BTRFS: error (device dm-2) in > btrfs_run_delayed_refs:2907: errno=-5 IO failure This means, btrfs is trying to read extent tree for CoW, but at that time, extent tree is already corrupted, thus it returns -EIO. And btrfs_run_delayed_refs just returns error. Not sure if it's related to device replace, but anyway the corruption just happened. The device replace may be an interesting clue, as currently our dm-log-writes are mostly focused on single device usage. Then I'd recommend to do regular rescue procedure: - Try that skip_bg patchset if possible This provides the best salvage method so far, full subvolume available, although needs out-of-tree patches. https://patchwork.kernel.org/project/linux-btrfs/list/?series=130637 - btrfs-restore The regular unmounted recover, needs extra space. Latest btrfs-progs recommended. Thanks, Qu > [ 142.544564] BTRFS info (device dm-2): forced readonly > [ 144.108801] lxc-int: port 2(veth4OH9N0) entered disabled state > [ 144.144061] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [ 144.144110] lxc-int: port 2(veth4OH9N0) entered blocking state > [ 144.144113] lxc-int: port 2(veth4OH9N0) entered forwarding state > [ 145.973587] NFSD: attempt to initialize umh client tracking in a > container ignored. > [ 145.973629] NFSD: attempt to initialize legacy client tracking in a > container ignored. > [ 145.973629] NFSD: Unable to initialize client recovery tracking! (-22) > [ 145.973631] NFSD: starting 90-second grace period (net f0000492) > [ 146.341292] lxc-int: port 4(vethNJGFQC) entered disabled state > [ 146.379910] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [ 146.379947] lxc-int: port 4(vethNJGFQC) entered blocking state > [ 146.379949] lxc-int: port 4(vethNJGFQC) entered forwarding state > [ 156.039660] lxc-int: port 3(veth70VEEM) entered disabled state > [ 156.069863] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [ 156.069900] lxc-int: port 3(veth70VEEM) entered blocking state > [ 156.069902] lxc-int: port 3(veth70VEEM) entered forwarding state > >