All of lore.kernel.org
 help / color / mirror / Atom feed
* errors found in extent allocation tree or chunk allocation after power failure
@ 2019-09-25 14:50 Pallissard, Matthew
  2019-09-25 19:08 ` Chris Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Pallissard, Matthew @ 2019-09-25 14:50 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 21033 bytes --]


Hey all,

tl;dr Had a power failure while executing snapshot operations on a 6 drive HDD raid-10 configuration. I now have errors.  I'm willing to let this array be a guinea pig.


Not-quite-ancient hardware running in a basement lab. Dell R710, 6 HDDs.  It acts as a hypervisor; as such the files on disk are few, but large.
Mostly qcow2, a few raw images, and the xml that describes the vms.  The vm images range from ~20GB to ~1TB.

I had two snapshots from June of 2019 I was cleaning up.  After the deletions I went to take a fresh snapshot.  I ran the commands in immediate succession.
After the deletions, while the snapshot command was running the decade+ old UPS decided to fail.  Order of operations is as follows.

1. snapshot delete old0
2. snapshot delete old1
3. snapshot new
[power failure]
4. btrfsfck
5. btrfs balance
6. btrfs del new
7. btrfsck (same errors show as #4)


I can mount the filesystem.  I have backups, but am attempting one more at the moment.
I plan on running a balance again after the backup completes. Unless of course someone here tells me differently.
I'm willing to perform any potentially destructive operations if any devs here would find it helpful.


Version:
Kernel: 5.2.2-arch1-1-ARCH #1 SMP PREEMPT Sun Jul 21 19:18:34 UTC 2019 x86_64 GNU/Linux
Btrfs version: btrfs-progs v5.2

Info:

Sorry, I didn't run the commands specified on the list wiki verbatim.  I captured this info before double checking that.  I can go back later and gather additional info if need be.

1. There are 6 spinning rust drives
2. Each drive is passed through a perc raid controller individually
3. Each drive is a luks volume
4. Btrfs exists on each device-mapped luks volume
6. There are no device errors
6. There is one stacktrace in dmesg that happened during the balance.
  6a. I have not seen a hung task timeout on this array before.  I also haven't done many intensive I/O operations.
  6b. That's not to say it hasn't happened without me missing it.


> Overall:
>     Device size:                   5.45TiB
>     Device allocated:              3.30TiB
>     Device unallocated:            2.15TiB
>     Device missing:                  0.00B
>     Used:                          3.22TiB
>     Free (estimated):              1.12TiB      (min: 1.12TiB)
>     Data ratio:                       2.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 0.00B)
>
> Data,RAID10: Size:1.65TiB, Used:1.61TiB
>    /dev/mapper/luks-a    562.00GiB
>    /dev/mapper/luks-b    562.00GiB
>    /dev/mapper/luks-c    562.00GiB
>    /dev/mapper/luks-d    562.00GiB
>    /dev/mapper/luks-e    562.00GiB
>    /dev/mapper/luks-f    562.00GiB
>
> Metadata,RAID10: Size:4.03GiB, Used:2.24GiB
>    /dev/mapper/luks-a      1.34GiB
>    /dev/mapper/luks-b      1.34GiB
>    /dev/mapper/luks-c      1.34GiB
>    /dev/mapper/luks-d      1.34GiB
>    /dev/mapper/luks-e      1.34GiB
>    /dev/mapper/luks-f      1.34GiB
>
> System,RAID10: Size:7.88MiB, Used:224.00KiB
>    /dev/mapper/luks-a      2.62MiB
>    /dev/mapper/luks-b      2.62MiB
>    /dev/mapper/luks-c      2.62MiB
>    /dev/mapper/luks-d      2.62MiB
>    /dev/mapper/luks-e      2.62MiB
>    /dev/mapper/luks-f      2.62MiB
>
> Unallocated:
>    /dev/mapper/luks-a    367.65GiB
>    /dev/mapper/luks-b    367.65GiB
>    /dev/mapper/luks-c    367.65GiB
>    /dev/mapper/luks-d    367.65GiB
>    /dev/mapper/luks-e    367.65GiB
>    /dev/mapper/luks-f    367.65GiB


Error:
btrfsck output
> btrfsck /dev/mapper/luks-a

> Opening filesystem to check...
> Checking filesystem on /dev/mapper/luks-a
> UUID: 1f562f3d-5ba4-4bae-9099-2a4eb836e7f8
> [1/7] checking root items
> [2/7] checking extents
> ref mismatch on [1453984899072 524288] extent item 0, found 1
> data backref 1453984899072 root 5 owner 1468 offset 1910132736 num_refs 0 not found in extent tree
> incorrect local backref count on 1453984899072 root 5 owner 1468 offset 1910132736 found 1 wanted 0 back 0x556ab4725490
> backpointer mismatch on [1453984899072 524288]
> ERROR: errors found in extent allocation tree or chunk allocation
> [3/7] checking free space cache
> [4/7] checking fs roots
> root 5 inode 271 errors 2000, link count wrong
>         unresolved ref dir 261 index 4 namelen 22 name mirror-nginx-p01.qcow2 filetype 1 errors 0
> root 5 inode 272 errors 2000, link count wrong
>         unresolved ref dir 261 index 5 namelen 18 name bind-dns-p01.qcow2 filetype 1 errors 0
> root 5 inode 273 errors 2000, link count wrong
>         unresolved ref dir 261 index 6 namelen 18 name ovpn-vpn-p01.qcow2 filetype 1 errors 0
> root 5 inode 321 errors 2000, link count wrong
>         unresolved ref dir 261 index 7 namelen 18 name isc-dhcp-p01.qcow2 filetype 1 errors 0
> root 5 inode 425 errors 2000, link count wrong
>         unresolved ref dir 261 index 8 namelen 18 name afs-file-p01.qcow2 filetype 1 errors 0
> root 5 inode 564 errors 2000, link count wrong
>         unresolved ref dir 277 index 2 namelen 20 name plex-nginx-p01.qcow2 filetype 1 errors 0
> root 5 inode 607 errors 2000, link count wrong
>         unresolved ref dir 277 index 3 namelen 19 name linux-rtr-p01.qcow2 filetype 1 errors 0
> root 5 inode 1340 errors 2000, link count wrong
>         unresolved ref dir 261 index 106 namelen 17 name gen-nfs-p01.qcow2 filetype 1 errors 0
> root 5 inode 1453 errors 2000, link count wrong
>         unresolved ref dir 261 index 115 namelen 23 name gen-kube-cont-p03.qcow2 filetype 1 errors 0
> root 5 inode 1454 errors 2000, link count wrong
>         unresolved ref dir 261 index 116 namelen 25 name gen-master-cont-p01.qcow2 filetype 1 errors 0
> root 5 inode 1455 errors 2000, link count wrong
>         unresolved ref dir 261 index 117 namelen 23 name gen-kube-cont-p04.qcow2 filetype 1 errors 0
> root 5 inode 1458 errors 2000, link count wrong
>         unresolved ref dir 261 index 118 namelen 23 name gen-kube-cont-p01.qcow2 filetype 1 errors 0
> root 5 inode 1468 errors 3000, some csum missing, link count wrong
>         unresolved ref dir 261 index 119 namelen 23 name gen-kube-cont-p02.qcow2 filetype 1 errors 0
> root 5 inode 1499 errors 2000, link count wrong
>         unresolved ref dir 261 index 120 namelen 22 name gen-nfs-p01-nfs0.qcow2 filetype 1 errors 0
> root 5 inode 2456 errors 2100, file extent discount, link count wrong
> Found file extent holes:
>         start: 0, len: 7612334080
>         unresolved ref dir 261 index 131 namelen 27 name infra-vol-gluster-p01.qcow2 filetype 1 errors 0
> root 5 inode 2474 errors 2000, link count wrong
>         unresolved ref dir 261 index 133 namelen 31 name matt-headless-desktop-p01.qcow2 filetype 1 errors 0
> ERROR: errors found in fs roots
> found 1769310683136 bytes used, error(s) found
> total csum bytes: 990772952
> total tree bytes: 2456469504
> total fs tree bytes: 601128960
> total extent tree bytes: 614989824
> btree space waste bytes: 430394740
> file data blocks allocated: 19047078367232
>  referenced 1549807284224


Here is the output of;
> dmesg --ctime | grep -v audit | grep -i -B 10 -A 10 btrfs

I took the liberty of removing all of the duplicate messages, unrelated device enumeration at boot, audit logs, etc.
NOTE: I did snap in a usb drive and ran mkfs on it to caputure the output. (I think I trimmed that all out successfully)

> [   73.164024] cryptd: max_cpu_qlen set to 1000
> [   73.538329] BTRFS: device fsid 1f562f3d-5ba4-4bae-9099-2a4eb836e7f8 devid 1 transid 8156599 /dev/dm-0
> [   75.318334] BTRFS: device fsid 1f562f3d-5ba4-4bae-9099-2a4eb836e7f8 devid 2 transid 8156599 /dev/dm-1
> [   77.076659] BTRFS: device fsid 1f562f3d-5ba4-4bae-9099-2a4eb836e7f8 devid 3 transid 8156599 /dev/dm-2
> [   78.892821] BTRFS: device fsid 1f562f3d-5ba4-4bae-9099-2a4eb836e7f8 devid 4 transid 8156599 /dev/dm-3
> [   80.701286] BTRFS: device fsid 1f562f3d-5ba4-4bae-9099-2a4eb836e7f8 devid 5 transid 8156599 /dev/dm-4
> [   82.568686] BTRFS: device fsid 1f562f3d-5ba4-4bae-9099-2a4eb836e7f8 devid 6 transid 8156599 /dev/dm-5
> [   82.922537] BTRFS info (device dm-0): disk space caching is enabled
> [   82.922541] BTRFS info (device dm-0): has skinny extents
> [  184.375844] BTRFS: error (device dm-0) in btrfs_replay_log:2302: errno=-22 unknown (Failed to recover log tree)
> [  188.832121] BTRFS error (device dm-0): open_ctree failed
> [ 1223.523145] BTRFS info (device dm-0): allowing degraded mounts
> [ 1223.523148] BTRFS info (device dm-0): disk space caching is enabled
> [ 1223.523150] BTRFS info (device dm-0): has skinny extents
> [ 1228.714934] BTRFS info (device dm-0): checking UUID tree
> [ 1257.835784] BTRFS error (device dm-0): csum mismatch on free space cache
> [ 1257.836801] BTRFS warning (device dm-0): failed to load free space cache for block group 534753705984, rebuilding it now
> [ 1260.892300] BTRFS error (device dm-0): csum mismatch on free space cache
> [ 1260.893284] BTRFS warning (device dm-0): failed to load free space cache for block group 1171751043072, rebuilding it now
> [ 1261.543959] BTRFS error (device dm-0): csum mismatch on free space cache
> [ 1261.544959] BTRFS warning (device dm-0): failed to load free space cache for block group 1258724130816, rebuilding it now
> [ 1261.712144] BTRFS error (device dm-0): csum mismatch on free space cache
> [ 1261.712152] BTRFS warning (device dm-0): failed to load free space cache for block group 1281272709120, rebuilding it now
> [ 1262.083594] BTRFS error (device dm-0): csum mismatch on free space cache
> [ 1262.084552] BTRFS warning (device dm-0): failed to load free space cache for block group 1329591091200, rebuilding it now
> [ 1262.091133] BTRFS error (device dm-0): csum mismatch on free space cache
> [ 1262.092086] BTRFS warning (device dm-0): failed to load free space cache for block group 1332812316672, rebuilding it now
> [ 1262.137185] BTRFS error (device dm-0): csum mismatch on free space cache
> [ 1262.138036] BTRFS warning (device dm-0): failed to load free space cache for block group 1339254767616, rebuilding it now
> [ 1262.143510] BTRFS error (device dm-0): csum mismatch on free space cache
> [ 1262.144370] BTRFS warning (device dm-0): failed to load free space cache for block group 1342475993088, rebuilding it now
> [ 1262.159166] BTRFS error (device dm-0): csum mismatch on free space cache
> [ 1262.160013] BTRFS warning (device dm-0): failed to load free space cache for block group 1345697218560, rebuilding it now
> [ 1262.164526] BTRFS error (device dm-0): csum mismatch on free space cache
> [ 1262.165401] BTRFS warning (device dm-0): failed to load free space cache for block group 1348918444032, rebuilding it now
> [ 1262.178524] BTRFS warning (device dm-0): failed to load free space cache for block group 1352139669504, rebuilding it now
> [ 1262.194532] BTRFS warning (device dm-0): failed to load free space cache for block group 1355360894976, rebuilding it now
> [ 1262.210247] BTRFS warning (device dm-0): failed to load free space cache for block group 1358582120448, rebuilding it now
> [ 1262.217203] BTRFS warning (device dm-0): failed to load free space cache for block group 1361803345920, rebuilding it now
> [ 1262.226936] BTRFS warning (device dm-0): failed to load free space cache for block group 1365024571392, rebuilding it now
> [ 1262.283055] BTRFS warning (device dm-0): failed to load free space cache for block group 1371467022336, rebuilding it now
> [ 1262.300303] BTRFS warning (device dm-0): failed to load free space cache for block group 1374688247808, rebuilding it now
> [ 1263.555486] io_ctl_check_crc: 7 callbacks suppressed
> [ 1263.555489] BTRFS error (device dm-0): csum mismatch on free space cache
> [ 1263.556515] BTRFS warning (device dm-0): failed to load free space cache for block group 1825659813888, rebuilding it now
> [ 1264.241637] BTRFS info (device dm-0): balance: start -d -m -s
> [ 1264.242432] BTRFS info (device dm-0): relocating block group 2532215488512 flags data|raid10
> [ 1600.754725] BTRFS error (device dm-0): csum mismatch on free space cache
> [ 1600.755753] BTRFS warning (device dm-0): failed to load free space cache for block group 1474546237440, rebuilding it now


these repeat a lot

> [ 1638.059938] BTRFS info (device dm-0): found 911 extents
> [ 1648.389965] BTRFS info (device dm-0): found 911 extents
> [ 1648.764284] BTRFS info (device dm-0): relocating block group 2528994263040 flags data|raid10

end repeat


> [ 9127.793658] WARNING: CPU: 9 PID: 1294 at fs/btrfs/extent-tree.c:1573 lookup_inline_extent_backref+0x5fd/0x640 [btrfs]


looks like my array took too long here

> [ 9127.793660] Modules linked in: dm_crypt crypto_simd cryptd glue_helper aes_x86_64 algif_skcipher af_alg dm_mod fuse openafs(POE) 8021q garp mrp bridge stp llc bonding xt_tcpudp xt_state xt_conntrack iptable_filter xt_comment xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 msr iptable_mangle nls_iso8859_1 nls_cp437 vfat fat ipmi_ssif mgag200 intel_powerclamp i2c_algo_bit ttm coretemp kvm_intel drm_kms_helper drm kvm irqbypass agpgart syscopyarea sysfillrect intel_cstate sysimgblt intel_uncore fb_sys_fops ipmi_si iTCO_wdt bnx2 iTCO_vendor_support gpio_ich tpm_tis tpm_tis_core ipmi_devintf input_leds mousedev joydev lpc_ich ipmi_msghandler i7core_edac tpm evdev dcdbas rng_core pcspkr mac_hid acpi_power_meter sg ip_tables x_tables btrfs libcrc32c crc32c_generic xor raid6_pq hid_generic ses enclosure ata_generic sd_mod scsi_transport_sas pata_acpi usbhid hid uas usb_storage uhci_hcd ata_piix libata megaraid_sas crc32c_intel scsi_mod ehci_pci ehci_hcd
> [ 9127.793700] CPU: 9 PID: 1294 Comm: btrfs Tainted: P          IOE     5.2.2-arch1-1-ARCH #1
> [ 9127.793701] Hardware name: Dell Inc. PowerEdge R710/0YDJK3, BIOS 6.1.0 10/18/2011
> [ 9127.793718] RIP: 0010:lookup_inline_extent_backref+0x5fd/0x640 [btrfs]
> [ 9127.793720] Code: 8d 8d 8f 00 00 00 4c 39 ce 0f 83 64 fd ff ff 0f 0b 4c 8b 64 24 38 b8 8b ff ff ff e9 c9 fb ff ff 49 89 dc 31 db e9 c1 fc ff ff <0f> 0b b8 fb ff ff ff e9 b3 fb ff ff 80 7c 24 5e bf 0f 87 81 fe ff
> [ 9127.793721] RSP: 0018:ffffadd4c7bef710 EFLAGS: 00010202
> [ 9127.793723] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
> [ 9127.793723] RDX: 0000000000000001 RSI: ffffa2bfd043a540 RDI: ffffa2bf5d34a318
> [ 9127.793724] RBP: 0000000000080000 R08: 0000000000000000 R09: 0000000000000001
> [ 9127.793725] R10: 0000000000000045 R11: ffffa2bb00000000 R12: ffffa2bfd043a540
> [ 9127.793726] R13: ffffa2c160ef2958 R14: 000000000000000d R15: ffffa2c1640b9000
> [ 9127.793727] FS:  00007f7202fa58c0(0000) GS:ffffa2c167b00000(0000) knlGS:0000000000000000
> [ 9127.793728] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9127.793729] CR2: 000055768cb67552 CR3: 0000000c3c638000 CR4: 00000000000006e0
> [ 9127.793730] Call Trace:
> [ 9127.793752]  insert_inline_extent_backref+0x5a/0xe0 [btrfs]
> [ 9127.793768]  __btrfs_inc_extent_ref.isra.0+0xa0/0x2a0 [btrfs]
> [ 9127.793773]  ? _raw_spin_unlock+0x16/0x30
> [ 9127.793790]  __btrfs_run_delayed_refs+0x886/0x1080 [btrfs]
> [ 9127.793809]  btrfs_run_delayed_refs.part.0+0x4e/0x160 [btrfs]
> [ 9127.793828]  btrfs_commit_transaction+0xaa/0x980 [btrfs]
> [ 9127.793851]  prepare_to_merge+0x200/0x230 [btrfs]
> [ 9127.793873]  relocate_block_group+0x38c/0x620 [btrfs]
> [ 9127.793896]  btrfs_relocate_block_group+0x156/0x2f0 [btrfs]
> [ 9127.793917]  btrfs_relocate_chunk+0x31/0xa0 [btrfs]
> [ 9127.793938]  btrfs_balance+0x71a/0xee0 [btrfs]
> [ 9127.793961]  btrfs_ioctl_balance+0x292/0x340 [btrfs]
> [ 9127.793983]  btrfs_ioctl+0x84c/0x3060 [btrfs]
> [ 9127.793987]  ? preempt_count_add+0x79/0xb0
> [ 9127.793988]  ? _raw_spin_lock_irqsave+0x26/0x50
> [ 9127.793990]  ? up+0x12/0x60
> [ 9127.793991]  ? preempt_count_add+0x79/0xb0
> [ 9127.793992]  ? _raw_spin_lock+0x13/0x30
> [ 9127.793994]  ? _raw_spin_unlock+0x16/0x30
> [ 9127.793995]  ? preempt_count_add+0x79/0xb0
> [ 9127.793996]  ? _raw_spin_lock_irqsave+0x26/0x50
> [ 9127.793998]  ? _raw_spin_unlock_irqrestore+0x20/0x40
> [ 9127.793999]  ? __wake_up_common_lock+0x8d/0xc0
> --
> [ 9127.794018]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 9127.794020] RIP: 0033:0x7f720309b21b
> [ 9127.794022] Code: 0f 1e fa 48 8b 05 75 8c 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 8c 0c 00 f7 d8 64 89 01 48
> [ 9127.794022] RSP: 002b:00007ffedae78488 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 9127.794024] RAX: ffffffffffffffda RBX: 00007ffedae784f0 RCX: 00007f720309b21b
> [ 9127.794025] RDX: 00007ffedae784f0 RSI: 00000000c4009420 RDI: 0000000000000004
> [ 9127.794025] RBP: 0000000000000004 R08: 0000000000000027 R09: 0000000000000002
> [ 9127.794026] R10: 00007ffedae78367 R11: 0000000000000246 R12: 000055ac4325c125
> [ 9127.794027] R13: 00007ffedae79dea R14: 0000000000000000 R15: 00007f72031656a8
> [ 9127.794030] ---[ end trace e6a280878e0f6ea2 ]---
> [ 9127.794034] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2907: errno=-5 IO failure
> [ 9127.794074] BTRFS info (device dm-0): forced readonly


which may have caused an error with the balance.
> [ 9127.795439] BTRFS info (device dm-0): balance: ended with status: -30


> [36269.471145] BTRFS info (device dm-0): disk space caching is enabled
> [36269.471148] BTRFS info (device dm-0): has skinny extents
> [36272.920637] BTRFS info (device dm-0): checking UUID tree
> [36273.905836] BTRFS info (device dm-0): balance: resume -dusage=90 -musage=90 -susage=90
> [36273.906920] BTRFS info (device dm-0): relocating block group 2943425052672 flags data|raid10
> [36274.096312] BTRFS info (device dm-0): relocating block group 2940203827200 flags data|raid10
> [36309.330874] BTRFS info (device dm-0): found 77 extents
> [36312.572905] BTRFS info (device dm-0): found 77 extents
> [36313.110274] BTRFS info (device dm-0): balance: paused
> [36689.402334] BTRFS info (device dm-0): disk space caching is enabled
> [36689.402337] BTRFS info (device dm-0): has skinny extents


but then it started again

> [36692.424302] BTRFS info (device dm-0): balance: resume -dusage=90 -musage=90 -susage=90


more repeating

> [36692.425104] BTRFS info (device dm-0): relocating block group 2943425052672 flags data|raid10
> [36723.945784] BTRFS info (device dm-0): found 77 extents
> [37049.457773] BTRFS info (device dm-0): found 77 extents

end repeating

> [42819.594722] BTRFS info (device dm-0): no csum found for inode 1468 start 1910132736
> [42819.594731] BTRFS info (device dm-0): no csum found for inode 1468 start 1910136832
> [42819.594790] BTRFS info (device dm-0): no csum found for inode 1468 start 1910140928
> [42819.594795] BTRFS info (device dm-0): no csum found for inode 1468 start 1910145024
> [42819.594799] BTRFS info (device dm-0): no csum found for inode 1468 start 1910149120
> [42819.594803] BTRFS info (device dm-0): no csum found for inode 1468 start 1910153216
> [42819.594806] BTRFS info (device dm-0): no csum found for inode 1468 start 1910157312
> [42819.594810] BTRFS info (device dm-0): no csum found for inode 1468 start 1910161408
> [42819.594813] BTRFS info (device dm-0): no csum found for inode 1468 start 1910165504
> [42819.594817] BTRFS info (device dm-0): no csum found for inode 1468 start 1910169600
> [42819.602862] BTRFS warning (device dm-0): csum failed root 5 ino 1468 off 1910140928 csum 0x972b9901 expected csum 0x00000000 mirror 2
> [42819.602889] BTRFS warning (device dm-0): csum failed root 5 ino 1468 off 1910337536 csum 0x98f94189 expected csum 0x00000000 mirror 2
> [42819.602912] BTRFS warning (device dm-0): csum failed root 5 ino 1468 off 1910145024 csum 0x7722536c expected csum 0x00000000 mirror 2
> [42819.602920] BTRFS warning (device dm-0): csum failed root 5 ino 1468 off 1910341632 csum 0x98f94189 expected csum 0x00000000 mirror 2
> [42819.602935] BTRFS warning (device dm-0): csum failed root 5 ino 1468 off 1910345728 csum 0x98f94189 expected csum 0x00000000 mirror 2
> [42819.602953] BTRFS warning (device dm-0): csum failed root 5 ino 1468 off 1910149120 csum 0x642755f8 expected csum 0x00000000 mirror 2
> [42819.602955] BTRFS warning (device dm-0): csum failed root 5 ino 1468 off 1910349824 csum 0x98f94189 expected csum 0x00000000 mirror 2
> [42819.602967] BTRFS warning (device dm-0): csum failed root 5 ino 1468 off 1910353920 csum 0x98f94189 expected csum 0x00000000 mirror 2
> [42819.602981] BTRFS warning (device dm-0): csum failed root 5 ino 1468 off 1910358016 csum 0x98f94189 expected csum 0x00000000 mirror 2
> [42819.602998] BTRFS warning (device dm-0): csum failed root 5 ino 1468 off 1910153216 csum 0xc8eac75b expected csum 0x00000000 mirror 2



Matt Pallissard

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: errors found in extent allocation tree or chunk allocation after power failure
  2019-09-25 14:50 errors found in extent allocation tree or chunk allocation after power failure Pallissard, Matthew
@ 2019-09-25 19:08 ` Chris Murphy
  2019-09-25 19:34   ` Pallissard, Matthew
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2019-09-25 19:08 UTC (permalink / raw)
  To: Pallissard, Matthew; +Cc: Btrfs BTRFS

On Wed, Sep 25, 2019 at 8:50 AM Pallissard, Matthew <matt@pallissard.net> wrote:
>
> Version:
> Kernel: 5.2.2-arch1-1-ARCH #1 SMP PREEMPT Sun Jul 21 19:18:34 UTC 2019 x86_64 GNU/Linux

You need to upgrade to arch kernel 5.2.14 or newer (they backported
the fix first appearing in stable 5.2.15). Or you need to downgrade to
5.1 series.
https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdmanana@kernel.org/T/#u

That's a nasty bug. I don't offhand see evidence that you've hit this
bug. But I'm not certain. So first thing should be to use a different
kernel.

Next, anytime there is a crash or powerfailur with Btrfs raid56, you
need to do a complete scrub of the volume. Obviously will take time
but that's what needs to be done first.

OK actually, before the scrub you need to confirm that each drive's
SCT ERC time is *less* than the kernel's SCSI command timer. e.g.

# smartclt -l scterc /dev/sda
# cat /sys/block/sda/device/timeout

The SCT ERC value is in deciseconds so convert to seconds. The second
value is in seconds. The first value must be shorter. By default the
kernel's command timer per device is 30 seconds, typical consumer
drives are much longer. So depending on the reply from your drive for
that smart command, you might either change the drive timer or the
SCSI command timer - or it might actually be perfect. NAS specific
drives and nearline and SAS all tend to have short SCT ERC by default,
around 7 second. That's fine.

Note that the smart command is transient, when the drive powers off it
goes back to a default. And on reboot, the kernel's command timer also
resets.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: errors found in extent allocation tree or chunk allocation after power failure
  2019-09-25 19:08 ` Chris Murphy
@ 2019-09-25 19:34   ` Pallissard, Matthew
  2019-09-25 21:05     ` Chris Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Pallissard, Matthew @ 2019-09-25 19:34 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1612 bytes --]


Chris,

Thank you for your reply.  Responses in-line.

On 2019-09-25T13:08:34, Chris Murphy wrote:
> On Wed, Sep 25, 2019 at 8:50 AM Pallissard, Matthew <matt@pallissard.net> wrote:
> >
> > Version:
> > Kernel: 5.2.2-arch1-1-ARCH #1 SMP PREEMPT Sun Jul 21 19:18:34 UTC 2019 x86_64 GNU/Linux
>
> You need to upgrade to arch kernel 5.2.14 or newer (they backported the fix first appearing in stable 5.2.15). Or you need to downgrade to 5.1 series.
> https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdmanana@kernel.org/T/#u
>
> That's a nasty bug. I don't offhand see evidence that you've hit this bug. But I'm not certain. So first thing should be to use a different kernel.

Interesting, I'll go ahead with a kernel upgrade as that easy enough.

However, that looks like it's related to a stacktrace regarding a hung process.  Which is not the original problem I had.

Based on the output in my previous email, I've been working under the assumption that there is a problem on-disk.  Is that not correct?


> Next, anytime there is a crash or powerfailur with Btrfs raid56, you need to do a complete scrub of the volume. Obviously will take time but that's what needs to be done first.

I'm using raid 10, not 5 or 6.

> OK actually, before the scrub you need to confirm that each drive's SCT ERC time is *less* than the kernel's SCSI command timer. e.g.

I gather that I should probably do this before any scrub, be it raid 5, 6, or 10.  But, Is a scrub the operation I should attempt on this raid 10 array to repair the specific errors mentioned in my previous email?

Thanks again.

Matt Pallissard

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: errors found in extent allocation tree or chunk allocation after power failure
  2019-09-25 19:34   ` Pallissard, Matthew
@ 2019-09-25 21:05     ` Chris Murphy
  2019-09-25 21:32       ` Pallissard, Matthew
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2019-09-25 21:05 UTC (permalink / raw)
  To: Pallissard, Matthew; +Cc: Btrfs BTRFS, Qu Wenruo

On Wed, Sep 25, 2019 at 1:34 PM Pallissard, Matthew <matt@pallissard.net> wrote:
>
>
> Chris,
>
> Thank you for your reply.  Responses in-line.
>
> On 2019-09-25T13:08:34, Chris Murphy wrote:
> > On Wed, Sep 25, 2019 at 8:50 AM Pallissard, Matthew <matt@pallissard.net> wrote:
> > >
> > > Version:
> > > Kernel: 5.2.2-arch1-1-ARCH #1 SMP PREEMPT Sun Jul 21 19:18:34 UTC 2019 x86_64 GNU/Linux
> >
> > You need to upgrade to arch kernel 5.2.14 or newer (they backported the fix first appearing in stable 5.2.15). Or you need to downgrade to 5.1 series.
> > https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdmanana@kernel.org/T/#u
> >
> > That's a nasty bug. I don't offhand see evidence that you've hit this bug. But I'm not certain. So first thing should be to use a different kernel.
>
> Interesting, I'll go ahead with a kernel upgrade as that easy enough.
>
> However, that looks like it's related to a stacktrace regarding a hung process.  Which is not the original problem I had.
>
> Based on the output in my previous email, I've been working under the assumption that there is a problem on-disk.  Is that not correct?

That bug does cause filesystem corruption that is not repairable.
Whether you have that problem or a different problem, I'm not sure.
But it's best to avoid combining problems.

The file system mounts rw now? Or still only mounts ro?

I think most of the errors reported by btrfs check, if they still
exist after doing a scrub, should be repaired by 'btrfs check
--repair' but I don't advise that until later. I'm not a developer,
maybe Qu can offer some advise on those errors.


> > Next, anytime there is a crash or powerfailur with Btrfs raid56, you need to do a complete scrub of the volume. Obviously will take time but that's what needs to be done first.
>
> I'm using raid 10, not 5 or 6.

Same advice, but it's not as important to raid10 because it doesn't
have the write hole problem.


> > OK actually, before the scrub you need to confirm that each drive's SCT ERC time is *less* than the kernel's SCSI command timer. e.g.
>
> I gather that I should probably do this before any scrub, be it raid 5, 6, or 10.  But, Is a scrub the operation I should attempt on this raid 10 array to repair the specific errors mentioned in my previous email?
>

Definitely deal with the timing issue first. If by chance there are
bad sectors on any of the drives, they must be properly reported by
the drive with a discrete read error in order for Btrfs to do a proper
fixup. If the times are mismatched, then Linux can get tired waiting,
and do a link reset on the drive before the read error happens. And
now the whole command queue is lost and the problem isn't fixed.

There are myriad errors and the advice I'm giving to scrub is a safe
first step to make sure the storage stack is sane - or at least we
know where the simpler problems are. And then move to the less simple
ones that have higher risk.  It also changed the volume the least.
Everything else, like balance and chunk recover and btrfs check
--repair - all make substantial changes to the file system and have
higher risk of making things worse.

In theory if the storage stack does exactly what Btrfs says, then at
worst you should lose some data, but the file system itself should be
consistent. And that includes power failures. The fact there's
problems reported suggests a bug somewhere - it could be Btrfs, it
could be device mapper, it could be controller or drive firmware.

--
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: errors found in extent allocation tree or chunk allocation after power failure
  2019-09-25 21:05     ` Chris Murphy
@ 2019-09-25 21:32       ` Pallissard, Matthew
  2019-09-25 21:56         ` Chris Murphy
  2019-09-28  0:01         ` [UNRESOLVED] " Pallissard, Matthew
  0 siblings, 2 replies; 9+ messages in thread
From: Pallissard, Matthew @ 2019-09-25 21:32 UTC (permalink / raw)
  To: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 4121 bytes --]

On 2019-09-25T15:05:44, Chris Murphy wrote:
> On Wed, Sep 25, 2019 at 1:34 PM Pallissard, Matthew <matt@pallissard.net> wrote:
> > On 2019-09-25T13:08:34, Chris Murphy wrote:
> > > On Wed, Sep 25, 2019 at 8:50 AM Pallissard, Matthew <matt@pallissard.net> wrote:
> > > >
> > > > Version:
> > > > Kernel: 5.2.2-arch1-1-ARCH #1 SMP PREEMPT Sun Jul 21 19:18:34 UTC 2019 x86_64 GNU/Linux
> > >
> > > You need to upgrade to arch kernel 5.2.14 or newer (they backported the fix first appearing in stable 5.2.15). Or you need to downgrade to 5.1 series.
> > > https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdmanana@kernel.org/T/#u
> > >
> > > That's a nasty bug. I don't offhand see evidence that you've hit this bug. But I'm not certain. So first thing should be to use a different kernel.
> >
> > Interesting, I'll go ahead with a kernel upgrade as that easy enough.
> > However, that looks like it's related to a stacktrace regarding a hung process.  Which is not the original problem I had.
> > Based on the output in my previous email, I've been working under the assumption that there is a problem on-disk.  Is that not correct?
>
> That bug does cause filesystem corruption that is not repairable.
> Whether you have that problem or a different problem, I'm not sure.
> But it's best to avoid combining problems.
>
> The file system mounts rw now? Or still only mounts ro?

It mounts RW, but I have yet to attempt an actual write.


> I think most of the errors reported by btrfs check, if they still exist after doing a scrub, should be repaired by 'btrfs check --repair' but I don't advise that until later. I'm not a developer, maybe Qu can offer some advise on those errors.


> > > Next, anytime there is a crash or powerfailur with Btrfs raid56, you need to do a complete scrub of the volume. Obviously will take time but that's what needs to be done first.
> >
> > I'm using raid 10, not 5 or 6.
>
> Same advice, but it's not as important to raid10 because it doesn't have the write hole problem.


> > > OK actually, before the scrub you need to confirm that each drive's SCT ERC time is *less* than the kernel's SCSI command timer. e.g.
> >
> > I gather that I should probably do this before any scrub, be it raid 5, 6, or 10.  But, Is a scrub the operation I should attempt on this raid 10 array to repair the specific errors mentioned in my previous email?
>
> Definitely deal with the timing issue first. If by chance there are bad sectors on any of the drives, they must be properly reported by the drive with a discrete read error in order for Btrfs to do a proper fixup. If the times are mismatched, then Linux can get tired waiting, and do a link reset on the drive before the read error happens. And now the whole command queue is lost and the problem isn't fixed.

Good to know, that seems like a critical piece of information.  A few searches turned up this page, https://wiki.debian.org/Btrfs#FAQ.

Should this be noted on the 'gotchas' or 'getting started page as well?  I'd be happy to make edits should the powers that be allow it.


> There are myriad errors and the advice I'm giving to scrub is a safe first step to make sure the storage stack is sane - or at least we know where the simpler problems are. And then move to the less simple ones that have higher risk.  It also changed the volume the least. Everything else, like balance and chunk recover and btrfs check --repair - all make substantial changes to the file system and have higher risk of making things worse.

This sounds sensible.


> In theory if the storage stack does exactly what Btrfs says, then at worst you should lose some data, but the file system itself should be consistent. And that includes power failures. The fact there's problems reported suggests a bug somewhere - it could be Btrfs, it could be device mapper, it could be controller or drive firmware.

I'll go ahead with a kernel upgrade/make sure the timing issues are squared away.  Then I'll kick off a scrub.

I'll report back when the scrub is complete or something interesting happens.  Whichever comes first.

Thanks again.


Matt Pallissard

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: errors found in extent allocation tree or chunk allocation after power failure
  2019-09-25 21:32       ` Pallissard, Matthew
@ 2019-09-25 21:56         ` Chris Murphy
  2019-09-25 22:03           ` Pallissard, Matthew
  2019-09-28  0:01         ` [UNRESOLVED] " Pallissard, Matthew
  1 sibling, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2019-09-25 21:56 UTC (permalink / raw)
  To: Pallissard, Matthew; +Cc: Btrfs BTRFS

On Wed, Sep 25, 2019 at 3:32 PM Pallissard, Matthew <matt@pallissard.net> wrote:
>
> On 2019-09-25T15:05:44, Chris Murphy wrote:
> > Definitely deal with the timing issue first. If by chance there are bad sectors on any of the drives, they must be properly reported by the drive with a discrete read error in order for Btrfs to do a proper fixup. If the times are mismatched, then Linux can get tired waiting, and do a link reset on the drive before the read error happens. And now the whole command queue is lost and the problem isn't fixed.
>
> Good to know, that seems like a critical piece of information.  A few searches turned up this page, https://wiki.debian.org/Btrfs#FAQ.
>
> Should this be noted on the 'gotchas' or 'getting started page as well?  I'd be happy to make edits should the powers that be allow it.

Should what be noted as a gotcha? The timing stuff? That's not Btrfs
specific. It's just a default that's become shitty because if the
crazy amount of time consumer drives can take doing "deep recovery"
for bad sectors that can exceed a minute. It's incredible how slow
that is and how many attempts are being made. But I guess on rare
occasion this does cause a recovery, while also making your computer
slow as balls. Anyway, this 30 second timer is obsolete but kernel
developers so far refuse to change it, arguing every distribution that
cares about desktop users, and users who use consumer drives for data
storage, should change the timer default for their users using a udev
rule. Except no distro I know of does that. This affects everyone with
consumer drives that have deep recoveries, mostly common with hard
drives. But it's especially negative on large data storage stacks
using any kind of RAID. You'll find this problem all over the
linux-raid@ achive, it comes up all the time. Still.

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: errors found in extent allocation tree or chunk allocation after power failure
  2019-09-25 21:56         ` Chris Murphy
@ 2019-09-25 22:03           ` Pallissard, Matthew
  0 siblings, 0 replies; 9+ messages in thread
From: Pallissard, Matthew @ 2019-09-25 22:03 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2119 bytes --]

On 2019-09-25T15:56:54, Chris Murphy wrote:
> On Wed, Sep 25, 2019 at 3:32 PM Pallissard, Matthew <matt@pallissard.net> wrote:
> >
> > On 2019-09-25T15:05:44, Chris Murphy wrote:
> > > Definitely deal with the timing issue first. If by chance there are bad sectors on any of the drives, they must be properly reported by the drive with a discrete read error in order for Btrfs to do a proper fixup. If the times are mismatched, then Linux can get tired waiting, and do a link reset on the drive before the read error happens. And now the whole command queue is lost and the problem isn't fixed.
> >
> > Good to know, that seems like a critical piece of information.  A few searches turned up this page, https://wiki.debian.org/Btrfs#FAQ.
> >
> > Should this be noted on the 'gotchas' or 'getting started page as well?  I'd be happy to make edits should the powers that be allow it.
> 
> Should what be noted as a gotcha? The timing stuff? That's not Btrfs
> specific. It's just a default that's become shitty because if the
> crazy amount of time consumer drives can take doing "deep recovery"
> for bad sectors that can exceed a minute. It's incredible how slow
> that is and how many attempts are being made. But I guess on rare
> occasion this does cause a recovery, while also making your computer
> slow as balls. Anyway, this 30 second timer is obsolete but kernel
> developers so far refuse to change it, arguing every distribution that
> cares about desktop users, and users who use consumer drives for data
> storage, should change the timer default for their users using a udev
> rule. Except no distro I know of does that. This affects everyone with
> consumer drives that have deep recoveries, mostly common with hard
> drives. But it's especially negative on large data storage stacks
> using any kind of RAID. You'll find this problem all over the
> linux-raid@ achive, it comes up all the time. Still.
> 
> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

Thanks for the link.

Also, reading that made me smile.  I appreciate your perspective.


Matt Pallissard

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [UNRESOLVED] Re: errors found in extent allocation tree or chunk allocation after power failure
  2019-09-25 21:32       ` Pallissard, Matthew
  2019-09-25 21:56         ` Chris Murphy
@ 2019-09-28  0:01         ` Pallissard, Matthew
  2019-09-28  0:03           ` Pallissard, Matthew
  1 sibling, 1 reply; 9+ messages in thread
From: Pallissard, Matthew @ 2019-09-28  0:01 UTC (permalink / raw)
  To: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 5451 bytes --]


On 2019-09-25T14:32:31, Pallissard, Matthew wrote:
> On 2019-09-25T15:05:44, Chris Murphy wrote:
> > On Wed, Sep 25, 2019 at 1:34 PM Pallissard, Matthew <matt@pallissard.net> wrote:
> > > On 2019-09-25T13:08:34, Chris Murphy wrote:
> > > > On Wed, Sep 25, 2019 at 8:50 AM Pallissard, Matthew <matt@pallissard.net> wrote:
> > > > >
> > > > > Version:
> > > > > Kernel: 5.2.2-arch1-1-ARCH #1 SMP PREEMPT Sun Jul 21 19:18:34 UTC 2019 x86_64 GNU/Linux
> > > >
> > > > You need to upgrade to arch kernel 5.2.14 or newer (they backported the fix first appearing in stable 5.2.15). Or you need to downgrade to 5.1 series.
> > > > https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdmanana@kernel.org/T/#u
> > > >
> > > > That's a nasty bug. I don't offhand see evidence that you've hit this bug. But I'm not certain. So first thing should be to use a different kernel.
> > >
> > > Interesting, I'll go ahead with a kernel upgrade as that easy enough.
> > > However, that looks like it's related to a stacktrace regarding a hung process.  Which is not the original problem I had.
> > > Based on the output in my previous email, I've been working under the assumption that there is a problem on-disk.  Is that not correct?
> >
> > That bug does cause filesystem corruption that is not repairable.
> > Whether you have that problem or a different problem, I'm not sure.
> > But it's best to avoid combining problems.
> >
> > The file system mounts rw now? Or still only mounts ro?
> 
> It mounts RW, but I have yet to attempt an actual write.
> 
> 
> > I think most of the errors reported by btrfs check, if they still exist after doing a scrub, should be repaired by 'btrfs check --repair' but I don't advise that until later. I'm not a developer, maybe Qu can offer some advise on those errors.
> 
> 
> > > > Next, anytime there is a crash or powerfailur with Btrfs raid56, you need to do a complete scrub of the volume. Obviously will take time but that's what needs to be done first.
> > >
> > > I'm using raid 10, not 5 or 6.
> >
> > Same advice, but it's not as important to raid10 because it doesn't have the write hole problem.
> 
> 
> > > > OK actually, before the scrub you need to confirm that each drive's SCT ERC time is *less* than the kernel's SCSI command timer. e.g.
> > >
> > > I gather that I should probably do this before any scrub, be it raid 5, 6, or 10.  But, Is a scrub the operation I should attempt on this raid 10 array to repair the specific errors mentioned in my previous email?
> >
> > Definitely deal with the timing issue first. If by chance there are bad sectors on any of the drives, they must be properly reported by the drive with a discrete read error in order for Btrfs to do a proper fixup. If the times are mismatched, then Linux can get tired waiting, and do a link reset on the drive before the read error happens. And now the whole command queue is lost and the problem isn't fixed.
> 
> Good to know, that seems like a critical piece of information.  A few searches turned up this page, https://wiki.debian.org/Btrfs#FAQ.
> 
> Should this be noted on the 'gotchas' or 'getting started page as well?  I'd be happy to make edits should the powers that be allow it.
> 
> 
> > There are myriad errors and the advice I'm giving to scrub is a safe first step to make sure the storage stack is sane - or at least we know where the simpler problems are. And then move to the less simple ones that have higher risk.  It also changed the volume the least. Everything else, like balance and chunk recover and btrfs check --repair - all make substantial changes to the file system and have higher risk of making things worse.
> 
> This sounds sensible.
> 
> 
> > In theory if the storage stack does exactly what Btrfs says, then at worst you should lose some data, but the file system itself should be consistent. And that includes power failures. The fact there's problems reported suggests a bug somewhere - it could be Btrfs, it could be device mapper, it could be controller or drive firmware.
> 
> I'll go ahead with a kernel upgrade/make sure the timing issues are squared away.  Then I'll kick off a scrub.
> 
> I'll report back when the scrub is complete or something interesting happens.  Whichever comes first.

As a followup;
1. I took care of the timing issues
2. ran a scrub.
3. I ran a balance, it kept failing with about 20% left
  - stacktraces in dmesg showed spinlock stuff

3. got I/O errors on one file during my final backup, (
  - post-backup hashsums of everything else checked out
  - the errors during the copy were csum mismatches should anyone care

4. ran a bunch of potentially disruptive btrfs check commands in alphabetical order because "why not at this point?"
  - they had zero affect as far as I can tell, all the same files were readable, the btrfs check errors looked identical (admittedly I didn't put them side by side)

5. re-provisioned the array, restored from backups.

As I thought about it, it may have not been an issue with the original power outage.  I only ran a check after the power outage.  My array could have had an issue due to a previous bug. I was on a 5.2x kernel for several weeks under high load.  Anyway, there are enough unknowns to make a root cause analysis not worth my time.

Marking this as unresolved folks in the future who may be looking for answers.

Matt Pallissard

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [UNRESOLVED] Re: errors found in extent allocation tree or chunk allocation after power failure
  2019-09-28  0:01         ` [UNRESOLVED] " Pallissard, Matthew
@ 2019-09-28  0:03           ` Pallissard, Matthew
  0 siblings, 0 replies; 9+ messages in thread
From: Pallissard, Matthew @ 2019-09-28  0:03 UTC (permalink / raw)
  To: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 5736 bytes --]


On 2019-09-27T17:01:27, Pallissard, Matthew wrote:
> 
> On 2019-09-25T14:32:31, Pallissard, Matthew wrote:
> > On 2019-09-25T15:05:44, Chris Murphy wrote:
> > > On Wed, Sep 25, 2019 at 1:34 PM Pallissard, Matthew <matt@pallissard.net> wrote:
> > > > On 2019-09-25T13:08:34, Chris Murphy wrote:
> > > > > On Wed, Sep 25, 2019 at 8:50 AM Pallissard, Matthew <matt@pallissard.net> wrote:
> > > > > >
> > > > > > Version:
> > > > > > Kernel: 5.2.2-arch1-1-ARCH #1 SMP PREEMPT Sun Jul 21 19:18:34 UTC 2019 x86_64 GNU/Linux
> > > > >
> > > > > You need to upgrade to arch kernel 5.2.14 or newer (they backported the fix first appearing in stable 5.2.15). Or you need to downgrade to 5.1 series.
> > > > > https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdmanana@kernel.org/T/#u
> > > > >
> > > > > That's a nasty bug. I don't offhand see evidence that you've hit this bug. But I'm not certain. So first thing should be to use a different kernel.
> > > >
> > > > Interesting, I'll go ahead with a kernel upgrade as that easy enough.
> > > > However, that looks like it's related to a stacktrace regarding a hung process.  Which is not the original problem I had.
> > > > Based on the output in my previous email, I've been working under the assumption that there is a problem on-disk.  Is that not correct?
> > >
> > > That bug does cause filesystem corruption that is not repairable.
> > > Whether you have that problem or a different problem, I'm not sure.
> > > But it's best to avoid combining problems.
> > >
> > > The file system mounts rw now? Or still only mounts ro?
> > 
> > It mounts RW, but I have yet to attempt an actual write.
> > 
> > 
> > > I think most of the errors reported by btrfs check, if they still exist after doing a scrub, should be repaired by 'btrfs check --repair' but I don't advise that until later. I'm not a developer, maybe Qu can offer some advise on those errors.
> > 
> > 
> > > > > Next, anytime there is a crash or powerfailur with Btrfs raid56, you need to do a complete scrub of the volume. Obviously will take time but that's what needs to be done first.
> > > >
> > > > I'm using raid 10, not 5 or 6.
> > >
> > > Same advice, but it's not as important to raid10 because it doesn't have the write hole problem.
> > 
> > 
> > > > > OK actually, before the scrub you need to confirm that each drive's SCT ERC time is *less* than the kernel's SCSI command timer. e.g.
> > > >
> > > > I gather that I should probably do this before any scrub, be it raid 5, 6, or 10.  But, Is a scrub the operation I should attempt on this raid 10 array to repair the specific errors mentioned in my previous email?
> > >
> > > Definitely deal with the timing issue first. If by chance there are bad sectors on any of the drives, they must be properly reported by the drive with a discrete read error in order for Btrfs to do a proper fixup. If the times are mismatched, then Linux can get tired waiting, and do a link reset on the drive before the read error happens. And now the whole command queue is lost and the problem isn't fixed.
> > 
> > Good to know, that seems like a critical piece of information.  A few searches turned up this page, https://wiki.debian.org/Btrfs#FAQ.
> > 
> > Should this be noted on the 'gotchas' or 'getting started page as well?  I'd be happy to make edits should the powers that be allow it.
> > 
> > 
> > > There are myriad errors and the advice I'm giving to scrub is a safe first step to make sure the storage stack is sane - or at least we know where the simpler problems are. And then move to the less simple ones that have higher risk.  It also changed the volume the least. Everything else, like balance and chunk recover and btrfs check --repair - all make substantial changes to the file system and have higher risk of making things worse.
> > 
> > This sounds sensible.
> > 
> > 
> > > In theory if the storage stack does exactly what Btrfs says, then at worst you should lose some data, but the file system itself should be consistent. And that includes power failures. The fact there's problems reported suggests a bug somewhere - it could be Btrfs, it could be device mapper, it could be controller or drive firmware.
> > 
> > I'll go ahead with a kernel upgrade/make sure the timing issues are squared away.  Then I'll kick off a scrub.
> > 
> > I'll report back when the scrub is complete or something interesting happens.  Whichever comes first.
> 
> As a followup;
> 1. I took care of the timing issues
> 2. ran a scrub.
> 3. I ran a balance, it kept failing with about 20% left
>   - stacktraces in dmesg showed spinlock stuff
> 
> 3. got I/O errors on one file during my final backup, (
>   - post-backup hashsums of everything else checked out
>   - the errors during the copy were csum mismatches should anyone care
> 
> 4. ran a bunch of potentially disruptive btrfs check commands in alphabetical order because "why not at this point?"
>   - they had zero affect as far as I can tell, all the same files were readable, the btrfs check errors looked identical (admittedly I didn't put them side by side)
> 
> 5. re-provisioned the array, restored from backups.
> 
> As I thought about it, it may have not been an issue with the original power outage.  I only ran a check after the power outage.  My array could have had an issue due to a previous bug. I was on a 5.2x kernel for several weeks under high load.  Anyway, there are enough unknowns to make a root cause analysis not worth my time.
> 
> Marking this as unresolved folks in the future who may be looking for answers.
> 

Man, I should have read that over one more time for typos. Oh well.

Matt Pallissard

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-09-28  0:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-25 14:50 errors found in extent allocation tree or chunk allocation after power failure Pallissard, Matthew
2019-09-25 19:08 ` Chris Murphy
2019-09-25 19:34   ` Pallissard, Matthew
2019-09-25 21:05     ` Chris Murphy
2019-09-25 21:32       ` Pallissard, Matthew
2019-09-25 21:56         ` Chris Murphy
2019-09-25 22:03           ` Pallissard, Matthew
2019-09-28  0:01         ` [UNRESOLVED] " Pallissard, Matthew
2019-09-28  0:03           ` Pallissard, Matthew

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.