On 2020/11/26 下午6:53, Joe Hermaszewski wrote: > Hi Qu, > > This is the output of btrfs check with the device unmounted: > > ``` > [root@nixos:~]# btrfs check --readonly /dev/sda1 > Opening filesystem to check... > Checking filesystem on /dev/sda1 > UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124 > [1/7] checking root items > [2/7] checking extents > [3/7] checking free space cache > [4/7] checking fs roots > root 294 inode 24665 errors 100, file extent discount > Found file extent holes: > start: 3709534208, len: 163840 > root 294 inode 406548 errors 100, file extent discount > Found file extent holes: > start: 0, len: 99450880 > ERROR: errors found in fs roots > found 11279251902464 bytes used, error(s) found > total csum bytes: 10935906504 > total tree bytes: 18455740416 > total fs tree bytes: 5798936576 > total extent tree bytes: 532250624 > btree space waste bytes: 2308849707 > file data blocks allocated: 17243418611712 > referenced 14210256932864 > ``` > > Looks the same as before... Then it means, at least your metadata is pretty safe. Nothing really needs to be worried (so far). This also means, the problem only happens at runtime, not something on-disk. If you are OK to re-compile the kernel, and the warning message is always the same at line number 531 of disk-io.c, then I recommend the attached debug diff The diff will output the found_start (from eb) and page_start (from page). If it's memory bitflip, we should have a pretty signature diff at the output. If not memory bitflip, then we really have a bug to dig. > > I'm afraid I don't have the precise details of the transid error > saved. I think I meant to write earlier that the message that comes up > now and then is the "errno=-5 IO failure (Error while writing out > transaction)". I've not seen the transid mismatch one since the first > incident, sorry for that slip-up. > > I'll try and attach to a known good system, will I have to attach the > whole array because I don't think I have that many enclosures for the > drives. You only need to mount the fs you're seeing the problem. But if that fs is on multiple devices, then I guess you need to mount them all. Thanks, Qu > > I'll also run a memtest on this box. > > Thank you for the helpful response. > > Best, > Joe > > On Thu, Nov 26, 2020 at 2:15 PM Qu Wenruo wrote: >> >> >> >> On 2020/11/25 下午11:28, Joe Hermaszewski wrote: >>> Hi, >>> >>> I have a arm32 machine with four drives with a btrfs fs spanning then in RAID1. >>> The filesystem has started behaving badly recently and I'm writing to: >>> >>> - Solicit advice on how best to get the system back to a stable state >>> - Report a potential bug >>> >>> ## What happened: >>> >>> A couple of days ago I could no longer ssh into it, and on the serial >>> connection there were heaps of messages (and new ones appearing with great >>> frequency) along the lines of: `parent transid verify failed on blah... wanted >>> x got y`. >>> >>> Although I don't have a record of the precise messages I do remember that there >>> was a difference of `15` between x and y. >> >> This normally can be a sign of unreliable HDD, which lies on FLUSH, >> killing metadata COW. >> >> But, your btrfs check doesn't report the same problem, thus I'm confused. >> >> Would you please try to run a "btrfs check --readonly /dev/sda1" with >> the fs unmounted? >> >> And, would you provide the full dmesg of that mismatch? >> The reason for the exact number is, I'm suspecting hardware memory >> corruption. >> >>> >>> I power-cycled system and started a scrub after it rebooted, this was >>> interrupted quite promptly by several more errors in btrfs, and the disk >>> remounted RO. >>> >>> Every now and then in the kernel log I get messages like: >>> >>> `parent transid verify failed on blah... wanted x got y` >> >> Not showing up in the gist dmesg though. >> >>> >>> ## Important info >>> >>> The dev stats are all zero. >>> >>> Here are the outputs of some btrfs commands, dmesg and the kernel log from the >>> previous two boots: https://gist.github.com/b1beab134403c5047e2efbceb98985f9 >>> >>> The "cut here" portion of the kernel log is as follows >>> >>> ``` >>> [ 409.158097] ------------[ cut here ]------------ >>> [ 409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531 >>> btree_csum_one_bio+0x208/0x248 [btrfs] >> >> The line number shows, the tree bytenr doesn't match with the one in memory. >> This is really too rare to be seen, especially when we have no other >> error reported from btrfs (at least in the gist) >> >> Since there is no other problems showing up in the gist, it means it >> could be a bit flip, considering the magic generation gap you mention is >> 15, I'm more suspicious about memory bit flip. >> >> If you can provide the full parent transid mismatch error message, it >> may help to determine the possibility of hardware memory corruption. >> >> Thanks, >> Qu >> >>> [ 409.158208] Modules linked in: cfg80211 rfkill 8021q ip6table_nat >>> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 >>> nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw >>> xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp >>> ftdi_sio usbserial phy_generic uio_pdrv_genirq uio ip6table_filter >>> ip6_tables iptable_filter sch_fq_codel loop tun tap macvlan bridge stp >>> llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor >>> raid6_pq >>> [ 409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not tainted 5.4.77 #1-NixOS >>> [ 409.158260] Hardware name: Marvell Armada 380/385 (Device Tree) >>> [ 409.158261] Backtrace: >>> [ 409.158272] [] (dump_backtrace) from [] >>> (show_stack+0x20/0x24) >>> [ 409.158277] r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c044 >>> [ 409.158283] [] (show_stack) from [] >>> (dump_stack+0x98/0xac) >>> [ 409.158288] [] (dump_stack) from [] (__warn+0xe0/0x108) >>> [ 409.158292] r7:00000213 r6:bf058ec8 r5:00000009 r4:bf120990 >>> [ 409.158296] [] (__warn) from [] >>> (warn_slowpath_fmt+0x74/0xc4) >>> [ 409.158300] r7:00000213 r6:bf120990 r5:00000000 r4:e2392000 >>> [ 409.158358] [] (warn_slowpath_fmt) from [] >>> (btree_csum_one_bio+0x208/0x248 [btrfs]) >>> [ 409.158363] r9:e277abe0 r8:00000001 r7:e2392000 r6:ea3d17f0 >>> r5:00000000 r4:eefd2d3c >>> [ 409.158465] [] (btree_csum_one_bio [btrfs]) from >>> [] (btree_submit_bio_hook+0xe8/0x100 [btrfs]) >>> [ 409.158470] r10:e32ce170 r9:ecc45fc0 r8:ecc45f70 r7:ec82b000 >>> r6:00000000 r5:ea3d17f0 >>> [ 409.158472] r4:bf059e0c >>> [ 409.158575] [] (btree_submit_bio_hook [btrfs]) from >>> [] (submit_one_bio+0x44/0x5c [btrfs]) >>> [ 409.158578] r7:ef36c048 r6:e2393cac r5:00000000 r4:bf059e0c >>> [ 409.158683] [] (submit_one_bio [btrfs]) from [] >>> (btree_write_cache_pages+0x380/0x408 [btrfs]) >>> [ 409.158686] r5:00000000 r4:00000000 >>> [ 409.158788] [] (btree_write_cache_pages [btrfs]) from >>> [] (btree_writepages+0x7c/0x84 [btrfs]) >>> [ 409.158793] r10:00000001 r9:4fd00000 r8:c0280c94 r7:e2392000 >>> r6:e2393d80 r5:ecc45f70 >>> [ 409.158794] r4:e2393d80 >>> [ 409.158850] [] (btree_writepages [btrfs]) from >>> [] (do_writepages+0x58/0xf4) >>> [ 409.158852] r5:ecc45f70 r4:ecc45e68 >>> [ 409.158860] [] (do_writepages) from [] >>> (__filemap_fdatawrite_range+0xf8/0x130) >>> [ 409.158864] r8:ecc45f70 r7:00001000 r6:4fd0bfff r5:e2392000 r4:ecc45e68 >>> [ 409.158869] [] (__filemap_fdatawrite_range) from >>> [] (filemap_fdatawrite_range+0x2c/0x34) >>> [ 409.158874] r10:ecc45f70 r9:00001000 r8:4fd0bfff r7:e2393e4c >>> r6:c73bc628 r5:00001000 >>> [ 409.158875] r4:4fd0bfff >>> [ 409.158929] [] (filemap_fdatawrite_range) from >>> [] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs]) >>> [ 409.158931] r5:00000001 r4:00000000 >>> [ 409.159033] [] (btrfs_write_marked_extents [btrfs]) from >>> [] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs]) >>> [ 409.159038] r10:e2392000 r9:ec82b010 r8:ec82b000 r7:c73bc628 >>> r6:ec82b000 r5:e2392000 >>> [ 409.159040] r4:c8b81ca8 >>> [ 409.159141] [] (btrfs_write_and_wait_transaction [btrfs]) >>> from [] (btrfs_commit_transaction+0x75c/0xc94 [btrfs]) >>> [ 409.159145] r8:ec82b418 r7:c73bc600 r6:ec82b000 r5:c8b81ca8 r4:00000000 >>> [ 409.159248] [] (btrfs_commit_transaction [btrfs]) from >>> [] (transaction_kthread+0x19c/0x1e0 [btrfs]) >>> [ 409.159253] r10:ec82b28c r9:00000000 r8:001aaafa r7:00000064 >>> r6:ec82b414 r5:00000bb8 >>> [ 409.159254] r4:ec82b000 >>> [ 409.159309] [] (transaction_kthread [btrfs]) from >>> [] (kthread+0x170/0x174) >>> [ 409.159313] r10:eca87bfc r9:bf05cb6c r8:ed619000 r7:e2392000 >>> r6:00000000 r5:ed5ee700 >>> [ 409.159315] r4:ed5ee1c0 >>> [ 409.159320] [] (kthread) from [] >>> (ret_from_fork+0x14/0x2c) >>> [ 409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8) >>> [ 409.159326] 3fa0: 00000000 >>> 00000000 00000000 00000000 >>> [ 409.159331] 3fc0: 00000000 00000000 00000000 00000000 00000000 >>> 00000000 00000000 00000000 >>> [ 409.159334] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 >>> [ 409.159338] r10:00000000 r9:00000000 r8:00000000 r7:00000000 >>> r6:00000000 r5:c014f94c >>> [ 409.159340] r4:ed5ee700 >>> [ 409.159342] ---[ end trace eea59ced12fa7859 ]--- >>> [ 409.165084] BTRFS: error (device sda1) in >>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while >>> writing out transaction) >>> [ 409.176920] BTRFS info (device sda1): forced readonly >>> [ 409.176947] BTRFS warning (device sda1): Skipping commit of aborted >>> transaction. >>> [ 409.176952] BTRFS: error (device sda1) in cleanup_transaction:1832: >>> errno=-5 IO failure >>> [ 409.185049] BTRFS info (device sda1): delayed_refs has NO entry >>> [ 409.310199] BTRFS info (device sda1): scrub: not finished on devid >>> 3 with status: -125 >>> [ 409.664880] BTRFS info (device sda1): scrub: not finished on devid >>> 4 with status: -125 >>> [ 410.106791] BTRFS info (device sda1): scrub: not finished on devid >>> 1 with status: -125 >>> [ 411.268585] BTRFS warning (device sda1): failed setting block group ro: -30 >>> [ 411.268594] BTRFS info (device sda1): scrub: not finished on devid >>> 2 with status: -30 >>> [ 411.268605] BTRFS info (device sda1): delayed_refs has NO entry >>> ``` >>> >>> Information requested here >>> (https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list): >>> >>> ``` >>> $ uname -a >>> Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC 2020 armv7l GNU/Linux >>> >>> $ btrfs --version >>> btrfs-progs v5.7 >>> >>> $ sudo btrfs fi show >>> Label: none uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124 >>> Total devices 4 FS bytes used 10.26TiB >>> devid 1 size 3.64TiB used 2.40TiB path /dev/sda1 >>> devid 2 size 3.64TiB used 2.40TiB path /dev/sdc1 >>> devid 3 size 9.09TiB used 7.86TiB path /dev/sdd1 >>> devid 4 size 9.09TiB used 7.86TiB path /dev/sdb1 >>> >>> Label: none uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2 >>> Total devices 1 FS bytes used 116.35MiB >>> devid 1 size 399.82MiB used 224.00MiB path /dev/sda2 >>> >>> $ sudo btrfs fi df / >>> Data, RAID1: total=10.25TiB, used=10.24TiB >>> System, RAID1: total=64.00MiB, used=1.45MiB >>> Metadata, RAID1: total=18.00GiB, used=17.19GiB >>> GlobalReserve, single: total=512.00MiB, used=0.00B >>> ``` >>> >>> Thanks to demfloro and multicore on #btrfs for prompting this email. >>> >>> Best wishes, >>> Joe >>> >>