Hello, I have a pair of identical drives that I set up with btrfs for the purpose of redundancy for a 24x7 network attached storage application. The pair of drives reside in an external enclosure and are each connected to the servers with a dedicated eSATA cable. I set up the array in Fall of 2017 and had about a year of operation without noticing any issues. In January 2019 the file system remounted read-only. I have retained the dmesg logs and notes on the actions I took at that time. Summary is that I saw a bunch of SATA "link is slow to respond" and "hard resetting link" messages followed by BTRFS errors, "BTRFS: Transaction aborted (error -5)", and a stack trace. I tried a few things (unsuccessfully) and had to power down the hardware. After a power cycle the file system mounted R/W as normal. I then ran a btrfs check and btrfs scrub which showed no errors. # btrfs check -p /dev/disk/by-uuid/X Checking filesystem on /dev/disk/by-uuid/X UUID: X checking extents [o] checking free space cache [.] checking fs roots [o] checking csums checking root refs found 2953419067392 bytes used err is 0 total csum bytes: 2880319844 total tree bytes: 3249602560 total fs tree bytes: 153665536 total extent tree bytes: 30392320 btree space waste bytes: 162811208 file data blocks allocated: 2950169763840 referenced 2950168137728 # btrfs scrub start -B /dev/disk/by-uuid/X scrub done for X scrub started at Thu Jan 24 21:18:33 2019 and finished after 03:58:21 total bytes scrubbed: 2.69TiB with 0 errors I also swapped out my long eSATA cables (6 foot) for shorter cables (20 inch) to try to address the link issues... Thought everything was good, but this episode may be related to subsequent events... <10 Months Later> In October 2019 I had a similar event. Logs show SATA "hard resetting link" for both drives as well as BTRFS "lost page write due to IO error" messages. The logs show "forced readonly", "BTRFS: Transaction aborted (error -5)" and a stack trace. After hard power cycle I ran a "btrfs check -p" which resulted in a stream messages like "parent transid verify failed on 2418006753280 wanted 34457 found 30647" and then the following: parent transid verify failed on 2417279598592 wanted 35322 found 29823 checking root refs found 3066087231488 bytes used err is 0 total csum bytes: 2990197476 total tree bytes: 3376070656 total fs tree bytes: 158498816 total extent tree bytes: 31932416 btree space waste bytes: 171614389 file data blocks allocated: 3062711459840 referenced 3062709796864 I re-ran the check and captured all output to file just in case. File is 315000 lines long... After some internet research I decided to mount the file system and run a scrub. File system mounts read/write successfully, but I was seeing patterns of log entries like: "BTRFS error (device dm-1): parent transid verify failed on 2176843776 wanted 34455 found 27114" and then "BTRFS info (device dm-1): read error corrected: ino 1 off 2176843776 (dev /dev/mapper/K1JG82AD sector 4251648)" # btrfs scrub start -B /dev/disk/by-uuid/X After starting the scrub the logs were showing several patterns verify_parent_transid: 89 callbacks suppressed repair_io_failure: 386 callbacks suppressed BTRFS error (device dm-1): csum mismatch on free space cache BTRFS warning (device dm-1): failed to load free space cache for block group 1131786797056, rebuilding it now BTRFS error (device dm-1): parent transid verify failed on 2417279385600 wanted 35322 found 29823 BTRFS info (device dm-1): read error corrected: ino 1 off 2417279385600 (dev / dev/mapper/K1JG82AD sector 4719086112) But the summary output looked OK... scrub done for X scrub started at Thu Oct 31 00:27:29 2019 and finished after 04:07:45 total bytes scrubbed: 2.79TiB with 0 errors # date Sun Dec 8 11:56:29 EST 2019 # btrfs fi show Label: none uuid: X Total devices 2 FS bytes used 2.80TiB devid 1 size 5.46TiB used 2.80TiB path /dev/mapper/K1JG82AD devid 2 size 5.46TiB used 2.80TiB path /dev/mapper/K1JGYJJD # btrfs dev stat /mnt/raid [/dev/mapper/K1JG82AD].write_io_errs 15799831 [/dev/mapper/K1JG82AD].read_io_errs 15764242 [/dev/mapper/K1JG82AD].flush_io_errs 4385 [/dev/mapper/K1JG82AD].corruption_errs 0 [/dev/mapper/K1JG82AD].generation_errs 0 [/dev/mapper/K1JGYJJD].write_io_errs 0 [/dev/mapper/K1JGYJJD].read_io_errs 0 [/dev/mapper/K1JGYJJD].flush_io_errs 0 [/dev/mapper/K1JGYJJD].corruption_errs 0 [/dev/mapper/K1JGYJJD].generation_errs 0 At this point I was still seeing occasional log entries for "parent transid verify failed" and "read error corrected" so I decided to upgrade from Debian9 to Debian10 to get more current tools. Running a scrub with Debian10 tools I saw errors detected and corrected... I also saw sata link issues during the scrub... # date Mon 09 Dec 2019 10:29:05 PM EST # btrfs scrub start -B -d /dev/disk/by-uuid/X scrub device /dev/mapper/K1JG82AD (id 1) done scrub started at Sun Dec 8 23:06:59 2019 and finished after 05:46:26 total bytes scrubbed: 2.80TiB with 9490467 errors error details: verify=1349 csum=9489118 corrected errors: 9490467, uncorrectable errors: 0, unverified errors: 0 WARNING: errors detected during scrubbing, corrected # btrfs dev stat /mnt/raid [/dev/mapper/K1JG82AD].write_io_errs 15799831 [/dev/mapper/K1JG82AD].read_io_errs 15764242 [/dev/mapper/K1JG82AD].flush_io_errs 4385 [/dev/mapper/K1JG82AD].corruption_errs 9489118 [/dev/mapper/K1JG82AD].generation_errs 1349 [/dev/mapper/K1JGYJJD].write_io_errs 0 [/dev/mapper/K1JGYJJD].read_io_errs 0 [/dev/mapper/K1JGYJJD].flush_io_errs 0 [/dev/mapper/K1JGYJJD].corruption_errs 0 [/dev/mapper/K1JGYJJD].generation_errs 0 At this point I want to separate the file system debug from the hardware debug. I would like to offload all data to a different disk (if possible) to maintain availability while I work with the hardware, then return to this hardware once I am certain it is rock solid... Questions: 1) How should I interpret these errors? Seems that btrfs messages are telling me that there are an abundance of errors everywhere, but that they are all correctable... Should I panic? Should I proceed? 2) Is my file system broken? Is my data corrupted? Should I be able to scrub etc to get back to operation without scary log messages? Can I trust data that I copy out now, or need to fall back on old/incomplete backups? 3) What steps are recommended to backup/offload/recover data? I am considering installing the disks into a different machine, then mounting the array read- only, and then pulling a full copy of the data... 4) What steps should I take to clean up the file system errors/messages? Start fresh after full backup, (though I hate the idea of migrating off a redundant array onto a single disk in the process)? Scrub etc? Evaluate each disk independently and rebuild one from the other? Regards, Stephen - System Information - 2017 - 2019/12/8 (Debian 9) linux-4.9.110 Package: btrfs-progs (4.7.3-1) 2019/12/8 - present (Debian 10) # uname -a Linux 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/ Linux # btrfs --version btrfs-progs v4.20.1 # btrfs fi show - dmesg / stack traces - Attached