Hello,

I have a pair of identical drives that I set up with btrfs for the purpose of 
redundancy for a 24x7 network attached storage application.  The pair of 
drives reside in an external enclosure and are each connected to the servers 
with a dedicated eSATA cable.  I set up the array in Fall of 2017 and had 
about a year of operation without noticing any issues.

In January 2019 the file system remounted read-only.  I have retained the dmesg 
logs and notes on the actions I took at that time.  Summary is that I saw a 
bunch of SATA "link is slow to respond" and "hard resetting link" messages 
followed by BTRFS errors, "BTRFS: Transaction aborted (error -5)", and a stack 
trace.  I tried a few things (unsuccessfully) and had to power down the 
hardware.  After a power cycle the file system mounted R/W as normal.   I then 
ran a btrfs check and btrfs scrub which showed no errors.  

# btrfs check -p /dev/disk/by-uuid/X
Checking filesystem on /dev/disk/by-uuid/X
UUID: X
checking extents [o]
checking free space cache [.]
checking fs roots [o]
checking csums
checking root refs
found 2953419067392 bytes used err is 0
total csum bytes: 2880319844
total tree bytes: 3249602560
total fs tree bytes: 153665536
total extent tree bytes: 30392320
btree space waste bytes: 162811208
file data blocks allocated: 2950169763840
 referenced 2950168137728

# btrfs scrub start -B /dev/disk/by-uuid/X
scrub done for X
        scrub started at Thu Jan 24 21:18:33 2019 and finished after 03:58:21
        total bytes scrubbed: 2.69TiB with 0 errors

I also swapped out my long eSATA cables (6 foot) for shorter cables (20 inch) 
to try to address the link issues... Thought everything was good, but this 
episode may be related to subsequent events...

<10 Months Later>

In October 2019 I had a similar event.  Logs show SATA "hard resetting link" 
for both drives as well as BTRFS "lost page write due to IO error" messages.  
The logs show "forced readonly", "BTRFS: Transaction aborted (error -5)" and a 
stack trace.  After hard power cycle I ran a "btrfs check -p" which resulted 
in a stream messages like "parent transid verify failed on 2418006753280 
wanted 34457 found 30647" and then the following:
parent transid verify failed on 2417279598592 wanted 35322 found 29823
checking root refs
found 3066087231488 bytes used err is 0
total csum bytes: 2990197476
total tree bytes: 3376070656
total fs tree bytes: 158498816
total extent tree bytes: 31932416
btree space waste bytes: 171614389
file data blocks allocated: 3062711459840
 referenced 3062709796864

I re-ran the check and captured all output to file just in case. File is 315000 
lines long...

After some internet research I decided to mount the file system and run a 
scrub.  File system mounts read/write successfully, but I was seeing patterns 
of log entries like: "BTRFS error (device dm-1): parent transid verify failed 
on 2176843776 wanted 34455 found 27114" and then "BTRFS info (device dm-1): 
read error corrected: ino 1 off 2176843776 (dev /dev/mapper/K1JG82AD sector 
4251648)"

# btrfs scrub start -B /dev/disk/by-uuid/X
After starting the scrub the logs were showing several patterns
verify_parent_transid: 89 callbacks suppressed
repair_io_failure: 386 callbacks suppressed
BTRFS error (device dm-1): csum mismatch on free space cache
BTRFS warning (device dm-1): failed to load free space cache for block group 
1131786797056, rebuilding it now
BTRFS error (device dm-1): parent transid verify failed on 2417279385600 
wanted 35322 found 29823
BTRFS info (device dm-1): read error corrected: ino 1 off 2417279385600 (dev /
dev/mapper/K1JG82AD sector 4719086112)

But the summary output looked OK...
scrub done for X
        scrub started at Thu Oct 31 00:27:29 2019 and finished after 04:07:45
        total bytes scrubbed: 2.79TiB with 0 errors

# date
Sun Dec  8 11:56:29 EST 2019
# btrfs fi show
Label: none  uuid: X
        Total devices 2 FS bytes used 2.80TiB
        devid    1 size 5.46TiB used 2.80TiB path /dev/mapper/K1JG82AD
        devid    2 size 5.46TiB used 2.80TiB path /dev/mapper/K1JGYJJD

# btrfs dev stat /mnt/raid
[/dev/mapper/K1JG82AD].write_io_errs   15799831
[/dev/mapper/K1JG82AD].read_io_errs    15764242
[/dev/mapper/K1JG82AD].flush_io_errs   4385
[/dev/mapper/K1JG82AD].corruption_errs 0
[/dev/mapper/K1JG82AD].generation_errs 0
[/dev/mapper/K1JGYJJD].write_io_errs   0
[/dev/mapper/K1JGYJJD].read_io_errs    0
[/dev/mapper/K1JGYJJD].flush_io_errs   0
[/dev/mapper/K1JGYJJD].corruption_errs 0
[/dev/mapper/K1JGYJJD].generation_errs 0

At this point I was still seeing occasional log entries for "parent transid 
verify failed" and "read error corrected" so I decided to upgrade from Debian9 
to Debian10 to get more current tools.  Running a scrub with Debian10 tools I 
saw errors detected and corrected... I also saw sata link issues during the 
scrub...

# date
Mon 09 Dec 2019 10:29:05 PM EST
# btrfs scrub start -B -d /dev/disk/by-uuid/X
scrub device /dev/mapper/K1JG82AD (id 1) done
        scrub started at Sun Dec  8 23:06:59 2019 and finished after 05:46:26
        total bytes scrubbed: 2.80TiB with 9490467 errors
        error details: verify=1349 csum=9489118
        corrected errors: 9490467, uncorrectable errors: 0, unverified errors: 
0
WARNING: errors detected during scrubbing, corrected

# btrfs dev stat /mnt/raid
[/dev/mapper/K1JG82AD].write_io_errs    15799831
[/dev/mapper/K1JG82AD].read_io_errs     15764242
[/dev/mapper/K1JG82AD].flush_io_errs    4385
[/dev/mapper/K1JG82AD].corruption_errs  9489118
[/dev/mapper/K1JG82AD].generation_errs  1349
[/dev/mapper/K1JGYJJD].write_io_errs    0
[/dev/mapper/K1JGYJJD].read_io_errs     0
[/dev/mapper/K1JGYJJD].flush_io_errs    0
[/dev/mapper/K1JGYJJD].corruption_errs  0
[/dev/mapper/K1JGYJJD].generation_errs  0

At this point I want to separate the file system debug from the hardware debug.  
I would like to offload all data to a different disk (if possible) to maintain 
availability while I work with the hardware, then return to this hardware once 
I am certain it is rock solid...

Questions:
1) How should I interpret these errors?  Seems that btrfs messages are telling 
me that there are an abundance of errors everywhere, but that they are all 
correctable...  Should I panic?  Should I proceed?

2) Is my file system broken? Is my data corrupted?  Should I be able to scrub 
etc to get back to operation without scary log messages? Can I trust data that 
I copy out now, or need to fall back on old/incomplete backups?

3) What steps are recommended to backup/offload/recover data?  I am considering 
installing the disks into a different machine, then mounting the array read-
only, and then pulling a full copy of the data...

4) What steps should I take to clean up the file system errors/messages?  Start 
fresh after full backup, (though I hate the idea of migrating off a redundant 
array onto a single disk in the process)?  Scrub etc?  Evaluate each disk 
independently and rebuild one from the other?

Regards,
Stephen

 - System Information - 
2017 - 2019/12/8 (Debian 9)
linux-4.9.110
Package: btrfs-progs (4.7.3-1)

2019/12/8 - present (Debian 10)
# uname -a
Linux 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/
Linux
# btrfs --version
btrfs-progs v4.20.1
# btrfs fi show
<See above>

- dmesg / stack traces - 
Attached