Hello,
 
I think I've encountered a deadlock between btrfs-transacti and postgres
process(es). This is system information (btrfs fi usage obtained after
poweroff and boot):

# cat /etc/redhat-release  
CentOS Linux release 7.6.1810 (Core) 
  
# uname -a
Linux prod-dbsnap-01 5.2.1-1.el7.elrepo.x86_64 #1 SMP Sun Jul 14 08:15:04 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux 
 
# btrfs --version
btrfs-progs v5.2 
 
# btrfs filesystem usage /data/pg_data 
 Overall:
     Device size:                   2.00TiB
     Device allocated:            345.03GiB
     Device unallocated:            1.66TiB
     Device missing:                  0.00B
     Used:                        338.07GiB
     Free (estimated):              1.67TiB      (min: 854.27GiB)
     Data ratio:                       1.00
     Metadata ratio:                   2.00
     Global reserve:              512.00MiB      (used: 0.00B)
 
 Data,RAID0: Size:332.00GiB, Used:329.22GiB
    /dev/sdb       83.00GiB
    /dev/sdc       83.00GiB
    /dev/sdd       83.00GiB
    /dev/sde       83.00GiB
 
 Metadata,RAID10: Size:6.50GiB, Used:4.42GiB
    /dev/sdb        3.25GiB
    /dev/sdc        3.25GiB
    /dev/sdd        3.25GiB
    /dev/sde        3.25GiB
 
 System,RAID10: Size:16.00MiB, Used:48.00KiB
    /dev/sdb        8.00MiB
    /dev/sdc        8.00MiB
    /dev/sdd        8.00MiB
    /dev/sde        8.00MiB
 
 Unallocated:
    /dev/sdb      425.74GiB
    /dev/sdc      425.74GiB
    /dev/sdd      425.74GiB
    /dev/sde      425.74GiB
 
There were three btrfs subvolumes and on each one there was a Postgres
database slave doing recovery (single threaded writing). But there was a
lot of writing. And prior to starting Postgres slaves I was restoring base
backup from the backup server, which was being done by a number of
parallel rsync processes (12 at most, I think).
 
The file system is mounted with:
 
# grep btrfs /proc/mounts
/dev/sdd /data/pg_data btrfs rw,noatime,compress-force=zstd:3,space_cache,subvolid=5,subvol=/ 0 0
 
After several hours I got this in /var/log/messages:

# grep 'INFO: task .*blocked for more than' messages
 Jul 17 16:47:09 prod-dbsnap-01 kernel: INFO: task btrfs-transacti:1361 blocked for more than 122 seconds.
 Jul 17 16:47:09 prod-dbsnap-01 kernel: INFO: task postgres:62682 blocked for more than 122 seconds.
 Jul 17 16:47:09 prod-dbsnap-01 kernel: INFO: task postgres:80145 blocked for more than 122 seconds.
 Jul 17 16:47:09 prod-dbsnap-01 kernel: INFO: task postgres:87299 blocked for more than 122 seconds.
 Jul 17 16:47:09 prod-dbsnap-01 kernel: INFO: task postgres:93108 blocked for more than 122 seconds.
 Jul 17 16:49:12 prod-dbsnap-01 kernel: INFO: task btrfs-transacti:1361 blocked for more than 245 seconds.
 Jul 17 16:49:12 prod-dbsnap-01 kernel: INFO: task postgres:62682 blocked for more than 245 seconds.
 Jul 17 16:49:12 prod-dbsnap-01 kernel: INFO: task postgres:80145 blocked for more than 245 seconds.
 Jul 17 16:49:12 prod-dbsnap-01 kernel: INFO: task postgres:87299 blocked for more than 245 seconds.
 Jul 17 16:49:12 prod-dbsnap-01 kernel: INFO: task postgres:93108 blocked for more than 245 seconds.
 
Full log is in the attachment. It seems to me that there is a deadlock
between btrfs-transacti and any process which is trying to write
something (not sure about reading). While this was going on the cpu
usage (according to top) was non-existant. There are 4 cpus (it's a virtual
machine in VmWare) and 3 were 100% idle. The 4th was 100% in
waiting. (I didn't find out which process was on that cpu, unfortunately.)
 
I powered off the machine (yesterday), booted this morning and things
are working without errors. I stopped postgres clusters, though.

I have a few questions:

1.  After something like this happens and the machine is rebooted is there
a procedure which would lower the probablity of encountering the deadlock
again (maybe btrfs scrub or btrfs defragment or something like that)? This
happened after a heavy write activity,  so maybe fragmentation had
something to do with it.

2. Should I run btrfsck (or something else) to verify on-disk integrity after a
problem like this? Or it's just an in-memory problem, so I can assume that
nothing bad happened to the data on disks.

3. I'd like to write a watchdog program to catch deadlocks and reboot
(probably power-cycle) the VM, but I'm not sure what would the appropriate
check be. Does it have to write something to the disk or reading would be
sufficient? And how to bypass the OS  buffer cache (fsync() or O_DIRECT
should do it for writing, but I'm not sure about reading)?

What would the appropriate timeout be? (If the operation doesn't
complete in xx seconds a reboot should be triggered, but I don't know how
many seconds to wait when there's  a heavy load and things are just slow,
but there's no deadlock.)

Should I put watchdog process in real-time class (or however the equivalent
is called on Linux)? Since this is a mainline kernel, I'm not sure if I could assume
that real-time support won't have  bugs of its own.
 
And last, but not least, is there additional data that could help with debugging
issues like this? (If possible, something that could be programmed into
watchdog service.)