Hello, it seems my last e-mail was filtered as I can't find it in the archives. So I will resend it and include all attachments in one tarball. On 26. 08. 20 20:07, Chris Murphy wrote:> OK so from the attachments.. > > cat /proc//stack for md1_raid6 > > [<0>] rq_qos_wait+0xfa/0x170 > [<0>] wbt_wait+0x98/0xe0 > [<0>] __rq_qos_throttle+0x23/0x30 > [<0>] blk_mq_make_request+0x12a/0x5d0 > [<0>] generic_make_request+0xcf/0x310 > [<0>] submit_bio+0x42/0x1c0 > [<0>] md_update_sb.part.71+0x3c0/0x8f0 [md_mod] > [<0>] r5l_do_reclaim+0x32a/0x3b0 [raid456] > [<0>] md_thread+0x94/0x150 [md_mod] > [<0>] kthread+0x112/0x130 > [<0>] ret_from_fork+0x22/0x40 > > > Btrfs snapshot flushing might instigate the problem but it seems to me > there's some kind of contention or blocking happening within md, and > that's why everything stalls. But I can't tell why. > > Do you have any iostat output at the time of this problem? I'm > wondering if md is waiting on disks. If not, try `iostat -dxm 5` and > share a few minutes before and after the freeze/hang. We have detected the issue at Monday 31.09.2020 15:24. It must happen sometimes between 15:22-15:24 as we monitor the state every 2 minutes. We have recorded stacks of blocked processes, sysrq+w command and requested `iostat`. Then in 15:45, we perform manual "unstuck" process by accessing md1 device via dd command (reading a few random blocks). I hope attached file names are self-explaining. Please let me know if we can do anything more to track the issue or if I forget something. Thanks a lot, Vojtech and Michal Description of the devices in iostat, just for recap: - sda-sdf: 6 HDD disks - sdg, sdh: 2 SSD disks - md0: raid1 over sdg1 and sdh1 ("SSD RAID", Physical Volume for LVM) - md1: our "problematic" raid6 over sda-sdf ("HDD RAID", btrfs formatted) - Logical volumes over md0 Physical Volume (on SSD RAID) - dm-0: 4G LV for SWAP - dm-1: 16G LV for root file system (ext4 formatted) - dm-2: 1G LV for md1 journal