XFS disaster recovery

* XFS disaster recovery
@ 2022-02-01 23:07 Sean Caron
  2022-02-01 23:33 ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Caron @ 2022-02-01 23:07 UTC (permalink / raw)
  To: linux-xfs, Sean Caron

Hi all,

Me again with another not-backed-up XFS filesystem that's in a little
trouble. Last time I stopped by to discuss my woes, I was told that I
could check in here and get some help reading the tea leaves before I
do anything drastic so I'm doing that :)

Brief backstory: This is a RAID 60 composed of three 18-drive RAID 6
strings of 8 TB disk drives, around 460 TB total capacity. Last week
we had a disk fail out of the array. We replaced the disk and the
recovery hung at around 70%.

We power cycled the machine and enclosure and got the recovery to run
to completion. Just as it finished up, the same string dropped another
drive.

We replaced that drive and started recovery again. It got a fair bit
into the recovery, then hung just as did the first drive recovery, at
around +/- 70%. We power cycled everything again, then started the
recovery. As the recovery was running again, a third disk started to
throw read errors.

At this point, I decided to just stop trying to recover this array so
it's up with two disks down but otherwise assembled. I figured I would
just try to mount ro,norecovery and try to salvage as much as possible
at this point before going any further.

Trying to mount ro,norecovery, I am getting an error:

metadata I/O error in "xfs_trans_read_buf_map at daddr ... len 8 error 74
Metadata CRC error detected at xfs_agf_read_verify+0xd0/0xf0 [xfs],
xfs_agf block ...

I ran an xfs_repair -L -n just to see what it would spit out. It
completes within 15-20 minutes (which I feel might be a good sign,
from my experience, outcomes are inversely proportional to run time),
but the output is implying that it would unlink over 100,000 files
(I'm not sure how many total files are on the filesystem, in terms of
what proportion of loss this would equate to) and it also says:

"Inode allocation btrees are too corrupted, skipping phases 6 and 7"

which sounds a little ominous.

It would be a huge help if someone could help me get a little insight
into this and determine the best way forward at this point to try and
salvage as much as possible.

Happy to provide any data, dmesg output, etc as needed. Please just
let me know what would be helpful for diagnosis.

Thanks so much,

Sean

^ permalink raw reply	[flat|nested] 20+ messages in thread