All of lore.kernel.org
 help / color / mirror / Atom feed
* Question regarding XFS crisis recovery
@ 2021-11-15 17:14 Sean Caron
  2021-11-15 18:13 ` Roger Willcocks
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Sean Caron @ 2021-11-15 17:14 UTC (permalink / raw)
  To: linux-xfs, Sean Caron

Hi all,

I recently had to manage a storage failure on a ~150 TB XFS volume and
I just wanted to check with the group here to see if anything could
have been done differently. Here is my story.

We had a 150 TB RAID 60 volume formatted with XFS. The volume was made
up of two 21-drive RAID 6 strings (4 TB drives). This was all done
with Linux MD software RAID.

The filesystem was filled to 100% capacity when it failed. I'm not
sure if this contributed to the poor outcome.

There was no backup available of this filesystem (of course).

About a week ago, we had two drives become spuriously ejected from one
of the two RAID 6 strings that composed this volume. This seems to
happen sometimes as a result of various hardware and software
glitches. We checked the drives with smartctl, added them back to the
array and a resync operation started.

The resync ran for a little while and failed, because a third disk in
the array (which mdadm had never failed out, and smartctl still
thought was OK) reported a read error/bad blocks and dropped out of
the array.

We decided to clone the failed disk to a brand new replacement drive with:

dd conv=notrunc,noerror,sync

Figuring we'd lose a few sectors to get nulled out, but we'd have a
drive that could run the rebuild without getting kicked due to read
errors (we've used this technique in the past to recover from this
kind of situation successfully).

Clone completed. We swapped the clone drive with the bad blocks drive
and kicked off another rebuild.

Rebuild fails again because a fourth drive is throwing bad blocks/read
errors and gets kicked out of the array.

We scan all 21 drives in this array with smartctl and there are
actually three more drives in total where SMART has logged read
errors.

This is starting to look pretty bad but what can we do? We just clone
these three drives to three more fresh drives using dd
conv=notrunc,noerror,sync.

Swap them in for the old bad block drives and kick off another
rebuild. The rebuild actually runs and completes successfully. MD
thinks the array is fine, running, not degraded at all.

We mount the array. It mounts, but it is obviously pretty damaged.
Normally when this happens we try to mount it read only and copy off
what we can, then write it off. This time, we can't hardly do anything
but an "ls" in the filesystem without getting "structure needs
cleaning". Doing any kind of material access to the filesystem gives
various major errors (i.e. "in-memory corruption of filesystem data
detected") and the filesystem goes offline. Reads just fail with I/O
errors.

What can we do? Seems like at this stage we just run xfs_repair and
hope for the best, right?

Ran xfs_repair in dry run mode and it's looking pretty bad, just from
the sheer amount of output.

But there's no real way to know exactly how much data xfs_repair will
wipe out, and what alternatives do we have? The filesystem hardly
mounts without faulting anyway. Seems like there's little choice going
forward to run it, and see what shakes out.

We run xfs_repair overnight. It ran for a while, then eventually hung
in Phase 4, I think.

We killed xfs_repair off and re-ran it with the -P flag. It runs for
maybe two or three hours and eventually completes.

We mount the filesystem up. Of around 150 TB, we have maybe 10% of
that in data salad in lost+found, 21 GB of good data and the rest is
gone.

Copy off what we can, and call it dead. This is where we're at now.

It seems like the MD rebuild process really scrambled things somehow.
I'm not sure if this was due to some kind of kernel bug, or just
zeroed out bad sectors in wrong places or what. Once the md resync
ran, we were cooked.

I guess, after blowing through four or five "Hope you have a backup,
but if not, you can try this and pray" checkpoints, I just want to
check with the developers and group here to see if we did the best
thing possible given the circumstances?

Xfs_repair is it, right? When things are that scrambled, pretty much
all you can do is run an xfs_repair and hope for the best? Am I
correct in thinking that there is no better or alternative tool that
will give different results?

Can a commercial data recovery service make any better sense of a
scrambled XFS than xfs_repair could? When the underlying device is
presenting OK, just scrambled data on it?

Thanks,

Sean

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question regarding XFS crisis recovery
  2021-11-15 17:14 Question regarding XFS crisis recovery Sean Caron
@ 2021-11-15 18:13 ` Roger Willcocks
  2021-11-15 18:35 ` Chris Murphy
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Roger Willcocks @ 2021-11-15 18:13 UTC (permalink / raw)
  To: Sean Caron; +Cc: Roger Willcocks, linux-xfs

In principle that should have worked. And yes, when you’ve got the filesystem back to the point where it mounts, xfs-repair is your only option.

It might have been useful to take an xfs-metadump before the repair, to see what xfs-repair would make if it, and to share it with others for their thoughts.

It does seem like there should be an md resync recovery option which substitutes zeroes for bad blocks instead of giving up immediately. A few blocks of corrupted data in 150 TB is obviously preferable to no data at all.

Or allow it to fall back to reading the ‘dropped out’ drives if there’s a read error elsewhere in the stripe while they’re build rebuilt. 

—
Roger


> On 15 Nov 2021, at 17:14, Sean Caron <scaron@umich.edu> wrote:
> 
> Hi all,
> 
> I recently had to manage a storage failure on a ~150 TB XFS volume and
> I just wanted to check with the group here to see if anything could
> have been done differently. Here is my story.
> 
> We had a 150 TB RAID 60 volume formatted with XFS. The volume was made
> up of two 21-drive RAID 6 strings (4 TB drives). This was all done
> with Linux MD software RAID.
> 
> The filesystem was filled to 100% capacity when it failed. I'm not
> sure if this contributed to the poor outcome.
> 
> There was no backup available of this filesystem (of course).
> 
> About a week ago, we had two drives become spuriously ejected from one
> of the two RAID 6 strings that composed this volume. This seems to
> happen sometimes as a result of various hardware and software
> glitches. We checked the drives with smartctl, added them back to the
> array and a resync operation started.
> 
> The resync ran for a little while and failed, because a third disk in
> the array (which mdadm had never failed out, and smartctl still
> thought was OK) reported a read error/bad blocks and dropped out of
> the array.
> 
> We decided to clone the failed disk to a brand new replacement drive with:
> 
> dd conv=notrunc,noerror,sync
> 
> Figuring we'd lose a few sectors to get nulled out, but we'd have a
> drive that could run the rebuild without getting kicked due to read
> errors (we've used this technique in the past to recover from this
> kind of situation successfully).
> 
> Clone completed. We swapped the clone drive with the bad blocks drive
> and kicked off another rebuild.
> 
> Rebuild fails again because a fourth drive is throwing bad blocks/read
> errors and gets kicked out of the array.
> 
> We scan all 21 drives in this array with smartctl and there are
> actually three more drives in total where SMART has logged read
> errors.
> 
> This is starting to look pretty bad but what can we do? We just clone
> these three drives to three more fresh drives using dd
> conv=notrunc,noerror,sync.
> 
> Swap them in for the old bad block drives and kick off another
> rebuild. The rebuild actually runs and completes successfully. MD
> thinks the array is fine, running, not degraded at all.
> 
> We mount the array. It mounts, but it is obviously pretty damaged.
> Normally when this happens we try to mount it read only and copy off
> what we can, then write it off. This time, we can't hardly do anything
> but an "ls" in the filesystem without getting "structure needs
> cleaning". Doing any kind of material access to the filesystem gives
> various major errors (i.e. "in-memory corruption of filesystem data
> detected") and the filesystem goes offline. Reads just fail with I/O
> errors.
> 
> What can we do? Seems like at this stage we just run xfs_repair and
> hope for the best, right?
> 
> Ran xfs_repair in dry run mode and it's looking pretty bad, just from
> the sheer amount of output.
> 
> But there's no real way to know exactly how much data xfs_repair will
> wipe out, and what alternatives do we have? The filesystem hardly
> mounts without faulting anyway. Seems like there's little choice going
> forward to run it, and see what shakes out.
> 
> We run xfs_repair overnight. It ran for a while, then eventually hung
> in Phase 4, I think.
> 
> We killed xfs_repair off and re-ran it with the -P flag. It runs for
> maybe two or three hours and eventually completes.
> 
> We mount the filesystem up. Of around 150 TB, we have maybe 10% of
> that in data salad in lost+found, 21 GB of good data and the rest is
> gone.
> 
> Copy off what we can, and call it dead. This is where we're at now.
> 
> It seems like the MD rebuild process really scrambled things somehow.
> I'm not sure if this was due to some kind of kernel bug, or just
> zeroed out bad sectors in wrong places or what. Once the md resync
> ran, we were cooked.
> 
> I guess, after blowing through four or five "Hope you have a backup,
> but if not, you can try this and pray" checkpoints, I just want to
> check with the developers and group here to see if we did the best
> thing possible given the circumstances?
> 
> Xfs_repair is it, right? When things are that scrambled, pretty much
> all you can do is run an xfs_repair and hope for the best? Am I
> correct in thinking that there is no better or alternative tool that
> will give different results?
> 
> Can a commercial data recovery service make any better sense of a
> scrambled XFS than xfs_repair could? When the underlying device is
> presenting OK, just scrambled data on it?
> 
> Thanks,
> 
> Sean
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question regarding XFS crisis recovery
  2021-11-15 17:14 Question regarding XFS crisis recovery Sean Caron
  2021-11-15 18:13 ` Roger Willcocks
@ 2021-11-15 18:35 ` Chris Murphy
  2021-11-15 18:51 ` Eric Sandeen
  2021-11-15 21:21 ` Dave Chinner
  3 siblings, 0 replies; 6+ messages in thread
From: Chris Murphy @ 2021-11-15 18:35 UTC (permalink / raw)
  To: Sean Caron; +Cc: xfs list

On Mon, Nov 15, 2021 at 12:22 PM Sean Caron <scaron@umich.edu> wrote:
>
> Hi all,
>
> I recently had to manage a storage failure on a ~150 TB XFS volume and
> I just wanted to check with the group here to see if anything could
> have been done differently. Here is my story.
>
> We had a 150 TB RAID 60 volume formatted with XFS. The volume was made
> up of two 21-drive RAID 6 strings (4 TB drives). This was all done
> with Linux MD software RAID.
>
> The filesystem was filled to 100% capacity when it failed. I'm not
> sure if this contributed to the poor outcome.
>
> There was no backup available of this filesystem (of course).
>
> About a week ago, we had two drives become spuriously ejected from one
> of the two RAID 6 strings that composed this volume. This seems to
> happen sometimes as a result of various hardware and software
> glitches. We checked the drives with smartctl, added them back to the
> array and a resync operation started.
>
> The resync ran for a little while and failed, because a third disk in
> the array (which mdadm had never failed out, and smartctl still
> thought was OK) reported a read error/bad blocks and dropped out of
> the array.
>
> We decided to clone the failed disk to a brand new replacement drive with:
>
> dd conv=notrunc,noerror,sync
>
> Figuring we'd lose a few sectors to get nulled out, but we'd have a
> drive that could run the rebuild without getting kicked due to read
> errors (we've used this technique in the past to recover from this
> kind of situation successfully).
>
> Clone completed. We swapped the clone drive with the bad blocks drive
> and kicked off another rebuild.
>
> Rebuild fails again because a fourth drive is throwing bad blocks/read
> errors and gets kicked out of the array.
>
> We scan all 21 drives in this array with smartctl and there are
> actually three more drives in total where SMART has logged read
> errors.
>
> This is starting to look pretty bad but what can we do? We just clone
> these three drives to three more fresh drives using dd
> conv=notrunc,noerror,sync.
>
> Swap them in for the old bad block drives and kick off another
> rebuild. The rebuild actually runs and completes successfully. MD
> thinks the array is fine, running, not degraded at all.
>
> We mount the array. It mounts, but it is obviously pretty damaged.
> Normally when this happens we try to mount it read only and copy off
> what we can, then write it off. This time, we can't hardly do anything
> but an "ls" in the filesystem without getting "structure needs
> cleaning". Doing any kind of material access to the filesystem gives
> various major errors (i.e. "in-memory corruption of filesystem data
> detected") and the filesystem goes offline. Reads just fail with I/O
> errors.
>
> What can we do? Seems like at this stage we just run xfs_repair and
> hope for the best, right?
>
> Ran xfs_repair in dry run mode and it's looking pretty bad, just from
> the sheer amount of output.
>
> But there's no real way to know exactly how much data xfs_repair will
> wipe out, and what alternatives do we have? The filesystem hardly
> mounts without faulting anyway. Seems like there's little choice going
> forward to run it, and see what shakes out.
>
> We run xfs_repair overnight. It ran for a while, then eventually hung
> in Phase 4, I think.
>
> We killed xfs_repair off and re-ran it with the -P flag. It runs for
> maybe two or three hours and eventually completes.
>
> We mount the filesystem up. Of around 150 TB, we have maybe 10% of
> that in data salad in lost+found, 21 GB of good data and the rest is
> gone.
>
> Copy off what we can, and call it dead. This is where we're at now.
>
> It seems like the MD rebuild process really scrambled things somehow.
> I'm not sure if this was due to some kind of kernel bug, or just
> zeroed out bad sectors in wrong places or what. Once the md resync
> ran, we were cooked.
>
> I guess, after blowing through four or five "Hope you have a backup,
> but if not, you can try this and pray" checkpoints, I just want to
> check with the developers and group here to see if we did the best
> thing possible given the circumstances?
>
> Xfs_repair is it, right? When things are that scrambled, pretty much
> all you can do is run an xfs_repair and hope for the best? Am I
> correct in thinking that there is no better or alternative tool that
> will give different results?
>
> Can a commercial data recovery service make any better sense of a
> scrambled XFS than xfs_repair could? When the underlying device is
> presenting OK, just scrambled data on it?

I'm going to let others address the XFS issues if any. My take is this
is not at all XFS related, but a problem with lower layers in the
storage stack.

What is the SCT ERC value for each of the drives?  This value must be
less than the kernel's SCSI command timer, which by default is 30
seconds.

It sounds to me like a common misconfiguration where the drive SCT ERC
is not configured, bad sectors accumulate over time because they are
never being fixed up as a result of the miconfiguration. And once a
single stripe is lost, representing a critical amount off file system
metadata, you lose the whole file system. It's a very high penalty for
what is actually an avoidable problem, but relies on esoteric
knowledge and resistence of downstream distros to change kernel
defaults because they don't understand most of the knobs. And upstream
kernel development's reluctance to change defaults because of various
downstream expectations based on them. Those are generally valid
positions, but in the specific case of large software raid arrays,
Linux has a bad reputation strictly because of crap defaults where the
common case is that SCT ERC is a higher value than the SCSI command
timer. And this will *always* lead to data loss, eventually.

Check the *device* timeout with this command
smartctl -l scterc /dev/sdX

Check the *kernel* timeout with this command
cat /sys/block/sdX/device/timeout

If the drive doesn't support configurable SCT ERC, then you must
increase the kernel's command timer to a ridiculous value like 180.
Seriously 180 seconds for a drive to decide whether a sector is
unreadable is ridiculous but the logic of the consumer drive is
there's no redundancy so try as long and hard as possible before
giving up, the exact opposite of what we want in a raid array.

This guide is a bit stale, I prefer to change either SCT ERC or the
command timer with a udev rule. But the result is the same.
https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question regarding XFS crisis recovery
  2021-11-15 17:14 Question regarding XFS crisis recovery Sean Caron
  2021-11-15 18:13 ` Roger Willcocks
  2021-11-15 18:35 ` Chris Murphy
@ 2021-11-15 18:51 ` Eric Sandeen
  2021-11-15 21:21 ` Dave Chinner
  3 siblings, 0 replies; 6+ messages in thread
From: Eric Sandeen @ 2021-11-15 18:51 UTC (permalink / raw)
  To: Sean Caron, linux-xfs

On 11/15/21 11:14 AM, Sean Caron wrote:
> I guess, after blowing through four or five "Hope you have a backup,
> but if not, you can try this and pray" checkpoints, I just want to
> check with the developers and group here to see if we did the best
> thing possible given the circumstances?

Overall I suppose that what you did sounds reasonable, unless I'm missing
something about the MD raid state. Having that many drives failing out
sounds bad.  Any idea how many blocks got skipped with your dd clone?

> Xfs_repair is it, right? When things are that scrambled, pretty much
> all you can do is run an xfs_repair and hope for the best? Am I
> correct in thinking that there is no better or alternative tool that
> will give different results?

well ... from the xfs POV, yes, but xfs_repair is not a data recovery tool,
it is designed to make the filesystem consistent again, not to recover
all data.  (that might sound a bit glib, but while repair obviously tries
to salvage/correct what it can, in the end, the goal is consistency.)

There's not much xfs_repair can do if the block device has been severely
scrambled beneath it.

> Can a commercial data recovery service make any better sense of a
> scrambled XFS than xfs_repair could? When the underlying device is
> presenting OK, just scrambled data on it?

maybe? :) not sure, personally. I have never used a data recovery service.

Sorry for your data loss. I'd suggest asking these questions of the
md raid folks as well, because I think that's where your problems started,
and frankly xfs / xfs_repair may have been handed an impossible task.

(Not blaming md; perhaps the hardware failure made this all inevitable,
but maybe they md devs thoughts on your recovery attempts.)

-Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question regarding XFS crisis recovery
  2021-11-15 17:14 Question regarding XFS crisis recovery Sean Caron
                   ` (2 preceding siblings ...)
  2021-11-15 18:51 ` Eric Sandeen
@ 2021-11-15 21:21 ` Dave Chinner
  2021-11-16 17:27   ` Sean Caron
  3 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2021-11-15 21:21 UTC (permalink / raw)
  To: Sean Caron; +Cc: linux-xfs

On Mon, Nov 15, 2021 at 12:14:34PM -0500, Sean Caron wrote:
> Hi all,
> 
> I recently had to manage a storage failure on a ~150 TB XFS volume and
> I just wanted to check with the group here to see if anything could
> have been done differently. Here is my story.

:(

> We had a 150 TB RAID 60 volume formatted with XFS. The volume was made
> up of two 21-drive RAID 6 strings (4 TB drives). This was all done
> with Linux MD software RAID.

A 21-drive RAID-6 made this cascading failure scenario inevitable,
especially if all the drives were identical (same vendor and
manufacturing batch). Once the first drive goes bad, the rest are at
death's door. RAID rebuild is about the most intensive sustained
load you can put on a drive, and if a drive is marginal that's often
all that is needed to kick it over the edge. The more disks in the
RAID set, the more likely cascading failures during rebuild are.

> We mount the array. It mounts, but it is obviously pretty damaged.
> Normally when this happens we try to mount it read only and copy off
> what we can, then write it off. This time, we can't hardly do anything
> but an "ls" in the filesystem without getting "structure needs
> cleaning".

Which makes me think that the damage is, unfortunately, high up in
directory heirarchy and the inodes and sub-directories that hold
most of the data can't be accessed.

> Doing any kind of material access to the filesystem gives
> various major errors (i.e. "in-memory corruption of filesystem data
> detected") and the filesystem goes offline. Reads just fail with I/O
> errors.
> 
> What can we do? Seems like at this stage we just run xfs_repair and
> hope for the best, right?

Not quite. The next step would have been to take a metadump of the
broken filesystem and then restore the image to a file on non-broken
storage. Then you can run repair on the restored metadump image and
see just how much ends up being left after xfs_repair runs. That
tells you the likely result of running repair without actually
changing anything in the damaged storage.

> Ran xfs_repair in dry run mode and it's looking pretty bad, just from
> the sheer amount of output.
> 
> But there's no real way to know exactly how much data xfs_repair will
> wipe out, and what alternatives do we have?

That's exactly what metadump/restore/repair/"mount -o loop" allows
us to evaluate.

> We run xfs_repair overnight. It ran for a while, then eventually hung
> in Phase 4, I think.
> 
> We killed xfs_repair off and re-ran it with the -P flag. It runs for
> maybe two or three hours and eventually completes.
> 
> We mount the filesystem up. Of around 150 TB, we have maybe 10% of
> that in data salad in lost+found, 21 GB of good data and the rest is
> gone.
> 
> Copy off what we can, and call it dead. This is where we're at now.

Yeah, and there's probably not a lot that can be done now except run
custom data scrapers over the raw disk blocks to try to recognise
unconnected metadata and files to try to recover the raw information
that is no longer connected to the repaired directory structure.
That's slow, time consuming and expensive.

> It seems like the MD rebuild process really scrambled things somehow.
> I'm not sure if this was due to some kind of kernel bug, or just
> zeroed out bad sectors in wrong places or what. Once the md resync
> ran, we were cooked.
> 
> I guess, after blowing through four or five "Hope you have a backup,
> but if not, you can try this and pray" checkpoints, I just want to
> check with the developers and group here to see if we did the best
> thing possible given the circumstances?

Before running repair - which is a "can't go back once it's started"
operation - you probably should have reached out for advice. We do
have tools that allow us to examine, investigate and modify the
on-disk format manually (xfs_db), and with metadump you can provide
us with a compact, obfuscated metadata-only image that we can look
at directly and see if there's anything that can be done to recover
the data from the broken filesystem. xfs_db requires substantial
expertise to use as a manual recovery tool, so it's not something
that just anyone can do...

> Xfs_repair is it, right? When things are that scrambled, pretty much
> all you can do is run an xfs_repair and hope for the best? Am I
> correct in thinking that there is no better or alternative tool that
> will give different results?

There are other tools that we have that can help understand the
nature of the corruption before an operation is performed that can't
be undone. Using those tools can lead to a better outcome, but in
RAID failure cases like these it is still often "storage is
completely scrambled, the filesystem and the data on it is toast no
matter what we do"....

> Can a commercial data recovery service make any better sense of a
> scrambled XFS than xfs_repair could? When the underlying device is
> presenting OK, just scrambled data on it?

Commercial data recovery services have their own custom data
scrapers that pull all the disconnected fragments of data off the
drive and then they tend to reconstruct the data manually from
there. They have a different goal to xfs_repair (data recovery vs
filesystem consistency) but a good data recovery service might be
able to scrape some of the data from disk blocks that xfs_repair
removed all the corrupt metadata references to...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question regarding XFS crisis recovery
  2021-11-15 21:21 ` Dave Chinner
@ 2021-11-16 17:27   ` Sean Caron
  0 siblings, 0 replies; 6+ messages in thread
From: Sean Caron @ 2021-11-16 17:27 UTC (permalink / raw)
  To: Dave Chinner, Sean Caron; +Cc: linux-xfs

Thank you so much, Dave, for taking some time out with such a
thoughtful response. Also thank you to everyone else on this email
thread who contributed condolences and much useful information. I will
keep this all for future reference and will definitely remember that
there are some really good people here willing to help out when
disaster strikes. I've also put it on my agenda to check on SCT ERC
values set on our drives in other arrays. This has been very
educational for me and I really appreciate the help and outside review
of the actions taken to respond to this failure.

Best,

Sean

On Mon, Nov 15, 2021 at 4:21 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Nov 15, 2021 at 12:14:34PM -0500, Sean Caron wrote:
> > Hi all,
> >
> > I recently had to manage a storage failure on a ~150 TB XFS volume and
> > I just wanted to check with the group here to see if anything could
> > have been done differently. Here is my story.
>
> :(
>
> > We had a 150 TB RAID 60 volume formatted with XFS. The volume was made
> > up of two 21-drive RAID 6 strings (4 TB drives). This was all done
> > with Linux MD software RAID.
>
> A 21-drive RAID-6 made this cascading failure scenario inevitable,
> especially if all the drives were identical (same vendor and
> manufacturing batch). Once the first drive goes bad, the rest are at
> death's door. RAID rebuild is about the most intensive sustained
> load you can put on a drive, and if a drive is marginal that's often
> all that is needed to kick it over the edge. The more disks in the
> RAID set, the more likely cascading failures during rebuild are.
>
> > We mount the array. It mounts, but it is obviously pretty damaged.
> > Normally when this happens we try to mount it read only and copy off
> > what we can, then write it off. This time, we can't hardly do anything
> > but an "ls" in the filesystem without getting "structure needs
> > cleaning".
>
> Which makes me think that the damage is, unfortunately, high up in
> directory heirarchy and the inodes and sub-directories that hold
> most of the data can't be accessed.
>
> > Doing any kind of material access to the filesystem gives
> > various major errors (i.e. "in-memory corruption of filesystem data
> > detected") and the filesystem goes offline. Reads just fail with I/O
> > errors.
> >
> > What can we do? Seems like at this stage we just run xfs_repair and
> > hope for the best, right?
>
> Not quite. The next step would have been to take a metadump of the
> broken filesystem and then restore the image to a file on non-broken
> storage. Then you can run repair on the restored metadump image and
> see just how much ends up being left after xfs_repair runs. That
> tells you the likely result of running repair without actually
> changing anything in the damaged storage.
>
> > Ran xfs_repair in dry run mode and it's looking pretty bad, just from
> > the sheer amount of output.
> >
> > But there's no real way to know exactly how much data xfs_repair will
> > wipe out, and what alternatives do we have?
>
> That's exactly what metadump/restore/repair/"mount -o loop" allows
> us to evaluate.
>
> > We run xfs_repair overnight. It ran for a while, then eventually hung
> > in Phase 4, I think.
> >
> > We killed xfs_repair off and re-ran it with the -P flag. It runs for
> > maybe two or three hours and eventually completes.
> >
> > We mount the filesystem up. Of around 150 TB, we have maybe 10% of
> > that in data salad in lost+found, 21 GB of good data and the rest is
> > gone.
> >
> > Copy off what we can, and call it dead. This is where we're at now.
>
> Yeah, and there's probably not a lot that can be done now except run
> custom data scrapers over the raw disk blocks to try to recognise
> unconnected metadata and files to try to recover the raw information
> that is no longer connected to the repaired directory structure.
> That's slow, time consuming and expensive.
>
> > It seems like the MD rebuild process really scrambled things somehow.
> > I'm not sure if this was due to some kind of kernel bug, or just
> > zeroed out bad sectors in wrong places or what. Once the md resync
> > ran, we were cooked.
> >
> > I guess, after blowing through four or five "Hope you have a backup,
> > but if not, you can try this and pray" checkpoints, I just want to
> > check with the developers and group here to see if we did the best
> > thing possible given the circumstances?
>
> Before running repair - which is a "can't go back once it's started"
> operation - you probably should have reached out for advice. We do
> have tools that allow us to examine, investigate and modify the
> on-disk format manually (xfs_db), and with metadump you can provide
> us with a compact, obfuscated metadata-only image that we can look
> at directly and see if there's anything that can be done to recover
> the data from the broken filesystem. xfs_db requires substantial
> expertise to use as a manual recovery tool, so it's not something
> that just anyone can do...
>
> > Xfs_repair is it, right? When things are that scrambled, pretty much
> > all you can do is run an xfs_repair and hope for the best? Am I
> > correct in thinking that there is no better or alternative tool that
> > will give different results?
>
> There are other tools that we have that can help understand the
> nature of the corruption before an operation is performed that can't
> be undone. Using those tools can lead to a better outcome, but in
> RAID failure cases like these it is still often "storage is
> completely scrambled, the filesystem and the data on it is toast no
> matter what we do"....
>
> > Can a commercial data recovery service make any better sense of a
> > scrambled XFS than xfs_repair could? When the underlying device is
> > presenting OK, just scrambled data on it?
>
> Commercial data recovery services have their own custom data
> scrapers that pull all the disconnected fragments of data off the
> drive and then they tend to reconstruct the data manually from
> there. They have a different goal to xfs_repair (data recovery vs
> filesystem consistency) but a good data recovery service might be
> able to scrape some of the data from disk blocks that xfs_repair
> removed all the corrupt metadata references to...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-11-16 17:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-15 17:14 Question regarding XFS crisis recovery Sean Caron
2021-11-15 18:13 ` Roger Willcocks
2021-11-15 18:35 ` Chris Murphy
2021-11-15 18:51 ` Eric Sandeen
2021-11-15 21:21 ` Dave Chinner
2021-11-16 17:27   ` Sean Caron

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.