linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Recover data from damage disk in "array"
@ 2021-01-19  0:00 Hérikz Nawarro
  2021-01-23  6:29 ` Chris Murphy
  2021-01-23 17:27 ` Zygo Blaxell
  0 siblings, 2 replies; 5+ messages in thread
From: Hérikz Nawarro @ 2021-01-19  0:00 UTC (permalink / raw)
  To: linux-btrfs

Hello everyone,

I got an array of 4 disks with btrfs configured with data single and
metadata dup, one disk of this array was plugged with a bad sata cable
that broke the plastic part of the data port (the pins still intact),
i still can read the disk with an adapter, but there's a way to
"isolate" this disk, recover all data and later replace the fault disk
in the array with a new one?

Cheers,

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recover data from damage disk in "array"
  2021-01-19  0:00 Recover data from damage disk in "array" Hérikz Nawarro
@ 2021-01-23  6:29 ` Chris Murphy
  2021-01-25  1:48   ` Hérikz Nawarro
  2021-01-23 17:27 ` Zygo Blaxell
  1 sibling, 1 reply; 5+ messages in thread
From: Chris Murphy @ 2021-01-23  6:29 UTC (permalink / raw)
  To: Hérikz Nawarro; +Cc: Btrfs BTRFS

On Mon, Jan 18, 2021 at 5:02 PM Hérikz Nawarro <herikz.nawarro@gmail.com> wrote:
>
> Hello everyone,
>
> I got an array of 4 disks with btrfs configured with data single and
> metadata dup, one disk of this array was plugged with a bad sata cable
> that broke the plastic part of the data port (the pins still intact),
> i still can read the disk with an adapter, but there's a way to
> "isolate" this disk, recover all data and later replace the fault disk
> in the array with a new one?

I'm not sure what you mean by isolate, or what's meant by recover all
data. To recover all data on all four disks suggests replicating all
of it to another file system - i.e. backup, rsync, snapshot(s) +
send/receive.

Are there any kernel messages reporting btrfs problems with this file
system? That should be resolved as a priority before anything else.

Also, DUP metadata for multiple device btrfs is suboptimal. It's a
single point of failure. I suggest converting to raid1 metadata so the
file system can correct for drive specific problems/bugs by getting a
good copy from another drive. If it's the case DUP metadata is on the
drive with the bad sata cable, that could easily result in loss or
corruption of both copies of metadata and the whole file system can
implode.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recover data from damage disk in "array"
  2021-01-19  0:00 Recover data from damage disk in "array" Hérikz Nawarro
  2021-01-23  6:29 ` Chris Murphy
@ 2021-01-23 17:27 ` Zygo Blaxell
  2021-01-25  1:41   ` Hérikz Nawarro
  1 sibling, 1 reply; 5+ messages in thread
From: Zygo Blaxell @ 2021-01-23 17:27 UTC (permalink / raw)
  To: Hérikz Nawarro; +Cc: linux-btrfs

On Mon, Jan 18, 2021 at 09:00:58PM -0300, Hérikz Nawarro wrote:
> Hello everyone,
> 
> I got an array of 4 disks with btrfs configured with data single and
> metadata dup

OK, that's weird.  Multiple disks should always have metadata in a raid1*
profile (raid1, raid10, raid1c3, or raid1c4).  dup metadata on multiple
disks, especially spinners, is going to be slow and brittle with no
upside.

> , one disk of this array was plugged with a bad sata cable
> that broke the plastic part of the data port (the pins still intact),
> i still can read the disk with an adapter, but there's a way to
> "isolate" this disk, recover all data and later replace the fault disk
> in the array with a new one?

There's no redundancy in this array, so you will have to keep the broken
disk online (or the filesystem unmounted) until a solution is implemented.

I wouldn't advise running with a broken connector at all, especially
without raid1 metadata.

Ideally, boot from rescue media, copy the broken device to a replacement
disk with dd, then remove the broken disk and mount the filesystem with
4 healthy disks.

If you try to operate with a broken connector, you could get disconnects
and lost writes.  With dup metadata there is no redundancy across
drives, so a lost metadata write on a single disk is a fatal error.
That will be a stress-test for btrfs's lost write detection, and even
if it works, it will force the filesystem read-only whenever it occurs
in a metadata write.  In the worst case, the disconnection resets the
drive and prevents its write cache from working properly, so a write is
lost in metadata, and the filesystem is unrecoverably damaged.

There are other ways to do this, but they take longer, in some cases
orders of magnitude longer (and therefore higher risk):

1.  convert the metadata to raid1, starting with the faulty drive
(in these examples I'm just going to call it device 3, use the
correct device ID for your array):

	# Remove metadata from broken device first
	btrfs balance start -mdevid=3,convert=raid1,soft /array

	# Continue converting all other metadata in the array:
	btrfs balance start -mconvert=raid1,soft /array

After metadata is converted to raid1, an intermittent drive connection is
a much more recoverable problem, and you can replace the broken disk at
your leisure.  You'll get csum and IO errors when the drive disconnects,
but these errors will not be fatal to the filesystem as a whole because
the metadata will be safely written on other devices.

2.  convert the metadata to raid1 as in option 1, then delete the missing
device.  This is by far the slowest option, and only works if you have
sufficient space on the other drives for the new data.

3.  convert the metadata to raid1 as in option 1, add more disks so that
there is enough space for the device delete in option 2, then proceed
with the device delete in option 2.  This is probably worse than option
2 in terms of potential failure modes, but I put it here for completeness.

4.  when the replacement disk arrives, run 'btrfs replace' from the broken
disk to the new disk, then convert the metadata to raid1 as in option 1
so you're not using dup metadata any more.  This is as fast as the 'dd'
solution, but there is a slightly higher risk as the broken disk might
disconnect during a write and abort the replace operation.

> Cheers,

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recover data from damage disk in "array"
  2021-01-23 17:27 ` Zygo Blaxell
@ 2021-01-25  1:41   ` Hérikz Nawarro
  0 siblings, 0 replies; 5+ messages in thread
From: Hérikz Nawarro @ 2021-01-25  1:41 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

> OK, that's weird.  Multiple disks should always have metadata in a raid1*
> profile (raid1, raid10, raid1c3, or raid1c4).  dup metadata on multiple
> disks, especially spinners, is going to be slow and brittle with no
> upside.

I didn't know about this.

> There are other ways to do this, but they take longer, in some cases
> orders of magnitude longer (and therefore higher risk):
>
> 1.  convert the metadata to raid1, starting with the faulty drive
> (in these examples I'm just going to call it device 3, use the
> correct device ID for your array):
>
>        # Remove metadata from broken device first
>        btrfs balance start -mdevid=3,convert=raid1,soft /array
>
>        # Continue converting all other metadata in the array:
>        btrfs balance start -mconvert=raid1,soft /array
>
> After metadata is converted to raid1, an intermittent drive connection is
> a much more recoverable problem, and you can replace the broken disk at
> your leisure.  You'll get csum and IO errors when the drive disconnects,
> but these errors will not be fatal to the filesystem as a whole because
> the metadata will be safely written on other devices.
>
> 2.  convert the metadata to raid1 as in option 1, then delete the missing
> device.  This is by far the slowest option, and only works if you have
> sufficient space on the other drives for the new data.
>
> 3.  convert the metadata to raid1 as in option 1, add more disks so that
> there is enough space for the device delete in option 2, then proceed
> with the device delete in option 2.  This is probably worse than option
> 2 in terms of potential failure modes, but I put it here for completeness.
>
> 4.  when the replacement disk arrives, run 'btrfs replace' from the broken
> disk to the new disk, then convert the metadata to raid1 as in option 1
> so you're not using dup metadata any more.  This is as fast as the 'dd'
> solution, but there is a slightly higher risk as the broken disk might
> disconnect during a write and abort the replace operation.

Thanks for the options, i'll try soon.


Em sáb., 23 de jan. de 2021 às 14:27, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> escreveu:
>
> On Mon, Jan 18, 2021 at 09:00:58PM -0300, Hérikz Nawarro wrote:
> > Hello everyone,
> >
> > I got an array of 4 disks with btrfs configured with data single and
> > metadata dup
>
> OK, that's weird.  Multiple disks should always have metadata in a raid1*
> profile (raid1, raid10, raid1c3, or raid1c4).  dup metadata on multiple
> disks, especially spinners, is going to be slow and brittle with no
> upside.
>
> > , one disk of this array was plugged with a bad sata cable
> > that broke the plastic part of the data port (the pins still intact),
> > i still can read the disk with an adapter, but there's a way to
> > "isolate" this disk, recover all data and later replace the fault disk
> > in the array with a new one?
>
> There's no redundancy in this array, so you will have to keep the broken
> disk online (or the filesystem unmounted) until a solution is implemented.
>
> I wouldn't advise running with a broken connector at all, especially
> without raid1 metadata.
>
> Ideally, boot from rescue media, copy the broken device to a replacement
> disk with dd, then remove the broken disk and mount the filesystem with
> 4 healthy disks.
>
> If you try to operate with a broken connector, you could get disconnects
> and lost writes.  With dup metadata there is no redundancy across
> drives, so a lost metadata write on a single disk is a fatal error.
> That will be a stress-test for btrfs's lost write detection, and even
> if it works, it will force the filesystem read-only whenever it occurs
> in a metadata write.  In the worst case, the disconnection resets the
> drive and prevents its write cache from working properly, so a write is
> lost in metadata, and the filesystem is unrecoverably damaged.
>
> There are other ways to do this, but they take longer, in some cases
> orders of magnitude longer (and therefore higher risk):
>
> 1.  convert the metadata to raid1, starting with the faulty drive
> (in these examples I'm just going to call it device 3, use the
> correct device ID for your array):
>
>         # Remove metadata from broken device first
>         btrfs balance start -mdevid=3,convert=raid1,soft /array
>
>         # Continue converting all other metadata in the array:
>         btrfs balance start -mconvert=raid1,soft /array
>
> After metadata is converted to raid1, an intermittent drive connection is
> a much more recoverable problem, and you can replace the broken disk at
> your leisure.  You'll get csum and IO errors when the drive disconnects,
> but these errors will not be fatal to the filesystem as a whole because
> the metadata will be safely written on other devices.
>
> 2.  convert the metadata to raid1 as in option 1, then delete the missing
> device.  This is by far the slowest option, and only works if you have
> sufficient space on the other drives for the new data.
>
> 3.  convert the metadata to raid1 as in option 1, add more disks so that
> there is enough space for the device delete in option 2, then proceed
> with the device delete in option 2.  This is probably worse than option
> 2 in terms of potential failure modes, but I put it here for completeness.
>
> 4.  when the replacement disk arrives, run 'btrfs replace' from the broken
> disk to the new disk, then convert the metadata to raid1 as in option 1
> so you're not using dup metadata any more.  This is as fast as the 'dd'
> solution, but there is a slightly higher risk as the broken disk might
> disconnect during a write and abort the replace operation.
>
> > Cheers,

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recover data from damage disk in "array"
  2021-01-23  6:29 ` Chris Murphy
@ 2021-01-25  1:48   ` Hérikz Nawarro
  0 siblings, 0 replies; 5+ messages in thread
From: Hérikz Nawarro @ 2021-01-25  1:48 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

> I'm not sure what you mean by isolate, or what's meant by recover all
> data. To recover all data on all four disks suggests replicating all
> of it to another file system - i.e. backup, rsync, snapshot(s) +
> send/receive.

I mean, dd the disk to a file and copy the data, before replacing the
broken disk.

> Are there any kernel messages reporting btrfs problems with this file
> system? That should be resolved as a priority before anything else.

No, the fs is fine and i stopped using it when the disk port broke.

> Also, DUP metadata for multiple device btrfs is suboptimal. It's a
> single point of failure. I suggest converting to raid1 metadata so the
> file system can correct for drive specific problems/bugs by getting a
> good copy from another drive. If it's the case DUP metadata is on the
> drive with the bad sata cable, that could easily result in loss or
> corruption of both copies of metadata and the whole file system can
> implode.

I'll try to convert the whole fs as soon as I get a new disk for replacement.

Em sáb., 23 de jan. de 2021 às 03:29, Chris Murphy
<lists@colorremedies.com> escreveu:
>
> On Mon, Jan 18, 2021 at 5:02 PM Hérikz Nawarro <herikz.nawarro@gmail.com> wrote:
> >
> > Hello everyone,
> >
> > I got an array of 4 disks with btrfs configured with data single and
> > metadata dup, one disk of this array was plugged with a bad sata cable
> > that broke the plastic part of the data port (the pins still intact),
> > i still can read the disk with an adapter, but there's a way to
> > "isolate" this disk, recover all data and later replace the fault disk
> > in the array with a new one?
>
> I'm not sure what you mean by isolate, or what's meant by recover all
> data. To recover all data on all four disks suggests replicating all
> of it to another file system - i.e. backup, rsync, snapshot(s) +
> send/receive.
>
> Are there any kernel messages reporting btrfs problems with this file
> system? That should be resolved as a priority before anything else.
>
> Also, DUP metadata for multiple device btrfs is suboptimal. It's a
> single point of failure. I suggest converting to raid1 metadata so the
> file system can correct for drive specific problems/bugs by getting a
> good copy from another drive. If it's the case DUP metadata is on the
> drive with the bad sata cable, that could easily result in loss or
> corruption of both copies of metadata and the whole file system can
> implode.
>
> --
> Chris Murphy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-01-25  1:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-19  0:00 Recover data from damage disk in "array" Hérikz Nawarro
2021-01-23  6:29 ` Chris Murphy
2021-01-25  1:48   ` Hérikz Nawarro
2021-01-23 17:27 ` Zygo Blaxell
2021-01-25  1:41   ` Hérikz Nawarro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).