* Recover data from damage disk in "array" @ 2021-01-19 0:00 Hérikz Nawarro 2021-01-23 6:29 ` Chris Murphy 2021-01-23 17:27 ` Zygo Blaxell 0 siblings, 2 replies; 5+ messages in thread From: Hérikz Nawarro @ 2021-01-19 0:00 UTC (permalink / raw) To: linux-btrfs Hello everyone, I got an array of 4 disks with btrfs configured with data single and metadata dup, one disk of this array was plugged with a bad sata cable that broke the plastic part of the data port (the pins still intact), i still can read the disk with an adapter, but there's a way to "isolate" this disk, recover all data and later replace the fault disk in the array with a new one? Cheers, ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Recover data from damage disk in "array" 2021-01-19 0:00 Recover data from damage disk in "array" Hérikz Nawarro @ 2021-01-23 6:29 ` Chris Murphy 2021-01-25 1:48 ` Hérikz Nawarro 2021-01-23 17:27 ` Zygo Blaxell 1 sibling, 1 reply; 5+ messages in thread From: Chris Murphy @ 2021-01-23 6:29 UTC (permalink / raw) To: Hérikz Nawarro; +Cc: Btrfs BTRFS On Mon, Jan 18, 2021 at 5:02 PM Hérikz Nawarro <herikz.nawarro@gmail.com> wrote: > > Hello everyone, > > I got an array of 4 disks with btrfs configured with data single and > metadata dup, one disk of this array was plugged with a bad sata cable > that broke the plastic part of the data port (the pins still intact), > i still can read the disk with an adapter, but there's a way to > "isolate" this disk, recover all data and later replace the fault disk > in the array with a new one? I'm not sure what you mean by isolate, or what's meant by recover all data. To recover all data on all four disks suggests replicating all of it to another file system - i.e. backup, rsync, snapshot(s) + send/receive. Are there any kernel messages reporting btrfs problems with this file system? That should be resolved as a priority before anything else. Also, DUP metadata for multiple device btrfs is suboptimal. It's a single point of failure. I suggest converting to raid1 metadata so the file system can correct for drive specific problems/bugs by getting a good copy from another drive. If it's the case DUP metadata is on the drive with the bad sata cable, that could easily result in loss or corruption of both copies of metadata and the whole file system can implode. -- Chris Murphy ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Recover data from damage disk in "array" 2021-01-23 6:29 ` Chris Murphy @ 2021-01-25 1:48 ` Hérikz Nawarro 0 siblings, 0 replies; 5+ messages in thread From: Hérikz Nawarro @ 2021-01-25 1:48 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs > I'm not sure what you mean by isolate, or what's meant by recover all > data. To recover all data on all four disks suggests replicating all > of it to another file system - i.e. backup, rsync, snapshot(s) + > send/receive. I mean, dd the disk to a file and copy the data, before replacing the broken disk. > Are there any kernel messages reporting btrfs problems with this file > system? That should be resolved as a priority before anything else. No, the fs is fine and i stopped using it when the disk port broke. > Also, DUP metadata for multiple device btrfs is suboptimal. It's a > single point of failure. I suggest converting to raid1 metadata so the > file system can correct for drive specific problems/bugs by getting a > good copy from another drive. If it's the case DUP metadata is on the > drive with the bad sata cable, that could easily result in loss or > corruption of both copies of metadata and the whole file system can > implode. I'll try to convert the whole fs as soon as I get a new disk for replacement. Em sáb., 23 de jan. de 2021 às 03:29, Chris Murphy <lists@colorremedies.com> escreveu: > > On Mon, Jan 18, 2021 at 5:02 PM Hérikz Nawarro <herikz.nawarro@gmail.com> wrote: > > > > Hello everyone, > > > > I got an array of 4 disks with btrfs configured with data single and > > metadata dup, one disk of this array was plugged with a bad sata cable > > that broke the plastic part of the data port (the pins still intact), > > i still can read the disk with an adapter, but there's a way to > > "isolate" this disk, recover all data and later replace the fault disk > > in the array with a new one? > > I'm not sure what you mean by isolate, or what's meant by recover all > data. To recover all data on all four disks suggests replicating all > of it to another file system - i.e. backup, rsync, snapshot(s) + > send/receive. > > Are there any kernel messages reporting btrfs problems with this file > system? That should be resolved as a priority before anything else. > > Also, DUP metadata for multiple device btrfs is suboptimal. It's a > single point of failure. I suggest converting to raid1 metadata so the > file system can correct for drive specific problems/bugs by getting a > good copy from another drive. If it's the case DUP metadata is on the > drive with the bad sata cable, that could easily result in loss or > corruption of both copies of metadata and the whole file system can > implode. > > -- > Chris Murphy ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Recover data from damage disk in "array" 2021-01-19 0:00 Recover data from damage disk in "array" Hérikz Nawarro 2021-01-23 6:29 ` Chris Murphy @ 2021-01-23 17:27 ` Zygo Blaxell 2021-01-25 1:41 ` Hérikz Nawarro 1 sibling, 1 reply; 5+ messages in thread From: Zygo Blaxell @ 2021-01-23 17:27 UTC (permalink / raw) To: Hérikz Nawarro; +Cc: linux-btrfs On Mon, Jan 18, 2021 at 09:00:58PM -0300, Hérikz Nawarro wrote: > Hello everyone, > > I got an array of 4 disks with btrfs configured with data single and > metadata dup OK, that's weird. Multiple disks should always have metadata in a raid1* profile (raid1, raid10, raid1c3, or raid1c4). dup metadata on multiple disks, especially spinners, is going to be slow and brittle with no upside. > , one disk of this array was plugged with a bad sata cable > that broke the plastic part of the data port (the pins still intact), > i still can read the disk with an adapter, but there's a way to > "isolate" this disk, recover all data and later replace the fault disk > in the array with a new one? There's no redundancy in this array, so you will have to keep the broken disk online (or the filesystem unmounted) until a solution is implemented. I wouldn't advise running with a broken connector at all, especially without raid1 metadata. Ideally, boot from rescue media, copy the broken device to a replacement disk with dd, then remove the broken disk and mount the filesystem with 4 healthy disks. If you try to operate with a broken connector, you could get disconnects and lost writes. With dup metadata there is no redundancy across drives, so a lost metadata write on a single disk is a fatal error. That will be a stress-test for btrfs's lost write detection, and even if it works, it will force the filesystem read-only whenever it occurs in a metadata write. In the worst case, the disconnection resets the drive and prevents its write cache from working properly, so a write is lost in metadata, and the filesystem is unrecoverably damaged. There are other ways to do this, but they take longer, in some cases orders of magnitude longer (and therefore higher risk): 1. convert the metadata to raid1, starting with the faulty drive (in these examples I'm just going to call it device 3, use the correct device ID for your array): # Remove metadata from broken device first btrfs balance start -mdevid=3,convert=raid1,soft /array # Continue converting all other metadata in the array: btrfs balance start -mconvert=raid1,soft /array After metadata is converted to raid1, an intermittent drive connection is a much more recoverable problem, and you can replace the broken disk at your leisure. You'll get csum and IO errors when the drive disconnects, but these errors will not be fatal to the filesystem as a whole because the metadata will be safely written on other devices. 2. convert the metadata to raid1 as in option 1, then delete the missing device. This is by far the slowest option, and only works if you have sufficient space on the other drives for the new data. 3. convert the metadata to raid1 as in option 1, add more disks so that there is enough space for the device delete in option 2, then proceed with the device delete in option 2. This is probably worse than option 2 in terms of potential failure modes, but I put it here for completeness. 4. when the replacement disk arrives, run 'btrfs replace' from the broken disk to the new disk, then convert the metadata to raid1 as in option 1 so you're not using dup metadata any more. This is as fast as the 'dd' solution, but there is a slightly higher risk as the broken disk might disconnect during a write and abort the replace operation. > Cheers, ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Recover data from damage disk in "array" 2021-01-23 17:27 ` Zygo Blaxell @ 2021-01-25 1:41 ` Hérikz Nawarro 0 siblings, 0 replies; 5+ messages in thread From: Hérikz Nawarro @ 2021-01-25 1:41 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs > OK, that's weird. Multiple disks should always have metadata in a raid1* > profile (raid1, raid10, raid1c3, or raid1c4). dup metadata on multiple > disks, especially spinners, is going to be slow and brittle with no > upside. I didn't know about this. > There are other ways to do this, but they take longer, in some cases > orders of magnitude longer (and therefore higher risk): > > 1. convert the metadata to raid1, starting with the faulty drive > (in these examples I'm just going to call it device 3, use the > correct device ID for your array): > > # Remove metadata from broken device first > btrfs balance start -mdevid=3,convert=raid1,soft /array > > # Continue converting all other metadata in the array: > btrfs balance start -mconvert=raid1,soft /array > > After metadata is converted to raid1, an intermittent drive connection is > a much more recoverable problem, and you can replace the broken disk at > your leisure. You'll get csum and IO errors when the drive disconnects, > but these errors will not be fatal to the filesystem as a whole because > the metadata will be safely written on other devices. > > 2. convert the metadata to raid1 as in option 1, then delete the missing > device. This is by far the slowest option, and only works if you have > sufficient space on the other drives for the new data. > > 3. convert the metadata to raid1 as in option 1, add more disks so that > there is enough space for the device delete in option 2, then proceed > with the device delete in option 2. This is probably worse than option > 2 in terms of potential failure modes, but I put it here for completeness. > > 4. when the replacement disk arrives, run 'btrfs replace' from the broken > disk to the new disk, then convert the metadata to raid1 as in option 1 > so you're not using dup metadata any more. This is as fast as the 'dd' > solution, but there is a slightly higher risk as the broken disk might > disconnect during a write and abort the replace operation. Thanks for the options, i'll try soon. Em sáb., 23 de jan. de 2021 às 14:27, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> escreveu: > > On Mon, Jan 18, 2021 at 09:00:58PM -0300, Hérikz Nawarro wrote: > > Hello everyone, > > > > I got an array of 4 disks with btrfs configured with data single and > > metadata dup > > OK, that's weird. Multiple disks should always have metadata in a raid1* > profile (raid1, raid10, raid1c3, or raid1c4). dup metadata on multiple > disks, especially spinners, is going to be slow and brittle with no > upside. > > > , one disk of this array was plugged with a bad sata cable > > that broke the plastic part of the data port (the pins still intact), > > i still can read the disk with an adapter, but there's a way to > > "isolate" this disk, recover all data and later replace the fault disk > > in the array with a new one? > > There's no redundancy in this array, so you will have to keep the broken > disk online (or the filesystem unmounted) until a solution is implemented. > > I wouldn't advise running with a broken connector at all, especially > without raid1 metadata. > > Ideally, boot from rescue media, copy the broken device to a replacement > disk with dd, then remove the broken disk and mount the filesystem with > 4 healthy disks. > > If you try to operate with a broken connector, you could get disconnects > and lost writes. With dup metadata there is no redundancy across > drives, so a lost metadata write on a single disk is a fatal error. > That will be a stress-test for btrfs's lost write detection, and even > if it works, it will force the filesystem read-only whenever it occurs > in a metadata write. In the worst case, the disconnection resets the > drive and prevents its write cache from working properly, so a write is > lost in metadata, and the filesystem is unrecoverably damaged. > > There are other ways to do this, but they take longer, in some cases > orders of magnitude longer (and therefore higher risk): > > 1. convert the metadata to raid1, starting with the faulty drive > (in these examples I'm just going to call it device 3, use the > correct device ID for your array): > > # Remove metadata from broken device first > btrfs balance start -mdevid=3,convert=raid1,soft /array > > # Continue converting all other metadata in the array: > btrfs balance start -mconvert=raid1,soft /array > > After metadata is converted to raid1, an intermittent drive connection is > a much more recoverable problem, and you can replace the broken disk at > your leisure. You'll get csum and IO errors when the drive disconnects, > but these errors will not be fatal to the filesystem as a whole because > the metadata will be safely written on other devices. > > 2. convert the metadata to raid1 as in option 1, then delete the missing > device. This is by far the slowest option, and only works if you have > sufficient space on the other drives for the new data. > > 3. convert the metadata to raid1 as in option 1, add more disks so that > there is enough space for the device delete in option 2, then proceed > with the device delete in option 2. This is probably worse than option > 2 in terms of potential failure modes, but I put it here for completeness. > > 4. when the replacement disk arrives, run 'btrfs replace' from the broken > disk to the new disk, then convert the metadata to raid1 as in option 1 > so you're not using dup metadata any more. This is as fast as the 'dd' > solution, but there is a slightly higher risk as the broken disk might > disconnect during a write and abort the replace operation. > > > Cheers, ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-01-25 1:50 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-01-19 0:00 Recover data from damage disk in "array" Hérikz Nawarro 2021-01-23 6:29 ` Chris Murphy 2021-01-25 1:48 ` Hérikz Nawarro 2021-01-23 17:27 ` Zygo Blaxell 2021-01-25 1:41 ` Hérikz Nawarro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).