* Recommendation on raid5 drive error resolution @ 2016-08-25 7:23 Gareth Pye 2016-08-28 7:05 ` DanglingPointer 2016-08-28 17:15 ` Chris Murphy 0 siblings, 2 replies; 15+ messages in thread From: Gareth Pye @ 2016-08-25 7:23 UTC (permalink / raw) To: linux-btrfs So I've been living on the reckless-side (meta RAID6, data RAID5) and I have a drive or two that isn't playing nicely any more. dmesg of the system running for a few minutes: http://pastebin.com/9pHBRQVe Everything of value is backed up, but I'd rather keep data than download it all again. When I only saw one disk having troubles I was concerned. Now I notice both sda and sdc having issues I'm thinking I might be about to have a bad time. What else should I provide? -- Gareth Pye - blog.cerberos.id.au Level 2 MTG Judge, Melbourne, Australia ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-25 7:23 Recommendation on raid5 drive error resolution Gareth Pye @ 2016-08-28 7:05 ` DanglingPointer 2016-08-28 17:15 ` Chris Murphy 1 sibling, 0 replies; 15+ messages in thread From: DanglingPointer @ 2016-08-28 7:05 UTC (permalink / raw) To: Gareth Pye, linux-btrfs Hi Gareth, I'm interested in how you go with this as I'm somewhat similar with RAID5 with both. Don't take this as advice as I have never done it; however if I were in your shoes, I would take out one of the disks that isn't playing nicely and rebuild the array. Once it is running smooth then I would take the other disk that isn't playing nice and replace it and rebuild again. The whole process will take a fair bit of time but better to be safe than sorry. Like I said I have never done it so do so at your own risk. DanglingPointer On 25/08/16 17:23, Gareth Pye wrote: > So I've been living on the reckless-side (meta RAID6, data RAID5) and > I have a drive or two that isn't playing nicely any more. > > dmesg of the system running for a few minutes: http://pastebin.com/9pHBRQVe > > Everything of value is backed up, but I'd rather keep data than > download it all again. When I only saw one disk having troubles I was > concerned. Now I notice both sda and sdc having issues I'm thinking I > might be about to have a bad time. > > What else should I provide? > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-25 7:23 Recommendation on raid5 drive error resolution Gareth Pye 2016-08-28 7:05 ` DanglingPointer @ 2016-08-28 17:15 ` Chris Murphy 2016-08-29 0:15 ` Gareth Pye 1 sibling, 1 reply; 15+ messages in thread From: Chris Murphy @ 2016-08-28 17:15 UTC (permalink / raw) To: Gareth Pye; +Cc: linux-btrfs On Thu, Aug 25, 2016 at 1:23 AM, Gareth Pye <gareth@cerberos.id.au> wrote: > So I've been living on the reckless-side (meta RAID6, data RAID5) and > I have a drive or two that isn't playing nicely any more. > > dmesg of the system running for a few minutes: http://pastebin.com/9pHBRQVe > > Everything of value is backed up, but I'd rather keep data than > download it all again. When I only saw one disk having troubles I was > concerned. Now I notice both sda and sdc having issues I'm thinking I > might be about to have a bad time. > > What else should I provide? [ 72.555921] BTRFS info (device sda7): bdev /dev/sdc errs: wr 0, rd 9091, flush 0, corrupt 0, gen 0 [ 72.555941] BTRFS info (device sda7): bdev /dev/sdh errs: wr 0, rd 74, flush 0, corrupt 0, gen 0 Two devices with read errors, bad. If they overlap, it's basically a dead raid5. And it also means you *CANNOT* remove either drive. So now you have a problem, and I highly advise that you fresh your backup because this is a really fragile state for any raid5. What's the result from these two commands for every drive in this array? smarctl -l scterc <dev> cat /sys/block/sdX/device/timeout The SCTERC value must be less than the timeout. This really must be the first thing you do, even before starting your backup, because otherwise a misconfiguration here has a very good chance of preventing the success of getting a backup. Note these are not persistent settings. -- Chris Murphy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-28 17:15 ` Chris Murphy @ 2016-08-29 0:15 ` Gareth Pye 2016-08-29 0:18 ` Gareth Pye 0 siblings, 1 reply; 15+ messages in thread From: Gareth Pye @ 2016-08-29 0:15 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs Current status: Knowing things were bad I did set the scterc values sanely, but the box was getting less stable so I thought a reboot was a good idea. That reboot failed to mount the partition at all and eveything triggered my 'is this a psu issue' sense so I've left the box off till I've got time to check if a psu replacement makes anything happier. That might happen tonight or tomorrow. I'll update the thread when I do that. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-29 0:15 ` Gareth Pye @ 2016-08-29 0:18 ` Gareth Pye 2016-08-29 23:01 ` Gareth Pye 0 siblings, 1 reply; 15+ messages in thread From: Gareth Pye @ 2016-08-29 0:18 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs Am I right that the wr: 0 means that the disks should at least be in a nice consistent state? I know that overlapping read fails can still cause everything to fail. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-29 0:18 ` Gareth Pye @ 2016-08-29 23:01 ` Gareth Pye 2016-08-30 9:58 ` Gareth Pye 0 siblings, 1 reply; 15+ messages in thread From: Gareth Pye @ 2016-08-29 23:01 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs When I can get this stupid box to boot from an external drive I'll have some idea of what is going on.... ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-29 23:01 ` Gareth Pye @ 2016-08-30 9:58 ` Gareth Pye 2016-08-30 18:04 ` Chris Murphy 0 siblings, 1 reply; 15+ messages in thread From: Gareth Pye @ 2016-08-30 9:58 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs Okay, things aren't looking good. The FS wont mount for me: http://pastebin.com/sEEdRxsN On Tue, Aug 30, 2016 at 9:01 AM, Gareth Pye <gareth@cerberos.id.au> wrote: > When I can get this stupid box to boot from an external drive I'll > have some idea of what is going on.... -- Gareth Pye - blog.cerberos.id.au Level 2 MTG Judge, Melbourne, Australia ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-30 9:58 ` Gareth Pye @ 2016-08-30 18:04 ` Chris Murphy 2016-08-30 18:28 ` Chris Murphy 0 siblings, 1 reply; 15+ messages in thread From: Chris Murphy @ 2016-08-30 18:04 UTC (permalink / raw) To: Gareth Pye; +Cc: Chris Murphy, linux-btrfs On Tue, Aug 30, 2016 at 3:58 AM, Gareth Pye <gareth@cerberos.id.au> wrote: > Okay, things aren't looking good. The FS wont mount for me: > http://pastebin.com/sEEdRxsN Try to mount with -o ro,degraded. I have no idea which device it'll end up dropping, but it might at least get you a read only mount so you can get stuff off - if you want - without modifying the file system. One of us would have to go look in source to see what causes "[ 163.612313] BTRFS: failed to read the system array on sdd" to appear for each device. It's suspicious that every drive produces that message, and there are no fixup messages at all ever. So it sounds like it's not even getting far enough to figure out what's bad and reconstruct from parity. And I don't even see csum errors either, which is also suspicious. It's like the boot strapping itself is failing which kinda implicates superblocks? What do you get for btrfs rescue super-recover -v /dev/sdX ? -- Chris Murphy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-30 18:04 ` Chris Murphy @ 2016-08-30 18:28 ` Chris Murphy 2016-08-30 21:23 ` Gareth Pye 0 siblings, 1 reply; 15+ messages in thread From: Chris Murphy @ 2016-08-30 18:28 UTC (permalink / raw) To: Chris Murphy; +Cc: Gareth Pye, linux-btrfs On Tue, Aug 30, 2016 at 12:04 PM, Chris Murphy <lists@colorremedies.com> wrote: > One of us would have to go look in source to see what causes "[ > 163.612313] BTRFS: failed to read the system array on sdd" to appear https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/disk-io.c?id=refs/tags/v4.7.2 line 2864 And btrfs_read_sys_array is found here on 6587. So https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/volumes.c?id=refs/tags/v4.7.2 And then comparing your 4.4.13 to 4.7.2.... https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/disk-io.c?id=v4.7.2&id2=v4.4.13 There are changes in these areas but looks like they're mainly printk's becoming btrfs_err. But I'd try a newer kernel before you give up on it. https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/volumes.c?id=v4.7.2&id2=v4.4.13 More changes here too. I suggest using btrfs-progs 4.5.3 or 4.6.1. You could also try 4.7 but I'm getting some weird unexplained errors that only progs 4.7 complains about (clean scrubs, clean mounts, completely working file system, but a buncha backref complaints from 4.7's btrfs check). But I think the super-recover -v output should be reliable with any version in the last ~year. -- Chris Murphy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-30 18:28 ` Chris Murphy @ 2016-08-30 21:23 ` Gareth Pye 2016-08-30 21:45 ` Chris Murphy 2016-08-30 21:46 ` Gareth Pye 0 siblings, 2 replies; 15+ messages in thread From: Gareth Pye @ 2016-08-30 21:23 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs On Wed, Aug 31, 2016 at 4:28 AM, Chris Murphy <lists@colorremedies.com> wrote: > But I'd try a newer kernel before you > give up on it. Any recommendations on liveCDs that have recent kernels & btrfs tools? For no apparent reason system isn't booting normally either, and I'm reluctant to fix that before at least confirming the things I at least partially care about have a recent backup. -- Gareth Pye - blog.cerberos.id.au Level 2 MTG Judge, Melbourne, Australia ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-30 21:23 ` Gareth Pye @ 2016-08-30 21:45 ` Chris Murphy 2016-08-30 21:46 ` Gareth Pye 1 sibling, 0 replies; 15+ messages in thread From: Chris Murphy @ 2016-08-30 21:45 UTC (permalink / raw) To: Gareth Pye; +Cc: Chris Murphy, linux-btrfs On Tue, Aug 30, 2016 at 3:23 PM, Gareth Pye <gareth@cerberos.id.au> wrote: > On Wed, Aug 31, 2016 at 4:28 AM, Chris Murphy <lists@colorremedies.com> wrote: >> But I'd try a newer kernel before you >> give up on it. > > > Any recommendations on liveCDs that have recent kernels & btrfs tools? > For no apparent reason system isn't booting normally either, and I'm > reluctant to fix that before at least confirming the things I at least > partially care about have a recent backup. Fedora 25 Alpha released today with kernel 4.8rc2 and btrfs-progs 4.6.1. https://getfedora.org/en/workstation/prerelease/ The top green "Download" button offers GNOME. If you want something smaller, on the right hand side are netinstall images with the same kernel and progs, but no GUI. You can choose the Troubleshooting menu, and then the Rescue a Fedora System option. It boots, and then you're at a text UI where you can just get to a shell, option 3. The easiest way to create a USB stick is with dd and it'll boot practically anything, BIOS, UEFI, even Macs. Not all wireless firmware is included in these media, if you have a wired connection it'll be easier to get dmesg and and contents of btrfs check off. If you opt for the larger image (GNOME), it's a bit easier to get the terminal output into a file and either scp it to another computer or you can also use fpaste <filename> and it'll spit back a URL where it uploaded the text. -- Chris Murphy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-30 21:23 ` Gareth Pye 2016-08-30 21:45 ` Chris Murphy @ 2016-08-30 21:46 ` Gareth Pye 2016-08-31 23:04 ` Gareth Pye 1 sibling, 1 reply; 15+ messages in thread From: Gareth Pye @ 2016-08-30 21:46 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs Or I could just once again select the right boot device in the bios. I think I want some new hardware :) On Wed, Aug 31, 2016 at 7:23 AM, Gareth Pye <gareth@cerberos.id.au> wrote: > On Wed, Aug 31, 2016 at 4:28 AM, Chris Murphy <lists@colorremedies.com> wrote: >> But I'd try a newer kernel before you >> give up on it. > > > Any recommendations on liveCDs that have recent kernels & btrfs tools? > For no apparent reason system isn't booting normally either, and I'm > reluctant to fix that before at least confirming the things I at least > partially care about have a recent backup. > > -- > Gareth Pye - blog.cerberos.id.au > Level 2 MTG Judge, Melbourne, Australia -- Gareth Pye - blog.cerberos.id.au Level 2 MTG Judge, Melbourne, Australia ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-30 21:46 ` Gareth Pye @ 2016-08-31 23:04 ` Gareth Pye 2016-09-01 11:25 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 15+ messages in thread From: Gareth Pye @ 2016-08-31 23:04 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs ro,degraded has mounted it nicely and my rsync of the more useful data is progressing at the speed of WiFi. There are repeated read errors from one drive still but the rsync hasn't bailed yet, which I think means there isn't any overlapping errors in any of the files it has touched thus far. Am I right or is their likely to be corrupt data in the files I've synced off? On Wed, Aug 31, 2016 at 7:46 AM, Gareth Pye <gareth@cerberos.id.au> wrote: > Or I could just once again select the right boot device in the bios. I > think I want some new hardware :) > > On Wed, Aug 31, 2016 at 7:23 AM, Gareth Pye <gareth@cerberos.id.au> wrote: >> On Wed, Aug 31, 2016 at 4:28 AM, Chris Murphy <lists@colorremedies.com> wrote: >>> But I'd try a newer kernel before you >>> give up on it. >> >> >> Any recommendations on liveCDs that have recent kernels & btrfs tools? >> For no apparent reason system isn't booting normally either, and I'm >> reluctant to fix that before at least confirming the things I at least >> partially care about have a recent backup. >> >> -- >> Gareth Pye - blog.cerberos.id.au >> Level 2 MTG Judge, Melbourne, Australia > > > > -- > Gareth Pye - blog.cerberos.id.au > Level 2 MTG Judge, Melbourne, Australia -- Gareth Pye - blog.cerberos.id.au Level 2 MTG Judge, Melbourne, Australia ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-08-31 23:04 ` Gareth Pye @ 2016-09-01 11:25 ` Austin S. Hemmelgarn 2016-09-07 0:35 ` Gareth Pye 0 siblings, 1 reply; 15+ messages in thread From: Austin S. Hemmelgarn @ 2016-09-01 11:25 UTC (permalink / raw) To: Gareth Pye, Chris Murphy; +Cc: linux-btrfs On 2016-08-31 19:04, Gareth Pye wrote: > ro,degraded has mounted it nicely and my rsync of the more useful data > is progressing at the speed of WiFi. > > There are repeated read errors from one drive still but the rsync > hasn't bailed yet, which I think means there isn't any overlapping > errors in any of the files it has touched thus far. Am I right or is > their likely to be corrupt data in the files I've synced off? Unless you've been running with nocow or nodatasum in your mount options, then what you've concluded should be correct. I would still suggest verifying the data by some external means if possible, this type of situation is not something that's well tested, and TBH I'm amazed that things are working to the degree that they are. > > On Wed, Aug 31, 2016 at 7:46 AM, Gareth Pye <gareth@cerberos.id.au> wrote: >> Or I could just once again select the right boot device in the bios. I >> think I want some new hardware :) >> >> On Wed, Aug 31, 2016 at 7:23 AM, Gareth Pye <gareth@cerberos.id.au> wrote: >>> On Wed, Aug 31, 2016 at 4:28 AM, Chris Murphy <lists@colorremedies.com> wrote: >>>> But I'd try a newer kernel before you >>>> give up on it. >>> >>> >>> Any recommendations on liveCDs that have recent kernels & btrfs tools? >>> For no apparent reason system isn't booting normally either, and I'm >>> reluctant to fix that before at least confirming the things I at least >>> partially care about have a recent backup. >>> >>> -- >>> Gareth Pye - blog.cerberos.id.au >>> Level 2 MTG Judge, Melbourne, Australia >> >> >> >> -- >> Gareth Pye - blog.cerberos.id.au >> Level 2 MTG Judge, Melbourne, Australia > > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Recommendation on raid5 drive error resolution 2016-09-01 11:25 ` Austin S. Hemmelgarn @ 2016-09-07 0:35 ` Gareth Pye 0 siblings, 0 replies; 15+ messages in thread From: Gareth Pye @ 2016-09-07 0:35 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Chris Murphy, linux-btrfs Things have been copying off really well. I'm starting to suspect the issue was the PSU which I've swapped out. What is the line I should see in dmesg if the degraded option was actually used when mounting the file system? On Thu, Sep 1, 2016 at 9:25 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-08-31 19:04, Gareth Pye wrote: >> >> ro,degraded has mounted it nicely and my rsync of the more useful data >> is progressing at the speed of WiFi. >> >> There are repeated read errors from one drive still but the rsync >> hasn't bailed yet, which I think means there isn't any overlapping >> errors in any of the files it has touched thus far. Am I right or is >> their likely to be corrupt data in the files I've synced off? > > Unless you've been running with nocow or nodatasum in your mount options, > then what you've concluded should be correct. I would still suggest > verifying the data by some external means if possible, this type of > situation is not something that's well tested, and TBH I'm amazed that > things are working to the degree that they are. > >> >> On Wed, Aug 31, 2016 at 7:46 AM, Gareth Pye <gareth@cerberos.id.au> wrote: >>> >>> Or I could just once again select the right boot device in the bios. I >>> think I want some new hardware :) >>> >>> On Wed, Aug 31, 2016 at 7:23 AM, Gareth Pye <gareth@cerberos.id.au> >>> wrote: >>>> >>>> On Wed, Aug 31, 2016 at 4:28 AM, Chris Murphy <lists@colorremedies.com> >>>> wrote: >>>>> >>>>> But I'd try a newer kernel before you >>>>> give up on it. >>>> >>>> >>>> >>>> Any recommendations on liveCDs that have recent kernels & btrfs tools? >>>> For no apparent reason system isn't booting normally either, and I'm >>>> reluctant to fix that before at least confirming the things I at least >>>> partially care about have a recent backup. >>>> >>>> -- >>>> Gareth Pye - blog.cerberos.id.au >>>> Level 2 MTG Judge, Melbourne, Australia >>> >>> >>> >>> >>> -- >>> Gareth Pye - blog.cerberos.id.au >>> Level 2 MTG Judge, Melbourne, Australia >> >> >> >> > -- Gareth Pye - blog.cerberos.id.au Level 2 MTG Judge, Melbourne, Australia ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2016-09-07 0:35 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-08-25 7:23 Recommendation on raid5 drive error resolution Gareth Pye 2016-08-28 7:05 ` DanglingPointer 2016-08-28 17:15 ` Chris Murphy 2016-08-29 0:15 ` Gareth Pye 2016-08-29 0:18 ` Gareth Pye 2016-08-29 23:01 ` Gareth Pye 2016-08-30 9:58 ` Gareth Pye 2016-08-30 18:04 ` Chris Murphy 2016-08-30 18:28 ` Chris Murphy 2016-08-30 21:23 ` Gareth Pye 2016-08-30 21:45 ` Chris Murphy 2016-08-30 21:46 ` Gareth Pye 2016-08-31 23:04 ` Gareth Pye 2016-09-01 11:25 ` Austin S. Hemmelgarn 2016-09-07 0:35 ` Gareth Pye
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.