All of lore.kernel.org
 help / color / mirror / Atom feed
* Recommendation on raid5 drive error resolution
@ 2016-08-25  7:23 Gareth Pye
  2016-08-28  7:05 ` DanglingPointer
  2016-08-28 17:15 ` Chris Murphy
  0 siblings, 2 replies; 15+ messages in thread
From: Gareth Pye @ 2016-08-25  7:23 UTC (permalink / raw)
  To: linux-btrfs

So I've been living on the reckless-side (meta RAID6, data RAID5) and
I have a drive or two that isn't playing nicely any more.

dmesg of the system running for a few minutes: http://pastebin.com/9pHBRQVe

Everything of value is backed up, but I'd rather keep data than
download it all again. When I only saw one disk having troubles I was
concerned. Now I notice both sda and sdc having issues I'm thinking I
might be about to have a bad time.

What else should I provide?

-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-25  7:23 Recommendation on raid5 drive error resolution Gareth Pye
@ 2016-08-28  7:05 ` DanglingPointer
  2016-08-28 17:15 ` Chris Murphy
  1 sibling, 0 replies; 15+ messages in thread
From: DanglingPointer @ 2016-08-28  7:05 UTC (permalink / raw)
  To: Gareth Pye, linux-btrfs

Hi Gareth,

I'm interested in how you go with this as I'm somewhat similar with 
RAID5 with both.  Don't take this as advice as I have never done it; 
however if I were in your shoes, I would take out one of the disks that 
isn't playing nicely and rebuild the array.  Once it is running smooth 
then I would take the other disk that isn't playing nice and replace it 
and rebuild again.  The whole process will take a fair bit of time but 
better to be safe than sorry.

Like I said I have never done it so do so at your own risk.

DanglingPointer

On 25/08/16 17:23, Gareth Pye wrote:
> So I've been living on the reckless-side (meta RAID6, data RAID5) and
> I have a drive or two that isn't playing nicely any more.
>
> dmesg of the system running for a few minutes: http://pastebin.com/9pHBRQVe
>
> Everything of value is backed up, but I'd rather keep data than
> download it all again. When I only saw one disk having troubles I was
> concerned. Now I notice both sda and sdc having issues I'm thinking I
> might be about to have a bad time.
>
> What else should I provide?
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-25  7:23 Recommendation on raid5 drive error resolution Gareth Pye
  2016-08-28  7:05 ` DanglingPointer
@ 2016-08-28 17:15 ` Chris Murphy
  2016-08-29  0:15   ` Gareth Pye
  1 sibling, 1 reply; 15+ messages in thread
From: Chris Murphy @ 2016-08-28 17:15 UTC (permalink / raw)
  To: Gareth Pye; +Cc: linux-btrfs

On Thu, Aug 25, 2016 at 1:23 AM, Gareth Pye <gareth@cerberos.id.au> wrote:
> So I've been living on the reckless-side (meta RAID6, data RAID5) and
> I have a drive or two that isn't playing nicely any more.
>
> dmesg of the system running for a few minutes: http://pastebin.com/9pHBRQVe
>
> Everything of value is backed up, but I'd rather keep data than
> download it all again. When I only saw one disk having troubles I was
> concerned. Now I notice both sda and sdc having issues I'm thinking I
> might be about to have a bad time.
>
> What else should I provide?


[   72.555921] BTRFS info (device sda7): bdev /dev/sdc errs: wr 0, rd
9091, flush 0, corrupt 0, gen 0
[   72.555941] BTRFS info (device sda7): bdev /dev/sdh errs: wr 0, rd
74, flush 0, corrupt 0, gen 0

Two devices with read errors, bad. If they overlap, it's basically a
dead raid5. And it also means you *CANNOT* remove either drive.  So
now you have a problem, and I highly advise that you fresh your backup
because this is a really fragile state for any raid5.

What's the result from these two commands for every drive in this array?

smarctl -l scterc <dev>
cat /sys/block/sdX/device/timeout

The SCTERC value must be less than the timeout. This really must be
the first thing you do, even before starting your backup, because
otherwise a misconfiguration here has a very good chance of preventing
the success of getting a backup. Note these are not persistent
settings.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-28 17:15 ` Chris Murphy
@ 2016-08-29  0:15   ` Gareth Pye
  2016-08-29  0:18     ` Gareth Pye
  0 siblings, 1 reply; 15+ messages in thread
From: Gareth Pye @ 2016-08-29  0:15 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

Current status:

Knowing things were bad I did set the scterc values sanely, but the
box was getting less stable so I thought a reboot was a good idea.
That reboot failed to mount the partition at all and eveything
triggered my 'is this a psu issue' sense so I've left the box off till
I've got time to check if a psu replacement makes anything happier.
That might happen tonight or tomorrow.

I'll update the thread when I do that.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-29  0:15   ` Gareth Pye
@ 2016-08-29  0:18     ` Gareth Pye
  2016-08-29 23:01       ` Gareth Pye
  0 siblings, 1 reply; 15+ messages in thread
From: Gareth Pye @ 2016-08-29  0:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

Am I right that the wr: 0 means that the disks should at least be in a
nice consistent state? I know that overlapping read fails can still
cause everything to fail.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-29  0:18     ` Gareth Pye
@ 2016-08-29 23:01       ` Gareth Pye
  2016-08-30  9:58         ` Gareth Pye
  0 siblings, 1 reply; 15+ messages in thread
From: Gareth Pye @ 2016-08-29 23:01 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

When I can get this stupid box to boot from an external drive I'll
have some idea of what is going on....

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-29 23:01       ` Gareth Pye
@ 2016-08-30  9:58         ` Gareth Pye
  2016-08-30 18:04           ` Chris Murphy
  0 siblings, 1 reply; 15+ messages in thread
From: Gareth Pye @ 2016-08-30  9:58 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

Okay, things aren't looking good. The FS wont mount for me:
http://pastebin.com/sEEdRxsN

On Tue, Aug 30, 2016 at 9:01 AM, Gareth Pye <gareth@cerberos.id.au> wrote:
> When I can get this stupid box to boot from an external drive I'll
> have some idea of what is going on....



-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-30  9:58         ` Gareth Pye
@ 2016-08-30 18:04           ` Chris Murphy
  2016-08-30 18:28             ` Chris Murphy
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Murphy @ 2016-08-30 18:04 UTC (permalink / raw)
  To: Gareth Pye; +Cc: Chris Murphy, linux-btrfs

On Tue, Aug 30, 2016 at 3:58 AM, Gareth Pye <gareth@cerberos.id.au> wrote:
> Okay, things aren't looking good. The FS wont mount for me:
> http://pastebin.com/sEEdRxsN

Try to mount with -o ro,degraded. I have no idea which device it'll
end up dropping, but it might at least get you a read only mount so
you can get stuff off - if you want - without modifying the file
system.

One of us would have to go look in source to see what causes "[
163.612313] BTRFS: failed to read the system array on sdd" to appear
for each device. It's suspicious that every drive produces that
message, and there are no fixup messages at all ever. So it sounds
like it's not even getting far enough to figure out what's bad and
reconstruct from parity. And I don't even see csum errors either,
which is also suspicious. It's like the boot strapping itself is
failing which kinda implicates superblocks?

What do you get for

btrfs rescue super-recover -v /dev/sdX ?



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-30 18:04           ` Chris Murphy
@ 2016-08-30 18:28             ` Chris Murphy
  2016-08-30 21:23               ` Gareth Pye
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Murphy @ 2016-08-30 18:28 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Gareth Pye, linux-btrfs

On Tue, Aug 30, 2016 at 12:04 PM, Chris Murphy <lists@colorremedies.com> wrote:

> One of us would have to go look in source to see what causes "[
> 163.612313] BTRFS: failed to read the system array on sdd" to appear

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/disk-io.c?id=refs/tags/v4.7.2
line 2864

And btrfs_read_sys_array is found here on 6587. So
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/volumes.c?id=refs/tags/v4.7.2

And then comparing your 4.4.13 to 4.7.2....
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/disk-io.c?id=v4.7.2&id2=v4.4.13

There are changes in these areas but looks like they're mainly
printk's becoming btrfs_err. But I'd try a newer kernel before you
give up on it.

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/volumes.c?id=v4.7.2&id2=v4.4.13
More changes here too.

I suggest using btrfs-progs 4.5.3 or 4.6.1. You could also try 4.7 but
I'm getting some weird unexplained errors that only progs 4.7
complains about (clean scrubs, clean mounts, completely working file
system, but a buncha backref complaints from 4.7's btrfs check). But I
think the super-recover -v output should be reliable with any version
in the last ~year.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-30 18:28             ` Chris Murphy
@ 2016-08-30 21:23               ` Gareth Pye
  2016-08-30 21:45                 ` Chris Murphy
  2016-08-30 21:46                 ` Gareth Pye
  0 siblings, 2 replies; 15+ messages in thread
From: Gareth Pye @ 2016-08-30 21:23 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On Wed, Aug 31, 2016 at 4:28 AM, Chris Murphy <lists@colorremedies.com> wrote:
> But I'd try a newer kernel before you
> give up on it.


Any recommendations on liveCDs that have recent kernels & btrfs tools?
For no apparent reason system isn't booting normally either, and I'm
reluctant to fix that before at least confirming the things I at least
partially care about have a recent backup.

-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-30 21:23               ` Gareth Pye
@ 2016-08-30 21:45                 ` Chris Murphy
  2016-08-30 21:46                 ` Gareth Pye
  1 sibling, 0 replies; 15+ messages in thread
From: Chris Murphy @ 2016-08-30 21:45 UTC (permalink / raw)
  To: Gareth Pye; +Cc: Chris Murphy, linux-btrfs

On Tue, Aug 30, 2016 at 3:23 PM, Gareth Pye <gareth@cerberos.id.au> wrote:
> On Wed, Aug 31, 2016 at 4:28 AM, Chris Murphy <lists@colorremedies.com> wrote:
>> But I'd try a newer kernel before you
>> give up on it.
>
>
> Any recommendations on liveCDs that have recent kernels & btrfs tools?
> For no apparent reason system isn't booting normally either, and I'm
> reluctant to fix that before at least confirming the things I at least
> partially care about have a recent backup.

Fedora 25 Alpha released today with kernel 4.8rc2 and btrfs-progs 4.6.1.
https://getfedora.org/en/workstation/prerelease/

The top green "Download" button offers GNOME. If you want something
smaller, on the right hand side are netinstall images with the same
kernel and progs, but no GUI. You can choose the Troubleshooting menu,
and then the Rescue a Fedora System option. It boots, and then you're
at a text UI where you can just get to a shell, option 3.

The easiest way to create a USB stick is with dd and it'll boot
practically anything, BIOS, UEFI, even Macs. Not all wireless firmware
is included in these media, if you have a wired connection it'll be
easier to get dmesg and and contents of btrfs check off. If you opt
for the larger image (GNOME), it's a bit easier to get the terminal
output into a file and either scp it to another computer or you can
also use fpaste <filename> and it'll spit back a URL where it uploaded
the text.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-30 21:23               ` Gareth Pye
  2016-08-30 21:45                 ` Chris Murphy
@ 2016-08-30 21:46                 ` Gareth Pye
  2016-08-31 23:04                   ` Gareth Pye
  1 sibling, 1 reply; 15+ messages in thread
From: Gareth Pye @ 2016-08-30 21:46 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

Or I could just once again select the right boot device in the bios. I
think I want some new hardware :)

On Wed, Aug 31, 2016 at 7:23 AM, Gareth Pye <gareth@cerberos.id.au> wrote:
> On Wed, Aug 31, 2016 at 4:28 AM, Chris Murphy <lists@colorremedies.com> wrote:
>> But I'd try a newer kernel before you
>> give up on it.
>
>
> Any recommendations on liveCDs that have recent kernels & btrfs tools?
> For no apparent reason system isn't booting normally either, and I'm
> reluctant to fix that before at least confirming the things I at least
> partially care about have a recent backup.
>
> --
> Gareth Pye - blog.cerberos.id.au
> Level 2 MTG Judge, Melbourne, Australia



-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-30 21:46                 ` Gareth Pye
@ 2016-08-31 23:04                   ` Gareth Pye
  2016-09-01 11:25                     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 15+ messages in thread
From: Gareth Pye @ 2016-08-31 23:04 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

ro,degraded has mounted it nicely and my rsync of the more useful data
is progressing at the speed of WiFi.

There are repeated read errors from one drive still but the rsync
hasn't bailed yet, which I think means there isn't any overlapping
errors in any of the files it has touched thus far. Am I right or is
their likely to be corrupt data in the files I've synced off?

On Wed, Aug 31, 2016 at 7:46 AM, Gareth Pye <gareth@cerberos.id.au> wrote:
> Or I could just once again select the right boot device in the bios. I
> think I want some new hardware :)
>
> On Wed, Aug 31, 2016 at 7:23 AM, Gareth Pye <gareth@cerberos.id.au> wrote:
>> On Wed, Aug 31, 2016 at 4:28 AM, Chris Murphy <lists@colorremedies.com> wrote:
>>> But I'd try a newer kernel before you
>>> give up on it.
>>
>>
>> Any recommendations on liveCDs that have recent kernels & btrfs tools?
>> For no apparent reason system isn't booting normally either, and I'm
>> reluctant to fix that before at least confirming the things I at least
>> partially care about have a recent backup.
>>
>> --
>> Gareth Pye - blog.cerberos.id.au
>> Level 2 MTG Judge, Melbourne, Australia
>
>
>
> --
> Gareth Pye - blog.cerberos.id.au
> Level 2 MTG Judge, Melbourne, Australia



-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-08-31 23:04                   ` Gareth Pye
@ 2016-09-01 11:25                     ` Austin S. Hemmelgarn
  2016-09-07  0:35                       ` Gareth Pye
  0 siblings, 1 reply; 15+ messages in thread
From: Austin S. Hemmelgarn @ 2016-09-01 11:25 UTC (permalink / raw)
  To: Gareth Pye, Chris Murphy; +Cc: linux-btrfs

On 2016-08-31 19:04, Gareth Pye wrote:
> ro,degraded has mounted it nicely and my rsync of the more useful data
> is progressing at the speed of WiFi.
>
> There are repeated read errors from one drive still but the rsync
> hasn't bailed yet, which I think means there isn't any overlapping
> errors in any of the files it has touched thus far. Am I right or is
> their likely to be corrupt data in the files I've synced off?
Unless you've been running with nocow or nodatasum in your mount 
options, then what you've concluded should be correct.  I would still 
suggest verifying the data by some external means if possible, this type 
of situation is not something that's well tested, and TBH I'm amazed 
that things are working to the degree that they are.
>
> On Wed, Aug 31, 2016 at 7:46 AM, Gareth Pye <gareth@cerberos.id.au> wrote:
>> Or I could just once again select the right boot device in the bios. I
>> think I want some new hardware :)
>>
>> On Wed, Aug 31, 2016 at 7:23 AM, Gareth Pye <gareth@cerberos.id.au> wrote:
>>> On Wed, Aug 31, 2016 at 4:28 AM, Chris Murphy <lists@colorremedies.com> wrote:
>>>> But I'd try a newer kernel before you
>>>> give up on it.
>>>
>>>
>>> Any recommendations on liveCDs that have recent kernels & btrfs tools?
>>> For no apparent reason system isn't booting normally either, and I'm
>>> reluctant to fix that before at least confirming the things I at least
>>> partially care about have a recent backup.
>>>
>>> --
>>> Gareth Pye - blog.cerberos.id.au
>>> Level 2 MTG Judge, Melbourne, Australia
>>
>>
>>
>> --
>> Gareth Pye - blog.cerberos.id.au
>> Level 2 MTG Judge, Melbourne, Australia
>
>
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Recommendation on raid5 drive error resolution
  2016-09-01 11:25                     ` Austin S. Hemmelgarn
@ 2016-09-07  0:35                       ` Gareth Pye
  0 siblings, 0 replies; 15+ messages in thread
From: Gareth Pye @ 2016-09-07  0:35 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, linux-btrfs

Things have been copying off really well.

I'm starting to suspect the issue was the PSU which I've swapped out.
What is the line I should see in dmesg if the degraded option was
actually used when mounting the file system?

On Thu, Sep 1, 2016 at 9:25 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-08-31 19:04, Gareth Pye wrote:
>>
>> ro,degraded has mounted it nicely and my rsync of the more useful data
>> is progressing at the speed of WiFi.
>>
>> There are repeated read errors from one drive still but the rsync
>> hasn't bailed yet, which I think means there isn't any overlapping
>> errors in any of the files it has touched thus far. Am I right or is
>> their likely to be corrupt data in the files I've synced off?
>
> Unless you've been running with nocow or nodatasum in your mount options,
> then what you've concluded should be correct.  I would still suggest
> verifying the data by some external means if possible, this type of
> situation is not something that's well tested, and TBH I'm amazed that
> things are working to the degree that they are.
>
>>
>> On Wed, Aug 31, 2016 at 7:46 AM, Gareth Pye <gareth@cerberos.id.au> wrote:
>>>
>>> Or I could just once again select the right boot device in the bios. I
>>> think I want some new hardware :)
>>>
>>> On Wed, Aug 31, 2016 at 7:23 AM, Gareth Pye <gareth@cerberos.id.au>
>>> wrote:
>>>>
>>>> On Wed, Aug 31, 2016 at 4:28 AM, Chris Murphy <lists@colorremedies.com>
>>>> wrote:
>>>>>
>>>>> But I'd try a newer kernel before you
>>>>> give up on it.
>>>>
>>>>
>>>>
>>>> Any recommendations on liveCDs that have recent kernels & btrfs tools?
>>>> For no apparent reason system isn't booting normally either, and I'm
>>>> reluctant to fix that before at least confirming the things I at least
>>>> partially care about have a recent backup.
>>>>
>>>> --
>>>> Gareth Pye - blog.cerberos.id.au
>>>> Level 2 MTG Judge, Melbourne, Australia
>>>
>>>
>>>
>>>
>>> --
>>> Gareth Pye - blog.cerberos.id.au
>>> Level 2 MTG Judge, Melbourne, Australia
>>
>>
>>
>>
>



-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-09-07  0:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-25  7:23 Recommendation on raid5 drive error resolution Gareth Pye
2016-08-28  7:05 ` DanglingPointer
2016-08-28 17:15 ` Chris Murphy
2016-08-29  0:15   ` Gareth Pye
2016-08-29  0:18     ` Gareth Pye
2016-08-29 23:01       ` Gareth Pye
2016-08-30  9:58         ` Gareth Pye
2016-08-30 18:04           ` Chris Murphy
2016-08-30 18:28             ` Chris Murphy
2016-08-30 21:23               ` Gareth Pye
2016-08-30 21:45                 ` Chris Murphy
2016-08-30 21:46                 ` Gareth Pye
2016-08-31 23:04                   ` Gareth Pye
2016-09-01 11:25                     ` Austin S. Hemmelgarn
2016-09-07  0:35                       ` Gareth Pye

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.