All of lore.kernel.org
 help / color / mirror / Atom feed
* 3-disk RAID5 won't assemble
@ 2017-10-20 19:51 Alex Elder
  2017-10-20 20:34 ` Wols Lists
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Elder @ 2017-10-20 19:51 UTC (permalink / raw)
  To: linux-raid

I have a 3-disk RAID5 with identical drives that won't assemble.

The event counts on two of them are the same (20592) and one is
quite a bit less (20466).  I do not expect failing hardware.

The problem occurred while I was copying some large files to
the XFS volume on the device, while doing something else that
ate up all my memory.  (It was a long time ago so I that's
about as much detail as I can provide--I assumed the OOM killer
ultimately was to blame, somehow.)

It *sounds* like the two drives with the same event count should
be enough to recover my volume.  But forcibly doing that is scary
so I'm writing here for encouragement and guidance.

  {1156} root@meat-> mdadm --stop /dev/md0
  mdadm: stopped /dev/md0
  {1157} root@meat-> mdadm --assemble /dev/md0 /dev/sd[bcd]1
  mdadm: /dev/md0 assembled from 2 drives - not enough to start the
array  while not clean - consider --force.
  {1158} root@meat->

I can provide plenty more information, but thought I'd start by
introducing the problem.

How should I proceed?  Thanks.

					-Alex

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 3-disk RAID5 won't assemble
  2017-10-20 19:51 3-disk RAID5 won't assemble Alex Elder
@ 2017-10-20 20:34 ` Wols Lists
  2017-10-20 22:46   ` Alex Elder
  0 siblings, 1 reply; 7+ messages in thread
From: Wols Lists @ 2017-10-20 20:34 UTC (permalink / raw)
  To: Alex Elder, linux-raid

On 20/10/17 20:51, Alex Elder wrote:
> I have a 3-disk RAID5 with identical drives that won't assemble.
> 
> The event counts on two of them are the same (20592) and one is
> quite a bit less (20466).  I do not expect failing hardware.

First things first. Have you looked at the raid wiki?
https://raid.wiki.kernel.org/index.php/Linux_Raid

In particular, take a read of the "When things go wrogn" section. And
especially, do a "smartctl -x" - are your drives desktop drives?
> 
> The problem occurred while I was copying some large files to
> the XFS volume on the device, while doing something else that
> ate up all my memory.  (It was a long time ago so I that's
> about as much detail as I can provide--I assumed the OOM killer
> ultimately was to blame, somehow.)

Have you rebooted since then? If that really was the problem, the array
should have failed the first time you rebooted.
> 
> It *sounds* like the two drives with the same event count should
> be enough to recover my volume.  But forcibly doing that is scary
> so I'm writing here for encouragement and guidance.
> 
>   {1156} root@meat-> mdadm --stop /dev/md0
>   mdadm: stopped /dev/md0
>   {1157} root@meat-> mdadm --assemble /dev/md0 /dev/sd[bcd]1
>   mdadm: /dev/md0 assembled from 2 drives - not enough to start the
> array  while not clean - consider --force.
>   {1158} root@meat->

Okay. Do NOT force all three drives. Forcing the two with the same event
count is safe - you have no redundancy so it's not going to start
mucking about with the drives. But first you need to be certain it's not
desktop drives and a timeout problem.
> 
> I can provide plenty more information, but thought I'd start by
> introducing the problem.
> 
> How should I proceed?  Thanks.
> 
Read the wiki?

Make sure it's not the timeout problem !!!

Does your array have bitmap enabled?

Once we're happy that your drives are fine, you can force the two good
drives, and then re-add the third. If you have bitmaps enabled, this
will bring it quickly up to scratch without needing a full resync.

And once the third is re-added, you need to do a scrub.

But it looks like everything is pretty much fine. Recovery *should* be
easy (famous last words ...)

If you're not sure you're happy, post all the requested diagnostics to
the list - preferably inline in your emails - and let an expert take a look.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 3-disk RAID5 won't assemble
  2017-10-20 20:34 ` Wols Lists
@ 2017-10-20 22:46   ` Alex Elder
  2017-10-20 23:37     ` ***UNCHECKED*** " Anthony Youngman
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Elder @ 2017-10-20 22:46 UTC (permalink / raw)
  To: Wols Lists, linux-raid

[-- Attachment #1: Type: text/plain, Size: 4135 bytes --]

On 10/20/2017 03:34 PM, Wols Lists wrote:
> On 20/10/17 20:51, Alex Elder wrote:
>> I have a 3-disk RAID5 with identical drives that won't assemble.
>>
>> The event counts on two of them are the same (20592) and one is
>> quite a bit less (20466).  I do not expect failing hardware.
> 
> First things first. Have you looked at the raid wiki?
> https://raid.wiki.kernel.org/index.php/Linux_Raid

Yes.  And I gathered all the data on the "Asking_for_help" page
before asking the list.  The information did not solve my problem,
but it told me enough that I believed I'm probably OK if I take
the proper steps.  I've attached those files to this message.

> In particular, take a read of the "When things go wrogn" section. And
> especially, do a "smartctl -x" - are your drives desktop drives?

I ran "smartctl -xall" on each and saved the output, but now
I see that gave an error (so I now ran "smartctl -x" instead).

They are 2.5" Seagate laptop drives (ST4000LM016).  And:
"SCT Error Recovery Control command not supported".

The volume was functioning for a long time (for months anyway)
prior to the failure.

>> The problem occurred while I was copying some large files to
>> the XFS volume on the device, while doing something else that
>> ate up all my memory.  (It was a long time ago so I that's
>> about as much detail as I can provide--I assumed the OOM killer
>> ultimately was to blame, somehow.)
> 
> Have you rebooted since then? If that really was the problem, the array
> should have failed the first time you rebooted.

Yes, it did fail the first time I booted after the problem
first occurred.  I didn't have time to look at it when
it happened and have made do without it until now.  I have
rebooted many times since the failure, knowing the volume
would not work until I fixed it.

>> It *sounds* like the two drives with the same event count should
>> be enough to recover my volume.  But forcibly doing that is scary
>> so I'm writing here for encouragement and guidance.
>>
>>   {1156} root@meat-> mdadm --stop /dev/md0
>>   mdadm: stopped /dev/md0
>>   {1157} root@meat-> mdadm --assemble /dev/md0 /dev/sd[bcd]1
>>   mdadm: /dev/md0 assembled from 2 drives - not enough to start the
>> array  while not clean - consider --force.
>>   {1158} root@meat->
> 
> Okay. Do NOT force all three drives. Forcing the two with the same event
> count is safe - you have no redundancy so it's not going to start
> mucking about with the drives. But first you need to be certain it's not
> desktop drives and a timeout problem.
>>
>> I can provide plenty more information, but thought I'd start by
>> introducing the problem.
>>
>> How should I proceed?  Thanks.
>>
> Read the wiki?

Done.  I can't promise I grokked what was most important.

> Make sure it's not the timeout problem !!!

I increased the timeout to 180 and changed the readahead values
to 1024.  These values do not "stick" across a reboot so I'll
need to add them to a startup script.

> Does your array have bitmap enabled?

It appears so: Internal Bitmap : 8 sectors from superblock

> Once we're happy that your drives are fine, you can force the two good
> drives, and then re-add the third. If you have bitmaps enabled, this
> will bring it quickly up to scratch without needing a full resync.

That's very reassuring.

> And once the third is re-added, you need to do a scrub.
> 
> But it looks like everything is pretty much fine. Recovery *should* be
> easy (famous last words ...)

Piece of cake.

> If you're not sure you're happy, post all the requested diagnostics to
> the list - preferably inline in your emails - and let an expert take a look.

Sorry, I attached them...

I am fairly sure my drives aren't damaged, though I had not done
the timeout fix previously.

I would really appreciate having someone who's been through this
before glance through the mdadm, mdstat, and smartctl output I've
provided to make sure I'm not mistaken about the state of things.
Setting up the RAID was fine; but now that I'm (finally) dealing
with my first failure I am a bit apprehensive.

Thanks a lot for your response.

					-Alex

> Cheers,
> Wol
> 


[-- Attachment #2: raid_info.tgz --]
[-- Type: application/x-compressed-tar, Size: 6319 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ***UNCHECKED*** Re: 3-disk RAID5 won't assemble
  2017-10-20 22:46   ` Alex Elder
@ 2017-10-20 23:37     ` Anthony Youngman
  2017-10-20 23:49       ` Phil Turmel
  0 siblings, 1 reply; 7+ messages in thread
From: Anthony Youngman @ 2017-10-20 23:37 UTC (permalink / raw)
  To: Alex Elder, linux-raid

On 20/10/17 23:46, Alex Elder wrote:
> On 10/20/2017 03:34 PM, Wols Lists wrote:
>> On 20/10/17 20:51, Alex Elder wrote:
>>> I have a 3-disk RAID5 with identical drives that won't assemble.
>>>
>>> The event counts on two of them are the same (20592) and one is
>>> quite a bit less (20466).  I do not expect failing hardware.
>>
>> First things first. Have you looked at the raid wiki?
>> https://raid.wiki.kernel.org/index.php/Linux_Raid
> 
> Yes.  And I gathered all the data on the "Asking_for_help" page
> before asking the list.  The information did not solve my problem,
> but it told me enough that I believed I'm probably OK if I take
> the proper steps.  I've attached those files to this message.
> 
>> In particular, take a read of the "When things go wrogn" section. And
>> especially, do a "smartctl -x" - are your drives desktop drives?
> 
> I ran "smartctl -xall" on each and saved the output, but now
> I see that gave an error (so I now ran "smartctl -x" instead).
> 
> They are 2.5" Seagate laptop drives (ST4000LM016).  And:
> "SCT Error Recovery Control command not supported".
> 
> The volume was functioning for a long time (for months anyway)
> prior to the failure.
> 
Okay. That's perfectly normal. The timeout problem is basically because, 
over time, magnetism fades. So the array WILL work perfectly fine to 
start with. But the stuff you wrote on day 1, as time goes on and you 
don't rewrite it, it will fade away (especially if you rewrite the stuff 
next to it).

>>> The problem occurred while I was copying some large files to
>>> the XFS volume on the device, while doing something else that
>>> ate up all my memory.  (It was a long time ago so I that's
>>> about as much detail as I can provide--I assumed the OOM killer
>>> ultimately was to blame, somehow.)
>>
Well, for some reason, you will have asked the drive to *read* some old 
data, and it couldn't. And BOOM linux *thought* the drive was dead 
(that's the timeout problem), kicked the drive, and killed the array.

>> Have you rebooted since then? If that really was the problem, the array
>> should have failed the first time you rebooted.
> 
> Yes, it did fail the first time I booted after the problem
> first occurred.  I didn't have time to look at it when
> it happened and have made do without it until now.  I have
> rebooted many times since the failure, knowing the volume
> would not work until I fixed it.
> 
>>> It *sounds* like the two drives with the same event count should
>>> be enough to recover my volume.  But forcibly doing that is scary
>>> so I'm writing here for encouragement and guidance.
>>>
>>>    {1156} root@meat-> mdadm --stop /dev/md0
>>>    mdadm: stopped /dev/md0
>>>    {1157} root@meat-> mdadm --assemble /dev/md0 /dev/sd[bcd]1
>>>    mdadm: /dev/md0 assembled from 2 drives - not enough to start the
>>> array  while not clean - consider --force.
>>>    {1158} root@meat->
>>
>> Okay. Do NOT force all three drives. Forcing the two with the same event
>> count is safe - you have no redundancy so it's not going to start
>> mucking about with the drives. But first you need to be certain it's not
>> desktop drives and a timeout problem.
>>>
>>> I can provide plenty more information, but thought I'd start by
>>> introducing the problem.
>>>
>>> How should I proceed?  Thanks.
>>>
>> Read the wiki?
> 
> Done.  I can't promise I grokked what was most important.
> 
>> Make sure it's not the timeout problem !!!
> 
> I increased the timeout to 180 and changed the readahead values
> to 1024.  These values do not "stick" across a reboot so I'll
> need to add them to a startup script.
> 
>> Does your array have bitmap enabled?
> 
> It appears so: Internal Bitmap : 8 sectors from superblock
> 
>> Once we're happy that your drives are fine, you can force the two good
>> drives, and then re-add the third. If you have bitmaps enabled, this
>> will bring it quickly up to scratch without needing a full resync.
> 
> That's very reassuring.
> 
>> And once the third is re-added, you need to do a scrub.
>>
>> But it looks like everything is pretty much fine. Recovery *should* be
>> easy (famous last words ...)
> 
> Piece of cake.
> 
>> If you're not sure you're happy, post all the requested diagnostics to
>> the list - preferably inline in your emails - and let an expert take a look.
> 
> Sorry, I attached them...
> 
> I am fairly sure my drives aren't damaged, though I had not done
> the timeout fix previously.

And what's happened looks like the absolutely typical timeout problem.
> 
> I would really appreciate having someone who's been through this
> before glance through the mdadm, mdstat, and smartctl output I've
> provided to make sure I'm not mistaken about the state of things.
> Setting up the RAID was fine; but now that I'm (finally) dealing
> with my first failure I am a bit apprehensive.
> 
> Thanks a lot for your response.
> 
Failing to fix the timeouts is just asking for problems to strike - 
usually some way down the line. Make sure you set that script up to run 
every boot BEFORE you try and fix the array. Then, as I say, just force 
the two good drives and re-add the third. Your array will be back ... 
(and if you really need to re-assure yourself, try it out with overlays 
first).

Cheers,
Wol

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ***UNCHECKED*** Re: 3-disk RAID5 won't assemble
  2017-10-20 23:37     ` ***UNCHECKED*** " Anthony Youngman
@ 2017-10-20 23:49       ` Phil Turmel
  2017-10-21  3:40         ` Alex Elder
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Turmel @ 2017-10-20 23:49 UTC (permalink / raw)
  To: Anthony Youngman, Alex Elder, linux-raid

On 10/20/2017 07:37 PM, Anthony Youngman wrote:
> On 20/10/17 23:46, Alex Elder wrote:

>> I would really appreciate having someone who's been through this
>> before glance through the mdadm, mdstat, and smartctl output I've
>> provided to make sure I'm not mistaken about the state of things.
>> Setting up the RAID was fine; but now that I'm (finally) dealing
>> with my first failure I am a bit apprehensive.
>>
>> Thanks a lot for your response.
>>
> Failing to fix the timeouts is just asking for problems to strike -
> usually some way down the line. Make sure you set that script up to run
> every boot BEFORE you try and fix the array. Then, as I say, just force
> the two good drives and re-add the third. Your array will be back ...
> (and if you really need to re-assure yourself, try it out with overlays
> first).

I endorse this advice and concur with it's positive tone.  Your array
should be fine, Alex.

Buy NAS drives in the future, though.

Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ***UNCHECKED*** Re: 3-disk RAID5 won't assemble
  2017-10-20 23:49       ` Phil Turmel
@ 2017-10-21  3:40         ` Alex Elder
  2017-10-21  9:36           ` Anthony Youngman
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Elder @ 2017-10-21  3:40 UTC (permalink / raw)
  To: Phil Turmel, Anthony Youngman, linux-raid

On 10/20/2017 06:49 PM, Phil Turmel wrote:
> On 10/20/2017 07:37 PM, Anthony Youngman wrote:
>> On 20/10/17 23:46, Alex Elder wrote:
> 
>>> I would really appreciate having someone who's been through this
>>> before glance through the mdadm, mdstat, and smartctl output I've
>>> provided to make sure I'm not mistaken about the state of things.
>>> Setting up the RAID was fine; but now that I'm (finally) dealing
>>> with my first failure I am a bit apprehensive.
>>>
>>> Thanks a lot for your response.
>>>
>> Failing to fix the timeouts is just asking for problems to strike -
>> usually some way down the line. Make sure you set that script up to run
>> every boot BEFORE you try and fix the array. Then, as I say, just force
>> the two good drives and re-add the third. Your array will be back ...
>> (and if you really need to re-assure yourself, try it out with overlays
>> first).
> 
> I endorse this advice and concur with it's positive tone.  Your array
> should be fine, Alex.

You guys are really great.  I am back running again.  Thank you
for taking the time to help me out.  It saved me a LOT of time
trying to chase down what I needed to do, and most importantly
you gave me the confidence to proceed with only a modest level
of anxiety...  The hardest thing was creating a new systemd
service to run the script at boot time (also a first for me).

> Buy NAS drives in the future, though.

Hey, they were never going to fail!  RAID solves all problems.

But yeah, I'll look more closely at that next time.

Thanks again.

					-Alex

> Phil
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ***UNCHECKED*** Re: 3-disk RAID5 won't assemble
  2017-10-21  3:40         ` Alex Elder
@ 2017-10-21  9:36           ` Anthony Youngman
  0 siblings, 0 replies; 7+ messages in thread
From: Anthony Youngman @ 2017-10-21  9:36 UTC (permalink / raw)
  To: Alex Elder, Anthony Youngman, linux-raid

On 21/10/17 04:40, Alex Elder wrote:
> You guys are really great.  I am back running again.  Thank you
> for taking the time to help me out.  It saved me a LOT of time
> trying to chase down what I needed to do, and most importantly
> you gave me the confidence to proceed with only a modest level
> of anxiety...  The hardest thing was creating a new systemd
> service to run the script at boot time (also a first for me).
> 
>> Buy NAS drives in the future, though.

> Hey, they were never going to fail!  RAID solves all problems.

Even the alleged experts (definitely "alleged" in my case :-) don't 
always do the right thing :-) I'm currently running two Seagate 
Barracudas - *without* the timeout script! - in a mirror configuration. 
But over the next few days I'll be building a new pc and I've got two 
raid-capable drives. It'll be a big learning experience because I use 
gentoo ... learning systemd, learning QEMU, learning KVM, learning...
> 
> But yeah, I'll look more closely at that next time.
> 
Cheers,
Wol

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-10-21  9:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-20 19:51 3-disk RAID5 won't assemble Alex Elder
2017-10-20 20:34 ` Wols Lists
2017-10-20 22:46   ` Alex Elder
2017-10-20 23:37     ` ***UNCHECKED*** " Anthony Youngman
2017-10-20 23:49       ` Phil Turmel
2017-10-21  3:40         ` Alex Elder
2017-10-21  9:36           ` Anthony Youngman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.