All of lore.kernel.org
 help / color / mirror / Atom feed
* Stacked array data recovery
@ 2012-06-21 22:44 Ramon Hofer
  2012-06-22 14:32 ` Ramon Hofer
  2012-06-22 14:37 ` Ramon Hofer
  0 siblings, 2 replies; 27+ messages in thread
From: Ramon Hofer @ 2012-06-21 22:44 UTC (permalink / raw)
  To: linux-raid

Hi all

I had a media server at home with a raid5 array containing four disks 
where I ran out of space. The solution with a separate NAS was nothing I 
liked so I wanted to enlarge my server.

I bought a Norco case with 20 hot swap slots.

With the kind help of Stan Hoeppner I was able to find good hardware to 
connect the disks.
Maybe it's not important but it's a LSI MegaRAID SAS 9240-4i, an Intel SAS 
expander and the mainboard is a Asus P7P55D.
I use Debian Squeeze with the backported 3.2.0-2-amd64 kernel.
The mdadm package version is "v3.1.4 - 31st August 2010"
(More information about the drives are below in the pastebins.)

The plan was to setup several RAID5 as members of a linear raid.

Unfortunately I was too stupid to cool the disks so probably they are 
damaged and I will have to replace them (hopefully still under warranty).

I set up md1 with four WD greens 2 TB (mdadm -C /dev/md1 -c 128 -n4 -l5 /
dev/sd[abcd]) and added it as a single device to the linear array md0 
(mdadm -C /dev/md0 --force -n1 -l linear /dev/md1).
Then I copied the data from the old raid to it (btw the filesystem on md0 
is xfs if this is relevant).
During this process the disks weren't cooled.
They are about 1-2 years old and mdadm marked some of them as faulty. I 
thought this was because they were green drives and bought four new WD 
black 2TB drives.

I tried the same again and this time it worked well.
Then I took the four Samsung 1.5 TB drives from the old server and added 
them to the Norco case. With them I created the RAID5 array md2, grew the 
linear array md0 and its filesystem.

Until then I thought it would be wise to use the array already when it 
still syncing the disks.
During this process the disks of md1 made strange noises and were set 
faulty as well as some disks of md2.

Here are the mails I got from mdadm:
http://pastebin.com/raw.php?i=ftpmfSpv

Now I have solved the cooling problem and want to replace the disks. But 
I'd like to try to rescue the data on md0.

I did a smartctl readout of the disks. Here's a summary of the disks I 
have:

/dev/sd[abcd] WD blacks md1
/dev/sd[efgh] Samsungs md2
/dev/sd[jklm] WD green md3 (not yet created)

This is the output of "smartctl -l error" for each of the disks:
http://pastebin.com/raw.php?i=JtYkweNp

and "smartctl -x":
http://pastebin.com/raw.php?i=QFK6dyZs

and "smartctl -H":
http://pastebin.com/raw.php?i=hAEdyvCz

When I start the system md2 gets started and is syncing the disks. But 
md1 is stopped.

Is it possible to start md1 again to be able to start the linear array 
md0 so that I can accesss the data?

Thanks in advance for any help.


Best regards
Ramon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-21 22:44 Stacked array data recovery Ramon Hofer
@ 2012-06-22 14:32 ` Ramon Hofer
  2012-06-23 12:05   ` Stan Hoeppner
  2012-06-22 14:37 ` Ramon Hofer
  1 sibling, 1 reply; 27+ messages in thread
From: Ramon Hofer @ 2012-06-22 14:32 UTC (permalink / raw)
  To: linux-raid

On Thu, 21 Jun 2012 22:44:10 +0000, Ramon Hofer wrote:

> Is it possible to start md1 again to be able to start the linear array
> md0 so that I can accesss the data?

Stand will probably be angry with me but I did the following:

I found the recovery page in your wiki:
https://raid.wiki.kernel.org/index.php/RAID_Recovery

So I followed the process there.

~$ mdadm --examine /dev/sd[abcd] > raid.status.md1

~$ mdadm --create --assume-clean --level=5 --raid-devices=4 /dev/md1 /dev/
sda /dev/sdb /dev/sdc /dev/sdd

~$

~$



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-21 22:44 Stacked array data recovery Ramon Hofer
  2012-06-22 14:32 ` Ramon Hofer
@ 2012-06-22 14:37 ` Ramon Hofer
  2012-06-23 12:09   ` Stan Hoeppner
  1 sibling, 1 reply; 27+ messages in thread
From: Ramon Hofer @ 2012-06-22 14:37 UTC (permalink / raw)
  To: linux-raid

Sorry, I hit the wrong button.On Thu, 21 Jun 2012 22:44:10 +0000, Ramon 
Hofer wrote:

> Is it possible to start md1 again to be able to start the linear array
> md0 so that I can accesss the data?

Stan will probably be angry with me but I did the following:

I found the recovery page in your wiki:
https://raid.wiki.kernel.org/index.php/RAID_Recovery

So I followed the process there.
This is what I did:

~$ mdadm --examine /dev/sd[abcd] > raid.status.md1

~$ mdadm --create --assume-clean -c 128 --level=5 --raid-devices=4 /dev/
md1 /dev/sda /dev/sdb /dev/sdc /dev/sdd

It was up and clean.

~$ mdadm --examine /dev/md[12] > raid.status.md0

~$ mdadm -C --assume-clean -n2 -l linear /dev/md0 /dev/md[12]

Now I could mount md0 and all my data is still there :-D

Thanks for this great wiki page!


Best regards
Ramon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-22 14:32 ` Ramon Hofer
@ 2012-06-23 12:05   ` Stan Hoeppner
  0 siblings, 0 replies; 27+ messages in thread
From: Stan Hoeppner @ 2012-06-23 12:05 UTC (permalink / raw)
  To: Ramon Hofer; +Cc: linux-raid

On 6/22/2012 9:32 AM, Ramon Hofer wrote:
> On Thu, 21 Jun 2012 22:44:10 +0000, Ramon Hofer wrote:
> 
>> Is it possible to start md1 again to be able to start the linear array
>> md0 so that I can accesss the data?
> 
> Stand will probably be angry with me but I did the following:

Not at all.  I just wanted to make sure you had all of your post crash
ducks in a row before trying to bring the arrays back up, given the
circumstances under which they went down.

> I found the recovery page in your wiki:
> https://raid.wiki.kernel.org/index.php/RAID_Recovery
> 
> So I followed the process there.
> 
> ~$ mdadm --examine /dev/sd[abcd] > raid.status.md1
> 
> ~$ mdadm --create --assume-clean --level=5 --raid-devices=4 /dev/md1 /dev/
> sda /dev/sdb /dev/sdc /dev/sdd

-- 
Stan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-22 14:37 ` Ramon Hofer
@ 2012-06-23 12:09   ` Stan Hoeppner
  2012-06-24 12:15     ` Ramon Hofer
  0 siblings, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2012-06-23 12:09 UTC (permalink / raw)
  To: Ramon Hofer; +Cc: linux-raid

On 6/22/2012 9:37 AM, Ramon Hofer wrote:
> Sorry, I hit the wrong button.On Thu, 21 Jun 2012 22:44:10 +0000, Ramon 
> Hofer wrote:
> 
>> Is it possible to start md1 again to be able to start the linear array
>> md0 so that I can accesss the data?
> 
> Stan will probably be angry with me but I did the following:
> 
> I found the recovery page in your wiki:
> https://raid.wiki.kernel.org/index.php/RAID_Recovery
> 
> So I followed the process there.
> This is what I did:
> 
> ~$ mdadm --examine /dev/sd[abcd] > raid.status.md1
> 
> ~$ mdadm --create --assume-clean -c 128 --level=5 --raid-devices=4 /dev/
> md1 /dev/sda /dev/sdb /dev/sdc /dev/sdd
> 
> It was up and clean.
> 
> ~$ mdadm --examine /dev/md[12] > raid.status.md0
> 
> ~$ mdadm -C --assume-clean -n2 -l linear /dev/md0 /dev/md[12]
> 
> Now I could mount md0 and all my data is still there :-D

You should have run an "xfs_repair -n" before mounting.  "-n" means no
modify, making it a check operation.  If it finds errors then rerun it
without the "-n" so it can make necessary repairs.  Then remount.  Sorry
I forgot to mention this, or remind you, whichever is the case. :)

-- 
Stan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-23 12:09   ` Stan Hoeppner
@ 2012-06-24 12:15     ` Ramon Hofer
  2012-06-24 14:12       ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: Ramon Hofer @ 2012-06-24 12:15 UTC (permalink / raw)
  To: linux-raid

On Sat, 23 Jun 2012 07:09:55 -0500, Stan Hoeppner wrote:

> On 6/22/2012 9:37 AM, Ramon Hofer wrote:
>> Sorry, I hit the wrong button.On Thu, 21 Jun 2012 22:44:10 +0000, Ramon
>> Hofer wrote:
>> 
>>> Is it possible to start md1 again to be able to start the linear array
>>> md0 so that I can accesss the data?
>> 
>> Stan will probably be angry with me but I did the following:
>> 
>> I found the recovery page in your wiki:
>> https://raid.wiki.kernel.org/index.php/RAID_Recovery
>> 
>> So I followed the process there.
>> This is what I did:
>> 
>> ~$ mdadm --examine /dev/sd[abcd] > raid.status.md1
>> 
>> ~$ mdadm --create --assume-clean -c 128 --level=5 --raid-devices=4
>> /dev/ md1 /dev/sda /dev/sdb /dev/sdc /dev/sdd
>> 
>> It was up and clean.
>> 
>> ~$ mdadm --examine /dev/md[12] > raid.status.md0
>> 
>> ~$ mdadm -C --assume-clean -n2 -l linear /dev/md0 /dev/md[12]
>> 
>> Now I could mount md0 and all my data is still there :-D
> 
> You should have run an "xfs_repair -n" before mounting.  "-n" means no
> modify, making it a check operation.  If it finds errors then rerun it
> without the "-n" so it can make necessary repairs.  Then remount.  Sorry
> I forgot to mention this, or remind you, whichever is the case. :)

Thanks you!

You have mentioned but I forgot to do it.
I did it now and still everything looks good.
At least with the WD blacks and the Samsung drives.

One WD green was again marked faulty when I tried to create an array with 
them.

This is the output of dmesg:
http://pastebin.com/raw.php?i=5aukYJa8

This seems to be not good:
[61142.466334] md/raid:md9: read error not correctable (sector 3758190680 
on sdk).
[61142.466338] md/raid:md9: Disk failure on sdk, disabling device.

What could the reason of this issue be?
Is it because the disk is broken or not suited for raid use?

I'm now running smartctl -t long /dev/sdk.
I have no clue if this helps in any way...

Here's the output of smartctl -a /dev/sdk:
http://pastebin.com/raw.php?i=2ULrx6du


Should I bring the disc to my dealer or is it an issue of using it with 
mdadm?


Best regards
Ramon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-24 12:15     ` Ramon Hofer
@ 2012-06-24 14:12       ` Stan Hoeppner
  2012-06-25  3:51         ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2012-06-24 14:12 UTC (permalink / raw)
  To: Ramon Hofer; +Cc: linux-raid

On 6/24/2012 7:15 AM, Ramon Hofer wrote:
> On Sat, 23 Jun 2012 07:09:55 -0500, Stan Hoeppner wrote:

>> You should have run an "xfs_repair -n" before mounting.  "-n" means no
>> modify, making it a check operation.  If it finds errors then rerun it
>> without the "-n" so it can make necessary repairs.  Then remount.  Sorry
>> I forgot to mention this, or remind you, whichever is the case. :)
> 
> Thanks you!
> 
> You have mentioned but I forgot to do it.
> I did it now and still everything looks good.
> At least with the WD blacks and the Samsung drives.

Fantastic.

> One WD green was again marked faulty when I tried to create an array with 
> them.
> 
> This is the output of dmesg:
> http://pastebin.com/raw.php?i=5aukYJa8

This shows you have 3 bad sectors that have not been reallocated.  This
may be correctable, maybe not.  It depends whether this drive has
exhausted its spare sector pool.

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always
      -       3

If the spare sector pool has not been exhausted, you could try to
overwrite each bad block manually and then sync to force the drive to
reallocate the sectors.  But at this point, given it's a WD20EARS, and
has some hours under its belt, you may be better off writing zeros to
the entire OS visible portion of the drive.  This will tend to flush out
any other bad sectors or problems with the drive, and if there are none
should repair the 3 bad sectors by reallocating them (replacing them
with spare blocks).  This operation will take up to an hour, or more, to
complete.  Read this entire email before you run any commands.

~$ dd if=/dev/zero of=/dev/sdk bs=1M; sync

WARNING:  THIS COMMAND WILL ERASE A DISK DRIVE.  Be very careful.
WARNING:  THIS COMMAND WILL ERASE A DISK DRIVE.  Be very careful.

> This seems to be not good:
> [61142.466334] md/raid:md9: read error not correctable (sector 3758190680 
> on sdk).
> [61142.466338] md/raid:md9: Disk failure on sdk, disabling device.

This is one of the 3 bad sectors.

> What could the reason of this issue be?
> Is it because the disk is broken or not suited for raid use?

No, just platter surface defects.  Common with very large drives.

> I'm now running smartctl -t long /dev/sdk.
> I have no clue if this helps in any way...
> 
> Here's the output of smartctl -a /dev/sdk:
> http://pastebin.com/raw.php?i=2ULrx6du

It identified the same bad sector listed in the md failure: 3758190680

# 1  Extended offline    Completed: read failure       90%      5174
     3758190680

But you have two other bad sectors as well, apparently, that this self
test didn't pick up.  They were however previously logged.

> Should I bring the disc to my dealer or is it an issue of using it with 
> mdadm?

That's premature.  If you don't have any irreplaceable data on md9 yet,
I'd recommend erasing all 4 EARS drives with the dd command so you have
a "fresh start".  You can do this in parallel so they complete at the
~same time:

The easiest way is to simply put and ampersand at the end of each
command, which puts each process in the background and frees up the
command line for the next command.  I don't know which device names
those WDs are so I'm using fictional examples:

~$ dd if=/dev/zero of=/dev/sdw bs=1M &
~$ dd if=/dev/zero of=/dev/sdx bs=1M &
~$ dd if=/dev/zero of=/dev/sdy bs=1M &
~$ dd if=/dev/zero of=/dev/sdz bs=1M &

WARNING:  THESES COMMANDS WILL ERASE DISK DRIVES.  Be very careful.
WARNING:  THESES COMMANDS WILL ERASE DISK DRIVES.  Be very careful.


MAKE SURE YOU ENTER THE CORRECT DRIVE DEVICE NAMES.  If you enter the
name of a WD Black, you will erase the Black drive.

After they all finish you'll see something like this 4 times but the
values will be immensely larger:

1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0164695 s, 63.7 MB/s

After you see 4 of those, issue a sync to force any remaining pending
writes out of the buffer cache and drive caches:

~$ sync

There will be no output from the sync command.  Wait until the drive
lights for these 4 drives stop flashing.  Then create the md array again.

If you get any errors from the dd commands for /dev/sdk, or any of the
drives, don't create the md array.  Post the errors here first.  The
errors may indicate you need to replace a drive.  So you need to know
that before trying to create the array again.

-- 
Stan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-24 14:12       ` Stan Hoeppner
@ 2012-06-25  3:51         ` Stan Hoeppner
  2012-06-25 10:31           ` Ramon Hofer
  0 siblings, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2012-06-25  3:51 UTC (permalink / raw)
  To: stan; +Cc: Ramon Hofer, linux-raid

On 6/24/2012 9:12 AM, Stan Hoeppner wrote:

> That's premature.  If you don't have any irreplaceable data on md9 yet,
> I'd recommend erasing all 4 EARS drives with the dd command so you have
> a "fresh start".

Sorry Ramon, I meant the Samsungs here, not EARS.  You probably understood.

-- 
Stan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-25  3:51         ` Stan Hoeppner
@ 2012-06-25 10:31           ` Ramon Hofer
  2012-06-26  1:53             ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: Ramon Hofer @ 2012-06-25 10:31 UTC (permalink / raw)
  To: linux-raid

On Sun, 24 Jun 2012 22:51:32 -0500, Stan Hoeppner wrote:

> On 6/24/2012 9:12 AM, Stan Hoeppner wrote:
> 
>> That's premature.  If you don't have any irreplaceable data on md9 yet,
>> I'd recommend erasing all 4 EARS drives with the dd command so you have
>> a "fresh start".
> 
> Sorry Ramon, I meant the Samsungs here, not EARS.  You probably
> understood.

No, sorry I'm a bit confused.

The Samsung drives worked fine so far. I already have used the linear 
array and don't know what is written to md2 through md0.
But I could remove one Samsung disk from md2, dd it, re add it and do 
this procedure for the other three Samsungs.

What about the WD green?
I tried to dd them yesterday but when I wanted to stream a movie from the 
server it stopped. Sometimes I couldn't even ssh into the server and when 
I could the remote shell froze after a very short time.

Should I try to dd them again but one after the other so that I know 
which one makes problems?


Best regards


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-25 10:31           ` Ramon Hofer
@ 2012-06-26  1:53             ` Stan Hoeppner
  2012-06-26  8:37               ` Ramon Hofer
  0 siblings, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2012-06-26  1:53 UTC (permalink / raw)
  To: Ramon Hofer; +Cc: linux-raid

On 6/25/2012 5:31 AM, Ramon Hofer wrote:
> On Sun, 24 Jun 2012 22:51:32 -0500, Stan Hoeppner wrote:
> 
>> On 6/24/2012 9:12 AM, Stan Hoeppner wrote:
>>
>>> That's premature.  If you don't have any irreplaceable data on md9 yet,
>>> I'd recommend erasing all 4 EARS drives with the dd command so you have
>>> a "fresh start".
>>
>> Sorry Ramon, I meant the Samsungs here, not EARS.  You probably
>> understood.
> 
> No, sorry I'm a bit confused.

I'm confused as well.  The error you pasted was on md9, which I thought
was the old Samsung array.

[61142.466334] md/raid:md9: read error not correctable (sector 3758190680
on sdk).
[61142.466338] md/raid:md9: Disk failure on sdk, disabling device.

Which disk is /dev/sdk?  WD20EARS or Samsung?

> The Samsung drives worked fine so far. I already have used the linear 
> array and don't know what is written to md2 through md0.
> But I could remove one Samsung disk from md2, dd it, re add it and do 
> this procedure for the other three Samsungs.

Ok, so md1 are the Blacks, md2 are the Samsungs.  You tried to create
another array, md9, using the WD20EARS, and one, /dev/sdk, generated the
error above.  Is this correct?

> What about the WD green?

Ok, so currently the WD20EARS drives are not part of an array, correct?
 And you're following the procedure I posted to dd the four drives, correct?

> I tried to dd them yesterday 

There is no "try" here.  Once you start the dd commands they run until
complete.  You didn't kill the processes did you?

> but when I wanted to stream a movie from the 
> server it stopped. 

What do you mean "it stopped"?  What stopped?  The playback in the
client app?

> Sometimes I couldn't even ssh into the server and when 
> I could the remote shell froze after a very short time.

You had 4 dd processes writing zeros to 4 drives at full bandwidth,
consuming something like 480MB/s at the beginning and around 200MB/s at
the end as the platter diameter gets smaller.  The controller chip on
the LSI HBA is seeing tens of thousands of write IOPS.  Not to mention
the four dd processes are generating a good deal of CPU load.  And it
you're not running irqbalance, which you're surely not, interrupts from
the controller are only going to 1 CPU core.

My point is, running these 4 dd's in parallel is going to be very taxing
on your system.  I guess I should have added a caveat/warning in my 'dd'
email that you should not do any other work on the system while it's
dd'ing the 4 drives.  Sorry for failing to mention this.

> Should I try to dd them again but one after the other so that I know 
> which one makes problems?

You first need to explain what you mean by "try again".  Unless you
killed the processes, or rebooted or power cycled the machine, the dd
processes would have run to completion.  I get the feeling you've
omitted some important details.

Oh, please reply-to-all Ramon so these hit my inbox.  List mail goes to
separate folders, and I don't check them in a timely manner.

-- 
Stan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-26  1:53             ` Stan Hoeppner
@ 2012-06-26  8:37               ` Ramon Hofer
  2012-06-26 20:23                 ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: Ramon Hofer @ 2012-06-26  8:37 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On Mon, 2012-06-25 at 20:53 -0500, Stan Hoeppner wrote:
> On 6/25/2012 5:31 AM, Ramon Hofer wrote:
> > On Sun, 24 Jun 2012 22:51:32 -0500, Stan Hoeppner wrote:
> > 
> >> On 6/24/2012 9:12 AM, Stan Hoeppner wrote:
> >>
> >>> That's premature.  If you don't have any irreplaceable data on md9 yet,
> >>> I'd recommend erasing all 4 EARS drives with the dd command so you have
> >>> a "fresh start".
> >>
> >> Sorry Ramon, I meant the Samsungs here, not EARS.  You probably
> >> understood.
> > 
> > No, sorry I'm a bit confused.
> 
> I'm confused as well.  The error you pasted was on md9, which I thought
> was the old Samsung array.

Sorry, I should have been more precise.

After I was able to recover md1 (WD blacks) I created md2 with the
Samungs.

Then I wanted to test the WD greens by creating md9 and copying the
mythtv recordings onto it. (I wanted to do that because I wanted to
switch to xfs as well for the recordings drive.)


> [61142.466334] md/raid:md9: read error not correctable (sector 3758190680
> on sdk).
> [61142.466338] md/raid:md9: Disk failure on sdk, disabling device.
> 
> Which disk is /dev/sdk?  WD20EARS or Samsung?

All the disks from md9 now are WD20EARS.

Sorry again for the confusion!


> > The Samsung drives worked fine so far. I already have used the linear 
> > array and don't know what is written to md2 through md0.
> > But I could remove one Samsung disk from md2, dd it, re add it and do 
> > this procedure for the other three Samsungs.
> 
> Ok, so md1 are the Blacks, md2 are the Samsungs.  You tried to create
> another array, md9, using the WD20EARS, and one, /dev/sdk, generated the
> error above.  Is this correct?

Exactly.


> > What about the WD green?
> 
> Ok, so currently the WD20EARS drives are not part of an array, correct?
>  And you're following the procedure I posted to dd the four drives, correct?

No, they're not.
And yes, I did. But the server behaved very strangely. Sometimes I
couldn't ssh into it anymore. Sometimes I could and the connection
froze.


> > I tried to dd them yesterday 
> 
> There is no "try" here.  Once you start the dd commands they run until
> complete.  You didn't kill the processes did you?

I wanted to watch a movie that evening. It streamed fine until about 15
min to the end but I really had to see the end before going to bed.


> > but when I wanted to stream a movie from the 
> > server it stopped. 
> 
> What do you mean "it stopped"?  What stopped?  The playback in the
> client app?

Yes.
I first thought it was because of the client app. But after I couldn't
ssh into the server and freezings of the ssh connection I thought I'd
reboot it.

I thought it couldn't be very hard to write a lot of zeros...


> > Sometimes I couldn't even ssh into the server and when 
> > I could the remote shell froze after a very short time.
> 
> You had 4 dd processes writing zeros to 4 drives at full bandwidth,
> consuming something like 480MB/s at the beginning and around 200MB/s at
> the end as the platter diameter gets smaller.  The controller chip on
> the LSI HBA is seeing tens of thousands of write IOPS.  Not to mention
> the four dd processes are generating a good deal of CPU load.  And it
> you're not running irqbalance, which you're surely not, interrupts from
> the controller are only going to 1 CPU core.
> 
> My point is, running these 4 dd's in parallel is going to be very taxing
> on your system.  I guess I should have added a caveat/warning in my 'dd'
> email that you should not do any other work on the system while it's
> dd'ing the 4 drives.  Sorry for failing to mention this.

I ran top to see if the system is busy. And I saw that the cpu isn't.
But the system load was as high as never before (around 10).
Now I see that the movie couldn't be streamed because the LSI controller
didn't have any bandwidth left for the movie.

So maybe I can just rerun the four dd commands when the server isn't
busy? Or even take out the drives and run the command on another
machine?


> > Should I try to dd them again but one after the other so that I know 
> > which one makes problems?
> 
> You first need to explain what you mean by "try again".  Unless you
> killed the processes, or rebooted or power cycled the machine, the dd
> processes would have run to completion.  I get the feeling you've
> omitted some important details.

Sorry, I didn't explain properly what I did.

When the dd command was running for some time I wanted to watch that
movie in the evening. Unfortunately it stopped about 15 minutes before
it was finished and it was very thrilling ;-)

So I rebooted the frontend machine because I thought it was because I
use a xbmc version with mythtv pvr support which is alpha or beta.

But the movie stopped after some seconds. It's really strange because
ite ran fine for about 1 hour 50 mins. Only the last 15 or 20 minutes
made problems.

When I first ssh-ed into the server the connection froze like if the
network connection had gone. But I could still ping it. I tried several
times. Sometimes I couldn't login sometimes I could.

Btw I ran the four dd commands within a screen session if this is of any
importance?


> Oh, please reply-to-all Ramon so these hit my inbox.  List mail goes to
> separate folders, and I don't check them in a timely manner.

Sorry the last time I used pan to reply. It's not possible to reply to
the list and you at the same time with it.
But evolution can :-)


Best regards
Ramon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-26  8:37               ` Ramon Hofer
@ 2012-06-26 20:23                 ` Stan Hoeppner
  2012-06-27  9:07                   ` Ramon Hofer
  2012-07-02 10:12                   ` Ramon Hofer
  0 siblings, 2 replies; 27+ messages in thread
From: Stan Hoeppner @ 2012-06-26 20:23 UTC (permalink / raw)
  To: Ramon Hofer; +Cc: linux-raid

On 6/26/2012 3:37 AM, Ramon Hofer wrote:

> Btw I ran the four dd commands within a screen session if this is of any
> importance?

Can you get to that screen session?  Recall I showed you the output you
should have seen when each dd command completed?  Did you see four such
sets of output in the screen session?

Regardless, unless the dd commands hung in some way, they should not
show up in top right now.  So it's probably safe to assume they completed.

So at this point you can try creating the RAID5 array again.  If the dd
command did what we wanted, /dev/sdk should have remapped the bad
sector, and you shouldn't get the error kicking that drive.  If you
still do, you may need to replace the drive.

-- 
Stan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-26 20:23                 ` Stan Hoeppner
@ 2012-06-27  9:07                   ` Ramon Hofer
  2012-06-27 12:34                     ` Stan Hoeppner
  2012-06-28 18:44                     ` Krzysztof Adamski
  2012-07-02 10:12                   ` Ramon Hofer
  1 sibling, 2 replies; 27+ messages in thread
From: Ramon Hofer @ 2012-06-27  9:07 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On Die, 2012-06-26 at 15:23 -0500, Stan Hoeppner wrote:
> On 6/26/2012 3:37 AM, Ramon Hofer wrote:
> 
> > Btw I ran the four dd commands within a screen session if this is of any
> > importance?
> 
> Can you get to that screen session?  Recall I showed you the output you
> should have seen when each dd command completed?  Did you see four such
> sets of output in the screen session?

I was able to get into the screen session but the output wasn't shown.
It seemed as if all four commands were still running.
top showed the dd commands as well but I'm not sure if each four were
still running.


> Regardless, unless the dd commands hung in some way, they should not
> show up in top right now.  So it's probably safe to assume they completed.

How long would it approximately take for the dd command to clear the
drive?
Can I assume a write rate of 100 MB/s?
With a disk size of 2 TB thid should take about 6 hours?
I think I'll let it run again this evening so that it can complete until
the next day.


> So at this point you can try creating the RAID5 array again.  If the dd
> command did what we wanted, /dev/sdk should have remapped the bad
> sector, and you shouldn't get the error kicking that drive.  If you
> still do, you may need to replace the drive.

I'm not sure if I didn't kille the dd command too early.
Maybe it's better to let it run again. Maybe even each disk at once?
Maybe this would already tell if a disk is faulty?


Cheers
Ramon




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-27  9:07                   ` Ramon Hofer
@ 2012-06-27 12:34                     ` Stan Hoeppner
  2012-06-27 19:19                       ` Ramon Hofer
  2012-06-28 18:44                     ` Krzysztof Adamski
  1 sibling, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2012-06-27 12:34 UTC (permalink / raw)
  To: Ramon Hofer; +Cc: linux-raid

On 6/27/2012 4:07 AM, Ramon Hofer wrote:
> On Die, 2012-06-26 at 15:23 -0500, Stan Hoeppner wrote:
>> On 6/26/2012 3:37 AM, Ramon Hofer wrote:
>>
>>> Btw I ran the four dd commands within a screen session if this is of any
>>> importance?
>>
>> Can you get to that screen session?  Recall I showed you the output you
>> should have seen when each dd command completed?  Did you see four such
>> sets of output in the screen session?
> 
> I was able to get into the screen session but the output wasn't shown.
> It seemed as if all four commands were still running.
> top showed the dd commands as well but I'm not sure if each four were
> still running.

Ok that's strange.  Looks like they hung somehow if they're still
showing in top.  Go ahead and kill them.  After you kill em confirm they
exited (no longer it top).

>> Regardless, unless the dd commands hung in some way, they should not
>> show up in top right now.  So it's probably safe to assume they completed.
> 
> How long would it approximately take for the dd command to clear the
> drive?
> Can I assume a write rate of 100 MB/s?

It'll be 100-120 at the beginning, and slow to 20-30 as dd is writing
sectors on the inner platter tracks.

> With a disk size of 2 TB thid should take about 6 hours?

Yeah, something like that.

> I think I'll let it run again this evening so that it can complete until
> the next day.

Do them one at a time this time, starting with /dev/sdk.  Remember to
issue sync.  E.g.

~$ dd if=/dev/zero of=/dev/sdk bs=16384; sync

Try a smaller block size this time, 16KB instead of 1M.

Oh, Ramon, before you do any of this, change to the deadline elevator:

~$ echo deadline > /sys/block/sdk/queue/scheduler

Dangit, Ramon, I can't believe I forgot to tell you this when you
installed the 9240.  Debian, like most distros, defaults to the CFQ
elevator, which is good, I guess, for interactive desktop use.  But it
sucks like a Hoover with most server workloads.  So add this to your
kernel boot options in grub's menu.list, normally found in
/boot/grub/menu.lst:

elevator=deadline

That will enable deadline each time you boot.  So after you make the
change go ahead and reboot the system.  Then proceed with your dd commands.

>> So at this point you can try creating the RAID5 array again.  If the dd
>> command did what we wanted, /dev/sdk should have remapped the bad
>> sector, and you shouldn't get the error kicking that drive.  If you
>> still do, you may need to replace the drive.
> 
> I'm not sure if I didn't kille the dd command too early.
> Maybe it's better to let it run again. Maybe even each disk at once?
> Maybe this would already tell if a disk is faulty?

Yeah, do em one at a time this time.  It'll cause less load, and you
should still be able to watch movies etc while it runs.

-- 
Stan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-27 12:34                     ` Stan Hoeppner
@ 2012-06-27 19:19                       ` Ramon Hofer
  2012-06-28 19:57                         ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: Ramon Hofer @ 2012-06-27 19:19 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On Wed, 27 Jun 2012 07:34:13 -0500
Stan Hoeppner <stan@hardwarefreak.com> wrote:

> > With a disk size of 2 TB thid should take about 6 hours?
> 
> Yeah, something like that.
> 
> > I think I'll let it run again this evening so that it can complete
> > until the next day.
> 
> Do them one at a time this time, starting with /dev/sdk.  Remember to
> issue sync.  E.g.
> 
> ~$ dd if=/dev/zero of=/dev/sdk bs=16384; sync
> 
> Try a smaller block size this time, 16KB instead of 1M.

Ok, thanks I will. And additionally I will write down what time I
started each command so when one of them still hasn't finished after 12
hours or so the disk will have to be replaced right?


> Oh, Ramon, before you do any of this, change to the deadline elevator:
> 
> ~$ echo deadline > /sys/block/sdk/queue/scheduler
> 
> Dangit, Ramon, I can't believe I forgot to tell you this when you
> installed the 9240.  Debian, like most distros, defaults to the CFQ
> elevator, which is good, I guess, for interactive desktop use.  But it
> sucks like a Hoover with most server workloads.  So add this to your
> kernel boot options in grub's menu.list, normally found in
> /boot/grub/menu.lst:
> 
> elevator=deadline
> 
> That will enable deadline each time you boot.  So after you make the
> change go ahead and reboot the system.  Then proceed with your dd
> commands.

Very interesting!!!
Thanks for that :-)

Btw: I use grub2. The file I edited is /etc/default/grub. I changed
this line:

GRUB_CMDLINE_LINUX_DEFAULT="quiet elevator=deadline"

and ran
~# update-grub

I checked if the right scheduler is running:

~$ cat /sys/block/sdk/queue/scheduler
noop [deadline] cfq 

Is this correct what I did?


> >> So at this point you can try creating the RAID5 array again.  If
> >> the dd command did what we wanted, /dev/sdk should have remapped
> >> the bad sector, and you shouldn't get the error kicking that
> >> drive.  If you still do, you may need to replace the drive.
> > 
> > I'm not sure if I didn't kille the dd command too early.
> > Maybe it's better to let it run again. Maybe even each disk at once?
> > Maybe this would already tell if a disk is faulty?
> 
> Yeah, do em one at a time this time.  It'll cause less load, and you
> should still be able to watch movies etc while it runs.

And I can see if one of them behaves strangely :-)


Cheers
Ramon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-27  9:07                   ` Ramon Hofer
  2012-06-27 12:34                     ` Stan Hoeppner
@ 2012-06-28 18:44                     ` Krzysztof Adamski
  2012-06-29  7:44                       ` Ramon Hofer
  1 sibling, 1 reply; 27+ messages in thread
From: Krzysztof Adamski @ 2012-06-28 18:44 UTC (permalink / raw)
  To: Ramon Hofer; +Cc: linux-raid

On Wed, 2012-06-27 at 11:07 +0200, Ramon Hofer wrote:

> I'm not sure if I didn't kille the dd command too early.
> Maybe it's better to let it run again. Maybe even each disk at once?
> Maybe this would already tell if a disk is faulty?

You can send USR1 signal to the dd command and it will print how far it
has gone, like this:
# kill -USR1 <pid of dd>

K


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-27 19:19                       ` Ramon Hofer
@ 2012-06-28 19:57                         ` Stan Hoeppner
  2012-06-29  7:58                           ` Ramon Hofer
  0 siblings, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2012-06-28 19:57 UTC (permalink / raw)
  To: Ramon Hofer; +Cc: linux-raid

On 6/27/2012 2:19 PM, Ramon Hofer wrote:
> 
> Ok, thanks I will. And additionally I will write down what time I
> started each command so when one of them still hasn't finished after 12
> hours or so the disk will have to be replaced right?

That's unnecessary.  Linux retains the start time of each process:

~$ ps -ef|grep dd
...
root      4338  4307 95 14:49 pts/0    00:00:48 dd if=/dev/zero of=./test
...

The 5th column shows the start time.  If the process has been running
more than 24 hours the start date will be shown instead of the start time.

> GRUB_CMDLINE_LINUX_DEFAULT="quiet elevator=deadline"
> 
> and ran
> ~# update-grub
> 
> I checked if the right scheduler is running:
> 
> ~$ cat /sys/block/sdk/queue/scheduler
> noop [deadline] cfq 
> 
> Is this correct what I did?

Yep.  The brackets surrounding deadline show it is enabled.

> And I can see if one of them behaves strangely :-)

Yep.

-- 
Stan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-28 18:44                     ` Krzysztof Adamski
@ 2012-06-29  7:44                       ` Ramon Hofer
  2012-06-29 10:15                         ` John Robinson
  0 siblings, 1 reply; 27+ messages in thread
From: Ramon Hofer @ 2012-06-29  7:44 UTC (permalink / raw)
  To: Krzysztof Adamski; +Cc: linux-raid

On Don, 2012-06-28 at 14:44 -0400, Krzysztof Adamski wrote:
> You can send USR1 signal to the dd command and it will print how far it
> has gone, like this:
> # kill -USR1 <pid of dd>

Great!
Thanks for that :-)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-28 19:57                         ` Stan Hoeppner
@ 2012-06-29  7:58                           ` Ramon Hofer
  0 siblings, 0 replies; 27+ messages in thread
From: Ramon Hofer @ 2012-06-29  7:58 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On Don, 2012-06-28 at 14:57 -0500, Stan Hoeppner wrote:
> On 6/27/2012 2:19 PM, Ramon Hofer wrote:
> > 
> > Ok, thanks I will. And additionally I will write down what time I
> > started each command so when one of them still hasn't finished after 12
> > hours or so the disk will have to be replaced right?
> 
> That's unnecessary.  Linux retains the start time of each process:
> 
> ~$ ps -ef|grep dd
> ...
> root      4338  4307 95 14:49 pts/0    00:00:48 dd if=/dev/zero of=./test
> ...
> 
> The 5th column shows the start time.  If the process has been running
> more than 24 hours the start date will be shown instead of the start time.

Thanks for that :-)


The first drive (sdj) has finished.
I'm now dding the second (sdk). This one caused problems I think. I'll
report how it went...


Cheers
Ramon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-29  7:44                       ` Ramon Hofer
@ 2012-06-29 10:15                         ` John Robinson
  2012-06-29 11:19                           ` Ramon Hofer
  0 siblings, 1 reply; 27+ messages in thread
From: John Robinson @ 2012-06-29 10:15 UTC (permalink / raw)
  To: Ramon Hofer; +Cc: Krzysztof Adamski, linux-raid

On 29/06/2012 08:44, Ramon Hofer wrote:
> On Don, 2012-06-28 at 14:44 -0400, Krzysztof Adamski wrote:
>> You can send USR1 signal to the dd command and it will print how far it
>> has gone, like this:
>> # kill -USR1 <pid of dd>
>
> Great!
> Thanks for that :-)

When running dd on whole discs, I usually run:

while p=`pidof dd` ; do kill -USR1 $p ; sleep 30 ; done

so I have a continuous progress monitor.

Cheers,

John.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-29 10:15                         ` John Robinson
@ 2012-06-29 11:19                           ` Ramon Hofer
  0 siblings, 0 replies; 27+ messages in thread
From: Ramon Hofer @ 2012-06-29 11:19 UTC (permalink / raw)
  To: John Robinson; +Cc: Krzysztof Adamski, linux-raid

On Fre, 2012-06-29 at 11:15 +0100, John Robinson wrote:
> On 29/06/2012 08:44, Ramon Hofer wrote:
> > On Don, 2012-06-28 at 14:44 -0400, Krzysztof Adamski wrote:
> >> You can send USR1 signal to the dd command and it will print how far it
> >> has gone, like this:
> >> # kill -USR1 <pid of dd>
> >
> > Great!
> > Thanks for that :-)
> 
> When running dd on whole discs, I usually run:
> 
> while p=`pidof dd` ; do kill -USR1 $p ; sleep 30 ; done
> 
> so I have a continuous progress monitor.

Very cool!
Thanks too :-)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-06-26 20:23                 ` Stan Hoeppner
  2012-06-27  9:07                   ` Ramon Hofer
@ 2012-07-02 10:12                   ` Ramon Hofer
  2012-07-02 11:46                     ` Phil Turmel
  2012-07-02 20:27                     ` Stan Hoeppner
  1 sibling, 2 replies; 27+ messages in thread
From: Ramon Hofer @ 2012-07-02 10:12 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On Die, 2012-06-26 at 15:23 -0500, Stan Hoeppner wrote:

> Regardless, unless the dd commands hung in some way, they should not
> show up in top right now.  So it's probably safe to assume they completed.
> 
> So at this point you can try creating the RAID5 array again.  If the dd
> command did what we wanted, /dev/sdk should have remapped the bad
> sector, and you shouldn't get the error kicking that drive.  If you
> still do, you may need to replace the drive.

I have successfully run dd for all the four drives.

But because I couldn't create the raid I ran smartctl again for all of
them. It seems that sdk has to be replaced. Here are the outputs:

sdj:
http://pastebin.com/raw.php?i=V60FJ4wC

sdk:
http://pastebin.com/raw.php?i=cgfq3202
There are some pre-fail and one FAILING_NOW messages.

sdl:
http://pastebin.com/raw.php?i=tipjbpxu

sdm:
http://pastebin.com/raw.php?i=sZnrCJ5Q

Is this the right moment to bring sdk back to my dealer?


Best regards
Ramon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-07-02 10:12                   ` Ramon Hofer
@ 2012-07-02 11:46                     ` Phil Turmel
  2012-07-02 12:18                       ` Ramon Hofer
  2012-07-02 20:27                     ` Stan Hoeppner
  1 sibling, 1 reply; 27+ messages in thread
From: Phil Turmel @ 2012-07-02 11:46 UTC (permalink / raw)
  To: Ramon Hofer; +Cc: stan, linux-raid

Hi Ramon,

On 07/02/2012 06:12 AM, Ramon Hofer wrote:
> On Die, 2012-06-26 at 15:23 -0500, Stan Hoeppner wrote:
> 
>> Regardless, unless the dd commands hung in some way, they should not
>> show up in top right now.  So it's probably safe to assume they completed.
>>
>> So at this point you can try creating the RAID5 array again.  If the dd
>> command did what we wanted, /dev/sdk should have remapped the bad
>> sector, and you shouldn't get the error kicking that drive.  If you
>> still do, you may need to replace the drive.
> 
> I have successfully run dd for all the four drives.
> 
> But because I couldn't create the raid I ran smartctl again for all of
> them. It seems that sdk has to be replaced. Here are the outputs:
> 
> sdj:
> http://pastebin.com/raw.php?i=V60FJ4wC
> 
> sdk:
> http://pastebin.com/raw.php?i=cgfq3202
> There are some pre-fail and one FAILING_NOW messages.

Absolutely take this drive back to your dealer.  It has re-allocated
over a thousand sectors.  In my opinion, a drive should be put on a
short-term (~3month) replacement list for the next convenient
opportunity as soon as the first sector gets relocated.  If you get more
relocations in quick succession, replace immediately.

> sdl:
> http://pastebin.com/raw.php?i=tipjbpxu

Interesting.  This drive claims to support SCTERC.  I was under the
impression the EARS drives didn't.  Could you please show the output of
"smartctl -l scterc /dev/sdl" ?

Regards,

Phil

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-07-02 11:46                     ` Phil Turmel
@ 2012-07-02 12:18                       ` Ramon Hofer
  2012-07-02 21:42                         ` Phil Turmel
  0 siblings, 1 reply; 27+ messages in thread
From: Ramon Hofer @ 2012-07-02 12:18 UTC (permalink / raw)
  To: Phil Turmel; +Cc: stan, linux-raid

On Mon, 2012-07-02 at 07:46 -0400, Phil Turmel wrote:
> Hi Ramon,
> 
> On 07/02/2012 06:12 AM, Ramon Hofer wrote:
> > On Die, 2012-06-26 at 15:23 -0500, Stan Hoeppner wrote:
> > 
> >> Regardless, unless the dd commands hung in some way, they should not
> >> show up in top right now.  So it's probably safe to assume they completed.
> >>
> >> So at this point you can try creating the RAID5 array again.  If the dd
> >> command did what we wanted, /dev/sdk should have remapped the bad
> >> sector, and you shouldn't get the error kicking that drive.  If you
> >> still do, you may need to replace the drive.
> > 
> > I have successfully run dd for all the four drives.
> > 
> > But because I couldn't create the raid I ran smartctl again for all of
> > them. It seems that sdk has to be replaced. Here are the outputs:
> > 
> > sdj:
> > http://pastebin.com/raw.php?i=V60FJ4wC
> > 
> > sdk:
> > http://pastebin.com/raw.php?i=cgfq3202
> > There are some pre-fail and one FAILING_NOW messages.
> 
> Absolutely take this drive back to your dealer.  It has re-allocated
> over a thousand sectors.  In my opinion, a drive should be put on a
> short-term (~3month) replacement list for the next convenient
> opportunity as soon as the first sector gets relocated.  If you get more
> relocations in quick succession, replace immediately.

Ok, thanks!
I will bring it back.

I've never had to replace a disk so far. Does it depend on the dealer or
do the manufacturers replace them already when sectors get reallocated? 


> > sdl:
> > http://pastebin.com/raw.php?i=tipjbpxu
> 
> Interesting.  This drive claims to support SCTERC.  I was under the
> impression the EARS drives didn't.  Could you please show the output of
> "smartctl -l scterc /dev/sdl" ?

Sure:

~# smartctl -l scterc /dev/sdl
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen,
http://smartmontools.sourceforge.net

Error SMART WRITE LOG does not return COUNT and LBA_LOW register
Warning: device does not support SCT (Get) Error Recovery Control
command


Cheers
Ramon




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-07-02 10:12                   ` Ramon Hofer
  2012-07-02 11:46                     ` Phil Turmel
@ 2012-07-02 20:27                     ` Stan Hoeppner
  2012-07-03  7:16                       ` Ramon Hofer
  1 sibling, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2012-07-02 20:27 UTC (permalink / raw)
  To: Ramon Hofer; +Cc: linux-raid

On 7/2/2012 5:12 AM, Ramon Hofer wrote:
> On Die, 2012-06-26 at 15:23 -0500, Stan Hoeppner wrote:
> 
>> Regardless, unless the dd commands hung in some way, they should not
>> show up in top right now.  So it's probably safe to assume they completed.
>>
>> So at this point you can try creating the RAID5 array again.  If the dd
>> command did what we wanted, /dev/sdk should have remapped the bad
>> sector, and you shouldn't get the error kicking that drive.  If you
>> still do, you may need to replace the drive.
> 
> I have successfully run dd for all the four drives.
> 
> But because I couldn't create the raid I ran smartctl again for all of
> them. It seems that sdk has to be replaced. Here are the outputs:

> sdk:
> http://pastebin.com/raw.php?i=cgfq3202

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
  5 Reallocated_Sector_Ct   0x0033   133   133   140    Pre-fail  Always   FAILING_NOW 1265
  7 Seek_Error_Rate         0x002e   103   103   000    Old_age   Always       -       15797
196 Reallocated_Event_Count 0x0032   142   142   000    Old_age   Always       -       58
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       14
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       15
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       15

Yes, the drive is toast.  Above are the indicators.  IIRC, your previous smartctl run showed only a count of 3 for Current_Pending_Sector, but no other errors.  Zero'ing the drive with dd has uncovered the true severity of the drive's problems, as another 1000+ bad sectors were identified by the drive and remapped.

So yes, the drive needs to be replaced.

-- 
Stan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-07-02 12:18                       ` Ramon Hofer
@ 2012-07-02 21:42                         ` Phil Turmel
  0 siblings, 0 replies; 27+ messages in thread
From: Phil Turmel @ 2012-07-02 21:42 UTC (permalink / raw)
  To: Ramon Hofer; +Cc: stan, linux-raid

On 07/02/2012 08:18 AM, Ramon Hofer wrote:
> Ok, thanks!
> I will bring it back.
> 
> I've never had to replace a disk so far. Does it depend on the dealer or
> do the manufacturers replace them already when sectors get reallocated? 

I don't know--I've never had an in-service drive fail while still under
warranty :-)  I did have a DOA replaced once, but I doubt that reflects
the hoops you might have to jump through.  Maybe just send them the
smartctl report?

As a hobbyist and small businessman, I only run my own servers.  I don't
handle the volume of drives necessary to personally know the RMA
departments.

Others around here probably do.

>> Interesting.  This drive claims to support SCTERC.  I was under the
>> impression the EARS drives didn't.  Could you please show the output of
>> "smartctl -l scterc /dev/sdl" ?
> 
> Sure:
> 
> ~# smartctl -l scterc /dev/sdl
> smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen,
> http://smartmontools.sourceforge.net
> 
> Error SMART WRITE LOG does not return COUNT and LBA_LOW register
> Warning: device does not support SCT (Get) Error Recovery Control
> command

Very interesting indeed.  The summary report says it *does* support
SCTERC, but when actually probed, it refuses.  That's just evil.

Thanks for running the query.

Regards,

Phil

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Stacked array data recovery
  2012-07-02 20:27                     ` Stan Hoeppner
@ 2012-07-03  7:16                       ` Ramon Hofer
  0 siblings, 0 replies; 27+ messages in thread
From: Ramon Hofer @ 2012-07-03  7:16 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On Mon, 2012-07-02 at 15:27 -0500, Stan Hoeppner wrote:
> On 7/2/2012 5:12 AM, Ramon Hofer wrote:
> > On Die, 2012-06-26 at 15:23 -0500, Stan Hoeppner wrote:
> > 
> >> Regardless, unless the dd commands hung in some way, they should not
> >> show up in top right now.  So it's probably safe to assume they completed.
> >>
> >> So at this point you can try creating the RAID5 array again.  If the dd
> >> command did what we wanted, /dev/sdk should have remapped the bad
> >> sector, and you shouldn't get the error kicking that drive.  If you
> >> still do, you may need to replace the drive.
> > 
> > I have successfully run dd for all the four drives.
> > 
> > But because I couldn't create the raid I ran smartctl again for all of
> > them. It seems that sdk has to be replaced. Here are the outputs:
> 
> > sdk:
> > http://pastebin.com/raw.php?i=cgfq3202
> 
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
> ...
>   5 Reallocated_Sector_Ct   0x0033   133   133   140    Pre-fail  Always   FAILING_NOW 1265
>   7 Seek_Error_Rate         0x002e   103   103   000    Old_age   Always       -       15797
> 196 Reallocated_Event_Count 0x0032   142   142   000    Old_age   Always       -       58
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       14
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       15
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       15
> 
> Yes, the drive is toast.  Above are the indicators.  IIRC, your previous smartctl run showed only a count of 3 for Current_Pending_Sector, but no other errors.  Zero'ing the drive with dd has uncovered the true severity of the drive's problems, as another 1000+ bad sectors were identified by the drive and remapped.
> 
> So yes, the drive needs to be replaced.

Thanks for the explanation!

I brought the drive to my dealer yesterday and could take a new one with
me. :-)
I only had to pay the difference of CHF 10.-


Btw Stan: I wrote to the manufacturer of the damping mats. He told me
that they are isolating. I asked about the value of resistance and if
it's flammable. When I have the answer I will order them...


Cheers
Ramon


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2012-07-03  7:16 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-21 22:44 Stacked array data recovery Ramon Hofer
2012-06-22 14:32 ` Ramon Hofer
2012-06-23 12:05   ` Stan Hoeppner
2012-06-22 14:37 ` Ramon Hofer
2012-06-23 12:09   ` Stan Hoeppner
2012-06-24 12:15     ` Ramon Hofer
2012-06-24 14:12       ` Stan Hoeppner
2012-06-25  3:51         ` Stan Hoeppner
2012-06-25 10:31           ` Ramon Hofer
2012-06-26  1:53             ` Stan Hoeppner
2012-06-26  8:37               ` Ramon Hofer
2012-06-26 20:23                 ` Stan Hoeppner
2012-06-27  9:07                   ` Ramon Hofer
2012-06-27 12:34                     ` Stan Hoeppner
2012-06-27 19:19                       ` Ramon Hofer
2012-06-28 19:57                         ` Stan Hoeppner
2012-06-29  7:58                           ` Ramon Hofer
2012-06-28 18:44                     ` Krzysztof Adamski
2012-06-29  7:44                       ` Ramon Hofer
2012-06-29 10:15                         ` John Robinson
2012-06-29 11:19                           ` Ramon Hofer
2012-07-02 10:12                   ` Ramon Hofer
2012-07-02 11:46                     ` Phil Turmel
2012-07-02 12:18                       ` Ramon Hofer
2012-07-02 21:42                         ` Phil Turmel
2012-07-02 20:27                     ` Stan Hoeppner
2012-07-03  7:16                       ` Ramon Hofer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.