All of lore.kernel.org
 help / color / mirror / Atom feed
* recovery of hosed raid5 array
@ 2003-10-11 15:52 Jason Lunz
  2003-10-11 18:59 ` linux-raid
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Lunz @ 2003-10-11 15:52 UTC (permalink / raw)
  To: linux-raid

So I have a small raid5 array of 5 25G ide disks for a home server. It
was built entirely on 2.4.22 with mdadm in debian unstable a couple of
months ago.  One of the disks died last night and was removed from the
array, which went on in degraded mode.  The dead disk, /dev/hdg, stopped
responding and was deactivated by the promise ide driver.

After rebooting, the raid continued to work in degraded mode, and
/dev/hdg seemed to check out ok. After checking it with smartctl and
doing some successful reads from it, I added it back to the array with
"mdadm -a".

Once the rebuild started, I noticed an email from smartd, saying that
the _other_ drive on the Promise controller (/dev/hde) had had errors.
So I foolishly checked it out with smartctl, at which point everything
went to hell. /dev/hde now clicks repeatedly and refuses to work for
more than a minute or so, while /dev/hdg is partially destroyed from
being re-added. The rebuild was 3-4% complete when hde died.

So I realize I'm probably fucked, but I really want to get the 80G or so
of data back. Three of the drives are fine, /dev/hde is dead, but if I
can use even some of /dev/hdg, I should be able to at least recover part
of the data if I can get the ext3 fs to mount. How would I go about
this?

I know if I just wanted to use the four good disks, I could do
"mdadm -Af". But how do I go about getting at the pre-rebuild data on
/dev/hdg? Is there any way to restore its superblock data to the
pre-rebuild state?

I will be eternally indebted to anyone that can help.

thanks,

Jason


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: recovery of hosed raid5 array
  2003-10-11 15:52 recovery of hosed raid5 array Jason Lunz
@ 2003-10-11 18:59 ` linux-raid
  2003-10-12  1:14   ` rob
  2003-10-12 15:14   ` Jason Lunz
  0 siblings, 2 replies; 12+ messages in thread
From: linux-raid @ 2003-10-11 18:59 UTC (permalink / raw)
  To: Jason Lunz; +Cc: linux-raid


 Hi,

On Sat, 11 Oct 2003, Jason Lunz wrote:

> Date: Sat, 11 Oct 2003 15:52:40 +0000 (UTC)
> From: Jason Lunz <lunz@falooley.org>
> To: linux-raid@vger.kernel.org
> Subject: recovery of hosed raid5 array
> 
[...]
> Once the rebuild started, I noticed an email from smartd, saying that
> the _other_ drive on the Promise controller (/dev/hde) had had errors.
> So I foolishly checked it out with smartctl,

  Can you please share with us newbies why using "smartctl" 
in such situation is foolish, so we don't make the same 
mistake ?

  Thanks.

    John


[...]
> at which point everything
> went to hell. /dev/hde now clicks repeatedly and refuses to work for
> more than a minute or so, while /dev/hdg is partially destroyed from
> being re-added. The rebuild was 3-4% complete when hde died.


-- 
-- Gospel of Jesus' kingdom = saving power of God for all who believe --
                 ## To some, nothing is impossible. ##

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: recovery of hosed raid5 array
  2003-10-11 18:59 ` linux-raid
@ 2003-10-12  1:14   ` rob
  2003-10-12 15:14   ` Jason Lunz
  1 sibling, 0 replies; 12+ messages in thread
From: rob @ 2003-10-12  1:14 UTC (permalink / raw)
  To: linux-raid; +Cc: Jason Lunz, linux-raid

What IS  "smartctl" ?

linux-raid@ied.com wrote:

> Hi,
>
>On Sat, 11 Oct 2003, Jason Lunz wrote:
>
>  
>
>>Date: Sat, 11 Oct 2003 15:52:40 +0000 (UTC)
>>From: Jason Lunz <lunz@falooley.org>
>>To: linux-raid@vger.kernel.org
>>Subject: recovery of hosed raid5 array
>>
>>    
>>
>[...]
>  
>
>>Once the rebuild started, I noticed an email from smartd, saying that
>>the _other_ drive on the Promise controller (/dev/hde) had had errors.
>>So I foolishly checked it out with smartctl,
>>    
>>
>
>  Can you please share with us newbies why using "smartctl" 
>in such situation is foolish, so we don't make the same 
>mistake ?
>
>  Thanks.
>
>    John
>
>
>[...]
>  
>
>>at which point everything
>>went to hell. /dev/hde now clicks repeatedly and refuses to work for
>>more than a minute or so, while /dev/hdg is partially destroyed from
>>being re-added. The rebuild was 3-4% complete when hde died.
>>    
>>
>
>
>  
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: recovery of hosed raid5 array
  2003-10-11 18:59 ` linux-raid
  2003-10-12  1:14   ` rob
@ 2003-10-12 15:14   ` Jason Lunz
  2003-10-12 17:39     ` dean gaudet
  1 sibling, 1 reply; 12+ messages in thread
From: Jason Lunz @ 2003-10-12 15:14 UTC (permalink / raw)
  To: linux-raid

linux-raid@ied.com said:
> Can you please share with us newbies why using "smartctl" in such
> situation is foolish, so we don't make the same mistake ?

SMART is a way to determine the health of hard disks, and smartctl is
part of the smartmontools suite that is used under linux to assess the
SMART status of disks.

What was foolish was me provoking /dev/hde by asking it to report
diagnostics with smartctl at the same time the array was rebuilding
/dev/hdg. Even if something _was_ wrong with hde, it wouldn't have
helped me to find out then during the rebuild. Had the resync completed,
I'd have all my data now and one dead disk.

The question remains: What's the best way to get at the mostly unharmed
data on hdg from before the rebuild started? I know it's there.

Jason


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: recovery of hosed raid5 array
  2003-10-12 15:14   ` Jason Lunz
@ 2003-10-12 17:39     ` dean gaudet
  2003-10-12 18:04       ` Jason Lunz
                         ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: dean gaudet @ 2003-10-12 17:39 UTC (permalink / raw)
  To: Jason Lunz; +Cc: linux-raid



On Sun, 12 Oct 2003, Jason Lunz wrote:

> What was foolish was me provoking /dev/hde by asking it to report
> diagnostics with smartctl at the same time the array was rebuilding
> /dev/hdg. Even if something _was_ wrong with hde, it wouldn't have
> helped me to find out then during the rebuild. Had the resync completed,
> I'd have all my data now and one dead disk.

querying SMART shouldn't cause this to happen -- but i've seen it occur
with a promise controller and maxtor disks.  i used to query the SMART
data once a night just to have a log.  then i switched it to once every 5
minutes so i could graph the drive temperature... and when i went to once
every 5 minutes the system became unstable.  the kernel would randomly
lose the ability to talk to a disk.  the problem would go away after a
reboot.  i assume it was some sort of race condition.

i've since switched from promise to 3ware, and now i can't use smartctl to
query the data.  (mind you a kind engineer from 3ware sent me the code i
need to query SMART from the drives, i've just never had the chance to
merge it into smartctl).


> The question remains: What's the best way to get at the mostly unharmed
> data on hdg from before the rebuild started? I know it's there.

mdadm can do it for you ... you need to know exactly which disk was in
which position in the raid.  then you recreate the raid using "missing" in
the slot where /dev/hde belonged.  then you'll have a degraded array, so
md won't try rebuilding it.  then you can copy off the data.

you need to know the exact numberings, and the exact commands you used to
create the array in the first place.

-dean

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: recovery of hosed raid5 array
  2003-10-12 17:39     ` dean gaudet
@ 2003-10-12 18:04       ` Jason Lunz
  2003-10-12 18:34         ` dean gaudet
  2003-10-13 13:54       ` Dragan Simic
  2003-10-26 19:40       ` Mike Fedyk
  2 siblings, 1 reply; 12+ messages in thread
From: Jason Lunz @ 2003-10-12 18:04 UTC (permalink / raw)
  To: dean gaudet; +Cc: linux-raid

On Sun, Oct 12, 2003 at 10:39AM -0700, dean gaudet wrote:
> querying SMART shouldn't cause this to happen -- but i've seen it occur

That's true, in theory. Fact is, asking a broken drive to do SMART stuff
may be the straw that breaks the camel's back. Immediately upon using
smartctl -a, /dev/hde started making repetetive clicking sounds every
half second or so, and the ide driver spammed syslog spectacularly. It's
only worked sporadically since (no more than a few minutes at a time),
even when power-cycled.

> with a promise controller and maxtor disks.  i used to query the SMART

oddly enough, it _is_ a maxtor disk on a promise controller (PDC20626, I
believe).

> i've since switched from promise to 3ware, and now i can't use smartctl to
> query the data.  (mind you a kind engineer from 3ware sent me the code i
> need to query SMART from the drives, i've just never had the chance to
> merge it into smartctl).

I've heard noises about 3ware support in very recent releases of
smartmontools, iirc.

> mdadm can do it for you ... you need to know exactly which disk was in
> which position in the raid.  then you recreate the raid using "missing" in
> the slot where /dev/hde belonged.  then you'll have a degraded array, so
> md won't try rebuilding it.  then you can copy off the data.

seriously? Did you read the whole thread? mdadm will do the right thing
even though /dev/hdg was 3% into a resync when /dev/hde died? That would
be lovely.

> you need to know the exact numberings, and the exact commands you used
> to create the array in the first place.

How might I go about figuring this out? I got a 120G drive yesterday
that's large enough to capture raw images of all the raid disks, so I
can try different combinations of commands. What I can't do is look at
the logs, because the non-raid portion of the now-dead /dev/hde held the
root, /usr, and /var partitions.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: recovery of hosed raid5 array
  2003-10-12 18:04       ` Jason Lunz
@ 2003-10-12 18:34         ` dean gaudet
  0 siblings, 0 replies; 12+ messages in thread
From: dean gaudet @ 2003-10-12 18:34 UTC (permalink / raw)
  To: Jason Lunz; +Cc: linux-raid

On Sun, 12 Oct 2003, Jason Lunz wrote:

> On Sun, Oct 12, 2003 at 10:39AM -0700, dean gaudet wrote:
>
> > mdadm can do it for you ... you need to know exactly which disk was in
> > which position in the raid.  then you recreate the raid using "missing" in
> > the slot where /dev/hde belonged.  then you'll have a degraded array, so
> > md won't try rebuilding it.  then you can copy off the data.
>
> seriously? Did you read the whole thread? mdadm will do the right thing
> even though /dev/hdg was 3% into a resync when /dev/hde died? That would
> be lovely.

yeah it's not gonna be pretty no matter what you try, but you can at least
force md into thinking the remaining disks are part of a degraded raid.
you should mount any fs read-only at this point though.


> > you need to know the exact numberings, and the exact commands you used
> > to create the array in the first place.
>
> How might I go about figuring this out? I got a 120G drive yesterday
> that's large enough to capture raw images of all the raid disks, so I
> can try different combinations of commands. What I can't do is look at
> the logs, because the non-raid portion of the now-dead /dev/hde held the
> root, /usr, and /var partitions.

unfortunately if you don't have any logs or any memory of what positions
the disks were in you're kind of screwed.  it's in dmesg after a boot --
in the past i've fetched it from a backup of /var/log/dmesg on another
system.  i.e.:

raid5: device sdh1 operational as raid disk 6
raid5: device sdg1 operational as raid disk 5
raid5: spare disk sdf1
raid5: device sde1 operational as raid disk 4
raid5: device sdd1 operational as raid disk 3
raid5: device sdc1 operational as raid disk 2
raid5: device sdb1 operational as raid disk 1
raid5: device sda1 operational as raid disk 0

unfortunately md doesn't log the chunksize in dmesg... you can get the
chunksize from /proc/mdstat though (which is another place to get the
disk positions).

Personalities : [linear] [raid0] [raid1] [raid5]
read_ahead 1024 sectors
md0 : active raid5 sdh1[6] sdg1[5] sdf1[7] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
      720321792 blocks level 5, 64k chunk, algorithm 2 [7/7] [UUUUUUU]

if you've never had any faulty disk and swapped in a spare then your
raid should be in the exact order you originally created it.

if i wanted to forcefully reconstruct that array without sde1 i'd be
doing something like (you need to --stop your md0 before doing this):

	mdadm --create /dev/md0 --chunk=64 --level=5 --raid-devices=7 \
		/dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 missing \
		/dev/sdg1 /dev/sdh1

notice the "missing".

if you specified a non-default raid5 algorithm then you need to include
that as well.

this will create brandnew raid superblocks... there's no going back
after you've done this.  to md it will be like this is a brand new
array.

cross your fingers and mount the fs read-only and see if any of your
data is intact.

as a backup you could partition copy md0 to another disk/raid using dd
and then you can fsck that copy ... you might get further than you would
mounting the original read-only.

if /dev/hdg has a surface error and md marks it as faulty again then
what you'll need to do is copy /dev/hdg to a fresh disk (use dd on the
partition) then do like above but replace hdg with the copy... you'll
get garbage wherever hdg had surface errors, but at least md won't mark
it as faulty.  (the fs probably won't be happy.)

hmm i suppose if you're clever you can find the bad sectors with dd, then
overwrite them with zeros -- if the disk has any spare blocks left this
will work and you won't have to copy to another disk...  you lose the
data either way.  i'm going to skip explaining how to use dd like this
because you really should know what you're doing if you want to try it.

trust me, if any of this isn't clear then don't do it until you understand
what i'm suggesting.  there's really no going back.

-dean

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: recovery of hosed raid5 array
  2003-10-12 17:39     ` dean gaudet
  2003-10-12 18:04       ` Jason Lunz
@ 2003-10-13 13:54       ` Dragan Simic
  2003-10-13 17:00         ` dean gaudet
  2003-10-26 19:40       ` Mike Fedyk
  2 siblings, 1 reply; 12+ messages in thread
From: Dragan Simic @ 2003-10-13 13:54 UTC (permalink / raw)
  To: linux-raid

On Sun, 12 Oct 2003, dean gaudet wrote:

> querying SMART shouldn't cause this to happen -- but i've seen it occur
> with a promise controller and maxtor disks.  i used to query the SMART
> data once a night just to have a log.  then i switched it to once every 5
> minutes so i could graph the drive temperature... and when i went to once
> every 5 minutes the system became unstable.  the kernel would randomly
> lose the ability to talk to a disk.  the problem would go away after a
> reboot.  i assume it was some sort of race condition.

Just a small drop-in: I have a Promise FastTrak133 with two Maxtor HDDs
attached to it, and running smartmontools every 5 minutes, didn't notice
any troubles or signs of instability.

And, about data recovery, if there are two failing HDDs in a RAID5 array,
and you had nr-spare-disks equal to zero, I *think* there is no chance to
revover your data, because RAID5 is protecting you against at most one
HDD failure.

If I'm wrong, please correct me. ;)


-- 

.----------------------------------------------------------------------------.
| Pozdrav / Best Wishes,     dsimic@urc.bl.ac.yu  | LL   The Choice of       |
| Dragan Simic                 RS.BA Hostmaster   | LL            GNU        |
| URC B.Luka / RSKoming.NET  System/Network Admin | LLLL i n u x  Generation |
`----------------------------------------------------------------------------'


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: recovery of hosed raid5 array
  2003-10-13 13:54       ` Dragan Simic
@ 2003-10-13 17:00         ` dean gaudet
  0 siblings, 0 replies; 12+ messages in thread
From: dean gaudet @ 2003-10-13 17:00 UTC (permalink / raw)
  To: Dragan Simic; +Cc: linux-raid

On Mon, 13 Oct 2003, Dragan Simic wrote:

> On Sun, 12 Oct 2003, dean gaudet wrote:
>
> > querying SMART shouldn't cause this to happen -- but i've seen it occur
> > with a promise controller and maxtor disks.  i used to query the SMART
> > data once a night just to have a log.  then i switched it to once every 5
> > minutes so i could graph the drive temperature... and when i went to once
> > every 5 minutes the system became unstable.  the kernel would randomly
> > lose the ability to talk to a disk.  the problem would go away after a
> > reboot.  i assume it was some sort of race condition.
>
> Just a small drop-in: I have a Promise FastTrak133 with two Maxtor HDDs
> attached to it, and running smartmontools every 5 minutes, didn't notice
> any troubles or signs of instability.

yeah i've had a hard time reproducing it outside of the production system
i had the troubles on.  that system was exceptionally busy though.  it was
also SMP.  i had both ultra100tx2 and ultra133tx2 in that box at the time.
(the problem could also have been the maxtor firmware -- they were older
80GB 7200rpm disks.)

-dean

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: recovery of hosed raid5 array
  2003-10-12 17:39     ` dean gaudet
  2003-10-12 18:04       ` Jason Lunz
  2003-10-13 13:54       ` Dragan Simic
@ 2003-10-26 19:40       ` Mike Fedyk
  2003-10-26 19:48         ` Jason Lunz
  2003-10-26 21:29         ` dean gaudet
  2 siblings, 2 replies; 12+ messages in thread
From: Mike Fedyk @ 2003-10-26 19:40 UTC (permalink / raw)
  To: dean gaudet; +Cc: Jason Lunz, linux-raid

On Sun, Oct 12, 2003 at 10:39:50AM -0700, dean gaudet wrote:
> 
> 
> On Sun, 12 Oct 2003, Jason Lunz wrote:
> 
> > What was foolish was me provoking /dev/hde by asking it to report
> > diagnostics with smartctl at the same time the array was rebuilding
> > /dev/hdg. Even if something _was_ wrong with hde, it wouldn't have
> > helped me to find out then during the rebuild. Had the resync completed,
> > I'd have all my data now and one dead disk.
> 
> querying SMART shouldn't cause this to happen -- but i've seen it occur
> with a promise controller and maxtor disks.  i used to query the SMART
> data once a night just to have a log.  then i switched it to once every 5
> minutes so i could graph the drive temperature... and when i went to once
> every 5 minutes the system became unstable.  the kernel would randomly
> lose the ability to talk to a disk.  the problem would go away after a
> reboot.  i assume it was some sort of race condition.

1.  Were they maxtor 160GB 8MB cache drives?

2.  Is there any package that will take one drive in a raid1/5 array
offline, and run badblocks on it?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: recovery of hosed raid5 array
  2003-10-26 19:40       ` Mike Fedyk
@ 2003-10-26 19:48         ` Jason Lunz
  2003-10-26 21:29         ` dean gaudet
  1 sibling, 0 replies; 12+ messages in thread
From: Jason Lunz @ 2003-10-26 19:48 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: linux-raid

On Sun, Oct 26, 2003 at 11:40AM -0800, Mike Fedyk wrote:
> 1.  Were they maxtor 160GB 8MB cache drives?

Maxtor DiamondMax Plus 30G, model #53073H4, on a 
Promise 20262 pci ide card

luckily enough, I got all my data back. The "dead" drive turned out to
work well enough to read most of the data. By some miracle, the portion
of the drive with unrecoverable sectors was entirely in the 10% of the
disk that wasn't part of the raid partition.

> 2.  Is there any package that will take one drive in a raid1/5 array
> offline, and run badblocks on it?

package? That's a pretty short script to write, using mdadm and
badblocks.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: recovery of hosed raid5 array
  2003-10-26 19:40       ` Mike Fedyk
  2003-10-26 19:48         ` Jason Lunz
@ 2003-10-26 21:29         ` dean gaudet
  1 sibling, 0 replies; 12+ messages in thread
From: dean gaudet @ 2003-10-26 21:29 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Jason Lunz, linux-raid

On Sun, 26 Oct 2003, Mike Fedyk wrote:

> On Sun, Oct 12, 2003 at 10:39:50AM -0700, dean gaudet wrote:
> >
> > querying SMART shouldn't cause this to happen -- but i've seen it occur
> > with a promise controller and maxtor disks.  i used to query the SMART
> > data once a night just to have a log.  then i switched it to once every 5
> > minutes so i could graph the drive temperature... and when i went to once
> > every 5 minutes the system became unstable.  the kernel would randomly
> > lose the ability to talk to a disk.  the problem would go away after a
> > reboot.  i assume it was some sort of race condition.
>
> 1.  Were they maxtor 160GB 8MB cache drives?

no they were 6L080J4 ... 80GB D740X.  attached to both ultra100tx2 and
ultra133tx.  this was also long enough ago that the promise driver was
still the single driver, not the newer driver.  who knows what the problem
really was.

-dean

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2003-10-26 21:29 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-10-11 15:52 recovery of hosed raid5 array Jason Lunz
2003-10-11 18:59 ` linux-raid
2003-10-12  1:14   ` rob
2003-10-12 15:14   ` Jason Lunz
2003-10-12 17:39     ` dean gaudet
2003-10-12 18:04       ` Jason Lunz
2003-10-12 18:34         ` dean gaudet
2003-10-13 13:54       ` Dragan Simic
2003-10-13 17:00         ` dean gaudet
2003-10-26 19:40       ` Mike Fedyk
2003-10-26 19:48         ` Jason Lunz
2003-10-26 21:29         ` dean gaudet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.