All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: mdadm RAID5 array failure
@ 2007-02-09  3:15 jahammonds prost
  2007-02-09  3:26 ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: jahammonds prost @ 2007-02-09  3:15 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

> mdadm -Af /dev/md0 should get it back for you. 

It did indeed... Thank you.

> But you really want to find out why it died.

Well, it looks like I have a bad section on hde, which got tickled as I was copying files onto it... As the rebuild progressed, and hit around 6%, it hit the same spot on the disk again, and locked the box up solid. I ended up setting speed_limit_min and speed_limit_max to 0 so that the rebuild didn't happen, activated my LVM volume groups, and mounted the first of the logical volumes. I've just copied off all the files on that LV, and tomorrow I'll get the other 2 done. I do have a spare drive in the array... any idea why it wasn't being activated when hde went offline?

> What kernel version are you running?

Kernel is 2.6.17-1.2142.FC4, and mdadm is V1.11.0 11 April 2005

I am assuming that the underlying RAID doesn't do any bad block handling?


Once again, thank you for your help.


Graham

----- Original Message ----
From: Neil Brown <neilb@suse.de>
To: jahammonds prost <gmitch64@yahoo.com>
Cc: linux-raid@vger.kernel.org
Sent: Wednesday, 7 February, 2007 10:57:47 PM
Subject: Re: mdadm RAID5 array failure


On Thursday February 8, gmitch64@yahoo.com wrote:

> I'm running an FC4 system. I was copying some files on to the server
> this weekend, and the server locked up hard, and I had to power
> off. I rebooted the server, and the array came up fine, but when I
> tried to fsck the filesystem, fsck just locked up at about 40%. I
> left it sitting there for 12 hours, hoping it was going to come
> back, but I had to power off the server again. When I now reboot the
> server, it is failing to mount my raid5 array.. 
>  
>       mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to start the array.

mdadm -Af /dev/md0
should get it back for you.  But you really want to find out why it
died.
Where there any kernel messages at the time of the first failure?
What kernel version are you running?

>  
> I've added the output from the various files/commands at the bottom...
> I am a little confused at the output.. According to /dev/hd[cgh],
> there is only 1 failed disk in the array, so why does it think that
> there are 3 failed disks in the array? 

You need to look at the 'Event' count.  md will look for the device
with the highest event count and reject anything with an event count 2
or more less than that.

NeilBrown


		
___________________________________________________________ 
What kind of emailer are you? Find out today - get a free analysis of your email personality. Take the quiz at the Yahoo! Mail Championship. 
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mdadm RAID5 array failure
  2007-02-09  3:15 mdadm RAID5 array failure jahammonds prost
@ 2007-02-09  3:26 ` Neil Brown
  0 siblings, 0 replies; 4+ messages in thread
From: Neil Brown @ 2007-02-09  3:26 UTC (permalink / raw)
  To: jahammonds prost; +Cc: linux-raid

On Thursday February 8, gmitch64@yahoo.com wrote:
> > mdadm -Af /dev/md0 should get it back for you. 
> 
> It did indeed... Thank you.
> 
> > But you really want to find out why it died.

Good!

> 
> Well, it looks like I have a bad section on hde, which got tickled
> as I was copying files onto it... As the rebuild progressed, and hit
> around 6%, it hit the same spot on the disk again, and locked the
> box up solid. I ended up setting speed_limit_min and speed_limit_max
> to 0 so that the rebuild didn't happen, activated my LVM volume
> groups, and mounted the first of the logical volumes. I've just
> copied off all the files on that LV, and tomorrow I'll get the other
> 2 done. I do have a spare drive in the array... any idea why it
> wasn't being activated when hde went offline? 

I would need to look at kernel logs to be sure of what was happening.
If the problem with the drive causes the drive controller to hang
(rather than return an error) then there is not much that the raid
layer can do.

If you do get any kernel logs when the machine hangs, or if you can
get something out with
  alt-sysrq-t
then I suspect the maintainer of the relevant driver would like to
know about it - testing error conditions in drives can be hard with
having the right sort of faulty drive....

NeilBrown

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mdadm RAID5 array failure
  2007-02-08  3:36 jahammonds prost
@ 2007-02-08  3:57 ` Neil Brown
  0 siblings, 0 replies; 4+ messages in thread
From: Neil Brown @ 2007-02-08  3:57 UTC (permalink / raw)
  To: jahammonds prost; +Cc: linux-raid

On Thursday February 8, gmitch64@yahoo.com wrote:

> I'm running an FC4 system. I was copying some files on to the server
> this weekend, and the server locked up hard, and I had to power
> off. I rebooted the server, and the array came up fine, but when I
> tried to fsck the filesystem, fsck just locked up at about 40%. I
> left it sitting there for 12 hours, hoping it was going to come
> back, but I had to power off the server again. When I now reboot the
> server, it is failing to mount my raid5 array.. 
>  
>       mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to start the array.

mdadm -Af /dev/md0
should get it back for you.  But you really want to find out why it
died.
Where there any kernel messages at the time of the first failure?
What kernel version are you running?

>  
> I've added the output from the various files/commands at the bottom...
> I am a little confused at the output.. According to /dev/hd[cgh],
> there is only 1 failed disk in the array, so why does it think that
> there are 3 failed disks in the array? 

You need to look at the 'Event' count.  md will look for the device
with the highest event count and reject anything with an event count 2
or more less than that.

NeilBrown

^ permalink raw reply	[flat|nested] 4+ messages in thread

* mdadm RAID5 array failure
@ 2007-02-08  3:36 jahammonds prost
  2007-02-08  3:57 ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: jahammonds prost @ 2007-02-08  3:36 UTC (permalink / raw)
  To: linux-raid

I'm running an FC4 system. I was copying some files on to the server this weekend, and the server locked up hard, and I had to power off. I rebooted the server, and the array came up fine, but when I tried to fsck the filesystem, fsck just locked up at about 40%. I left it sitting there for 12 hours, hoping it was going to come back, but I had to power off the server again. When I now reboot the server, it is failing to mount my raid5 array..
 
      mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to start the array.
 
I've added the output from the various files/commands at the bottom...
I am a little confused at the output.. According to /dev/hd[cgh], there is only 1 failed disk in the array, so why does it think that there are 3 failed disks in the array? It looks like there is only 1 failed disk – I got an error from SMARTD about it when I got the server back into multiuser mode, so I know there is an issue with the disk (Device: /dev/hde, 8 Offline uncorrectable sectors), but there are still enough disks to bring up the array, and for the spare disk to start rebuilding.
 
I've spent the last couple of days googling around, and I can't seem to find much on how to recover a failed md arrary. Is there any way to get the array back and working? Unfortunately I don't have a back up of this array, and I'd really like to try and get the data back (there are 3 LVM logical volumes on it).
 
Thanks very much for any help.
 
 
Graham
 
 
 
My /etc/mdadm.conf looks like this
 
]# cat /etc/mdadm.conf
DEVICE /dev/hd*[a-z]
ARRAY /dev/md0 level=raid5 num-devices=6 UUID=96c7d78a:2113ea58:9dc237f1:79a60ddf
  
devices=/dev/hdh,/dev/hdg,/dev/hdf,/dev/hde,/dev/hdd,/dev/hdc,/dev/hdb
 
 
Looking at /proc/mdstat, I am getting this output
 
# cat /proc/mdstat
Personalities : [raid5] [raid4]
md0 : inactive hdc[0] hdb[6] hdh[5] hdg[4] hdf[3] hde[2] hdd[1]
      1378888832 blocks super non-persistent
 
 
 
 
Here's the output when ran on the device that some think have failed.....
 
# mdadm -E /dev/hde
/dev/hde:
          Magic : a92b4efc
        Version : 00.90.02
           UUID : 96c7d78a:2113ea58:9dc237f1:79a60ddf
  Creation Time : Wed Feb  1 17:10:39 2006
     Raid Level : raid5
   Raid Devices : 6
  Total Devices : 7
Preferred Minor : 0
 
    Update Time : Sun Feb  4 17:29:53 2007
          State : active
 Active Devices : 6
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 1
       Checksum : dcab70d - correct
         Events : 0.840944
 
         Layout : left-symmetric
     Chunk Size : 128K
 
      Number   Major   Minor   RaidDevice State
this     2      33        0        2      active sync   /dev/hde
 
   0     0      22        0        0      active sync   /dev/hdc
   1     1      22       64        1      active sync   /dev/hdd
   2     2      33        0        2      active sync   /dev/hde
   3     3      33       64        3      active sync   /dev/hdf
   4     4      34        0        4      active sync   /dev/hdg
   5     5      34       64        5      active sync   /dev/hdh
   6     6       3       64        6      spare   /dev/hdb
 
 
Running an mdadm -E on /dev/hd[bcgh] gives this,
 
 
      Number   Major   Minor   RaidDevice State
this     6       3       64        6      spare   /dev/hdb
 
   0     0      22        0        0      active sync   /dev/hdc
   1     1      22       64        1      active sync   /dev/hdd
   2     2       0        0        2      faulty removed
   3     3      33       64        3      active sync   /dev/hdf
   4     4      34        0        4      active sync   /dev/hdg
   5     5      34       64        5      active sync   /dev/hdh
   6     6       3       64        6      spare   /dev/hdb
 
 
 
And running mdadm -E on /dev/hd[def]
 
      Number   Major   Minor   RaidDevice State
this     3      33       64        3      active sync   /dev/hdf
 
   0     0      22        0        0      active sync   /dev/hdc
   1     1      22       64        1      active sync   /dev/hdd
   2     2      33        0        2      active sync   /dev/hde
   3     3      33       64        3      active sync   /dev/hdf
   4     4      34        0        4      active sync   /dev/hdg
   5     5      34       64        5      active sync   /dev/hdh
   6     6       3       64        6      spare   /dev/hdb
 
 
Looking at /var/log/messages, shows the following
 
Feb  6 12:36:42 file01bert kernel: md: bind<hdd>
Feb  6 12:36:42 file01bert kernel: md: bind<hde>
Feb  6 12:36:42 file01bert kernel: md: bind<hdf>
Feb  6 12:36:42 file01bert kernel: md: bind<hdg>
Feb  6 12:36:42 file01bert kernel: md: bind<hdh>
Feb  6 12:36:42 file01bert kernel: md: bind<hdb>
Feb  6 12:36:42 file01bert kernel: md: bind<hdc>
Feb  6 12:36:42 file01bert kernel: md: kicking non-fresh hdf from array!
Feb  6 12:36:42 file01bert kernel: md: unbind<hdf>
Feb  6 12:36:42 file01bert kernel: md: export_rdev(hdf)
Feb  6 12:36:42 file01bert kernel: md: kicking non-fresh hde from array!
Feb  6 12:36:42 file01bert kernel: md: unbind<hde>
Feb  6 12:36:42 file01bert kernel: md: export_rdev(hde)
Feb  6 12:36:42 file01bert kernel: md: kicking non-fresh hdd from array!
Feb  6 12:36:42 file01bert kernel: md: unbind<hdd>
Feb  6 12:36:42 file01bert kernel: md: export_rdev(hdd)
Feb  6 12:36:42 file01bert kernel: md: md0: raid array is not clean -- starting background reconstruction
Feb  6 12:36:42 file01bert kernel: raid5: device hdc operational as raid disk 0
Feb  6 12:36:42 file01bert kernel: raid5: device hdh operational as raid disk 5
Feb  6 12:36:42 file01bert kernel: raid5: device hdg operational as raid disk 4
Feb  6 12:36:42 file01bert kernel: raid5: not enough operational devices for md0 (3/6 failed)
Feb  6 12:36:42 file01bert kernel: RAID5 conf printout:
Feb  6 12:36:42 file01bert kernel:  --- rd:6 wd:3 fd:3
Feb  6 12:36:42 file01bert kernel:  disk 0, o:1, dev:hdc
Feb  6 12:36:42 file01bert kernel:  disk 4, o:1, dev:hdg
Feb  6 12:36:42 file01bert kernel:  disk 5, o:1, dev:hdh
Feb  6 12:36:42 file01bert kernel: raid5: failed to run raid set md0
Feb  6 12:36:42 file01bert kernel: md: pers->run() failed ...


	
	
		
___________________________________________________________ 
New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at the Yahoo! Mail Championships. Plus: play games and win prizes. 
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-02-09  3:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-09  3:15 mdadm RAID5 array failure jahammonds prost
2007-02-09  3:26 ` Neil Brown
  -- strict thread matches above, loose matches on Subject: below --
2007-02-08  3:36 jahammonds prost
2007-02-08  3:57 ` Neil Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.