All of lore.kernel.org
 help / color / mirror / Atom feed
* Failed drive while converting raid5 to raid6, then a hard reboot
@ 2012-04-30 13:59 Hákon Gíslason
  2012-05-08 20:48 ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Hákon Gíslason @ 2012-04-30 13:59 UTC (permalink / raw)
  To: linux-raid

Hello,
I've been having frequent drive "failures", as in, they are reported
failed/bad and mdadm sends me an email telling me things went wrong,
etc... but after a reboot or two, they are perfectly fine again. I'm
not sure what it is, but this server is quite new and I think there
might be more behind it, bad memory or the motherboard (I've been
having other issues as well). I've had 4 drive "failures" in this
month, all different drives except for one, which "failed" twice, and
all have been fixed with a reboot or rebuild (all drives reported bad
by mdadm passed an extensive SMART test).
Due to this, I decided to convert my raid5 array to a raid6 array
while I find the root cause of the problem.

I started the conversion right after a drive failure & rebuild, but as
it had converted/reshaped aprox. 4%(if I remember correctly, and it
was going really slowly, ~7500 minutes to completion), it reported
another drive bad, and the conversion to raid6 stopped (it said
"rebuilding", but the speed was 0K/sec and the time left was a few
million minutes.
After that happened, I tried to stop the array and reboot the server,
as I had done previously to get the reportedly "bad" drive working
again, but It wouldn't stop the array or reboot, neither could I
unmount it, it just hung whenever I tried to do something with
/dev/md0. After trying to reboot a few times, I just killed the power
and re-started it. Admittedly this was probably not the best thing I
could have done at that point.

I have backup of ca. 80% of the data on there, it's been a month since
the last complete backup (because I ran out of backup disk space).

So, the big question, can the array be activated, and can it complete
the conversion to raid6? And will I get my data back?
I hope the data can be rescued, and any help I can get would be much
appreciated!

I'm fairly new to raid in general, and have been using mdadm for about
a month now.
Here's some data:

root@axiom:~# mdadm --examine --scan
ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1
name=axiom.is:0


root@axiom:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
      7814054240 blocks super 1.2

root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
mdadm: /dev/md0 is already in use.

root@axiom:~# mdadm --stop /dev/md0
mdadm: stopped /dev/md0

root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
mdadm: Failed to restore critical section for reshape, sorry.
      Possibly you needed to specify the --backup-file

root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
--backup-file=/root/mdadm-backup-file
mdadm: Failed to restore critical section for reshape, sorry.

root@axiom:~# fdisk -l | grep 2000
Disk /dev/sda doesn't contain a valid partition table
Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes
Disk /dev/sde: 2000.4 GB, 2000398934016 bytes
Disk /dev/sdf: 2000.4 GB, 2000398934016 bytes

root@axiom:~# mdadm --examine /dev/sd{a,b,c,e,f}
/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : cfedbfc1:feaee982:4e92ccf4:45e08ed1
           Name : axiom.is:0

  (local to host axiom.is

)
  Creation Time : Mon Apr  9 01:05:20 2012
     Raid Level : raid6
   Raid Devices : 5

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
     Array Size : 11721080448 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907026816 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : b11a7424:fc470ea7:51ba6ea0:158c0ce6

  Reshape pos'n : 242343936 (231.12 GiB 248.16 GB)
     New Layout : left-symmetric

    Update Time : Sun Oct 14 15:20:06 2012
       Checksum : 76ecd244 - correct
         Events : 138274

         Layout : left-symmetric-6
     Chunk Size : 32K

   Device Role : Active device 3
   Array State : .AAAA ('A' == active, '.' == missing)
/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x6
     Array UUID : cfedbfc1:feaee982:4e92ccf4:45e08ed1
           Name : axiom.is:0

  (local to host axiom.is

)
  Creation Time : Mon Apr  9 01:05:20 2012
     Raid Level : raid6
   Raid Devices : 5

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
     Array Size : 11721080448 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907026816 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
Recovery Offset : 161546240 sectors
          State : active
    Device UUID : 8389f39f:cc7fa027:f10cf717:1d41d40b

  Reshape pos'n : 242343936 (231.12 GiB 248.16 GB)
     New Layout : left-symmetric

    Update Time : Sun Oct 14 15:20:06 2012
       Checksum : 19ef8090 - correct
         Events : 138274

         Layout : left-symmetric-6
     Chunk Size : 32K

   Device Role : Active device 4
   Array State : .AAAA ('A' == active, '.' == missing)
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : cfedbfc1:feaee982:4e92ccf4:45e08ed1
           Name : axiom.is:0

  (local to host axiom.is

)
  Creation Time : Mon Apr  9 01:05:20 2012
     Raid Level : raid6
   Raid Devices : 5

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
     Array Size : 11721080448 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907026816 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : b2cec17f:e526b42e:9e69e46b:23be5163

  Reshape pos'n : 242343936 (231.12 GiB 248.16 GB)
     New Layout : left-symmetric

    Update Time : Sun Oct 14 15:20:06 2012
       Checksum : a29b468a - correct
         Events : 138274

         Layout : left-symmetric-6
     Chunk Size : 32K

   Device Role : Active device 1
   Array State : .AAAA ('A' == active, '.' == missing)
/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : cfedbfc1:feaee982:4e92ccf4:45e08ed1
           Name : axiom.is:0

  (local to host axiom.is

)
  Creation Time : Mon Apr  9 01:05:20 2012
     Raid Level : raid6
   Raid Devices : 5

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
     Array Size : 11721080448 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907026816 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 21c799cd:58be3156:6830865b:fa984134

  Reshape pos'n : 242343936 (231.12 GiB 248.16 GB)
     New Layout : left-symmetric

    Update Time : Sun Oct 14 15:20:06 2012
       Checksum : d882780e - correct
         Events : 138274

         Layout : left-symmetric-6
     Chunk Size : 32K

   Device Role : Active device 2
   Array State : .AAAA ('A' == active, '.' == missing)
/dev/sdf:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : cfedbfc1:feaee982:4e92ccf4:45e08ed1
           Name : axiom.is:0

  (local to host axiom.is

)
  Creation Time : Mon Apr  9 01:05:20 2012
     Raid Level : raid6
   Raid Devices : 5

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
     Array Size : 11721080448 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907026816 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 8b043488:8379f327:5f00e0fe:6a1e0bee

  Reshape pos'n : 242343936 (231.12 GiB 248.16 GB)
     New Layout : left-symmetric

    Update Time : Sat Apr 28 22:57:36 2012
       Checksum : c122639f - correct
         Events : 138241

         Layout : left-symmetric-6
     Chunk Size : 32K

   Device Role : Active device 0
   Array State : AAAAA ('A' == active, '.' == missing)
--
Hákon G.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive while converting raid5 to raid6, then a hard reboot
  2012-04-30 13:59 Failed drive while converting raid5 to raid6, then a hard reboot Hákon Gíslason
@ 2012-05-08 20:48 ` NeilBrown
  2012-05-08 22:19   ` Hákon Gíslason
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2012-05-08 20:48 UTC (permalink / raw)
  To: Hákon Gíslason; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3537 bytes --]

On Mon, 30 Apr 2012 13:59:56 +0000 Hákon Gíslason <hakon.gislason@gmail.com>
wrote:

> Hello,
> I've been having frequent drive "failures", as in, they are reported
> failed/bad and mdadm sends me an email telling me things went wrong,
> etc... but after a reboot or two, they are perfectly fine again. I'm
> not sure what it is, but this server is quite new and I think there
> might be more behind it, bad memory or the motherboard (I've been
> having other issues as well). I've had 4 drive "failures" in this
> month, all different drives except for one, which "failed" twice, and
> all have been fixed with a reboot or rebuild (all drives reported bad
> by mdadm passed an extensive SMART test).
> Due to this, I decided to convert my raid5 array to a raid6 array
> while I find the root cause of the problem.
> 
> I started the conversion right after a drive failure & rebuild, but as
> it had converted/reshaped aprox. 4%(if I remember correctly, and it
> was going really slowly, ~7500 minutes to completion), it reported
> another drive bad, and the conversion to raid6 stopped (it said
> "rebuilding", but the speed was 0K/sec and the time left was a few
> million minutes.
> After that happened, I tried to stop the array and reboot the server,
> as I had done previously to get the reportedly "bad" drive working
> again, but It wouldn't stop the array or reboot, neither could I
> unmount it, it just hung whenever I tried to do something with
> /dev/md0. After trying to reboot a few times, I just killed the power
> and re-started it. Admittedly this was probably not the best thing I
> could have done at that point.
> 
> I have backup of ca. 80% of the data on there, it's been a month since
> the last complete backup (because I ran out of backup disk space).
> 
> So, the big question, can the array be activated, and can it complete
> the conversion to raid6? And will I get my data back?
> I hope the data can be rescued, and any help I can get would be much
> appreciated!
> 
> I'm fairly new to raid in general, and have been using mdadm for about
> a month now.
> Here's some data:
> 
> root@axiom:~# mdadm --examine --scan
> ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1
> name=axiom.is:0
> 
> 
> root@axiom:~# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
>       7814054240 blocks super 1.2
> 
> root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> mdadm: /dev/md0 is already in use.
> 
> root@axiom:~# mdadm --stop /dev/md0
> mdadm: stopped /dev/md0
> 
> root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> mdadm: Failed to restore critical section for reshape, sorry.
>       Possibly you needed to specify the --backup-file
> 
> root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> --backup-file=/root/mdadm-backup-file
> mdadm: Failed to restore critical section for reshape, sorry.

What version of mdadm are you using?

I suggest getting a newer one (I'm about to release 3.2.4, but 3.2.3 should
be fine) and if just that doesn't help, add the "--invalid-backup" option.

However I very strongly suggest you try to resolve the problem which is
causing your drives to fail.  Until you resolve that it will keep happening
and having it happen repeatly during the (slow) reshape process would not be
good.

Maybe plug the drives into another computer, or another controller, while the
reshape runs?

NeilBrown



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive while converting raid5 to raid6, then a hard reboot
  2012-05-08 20:48 ` NeilBrown
@ 2012-05-08 22:19   ` Hákon Gíslason
  2012-05-08 23:03     ` Hákon Gíslason
  2012-05-08 23:21     ` NeilBrown
  0 siblings, 2 replies; 9+ messages in thread
From: Hákon Gíslason @ 2012-05-08 22:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Thank you for the reply, Neil
I was using mdadm from the package manager in Debian stable first
(v3.1.4), but after the constant drive failures I upgraded to the
latest one (3.2.3).
I've come to the conclusion that the drives are either failing because
they are "green" drives, and might have power-saving features that are
causing them to be "disconnected", or that the cables that came with
the motherboard aren't good enough. I'm not 100% sure about either,
but at the moment these seem likely causes. It could be incompatible
hardware or the kernel that I'm using (proxmox debian kernel:
2.6.32-11-pve).

I got the array assembled (thank you), but what about the raid5 to
raid6 conversion? Do I have to complete it for this to work, or will
mdadm know what to do? Can I cancel (revert) the conversion and get
the array back to raid5?

/proc/mdstat contains:

root@axiom:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active (read-only) raid6 sdc[6] sdb[5] sda[4] sdd[7]
      5860540224 blocks super 1.2 level 6, 32k chunk, algorithm 18 [5/3] [_UUU_]

unused devices: <none>

If I try to mount the volume group on the array the kernel panics, and
the system hangs. Is that related to the incomplete conversion?

Thanks,
--
Hákon G.



On 8 May 2012 20:48, NeilBrown <neilb@suse.de> wrote:
>
> On Mon, 30 Apr 2012 13:59:56 +0000 Hákon Gíslason
> <hakon.gislason@gmail.com>
> wrote:
>
> > Hello,
> > I've been having frequent drive "failures", as in, they are reported
> > failed/bad and mdadm sends me an email telling me things went wrong,
> > etc... but after a reboot or two, they are perfectly fine again. I'm
> > not sure what it is, but this server is quite new and I think there
> > might be more behind it, bad memory or the motherboard (I've been
> > having other issues as well). I've had 4 drive "failures" in this
> > month, all different drives except for one, which "failed" twice, and
> > all have been fixed with a reboot or rebuild (all drives reported bad
> > by mdadm passed an extensive SMART test).
> > Due to this, I decided to convert my raid5 array to a raid6 array
> > while I find the root cause of the problem.
> >
> > I started the conversion right after a drive failure & rebuild, but as
> > it had converted/reshaped aprox. 4%(if I remember correctly, and it
> > was going really slowly, ~7500 minutes to completion), it reported
> > another drive bad, and the conversion to raid6 stopped (it said
> > "rebuilding", but the speed was 0K/sec and the time left was a few
> > million minutes.
> > After that happened, I tried to stop the array and reboot the server,
> > as I had done previously to get the reportedly "bad" drive working
> > again, but It wouldn't stop the array or reboot, neither could I
> > unmount it, it just hung whenever I tried to do something with
> > /dev/md0. After trying to reboot a few times, I just killed the power
> > and re-started it. Admittedly this was probably not the best thing I
> > could have done at that point.
> >
> > I have backup of ca. 80% of the data on there, it's been a month since
> > the last complete backup (because I ran out of backup disk space).
> >
> > So, the big question, can the array be activated, and can it complete
> > the conversion to raid6? And will I get my data back?
> > I hope the data can be rescued, and any help I can get would be much
> > appreciated!
> >
> > I'm fairly new to raid in general, and have been using mdadm for about
> > a month now.
> > Here's some data:
> >
> > root@axiom:~# mdadm --examine --scan
> > ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1
> > name=axiom.is:0
> >
> >
> > root@axiom:~# cat /proc/mdstat
> > Personalities : [raid6] [raid5] [raid4]
> > md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
> >       7814054240 blocks super 1.2
> >
> > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> > mdadm: /dev/md0 is already in use.
> >
> > root@axiom:~# mdadm --stop /dev/md0
> > mdadm: stopped /dev/md0
> >
> > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> > mdadm: Failed to restore critical section for reshape, sorry.
> >       Possibly you needed to specify the --backup-file
> >
> > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> > --backup-file=/root/mdadm-backup-file
> > mdadm: Failed to restore critical section for reshape, sorry.
>
> What version of mdadm are you using?
>
> I suggest getting a newer one (I'm about to release 3.2.4, but 3.2.3
> should
> be fine) and if just that doesn't help, add the "--invalid-backup" option.
>
> However I very strongly suggest you try to resolve the problem which is
> causing your drives to fail.  Until you resolve that it will keep
> happening
> and having it happen repeatly during the (slow) reshape process would not
> be
> good.
>
> Maybe plug the drives into another computer, or another controller, while
> the
> reshape runs?
>
> NeilBrown
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive while converting raid5 to raid6, then a hard reboot
  2012-05-08 22:19   ` Hákon Gíslason
@ 2012-05-08 23:03     ` Hákon Gíslason
  2012-05-08 23:21     ` NeilBrown
  1 sibling, 0 replies; 9+ messages in thread
From: Hákon Gíslason @ 2012-05-08 23:03 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Forgot this: http://pastebin.ubuntu.com/976915/
--
Hákon G.


On 8 May 2012 22:19, Hákon Gíslason <hakon.gislason@gmail.com> wrote:
> Thank you for the reply, Neil
> I was using mdadm from the package manager in Debian stable first
> (v3.1.4), but after the constant drive failures I upgraded to the
> latest one (3.2.3).
> I've come to the conclusion that the drives are either failing because
> they are "green" drives, and might have power-saving features that are
> causing them to be "disconnected", or that the cables that came with
> the motherboard aren't good enough. I'm not 100% sure about either,
> but at the moment these seem likely causes. It could be incompatible
> hardware or the kernel that I'm using (proxmox debian kernel:
> 2.6.32-11-pve).
>
> I got the array assembled (thank you), but what about the raid5 to
> raid6 conversion? Do I have to complete it for this to work, or will
> mdadm know what to do? Can I cancel (revert) the conversion and get
> the array back to raid5?
>
> /proc/mdstat contains:
>
> root@axiom:~# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active (read-only) raid6 sdc[6] sdb[5] sda[4] sdd[7]
>      5860540224 blocks super 1.2 level 6, 32k chunk, algorithm 18 [5/3] [_UUU_]
>
> unused devices: <none>
>
> If I try to mount the volume group on the array the kernel panics, and
> the system hangs. Is that related to the incomplete conversion?
>
> Thanks,
> --
> Hákon G.
>
>
>
> On 8 May 2012 20:48, NeilBrown <neilb@suse.de> wrote:
>>
>> On Mon, 30 Apr 2012 13:59:56 +0000 Hákon Gíslason
>> <hakon.gislason@gmail.com>
>> wrote:
>>
>> > Hello,
>> > I've been having frequent drive "failures", as in, they are reported
>> > failed/bad and mdadm sends me an email telling me things went wrong,
>> > etc... but after a reboot or two, they are perfectly fine again. I'm
>> > not sure what it is, but this server is quite new and I think there
>> > might be more behind it, bad memory or the motherboard (I've been
>> > having other issues as well). I've had 4 drive "failures" in this
>> > month, all different drives except for one, which "failed" twice, and
>> > all have been fixed with a reboot or rebuild (all drives reported bad
>> > by mdadm passed an extensive SMART test).
>> > Due to this, I decided to convert my raid5 array to a raid6 array
>> > while I find the root cause of the problem.
>> >
>> > I started the conversion right after a drive failure & rebuild, but as
>> > it had converted/reshaped aprox. 4%(if I remember correctly, and it
>> > was going really slowly, ~7500 minutes to completion), it reported
>> > another drive bad, and the conversion to raid6 stopped (it said
>> > "rebuilding", but the speed was 0K/sec and the time left was a few
>> > million minutes.
>> > After that happened, I tried to stop the array and reboot the server,
>> > as I had done previously to get the reportedly "bad" drive working
>> > again, but It wouldn't stop the array or reboot, neither could I
>> > unmount it, it just hung whenever I tried to do something with
>> > /dev/md0. After trying to reboot a few times, I just killed the power
>> > and re-started it. Admittedly this was probably not the best thing I
>> > could have done at that point.
>> >
>> > I have backup of ca. 80% of the data on there, it's been a month since
>> > the last complete backup (because I ran out of backup disk space).
>> >
>> > So, the big question, can the array be activated, and can it complete
>> > the conversion to raid6? And will I get my data back?
>> > I hope the data can be rescued, and any help I can get would be much
>> > appreciated!
>> >
>> > I'm fairly new to raid in general, and have been using mdadm for about
>> > a month now.
>> > Here's some data:
>> >
>> > root@axiom:~# mdadm --examine --scan
>> > ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1
>> > name=axiom.is:0
>> >
>> >
>> > root@axiom:~# cat /proc/mdstat
>> > Personalities : [raid6] [raid5] [raid4]
>> > md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
>> >       7814054240 blocks super 1.2
>> >
>> > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>> > mdadm: /dev/md0 is already in use.
>> >
>> > root@axiom:~# mdadm --stop /dev/md0
>> > mdadm: stopped /dev/md0
>> >
>> > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>> > mdadm: Failed to restore critical section for reshape, sorry.
>> >       Possibly you needed to specify the --backup-file
>> >
>> > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>> > --backup-file=/root/mdadm-backup-file
>> > mdadm: Failed to restore critical section for reshape, sorry.
>>
>> What version of mdadm are you using?
>>
>> I suggest getting a newer one (I'm about to release 3.2.4, but 3.2.3
>> should
>> be fine) and if just that doesn't help, add the "--invalid-backup" option.
>>
>> However I very strongly suggest you try to resolve the problem which is
>> causing your drives to fail.  Until you resolve that it will keep
>> happening
>> and having it happen repeatly during the (slow) reshape process would not
>> be
>> good.
>>
>> Maybe plug the drives into another computer, or another controller, while
>> the
>> reshape runs?
>>
>> NeilBrown
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive while converting raid5 to raid6, then a hard reboot
  2012-05-08 22:19   ` Hákon Gíslason
  2012-05-08 23:03     ` Hákon Gíslason
@ 2012-05-08 23:21     ` NeilBrown
  2012-05-08 23:55       ` Hákon Gíslason
  1 sibling, 1 reply; 9+ messages in thread
From: NeilBrown @ 2012-05-08 23:21 UTC (permalink / raw)
  To: Hákon Gíslason; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 6445 bytes --]

On Tue, 8 May 2012 22:19:49 +0000 Hákon Gíslason <hakon.gislason@gmail.com>
wrote:

> Thank you for the reply, Neil
> I was using mdadm from the package manager in Debian stable first
> (v3.1.4), but after the constant drive failures I upgraded to the
> latest one (3.2.3).
> I've come to the conclusion that the drives are either failing because
> they are "green" drives, and might have power-saving features that are
> causing them to be "disconnected", or that the cables that came with
> the motherboard aren't good enough. I'm not 100% sure about either,
> but at the moment these seem likely causes. It could be incompatible
> hardware or the kernel that I'm using (proxmox debian kernel:
> 2.6.32-11-pve).
> 
> I got the array assembled (thank you), but what about the raid5 to
> raid6 conversion? Do I have to complete it for this to work, or will
> mdadm know what to do? Can I cancel (revert) the conversion and get
> the array back to raid5?
> 
> /proc/mdstat contains:
> 
> root@axiom:~# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active (read-only) raid6 sdc[6] sdb[5] sda[4] sdd[7]
>       5860540224 blocks super 1.2 level 6, 32k chunk, algorithm 18 [5/3] [_UUU_]
> 
> unused devices: <none>
> 
> If I try to mount the volume group on the array the kernel panics, and
> the system hangs. Is that related to the incomplete conversion?

The array should be part way through the conversion.  If you
   mdadm -E /dev/sda
it should report something like "Reshape Position : XXXX" indicating
how far along it is.
The reshape will not restart while the array is read-only.  Once you make it
writeable it will automatically restart the reshape from where it is up to.

The kernel panic is because the array is read-only and the filesystem tries
to write to it.  I think that is fixed in more recent kernels (i.e. ext4
refuses to mount rather than trying and crashing).

So you should just be able to "mdadm --read-write /dev/md0" to make the array
writable, and then continue using it ... until another device fails.

Reverting the reshape is not currently possible.  Maybe it will be with Linux
3.5 and mdadm-3.3, but that is all months away.

I would recommend an "fsck -n /dev/md0" first and if that seems mostly OK,
and if "mdadm -E /dev/sda" reports the "Reshape Position" as expected, then
make the array read-write, mount it, and backup any important data.

NeilBrown


> 
> Thanks,
> --
> Hákon G.
> 
> 
> 
> On 8 May 2012 20:48, NeilBrown <neilb@suse.de> wrote:
> >
> > On Mon, 30 Apr 2012 13:59:56 +0000 Hákon Gíslason
> > <hakon.gislason@gmail.com>
> > wrote:
> >
> > > Hello,
> > > I've been having frequent drive "failures", as in, they are reported
> > > failed/bad and mdadm sends me an email telling me things went wrong,
> > > etc... but after a reboot or two, they are perfectly fine again. I'm
> > > not sure what it is, but this server is quite new and I think there
> > > might be more behind it, bad memory or the motherboard (I've been
> > > having other issues as well). I've had 4 drive "failures" in this
> > > month, all different drives except for one, which "failed" twice, and
> > > all have been fixed with a reboot or rebuild (all drives reported bad
> > > by mdadm passed an extensive SMART test).
> > > Due to this, I decided to convert my raid5 array to a raid6 array
> > > while I find the root cause of the problem.
> > >
> > > I started the conversion right after a drive failure & rebuild, but as
> > > it had converted/reshaped aprox. 4%(if I remember correctly, and it
> > > was going really slowly, ~7500 minutes to completion), it reported
> > > another drive bad, and the conversion to raid6 stopped (it said
> > > "rebuilding", but the speed was 0K/sec and the time left was a few
> > > million minutes.
> > > After that happened, I tried to stop the array and reboot the server,
> > > as I had done previously to get the reportedly "bad" drive working
> > > again, but It wouldn't stop the array or reboot, neither could I
> > > unmount it, it just hung whenever I tried to do something with
> > > /dev/md0. After trying to reboot a few times, I just killed the power
> > > and re-started it. Admittedly this was probably not the best thing I
> > > could have done at that point.
> > >
> > > I have backup of ca. 80% of the data on there, it's been a month since
> > > the last complete backup (because I ran out of backup disk space).
> > >
> > > So, the big question, can the array be activated, and can it complete
> > > the conversion to raid6? And will I get my data back?
> > > I hope the data can be rescued, and any help I can get would be much
> > > appreciated!
> > >
> > > I'm fairly new to raid in general, and have been using mdadm for about
> > > a month now.
> > > Here's some data:
> > >
> > > root@axiom:~# mdadm --examine --scan
> > > ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1
> > > name=axiom.is:0
> > >
> > >
> > > root@axiom:~# cat /proc/mdstat
> > > Personalities : [raid6] [raid5] [raid4]
> > > md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
> > >       7814054240 blocks super 1.2
> > >
> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> > > mdadm: /dev/md0 is already in use.
> > >
> > > root@axiom:~# mdadm --stop /dev/md0
> > > mdadm: stopped /dev/md0
> > >
> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> > > mdadm: Failed to restore critical section for reshape, sorry.
> > >       Possibly you needed to specify the --backup-file
> > >
> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> > > --backup-file=/root/mdadm-backup-file
> > > mdadm: Failed to restore critical section for reshape, sorry.
> >
> > What version of mdadm are you using?
> >
> > I suggest getting a newer one (I'm about to release 3.2.4, but 3.2.3
> > should
> > be fine) and if just that doesn't help, add the "--invalid-backup" option.
> >
> > However I very strongly suggest you try to resolve the problem which is
> > causing your drives to fail.  Until you resolve that it will keep
> > happening
> > and having it happen repeatly during the (slow) reshape process would not
> > be
> > good.
> >
> > Maybe plug the drives into another computer, or another controller, while
> > the
> > reshape runs?
> >
> > NeilBrown
> >
> >


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive while converting raid5 to raid6, then a hard reboot
  2012-05-08 23:21     ` NeilBrown
@ 2012-05-08 23:55       ` Hákon Gíslason
  2012-05-09  0:20         ` Hákon Gíslason
  0 siblings, 1 reply; 9+ messages in thread
From: Hákon Gíslason @ 2012-05-08 23:55 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Thank you very much!
It's currently rebuilding, I'll make an attempt to mount the volume
once it completes the build. But before that, I'm going to image all
the disks to my friends array, just to be safe. After that, backup
everything.
Again, thank you for your help!
--
Hákon G.


On 8 May 2012 23:21, NeilBrown <neilb@suse.de> wrote:
> On Tue, 8 May 2012 22:19:49 +0000 Hákon Gíslason <hakon.gislason@gmail.com>
> wrote:
>
>> Thank you for the reply, Neil
>> I was using mdadm from the package manager in Debian stable first
>> (v3.1.4), but after the constant drive failures I upgraded to the
>> latest one (3.2.3).
>> I've come to the conclusion that the drives are either failing because
>> they are "green" drives, and might have power-saving features that are
>> causing them to be "disconnected", or that the cables that came with
>> the motherboard aren't good enough. I'm not 100% sure about either,
>> but at the moment these seem likely causes. It could be incompatible
>> hardware or the kernel that I'm using (proxmox debian kernel:
>> 2.6.32-11-pve).
>>
>> I got the array assembled (thank you), but what about the raid5 to
>> raid6 conversion? Do I have to complete it for this to work, or will
>> mdadm know what to do? Can I cancel (revert) the conversion and get
>> the array back to raid5?
>>
>> /proc/mdstat contains:
>>
>> root@axiom:~# cat /proc/mdstat
>> Personalities : [raid6] [raid5] [raid4]
>> md0 : active (read-only) raid6 sdc[6] sdb[5] sda[4] sdd[7]
>>       5860540224 blocks super 1.2 level 6, 32k chunk, algorithm 18 [5/3] [_UUU_]
>>
>> unused devices: <none>
>>
>> If I try to mount the volume group on the array the kernel panics, and
>> the system hangs. Is that related to the incomplete conversion?
>
> The array should be part way through the conversion.  If you
>   mdadm -E /dev/sda
> it should report something like "Reshape Position : XXXX" indicating
> how far along it is.
> The reshape will not restart while the array is read-only.  Once you make it
> writeable it will automatically restart the reshape from where it is up to.
>
> The kernel panic is because the array is read-only and the filesystem tries
> to write to it.  I think that is fixed in more recent kernels (i.e. ext4
> refuses to mount rather than trying and crashing).
>
> So you should just be able to "mdadm --read-write /dev/md0" to make the array
> writable, and then continue using it ... until another device fails.
>
> Reverting the reshape is not currently possible.  Maybe it will be with Linux
> 3.5 and mdadm-3.3, but that is all months away.
>
> I would recommend an "fsck -n /dev/md0" first and if that seems mostly OK,
> and if "mdadm -E /dev/sda" reports the "Reshape Position" as expected, then
> make the array read-write, mount it, and backup any important data.
>
> NeilBrown
>
>
>>
>> Thanks,
>> --
>> Hákon G.
>>
>>
>>
>> On 8 May 2012 20:48, NeilBrown <neilb@suse.de> wrote:
>> >
>> > On Mon, 30 Apr 2012 13:59:56 +0000 Hákon Gíslason
>> > <hakon.gislason@gmail.com>
>> > wrote:
>> >
>> > > Hello,
>> > > I've been having frequent drive "failures", as in, they are reported
>> > > failed/bad and mdadm sends me an email telling me things went wrong,
>> > > etc... but after a reboot or two, they are perfectly fine again. I'm
>> > > not sure what it is, but this server is quite new and I think there
>> > > might be more behind it, bad memory or the motherboard (I've been
>> > > having other issues as well). I've had 4 drive "failures" in this
>> > > month, all different drives except for one, which "failed" twice, and
>> > > all have been fixed with a reboot or rebuild (all drives reported bad
>> > > by mdadm passed an extensive SMART test).
>> > > Due to this, I decided to convert my raid5 array to a raid6 array
>> > > while I find the root cause of the problem.
>> > >
>> > > I started the conversion right after a drive failure & rebuild, but as
>> > > it had converted/reshaped aprox. 4%(if I remember correctly, and it
>> > > was going really slowly, ~7500 minutes to completion), it reported
>> > > another drive bad, and the conversion to raid6 stopped (it said
>> > > "rebuilding", but the speed was 0K/sec and the time left was a few
>> > > million minutes.
>> > > After that happened, I tried to stop the array and reboot the server,
>> > > as I had done previously to get the reportedly "bad" drive working
>> > > again, but It wouldn't stop the array or reboot, neither could I
>> > > unmount it, it just hung whenever I tried to do something with
>> > > /dev/md0. After trying to reboot a few times, I just killed the power
>> > > and re-started it. Admittedly this was probably not the best thing I
>> > > could have done at that point.
>> > >
>> > > I have backup of ca. 80% of the data on there, it's been a month since
>> > > the last complete backup (because I ran out of backup disk space).
>> > >
>> > > So, the big question, can the array be activated, and can it complete
>> > > the conversion to raid6? And will I get my data back?
>> > > I hope the data can be rescued, and any help I can get would be much
>> > > appreciated!
>> > >
>> > > I'm fairly new to raid in general, and have been using mdadm for about
>> > > a month now.
>> > > Here's some data:
>> > >
>> > > root@axiom:~# mdadm --examine --scan
>> > > ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1
>> > > name=axiom.is:0
>> > >
>> > >
>> > > root@axiom:~# cat /proc/mdstat
>> > > Personalities : [raid6] [raid5] [raid4]
>> > > md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
>> > >       7814054240 blocks super 1.2
>> > >
>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>> > > mdadm: /dev/md0 is already in use.
>> > >
>> > > root@axiom:~# mdadm --stop /dev/md0
>> > > mdadm: stopped /dev/md0
>> > >
>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>> > > mdadm: Failed to restore critical section for reshape, sorry.
>> > >       Possibly you needed to specify the --backup-file
>> > >
>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>> > > --backup-file=/root/mdadm-backup-file
>> > > mdadm: Failed to restore critical section for reshape, sorry.
>> >
>> > What version of mdadm are you using?
>> >
>> > I suggest getting a newer one (I'm about to release 3.2.4, but 3.2.3
>> > should
>> > be fine) and if just that doesn't help, add the "--invalid-backup" option.
>> >
>> > However I very strongly suggest you try to resolve the problem which is
>> > causing your drives to fail.  Until you resolve that it will keep
>> > happening
>> > and having it happen repeatly during the (slow) reshape process would not
>> > be
>> > good.
>> >
>> > Maybe plug the drives into another computer, or another controller, while
>> > the
>> > reshape runs?
>> >
>> > NeilBrown
>> >
>> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive while converting raid5 to raid6, then a hard reboot
  2012-05-08 23:55       ` Hákon Gíslason
@ 2012-05-09  0:20         ` Hákon Gíslason
  2012-05-09  0:46           ` Hákon Gíslason
  2012-05-09  0:47           ` NeilBrown
  0 siblings, 2 replies; 9+ messages in thread
From: Hákon Gíslason @ 2012-05-09  0:20 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi again, I thought the drives would last long enough to complete the
reshape, I assembled the array, it started reshaping, went for a
shower, and came back to this: http://pastebin.ubuntu.com/976993/

The logs show the same as when the other drives failed:
May  8 23:58:26 axiom kernel: ata4: hard resetting link
May  8 23:58:32 axiom kernel: ata4: link is slow to respond, please be
patient (ready=0)
May  8 23:58:37 axiom kernel: ata4: hard resetting link
May  8 23:58:42 axiom kernel: ata4: link is slow to respond, please be
patient (ready=0)
May  8 23:58:47 axiom kernel: ata4: hard resetting link
May  8 23:58:52 axiom kernel: ata4: link is slow to respond, please be
patient (ready=0)
May  8 23:59:22 axiom kernel: ata4: limiting SATA link speed to 1.5 Gbps
May  8 23:59:22 axiom kernel: ata4: hard resetting link
May  8 23:59:27 axiom kernel: ata4.00: disabled
May  8 23:59:27 axiom kernel: ata4: EH complete
May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Unhandled error code
May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] CDB: Write(10): 2a 00
00 00 00 08 00 00 02 00
May  8 23:59:27 axiom kernel: md: super_written gets error=-5, uptodate=0
May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Unhandled error code
May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] CDB: Read(10): 28 00
0a 9d cb 00 00 00 40 00
May  8 23:59:27 axiom kernel: md: md0: reshape done.

What course of action do you suggest I take now?

--
Hákon G.


On 8 May 2012 23:55, Hákon Gíslason <hakon.gislason@gmail.com> wrote:
> Thank you very much!
> It's currently rebuilding, I'll make an attempt to mount the volume
> once it completes the build. But before that, I'm going to image all
> the disks to my friends array, just to be safe. After that, backup
> everything.
> Again, thank you for your help!
> --
> Hákon G.
>
>
> On 8 May 2012 23:21, NeilBrown <neilb@suse.de> wrote:
>> On Tue, 8 May 2012 22:19:49 +0000 Hákon Gíslason <hakon.gislason@gmail.com>
>> wrote:
>>
>>> Thank you for the reply, Neil
>>> I was using mdadm from the package manager in Debian stable first
>>> (v3.1.4), but after the constant drive failures I upgraded to the
>>> latest one (3.2.3).
>>> I've come to the conclusion that the drives are either failing because
>>> they are "green" drives, and might have power-saving features that are
>>> causing them to be "disconnected", or that the cables that came with
>>> the motherboard aren't good enough. I'm not 100% sure about either,
>>> but at the moment these seem likely causes. It could be incompatible
>>> hardware or the kernel that I'm using (proxmox debian kernel:
>>> 2.6.32-11-pve).
>>>
>>> I got the array assembled (thank you), but what about the raid5 to
>>> raid6 conversion? Do I have to complete it for this to work, or will
>>> mdadm know what to do? Can I cancel (revert) the conversion and get
>>> the array back to raid5?
>>>
>>> /proc/mdstat contains:
>>>
>>> root@axiom:~# cat /proc/mdstat
>>> Personalities : [raid6] [raid5] [raid4]
>>> md0 : active (read-only) raid6 sdc[6] sdb[5] sda[4] sdd[7]
>>>       5860540224 blocks super 1.2 level 6, 32k chunk, algorithm 18 [5/3] [_UUU_]
>>>
>>> unused devices: <none>
>>>
>>> If I try to mount the volume group on the array the kernel panics, and
>>> the system hangs. Is that related to the incomplete conversion?
>>
>> The array should be part way through the conversion.  If you
>>   mdadm -E /dev/sda
>> it should report something like "Reshape Position : XXXX" indicating
>> how far along it is.
>> The reshape will not restart while the array is read-only.  Once you make it
>> writeable it will automatically restart the reshape from where it is up to.
>>
>> The kernel panic is because the array is read-only and the filesystem tries
>> to write to it.  I think that is fixed in more recent kernels (i.e. ext4
>> refuses to mount rather than trying and crashing).
>>
>> So you should just be able to "mdadm --read-write /dev/md0" to make the array
>> writable, and then continue using it ... until another device fails.
>>
>> Reverting the reshape is not currently possible.  Maybe it will be with Linux
>> 3.5 and mdadm-3.3, but that is all months away.
>>
>> I would recommend an "fsck -n /dev/md0" first and if that seems mostly OK,
>> and if "mdadm -E /dev/sda" reports the "Reshape Position" as expected, then
>> make the array read-write, mount it, and backup any important data.
>>
>> NeilBrown
>>
>>
>>>
>>> Thanks,
>>> --
>>> Hákon G.
>>>
>>>
>>>
>>> On 8 May 2012 20:48, NeilBrown <neilb@suse.de> wrote:
>>> >
>>> > On Mon, 30 Apr 2012 13:59:56 +0000 Hákon Gíslason
>>> > <hakon.gislason@gmail.com>
>>> > wrote:
>>> >
>>> > > Hello,
>>> > > I've been having frequent drive "failures", as in, they are reported
>>> > > failed/bad and mdadm sends me an email telling me things went wrong,
>>> > > etc... but after a reboot or two, they are perfectly fine again. I'm
>>> > > not sure what it is, but this server is quite new and I think there
>>> > > might be more behind it, bad memory or the motherboard (I've been
>>> > > having other issues as well). I've had 4 drive "failures" in this
>>> > > month, all different drives except for one, which "failed" twice, and
>>> > > all have been fixed with a reboot or rebuild (all drives reported bad
>>> > > by mdadm passed an extensive SMART test).
>>> > > Due to this, I decided to convert my raid5 array to a raid6 array
>>> > > while I find the root cause of the problem.
>>> > >
>>> > > I started the conversion right after a drive failure & rebuild, but as
>>> > > it had converted/reshaped aprox. 4%(if I remember correctly, and it
>>> > > was going really slowly, ~7500 minutes to completion), it reported
>>> > > another drive bad, and the conversion to raid6 stopped (it said
>>> > > "rebuilding", but the speed was 0K/sec and the time left was a few
>>> > > million minutes.
>>> > > After that happened, I tried to stop the array and reboot the server,
>>> > > as I had done previously to get the reportedly "bad" drive working
>>> > > again, but It wouldn't stop the array or reboot, neither could I
>>> > > unmount it, it just hung whenever I tried to do something with
>>> > > /dev/md0. After trying to reboot a few times, I just killed the power
>>> > > and re-started it. Admittedly this was probably not the best thing I
>>> > > could have done at that point.
>>> > >
>>> > > I have backup of ca. 80% of the data on there, it's been a month since
>>> > > the last complete backup (because I ran out of backup disk space).
>>> > >
>>> > > So, the big question, can the array be activated, and can it complete
>>> > > the conversion to raid6? And will I get my data back?
>>> > > I hope the data can be rescued, and any help I can get would be much
>>> > > appreciated!
>>> > >
>>> > > I'm fairly new to raid in general, and have been using mdadm for about
>>> > > a month now.
>>> > > Here's some data:
>>> > >
>>> > > root@axiom:~# mdadm --examine --scan
>>> > > ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1
>>> > > name=axiom.is:0
>>> > >
>>> > >
>>> > > root@axiom:~# cat /proc/mdstat
>>> > > Personalities : [raid6] [raid5] [raid4]
>>> > > md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
>>> > >       7814054240 blocks super 1.2
>>> > >
>>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>>> > > mdadm: /dev/md0 is already in use.
>>> > >
>>> > > root@axiom:~# mdadm --stop /dev/md0
>>> > > mdadm: stopped /dev/md0
>>> > >
>>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>>> > > mdadm: Failed to restore critical section for reshape, sorry.
>>> > >       Possibly you needed to specify the --backup-file
>>> > >
>>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>>> > > --backup-file=/root/mdadm-backup-file
>>> > > mdadm: Failed to restore critical section for reshape, sorry.
>>> >
>>> > What version of mdadm are you using?
>>> >
>>> > I suggest getting a newer one (I'm about to release 3.2.4, but 3.2.3
>>> > should
>>> > be fine) and if just that doesn't help, add the "--invalid-backup" option.
>>> >
>>> > However I very strongly suggest you try to resolve the problem which is
>>> > causing your drives to fail.  Until you resolve that it will keep
>>> > happening
>>> > and having it happen repeatly during the (slow) reshape process would not
>>> > be
>>> > good.
>>> >
>>> > Maybe plug the drives into another computer, or another controller, while
>>> > the
>>> > reshape runs?
>>> >
>>> > NeilBrown
>>> >
>>> >
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive while converting raid5 to raid6, then a hard reboot
  2012-05-09  0:20         ` Hákon Gíslason
@ 2012-05-09  0:46           ` Hákon Gíslason
  2012-05-09  0:47           ` NeilBrown
  1 sibling, 0 replies; 9+ messages in thread
From: Hákon Gíslason @ 2012-05-09  0:46 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Nevermind, got it back up and running using --force.
Stopping the array, halting the server, going to image all the disks first.
--
Hákon G.


On 9 May 2012 00:20, Hákon Gíslason <hakon.gislason@gmail.com> wrote:
> Hi again, I thought the drives would last long enough to complete the
> reshape, I assembled the array, it started reshaping, went for a
> shower, and came back to this: http://pastebin.ubuntu.com/976993/
>
> The logs show the same as when the other drives failed:
> May  8 23:58:26 axiom kernel: ata4: hard resetting link
> May  8 23:58:32 axiom kernel: ata4: link is slow to respond, please be
> patient (ready=0)
> May  8 23:58:37 axiom kernel: ata4: hard resetting link
> May  8 23:58:42 axiom kernel: ata4: link is slow to respond, please be
> patient (ready=0)
> May  8 23:58:47 axiom kernel: ata4: hard resetting link
> May  8 23:58:52 axiom kernel: ata4: link is slow to respond, please be
> patient (ready=0)
> May  8 23:59:22 axiom kernel: ata4: limiting SATA link speed to 1.5 Gbps
> May  8 23:59:22 axiom kernel: ata4: hard resetting link
> May  8 23:59:27 axiom kernel: ata4.00: disabled
> May  8 23:59:27 axiom kernel: ata4: EH complete
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Unhandled error code
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] CDB: Write(10): 2a 00
> 00 00 00 08 00 00 02 00
> May  8 23:59:27 axiom kernel: md: super_written gets error=-5, uptodate=0
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Unhandled error code
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] CDB: Read(10): 28 00
> 0a 9d cb 00 00 00 40 00
> May  8 23:59:27 axiom kernel: md: md0: reshape done.
>
> What course of action do you suggest I take now?
>
> --
> Hákon G.
>
>
> On 8 May 2012 23:55, Hákon Gíslason <hakon.gislason@gmail.com> wrote:
>> Thank you very much!
>> It's currently rebuilding, I'll make an attempt to mount the volume
>> once it completes the build. But before that, I'm going to image all
>> the disks to my friends array, just to be safe. After that, backup
>> everything.
>> Again, thank you for your help!
>> --
>> Hákon G.
>>
>>
>> On 8 May 2012 23:21, NeilBrown <neilb@suse.de> wrote:
>>> On Tue, 8 May 2012 22:19:49 +0000 Hákon Gíslason <hakon.gislason@gmail.com>
>>> wrote:
>>>
>>>> Thank you for the reply, Neil
>>>> I was using mdadm from the package manager in Debian stable first
>>>> (v3.1.4), but after the constant drive failures I upgraded to the
>>>> latest one (3.2.3).
>>>> I've come to the conclusion that the drives are either failing because
>>>> they are "green" drives, and might have power-saving features that are
>>>> causing them to be "disconnected", or that the cables that came with
>>>> the motherboard aren't good enough. I'm not 100% sure about either,
>>>> but at the moment these seem likely causes. It could be incompatible
>>>> hardware or the kernel that I'm using (proxmox debian kernel:
>>>> 2.6.32-11-pve).
>>>>
>>>> I got the array assembled (thank you), but what about the raid5 to
>>>> raid6 conversion? Do I have to complete it for this to work, or will
>>>> mdadm know what to do? Can I cancel (revert) the conversion and get
>>>> the array back to raid5?
>>>>
>>>> /proc/mdstat contains:
>>>>
>>>> root@axiom:~# cat /proc/mdstat
>>>> Personalities : [raid6] [raid5] [raid4]
>>>> md0 : active (read-only) raid6 sdc[6] sdb[5] sda[4] sdd[7]
>>>>       5860540224 blocks super 1.2 level 6, 32k chunk, algorithm 18 [5/3] [_UUU_]
>>>>
>>>> unused devices: <none>
>>>>
>>>> If I try to mount the volume group on the array the kernel panics, and
>>>> the system hangs. Is that related to the incomplete conversion?
>>>
>>> The array should be part way through the conversion.  If you
>>>   mdadm -E /dev/sda
>>> it should report something like "Reshape Position : XXXX" indicating
>>> how far along it is.
>>> The reshape will not restart while the array is read-only.  Once you make it
>>> writeable it will automatically restart the reshape from where it is up to.
>>>
>>> The kernel panic is because the array is read-only and the filesystem tries
>>> to write to it.  I think that is fixed in more recent kernels (i.e. ext4
>>> refuses to mount rather than trying and crashing).
>>>
>>> So you should just be able to "mdadm --read-write /dev/md0" to make the array
>>> writable, and then continue using it ... until another device fails.
>>>
>>> Reverting the reshape is not currently possible.  Maybe it will be with Linux
>>> 3.5 and mdadm-3.3, but that is all months away.
>>>
>>> I would recommend an "fsck -n /dev/md0" first and if that seems mostly OK,
>>> and if "mdadm -E /dev/sda" reports the "Reshape Position" as expected, then
>>> make the array read-write, mount it, and backup any important data.
>>>
>>> NeilBrown
>>>
>>>
>>>>
>>>> Thanks,
>>>> --
>>>> Hákon G.
>>>>
>>>>
>>>>
>>>> On 8 May 2012 20:48, NeilBrown <neilb@suse.de> wrote:
>>>> >
>>>> > On Mon, 30 Apr 2012 13:59:56 +0000 Hákon Gíslason
>>>> > <hakon.gislason@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > Hello,
>>>> > > I've been having frequent drive "failures", as in, they are reported
>>>> > > failed/bad and mdadm sends me an email telling me things went wrong,
>>>> > > etc... but after a reboot or two, they are perfectly fine again. I'm
>>>> > > not sure what it is, but this server is quite new and I think there
>>>> > > might be more behind it, bad memory or the motherboard (I've been
>>>> > > having other issues as well). I've had 4 drive "failures" in this
>>>> > > month, all different drives except for one, which "failed" twice, and
>>>> > > all have been fixed with a reboot or rebuild (all drives reported bad
>>>> > > by mdadm passed an extensive SMART test).
>>>> > > Due to this, I decided to convert my raid5 array to a raid6 array
>>>> > > while I find the root cause of the problem.
>>>> > >
>>>> > > I started the conversion right after a drive failure & rebuild, but as
>>>> > > it had converted/reshaped aprox. 4%(if I remember correctly, and it
>>>> > > was going really slowly, ~7500 minutes to completion), it reported
>>>> > > another drive bad, and the conversion to raid6 stopped (it said
>>>> > > "rebuilding", but the speed was 0K/sec and the time left was a few
>>>> > > million minutes.
>>>> > > After that happened, I tried to stop the array and reboot the server,
>>>> > > as I had done previously to get the reportedly "bad" drive working
>>>> > > again, but It wouldn't stop the array or reboot, neither could I
>>>> > > unmount it, it just hung whenever I tried to do something with
>>>> > > /dev/md0. After trying to reboot a few times, I just killed the power
>>>> > > and re-started it. Admittedly this was probably not the best thing I
>>>> > > could have done at that point.
>>>> > >
>>>> > > I have backup of ca. 80% of the data on there, it's been a month since
>>>> > > the last complete backup (because I ran out of backup disk space).
>>>> > >
>>>> > > So, the big question, can the array be activated, and can it complete
>>>> > > the conversion to raid6? And will I get my data back?
>>>> > > I hope the data can be rescued, and any help I can get would be much
>>>> > > appreciated!
>>>> > >
>>>> > > I'm fairly new to raid in general, and have been using mdadm for about
>>>> > > a month now.
>>>> > > Here's some data:
>>>> > >
>>>> > > root@axiom:~# mdadm --examine --scan
>>>> > > ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1
>>>> > > name=axiom.is:0
>>>> > >
>>>> > >
>>>> > > root@axiom:~# cat /proc/mdstat
>>>> > > Personalities : [raid6] [raid5] [raid4]
>>>> > > md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
>>>> > >       7814054240 blocks super 1.2
>>>> > >
>>>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>>>> > > mdadm: /dev/md0 is already in use.
>>>> > >
>>>> > > root@axiom:~# mdadm --stop /dev/md0
>>>> > > mdadm: stopped /dev/md0
>>>> > >
>>>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>>>> > > mdadm: Failed to restore critical section for reshape, sorry.
>>>> > >       Possibly you needed to specify the --backup-file
>>>> > >
>>>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>>>> > > --backup-file=/root/mdadm-backup-file
>>>> > > mdadm: Failed to restore critical section for reshape, sorry.
>>>> >
>>>> > What version of mdadm are you using?
>>>> >
>>>> > I suggest getting a newer one (I'm about to release 3.2.4, but 3.2.3
>>>> > should
>>>> > be fine) and if just that doesn't help, add the "--invalid-backup" option.
>>>> >
>>>> > However I very strongly suggest you try to resolve the problem which is
>>>> > causing your drives to fail.  Until you resolve that it will keep
>>>> > happening
>>>> > and having it happen repeatly during the (slow) reshape process would not
>>>> > be
>>>> > good.
>>>> >
>>>> > Maybe plug the drives into another computer, or another controller, while
>>>> > the
>>>> > reshape runs?
>>>> >
>>>> > NeilBrown
>>>> >
>>>> >
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive while converting raid5 to raid6, then a hard reboot
  2012-05-09  0:20         ` Hákon Gíslason
  2012-05-09  0:46           ` Hákon Gíslason
@ 2012-05-09  0:47           ` NeilBrown
  1 sibling, 0 replies; 9+ messages in thread
From: NeilBrown @ 2012-05-09  0:47 UTC (permalink / raw)
  To: Hákon Gíslason; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 10013 bytes --]

On Wed, 9 May 2012 00:20:29 +0000 Hákon Gíslason <hakon.gislason@gmail.com>
wrote:

> Hi again, I thought the drives would last long enough to complete the
> reshape, I assembled the array, it started reshaping, went for a
> shower, and came back to this: http://pastebin.ubuntu.com/976993/
> 
> The logs show the same as when the other drives failed:
> May  8 23:58:26 axiom kernel: ata4: hard resetting link
> May  8 23:58:32 axiom kernel: ata4: link is slow to respond, please be
> patient (ready=0)
> May  8 23:58:37 axiom kernel: ata4: hard resetting link
> May  8 23:58:42 axiom kernel: ata4: link is slow to respond, please be
> patient (ready=0)
> May  8 23:58:47 axiom kernel: ata4: hard resetting link
> May  8 23:58:52 axiom kernel: ata4: link is slow to respond, please be
> patient (ready=0)
> May  8 23:59:22 axiom kernel: ata4: limiting SATA link speed to 1.5 Gbps
> May  8 23:59:22 axiom kernel: ata4: hard resetting link
> May  8 23:59:27 axiom kernel: ata4.00: disabled
> May  8 23:59:27 axiom kernel: ata4: EH complete
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Unhandled error code
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] CDB: Write(10): 2a 00
> 00 00 00 08 00 00 02 00
> May  8 23:59:27 axiom kernel: md: super_written gets error=-5, uptodate=0
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Unhandled error code
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] CDB: Read(10): 28 00
> 0a 9d cb 00 00 00 40 00
> May  8 23:59:27 axiom kernel: md: md0: reshape done.
> 
> What course of action do you suggest I take now?

I'm not surprised.  Until you fix the underlying issue you will continue to
suffer pain.

You should be able to assemble the array again the same way as before - plus
the --force option.

It continue to reshape for a little while and then will probably another
error.  Each time that happens there is a risk of some corruption.

Once you get it going again you could
   echo frozen > /sys/block/md0/md/sync_action
to freeze the reshape.  Then mount the filesystem and backup the important
data.
That what the constant reshape activity won't trigger any errors  - though
just extracting data for the backup might.

NeilBrown


> 
> --
> Hákon G.
> 
> 
> On 8 May 2012 23:55, Hákon Gíslason <hakon.gislason@gmail.com> wrote:
> > Thank you very much!
> > It's currently rebuilding, I'll make an attempt to mount the volume
> > once it completes the build. But before that, I'm going to image all
> > the disks to my friends array, just to be safe. After that, backup
> > everything.
> > Again, thank you for your help!
> > --
> > Hákon G.
> >
> >
> > On 8 May 2012 23:21, NeilBrown <neilb@suse.de> wrote:
> >> On Tue, 8 May 2012 22:19:49 +0000 Hákon Gíslason <hakon.gislason@gmail.com>
> >> wrote:
> >>
> >>> Thank you for the reply, Neil
> >>> I was using mdadm from the package manager in Debian stable first
> >>> (v3.1.4), but after the constant drive failures I upgraded to the
> >>> latest one (3.2.3).
> >>> I've come to the conclusion that the drives are either failing because
> >>> they are "green" drives, and might have power-saving features that are
> >>> causing them to be "disconnected", or that the cables that came with
> >>> the motherboard aren't good enough. I'm not 100% sure about either,
> >>> but at the moment these seem likely causes. It could be incompatible
> >>> hardware or the kernel that I'm using (proxmox debian kernel:
> >>> 2.6.32-11-pve).
> >>>
> >>> I got the array assembled (thank you), but what about the raid5 to
> >>> raid6 conversion? Do I have to complete it for this to work, or will
> >>> mdadm know what to do? Can I cancel (revert) the conversion and get
> >>> the array back to raid5?
> >>>
> >>> /proc/mdstat contains:
> >>>
> >>> root@axiom:~# cat /proc/mdstat
> >>> Personalities : [raid6] [raid5] [raid4]
> >>> md0 : active (read-only) raid6 sdc[6] sdb[5] sda[4] sdd[7]
> >>>       5860540224 blocks super 1.2 level 6, 32k chunk, algorithm 18 [5/3] [_UUU_]
> >>>
> >>> unused devices: <none>
> >>>
> >>> If I try to mount the volume group on the array the kernel panics, and
> >>> the system hangs. Is that related to the incomplete conversion?
> >>
> >> The array should be part way through the conversion.  If you
> >>   mdadm -E /dev/sda
> >> it should report something like "Reshape Position : XXXX" indicating
> >> how far along it is.
> >> The reshape will not restart while the array is read-only.  Once you make it
> >> writeable it will automatically restart the reshape from where it is up to.
> >>
> >> The kernel panic is because the array is read-only and the filesystem tries
> >> to write to it.  I think that is fixed in more recent kernels (i.e. ext4
> >> refuses to mount rather than trying and crashing).
> >>
> >> So you should just be able to "mdadm --read-write /dev/md0" to make the array
> >> writable, and then continue using it ... until another device fails.
> >>
> >> Reverting the reshape is not currently possible.  Maybe it will be with Linux
> >> 3.5 and mdadm-3.3, but that is all months away.
> >>
> >> I would recommend an "fsck -n /dev/md0" first and if that seems mostly OK,
> >> and if "mdadm -E /dev/sda" reports the "Reshape Position" as expected, then
> >> make the array read-write, mount it, and backup any important data.
> >>
> >> NeilBrown
> >>
> >>
> >>>
> >>> Thanks,
> >>> --
> >>> Hákon G.
> >>>
> >>>
> >>>
> >>> On 8 May 2012 20:48, NeilBrown <neilb@suse.de> wrote:
> >>> >
> >>> > On Mon, 30 Apr 2012 13:59:56 +0000 Hákon Gíslason
> >>> > <hakon.gislason@gmail.com>
> >>> > wrote:
> >>> >
> >>> > > Hello,
> >>> > > I've been having frequent drive "failures", as in, they are reported
> >>> > > failed/bad and mdadm sends me an email telling me things went wrong,
> >>> > > etc... but after a reboot or two, they are perfectly fine again. I'm
> >>> > > not sure what it is, but this server is quite new and I think there
> >>> > > might be more behind it, bad memory or the motherboard (I've been
> >>> > > having other issues as well). I've had 4 drive "failures" in this
> >>> > > month, all different drives except for one, which "failed" twice, and
> >>> > > all have been fixed with a reboot or rebuild (all drives reported bad
> >>> > > by mdadm passed an extensive SMART test).
> >>> > > Due to this, I decided to convert my raid5 array to a raid6 array
> >>> > > while I find the root cause of the problem.
> >>> > >
> >>> > > I started the conversion right after a drive failure & rebuild, but as
> >>> > > it had converted/reshaped aprox. 4%(if I remember correctly, and it
> >>> > > was going really slowly, ~7500 minutes to completion), it reported
> >>> > > another drive bad, and the conversion to raid6 stopped (it said
> >>> > > "rebuilding", but the speed was 0K/sec and the time left was a few
> >>> > > million minutes.
> >>> > > After that happened, I tried to stop the array and reboot the server,
> >>> > > as I had done previously to get the reportedly "bad" drive working
> >>> > > again, but It wouldn't stop the array or reboot, neither could I
> >>> > > unmount it, it just hung whenever I tried to do something with
> >>> > > /dev/md0. After trying to reboot a few times, I just killed the power
> >>> > > and re-started it. Admittedly this was probably not the best thing I
> >>> > > could have done at that point.
> >>> > >
> >>> > > I have backup of ca. 80% of the data on there, it's been a month since
> >>> > > the last complete backup (because I ran out of backup disk space).
> >>> > >
> >>> > > So, the big question, can the array be activated, and can it complete
> >>> > > the conversion to raid6? And will I get my data back?
> >>> > > I hope the data can be rescued, and any help I can get would be much
> >>> > > appreciated!
> >>> > >
> >>> > > I'm fairly new to raid in general, and have been using mdadm for about
> >>> > > a month now.
> >>> > > Here's some data:
> >>> > >
> >>> > > root@axiom:~# mdadm --examine --scan
> >>> > > ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1
> >>> > > name=axiom.is:0
> >>> > >
> >>> > >
> >>> > > root@axiom:~# cat /proc/mdstat
> >>> > > Personalities : [raid6] [raid5] [raid4]
> >>> > > md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
> >>> > >       7814054240 blocks super 1.2
> >>> > >
> >>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> >>> > > mdadm: /dev/md0 is already in use.
> >>> > >
> >>> > > root@axiom:~# mdadm --stop /dev/md0
> >>> > > mdadm: stopped /dev/md0
> >>> > >
> >>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> >>> > > mdadm: Failed to restore critical section for reshape, sorry.
> >>> > >       Possibly you needed to specify the --backup-file
> >>> > >
> >>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> >>> > > --backup-file=/root/mdadm-backup-file
> >>> > > mdadm: Failed to restore critical section for reshape, sorry.
> >>> >
> >>> > What version of mdadm are you using?
> >>> >
> >>> > I suggest getting a newer one (I'm about to release 3.2.4, but 3.2.3
> >>> > should
> >>> > be fine) and if just that doesn't help, add the "--invalid-backup" option.
> >>> >
> >>> > However I very strongly suggest you try to resolve the problem which is
> >>> > causing your drives to fail.  Until you resolve that it will keep
> >>> > happening
> >>> > and having it happen repeatly during the (slow) reshape process would not
> >>> > be
> >>> > good.
> >>> >
> >>> > Maybe plug the drives into another computer, or another controller, while
> >>> > the
> >>> > reshape runs?
> >>> >
> >>> > NeilBrown
> >>> >
> >>> >
> >>


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-05-09  0:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-30 13:59 Failed drive while converting raid5 to raid6, then a hard reboot Hákon Gíslason
2012-05-08 20:48 ` NeilBrown
2012-05-08 22:19   ` Hákon Gíslason
2012-05-08 23:03     ` Hákon Gíslason
2012-05-08 23:21     ` NeilBrown
2012-05-08 23:55       ` Hákon Gíslason
2012-05-09  0:20         ` Hákon Gíslason
2012-05-09  0:46           ` Hákon Gíslason
2012-05-09  0:47           ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.