All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Help needed recovering from raid failure
@ 2015-04-29 18:17 Peter van Es
  2015-04-29 23:27 ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Peter van Es @ 2015-04-29 18:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Dear Neil,

first of all, I really appreciate you trying to help me. This is the first time I’m deploying software raid, so really appreciate the guidance.


> On 29 Apr 2015, at 00:26, NeilBrown <neilb@suse.de> wrote:
> 
> This isn't really reporting anything new.
> There is probably a daily cron job which reports all degraded arrays.  This
> message is reported by that job.

I understand...

> 
> 
> Why do you think the array is off-line?  The above message doesn't suggest
> that.
> 

My Ubuntu server was accessible through ssh but did not serve webpages, files etc. When I went to the console, 
it told me it had taken the array offline because of degraded /dev/sdd2 and /dev/sdc2
Those two drives were out of the array. 

> 
>> 
>> Needless to say, I can't boot the system anymore as the boot drive is /dev/md0, and GRUB can't
>> get at it. I do need to recover data (I know, but there's stuf on there I have no backup for--yet).
> 
> You boot off a RAID5?  Does grub support that?  I didn't know.
> But md0 hasn't failed, has it?
> 
> Confused.

Well, it took a little time but yes, I managed to define a raid 5 array that the system was able to boot from. 

> There is something VERY sick here.  I suggest that you tread very carefully.
> 
> All your '1' partitions should be about 2GB and the '2' parititions about 2TB
> 
> But the --examine output suggests sda2 and sdb2 are 2TB, while sdd2 and sde2
> are 2GB.
> 
> That really really shouldn't happen.  Maybe check your partition table
> (fdisk).
> I really cannot see how this would happen.

But this question, and the previous question you asked, tell me a little of what I may have done…

I think confused /dev/md0 and /dev/md1 (now called /dev/md126 and /dev/md127 when running of the USB stick). 

/dev/md0 is a swap array (around 6GB, comprised of 4 x 2 GB in raid 5)
/dev/md1 is the boot and data array (around 5 TB, comprised of 4 x ~2 TB in raid 5) 

I must have confused them and tried to add the /dev/sdc2 and /dev/sdd2 drive to the /dev/md0 array (mdadm —add /dev/md0 /dev/sdc2)
instead of to the /dev/md1 array.  They were  then added as spare drives, their superblocks were overwritten, but since
a) no swap space was used, and 
b) they were added as spares

The data should not have been overwritten.

> 
> Can you
>  mdadm -Ss
> 
> to stop all the arrays, then
> 
>  fdisk -l /dev/sd?
> 
> then 
> 
>  mdadm -Esvv
> 

Neil, here they are: again, I appreciate you taking the time and guiding me through this!

Is there any way to resurrect the super blocks and try to force assemble the array, skipping the failing drive /dev/sdd2 (the /dev/sdd2 drive created some errors I observed in the log, /dev/sdc2 must have had a one off issue to be taken out….). I have two new drives (arrived today), and a new SSD drive. I would want to get the new array assembled using /dev/sdc2 perhaps forcing it back to the array geometry and “hoping for the best” and then install a new /dev/sdd2 to be recovered. Then I’ll create a boot and swap drive off the SSD which means that any array failures should not prevent the system from booting…

Requested outputs are below

Thanks, 

Peter


fdisk output: (USB devices deleted)


Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000f24ee

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048     3905535     1951744   fd  Linux raid autodetect
/dev/sda2   *     3905536  3907028991  1951561728   fd  Linux raid autodetect

Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00029d5c

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048     3905535     1951744   fd  Linux raid autodetect
/dev/sdb2   *     3905536  3907028991  1951561728   fd  Linux raid autodetect


Disk /dev/sdd: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000727bf

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1            2048     3905535     1951744   fd  Linux raid autodetect
/dev/sdd2   *     3905536  3907028991  1951561728   fd  Linux raid autodetect

Disk /dev/sde: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0009fe7f

   Device Boot      Start         End      Blocks   Id  System
/dev/sde1            2048     3905535     1951744   fd  Linux raid autodetect
/dev/sde2   *     3905536  3907028991  1951561728   fd  Linux raid autodetect


mdadm -Esvv output (USB devices deleted)

/dev/sde2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : dbe238a3:c7a528c1:a1b78589:276ecfcf
           Name : ubuntu:0  (local to host ubuntu)
  Creation Time : Wed Apr  1 22:27:42 2015
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3903121408 (1861.15 GiB 1998.40 GB)
     Array Size : 5850624 (5.58 GiB 5.99 GB)
  Used Dev Size : 3900416 (1904.82 MiB 1997.01 MB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : cdae3287:91168194:942ba99d:1a85c466

    Update Time : Wed Apr 29 17:46:25 2015
       Checksum : b8b84dad - correct
         Events : 30

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : dbe238a3:c7a528c1:a1b78589:276ecfcf
           Name : ubuntu:0  (local to host ubuntu)
  Creation Time : Wed Apr  1 22:27:42 2015
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3901440 (1905.32 MiB 1997.54 MB)
     Array Size : 5850624 (5.58 GiB 5.99 GB)
  Used Dev Size : 3900416 (1904.82 MiB 1997.01 MB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : b051f523:4887e729:cd63bed1:8c2a7575

    Update Time : Wed Apr 29 17:46:25 2015
       Checksum : 453ddeef - correct
         Events : 30

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sde:
   MBR Magic : aa55
Partition[0] :      3903488 sectors at         2048 (type fd)
Partition[1] :   3903123456 sectors at      3905536 (type fd)
/dev/sdd2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : dbe238a3:c7a528c1:a1b78589:276ecfcf
           Name : ubuntu:0  (local to host ubuntu)
  Creation Time : Wed Apr  1 22:27:42 2015
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3903121408 (1861.15 GiB 1998.40 GB)
     Array Size : 5850624 (5.58 GiB 5.99 GB)
  Used Dev Size : 3900416 (1904.82 MiB 1997.01 MB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 0f3f2b91:09cbb344:e52c4c4b:722d65c4

    Update Time : Wed Apr 29 17:46:25 2015
       Checksum : 7e273c0f - correct
         Events : 30

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : dbe238a3:c7a528c1:a1b78589:276ecfcf
           Name : ubuntu:0  (local to host ubuntu)
  Creation Time : Wed Apr  1 22:27:42 2015
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3901440 (1905.32 MiB 1997.54 MB)
     Array Size : 5850624 (5.58 GiB 5.99 GB)
  Used Dev Size : 3900416 (1904.82 MiB 1997.01 MB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : b6668730:3b1380bf:556700d9:30df829c

    Update Time : Wed Apr 29 17:46:25 2015
       Checksum : 15b83814 - correct
         Events : 30

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sdd:
   MBR Magic : aa55
Partition[0] :      3903488 sectors at         2048 (type fd)
Partition[1] :   3903123456 sectors at      3905536 (type fd)
/dev/sdb2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 1f28f7bb:7b3ecd41:ca0fa5d1:ccd008df
           Name : ubuntu:1  (local to host ubuntu)
  Creation Time : Wed Apr  1 22:27:58 2015
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3902861312 (1861.03 GiB 1998.26 GB)
     Array Size : 5854290432 (5583.09 GiB 5994.79 GB)
  Used Dev Size : 3902860288 (1861.03 GiB 1998.26 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : f1e79609:79b7ac23:55197f70:e8fbfd58

    Update Time : Sun Apr 26 05:59:13 2015
       Checksum : 696f4e76 - correct
         Events : 18014

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AA.. ('A' == active, '.' == missing)
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : dbe238a3:c7a528c1:a1b78589:276ecfcf
           Name : ubuntu:0  (local to host ubuntu)
  Creation Time : Wed Apr  1 22:27:42 2015
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3901440 (1905.32 MiB 1997.54 MB)
     Array Size : 5850624 (5.58 GiB 5.99 GB)
  Used Dev Size : 3900416 (1904.82 MiB 1997.01 MB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : f52239b1:0fb87e7e:71e29ea4:bf67184a

    Update Time : Wed Apr 29 17:46:25 2015
       Checksum : ce9c9cd0 - correct
         Events : 30

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sdb:
   MBR Magic : aa55
Partition[0] :      3903488 sectors at         2048 (type fd)
Partition[1] :   3903123456 sectors at      3905536 (type fd)
/dev/sda2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 1f28f7bb:7b3ecd41:ca0fa5d1:ccd008df
           Name : ubuntu:1  (local to host ubuntu)
  Creation Time : Wed Apr  1 22:27:58 2015
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3902861312 (1861.03 GiB 1998.26 GB)
     Array Size : 5854290432 (5583.09 GiB 5994.79 GB)
  Used Dev Size : 3902860288 (1861.03 GiB 1998.26 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 713e556d:ca104217:785db68a:d820a57b

    Update Time : Sun Apr 26 05:59:13 2015
       Checksum : fda151f9 - correct
         Events : 18014

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AA.. ('A' == active, '.' == missing)
/dev/sda1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : dbe238a3:c7a528c1:a1b78589:276ecfcf
           Name : ubuntu:0  (local to host ubuntu)
  Creation Time : Wed Apr  1 22:27:42 2015
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3901440 (1905.32 MiB 1997.54 MB)
     Array Size : 5850624 (5.58 GiB 5.99 GB)
  Used Dev Size : 3900416 (1904.82 MiB 1997.01 MB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : c483532d:06f93351:cfdf5a92:e83855b5

    Update Time : Wed Apr 29 17:46:25 2015
       Checksum : 76650d1c - correct
         Events : 30

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sda:
   MBR Magic : aa55
Partition[0] :      3903488 sectors at         2048 (type fd)
Partition[1] :   3903123456 sectors at      3905536 (type fd)--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help needed recovering from raid failure
  2015-04-29 18:17 Help needed recovering from raid failure Peter van Es
@ 2015-04-29 23:27 ` NeilBrown
  2015-04-30 19:25   ` Peter van Es
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2015-04-29 23:27 UTC (permalink / raw)
  To: Peter van Es; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5055 bytes --]

On Wed, 29 Apr 2015 20:17:09 +0200 Peter van Es <vanes.peter@gmail.com> wrote:

> Dear Neil,
> 
> first of all, I really appreciate you trying to help me. This is the first time I’m deploying software raid, so really appreciate the guidance.
> 
> 
> > On 29 Apr 2015, at 00:26, NeilBrown <neilb@suse.de> wrote:
> > 
> > This isn't really reporting anything new.
> > There is probably a daily cron job which reports all degraded arrays.  This
> > message is reported by that job.
> 
> I understand...
> 
> > 
> > 
> > Why do you think the array is off-line?  The above message doesn't suggest
> > that.
> > 
> 
> My Ubuntu server was accessible through ssh but did not serve webpages, files etc. When I went to the console, 
> it told me it had taken the array offline because of degraded /dev/sdd2 and /dev/sdc2
> Those two drives were out of the array. 
> 
> > 
> >> 
> >> Needless to say, I can't boot the system anymore as the boot drive is /dev/md0, and GRUB can't
> >> get at it. I do need to recover data (I know, but there's stuf on there I have no backup for--yet).
> > 
> > You boot off a RAID5?  Does grub support that?  I didn't know.
> > But md0 hasn't failed, has it?
> > 
> > Confused.
> 
> Well, it took a little time but yes, I managed to define a raid 5 array that the system was able to boot from. 
> 
> > There is something VERY sick here.  I suggest that you tread very carefully.
> > 
> > All your '1' partitions should be about 2GB and the '2' parititions about 2TB
> > 
> > But the --examine output suggests sda2 and sdb2 are 2TB, while sdd2 and sde2
> > are 2GB.
> > 
> > That really really shouldn't happen.  Maybe check your partition table
> > (fdisk).
> > I really cannot see how this would happen.
> 
> But this question, and the previous question you asked, tell me a little of what I may have done…
> 
> I think confused /dev/md0 and /dev/md1 (now called /dev/md126 and /dev/md127 when running of the USB stick). 
> 
> /dev/md0 is a swap array (around 6GB, comprised of 4 x 2 GB in raid 5)
> /dev/md1 is the boot and data array (around 5 TB, comprised of 4 x ~2 TB in raid 5) 
> 
> I must have confused them and tried to add the /dev/sdc2 and /dev/sdd2 drive to the /dev/md0 array (mdadm —add /dev/md0 /dev/sdc2)

Oops!

> instead of to the /dev/md1 array.  They were  then added as spare drives, their superblocks were overwritten, but since
> a) no swap space was used, and 
> b) they were added as spares
> 
> The data should not have been overwritten.

Hopefully not.

> 
> > 
> > Can you
> >  mdadm -Ss
> > 
> > to stop all the arrays, then
> > 
> >  fdisk -l /dev/sd?
> > 
> > then 
> > 
> >  mdadm -Esvv
> > 
> 
> Neil, here they are: again, I appreciate you taking the time and guiding me through this!
> 
> Is there any way to resurrect the super blocks and try to force assemble the array, skipping the failing drive /dev/sdd2 (the /dev/sdd2 drive created some errors I observed in the log, /dev/sdc2 must have had a one off issue to be taken out….). I have two new drives (arrived today), and a new SSD drive. I would want to get the new array assembled using /dev/sdc2 perhaps forcing it back to the array geometry and “hoping for the best” and then install a new /dev/sdd2 to be recovered. Then I’ll create a boot and swap drive off the SSD which means that any array failures should not prevent the system from booting…

As you have destroyed some metadata, it is no longer possible to 'assemble'
the array.  We need to re-create it.

sda2 and sdb2 appear to be the first two drives of the array.  sdd2 failed
first, so sdce is a better choice to use.  It is probably reasonable to
assume that it was the fourth drive in the array.  If that assumption proves
false then it might be the third.

Before doing this, double check that the names have changed, so check that
  mdadm --examine /dev/sda2
shows
>      Array UUID : 1f28f7bb:7b3ecd41:ca0fa5d1:ccd008df
>    Device Role : Active device 0

(among other info) and  that 
  mdadm --exmaine /dev/sdb2
show the same Array UUID and
>    Device Role : Active device 1


Then run

 mdadm -C /dev/md1 -l5 -n4 --data-offset=262144s --metadata=1.2 --assume-clean \
  /dev/sda2 /dev/sdb2 missing /dev/sde2

Then

 fsck -n -f /dev/md1

If the works, mount /dev/md1 and have a look around and confirm everything
looks OK.
If fsck complains, we might have sde2 in the wrong position.  Or maybe sde
and sdd changed names.
run
  mdadm -Ss
then rerun the -C command with a different list of devices. e.g.
  /dev/sda2 /dev/sdb2 /dev/sde2 missing

Always have one 'missing' device or you will be very likely to get
out-of-sync data.

Once you have data that look OK, copy out any really really important stuff
then, if you think the 4th drive is reliable enough, or if you have replaced
it, add '2' partition of the fourth drive to the array and let it rebuild.
Then you should be back to a safe working array.

NeilBrown



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help needed recovering from raid failure
  2015-04-29 23:27 ` NeilBrown
@ 2015-04-30 19:25   ` Peter van Es
  2015-05-01  2:31     ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Peter van Es @ 2015-04-30 19:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Neil,

thanks. I followed your instructions (slightly modified as my version of mdadm did not support the --data-offset stanza). /dev/sdd was the 3rd drive and I had physically removed the 4th drive from my server.

I managed to restart the array. Then I replaced the failing drive, created partitions the same as on /dev/sda and added it to the two arrays.

It is now rebuilding for the data array, and will be done in 440 minutes.... It appears that I've lost nothing important...

One question: I did spot that the Array UUID has changed on the Create command. Is there any way of getting it back to the old value ?

Peter


> 
> Before doing this, double check that the names have changed, so check that
>  mdadm --examine /dev/sda2
> shows
>>     Array UUID : 1f28f7bb:7b3ecd41:ca0fa5d1:ccd008df
>>   Device Role : Active device 0
> 
> (among other info) and  that 
>  mdadm --exmaine /dev/sdb2
> show the same Array UUID and
>>   Device Role : Active device 1
> 
> 
> Then run
> 
> mdadm -C /dev/md1 -l5 -n4 --data-offset=262144s --metadata=1.2 --assume-clean \
>  /dev/sda2 /dev/sdb2 missing /dev/sde2



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help needed recovering from raid failure
  2015-04-30 19:25   ` Peter van Es
@ 2015-05-01  2:31     ` NeilBrown
  0 siblings, 0 replies; 7+ messages in thread
From: NeilBrown @ 2015-05-01  2:31 UTC (permalink / raw)
  To: Peter van Es; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1549 bytes --]

On Thu, 30 Apr 2015 21:25:04 +0200 Peter van Es <vanes.peter@gmail.com> wrote:

> Neil,
> 
> thanks. I followed your instructions (slightly modified as my version of mdadm did not support the --data-offset stanza). /dev/sdd was the 3rd drive and I had physically removed the 4th drive from my server.
> 
> I managed to restart the array. Then I replaced the failing drive, created partitions the same as on /dev/sda and added it to the two arrays.
> 
> It is now rebuilding for the data array, and will be done in 440 minutes.... It appears that I've lost nothing important...

Excellent.

> 
> One question: I did spot that the Array UUID has changed on the Create command. Is there any way of getting it back to the old value ?

Why would you want to?

But I think you can.  Firstly stop the array (so you need to be booted from a
USB or similar) and then

 mdadm --assemble /dev/mdWHATEVER --update=uuid --uuid=your:favo:rite:nums ..list.of.devices..

NeilBrown

> 
> Peter
> 
> 
> > 
> > Before doing this, double check that the names have changed, so check that
> >  mdadm --examine /dev/sda2
> > shows
> >>     Array UUID : 1f28f7bb:7b3ecd41:ca0fa5d1:ccd008df
> >>   Device Role : Active device 0
> > 
> > (among other info) and  that 
> >  mdadm --exmaine /dev/sdb2
> > show the same Array UUID and
> >>   Device Role : Active device 1
> > 
> > 
> > Then run
> > 
> > mdadm -C /dev/md1 -l5 -n4 --data-offset=262144s --metadata=1.2 --assume-clean \
> >  /dev/sda2 /dev/sdb2 missing /dev/sde2
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help needed recovering from raid failure
  2015-04-27  9:35 Peter van Es
  2015-04-27 11:07 ` Mikael Abrahamsson
@ 2015-04-28 22:26 ` NeilBrown
  1 sibling, 0 replies; 7+ messages in thread
From: NeilBrown @ 2015-04-28 22:26 UTC (permalink / raw)
  To: Peter van Es; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4049 bytes --]

On Mon, 27 Apr 2015 11:35:09 +0200 Peter van Es <vanes.peter@gmail.com> wrote:

> Sorry for the long post...
> 
> I am running Ubuntu LTS 14.04.02 Server edition, 64 bits, with 4x 2.0TB drives in a raid-5 array.
> 
> The 4th drive was beginning to show read errors. Because it was weekend, I could not go out
> and buy a spare 2TB drive to replace the one that was beginning to fail.
> 
> I first got a fail event:
> 
> This is an automatically generated mail message from mdadm
> running on bali
> 
> A Fail event had been detected on md device /dev/md/1.
> 
> It could be related to component device /dev/sdd2.
> 
> Faithfully yours, etc.
> 
> P.S. The /proc/mdstat file currently contains the following:
> 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md1 : active raid5 sdc2[2] sdb2[1] sda2[0] sdd2[3](F)
>     5854290432 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
> 
> md0 : active raid5 sdc1[2] sdd1[3] sdb1[1] sda1[0]
>     5850624 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
> 
> unused devices: <none>
> 
> And then subsequently, around 18 hours later:
> 
> This is an automatically generated mail message from mdadm
> running on bali
> 
> A DegradedArray event had been detected on md device /dev/md/1.

This isn't really reporting anything new.
There is probably a daily cron job which reports all degraded arrays.  This
message is reported by that job.

> 
> Faithfully yours, etc.
> 
> P.S. The /proc/mdstat file currently contains the following:
> 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md1 : active raid5 sdc2[2] sdb2[1] sda2[0] sdd2[3](F)
>     5854290432 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
> 
> md0 : active raid5 sdc1[2] sdd1[3] sdb1[1] sda1[0]
>     5850624 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
> 
> unused devices: <none>
> 
> The server had taken the array off line at that point.

Why do you think the array is off-line?  The above message doesn't suggest
that.


> 
> Needless to say, I can't boot the system anymore as the boot drive is /dev/md0, and GRUB can't
> get at it. I do need to recover data (I know, but there's stuf on there I have no backup for--yet).

You boot off a RAID5?  Does grub support that?  I didn't know.
But md0 hasn't failed, has it?

Confused.



> 
> I booted Linux from a USB stick (which is on /dev/sdc1 hence changing the numbering),
> in recovery mode. Below is the output of /proc/mdstat and 
> mdadm --examine. It looks like somehow the /dev/sdd2 and /dev/sde2 drives took on the 
> super block of the /dev/md127 device (my swap file). May that have been done by the boot from
> the Ubuntu USB stick?

There is something VERY sick here.  I suggest that you tread very carefully.

All your '1' partitions should be about 2GB and the '2' parititions about 2TB

But the --examine output suggests sda2 and sdb2 are 2TB, while sdd2 and sde2
are 2GB.

That really really shouldn't happen.  Maybe check your partition table
(fdisk).
I really cannot see how this would happen.
> 
> My plan... assemble a degraded array, with /dev/sde2 (the 4th drive, formerly known as /dev/sdd2) not in it.
> Because the fail event put the file system in RO mode, I expect /dev/sdd2 (formerly /dev/sdc2) to be ok.
> Then insert new 2TB drive in slot 4. Let system resync and recover.
> 
> I'm running xfs on the /dev/md1 device.
> 
> Questions:
> 
> 1. is this the wise course of action ?
> 2. how exactly do I reassemble the array (/etc/mdadm.conf is inaccessible in recovery mode)
> 3. what command line options do I use exactly from the --examine output below without screwing things up
> 
> And help or pointers gratefully accepted

Can you
  mdadm -Ss

to stop all the arrays, then

  fdisk -l /dev/sd?

then 

  mdadm -Esvv

and post all of that.  Hopefully some of it will make sense.

NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help needed recovering from raid failure
  2015-04-27  9:35 Peter van Es
@ 2015-04-27 11:07 ` Mikael Abrahamsson
  2015-04-28 22:26 ` NeilBrown
  1 sibling, 0 replies; 7+ messages in thread
From: Mikael Abrahamsson @ 2015-04-27 11:07 UTC (permalink / raw)
  To: Peter van Es; +Cc: linux-raid


> I booted Linux from a USB stick (which is on /dev/sdc1 hence changing the numbering),
> in recovery mode. Below is the output of /proc/mdstat and
> mdadm --examine. It looks like somehow the /dev/sdd2 and /dev/sde2 drives took on the
> super block of the /dev/md127 device (my swap file). May that have been done by the boot from
> the Ubuntu USB stick?

Your event counters are strange, 2 drives are showing 18014, and two 
drives are showing event count of 26. Two drives show an update time of 
the 26:th, two show update time on the 27:th of April. This doesn't make 
much sense.

If I were you, I would try to make really really sure that I had unplugged 
the drive that first went offline, then I would use "mdadm --assemble 
--force <md> <component drives>" to get the array up in degraded mode, I 
would then mount it read-only and try to copy the most important 
information onto some other disk. After that you can try to add the new 
drive you bought and let it re-sync. Most likely this will not work as you 
most likely have read errors on at least one other drive. You can use 
"smartctl" from "smartmontolls" to verify. Most likely you will have 
"pending sectors" which are sectors that can't be read on at least one 
other drive.

Also, I recommend you do this:

for x in /sys/block/sd[a-z] ; do
         echo 180  > $x/device/timeout
done

echo 4096 > /sys/block/md0/md/stripe_cache_size

Change md0 above to your md-device. This will increase your kernel 
timeouts and lessen the risk that drives will be considered dead when they 
are only having problems reading a block.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Help needed recovering from raid failure
@ 2015-04-27  9:35 Peter van Es
  2015-04-27 11:07 ` Mikael Abrahamsson
  2015-04-28 22:26 ` NeilBrown
  0 siblings, 2 replies; 7+ messages in thread
From: Peter van Es @ 2015-04-27  9:35 UTC (permalink / raw)
  To: linux-raid

Sorry for the long post...

I am running Ubuntu LTS 14.04.02 Server edition, 64 bits, with 4x 2.0TB drives in a raid-5 array.

The 4th drive was beginning to show read errors. Because it was weekend, I could not go out
and buy a spare 2TB drive to replace the one that was beginning to fail.

I first got a fail event:

This is an automatically generated mail message from mdadm
running on bali

A Fail event had been detected on md device /dev/md/1.

It could be related to component device /dev/sdd2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid5 sdc2[2] sdb2[1] sda2[0] sdd2[3](F)
    5854290432 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]

md0 : active raid5 sdc1[2] sdd1[3] sdb1[1] sda1[0]
    5850624 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>

And then subsequently, around 18 hours later:

This is an automatically generated mail message from mdadm
running on bali

A DegradedArray event had been detected on md device /dev/md/1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid5 sdc2[2] sdb2[1] sda2[0] sdd2[3](F)
    5854290432 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]

md0 : active raid5 sdc1[2] sdd1[3] sdb1[1] sda1[0]
    5850624 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>

The server had taken the array off line at that point.

Needless to say, I can't boot the system anymore as the boot drive is /dev/md0, and GRUB can't
get at it. I do need to recover data (I know, but there's stuf on there I have no backup for--yet).

I booted Linux from a USB stick (which is on /dev/sdc1 hence changing the numbering),
in recovery mode. Below is the output of /proc/mdstat and 
mdadm --examine. It looks like somehow the /dev/sdd2 and /dev/sde2 drives took on the 
super block of the /dev/md127 device (my swap file). May that have been done by the boot from
the Ubuntu USB stick?

My plan... assemble a degraded array, with /dev/sde2 (the 4th drive, formerly known as /dev/sdd2) not in it.
Because the fail event put the file system in RO mode, I expect /dev/sdd2 (formerly /dev/sdc2) to be ok.
Then insert new 2TB drive in slot 4. Let system resync and recover.

I'm running xfs on the /dev/md1 device.

Questions:

1. is this the wise course of action ?
2. how exactly do I reassemble the array (/etc/mdadm.conf is inaccessible in recovery mode)
3. what command line options do I use exactly from the --examine output below without screwing things up

And help or pointers gratefully accepted

Peter van Es




/proc/mdstat (in recovery)

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md126 : inactive sdb2[1](S) sda2[0](S)
     3902861312 blocks super 1.2

md127 : active raid5 sde2[5](S) sde1[3] sdb1[1] sda1[0] sdd1[2] sdd2[4](S)
     5850624 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>

mdadm --examine /dev/sd[abde]2 


/dev/sda2:
         Magic : a92b4efc
       Version : 1.2
   Feature Map : 0x0
    Array UUID : 1f28f7bb:7b3ecd41:ca0fa5d1:ccd008df
          Name : ubuntu:1  (local to host ubuntu)
 Creation Time : Wed Apr  1 22:27:58 2015
    Raid Level : raid5
  Raid Devices : 4

Avail Dev Size : 3902861312 (1861.03 GiB 1998.26 GB)
    Array Size : 5854290432 (5583.09 GiB 5994.79 GB)
 Used Dev Size : 3902860288 (1861.03 GiB 1998.26 GB)
   Data Offset : 262144 sectors
  Super Offset : 8 sectors
         State : clean
   Device UUID : 713e556d:ca104217:785db68a:d820a57b

   Update Time : Sun Apr 26 05:59:13 2015
      Checksum : fda151f9 - correct
        Events : 18014

        Layout : left-symmetric
    Chunk Size : 512K

  Device Role : Active device 0
  Array State : AA.. ('A' == active, '.' == missing)

/dev/sdb2:
         Magic : a92b4efc
       Version : 1.2
   Feature Map : 0x0
    Array UUID : 1f28f7bb:7b3ecd41:ca0fa5d1:ccd008df
          Name : ubuntu:1  (local to host ubuntu)
 Creation Time : Wed Apr  1 22:27:58 2015
    Raid Level : raid5
  Raid Devices : 4

Avail Dev Size : 3902861312 (1861.03 GiB 1998.26 GB)
    Array Size : 5854290432 (5583.09 GiB 5994.79 GB)
 Used Dev Size : 3902860288 (1861.03 GiB 1998.26 GB)
   Data Offset : 262144 sectors
  Super Offset : 8 sectors
         State : clean
   Device UUID : f1e79609:79b7ac23:55197f70:e8fbfd58

   Update Time : Sun Apr 26 05:59:13 2015
      Checksum : 696f4e76 - correct
        Events : 18014

        Layout : left-symmetric
    Chunk Size : 512K

  Device Role : Active device 1
  Array State : AA.. ('A' == active, '.' == missing)

/dev/sdd2:
         Magic : a92b4efc
       Version : 1.2
   Feature Map : 0x0
    Array UUID : dbe238a3:c7a528c1:a1b78589:276ecfcf
          Name : ubuntu:0  (local to host ubuntu)
 Creation Time : Wed Apr  1 22:27:42 2015
    Raid Level : raid5
  Raid Devices : 4

Avail Dev Size : 3903121408 (1861.15 GiB 1998.40 GB)
    Array Size : 5850624 (5.58 GiB 5.99 GB)
 Used Dev Size : 3900416 (1904.82 MiB 1997.01 MB)
   Data Offset : 2048 sectors
  Super Offset : 8 sectors
         State : clean
   Device UUID : 0f3f2b91:09cbb344:e52c4c4b:722d65c4

   Update Time : Mon Apr 27 08:37:15 2015
      Checksum : 7e241855 - correct
        Events : 26

        Layout : left-symmetric
    Chunk Size : 512K

  Device Role : spare
  Array State : AAAA ('A' == active, '.' == missing)

/dev/sde2:
         Magic : a92b4efc
       Version : 1.2
   Feature Map : 0x0
    Array UUID : dbe238a3:c7a528c1:a1b78589:276ecfcf
          Name : ubuntu:0  (local to host ubuntu)
 Creation Time : Wed Apr  1 22:27:42 2015
    Raid Level : raid5
  Raid Devices : 4

Avail Dev Size : 3903121408 (1861.15 GiB 1998.40 GB)
    Array Size : 5850624 (5.58 GiB 5.99 GB)
 Used Dev Size : 3900416 (1904.82 MiB 1997.01 MB)
   Data Offset : 2048 sectors
  Super Offset : 8 sectors
         State : clean
   Device UUID : cdae3287:91168194:942ba99d:1a85c466

   Update Time : Mon Apr 27 08:37:15 2015
      Checksum : b8b529f3 - correct
        Events : 26

        Layout : left-symmetric
    Chunk Size : 512K

  Device Role : spare
  Array State : AAAA ('A' == active, '.' == missing)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-05-01  2:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-29 18:17 Help needed recovering from raid failure Peter van Es
2015-04-29 23:27 ` NeilBrown
2015-04-30 19:25   ` Peter van Es
2015-05-01  2:31     ` NeilBrown
  -- strict thread matches above, loose matches on Subject: below --
2015-04-27  9:35 Peter van Es
2015-04-27 11:07 ` Mikael Abrahamsson
2015-04-28 22:26 ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.