All of lore.kernel.org
 help / color / mirror / Atom feed
* mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
@ 2014-01-11  6:42 Großkreutz, Julian
  2014-01-11 17:47 ` Phil Turmel
  0 siblings, 1 reply; 11+ messages in thread
From: Großkreutz, Julian @ 2014-01-11  6:42 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb

Dear all, dear Neil (thanks for pointing me to this list),

I am in desperate need of help. mdadm is fantastic work, and I have
relied on mdadm for years to run very stable server systems, never had
major problems I could not solve.

This time its different:

On a Centos 6.x (can't remember) initially in 2012:

parted to create GPT partitions on 5 Seagate drives 3TB each

Model: ATA ST3000DM001-9YN1 (scsi)
Disk /dev/sda: 5860533168s  # sd[bcde] identical
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start     End          Size         File system  Name     Flags
1      2048s     1953791s     1951744s     ext4                  boot
2      1955840s  5860532223s  5858576384s               primary  raid

I used an unknown mdadm version including unknown offset parameters for
4k alignment to create

/dev/sd[abcde]1 as /dev/md0 raid 1 for booting (1 GB)
/dev/sd[abcde]2 as /dev/md1 raid 6 for data (9 TB) lvm physical drive

Later added 3 more 3T identical Seagate drives with identical partition
layout, but later firmware.

Using likely a different newer version of mdadm I expanded RAID 6 by 2
drives and added 1 spare.

/dev/md1 was at 15 TB gross, 13 TB usable, expanded pv

Ran fine

Then I moved the 8 disks to a new server with an hba and backplane,
array did not start because mdadm did not find the superblocks on the
original 5 devices /dev/sd[abcde]2. Moving the disks back to the old
server the error did not vanish. Using a centos 6.3 livecd, I got the
following:

[root@livecd ~]# mdadm -Evvvvs /dev/sd[abcdefgh]2
mdadm: No md superblock detected on /dev/sda2.
mdadm: No md superblock detected on /dev/sdb2.
mdadm: No md superblock detected on /dev/sdc2.
mdadm: No md superblock detected on /dev/sdd2.
mdadm: No md superblock detected on /dev/sde2.

/dev/sdf2:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
               Name : 1
      Creation Time : Wed Jul 31 18:24:38 2013
         Raid Level : raid6
       Raid Devices : 7

     Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
         Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
      Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : active
        Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d

        Update Time : Mon Dec 16 01:16:26 2013
           Checksum : ee921c43 - correct
             Events : 327

             Layout : left-symmetric
         Chunk Size : 256K

      Device Role : Active device 5
      Array State : A.AAAAA ('A' == active, '.' == missing)

/dev/sdg2:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
               Name : 1
      Creation Time : Wed Jul 31 18:24:38 2013
         Raid Level : raid6
       Raid Devices : 7

     Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
         Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
      Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : active
        Device UUID : a1e1e51b:d8912985:e51207a9:1d718292

        Update Time : Mon Dec 16 01:16:26 2013
           Checksum : 4ef01fe9 - correct
             Events : 327

             Layout : left-symmetric
         Chunk Size : 256K

        Device Role : Active device 6
        Array State : A.AAAAA ('A' == active, '.' == missing)

/dev/sdh2:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
               Name : 1
      Creation Time : Wed Jul 31 18:24:38 2013
         Raid Level : raid6
       Raid Devices : 7

     Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
         Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
      Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : active
        Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1

        Update Time : Mon Dec 16 01:16:26 2013
           Checksum : a1330e97 - correct
             Events : 327

             Layout : left-symmetric
         Chunk Size : 256K

       Device Role : spare
       Array State : A.AAAAA ('A' == active, '.' == missing)


I suspect that the superblock of the original 5 devices is at a
different location, possibly because they where created with a different
mdadm version, i.e. at the end of the partitions. Booting the drives
with the hba in IT (non-raid) mode on the new server may have introduced
an initialization on the first five drive at the end of the partitions
because I can hexdump something with "EFI PART" in the last 64 kb in all
8 partitions used for the raid 6, which may not have affected the 3
added drives which show metadata 1.2.

If any of You can help me sort this I would greatly appreciate it. I
guess I need the mdadm version where I can set the data offset
differently for each device, but it doesn't compile with an error in
sha1.c:

sha1.h:29:22: Fehler: ansidecl.h: Datei oder Verzeichnis nicht gefunden
(didn't find ansidecl.h, error in German)

What would be the best way to proceed? There is critical data on this
raid, not fully backed up.

(UPD'T)

Thanks for getting back.

Yes, it's bad, I know, also tweaking without keeping exact records of
versions and offsets.

I am, however, rather sure that nothing was written to the disks when I
plugged them into the NEW server, unless starting up a live cd causes an
automatic assemble attempt with an update to the superblocks. That I
cannot exclude.

What I did so far w/o writing to the disks

get non-00 data at the beginning of sda2:

dd if=/dev/sda skip=1955840 bs=512 count=10 | hexdump -C | grep [^00]

gives me

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
|................|
        *
00001000  1e b5 54 51 20 4c 56 4d  32 20 78 5b 35 41 25 72  |..TQ LVM2
x[5A%r|
00001010  30 4e 2a 3e 01 00 00 00  00 10 00 00 00 00 00 00  |
0N*>............|
00001020  00 00 02 00 00 00 00 00  00 00 00 00 00 00 00 00
|................|
00001030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
|................|
*
00001200  76 67 5f 6e 65 64 69 67  73 30 32 20 7b 0a 69 64  |vg_nedigs02
{.id|
00001210  20 3d 20 22 32 4c 62 48  71 64 2d 72 67 42 74 2d  | =
"2LbHqd-rgBt-|
00001220  45 4a 75 31 2d 32 52 36  31 2d 41 35 7a 74 2d 6e  |
EJu1-2R61-A5zt-n|
00001230  49 58 53 2d 66 79 4f 36  33 73 22 0a 73 65 71 6e  |
IXS-fyO63s".seqn|
00001240  6f 20 3d 20 37 0a 66 6f  72 6d 61 74 20 3d 20 22  |o =
7.format = "|
00001250  6c 76 6d 32 22 20 23 20  69 6e 66 6f 72 6d 61 74  |lvm2" #
informat|
(cont'd)

but on /dev/sdb

00000000  5f 80 00 00 5f 80 01 00  5f 80 02 00 5f 80 03 00  |
_..._..._..._...|
00000010  5f 80 04 00 5f 80 0c 00  5f 80 0d 00 00 00 00 00  |
_..._..._.......|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
|................|
*
00001000  60 80 00 00 60 80 01 00  60 80 02 00 60 80 03 00  |
`...`...`...`...|
00001010  60 80 04 00 60 80 0c 00  60 80 0d 00 00 00 00 00  |
`...`...`.......|
00001020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
|................|
*
00001400

so my initial guess that the data may start at 00001000 did not pan out.

Does anybody have an idea of how to reliably identify an mdadm
superblock in a hexdump of the drive ?

And second, have I got my numbers right ? In parted I see the block
count, and when I multiply 512 (not 4096!) with the total count I get 3
TB, so I think I have to use bs=512 in dd to get teh partition
boundaries correct.

As for the last state: one drive was set faulty, apparently, but the
spare had not been integrated. I may have gotten caught in a bug
described by Neil Brown, where on shutdown disk were wrongly reported,
and subsequently superblock information was overwritten.

I don't have NAS/SAN storage space to make identical copies of 5x3 TB,
but maybe I should buy 5 more disks and do a dd mirror so I have a
backup of the current state.

Again, any help / ideas welcome, especially building an mdadm version
with offset_data options ...

Julian

Universitätsklinikum Jena - Bachstrasse 18 - D-07743 Jena
Die gesetzlichen Pflichtangaben finden Sie unter http://www.uniklinikum-jena.de/Pflichtangaben.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
  2014-01-11  6:42 mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock Großkreutz, Julian
@ 2014-01-11 17:47 ` Phil Turmel
       [not found]   ` <1389632980.11328.104.camel@achilles.aeskuladis.de>
  0 siblings, 1 reply; 11+ messages in thread
From: Phil Turmel @ 2014-01-11 17:47 UTC (permalink / raw)
  To: "Großkreutz, Julian", linux-raid; +Cc: neilb

Hi Julian,

Very good report.  I think we can help.

On 01/11/2014 01:42 AM, Großkreutz, Julian wrote:
> Dear all, dear Neil (thanks for pointing me to this list),
> 
> I am in desperate need of help. mdadm is fantastic work, and I have
> relied on mdadm for years to run very stable server systems, never had
> major problems I could not solve.
> 
> This time its different:
> 
> On a Centos 6.x (can't remember) initially in 2012:
> 
> parted to create GPT partitions on 5 Seagate drives 3TB each
> 
> Model: ATA ST3000DM001-9YN1 (scsi)
> Disk /dev/sda: 5860533168s  # sd[bcde] identical
> Sector size (logical/physical): 512B/4096B
> Partition Table: gpt
> 
> Number  Start     End          Size         File system  Name     Flags
> 1      2048s     1953791s     1951744s     ext4                  boot
> 2      1955840s  5860532223s  5858576384s               primary  raid

Ok.

Please also show the partition tables for the /dev/sd[fgh].

> I used an unknown mdadm version including unknown offset parameters for
> 4k alignment to create
> 
> /dev/sd[abcde]1 as /dev/md0 raid 1 for booting (1 GB)
> /dev/sd[abcde]2 as /dev/md1 raid 6 for data (9 TB) lvm physical drive
> 
> Later added 3 more 3T identical Seagate drives with identical partition
> layout, but later firmware.
> 
> Using likely a different newer version of mdadm I expanded RAID 6 by 2
> drives and added 1 spare.
> 
> /dev/md1 was at 15 TB gross, 13 TB usable, expanded pv
> 
> Ran fine

Ok.  Your evidence below has some evidence suggesting you created the
larger array from scratch instead of using --grow.  Do you remember?

> Then I moved the 8 disks to a new server with an hba and backplane,
> array did not start because mdadm did not find the superblocks on the
> original 5 devices /dev/sd[abcde]2. Moving the disks back to the old
> server the error did not vanish. Using a centos 6.3 livecd, I got the
> following:
> 
> [root@livecd ~]# mdadm -Evvvvs /dev/sd[abcdefgh]2
> mdadm: No md superblock detected on /dev/sda2.
> mdadm: No md superblock detected on /dev/sdb2.
> mdadm: No md superblock detected on /dev/sdc2.
> mdadm: No md superblock detected on /dev/sdd2.
> mdadm: No md superblock detected on /dev/sde2.
> 
> /dev/sdf2:
>               Magic : a92b4efc
>             Version : 1.2
>         Feature Map : 0x0
>          Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>                Name : 1
>       Creation Time : Wed Jul 31 18:24:38 2013

Note this creation time...  would have been 2012 if you had used --grow.

>          Raid Level : raid6
>        Raid Devices : 7
> 
>      Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>          Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>       Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)

This used dev size is very odd.  The unused space after the data area is
1155584 sectors (>500MiB).

>         Data Offset : 262144 sectors
>        Super Offset : 8 sectors
>               State : active
>         Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d
> 
>         Update Time : Mon Dec 16 01:16:26 2013
>            Checksum : ee921c43 - correct
>              Events : 327
> 
>              Layout : left-symmetric
>          Chunk Size : 256K
> 
>       Device Role : Active device 5
>       Array State : A.AAAAA ('A' == active, '.' == missing)
> 
> /dev/sdg2:
>               Magic : a92b4efc
>             Version : 1.2
>         Feature Map : 0x0
>          Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>                Name : 1
>       Creation Time : Wed Jul 31 18:24:38 2013
>          Raid Level : raid6
>        Raid Devices : 7
> 
>      Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>          Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>       Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>         Data Offset : 262144 sectors
>        Super Offset : 8 sectors
>               State : active
>         Device UUID : a1e1e51b:d8912985:e51207a9:1d718292
> 
>         Update Time : Mon Dec 16 01:16:26 2013
>            Checksum : 4ef01fe9 - correct
>              Events : 327
> 
>              Layout : left-symmetric
>          Chunk Size : 256K
> 
>         Device Role : Active device 6
>         Array State : A.AAAAA ('A' == active, '.' == missing)
> 
> /dev/sdh2:
>               Magic : a92b4efc
>             Version : 1.2
>         Feature Map : 0x0
>          Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>                Name : 1
>       Creation Time : Wed Jul 31 18:24:38 2013
>          Raid Level : raid6
>        Raid Devices : 7
> 
>      Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>          Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>       Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>         Data Offset : 262144 sectors
>        Super Offset : 8 sectors
>               State : active
>         Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1
> 
>         Update Time : Mon Dec 16 01:16:26 2013
>            Checksum : a1330e97 - correct
>              Events : 327
> 
>              Layout : left-symmetric
>          Chunk Size : 256K
> 
>        Device Role : spare
>        Array State : A.AAAAA ('A' == active, '.' == missing)
> 
> 
> I suspect that the superblock of the original 5 devices is at a
> different location, possibly because they where created with a different
> mdadm version, i.e. at the end of the partitions. Booting the drives
> with the hba in IT (non-raid) mode on the new server may have introduced
> an initialization on the first five drive at the end of the partitions
> because I can hexdump something with "EFI PART" in the last 64 kb in all
> 8 partitions used for the raid 6, which may not have affected the 3
> added drives which show metadata 1.2.

The "EFI PART" is part of the backup copy of the GPT.  All the drives in
a working array will have the same metadata version (superblock
location) even if the data offsets are different.

I would suggest hexdumping entire devices looking for the MD superblock
magic value, which will always be at the start of a 4k-aligned block.

Show (will take a long time, even with the big block size):

for x in /dev/sd[a-e]2 ; echo -e "\nDevice $x" ; dd if=$x bs=1M |hexdump
-C |grep "000  fc 4e 2b a9" ; done

For any candidates found, hexdump the whole 4k block for us.

> If any of You can help me sort this I would greatly appreciate it. I
> guess I need the mdadm version where I can set the data offset
> differently for each device, but it doesn't compile with an error in
> sha1.c:
> 
> sha1.h:29:22: Fehler: ansidecl.h: Datei oder Verzeichnis nicht gefunden
> (didn't find ansidecl.h, error in German)

You probably need some *-dev packages.  I don't use the RHEL platform,
so I'm not sure what you'd need.  In the ubuntu world, it'd be the
"build-essentials" meta-package.

> What would be the best way to proceed? There is critical data on this
> raid, not fully backed up.
> 
> (UPD'T)
> 
> Thanks for getting back.
> 
> Yes, it's bad, I know, also tweaking without keeping exact records of
> versions and offsets.
> 
> I am, however, rather sure that nothing was written to the disks when I
> plugged them into the NEW server, unless starting up a live cd causes an
> automatic assemble attempt with an update to the superblocks. That I
> cannot exclude.
> 
> What I did so far w/o writing to the disks
> 
> get non-00 data at the beginning of sda2:
> 
> dd if=/dev/sda skip=1955840 bs=512 count=10 | hexdump -C | grep [^00]

FWIW, you could have combined "if=/dev/sda skip=1955840" into
"if=/dev/sda2" . . . :-)

> gives me
> 
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
>         *
> 00001000  1e b5 54 51 20 4c 56 4d  32 20 78 5b 35 41 25 72  |..TQ LVM2
> x[5A%r|
> 00001010  30 4e 2a 3e 01 00 00 00  00 10 00 00 00 00 00 00  |
> 0N*>............|
> 00001020  00 00 02 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> 00001030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> *
> 00001200  76 67 5f 6e 65 64 69 67  73 30 32 20 7b 0a 69 64  |vg_nedigs02
> {.id|
> 00001210  20 3d 20 22 32 4c 62 48  71 64 2d 72 67 42 74 2d  | =
> "2LbHqd-rgBt-|
> 00001220  45 4a 75 31 2d 32 52 36  31 2d 41 35 7a 74 2d 6e  |
> EJu1-2R61-A5zt-n|
> 00001230  49 58 53 2d 66 79 4f 36  33 73 22 0a 73 65 71 6e  |
> IXS-fyO63s".seqn|
> 00001240  6f 20 3d 20 37 0a 66 6f  72 6d 61 74 20 3d 20 22  |o =
> 7.format = "|
> 00001250  6c 76 6d 32 22 20 23 20  69 6e 66 6f 72 6d 61 74  |lvm2" #
> informat|
> (cont'd)

This implies that /dev/sda2 is the first device in a raid5/6 that uses
metadata 0.9 or 1.0.  You've found the LVM PV signature, which starts at
4k into a PV.  Theoretically, this could be a stray, abandoned signature
from the original array, with the real LVM signature at the 262144
offset.  Show:

dd if=/dev/sda2 skip=262144 count=16 |hexdump -C

> 
> but on /dev/sdb
> 
> 00000000  5f 80 00 00 5f 80 01 00  5f 80 02 00 5f 80 03 00  |
> _..._..._..._...|
> 00000010  5f 80 04 00 5f 80 0c 00  5f 80 0d 00 00 00 00 00  |
> _..._..._.......|
> 00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> *
> 00001000  60 80 00 00 60 80 01 00  60 80 02 00 60 80 03 00  |
> `...`...`...`...|
> 00001010  60 80 04 00 60 80 0c 00  60 80 0d 00 00 00 00 00  |
> `...`...`.......|
> 00001020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> *
> 00001400
> 
> so my initial guess that the data may start at 00001000 did not pan out.

No, but with parity raid scattering data amongst the participating
devices, the report on /dev/sdb2 is expected.

> Does anybody have an idea of how to reliably identify an mdadm
> superblock in a hexdump of the drive ?

Above.

> And second, have I got my numbers right ? In parted I see the block
> count, and when I multiply 512 (not 4096!) with the total count I get 3
> TB, so I think I have to use bs=512 in dd to get teh partition
> boundaries correct.

dd uses bs=512 as the default.  And it can access the partitions directly.

> As for the last state: one drive was set faulty, apparently, but the
> spare had not been integrated. I may have gotten caught in a bug
> described by Neil Brown, where on shutdown disk were wrongly reported,
> and subsequently superblock information was overwritten.

Possible.  If so, you may not find any superblocks with the grep above.

> I don't have NAS/SAN storage space to make identical copies of 5x3 TB,
> but maybe I should buy 5 more disks and do a dd mirror so I have a
> backup of the current state.

We can do some more non-destructive investigation first.

Regards,

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
       [not found]   ` <1389632980.11328.104.camel@achilles.aeskuladis.de>
@ 2014-01-13 18:42     ` Phil Turmel
  2014-01-13 20:11       ` Chris Murphy
  2014-01-14 10:31       ` Großkreutz, Julian
  0 siblings, 2 replies; 11+ messages in thread
From: Phil Turmel @ 2014-01-13 18:42 UTC (permalink / raw)
  To: "Großkreutz, Julian", linux-raid; +Cc: neilb

Hi Julian,

[Note, your reply didn't make it to linux-raid due to size.  I believe
the limit is 150k ~ 200k.]

On 01/13/2014 12:09 PM, Großkreutz, Julian wrote:
> Hi Phil,
> 
> thanks for getting back so quickly
> 
>>> Model: ATA ST3000DM001-9YN1 (scsi)

Aside: This model looks familiar.  I'm pretty sure these drives are
desktop models that lack scterc support.  Meaning they are *not*
generally suitable for raid duty.  Search the archives for combinations
of "timeout mismatch", "scterc", "URE", and "scrub" for a full
explanation.  If I've guessed correctly, you *must* use the driver
timeout work-around before proceeding.

[trim /]

> I noticed one difference: part 1 is one sector longer than
> on /dev/[abcde], but part 2 starts at the same sector in all 8 drives
> and has the same length in all 8 drives. I usually leave 800-1000
> sectors unallocated at the end. As previously mentioned the first 5
> drives are older, the last three newer (the drive with the oldest
> firmware is sdb which has (incidence?) gone missing acc to sd[fgh]).

Some versions of parted/fdisk will get you that extra sector, wasting a
megabyte or more between partitions.  Not relevant here, I don't think.
 The partitions we need are all consistent.

[trim /]

>> Ok.  Your evidence below has some evidence suggesting you created the
>> larger array from scratch instead of using --grow.  Do you remember?
>>
> I seem to recall that building the initial 5 disk raid 6 was difficult,
> and I think I needed a custom compiled mdadm version (now residing on
> the inaccessible raid) which allowed me to align the offsets and
> optimize performance which was otherwise abysmal. I may have chosen 1.0
> superblock. Extending the raid was difficult as well, but I don't recall
> recreating it from scratch. Maybe I tried once using standard settings,
> didn't work, and then used the "custom" mdadm with offsets on the new
> drives as well. Sadly I can't remember. The existing superblock 1.2
> on /dev/sd[fgh] seems standard in data offset and superblock offset.

I don't think you can run an array with mixed superblock locations, so
I'm now concerned that the partitions on /dev/sd[a-e] aren't correct.
Instead of attempting to find the superblock signature, I think we
should first try to find the LVM2 signature.

>> Note this creation time...  would have been 2012 if you had used --grow.
>>
> Dont pin me down on 2012, but surely the original set of five was not
> created July 2013, my third child was born on the 8th. By then this raid
> was up and served as an extra mirror archive.

But you could have backed up and re-created from scratch *after*.  It
does say July 31.

>> This used dev size is very odd.  The unused space after the data area is
>> 1155584 sectors (>500MiB).
> 
> Possibly the result of my fiddling with a custom mdadm and offsets to
> begin with? I presume I could not have set this manually.

No, I don't think so.

[trim /]

>> I would suggest hexdumping entire devices looking for the MD superblock
>> magic value, which will always be at the start of a 4k-aligned block.
>>
>> Show (will take a long time, even with the big block size):
>>
>> for x in /dev/sd[a-e]2 ; echo -e "\nDevice $x" ; dd if=$x bs=1M |hexdump
>> -C |grep "000  fc 4e 2b a9" ; done
>>
> 
> I started it, but this old dual Xeon puts 1.2 MB/s through the hexdump
> thread if data is not zero -> it will take app. 20 days !
> 
> For now: the last 2.8 GB of all 8 drives did not show the signature:
> 
> [root@livecd ~]# for x in /dev/sd[a-h]; do echo -e "\nDevice $x"; dd if=$x skip=5855000000 count=100000000 |hexdump -C |grep "000  fc 4e 2b a9"; done

Don't bother with this now.

> So attached You will find hexdumps of 64k of /sda/sd[a-h]2 at sector 0
> and 262144 which shows the superblock 1.2 on sd[fgh]2, not on sd[a-e]2,
> but may help to identify data_offset; I suspect it is 2048 on sd[a-e]2
> and 262144 on sd[fgh]2.
>

Jackpot!  LVM2 embedded backup data at the correct location for mdadm
data offset == 262144.  And on /dev/sda2, which is the only device that
should have it (first device in the raid).

From /dev/sda2 @ 262144:

> 00001200  76 67 5f 6e 65 64 69 67  73 30 32 20 5d 0a 69 64  |vg_nedigs02 ].id|
> 00001210  20 3d 20 22 32 4c 62 48  71 64 2d 72 67 42 9f 6e  | = "2LbHqd-rgB.n|
> 00001220  45 4a 75 31 2d 32 52 36  31 2d 41 35 f5 75 2d 6e  |EJu1-2R61-A5.u-n|
> 00001230  49 58 53 2d 66 79 4f 36  33 73 22 0a 73 65 3a 01  |IXS-fyO63s".se:.|
> 00001240  6f 20 3d 20 33 36 0a 66  6f 72 6d 61 ca 24 3d 20  |o = 36.forma.$= |
> 00001250  22 6c 76 6d 32 22 20 23  20 69 6e 66 6f 72 6b ac  |"lvm2" # infork.|
> 00001260  74 69 6f 6e 61 6c 0a 73  74 61 74 75 ee 22 3d 20  |tional.statu."= |
> 00001270  5b 22 52 45 53 49 5a 45  41 42 4c 45 22 2c 3e c0  |["RESIZEABLE",>.|
> 00001280  52 45 41 44 22 2c 20 22  57 52 49 54 d0 27 5d 0a  |READ", "WRIT.'].|
> 00001290  66 6c 61 67 73 20 3d 20  5b 5d 0a 65 78 74 4b df  |flags = [].extK.|
> 000012a0  74 5f 73 69 7a 65 20 3d  20 38 31 39 3e 08 6d 61  |t_size = 819>.ma|
> 000012b0  78 5f 6c 76 20 3d 20 30  0a 6d 61 78 5f 70 14 13  |x_lv = 0.max_p..|
> 000012c0  3d 20 30 0a 6d 65 74 61  64 61 74 61 b3 63 6f 70  |= 0.metadata.cop|
> 000012d0  69 65 73 20 3d 20 30 0a  0a 70 68 79 73 69 97 c4  |ies = 0..physi..|
> 000012e0  6c 5f 76 6f 6c 75 6d 65  73 20 7b 0a 2e 78 76 30  |l_volumes {..xv0|
> 000012f0  20 7b 0a 69 64 20 3d 20  22 50 4a 48 4c 67 bf 14  | {.id = "PJHLg..|
> 00001300  53 70 56 70 2d 47 55 71  34 2d 6b 4a 57 7f 2d 39  |SpVp-GUq4-kJW.-9|
> 00001310  6d 74 4b 2d 31 6c 65 4a  2d 73 36 64 39 6a d8 1b  |mtK-1leJ-s6d9j..|
> 00001320  0a 64 65 76 69 63 65 20  3d 20 22 2f 79 6c 76 2f  |.device = "/ylv/|
> 00001330  73 64 66 32 22 0a 0a 73  74 61 74 75 73 20 7d 18  |sdf2"..status }.|
> 00001340  5b 22 41 4c 4c 4f 43 41  54 41 42 4c df 25 5d 0a  |["ALLOCATABL.%].|
> 00001350  66 6c 61 67 73 20 3d 20  5b 5d 0a 64 65 76 e3 b5  |flags = [].dev..|
> 00001360  69 7a 65 20 3d 20 31 30  32 34 30 30 ce 33 30 0a  |ize = 102400.30.|
> 00001370  70 65 5f 73 74 61 72 74  20 3d 20 32 30 34 99 22  |pe_start = 204."|
> 00001380  70 65 5f 63 6f 75 6e 74  20 3d 20 31 cd 33 39 39  |pe_count = 1.399|
> 00001390  0a 7d 0a 0a 70 76 31 20  7b 0a 69 64 20 3d 92 37  |.}..pv1 {.id =.7|
> 000013a0  44 39 7a 75 70 37 2d 6a  76 79 46 2d 6b 32 73 42  |D9zup7-jvyF-k2sB|
> 000013b0  2d 42 75 59 30 2d 39 74  73 61 2d 41 78 68 11 86  |-BuY0-9tsa-Axh..|
> 000013c0  34 45 51 48 4e 71 22 0a  64 65 76 69 c0 61 20 3d  |4EQHNq".devi.a =|
> 000013d0  20 22 2f 64 65 76 2f 6d  64 31 22 0a 0a 73 e4 c6  | "/dev/md1"..s..|
> 000013e0  74 75 73 20 3d 20 5b 22  41 4c 4c 4f db 41 54 41  |tus = ["ALLO.ATA|
> 000013f0  42 4c 45 22 5d 0a 66 6c  61 67 73 20 3d 20 f4 12  |BLE"].flags = ..|
> 00001400  0a 64 65 76 5f 73 69 7a  65 20 3d 20 14 39 32 38  |.dev_size = .928|
> 00001410  35 37 39 33 32 38 30 0a  70 65 5f 73 74 61 99 37  |5793280.pe_sta.7|
> 00001420  20 3d 20 35 31 32 0a 70  65 5f 63 6f 4d 6d 74 20  | = 512.pe_coMmt |
> 00001430  3d 20 33 35 37 34 39 32  35 0a 7d 0a 7d 0a 77 f1  |= 3574925.}.}.w.|
> 00001440  6f 67 69 63 61 6c 5f 76  6f 6c 75 6d 9c 7d 20 7b  |ogical_volum.} {|
> 00001450  0a 0a 6c 76 5f 76 61 72  20 7b 0a 69 64 20 b9 ee  |..lv_var {.id ..|
> 00001460  22 5a 4a 47 56 55 4d 2d  4d 70 76 50 a8 7a 6f 49  |"ZJGVUM-MpvP.zoI|
> 00001470  39 2d 68 31 39 47 2d 57  70 75 6d 2d 4e 4b ee d5  |9-h19G-Wpum-NK..|
> 00001480  2d 4a 77 34 32 31 59 22  0a 73 74 61 f4 70 73 20  |-Jw421Y".sta.ps |
> 00001490  3d 20 5b 22 52 45 41 44  22 2c 20 22 57 52 73 ed  |= ["READ", "WRs.|
> 000014a0  45 22 2c 20 22 56 49 53  49 42 4c 45 86 5a 0a 66  |E", "VISIBLE.Z.f|
> 000014b0  6c 61 67 73 20 3d 20 5b  5d 0a 73 65 67 6d b4 4c  |lags = [].segm.L|
> 000014c0  74 5f 63 6f 75 6e 74 20  3d 20 31 0a d1 76 65 67  |t_count = 1..veg|
> 000014d0  6d 65 6e 74 31 20 7b 0a  73 74 61 72 74 5f 3c f6  |ment1 {.start_<.|
> 000014e0  74 65 6e 74 20 3d 20 30  0a 65 78 74 9a 68 74 5f  |tent = 0.ext.ht_|
> 000014f0  63 6f 75 6e 74 20 3d 20  31 32 35 30 0a 0a 97 fc  |count = 1250....|
> 00001500  70 65 20 3d 20 22 73 74  72 69 70 65 a4 23 0a 73  |pe = "stripe.#.s|
> 00001510  74 72 69 70 65 5f 63 6f  75 6e 74 20 3d 20 5c a5  |tripe_count = \.|
> 00001520  23 20 6c 69 6e 65 61 72  0a 0a 73 74 75 69 70 65  |# linear..stuipe|
> 00001530  73 20 3d 20 5b 0a 22 70  76 30 22 2c 20 34 88 4c  |s = [."pv0", 4.L|
> 00001540  39 0a 5d 0a 7d 0a 7d 0a  0a 6c 76 5f b5 6e 6f 74  |9.].}.}..lv_.not|
> 00001550  20 7b 0a 69 64 20 3d 20  22 4c 48 58 57 4f 97 f4  | {.id = "LHXWO..|
> 00001560  47 30 6f 63 2d 62 4a 54  31 2d 49 6e 5d 36 2d 36  |G0oc-bJT1-In]6-6|
> 00001570  46 39 58 2d 7a 76 4b 50  2d 53 68 73 74 66 b7 69  |F9X-zvKP-Shstf.i|
> 00001580  0a 73 74 61 74 75 73 20  3d 20 5b 22 0b 42 41 44  |.status = [".BAD|
> 00001590  22 2c 20 22 57 52 49 54  45 22 2c 20 22 56 39 ed  |", "WRITE", "V9.|
> 000015a0  49 42 4c 45 22 5d 0a 66  6c 61 67 73 ef 3d 20 5b  |IBLE"].flags.= [|
> 000015b0  5d 0a 73 65 67 6d 65 6e  74 5f 63 6f 75 6e 7b 0b  |].segment_coun{.|
> 000015c0  3d 20 31 0a 0a 73 65 67  6d 65 6e 74 4c 27 7b 0a  |= 1..segmentL'{.|
> 000015d0  73 74 61 72 74 5f 65 78  74 65 6e 74 20 3d 1a 75  |start_extent =.u|
> 000015e0  0a 65 78 74 65 6e 74 5f  63 6f 75 6e ae 22 3d 20  |.extent_coun."= |
> 000015f0  32 35 30 30 0a 0a 74 79  70 65 20 3d 20 22 c9 37  |2500..type = ".7|
> 00001600  72 69 70 65 64 22 0a 73  74 72 69 70 77 50 63 6f  |riped".stripwPco|
> 00001610  75 6e 74 20 3d 20 31 09  23 20 6c 69 6e 65 f1 fc  |unt = 1.# line..|
> 00001620  0a 0a 73 74 72 69 70 65  73 20 3d 20 24 0b 22 70  |..stripes = $."p|
> 00001630  76 30 22 2c 20 32 34 39  39 0a 5d 0a 7d 0a 05 56  |v0", 2499.].}..V|
> 00001640  0a 6c 76 5f 68 6f 6d 65  20 7b 0a 69 26 22 3d 20  |.lv_home {.i&"= |
> 00001650  22 76 48 4a 37 4d 34 2d  74 74 77 4f 2d 46 71 7d  |"vHJ7M4-ttwO-Fq}|
> 00001660  6e 2d 72 35 67 71 2d 74  44 48 74 2d 38 49 64 37  |n-r5gq-tDHt-8Id7|
> 00001670  2d 54 56 74 52 6f 36 22  0a 73 74 61 74 75 ff 91  |-TVtRo6".statu..|
> 00001680  3d 20 5b 22 52 45 41 44  22 2c 20 22 9a 54 49 54  |= ["READ", ".TIT|
> 00001690  45 22 2c 20 22 56 49 53  49 42 4c 45 22 5d 47 54  |E", "VISIBLE"]GT|
> 000016a0  6c 61 67 73 20 3d 20 5b  5d 0a 73 65 e6 6b 65 6e  |lags = [].se.ken|
> 000016b0  74 5f 63 6f 75 6e 74 20  3d 20 31 0a 0a 73 fe d2  |t_count = 1..s..|
> 000016c0  6d 65 6e 74 31 20 7b 0a  73 74 61 72 3e 50 65 78  |ment1 {.star>Pex|
> 000016d0  74 65 6e 74 20 3d 20 30  0a 65 78 74 65 6e 77 a2  |tent = 0.extenw.|
> 000016e0  63 6f 75 6e 74 20 3d 20  32 35 30 30 13 0a 74 79  |count = 2500..ty|
> 000016f0  70 65 20 3d 20 22 73 74  72 69 70 65 64 22 dd 28  |pe = "striped".(|
> 00001700  74 72 69 70 65 5f 63 6f  75 6e 74 20 1e 22 31 09  |tripe_count ."1.|
> 00001710  23 20 6c 69 6e 65 61 72  0a 0a 73 74 72 69 2a 8b  |# linear..stri*.|
> 00001720  73 20 3d 20 5b 0a 22 70  76 30 22 2c 1c 35 32 34  |s = [."pv0",.524|
> 00001730  39 0a 5d 0a 7d 0a 7d 0a  0a 6c 76 5f 73 77 5d dc  |9.].}.}..lv_sw].|
> 00001740  20 7b 0a 69 64 20 3d 20  22 58 6f 36 e6 7a 36 2d  | {.id = "Xo6.z6-|
> 00001750  39 62 61 38 2d 49 54 53  73 2d 57 63 61 78 ba 6f  |9ba8-ITSs-Wcax.o|
> 00001760  73 42 52 2d 6e 48 65 61  2d 65 44 45 63 61 33 22  |sBR-nHea-eDEca3"|
> 00001770  0a 73 74 61 74 75 73 20  3d 20 5b 22 52 45 08 4f  |.status = ["RE.O|
> 00001780  22 2c 20 22 57 52 49 54  45 22 2c 20 ec 50 49 53  |", "WRITE", .PIS|
> 00001790  49 42 4c 45 22 5d 0a 66  6c 61 67 73 20 3d 9d 2d  |IBLE"].flags =.-|
> 000017a0  5d 0a 73 65 67 6d 65 6e  74 5f 63 6f 04 6a 74 20  |].segment_co.jt |
> 000017b0  3d 20 31 0a 0a 73 65 67  6d 65 6e 74 31 20 72 ec  |= 1..segment1 r.|
> 000017c0  73 74 61 72 74 5f 65 78  74 65 6e 74 7b 3d 20 30  |start_extent{= 0|
> 000017d0  0a 65 78 74 65 6e 74 5f  63 6f 75 6e 74 20 5e 17  |.extent_count ^.|
> 000017e0  32 34 39 39 0a 0a 74 79  70 65 20 3d f7 21 73 74  |2499..type =.!st|
> 000017f0  72 69 70 65 64 22 0a 73  74 72 69 70 65 5f 1a 13  |riped".stripe_..|
> 00001800  75 6e 74 20 3d 20 31 09  23 20 6c 69 51 65 61 72  |unt = 1.# liQear|
> 00001810  0a 0a 73 74 72 69 70 65  73 20 3d 20 5b 0a 1e 68  |..stripes = [..h|
> 00001820  76 30 22 2c 20 30 0a 5d  0a 7d 0a 7d 0a 0a 6c 76  |v0", 0.].}.}..lv|
> 00001830  5f 74 6d 70 20 7b 0a 69  64 20 3d 20 22 6b 66 55  |_tmp {.id = "kfU|
> 00001840  76 49 50 2d 55 4f 56 50  2d 53 67 61 24 2a 55 71  |vIP-UOVP-Sga$*Uq|
> 00001850  49 4f 2d 56 36 32 6f 2d  33 56 58 47 2d 52 7e 09  |IO-V62o-3VXG-R~.|
> 00001860  67 6b 75 22 0a 73 74 61  74 75 73 20 c2 27 5b 22  |gku".status .'["|
> 00001870  52 45 41 44 22 2c 20 22  57 52 49 54 45 22 9e 35  |READ", "WRITE".5|
> 00001880  22 56 49 53 49 42 4c 45  22 5d 0a 66 37 61 67 73  |"VISIBLE"].f7ags|
> 00001890  20 3d 20 5b 5d 0a 73 65  67 6d 65 6e 74 5f 00 58  | = [].segment_.X|
> 000018a0  75 6e 74 20 3d 20 31 0a  0a 73 65 67 80 61 6e 74  |unt = 1..seg.ant|
> 000018b0  31 20 7b 0a 73 74 61 72  74 5f 65 78 74 65 89 ec  |1 {.start_exte..|
> 000018c0  20 3d 20 30 0a 65 78 74  65 6e 74 5f 2a 6c 75 6e  | = 0.extent_*lun|
> 000018d0  74 20 3d 20 32 35 30 30  0a 0a 74 79 70 65 16 87  |t = 2500..type..|
> 000018e0  20 22 73 74 72 69 70 65  64 22 0a 73 40 77 69 70  | "striped".s@wip|
> 000018f0  65 5f 63 6f 75 6e 74 20  3d 20 31 09 23 20 31 06  |e_count = 1.# 1.|
> 00001900  6e 65 61 72 0a 0a 73 74  72 69 70 65 cc 25 3d 20  |near..stripe.%= |
> 00001910  5b 0a 22 70 76 30 22 2c  20 38 37 34 39 0a 5b ab  |[."pv0", 8749.[.|
> 00001920  7d 0a 7d 0a 7d 0a 7d 0a  23 20 47 65 b1 66 72 61  |}.}.}.}.# Ge.fra|
> 00001930  74 65 64 20 62 79 20 4c  56 4d 32 20 76 65 89 6d  |ted by LVM2 ve.m|
> 00001940  69 6f 6e 20 32 2e 30 32  2e 39 38 28 ff 2f 2d 52  |ion 2.02.98(./-R|
> 00001950  48 45 4c 36 20 28 32 30  31 32 2d 31 30 2d 7c 07  |HEL6 (2012-10-|.|
> 00001960  29 3a 20 57 65 64 20 4a  75 6c 20 33 14 22 31 38  |): Wed Jul 3."18|
> 00001970  3a 32 36 3a 31 39 20 32  30 31 33 0a 0a 63 d5 09  |:26:19 2013..c..|
> 00001980  74 65 6e 74 73 20 3d 20  22 54 65 78 69 21 46 6f  |tents = "Texi!Fo|
> 00001990  72 6d 61 74 20 56 6f 6c  75 6d 65 20 47 72 a8 8f  |rmat Volume Gr..|
> 000019a0  70 22 0a 76 65 72 73 69  6f 6e 20 3d 20 31 0a 0a  |p".version = 1..|
> 000019b0  64 65 73 63 72 69 70 74  69 6f 6e 20 3d 20 22 22  |description = ""|
> 000019c0  0a 0a 63 72 65 61 74 69  6f 6e 5f 68 f7 73 74 20  |..creation_h.st |
> 000019d0  3d 20 22 6e 65 64 69 67  73 33 30 2e 6e 65 cb 26  |= "nedigs30.ne.&|
> 000019e0  67 2e 61 65 73 6b 75 6c  61 64 69 73 2e 6c 6f 63  |g.aeskuladis.loc|
> 000019f0  61 6c 22 09 23 20 4c 69  6e 75 78 20 6e 65 64 69  |al".# Linux nedi|
> 00001a00  67 73 33 30 2e 6e 65 64  69 67 2e 61 65 73 6b 75  |gs30.nedig.aesku|
> 00001a10  6c 61 64 69 73 2e 6c 6f  63 61 6c 20 32 2e 36 2e  |ladis.local 2.6.|
> 00001a20  33 32 2d 33 35 38 2e 36  2e 31 2e 65 93 35 2e 78  |32-358.6.1.e.5.x|
> 00001a30  38 36 5f 36 34 20 23 31  20 53 4d 50 20 54 85 b1  |86_64 #1 SMP T..|
> 00001a40  20 41 70 72 20 32 33 20  31 39 3a 32 76 3a 30 30  | Apr 23 19:2v:00|
> 00001a50  20 55 54 43 20 32 30 31  33 20 78 38 36 5f 10 f7  | UTC 2013 x86_..|
> 00001a60  0a 63 72 65 61 74 69 6f  6e 5f 74 69 71 61 20 3d  |.creation_tiqa =|
> 00001a70  20 31 33 37 35 32 38 37  39 37 39 09 23 20 d2 32  | 1375287979.# .2|
> 00001a80  64 20 4a 75 6c 20 33 31  20 31 38 3a af 37 3a 31  |d Jul 31 18:.7:1|
> 00001a90  39 20 32 30 31 33 0a 0a  00 00 00 00 00 00 ee 12  |9 2013..........|

Note the creation date/time at the end (with a corrupted byte):

Jul 31 18:?7:19 2013

There are other corrupted bytes scattered around.  I'd be worried about
the RAM in this machine.  Since you are using non-enterprise drives, I'm
going to go out on a limb here and guess that the server doesn't have
ECC ram...

Part of the signature that should have showed up at 00001000 is missing,
too.

Consider performing an extended memcheck run to see what's going on.
Maybe move the entire stack of disks to another server.

>>> 00001200  76 67 5f 6e 65 64 69 67  73 30 32 20 7b 0a 69 64  |vg_nedigs02
>>> {.id|
>>> 00001210  20 3d 20 22 32 4c 62 48  71 64 2d 72 67 42 74 2d  | =
>>> "2LbHqd-rgBt-|
>>> 00001220  45 4a 75 31 2d 32 52 36  31 2d 41 35 7a 74 2d 6e  |
>>> EJu1-2R61-A5zt-n|
>>> 00001230  49 58 53 2d 66 79 4f 36  33 73 22 0a 73 65 71 6e  |
>>> IXS-fyO63s".seqn|
>>> 00001240  6f 20 3d 20 37 0a 66 6f  72 6d 61 74 20 3d 20 22  |o =
>>> 7.format = "|
>>> 00001250  6c 76 6d 32 22 20 23 20  69 6e 66 6f 72 6d 61 74  |lvm2" #
>>> informat|
>>> (cont'd)
>>
>> This implies that /dev/sda2 is the first device in a raid5/6 that uses
>> metadata 0.9 or 1.0.  You've found the LVM PV signature, which starts at
>> 4k into a PV.  Theoretically, this could be a stray, abandoned signature
>> from the original array, with the real LVM signature at the 262144
>> offset.  Show:

This certainly was a stray LVM2 signature from a version 1.0 metadata
array.  It matches the new location, if you allow for the scattered
corrupted bytes.  Even the same UUID, suggesting you did a vgcfgbackup
and vgcfgrestore sequence.

[trim /]

>> No, but with parity raid scattering data amongst the participating
>> devices, the report on /dev/sdb2 is expected.
>>
>>> As for the last state: one drive was set faulty, apparently, but the
>>> spare had not been integrated. I may have gotten caught in a bug
>>> described by Neil Brown, where on shutdown disk were wrongly reported,
>>> and subsequently superblock information was overwritten.
>>
>> Possible.  If so, you may not find any superblocks with the grep above.

With memory corruption, all kinds of weird behavior is possible.

> In all, I think I lost all superblock information on sd[a-e]2, possibly
> when I extended the raid set; superblock 1.2 could not be written to
> 262144 on sd[a-e]2 because data started at 2048, so no place to put the
> superblocks.
> 
> I would proceed to try a non-destructive assembly of the raid (i.e.
> read-only through a loop device for each drive) with the freshly
> compiled mdadm_offset with /dev/sd[a-e]2:2048 and /dev/sd[f-h]2:262144.
> Make sense ?

Based on the signature discovered above, we should be able to --create
--assume-clean with the modern default data offset.  We know the
following device roles:

/dev/sda2 == 0
/dev/sdf2 == 5
/dev/sdg2 == 6
/dev/sdh2 == spare

So /dev/sdh2 should be left out until the array is working.

Please re-execute the "mdadm -E" reports for /dev/sd[fgh]2 and show them
uncut.  (Use the lasted mdadm.)  That should fill in the likely device
order of the remaining drives.

Also, it is important that you document which drive serial numbers are
currently occupying the different device names.  An excerpt from "ls -l
/dev/disk/by-id/" would do.

I have to admit that I'm very concerned about your corrupted LVM
signature at offset 262144.  LVM probably won't recognize your PV once
the array is assembled correctly, making it difficult to
non-destructively test the filesystems on your logical volumes.  You may
have to duplicate your disks onto new ones so that an LVM restore can be
safely attempted.

Do *not* buy desktop drives!  You need raid-capable drives like the WD
Red at the least.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
  2014-01-13 18:42     ` Phil Turmel
@ 2014-01-13 20:11       ` Chris Murphy
  2014-01-14 10:31       ` Großkreutz, Julian
  1 sibling, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2014-01-13 20:11 UTC (permalink / raw)
  To: linux-raid


On Jan 13, 2014, at 11:42 AM, Phil Turmel <philip@turmel.org> wrote:

> Do *not* buy desktop drives!  You need raid-capable drives like the WD
> Red at the least.


Yeah I agree. If you care about the data, suck it up and use the right drive.

Very slight threadjack here: WD has a Caviar and Scorpio Blue, and all of the models I've seen, both desktop and laptop interestingly enough, have SCT ERC support. They are "tested and recommended" for raid0/raid1 only. WD says they are not warranted for use in (among other things) "multi-bay chassis" even though they don't list raid5 by name. I think the question is whether vibration is a concern with this class of drive, which then points a typical user to the Caviar/Scorpio Black which has the same recommendation and proscription as the Blue, *but* at least one desktop and one laptop Black model I have, do not support SCT ERC.

So it's almost like spec wise the Red has the vibration tolerance of the Black, but the SCT ERC support the Blue has. It's just odd they do it this way though. And finding out what drives have SCT ERC support is non-obvious.

Chris Murphy


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
  2014-01-13 18:42     ` Phil Turmel
  2014-01-13 20:11       ` Chris Murphy
@ 2014-01-14 10:31       ` Großkreutz, Julian
  2014-01-14 13:14         ` Phil Turmel
  1 sibling, 1 reply; 11+ messages in thread
From: Großkreutz, Julian @ 2014-01-14 10:31 UTC (permalink / raw)
  To: Phil Turmel, linux-raid; +Cc: neilb

Hi Phil,

thanks again for bearing with me.

> >
> >>> Model: ATA ST3000DM001-9YN1 (scsi)
>
> Aside: This model looks familiar.  I'm pretty sure these drives are
> desktop models that lack scterc support.  Meaning they are *not*
> generally suitable for raid duty.  Search the archives for combinations
> of "timeout mismatch", "scterc", "URE", and "scrub" for a full
> explanation.  If I've guessed correctly, you *must* use the driver
> timeout work-around before proceeding.
>

Yes I did, and smartctl showed no significant problems. The 10 year old
server (supermicro enterprise grade dual Xeon with 8 GB ECC RAM) had
started to create problems early January which is why I wanted to move
the drives to a new server in the first place, to then transfer the data
to a new set of enterprise grade disks. I had checked the memory and the
disks in a burn in for several days including time out and power saving
before I set up the raid 2012/2013, and did not have any issues then.

One of the reasons I tend use mdadm is that I am able to utilize
existing hardware to create bridging solutions until money comes in for
better hardware, and moving an mdadm raid has so far never created a
serious problem.

> > So attached You will find hexdumps of 64k of /sda/sd[a-h]2 at sector 0
> > and 262144 which shows the superblock 1.2 on sd[fgh]2, not on sd[a-e]2,
> > but may help to identify data_offset; I suspect it is 2048 on sd[a-e]2
> > and 262144 on sd[fgh]2.
> >
>
> Jackpot!  LVM2 embedded backup data at the correct location for mdadm
> data offset == 262144.  And on /dev/sda2, which is the only device that
> should have it (first device in the raid).
>
> From /dev/sda2 @ 262144:
>
> > 00001200  76 67 5f 6e 65 64 69 67  73 30 32 20 5d 0a 69 64  |vg_nedigs02 ].id|
> > 00001210  20 3d 20 22 32 4c 62 48  71 64 2d 72 67 42 9f 6e  | = "2LbHqd-rgB.n|
> > 00001220  45 4a 75 31 2d 32 52 36  31 2d 41 35 f5 75 2d 6e  |EJu1-2R61-A5.u-n|
> > 00001230  49 58 53 2d 66 79 4f 36  33 73 22 0a 73 65 3a 01  |IXS-fyO63s".se:.|
> > 00001240  6f 20 3d 20 33 36 0a 66  6f 72 6d 61 ca 24 3d 20  |o = 36.forma.$= |
> > 00001250  22 6c 76 6d 32 22 20 23  20 69 6e 66 6f 72 6b ac  |"lvm2" # infork.|
> ...
> > 00001a70  20 31 33 37 35 32 38 37  39 37 39 09 23 20 d2 32  | 1375287979.# .2|
> > 00001a80  64 20 4a 75 6c 20 33 31  20 31 38 3a af 37 3a 31  |d Jul 31 18:.7:1|
> > 00001a90  39 20 32 30 31 33 0a 0a  00 00 00 00 00 00 ee 12  |9 2013..........|
>
> Note the creation date/time at the end (with a corrupted byte):
>
> Jul 31 18:?7:19 2013
>
> There are other corrupted bytes scattered around.  I'd be worried about
> the RAM in this machine.  Since you are using non-enterprise drives, I'm
> going to go out on a limb here and guess that the server doesn't have
> ECC ram...
see above
> Consider performing an extended memcheck run to see what's going on.
> Maybe move the entire stack of disks to another server.
>
Thats what I did initially, moved it back because it failed, now will
move again into the new server before proceeding.

> Based on the signature discovered above, we should be able to --create
> --assume-clean with the modern default data offset.  We know the
> following device roles:
>
> /dev/sda2 == 0
> /dev/sdf2 == 5
> /dev/sdg2 == 6
> /dev/sdh2 == spare
>
> So /dev/sdh2 should be left out until the array is working.
>
> Please re-execute the "mdadm -E" reports for /dev/sd[fgh]2 and show them
> uncut.  (Use the lasted mdadm.)  That should fill in the likely device
> order of the remaining drives.

[root@livecd mnt]# mdadm -E /dev/sd[fgh]2

/dev/sdf2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
           Name : 1
  Creation Time : Wed Jul 31 18:24:38 2013
     Raid Level : raid6
   Raid Devices : 7

 Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
     Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
  Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d

    Update Time : Mon Dec 16 01:16:26 2013
       Checksum : ee921c43 - correct
         Events : 327

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : Active device 5
   Array State : A.AAAAA ('A' == active, '.' == missing)
/dev/sdg2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
           Name : 1
  Creation Time : Wed Jul 31 18:24:38 2013
     Raid Level : raid6
   Raid Devices : 7

 Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
     Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
  Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : a1e1e51b:d8912985:e51207a9:1d718292

    Update Time : Mon Dec 16 01:16:26 2013
       Checksum : 4ef01fe9 - correct
         Events : 327

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : Active device 6
   Array State : A.AAAAA ('A' == active, '.' == missing)


/dev/sdh2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
           Name : 1
  Creation Time : Wed Jul 31 18:24:38 2013
     Raid Level : raid6
   Raid Devices : 7

 Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
     Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
  Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1

    Update Time : Mon Dec 16 01:16:26 2013
       Checksum : a1330e97 - correct
         Events : 327

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : spare
   Array State : A.AAAAA ('A' == active, '.' == missing)

> Also, it is important that you document which drive serial numbers are
> currently occupying the different device names.  An excerpt from "ls -l
> /dev/disk/by-id/" would do.

scsi-SATA_ST3000DM001-9YN_S1F026VJ -> ../../sda
scsi-SATA_ST3000DM001-9YN_W1F0TB3C -> ../../sdb
scsi-SATA_ST3000DM001-9YN_S1F04KAK -> ../../sdc
scsi-SATA_ST3000DM001-9YN_W1F0RWJY -> ../../sdd
scsi-SATA_ST3000DM001-9YN_S1F08N7Q -> ../../sde
scsi-SATA_ST3000DM001-9YN_Z1F1F3TC -> ../../sdf
scsi-SATA_ST3000DM001-9YN_W1F1ZZ9T -> ../../sdg
scsi-SATA_ST3000DM001-9YN_Z1F1X0AC -> ../../sdh


> I have to admit that I'm very concerned about your corrupted LVM
> signature at offset 262144.  LVM probably won't recognize your PV once
> the array is assembled correctly, making it difficult to
> non-destructively test the filesystems on your logical volumes.  You may
> have to duplicate your disks onto new ones so that an LVM restore can be
> safely attempted.

> Do *not* buy desktop drives!  You need raid-capable drives like the WD
> Red at the least.

;-) Already ordered WD reds, will be delivered any time now. I guess I
have now reached that level after years of making do with very limited
budgets.

I am a bit more relaxed now because I found that a scheduled transfer of
the data to the university tape robot had completed before christmas. So
this local archive mirror is (luckily) not critical. I still want to
understand whether all this is just a result of shaky hardware, or an
mdadm (misuse) issue. Losing (all superblocks on) five drives in a large
software raid 6 instead of bytes is not something I would like to repeat
any time soon by ie. mishandling mdadm.

We have then

Wed Jul 31 18:24:38 2013 on sdf-h2 for creation of the raid6 and
wed Jul 31 18:?7:19 2013 for creation of the lvm group

could well be.

So I will move the disks to the new server, make 1:1 copies to new
drives and then attempt an assembly using --assume-clean in which
order ?

Thanks so much, I have learned a lot already.

Regards


Julian



Universitätsklinikum Jena - Bachstrasse 18 - D-07743 Jena
Die gesetzlichen Pflichtangaben finden Sie unter http://www.uniklinikum-jena.de/Pflichtangaben.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
  2014-01-14 10:31       ` Großkreutz, Julian
@ 2014-01-14 13:14         ` Phil Turmel
  2014-01-14 14:00           ` AW: " Großkreutz, Julian
  2014-01-14 17:47           ` Wilson Jonathan
  0 siblings, 2 replies; 11+ messages in thread
From: Phil Turmel @ 2014-01-14 13:14 UTC (permalink / raw)
  To: "Großkreutz, Julian", linux-raid; +Cc: neilb

On 01/14/2014 05:31 AM, Großkreutz, Julian wrote:
> Hi Phil,
> 
> thanks again for bearing with me.

No problem.

>>>>> Model: ATA ST3000DM001-9YN1 (scsi)
>>
>> Aside: This model looks familiar.  I'm pretty sure these drives are
>> desktop models that lack scterc support.  Meaning they are *not*
>> generally suitable for raid duty.  Search the archives for combinations
>> of "timeout mismatch", "scterc", "URE", and "scrub" for a full
>> explanation.  If I've guessed correctly, you *must* use the driver
>> timeout work-around before proceeding.
>>
> 
> Yes I did, and smartctl showed no significant problems.

?.  What did "smartctl -l scterc" say?  If it says unsupported, you have
a problem.  The workaround is to set the driver timeouts to ~180 seconds
for each such drive.

If scterc is supported, but disabled, you can set 7-second timeouts with
"smartctl -l scterc,70,70", but you must do so on every power cycle.
Either way, you need boot-time scripting or distro support.

Raid-rated drives power up with a reasonable setting here.

> The 10 year old
> server (supermicro enterprise grade dual Xeon with 8 GB ECC RAM) had
> started to create problems early January which is why I wanted to move
> the drives to a new server in the first place, to then transfer the data
> to a new set of enterprise grade disks. I had checked the memory and the
> disks in a burn in for several days including time out and power saving
> before I set up the raid 2012/2013, and did not have any issues then.

Ok.  This makes sense.

> One of the reasons I tend use mdadm is that I am able to utilize
> existing hardware to create bridging solutions until money comes in for
> better hardware, and moving an mdadm raid has so far never created a
> serious problem.

Many people discover the timeout problem the first time they have an
otherwise correctable read error in their array, and the array falls
apart instead.  This list's archives are well-populated with such cases.

>>> So attached You will find hexdumps of 64k of /sda/sd[a-h]2 at sector 0
>>> and 262144 which shows the superblock 1.2 on sd[fgh]2, not on sd[a-e]2,
>>> but may help to identify data_offset; I suspect it is 2048 on sd[a-e]2
>>> and 262144 on sd[fgh]2.
>>>
>>
>> Jackpot!  LVM2 embedded backup data at the correct location for mdadm
>> data offset == 262144.  And on /dev/sda2, which is the only device that
>> should have it (first device in the raid).
>>
>> From /dev/sda2 @ 262144:
>>
>>> 00001200  76 67 5f 6e 65 64 69 67  73 30 32 20 5d 0a 69 64  |vg_nedigs02 ].id|
>>> 00001210  20 3d 20 22 32 4c 62 48  71 64 2d 72 67 42 9f 6e  | = "2LbHqd-rgB.n|
>>> 00001220  45 4a 75 31 2d 32 52 36  31 2d 41 35 f5 75 2d 6e  |EJu1-2R61-A5.u-n|
>>> 00001230  49 58 53 2d 66 79 4f 36  33 73 22 0a 73 65 3a 01  |IXS-fyO63s".se:.|
>>> 00001240  6f 20 3d 20 33 36 0a 66  6f 72 6d 61 ca 24 3d 20  |o = 36.forma.$= |
>>> 00001250  22 6c 76 6d 32 22 20 23  20 69 6e 66 6f 72 6b ac  |"lvm2" # infork.|
>> ...
>>> 00001a70  20 31 33 37 35 32 38 37  39 37 39 09 23 20 d2 32  | 1375287979.# .2|
>>> 00001a80  64 20 4a 75 6c 20 33 31  20 31 38 3a af 37 3a 31  |d Jul 31 18:.7:1|
>>> 00001a90  39 20 32 30 31 33 0a 0a  00 00 00 00 00 00 ee 12  |9 2013..........|
>>
>> Note the creation date/time at the end (with a corrupted byte):
>>
>> Jul 31 18:?7:19 2013
>>
>> There are other corrupted bytes scattered around.  I'd be worried about
>> the RAM in this machine.  Since you are using non-enterprise drives, I'm
>> going to go out on a limb here and guess that the server doesn't have
>> ECC ram...
> see above

Understood.  With really old memory, double-faults in the ECC could have
panic'd the server, leaving scattered data unwritten.

>> Consider performing an extended memcheck run to see what's going on.
>> Maybe move the entire stack of disks to another server.
>>
> Thats what I did initially, moved it back because it failed, now will
> move again into the new server before proceeding.

Ok.

>> Based on the signature discovered above, we should be able to --create
>> --assume-clean with the modern default data offset.  We know the
>> following device roles:
>>
>> /dev/sda2 == 0
>> /dev/sdf2 == 5
>> /dev/sdg2 == 6
>> /dev/sdh2 == spare
>>
>> So /dev/sdh2 should be left out until the array is working.
>>
>> Please re-execute the "mdadm -E" reports for /dev/sd[fgh]2 and show them
>> uncut.  (Use the lasted mdadm.)  That should fill in the likely device
>> order of the remaining drives.

Hmmm.  Typo on my part: s/lasted/latest/  Newer mdadm will give more
information.  In particular, I wanted the tail of each report where each
device lists what it last knew about all of the other devices' roles.

> [root@livecd mnt]# mdadm -E /dev/sd[fgh]2
> 
> /dev/sdf2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>            Name : 1
>   Creation Time : Wed Jul 31 18:24:38 2013
>      Raid Level : raid6
>    Raid Devices : 7
> 
>  Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>      Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>   Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>     Data Offset : 262144 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d
> 
>     Update Time : Mon Dec 16 01:16:26 2013
>        Checksum : ee921c43 - correct
>          Events : 327
> 
>          Layout : left-symmetric
>      Chunk Size : 256K
> 
>    Device Role : Active device 5
>    Array State : A.AAAAA ('A' == active, '.' == missing)

I was expecting more info after this.

> /dev/sdg2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>            Name : 1
>   Creation Time : Wed Jul 31 18:24:38 2013
>      Raid Level : raid6
>    Raid Devices : 7
> 
>  Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>      Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>   Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>     Data Offset : 262144 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : a1e1e51b:d8912985:e51207a9:1d718292
> 
>     Update Time : Mon Dec 16 01:16:26 2013
>        Checksum : 4ef01fe9 - correct
>          Events : 327
> 
>          Layout : left-symmetric
>      Chunk Size : 256K
> 
>    Device Role : Active device 6
>    Array State : A.AAAAA ('A' == active, '.' == missing)

And here.

> /dev/sdh2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>            Name : 1
>   Creation Time : Wed Jul 31 18:24:38 2013
>      Raid Level : raid6
>    Raid Devices : 7
> 
>  Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>      Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>   Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>     Data Offset : 262144 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1
> 
>     Update Time : Mon Dec 16 01:16:26 2013
>        Checksum : a1330e97 - correct
>          Events : 327
> 
>          Layout : left-symmetric
>      Chunk Size : 256K
> 
>    Device Role : spare
>    Array State : A.AAAAA ('A' == active, '.' == missing)

And here.

>> Also, it is important that you document which drive serial numbers are
>> currently occupying the different device names.  An excerpt from "ls -l
>> /dev/disk/by-id/" would do.
> 
> scsi-SATA_ST3000DM001-9YN_S1F026VJ -> ../../sda
> scsi-SATA_ST3000DM001-9YN_W1F0TB3C -> ../../sdb
> scsi-SATA_ST3000DM001-9YN_S1F04KAK -> ../../sdc
> scsi-SATA_ST3000DM001-9YN_W1F0RWJY -> ../../sdd
> scsi-SATA_ST3000DM001-9YN_S1F08N7Q -> ../../sde
> scsi-SATA_ST3000DM001-9YN_Z1F1F3TC -> ../../sdf
> scsi-SATA_ST3000DM001-9YN_W1F1ZZ9T -> ../../sdg
> scsi-SATA_ST3000DM001-9YN_Z1F1X0AC -> ../../sdh

Ok.  Be sure to recheck this list any time you boot, since the device
order matters.

> I am a bit more relaxed now because I found that a scheduled transfer of
> the data to the university tape robot had completed before christmas. So
> this local archive mirror is (luckily) not critical. I still want to
> understand whether all this is just a result of shaky hardware, or an
> mdadm (misuse) issue. Losing (all superblocks on) five drives in a large
> software raid 6 instead of bytes is not something I would like to repeat
> any time soon by ie. mishandling mdadm.

I think you skated over the edge due to a flaky motherboard.  mdadm
can't fix that.  In fact, since you have a backup, I personally wouldn't
bother further reconstruction efforts.  If you have a recent
vgcfgbackup, it's doable, but I have little confidence in the device
order: [a????fg], probably [abcdefg].  There's 4! == 24 permutations
there, each of which will require a vgcfgrestore before you can check
the reconstruction with "fsck -n".

> We have then
> 
> Wed Jul 31 18:24:38 2013 on sdf-h2 for creation of the raid6 and
> wed Jul 31 18:?7:19 2013 for creation of the lvm group
> 
> could well be.

I don't see any way to get such a timestamp except "certainly was".

> So I will move the disks to the new server, make 1:1 copies to new
> drives and then attempt an assembly using --assume-clean in which
> order ?

All permutations of [a????fg] with b, c, d, and e.

Try likely combinations gleaned from "mdadm -E" reports first to
shortcut the process.

> Thanks so much, I have learned a lot already.

You are welcome, and good luck.

Regards,

Phil

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* AW: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
  2014-01-14 13:14         ` Phil Turmel
@ 2014-01-14 14:00           ` Großkreutz, Julian
  2014-01-14 17:47           ` Wilson Jonathan
  1 sibling, 0 replies; 11+ messages in thread
From: Großkreutz, Julian @ 2014-01-14 14:00 UTC (permalink / raw)
  To: 'Phil Turmel', 'linux-raid@vger.kernel.org'
  Cc: 'neilb@suse.de'

Hi Phil,

great help, a lot of lessons learned on my part, thanks again.

I will not try to rescue the raid, time constraints forbid this but I will from now on implement a strict minimum hardware requirements policy :
-)

Regards

Julian

-----Ursprüngliche Nachricht-----
Von: Phil Turmel [mailto:philip@turmel.org]
Gesendet: Dienstag, 14. Januar 2014 14:15
An: Großkreutz, Julian; linux-raid@vger.kernel.org
Cc: neilb@suse.de
Betreff: Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock

On 01/14/2014 05:31 AM, Großkreutz, Julian wrote:
> Hi Phil,
>
> thanks again for bearing with me.

No problem.

>>>>> Model: ATA ST3000DM001-9YN1 (scsi)
>>
>> Aside: This model looks familiar.  I'm pretty sure these drives are
>> desktop models that lack scterc support.  Meaning they are *not*
>> generally suitable for raid duty.  Search the archives for
>> combinations of "timeout mismatch", "scterc", "URE", and "scrub" for
>> a full explanation.  If I've guessed correctly, you *must* use the
>> driver timeout work-around before proceeding.
>>
>
> Yes I did, and smartctl showed no significant problems.

?.  What did "smartctl -l scterc" say?  If it says unsupported, you have a problem.  The workaround is to set the driver timeouts to ~180 seconds for each such drive.

If scterc is supported, but disabled, you can set 7-second timeouts with "smartctl -l scterc,70,70", but you must do so on every power cycle.
Either way, you need boot-time scripting or distro support.

Raid-rated drives power up with a reasonable setting here.

> The 10 year old
> server (supermicro enterprise grade dual Xeon with 8 GB ECC RAM) had
> started to create problems early January which is why I wanted to move
> the drives to a new server in the first place, to then transfer the
> data to a new set of enterprise grade disks. I had checked the memory
> and the disks in a burn in for several days including time out and
> power saving before I set up the raid 2012/2013, and did not have any issues then.

Ok.  This makes sense.

> One of the reasons I tend use mdadm is that I am able to utilize
> existing hardware to create bridging solutions until money comes in
> for better hardware, and moving an mdadm raid has so far never created
> a serious problem.

Many people discover the timeout problem the first time they have an otherwise correctable read error in their array, and the array falls apart instead.  This list's archives are well-populated with such cases.

>>> So attached You will find hexdumps of 64k of /sda/sd[a-h]2 at sector
>>> 0 and 262144 which shows the superblock 1.2 on sd[fgh]2, not on
>>> sd[a-e]2, but may help to identify data_offset; I suspect it is 2048
>>> on sd[a-e]2 and 262144 on sd[fgh]2.
>>>
>>
>> Jackpot!  LVM2 embedded backup data at the correct location for mdadm
>> data offset == 262144.  And on /dev/sda2, which is the only device
>> that should have it (first device in the raid).
>>
>> From /dev/sda2 @ 262144:
>>
>>> 00001200  76 67 5f 6e 65 64 69 67  73 30 32 20 5d 0a 69 64
>>> |vg_nedigs02 ].id|
>>> 00001210  20 3d 20 22 32 4c 62 48  71 64 2d 72 67 42 9f 6e  | =
>>> "2LbHqd-rgB.n|
>>> 00001220  45 4a 75 31 2d 32 52 36  31 2d 41 35 f5 75 2d 6e
>>> |EJu1-2R61-A5.u-n|
>>> 00001230  49 58 53 2d 66 79 4f 36  33 73 22 0a 73 65 3a 01
>>> |IXS-fyO63s".se:.|
>>> 00001240  6f 20 3d 20 33 36 0a 66  6f 72 6d 61 ca 24 3d 20  |o =
>>> 36.forma.$= |
>>> 00001250  22 6c 76 6d 32 22 20 23  20 69 6e 66 6f 72 6b ac  |"lvm2"
>>> # infork.|
>> ...
>>> 00001a70  20 31 33 37 35 32 38 37  39 37 39 09 23 20 d2 32  |
>>> 1375287979.# .2|
>>> 00001a80  64 20 4a 75 6c 20 33 31  20 31 38 3a af 37 3a 31  |d Jul
>>> 31 18:.7:1|
>>> 00001a90  39 20 32 30 31 33 0a 0a  00 00 00 00 00 00 ee 12  |9
>>> 2013..........|
>>
>> Note the creation date/time at the end (with a corrupted byte):
>>
>> Jul 31 18:?7:19 2013
>>
>> There are other corrupted bytes scattered around.  I'd be worried
>> about the RAM in this machine.  Since you are using non-enterprise
>> drives, I'm going to go out on a limb here and guess that the server
>> doesn't have ECC ram...
> see above

Understood.  With really old memory, double-faults in the ECC could have panic'd the server, leaving scattered data unwritten.

>> Consider performing an extended memcheck run to see what's going on.
>> Maybe move the entire stack of disks to another server.
>>
> Thats what I did initially, moved it back because it failed, now will
> move again into the new server before proceeding.

Ok.

>> Based on the signature discovered above, we should be able to
>> --create --assume-clean with the modern default data offset.  We know
>> the following device roles:
>>
>> /dev/sda2 == 0
>> /dev/sdf2 == 5
>> /dev/sdg2 == 6
>> /dev/sdh2 == spare
>>
>> So /dev/sdh2 should be left out until the array is working.
>>
>> Please re-execute the "mdadm -E" reports for /dev/sd[fgh]2 and show
>> them uncut.  (Use the lasted mdadm.)  That should fill in the likely
>> device order of the remaining drives.

Hmmm.  Typo on my part: s/lasted/latest/  Newer mdadm will give more information.  In particular, I wanted the tail of each report where each device lists what it last knew about all of the other devices' roles.

> [root@livecd mnt]# mdadm -E /dev/sd[fgh]2
>
> /dev/sdf2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>            Name : 1
>   Creation Time : Wed Jul 31 18:24:38 2013
>      Raid Level : raid6
>    Raid Devices : 7
>
>  Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>      Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>   Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>     Data Offset : 262144 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d
>
>     Update Time : Mon Dec 16 01:16:26 2013
>        Checksum : ee921c43 - correct
>          Events : 327
>
>          Layout : left-symmetric
>      Chunk Size : 256K
>
>    Device Role : Active device 5
>    Array State : A.AAAAA ('A' == active, '.' == missing)

I was expecting more info after this.

> /dev/sdg2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>            Name : 1
>   Creation Time : Wed Jul 31 18:24:38 2013
>      Raid Level : raid6
>    Raid Devices : 7
>
>  Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>      Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>   Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>     Data Offset : 262144 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : a1e1e51b:d8912985:e51207a9:1d718292
>
>     Update Time : Mon Dec 16 01:16:26 2013
>        Checksum : 4ef01fe9 - correct
>          Events : 327
>
>          Layout : left-symmetric
>      Chunk Size : 256K
>
>    Device Role : Active device 6
>    Array State : A.AAAAA ('A' == active, '.' == missing)

And here.

> /dev/sdh2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>            Name : 1
>   Creation Time : Wed Jul 31 18:24:38 2013
>      Raid Level : raid6
>    Raid Devices : 7
>
>  Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>      Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>   Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>     Data Offset : 262144 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1
>
>     Update Time : Mon Dec 16 01:16:26 2013
>        Checksum : a1330e97 - correct
>          Events : 327
>
>          Layout : left-symmetric
>      Chunk Size : 256K
>
>    Device Role : spare
>    Array State : A.AAAAA ('A' == active, '.' == missing)

And here.

>> Also, it is important that you document which drive serial numbers
>> are currently occupying the different device names.  An excerpt from
>> "ls -l /dev/disk/by-id/" would do.
>
> scsi-SATA_ST3000DM001-9YN_S1F026VJ -> ../../sda
> scsi-SATA_ST3000DM001-9YN_W1F0TB3C -> ../../sdb
> scsi-SATA_ST3000DM001-9YN_S1F04KAK -> ../../sdc
> scsi-SATA_ST3000DM001-9YN_W1F0RWJY -> ../../sdd
> scsi-SATA_ST3000DM001-9YN_S1F08N7Q -> ../../sde
> scsi-SATA_ST3000DM001-9YN_Z1F1F3TC -> ../../sdf
> scsi-SATA_ST3000DM001-9YN_W1F1ZZ9T -> ../../sdg
> scsi-SATA_ST3000DM001-9YN_Z1F1X0AC -> ../../sdh

Ok.  Be sure to recheck this list any time you boot, since the device order matters.

> I am a bit more relaxed now because I found that a scheduled transfer
> of the data to the university tape robot had completed before
> christmas. So this local archive mirror is (luckily) not critical. I
> still want to understand whether all this is just a result of shaky
> hardware, or an mdadm (misuse) issue. Losing (all superblocks on) five
> drives in a large software raid 6 instead of bytes is not something I
> would like to repeat any time soon by ie. mishandling mdadm.

I think you skated over the edge due to a flaky motherboard.  mdadm can't fix that.  In fact, since you have a backup, I personally wouldn't bother further reconstruction efforts.  If you have a recent vgcfgbackup, it's doable, but I have little confidence in the device
order: [a????fg], probably [abcdefg].  There's 4! == 24 permutations there, each of which will require a vgcfgrestore before you can check the reconstruction with "fsck -n".

> We have then
>
> Wed Jul 31 18:24:38 2013 on sdf-h2 for creation of the raid6 and wed
> Jul 31 18:?7:19 2013 for creation of the lvm group
>
> could well be.

I don't see any way to get such a timestamp except "certainly was".

> So I will move the disks to the new server, make 1:1 copies to new
> drives and then attempt an assembly using --assume-clean in which
> order ?

All permutations of [a????fg] with b, c, d, and e.

Try likely combinations gleaned from "mdadm -E" reports first to shortcut the process.

> Thanks so much, I have learned a lot already.

You are welcome, and good luck.

Regards,

Phil


Universitätsklinikum Jena - Bachstrasse 18 - D-07743 Jena
Die gesetzlichen Pflichtangaben finden Sie unter http://www.uniklinikum-jena.de/Pflichtangaben.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
  2014-01-14 13:14         ` Phil Turmel
  2014-01-14 14:00           ` AW: " Großkreutz, Julian
@ 2014-01-14 17:47           ` Wilson Jonathan
  2014-01-14 18:43             ` Phil Turmel
  1 sibling, 1 reply; 11+ messages in thread
From: Wilson Jonathan @ 2014-01-14 17:47 UTC (permalink / raw)
  To: Phil Turmel; +Cc: "Großkreutz, Julian", linux-raid, neilb

On Tue, 2014-01-14 at 08:14 -0500, Phil Turmel wrote:

> ?.  What did "smartctl -l scterc" say?  If it says unsupported, you have
> a problem.  The workaround is to set the driver timeouts to ~180 seconds
> for each such drive.
> 
> If scterc is supported, but disabled, you can set 7-second timeouts with
> "smartctl -l scterc,70,70", but you must do so on every power cycle.
> Either way, you need boot-time scripting or distro support.
> 
> Raid-rated drives power up with a reasonable setting here.
> 
> Many people discover the timeout problem the first time they have an
> otherwise correctable read error in their array, and the array falls
> apart instead.  This list's archives are well-populated with such cases.

Snipped for brevity above.

I understand the issue of "timeout" on drives that might perform long
error checking which then causes mdadm, via the device (block?) driver
issuing a time out, to then kick the drive. In this instance you allow
some time for a drive to try and fix things at the expense of a hung
array for a longer period of time.

I also understand that with scterc the drive gives up (in effect timing
its self out) when it hits the 7 second, or there about, mark and
subsequently mdadm kicks the drive out. In this specific instance the
idea is to kill a drive quickly to that the raid doesn't hang longer
than a few seconds.

However surely these things (bar the amount of time) result in the same
final result of a drive being kicked out. Even in a non-madam hardware
raid set up, the drive is either kicked because it didn't return in 7
seconds, or the drive kicks its self because it gave up before 7
seconds.

If anything surely when you have a degraded array that will fail if any
more disks are kicked then you actually need to do the reverse of normal
raid wisdom... which is set the time out in the device (block) layer to
as long as possible and then if the drives have scterc enabled then
disable it (assuming the drive physically allows it and if disabled
performs a harder, or any, internal retry/crc/etc.) to force the drives
to give their all to get any, as yet unknown, potential failing sectors
back should they occur during a re-build of a failed drive.

Surely, unless I'm missing something, rebuilding a failed drive's data
means that you want the system to not kick if at all possible and having
scterc enabled or a short timeout (shorter than the drives max time,
unless that time is indefinite retry) is the last thing you want?

> 
> Regards,
> 
> Phil


Jon



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
  2014-01-14 17:47           ` Wilson Jonathan
@ 2014-01-14 18:43             ` Phil Turmel
  2014-01-15 12:50               ` Wilson Jonathan
  0 siblings, 1 reply; 11+ messages in thread
From: Phil Turmel @ 2014-01-14 18:43 UTC (permalink / raw)
  To: Wilson Jonathan; +Cc: "Großkreutz, Julian", linux-raid, neilb

On 01/14/2014 12:47 PM, Wilson Jonathan wrote:

[trim /]

> I understand the issue of "timeout" on drives that might perform long
> error checking which then causes mdadm, via the device (block?) driver
> issuing a time out, to then kick the drive. In this instance you allow
> some time for a drive to try and fix things at the expense of a hung
> array for a longer period of time.
> 
> I also understand that with scterc the drive gives up (in effect timing
> its self out) when it hits the 7 second, or there about, mark and
> subsequently mdadm kicks the drive out. In this specific instance the
> idea is to kill a drive quickly to that the raid doesn't hang longer
> than a few seconds.

No.  The intent is to fail the read without failing the controller channel.

> However surely these things (bar the amount of time) result in the same
> final result of a drive being kicked out. Even in a non-madam hardware
> raid set up, the drive is either kicked because it didn't return in 7
> seconds, or the drive kicks its self because it gave up before 7
> seconds.

No.  Upon a failed read, MD will obtain/reconstruct the problem sector
from remaining redundancy, then write the correct data back.  Occasional
read errors of this type are normal, and fix themselves when the sector
is written again.  MD will only fail a drive after *multiple* read
errors, not just one.  (Isolated bursts of up to 20, then ~ ten per hour.)

[trim /]

> Surely, unless I'm missing something, rebuilding a failed drive's data
> means that you want the system to not kick if at all possible and having
> scterc enabled or a short timeout (shorter than the drives max time,
> unless that time is indefinite retry) is the last thing you want?

What you are missing is what happens when the controller channel times
out.  The original read is reported failed to MD while the driver tries
to revive the unresponsive drive.  MD proceeds to obtain/reconstruct the
missing data, then write back.  But the device is not communicating--the
driver has reset the channel, and will continue not communicating until
the drive firmware finally gives up on the original read.  So the
*write* fails instantly, kicking the drive out of the array.

When you, the admin, get around to looking, the drive is idle but
apparently fine.  (It gains a "pending" sector, which stays until the
drive is told to write over that spot.)

HTH,

Phil

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
  2014-01-14 18:43             ` Phil Turmel
@ 2014-01-15 12:50               ` Wilson Jonathan
  2014-01-15 13:35                 ` Phil Turmel
  0 siblings, 1 reply; 11+ messages in thread
From: Wilson Jonathan @ 2014-01-15 12:50 UTC (permalink / raw)
  To: Phil Turmel; +Cc: "Großkreutz, Julian", linux-raid, neilb

On Tue, 2014-01-14 at 13:43 -0500, Phil Turmel wrote:
> On 01/14/2014 12:47 PM, Wilson Jonathan wrote:
> 
> [trim /]
> 
> > I understand the issue of "timeout" on drives that might perform long
> > error checking which then causes mdadm, via the device (block?) driver
> > issuing a time out, to then kick the drive. In this instance you allow
> > some time for a drive to try and fix things at the expense of a hung
> > array for a longer period of time.
> > 
> > I also understand that with scterc the drive gives up (in effect timing
> > its self out) when it hits the 7 second, or there about, mark and
> > subsequently mdadm kicks the drive out. In this specific instance the
> > idea is to kill a drive quickly to that the raid doesn't hang longer
> > than a few seconds.
> 
> No.  The intent is to fail the read without failing the controller channel.

Arrr, thanks for the clarification... I hadn't realised that instead of
the drive returning a "Error, I can't get the data, I'm dead in the
water" message it instead returned a "warning, I can't get the data, you
deal with it and get back to me, I'm still working" kind of affair.

> 
> > However surely these things (bar the amount of time) result in the same
> > final result of a drive being kicked out. Even in a non-madam hardware
> > raid set up, the drive is either kicked because it didn't return in 7
> > seconds, or the drive kicks its self because it gave up before 7
> > seconds.
> 
> No.  Upon a failed read, MD will obtain/reconstruct the problem sector
> from remaining redundancy, then write the correct data back.  Occasional
> read errors of this type are normal, and fix themselves when the sector
> is written again.  MD will only fail a drive after *multiple* read
> errors, not just one.  (Isolated bursts of up to 20, then ~ ten per hour.)
> 

I see now... I had totally the wrong idea of what happened and how they
differed. 

> [trim /]
> 
> > Surely, unless I'm missing something, rebuilding a failed drive's data
> > means that you want the system to not kick if at all possible and having
> > scterc enabled or a short timeout (shorter than the drives max time,
> > unless that time is indefinite retry) is the last thing you want?
> 
> What you are missing is what happens when the controller channel times
> out.  The original read is reported failed to MD while the driver tries
> to revive the unresponsive drive.  MD proceeds to obtain/reconstruct the
> missing data, then write back.  But the device is not communicating--the
> driver has reset the channel, and will continue not communicating until
> the drive firmware finally gives up on the original read.  So the
> *write* fails instantly, kicking the drive out of the array.
> 
> When you, the admin, get around to looking, the drive is idle but
> apparently fine.  (It gains a "pending" sector, which stays until the
> drive is told to write over that spot.)
> 
> HTH,

It does, thanks for the information :-)

> 
> Phil
> 

Jon



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
  2014-01-15 12:50               ` Wilson Jonathan
@ 2014-01-15 13:35                 ` Phil Turmel
  0 siblings, 0 replies; 11+ messages in thread
From: Phil Turmel @ 2014-01-15 13:35 UTC (permalink / raw)
  To: Wilson Jonathan; +Cc: "Großkreutz, Julian", linux-raid, neilb

On 01/15/2014 07:50 AM, Wilson Jonathan wrote:
> On Tue, 2014-01-14 at 13:43 -0500, Phil Turmel wrote:
>> On 01/14/2014 12:47 PM, Wilson Jonathan wrote:
>>
>> [trim /]
>>
>>> I understand the issue of "timeout" on drives that might perform long
>>> error checking which then causes mdadm, via the device (block?) driver
>>> issuing a time out, to then kick the drive. In this instance you allow
>>> some time for a drive to try and fix things at the expense of a hung
>>> array for a longer period of time.
>>>
>>> I also understand that with scterc the drive gives up (in effect timing
>>> its self out) when it hits the 7 second, or there about, mark and
>>> subsequently mdadm kicks the drive out. In this specific instance the
>>> idea is to kill a drive quickly to that the raid doesn't hang longer
>>> than a few seconds.
>>
>> No.  The intent is to fail the read without failing the controller channel.
> 
> Arrr, thanks for the clarification... I hadn't realised that instead of
> the drive returning a "Error, I can't get the data, I'm dead in the
> water" message it instead returned a "warning, I can't get the data, you
> deal with it and get back to me, I'm still working" kind of affair.

Let me emphasize one point here:  while a drive is performing error
recovery, it *stops talking to the controller*.  The drive isn't
replying with a warning as you suggest--it isn't replying *at all*.
Modern desktop drives try *very hard* to recover bad sectors, under the
assumption that they have the only copy of the data.  Typically, they'll
work at it for two *minutes* or more.

The linux kernel driver will give up after 30 seconds and try to reset
the drive.  The drive firmware ignores the reset, possibly multiple
times, until it is done retrying the original read.  When it does
finally reset, it is too late--it's been bumped from the array.

But the drive didn't really fail, leading to:

>> When you, the admin, get around to looking, the drive is idle but
>> apparently fine.  (It gains a "pending" sector, which stays until the
>> drive is told to write over that spot.)
>>
>> HTH,
> 
> It does, thanks for the information :-)

You are welcome.

Phil


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-01-15 13:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-11  6:42 mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock Großkreutz, Julian
2014-01-11 17:47 ` Phil Turmel
     [not found]   ` <1389632980.11328.104.camel@achilles.aeskuladis.de>
2014-01-13 18:42     ` Phil Turmel
2014-01-13 20:11       ` Chris Murphy
2014-01-14 10:31       ` Großkreutz, Julian
2014-01-14 13:14         ` Phil Turmel
2014-01-14 14:00           ` AW: " Großkreutz, Julian
2014-01-14 17:47           ` Wilson Jonathan
2014-01-14 18:43             ` Phil Turmel
2014-01-15 12:50               ` Wilson Jonathan
2014-01-15 13:35                 ` Phil Turmel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.