* mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
@ 2014-01-11 6:42 Großkreutz, Julian
2014-01-11 17:47 ` Phil Turmel
0 siblings, 1 reply; 11+ messages in thread
From: Großkreutz, Julian @ 2014-01-11 6:42 UTC (permalink / raw)
To: linux-raid; +Cc: neilb
Dear all, dear Neil (thanks for pointing me to this list),
I am in desperate need of help. mdadm is fantastic work, and I have
relied on mdadm for years to run very stable server systems, never had
major problems I could not solve.
This time its different:
On a Centos 6.x (can't remember) initially in 2012:
parted to create GPT partitions on 5 Seagate drives 3TB each
Model: ATA ST3000DM001-9YN1 (scsi)
Disk /dev/sda: 5860533168s # sd[bcde] identical
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Number Start End Size File system Name Flags
1 2048s 1953791s 1951744s ext4 boot
2 1955840s 5860532223s 5858576384s primary raid
I used an unknown mdadm version including unknown offset parameters for
4k alignment to create
/dev/sd[abcde]1 as /dev/md0 raid 1 for booting (1 GB)
/dev/sd[abcde]2 as /dev/md1 raid 6 for data (9 TB) lvm physical drive
Later added 3 more 3T identical Seagate drives with identical partition
layout, but later firmware.
Using likely a different newer version of mdadm I expanded RAID 6 by 2
drives and added 1 spare.
/dev/md1 was at 15 TB gross, 13 TB usable, expanded pv
Ran fine
Then I moved the 8 disks to a new server with an hba and backplane,
array did not start because mdadm did not find the superblocks on the
original 5 devices /dev/sd[abcde]2. Moving the disks back to the old
server the error did not vanish. Using a centos 6.3 livecd, I got the
following:
[root@livecd ~]# mdadm -Evvvvs /dev/sd[abcdefgh]2
mdadm: No md superblock detected on /dev/sda2.
mdadm: No md superblock detected on /dev/sdb2.
mdadm: No md superblock detected on /dev/sdc2.
mdadm: No md superblock detected on /dev/sdd2.
mdadm: No md superblock detected on /dev/sde2.
/dev/sdf2:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
Name : 1
Creation Time : Wed Jul 31 18:24:38 2013
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : active
Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d
Update Time : Mon Dec 16 01:16:26 2013
Checksum : ee921c43 - correct
Events : 327
Layout : left-symmetric
Chunk Size : 256K
Device Role : Active device 5
Array State : A.AAAAA ('A' == active, '.' == missing)
/dev/sdg2:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
Name : 1
Creation Time : Wed Jul 31 18:24:38 2013
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : active
Device UUID : a1e1e51b:d8912985:e51207a9:1d718292
Update Time : Mon Dec 16 01:16:26 2013
Checksum : 4ef01fe9 - correct
Events : 327
Layout : left-symmetric
Chunk Size : 256K
Device Role : Active device 6
Array State : A.AAAAA ('A' == active, '.' == missing)
/dev/sdh2:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
Name : 1
Creation Time : Wed Jul 31 18:24:38 2013
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : active
Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1
Update Time : Mon Dec 16 01:16:26 2013
Checksum : a1330e97 - correct
Events : 327
Layout : left-symmetric
Chunk Size : 256K
Device Role : spare
Array State : A.AAAAA ('A' == active, '.' == missing)
I suspect that the superblock of the original 5 devices is at a
different location, possibly because they where created with a different
mdadm version, i.e. at the end of the partitions. Booting the drives
with the hba in IT (non-raid) mode on the new server may have introduced
an initialization on the first five drive at the end of the partitions
because I can hexdump something with "EFI PART" in the last 64 kb in all
8 partitions used for the raid 6, which may not have affected the 3
added drives which show metadata 1.2.
If any of You can help me sort this I would greatly appreciate it. I
guess I need the mdadm version where I can set the data offset
differently for each device, but it doesn't compile with an error in
sha1.c:
sha1.h:29:22: Fehler: ansidecl.h: Datei oder Verzeichnis nicht gefunden
(didn't find ansidecl.h, error in German)
What would be the best way to proceed? There is critical data on this
raid, not fully backed up.
(UPD'T)
Thanks for getting back.
Yes, it's bad, I know, also tweaking without keeping exact records of
versions and offsets.
I am, however, rather sure that nothing was written to the disks when I
plugged them into the NEW server, unless starting up a live cd causes an
automatic assemble attempt with an update to the superblocks. That I
cannot exclude.
What I did so far w/o writing to the disks
get non-00 data at the beginning of sda2:
dd if=/dev/sda skip=1955840 bs=512 count=10 | hexdump -C | grep [^00]
gives me
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|................|
*
00001000 1e b5 54 51 20 4c 56 4d 32 20 78 5b 35 41 25 72 |..TQ LVM2
x[5A%r|
00001010 30 4e 2a 3e 01 00 00 00 00 10 00 00 00 00 00 00 |
0N*>............|
00001020 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00
|................|
00001030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|................|
*
00001200 76 67 5f 6e 65 64 69 67 73 30 32 20 7b 0a 69 64 |vg_nedigs02
{.id|
00001210 20 3d 20 22 32 4c 62 48 71 64 2d 72 67 42 74 2d | =
"2LbHqd-rgBt-|
00001220 45 4a 75 31 2d 32 52 36 31 2d 41 35 7a 74 2d 6e |
EJu1-2R61-A5zt-n|
00001230 49 58 53 2d 66 79 4f 36 33 73 22 0a 73 65 71 6e |
IXS-fyO63s".seqn|
00001240 6f 20 3d 20 37 0a 66 6f 72 6d 61 74 20 3d 20 22 |o =
7.format = "|
00001250 6c 76 6d 32 22 20 23 20 69 6e 66 6f 72 6d 61 74 |lvm2" #
informat|
(cont'd)
but on /dev/sdb
00000000 5f 80 00 00 5f 80 01 00 5f 80 02 00 5f 80 03 00 |
_..._..._..._...|
00000010 5f 80 04 00 5f 80 0c 00 5f 80 0d 00 00 00 00 00 |
_..._..._.......|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|................|
*
00001000 60 80 00 00 60 80 01 00 60 80 02 00 60 80 03 00 |
`...`...`...`...|
00001010 60 80 04 00 60 80 0c 00 60 80 0d 00 00 00 00 00 |
`...`...`.......|
00001020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|................|
*
00001400
so my initial guess that the data may start at 00001000 did not pan out.
Does anybody have an idea of how to reliably identify an mdadm
superblock in a hexdump of the drive ?
And second, have I got my numbers right ? In parted I see the block
count, and when I multiply 512 (not 4096!) with the total count I get 3
TB, so I think I have to use bs=512 in dd to get teh partition
boundaries correct.
As for the last state: one drive was set faulty, apparently, but the
spare had not been integrated. I may have gotten caught in a bug
described by Neil Brown, where on shutdown disk were wrongly reported,
and subsequently superblock information was overwritten.
I don't have NAS/SAN storage space to make identical copies of 5x3 TB,
but maybe I should buy 5 more disks and do a dd mirror so I have a
backup of the current state.
Again, any help / ideas welcome, especially building an mdadm version
with offset_data options ...
Julian
Universitätsklinikum Jena - Bachstrasse 18 - D-07743 Jena
Die gesetzlichen Pflichtangaben finden Sie unter http://www.uniklinikum-jena.de/Pflichtangaben.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
2014-01-11 6:42 mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock Großkreutz, Julian
@ 2014-01-11 17:47 ` Phil Turmel
[not found] ` <1389632980.11328.104.camel@achilles.aeskuladis.de>
0 siblings, 1 reply; 11+ messages in thread
From: Phil Turmel @ 2014-01-11 17:47 UTC (permalink / raw)
To: "Großkreutz, Julian", linux-raid; +Cc: neilb
Hi Julian,
Very good report. I think we can help.
On 01/11/2014 01:42 AM, Großkreutz, Julian wrote:
> Dear all, dear Neil (thanks for pointing me to this list),
>
> I am in desperate need of help. mdadm is fantastic work, and I have
> relied on mdadm for years to run very stable server systems, never had
> major problems I could not solve.
>
> This time its different:
>
> On a Centos 6.x (can't remember) initially in 2012:
>
> parted to create GPT partitions on 5 Seagate drives 3TB each
>
> Model: ATA ST3000DM001-9YN1 (scsi)
> Disk /dev/sda: 5860533168s # sd[bcde] identical
> Sector size (logical/physical): 512B/4096B
> Partition Table: gpt
>
> Number Start End Size File system Name Flags
> 1 2048s 1953791s 1951744s ext4 boot
> 2 1955840s 5860532223s 5858576384s primary raid
Ok.
Please also show the partition tables for the /dev/sd[fgh].
> I used an unknown mdadm version including unknown offset parameters for
> 4k alignment to create
>
> /dev/sd[abcde]1 as /dev/md0 raid 1 for booting (1 GB)
> /dev/sd[abcde]2 as /dev/md1 raid 6 for data (9 TB) lvm physical drive
>
> Later added 3 more 3T identical Seagate drives with identical partition
> layout, but later firmware.
>
> Using likely a different newer version of mdadm I expanded RAID 6 by 2
> drives and added 1 spare.
>
> /dev/md1 was at 15 TB gross, 13 TB usable, expanded pv
>
> Ran fine
Ok. Your evidence below has some evidence suggesting you created the
larger array from scratch instead of using --grow. Do you remember?
> Then I moved the 8 disks to a new server with an hba and backplane,
> array did not start because mdadm did not find the superblocks on the
> original 5 devices /dev/sd[abcde]2. Moving the disks back to the old
> server the error did not vanish. Using a centos 6.3 livecd, I got the
> following:
>
> [root@livecd ~]# mdadm -Evvvvs /dev/sd[abcdefgh]2
> mdadm: No md superblock detected on /dev/sda2.
> mdadm: No md superblock detected on /dev/sdb2.
> mdadm: No md superblock detected on /dev/sdc2.
> mdadm: No md superblock detected on /dev/sdd2.
> mdadm: No md superblock detected on /dev/sde2.
>
> /dev/sdf2:
> Magic : a92b4efc
> Version : 1.2
> Feature Map : 0x0
> Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
> Name : 1
> Creation Time : Wed Jul 31 18:24:38 2013
Note this creation time... would have been 2012 if you had used --grow.
> Raid Level : raid6
> Raid Devices : 7
>
> Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
> Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
> Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
This used dev size is very odd. The unused space after the data area is
1155584 sectors (>500MiB).
> Data Offset : 262144 sectors
> Super Offset : 8 sectors
> State : active
> Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d
>
> Update Time : Mon Dec 16 01:16:26 2013
> Checksum : ee921c43 - correct
> Events : 327
>
> Layout : left-symmetric
> Chunk Size : 256K
>
> Device Role : Active device 5
> Array State : A.AAAAA ('A' == active, '.' == missing)
>
> /dev/sdg2:
> Magic : a92b4efc
> Version : 1.2
> Feature Map : 0x0
> Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
> Name : 1
> Creation Time : Wed Jul 31 18:24:38 2013
> Raid Level : raid6
> Raid Devices : 7
>
> Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
> Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
> Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
> Data Offset : 262144 sectors
> Super Offset : 8 sectors
> State : active
> Device UUID : a1e1e51b:d8912985:e51207a9:1d718292
>
> Update Time : Mon Dec 16 01:16:26 2013
> Checksum : 4ef01fe9 - correct
> Events : 327
>
> Layout : left-symmetric
> Chunk Size : 256K
>
> Device Role : Active device 6
> Array State : A.AAAAA ('A' == active, '.' == missing)
>
> /dev/sdh2:
> Magic : a92b4efc
> Version : 1.2
> Feature Map : 0x0
> Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
> Name : 1
> Creation Time : Wed Jul 31 18:24:38 2013
> Raid Level : raid6
> Raid Devices : 7
>
> Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
> Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
> Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
> Data Offset : 262144 sectors
> Super Offset : 8 sectors
> State : active
> Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1
>
> Update Time : Mon Dec 16 01:16:26 2013
> Checksum : a1330e97 - correct
> Events : 327
>
> Layout : left-symmetric
> Chunk Size : 256K
>
> Device Role : spare
> Array State : A.AAAAA ('A' == active, '.' == missing)
>
>
> I suspect that the superblock of the original 5 devices is at a
> different location, possibly because they where created with a different
> mdadm version, i.e. at the end of the partitions. Booting the drives
> with the hba in IT (non-raid) mode on the new server may have introduced
> an initialization on the first five drive at the end of the partitions
> because I can hexdump something with "EFI PART" in the last 64 kb in all
> 8 partitions used for the raid 6, which may not have affected the 3
> added drives which show metadata 1.2.
The "EFI PART" is part of the backup copy of the GPT. All the drives in
a working array will have the same metadata version (superblock
location) even if the data offsets are different.
I would suggest hexdumping entire devices looking for the MD superblock
magic value, which will always be at the start of a 4k-aligned block.
Show (will take a long time, even with the big block size):
for x in /dev/sd[a-e]2 ; echo -e "\nDevice $x" ; dd if=$x bs=1M |hexdump
-C |grep "000 fc 4e 2b a9" ; done
For any candidates found, hexdump the whole 4k block for us.
> If any of You can help me sort this I would greatly appreciate it. I
> guess I need the mdadm version where I can set the data offset
> differently for each device, but it doesn't compile with an error in
> sha1.c:
>
> sha1.h:29:22: Fehler: ansidecl.h: Datei oder Verzeichnis nicht gefunden
> (didn't find ansidecl.h, error in German)
You probably need some *-dev packages. I don't use the RHEL platform,
so I'm not sure what you'd need. In the ubuntu world, it'd be the
"build-essentials" meta-package.
> What would be the best way to proceed? There is critical data on this
> raid, not fully backed up.
>
> (UPD'T)
>
> Thanks for getting back.
>
> Yes, it's bad, I know, also tweaking without keeping exact records of
> versions and offsets.
>
> I am, however, rather sure that nothing was written to the disks when I
> plugged them into the NEW server, unless starting up a live cd causes an
> automatic assemble attempt with an update to the superblocks. That I
> cannot exclude.
>
> What I did so far w/o writing to the disks
>
> get non-00 data at the beginning of sda2:
>
> dd if=/dev/sda skip=1955840 bs=512 count=10 | hexdump -C | grep [^00]
FWIW, you could have combined "if=/dev/sda skip=1955840" into
"if=/dev/sda2" . . . :-)
> gives me
>
> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> |................|
> *
> 00001000 1e b5 54 51 20 4c 56 4d 32 20 78 5b 35 41 25 72 |..TQ LVM2
> x[5A%r|
> 00001010 30 4e 2a 3e 01 00 00 00 00 10 00 00 00 00 00 00 |
> 0N*>............|
> 00001020 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00
> |................|
> 00001030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> |................|
> *
> 00001200 76 67 5f 6e 65 64 69 67 73 30 32 20 7b 0a 69 64 |vg_nedigs02
> {.id|
> 00001210 20 3d 20 22 32 4c 62 48 71 64 2d 72 67 42 74 2d | =
> "2LbHqd-rgBt-|
> 00001220 45 4a 75 31 2d 32 52 36 31 2d 41 35 7a 74 2d 6e |
> EJu1-2R61-A5zt-n|
> 00001230 49 58 53 2d 66 79 4f 36 33 73 22 0a 73 65 71 6e |
> IXS-fyO63s".seqn|
> 00001240 6f 20 3d 20 37 0a 66 6f 72 6d 61 74 20 3d 20 22 |o =
> 7.format = "|
> 00001250 6c 76 6d 32 22 20 23 20 69 6e 66 6f 72 6d 61 74 |lvm2" #
> informat|
> (cont'd)
This implies that /dev/sda2 is the first device in a raid5/6 that uses
metadata 0.9 or 1.0. You've found the LVM PV signature, which starts at
4k into a PV. Theoretically, this could be a stray, abandoned signature
from the original array, with the real LVM signature at the 262144
offset. Show:
dd if=/dev/sda2 skip=262144 count=16 |hexdump -C
>
> but on /dev/sdb
>
> 00000000 5f 80 00 00 5f 80 01 00 5f 80 02 00 5f 80 03 00 |
> _..._..._..._...|
> 00000010 5f 80 04 00 5f 80 0c 00 5f 80 0d 00 00 00 00 00 |
> _..._..._.......|
> 00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> |................|
> *
> 00001000 60 80 00 00 60 80 01 00 60 80 02 00 60 80 03 00 |
> `...`...`...`...|
> 00001010 60 80 04 00 60 80 0c 00 60 80 0d 00 00 00 00 00 |
> `...`...`.......|
> 00001020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> |................|
> *
> 00001400
>
> so my initial guess that the data may start at 00001000 did not pan out.
No, but with parity raid scattering data amongst the participating
devices, the report on /dev/sdb2 is expected.
> Does anybody have an idea of how to reliably identify an mdadm
> superblock in a hexdump of the drive ?
Above.
> And second, have I got my numbers right ? In parted I see the block
> count, and when I multiply 512 (not 4096!) with the total count I get 3
> TB, so I think I have to use bs=512 in dd to get teh partition
> boundaries correct.
dd uses bs=512 as the default. And it can access the partitions directly.
> As for the last state: one drive was set faulty, apparently, but the
> spare had not been integrated. I may have gotten caught in a bug
> described by Neil Brown, where on shutdown disk were wrongly reported,
> and subsequently superblock information was overwritten.
Possible. If so, you may not find any superblocks with the grep above.
> I don't have NAS/SAN storage space to make identical copies of 5x3 TB,
> but maybe I should buy 5 more disks and do a dd mirror so I have a
> backup of the current state.
We can do some more non-destructive investigation first.
Regards,
Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
[not found] ` <1389632980.11328.104.camel@achilles.aeskuladis.de>
@ 2014-01-13 18:42 ` Phil Turmel
2014-01-13 20:11 ` Chris Murphy
2014-01-14 10:31 ` Großkreutz, Julian
0 siblings, 2 replies; 11+ messages in thread
From: Phil Turmel @ 2014-01-13 18:42 UTC (permalink / raw)
To: "Großkreutz, Julian", linux-raid; +Cc: neilb
Hi Julian,
[Note, your reply didn't make it to linux-raid due to size. I believe
the limit is 150k ~ 200k.]
On 01/13/2014 12:09 PM, Großkreutz, Julian wrote:
> Hi Phil,
>
> thanks for getting back so quickly
>
>>> Model: ATA ST3000DM001-9YN1 (scsi)
Aside: This model looks familiar. I'm pretty sure these drives are
desktop models that lack scterc support. Meaning they are *not*
generally suitable for raid duty. Search the archives for combinations
of "timeout mismatch", "scterc", "URE", and "scrub" for a full
explanation. If I've guessed correctly, you *must* use the driver
timeout work-around before proceeding.
[trim /]
> I noticed one difference: part 1 is one sector longer than
> on /dev/[abcde], but part 2 starts at the same sector in all 8 drives
> and has the same length in all 8 drives. I usually leave 800-1000
> sectors unallocated at the end. As previously mentioned the first 5
> drives are older, the last three newer (the drive with the oldest
> firmware is sdb which has (incidence?) gone missing acc to sd[fgh]).
Some versions of parted/fdisk will get you that extra sector, wasting a
megabyte or more between partitions. Not relevant here, I don't think.
The partitions we need are all consistent.
[trim /]
>> Ok. Your evidence below has some evidence suggesting you created the
>> larger array from scratch instead of using --grow. Do you remember?
>>
> I seem to recall that building the initial 5 disk raid 6 was difficult,
> and I think I needed a custom compiled mdadm version (now residing on
> the inaccessible raid) which allowed me to align the offsets and
> optimize performance which was otherwise abysmal. I may have chosen 1.0
> superblock. Extending the raid was difficult as well, but I don't recall
> recreating it from scratch. Maybe I tried once using standard settings,
> didn't work, and then used the "custom" mdadm with offsets on the new
> drives as well. Sadly I can't remember. The existing superblock 1.2
> on /dev/sd[fgh] seems standard in data offset and superblock offset.
I don't think you can run an array with mixed superblock locations, so
I'm now concerned that the partitions on /dev/sd[a-e] aren't correct.
Instead of attempting to find the superblock signature, I think we
should first try to find the LVM2 signature.
>> Note this creation time... would have been 2012 if you had used --grow.
>>
> Dont pin me down on 2012, but surely the original set of five was not
> created July 2013, my third child was born on the 8th. By then this raid
> was up and served as an extra mirror archive.
But you could have backed up and re-created from scratch *after*. It
does say July 31.
>> This used dev size is very odd. The unused space after the data area is
>> 1155584 sectors (>500MiB).
>
> Possibly the result of my fiddling with a custom mdadm and offsets to
> begin with? I presume I could not have set this manually.
No, I don't think so.
[trim /]
>> I would suggest hexdumping entire devices looking for the MD superblock
>> magic value, which will always be at the start of a 4k-aligned block.
>>
>> Show (will take a long time, even with the big block size):
>>
>> for x in /dev/sd[a-e]2 ; echo -e "\nDevice $x" ; dd if=$x bs=1M |hexdump
>> -C |grep "000 fc 4e 2b a9" ; done
>>
>
> I started it, but this old dual Xeon puts 1.2 MB/s through the hexdump
> thread if data is not zero -> it will take app. 20 days !
>
> For now: the last 2.8 GB of all 8 drives did not show the signature:
>
> [root@livecd ~]# for x in /dev/sd[a-h]; do echo -e "\nDevice $x"; dd if=$x skip=5855000000 count=100000000 |hexdump -C |grep "000 fc 4e 2b a9"; done
Don't bother with this now.
> So attached You will find hexdumps of 64k of /sda/sd[a-h]2 at sector 0
> and 262144 which shows the superblock 1.2 on sd[fgh]2, not on sd[a-e]2,
> but may help to identify data_offset; I suspect it is 2048 on sd[a-e]2
> and 262144 on sd[fgh]2.
>
Jackpot! LVM2 embedded backup data at the correct location for mdadm
data offset == 262144. And on /dev/sda2, which is the only device that
should have it (first device in the raid).
From /dev/sda2 @ 262144:
> 00001200 76 67 5f 6e 65 64 69 67 73 30 32 20 5d 0a 69 64 |vg_nedigs02 ].id|
> 00001210 20 3d 20 22 32 4c 62 48 71 64 2d 72 67 42 9f 6e | = "2LbHqd-rgB.n|
> 00001220 45 4a 75 31 2d 32 52 36 31 2d 41 35 f5 75 2d 6e |EJu1-2R61-A5.u-n|
> 00001230 49 58 53 2d 66 79 4f 36 33 73 22 0a 73 65 3a 01 |IXS-fyO63s".se:.|
> 00001240 6f 20 3d 20 33 36 0a 66 6f 72 6d 61 ca 24 3d 20 |o = 36.forma.$= |
> 00001250 22 6c 76 6d 32 22 20 23 20 69 6e 66 6f 72 6b ac |"lvm2" # infork.|
> 00001260 74 69 6f 6e 61 6c 0a 73 74 61 74 75 ee 22 3d 20 |tional.statu."= |
> 00001270 5b 22 52 45 53 49 5a 45 41 42 4c 45 22 2c 3e c0 |["RESIZEABLE",>.|
> 00001280 52 45 41 44 22 2c 20 22 57 52 49 54 d0 27 5d 0a |READ", "WRIT.'].|
> 00001290 66 6c 61 67 73 20 3d 20 5b 5d 0a 65 78 74 4b df |flags = [].extK.|
> 000012a0 74 5f 73 69 7a 65 20 3d 20 38 31 39 3e 08 6d 61 |t_size = 819>.ma|
> 000012b0 78 5f 6c 76 20 3d 20 30 0a 6d 61 78 5f 70 14 13 |x_lv = 0.max_p..|
> 000012c0 3d 20 30 0a 6d 65 74 61 64 61 74 61 b3 63 6f 70 |= 0.metadata.cop|
> 000012d0 69 65 73 20 3d 20 30 0a 0a 70 68 79 73 69 97 c4 |ies = 0..physi..|
> 000012e0 6c 5f 76 6f 6c 75 6d 65 73 20 7b 0a 2e 78 76 30 |l_volumes {..xv0|
> 000012f0 20 7b 0a 69 64 20 3d 20 22 50 4a 48 4c 67 bf 14 | {.id = "PJHLg..|
> 00001300 53 70 56 70 2d 47 55 71 34 2d 6b 4a 57 7f 2d 39 |SpVp-GUq4-kJW.-9|
> 00001310 6d 74 4b 2d 31 6c 65 4a 2d 73 36 64 39 6a d8 1b |mtK-1leJ-s6d9j..|
> 00001320 0a 64 65 76 69 63 65 20 3d 20 22 2f 79 6c 76 2f |.device = "/ylv/|
> 00001330 73 64 66 32 22 0a 0a 73 74 61 74 75 73 20 7d 18 |sdf2"..status }.|
> 00001340 5b 22 41 4c 4c 4f 43 41 54 41 42 4c df 25 5d 0a |["ALLOCATABL.%].|
> 00001350 66 6c 61 67 73 20 3d 20 5b 5d 0a 64 65 76 e3 b5 |flags = [].dev..|
> 00001360 69 7a 65 20 3d 20 31 30 32 34 30 30 ce 33 30 0a |ize = 102400.30.|
> 00001370 70 65 5f 73 74 61 72 74 20 3d 20 32 30 34 99 22 |pe_start = 204."|
> 00001380 70 65 5f 63 6f 75 6e 74 20 3d 20 31 cd 33 39 39 |pe_count = 1.399|
> 00001390 0a 7d 0a 0a 70 76 31 20 7b 0a 69 64 20 3d 92 37 |.}..pv1 {.id =.7|
> 000013a0 44 39 7a 75 70 37 2d 6a 76 79 46 2d 6b 32 73 42 |D9zup7-jvyF-k2sB|
> 000013b0 2d 42 75 59 30 2d 39 74 73 61 2d 41 78 68 11 86 |-BuY0-9tsa-Axh..|
> 000013c0 34 45 51 48 4e 71 22 0a 64 65 76 69 c0 61 20 3d |4EQHNq".devi.a =|
> 000013d0 20 22 2f 64 65 76 2f 6d 64 31 22 0a 0a 73 e4 c6 | "/dev/md1"..s..|
> 000013e0 74 75 73 20 3d 20 5b 22 41 4c 4c 4f db 41 54 41 |tus = ["ALLO.ATA|
> 000013f0 42 4c 45 22 5d 0a 66 6c 61 67 73 20 3d 20 f4 12 |BLE"].flags = ..|
> 00001400 0a 64 65 76 5f 73 69 7a 65 20 3d 20 14 39 32 38 |.dev_size = .928|
> 00001410 35 37 39 33 32 38 30 0a 70 65 5f 73 74 61 99 37 |5793280.pe_sta.7|
> 00001420 20 3d 20 35 31 32 0a 70 65 5f 63 6f 4d 6d 74 20 | = 512.pe_coMmt |
> 00001430 3d 20 33 35 37 34 39 32 35 0a 7d 0a 7d 0a 77 f1 |= 3574925.}.}.w.|
> 00001440 6f 67 69 63 61 6c 5f 76 6f 6c 75 6d 9c 7d 20 7b |ogical_volum.} {|
> 00001450 0a 0a 6c 76 5f 76 61 72 20 7b 0a 69 64 20 b9 ee |..lv_var {.id ..|
> 00001460 22 5a 4a 47 56 55 4d 2d 4d 70 76 50 a8 7a 6f 49 |"ZJGVUM-MpvP.zoI|
> 00001470 39 2d 68 31 39 47 2d 57 70 75 6d 2d 4e 4b ee d5 |9-h19G-Wpum-NK..|
> 00001480 2d 4a 77 34 32 31 59 22 0a 73 74 61 f4 70 73 20 |-Jw421Y".sta.ps |
> 00001490 3d 20 5b 22 52 45 41 44 22 2c 20 22 57 52 73 ed |= ["READ", "WRs.|
> 000014a0 45 22 2c 20 22 56 49 53 49 42 4c 45 86 5a 0a 66 |E", "VISIBLE.Z.f|
> 000014b0 6c 61 67 73 20 3d 20 5b 5d 0a 73 65 67 6d b4 4c |lags = [].segm.L|
> 000014c0 74 5f 63 6f 75 6e 74 20 3d 20 31 0a d1 76 65 67 |t_count = 1..veg|
> 000014d0 6d 65 6e 74 31 20 7b 0a 73 74 61 72 74 5f 3c f6 |ment1 {.start_<.|
> 000014e0 74 65 6e 74 20 3d 20 30 0a 65 78 74 9a 68 74 5f |tent = 0.ext.ht_|
> 000014f0 63 6f 75 6e 74 20 3d 20 31 32 35 30 0a 0a 97 fc |count = 1250....|
> 00001500 70 65 20 3d 20 22 73 74 72 69 70 65 a4 23 0a 73 |pe = "stripe.#.s|
> 00001510 74 72 69 70 65 5f 63 6f 75 6e 74 20 3d 20 5c a5 |tripe_count = \.|
> 00001520 23 20 6c 69 6e 65 61 72 0a 0a 73 74 75 69 70 65 |# linear..stuipe|
> 00001530 73 20 3d 20 5b 0a 22 70 76 30 22 2c 20 34 88 4c |s = [."pv0", 4.L|
> 00001540 39 0a 5d 0a 7d 0a 7d 0a 0a 6c 76 5f b5 6e 6f 74 |9.].}.}..lv_.not|
> 00001550 20 7b 0a 69 64 20 3d 20 22 4c 48 58 57 4f 97 f4 | {.id = "LHXWO..|
> 00001560 47 30 6f 63 2d 62 4a 54 31 2d 49 6e 5d 36 2d 36 |G0oc-bJT1-In]6-6|
> 00001570 46 39 58 2d 7a 76 4b 50 2d 53 68 73 74 66 b7 69 |F9X-zvKP-Shstf.i|
> 00001580 0a 73 74 61 74 75 73 20 3d 20 5b 22 0b 42 41 44 |.status = [".BAD|
> 00001590 22 2c 20 22 57 52 49 54 45 22 2c 20 22 56 39 ed |", "WRITE", "V9.|
> 000015a0 49 42 4c 45 22 5d 0a 66 6c 61 67 73 ef 3d 20 5b |IBLE"].flags.= [|
> 000015b0 5d 0a 73 65 67 6d 65 6e 74 5f 63 6f 75 6e 7b 0b |].segment_coun{.|
> 000015c0 3d 20 31 0a 0a 73 65 67 6d 65 6e 74 4c 27 7b 0a |= 1..segmentL'{.|
> 000015d0 73 74 61 72 74 5f 65 78 74 65 6e 74 20 3d 1a 75 |start_extent =.u|
> 000015e0 0a 65 78 74 65 6e 74 5f 63 6f 75 6e ae 22 3d 20 |.extent_coun."= |
> 000015f0 32 35 30 30 0a 0a 74 79 70 65 20 3d 20 22 c9 37 |2500..type = ".7|
> 00001600 72 69 70 65 64 22 0a 73 74 72 69 70 77 50 63 6f |riped".stripwPco|
> 00001610 75 6e 74 20 3d 20 31 09 23 20 6c 69 6e 65 f1 fc |unt = 1.# line..|
> 00001620 0a 0a 73 74 72 69 70 65 73 20 3d 20 24 0b 22 70 |..stripes = $."p|
> 00001630 76 30 22 2c 20 32 34 39 39 0a 5d 0a 7d 0a 05 56 |v0", 2499.].}..V|
> 00001640 0a 6c 76 5f 68 6f 6d 65 20 7b 0a 69 26 22 3d 20 |.lv_home {.i&"= |
> 00001650 22 76 48 4a 37 4d 34 2d 74 74 77 4f 2d 46 71 7d |"vHJ7M4-ttwO-Fq}|
> 00001660 6e 2d 72 35 67 71 2d 74 44 48 74 2d 38 49 64 37 |n-r5gq-tDHt-8Id7|
> 00001670 2d 54 56 74 52 6f 36 22 0a 73 74 61 74 75 ff 91 |-TVtRo6".statu..|
> 00001680 3d 20 5b 22 52 45 41 44 22 2c 20 22 9a 54 49 54 |= ["READ", ".TIT|
> 00001690 45 22 2c 20 22 56 49 53 49 42 4c 45 22 5d 47 54 |E", "VISIBLE"]GT|
> 000016a0 6c 61 67 73 20 3d 20 5b 5d 0a 73 65 e6 6b 65 6e |lags = [].se.ken|
> 000016b0 74 5f 63 6f 75 6e 74 20 3d 20 31 0a 0a 73 fe d2 |t_count = 1..s..|
> 000016c0 6d 65 6e 74 31 20 7b 0a 73 74 61 72 3e 50 65 78 |ment1 {.star>Pex|
> 000016d0 74 65 6e 74 20 3d 20 30 0a 65 78 74 65 6e 77 a2 |tent = 0.extenw.|
> 000016e0 63 6f 75 6e 74 20 3d 20 32 35 30 30 13 0a 74 79 |count = 2500..ty|
> 000016f0 70 65 20 3d 20 22 73 74 72 69 70 65 64 22 dd 28 |pe = "striped".(|
> 00001700 74 72 69 70 65 5f 63 6f 75 6e 74 20 1e 22 31 09 |tripe_count ."1.|
> 00001710 23 20 6c 69 6e 65 61 72 0a 0a 73 74 72 69 2a 8b |# linear..stri*.|
> 00001720 73 20 3d 20 5b 0a 22 70 76 30 22 2c 1c 35 32 34 |s = [."pv0",.524|
> 00001730 39 0a 5d 0a 7d 0a 7d 0a 0a 6c 76 5f 73 77 5d dc |9.].}.}..lv_sw].|
> 00001740 20 7b 0a 69 64 20 3d 20 22 58 6f 36 e6 7a 36 2d | {.id = "Xo6.z6-|
> 00001750 39 62 61 38 2d 49 54 53 73 2d 57 63 61 78 ba 6f |9ba8-ITSs-Wcax.o|
> 00001760 73 42 52 2d 6e 48 65 61 2d 65 44 45 63 61 33 22 |sBR-nHea-eDEca3"|
> 00001770 0a 73 74 61 74 75 73 20 3d 20 5b 22 52 45 08 4f |.status = ["RE.O|
> 00001780 22 2c 20 22 57 52 49 54 45 22 2c 20 ec 50 49 53 |", "WRITE", .PIS|
> 00001790 49 42 4c 45 22 5d 0a 66 6c 61 67 73 20 3d 9d 2d |IBLE"].flags =.-|
> 000017a0 5d 0a 73 65 67 6d 65 6e 74 5f 63 6f 04 6a 74 20 |].segment_co.jt |
> 000017b0 3d 20 31 0a 0a 73 65 67 6d 65 6e 74 31 20 72 ec |= 1..segment1 r.|
> 000017c0 73 74 61 72 74 5f 65 78 74 65 6e 74 7b 3d 20 30 |start_extent{= 0|
> 000017d0 0a 65 78 74 65 6e 74 5f 63 6f 75 6e 74 20 5e 17 |.extent_count ^.|
> 000017e0 32 34 39 39 0a 0a 74 79 70 65 20 3d f7 21 73 74 |2499..type =.!st|
> 000017f0 72 69 70 65 64 22 0a 73 74 72 69 70 65 5f 1a 13 |riped".stripe_..|
> 00001800 75 6e 74 20 3d 20 31 09 23 20 6c 69 51 65 61 72 |unt = 1.# liQear|
> 00001810 0a 0a 73 74 72 69 70 65 73 20 3d 20 5b 0a 1e 68 |..stripes = [..h|
> 00001820 76 30 22 2c 20 30 0a 5d 0a 7d 0a 7d 0a 0a 6c 76 |v0", 0.].}.}..lv|
> 00001830 5f 74 6d 70 20 7b 0a 69 64 20 3d 20 22 6b 66 55 |_tmp {.id = "kfU|
> 00001840 76 49 50 2d 55 4f 56 50 2d 53 67 61 24 2a 55 71 |vIP-UOVP-Sga$*Uq|
> 00001850 49 4f 2d 56 36 32 6f 2d 33 56 58 47 2d 52 7e 09 |IO-V62o-3VXG-R~.|
> 00001860 67 6b 75 22 0a 73 74 61 74 75 73 20 c2 27 5b 22 |gku".status .'["|
> 00001870 52 45 41 44 22 2c 20 22 57 52 49 54 45 22 9e 35 |READ", "WRITE".5|
> 00001880 22 56 49 53 49 42 4c 45 22 5d 0a 66 37 61 67 73 |"VISIBLE"].f7ags|
> 00001890 20 3d 20 5b 5d 0a 73 65 67 6d 65 6e 74 5f 00 58 | = [].segment_.X|
> 000018a0 75 6e 74 20 3d 20 31 0a 0a 73 65 67 80 61 6e 74 |unt = 1..seg.ant|
> 000018b0 31 20 7b 0a 73 74 61 72 74 5f 65 78 74 65 89 ec |1 {.start_exte..|
> 000018c0 20 3d 20 30 0a 65 78 74 65 6e 74 5f 2a 6c 75 6e | = 0.extent_*lun|
> 000018d0 74 20 3d 20 32 35 30 30 0a 0a 74 79 70 65 16 87 |t = 2500..type..|
> 000018e0 20 22 73 74 72 69 70 65 64 22 0a 73 40 77 69 70 | "striped".s@wip|
> 000018f0 65 5f 63 6f 75 6e 74 20 3d 20 31 09 23 20 31 06 |e_count = 1.# 1.|
> 00001900 6e 65 61 72 0a 0a 73 74 72 69 70 65 cc 25 3d 20 |near..stripe.%= |
> 00001910 5b 0a 22 70 76 30 22 2c 20 38 37 34 39 0a 5b ab |[."pv0", 8749.[.|
> 00001920 7d 0a 7d 0a 7d 0a 7d 0a 23 20 47 65 b1 66 72 61 |}.}.}.}.# Ge.fra|
> 00001930 74 65 64 20 62 79 20 4c 56 4d 32 20 76 65 89 6d |ted by LVM2 ve.m|
> 00001940 69 6f 6e 20 32 2e 30 32 2e 39 38 28 ff 2f 2d 52 |ion 2.02.98(./-R|
> 00001950 48 45 4c 36 20 28 32 30 31 32 2d 31 30 2d 7c 07 |HEL6 (2012-10-|.|
> 00001960 29 3a 20 57 65 64 20 4a 75 6c 20 33 14 22 31 38 |): Wed Jul 3."18|
> 00001970 3a 32 36 3a 31 39 20 32 30 31 33 0a 0a 63 d5 09 |:26:19 2013..c..|
> 00001980 74 65 6e 74 73 20 3d 20 22 54 65 78 69 21 46 6f |tents = "Texi!Fo|
> 00001990 72 6d 61 74 20 56 6f 6c 75 6d 65 20 47 72 a8 8f |rmat Volume Gr..|
> 000019a0 70 22 0a 76 65 72 73 69 6f 6e 20 3d 20 31 0a 0a |p".version = 1..|
> 000019b0 64 65 73 63 72 69 70 74 69 6f 6e 20 3d 20 22 22 |description = ""|
> 000019c0 0a 0a 63 72 65 61 74 69 6f 6e 5f 68 f7 73 74 20 |..creation_h.st |
> 000019d0 3d 20 22 6e 65 64 69 67 73 33 30 2e 6e 65 cb 26 |= "nedigs30.ne.&|
> 000019e0 67 2e 61 65 73 6b 75 6c 61 64 69 73 2e 6c 6f 63 |g.aeskuladis.loc|
> 000019f0 61 6c 22 09 23 20 4c 69 6e 75 78 20 6e 65 64 69 |al".# Linux nedi|
> 00001a00 67 73 33 30 2e 6e 65 64 69 67 2e 61 65 73 6b 75 |gs30.nedig.aesku|
> 00001a10 6c 61 64 69 73 2e 6c 6f 63 61 6c 20 32 2e 36 2e |ladis.local 2.6.|
> 00001a20 33 32 2d 33 35 38 2e 36 2e 31 2e 65 93 35 2e 78 |32-358.6.1.e.5.x|
> 00001a30 38 36 5f 36 34 20 23 31 20 53 4d 50 20 54 85 b1 |86_64 #1 SMP T..|
> 00001a40 20 41 70 72 20 32 33 20 31 39 3a 32 76 3a 30 30 | Apr 23 19:2v:00|
> 00001a50 20 55 54 43 20 32 30 31 33 20 78 38 36 5f 10 f7 | UTC 2013 x86_..|
> 00001a60 0a 63 72 65 61 74 69 6f 6e 5f 74 69 71 61 20 3d |.creation_tiqa =|
> 00001a70 20 31 33 37 35 32 38 37 39 37 39 09 23 20 d2 32 | 1375287979.# .2|
> 00001a80 64 20 4a 75 6c 20 33 31 20 31 38 3a af 37 3a 31 |d Jul 31 18:.7:1|
> 00001a90 39 20 32 30 31 33 0a 0a 00 00 00 00 00 00 ee 12 |9 2013..........|
Note the creation date/time at the end (with a corrupted byte):
Jul 31 18:?7:19 2013
There are other corrupted bytes scattered around. I'd be worried about
the RAM in this machine. Since you are using non-enterprise drives, I'm
going to go out on a limb here and guess that the server doesn't have
ECC ram...
Part of the signature that should have showed up at 00001000 is missing,
too.
Consider performing an extended memcheck run to see what's going on.
Maybe move the entire stack of disks to another server.
>>> 00001200 76 67 5f 6e 65 64 69 67 73 30 32 20 7b 0a 69 64 |vg_nedigs02
>>> {.id|
>>> 00001210 20 3d 20 22 32 4c 62 48 71 64 2d 72 67 42 74 2d | =
>>> "2LbHqd-rgBt-|
>>> 00001220 45 4a 75 31 2d 32 52 36 31 2d 41 35 7a 74 2d 6e |
>>> EJu1-2R61-A5zt-n|
>>> 00001230 49 58 53 2d 66 79 4f 36 33 73 22 0a 73 65 71 6e |
>>> IXS-fyO63s".seqn|
>>> 00001240 6f 20 3d 20 37 0a 66 6f 72 6d 61 74 20 3d 20 22 |o =
>>> 7.format = "|
>>> 00001250 6c 76 6d 32 22 20 23 20 69 6e 66 6f 72 6d 61 74 |lvm2" #
>>> informat|
>>> (cont'd)
>>
>> This implies that /dev/sda2 is the first device in a raid5/6 that uses
>> metadata 0.9 or 1.0. You've found the LVM PV signature, which starts at
>> 4k into a PV. Theoretically, this could be a stray, abandoned signature
>> from the original array, with the real LVM signature at the 262144
>> offset. Show:
This certainly was a stray LVM2 signature from a version 1.0 metadata
array. It matches the new location, if you allow for the scattered
corrupted bytes. Even the same UUID, suggesting you did a vgcfgbackup
and vgcfgrestore sequence.
[trim /]
>> No, but with parity raid scattering data amongst the participating
>> devices, the report on /dev/sdb2 is expected.
>>
>>> As for the last state: one drive was set faulty, apparently, but the
>>> spare had not been integrated. I may have gotten caught in a bug
>>> described by Neil Brown, where on shutdown disk were wrongly reported,
>>> and subsequently superblock information was overwritten.
>>
>> Possible. If so, you may not find any superblocks with the grep above.
With memory corruption, all kinds of weird behavior is possible.
> In all, I think I lost all superblock information on sd[a-e]2, possibly
> when I extended the raid set; superblock 1.2 could not be written to
> 262144 on sd[a-e]2 because data started at 2048, so no place to put the
> superblocks.
>
> I would proceed to try a non-destructive assembly of the raid (i.e.
> read-only through a loop device for each drive) with the freshly
> compiled mdadm_offset with /dev/sd[a-e]2:2048 and /dev/sd[f-h]2:262144.
> Make sense ?
Based on the signature discovered above, we should be able to --create
--assume-clean with the modern default data offset. We know the
following device roles:
/dev/sda2 == 0
/dev/sdf2 == 5
/dev/sdg2 == 6
/dev/sdh2 == spare
So /dev/sdh2 should be left out until the array is working.
Please re-execute the "mdadm -E" reports for /dev/sd[fgh]2 and show them
uncut. (Use the lasted mdadm.) That should fill in the likely device
order of the remaining drives.
Also, it is important that you document which drive serial numbers are
currently occupying the different device names. An excerpt from "ls -l
/dev/disk/by-id/" would do.
I have to admit that I'm very concerned about your corrupted LVM
signature at offset 262144. LVM probably won't recognize your PV once
the array is assembled correctly, making it difficult to
non-destructively test the filesystems on your logical volumes. You may
have to duplicate your disks onto new ones so that an LVM restore can be
safely attempted.
Do *not* buy desktop drives! You need raid-capable drives like the WD
Red at the least.
Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
2014-01-13 18:42 ` Phil Turmel
@ 2014-01-13 20:11 ` Chris Murphy
2014-01-14 10:31 ` Großkreutz, Julian
1 sibling, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2014-01-13 20:11 UTC (permalink / raw)
To: linux-raid
On Jan 13, 2014, at 11:42 AM, Phil Turmel <philip@turmel.org> wrote:
> Do *not* buy desktop drives! You need raid-capable drives like the WD
> Red at the least.
Yeah I agree. If you care about the data, suck it up and use the right drive.
Very slight threadjack here: WD has a Caviar and Scorpio Blue, and all of the models I've seen, both desktop and laptop interestingly enough, have SCT ERC support. They are "tested and recommended" for raid0/raid1 only. WD says they are not warranted for use in (among other things) "multi-bay chassis" even though they don't list raid5 by name. I think the question is whether vibration is a concern with this class of drive, which then points a typical user to the Caviar/Scorpio Black which has the same recommendation and proscription as the Blue, *but* at least one desktop and one laptop Black model I have, do not support SCT ERC.
So it's almost like spec wise the Red has the vibration tolerance of the Black, but the SCT ERC support the Blue has. It's just odd they do it this way though. And finding out what drives have SCT ERC support is non-obvious.
Chris Murphy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
2014-01-13 18:42 ` Phil Turmel
2014-01-13 20:11 ` Chris Murphy
@ 2014-01-14 10:31 ` Großkreutz, Julian
2014-01-14 13:14 ` Phil Turmel
1 sibling, 1 reply; 11+ messages in thread
From: Großkreutz, Julian @ 2014-01-14 10:31 UTC (permalink / raw)
To: Phil Turmel, linux-raid; +Cc: neilb
Hi Phil,
thanks again for bearing with me.
> >
> >>> Model: ATA ST3000DM001-9YN1 (scsi)
>
> Aside: This model looks familiar. I'm pretty sure these drives are
> desktop models that lack scterc support. Meaning they are *not*
> generally suitable for raid duty. Search the archives for combinations
> of "timeout mismatch", "scterc", "URE", and "scrub" for a full
> explanation. If I've guessed correctly, you *must* use the driver
> timeout work-around before proceeding.
>
Yes I did, and smartctl showed no significant problems. The 10 year old
server (supermicro enterprise grade dual Xeon with 8 GB ECC RAM) had
started to create problems early January which is why I wanted to move
the drives to a new server in the first place, to then transfer the data
to a new set of enterprise grade disks. I had checked the memory and the
disks in a burn in for several days including time out and power saving
before I set up the raid 2012/2013, and did not have any issues then.
One of the reasons I tend use mdadm is that I am able to utilize
existing hardware to create bridging solutions until money comes in for
better hardware, and moving an mdadm raid has so far never created a
serious problem.
> > So attached You will find hexdumps of 64k of /sda/sd[a-h]2 at sector 0
> > and 262144 which shows the superblock 1.2 on sd[fgh]2, not on sd[a-e]2,
> > but may help to identify data_offset; I suspect it is 2048 on sd[a-e]2
> > and 262144 on sd[fgh]2.
> >
>
> Jackpot! LVM2 embedded backup data at the correct location for mdadm
> data offset == 262144. And on /dev/sda2, which is the only device that
> should have it (first device in the raid).
>
> From /dev/sda2 @ 262144:
>
> > 00001200 76 67 5f 6e 65 64 69 67 73 30 32 20 5d 0a 69 64 |vg_nedigs02 ].id|
> > 00001210 20 3d 20 22 32 4c 62 48 71 64 2d 72 67 42 9f 6e | = "2LbHqd-rgB.n|
> > 00001220 45 4a 75 31 2d 32 52 36 31 2d 41 35 f5 75 2d 6e |EJu1-2R61-A5.u-n|
> > 00001230 49 58 53 2d 66 79 4f 36 33 73 22 0a 73 65 3a 01 |IXS-fyO63s".se:.|
> > 00001240 6f 20 3d 20 33 36 0a 66 6f 72 6d 61 ca 24 3d 20 |o = 36.forma.$= |
> > 00001250 22 6c 76 6d 32 22 20 23 20 69 6e 66 6f 72 6b ac |"lvm2" # infork.|
> ...
> > 00001a70 20 31 33 37 35 32 38 37 39 37 39 09 23 20 d2 32 | 1375287979.# .2|
> > 00001a80 64 20 4a 75 6c 20 33 31 20 31 38 3a af 37 3a 31 |d Jul 31 18:.7:1|
> > 00001a90 39 20 32 30 31 33 0a 0a 00 00 00 00 00 00 ee 12 |9 2013..........|
>
> Note the creation date/time at the end (with a corrupted byte):
>
> Jul 31 18:?7:19 2013
>
> There are other corrupted bytes scattered around. I'd be worried about
> the RAM in this machine. Since you are using non-enterprise drives, I'm
> going to go out on a limb here and guess that the server doesn't have
> ECC ram...
see above
> Consider performing an extended memcheck run to see what's going on.
> Maybe move the entire stack of disks to another server.
>
Thats what I did initially, moved it back because it failed, now will
move again into the new server before proceeding.
> Based on the signature discovered above, we should be able to --create
> --assume-clean with the modern default data offset. We know the
> following device roles:
>
> /dev/sda2 == 0
> /dev/sdf2 == 5
> /dev/sdg2 == 6
> /dev/sdh2 == spare
>
> So /dev/sdh2 should be left out until the array is working.
>
> Please re-execute the "mdadm -E" reports for /dev/sd[fgh]2 and show them
> uncut. (Use the lasted mdadm.) That should fill in the likely device
> order of the remaining drives.
[root@livecd mnt]# mdadm -E /dev/sd[fgh]2
/dev/sdf2:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
Name : 1
Creation Time : Wed Jul 31 18:24:38 2013
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : active
Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d
Update Time : Mon Dec 16 01:16:26 2013
Checksum : ee921c43 - correct
Events : 327
Layout : left-symmetric
Chunk Size : 256K
Device Role : Active device 5
Array State : A.AAAAA ('A' == active, '.' == missing)
/dev/sdg2:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
Name : 1
Creation Time : Wed Jul 31 18:24:38 2013
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : active
Device UUID : a1e1e51b:d8912985:e51207a9:1d718292
Update Time : Mon Dec 16 01:16:26 2013
Checksum : 4ef01fe9 - correct
Events : 327
Layout : left-symmetric
Chunk Size : 256K
Device Role : Active device 6
Array State : A.AAAAA ('A' == active, '.' == missing)
/dev/sdh2:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
Name : 1
Creation Time : Wed Jul 31 18:24:38 2013
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : active
Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1
Update Time : Mon Dec 16 01:16:26 2013
Checksum : a1330e97 - correct
Events : 327
Layout : left-symmetric
Chunk Size : 256K
Device Role : spare
Array State : A.AAAAA ('A' == active, '.' == missing)
> Also, it is important that you document which drive serial numbers are
> currently occupying the different device names. An excerpt from "ls -l
> /dev/disk/by-id/" would do.
scsi-SATA_ST3000DM001-9YN_S1F026VJ -> ../../sda
scsi-SATA_ST3000DM001-9YN_W1F0TB3C -> ../../sdb
scsi-SATA_ST3000DM001-9YN_S1F04KAK -> ../../sdc
scsi-SATA_ST3000DM001-9YN_W1F0RWJY -> ../../sdd
scsi-SATA_ST3000DM001-9YN_S1F08N7Q -> ../../sde
scsi-SATA_ST3000DM001-9YN_Z1F1F3TC -> ../../sdf
scsi-SATA_ST3000DM001-9YN_W1F1ZZ9T -> ../../sdg
scsi-SATA_ST3000DM001-9YN_Z1F1X0AC -> ../../sdh
> I have to admit that I'm very concerned about your corrupted LVM
> signature at offset 262144. LVM probably won't recognize your PV once
> the array is assembled correctly, making it difficult to
> non-destructively test the filesystems on your logical volumes. You may
> have to duplicate your disks onto new ones so that an LVM restore can be
> safely attempted.
> Do *not* buy desktop drives! You need raid-capable drives like the WD
> Red at the least.
;-) Already ordered WD reds, will be delivered any time now. I guess I
have now reached that level after years of making do with very limited
budgets.
I am a bit more relaxed now because I found that a scheduled transfer of
the data to the university tape robot had completed before christmas. So
this local archive mirror is (luckily) not critical. I still want to
understand whether all this is just a result of shaky hardware, or an
mdadm (misuse) issue. Losing (all superblocks on) five drives in a large
software raid 6 instead of bytes is not something I would like to repeat
any time soon by ie. mishandling mdadm.
We have then
Wed Jul 31 18:24:38 2013 on sdf-h2 for creation of the raid6 and
wed Jul 31 18:?7:19 2013 for creation of the lvm group
could well be.
So I will move the disks to the new server, make 1:1 copies to new
drives and then attempt an assembly using --assume-clean in which
order ?
Thanks so much, I have learned a lot already.
Regards
Julian
Universitätsklinikum Jena - Bachstrasse 18 - D-07743 Jena
Die gesetzlichen Pflichtangaben finden Sie unter http://www.uniklinikum-jena.de/Pflichtangaben.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
2014-01-14 10:31 ` Großkreutz, Julian
@ 2014-01-14 13:14 ` Phil Turmel
2014-01-14 14:00 ` AW: " Großkreutz, Julian
2014-01-14 17:47 ` Wilson Jonathan
0 siblings, 2 replies; 11+ messages in thread
From: Phil Turmel @ 2014-01-14 13:14 UTC (permalink / raw)
To: "Großkreutz, Julian", linux-raid; +Cc: neilb
On 01/14/2014 05:31 AM, Großkreutz, Julian wrote:
> Hi Phil,
>
> thanks again for bearing with me.
No problem.
>>>>> Model: ATA ST3000DM001-9YN1 (scsi)
>>
>> Aside: This model looks familiar. I'm pretty sure these drives are
>> desktop models that lack scterc support. Meaning they are *not*
>> generally suitable for raid duty. Search the archives for combinations
>> of "timeout mismatch", "scterc", "URE", and "scrub" for a full
>> explanation. If I've guessed correctly, you *must* use the driver
>> timeout work-around before proceeding.
>>
>
> Yes I did, and smartctl showed no significant problems.
?. What did "smartctl -l scterc" say? If it says unsupported, you have
a problem. The workaround is to set the driver timeouts to ~180 seconds
for each such drive.
If scterc is supported, but disabled, you can set 7-second timeouts with
"smartctl -l scterc,70,70", but you must do so on every power cycle.
Either way, you need boot-time scripting or distro support.
Raid-rated drives power up with a reasonable setting here.
> The 10 year old
> server (supermicro enterprise grade dual Xeon with 8 GB ECC RAM) had
> started to create problems early January which is why I wanted to move
> the drives to a new server in the first place, to then transfer the data
> to a new set of enterprise grade disks. I had checked the memory and the
> disks in a burn in for several days including time out and power saving
> before I set up the raid 2012/2013, and did not have any issues then.
Ok. This makes sense.
> One of the reasons I tend use mdadm is that I am able to utilize
> existing hardware to create bridging solutions until money comes in for
> better hardware, and moving an mdadm raid has so far never created a
> serious problem.
Many people discover the timeout problem the first time they have an
otherwise correctable read error in their array, and the array falls
apart instead. This list's archives are well-populated with such cases.
>>> So attached You will find hexdumps of 64k of /sda/sd[a-h]2 at sector 0
>>> and 262144 which shows the superblock 1.2 on sd[fgh]2, not on sd[a-e]2,
>>> but may help to identify data_offset; I suspect it is 2048 on sd[a-e]2
>>> and 262144 on sd[fgh]2.
>>>
>>
>> Jackpot! LVM2 embedded backup data at the correct location for mdadm
>> data offset == 262144. And on /dev/sda2, which is the only device that
>> should have it (first device in the raid).
>>
>> From /dev/sda2 @ 262144:
>>
>>> 00001200 76 67 5f 6e 65 64 69 67 73 30 32 20 5d 0a 69 64 |vg_nedigs02 ].id|
>>> 00001210 20 3d 20 22 32 4c 62 48 71 64 2d 72 67 42 9f 6e | = "2LbHqd-rgB.n|
>>> 00001220 45 4a 75 31 2d 32 52 36 31 2d 41 35 f5 75 2d 6e |EJu1-2R61-A5.u-n|
>>> 00001230 49 58 53 2d 66 79 4f 36 33 73 22 0a 73 65 3a 01 |IXS-fyO63s".se:.|
>>> 00001240 6f 20 3d 20 33 36 0a 66 6f 72 6d 61 ca 24 3d 20 |o = 36.forma.$= |
>>> 00001250 22 6c 76 6d 32 22 20 23 20 69 6e 66 6f 72 6b ac |"lvm2" # infork.|
>> ...
>>> 00001a70 20 31 33 37 35 32 38 37 39 37 39 09 23 20 d2 32 | 1375287979.# .2|
>>> 00001a80 64 20 4a 75 6c 20 33 31 20 31 38 3a af 37 3a 31 |d Jul 31 18:.7:1|
>>> 00001a90 39 20 32 30 31 33 0a 0a 00 00 00 00 00 00 ee 12 |9 2013..........|
>>
>> Note the creation date/time at the end (with a corrupted byte):
>>
>> Jul 31 18:?7:19 2013
>>
>> There are other corrupted bytes scattered around. I'd be worried about
>> the RAM in this machine. Since you are using non-enterprise drives, I'm
>> going to go out on a limb here and guess that the server doesn't have
>> ECC ram...
> see above
Understood. With really old memory, double-faults in the ECC could have
panic'd the server, leaving scattered data unwritten.
>> Consider performing an extended memcheck run to see what's going on.
>> Maybe move the entire stack of disks to another server.
>>
> Thats what I did initially, moved it back because it failed, now will
> move again into the new server before proceeding.
Ok.
>> Based on the signature discovered above, we should be able to --create
>> --assume-clean with the modern default data offset. We know the
>> following device roles:
>>
>> /dev/sda2 == 0
>> /dev/sdf2 == 5
>> /dev/sdg2 == 6
>> /dev/sdh2 == spare
>>
>> So /dev/sdh2 should be left out until the array is working.
>>
>> Please re-execute the "mdadm -E" reports for /dev/sd[fgh]2 and show them
>> uncut. (Use the lasted mdadm.) That should fill in the likely device
>> order of the remaining drives.
Hmmm. Typo on my part: s/lasted/latest/ Newer mdadm will give more
information. In particular, I wanted the tail of each report where each
device lists what it last knew about all of the other devices' roles.
> [root@livecd mnt]# mdadm -E /dev/sd[fgh]2
>
> /dev/sdf2:
> Magic : a92b4efc
> Version : 1.2
> Feature Map : 0x0
> Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
> Name : 1
> Creation Time : Wed Jul 31 18:24:38 2013
> Raid Level : raid6
> Raid Devices : 7
>
> Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
> Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
> Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
> Data Offset : 262144 sectors
> Super Offset : 8 sectors
> State : active
> Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d
>
> Update Time : Mon Dec 16 01:16:26 2013
> Checksum : ee921c43 - correct
> Events : 327
>
> Layout : left-symmetric
> Chunk Size : 256K
>
> Device Role : Active device 5
> Array State : A.AAAAA ('A' == active, '.' == missing)
I was expecting more info after this.
> /dev/sdg2:
> Magic : a92b4efc
> Version : 1.2
> Feature Map : 0x0
> Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
> Name : 1
> Creation Time : Wed Jul 31 18:24:38 2013
> Raid Level : raid6
> Raid Devices : 7
>
> Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
> Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
> Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
> Data Offset : 262144 sectors
> Super Offset : 8 sectors
> State : active
> Device UUID : a1e1e51b:d8912985:e51207a9:1d718292
>
> Update Time : Mon Dec 16 01:16:26 2013
> Checksum : 4ef01fe9 - correct
> Events : 327
>
> Layout : left-symmetric
> Chunk Size : 256K
>
> Device Role : Active device 6
> Array State : A.AAAAA ('A' == active, '.' == missing)
And here.
> /dev/sdh2:
> Magic : a92b4efc
> Version : 1.2
> Feature Map : 0x0
> Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
> Name : 1
> Creation Time : Wed Jul 31 18:24:38 2013
> Raid Level : raid6
> Raid Devices : 7
>
> Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
> Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
> Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
> Data Offset : 262144 sectors
> Super Offset : 8 sectors
> State : active
> Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1
>
> Update Time : Mon Dec 16 01:16:26 2013
> Checksum : a1330e97 - correct
> Events : 327
>
> Layout : left-symmetric
> Chunk Size : 256K
>
> Device Role : spare
> Array State : A.AAAAA ('A' == active, '.' == missing)
And here.
>> Also, it is important that you document which drive serial numbers are
>> currently occupying the different device names. An excerpt from "ls -l
>> /dev/disk/by-id/" would do.
>
> scsi-SATA_ST3000DM001-9YN_S1F026VJ -> ../../sda
> scsi-SATA_ST3000DM001-9YN_W1F0TB3C -> ../../sdb
> scsi-SATA_ST3000DM001-9YN_S1F04KAK -> ../../sdc
> scsi-SATA_ST3000DM001-9YN_W1F0RWJY -> ../../sdd
> scsi-SATA_ST3000DM001-9YN_S1F08N7Q -> ../../sde
> scsi-SATA_ST3000DM001-9YN_Z1F1F3TC -> ../../sdf
> scsi-SATA_ST3000DM001-9YN_W1F1ZZ9T -> ../../sdg
> scsi-SATA_ST3000DM001-9YN_Z1F1X0AC -> ../../sdh
Ok. Be sure to recheck this list any time you boot, since the device
order matters.
> I am a bit more relaxed now because I found that a scheduled transfer of
> the data to the university tape robot had completed before christmas. So
> this local archive mirror is (luckily) not critical. I still want to
> understand whether all this is just a result of shaky hardware, or an
> mdadm (misuse) issue. Losing (all superblocks on) five drives in a large
> software raid 6 instead of bytes is not something I would like to repeat
> any time soon by ie. mishandling mdadm.
I think you skated over the edge due to a flaky motherboard. mdadm
can't fix that. In fact, since you have a backup, I personally wouldn't
bother further reconstruction efforts. If you have a recent
vgcfgbackup, it's doable, but I have little confidence in the device
order: [a????fg], probably [abcdefg]. There's 4! == 24 permutations
there, each of which will require a vgcfgrestore before you can check
the reconstruction with "fsck -n".
> We have then
>
> Wed Jul 31 18:24:38 2013 on sdf-h2 for creation of the raid6 and
> wed Jul 31 18:?7:19 2013 for creation of the lvm group
>
> could well be.
I don't see any way to get such a timestamp except "certainly was".
> So I will move the disks to the new server, make 1:1 copies to new
> drives and then attempt an assembly using --assume-clean in which
> order ?
All permutations of [a????fg] with b, c, d, and e.
Try likely combinations gleaned from "mdadm -E" reports first to
shortcut the process.
> Thanks so much, I have learned a lot already.
You are welcome, and good luck.
Regards,
Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* AW: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
2014-01-14 13:14 ` Phil Turmel
@ 2014-01-14 14:00 ` Großkreutz, Julian
2014-01-14 17:47 ` Wilson Jonathan
1 sibling, 0 replies; 11+ messages in thread
From: Großkreutz, Julian @ 2014-01-14 14:00 UTC (permalink / raw)
To: 'Phil Turmel', 'linux-raid@vger.kernel.org'
Cc: 'neilb@suse.de'
Hi Phil,
great help, a lot of lessons learned on my part, thanks again.
I will not try to rescue the raid, time constraints forbid this but I will from now on implement a strict minimum hardware requirements policy :
-)
Regards
Julian
-----Ursprüngliche Nachricht-----
Von: Phil Turmel [mailto:philip@turmel.org]
Gesendet: Dienstag, 14. Januar 2014 14:15
An: Großkreutz, Julian; linux-raid@vger.kernel.org
Cc: neilb@suse.de
Betreff: Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
On 01/14/2014 05:31 AM, Großkreutz, Julian wrote:
> Hi Phil,
>
> thanks again for bearing with me.
No problem.
>>>>> Model: ATA ST3000DM001-9YN1 (scsi)
>>
>> Aside: This model looks familiar. I'm pretty sure these drives are
>> desktop models that lack scterc support. Meaning they are *not*
>> generally suitable for raid duty. Search the archives for
>> combinations of "timeout mismatch", "scterc", "URE", and "scrub" for
>> a full explanation. If I've guessed correctly, you *must* use the
>> driver timeout work-around before proceeding.
>>
>
> Yes I did, and smartctl showed no significant problems.
?. What did "smartctl -l scterc" say? If it says unsupported, you have a problem. The workaround is to set the driver timeouts to ~180 seconds for each such drive.
If scterc is supported, but disabled, you can set 7-second timeouts with "smartctl -l scterc,70,70", but you must do so on every power cycle.
Either way, you need boot-time scripting or distro support.
Raid-rated drives power up with a reasonable setting here.
> The 10 year old
> server (supermicro enterprise grade dual Xeon with 8 GB ECC RAM) had
> started to create problems early January which is why I wanted to move
> the drives to a new server in the first place, to then transfer the
> data to a new set of enterprise grade disks. I had checked the memory
> and the disks in a burn in for several days including time out and
> power saving before I set up the raid 2012/2013, and did not have any issues then.
Ok. This makes sense.
> One of the reasons I tend use mdadm is that I am able to utilize
> existing hardware to create bridging solutions until money comes in
> for better hardware, and moving an mdadm raid has so far never created
> a serious problem.
Many people discover the timeout problem the first time they have an otherwise correctable read error in their array, and the array falls apart instead. This list's archives are well-populated with such cases.
>>> So attached You will find hexdumps of 64k of /sda/sd[a-h]2 at sector
>>> 0 and 262144 which shows the superblock 1.2 on sd[fgh]2, not on
>>> sd[a-e]2, but may help to identify data_offset; I suspect it is 2048
>>> on sd[a-e]2 and 262144 on sd[fgh]2.
>>>
>>
>> Jackpot! LVM2 embedded backup data at the correct location for mdadm
>> data offset == 262144. And on /dev/sda2, which is the only device
>> that should have it (first device in the raid).
>>
>> From /dev/sda2 @ 262144:
>>
>>> 00001200 76 67 5f 6e 65 64 69 67 73 30 32 20 5d 0a 69 64
>>> |vg_nedigs02 ].id|
>>> 00001210 20 3d 20 22 32 4c 62 48 71 64 2d 72 67 42 9f 6e | =
>>> "2LbHqd-rgB.n|
>>> 00001220 45 4a 75 31 2d 32 52 36 31 2d 41 35 f5 75 2d 6e
>>> |EJu1-2R61-A5.u-n|
>>> 00001230 49 58 53 2d 66 79 4f 36 33 73 22 0a 73 65 3a 01
>>> |IXS-fyO63s".se:.|
>>> 00001240 6f 20 3d 20 33 36 0a 66 6f 72 6d 61 ca 24 3d 20 |o =
>>> 36.forma.$= |
>>> 00001250 22 6c 76 6d 32 22 20 23 20 69 6e 66 6f 72 6b ac |"lvm2"
>>> # infork.|
>> ...
>>> 00001a70 20 31 33 37 35 32 38 37 39 37 39 09 23 20 d2 32 |
>>> 1375287979.# .2|
>>> 00001a80 64 20 4a 75 6c 20 33 31 20 31 38 3a af 37 3a 31 |d Jul
>>> 31 18:.7:1|
>>> 00001a90 39 20 32 30 31 33 0a 0a 00 00 00 00 00 00 ee 12 |9
>>> 2013..........|
>>
>> Note the creation date/time at the end (with a corrupted byte):
>>
>> Jul 31 18:?7:19 2013
>>
>> There are other corrupted bytes scattered around. I'd be worried
>> about the RAM in this machine. Since you are using non-enterprise
>> drives, I'm going to go out on a limb here and guess that the server
>> doesn't have ECC ram...
> see above
Understood. With really old memory, double-faults in the ECC could have panic'd the server, leaving scattered data unwritten.
>> Consider performing an extended memcheck run to see what's going on.
>> Maybe move the entire stack of disks to another server.
>>
> Thats what I did initially, moved it back because it failed, now will
> move again into the new server before proceeding.
Ok.
>> Based on the signature discovered above, we should be able to
>> --create --assume-clean with the modern default data offset. We know
>> the following device roles:
>>
>> /dev/sda2 == 0
>> /dev/sdf2 == 5
>> /dev/sdg2 == 6
>> /dev/sdh2 == spare
>>
>> So /dev/sdh2 should be left out until the array is working.
>>
>> Please re-execute the "mdadm -E" reports for /dev/sd[fgh]2 and show
>> them uncut. (Use the lasted mdadm.) That should fill in the likely
>> device order of the remaining drives.
Hmmm. Typo on my part: s/lasted/latest/ Newer mdadm will give more information. In particular, I wanted the tail of each report where each device lists what it last knew about all of the other devices' roles.
> [root@livecd mnt]# mdadm -E /dev/sd[fgh]2
>
> /dev/sdf2:
> Magic : a92b4efc
> Version : 1.2
> Feature Map : 0x0
> Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
> Name : 1
> Creation Time : Wed Jul 31 18:24:38 2013
> Raid Level : raid6
> Raid Devices : 7
>
> Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
> Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
> Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
> Data Offset : 262144 sectors
> Super Offset : 8 sectors
> State : active
> Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d
>
> Update Time : Mon Dec 16 01:16:26 2013
> Checksum : ee921c43 - correct
> Events : 327
>
> Layout : left-symmetric
> Chunk Size : 256K
>
> Device Role : Active device 5
> Array State : A.AAAAA ('A' == active, '.' == missing)
I was expecting more info after this.
> /dev/sdg2:
> Magic : a92b4efc
> Version : 1.2
> Feature Map : 0x0
> Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
> Name : 1
> Creation Time : Wed Jul 31 18:24:38 2013
> Raid Level : raid6
> Raid Devices : 7
>
> Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
> Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
> Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
> Data Offset : 262144 sectors
> Super Offset : 8 sectors
> State : active
> Device UUID : a1e1e51b:d8912985:e51207a9:1d718292
>
> Update Time : Mon Dec 16 01:16:26 2013
> Checksum : 4ef01fe9 - correct
> Events : 327
>
> Layout : left-symmetric
> Chunk Size : 256K
>
> Device Role : Active device 6
> Array State : A.AAAAA ('A' == active, '.' == missing)
And here.
> /dev/sdh2:
> Magic : a92b4efc
> Version : 1.2
> Feature Map : 0x0
> Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
> Name : 1
> Creation Time : Wed Jul 31 18:24:38 2013
> Raid Level : raid6
> Raid Devices : 7
>
> Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
> Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
> Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
> Data Offset : 262144 sectors
> Super Offset : 8 sectors
> State : active
> Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1
>
> Update Time : Mon Dec 16 01:16:26 2013
> Checksum : a1330e97 - correct
> Events : 327
>
> Layout : left-symmetric
> Chunk Size : 256K
>
> Device Role : spare
> Array State : A.AAAAA ('A' == active, '.' == missing)
And here.
>> Also, it is important that you document which drive serial numbers
>> are currently occupying the different device names. An excerpt from
>> "ls -l /dev/disk/by-id/" would do.
>
> scsi-SATA_ST3000DM001-9YN_S1F026VJ -> ../../sda
> scsi-SATA_ST3000DM001-9YN_W1F0TB3C -> ../../sdb
> scsi-SATA_ST3000DM001-9YN_S1F04KAK -> ../../sdc
> scsi-SATA_ST3000DM001-9YN_W1F0RWJY -> ../../sdd
> scsi-SATA_ST3000DM001-9YN_S1F08N7Q -> ../../sde
> scsi-SATA_ST3000DM001-9YN_Z1F1F3TC -> ../../sdf
> scsi-SATA_ST3000DM001-9YN_W1F1ZZ9T -> ../../sdg
> scsi-SATA_ST3000DM001-9YN_Z1F1X0AC -> ../../sdh
Ok. Be sure to recheck this list any time you boot, since the device order matters.
> I am a bit more relaxed now because I found that a scheduled transfer
> of the data to the university tape robot had completed before
> christmas. So this local archive mirror is (luckily) not critical. I
> still want to understand whether all this is just a result of shaky
> hardware, or an mdadm (misuse) issue. Losing (all superblocks on) five
> drives in a large software raid 6 instead of bytes is not something I
> would like to repeat any time soon by ie. mishandling mdadm.
I think you skated over the edge due to a flaky motherboard. mdadm can't fix that. In fact, since you have a backup, I personally wouldn't bother further reconstruction efforts. If you have a recent vgcfgbackup, it's doable, but I have little confidence in the device
order: [a????fg], probably [abcdefg]. There's 4! == 24 permutations there, each of which will require a vgcfgrestore before you can check the reconstruction with "fsck -n".
> We have then
>
> Wed Jul 31 18:24:38 2013 on sdf-h2 for creation of the raid6 and wed
> Jul 31 18:?7:19 2013 for creation of the lvm group
>
> could well be.
I don't see any way to get such a timestamp except "certainly was".
> So I will move the disks to the new server, make 1:1 copies to new
> drives and then attempt an assembly using --assume-clean in which
> order ?
All permutations of [a????fg] with b, c, d, and e.
Try likely combinations gleaned from "mdadm -E" reports first to shortcut the process.
> Thanks so much, I have learned a lot already.
You are welcome, and good luck.
Regards,
Phil
Universitätsklinikum Jena - Bachstrasse 18 - D-07743 Jena
Die gesetzlichen Pflichtangaben finden Sie unter http://www.uniklinikum-jena.de/Pflichtangaben.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
2014-01-14 13:14 ` Phil Turmel
2014-01-14 14:00 ` AW: " Großkreutz, Julian
@ 2014-01-14 17:47 ` Wilson Jonathan
2014-01-14 18:43 ` Phil Turmel
1 sibling, 1 reply; 11+ messages in thread
From: Wilson Jonathan @ 2014-01-14 17:47 UTC (permalink / raw)
To: Phil Turmel; +Cc: "Großkreutz, Julian", linux-raid, neilb
On Tue, 2014-01-14 at 08:14 -0500, Phil Turmel wrote:
> ?. What did "smartctl -l scterc" say? If it says unsupported, you have
> a problem. The workaround is to set the driver timeouts to ~180 seconds
> for each such drive.
>
> If scterc is supported, but disabled, you can set 7-second timeouts with
> "smartctl -l scterc,70,70", but you must do so on every power cycle.
> Either way, you need boot-time scripting or distro support.
>
> Raid-rated drives power up with a reasonable setting here.
>
> Many people discover the timeout problem the first time they have an
> otherwise correctable read error in their array, and the array falls
> apart instead. This list's archives are well-populated with such cases.
Snipped for brevity above.
I understand the issue of "timeout" on drives that might perform long
error checking which then causes mdadm, via the device (block?) driver
issuing a time out, to then kick the drive. In this instance you allow
some time for a drive to try and fix things at the expense of a hung
array for a longer period of time.
I also understand that with scterc the drive gives up (in effect timing
its self out) when it hits the 7 second, or there about, mark and
subsequently mdadm kicks the drive out. In this specific instance the
idea is to kill a drive quickly to that the raid doesn't hang longer
than a few seconds.
However surely these things (bar the amount of time) result in the same
final result of a drive being kicked out. Even in a non-madam hardware
raid set up, the drive is either kicked because it didn't return in 7
seconds, or the drive kicks its self because it gave up before 7
seconds.
If anything surely when you have a degraded array that will fail if any
more disks are kicked then you actually need to do the reverse of normal
raid wisdom... which is set the time out in the device (block) layer to
as long as possible and then if the drives have scterc enabled then
disable it (assuming the drive physically allows it and if disabled
performs a harder, or any, internal retry/crc/etc.) to force the drives
to give their all to get any, as yet unknown, potential failing sectors
back should they occur during a re-build of a failed drive.
Surely, unless I'm missing something, rebuilding a failed drive's data
means that you want the system to not kick if at all possible and having
scterc enabled or a short timeout (shorter than the drives max time,
unless that time is indefinite retry) is the last thing you want?
>
> Regards,
>
> Phil
Jon
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
2014-01-14 17:47 ` Wilson Jonathan
@ 2014-01-14 18:43 ` Phil Turmel
2014-01-15 12:50 ` Wilson Jonathan
0 siblings, 1 reply; 11+ messages in thread
From: Phil Turmel @ 2014-01-14 18:43 UTC (permalink / raw)
To: Wilson Jonathan; +Cc: "Großkreutz, Julian", linux-raid, neilb
On 01/14/2014 12:47 PM, Wilson Jonathan wrote:
[trim /]
> I understand the issue of "timeout" on drives that might perform long
> error checking which then causes mdadm, via the device (block?) driver
> issuing a time out, to then kick the drive. In this instance you allow
> some time for a drive to try and fix things at the expense of a hung
> array for a longer period of time.
>
> I also understand that with scterc the drive gives up (in effect timing
> its self out) when it hits the 7 second, or there about, mark and
> subsequently mdadm kicks the drive out. In this specific instance the
> idea is to kill a drive quickly to that the raid doesn't hang longer
> than a few seconds.
No. The intent is to fail the read without failing the controller channel.
> However surely these things (bar the amount of time) result in the same
> final result of a drive being kicked out. Even in a non-madam hardware
> raid set up, the drive is either kicked because it didn't return in 7
> seconds, or the drive kicks its self because it gave up before 7
> seconds.
No. Upon a failed read, MD will obtain/reconstruct the problem sector
from remaining redundancy, then write the correct data back. Occasional
read errors of this type are normal, and fix themselves when the sector
is written again. MD will only fail a drive after *multiple* read
errors, not just one. (Isolated bursts of up to 20, then ~ ten per hour.)
[trim /]
> Surely, unless I'm missing something, rebuilding a failed drive's data
> means that you want the system to not kick if at all possible and having
> scterc enabled or a short timeout (shorter than the drives max time,
> unless that time is indefinite retry) is the last thing you want?
What you are missing is what happens when the controller channel times
out. The original read is reported failed to MD while the driver tries
to revive the unresponsive drive. MD proceeds to obtain/reconstruct the
missing data, then write back. But the device is not communicating--the
driver has reset the channel, and will continue not communicating until
the drive firmware finally gives up on the original read. So the
*write* fails instantly, kicking the drive out of the array.
When you, the admin, get around to looking, the drive is idle but
apparently fine. (It gains a "pending" sector, which stays until the
drive is told to write over that spot.)
HTH,
Phil
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
2014-01-14 18:43 ` Phil Turmel
@ 2014-01-15 12:50 ` Wilson Jonathan
2014-01-15 13:35 ` Phil Turmel
0 siblings, 1 reply; 11+ messages in thread
From: Wilson Jonathan @ 2014-01-15 12:50 UTC (permalink / raw)
To: Phil Turmel; +Cc: "Großkreutz, Julian", linux-raid, neilb
On Tue, 2014-01-14 at 13:43 -0500, Phil Turmel wrote:
> On 01/14/2014 12:47 PM, Wilson Jonathan wrote:
>
> [trim /]
>
> > I understand the issue of "timeout" on drives that might perform long
> > error checking which then causes mdadm, via the device (block?) driver
> > issuing a time out, to then kick the drive. In this instance you allow
> > some time for a drive to try and fix things at the expense of a hung
> > array for a longer period of time.
> >
> > I also understand that with scterc the drive gives up (in effect timing
> > its self out) when it hits the 7 second, or there about, mark and
> > subsequently mdadm kicks the drive out. In this specific instance the
> > idea is to kill a drive quickly to that the raid doesn't hang longer
> > than a few seconds.
>
> No. The intent is to fail the read without failing the controller channel.
Arrr, thanks for the clarification... I hadn't realised that instead of
the drive returning a "Error, I can't get the data, I'm dead in the
water" message it instead returned a "warning, I can't get the data, you
deal with it and get back to me, I'm still working" kind of affair.
>
> > However surely these things (bar the amount of time) result in the same
> > final result of a drive being kicked out. Even in a non-madam hardware
> > raid set up, the drive is either kicked because it didn't return in 7
> > seconds, or the drive kicks its self because it gave up before 7
> > seconds.
>
> No. Upon a failed read, MD will obtain/reconstruct the problem sector
> from remaining redundancy, then write the correct data back. Occasional
> read errors of this type are normal, and fix themselves when the sector
> is written again. MD will only fail a drive after *multiple* read
> errors, not just one. (Isolated bursts of up to 20, then ~ ten per hour.)
>
I see now... I had totally the wrong idea of what happened and how they
differed.
> [trim /]
>
> > Surely, unless I'm missing something, rebuilding a failed drive's data
> > means that you want the system to not kick if at all possible and having
> > scterc enabled or a short timeout (shorter than the drives max time,
> > unless that time is indefinite retry) is the last thing you want?
>
> What you are missing is what happens when the controller channel times
> out. The original read is reported failed to MD while the driver tries
> to revive the unresponsive drive. MD proceeds to obtain/reconstruct the
> missing data, then write back. But the device is not communicating--the
> driver has reset the channel, and will continue not communicating until
> the drive firmware finally gives up on the original read. So the
> *write* fails instantly, kicking the drive out of the array.
>
> When you, the admin, get around to looking, the drive is idle but
> apparently fine. (It gains a "pending" sector, which stays until the
> drive is told to write over that spot.)
>
> HTH,
It does, thanks for the information :-)
>
> Phil
>
Jon
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
2014-01-15 12:50 ` Wilson Jonathan
@ 2014-01-15 13:35 ` Phil Turmel
0 siblings, 0 replies; 11+ messages in thread
From: Phil Turmel @ 2014-01-15 13:35 UTC (permalink / raw)
To: Wilson Jonathan; +Cc: "Großkreutz, Julian", linux-raid, neilb
On 01/15/2014 07:50 AM, Wilson Jonathan wrote:
> On Tue, 2014-01-14 at 13:43 -0500, Phil Turmel wrote:
>> On 01/14/2014 12:47 PM, Wilson Jonathan wrote:
>>
>> [trim /]
>>
>>> I understand the issue of "timeout" on drives that might perform long
>>> error checking which then causes mdadm, via the device (block?) driver
>>> issuing a time out, to then kick the drive. In this instance you allow
>>> some time for a drive to try and fix things at the expense of a hung
>>> array for a longer period of time.
>>>
>>> I also understand that with scterc the drive gives up (in effect timing
>>> its self out) when it hits the 7 second, or there about, mark and
>>> subsequently mdadm kicks the drive out. In this specific instance the
>>> idea is to kill a drive quickly to that the raid doesn't hang longer
>>> than a few seconds.
>>
>> No. The intent is to fail the read without failing the controller channel.
>
> Arrr, thanks for the clarification... I hadn't realised that instead of
> the drive returning a "Error, I can't get the data, I'm dead in the
> water" message it instead returned a "warning, I can't get the data, you
> deal with it and get back to me, I'm still working" kind of affair.
Let me emphasize one point here: while a drive is performing error
recovery, it *stops talking to the controller*. The drive isn't
replying with a warning as you suggest--it isn't replying *at all*.
Modern desktop drives try *very hard* to recover bad sectors, under the
assumption that they have the only copy of the data. Typically, they'll
work at it for two *minutes* or more.
The linux kernel driver will give up after 30 seconds and try to reset
the drive. The drive firmware ignores the reset, possibly multiple
times, until it is done retrying the original read. When it does
finally reset, it is too late--it's been bumped from the array.
But the drive didn't really fail, leading to:
>> When you, the admin, get around to looking, the drive is idle but
>> apparently fine. (It gains a "pending" sector, which stays until the
>> drive is told to write over that spot.)
>>
>> HTH,
>
> It does, thanks for the information :-)
You are welcome.
Phil
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-01-15 13:35 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-11 6:42 mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock Großkreutz, Julian
2014-01-11 17:47 ` Phil Turmel
[not found] ` <1389632980.11328.104.camel@achilles.aeskuladis.de>
2014-01-13 18:42 ` Phil Turmel
2014-01-13 20:11 ` Chris Murphy
2014-01-14 10:31 ` Großkreutz, Julian
2014-01-14 13:14 ` Phil Turmel
2014-01-14 14:00 ` AW: " Großkreutz, Julian
2014-01-14 17:47 ` Wilson Jonathan
2014-01-14 18:43 ` Phil Turmel
2014-01-15 12:50 ` Wilson Jonathan
2014-01-15 13:35 ` Phil Turmel
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.