All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID6 recovery with 6/9 drives out-of-sync
@ 2016-05-31  2:43 Peckins, Steven E
  2016-05-31 19:19 ` Phil Turmel
  0 siblings, 1 reply; 6+ messages in thread
From: Peckins, Steven E @ 2016-05-31  2:43 UTC (permalink / raw)
  To: linux-raid


I have a system with a 9+1 disk RAID6 array that has "3 drives and 1 spare - not enough to start the array."  The metadata version is 1.1; mdadm version is v3.3.

The component devices in the array are supposed to be multipath devices (dm-multipath), but for some reason, when the server was restarted, md grabbed both dm-* components and raw devices.  I *think* that this is what caused the problem.

The output from "mdadm --examine" shows that the drives in this array have either 44 events (4 drives, including the spare) or 35 events (6 drives) and have an earlier "Updated Time."  All components have a "clean" State, but the drives with later timestamps regard the other drives as missing (AAA......).

	Six drives report this:

		Update Time : Thu May 26 12:10:15 2016
			 Events : 35
	   Array State : AAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

	Four drives report this:

		Update Time : Thu May 26 15:44:23 2016
			 Events : 44
	   Array State : AAA...... ('A' == active, '.' == missing, 'R' == replacing)


I've used dd to duplicate all but one of the "missing" drives to other spares in the system prior to running any "forceful" mdadm commands on this array.  One of the drives (dm-15) errored-out early in the process with what looks like a bad sector, but the others completed fine.

After making those copies, I ran mdadm --assemble --force, the best it could do was five drives:  "/dev/md10 assembled from 5 drives and 1 spare - not enough to start the array."

Interestingly it says it cleared the "FAULTY" flag from two devices.  But output from --examine showed all components as clean.

(There are five other 9+1 RAID6 arrays in this system, and they all came up without issue.)

I'm seeking advice on how to proceed at this point.  If more information is required, please ask.

Output from --examine:  http://pastebin.com/khvPWrba
Output from --assemble:  http://pastebin.com/s2GkHkah


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID6 recovery with 6/9 drives out-of-sync
  2016-05-31  2:43 RAID6 recovery with 6/9 drives out-of-sync Peckins, Steven E
@ 2016-05-31 19:19 ` Phil Turmel
  2016-06-01 11:32   ` Peckins, Steven E
  0 siblings, 1 reply; 6+ messages in thread
From: Phil Turmel @ 2016-05-31 19:19 UTC (permalink / raw)
  To: Peckins, Steven E, linux-raid

On 05/30/2016 10:43 PM, Peckins, Steven E wrote:
> 
> I have a system with a 9+1 disk RAID6 array that has "3 drives and 1 spare - not enough to start the array."  The metadata version is 1.1; mdadm version is v3.3.
> 
> The component devices in the array are supposed to be multipath devices (dm-multipath), but for some reason, when the server was restarted, md grabbed both dm-* components and raw devices.  I *think* that this is what caused the problem.

Quite possible.  You probably need a DEVICES clause in your mdadm.conf
to exclude the raw devices from the arrays.


> I'm seeking advice on how to proceed at this point.  If more information is required, please ask.

Hmmm.  The partial success on mdadm --force suggests trying that again.
 Possible with --force twice on the command line.

Forced assembly is precisely what you need -- don't despair and attempt
anything else.

Do review /proc/mdstat before each assembly attempt to make sure nothing
is partially assembled with those devices or the underlying raw devices.

Phil


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID6 recovery with 6/9 drives out-of-sync
  2016-05-31 19:19 ` Phil Turmel
@ 2016-06-01 11:32   ` Peckins, Steven E
  2016-06-01 12:06     ` Phil Turmel
  0 siblings, 1 reply; 6+ messages in thread
From: Peckins, Steven E @ 2016-06-01 11:32 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid


On May 31, 2016, at 2:19 PM, Phil Turmel <philip@turmel.org> wrote:

> On 05/30/2016 10:43 PM, Peckins, Steven E wrote:
>> 
>> The component devices in the array are supposed to be multipath devices (dm-multipath), but for some reason, when the server was restarted, md grabbed both dm-* components and raw devices.  I *think* that this is what caused the problem.
> 
> Quite possible.  You probably need a DEVICES clause in your mdadm.conf
> to exclude the raw devices from the arrays.

I had a typo in the DEVICE glob for the system disks (/dev/sd[ab]* instead of /dev/sd[ab][12]).


>> I'm seeking advice on how to proceed at this point.  If more information is required, please ask.
> 
> Hmmm.  The partial success on mdadm --force suggests trying that again.
> Possible with --force twice on the command line.
> 
> Forced assembly is precisely what you need -- don't despair and attempt
> anything else.

Repeating the command was not successful; it is still reporting "/dev/md10 assembled from 5 drives and 1 spare - not enough to start the array."  Four drives are listed as “possibly out of date."  I assume those are the four that are not being incorporated.

Output from --assemble --force 1x and 2x:  http://pastebin.com/k1dT2zYC

--steve--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID6 recovery with 6/9 drives out-of-sync
  2016-06-01 11:32   ` Peckins, Steven E
@ 2016-06-01 12:06     ` Phil Turmel
  2016-06-01 13:16       ` Peckins, Steven E
  0 siblings, 1 reply; 6+ messages in thread
From: Phil Turmel @ 2016-06-01 12:06 UTC (permalink / raw)
  To: Peckins, Steven E; +Cc: linux-raid

On 06/01/2016 07:32 AM, Peckins, Steven E wrote:
> 
> On May 31, 2016, at 2:19 PM, Phil Turmel <philip@turmel.org> wrote:
> 
>> On 05/30/2016 10:43 PM, Peckins, Steven E wrote:
>>>
>>> The component devices in the array are supposed to be multipath devices (dm-multipath), but for some reason, when the server was restarted, md grabbed both dm-* components and raw devices.  I *think* that this is what caused the problem.
>>
>> Quite possible.  You probably need a DEVICES clause in your mdadm.conf
>> to exclude the raw devices from the arrays.
> 
> I had a typo in the DEVICE glob for the system disks (/dev/sd[ab]* instead of /dev/sd[ab][12]).

Understood, but be aware that if you have to hotswap one of these system
devices, they may not get the sda or sdb name, preventing a re-add or a
replacement from joining the array.

Since you are having to use /dev/mapper entries for some arrays,
consider using /dev/disk/by*/ symlinks for your system arrays.

>>> I'm seeking advice on how to proceed at this point.  If more information is required, please ask.
>>
>> Hmmm.  The partial success on mdadm --force suggests trying that again.
>> Possible with --force twice on the command line.
>>
>> Forced assembly is precisely what you need -- don't despair and attempt
>> anything else.
> 
> Repeating the command was not successful; it is still reporting "/dev/md10 assembled from 5 drives and 1 spare - not enough to start the array."  Four drives are listed as “possibly out of date."  I assume those are the four that are not being incorporated.
> 
> Output from --assemble --force 1x and 2x:  http://pastebin.com/k1dT2zYC

{ In the future, please paste these in-line so the archives will have
them.  The size limit for this list is ~ 100k. }

I vaguely recall a bug in forced reassembly for many out-of-date drives.
 Please clone and build the latest mdadm userspace[1] and run that mdadm
binary for the forced assembly.  Also show the portion of dmesg that
corresponds to the attempt.

Phil

[1] https://github.com/neilbrown/mdadm

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID6 recovery with 6/9 drives out-of-sync
  2016-06-01 12:06     ` Phil Turmel
@ 2016-06-01 13:16       ` Peckins, Steven E
  2016-06-01 13:22         ` Phil Turmel
  0 siblings, 1 reply; 6+ messages in thread
From: Peckins, Steven E @ 2016-06-01 13:16 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid


On Jun 1, 2016, at 7:06 AM, Phil Turmel <philip@turmel.org> wrote:
> 
> Understood, but be aware that if you have to hotswap one of these system
> devices, they may not get the sda or sdb name, preventing a re-add or a
> replacement from joining the array.
> 
> Since you are having to use /dev/mapper entries for some arrays,
> consider using /dev/disk/by*/ symlinks for your system arrays.

Noted and updated.  (Those two drives are connected to the motherboard SATA, and the kernel names have been stable.  All other drives are connected through on-board SAS controllers and HBAs, etc.)


> I vaguely recall a bug in forced reassembly for many out-of-date drives.
> Please clone and build the latest mdadm userspace[1] and run that mdadm
> binary for the forced assembly.  Also show the portion of dmesg that
> corresponds to the attempt.

Good call!  The latest mdadm was able to assemble this array.

ocarina mdadm-latest # ./mdadm --version
mdadm - v3.4 - 28th January 2016
ocarina mdadm-latest # ./mdadm --assemble /dev/md10 --force --verbose /dev/dm-{0,1,11,12,13,14,15,16,17,28}
mdadm: looking for devices for /dev/md10
mdadm: /dev/dm-0 is identified as a member of /dev/md10, slot 0.
mdadm: /dev/dm-1 is identified as a member of /dev/md10, slot 1.
mdadm: /dev/dm-11 is identified as a member of /dev/md10, slot 2.
mdadm: /dev/dm-12 is identified as a member of /dev/md10, slot 3.
mdadm: /dev/dm-13 is identified as a member of /dev/md10, slot 4.
mdadm: /dev/dm-14 is identified as a member of /dev/md10, slot 5.
mdadm: /dev/dm-15 is identified as a member of /dev/md10, slot 6.
mdadm: /dev/dm-16 is identified as a member of /dev/md10, slot 7.
mdadm: /dev/dm-17 is identified as a member of /dev/md10, slot 8.
mdadm: /dev/dm-28 is identified as a member of /dev/md10, slot -1.
mdadm: forcing event count in /dev/dm-14(5) from 35 upto 44
mdadm: forcing event count in /dev/dm-15(6) from 35 upto 44
mdadm: forcing event count in /dev/dm-16(7) from 35 upto 44
mdadm: forcing event count in /dev/dm-17(8) from 35 upto 44
mdadm: clearing FAULTY flag for device 5 in /dev/md10 for /dev/dm-14
mdadm: clearing FAULTY flag for device 6 in /dev/md10 for /dev/dm-15
mdadm: clearing FAULTY flag for device 7 in /dev/md10 for /dev/dm-16
mdadm: clearing FAULTY flag for device 8 in /dev/md10 for /dev/dm-17
mdadm: Marking array /dev/md10 as 'clean'
mdadm: added /dev/dm-1 to /dev/md10 as 1
mdadm: added /dev/dm-11 to /dev/md10 as 2
mdadm: added /dev/dm-12 to /dev/md10 as 3
mdadm: added /dev/dm-13 to /dev/md10 as 4
mdadm: added /dev/dm-14 to /dev/md10 as 5
mdadm: added /dev/dm-15 to /dev/md10 as 6
mdadm: added /dev/dm-16 to /dev/md10 as 7
mdadm: added /dev/dm-17 to /dev/md10 as 8
mdadm: added /dev/dm-28 to /dev/md10 as -1
mdadm: added /dev/dm-0 to /dev/md10 as 0
mdadm: /dev/md10 has been started with 9 drives and 1 spare.

Output from dmesg for successful --assemble --force with latest mdadm binary:

[Wed Jun  1 08:23:15 2016] md: md10 stopped.
[Wed Jun  1 08:23:15 2016] md: bind<dm-1>
[Wed Jun  1 08:23:15 2016] md: bind<dm-11>
[Wed Jun  1 08:23:15 2016] md: bind<dm-12>
[Wed Jun  1 08:23:15 2016] md: bind<dm-13>
[Wed Jun  1 08:23:15 2016] md: bind<dm-14>
[Wed Jun  1 08:23:15 2016] md: bind<dm-15>
[Wed Jun  1 08:23:15 2016] md: bind<dm-16>
[Wed Jun  1 08:23:15 2016] md: bind<dm-17>
[Wed Jun  1 08:23:15 2016] md: bind<dm-28>
[Wed Jun  1 08:23:15 2016] md: bind<dm-0>
[Wed Jun  1 08:23:15 2016] md/raid:md10: device dm-0 operational as raid disk 0
[Wed Jun  1 08:23:15 2016] md/raid:md10: device dm-17 operational as raid disk 8
[Wed Jun  1 08:23:15 2016] md/raid:md10: device dm-16 operational as raid disk 7
[Wed Jun  1 08:23:15 2016] md/raid:md10: device dm-15 operational as raid disk 6
[Wed Jun  1 08:23:15 2016] md/raid:md10: device dm-14 operational as raid disk 5
[Wed Jun  1 08:23:15 2016] md/raid:md10: device dm-13 operational as raid disk 4
[Wed Jun  1 08:23:15 2016] md/raid:md10: device dm-12 operational as raid disk 3
[Wed Jun  1 08:23:15 2016] md/raid:md10: device dm-11 operational as raid disk 2
[Wed Jun  1 08:23:15 2016] md/raid:md10: device dm-1 operational as raid disk 1
[Wed Jun  1 08:23:15 2016] md/raid:md10: allocated 9558kB
[Wed Jun  1 08:23:15 2016] md/raid:md10: raid level 6 active with 9 out of 9 devices, algorithm 2
[Wed Jun  1 08:23:15 2016] RAID conf printout:
[Wed Jun  1 08:23:15 2016]  --- level:6 rd:9 wd:9
[Wed Jun  1 08:23:15 2016]  disk 0, o:1, dev:dm-0
[Wed Jun  1 08:23:15 2016]  disk 1, o:1, dev:dm-1
[Wed Jun  1 08:23:15 2016]  disk 2, o:1, dev:dm-11
[Wed Jun  1 08:23:15 2016]  disk 3, o:1, dev:dm-12
[Wed Jun  1 08:23:15 2016]  disk 4, o:1, dev:dm-13
[Wed Jun  1 08:23:15 2016]  disk 5, o:1, dev:dm-14
[Wed Jun  1 08:23:15 2016]  disk 6, o:1, dev:dm-15
[Wed Jun  1 08:23:15 2016]  disk 7, o:1, dev:dm-16
[Wed Jun  1 08:23:15 2016]  disk 8, o:1, dev:dm-17
[Wed Jun  1 08:23:15 2016] md10: detected capacity change from 0 to 14002780897280
[Wed Jun  1 08:23:15 2016] RAID conf printout:
[Wed Jun  1 08:23:15 2016]  --- level:6 rd:9 wd:9
[Wed Jun  1 08:23:15 2016]  disk 0, o:1, dev:dm-0
[Wed Jun  1 08:23:15 2016]  disk 1, o:1, dev:dm-1
[Wed Jun  1 08:23:15 2016]  disk 2, o:1, dev:dm-11
[Wed Jun  1 08:23:15 2016]  disk 3, o:1, dev:dm-12
[Wed Jun  1 08:23:15 2016]  disk 4, o:1, dev:dm-13
[Wed Jun  1 08:23:15 2016]  disk 5, o:1, dev:dm-14
[Wed Jun  1 08:23:15 2016]  disk 6, o:1, dev:dm-15
[Wed Jun  1 08:23:15 2016]  disk 7, o:1, dev:dm-16
[Wed Jun  1 08:23:15 2016]  disk 8, o:1, dev:dm-17
[Wed Jun  1 08:23:15 2016]  md10: unknown partition table

Uneventful dmesg output from EARLIER unsuccessful attempt with mdadm 3.3:

[Wed Jun  1 07:35:22 2016] md: md10 stopped.
[Wed Jun  1 07:35:22 2016] md: bind<dm-1>
[Wed Jun  1 07:35:22 2016] md: bind<dm-11>
[Wed Jun  1 07:35:22 2016] md: bind<dm-12>
[Wed Jun  1 07:35:22 2016] md: bind<dm-13>
[Wed Jun  1 07:35:22 2016] md: bind<dm-14>
[Wed Jun  1 07:35:22 2016] md: bind<dm-15>
[Wed Jun  1 07:35:22 2016] md: bind<dm-16>
[Wed Jun  1 07:35:22 2016] md: bind<dm-17>
[Wed Jun  1 07:35:22 2016] md: bind<dm-28>
[Wed Jun  1 07:35:22 2016] md: bind<dm-0>
[Wed Jun  1 07:35:22 2016] md: md10 stopped.
[Wed Jun  1 07:35:22 2016] md: unbind<dm-0>
[Wed Jun  1 07:35:22 2016] md: export_rdev(dm-0)
[Wed Jun  1 07:35:22 2016] md: unbind<dm-28>
[Wed Jun  1 07:35:22 2016] md: export_rdev(dm-28)
[Wed Jun  1 07:35:22 2016] md: unbind<dm-17>
[Wed Jun  1 07:35:22 2016] md: export_rdev(dm-17)
[Wed Jun  1 07:35:22 2016] md: unbind<dm-16>
[Wed Jun  1 07:35:22 2016] md: export_rdev(dm-16)
[Wed Jun  1 07:35:22 2016] md: unbind<dm-15>
[Wed Jun  1 07:35:22 2016] md: export_rdev(dm-15)
[Wed Jun  1 07:35:22 2016] md: unbind<dm-14>
[Wed Jun  1 07:35:22 2016] md: export_rdev(dm-14)
[Wed Jun  1 07:35:22 2016] md: unbind<dm-13>
[Wed Jun  1 07:35:22 2016] md: export_rdev(dm-13)
[Wed Jun  1 07:35:22 2016] md: unbind<dm-12>
[Wed Jun  1 07:35:22 2016] md: export_rdev(dm-12)
[Wed Jun  1 07:35:22 2016] md: unbind<dm-11>
[Wed Jun  1 07:35:22 2016] md: export_rdev(dm-11)
[Wed Jun  1 07:35:22 2016] md: unbind<dm-1>
[Wed Jun  1 07:35:22 2016] md: export_rdev(dm-1)

I activated the lvm volume and mounted the filesystem.  Everything looks intact.

Thanks for your help recovering this array!  I had been avoiding updating mdadm while I was recovering this as I had read of potential issues while recovering arrays created with earlier versions.

—steve


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID6 recovery with 6/9 drives out-of-sync
  2016-06-01 13:16       ` Peckins, Steven E
@ 2016-06-01 13:22         ` Phil Turmel
  0 siblings, 0 replies; 6+ messages in thread
From: Phil Turmel @ 2016-06-01 13:22 UTC (permalink / raw)
  To: Peckins, Steven E; +Cc: linux-raid

On 06/01/2016 09:16 AM, Peckins, Steven E wrote:

> I activated the lvm volume and mounted the filesystem.  Everything
> looks intact.

Yay!

> Thanks for your help recovering this array!

You're welcome.

> I had been avoiding updating mdadm while I was recovering this as I
> had read of potential issues while recovering arrays created with
> earlier versions.

Viewing bug reports and bug fix patches is the primary reason I'm
subscribed to this list.  It lets me make my own stable upgrade decisions.

Phil

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-06-01 13:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-31  2:43 RAID6 recovery with 6/9 drives out-of-sync Peckins, Steven E
2016-05-31 19:19 ` Phil Turmel
2016-06-01 11:32   ` Peckins, Steven E
2016-06-01 12:06     ` Phil Turmel
2016-06-01 13:16       ` Peckins, Steven E
2016-06-01 13:22         ` Phil Turmel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.