All of lore.kernel.org
 help / color / mirror / Atom feed
* How to assemble 4-disk raid5 with one broken disk and one marked as spare by operator error?
@ 2013-12-08 20:04 Tomas Agartz
  2013-12-09  3:46 ` NeilBrown
  0 siblings, 1 reply; 3+ messages in thread
From: Tomas Agartz @ 2013-12-08 20:04 UTC (permalink / raw)
  To: linux-raid

After booting a server that had been powered off for some time, the 4-disk 
raid5 device was up and running in read-only mode with one disk missing. 
After a, in hindsight, hasty decision, "mdadm --manage --add /dev/md0 
/dev/sdd" was executed to re-add the missing device to the array.

At this time, all hell broke loose :) The first thing that happened was 
that sdd was added as a spare instead of re-added as expected. The second 
thing was that a different disk, sdb, was kicked from the array because of 
read/sata-bus errors. The root disk also bailed and the system had to be 
powercycled.

The real problem, from the start, was probably that sdb was bad all along, 
but from some reason sdd was the device missing from the array after the 
initial boot.

Trying to read data from sdb gives read errors and timeouts, but I was 
able to do "mdadm --examine" after resetting the sata port.

The current state is that, out of 4 disks two are good (sde and sdf), one 
is (in error) marked as a spare (sdd), and the fourth device is unusable 
(sdb).

What is the correct method do change the spare disk back to a data disk 
and try to restart the array with 3 out of 4 devices (sdd, sde and sdf)?

The device has never had a spare, so I think that sdd used to be "Active 
device 0" before this happened?

Possibly relevant data from mdadm --examine on the four devices:

sdb          State : clean
sdb         Events : 333560
sdb   Device Role : Active device 3
sdb   Array State : .AAA ('A' == active, '.' == missing)

sdd          State : clean
sdd         Events : 333562
sdd   Device Role : spare
sdd   Array State : .AA. ('A' == active, '.' == missing)

sde          State : clean
sde         Events : 333562
sde   Device Role : Active device 1
sde   Array State : .AA. ('A' == active, '.' == missing)

sdf          State : clean
sdf         Events : 333562
sdf   Device Role : Active device 2
sdf   Array State : .AA. ('A' == active, '.' == missing)

If no one else has any better suggestions, my best guess would be to: 
"mdadm --create /dev/md0 --level=5 --raid-devices=4 --assume-clean 
/dev/sdd /dev/sde /dev/sdf missing" (the device was created with default 
values, metadata 1.2, chunk size 512K, layout left-symmetric).

(Other crazy ideas involve editing the superblock of sdd and making it 
device 0 and then trying to start the array after that).

Best regards,
Tomas

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: How to assemble 4-disk raid5 with one broken disk and one marked as spare by operator error?
  2013-12-08 20:04 How to assemble 4-disk raid5 with one broken disk and one marked as spare by operator error? Tomas Agartz
@ 2013-12-09  3:46 ` NeilBrown
  2013-12-09 11:27   ` Tomas Agartz
  0 siblings, 1 reply; 3+ messages in thread
From: NeilBrown @ 2013-12-09  3:46 UTC (permalink / raw)
  To: Tomas Agartz; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3255 bytes --]

On Sun, 8 Dec 2013 21:04:42 +0100 (CET) Tomas Agartz <tlund@nxs.se> wrote:

> After booting a server that had been powered off for some time, the 4-disk 
> raid5 device was up and running in read-only mode with one disk missing. 
> After a, in hindsight, hasty decision, "mdadm --manage --add /dev/md0 
> /dev/sdd" was executed to re-add the missing device to the array.
> 
> At this time, all hell broke loose :) The first thing that happened was 
> that sdd was added as a spare instead of re-added as expected. The second 
> thing was that a different disk, sdb, was kicked from the array because of 
> read/sata-bus errors. The root disk also bailed and the system had to be 
> powercycled.

If you want to re-add, it is safest to ask mdadm to --re-add, not to --add.

> 
> The real problem, from the start, was probably that sdb was bad all along, 
> but from some reason sdd was the device missing from the array after the 
> initial boot.
> 
> Trying to read data from sdb gives read errors and timeouts, but I was 
> able to do "mdadm --examine" after resetting the sata port.
> 
> The current state is that, out of 4 disks two are good (sde and sdf), one 
> is (in error) marked as a spare (sdd), and the fourth device is unusable 
> (sdb).
> 
> What is the correct method do change the spare disk back to a data disk 
> and try to restart the array with 3 out of 4 devices (sdd, sde and sdf)?
> 

The only real option at this point is to --create the array.  There isn't
enough information for mdadm to be able  to do anything clever.

> The device has never had a spare, so I think that sdd used to be "Active 
> device 0" before this happened?
> 
> Possibly relevant data from mdadm --examine on the four devices:
> 
> sdb          State : clean
> sdb         Events : 333560
> sdb   Device Role : Active device 3
> sdb   Array State : .AAA ('A' == active, '.' == missing)
> 
> sdd          State : clean
> sdd         Events : 333562
> sdd   Device Role : spare
> sdd   Array State : .AA. ('A' == active, '.' == missing)
> 
> sde          State : clean
> sde         Events : 333562
> sde   Device Role : Active device 1
> sde   Array State : .AA. ('A' == active, '.' == missing)
> 
> sdf          State : clean
> sdf         Events : 333562
> sdf   Device Role : Active device 2
> sdf   Array State : .AA. ('A' == active, '.' == missing)
> 
> If no one else has any better suggestions, my best guess would be to: 
> "mdadm --create /dev/md0 --level=5 --raid-devices=4 --assume-clean 
> /dev/sdd /dev/sde /dev/sdf missing" (the device was created with default 
> values, metadata 1.2, chunk size 512K, layout left-symmetric).

Check the "Data Offset" of the devices and make sure the newly created array
gets the same "Data Offset" (it can explicitly be set with the latest mdadm).

NeilBrown


> 
> (Other crazy ideas involve editing the superblock of sdd and making it 
> device 0 and then trying to start the array after that).
> 
> Best regards,
> Tomas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: How to assemble 4-disk raid5 with one broken disk and one marked as spare by operator error?
  2013-12-09  3:46 ` NeilBrown
@ 2013-12-09 11:27   ` Tomas Agartz
  0 siblings, 0 replies; 3+ messages in thread
From: Tomas Agartz @ 2013-12-09 11:27 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Mon, 9 Dec 2013, NeilBrown wrote:

> If you want to re-add, it is safest to ask mdadm to --re-add, not to --add.

Yeah, that was a bit of a brain fart :(

> The only real option at this point is to --create the array.  There isn't
> enough information for mdadm to be able  to do anything clever.

> Check the "Data Offset" of the devices and make sure the newly created array
> gets the same "Data Offset" (it can explicitly be set with the latest mdadm).

Ok, then I was on the right track at least.

However! Good news! After fiddeling with SATA cables and, in the end 
replacing all of them, I managed to get the "sdb" disk working again. I 
did a --force assemble with those 3 disks and then added the fourth disk 
back and let it rebuild. Since the array was never written to while all of 
this happened, the filesystem (and all the data) was intact when done.

Best regards,
Tomas

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-12-09 11:27 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-08 20:04 How to assemble 4-disk raid5 with one broken disk and one marked as spare by operator error? Tomas Agartz
2013-12-09  3:46 ` NeilBrown
2013-12-09 11:27   ` Tomas Agartz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.