All of lore.kernel.org
 help / color / mirror / Atom feed
* Re-assembling array after double device failure
@ 2017-03-27 13:38 Andy Smith
  2017-03-27 14:31 ` Andreas Klauer
  2017-03-27 15:23 ` Anthony Youngman
  0 siblings, 2 replies; 5+ messages in thread
From: Andy Smith @ 2017-03-27 13:38 UTC (permalink / raw)
  To: linux-raid

Hi,

I'm attempting to clean up after what is most likely a
timeout-related double device failure (yes, I know).

I just want to check I have the right procedure here.

So, initial situation was a two device RAID-10 (sdc, sdd). sdc saw
some I/O errors and was kicked. Contents of /proc/mdstat after that:

md4 : active raid10 sdc[0](F) sdd[1]
      3906886656 blocks super 1.2 512K chunks 2 far-copies [2/1] [_U]
      bitmap: 7/30 pages [28KB], 65536KB chunk

A couple of hours later, sdd also saw some I/O errors and was
similarly kicked. Neither /dev/sdc nor sdd appear as device nodes in
the system any more at this point and the controller doesn't see
them.

sdd was re-plugged and re-appeared as sdg.

A mdadm --examine /dev/sdg looks like:

/dev/sdg:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 4100ddce:8edf6082:ba50427e:60da0a42
           Name : elephant:4  (local to host elephant)
  Creation Time : Fri Nov 18 22:53:10 2016
     Raid Level : raid10
   Raid Devices : 2

 Avail Dev Size : 7813775024 (3725.90 GiB 4000.65 GB)
     Array Size : 3906886656 (3725.90 GiB 4000.65 GB)
  Used Dev Size : 7813773312 (3725.90 GiB 4000.65 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=1712 sectors
          State : active
    Device UUID : d9c9d81d:c487599a:3d3e3a30:0c512610

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Mar 26 00:00:01 2017
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : ec70d450 - correct
         Events : 298824

         Layout : far=2
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : .A ('A' == active, '.' == missing, 'R' == replacing)

mdadm config:

$ grep -v '^#' /etc/mdadm/mdadm.conf | grep -v '^$'
DEVICE /dev/sd*
CREATE owner=root group=disk mode=0660 auto=yes
HOMEHOST <system>
MAILADDR root
ARRAY /dev/md/0  metadata=1.2 UUID=400bac1d:e2c5d6ef:fea3b8c8:bcb70f8f
ARRAY /dev/md/1  metadata=1.2 UUID=e29c8b89:705f0116:d888f77e:2b6e32f5
ARRAY /dev/md/2  metadata=1.2 UUID=039b3427:4be5157a:6e2d53bd:fe898803
ARRAY /dev/md/3  metadata=1.2 UUID=30f745ce:7ed41b53:4df72181:7406ea1d
ARRAY /dev/md/4  metadata=1.2 UUID=4100ddce:8edf6082:ba50427e:60da0a42
ARRAY /dev/md/5  metadata=1.2 UUID=957030cf:c09f023d:ceaebb27:e546f095

(other arrays are on different devices and are not involved here)

So, I think I need to:

- Increase /sys/block/sdg/device/timeout to 180 (already done). TLER
  not supported.

- Stop md4.

  mdadm --stop /dev/md4

- Assemble it again.

  mdadm --assemble /dev/md4

 Theory being that there is at least one good device (sdg that was
 sdd).

- If that complains, I would then have to consider re-creating the
  array with something like:

  mdadm --create --assume-clean --level=10 --layout=f2 missing /dev/sdd

- Once it's up and running, add sdc back in and let it sync

- Make timeout changes permanent.

Does that seem correct?

I'm fairly confident that the drives themselves are actually okay -
nothing untoward in SMART data - so I'm not going to replace them at
this stage.

Cheers,
Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re-assembling array after double device failure
  2017-03-27 13:38 Re-assembling array after double device failure Andy Smith
@ 2017-03-27 14:31 ` Andreas Klauer
  2017-03-27 15:27   ` Anthony Youngman
  2017-03-27 15:23 ` Anthony Youngman
  1 sibling, 1 reply; 5+ messages in thread
From: Andreas Klauer @ 2017-03-27 14:31 UTC (permalink / raw)
  To: linux-raid

On Mon, Mar 27, 2017 at 01:38:13PM +0000, Andy Smith wrote:
> I'm fairly confident that the drives themselves are actually okay -
> nothing untoward in SMART data - so I'm not going to replace them at
> this stage.

You did not show any logs or SMART output. There is literally nothing 
in your mail that points at timeouts. If your confidence is based on 
the frequent "disk got kicked. must be timeouts!!1" mails on this list, 
then I wish you all the best. Praying works for some people, right...?

If you get two disks kicked, chances are something is seriously wrong.
If there is any doubt at all, and no backups exist, ddrescue both drives.
Better to make a copy you don't need than need a copy you didn't make.

Be very careful with mdadm --create. Defaults change over time and 
rescue systems might give you old mdadm versions, so you have to 
specify everything (metadata version, data offsets, ...).

Consider using overlays for experiments:

https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file

(But not on faulty drives.)

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re-assembling array after double device failure
  2017-03-27 13:38 Re-assembling array after double device failure Andy Smith
  2017-03-27 14:31 ` Andreas Klauer
@ 2017-03-27 15:23 ` Anthony Youngman
  2017-03-31  4:25   ` Andy Smith
  1 sibling, 1 reply; 5+ messages in thread
From: Anthony Youngman @ 2017-03-27 15:23 UTC (permalink / raw)
  To: linux-raid



On 27/03/17 14:38, Andy Smith wrote:
> Hi,
>
> I'm attempting to clean up after what is most likely a
> timeout-related double device failure (yes, I know).
>
> I just want to check I have the right procedure here.
>
> So, initial situation was a two device RAID-10 (sdc, sdd). sdc saw
> some I/O errors and was kicked. Contents of /proc/mdstat after that:
>
> md4 : active raid10 sdc[0](F) sdd[1]
>       3906886656 blocks super 1.2 512K chunks 2 far-copies [2/1] [_U]
>       bitmap: 7/30 pages [28KB], 65536KB chunk
>
> A couple of hours later, sdd also saw some I/O errors and was
> similarly kicked. Neither /dev/sdc nor sdd appear as device nodes in
> the system any more at this point and the controller doesn't see
> them.
>
> sdd was re-plugged and re-appeared as sdg.
>
> A mdadm --examine /dev/sdg looks like:
>
> /dev/sdg:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x1
>      Array UUID : 4100ddce:8edf6082:ba50427e:60da0a42
>            Name : elephant:4  (local to host elephant)
>   Creation Time : Fri Nov 18 22:53:10 2016
>      Raid Level : raid10
>    Raid Devices : 2
>
>  Avail Dev Size : 7813775024 (3725.90 GiB 4000.65 GB)
>      Array Size : 3906886656 (3725.90 GiB 4000.65 GB)
>   Used Dev Size : 7813773312 (3725.90 GiB 4000.65 GB)
>     Data Offset : 262144 sectors
>    Super Offset : 8 sectors
>    Unused Space : before=262056 sectors, after=1712 sectors
>           State : active
>     Device UUID : d9c9d81d:c487599a:3d3e3a30:0c512610
>
> Internal Bitmap : 8 sectors from superblock
>     Update Time : Sun Mar 26 00:00:01 2017
>   Bad Block Log : 512 entries available at offset 72 sectors
>        Checksum : ec70d450 - correct
>          Events : 298824
>
>          Layout : far=2
>      Chunk Size : 512K
>
>    Device Role : Active device 1
>    Array State : .A ('A' == active, '.' == missing, 'R' == replacing)
>
> mdadm config:
>
> $ grep -v '^#' /etc/mdadm/mdadm.conf | grep -v '^$'
> DEVICE /dev/sd*
> CREATE owner=root group=disk mode=0660 auto=yes
> HOMEHOST <system>
> MAILADDR root
> ARRAY /dev/md/0  metadata=1.2 UUID=400bac1d:e2c5d6ef:fea3b8c8:bcb70f8f
> ARRAY /dev/md/1  metadata=1.2 UUID=e29c8b89:705f0116:d888f77e:2b6e32f5
> ARRAY /dev/md/2  metadata=1.2 UUID=039b3427:4be5157a:6e2d53bd:fe898803
> ARRAY /dev/md/3  metadata=1.2 UUID=30f745ce:7ed41b53:4df72181:7406ea1d
> ARRAY /dev/md/4  metadata=1.2 UUID=4100ddce:8edf6082:ba50427e:60da0a42
> ARRAY /dev/md/5  metadata=1.2 UUID=957030cf:c09f023d:ceaebb27:e546f095
>
> (other arrays are on different devices and are not involved here)
>
> So, I think I need to:
>
> - Increase /sys/block/sdg/device/timeout to 180 (already done). TLER
>   not supported.
>
> - Stop md4.
>
>   mdadm --stop /dev/md4
>
> - Assemble it again.
>
>   mdadm --assemble /dev/md4
>
>  Theory being that there is at least one good device (sdg that was
>  sdd).
>
> - If that complains, I would then have to consider re-creating the
>   array with something like:

NEVER NEVER NEVER use --create except as a last resort. Try --assemble 
--force. And if you are going to try it, as an absolute minimum, read 
the kernel raid wiki, get lsdrv, run it AND MAKE SURE THE OUTPUT IS SAFE 
SOMEWHERE.

https://raid.wiki.kernel.org/index.php/Asking_for_help

Snag is, you might end up with a non-functional array with two spare 
drives. I'll have to step back and let the experts handle that if it 
happens.
>
>   mdadm --create --assume-clean --level=10 --layout=f2 missing /dev/sdd
>
> - Once it's up and running, add sdc back in and let it sync
>
> - Make timeout changes permanent.

I'd do this as the very first step - I think you need to put a script in 
your run-level. There's a good sample script on the wiki.

That way it'll get done as the system boots, and should prevent any 
problems. Oh - and do scheduled scrubs, as the fact you're getting 
timeout errors indicates that something is wrong - a scrub is probably 
sufficient to clean it up.
>
> Does that seem correct?

Hopefully fixing the timeout, followed by a --assemble --force, then a 
scrub, will be all that's required.

>
> I'm fairly confident that the drives themselves are actually okay -
> nothing untoward in SMART data - so I'm not going to replace them at
> this stage.
>
Cheers,
Wol

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re-assembling array after double device failure
  2017-03-27 14:31 ` Andreas Klauer
@ 2017-03-27 15:27   ` Anthony Youngman
  0 siblings, 0 replies; 5+ messages in thread
From: Anthony Youngman @ 2017-03-27 15:27 UTC (permalink / raw)
  To: Andreas Klauer, linux-raid



On 27/03/17 15:31, Andreas Klauer wrote:
> You did not show any logs or SMART output. There is literally nothing
> in your mail that points at timeouts. If your confidence is based on
> the frequent "disk got kicked. must be timeouts!!1" mails on this list,
> then I wish you all the best. Praying works for some people, right...?

To quote the OP's original email ...

"- Increase /sys/block/sdg/device/timeout to 180 (already done). TLER
   not supported."

Cheers,
Wol

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re-assembling array after double device failure
  2017-03-27 15:23 ` Anthony Youngman
@ 2017-03-31  4:25   ` Andy Smith
  0 siblings, 0 replies; 5+ messages in thread
From: Andy Smith @ 2017-03-31  4:25 UTC (permalink / raw)
  To: linux-raid

Hi Anthony,

On Mon, Mar 27, 2017 at 04:23:19PM +0100, Anthony Youngman wrote:
> Hopefully fixing the timeout, followed by a --assemble --force, then
> a scrub, will be all that's required.

Yep, thanks. Luckily it assembled without --force, and I was then
able to replace the older drive with a new one and rebuild onto it.
It completed that without error and seems happy now.

I'm going to subject the removed drive that got kicked out first to
some more in-depth scrutiny in another machine as soon as I get
chance.

Cheers,
Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-03-31  4:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-27 13:38 Re-assembling array after double device failure Andy Smith
2017-03-27 14:31 ` Andreas Klauer
2017-03-27 15:27   ` Anthony Youngman
2017-03-27 15:23 ` Anthony Youngman
2017-03-31  4:25   ` Andy Smith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.