All of lore.kernel.org
 help / color / mirror / Atom feed
* raid1 recoverable after system crash?
@ 2016-04-07 12:44 Brian J. Murrell
  2016-04-07 13:00 ` Roman Mamedov
  0 siblings, 1 reply; 5+ messages in thread
From: Brian J. Murrell @ 2016-04-07 12:44 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4193 bytes --]

Hi

I had a system crash.  When it came back up, one of my arrays was
degraded:

# cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdd[0]
      1953514496 blocks [2/1] [U_]
      
md1 : active raid0 sdc[1] sdb[0]
      1953524736 blocks super 1.2 512k chunks
      
unused devices: <none>

md1 is supposed to be a member of md0 but it's not currently:

/dev/md0:
        Version : 0.90
  Creation Time : Mon Jan 26 19:51:38 2009
     Raid Level : raid1
     Array Size : 1953514496 (1863.02 GiB 2000.40 GB)
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Apr  7 08:19:25 2016
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : 2f8fc5e0:a0eb646a:2303d005:33f25f21
         Events : 0.5030755

    Number   Major   Minor   RaidDevice State
       0       8       48        0      active sync   /dev/sdd
       1       0        0        1      removed

It doesn't seem to be re-addable:

# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
# mdadm --manage /dev/md0 --re-add /dev/md1
mdadm: --re-add for /dev/md1 to /dev/md0 is not possible

It doesn't seem to be assemble-able:

# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
# mdadm --assemble /dev/md0 /dev/sdd /dev/md1
mdadm: /dev/md0 has been started with 1 drive (out of 2).

even when forced:

# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
# mdadm --assemble --force /dev/md0 /dev/sdd /dev/md1
mdadm: /dev/md0 has been started with 1 drive (out of 2).

Is my only option here to fail/remove /dev/md1 from the array and re-
add it that way or is there a more graceful recovery possible here?

Some additional info:

mdadm --examine /dev/md1
/dev/md1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 2f8fc5e0:a0eb646a:2303d005:33f25f21
  Creation Time : Mon Jan 26 19:51:38 2009
     Raid Level : raid1
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
     Array Size : 1953514496 (1863.02 GiB 2000.40 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0

    Update Time : Thu Apr  7 00:18:47 2016
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : e5ddeac5 - correct
         Events : 5030737


      Number   Major   Minor   RaidDevice State
this     1       9        1        1      active sync   /dev/md1
   0     0       8       48        0      active sync   /dev/sdd
   1     1       9        1        1      active sync   /dev/md1
# mdadm --examine /dev/sdd
/dev/sdd:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 2f8fc5e0:a0eb646a:2303d005:33f25f21
  Creation Time : Mon Jan 26 19:51:38 2009
     Raid Level : raid1
  Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
     Array Size : 1953514496 (1863.02 GiB 2000.40 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0

    Update Time : Thu Apr  7 08:37:04 2016
          State : clean
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0
       Checksum : e62b232b - correct
         Events : 5030757


      Number   Major   Minor   RaidDevice State
this     0       8       48        0      active sync   /dev/sdd

   0     0       8       48        0      active sync   /dev/sdd
   1     1       0        0        1      faulty removed

Most happy to provide any additional information needed.

Cheers,
b.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid1 recoverable after system crash?
  2016-04-07 12:44 raid1 recoverable after system crash? Brian J. Murrell
@ 2016-04-07 13:00 ` Roman Mamedov
  2016-04-07 16:11   ` Brian J. Murrell
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Mamedov @ 2016-04-07 13:00 UTC (permalink / raw)
  To: Brian J. Murrell; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1000 bytes --]

On Thu, 07 Apr 2016 08:44:46 -0400
"Brian J. Murrell" <brian@interlinx.bc.ca> wrote:

> # cat /proc/mdstat 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid1 sdd[0]
>       1953514496 blocks [2/1] [U_]
>       
> md1 : active raid0 sdc[1] sdb[0]
>       1953524736 blocks super 1.2 512k chunks
...
> Is my only option here to fail/remove /dev/md1 from the array and re-
> add it that way or is there a more graceful recovery possible here?

You do not have a write intent bitmap at md0, so re-add will not work. Seems
like you should --add it now, then after it rebuilds use --grow to add a
bitmap, so that in the future you could use -re-add.

As to why the situation occured in the first place, you should ensure that md1
assembles before md0. Perhaps by listing both arrays in /etc/mdadm/mdadm.conf
in the order you need (and don't forget to re-generate initrd.img).

-- 
With respect,
Roman

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid1 recoverable after system crash?
  2016-04-07 13:00 ` Roman Mamedov
@ 2016-04-07 16:11   ` Brian J. Murrell
  2016-04-07 16:59     ` Roman Mamedov
  0 siblings, 1 reply; 5+ messages in thread
From: Brian J. Murrell @ 2016-04-07 16:11 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2162 bytes --]

On Thu, 2016-04-07 at 18:00 +0500, Roman Mamedov wrote:
> 
> You do not have a write intent bitmap at md0, so re-add will not
> work.

Ahhh.  OK.

>  Seems
> like you should --add it now,

Tried that.  It started off and got this far:

# cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 md1[2](F) sdd[0]
      1953514496 blocks [2/1] [U_]
      [================>....]  recovery = 82.0% (1602507648/1953514496) finish=42613.2min speed=137K/sec

before hitting this:

2016 Apr  7 12:01:00 linux [16583.606363] md/raid1:md0: Disk failure on md1, disabling device.
2016 Apr  7 12:01:00 linux [16583.606366] md/raid1:md0: Operation continuing on 1 devices.
2016 Apr  7 12:01:00 linux FailSpare event detected on md device /dev/md0, component device /dev/md1
2016 Apr  7 12:01:01 linux [16583.907982] BUG: unable to handle kernel paging request at 0000000099b899b8
2016 Apr  7 12:01:01 linux [16583.908009] IP: [<ffffffffa0019227>] call_bio_endio+0x37/0xb0 [raid1]
2016 Apr  7 12:01:01 linux [16583.908009] Oops: 0000 [#1] SMP
2016 Apr  7 12:01:01 linux [16583.908009] Stack:
2016 Apr  7 12:01:01 linux [16583.908009] Call Trace:
2016 Apr  7 12:01:01 linux [16583.908009] Code: 4c 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 66 66 66 66 90 4c 8b 67 28 48 8b 47 20 41 bf 01 00 00 00 48 89 fb 41 8b 54 24 2c <4c> 8b 28 85 d2 75 42 48 8b 43 18 a8 01 75 07 3e 41 80 64 24 18
2016 Apr  7 12:01:01 linux [16583.908009] RIP  [<ffffffffa0019227>] call_bio_endio+0x37/0xb0 [raid1]
2016 Apr  7 12:01:01 linux [16583.908009] CR2: 0000000099b899b8

And it seems to be stuck there now.

dmesg contents at http://www.interlinx.bc.ca/~brian/raid-dmesg.txt

>  then after it rebuilds use --grow to add a
> bitmap, so that in the future you could use -re-add.

Cool.  Will do, when this finally gets fixed.

> As to why the situation occured in the first place, you should ensure
> that md1
> assembles before md0.

Yeah.  Just noticed as of this incident that the order in mdadm.conf is
wrong.  :-(

Cheers,
b.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid1 recoverable after system crash?
  2016-04-07 16:11   ` Brian J. Murrell
@ 2016-04-07 16:59     ` Roman Mamedov
  2016-04-07 17:10       ` Brian J. Murrell
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Mamedov @ 2016-04-07 16:59 UTC (permalink / raw)
  To: Brian J. Murrell; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1637 bytes --]

On Thu, 07 Apr 2016 12:11:37 -0400
"Brian J. Murrell" <brian@interlinx.bc.ca> wrote:

> 2016 Apr  7 12:01:00 linux [16583.606363] md/raid1:md0: Disk failure on md1, disabling device.
> 2016 Apr  7 12:01:00 linux [16583.606366] md/raid1:md0: Operation continuing on 1 devices.
> 2016 Apr  7 12:01:00 linux FailSpare event detected on md device /dev/md0, component device /dev/md1
> 2016 Apr  7 12:01:01 linux [16583.907982] BUG: unable to handle kernel paging request at 0000000099b899b8
> 2016 Apr  7 12:01:01 linux [16583.908009] IP: [<ffffffffa0019227>] call_bio_endio+0x37/0xb0 [raid1]
> 2016 Apr  7 12:01:01 linux [16583.908009] Oops: 0000 [#1] SMP
> 2016 Apr  7 12:01:01 linux [16583.908009] Stack:
> 2016 Apr  7 12:01:01 linux [16583.908009] Call Trace:
> 2016 Apr  7 12:01:01 linux [16583.908009] Code: 4c 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 66 66 66 66 90 4c 8b 67 28 48 8b 47 20 41 bf 01 00 00 00 48 89 fb 41 8b 54 24 2c <4c> 8b 28 85 d2 75 42 48 8b 43 18 a8 01 75 07 3e 41 80 64 24 18
> 2016 Apr  7 12:01:01 linux [16583.908009] RIP  [<ffffffffa0019227>] call_bio_endio+0x37/0xb0 [raid1]
> 2016 Apr  7 12:01:01 linux [16583.908009] CR2: 0000000099b899b8
> 
> And it seems to be stuck there now.
> 
> dmesg contents at http://www.interlinx.bc.ca/~brian/raid-dmesg.txt

Don't know what's up with this bug, maybe someone else does.

How are the drives connected, plain SATA, not something unusual such as USB?

As something to try, upgrade to a newer kernel if you can: the 3.2 series is
very old, maybe the bug has been fixed in newer ones. 

-- 
With respect,
Roman

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid1 recoverable after system crash?
  2016-04-07 16:59     ` Roman Mamedov
@ 2016-04-07 17:10       ` Brian J. Murrell
  0 siblings, 0 replies; 5+ messages in thread
From: Brian J. Murrell @ 2016-04-07 17:10 UTC (permalink / raw)
  Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 316 bytes --]

On Thu, 2016-04-07 at 21:59 +0500, Roman Mamedov wrote:
> 
> How are the drives connected, plain SATA, not something unusual such
> as USB?

Yes, plain old SATA.  This is all hardware and configuration that has
not changed and has been working for many many months until today's
"incident".

Cheers,
b.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-04-07 17:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-07 12:44 raid1 recoverable after system crash? Brian J. Murrell
2016-04-07 13:00 ` Roman Mamedov
2016-04-07 16:11   ` Brian J. Murrell
2016-04-07 16:59     ` Roman Mamedov
2016-04-07 17:10       ` Brian J. Murrell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.