All of lore.kernel.org
 help / color / mirror / Atom feed
* Raid auto-assembly upon boot - device order
@ 2011-06-27 14:15 Pavel Hofman
  2011-06-27 14:47 ` Phil Turmel
  0 siblings, 1 reply; 7+ messages in thread
From: Pavel Hofman @ 2011-06-27 14:15 UTC (permalink / raw)
  To: linux-raid

Hi,

We are running a raid1 on top of two raid0's. In this specific case we
cannot use raid10 (different device sizes etc). Upon booting, the raid1
is always started degraded, with one of the raid0's missing. The log
says the missing raid could not be found. However, upon booting
/proc/mdstat lists both the two raid0s OK.

I guess either the raids are assembled in wrong order (no mutual
dependencies considered), or without letting the previously assembled
device to "settle down". I am wondering what would be the proper way to
fix this issue. The raids are huge (over 1TB each) and recovery takes
many hours.

Our mdadm.conf lists the raids in proper order, corresponding to their
dependency.

Thanks a lot for any help or suggestions.

Best regards,

Pavel.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Raid auto-assembly upon boot - device order
  2011-06-27 14:15 Raid auto-assembly upon boot - device order Pavel Hofman
@ 2011-06-27 14:47 ` Phil Turmel
  2011-06-28 10:18   ` Pavel Hofman
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Turmel @ 2011-06-27 14:47 UTC (permalink / raw)
  To: Pavel Hofman; +Cc: linux-raid

Hi Pavel,

On 06/27/2011 10:15 AM, Pavel Hofman wrote:
> Hi,
> 
> We are running a raid1 on top of two raid0's. In this specific case we
> cannot use raid10 (different device sizes etc). Upon booting, the raid1
> is always started degraded, with one of the raid0's missing. The log
> says the missing raid could not be found. However, upon booting
> /proc/mdstat lists both the two raid0s OK.
> 
> I guess either the raids are assembled in wrong order (no mutual
> dependencies considered), or without letting the previously assembled
> device to "settle down". I am wondering what would be the proper way to
> fix this issue. The raids are huge (over 1TB each) and recovery takes
> many hours.
> 
> Our mdadm.conf lists the raids in proper order, corresponding to their
> dependency.

I would first check the copy of mdadm.conf in your initramfs.  If it specifies just the raid1, you can end up in this situation.

Most distributions have an 'update-initramfs' script or something similar which must be run after any updates to files that are needed in early boot.

If this is your problem, it also explains why the raid0 appears OK after booting.  The init-scripts that are available on the real root filesystem, containing the correct mdadm.conf, will assemble it.

HTH,

Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Raid auto-assembly upon boot - device order
  2011-06-27 14:47 ` Phil Turmel
@ 2011-06-28 10:18   ` Pavel Hofman
  2011-06-28 11:03     ` Phil Turmel
  0 siblings, 1 reply; 7+ messages in thread
From: Pavel Hofman @ 2011-06-28 10:18 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid


Dne 27.6.2011 16:47, Phil Turmel napsal(a):
> Hi Pavel,
> 
> On 06/27/2011 10:15 AM, Pavel Hofman wrote:
>> Hi,
>> 
>> 
>> Our mdadm.conf lists the raids in proper order, corresponding to
>> their dependency.
> 
> I would first check the copy of mdadm.conf in your initramfs.  If it
> specifies just the raid1, you can end up in this situation.
> Most distributions have an 'update-initramfs' script or something
> similar which must be run after any updates to files that are needed
> in early boot.

Hi Phil,

Thanks a lot for your reply. I update the initramfs image regularly.
Just to make sure I uncompressed the current image, mdadm.conf lists all
the raids correctly:

DEVICE /dev/sd[a-z][1-9] /dev/md1 /dev/md2 /dev/md3 /dev/md4 /dev/md5
/dev/md6 /dev/md7 /dev/md8 /dev/md9
ARRAY /dev/md5 level=raid1 metadata=1.0 num-devices=2
UUID=2f88c280:3d7af418:e8d459c5:782e3ed2
ARRAY /dev/md6 level=raid1 metadata=1.0 num-devices=2
UUID=1f83ea99:a9e4d498:a6543047:af0a3b38
ARRAY /dev/md7 level=raid1 metadata=1.0 num-devices=2
UUID=dde16cd5:2e17c743:fcc7926c:fcf5081e
ARRAY /dev/md3 level=raid0 num-devices=2
UUID=8c9c28dd:ac12a9ef:a6200310:fe6d9686
ARRAY /dev/md1 level=raid1 num-devices=5
UUID=588cbbfd:4835b4da:0d7a0b1c:7bf552bb
ARRAY /dev/md2 level=raid1 num-devices=2
UUID=28714b52:55b123f5:a6200310:fe6d9686
ARRAY /dev/md4 level=raid0 num-devices=2
UUID=ce213d01:e50809ed:a6200310:fe6d9686
ARRAY /dev/md8 level=raid0 num-devices=2 metadata=00.90
UUID=5d23817a:fde9d31b:05afacbb:371c5cc4
ARRAY /dev/md9 level=raid0 num-devices=2 metadata=00.90
UUID=9854dd7a:43e8f27f:05afacbb:371c5cc4


This is my rather complex setup:
Personalities : [raid1] [raid0]
md4 : active raid0 sdb1[0] sdd3[1]
      2178180864 blocks 64k chunks

md2 : active raid1 sdc2[0] sdd2[1]
      8787456 blocks [2/2] [UU]

md3 : active raid0 sda1[0] sdc3[1]
      2178180864 blocks 64k chunks

md7 : active raid1 md6[2] md5[1]
      2178180592 blocks super 1.0 [2/1] [_U]
      [===========>.........]  recovery = 59.3% (1293749868/2178180592)
finish=164746.8min speed=87K/sec

md6 : active raid1 md4[0]
      2178180728 blocks super 1.0 [2/1] [U_]

md5 : active raid1 md3[2]
      2178180728 blocks super 1.0 [2/1] [U_]
      bitmap: 9/9 pages [36KB], 131072KB chunk

md1 : active raid1 sdc1[0] sdd1[3]
      10739328 blocks [5/2] [U__U_]


You can see md7 recoverying, even though both md5 and md6 were present.

Here is the relevant part of dmesg at boot:


[   11.957040] device-mapper: uevent: version 1.0.3
[   11.957040] device-mapper: ioctl: 4.13.0-ioctl (2007-10-18)
initialised: dm-devel@redhat.com
[   12.017047] md: md1 still in use.
[   12.017047] md: md1 still in use.
[   12.017821] md: md5 stopped.
[   12.133051] md: md6 stopped.
[   12.134968] md: md7 stopped.
[   12.141042] md: md3 stopped.
[   12.193037] md: bind<sdc3>
[   12.193037] md: bind<sda1>
[   12.237037] md: raid0 personality registered for level 0
[   12.237037] md3: setting max_sectors to 128, segment boundary to 32767
[   12.237037] raid0: looking at sda1
[   12.237037] raid0:   comparing sda1(732571904) with sda1(732571904)
[   12.237037] raid0:   END
[   12.237037] raid0:   ==> UNIQUE
[   12.237037] raid0: 1 zones
[   12.237037] raid0: looking at sdc3
[   12.237037] raid0:   comparing sdc3(1445608960) with sda1(732571904)
[   12.237037] raid0:   NOT EQUAL
[   12.237037] raid0:   comparing sdc3(1445608960) with sdc3(1445608960)
[   12.237037] raid0:   END
[   12.237037] raid0:   ==> UNIQUE
[   12.237037] raid0: 2 zones
[   12.237037] raid0: FINAL 2 zones
[   12.237037] raid0: zone 1
[   12.237037] raid0: checking sda1 ... nope.
[   12.237037] raid0: checking sdc3 ... contained as device 0
[   12.237037]   (1445608960) is smallest!.
[   12.237037] raid0: zone->nb_dev: 1, size: 713037056
[   12.237037] raid0: current zone offset: 1445608960
[   12.237037] raid0: done.
[   12.237037] raid0 : md_size is 2178180864 blocks.
[   12.237037] raid0 : conf->hash_spacing is 1465143808 blocks.
[   12.237037] raid0 : nb_zone is 2.
[   12.237037] raid0 : Allocating 16 bytes for hash.
[   12.241039] md: md2 stopped.
[   12.261038] md: bind<sdd2>
[   12.261038] md: bind<sdc2>
[   12.305037] raid1: raid set md2 active with 2 out of 2 mirrors
[   12.305037] md: md4 stopped.
[   12.317037] md: bind<sdd3>
[   12.317037] md: bind<sdb1>
[   12.361036] md4: setting max_sectors to 128, segment boundary to 32767
[   12.361036] raid0: looking at sdb1
[   12.361036] raid0:   comparing sdb1(732571904) with sdb1(732571904)
[   12.361036] raid0:   END
[   12.361036] raid0:   ==> UNIQUE
[   12.361036] raid0: 1 zones
[   12.361036] raid0: looking at sdd3
[   12.361036] raid0:   comparing sdd3(1445608960) with sdb1(732571904)
[   12.361036] raid0:   NOT EQUAL
[   12.361036] raid0:   comparing sdd3(1445608960) with sdd3(1445608960)
[   12.361036] raid0:   END
[   12.361036] raid0:   ==> UNIQUE
[   12.361036] raid0: 2 zones
[   12.361036] raid0: FINAL 2 zones
[   12.361036] raid0: zone 1
[   12.361036] raid0: checking sdb1 ... nope.
[   12.361036] raid0: checking sdd3 ... contained as device 0
[   12.361036]   (1445608960) is smallest!.
[   12.361036] raid0: zone->nb_dev: 1, size: 713037056
[   12.361036] raid0: current zone offset: 1445608960
[   12.361036] raid0: done.
[   12.361036] raid0 : md_size is 2178180864 blocks.
[   12.361036] raid0 : conf->hash_spacing is 1465143808 blocks.
[   12.361036] raid0 : nb_zone is 2.
[   12.361036] raid0 : Allocating 16 bytes for hash.
[   12.361036] md: md8 stopped.
[   12.413036] md: md9 stopped.
[   12.429036] md: bind<md3>
[   12.469035] raid1: raid set md5 active with 1 out of 2 mirrors
[   12.473035] md5: bitmap initialized from disk: read 1/1 pages, set
5027 bits
[   12.473035] created bitmap (9 pages) for device md5
[   12.509036] md: bind<md5>
[   12.549035] raid1: raid set md7 active with 1 out of 2 mirrors
[   12.573039] md: md6 stopped.
[   12.573039] md: bind<md4>
[   12.573039] md: md6: raid array is not clean -- starting background
reconstruction
[   12.617034] raid1: raid set md6 active with 1 out of 2 mirrors

Please notice that md7 is being assembled before even mentioning md6,
its component. Upon that, md6 is marked as not clean, eventhough both
md5 and md6 are degraded (the missing drives are connected weekly via
eSATA from external enclosure and used for offline backups).

Plus how can can a background reconstruction be started on md6, if it is
degraded and the other mirroring part is not even present?

Thanks a lot,

Pavel.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Raid auto-assembly upon boot - device order
  2011-06-28 10:18   ` Pavel Hofman
@ 2011-06-28 11:03     ` Phil Turmel
  2011-06-28 12:01       ` Pavel Hofman
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Turmel @ 2011-06-28 11:03 UTC (permalink / raw)
  To: Pavel Hofman; +Cc: linux-raid

Good morning, Pavel,

On 06/28/2011 06:18 AM, Pavel Hofman wrote:
> 
> Dne 27.6.2011 16:47, Phil Turmel napsal(a):
>> Hi Pavel,
>>
>> On 06/27/2011 10:15 AM, Pavel Hofman wrote:
>>> Hi,
>>>
>>>
>>> Our mdadm.conf lists the raids in proper order, corresponding to
>>> their dependency.
>>
>> I would first check the copy of mdadm.conf in your initramfs.  If it
>> specifies just the raid1, you can end up in this situation.
>> Most distributions have an 'update-initramfs' script or something
>> similar which must be run after any updates to files that are needed
>> in early boot.
> 
> Hi Phil,
> 
> Thanks a lot for your reply. I update the initramfs image regularly.
> Just to make sure I uncompressed the current image, mdadm.conf lists all
> the raids correctly:
> 
> DEVICE /dev/sd[a-z][1-9] /dev/md1 /dev/md2 /dev/md3 /dev/md4 /dev/md5
> /dev/md6 /dev/md7 /dev/md8 /dev/md9
> ARRAY /dev/md5 level=raid1 metadata=1.0 num-devices=2
> UUID=2f88c280:3d7af418:e8d459c5:782e3ed2
> ARRAY /dev/md6 level=raid1 metadata=1.0 num-devices=2
> UUID=1f83ea99:a9e4d498:a6543047:af0a3b38
> ARRAY /dev/md7 level=raid1 metadata=1.0 num-devices=2
> UUID=dde16cd5:2e17c743:fcc7926c:fcf5081e
> ARRAY /dev/md3 level=raid0 num-devices=2
> UUID=8c9c28dd:ac12a9ef:a6200310:fe6d9686
> ARRAY /dev/md1 level=raid1 num-devices=5
> UUID=588cbbfd:4835b4da:0d7a0b1c:7bf552bb
> ARRAY /dev/md2 level=raid1 num-devices=2
> UUID=28714b52:55b123f5:a6200310:fe6d9686
> ARRAY /dev/md4 level=raid0 num-devices=2
> UUID=ce213d01:e50809ed:a6200310:fe6d9686
> ARRAY /dev/md8 level=raid0 num-devices=2 metadata=00.90
> UUID=5d23817a:fde9d31b:05afacbb:371c5cc4
> ARRAY /dev/md9 level=raid0 num-devices=2 metadata=00.90
> UUID=9854dd7a:43e8f27f:05afacbb:371c5cc4

OK.  Though some are out of order (md3 & md4 ought to be listed before md5 & md6), but it seems to not matter.

> This is my rather complex setup:
> Personalities : [raid1] [raid0]
> md4 : active raid0 sdb1[0] sdd3[1]
>       2178180864 blocks 64k chunks
> 
> md2 : active raid1 sdc2[0] sdd2[1]
>       8787456 blocks [2/2] [UU]
> 
> md3 : active raid0 sda1[0] sdc3[1]
>       2178180864 blocks 64k chunks
> 
> md7 : active raid1 md6[2] md5[1]
>       2178180592 blocks super 1.0 [2/1] [_U]
>       [===========>.........]  recovery = 59.3% (1293749868/2178180592)
> finish=164746.8min speed=87K/sec
> 
> md6 : active raid1 md4[0]
>       2178180728 blocks super 1.0 [2/1] [U_]
> 
> md5 : active raid1 md3[2]
>       2178180728 blocks super 1.0 [2/1] [U_]
>       bitmap: 9/9 pages [36KB], 131072KB chunk
> 
> md1 : active raid1 sdc1[0] sdd1[3]
>       10739328 blocks [5/2] [U__U_]
> 
> 
> You can see md7 recoverying, even though both md5 and md6 were present.

Yes, but md5 & md6 are themselves degraded.  Should not have started unless you are globally enabling it.

ps.  "lsdrv" would be really useful here to understand your layering setup.

http://github.com/pturmel/lsdrv

> Here is the relevant part of dmesg at boot:
> 
> 
> [   11.957040] device-mapper: uevent: version 1.0.3
> [   11.957040] device-mapper: ioctl: 4.13.0-ioctl (2007-10-18)
> initialised: dm-devel@redhat.com
> [   12.017047] md: md1 still in use.
> [   12.017047] md: md1 still in use.
> [   12.017821] md: md5 stopped.
> [   12.133051] md: md6 stopped.
> [   12.134968] md: md7 stopped.
> [   12.141042] md: md3 stopped.
> [   12.193037] md: bind<sdc3>
> [   12.193037] md: bind<sda1>
> [   12.237037] md: raid0 personality registered for level 0
> [   12.237037] md3: setting max_sectors to 128, segment boundary to 32767
> [   12.237037] raid0: looking at sda1
> [   12.237037] raid0:   comparing sda1(732571904) with sda1(732571904)
> [   12.237037] raid0:   END
> [   12.237037] raid0:   ==> UNIQUE
> [   12.237037] raid0: 1 zones
> [   12.237037] raid0: looking at sdc3
> [   12.237037] raid0:   comparing sdc3(1445608960) with sda1(732571904)
> [   12.237037] raid0:   NOT EQUAL
> [   12.237037] raid0:   comparing sdc3(1445608960) with sdc3(1445608960)
> [   12.237037] raid0:   END
> [   12.237037] raid0:   ==> UNIQUE
> [   12.237037] raid0: 2 zones
> [   12.237037] raid0: FINAL 2 zones
> [   12.237037] raid0: zone 1
> [   12.237037] raid0: checking sda1 ... nope.
> [   12.237037] raid0: checking sdc3 ... contained as device 0
> [   12.237037]   (1445608960) is smallest!.
> [   12.237037] raid0: zone->nb_dev: 1, size: 713037056
> [   12.237037] raid0: current zone offset: 1445608960
> [   12.237037] raid0: done.
> [   12.237037] raid0 : md_size is 2178180864 blocks.
> [   12.237037] raid0 : conf->hash_spacing is 1465143808 blocks.
> [   12.237037] raid0 : nb_zone is 2.
> [   12.237037] raid0 : Allocating 16 bytes for hash.
> [   12.241039] md: md2 stopped.
> [   12.261038] md: bind<sdd2>
> [   12.261038] md: bind<sdc2>
> [   12.305037] raid1: raid set md2 active with 2 out of 2 mirrors
> [   12.305037] md: md4 stopped.
> [   12.317037] md: bind<sdd3>
> [   12.317037] md: bind<sdb1>
> [   12.361036] md4: setting max_sectors to 128, segment boundary to 32767
> [   12.361036] raid0: looking at sdb1
> [   12.361036] raid0:   comparing sdb1(732571904) with sdb1(732571904)
> [   12.361036] raid0:   END
> [   12.361036] raid0:   ==> UNIQUE
> [   12.361036] raid0: 1 zones
> [   12.361036] raid0: looking at sdd3
> [   12.361036] raid0:   comparing sdd3(1445608960) with sdb1(732571904)
> [   12.361036] raid0:   NOT EQUAL
> [   12.361036] raid0:   comparing sdd3(1445608960) with sdd3(1445608960)
> [   12.361036] raid0:   END
> [   12.361036] raid0:   ==> UNIQUE
> [   12.361036] raid0: 2 zones
> [   12.361036] raid0: FINAL 2 zones
> [   12.361036] raid0: zone 1
> [   12.361036] raid0: checking sdb1 ... nope.
> [   12.361036] raid0: checking sdd3 ... contained as device 0
> [   12.361036]   (1445608960) is smallest!.
> [   12.361036] raid0: zone->nb_dev: 1, size: 713037056
> [   12.361036] raid0: current zone offset: 1445608960
> [   12.361036] raid0: done.
> [   12.361036] raid0 : md_size is 2178180864 blocks.
> [   12.361036] raid0 : conf->hash_spacing is 1465143808 blocks.
> [   12.361036] raid0 : nb_zone is 2.
> [   12.361036] raid0 : Allocating 16 bytes for hash.
> [   12.361036] md: md8 stopped.
> [   12.413036] md: md9 stopped.
> [   12.429036] md: bind<md3>
> [   12.469035] raid1: raid set md5 active with 1 out of 2 mirrors
> [   12.473035] md5: bitmap initialized from disk: read 1/1 pages, set
> 5027 bits
> [   12.473035] created bitmap (9 pages) for device md5
> [   12.509036] md: bind<md5>
> [   12.549035] raid1: raid set md7 active with 1 out of 2 mirrors
> [   12.573039] md: md6 stopped.
> [   12.573039] md: bind<md4>
> [   12.573039] md: md6: raid array is not clean -- starting background
> reconstruction
> [   12.617034] raid1: raid set md6 active with 1 out of 2 mirrors
> 
> Please notice that md7 is being assembled before even mentioning md6,
> its component. Upon that, md6 is marked as not clean, eventhough both
> md5 and md6 are degraded (the missing drives are connected weekly via
> eSATA from external enclosure and used for offline backups).

I suspect it is merely timing.  You are using degraded arrays deliberately as part of your backup scheme, which means you must be using "start_dirty_degraded" as a kernel parameter.  That enables md7, which you don't want degraded, to start degraded when md6 is a hundred or so milliseconds late to the party.

I think you have a couple options:

1) Don't run degraded arrays.  Use other backup tools.
2) Remove md7 from your mdadm.conf in your initramfs.  Don't let early userspace assemble it.  The extra time should then allow your initscripts on your real root fs to assemble it with both members.  This only works if md7 does not contain your real root fs.

> Plus how can can a background reconstruction be started on md6, if it is
> degraded and the other mirroring part is not even present?

Don't know.  Maybe one of your existing drives is occupying a major/minor combination that your esata drive occupied on your last backup.  I'm pretty sure the message is harmless.  I noticed that md5 has a bitmap, but md6 does not.  I wonder if adding a bitmap to md6 would change the timing enough to help you.

Relying on timing variations for successful boot doesn't sound great to me.

> Thanks a lot,
> 
> Pavel.
> 

HTH,

Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Raid auto-assembly upon boot - device order
  2011-06-28 11:03     ` Phil Turmel
@ 2011-06-28 12:01       ` Pavel Hofman
  2011-06-28 15:39         ` Phil Turmel
  0 siblings, 1 reply; 7+ messages in thread
From: Pavel Hofman @ 2011-06-28 12:01 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Hi Phil,

Dne 28.6.2011 13:03, Phil Turmel napsal(a):
> Good morning, Pavel,
> 
> On 06/28/2011 06:18 AM, Pavel Hofman wrote:
>> 
>> 
>> Hi Phil,
>> 
>> This is my rather complex setup: Personalities : [raid1] [raid0] 
>> md4 : active raid0 sdb1[0] sdd3[1] 2178180864 blocks 64k chunks
>> 
>> md2 : active raid1 sdc2[0] sdd2[1] 8787456 blocks [2/2] [UU]
>> 
>> md3 : active raid0 sda1[0] sdc3[1] 2178180864 blocks 64k chunks
>> 
>> md7 : active raid1 md6[2] md5[1] 2178180592 blocks super 1.0 [2/1]
>> [_U] [===========>.........]  recovery = 59.3%
>> (1293749868/2178180592) finish=164746.8min speed=87K/sec
>> 
>> md6 : active raid1 md4[0] 2178180728 blocks super 1.0 [2/1] [U_]
>> 
>> md5 : active raid1 md3[2] 2178180728 blocks super 1.0 [2/1] [U_] 
>> bitmap: 9/9 pages [36KB], 131072KB chunk
>> 
>> md1 : active raid1 sdc1[0] sdd1[3] 10739328 blocks [5/2] [U__U_]
>> 
>> 
>> You can see md7 recoverying, even though both md5 and md6 were
>> present.
> 
> Yes, but md5 & md6 are themselves degraded.  Should not have started
> unless you are globally enabling it.

> 
> ps.  "lsdrv" would be really useful here to understand your layering
> setup.
> 
> http://github.com/pturmel/lsdrv

Thanks a lot for your quick reply. And for your wonderful tool too.

orfeus:/boot# lsdrv
PCI [AMD_IDE] 00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
 └─ide 2.0 HL-DT-ST RW/DVD GCC-H20N {[No Information Found]}
    └─hde: [33:0] Empty/Unknown 4.00g
PCI [sata_nv] 00:05.0 IDE interface: nVidia Corporation MCP55 SATA
Controller (rev a3)
 ├─scsi 0:0:0:0 ATA SAMSUNG HD753LJ {S13UJDWQ912345}
 │  └─sda: [8:0] MD raid10 (4) 698.64g inactive
{646f62e3:626d2cb3:05afacbb:371c5cc4}
 │     └─sda1: [8:1] MD raid0 (0/2) 698.64g md3 clean in_sync
{8c9c28dd:ac12a9ef:a6200310:fe6d9686}
 │        └─md3: [9:3] MD raid1 (0/2) 2.03t md5 active in_sync
'orfeus:5' {2f88c280:3d7af418:e8d459c5:782e3ed2}
 │           └─md5: [9:5] MD raid1 (1/2) 2.03t md7 active in_sync
'orfeus:7' {dde16cd5:2e17c743:fcc7926c:fcf5081e}
 │              └─md7: [9:7] (xfs) 2.03t 'backup'
{d987301b-dfb1-4c99-8f72-f4b400ba46c9}
 │                 └─Mounted as /dev/md7 @ /mnt/raid
 └─scsi 1:0:0:0 ATA ST3750330AS {9QK0VFJ9}
    └─sdb: [8:16] Empty/Unknown 698.64g
       └─sdb1: [8:17] MD raid0 (0/2) 698.64g md4 clean in_sync
{ce213d01:e50809ed:a6200310:fe6d9686}
          └─md4: [9:4] MD raid1 (0/2) 2.03t md6 active in_sync
''orfeus':6' {1f83ea99:a9e4d498:a6543047:af0a3b38}
             └─md6: [9:6] MD raid1 (0/2) 2.03t md7 active spare
''orfeus':7' {dde16cd5:2e17c743:fcc7926c:fcf5081e}
PCI [sata_nv] 00:05.1 IDE interface: nVidia Corporation MCP55 SATA
Controller (rev a3)
 ├─scsi 2:0:0:0 ATA ST31500341AS {9VS15Y1L}
 │  └─sdc: [8:32] Empty/Unknown 1.36t
 │     ├─sdc1: [8:33] MD raid1 (0/5) 10.24g md1 clean in_sync
{588cbbfd:4835b4da:0d7a0b1c:7bf552bb}
 │     │  └─md1: [9:1] (ext3) 10.24g {f620df1e-6dd6-43ab-b4e6-8e1fd4a447f7}
 │     │     └─Mounted as /dev/md1 @ /
 │     ├─sdc2: [8:34] MD raid1 (0/2) 8.38g md2 clean in_sync
{28714b52:55b123f5:a6200310:fe6d9686}
 │     │  └─md2: [9:2] (swap) 8.38g {1804bbc6-a61b-44ea-9cc9-ac3ce6f17305}
 │     └─sdc3: [8:35] MD raid0 (1/2) 1.35t md3 clean in_sync
{8c9c28dd:ac12a9ef:a6200310:fe6d9686}
 └─scsi 3:0:0:0 ATA ST31500341AS {9VS13H4N}
    └─sdd: [8:48] Empty/Unknown 1.36t
       ├─sdd1: [8:49] MD raid1 (3/5) 10.24g md1 clean in_sync
{588cbbfd:4835b4da:0d7a0b1c:7bf552bb}
       ├─sdd2: [8:50] MD raid1 (1/2) 8.38g md2 clean in_sync
{28714b52:55b123f5:a6200310:fe6d9686}
       └─sdd3: [8:51] MD raid0 (1/2) 1.35t md4 clean in_sync
{ce213d01:e50809ed:a6200310:fe6d9686}

Still you got the setup at the first look fine without the visualisation :)

> 
> 
> I suspect it is merely timing.  You are using degraded arrays
> deliberately as part of your backup scheme, which means you must be
> using "start_dirty_degraded" as a kernel parameter.  That enables
> md7, which you don't want degraded, to start degraded when md6 is a
> hundred or so milliseconds late to the party.

Running rgrep on /etc and /boot reveals no such kernel parameter on this
system. I have never had problems with the arrays not starting, perhaps
it is hard-compiled in debian kernel (lenny)? Config for the current
kernel in /boot does not list any such parameter either.

Woould using this parameter just change the timing?

> 
> I think you have a couple options:
> 
> 1) Don't run degraded arrays.  Use other backup tools.

It took me several years to find a reasonably fast way to offline-backup
that partition with tens of millions of backuppc hardlinks :)

> 2) Remove md7
> from your mdadm.conf in your initramfs.  Don't let early userspace
> assemble it.  The extra time should then allow your initscripts on
> your real root fs to assemble it with both members.  This only works
> if md7 does not contain your real root fs.

Fantastic, I will do so. Just have to find a way to keep different
mdadm.conf in /etc and in initramfs while preserving the useful
update-initramfs functionality :)
> 
>> Plus how can can a background reconstruction be started on md6, if
>> it is degraded and the other mirroring part is not even present?
> 
> Don't know.  Maybe one of your existing drives is occupying a
> major/minor combination that your esata drive occupied on your last
> backup.  I'm pretty sure the message is harmless.  I noticed that md5
> has a bitmap, but md6 does not.  I wonder if adding a bitmap to md6
> would change the timing enough to help you.

Wow, there is bitmap missing on md6 indeed. I swear it was there, in the
past :) It cuts down significantly the synchronization time for offline
copies. I have two offline drive sets - each rotating every two weeks.
One offline set plugs into md5, the other one into md6. This way I can
have two bitmaps, one for each set. Apparently, not now :-)

> 
> Relying on timing variations for successful boot doesn't sound great
> to me.

You are right. Hopefully the significantly delayed assembly will work OK.

I very appreciate your help, thanks a lot,

Pavel.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Raid auto-assembly upon boot - device order
  2011-06-28 12:01       ` Pavel Hofman
@ 2011-06-28 15:39         ` Phil Turmel
  2011-06-28 19:18           ` Pavel Hofman
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Turmel @ 2011-06-28 15:39 UTC (permalink / raw)
  To: Pavel Hofman; +Cc: linux-raid

On 06/28/2011 08:01 AM, Pavel Hofman wrote:
> Hi Phil,

[...]

> Thanks a lot for your quick reply. And for your wonderful tool too.

You're welcome.

> orfeus:/boot# lsdrv
> PCI [AMD_IDE] 00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
>  └─ide 2.0 HL-DT-ST RW/DVD GCC-H20N {[No Information Found]}
>     └─hde: [33:0] Empty/Unknown 4.00g
> PCI [sata_nv] 00:05.0 IDE interface: nVidia Corporation MCP55 SATA
> Controller (rev a3)
>  ├─scsi 0:0:0:0 ATA SAMSUNG HD753LJ {S13UJDWQ912345}
>  │  └─sda: [8:0] MD raid10 (4) 698.64g inactive
> {646f62e3:626d2cb3:05afacbb:371c5cc4}
>  │     └─sda1: [8:1] MD raid0 (0/2) 698.64g md3 clean in_sync
> {8c9c28dd:ac12a9ef:a6200310:fe6d9686}
>  │        └─md3: [9:3] MD raid1 (0/2) 2.03t md5 active in_sync
> 'orfeus:5' {2f88c280:3d7af418:e8d459c5:782e3ed2}
>  │           └─md5: [9:5] MD raid1 (1/2) 2.03t md7 active in_sync
> 'orfeus:7' {dde16cd5:2e17c743:fcc7926c:fcf5081e}
>  │              └─md7: [9:7] (xfs) 2.03t 'backup'
> {d987301b-dfb1-4c99-8f72-f4b400ba46c9}
>  │                 └─Mounted as /dev/md7 @ /mnt/raid
>  └─scsi 1:0:0:0 ATA ST3750330AS {9QK0VFJ9}
>     └─sdb: [8:16] Empty/Unknown 698.64g
>        └─sdb1: [8:17] MD raid0 (0/2) 698.64g md4 clean in_sync
> {ce213d01:e50809ed:a6200310:fe6d9686}
>           └─md4: [9:4] MD raid1 (0/2) 2.03t md6 active in_sync
> ''orfeus':6' {1f83ea99:a9e4d498:a6543047:af0a3b38}
>              └─md6: [9:6] MD raid1 (0/2) 2.03t md7 active spare
> ''orfeus':7' {dde16cd5:2e17c743:fcc7926c:fcf5081e}
> PCI [sata_nv] 00:05.1 IDE interface: nVidia Corporation MCP55 SATA
> Controller (rev a3)
>  ├─scsi 2:0:0:0 ATA ST31500341AS {9VS15Y1L}
>  │  └─sdc: [8:32] Empty/Unknown 1.36t
>  │     ├─sdc1: [8:33] MD raid1 (0/5) 10.24g md1 clean in_sync
> {588cbbfd:4835b4da:0d7a0b1c:7bf552bb}
>  │     │  └─md1: [9:1] (ext3) 10.24g {f620df1e-6dd6-43ab-b4e6-8e1fd4a447f7}
>  │     │     └─Mounted as /dev/md1 @ /
>  │     ├─sdc2: [8:34] MD raid1 (0/2) 8.38g md2 clean in_sync
> {28714b52:55b123f5:a6200310:fe6d9686}
>  │     │  └─md2: [9:2] (swap) 8.38g {1804bbc6-a61b-44ea-9cc9-ac3ce6f17305}
>  │     └─sdc3: [8:35] MD raid0 (1/2) 1.35t md3 clean in_sync
> {8c9c28dd:ac12a9ef:a6200310:fe6d9686}
>  └─scsi 3:0:0:0 ATA ST31500341AS {9VS13H4N}
>     └─sdd: [8:48] Empty/Unknown 1.36t
>        ├─sdd1: [8:49] MD raid1 (3/5) 10.24g md1 clean in_sync
> {588cbbfd:4835b4da:0d7a0b1c:7bf552bb}
>        ├─sdd2: [8:50] MD raid1 (1/2) 8.38g md2 clean in_sync
> {28714b52:55b123f5:a6200310:fe6d9686}
>        └─sdd3: [8:51] MD raid0 (1/2) 1.35t md4 clean in_sync
> {ce213d01:e50809ed:a6200310:fe6d9686}

Pretty deep layering.  I think I'm going to reduce the amount of indentation per layer.

> Still you got the setup at the first look fine without the visualisation :)
> 
>>
>>
>> I suspect it is merely timing.  You are using degraded arrays
>> deliberately as part of your backup scheme, which means you must be
>> using "start_dirty_degraded" as a kernel parameter.  That enables
>> md7, which you don't want degraded, to start degraded when md6 is a
>> hundred or so milliseconds late to the party.
> 
> Running rgrep on /etc and /boot reveals no such kernel parameter on this
> system. I have never had problems with the arrays not starting, perhaps
> it is hard-compiled in debian kernel (lenny)? Config for the current
> kernel in /boot does not list any such parameter either.
> 
> Woould using this parameter just change the timing?

No.  Degraded arrays are supposed to not assemble without it.  Maybe it only applies to kernel autoassembly, which I no longer use.

>> I think you have a couple options:
>>
>> 1) Don't run degraded arrays.  Use other backup tools.
> 
> It took me several years to find a reasonably fast way to offline-backup
> that partition with tens of millions of backuppc hardlinks :)

I've heard of hardlink horrors with backuppc.  I don't use it myself.  I prefer to use LVM on top of MD, then take compressed backups of LVM snapshots.

>> 2) Remove md7
>> from your mdadm.conf in your initramfs.  Don't let early userspace
>> assemble it.  The extra time should then allow your initscripts on
>> your real root fs to assemble it with both members.  This only works
>> if md7 does not contain your real root fs.
> 
> Fantastic, I will do so. Just have to find a way to keep different
> mdadm.conf in /etc and in initramfs while preserving the useful
> update-initramfs functionality :)

I haven't dug that deep.  I use dracut, myself.

>>> Plus how can can a background reconstruction be started on md6, if
>>> it is degraded and the other mirroring part is not even present?
>>
>> Don't know.  Maybe one of your existing drives is occupying a
>> major/minor combination that your esata drive occupied on your last
>> backup.  I'm pretty sure the message is harmless.  I noticed that md5
>> has a bitmap, but md6 does not.  I wonder if adding a bitmap to md6
>> would change the timing enough to help you.
> 
> Wow, there is bitmap missing on md6 indeed. I swear it was there, in the
> past :) It cuts down significantly the synchronization time for offline
> copies. I have two offline drive sets - each rotating every two weeks.
> One offline set plugs into md5, the other one into md6. This way I can
> have two bitmaps, one for each set. Apparently, not now :-)

Mirror w/ bitmap would make 1:1 backups faster.  I understand why you are doing this, but I'd be worried about filesystem integrity at the point in time you disconnect the backup drive.  Have you performed any tests to be sure you can recover usable data from the offline copy?  If I recall correctly, an LVM snapshot operation incorporates a filesystem metadata sync.

>> Relying on timing variations for successful boot doesn't sound great
>> to me.
> 
> You are right. Hopefully the significantly delayed assembly will work OK.
> 
> I very appreciate your help, thanks a lot,
> 
> Pavel.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Raid auto-assembly upon boot - device order
  2011-06-28 15:39         ` Phil Turmel
@ 2011-06-28 19:18           ` Pavel Hofman
  0 siblings, 0 replies; 7+ messages in thread
From: Pavel Hofman @ 2011-06-28 19:18 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Dne 28.6.2011 17:39, Phil Turmel napsal(a):
>> 
>> It took me several years to find a reasonably fast way to
>> offline-backup that partition with tens of millions of backuppc
>> hardlinks :)
> 
> I've heard of hardlink horrors with backuppc.  I don't use it myself.
> I prefer to use LVM on top of MD, then take compressed backups of LVM
> snapshots.

One of my goals was to be able to stick the offline backup drives to any
PC, boot from a debian netinstall CD, chroot to the mirrored root, mount
the large data partitions, and be ready to start recovery in a
matter of minutes. Therefore I prefer the whole filesystems mirror.
> 
> Mirror w/ bitmap would make 1:1 backups faster.  I understand why you
> are doing this, but I'd be worried about filesystem integrity at the
> point in time you disconnect the backup drive.  Have you performed
> any tests to be sure you can recover usable data from the offline
> copy?  If I recall correctly, an LVM snapshot operation incorporates
> a filesystem metadata sync.

Yes, I am using a rather complicated automatic procedure. When the
resyncing finishes, the script waits until backuppc finishes the
currently running jobs (while starting new ones is disabled), shuts
backuppc down, kills all other processes accessing the partition,
umounts the filesystem, removes the offline drives from the md5/md6
arrays, mounts the backup raid again, restarts the backup processes, and
then checks the offline copy: mounts the offline drives as partition and
tries to read a random file from it. Only after that are the offline
drives unmounted, their array disassembled and the drives put to sleep,
until the operator removes them from their external bays and takes away
from the company premises.

The offline drives have saved my butt several times :)

Regards,

Pavel.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-06-28 19:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-27 14:15 Raid auto-assembly upon boot - device order Pavel Hofman
2011-06-27 14:47 ` Phil Turmel
2011-06-28 10:18   ` Pavel Hofman
2011-06-28 11:03     ` Phil Turmel
2011-06-28 12:01       ` Pavel Hofman
2011-06-28 15:39         ` Phil Turmel
2011-06-28 19:18           ` Pavel Hofman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.