Persistent failures with simple md setup

* Persistent failures with simple md setup
@ 2013-01-29 22:14 Hans-Peter Jansen
  2013-01-30  9:07 ` Sebastian Riemer
  2013-01-30  9:20 ` Roy Sigurd Karlsbakk
  0 siblings, 2 replies; 19+ messages in thread
From: Hans-Peter Jansen @ 2013-01-29 22:14 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3990 bytes --]

[Looks like my first messge didn't made it to the list, hence send 
again with tarballed attachments]

Dear list members,

one of the systems, I take care of, there's one pretty bog standard 
openSUSE 12.1 installation, that stick out with continued device 
failures on boot:

Here a typical case:

~# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] 
md3 : active raid1 sda4[0]
      869702736 blocks super 1.0 [2/1] [U_]
      bitmap: 57/415 pages [228KB], 1024KB chunk

md0 : active raid1 sda1[0]
      96376 blocks super 1.0 [2/1] [U_]
      bitmap: 1/6 pages [4KB], 8KB chunk

md1 : active (auto-read-only) raid1 sdb2[1] sda2[0]
      2096468 blocks super 1.0 [2/2] [UU]
      bitmap: 0/8 pages [0KB], 128KB chunk

md124 : active raid1 sdb3[1] sda3[0]
      104856180 blocks super 1.0 [2/2] [UU]
      bitmap: 8/200 pages [32KB], 256KB chunk

[no line breaks on purpose]
Jan 29 20:22:36 zaphkiel kernel: [   11.047504] md: raid1 personality registered for level 1
Jan 29 20:22:36 zaphkiel kernel: [   11.549612] md: bind<sda3>
Jan 29 20:22:36 zaphkiel kernel: [   11.587037] md: bind<sdb3>
Jan 29 20:22:36 zaphkiel kernel: [   11.630965] md/raid1:md124: active with 2 out of 2 mirrors
Jan 29 20:22:36 zaphkiel kernel: [   11.708396] md124: bitmap initialized from disk: read 13/13 pages, set 1 of 409595 bits
Jan 29 20:22:36 zaphkiel kernel: [   11.769213] md124: detected capacity change from 0 to 107372728320
Jan 29 20:22:36 zaphkiel kernel: [   11.981192] md: raid0 personality registered for level 0
Jan 29 20:22:36 zaphkiel kernel: [   12.020959] md: raid10 personality registered for level 10
Jan 29 20:22:36 zaphkiel kernel: [   12.625530] md: raid6 personality registered for level 6
Jan 29 20:22:36 zaphkiel kernel: [   12.657414] md: raid5 personality registered for level 5
Jan 29 20:22:36 zaphkiel kernel: [   12.689261] md: raid4 personality registered for level 4
Jan 29 20:22:36 zaphkiel kernel: [   25.151590] md: bind<sda2>
Jan 29 20:22:36 zaphkiel kernel: [   25.314284] md: bind<sda1>
Jan 29 20:22:36 zaphkiel kernel: [   25.409503] md: bind<sda4>
Jan 29 20:22:36 zaphkiel kernel: [   25.568103] md/raid1:md0: active with 1 out of 2 mirrors
Jan 29 20:22:36 zaphkiel kernel: [   25.689110] md: bind<sdb2>
Jan 29 20:22:36 zaphkiel kernel: [   25.713385] md0: bitmap initialized from disk: read 1/1 pages, set 0 of 12047 bits
Jan 29 20:22:36 zaphkiel kernel: [   25.837207] md0: detected capacity change from 0 to 98689024
Jan 29 20:22:36 zaphkiel kernel: [   26.045361] md/raid1:md1: active with 2 out of 2 mirrors
Jan 29 20:22:36 zaphkiel kernel: [   26.260500] md1: bitmap initialized from disk: read 1/1 pages, set 0 of 16379 bits
Jan 29 20:22:36 zaphkiel kernel: [   26.349129] md1: detected capacity change from 0 to 2146783232
Jan 29 20:22:36 zaphkiel kernel: [   26.391526] md/raid1:md3: active with 1 out of 2 mirrors
Jan 29 20:22:36 zaphkiel kernel: [   27.188346] md3: bitmap initialized from disk: read 26/26 pages, set 1547 of 849320 bits
Jan 29 20:22:36 zaphkiel kernel: [   27.302622] md3: detected capacity change from 0 to 890575601664

This looks like some kind of race during device detection.
The full boot sequence log leading to this mess is attached. 

The major parts operating here are: 
mdadm-3.2.2-4.9.1.i586
mkinitrd-2.7.0-39.3.1.i586
kernel-desktop-3.1.10-1.16.1.i586
kernel-desktop-base-3.1.10-1.16.1.i586

Sure the system can be repaired with:
mdadm --add /dev/md0 /dev/sdb1
mdadm --add /dev/md3 /dev/sdb4

for this case, but the behavior which partition is affected is random, 
only md124 seems stable (the root fs). The strange md naming was the 
result of an upgrade installation. The device details are attached as 
well.

It happens, that the active device even *switches* between boots, which 
is a perfect recipe for actually loosing data, hence this md doesn't 
raise data security, it is the reason for loosing them. 

Could some kind soul tell me, what's going on here?

Thanks in advance,
Pete

[-- Attachment #2: details-and-log.tar.bz2 --]
[-- Type: application/x-bzip-compressed-tar, Size: 18989 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread