All of lore.kernel.org
 help / color / mirror / Atom feed
* Persistent failures with simple md setup
@ 2013-01-29 22:14 Hans-Peter Jansen
  2013-01-30  9:07 ` Sebastian Riemer
  2013-01-30  9:20 ` Roy Sigurd Karlsbakk
  0 siblings, 2 replies; 19+ messages in thread
From: Hans-Peter Jansen @ 2013-01-29 22:14 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3990 bytes --]

[Looks like my first messge didn't made it to the list, hence send 
again with tarballed attachments]

Dear list members,

one of the systems, I take care of, there's one pretty bog standard 
openSUSE 12.1 installation, that stick out with continued device 
failures on boot:

Here a typical case:

~# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] 
md3 : active raid1 sda4[0]
      869702736 blocks super 1.0 [2/1] [U_]
      bitmap: 57/415 pages [228KB], 1024KB chunk

md0 : active raid1 sda1[0]
      96376 blocks super 1.0 [2/1] [U_]
      bitmap: 1/6 pages [4KB], 8KB chunk

md1 : active (auto-read-only) raid1 sdb2[1] sda2[0]
      2096468 blocks super 1.0 [2/2] [UU]
      bitmap: 0/8 pages [0KB], 128KB chunk

md124 : active raid1 sdb3[1] sda3[0]
      104856180 blocks super 1.0 [2/2] [UU]
      bitmap: 8/200 pages [32KB], 256KB chunk

[no line breaks on purpose]
Jan 29 20:22:36 zaphkiel kernel: [   11.047504] md: raid1 personality registered for level 1
Jan 29 20:22:36 zaphkiel kernel: [   11.549612] md: bind<sda3>
Jan 29 20:22:36 zaphkiel kernel: [   11.587037] md: bind<sdb3>
Jan 29 20:22:36 zaphkiel kernel: [   11.630965] md/raid1:md124: active with 2 out of 2 mirrors
Jan 29 20:22:36 zaphkiel kernel: [   11.708396] md124: bitmap initialized from disk: read 13/13 pages, set 1 of 409595 bits
Jan 29 20:22:36 zaphkiel kernel: [   11.769213] md124: detected capacity change from 0 to 107372728320
Jan 29 20:22:36 zaphkiel kernel: [   11.981192] md: raid0 personality registered for level 0
Jan 29 20:22:36 zaphkiel kernel: [   12.020959] md: raid10 personality registered for level 10
Jan 29 20:22:36 zaphkiel kernel: [   12.625530] md: raid6 personality registered for level 6
Jan 29 20:22:36 zaphkiel kernel: [   12.657414] md: raid5 personality registered for level 5
Jan 29 20:22:36 zaphkiel kernel: [   12.689261] md: raid4 personality registered for level 4
Jan 29 20:22:36 zaphkiel kernel: [   25.151590] md: bind<sda2>
Jan 29 20:22:36 zaphkiel kernel: [   25.314284] md: bind<sda1>
Jan 29 20:22:36 zaphkiel kernel: [   25.409503] md: bind<sda4>
Jan 29 20:22:36 zaphkiel kernel: [   25.568103] md/raid1:md0: active with 1 out of 2 mirrors
Jan 29 20:22:36 zaphkiel kernel: [   25.689110] md: bind<sdb2>
Jan 29 20:22:36 zaphkiel kernel: [   25.713385] md0: bitmap initialized from disk: read 1/1 pages, set 0 of 12047 bits
Jan 29 20:22:36 zaphkiel kernel: [   25.837207] md0: detected capacity change from 0 to 98689024
Jan 29 20:22:36 zaphkiel kernel: [   26.045361] md/raid1:md1: active with 2 out of 2 mirrors
Jan 29 20:22:36 zaphkiel kernel: [   26.260500] md1: bitmap initialized from disk: read 1/1 pages, set 0 of 16379 bits
Jan 29 20:22:36 zaphkiel kernel: [   26.349129] md1: detected capacity change from 0 to 2146783232
Jan 29 20:22:36 zaphkiel kernel: [   26.391526] md/raid1:md3: active with 1 out of 2 mirrors
Jan 29 20:22:36 zaphkiel kernel: [   27.188346] md3: bitmap initialized from disk: read 26/26 pages, set 1547 of 849320 bits
Jan 29 20:22:36 zaphkiel kernel: [   27.302622] md3: detected capacity change from 0 to 890575601664

This looks like some kind of race during device detection.
The full boot sequence log leading to this mess is attached. 

The major parts operating here are: 
mdadm-3.2.2-4.9.1.i586
mkinitrd-2.7.0-39.3.1.i586
kernel-desktop-3.1.10-1.16.1.i586
kernel-desktop-base-3.1.10-1.16.1.i586

Sure the system can be repaired with:
mdadm --add /dev/md0 /dev/sdb1
mdadm --add /dev/md3 /dev/sdb4

for this case, but the behavior which partition is affected is random, 
only md124 seems stable (the root fs). The strange md naming was the 
result of an upgrade installation. The device details are attached as 
well.

It happens, that the active device even *switches* between boots, which 
is a perfect recipe for actually loosing data, hence this md doesn't 
raise data security, it is the reason for loosing them. 

Could some kind soul tell me, what's going on here?

Thanks in advance,
Pete

[-- Attachment #2: details-and-log.tar.bz2 --]
[-- Type: application/x-bzip-compressed-tar, Size: 18989 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-01-29 22:14 Persistent failures with simple md setup Hans-Peter Jansen
@ 2013-01-30  9:07 ` Sebastian Riemer
  2013-01-30 17:12   ` Hans-Peter Jansen
  2013-01-30  9:20 ` Roy Sigurd Karlsbakk
  1 sibling, 1 reply; 19+ messages in thread
From: Sebastian Riemer @ 2013-01-30  9:07 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: Linux RAID, NeilBrown

On 29.01.2013 23:14, Hans-Peter Jansen wrote:
[...]
> ~# cat /proc/mdstat
> Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] 
> md3 : active raid1 sda4[0]
>       869702736 blocks super 1.0 [2/1] [U_]
>       bitmap: 57/415 pages [228KB], 1024KB chunk
> 
> md0 : active raid1 sda1[0]
>       96376 blocks super 1.0 [2/1] [U_]
>       bitmap: 1/6 pages [4KB], 8KB chunk
> 
> md1 : active (auto-read-only) raid1 sdb2[1] sda2[0]
>       2096468 blocks super 1.0 [2/2] [UU]
>       bitmap: 0/8 pages [0KB], 128KB chunk
> 
> md124 : active raid1 sdb3[1] sda3[0]
>       104856180 blocks super 1.0 [2/2] [UU]
>       bitmap: 8/200 pages [32KB], 256KB chunk
> 
> This looks like some kind of race during device detection.
> The full boot sequence log leading to this mess is attached. 

[...]

> Could some kind soul tell me, what's going on here?

Funny, we've observed similar strange behavior when putting MD devices
on iSCSI/SRP exports. We connect to the SCSI target and udev does lots
of crap assembling only 1/2 or even 0/2 devices. This is why we disable
all udev rules related to MD and do it by custom scripts.

In mdadm 3.2.6 a possible fix has been introduced.

Check: git log mdadm-3.2.5..mdadm-3.2.6


commit 090900c3d2eb5b3aef5251a21228483c32246cc7
Author: Harald Hoyer <harald@redhat.com>
Date:   Mon Aug 13 08:00:21 2012 +1000

    udev-rules: prevent systemd from mount devices before they are ready.

commit b7e05d2373313dd8d0cb687479ad58a88f37d29f
Author: NeilBrown <neilb@suse.de>
Date:   Thu May 24 11:49:49 2012 +1000

    udev-rules: prevent systemd from mount devices before they are ready.


Does mdadm 3.2.6 solve this?

Cheers,
Sebastian


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-01-29 22:14 Persistent failures with simple md setup Hans-Peter Jansen
  2013-01-30  9:07 ` Sebastian Riemer
@ 2013-01-30  9:20 ` Roy Sigurd Karlsbakk
  1 sibling, 0 replies; 19+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-01-30  9:20 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: linux-raid

> ~# cat /proc/mdstat
> Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4]
> md3 : active raid1 sda4[0]
> 869702736 blocks super 1.0 [2/1] [U_]
> bitmap: 57/415 pages [228KB], 1024KB chunk
> 
> md0 : active raid1 sda1[0]
> 96376 blocks super 1.0 [2/1] [U_]
> bitmap: 1/6 pages [4KB], 8KB chunk
> 
> md1 : active (auto-read-only) raid1 sdb2[1] sda2[0]
> 2096468 blocks super 1.0 [2/2] [UU]
> bitmap: 0/8 pages [0KB], 128KB chunk
> 
> md124 : active raid1 sdb3[1] sda3[0]
> 104856180 blocks super 1.0 [2/2] [UU]
> bitmap: 8/200 pages [32KB], 256KB chunk
...
> for this case, but the behavior which partition is affected is random,
> only md124 seems stable (the root fs). The strange md naming was the
> result of an upgrade installation. The device details are attached as
> well.
> 
> It happens, that the active device even *switches* between boots,
> which
> is a perfect recipe for actually loosing data, hence this md doesn't
> raise data security, it is the reason for loosing them.
> 
> Could some kind soul tell me, what's going on here?

This looks somewhat familiar. I had a similar issue after install Ubuntu 12.04 on my home server. IIRC udev/mdadm was out of sync, so that the raid was started before all drives were ready. This was fixed around ubuntu 12.04.1. But then again, since it finds some of the partitions, and not losing whole drives, this seems somewhat unlikely.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-01-30  9:07 ` Sebastian Riemer
@ 2013-01-30 17:12   ` Hans-Peter Jansen
  2013-02-04 20:43     ` Hans-Peter Jansen
  0 siblings, 1 reply; 19+ messages in thread
From: Hans-Peter Jansen @ 2013-01-30 17:12 UTC (permalink / raw)
  To: Sebastian Riemer; +Cc: Linux RAID, NeilBrown

Dear Sebastian,

thanks for your valuable response.

Am Mittwoch, 30. Januar 2013, 10:07:24 schrieb Sebastian Riemer:
> On 29.01.2013 23:14, Hans-Peter Jansen wrote:
> [...]
> 
> > ~# cat /proc/mdstat
> > Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4]
> > md3 : active raid1 sda4[0]
> > 
> >       869702736 blocks super 1.0 [2/1] [U_]
> >       bitmap: 57/415 pages [228KB], 1024KB chunk
> > 
> > md0 : active raid1 sda1[0]
> > 
> >       96376 blocks super 1.0 [2/1] [U_]
> >       bitmap: 1/6 pages [4KB], 8KB chunk
> > 
> > md1 : active (auto-read-only) raid1 sdb2[1] sda2[0]
> > 
> >       2096468 blocks super 1.0 [2/2] [UU]
> >       bitmap: 0/8 pages [0KB], 128KB chunk
> > 
> > md124 : active raid1 sdb3[1] sda3[0]
> > 
> >       104856180 blocks super 1.0 [2/2] [UU]
> >       bitmap: 8/200 pages [32KB], 256KB chunk
> > 
> > This looks like some kind of race during device detection.
> > The full boot sequence log leading to this mess is attached.
> 
> [...]
> 
> > Could some kind soul tell me, what's going on here?
> 
> Funny, we've observed similar strange behavior when putting MD devices
> on iSCSI/SRP exports. We connect to the SCSI target and udev does lots
> of crap assembling only 1/2 or even 0/2 devices. This is why we disable
> all udev rules related to MD and do it by custom scripts.

Oh, I see. Forgot to mention, I do not enjoy fiddling in mkinitrd code ;-)  
(been there, had to do that for my aufs2 based diskless setups)

> In mdadm 3.2.6 a possible fix has been introduced.
> 
> Check: git log mdadm-3.2.5..mdadm-3.2.6
> 
> 
> commit 090900c3d2eb5b3aef5251a21228483c32246cc7
> Author: Harald Hoyer <harald@redhat.com>
> Date:   Mon Aug 13 08:00:21 2012 +1000
> 
>     udev-rules: prevent systemd from mount devices before they are ready.
> 
> commit b7e05d2373313dd8d0cb687479ad58a88f37d29f
> Author: NeilBrown <neilb@suse.de>
> Date:   Thu May 24 11:49:49 2012 +1000
> 
>     udev-rules: prevent systemd from mount devices before they are ready.
> 
> 
> Does mdadm 3.2.6 solve this?

Hmm, according to mdadm from openSUSE:12.1:Update, the relevant fixes should 
be in place. It might be an unfortunate combination of this issue and the 
asynchronously applied updates, interfered by the *switching* behavior. 

I started with regenerating the initrds now, and a first reboot succeeded so 
far. Good.

Will ask my friend to reboot the system a dozen times tonight.

Thanks,
Pete

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-01-30 17:12   ` Hans-Peter Jansen
@ 2013-02-04 20:43     ` Hans-Peter Jansen
  2013-02-05  3:44       ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Hans-Peter Jansen @ 2013-02-04 20:43 UTC (permalink / raw)
  To: Linux RAID; +Cc: Sebastian Riemer, NeilBrown

[-- Attachment #1: Type: text/plain, Size: 10019 bytes --]

Am Mittwoch, 30. Januar 2013, 18:12:46 schrieb Hans-Peter Jansen:
> 
> Hmm, according to mdadm from openSUSE:12.1:Update, the relevant fixes should
> be in place. It might be an unfortunate combination of this issue and the
> asynchronously applied updates, interfered by the *switching* behavior.
> 
> I started with regenerating the initrds now, and a first reboot succeeded so
> far. Good.
> 
> Will ask my friend to reboot the system a dozen times tonight.

After a few reboots, the issue reappeared. I really believe now, that by
driving the md in degraded mode for some time and with the switching behavior, 
just re-adding the devices resulted in unsynced raid1 devices.

Next, my friend managed to create a nearby data disaster: I've explained him,
how he would be able to re-add a device himself. He did so on sunday with his
home partition, and since there appeared no progress bar in /proc/mdstat, he 
immediately repeated the command. 

Neil, is it conceivable (due to a race or the like), that repeating to add 
(re-add) a device potentially creates data salad, since that home-fs (xfs) 
gone mad a few minutes later (firefox crashed, and couldn't be started, kmail 
crashed, and so on (all those processes, that write to ~). He decided to 
reboot, and that jailed him in the emergency recovery console, because /home 
couldn't be mounted anymore.

Both parts of the mirror were affected, the "old" part was ~200kb 
undestructive xfs_repair log, the other ~900kb, hence I decided to use the 
smaller one. First I failed and removed the other part, and then attempted to 
repair it. Unfortunately, the real repair run bailed out with:

disconnected inode 2161430687, moving to lost+found
corrupt dinode 2161430687, extent total = 1, nblocks = 0.  This is a bug.
Please capture the filesystem metadata with xfs_metadump and
report it to xfs@oss.sgi.com.
cache_node_purge: refcount was 1, not zero (node=0xf867208)

fatal error -- 117 - couldn't iget disconnected inode

although I already used the (current) xfsprogs-3.1.6 version. :-(

After fixing that issue manually with xfs_db (== great fun), I was able to
recover the filesystem. It lost(+found) just a few new items, nobody cares 
about... So far, so good..

Now, the unsynced state disturbed me. Just re-adding the bad device might
result in an invalid mirror again. A "repair" run cannot be controlled. Hence
I zeroed the superblock of that partition, and added it. Et voila, it 
completely synced that mirror. Good.

Today, I hammered the raid1 partitions with "check". During one run, this 
appeared in syslog:

Feb  4 11:18:26 zaphkiel kernel: [11165.652478] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb  4 11:18:26 zaphkiel kernel: [11165.652486] ata2.00: irq_stat 0x40000008
Feb  4 11:18:26 zaphkiel kernel: [11165.652495] ata2.00: failed command: READ FPDMA QUEUED
Feb  4 11:18:26 zaphkiel kernel: [11165.652510] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
Feb  4 11:18:26 zaphkiel kernel: [11165.652513]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb  4 11:18:26 zaphkiel kernel: [11165.652520] ata2.00: status: { DRDY ERR }
Feb  4 11:18:26 zaphkiel kernel: [11165.652524] ata2.00: error: { UNC }
Feb  4 11:18:26 zaphkiel kernel: [11165.652876] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100)
Feb  4 11:18:26 zaphkiel kernel: [11165.652882] ata2.00: revalidation failed (errno=-5)
Feb  4 11:18:26 zaphkiel kernel: [11165.652890] ata2: hard resetting link
Feb  4 11:18:26 zaphkiel kernel: [11165.957043] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb  4 11:18:26 zaphkiel kernel: [11165.969910] ata2.00: configured for UDMA/133
Feb  4 11:18:26 zaphkiel kernel: [11165.970048] ata2: EH complete
Feb  4 11:18:28 zaphkiel kernel: [11167.949241] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb  4 11:18:28 zaphkiel kernel: [11167.949249] ata2.00: irq_stat 0x40000008
Feb  4 11:18:28 zaphkiel kernel: [11167.949257] ata2.00: failed command: READ FPDMA QUEUED
Feb  4 11:18:28 zaphkiel kernel: [11167.949272] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
Feb  4 11:18:28 zaphkiel kernel: [11167.949275]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb  4 11:18:28 zaphkiel kernel: [11167.949282] ata2.00: status: { DRDY ERR }
Feb  4 11:18:28 zaphkiel kernel: [11167.949287] ata2.00: error: { UNC }
Feb  4 11:18:28 zaphkiel kernel: [11167.962146] ata2.00: configured for UDMA/133
Feb  4 11:18:28 zaphkiel kernel: [11167.962206] ata2: EH complete
Feb  4 11:18:30 zaphkiel kernel: [11169.898187] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb  4 11:18:30 zaphkiel kernel: [11169.898195] ata2.00: irq_stat 0x40000008
Feb  4 11:18:30 zaphkiel kernel: [11169.898204] ata2.00: failed command: READ FPDMA QUEUED
Feb  4 11:18:30 zaphkiel kernel: [11169.898219] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
Feb  4 11:18:30 zaphkiel kernel: [11169.898222]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb  4 11:18:30 zaphkiel kernel: [11169.898229] ata2.00: status: { DRDY ERR }
Feb  4 11:18:30 zaphkiel kernel: [11169.898234] ata2.00: error: { UNC }
Feb  4 11:18:30 zaphkiel kernel: [11169.912066] ata2.00: configured for UDMA/133
Feb  4 11:18:30 zaphkiel kernel: [11169.912117] ata2: EH complete
Feb  4 11:18:32 zaphkiel kernel: [11171.905192] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb  4 11:18:32 zaphkiel kernel: [11171.905200] ata2.00: irq_stat 0x40000008
Feb  4 11:18:32 zaphkiel kernel: [11171.905208] ata2.00: failed command: READ FPDMA QUEUED
Feb  4 11:18:32 zaphkiel kernel: [11171.905223] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
Feb  4 11:18:32 zaphkiel kernel: [11171.905226]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb  4 11:18:32 zaphkiel kernel: [11171.905233] ata2.00: status: { DRDY ERR }
Feb  4 11:18:32 zaphkiel kernel: [11171.905238] ata2.00: error: { UNC }
Feb  4 11:18:32 zaphkiel kernel: [11171.919099] ata2.00: configured for UDMA/133
Feb  4 11:18:32 zaphkiel kernel: [11171.919152] ata2: EH complete
Feb  4 11:18:34 zaphkiel kernel: [11173.912191] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb  4 11:18:34 zaphkiel kernel: [11173.912199] ata2.00: irq_stat 0x40000008
Feb  4 11:18:34 zaphkiel kernel: [11173.912208] ata2.00: failed command: READ FPDMA QUEUED
Feb  4 11:18:34 zaphkiel kernel: [11173.912223] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
Feb  4 11:18:34 zaphkiel kernel: [11173.912226]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb  4 11:18:34 zaphkiel kernel: [11173.912233] ata2.00: status: { DRDY ERR }
Feb  4 11:18:34 zaphkiel kernel: [11173.912238] ata2.00: error: { UNC }
Feb  4 11:18:34 zaphkiel kernel: [11173.925101] ata2.00: configured for UDMA/133
Feb  4 11:18:34 zaphkiel kernel: [11173.925159] ata2: EH complete
Feb  4 11:18:36 zaphkiel kernel: [11175.861152] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb  4 11:18:36 zaphkiel kernel: [11175.861160] ata2.00: irq_stat 0x40000008
Feb  4 11:18:36 zaphkiel kernel: [11175.861168] ata2.00: failed command: READ FPDMA QUEUED
Feb  4 11:18:36 zaphkiel kernel: [11175.861183] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
Feb  4 11:18:36 zaphkiel kernel: [11175.861186]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb  4 11:18:36 zaphkiel kernel: [11175.861193] ata2.00: status: { DRDY ERR }
Feb  4 11:18:36 zaphkiel kernel: [11175.861198] ata2.00: error: { UNC }
Feb  4 11:18:36 zaphkiel kernel: [11175.874052] ata2.00: configured for UDMA/133
Feb  4 11:18:36 zaphkiel kernel: [11175.874103] sd 1:0:0:0: [sdb] Unhandled sense code
Feb  4 11:18:36 zaphkiel kernel: [11175.874109] sd 1:0:0:0: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb  4 11:18:36 zaphkiel kernel: [11175.874117] sd 1:0:0:0: [sdb]  Sense Key : Medium Error [current] [descriptor]
Feb  4 11:18:36 zaphkiel kernel: [11175.874125] Descriptor sense data with sense descriptors (in hex):
Feb  4 11:18:36 zaphkiel kernel: [11175.874130]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Feb  4 11:18:36 zaphkiel kernel: [11175.874145]         0c c2 ef 3f 
Feb  4 11:18:36 zaphkiel kernel: [11175.874153] sd 1:0:0:0: [sdb]  Add. Sense: Unrecovered read error - auto reallocate failed
Feb  4 11:18:36 zaphkiel kernel: [11175.874163] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 0c c2 ef 12 00 00 80 00
Feb  4 11:18:36 zaphkiel kernel: [11175.874180] end_request: I/O error, dev sdb, sector 214101823
Feb  4 11:18:36 zaphkiel kernel: [11175.874234] ata2: EH complete
Feb  4 11:18:38 zaphkiel kernel: [11177.954091] md: md124: data-check done.

This is a classical URE, isn't it? Interestingly, nonetheless, the raid1 check 
run succeeded! (Not so good, is it?)

Before you ask, both drives have a sane timeout already:

smartctl -l scterc /dev/sda
smartctl 6.0 2012-10-10 r3643 [i686-linux-3.1.10-1.9-desktop] (SUSE RPM)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)


smartctl -l scterc /dev/sdb
smartctl 6.0 2012-10-10 r3643 [i686-linux-3.1.10-1.9-desktop] (SUSE RPM)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)


Attached is the result of smartctl -x of sda (good), and sdb (bad).

Could somebody from the audience have a look into it, and give me an 
assessment, how dangerous the state of this drive really is.

Last question: since I had to massage the system anyway, I've updated mdadm 
from 3.2.2 to 3.2.6. I red, that it can be dangerous to do so, what do I risk
here?

Thanks in advance,
Pete

[-- Attachment #2: smart-sdb-20130204-140400.log --]
[-- Type: text/x-log, Size: 17301 bytes --]

smartctl 6.0 2012-10-10 r3643 [i686-linux-3.1.10-1.9-desktop] (SUSE RPM)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F1 RE
Device Model:     SAMSUNG HE103UJ
Serial Number:    S13VJDWS900483
LU WWN Device Id: 5 0024e9 002167ef6
Firmware Version: 1AA01118
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7, ATA8-ACS T13/1699-D revision 3b
Local Time is:    Mon Feb  4 14:04:00 2013 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Disabled
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(12124) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 203) minutes.
Conveyance self-test routine
recommended polling time: 	 (  22) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   099   099   051    -    531
  3 Spin_Up_Time            POS---   073   073   011    -    9000
  4 Start_Stop_Count        -O--CK   099   099   000    -    1278
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   100   100   051    -    0
  8 Seek_Time_Performance   P-S--K   100   100   015    -    0
  9 Power_On_Hours          -O--CK   097   097   000    -    16416
 10 Spin_Retry_Count        PO--CK   100   100   051    -    0
 11 Calibration_Retry_Count -O--C-   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   099   099   000    -    1277
 13 Read_Soft_Error_Rate    -OSR--   099   099   000    -    528
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
184 End-to-End_Error        PO--CK   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    1052
188 Command_Timeout         -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   076   059   000    -    24 (Min/Max 12/24)
194 Temperature_Celsius     -O---K   077   058   000    -    23 (Min/Max 12/26)
195 Hardware_ECC_Recovered  -O-RC-   100   100   000    -    19500704
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   253   253   000    -    0
200 Multi_Zone_Error_Rate   -O-R--   100   100   000    -    0
201 Soft_Read_Error_Rate    -O-R--   099   099   000    -    12
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
GP/S  Log at address 0x00 has    1 sectors [Log Directory]
SMART Log at address 0x01 has    1 sectors [Summary SMART error log]
SMART Log at address 0x02 has    2 sectors [Comprehensive SMART error log]
GP    Log at address 0x03 has    2 sectors [Ext. Comprehensive SMART error log]
GP    Log at address 0x04 has    2 sectors [Device Statistics log]
SMART Log at address 0x06 has    1 sectors [SMART self-test log]
GP    Log at address 0x07 has    2 sectors [Extended self-test log]
SMART Log at address 0x09 has    1 sectors [Selective self-test log]
GP    Log at address 0x10 has    1 sectors [NCQ Command Error log]
GP    Log at address 0x11 has    1 sectors [SATA Phy Event Counters]
GP    Log at address 0x20 has    2 sectors [Streaming performance log]
GP    Log at address 0x21 has    1 sectors [Write stream error log]
GP    Log at address 0x22 has    1 sectors [Read stream error log]
GP/S  Log at address 0x80 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x81 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x82 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x83 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x84 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x85 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x86 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x87 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x88 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x89 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8a has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8b has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8c has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8d has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8e has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8f has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x90 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x91 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x92 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x93 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x94 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x95 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x96 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x97 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x98 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x99 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9a has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9b has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9c has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9d has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9e has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9f has   16 sectors [Host vendor specific log]
GP/S  Log at address 0xe0 has    1 sectors [SCT Command/Status]
GP/S  Log at address 0xe1 has    1 sectors [SCT Data Transfer]

SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
Device Error Count: 6
	CR     = Command Register
	FEATR  = Features Register
	COUNT  = Count (was: Sector Count) Register
	LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
	LH     = LBA High (was: Cylinder High) Register    ]   LBA
	LM     = LBA Mid (was: Cylinder Low) Register      ] Register
	LL     = LBA Low (was: Sector Number) Register     ]
	DV     = Device (was: Device/Head) Register
	DC     = Device Control Register
	ER     = Error register
	ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 6 [5] occurred at disk power-on lifetime: 16413 hours (683 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  00 -- 42 00 00 00 00 0c c2 ef 3e 40 00   at LBA = 0x0cc2ef3e = 214101822

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 00 00 80 00 00 00 c2 f2 92 40 00     03:06:50.480  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f3 12 40 00     03:06:50.480  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f3 92 40 00     03:06:50.480  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f6 92 40 00     03:06:50.480  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f7 12 40 00     03:06:50.480  READ FPDMA QUEUED

Error 5 [4] occurred at disk power-on lifetime: 16413 hours (683 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  00 -- 42 00 00 00 00 0c c2 ef 3e 40 00   at LBA = 0x0cc2ef3e = 214101822

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 00 00 80 00 00 00 c2 f7 92 40 00     03:06:48.480  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f5 92 40 00     03:06:48.480  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 ef 12 40 00     03:06:48.480  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 fe 12 40 00     03:06:48.480  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 fb 12 40 00     03:06:48.480  READ FPDMA QUEUED

Error 4 [3] occurred at disk power-on lifetime: 16413 hours (683 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  00 -- 42 00 00 00 00 0c c2 ef 3e 40 00   at LBA = 0x0cc2ef3e = 214101822

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 00 00 80 00 00 00 c2 f2 92 40 00     03:06:46.470  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f3 12 40 00     03:06:46.470  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f3 92 40 00     03:06:46.470  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f6 92 40 00     03:06:46.470  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f7 12 40 00     03:06:46.470  READ FPDMA QUEUED

Error 3 [2] occurred at disk power-on lifetime: 16413 hours (683 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  00 -- 42 00 00 00 00 0c c2 ef 3e 40 00   at LBA = 0x0cc2ef3e = 214101822

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 00 00 80 00 00 00 c2 f7 92 40 00     03:06:44.520  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f5 92 40 00     03:06:44.520  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 ef 12 40 00     03:06:44.520  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 fe 12 40 00     03:06:44.520  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 fb 12 40 00     03:06:44.520  READ FPDMA QUEUED

Error 2 [1] occurred at disk power-on lifetime: 16413 hours (683 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  00 -- 42 00 00 00 00 0c c2 ef 3e 40 00   at LBA = 0x0cc2ef3e = 214101822

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 00 00 80 00 00 00 c2 f2 92 40 00     03:06:42.530  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f3 12 40 00     03:06:42.530  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f3 92 40 00     03:06:42.530  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f6 92 40 00     03:06:42.530  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f7 12 40 00     03:06:42.530  READ FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 16413 hours (683 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  00 -- 42 00 00 00 00 0c c2 ef 3e 40 00   at LBA = 0x0cc2ef3e = 214101822

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 00 00 80 00 00 00 c2 fb 92 40 00     03:06:40.150  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 fb 12 40 00     03:06:40.150  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 fa 92 40 00     03:06:40.150  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 fa 12 40 00     03:06:40.150  READ FPDMA QUEUED
  60 00 00 00 80 00 00 00 c2 f9 92 40 00     03:06:40.150  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (2 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  2
SCT Version (vendor specific):       256 (0x0100)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                 23 Celsius
Power Cycle Max Temperature:         26 Celsius
Lifetime    Max Temperature:         63 Celsius
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:     -4/72 Celsius
Min/Max Temperature Limit:           -9/77 Celsius
Temperature History Size (Index):    128 (4)

Index    Estimated Time   Temperature Celsius
   5    2013-02-04 11:57    24  *****
 ...    ..( 32 skipped).    ..  *****
  38    2013-02-04 12:30    24  *****
  39    2013-02-04 12:31    25  ******
  40    2013-02-04 12:32    24  *****
 ...    ..( 52 skipped).    ..  *****
  93    2013-02-04 13:25    24  *****
  94    2013-02-04 13:26    23  ****
 ...    ..( 37 skipped).    ..  ****
   4    2013-02-04 14:04    23  ****

SCT Error Recovery Control:
           Read:     70 (7,0 seconds)
          Write:     70 (7,0 seconds)

ATA_READ_LOG_EXT (addr=0x04:0x00, page=0, n=1) failed: scsi error aborted command
SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            5  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            5  Transition from drive PhyRdy to drive PhyNRdy
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC


[-- Attachment #3: smart-sda-20130204-140400.log --]
[-- Type: text/x-log, Size: 10814 bytes --]

smartctl 6.0 2012-10-10 r3643 [i686-linux-3.1.10-1.9-desktop] (SUSE RPM)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F1 RE
Device Model:     SAMSUNG HE103UJ
Serial Number:    S13VJDWS900475
LU WWN Device Id: 5 0024e9 002167cfe
Firmware Version: 1AA01118
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7, ATA8-ACS T13/1699-D revision 3b
Local Time is:    Mon Feb  4 14:04:00 2013 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Disabled
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(11933) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 200) minutes.
Conveyance self-test routine
recommended polling time: 	 (  21) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   100   100   051    -    0
  3 Spin_Up_Time            POS---   072   072   011    -    9340
  4 Start_Stop_Count        -O--CK   099   099   000    -    1278
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   100   100   051    -    0
  8 Seek_Time_Performance   P-S--K   100   100   015    -    0
  9 Power_On_Hours          -O--CK   097   097   000    -    16415
 10 Spin_Retry_Count        PO--CK   100   100   051    -    0
 11 Calibration_Retry_Count -O--C-   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   099   099   000    -    1277
 13 Read_Soft_Error_Rate    -OSR--   100   100   000    -    0
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
184 End-to-End_Error        PO--CK   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   077   001   000    -    23 (Min/Max 11/23)
194 Temperature_Celsius     -O---K   078   058   000    -    22 (Min/Max 11/25)
195 Hardware_ECC_Recovered  -O-RC-   100   100   000    -    3573535
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   100   100   000    -    0
200 Multi_Zone_Error_Rate   -O-R--   100   100   000    -    0
201 Soft_Read_Error_Rate    -O-R--   100   100   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
GP/S  Log at address 0x00 has    1 sectors [Log Directory]
SMART Log at address 0x01 has    1 sectors [Summary SMART error log]
SMART Log at address 0x02 has    2 sectors [Comprehensive SMART error log]
GP    Log at address 0x03 has    2 sectors [Ext. Comprehensive SMART error log]
GP    Log at address 0x04 has    2 sectors [Device Statistics log]
SMART Log at address 0x06 has    1 sectors [SMART self-test log]
GP    Log at address 0x07 has    2 sectors [Extended self-test log]
SMART Log at address 0x09 has    1 sectors [Selective self-test log]
GP    Log at address 0x10 has    1 sectors [NCQ Command Error log]
GP    Log at address 0x11 has    1 sectors [SATA Phy Event Counters]
GP    Log at address 0x20 has    2 sectors [Streaming performance log]
GP    Log at address 0x21 has    1 sectors [Write stream error log]
GP    Log at address 0x22 has    1 sectors [Read stream error log]
GP/S  Log at address 0x80 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x81 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x82 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x83 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x84 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x85 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x86 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x87 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x88 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x89 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8a has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8b has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8c has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8d has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8e has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8f has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x90 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x91 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x92 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x93 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x94 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x95 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x96 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x97 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x98 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x99 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9a has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9b has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9c has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9d has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9e has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9f has   16 sectors [Host vendor specific log]
GP/S  Log at address 0xe0 has    1 sectors [SCT Command/Status]
GP/S  Log at address 0xe1 has    1 sectors [SCT Data Transfer]

SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (2 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  2
SCT Version (vendor specific):       256 (0x0100)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                 22 Celsius
Power Cycle Max Temperature:         26 Celsius
Lifetime    Max Temperature:         55 Celsius
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:     -4/72 Celsius
Min/Max Temperature Limit:           -9/77 Celsius
Temperature History Size (Index):    128 (10)

Index    Estimated Time   Temperature Celsius
  11    2013-02-04 11:57    23  ****
 ...    ..( 31 skipped).    ..  ****
  43    2013-02-04 12:29    23  ****
  44    2013-02-04 12:30    24  *****
 ...    ..(  2 skipped).    ..  *****
  47    2013-02-04 12:33    24  *****
  48    2013-02-04 12:34    23  ****
 ...    ..( 79 skipped).    ..  ****
   0    2013-02-04 13:54    23  ****
   1    2013-02-04 13:55    22  ***
 ...    ..(  8 skipped).    ..  ***
  10    2013-02-04 14:04    22  ***

SCT Error Recovery Control:
           Read:     70 (7,0 seconds)
          Write:     70 (7,0 seconds)

ATA_READ_LOG_EXT (addr=0x04:0x00, page=0, n=1) failed: scsi error aborted command
SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            4  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            4  Transition from drive PhyRdy to drive PhyNRdy
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-02-04 20:43     ` Hans-Peter Jansen
@ 2013-02-05  3:44       ` NeilBrown
  2013-02-27 17:01         ` Hans-Peter Jansen
  0 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2013-02-05  3:44 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: Linux RAID, Sebastian Riemer

[-- Attachment #1: Type: text/plain, Size: 9028 bytes --]

On Mon, 04 Feb 2013 21:43:29 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote:

> Am Mittwoch, 30. Januar 2013, 18:12:46 schrieb Hans-Peter Jansen:
> > 
> > Hmm, according to mdadm from openSUSE:12.1:Update, the relevant fixes should
> > be in place. It might be an unfortunate combination of this issue and the
> > asynchronously applied updates, interfered by the *switching* behavior.
> > 
> > I started with regenerating the initrds now, and a first reboot succeeded so
> > far. Good.
> > 
> > Will ask my friend to reboot the system a dozen times tonight.
> 
> After a few reboots, the issue reappeared. I really believe now, that by
> driving the md in degraded mode for some time and with the switching behavior, 
> just re-adding the devices resulted in unsynced raid1 devices.
> 
> Next, my friend managed to create a nearby data disaster: I've explained him,
> how he would be able to re-add a device himself. He did so on sunday with his
> home partition, and since there appeared no progress bar in /proc/mdstat, he 
> immediately repeated the command. 
> 
> Neil, is it conceivable (due to a race or the like), that repeating to add 
> (re-add) a device potentially creates data salad, since that home-fs (xfs) 
> gone mad a few minutes later (firefox crashed, and couldn't be started, kmail 
> crashed, and so on (all those processes, that write to ~). He decided to 
> reboot, and that jailed him in the emergency recovery console, because /home 
> couldn't be mounted anymore.

There was a bug prior to 2.6.37 (fixed by commit 1a855a0606653d2) which
sounds vaguely related, but you seem to be running a 3.0.x kernel(?) so
shouldn't be affected by that.

Without logs of precisely what happened, it is very hard to guess.



> Today, I hammered the raid1 partitions with "check". During one run, this 
> appeared in syslog:
> 
> Feb  4 11:18:26 zaphkiel kernel: [11165.652478] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:26 zaphkiel kernel: [11165.652486] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:26 zaphkiel kernel: [11165.652495] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:26 zaphkiel kernel: [11165.652510] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
> Feb  4 11:18:26 zaphkiel kernel: [11165.652513]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:26 zaphkiel kernel: [11165.652520] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:26 zaphkiel kernel: [11165.652524] ata2.00: error: { UNC }
> Feb  4 11:18:26 zaphkiel kernel: [11165.652876] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100)
> Feb  4 11:18:26 zaphkiel kernel: [11165.652882] ata2.00: revalidation failed (errno=-5)
> Feb  4 11:18:26 zaphkiel kernel: [11165.652890] ata2: hard resetting link
> Feb  4 11:18:26 zaphkiel kernel: [11165.957043] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> Feb  4 11:18:26 zaphkiel kernel: [11165.969910] ata2.00: configured for UDMA/133
> Feb  4 11:18:26 zaphkiel kernel: [11165.970048] ata2: EH complete
> Feb  4 11:18:28 zaphkiel kernel: [11167.949241] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:28 zaphkiel kernel: [11167.949249] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:28 zaphkiel kernel: [11167.949257] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:28 zaphkiel kernel: [11167.949272] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
> Feb  4 11:18:28 zaphkiel kernel: [11167.949275]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:28 zaphkiel kernel: [11167.949282] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:28 zaphkiel kernel: [11167.949287] ata2.00: error: { UNC }
> Feb  4 11:18:28 zaphkiel kernel: [11167.962146] ata2.00: configured for UDMA/133
> Feb  4 11:18:28 zaphkiel kernel: [11167.962206] ata2: EH complete
> Feb  4 11:18:30 zaphkiel kernel: [11169.898187] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:30 zaphkiel kernel: [11169.898195] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:30 zaphkiel kernel: [11169.898204] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:30 zaphkiel kernel: [11169.898219] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
> Feb  4 11:18:30 zaphkiel kernel: [11169.898222]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:30 zaphkiel kernel: [11169.898229] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:30 zaphkiel kernel: [11169.898234] ata2.00: error: { UNC }
> Feb  4 11:18:30 zaphkiel kernel: [11169.912066] ata2.00: configured for UDMA/133
> Feb  4 11:18:30 zaphkiel kernel: [11169.912117] ata2: EH complete
> Feb  4 11:18:32 zaphkiel kernel: [11171.905192] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:32 zaphkiel kernel: [11171.905200] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:32 zaphkiel kernel: [11171.905208] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:32 zaphkiel kernel: [11171.905223] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
> Feb  4 11:18:32 zaphkiel kernel: [11171.905226]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:32 zaphkiel kernel: [11171.905233] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:32 zaphkiel kernel: [11171.905238] ata2.00: error: { UNC }
> Feb  4 11:18:32 zaphkiel kernel: [11171.919099] ata2.00: configured for UDMA/133
> Feb  4 11:18:32 zaphkiel kernel: [11171.919152] ata2: EH complete
> Feb  4 11:18:34 zaphkiel kernel: [11173.912191] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:34 zaphkiel kernel: [11173.912199] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:34 zaphkiel kernel: [11173.912208] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:34 zaphkiel kernel: [11173.912223] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
> Feb  4 11:18:34 zaphkiel kernel: [11173.912226]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:34 zaphkiel kernel: [11173.912233] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:34 zaphkiel kernel: [11173.912238] ata2.00: error: { UNC }
> Feb  4 11:18:34 zaphkiel kernel: [11173.925101] ata2.00: configured for UDMA/133
> Feb  4 11:18:34 zaphkiel kernel: [11173.925159] ata2: EH complete
> Feb  4 11:18:36 zaphkiel kernel: [11175.861152] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:36 zaphkiel kernel: [11175.861160] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:36 zaphkiel kernel: [11175.861168] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:36 zaphkiel kernel: [11175.861183] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
> Feb  4 11:18:36 zaphkiel kernel: [11175.861186]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:36 zaphkiel kernel: [11175.861193] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:36 zaphkiel kernel: [11175.861198] ata2.00: error: { UNC }
> Feb  4 11:18:36 zaphkiel kernel: [11175.874052] ata2.00: configured for UDMA/133
> Feb  4 11:18:36 zaphkiel kernel: [11175.874103] sd 1:0:0:0: [sdb] Unhandled sense code
> Feb  4 11:18:36 zaphkiel kernel: [11175.874109] sd 1:0:0:0: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Feb  4 11:18:36 zaphkiel kernel: [11175.874117] sd 1:0:0:0: [sdb]  Sense Key : Medium Error [current] [descriptor]
> Feb  4 11:18:36 zaphkiel kernel: [11175.874125] Descriptor sense data with sense descriptors (in hex):
> Feb  4 11:18:36 zaphkiel kernel: [11175.874130]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
> Feb  4 11:18:36 zaphkiel kernel: [11175.874145]         0c c2 ef 3f 
> Feb  4 11:18:36 zaphkiel kernel: [11175.874153] sd 1:0:0:0: [sdb]  Add. Sense: Unrecovered read error - auto reallocate failed
> Feb  4 11:18:36 zaphkiel kernel: [11175.874163] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 0c c2 ef 12 00 00 80 00
> Feb  4 11:18:36 zaphkiel kernel: [11175.874180] end_request: I/O error, dev sdb, sector 214101823
> Feb  4 11:18:36 zaphkiel kernel: [11175.874234] ata2: EH complete
> Feb  4 11:18:38 zaphkiel kernel: [11177.954091] md: md124: data-check done.
> 
> This is a classical URE, isn't it? Interestingly, nonetheless, the raid1 check 
> run succeeded! (Not so good, is it?)

What?  Success not good?  :-)

md didn't report any errors.  Maybe it didn't see any.  Where is sector
214101823?



> Last question: since I had to massage the system anyway, I've updated mdadm 
> from 3.2.2 to 3.2.6. I red, that it can be dangerous to do so, what do I risk
> here?

Where did you read that?
If you find you need to re-create an array with "--create", a different mdadm
might give a different result, which wouldn't be what you want.  Otherwise it
should be safe.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-02-05  3:44       ` NeilBrown
@ 2013-02-27 17:01         ` Hans-Peter Jansen
  2013-02-28  3:40           ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Hans-Peter Jansen @ 2013-02-27 17:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID

[I intentionally switched off line wrapping, since it renders the given information
 inedible]

Hi Neil,

sorry for the delay. Meanwhile, I updated to mdadm 3.2.6, and recreated the initrd.
Since I do the recreation on a regular base, no data salad appeared, but the initial
issue persists. I trigger checking the related md on each manual intervention.

Hence approaching this again. 

That's a typical boot sequence:

Feb 27 08:35:48 zaphkiel kernel: [    0.000000] Kernel command line: PROFILE=ibk root=/dev/disk/by-uuid/782b24f2-26f6-4ea9-ab8b-d377cd2ac97b console=ttyS0,19200 console=tty0 resume=/dev/md127 splash=silent  console=ttyS0,115200n8 vga=791
Feb 27 08:35:48 zaphkiel kernel: [    6.816164] sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
Feb 27 08:35:48 zaphkiel kernel: [    6.862240] sd 1:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
Feb 27 08:35:48 zaphkiel kernel: [    6.983157] sd 0:0:0:0: [sda] Write Protect is off
Feb 27 08:35:48 zaphkiel kernel: [    7.011914] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Feb 27 08:35:48 zaphkiel kernel: [    7.011932] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 27 08:35:48 zaphkiel kernel: [    7.159045] sd 1:0:0:0: [sdb] Write Protect is off
Feb 27 08:35:48 zaphkiel kernel: [    7.187766] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Feb 27 08:35:48 zaphkiel kernel: [    7.187783] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 27 08:35:48 zaphkiel kernel: [    7.319637]  sda: sda1 sda2 sda3 sda4
Feb 27 08:35:48 zaphkiel kernel: [    7.386469] sd 0:0:0:0: [sda] Attached SCSI disk
Feb 27 08:35:48 zaphkiel kernel: [    7.414335]  sdb: sdb1 sdb2 sdb3 sdb4
Feb 27 08:35:48 zaphkiel kernel: [    7.470597] sd 1:0:0:0: [sdb] Attached SCSI disk
Feb 27 08:35:48 zaphkiel kernel: [    9.676078] PM: Checking hibernation image partition /dev/md127
Feb 27 08:35:48 zaphkiel kernel: [    9.956010] sd 6:0:0:0: [sdc] Attached SCSI removable disk
Feb 27 08:35:48 zaphkiel kernel: [   10.175875] sd 6:0:0:1: [sdd] Attached SCSI removable disk
Feb 27 08:35:48 zaphkiel kernel: [   10.262999] sd 6:0:0:2: [sde] Attached SCSI removable disk
Feb 27 08:35:48 zaphkiel kernel: [   10.299000] sd 6:0:0:3: [sdf] Attached SCSI removable disk
Feb 27 08:35:48 zaphkiel kernel: [   10.618022] md: raid1 personality registered for level 1
Feb 27 08:35:48 zaphkiel kernel: [   11.139833] md: bind<sda3>
Feb 27 08:35:48 zaphkiel kernel: [   11.158438] md: bind<sdb3>
Feb 27 08:35:48 zaphkiel kernel: [   11.200897] md/raid1:md124: active with 2 out of 2 mirrors
Feb 27 08:35:48 zaphkiel kernel: [   11.233921] created bitmap (200 pages) for device md124
Feb 27 08:35:48 zaphkiel kernel: [   11.276742] md124: bitmap initialized from disk: read 13/13 pages, set 0 of 409595 bits
Feb 27 08:35:48 zaphkiel kernel: [   11.356796] md124: detected capacity change from 0 to 107372728320
Feb 27 08:35:48 zaphkiel kernel: [   11.395610]  md124: unknown partition table
Feb 27 08:35:48 zaphkiel kernel: [   11.543579] md: raid0 personality registered for level 0
Feb 27 08:35:48 zaphkiel kernel: [   11.583994] md: raid10 personality registered for level 10
Feb 27 08:35:48 zaphkiel kernel: [   12.188576] md: raid6 personality registered for level 6
Feb 27 08:35:48 zaphkiel kernel: [   12.220473] md: raid5 personality registered for level 5
Feb 27 08:35:48 zaphkiel kernel: [   12.252320] md: raid4 personality registered for level 4
Feb 27 08:35:48 zaphkiel kernel: [   12.318865] XFS (md124): Mounting Filesystem
Feb 27 08:35:48 zaphkiel kernel: [   12.535666] XFS (md124): Ending clean mount
Feb 27 08:35:48 zaphkiel kernel: [   22.720866] sd 0:0:0:0: Attached scsi generic sg0 type 0
Feb 27 08:35:48 zaphkiel kernel: [   23.126126] sd 1:0:0:0: Attached scsi generic sg1 type 0
Feb 27 08:35:48 zaphkiel kernel: [   23.203876] sd 6:0:0:0: Attached scsi generic sg2 type 0
Feb 27 08:35:48 zaphkiel kernel: [   23.402184] sd 6:0:0:1: Attached scsi generic sg3 type 0
Feb 27 08:35:48 zaphkiel kernel: [   23.725330] sd 6:0:0:2: Attached scsi generic sg4 type 0
Feb 27 08:35:48 zaphkiel kernel: [   23.917899] sd 6:0:0:3: Attached scsi generic sg5 type 0
Feb 27 08:35:48 zaphkiel kernel: [   24.492587] md: bind<sda2>
Feb 27 08:35:48 zaphkiel kernel: [   24.711804] md: bind<sda1>
Feb 27 08:35:48 zaphkiel kernel: [   24.821071] md: bind<sda4>
Feb 27 08:35:48 zaphkiel kernel: [   25.607192] md: bind<sdb2>
Feb 27 08:35:48 zaphkiel kernel: [   25.672335] md/raid1:md1: active with 2 out of 2 mirrors
Feb 27 08:35:48 zaphkiel kernel: [   25.723678] created bitmap (8 pages) for device md1
Feb 27 08:35:48 zaphkiel kernel: [   25.771764] md1: bitmap initialized from disk: read 1/1 pages, set 0 of 16379 bits
Feb 27 08:35:48 zaphkiel kernel: [   25.854963] md/raid1:md0: active with 1 out of 2 mirrors
Feb 27 08:35:48 zaphkiel kernel: [   25.886903] md1: detected capacity change from 0 to 2146783232
Feb 27 08:35:48 zaphkiel kernel: [   25.929142] created bitmap (6 pages) for device md0
Feb 27 08:35:48 zaphkiel kernel: [   25.958684]  md1: unknown partition table
Feb 27 08:35:48 zaphkiel kernel: [   25.995737] md: bind<sdb4>
Feb 27 08:35:48 zaphkiel kernel: [   26.012500] md0: bitmap initialized from disk: read 1/1 pages, set 0 of 12047 bits
Feb 27 08:35:48 zaphkiel kernel: [   26.077270] md0: detected capacity change from 0 to 98689024
Feb 27 08:35:48 zaphkiel kernel: [   26.168097] md/raid1:md3: active with 2 out of 2 mirrors
Feb 27 08:35:48 zaphkiel kernel: [   26.205258]  md0: unknown partition table
Feb 27 08:35:48 zaphkiel kernel: [   26.242295] created bitmap (415 pages) for device md3
Feb 27 08:35:48 zaphkiel kernel: [   26.304271] md3: bitmap initialized from disk: read 26/26 pages, set 1465 of 849320 bits
Feb 27 08:35:48 zaphkiel kernel: [   26.392578] md3: detected capacity change from 0 to 890575601664
Feb 27 08:35:48 zaphkiel kernel: [   26.575527]  md3: unknown partition table
Feb 27 08:35:48 zaphkiel kernel: [   26.763091] Adding 2096464k swap on /dev/md1.  Priority:0 extents:1 across:2096464k 
Feb 27 08:35:48 zaphkiel kernel: [   29.268318] XFS (md3): Mounting Filesystem
Feb 27 08:35:48 zaphkiel kernel: [   29.521844] EXT4-fs (md0): mounting ext2 file system using the ext4 subsystem
Feb 27 08:35:48 zaphkiel kernel: [   29.684176] EXT4-fs (md0): mounted filesystem without journal. Opts: acl,user_xattr
Feb 27 08:35:48 zaphkiel kernel: [   30.356255] XFS (md3): Ending clean mount

Today, /dev/sdb1 of /dev/md0 vanished. Re-adding and checking doesn't revealed 
any error, but it would be cool to get to the bottom of this, as re-adding some
lost partitions to the md every second day *is* a real PITA.

What could I do to help resolving this issue? 


Thanks in advance,
Pete


Usual system info (let me know, if something is missing):

# cat /etc/SuSE-release
openSUSE 12.1 (i586)
VERSION = 12.1
CODENAME = Asparagus


# uname -a (vendor kernel)
Linux zaphkiel 3.1.10-1.16-desktop #1 SMP PREEMPT Wed Jun 27 05:21:40 UTC 2012 (d016078) i686 athlon i386 GNU/Linux


# LANG=C fdisk -l /dev/sd{a,b}

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c39ea

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *       16065      208844       96390   fd  Linux raid autodetect
/dev/sda2          208845     4401809     2096482+  fd  Linux raid autodetect
/dev/sda3         4401810   214114319   104856255   fd  Linux raid autodetect
/dev/sda4       214114320  1953520064   869702872+  fd  Linux raid autodetect

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0005bc3f

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *       16065      208844       96390   fd  Linux raid autodetect
/dev/sdb2          208845     4401809     2096482+  fd  Linux raid autodetect
/dev/sdb3         4401810   214114319   104856255   fd  Linux raid autodetect
/dev/sdb4       214114320  1953520064   869702872+  fd  Linux raid autodetect


cat /proc/mdstat
Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] 
md3 : active raid1 sdb4[1] sda4[2]
      869702736 blocks super 1.0 [2/2] [UU]
      bitmap: 14/415 pages [56KB], 1024KB chunk

md0 : active raid1 sdb1[1] sda1[0]
      96376 blocks super 1.0 [2/2] [UU]
      bitmap: 0/6 pages [0KB], 8KB chunk

md1 : active raid1 sdb2[1] sda2[0]
      2096468 blocks super 1.0 [2/2] [UU]
      bitmap: 0/8 pages [0KB], 128KB chunk

md124 : active raid1 sdb3[1] sda3[0]
      104856180 blocks super 1.0 [2/2] [UU]
      bitmap: 6/200 pages [24KB], 256KB chunk

unused devices: <none>


# cat /etc/mdadm.conf
ARRAY /dev/md124 metadata=1.0 name=zaphkiel:2 UUID=4b91a798:7385d1ea:14128dab:9b09575e
ARRAY /dev/md/0 metadata=1.0 name=zaphkiel:0 UUID=a06939cd:1261d1ce:76eaa74c:db8ba5f9
ARRAY /dev/md/3 metadata=1.0 name=zaphkiel:3 UUID=8f2d01ee:035536bc:d23a6cf2:f0d863b8
ARRAY /dev/md1 metadata=1.0 name=zaphkiel:1 UUID=d6b3eadd:297b01f0:33860c88:9dc851cc
MAILADDR root


Note the silly naming..


# mdadm -Evvs
/dev/sdb4:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : 8f2d01ee:035536bc:d23a6cf2:f0d863b8
           Name : zaphkiel:3  (local to host zaphkiel)
  Creation Time : Thu Dec 31 14:04:04 2009
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 1739405472 (829.41 GiB 890.58 GB)
     Array Size : 869702736 (829.41 GiB 890.58 GB)
   Super Offset : 1739405728 sectors
          State : clean
    Device UUID : 373b5269:1f407175:abd8f4f7:d8cba089

Internal Bitmap : -208 sectors from superblock
    Update Time : Wed Feb 27 17:18:21 2013
       Checksum : 6957806 - correct
         Events : 8320716


   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing)
/dev/sdb3:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : 4b91a798:7385d1ea:14128dab:9b09575e
           Name : zaphkiel:2  (local to host zaphkiel)
  Creation Time : Thu Dec 31 14:04:03 2009
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 209712360 (100.00 GiB 107.37 GB)
     Array Size : 104856180 (100.00 GiB 107.37 GB)
   Super Offset : 209712488 sectors
          State : active
    Device UUID : 42c13439:d288d793:f2468bba:19adbfd2

Internal Bitmap : -101 sectors from superblock
    Update Time : Wed Feb 27 17:18:21 2013
       Checksum : 26e2aafc - correct
         Events : 15267


   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing)
/dev/sdb2:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : d6b3eadd:297b01f0:33860c88:9dc851cc
           Name : zaphkiel:1  (local to host zaphkiel)
  Creation Time : Thu Dec 31 14:04:02 2009
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 4192936 (2047.68 MiB 2146.78 MB)
     Array Size : 2096468 (2047.68 MiB 2146.78 MB)
   Super Offset : 4192944 sectors
          State : clean
    Device UUID : dfd7f969:e9ce7838:8de84f6b:0b04acb5

Internal Bitmap : 2 sectors from superblock
    Update Time : Wed Feb 27 17:15:03 2013
       Checksum : 272647 - correct
         Events : 1764


   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing)
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : a06939cd:1261d1ce:76eaa74c:db8ba5f9
           Name : zaphkiel:0  (local to host zaphkiel)
  Creation Time : Thu Dec 31 14:04:01 2009
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 192752 (94.13 MiB 98.69 MB)
     Array Size : 96376 (94.13 MiB 98.69 MB)
   Super Offset : 192760 sectors
          State : clean
    Device UUID : b8cd1539:9f294a42:710a79e1:c40a4ce0

Internal Bitmap : 2 sectors from superblock
    Update Time : Wed Feb 27 16:18:25 2013
       Checksum : 39343a85 - correct
         Events : 2609


   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing)
/dev/sdb:
   MBR Magic : aa55
Partition[0] :       192780 sectors at        16065 (type fd)
Partition[1] :      4192965 sectors at       208845 (type fd)
Partition[2] :    209712510 sectors at      4401810 (type fd)
Partition[3] :   1739405745 sectors at    214114320 (type fd)
/dev/sda4:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : 8f2d01ee:035536bc:d23a6cf2:f0d863b8
           Name : zaphkiel:3  (local to host zaphkiel)
  Creation Time : Thu Dec 31 14:04:04 2009
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 1739405472 (829.41 GiB 890.58 GB)
     Array Size : 869702736 (829.41 GiB 890.58 GB)
   Super Offset : 1739405728 sectors
          State : clean
    Device UUID : 0b568d2a:c8e5a8e5:269188ba:22367205

Internal Bitmap : -208 sectors from superblock
    Update Time : Wed Feb 27 17:18:21 2013
       Checksum : 766d5b48 - correct
         Events : 8320716


   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing)
/dev/sda3:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : 4b91a798:7385d1ea:14128dab:9b09575e
           Name : zaphkiel:2  (local to host zaphkiel)
  Creation Time : Thu Dec 31 14:04:03 2009
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 209712360 (100.00 GiB 107.37 GB)
     Array Size : 104856180 (100.00 GiB 107.37 GB)
   Super Offset : 209712488 sectors
          State : active
    Device UUID : 0d0a5a95:cc454e60:fcea6dd4:ea6aabcd

Internal Bitmap : -101 sectors from superblock
    Update Time : Wed Feb 27 17:18:21 2013
       Checksum : 644d129b - correct
         Events : 15267

   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing)
/dev/sda2:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : d6b3eadd:297b01f0:33860c88:9dc851cc
           Name : zaphkiel:1  (local to host zaphkiel)
  Creation Time : Thu Dec 31 14:04:02 2009
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 4192936 (2047.68 MiB 2146.78 MB)
     Array Size : 2096468 (2047.68 MiB 2146.78 MB)
   Super Offset : 4192944 sectors
          State : clean
    Device UUID : 4c169220:335d3afc:d26ca91a:3ff38cd1

Internal Bitmap : 2 sectors from superblock
    Update Time : Wed Feb 27 17:15:03 2013
       Checksum : 45bb6676 - correct
         Events : 1764


   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing)
/dev/sda1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : a06939cd:1261d1ce:76eaa74c:db8ba5f9
           Name : zaphkiel:0  (local to host zaphkiel)
  Creation Time : Thu Dec 31 14:04:01 2009
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 192752 (94.13 MiB 98.69 MB)
     Array Size : 96376 (94.13 MiB 98.69 MB)
   Super Offset : 192760 sectors
          State : clean
    Device UUID : 7bc0a6ec:81f9494d:b5e55c6a:961bbccd

Internal Bitmap : 2 sectors from superblock
    Update Time : Wed Feb 27 16:18:25 2013
       Checksum : 6e18e93f - correct
         Events : 2609


   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing)
/dev/sda:
   MBR Magic : aa55
Partition[0] :       192780 sectors at        16065 (type fd)
Partition[1] :      4192965 sectors at       208845 (type fd)
Partition[2] :    209712510 sectors at      4401810 (type fd)
Partition[3] :   1739405745 sectors at    214114320 (type fd)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-02-27 17:01         ` Hans-Peter Jansen
@ 2013-02-28  3:40           ` NeilBrown
  2013-02-28 10:49             ` Hans-Peter Jansen
  0 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2013-02-28  3:40 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 9617 bytes --]

On Wed, 27 Feb 2013 18:01:29 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote:

> [I intentionally switched off line wrapping, since it renders the given information
>  inedible]
> 
> Hi Neil,
> 
> sorry for the delay. Meanwhile, I updated to mdadm 3.2.6, and recreated the initrd.
> Since I do the recreation on a regular base, no data salad appeared, but the initial
> issue persists. I trigger checking the related md on each manual intervention.
> 
> Hence approaching this again. 
> 
> That's a typical boot sequence:

BTW, you aren't the only one to hit this issue .... I think I'm starting to
get closer to the problem.


> 
> Feb 27 08:35:48 zaphkiel kernel: [    0.000000] Kernel command line: PROFILE=ibk root=/dev/disk/by-uuid/782b24f2-26f6-4ea9-ab8b-d377cd2ac97b console=ttyS0,19200 console=tty0 resume=/dev/md127 splash=silent  console=ttyS0,115200n8 vga=791
> Feb 27 08:35:48 zaphkiel kernel: [    6.816164] sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
> Feb 27 08:35:48 zaphkiel kernel: [    6.862240] sd 1:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
> Feb 27 08:35:48 zaphkiel kernel: [    6.983157] sd 0:0:0:0: [sda] Write Protect is off
> Feb 27 08:35:48 zaphkiel kernel: [    7.011914] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> Feb 27 08:35:48 zaphkiel kernel: [    7.011932] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> Feb 27 08:35:48 zaphkiel kernel: [    7.159045] sd 1:0:0:0: [sdb] Write Protect is off
> Feb 27 08:35:48 zaphkiel kernel: [    7.187766] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
> Feb 27 08:35:48 zaphkiel kernel: [    7.187783] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> Feb 27 08:35:48 zaphkiel kernel: [    7.319637]  sda: sda1 sda2 sda3 sda4
> Feb 27 08:35:48 zaphkiel kernel: [    7.386469] sd 0:0:0:0: [sda] Attached SCSI disk
> Feb 27 08:35:48 zaphkiel kernel: [    7.414335]  sdb: sdb1 sdb2 sdb3 sdb4
> Feb 27 08:35:48 zaphkiel kernel: [    7.470597] sd 1:0:0:0: [sdb] Attached SCSI disk
> Feb 27 08:35:48 zaphkiel kernel: [    9.676078] PM: Checking hibernation image partition /dev/md127
> Feb 27 08:35:48 zaphkiel kernel: [    9.956010] sd 6:0:0:0: [sdc] Attached SCSI removable disk
> Feb 27 08:35:48 zaphkiel kernel: [   10.175875] sd 6:0:0:1: [sdd] Attached SCSI removable disk
> Feb 27 08:35:48 zaphkiel kernel: [   10.262999] sd 6:0:0:2: [sde] Attached SCSI removable disk
> Feb 27 08:35:48 zaphkiel kernel: [   10.299000] sd 6:0:0:3: [sdf] Attached SCSI removable disk
> Feb 27 08:35:48 zaphkiel kernel: [   10.618022] md: raid1 personality registered for level 1
> Feb 27 08:35:48 zaphkiel kernel: [   11.139833] md: bind<sda3>
> Feb 27 08:35:48 zaphkiel kernel: [   11.158438] md: bind<sdb3>
> Feb 27 08:35:48 zaphkiel kernel: [   11.200897] md/raid1:md124: active with 2 out of 2 mirrors

Presumably this is md124 being assembled in the initrd as it contains the
root filesystem.


> Feb 27 08:35:48 zaphkiel kernel: [   11.233921] created bitmap (200 pages) for device md124
> Feb 27 08:35:48 zaphkiel kernel: [   11.276742] md124: bitmap initialized from disk: read 13/13 pages, set 0 of 409595 bits
> Feb 27 08:35:48 zaphkiel kernel: [   11.356796] md124: detected capacity change from 0 to 107372728320
> Feb 27 08:35:48 zaphkiel kernel: [   11.395610]  md124: unknown partition table
> Feb 27 08:35:48 zaphkiel kernel: [   11.543579] md: raid0 personality registered for level 0
> Feb 27 08:35:48 zaphkiel kernel: [   11.583994] md: raid10 personality registered for level 10
> Feb 27 08:35:48 zaphkiel kernel: [   12.188576] md: raid6 personality registered for level 6
> Feb 27 08:35:48 zaphkiel kernel: [   12.220473] md: raid5 personality registered for level 5
> Feb 27 08:35:48 zaphkiel kernel: [   12.252320] md: raid4 personality registered for level 4
> Feb 27 08:35:48 zaphkiel kernel: [   12.318865] XFS (md124): Mounting Filesystem
> Feb 27 08:35:48 zaphkiel kernel: [   12.535666] XFS (md124): Ending clean mount

And now XFS has mounted that, so presumably about there the boot sequence
switches from initrd to real-root.  You have obviously removed some lines,
presumably because you thought brevity would help.  For future reference: it
doesn't.  I much prefer to see complete logs so I can correlate various parts
of the sequence and be sure I understand what I'm seeing.  But I probably
have enough here.

> Feb 27 08:35:48 zaphkiel kernel: [   22.720866] sd 0:0:0:0: Attached scsi generic sg0 type 0
> Feb 27 08:35:48 zaphkiel kernel: [   23.126126] sd 1:0:0:0: Attached scsi generic sg1 type 0
> Feb 27 08:35:48 zaphkiel kernel: [   23.203876] sd 6:0:0:0: Attached scsi generic sg2 type 0
> Feb 27 08:35:48 zaphkiel kernel: [   23.402184] sd 6:0:0:1: Attached scsi generic sg3 type 0
> Feb 27 08:35:48 zaphkiel kernel: [   23.725330] sd 6:0:0:2: Attached scsi generic sg4 type 0
> Feb 27 08:35:48 zaphkiel kernel: [   23.917899] sd 6:0:0:3: Attached scsi generic sg5 type 0
> Feb 27 08:35:48 zaphkiel kernel: [   24.492587] md: bind<sda2>
> Feb 27 08:35:48 zaphkiel kernel: [   24.711804] md: bind<sda1>
> Feb 27 08:35:48 zaphkiel kernel: [   24.821071] md: bind<sda4>

These are parts of 3 different arrays that are starting to be assembled in
parallel.  That must be udev running "mdadm -I /dev/sdaX" which will partly
assemble each array.  when the 'sdb' bits come along "mdadm -I /dev/sdbX"
should complete the assembly.


> Feb 27 08:35:48 zaphkiel kernel: [   25.607192] md: bind<sdb2>
> Feb 27 08:35:48 zaphkiel kernel: [   25.672335] md/raid1:md1: active with 2 out of 2 mirrors
> Feb 27 08:35:48 zaphkiel kernel: [   25.723678] created bitmap (8 pages) for device md1
> Feb 27 08:35:48 zaphkiel kernel: [   25.771764] md1: bitmap initialized from disk: read 1/1 pages, set 0 of 16379 bits

And there is the first one:  mdadm -I /dev/sdb2 must have been run by udev so
md1 got fully assembled.


> Feb 27 08:35:48 zaphkiel kernel: [   25.854963] md/raid1:md0: active with 1 out of 2 mirrors

But now md0 got assembled with waiting for sdb1.  This must have been the
   mdadm -IRs
in /etc/init.d/boot.md.  The purpose of this is to find any partially
assembled arrays and complete the assembly.  This is needed if, for example,
sdb has failed complete and would never be found.  However it shouldn't
happen yet.  It should wait until all the devices that have been discovered
have been processed.
A little early in the boot.md file you will see:
        if [ "$MDADM_DEVICE_TIMEOUT" -gt 0 ] ; then
            /sbin/udevadm settle --timeout="$MDADM_DEVICE_TIMEOUT"
        fi

so unless you have MDADM_DEVICE_TIMEOUT=0 in /etc/sysconfig/mdadm (which I
very much doubt) it should have already waited.  But there is no sign that it
did.  So maybe "udevadm settle" is broken?


> Feb 27 08:35:48 zaphkiel kernel: [   25.886903] md1: detected capacity change from 0 to 2146783232
> Feb 27 08:35:48 zaphkiel kernel: [   25.929142] created bitmap (6 pages) for device md0
> Feb 27 08:35:48 zaphkiel kernel: [   25.958684]  md1: unknown partition table
> Feb 27 08:35:48 zaphkiel kernel: [   25.995737] md: bind<sdb4>
> Feb 27 08:35:48 zaphkiel kernel: [   26.012500] md0: bitmap initialized from disk: read 1/1 pages, set 0 of 12047 bits
> Feb 27 08:35:48 zaphkiel kernel: [   26.077270] md0: detected capacity change from 0 to 98689024
> Feb 27 08:35:48 zaphkiel kernel: [   26.168097] md/raid1:md3: active with 2 out of 2 mirrors

And here comes sdb4, so "mdadm -I /dev/sdb4" must have been run by udev.  And
it got in before the  "mdadm -IRs" managed to find that md0 was incomplete,
and started it.


> Feb 27 08:35:48 zaphkiel kernel: [   26.205258]  md0: unknown partition table
> Feb 27 08:35:48 zaphkiel kernel: [   26.242295] created bitmap (415 pages) for device md3
> Feb 27 08:35:48 zaphkiel kernel: [   26.304271] md3: bitmap initialized from disk: read 26/26 pages, set 1465 of 849320 bits
> Feb 27 08:35:48 zaphkiel kernel: [   26.392578] md3: detected capacity change from 0 to 890575601664
> Feb 27 08:35:48 zaphkiel kernel: [   26.575527]  md3: unknown partition table
> Feb 27 08:35:48 zaphkiel kernel: [   26.763091] Adding 2096464k swap on /dev/md1.  Priority:0 extents:1 across:2096464k 
> Feb 27 08:35:48 zaphkiel kernel: [   29.268318] XFS (md3): Mounting Filesystem
> Feb 27 08:35:48 zaphkiel kernel: [   29.521844] EXT4-fs (md0): mounting ext2 file system using the ext4 subsystem
> Feb 27 08:35:48 zaphkiel kernel: [   29.684176] EXT4-fs (md0): mounted filesystem without journal. Opts: acl,user_xattr
> Feb 27 08:35:48 zaphkiel kernel: [   30.356255] XFS (md3): Ending clean mount
> 
> Today, /dev/sdb1 of /dev/md0 vanished. Re-adding and checking doesn't revealed 
> any error, but it would be cool to get to the bottom of this, as re-adding some
> lost partitions to the md every second day *is* a real PITA.
> 
> What could I do to help resolving this issue? 

It really feels like a udev bug, but it is hard to be sure.
Could you edit /etc/init.d/boot.md and add
   echo BEFORE mdadm -IRs > /dev/kmsg
just before the call the "$mdadm_BIN -IRs",
   echo AFTER mdadm -IRs > /dev/kmsg
just after that call, and
   echo AFTER mdadm -Asc > /dev/mksg
just after the call to "$mdadm_BIN -A -s -c $mdadm_CONFIG"

And maybe put similar messages just before and after the  
    /sbin/udevadm settle ....
call.

If you can then reproduce the problem, the extra logging might tell us
something useful.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-02-28  3:40           ` NeilBrown
@ 2013-02-28 10:49             ` Hans-Peter Jansen
  2013-02-28 21:25               ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Hans-Peter Jansen @ 2013-02-28 10:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID

Am Donnerstag, 28. Februar 2013, 14:40:13 schrieb NeilBrown:
> On Wed, 27 Feb 2013 18:01:29 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote:
> 
> > [I intentionally switched off line wrapping, since it renders the given information
> >  inedible]
> > 
> > Hi Neil,
> > 
> > sorry for the delay. Meanwhile, I updated to mdadm 3.2.6, and recreated the initrd.
> > Since I do the recreation on a regular base, no data salad appeared, but the initial
> > issue persists. I trigger checking the related md on each manual intervention.
> > 
> > Hence approaching this again. 
> > 
> > That's a typical boot sequence:
> 
> BTW, you aren't the only one to hit this issue .... I think I'm starting to
> get closer to the problem.

Nice to hear, will do my best to assist.

> 
> > 
> > Feb 27 08:35:48 zaphkiel kernel: [    0.000000] Kernel command line: PROFILE=ibk root=/dev/disk/by-uuid/782b24f2-26f6-4ea9-ab8b-d377cd2ac97b console=ttyS0,19200 console=tty0 resume=/dev/md127 splash=silent  console=ttyS0,115200n8 vga=791
> > Feb 27 08:35:48 zaphkiel kernel: [    6.816164] sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
> > Feb 27 08:35:48 zaphkiel kernel: [    6.862240] sd 1:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
> > Feb 27 08:35:48 zaphkiel kernel: [    6.983157] sd 0:0:0:0: [sda] Write Protect is off
> > Feb 27 08:35:48 zaphkiel kernel: [    7.011914] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> > Feb 27 08:35:48 zaphkiel kernel: [    7.011932] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> > Feb 27 08:35:48 zaphkiel kernel: [    7.159045] sd 1:0:0:0: [sdb] Write Protect is off
> > Feb 27 08:35:48 zaphkiel kernel: [    7.187766] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
> > Feb 27 08:35:48 zaphkiel kernel: [    7.187783] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> > Feb 27 08:35:48 zaphkiel kernel: [    7.319637]  sda: sda1 sda2 sda3 sda4
> > Feb 27 08:35:48 zaphkiel kernel: [    7.386469] sd 0:0:0:0: [sda] Attached SCSI disk
> > Feb 27 08:35:48 zaphkiel kernel: [    7.414335]  sdb: sdb1 sdb2 sdb3 sdb4
> > Feb 27 08:35:48 zaphkiel kernel: [    7.470597] sd 1:0:0:0: [sdb] Attached SCSI disk
> > Feb 27 08:35:48 zaphkiel kernel: [    9.676078] PM: Checking hibernation image partition /dev/md127
> > Feb 27 08:35:48 zaphkiel kernel: [    9.956010] sd 6:0:0:0: [sdc] Attached SCSI removable disk
> > Feb 27 08:35:48 zaphkiel kernel: [   10.175875] sd 6:0:0:1: [sdd] Attached SCSI removable disk
> > Feb 27 08:35:48 zaphkiel kernel: [   10.262999] sd 6:0:0:2: [sde] Attached SCSI removable disk
> > Feb 27 08:35:48 zaphkiel kernel: [   10.299000] sd 6:0:0:3: [sdf] Attached SCSI removable disk
> > Feb 27 08:35:48 zaphkiel kernel: [   10.618022] md: raid1 personality registered for level 1
> > Feb 27 08:35:48 zaphkiel kernel: [   11.139833] md: bind<sda3>
> > Feb 27 08:35:48 zaphkiel kernel: [   11.158438] md: bind<sdb3>
> > Feb 27 08:35:48 zaphkiel kernel: [   11.200897] md/raid1:md124: active with 2 out of 2 mirrors
> 
> Presumably this is md124 being assembled in the initrd as it contains the
> root filesystem.

Correct.

> 
> > Feb 27 08:35:48 zaphkiel kernel: [   11.233921] created bitmap (200 pages) for device md124
> > Feb 27 08:35:48 zaphkiel kernel: [   11.276742] md124: bitmap initialized from disk: read 13/13 pages, set 0 of 409595 bits
> > Feb 27 08:35:48 zaphkiel kernel: [   11.356796] md124: detected capacity change from 0 to 107372728320
> > Feb 27 08:35:48 zaphkiel kernel: [   11.395610]  md124: unknown partition table
> > Feb 27 08:35:48 zaphkiel kernel: [   11.543579] md: raid0 personality registered for level 0
> > Feb 27 08:35:48 zaphkiel kernel: [   11.583994] md: raid10 personality registered for level 10
> > Feb 27 08:35:48 zaphkiel kernel: [   12.188576] md: raid6 personality registered for level 6
> > Feb 27 08:35:48 zaphkiel kernel: [   12.220473] md: raid5 personality registered for level 5
> > Feb 27 08:35:48 zaphkiel kernel: [   12.252320] md: raid4 personality registered for level 4
> > Feb 27 08:35:48 zaphkiel kernel: [   12.318865] XFS (md124): Mounting Filesystem
> > Feb 27 08:35:48 zaphkiel kernel: [   12.535666] XFS (md124): Ending clean mount
> 
> And now XFS has mounted that, so presumably about there the boot sequence
> switches from initrd to real-root.  You have obviously removed some lines,
> presumably because you thought brevity would help.  For future reference: it
> doesn't.  I much prefer to see complete logs so I can correlate various parts
> of the sequence and be sure I understand what I'm seeing.  But I probably
> have enough here.

Sorry, I grepped for 'md|sd' and removed some lines triggered from pata_amd, 
which doesn't have any business in this scenario anyway (all sata). 

Will remove this dangling driver from initrd later.

We're leaving initrd here.

> > Feb 27 08:35:48 zaphkiel kernel: [   22.720866] sd 0:0:0:0: Attached scsi generic sg0 type 0
> > Feb 27 08:35:48 zaphkiel kernel: [   23.126126] sd 1:0:0:0: Attached scsi generic sg1 type 0
> > Feb 27 08:35:48 zaphkiel kernel: [   23.203876] sd 6:0:0:0: Attached scsi generic sg2 type 0
> > Feb 27 08:35:48 zaphkiel kernel: [   23.402184] sd 6:0:0:1: Attached scsi generic sg3 type 0
> > Feb 27 08:35:48 zaphkiel kernel: [   23.725330] sd 6:0:0:2: Attached scsi generic sg4 type 0
> > Feb 27 08:35:48 zaphkiel kernel: [   23.917899] sd 6:0:0:3: Attached scsi generic sg5 type 0
> > Feb 27 08:35:48 zaphkiel kernel: [   24.492587] md: bind<sda2>
> > Feb 27 08:35:48 zaphkiel kernel: [   24.711804] md: bind<sda1>
> > Feb 27 08:35:48 zaphkiel kernel: [   24.821071] md: bind<sda4>
> 
> These are parts of 3 different arrays that are starting to be assembled in
> parallel.  That must be udev running "mdadm -I /dev/sdaX" which will partly
> assemble each array.  when the 'sdb' bits come along "mdadm -I /dev/sdbX"
> should complete the assembly.

Yes, exactly.

> 
> > Feb 27 08:35:48 zaphkiel kernel: [   25.607192] md: bind<sdb2>
> > Feb 27 08:35:48 zaphkiel kernel: [   25.672335] md/raid1:md1: active with 2 out of 2 mirrors
> > Feb 27 08:35:48 zaphkiel kernel: [   25.723678] created bitmap (8 pages) for device md1
> > Feb 27 08:35:48 zaphkiel kernel: [   25.771764] md1: bitmap initialized from disk: read 1/1 pages, set 0 of 16379 bits
> 
> And there is the first one:  mdadm -I /dev/sdb2 must have been run by udev so
> md1 got fully assembled.
> 
> 
> > Feb 27 08:35:48 zaphkiel kernel: [   25.854963] md/raid1:md0: active with 1 out of 2 mirrors
> 
> But now md0 got assembled with waiting for sdb1.  This must have been the
>    mdadm -IRs
> in /etc/init.d/boot.md.  The purpose of this is to find any partially
> assembled arrays and complete the assembly.  This is needed if, for example,
> sdb has failed complete and would never be found.  However it shouldn't
> happen yet.  It should wait until all the devices that have been discovered
> have been processed.
> A little early in the boot.md file you will see:
>         if [ "$MDADM_DEVICE_TIMEOUT" -gt 0 ] ; then
>             /sbin/udevadm settle --timeout="$MDADM_DEVICE_TIMEOUT"
>         fi
> 
> so unless you have MDADM_DEVICE_TIMEOUT=0 in /etc/sysconfig/mdadm (which I
> very much doubt) it should have already waited.  But there is no sign that it
> did.  So maybe "udevadm settle" is broken?

MDADM_DEVICE_TIMEOUT="60"

As far as I can see, udevadm settle doesn't pick up all devices under some 
circumstances. The man page says, it waits up to --timeout seconds for the
event queue to become empty. Obviously, the event queue must be incomplete
then, given, that one md is assembled already in initrd (12 seconds ago),
and all other parts physically exist since then.

Probably, the events got eaten already in initrd already, given that timing.

> 
> > Feb 27 08:35:48 zaphkiel kernel: [   25.886903] md1: detected capacity change from 0 to 2146783232
> > Feb 27 08:35:48 zaphkiel kernel: [   25.929142] created bitmap (6 pages) for device md0
> > Feb 27 08:35:48 zaphkiel kernel: [   25.958684]  md1: unknown partition table
> > Feb 27 08:35:48 zaphkiel kernel: [   25.995737] md: bind<sdb4>
> > Feb 27 08:35:48 zaphkiel kernel: [   26.012500] md0: bitmap initialized from disk: read 1/1 pages, set 0 of 12047 bits
> > Feb 27 08:35:48 zaphkiel kernel: [   26.077270] md0: detected capacity change from 0 to 98689024
> > Feb 27 08:35:48 zaphkiel kernel: [   26.168097] md/raid1:md3: active with 2 out of 2 mirrors
> 
> And here comes sdb4, so "mdadm -I /dev/sdb4" must have been run by udev.  And
> it got in before the  "mdadm -IRs" managed to find that md0 was incomplete,
> and started it.
> 
> 
> > Feb 27 08:35:48 zaphkiel kernel: [   26.205258]  md0: unknown partition table
> > Feb 27 08:35:48 zaphkiel kernel: [   26.242295] created bitmap (415 pages) for device md3
> > Feb 27 08:35:48 zaphkiel kernel: [   26.304271] md3: bitmap initialized from disk: read 26/26 pages, set 1465 of 849320 bits
> > Feb 27 08:35:48 zaphkiel kernel: [   26.392578] md3: detected capacity change from 0 to 890575601664
> > Feb 27 08:35:48 zaphkiel kernel: [   26.575527]  md3: unknown partition table
> > Feb 27 08:35:48 zaphkiel kernel: [   26.763091] Adding 2096464k swap on /dev/md1.  Priority:0 extents:1 across:2096464k 
> > Feb 27 08:35:48 zaphkiel kernel: [   29.268318] XFS (md3): Mounting Filesystem
> > Feb 27 08:35:48 zaphkiel kernel: [   29.521844] EXT4-fs (md0): mounting ext2 file system using the ext4 subsystem
> > Feb 27 08:35:48 zaphkiel kernel: [   29.684176] EXT4-fs (md0): mounted filesystem without journal. Opts: acl,user_xattr
> > Feb 27 08:35:48 zaphkiel kernel: [   30.356255] XFS (md3): Ending clean mount
> > 
> > Today, /dev/sdb1 of /dev/md0 vanished. Re-adding and checking doesn't revealed 
> > any error, but it would be cool to get to the bottom of this, as re-adding some
> > lost partitions to the md every second day *is* a real PITA.
> > 
> > What could I do to help resolving this issue? 
> 
> It really feels like a udev bug, but it is hard to be sure.
> Could you edit /etc/init.d/boot.md and add
>    echo BEFORE mdadm -IRs > /dev/kmsg
> just before the call the "$mdadm_BIN -IRs",
>    echo AFTER mdadm -IRs > /dev/kmsg
> just after that call, and
>    echo AFTER mdadm -Asc > /dev/mksg
> just after the call to "$mdadm_BIN -A -s -c $mdadm_CONFIG"
> 
> And maybe put similar messages just before and after the  
>     /sbin/udevadm settle ....
> call.
> 
> If you can then reproduce the problem, the extra logging might tell us
> something useful.

I've enabled initrd (linuxrc=trace) and boot.md (#!/bin/sh -x) debugging, hence 
we should see the full story next time.

Thanks,
Pete

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-02-28 10:49             ` Hans-Peter Jansen
@ 2013-02-28 21:25               ` NeilBrown
  2013-02-28 22:16                 ` Hans-Peter Jansen
  0 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2013-02-28 21:25 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 977 bytes --]

On Thu, 28 Feb 2013 11:49:53 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote:

This is what I asked for:

> > It really feels like a udev bug, but it is hard to be sure.
> > Could you edit /etc/init.d/boot.md and add
> >    echo BEFORE mdadm -IRs > /dev/kmsg
> > just before the call the "$mdadm_BIN -IRs",
> >    echo AFTER mdadm -IRs > /dev/kmsg
> > just after that call, and
> >    echo AFTER mdadm -Asc > /dev/mksg
> > just after the call to "$mdadm_BIN -A -s -c $mdadm_CONFIG"
> > 
> > And maybe put similar messages just before and after the  
> >     /sbin/udevadm settle ....
> > call.
> > 
> > If you can then reproduce the problem, the extra logging might tell us
> > something useful.
> 

This is not at all the same thing:

> I've enabled initrd (linuxrc=trace) and boot.md (#!/bin/sh -x) debugging, hence 
> we should see the full story next time.
>

It might help, but I'm far from certain that it will be nearly as useful.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-02-28 21:25               ` NeilBrown
@ 2013-02-28 22:16                 ` Hans-Peter Jansen
       [not found]                   ` <4291349.FrQcKOnicQ@xrated>
  0 siblings, 1 reply; 19+ messages in thread
From: Hans-Peter Jansen @ 2013-02-28 22:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID

Am Freitag, 1. März 2013, 08:25:20 schrieb NeilBrown:
> On Thu, 28 Feb 2013 11:49:53 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote:
> 
> This is what I asked for:
> > > It really feels like a udev bug, but it is hard to be sure.
> > > Could you edit /etc/init.d/boot.md and add
> > > 
> > >    echo BEFORE mdadm -IRs > /dev/kmsg
> > > 
> > > just before the call the "$mdadm_BIN -IRs",
> > > 
> > >    echo AFTER mdadm -IRs > /dev/kmsg
> > > 
> > > just after that call, and
> > > 
> > >    echo AFTER mdadm -Asc > /dev/mksg
> > > 
> > > just after the call to "$mdadm_BIN -A -s -c $mdadm_CONFIG"
> > > 
> > > And maybe put similar messages just before and after the
> > > 
> > >     /sbin/udevadm settle ....
> > > 
> > > call.
> > > 
> > > If you can then reproduce the problem, the extra logging might tell us
> > > something useful.
> 
> This is not at all the same thing:
> > I've enabled initrd (linuxrc=trace) and boot.md (#!/bin/sh -x) debugging,
> > hence we should see the full story next time.
> 
> It might help, but I'm far from certain that it will be nearly as useful.

Okay, okay, I do it both ways now..

Let's see, when it triggers.

> NeilBrown

Cheers,
Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
       [not found]                   ` <4291349.FrQcKOnicQ@xrated>
@ 2013-03-03 23:33                     ` NeilBrown
  2013-03-13  0:52                     ` NeilBrown
  1 sibling, 0 replies; 19+ messages in thread
From: NeilBrown @ 2013-03-03 23:33 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 4939 bytes --]

On Sun, 03 Mar 2013 20:31:57 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote:

> Am Donnerstag, 28. Februar 2013, 23:16:01 schrieb Hans-Peter Jansen:
> > Am Freitag, 1. März 2013, 08:25:20 schrieb NeilBrown:
> > > On Thu, 28 Feb 2013 11:49:53 +0100 Hans-Peter Jansen <hpj@urpla.net>
> > > wrote:
> > > 
> > > This is what I asked for:
> > > > > It really feels like a udev bug, but it is hard to be sure.
> > > > > Could you edit /etc/init.d/boot.md and add
> > > > > 
> > > > >    echo BEFORE mdadm -IRs > /dev/kmsg
> > > > > 
> > > > > just before the call the "$mdadm_BIN -IRs",
> > > > > 
> > > > >    echo AFTER mdadm -IRs > /dev/kmsg
> > > > > 
> > > > > just after that call, and
> > > > > 
> > > > >    echo AFTER mdadm -Asc > /dev/mksg
> > > > > 
> > > > > just after the call to "$mdadm_BIN -A -s -c $mdadm_CONFIG"
> > > > > 
> > > > > And maybe put similar messages just before and after the
> > > > > 
> > > > >     /sbin/udevadm settle ....
> > > > > 
> > > > > call.
> > > > > 
> > > > > If you can then reproduce the problem, the extra logging might tell us
> > > > > something useful.
> > > 
> > > This is not at all the same thing:
> > > > I've enabled initrd (linuxrc=trace) and boot.md (#!/bin/sh -x)
> > > > debugging,
> > > > hence we should see the full story next time.
> > > 
> > > It might help, but I'm far from certain that it will be nearly as useful.
> 
> I see, sh debug messages do not condense in /var/log/something, they're routed
> somewhere else. I not sure, how I like that systemd implied sillyness.
> 
> > Okay, okay, I do it both ways now..
> > 
> > Let's see, when it triggers.
> 
> I'm not sure, how the basic kernel boot contribute to boiling this issue down, 
> but here we go: (not censored this time, hope it will go though..)



Thanks.  The interesting bit is here:

> Mar  3 09:11:59 zaphkiel kernel: [   28.781703] md: bind<sdb4>
It seems that udev is running "mdadm -I" on sdb4 here

> Mar  3 09:11:59 zaphkiel kernel: [   28.799939] tveeprom 1-0050: audio processor is MSP3415 (idx 6)
> Mar  3 09:11:59 zaphkiel kernel: [   28.862699] sr 7:0:0:0: Attached scsi generic sg6 type 5
> Mar  3 09:11:59 zaphkiel kernel: [   28.894667] tveeprom 1-0050: has radio
> Mar  3 09:11:59 zaphkiel kernel: [   28.917194] bttv0: Hauppauge eeprom indicates model#44354
> Mar  3 09:11:59 zaphkiel kernel: [   28.984037] bttv0: tuner type=20
> Mar  3 09:11:59 zaphkiel kernel: [   29.003670] md: bind<sdb1>

and "mdadm -I sdb1"  here
> Mar  3 09:11:59 zaphkiel kernel: [   29.026123] input: HDA Digital PCBeep as /devices/pci0000:00/0000:00:07.0/input/input5
> Mar  3 09:11:59 zaphkiel kernel: [   29.098905] i2c-core: driver [msp3400] using legacy suspend method
> Mar  3 09:11:59 zaphkiel kernel: [   29.139227] AFTER udevadm settle --timeout=60

and now udev is "settled".  This is much less than 60 seconds after the
"BEFORE udevadm settle", so it didn't time out - it must have exhausted the
queue.

> Mar  3 09:11:59 zaphkiel kernel: [   29.211092] i2c-core: driver [msp3400] using legacy resume method
> Mar  3 09:11:59 zaphkiel kernel: [   29.276147] msp3400 1-0040: MSP3415D-B3 found @ 0x80 (bt878 #0 [sw])
> Mar  3 09:11:59 zaphkiel kernel: [   29.325203] md: bind<sdb2>

But look - it seems that "mdadm -I /dev/sdb2" was just run - presumably by
udev.

> Mar  3 09:11:59 zaphkiel kernel: [   29.391360] msp3400 1-0040: msp3400 supports nicam, BEFORE mdadm -IRs

We managed to get two lines blended together here, but it is clear that this
is just before "mdadm -IRs" runs.

> Mar  3 09:11:59 zaphkiel kernel: [   29.455183] mode is autodetect
> Mar  3 09:11:59 zaphkiel kernel: [   29.525553] md/raid1:md0: active with 1 out of 2 mirrors
> 

And unsurprisingly, md0 is assembled with only on mirror because we have seen
sdb1 but not sda1

> Mar  3 09:11:59 zaphkiel kernel: [   29.572395] i2c-core: driver [tuner] using legacy suspend method
> Mar  3 09:11:59 zaphkiel kernel: [   29.642012] i2c-core: driver [tuner] using legacy resume method
> Mar  3 09:11:59 zaphkiel kernel: [   29.688786] md: bind<sda4>

and here comes sda4.  Presumably sda1 is attempted a little later, but as md0
is already assembled, mdadm refuses to do anything with it.

So "mdadm -IRs" is being run too early, apparently because "udevadm settle" is
exiting too early.  However I have looked closely at the udev code, and run
various tests, and cannot see how it could possibly exit early.  I can
imagine it exiting a bit late in some anomalous situations, but not early.

The only way I can explain this is to suggest that something other than udev
is running "mdadm -I".  But I cannot imagine what would be doing that.

If you put a "sleep 10" before the "mdadm -IRs", I suspect your problems
would go away.  It isn't really a nice solution, but it is the only one I can
think of at the moment.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
       [not found]                   ` <4291349.FrQcKOnicQ@xrated>
  2013-03-03 23:33                     ` NeilBrown
@ 2013-03-13  0:52                     ` NeilBrown
  2013-03-15 22:43                       ` Hans-Peter Jansen
  1 sibling, 1 reply; 19+ messages in thread
From: NeilBrown @ 2013-03-13  0:52 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 733 bytes --]


hi,
 wonder if you could try one more test for me.
 With the other (echo BEFORE / echo AFTER etc) tracing still there, change the
   /sbin/udevadm settle --timeout=$MDADM_DEVICE_TIMEOUT
 to

   /sbin/udevadm --debug settle --timeout=$MDADM_DEVICE_TIMEOUT > /dev/kmsg 2>&1

 I found there is a case where "udevadm settle" can exit before the queue is
 empty.  It seems like a very unlikely scenario, but it seems clear that
 something "unlikely" is happening.
 I'm hoping to see
    timeout waiting for udev queue

 appear in the logs when this runs.

 If you could add that and post the 'dmesg' output if you ever get that
 message - or maybe even if you don't - that would be very helpful.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-03-13  0:52                     ` NeilBrown
@ 2013-03-15 22:43                       ` Hans-Peter Jansen
  2013-03-18 11:20                         ` Hans-Peter Jansen
  0 siblings, 1 reply; 19+ messages in thread
From: Hans-Peter Jansen @ 2013-03-15 22:43 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID

Am Mittwoch, 13. März 2013, 11:52:35 schrieben Sie:
> hi,
>  wonder if you could try one more test for me.
>  With the other (echo BEFORE / echo AFTER etc) tracing still there, change
> the /sbin/udevadm settle --timeout=$MDADM_DEVICE_TIMEOUT
>  to
> 
>    /sbin/udevadm --debug settle --timeout=$MDADM_DEVICE_TIMEOUT > /dev/kmsg
> 2>&1
> 
>  I found there is a case where "udevadm settle" can exit before the queue is
> empty.  It seems like a very unlikely scenario, but it seems clear that
> something "unlikely" is happening.
>  I'm hoping to see
>     timeout waiting for udev queue
> 
>  appear in the logs when this runs.
> 
>  If you could add that and post the 'dmesg' output if you ever get that
>  message - or maybe even if you don't - that would be very helpful.

Did that right now and disabled the sleep 10 in front, that provisionally 
"solved" this issue BTW. Now waiting for another md issue to occur.

Cheers,
Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-03-15 22:43                       ` Hans-Peter Jansen
@ 2013-03-18 11:20                         ` Hans-Peter Jansen
  2013-03-21  3:24                           ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Hans-Peter Jansen @ 2013-03-18 11:20 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID

Am Freitag, 15. März 2013, 23:43:31 schrieb Hans-Peter Jansen:
> Am Mittwoch, 13. März 2013, 11:52:35 schrieben Sie:
> > hi,
> > 
> >  wonder if you could try one more test for me.
> >  With the other (echo BEFORE / echo AFTER etc) tracing still there, change
> > 
> > the /sbin/udevadm settle --timeout=$MDADM_DEVICE_TIMEOUT
> > 
> >  to
> >  
> >    /sbin/udevadm --debug settle --timeout=$MDADM_DEVICE_TIMEOUT >
> >    /dev/kmsg
> > 
> > 2>&1
> > 
> >  I found there is a case where "udevadm settle" can exit before the queue
> >  is
> > 
> > empty.  It seems like a very unlikely scenario, but it seems clear that
> > something "unlikely" is happening.
> > 
> >  I'm hoping to see
> >  
> >     timeout waiting for udev queue
> >  
> >  appear in the logs when this runs.
> >  
> >  If you could add that and post the 'dmesg' output if you ever get that
> >  message - or maybe even if you don't - that would be very helpful.
> 
> Did that right now and disabled the sleep 10 in front, that provisionally
> "solved" this issue BTW. Now waiting for another md issue to occur.

Here we go: (not censored in any way, and I hope to get the "window" right:
md0 was affected this time.

Mar 18 09:02:20 zaphkiel kernel: [   10.153098] sd 6:0:0:0: [sdc] 7741440 512-byte logical blocks: (3.96 GB/3.69 GiB)
Mar 18 09:02:20 zaphkiel kernel: [   10.202222] sd 6:0:0:0: [sdc] Write Protect is off
Mar 18 09:02:20 zaphkiel kernel: [   10.230965] sd 6:0:0:0: [sdc] Mode Sense: 03 00 00 00
Mar 18 09:02:20 zaphkiel kernel: [   10.234226] sd 6:0:0:0: [sdc] No Caching mode page present
Mar 18 09:02:20 zaphkiel kernel: [   10.267155] sd 6:0:0:0: [sdc] Assuming drive cache: write through
Mar 18 09:02:20 zaphkiel kernel: [   10.309225] sd 6:0:0:0: [sdc] No Caching mode page present
Mar 18 09:02:20 zaphkiel kernel: [   10.342135] sd 6:0:0:0: [sdc] Assuming drive cache: write through
Mar 18 09:02:20 zaphkiel kernel: [   10.379763]  sdc: sdc1
Mar 18 09:02:20 zaphkiel kernel: [   10.400224] sd 6:0:0:0: [sdc] No Caching mode page present
Mar 18 09:02:20 zaphkiel kernel: [   10.433138] sd 6:0:0:0: [sdc] Assuming drive cache: write through
Mar 18 09:02:20 zaphkiel kernel: [   10.469953] sd 6:0:0:0: [sdc] Attached SCSI removable disk
Mar 18 09:02:20 zaphkiel kernel: [   10.502901] Freeing unused kernel memory: 488k freed
Mar 18 09:02:20 zaphkiel kernel: [   10.532672] usb 2-4: device descriptor read/64, error -71
Mar 18 09:02:20 zaphkiel kernel: [   10.565302] Write protecting the kernel text: 5204k
Mar 18 09:02:20 zaphkiel kernel: [   10.594587] Write protecting the kernel read-only data: 3540k
Mar 18 09:02:20 zaphkiel kernel: [   10.629070] NX-protecting the kernel data: 5036k
Mar 18 09:02:20 zaphkiel kernel: [   10.699772] ACPI: acpi_idle registered with cpuidle
Mar 18 09:02:20 zaphkiel kernel: [   10.735081] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
Mar 18 09:02:20 zaphkiel kernel: [   10.792858] SGI XFS Quota Management subsystem
Mar 18 09:02:20 zaphkiel kernel: [   10.821500] md: raid1 personality registered for level 1
Mar 18 09:02:20 zaphkiel kernel: [   10.853390] usb 2-4: new high speed USB device number 5 using ehci_hcd
Mar 18 09:02:20 zaphkiel kernel: [   10.894261] BIOS EDD facility v0.16 2004-Jun-25, 3 devices found
Mar 18 09:02:20 zaphkiel kernel: [   11.157034] usb 2-4: device descriptor read/64, error -71
Mar 18 09:02:20 zaphkiel kernel: [   11.385910] md: bind<sda3>
Mar 18 09:02:20 zaphkiel kernel: [   11.407964] md: bind<sdb3>
Mar 18 09:02:20 zaphkiel kernel: [   11.427609] bio: create slab <bio-1> at 1
Mar 18 09:02:20 zaphkiel kernel: [   11.451733] md/raid1:md124: active with 2 out of 2 mirrors
Mar 18 09:02:20 zaphkiel kernel: [   11.484761] created bitmap (200 pages) for device md124
Mar 18 09:02:20 zaphkiel kernel: [   11.527745] md124: bitmap initialized from disk: read 13/13 pages, set 0 of 409595 bits
Mar 18 09:02:20 zaphkiel kernel: [   11.589866] md124: detected capacity change from 0 to 107372728320
Mar 18 09:02:20 zaphkiel kernel: [   11.628729]  md124: unknown partition table
Mar 18 09:02:20 zaphkiel kernel: [   11.705023] usb 2-4: device descriptor read/64, error -71
Mar 18 09:02:20 zaphkiel kernel: [   11.940026] usb 2-4: new high speed USB device number 6 using ehci_hcd
Mar 18 09:02:20 zaphkiel kernel: [   12.443021] usb 2-4: device not accepting address 6, error -71
Mar 18 09:02:20 zaphkiel kernel: [   12.579024] usb 2-4: new high speed USB device number 7 using ehci_hcd
Mar 18 09:02:20 zaphkiel kernel: [   13.082029] usb 2-4: device not accepting address 7, error -71
Mar 18 09:02:20 zaphkiel kernel: [   13.117018] hub 2-0:1.0: unable to enumerate USB device on port 4
Mar 18 09:02:20 zaphkiel kernel: [   13.385793] md: raid0 personality registered for level 0
Mar 18 09:02:20 zaphkiel kernel: [   13.424034] usb 4-2: new low speed USB device number 2 using ohci_hcd
Mar 18 09:02:20 zaphkiel kernel: [   13.471246] md: raid10 personality registered for level 10
Mar 18 09:02:20 zaphkiel kernel: [   13.517387] async_tx: api initialized (async)
Mar 18 09:02:20 zaphkiel kernel: [   13.544243] xor: automatically using best checksumming function: pIII_sse
Mar 18 09:02:20 zaphkiel kernel: [   13.590008]    pIII_sse  :  7224.000 MB/sec
Mar 18 09:02:20 zaphkiel kernel: [   13.615137] xor: using function: pIII_sse (7224.000 MB/sec)
Mar 18 09:02:20 zaphkiel kernel: [   13.666017] raid6: int32x1    996 MB/s
Mar 18 09:02:20 zaphkiel kernel: [   13.705044] raid6: int32x2   1101 MB/s
Mar 18 09:02:20 zaphkiel kernel: [   13.744041] raid6: int32x4    789 MB/s
Mar 18 09:02:20 zaphkiel kernel: [   13.783015] raid6: int32x8    878 MB/s
Mar 18 09:02:20 zaphkiel kernel: [   13.822041] raid6: mmxx1     1972 MB/s
Mar 18 09:02:20 zaphkiel kernel: [   13.861017] raid6: mmxx2     3632 MB/s
Mar 18 09:02:20 zaphkiel kernel: [   13.883514] usb 4-2: New USB device found, idVendor=093a, idProduct=2510
Mar 18 09:02:20 zaphkiel kernel: [   13.940023] raid6: sse1x1    2007 MB/s
Mar 18 09:02:20 zaphkiel kernel: [   13.962555] usb 4-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Mar 18 09:02:20 zaphkiel kernel: [   14.022017] raid6: sse1x2    3406 MB/s
Mar 18 09:02:20 zaphkiel kernel: [   14.061017] usb 4-2: Product: USB OPTICAL MOUSE
Mar 18 09:02:20 zaphkiel kernel: [   14.088224] raid6: sse2x1    3433 MB/s
Mar 18 09:02:20 zaphkiel kernel: [   14.110741] usb 4-2: Manufacturer: PIXART
Mar 18 09:02:20 zaphkiel kernel: [   14.152588] raid6: sse2x2    4511 MB/s
Mar 18 09:02:20 zaphkiel kernel: [   14.175186] raid6: using algorithm sse2x2 (4511 MB/s)
Mar 18 09:02:20 zaphkiel kernel: [   14.207849] md: raid6 personality registered for level 6
Mar 18 09:02:20 zaphkiel kernel: [   14.239745] md: raid5 personality registered for level 5
Mar 18 09:02:20 zaphkiel kernel: [   14.272105] input: PIXART USB OPTICAL MOUSE as /devices/pci0000:00/0000:00:04.0/usb4/4-2/4-2:1.0/input/input1
Mar 18 09:02:20 zaphkiel kernel: [   14.334075] md: raid4 personality registered for level 4
Mar 18 09:02:20 zaphkiel kernel: [   14.371117] generic-usb 0003:093A:2510.0001: input,hidraw0: USB HID v1.11 Mouse [PIXART USB OPTICAL MOUSE] on usb-0000:00:04.0-2/input0
Mar 18 09:02:20 zaphkiel kernel: [   14.485966] XFS (md124): Mounting Filesystem
Mar 18 09:02:20 zaphkiel kernel: [   14.701101] XFS (md124): Ending clean mount
Mar 18 09:02:20 zaphkiel kernel: [   14.731718] usb 4-4: new full speed USB device number 3 using ohci_hcd
Mar 18 09:02:20 zaphkiel kernel: [   14.959749] usb 4-4: not running at top speed; connect to a high speed hub
Mar 18 09:02:20 zaphkiel kernel: [   15.020945] usb 4-4: New USB device found, idVendor=05e3, idProduct=0760
Mar 18 09:02:20 zaphkiel kernel: [   15.066896] usb 4-4: New USB device strings: Mfr=0, Product=3, SerialNumber=4
Mar 18 09:02:20 zaphkiel kernel: [   15.109742] usb 4-4: Product: Flash Reader
Mar 18 09:02:20 zaphkiel kernel: [   15.134328] usb 4-4: SerialNumber: 03570
Mar 18 09:02:20 zaphkiel kernel: [   15.159828] scsi7 : usb-storage 4-4:1.0
Mar 18 09:02:20 zaphkiel kernel: [   16.191763] scsi 7:0:0:0: Direct-Access     Generic  STORAGE DEVICE   0125 PQ: 0 ANSI: 0
Mar 18 09:02:20 zaphkiel kernel: [   16.256768] scsi 7:0:0:1: Direct-Access     Generic  STORAGE DEVICE   0125 PQ: 0 ANSI: 0
Mar 18 09:02:20 zaphkiel kernel: [   16.332762] scsi 7:0:0:2: Direct-Access     Generic  STORAGE DEVICE   0125 PQ: 0 ANSI: 0
Mar 18 09:02:20 zaphkiel kernel: [   16.401753] sd 7:0:0:0: [sdd] Attached SCSI removable disk
Mar 18 09:02:20 zaphkiel kernel: [   16.450765] scsi 7:0:0:3: Direct-Access     Generic  STORAGE DEVICE   0125 PQ: 0 ANSI: 0
Mar 18 09:02:20 zaphkiel kernel: [   16.508754] sd 7:0:0:1: [sde] Attached SCSI removable disk
Mar 18 09:02:20 zaphkiel kernel: [   16.572754] sd 7:0:0:2: [sdf] Attached SCSI removable disk
Mar 18 09:02:20 zaphkiel kernel: [   16.625754] sd 7:0:0:3: [sdg] Attached SCSI removable disk
Mar 18 09:02:20 zaphkiel kernel: [   20.412626] forcedeth: Reverse Engineered nForce ethernet driver. Version 0.64.
Mar 18 09:02:20 zaphkiel kernel: [   20.460450] ACPI: PCI Interrupt Link [LMAC] enabled at IRQ 22
Mar 18 09:02:20 zaphkiel kernel: [   20.502526] forcedeth 0000:00:0a.0: PCI INT A -> Link[LMAC] -> GSI 22 (level, low) -> IRQ 22
Mar 18 09:02:20 zaphkiel kernel: [   20.555583] forcedeth 0000:00:0a.0: setting latency timer to 64
Mar 18 09:02:20 zaphkiel kernel: [   20.608180] forcedeth 0000:00:0a.0: ifname eth0, PHY OUI 0x732 @ 3, addr 00:23:54:2b:d2:58
Mar 18 09:02:20 zaphkiel kernel: [   20.661808] forcedeth 0000:00:0a.0: highdma csum pwrctl mgmt gbit lnktim msi desc-v3
Mar 18 09:02:20 zaphkiel kernel: [   21.514743] 8139too: 8139too Fast Ethernet driver 0.9.28
Mar 18 09:02:20 zaphkiel kernel: [   21.661650] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 16
Mar 18 09:02:20 zaphkiel kernel: [   21.745044] 8139too 0000:01:06.0: PCI INT A -> Link[LNKA] -> GSI 16 (level, low) -> IRQ 16
Mar 18 09:02:20 zaphkiel kernel: [   21.853853] 8139too 0000:01:06.0: eth1: RealTek RTL8139 at 0xf7a32c00, 00:08:54:0a:9b:53, IRQ 16
Mar 18 09:02:20 zaphkiel kernel: [   22.792962] BEFORE udevadm settle --timeout=60
Mar 18 09:02:20 zaphkiel kernel: [   22.879311] main: 
Mar 18 09:02:20 zaphkiel kernel: [   22.935601] runtime dir '/run/udev'
Mar 18 09:02:20 zaphkiel kernel: [   22.962183] run_command: 
Mar 18 09:02:20 zaphkiel kernel: [   23.021385] calling: settle
Mar 18 09:02:20 zaphkiel kernel: [   26.373950] i2c i2c-0: nForce2 SMBus adapter at 0x600
Mar 18 09:02:20 zaphkiel kernel: [   26.439755] WARNING! power/level is deprecated; use power/control instead
Mar 18 09:02:20 zaphkiel kernel: [   26.482815] ACPI: resource nForce2_smbus [io  0x0700-0x073f] conflicts with ACPI region SM00 [io 0x700-0x73f]
Mar 18 09:02:20 zaphkiel kernel: [   26.579063] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
Mar 18 09:02:20 zaphkiel kernel: [   26.649433] firewire_ohci 0000:04:00.0: PCI INT A -> Link[LN3A] -> GSI 17 (level, low) -> IRQ 17
Mar 18 09:02:20 zaphkiel kernel: [   26.739746] input: Power Button as /devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input2
Mar 18 09:02:20 zaphkiel kernel: [   26.796477] firewire_ohci 0000:04:00.0: setting latency timer to 64
Mar 18 09:02:20 zaphkiel kernel: [   26.801886] Linux video capture interface: v2.00
Mar 18 09:02:20 zaphkiel kernel: [   26.849080] firewire_ohci: Added fw-ohci device 0000:04:00.0, OHCI v1.10, 4 IR + 4 IT contexts, quirks 0x10
Mar 18 09:02:20 zaphkiel kernel: [   26.966186] pata_amd 0000:00:06.0: version 0.4.1
Mar 18 09:02:20 zaphkiel kernel: [   26.966229] pata_amd 0000:00:06.0: setting latency timer to 64
Mar 18 09:02:20 zaphkiel kernel: [   26.968773] scsi8 : pata_amd
Mar 18 09:02:20 zaphkiel kernel: [   26.992242] sd 0:0:0:0: Attached scsi generic sg0 type 0
Mar 18 09:02:20 zaphkiel kernel: [   27.031690] scsi9 : pata_amd
Mar 18 09:02:20 zaphkiel kernel: [   27.064347] ata7: PATA max UDMA/133 cmd 0x1f0 ctl 0x3f6 bmdma 0xffa0 irq 14
Mar 18 09:02:20 zaphkiel kernel: [   27.152035] ata8: PATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0xffa8 irq 15
Mar 18 09:02:20 zaphkiel kernel: [   27.207504] IT8712 SuperIO detected.
Mar 18 09:02:20 zaphkiel kernel: [   27.248861] k8temp 0000:00:18.3: Temperature readouts might be wrong - check erratum #141
Mar 18 09:02:20 zaphkiel kernel: [   27.358729] parport_pc 00:04: reported by Plug and Play ACPI
Mar 18 09:02:20 zaphkiel kernel: [   27.396357] sd 1:0:0:0: Attached scsi generic sg1 type 0
Mar 18 09:02:20 zaphkiel kernel: [   27.435502] ata7.00: ATAPI: HL-DT-ST DVDRAM GSA-4167B, DL12, max UDMA/33
Mar 18 09:02:20 zaphkiel kernel: [   27.479017] firewire_core: created device fw0: GUID 001e8c00018b93c2, S400
Mar 18 09:02:20 zaphkiel kernel: [   27.522961] ACPI: Power Button [PWRB]
Mar 18 09:02:20 zaphkiel kernel: [   27.547160] parport0: PC-style at 0x378
Mar 18 09:02:20 zaphkiel kernel: [   27.569182] ata7: nv_mode_filter: 0x739f&0x739f->0x739f, BIOS=0x7000 (0xc0000000) ACPI=0x701f (60:900:0x11)
Mar 18 09:02:20 zaphkiel kernel: [   27.572162] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input3
Mar 18 09:02:20 zaphkiel kernel: [   27.619485] nvidia: module license 'NVIDIA' taints kernel.
Mar 18 09:02:20 zaphkiel kernel: [   27.652780] ata7.00: configured for UDMA/33
Mar 18 09:02:20 zaphkiel kernel: [   27.681871] sd 6:0:0:0: Attached scsi generic sg2 type 0
Mar 18 09:02:20 zaphkiel kernel: [   27.720249] scsi 8:0:0:0: CD-ROM            HL-DT-ST DVDRAM GSA-4167B DL12 PQ: 0 ANSI: 5
Mar 18 09:02:20 zaphkiel kernel: [   27.773178] , irq 7
Mar 18 09:02:20 zaphkiel kernel: [   27.785284] Disabling lock debugging due to kernel taint
Mar 18 09:02:20 zaphkiel kernel: [   27.818744]  [
Mar 18 09:02:20 zaphkiel kernel: [   27.827966] sd 7:0:0:0: Attached scsi generic sg3 type 0
Mar 18 09:02:20 zaphkiel kernel: [   27.862615] PCSPP
Mar 18 09:02:20 zaphkiel kernel: [   27.873974] ACPI: Power Button [PWRF]
Mar 18 09:02:20 zaphkiel kernel: [   27.903194] input: PC Speaker as /devices/platform/pcspkr/input/input4
Mar 18 09:02:20 zaphkiel kernel: [   27.947382] ,TRISTATE]
Mar 18 09:02:20 zaphkiel kernel: [   27.967591] wmi: Mapper loaded
Mar 18 09:02:20 zaphkiel kernel: [   28.019498] sd 7:0:0:1: Attached scsi generic sg4 type 0
Mar 18 09:02:20 zaphkiel kernel: [   28.086333] ppdev: user-space parallel port driver
Mar 18 09:02:20 zaphkiel kernel: [   28.145165] sd 7:0:0:2: Attached scsi generic sg5 type 0
Mar 18 09:02:20 zaphkiel kernel: [   28.219780] sd 7:0:0:3: Attached scsi generic sg6 type 0
Mar 18 09:02:20 zaphkiel kernel: [   28.294832] scsi 8:0:0:0: Attached scsi generic sg7 type 5
Mar 18 09:02:20 zaphkiel kernel: [   28.359838] ata8: port disabled--ignoring
Mar 18 09:02:20 zaphkiel kernel: [   28.377526] snd_hda_intel 0000:00:07.0: power state changed by ACPI to D0
Mar 18 09:02:20 zaphkiel kernel: [   28.460047] snd_hda_intel 0000:00:07.0: power state changed by ACPI to D0
Mar 18 09:02:20 zaphkiel kernel: [   28.544146] ACPI: PCI Interrupt Link [LAZA] enabled at IRQ 21
Mar 18 09:02:20 zaphkiel kernel: [   28.616023] snd_hda_intel 0000:00:07.0: PCI INT A -> Link[LAZA] -> GSI 21 (level, low) -> IRQ 21
Mar 18 09:02:20 zaphkiel kernel: [   28.725762] hda_intel: Disabling MSI
Mar 18 09:02:20 zaphkiel kernel: [   28.749946] bttv: driver version 0.9.19 loaded
Mar 18 09:02:20 zaphkiel kernel: [   28.776705] snd_hda_intel 0000:00:07.0: setting latency timer to 64
Mar 18 09:02:20 zaphkiel kernel: [   28.787103] ALSA hda_intel.c:1454 Enable delay in RIRB handling
Mar 18 09:02:20 zaphkiel kernel: [   28.804531] bttv: using 8 buffers with 2080k (520 pages) each for capture
Mar 18 09:02:20 zaphkiel kernel: [   28.864408] IR NEC protocol handler initialized
Mar 18 09:02:20 zaphkiel kernel: [   28.960457] IR RC5(x) protocol handler initialized
Mar 18 09:02:20 zaphkiel kernel: [   29.004015] bttv: Bt8xx card found (0).
Mar 18 09:02:20 zaphkiel kernel: [   29.088643] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 19
Mar 18 09:02:20 zaphkiel kernel: [   29.123158] bttv 0000:01:07.0: PCI INT A -> Link[LNKB] -> GSI 19 (level, low) -> IRQ 19
Mar 18 09:02:20 zaphkiel kernel: [   29.180972] IR RC6 protocol handler initialized
Mar 18 09:02:20 zaphkiel kernel: [   29.250091] bttv0: Bt878 (rev 17) at 0000:01:07.0, 
Mar 18 09:02:20 zaphkiel kernel: [   29.278950] IR JVC protocol handler initialized
Mar 18 09:02:20 zaphkiel kernel: [   29.339358] irq: 19, latency: 64, mmio: 0xcfffe000
Mar 18 09:02:20 zaphkiel kernel: [   29.400761] bttv0: detected: Hauppauge WinTV [card=10], PCI subsystem ID is 0070:13eb
Mar 18 09:02:20 zaphkiel kernel: [   29.447978] IR Sony protocol handler initialized
Mar 18 09:02:20 zaphkiel kernel: [   29.521989] IR MCE Keyboard/mouse protocol handler initialized
Mar 18 09:02:20 zaphkiel kernel: [   29.562022] bttv0: using: Hauppauge (bt878) [card=10,autodetected]
Mar 18 09:02:20 zaphkiel kernel: [   29.636072] lirc_dev: IR Remote Control driver registered, major 250 
Mar 18 09:02:20 zaphkiel kernel: [   29.679569] bttv0: gpio: en=00000000, out=00000000 in=00ffffdb [init]
Mar 18 09:02:20 zaphkiel kernel: [   29.683413] bttv0: Hauppauge/Voodoo msp34xx: reset line init [5]
Mar 18 09:02:20 zaphkiel kernel: [   29.725031] ALSA patch_realtek.c:1228 SKU: Nid=0x1d sku_cfg=0x4015e601
Mar 18 09:02:20 zaphkiel kernel: [   29.725035] ALSA patch_realtek.c:1230 SKU: port_connectivity=0x1
Mar 18 09:02:20 zaphkiel kernel: [   29.725038] ALSA patch_realtek.c:1231 SKU: enable_pcbeep=0x1
Mar 18 09:02:20 zaphkiel kernel: [   29.725040] ALSA patch_realtek.c:1232 SKU: check_sum=0x00000005
Mar 18 09:02:20 zaphkiel kernel: [   29.725042] ALSA patch_realtek.c:1233 SKU: customization=0x000000e6
Mar 18 09:02:20 zaphkiel kernel: [   29.725045] ALSA patch_realtek.c:1234 SKU: external_amp=0x0
Mar 18 09:02:20 zaphkiel kernel: [   29.725047] ALSA patch_realtek.c:1235 SKU: platform_type=0x0
Mar 18 09:02:20 zaphkiel kernel: [   29.725049] ALSA patch_realtek.c:1236 SKU: swap=0x0
Mar 18 09:02:20 zaphkiel kernel: [   29.725051] ALSA patch_realtek.c:1237 SKU: override=0x1
Mar 18 09:02:20 zaphkiel kernel: [   29.742390] md: bind<sdb4>
Mar 18 09:02:20 zaphkiel kernel: [   29.779377] IR LIRC bridge handler initialized
Mar 18 09:02:20 zaphkiel kernel: [   29.857738] tveeprom 1-0050: Hauppauge model 44354, rev C142, serial# 2295831
Mar 18 09:02:20 zaphkiel kernel: [   29.921654] md: bind<sda1>
Mar 18 09:02:20 zaphkiel kernel: [   29.946033] tveeprom 1-0050: tuner model is Temic 4009FR5 (idx 42, type 20)
Mar 18 09:02:20 zaphkiel kernel: [   30.020132] sr0: scsi3-mmc drive: 48x/48x writer dvd-ram cd/rw xa/form2 cdda tray
Mar 18 09:02:20 zaphkiel kernel: [   30.065051] tveeprom 1-0050: TV standards PAL(B/G) (eeprom 0x04)
Mar 18 09:02:20 zaphkiel kernel: [   30.103013] tveeprom 1-0050: audio processor is MSP3415 (idx 6)
Mar 18 09:02:20 zaphkiel kernel: [   30.145378] md: bind<sda2>
Mar 18 09:02:20 zaphkiel kernel: [   30.174031] tveeprom 1-0050: has radio
Mar 18 09:02:20 zaphkiel kernel: [   30.221049] bttv0: Hauppauge eeprom indicates model#44354
Mar 18 09:02:20 zaphkiel kernel: [   30.269011] cdrom: Uniform CD-ROM driver Revision: 3.20
Mar 18 09:02:20 zaphkiel kernel: [   30.300405] bttv0: tuner type=20
Mar 18 09:02:20 zaphkiel kernel: [   30.337486] sr 8:0:0:0: Attached scsi CD-ROM sr0
Mar 18 09:02:20 zaphkiel kernel: [   30.340058] nvidia 0000:02:00.0: PCI INT A -> Link[LN0A] -> GSI 19 (level, low) -> IRQ 19
Mar 18 09:02:20 zaphkiel kernel: [   30.389781] i2c-core: driver [msp3400] using legacy suspend method
Mar 18 09:02:20 zaphkiel kernel: [   30.427379] md: bind<sda4>
Mar 18 09:02:20 zaphkiel kernel: [   30.447871] i2c-core: driver [msp3400] using legacy resume method
Mar 18 09:02:20 zaphkiel kernel: [   30.486398] nvidia 0000:02:00.0: setting latency timer to 64
Mar 18 09:02:20 zaphkiel kernel: [   30.486403] vgaarb: device changed decodes: PCI:0000:02:00.0,olddecodes=io+mem,decodes=none:owns=io+mem
Mar 18 09:02:20 zaphkiel kernel: [   30.552124] msp3400 1-0040: MSP3415D-B3 found @ 0x80 (bt878 #0 [sw])
Mar 18 09:02:20 zaphkiel kernel: [   30.591822] NVRM: loading NVIDIA UNIX x86 Kernel Module  304.64  Tue Oct 30 11:09:29 PDT 2012
Mar 18 09:02:20 zaphkiel kernel: [   30.641940] msp3400 1-0040: msp3400 supports nicam, mode is autodetect
Mar 18 09:02:20 zaphkiel kernel: [   30.683731] 
Mar 18 09:02:20 zaphkiel kernel: [   30.728574] i2c-core: driver [tuner] using legacy suspend method
Mar 18 09:02:20 zaphkiel kernel: [   30.767893] i2c-core: driver [tuner] using legacy resume method
Mar 18 09:02:20 zaphkiel kernel: [   30.811881] tuner 1-0061: Tuner -1 found with type(s) Radio TV.
Mar 18 09:02:20 zaphkiel kernel: [   30.847964] md/raid1:md3: active with 2 out of 2 mirrors
Mar 18 09:02:20 zaphkiel kernel: [   30.881959] created bitmap (415 pages) for device md3
Mar 18 09:02:20 zaphkiel kernel: [   30.990096] input: HDA Digital PCBeep as /devices/pci0000:00/0000:00:07.0/input/input5
Mar 18 09:02:20 zaphkiel kernel: [   31.038390] tuner-simple 1-0061: creating new instance
Mar 18 09:02:20 zaphkiel kernel: [   31.084560] tuner-simple 1-0061: type set to 20 (Temic PAL_BG (4009 FR5) or PAL_I (4069 FR5))
Mar 18 09:02:20 zaphkiel kernel: [   31.140892] bttv0: registered device video0
Mar 18 09:02:20 zaphkiel kernel: [   31.167942] bttv0: registered device vbi0
Mar 18 09:02:20 zaphkiel kernel: [   31.192924] bttv0: registered device radio0
Mar 18 09:02:20 zaphkiel kernel: [   31.223065] bttv0: PLL: 28636363 => 35468950 .. ok
Mar 18 09:02:20 zaphkiel kernel: [   31.312236] AFTER udevadm settle --timeout=60
Mar 18 09:02:20 zaphkiel kernel: [   31.437261] md3: bitmap initialized from disk: read 26/26 pages, set 1464 of 849320 bits
Mar 18 09:02:20 zaphkiel kernel: [   31.536642] BEFORE mdadm -IRs
Mar 18 09:02:20 zaphkiel kernel: [   31.562744] md3: detected capacity change from 0 to 890575601664
Mar 18 09:02:20 zaphkiel kernel: [   31.645517] md/raid1:md0: active with 1 out of 2 mirrors
Mar 18 09:02:20 zaphkiel kernel: [   31.709880]  md3: unknown partition table
Mar 18 09:02:20 zaphkiel kernel: [   31.734035] md: bind<sdb2>
Mar 18 09:02:20 zaphkiel kernel: [   31.753744] created bitmap (6 pages) for device md0
Mar 18 09:02:20 zaphkiel kernel: [   31.793486] md0: bitmap initialized from disk: read 1/1 pages, set 0 of 12047 bits
Mar 18 09:02:20 zaphkiel kernel: [   31.874934] md/raid1:md1: active with 2 out of 2 mirrors
Mar 18 09:02:20 zaphkiel kernel: [   31.907025] md0: detected capacity change from 0 to 98689024
Mar 18 09:02:20 zaphkiel kernel: [   32.004102] created bitmap (8 pages) for device md1
Mar 18 09:02:20 zaphkiel kernel: [   32.078506] md1: bitmap initialized from disk: read 1/1 pages, set 0 of 16379 bits
Mar 18 09:02:20 zaphkiel kernel: [   32.146550]  md0: unknown partition table
Mar 18 09:02:20 zaphkiel kernel: [   32.170680] md1: detected capacity change from 0 to 2146783232
Mar 18 09:02:20 zaphkiel kernel: [   32.209513] AFTER mdadm -IRs
Mar 18 09:02:20 zaphkiel kernel: [   32.452210]  md1: unknown partition table
Mar 18 09:02:20 zaphkiel kernel: [   32.783780] Adding 2096464k swap on /dev/md1.  Priority:-1 extents:1 across:2096464k 
Mar 18 09:02:20 zaphkiel kernel: [   34.746669] AFTER mdadm -Asc
Mar 18 09:02:20 zaphkiel kernel: [   36.399439] XFS (md3): Mounting Filesystem
Mar 18 09:02:20 zaphkiel kernel: [   36.943112] EXT4-fs (md0): mounting ext2 file system using the ext4 subsystem
Mar 18 09:02:20 zaphkiel kernel: [   37.014070] XFS (md3): Ending clean mount
Mar 18 09:02:20 zaphkiel kernel: [   37.257681] EXT4-fs (md0): mounted filesystem without journal. Opts: acl,user_xattr

Thanks,
Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-03-18 11:20                         ` Hans-Peter Jansen
@ 2013-03-21  3:24                           ` NeilBrown
  2013-04-10 13:28                             ` Hans-Peter Jansen
  2013-04-10 13:44                             ` Hans-Peter Jansen
  0 siblings, 2 replies; 19+ messages in thread
From: NeilBrown @ 2013-03-21  3:24 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 1869 bytes --]

On Mon, 18 Mar 2013 12:20:57 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote:

> Am Freitag, 15. März 2013, 23:43:31 schrieb Hans-Peter Jansen:
> > Am Mittwoch, 13. März 2013, 11:52:35 schrieben Sie:
> > > hi,
> > > 
> > >  wonder if you could try one more test for me.
> > >  With the other (echo BEFORE / echo AFTER etc) tracing still there, change
> > > 
> > > the /sbin/udevadm settle --timeout=$MDADM_DEVICE_TIMEOUT
> > > 
> > >  to
> > >  
> > >    /sbin/udevadm --debug settle --timeout=$MDADM_DEVICE_TIMEOUT >
> > >    /dev/kmsg
> > > 
> > > 2>&1
> > > 
> > >  I found there is a case where "udevadm settle" can exit before the queue
> > >  is
> > > 
> > > empty.  It seems like a very unlikely scenario, but it seems clear that
> > > something "unlikely" is happening.
> > > 
> > >  I'm hoping to see
> > >  
> > >     timeout waiting for udev queue
> > >  
> > >  appear in the logs when this runs.
> > >  
> > >  If you could add that and post the 'dmesg' output if you ever get that
> > >  message - or maybe even if you don't - that would be very helpful.
> > 
> > Did that right now and disabled the sleep 10 in front, that provisionally
> > "solved" this issue BTW. Now waiting for another md issue to occur.
> 
> Here we go: (not censored in any way, and I hope to get the "window" right:
> md0 was affected this time.
> 

Thanks! Unfortunately it is not as helpful as I hoped, but it does suggest
that "udevadm settle" does sometimes appear to misbehave even if there aren't
any problems with the md array.

Could I ask for one more?
Prefix the "udevadm settle " command with
  strace  -o /tmp/udevadm.trace -s 500

(making sure that you have 'strace' installed) and then post the
"/tmp/udevadm.trace" file.  Hopefully that will at least allow me to rule out
some possibilities.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-03-21  3:24                           ` NeilBrown
@ 2013-04-10 13:28                             ` Hans-Peter Jansen
  2013-04-10 13:44                             ` Hans-Peter Jansen
  1 sibling, 0 replies; 19+ messages in thread
From: Hans-Peter Jansen @ 2013-04-10 13:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 2179 bytes --]

On Donnerstag, 21. März 2013 14:24:59 NeilBrown wrote:
> On Mon, 18 Mar 2013 12:20:57 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote:
> > Am Freitag, 15. März 2013, 23:43:31 schrieb Hans-Peter Jansen:
> > > Am Mittwoch, 13. März 2013, 11:52:35 schrieben Sie:
> > > > hi,
> > > > 
> > > >  wonder if you could try one more test for me.
> > > >  With the other (echo BEFORE / echo AFTER etc) tracing still there,
> > > >  change
> > > > 
> > > > the /sbin/udevadm settle --timeout=$MDADM_DEVICE_TIMEOUT
> > > > 
> > > >  to
> > > >  
> > > >    /sbin/udevadm --debug settle --timeout=$MDADM_DEVICE_TIMEOUT >
> > > >    /dev/kmsg
> > > > 
> > > > 2>&1
> > > > 
> > > >  I found there is a case where "udevadm settle" can exit before the
> > > >  queue
> > > >  is
> > > > 
> > > > empty.  It seems like a very unlikely scenario, but it seems clear
> > > > that
> > > > something "unlikely" is happening.
> > > > 
> > > >  I'm hoping to see
> > > >  
> > > >     timeout waiting for udev queue
> > > >  
> > > >  appear in the logs when this runs.
> > > >  
> > > >  If you could add that and post the 'dmesg' output if you ever get
> > > >  that
> > > >  message - or maybe even if you don't - that would be very helpful.
> > > 
> > > Did that right now and disabled the sleep 10 in front, that
> > > provisionally
> > > "solved" this issue BTW. Now waiting for another md issue to occur.
> > 
> > Here we go: (not censored in any way, and I hope to get the "window"
> > right:
> > md0 was affected this time.
> 
> Thanks! Unfortunately it is not as helpful as I hoped, but it does suggest
> that "udevadm settle" does sometimes appear to misbehave even if there
> aren't any problems with the md array.
> 
> Could I ask for one more?
> Prefix the "udevadm settle " command with
>   strace  -o /tmp/udevadm.trace -s 500
> 
> (making sure that you have 'strace' installed) and then post the
> "/tmp/udevadm.trace" file.  Hopefully that will at least allow me to rule
> out some possibilities.

After reducing the timeout to 40 secs, we're able to catch a failure today
with two degraded arrays: md0 and md1.

Cheers,
Pete

[-- Attachment #2: udevadm.trace.gz --]
[-- Type: application/x-gzip, Size: 5524 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-03-21  3:24                           ` NeilBrown
  2013-04-10 13:28                             ` Hans-Peter Jansen
@ 2013-04-10 13:44                             ` Hans-Peter Jansen
  2013-04-11  7:33                               ` NeilBrown
  1 sibling, 1 reply; 19+ messages in thread
From: Hans-Peter Jansen @ 2013-04-10 13:44 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 683 bytes --]

On Donnerstag, 21. März 2013 14:24:59 NeilBrown wrote:
> 
> Thanks! Unfortunately it is not as helpful as I hoped, but it does suggest
> that "udevadm settle" does sometimes appear to misbehave even if there
> aren't any problems with the md array.
> 
> Could I ask for one more?
> Prefix the "udevadm settle " command with
>   strace  -o /tmp/udevadm.trace -s 500
> 
> (making sure that you have 'strace' installed) and then post the
> "/tmp/udevadm.trace" file.  Hopefully that will at least allow me to rule
> out some possibilities.

After reducing the timeout to 40 secs, we're able to catch a failure today
with two degraded arrays: md0 and md1.

Cheers,
Pete

[-- Attachment #2: udevadm.trace.gz --]
[-- Type: application/x-gzip, Size: 5524 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Persistent failures with simple md setup
  2013-04-10 13:44                             ` Hans-Peter Jansen
@ 2013-04-11  7:33                               ` NeilBrown
  0 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2013-04-11  7:33 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 2177 bytes --]

On Wed, 10 Apr 2013 15:44:25 +0200 Hans-Peter Jansen <hpj@urpla.net> wrote:

> On Donnerstag, 21. März 2013 14:24:59 NeilBrown wrote:
> > 
> > Thanks! Unfortunately it is not as helpful as I hoped, but it does suggest
> > that "udevadm settle" does sometimes appear to misbehave even if there
> > aren't any problems with the md array.
> > 
> > Could I ask for one more?
> > Prefix the "udevadm settle " command with
> >   strace  -o /tmp/udevadm.trace -s 500
> > 
> > (making sure that you have 'strace' installed) and then post the
> > "/tmp/udevadm.trace" file.  Hopefully that will at least allow me to rule
> > out some possibilities.
> 
> After reducing the timeout to 40 secs, we're able to catch a failure today
> with two degraded arrays: md0 and md1.
> 
> Cheers,
> Pete

Thanks.

There is a similar bug report open at
  https://bugzilla.novell.com/show_bug.cgi?id=793954

I received a generally similar trace there and have been looking at yours
and those and the code and not getting very far.

The  traces show the /run/udev/queue.bin getting bigger and bigger, then
shrinking down to '8':
 
 %zgrep 'fstat64(4,' /tmp/udevadm.trace.gz | sed 's/.*st_size=\([0-9]*\),.*/\1/'

This is expected.  Whenever the queue becomes empty, the file is reset
(normally 'add' and 'remove' records are simply appended).
It is supposed to also be shrunk when it is bigger than 4K and over half of
the file is wasted (due to add/remove pairs) but the calculation of wastage
is broken and I think this never happens.

Nevertheless there seems to be a correlation between the times when it fails
and the max size of the queue.bin file.  In the files from the buzilla, the
times that it works, the queue.bin file never gets big.

But that could just be an accident.

In any case it is clear that udevadm is working correctly.  Which seems to
suggest that maybe 'udev' is messing with the queue.bin file.  Either that or
I'm misunderstand some other details and am entirely on the wrong path.

But I've gone over the udev code again and again and I cannot fault it.

So I'm still stumped.

Thanks for your help though.
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-04-11  7:33 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-29 22:14 Persistent failures with simple md setup Hans-Peter Jansen
2013-01-30  9:07 ` Sebastian Riemer
2013-01-30 17:12   ` Hans-Peter Jansen
2013-02-04 20:43     ` Hans-Peter Jansen
2013-02-05  3:44       ` NeilBrown
2013-02-27 17:01         ` Hans-Peter Jansen
2013-02-28  3:40           ` NeilBrown
2013-02-28 10:49             ` Hans-Peter Jansen
2013-02-28 21:25               ` NeilBrown
2013-02-28 22:16                 ` Hans-Peter Jansen
     [not found]                   ` <4291349.FrQcKOnicQ@xrated>
2013-03-03 23:33                     ` NeilBrown
2013-03-13  0:52                     ` NeilBrown
2013-03-15 22:43                       ` Hans-Peter Jansen
2013-03-18 11:20                         ` Hans-Peter Jansen
2013-03-21  3:24                           ` NeilBrown
2013-04-10 13:28                             ` Hans-Peter Jansen
2013-04-10 13:44                             ` Hans-Peter Jansen
2013-04-11  7:33                               ` NeilBrown
2013-01-30  9:20 ` Roy Sigurd Karlsbakk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.