* Persistent failures with simple md setup @ 2013-01-29 22:14 Hans-Peter Jansen 2013-01-30 9:07 ` Sebastian Riemer 2013-01-30 9:20 ` Roy Sigurd Karlsbakk 0 siblings, 2 replies; 19+ messages in thread From: Hans-Peter Jansen @ 2013-01-29 22:14 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 3990 bytes --] [Looks like my first messge didn't made it to the list, hence send again with tarballed attachments] Dear list members, one of the systems, I take care of, there's one pretty bog standard openSUSE 12.1 installation, that stick out with continued device failures on boot: Here a typical case: ~# cat /proc/mdstat Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] md3 : active raid1 sda4[0] 869702736 blocks super 1.0 [2/1] [U_] bitmap: 57/415 pages [228KB], 1024KB chunk md0 : active raid1 sda1[0] 96376 blocks super 1.0 [2/1] [U_] bitmap: 1/6 pages [4KB], 8KB chunk md1 : active (auto-read-only) raid1 sdb2[1] sda2[0] 2096468 blocks super 1.0 [2/2] [UU] bitmap: 0/8 pages [0KB], 128KB chunk md124 : active raid1 sdb3[1] sda3[0] 104856180 blocks super 1.0 [2/2] [UU] bitmap: 8/200 pages [32KB], 256KB chunk [no line breaks on purpose] Jan 29 20:22:36 zaphkiel kernel: [ 11.047504] md: raid1 personality registered for level 1 Jan 29 20:22:36 zaphkiel kernel: [ 11.549612] md: bind<sda3> Jan 29 20:22:36 zaphkiel kernel: [ 11.587037] md: bind<sdb3> Jan 29 20:22:36 zaphkiel kernel: [ 11.630965] md/raid1:md124: active with 2 out of 2 mirrors Jan 29 20:22:36 zaphkiel kernel: [ 11.708396] md124: bitmap initialized from disk: read 13/13 pages, set 1 of 409595 bits Jan 29 20:22:36 zaphkiel kernel: [ 11.769213] md124: detected capacity change from 0 to 107372728320 Jan 29 20:22:36 zaphkiel kernel: [ 11.981192] md: raid0 personality registered for level 0 Jan 29 20:22:36 zaphkiel kernel: [ 12.020959] md: raid10 personality registered for level 10 Jan 29 20:22:36 zaphkiel kernel: [ 12.625530] md: raid6 personality registered for level 6 Jan 29 20:22:36 zaphkiel kernel: [ 12.657414] md: raid5 personality registered for level 5 Jan 29 20:22:36 zaphkiel kernel: [ 12.689261] md: raid4 personality registered for level 4 Jan 29 20:22:36 zaphkiel kernel: [ 25.151590] md: bind<sda2> Jan 29 20:22:36 zaphkiel kernel: [ 25.314284] md: bind<sda1> Jan 29 20:22:36 zaphkiel kernel: [ 25.409503] md: bind<sda4> Jan 29 20:22:36 zaphkiel kernel: [ 25.568103] md/raid1:md0: active with 1 out of 2 mirrors Jan 29 20:22:36 zaphkiel kernel: [ 25.689110] md: bind<sdb2> Jan 29 20:22:36 zaphkiel kernel: [ 25.713385] md0: bitmap initialized from disk: read 1/1 pages, set 0 of 12047 bits Jan 29 20:22:36 zaphkiel kernel: [ 25.837207] md0: detected capacity change from 0 to 98689024 Jan 29 20:22:36 zaphkiel kernel: [ 26.045361] md/raid1:md1: active with 2 out of 2 mirrors Jan 29 20:22:36 zaphkiel kernel: [ 26.260500] md1: bitmap initialized from disk: read 1/1 pages, set 0 of 16379 bits Jan 29 20:22:36 zaphkiel kernel: [ 26.349129] md1: detected capacity change from 0 to 2146783232 Jan 29 20:22:36 zaphkiel kernel: [ 26.391526] md/raid1:md3: active with 1 out of 2 mirrors Jan 29 20:22:36 zaphkiel kernel: [ 27.188346] md3: bitmap initialized from disk: read 26/26 pages, set 1547 of 849320 bits Jan 29 20:22:36 zaphkiel kernel: [ 27.302622] md3: detected capacity change from 0 to 890575601664 This looks like some kind of race during device detection. The full boot sequence log leading to this mess is attached. The major parts operating here are: mdadm-3.2.2-4.9.1.i586 mkinitrd-2.7.0-39.3.1.i586 kernel-desktop-3.1.10-1.16.1.i586 kernel-desktop-base-3.1.10-1.16.1.i586 Sure the system can be repaired with: mdadm --add /dev/md0 /dev/sdb1 mdadm --add /dev/md3 /dev/sdb4 for this case, but the behavior which partition is affected is random, only md124 seems stable (the root fs). The strange md naming was the result of an upgrade installation. The device details are attached as well. It happens, that the active device even *switches* between boots, which is a perfect recipe for actually loosing data, hence this md doesn't raise data security, it is the reason for loosing them. Could some kind soul tell me, what's going on here? Thanks in advance, Pete [-- Attachment #2: details-and-log.tar.bz2 --] [-- Type: application/x-bzip-compressed-tar, Size: 18989 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-01-29 22:14 Persistent failures with simple md setup Hans-Peter Jansen @ 2013-01-30 9:07 ` Sebastian Riemer 2013-01-30 17:12 ` Hans-Peter Jansen 2013-01-30 9:20 ` Roy Sigurd Karlsbakk 1 sibling, 1 reply; 19+ messages in thread From: Sebastian Riemer @ 2013-01-30 9:07 UTC (permalink / raw) To: Hans-Peter Jansen; +Cc: Linux RAID, NeilBrown On 29.01.2013 23:14, Hans-Peter Jansen wrote: [...] > ~# cat /proc/mdstat > Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] > md3 : active raid1 sda4[0] > 869702736 blocks super 1.0 [2/1] [U_] > bitmap: 57/415 pages [228KB], 1024KB chunk > > md0 : active raid1 sda1[0] > 96376 blocks super 1.0 [2/1] [U_] > bitmap: 1/6 pages [4KB], 8KB chunk > > md1 : active (auto-read-only) raid1 sdb2[1] sda2[0] > 2096468 blocks super 1.0 [2/2] [UU] > bitmap: 0/8 pages [0KB], 128KB chunk > > md124 : active raid1 sdb3[1] sda3[0] > 104856180 blocks super 1.0 [2/2] [UU] > bitmap: 8/200 pages [32KB], 256KB chunk > > This looks like some kind of race during device detection. > The full boot sequence log leading to this mess is attached. [...] > Could some kind soul tell me, what's going on here? Funny, we've observed similar strange behavior when putting MD devices on iSCSI/SRP exports. We connect to the SCSI target and udev does lots of crap assembling only 1/2 or even 0/2 devices. This is why we disable all udev rules related to MD and do it by custom scripts. In mdadm 3.2.6 a possible fix has been introduced. Check: git log mdadm-3.2.5..mdadm-3.2.6 commit 090900c3d2eb5b3aef5251a21228483c32246cc7 Author: Harald Hoyer <harald@redhat.com> Date: Mon Aug 13 08:00:21 2012 +1000 udev-rules: prevent systemd from mount devices before they are ready. commit b7e05d2373313dd8d0cb687479ad58a88f37d29f Author: NeilBrown <neilb@suse.de> Date: Thu May 24 11:49:49 2012 +1000 udev-rules: prevent systemd from mount devices before they are ready. Does mdadm 3.2.6 solve this? Cheers, Sebastian ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-01-30 9:07 ` Sebastian Riemer @ 2013-01-30 17:12 ` Hans-Peter Jansen 2013-02-04 20:43 ` Hans-Peter Jansen 0 siblings, 1 reply; 19+ messages in thread From: Hans-Peter Jansen @ 2013-01-30 17:12 UTC (permalink / raw) To: Sebastian Riemer; +Cc: Linux RAID, NeilBrown Dear Sebastian, thanks for your valuable response. Am Mittwoch, 30. Januar 2013, 10:07:24 schrieb Sebastian Riemer: > On 29.01.2013 23:14, Hans-Peter Jansen wrote: > [...] > > > ~# cat /proc/mdstat > > Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] > > md3 : active raid1 sda4[0] > > > > 869702736 blocks super 1.0 [2/1] [U_] > > bitmap: 57/415 pages [228KB], 1024KB chunk > > > > md0 : active raid1 sda1[0] > > > > 96376 blocks super 1.0 [2/1] [U_] > > bitmap: 1/6 pages [4KB], 8KB chunk > > > > md1 : active (auto-read-only) raid1 sdb2[1] sda2[0] > > > > 2096468 blocks super 1.0 [2/2] [UU] > > bitmap: 0/8 pages [0KB], 128KB chunk > > > > md124 : active raid1 sdb3[1] sda3[0] > > > > 104856180 blocks super 1.0 [2/2] [UU] > > bitmap: 8/200 pages [32KB], 256KB chunk > > > > This looks like some kind of race during device detection. > > The full boot sequence log leading to this mess is attached. > > [...] > > > Could some kind soul tell me, what's going on here? > > Funny, we've observed similar strange behavior when putting MD devices > on iSCSI/SRP exports. We connect to the SCSI target and udev does lots > of crap assembling only 1/2 or even 0/2 devices. This is why we disable > all udev rules related to MD and do it by custom scripts. Oh, I see. Forgot to mention, I do not enjoy fiddling in mkinitrd code ;-) (been there, had to do that for my aufs2 based diskless setups) > In mdadm 3.2.6 a possible fix has been introduced. > > Check: git log mdadm-3.2.5..mdadm-3.2.6 > > > commit 090900c3d2eb5b3aef5251a21228483c32246cc7 > Author: Harald Hoyer <harald@redhat.com> > Date: Mon Aug 13 08:00:21 2012 +1000 > > udev-rules: prevent systemd from mount devices before they are ready. > > commit b7e05d2373313dd8d0cb687479ad58a88f37d29f > Author: NeilBrown <neilb@suse.de> > Date: Thu May 24 11:49:49 2012 +1000 > > udev-rules: prevent systemd from mount devices before they are ready. > > > Does mdadm 3.2.6 solve this? Hmm, according to mdadm from openSUSE:12.1:Update, the relevant fixes should be in place. It might be an unfortunate combination of this issue and the asynchronously applied updates, interfered by the *switching* behavior. I started with regenerating the initrds now, and a first reboot succeeded so far. Good. Will ask my friend to reboot the system a dozen times tonight. Thanks, Pete ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-01-30 17:12 ` Hans-Peter Jansen @ 2013-02-04 20:43 ` Hans-Peter Jansen 2013-02-05 3:44 ` NeilBrown 0 siblings, 1 reply; 19+ messages in thread From: Hans-Peter Jansen @ 2013-02-04 20:43 UTC (permalink / raw) To: Linux RAID; +Cc: Sebastian Riemer, NeilBrown [-- Attachment #1: Type: text/plain, Size: 10019 bytes --] Am Mittwoch, 30. Januar 2013, 18:12:46 schrieb Hans-Peter Jansen: > > Hmm, according to mdadm from openSUSE:12.1:Update, the relevant fixes should > be in place. It might be an unfortunate combination of this issue and the > asynchronously applied updates, interfered by the *switching* behavior. > > I started with regenerating the initrds now, and a first reboot succeeded so > far. Good. > > Will ask my friend to reboot the system a dozen times tonight. After a few reboots, the issue reappeared. I really believe now, that by driving the md in degraded mode for some time and with the switching behavior, just re-adding the devices resulted in unsynced raid1 devices. Next, my friend managed to create a nearby data disaster: I've explained him, how he would be able to re-add a device himself. He did so on sunday with his home partition, and since there appeared no progress bar in /proc/mdstat, he immediately repeated the command. Neil, is it conceivable (due to a race or the like), that repeating to add (re-add) a device potentially creates data salad, since that home-fs (xfs) gone mad a few minutes later (firefox crashed, and couldn't be started, kmail crashed, and so on (all those processes, that write to ~). He decided to reboot, and that jailed him in the emergency recovery console, because /home couldn't be mounted anymore. Both parts of the mirror were affected, the "old" part was ~200kb undestructive xfs_repair log, the other ~900kb, hence I decided to use the smaller one. First I failed and removed the other part, and then attempted to repair it. Unfortunately, the real repair run bailed out with: disconnected inode 2161430687, moving to lost+found corrupt dinode 2161430687, extent total = 1, nblocks = 0. This is a bug. Please capture the filesystem metadata with xfs_metadump and report it to xfs@oss.sgi.com. cache_node_purge: refcount was 1, not zero (node=0xf867208) fatal error -- 117 - couldn't iget disconnected inode although I already used the (current) xfsprogs-3.1.6 version. :-( After fixing that issue manually with xfs_db (== great fun), I was able to recover the filesystem. It lost(+found) just a few new items, nobody cares about... So far, so good.. Now, the unsynced state disturbed me. Just re-adding the bad device might result in an invalid mirror again. A "repair" run cannot be controlled. Hence I zeroed the superblock of that partition, and added it. Et voila, it completely synced that mirror. Good. Today, I hammered the raid1 partitions with "check". During one run, this appeared in syslog: Feb 4 11:18:26 zaphkiel kernel: [11165.652478] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 Feb 4 11:18:26 zaphkiel kernel: [11165.652486] ata2.00: irq_stat 0x40000008 Feb 4 11:18:26 zaphkiel kernel: [11165.652495] ata2.00: failed command: READ FPDMA QUEUED Feb 4 11:18:26 zaphkiel kernel: [11165.652510] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in Feb 4 11:18:26 zaphkiel kernel: [11165.652513] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F> Feb 4 11:18:26 zaphkiel kernel: [11165.652520] ata2.00: status: { DRDY ERR } Feb 4 11:18:26 zaphkiel kernel: [11165.652524] ata2.00: error: { UNC } Feb 4 11:18:26 zaphkiel kernel: [11165.652876] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100) Feb 4 11:18:26 zaphkiel kernel: [11165.652882] ata2.00: revalidation failed (errno=-5) Feb 4 11:18:26 zaphkiel kernel: [11165.652890] ata2: hard resetting link Feb 4 11:18:26 zaphkiel kernel: [11165.957043] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Feb 4 11:18:26 zaphkiel kernel: [11165.969910] ata2.00: configured for UDMA/133 Feb 4 11:18:26 zaphkiel kernel: [11165.970048] ata2: EH complete Feb 4 11:18:28 zaphkiel kernel: [11167.949241] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 Feb 4 11:18:28 zaphkiel kernel: [11167.949249] ata2.00: irq_stat 0x40000008 Feb 4 11:18:28 zaphkiel kernel: [11167.949257] ata2.00: failed command: READ FPDMA QUEUED Feb 4 11:18:28 zaphkiel kernel: [11167.949272] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in Feb 4 11:18:28 zaphkiel kernel: [11167.949275] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F> Feb 4 11:18:28 zaphkiel kernel: [11167.949282] ata2.00: status: { DRDY ERR } Feb 4 11:18:28 zaphkiel kernel: [11167.949287] ata2.00: error: { UNC } Feb 4 11:18:28 zaphkiel kernel: [11167.962146] ata2.00: configured for UDMA/133 Feb 4 11:18:28 zaphkiel kernel: [11167.962206] ata2: EH complete Feb 4 11:18:30 zaphkiel kernel: [11169.898187] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 Feb 4 11:18:30 zaphkiel kernel: [11169.898195] ata2.00: irq_stat 0x40000008 Feb 4 11:18:30 zaphkiel kernel: [11169.898204] ata2.00: failed command: READ FPDMA QUEUED Feb 4 11:18:30 zaphkiel kernel: [11169.898219] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in Feb 4 11:18:30 zaphkiel kernel: [11169.898222] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F> Feb 4 11:18:30 zaphkiel kernel: [11169.898229] ata2.00: status: { DRDY ERR } Feb 4 11:18:30 zaphkiel kernel: [11169.898234] ata2.00: error: { UNC } Feb 4 11:18:30 zaphkiel kernel: [11169.912066] ata2.00: configured for UDMA/133 Feb 4 11:18:30 zaphkiel kernel: [11169.912117] ata2: EH complete Feb 4 11:18:32 zaphkiel kernel: [11171.905192] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 Feb 4 11:18:32 zaphkiel kernel: [11171.905200] ata2.00: irq_stat 0x40000008 Feb 4 11:18:32 zaphkiel kernel: [11171.905208] ata2.00: failed command: READ FPDMA QUEUED Feb 4 11:18:32 zaphkiel kernel: [11171.905223] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in Feb 4 11:18:32 zaphkiel kernel: [11171.905226] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F> Feb 4 11:18:32 zaphkiel kernel: [11171.905233] ata2.00: status: { DRDY ERR } Feb 4 11:18:32 zaphkiel kernel: [11171.905238] ata2.00: error: { UNC } Feb 4 11:18:32 zaphkiel kernel: [11171.919099] ata2.00: configured for UDMA/133 Feb 4 11:18:32 zaphkiel kernel: [11171.919152] ata2: EH complete Feb 4 11:18:34 zaphkiel kernel: [11173.912191] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 Feb 4 11:18:34 zaphkiel kernel: [11173.912199] ata2.00: irq_stat 0x40000008 Feb 4 11:18:34 zaphkiel kernel: [11173.912208] ata2.00: failed command: READ FPDMA QUEUED Feb 4 11:18:34 zaphkiel kernel: [11173.912223] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in Feb 4 11:18:34 zaphkiel kernel: [11173.912226] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F> Feb 4 11:18:34 zaphkiel kernel: [11173.912233] ata2.00: status: { DRDY ERR } Feb 4 11:18:34 zaphkiel kernel: [11173.912238] ata2.00: error: { UNC } Feb 4 11:18:34 zaphkiel kernel: [11173.925101] ata2.00: configured for UDMA/133 Feb 4 11:18:34 zaphkiel kernel: [11173.925159] ata2: EH complete Feb 4 11:18:36 zaphkiel kernel: [11175.861152] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 Feb 4 11:18:36 zaphkiel kernel: [11175.861160] ata2.00: irq_stat 0x40000008 Feb 4 11:18:36 zaphkiel kernel: [11175.861168] ata2.00: failed command: READ FPDMA QUEUED Feb 4 11:18:36 zaphkiel kernel: [11175.861183] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in Feb 4 11:18:36 zaphkiel kernel: [11175.861186] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F> Feb 4 11:18:36 zaphkiel kernel: [11175.861193] ata2.00: status: { DRDY ERR } Feb 4 11:18:36 zaphkiel kernel: [11175.861198] ata2.00: error: { UNC } Feb 4 11:18:36 zaphkiel kernel: [11175.874052] ata2.00: configured for UDMA/133 Feb 4 11:18:36 zaphkiel kernel: [11175.874103] sd 1:0:0:0: [sdb] Unhandled sense code Feb 4 11:18:36 zaphkiel kernel: [11175.874109] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Feb 4 11:18:36 zaphkiel kernel: [11175.874117] sd 1:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor] Feb 4 11:18:36 zaphkiel kernel: [11175.874125] Descriptor sense data with sense descriptors (in hex): Feb 4 11:18:36 zaphkiel kernel: [11175.874130] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Feb 4 11:18:36 zaphkiel kernel: [11175.874145] 0c c2 ef 3f Feb 4 11:18:36 zaphkiel kernel: [11175.874153] sd 1:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed Feb 4 11:18:36 zaphkiel kernel: [11175.874163] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 0c c2 ef 12 00 00 80 00 Feb 4 11:18:36 zaphkiel kernel: [11175.874180] end_request: I/O error, dev sdb, sector 214101823 Feb 4 11:18:36 zaphkiel kernel: [11175.874234] ata2: EH complete Feb 4 11:18:38 zaphkiel kernel: [11177.954091] md: md124: data-check done. This is a classical URE, isn't it? Interestingly, nonetheless, the raid1 check run succeeded! (Not so good, is it?) Before you ask, both drives have a sane timeout already: smartctl -l scterc /dev/sda smartctl 6.0 2012-10-10 r3643 [i686-linux-3.1.10-1.9-desktop] (SUSE RPM) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control: Read: 70 (7.0 seconds) Write: 70 (7.0 seconds) smartctl -l scterc /dev/sdb smartctl 6.0 2012-10-10 r3643 [i686-linux-3.1.10-1.9-desktop] (SUSE RPM) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control: Read: 70 (7.0 seconds) Write: 70 (7.0 seconds) Attached is the result of smartctl -x of sda (good), and sdb (bad). Could somebody from the audience have a look into it, and give me an assessment, how dangerous the state of this drive really is. Last question: since I had to massage the system anyway, I've updated mdadm from 3.2.2 to 3.2.6. I red, that it can be dangerous to do so, what do I risk here? Thanks in advance, Pete [-- Attachment #2: smart-sdb-20130204-140400.log --] [-- Type: text/x-log, Size: 17301 bytes --] smartctl 6.0 2012-10-10 r3643 [i686-linux-3.1.10-1.9-desktop] (SUSE RPM) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F1 RE Device Model: SAMSUNG HE103UJ Serial Number: S13VJDWS900483 LU WWN Device Id: 5 0024e9 002167ef6 Firmware Version: 1AA01118 User Capacity: 1.000.204.886.016 bytes [1,00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-7, ATA8-ACS T13/1699-D revision 3b Local Time is: Mon Feb 4 14:04:00 2013 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Disabled APM feature is: Disabled Rd look-ahead is: Enabled Write cache is: Enabled ATA Security is: Disabled, NOT FROZEN [SEC1] === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (12124) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 203) minutes. Conveyance self-test routine recommended polling time: ( 22) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 099 099 051 - 531 3 Spin_Up_Time POS--- 073 073 011 - 9000 4 Start_Stop_Count -O--CK 099 099 000 - 1278 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 100 100 051 - 0 8 Seek_Time_Performance P-S--K 100 100 015 - 0 9 Power_On_Hours -O--CK 097 097 000 - 16416 10 Spin_Retry_Count PO--CK 100 100 051 - 0 11 Calibration_Retry_Count -O--C- 100 100 000 - 0 12 Power_Cycle_Count -O--CK 099 099 000 - 1277 13 Read_Soft_Error_Rate -OSR-- 099 099 000 - 528 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 184 End-to-End_Error PO--CK 100 100 000 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 1052 188 Command_Timeout -O--CK 100 100 000 - 0 190 Airflow_Temperature_Cel -O---K 076 059 000 - 24 (Min/Max 12/24) 194 Temperature_Celsius -O---K 077 058 000 - 23 (Min/Max 12/26) 195 Hardware_ECC_Recovered -O-RC- 100 100 000 - 19500704 196 Reallocated_Event_Count -O--CK 100 100 000 - 0 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----CK 100 100 000 - 0 199 UDMA_CRC_Error_Count -OSRCK 253 253 000 - 0 200 Multi_Zone_Error_Rate -O-R-- 100 100 000 - 0 201 Soft_Read_Error_Rate -O-R-- 099 099 000 - 12 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning General Purpose Log Directory Version 1 SMART Log Directory Version 1 [multi-sector log support] GP/S Log at address 0x00 has 1 sectors [Log Directory] SMART Log at address 0x01 has 1 sectors [Summary SMART error log] SMART Log at address 0x02 has 2 sectors [Comprehensive SMART error log] GP Log at address 0x03 has 2 sectors [Ext. Comprehensive SMART error log] GP Log at address 0x04 has 2 sectors [Device Statistics log] SMART Log at address 0x06 has 1 sectors [SMART self-test log] GP Log at address 0x07 has 2 sectors [Extended self-test log] SMART Log at address 0x09 has 1 sectors [Selective self-test log] GP Log at address 0x10 has 1 sectors [NCQ Command Error log] GP Log at address 0x11 has 1 sectors [SATA Phy Event Counters] GP Log at address 0x20 has 2 sectors [Streaming performance log] GP Log at address 0x21 has 1 sectors [Write stream error log] GP Log at address 0x22 has 1 sectors [Read stream error log] GP/S Log at address 0x80 has 16 sectors [Host vendor specific log] GP/S Log at address 0x81 has 16 sectors [Host vendor specific log] GP/S Log at address 0x82 has 16 sectors [Host vendor specific log] GP/S Log at address 0x83 has 16 sectors [Host vendor specific log] GP/S Log at address 0x84 has 16 sectors [Host vendor specific log] GP/S Log at address 0x85 has 16 sectors [Host vendor specific log] GP/S Log at address 0x86 has 16 sectors [Host vendor specific log] GP/S Log at address 0x87 has 16 sectors [Host vendor specific log] GP/S Log at address 0x88 has 16 sectors [Host vendor specific log] GP/S Log at address 0x89 has 16 sectors [Host vendor specific log] GP/S Log at address 0x8a has 16 sectors [Host vendor specific log] GP/S Log at address 0x8b has 16 sectors [Host vendor specific log] GP/S Log at address 0x8c has 16 sectors [Host vendor specific log] GP/S Log at address 0x8d has 16 sectors [Host vendor specific log] GP/S Log at address 0x8e has 16 sectors [Host vendor specific log] GP/S Log at address 0x8f has 16 sectors [Host vendor specific log] GP/S Log at address 0x90 has 16 sectors [Host vendor specific log] GP/S Log at address 0x91 has 16 sectors [Host vendor specific log] GP/S Log at address 0x92 has 16 sectors [Host vendor specific log] GP/S Log at address 0x93 has 16 sectors [Host vendor specific log] GP/S Log at address 0x94 has 16 sectors [Host vendor specific log] GP/S Log at address 0x95 has 16 sectors [Host vendor specific log] GP/S Log at address 0x96 has 16 sectors [Host vendor specific log] GP/S Log at address 0x97 has 16 sectors [Host vendor specific log] GP/S Log at address 0x98 has 16 sectors [Host vendor specific log] GP/S Log at address 0x99 has 16 sectors [Host vendor specific log] GP/S Log at address 0x9a has 16 sectors [Host vendor specific log] GP/S Log at address 0x9b has 16 sectors [Host vendor specific log] GP/S Log at address 0x9c has 16 sectors [Host vendor specific log] GP/S Log at address 0x9d has 16 sectors [Host vendor specific log] GP/S Log at address 0x9e has 16 sectors [Host vendor specific log] GP/S Log at address 0x9f has 16 sectors [Host vendor specific log] GP/S Log at address 0xe0 has 1 sectors [SCT Command/Status] GP/S Log at address 0xe1 has 1 sectors [SCT Data Transfer] SMART Extended Comprehensive Error Log Version: 1 (2 sectors) Device Error Count: 6 CR = Command Register FEATR = Features Register COUNT = Count (was: Sector Count) Register LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8 LH = LBA High (was: Cylinder High) Register ] LBA LM = LBA Mid (was: Cylinder Low) Register ] Register LL = LBA Low (was: Sector Number) Register ] DV = Device (was: Device/Head) Register DC = Device Control Register ER = Error register ST = Status register Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 6 [5] occurred at disk power-on lifetime: 16413 hours (683 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 00 -- 42 00 00 00 00 0c c2 ef 3e 40 00 at LBA = 0x0cc2ef3e = 214101822 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 00 00 80 00 00 00 c2 f2 92 40 00 03:06:50.480 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f3 12 40 00 03:06:50.480 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f3 92 40 00 03:06:50.480 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f6 92 40 00 03:06:50.480 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f7 12 40 00 03:06:50.480 READ FPDMA QUEUED Error 5 [4] occurred at disk power-on lifetime: 16413 hours (683 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 00 -- 42 00 00 00 00 0c c2 ef 3e 40 00 at LBA = 0x0cc2ef3e = 214101822 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 00 00 80 00 00 00 c2 f7 92 40 00 03:06:48.480 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f5 92 40 00 03:06:48.480 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 ef 12 40 00 03:06:48.480 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 fe 12 40 00 03:06:48.480 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 fb 12 40 00 03:06:48.480 READ FPDMA QUEUED Error 4 [3] occurred at disk power-on lifetime: 16413 hours (683 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 00 -- 42 00 00 00 00 0c c2 ef 3e 40 00 at LBA = 0x0cc2ef3e = 214101822 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 00 00 80 00 00 00 c2 f2 92 40 00 03:06:46.470 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f3 12 40 00 03:06:46.470 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f3 92 40 00 03:06:46.470 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f6 92 40 00 03:06:46.470 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f7 12 40 00 03:06:46.470 READ FPDMA QUEUED Error 3 [2] occurred at disk power-on lifetime: 16413 hours (683 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 00 -- 42 00 00 00 00 0c c2 ef 3e 40 00 at LBA = 0x0cc2ef3e = 214101822 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 00 00 80 00 00 00 c2 f7 92 40 00 03:06:44.520 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f5 92 40 00 03:06:44.520 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 ef 12 40 00 03:06:44.520 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 fe 12 40 00 03:06:44.520 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 fb 12 40 00 03:06:44.520 READ FPDMA QUEUED Error 2 [1] occurred at disk power-on lifetime: 16413 hours (683 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 00 -- 42 00 00 00 00 0c c2 ef 3e 40 00 at LBA = 0x0cc2ef3e = 214101822 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 00 00 80 00 00 00 c2 f2 92 40 00 03:06:42.530 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f3 12 40 00 03:06:42.530 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f3 92 40 00 03:06:42.530 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f6 92 40 00 03:06:42.530 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f7 12 40 00 03:06:42.530 READ FPDMA QUEUED Error 1 [0] occurred at disk power-on lifetime: 16413 hours (683 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 00 -- 42 00 00 00 00 0c c2 ef 3e 40 00 at LBA = 0x0cc2ef3e = 214101822 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 00 00 80 00 00 00 c2 fb 92 40 00 03:06:40.150 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 fb 12 40 00 03:06:40.150 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 fa 92 40 00 03:06:40.150 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 fa 12 40 00 03:06:40.150 READ FPDMA QUEUED 60 00 00 00 80 00 00 00 c2 f9 92 40 00 03:06:40.150 READ FPDMA QUEUED SMART Extended Self-test Log Version: 1 (2 sectors) No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. SCT Status Version: 2 SCT Version (vendor specific): 256 (0x0100) SCT Support Level: 1 Device State: Active (0) Current Temperature: 23 Celsius Power Cycle Max Temperature: 26 Celsius Lifetime Max Temperature: 63 Celsius SCT Temperature History Version: 2 Temperature Sampling Period: 1 minute Temperature Logging Interval: 1 minute Min/Max recommended Temperature: -4/72 Celsius Min/Max Temperature Limit: -9/77 Celsius Temperature History Size (Index): 128 (4) Index Estimated Time Temperature Celsius 5 2013-02-04 11:57 24 ***** ... ..( 32 skipped). .. ***** 38 2013-02-04 12:30 24 ***** 39 2013-02-04 12:31 25 ****** 40 2013-02-04 12:32 24 ***** ... ..( 52 skipped). .. ***** 93 2013-02-04 13:25 24 ***** 94 2013-02-04 13:26 23 **** ... ..( 37 skipped). .. **** 4 2013-02-04 14:04 23 **** SCT Error Recovery Control: Read: 70 (7,0 seconds) Write: 70 (7,0 seconds) ATA_READ_LOG_EXT (addr=0x04:0x00, page=0, n=1) failed: scsi error aborted command SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x000a 2 5 Device-to-host register FISes sent due to a COMRESET 0x0001 2 0 Command failed due to ICRC error 0x0002 2 0 R_ERR response for data FIS 0x0003 2 0 R_ERR response for device-to-host data FIS 0x0004 2 0 R_ERR response for host-to-device data FIS 0x0005 2 0 R_ERR response for non-data FIS 0x0006 2 0 R_ERR response for device-to-host non-data FIS 0x0007 2 0 R_ERR response for host-to-device non-data FIS 0x0008 2 0 Device-to-host non-data FIS retries 0x0009 2 5 Transition from drive PhyRdy to drive PhyNRdy 0x000b 2 0 CRC errors within host-to-device FIS 0x000d 2 0 Non-CRC errors within host-to-device FIS 0x000f 2 0 R_ERR response for host-to-device data FIS, CRC 0x0010 2 0 R_ERR response for host-to-device data FIS, non-CRC 0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC 0x0013 2 0 R_ERR response for host-to-device non-data FIS, non-CRC [-- Attachment #3: smart-sda-20130204-140400.log --] [-- Type: text/x-log, Size: 10814 bytes --] smartctl 6.0 2012-10-10 r3643 [i686-linux-3.1.10-1.9-desktop] (SUSE RPM) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F1 RE Device Model: SAMSUNG HE103UJ Serial Number: S13VJDWS900475 LU WWN Device Id: 5 0024e9 002167cfe Firmware Version: 1AA01118 User Capacity: 1.000.204.886.016 bytes [1,00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-7, ATA8-ACS T13/1699-D revision 3b Local Time is: Mon Feb 4 14:04:00 2013 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Disabled APM feature is: Disabled Rd look-ahead is: Enabled Write cache is: Enabled ATA Security is: Disabled, NOT FROZEN [SEC1] === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (11933) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 200) minutes. Conveyance self-test routine recommended polling time: ( 21) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 100 100 051 - 0 3 Spin_Up_Time POS--- 072 072 011 - 9340 4 Start_Stop_Count -O--CK 099 099 000 - 1278 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 100 100 051 - 0 8 Seek_Time_Performance P-S--K 100 100 015 - 0 9 Power_On_Hours -O--CK 097 097 000 - 16415 10 Spin_Retry_Count PO--CK 100 100 051 - 0 11 Calibration_Retry_Count -O--C- 100 100 000 - 0 12 Power_Cycle_Count -O--CK 099 099 000 - 1277 13 Read_Soft_Error_Rate -OSR-- 100 100 000 - 0 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 184 End-to-End_Error PO--CK 100 100 000 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 188 Command_Timeout -O--CK 100 100 000 - 0 190 Airflow_Temperature_Cel -O---K 077 001 000 - 23 (Min/Max 11/23) 194 Temperature_Celsius -O---K 078 058 000 - 22 (Min/Max 11/25) 195 Hardware_ECC_Recovered -O-RC- 100 100 000 - 3573535 196 Reallocated_Event_Count -O--CK 100 100 000 - 0 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----CK 100 100 000 - 0 199 UDMA_CRC_Error_Count -OSRCK 100 100 000 - 0 200 Multi_Zone_Error_Rate -O-R-- 100 100 000 - 0 201 Soft_Read_Error_Rate -O-R-- 100 100 000 - 0 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning General Purpose Log Directory Version 1 SMART Log Directory Version 1 [multi-sector log support] GP/S Log at address 0x00 has 1 sectors [Log Directory] SMART Log at address 0x01 has 1 sectors [Summary SMART error log] SMART Log at address 0x02 has 2 sectors [Comprehensive SMART error log] GP Log at address 0x03 has 2 sectors [Ext. Comprehensive SMART error log] GP Log at address 0x04 has 2 sectors [Device Statistics log] SMART Log at address 0x06 has 1 sectors [SMART self-test log] GP Log at address 0x07 has 2 sectors [Extended self-test log] SMART Log at address 0x09 has 1 sectors [Selective self-test log] GP Log at address 0x10 has 1 sectors [NCQ Command Error log] GP Log at address 0x11 has 1 sectors [SATA Phy Event Counters] GP Log at address 0x20 has 2 sectors [Streaming performance log] GP Log at address 0x21 has 1 sectors [Write stream error log] GP Log at address 0x22 has 1 sectors [Read stream error log] GP/S Log at address 0x80 has 16 sectors [Host vendor specific log] GP/S Log at address 0x81 has 16 sectors [Host vendor specific log] GP/S Log at address 0x82 has 16 sectors [Host vendor specific log] GP/S Log at address 0x83 has 16 sectors [Host vendor specific log] GP/S Log at address 0x84 has 16 sectors [Host vendor specific log] GP/S Log at address 0x85 has 16 sectors [Host vendor specific log] GP/S Log at address 0x86 has 16 sectors [Host vendor specific log] GP/S Log at address 0x87 has 16 sectors [Host vendor specific log] GP/S Log at address 0x88 has 16 sectors [Host vendor specific log] GP/S Log at address 0x89 has 16 sectors [Host vendor specific log] GP/S Log at address 0x8a has 16 sectors [Host vendor specific log] GP/S Log at address 0x8b has 16 sectors [Host vendor specific log] GP/S Log at address 0x8c has 16 sectors [Host vendor specific log] GP/S Log at address 0x8d has 16 sectors [Host vendor specific log] GP/S Log at address 0x8e has 16 sectors [Host vendor specific log] GP/S Log at address 0x8f has 16 sectors [Host vendor specific log] GP/S Log at address 0x90 has 16 sectors [Host vendor specific log] GP/S Log at address 0x91 has 16 sectors [Host vendor specific log] GP/S Log at address 0x92 has 16 sectors [Host vendor specific log] GP/S Log at address 0x93 has 16 sectors [Host vendor specific log] GP/S Log at address 0x94 has 16 sectors [Host vendor specific log] GP/S Log at address 0x95 has 16 sectors [Host vendor specific log] GP/S Log at address 0x96 has 16 sectors [Host vendor specific log] GP/S Log at address 0x97 has 16 sectors [Host vendor specific log] GP/S Log at address 0x98 has 16 sectors [Host vendor specific log] GP/S Log at address 0x99 has 16 sectors [Host vendor specific log] GP/S Log at address 0x9a has 16 sectors [Host vendor specific log] GP/S Log at address 0x9b has 16 sectors [Host vendor specific log] GP/S Log at address 0x9c has 16 sectors [Host vendor specific log] GP/S Log at address 0x9d has 16 sectors [Host vendor specific log] GP/S Log at address 0x9e has 16 sectors [Host vendor specific log] GP/S Log at address 0x9f has 16 sectors [Host vendor specific log] GP/S Log at address 0xe0 has 1 sectors [SCT Command/Status] GP/S Log at address 0xe1 has 1 sectors [SCT Data Transfer] SMART Extended Comprehensive Error Log Version: 1 (2 sectors) No Errors Logged SMART Extended Self-test Log Version: 1 (2 sectors) No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. SCT Status Version: 2 SCT Version (vendor specific): 256 (0x0100) SCT Support Level: 1 Device State: Active (0) Current Temperature: 22 Celsius Power Cycle Max Temperature: 26 Celsius Lifetime Max Temperature: 55 Celsius SCT Temperature History Version: 2 Temperature Sampling Period: 1 minute Temperature Logging Interval: 1 minute Min/Max recommended Temperature: -4/72 Celsius Min/Max Temperature Limit: -9/77 Celsius Temperature History Size (Index): 128 (10) Index Estimated Time Temperature Celsius 11 2013-02-04 11:57 23 **** ... ..( 31 skipped). .. **** 43 2013-02-04 12:29 23 **** 44 2013-02-04 12:30 24 ***** ... ..( 2 skipped). .. ***** 47 2013-02-04 12:33 24 ***** 48 2013-02-04 12:34 23 **** ... ..( 79 skipped). .. **** 0 2013-02-04 13:54 23 **** 1 2013-02-04 13:55 22 *** ... ..( 8 skipped). .. *** 10 2013-02-04 14:04 22 *** SCT Error Recovery Control: Read: 70 (7,0 seconds) Write: 70 (7,0 seconds) ATA_READ_LOG_EXT (addr=0x04:0x00, page=0, n=1) failed: scsi error aborted command SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x000a 2 4 Device-to-host register FISes sent due to a COMRESET 0x0001 2 0 Command failed due to ICRC error 0x0002 2 0 R_ERR response for data FIS 0x0003 2 0 R_ERR response for device-to-host data FIS 0x0004 2 0 R_ERR response for host-to-device data FIS 0x0005 2 0 R_ERR response for non-data FIS 0x0006 2 0 R_ERR response for device-to-host non-data FIS 0x0007 2 0 R_ERR response for host-to-device non-data FIS 0x0008 2 0 Device-to-host non-data FIS retries 0x0009 2 4 Transition from drive PhyRdy to drive PhyNRdy 0x000b 2 0 CRC errors within host-to-device FIS 0x000d 2 0 Non-CRC errors within host-to-device FIS 0x000f 2 0 R_ERR response for host-to-device data FIS, CRC 0x0010 2 0 R_ERR response for host-to-device data FIS, non-CRC 0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC 0x0013 2 0 R_ERR response for host-to-device non-data FIS, non-CRC ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-02-04 20:43 ` Hans-Peter Jansen @ 2013-02-05 3:44 ` NeilBrown 2013-02-27 17:01 ` Hans-Peter Jansen 0 siblings, 1 reply; 19+ messages in thread From: NeilBrown @ 2013-02-05 3:44 UTC (permalink / raw) To: Hans-Peter Jansen; +Cc: Linux RAID, Sebastian Riemer [-- Attachment #1: Type: text/plain, Size: 9028 bytes --] On Mon, 04 Feb 2013 21:43:29 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote: > Am Mittwoch, 30. Januar 2013, 18:12:46 schrieb Hans-Peter Jansen: > > > > Hmm, according to mdadm from openSUSE:12.1:Update, the relevant fixes should > > be in place. It might be an unfortunate combination of this issue and the > > asynchronously applied updates, interfered by the *switching* behavior. > > > > I started with regenerating the initrds now, and a first reboot succeeded so > > far. Good. > > > > Will ask my friend to reboot the system a dozen times tonight. > > After a few reboots, the issue reappeared. I really believe now, that by > driving the md in degraded mode for some time and with the switching behavior, > just re-adding the devices resulted in unsynced raid1 devices. > > Next, my friend managed to create a nearby data disaster: I've explained him, > how he would be able to re-add a device himself. He did so on sunday with his > home partition, and since there appeared no progress bar in /proc/mdstat, he > immediately repeated the command. > > Neil, is it conceivable (due to a race or the like), that repeating to add > (re-add) a device potentially creates data salad, since that home-fs (xfs) > gone mad a few minutes later (firefox crashed, and couldn't be started, kmail > crashed, and so on (all those processes, that write to ~). He decided to > reboot, and that jailed him in the emergency recovery console, because /home > couldn't be mounted anymore. There was a bug prior to 2.6.37 (fixed by commit 1a855a0606653d2) which sounds vaguely related, but you seem to be running a 3.0.x kernel(?) so shouldn't be affected by that. Without logs of precisely what happened, it is very hard to guess. > Today, I hammered the raid1 partitions with "check". During one run, this > appeared in syslog: > > Feb 4 11:18:26 zaphkiel kernel: [11165.652478] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 > Feb 4 11:18:26 zaphkiel kernel: [11165.652486] ata2.00: irq_stat 0x40000008 > Feb 4 11:18:26 zaphkiel kernel: [11165.652495] ata2.00: failed command: READ FPDMA QUEUED > Feb 4 11:18:26 zaphkiel kernel: [11165.652510] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in > Feb 4 11:18:26 zaphkiel kernel: [11165.652513] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F> > Feb 4 11:18:26 zaphkiel kernel: [11165.652520] ata2.00: status: { DRDY ERR } > Feb 4 11:18:26 zaphkiel kernel: [11165.652524] ata2.00: error: { UNC } > Feb 4 11:18:26 zaphkiel kernel: [11165.652876] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100) > Feb 4 11:18:26 zaphkiel kernel: [11165.652882] ata2.00: revalidation failed (errno=-5) > Feb 4 11:18:26 zaphkiel kernel: [11165.652890] ata2: hard resetting link > Feb 4 11:18:26 zaphkiel kernel: [11165.957043] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > Feb 4 11:18:26 zaphkiel kernel: [11165.969910] ata2.00: configured for UDMA/133 > Feb 4 11:18:26 zaphkiel kernel: [11165.970048] ata2: EH complete > Feb 4 11:18:28 zaphkiel kernel: [11167.949241] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 > Feb 4 11:18:28 zaphkiel kernel: [11167.949249] ata2.00: irq_stat 0x40000008 > Feb 4 11:18:28 zaphkiel kernel: [11167.949257] ata2.00: failed command: READ FPDMA QUEUED > Feb 4 11:18:28 zaphkiel kernel: [11167.949272] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in > Feb 4 11:18:28 zaphkiel kernel: [11167.949275] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F> > Feb 4 11:18:28 zaphkiel kernel: [11167.949282] ata2.00: status: { DRDY ERR } > Feb 4 11:18:28 zaphkiel kernel: [11167.949287] ata2.00: error: { UNC } > Feb 4 11:18:28 zaphkiel kernel: [11167.962146] ata2.00: configured for UDMA/133 > Feb 4 11:18:28 zaphkiel kernel: [11167.962206] ata2: EH complete > Feb 4 11:18:30 zaphkiel kernel: [11169.898187] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 > Feb 4 11:18:30 zaphkiel kernel: [11169.898195] ata2.00: irq_stat 0x40000008 > Feb 4 11:18:30 zaphkiel kernel: [11169.898204] ata2.00: failed command: READ FPDMA QUEUED > Feb 4 11:18:30 zaphkiel kernel: [11169.898219] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in > Feb 4 11:18:30 zaphkiel kernel: [11169.898222] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F> > Feb 4 11:18:30 zaphkiel kernel: [11169.898229] ata2.00: status: { DRDY ERR } > Feb 4 11:18:30 zaphkiel kernel: [11169.898234] ata2.00: error: { UNC } > Feb 4 11:18:30 zaphkiel kernel: [11169.912066] ata2.00: configured for UDMA/133 > Feb 4 11:18:30 zaphkiel kernel: [11169.912117] ata2: EH complete > Feb 4 11:18:32 zaphkiel kernel: [11171.905192] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 > Feb 4 11:18:32 zaphkiel kernel: [11171.905200] ata2.00: irq_stat 0x40000008 > Feb 4 11:18:32 zaphkiel kernel: [11171.905208] ata2.00: failed command: READ FPDMA QUEUED > Feb 4 11:18:32 zaphkiel kernel: [11171.905223] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in > Feb 4 11:18:32 zaphkiel kernel: [11171.905226] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F> > Feb 4 11:18:32 zaphkiel kernel: [11171.905233] ata2.00: status: { DRDY ERR } > Feb 4 11:18:32 zaphkiel kernel: [11171.905238] ata2.00: error: { UNC } > Feb 4 11:18:32 zaphkiel kernel: [11171.919099] ata2.00: configured for UDMA/133 > Feb 4 11:18:32 zaphkiel kernel: [11171.919152] ata2: EH complete > Feb 4 11:18:34 zaphkiel kernel: [11173.912191] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 > Feb 4 11:18:34 zaphkiel kernel: [11173.912199] ata2.00: irq_stat 0x40000008 > Feb 4 11:18:34 zaphkiel kernel: [11173.912208] ata2.00: failed command: READ FPDMA QUEUED > Feb 4 11:18:34 zaphkiel kernel: [11173.912223] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in > Feb 4 11:18:34 zaphkiel kernel: [11173.912226] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F> > Feb 4 11:18:34 zaphkiel kernel: [11173.912233] ata2.00: status: { DRDY ERR } > Feb 4 11:18:34 zaphkiel kernel: [11173.912238] ata2.00: error: { UNC } > Feb 4 11:18:34 zaphkiel kernel: [11173.925101] ata2.00: configured for UDMA/133 > Feb 4 11:18:34 zaphkiel kernel: [11173.925159] ata2: EH complete > Feb 4 11:18:36 zaphkiel kernel: [11175.861152] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 > Feb 4 11:18:36 zaphkiel kernel: [11175.861160] ata2.00: irq_stat 0x40000008 > Feb 4 11:18:36 zaphkiel kernel: [11175.861168] ata2.00: failed command: READ FPDMA QUEUED > Feb 4 11:18:36 zaphkiel kernel: [11175.861183] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in > Feb 4 11:18:36 zaphkiel kernel: [11175.861186] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F> > Feb 4 11:18:36 zaphkiel kernel: [11175.861193] ata2.00: status: { DRDY ERR } > Feb 4 11:18:36 zaphkiel kernel: [11175.861198] ata2.00: error: { UNC } > Feb 4 11:18:36 zaphkiel kernel: [11175.874052] ata2.00: configured for UDMA/133 > Feb 4 11:18:36 zaphkiel kernel: [11175.874103] sd 1:0:0:0: [sdb] Unhandled sense code > Feb 4 11:18:36 zaphkiel kernel: [11175.874109] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE > Feb 4 11:18:36 zaphkiel kernel: [11175.874117] sd 1:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor] > Feb 4 11:18:36 zaphkiel kernel: [11175.874125] Descriptor sense data with sense descriptors (in hex): > Feb 4 11:18:36 zaphkiel kernel: [11175.874130] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 > Feb 4 11:18:36 zaphkiel kernel: [11175.874145] 0c c2 ef 3f > Feb 4 11:18:36 zaphkiel kernel: [11175.874153] sd 1:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed > Feb 4 11:18:36 zaphkiel kernel: [11175.874163] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 0c c2 ef 12 00 00 80 00 > Feb 4 11:18:36 zaphkiel kernel: [11175.874180] end_request: I/O error, dev sdb, sector 214101823 > Feb 4 11:18:36 zaphkiel kernel: [11175.874234] ata2: EH complete > Feb 4 11:18:38 zaphkiel kernel: [11177.954091] md: md124: data-check done. > > This is a classical URE, isn't it? Interestingly, nonetheless, the raid1 check > run succeeded! (Not so good, is it?) What? Success not good? :-) md didn't report any errors. Maybe it didn't see any. Where is sector 214101823? > Last question: since I had to massage the system anyway, I've updated mdadm > from 3.2.2 to 3.2.6. I red, that it can be dangerous to do so, what do I risk > here? Where did you read that? If you find you need to re-create an array with "--create", a different mdadm might give a different result, which wouldn't be what you want. Otherwise it should be safe. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-02-05 3:44 ` NeilBrown @ 2013-02-27 17:01 ` Hans-Peter Jansen 2013-02-28 3:40 ` NeilBrown 0 siblings, 1 reply; 19+ messages in thread From: Hans-Peter Jansen @ 2013-02-27 17:01 UTC (permalink / raw) To: NeilBrown; +Cc: Linux RAID [I intentionally switched off line wrapping, since it renders the given information inedible] Hi Neil, sorry for the delay. Meanwhile, I updated to mdadm 3.2.6, and recreated the initrd. Since I do the recreation on a regular base, no data salad appeared, but the initial issue persists. I trigger checking the related md on each manual intervention. Hence approaching this again. That's a typical boot sequence: Feb 27 08:35:48 zaphkiel kernel: [ 0.000000] Kernel command line: PROFILE=ibk root=/dev/disk/by-uuid/782b24f2-26f6-4ea9-ab8b-d377cd2ac97b console=ttyS0,19200 console=tty0 resume=/dev/md127 splash=silent console=ttyS0,115200n8 vga=791 Feb 27 08:35:48 zaphkiel kernel: [ 6.816164] sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB) Feb 27 08:35:48 zaphkiel kernel: [ 6.862240] sd 1:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB) Feb 27 08:35:48 zaphkiel kernel: [ 6.983157] sd 0:0:0:0: [sda] Write Protect is off Feb 27 08:35:48 zaphkiel kernel: [ 7.011914] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 Feb 27 08:35:48 zaphkiel kernel: [ 7.011932] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Feb 27 08:35:48 zaphkiel kernel: [ 7.159045] sd 1:0:0:0: [sdb] Write Protect is off Feb 27 08:35:48 zaphkiel kernel: [ 7.187766] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 Feb 27 08:35:48 zaphkiel kernel: [ 7.187783] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Feb 27 08:35:48 zaphkiel kernel: [ 7.319637] sda: sda1 sda2 sda3 sda4 Feb 27 08:35:48 zaphkiel kernel: [ 7.386469] sd 0:0:0:0: [sda] Attached SCSI disk Feb 27 08:35:48 zaphkiel kernel: [ 7.414335] sdb: sdb1 sdb2 sdb3 sdb4 Feb 27 08:35:48 zaphkiel kernel: [ 7.470597] sd 1:0:0:0: [sdb] Attached SCSI disk Feb 27 08:35:48 zaphkiel kernel: [ 9.676078] PM: Checking hibernation image partition /dev/md127 Feb 27 08:35:48 zaphkiel kernel: [ 9.956010] sd 6:0:0:0: [sdc] Attached SCSI removable disk Feb 27 08:35:48 zaphkiel kernel: [ 10.175875] sd 6:0:0:1: [sdd] Attached SCSI removable disk Feb 27 08:35:48 zaphkiel kernel: [ 10.262999] sd 6:0:0:2: [sde] Attached SCSI removable disk Feb 27 08:35:48 zaphkiel kernel: [ 10.299000] sd 6:0:0:3: [sdf] Attached SCSI removable disk Feb 27 08:35:48 zaphkiel kernel: [ 10.618022] md: raid1 personality registered for level 1 Feb 27 08:35:48 zaphkiel kernel: [ 11.139833] md: bind<sda3> Feb 27 08:35:48 zaphkiel kernel: [ 11.158438] md: bind<sdb3> Feb 27 08:35:48 zaphkiel kernel: [ 11.200897] md/raid1:md124: active with 2 out of 2 mirrors Feb 27 08:35:48 zaphkiel kernel: [ 11.233921] created bitmap (200 pages) for device md124 Feb 27 08:35:48 zaphkiel kernel: [ 11.276742] md124: bitmap initialized from disk: read 13/13 pages, set 0 of 409595 bits Feb 27 08:35:48 zaphkiel kernel: [ 11.356796] md124: detected capacity change from 0 to 107372728320 Feb 27 08:35:48 zaphkiel kernel: [ 11.395610] md124: unknown partition table Feb 27 08:35:48 zaphkiel kernel: [ 11.543579] md: raid0 personality registered for level 0 Feb 27 08:35:48 zaphkiel kernel: [ 11.583994] md: raid10 personality registered for level 10 Feb 27 08:35:48 zaphkiel kernel: [ 12.188576] md: raid6 personality registered for level 6 Feb 27 08:35:48 zaphkiel kernel: [ 12.220473] md: raid5 personality registered for level 5 Feb 27 08:35:48 zaphkiel kernel: [ 12.252320] md: raid4 personality registered for level 4 Feb 27 08:35:48 zaphkiel kernel: [ 12.318865] XFS (md124): Mounting Filesystem Feb 27 08:35:48 zaphkiel kernel: [ 12.535666] XFS (md124): Ending clean mount Feb 27 08:35:48 zaphkiel kernel: [ 22.720866] sd 0:0:0:0: Attached scsi generic sg0 type 0 Feb 27 08:35:48 zaphkiel kernel: [ 23.126126] sd 1:0:0:0: Attached scsi generic sg1 type 0 Feb 27 08:35:48 zaphkiel kernel: [ 23.203876] sd 6:0:0:0: Attached scsi generic sg2 type 0 Feb 27 08:35:48 zaphkiel kernel: [ 23.402184] sd 6:0:0:1: Attached scsi generic sg3 type 0 Feb 27 08:35:48 zaphkiel kernel: [ 23.725330] sd 6:0:0:2: Attached scsi generic sg4 type 0 Feb 27 08:35:48 zaphkiel kernel: [ 23.917899] sd 6:0:0:3: Attached scsi generic sg5 type 0 Feb 27 08:35:48 zaphkiel kernel: [ 24.492587] md: bind<sda2> Feb 27 08:35:48 zaphkiel kernel: [ 24.711804] md: bind<sda1> Feb 27 08:35:48 zaphkiel kernel: [ 24.821071] md: bind<sda4> Feb 27 08:35:48 zaphkiel kernel: [ 25.607192] md: bind<sdb2> Feb 27 08:35:48 zaphkiel kernel: [ 25.672335] md/raid1:md1: active with 2 out of 2 mirrors Feb 27 08:35:48 zaphkiel kernel: [ 25.723678] created bitmap (8 pages) for device md1 Feb 27 08:35:48 zaphkiel kernel: [ 25.771764] md1: bitmap initialized from disk: read 1/1 pages, set 0 of 16379 bits Feb 27 08:35:48 zaphkiel kernel: [ 25.854963] md/raid1:md0: active with 1 out of 2 mirrors Feb 27 08:35:48 zaphkiel kernel: [ 25.886903] md1: detected capacity change from 0 to 2146783232 Feb 27 08:35:48 zaphkiel kernel: [ 25.929142] created bitmap (6 pages) for device md0 Feb 27 08:35:48 zaphkiel kernel: [ 25.958684] md1: unknown partition table Feb 27 08:35:48 zaphkiel kernel: [ 25.995737] md: bind<sdb4> Feb 27 08:35:48 zaphkiel kernel: [ 26.012500] md0: bitmap initialized from disk: read 1/1 pages, set 0 of 12047 bits Feb 27 08:35:48 zaphkiel kernel: [ 26.077270] md0: detected capacity change from 0 to 98689024 Feb 27 08:35:48 zaphkiel kernel: [ 26.168097] md/raid1:md3: active with 2 out of 2 mirrors Feb 27 08:35:48 zaphkiel kernel: [ 26.205258] md0: unknown partition table Feb 27 08:35:48 zaphkiel kernel: [ 26.242295] created bitmap (415 pages) for device md3 Feb 27 08:35:48 zaphkiel kernel: [ 26.304271] md3: bitmap initialized from disk: read 26/26 pages, set 1465 of 849320 bits Feb 27 08:35:48 zaphkiel kernel: [ 26.392578] md3: detected capacity change from 0 to 890575601664 Feb 27 08:35:48 zaphkiel kernel: [ 26.575527] md3: unknown partition table Feb 27 08:35:48 zaphkiel kernel: [ 26.763091] Adding 2096464k swap on /dev/md1. Priority:0 extents:1 across:2096464k Feb 27 08:35:48 zaphkiel kernel: [ 29.268318] XFS (md3): Mounting Filesystem Feb 27 08:35:48 zaphkiel kernel: [ 29.521844] EXT4-fs (md0): mounting ext2 file system using the ext4 subsystem Feb 27 08:35:48 zaphkiel kernel: [ 29.684176] EXT4-fs (md0): mounted filesystem without journal. Opts: acl,user_xattr Feb 27 08:35:48 zaphkiel kernel: [ 30.356255] XFS (md3): Ending clean mount Today, /dev/sdb1 of /dev/md0 vanished. Re-adding and checking doesn't revealed any error, but it would be cool to get to the bottom of this, as re-adding some lost partitions to the md every second day *is* a real PITA. What could I do to help resolving this issue? Thanks in advance, Pete Usual system info (let me know, if something is missing): # cat /etc/SuSE-release openSUSE 12.1 (i586) VERSION = 12.1 CODENAME = Asparagus # uname -a (vendor kernel) Linux zaphkiel 3.1.10-1.16-desktop #1 SMP PREEMPT Wed Jun 27 05:21:40 UTC 2012 (d016078) i686 athlon i386 GNU/Linux # LANG=C fdisk -l /dev/sd{a,b} Disk /dev/sda: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x000c39ea Device Boot Start End Blocks Id System /dev/sda1 * 16065 208844 96390 fd Linux raid autodetect /dev/sda2 208845 4401809 2096482+ fd Linux raid autodetect /dev/sda3 4401810 214114319 104856255 fd Linux raid autodetect /dev/sda4 214114320 1953520064 869702872+ fd Linux raid autodetect Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x0005bc3f Device Boot Start End Blocks Id System /dev/sdb1 * 16065 208844 96390 fd Linux raid autodetect /dev/sdb2 208845 4401809 2096482+ fd Linux raid autodetect /dev/sdb3 4401810 214114319 104856255 fd Linux raid autodetect /dev/sdb4 214114320 1953520064 869702872+ fd Linux raid autodetect cat /proc/mdstat Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] md3 : active raid1 sdb4[1] sda4[2] 869702736 blocks super 1.0 [2/2] [UU] bitmap: 14/415 pages [56KB], 1024KB chunk md0 : active raid1 sdb1[1] sda1[0] 96376 blocks super 1.0 [2/2] [UU] bitmap: 0/6 pages [0KB], 8KB chunk md1 : active raid1 sdb2[1] sda2[0] 2096468 blocks super 1.0 [2/2] [UU] bitmap: 0/8 pages [0KB], 128KB chunk md124 : active raid1 sdb3[1] sda3[0] 104856180 blocks super 1.0 [2/2] [UU] bitmap: 6/200 pages [24KB], 256KB chunk unused devices: <none> # cat /etc/mdadm.conf ARRAY /dev/md124 metadata=1.0 name=zaphkiel:2 UUID=4b91a798:7385d1ea:14128dab:9b09575e ARRAY /dev/md/0 metadata=1.0 name=zaphkiel:0 UUID=a06939cd:1261d1ce:76eaa74c:db8ba5f9 ARRAY /dev/md/3 metadata=1.0 name=zaphkiel:3 UUID=8f2d01ee:035536bc:d23a6cf2:f0d863b8 ARRAY /dev/md1 metadata=1.0 name=zaphkiel:1 UUID=d6b3eadd:297b01f0:33860c88:9dc851cc MAILADDR root Note the silly naming.. # mdadm -Evvs /dev/sdb4: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : 8f2d01ee:035536bc:d23a6cf2:f0d863b8 Name : zaphkiel:3 (local to host zaphkiel) Creation Time : Thu Dec 31 14:04:04 2009 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 1739405472 (829.41 GiB 890.58 GB) Array Size : 869702736 (829.41 GiB 890.58 GB) Super Offset : 1739405728 sectors State : clean Device UUID : 373b5269:1f407175:abd8f4f7:d8cba089 Internal Bitmap : -208 sectors from superblock Update Time : Wed Feb 27 17:18:21 2013 Checksum : 6957806 - correct Events : 8320716 Device Role : Active device 1 Array State : AA ('A' == active, '.' == missing) /dev/sdb3: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : 4b91a798:7385d1ea:14128dab:9b09575e Name : zaphkiel:2 (local to host zaphkiel) Creation Time : Thu Dec 31 14:04:03 2009 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 209712360 (100.00 GiB 107.37 GB) Array Size : 104856180 (100.00 GiB 107.37 GB) Super Offset : 209712488 sectors State : active Device UUID : 42c13439:d288d793:f2468bba:19adbfd2 Internal Bitmap : -101 sectors from superblock Update Time : Wed Feb 27 17:18:21 2013 Checksum : 26e2aafc - correct Events : 15267 Device Role : Active device 1 Array State : AA ('A' == active, '.' == missing) /dev/sdb2: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : d6b3eadd:297b01f0:33860c88:9dc851cc Name : zaphkiel:1 (local to host zaphkiel) Creation Time : Thu Dec 31 14:04:02 2009 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 4192936 (2047.68 MiB 2146.78 MB) Array Size : 2096468 (2047.68 MiB 2146.78 MB) Super Offset : 4192944 sectors State : clean Device UUID : dfd7f969:e9ce7838:8de84f6b:0b04acb5 Internal Bitmap : 2 sectors from superblock Update Time : Wed Feb 27 17:15:03 2013 Checksum : 272647 - correct Events : 1764 Device Role : Active device 1 Array State : AA ('A' == active, '.' == missing) /dev/sdb1: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : a06939cd:1261d1ce:76eaa74c:db8ba5f9 Name : zaphkiel:0 (local to host zaphkiel) Creation Time : Thu Dec 31 14:04:01 2009 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 192752 (94.13 MiB 98.69 MB) Array Size : 96376 (94.13 MiB 98.69 MB) Super Offset : 192760 sectors State : clean Device UUID : b8cd1539:9f294a42:710a79e1:c40a4ce0 Internal Bitmap : 2 sectors from superblock Update Time : Wed Feb 27 16:18:25 2013 Checksum : 39343a85 - correct Events : 2609 Device Role : Active device 1 Array State : AA ('A' == active, '.' == missing) /dev/sdb: MBR Magic : aa55 Partition[0] : 192780 sectors at 16065 (type fd) Partition[1] : 4192965 sectors at 208845 (type fd) Partition[2] : 209712510 sectors at 4401810 (type fd) Partition[3] : 1739405745 sectors at 214114320 (type fd) /dev/sda4: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : 8f2d01ee:035536bc:d23a6cf2:f0d863b8 Name : zaphkiel:3 (local to host zaphkiel) Creation Time : Thu Dec 31 14:04:04 2009 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 1739405472 (829.41 GiB 890.58 GB) Array Size : 869702736 (829.41 GiB 890.58 GB) Super Offset : 1739405728 sectors State : clean Device UUID : 0b568d2a:c8e5a8e5:269188ba:22367205 Internal Bitmap : -208 sectors from superblock Update Time : Wed Feb 27 17:18:21 2013 Checksum : 766d5b48 - correct Events : 8320716 Device Role : Active device 0 Array State : AA ('A' == active, '.' == missing) /dev/sda3: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : 4b91a798:7385d1ea:14128dab:9b09575e Name : zaphkiel:2 (local to host zaphkiel) Creation Time : Thu Dec 31 14:04:03 2009 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 209712360 (100.00 GiB 107.37 GB) Array Size : 104856180 (100.00 GiB 107.37 GB) Super Offset : 209712488 sectors State : active Device UUID : 0d0a5a95:cc454e60:fcea6dd4:ea6aabcd Internal Bitmap : -101 sectors from superblock Update Time : Wed Feb 27 17:18:21 2013 Checksum : 644d129b - correct Events : 15267 Device Role : Active device 0 Array State : AA ('A' == active, '.' == missing) /dev/sda2: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : d6b3eadd:297b01f0:33860c88:9dc851cc Name : zaphkiel:1 (local to host zaphkiel) Creation Time : Thu Dec 31 14:04:02 2009 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 4192936 (2047.68 MiB 2146.78 MB) Array Size : 2096468 (2047.68 MiB 2146.78 MB) Super Offset : 4192944 sectors State : clean Device UUID : 4c169220:335d3afc:d26ca91a:3ff38cd1 Internal Bitmap : 2 sectors from superblock Update Time : Wed Feb 27 17:15:03 2013 Checksum : 45bb6676 - correct Events : 1764 Device Role : Active device 0 Array State : AA ('A' == active, '.' == missing) /dev/sda1: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : a06939cd:1261d1ce:76eaa74c:db8ba5f9 Name : zaphkiel:0 (local to host zaphkiel) Creation Time : Thu Dec 31 14:04:01 2009 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 192752 (94.13 MiB 98.69 MB) Array Size : 96376 (94.13 MiB 98.69 MB) Super Offset : 192760 sectors State : clean Device UUID : 7bc0a6ec:81f9494d:b5e55c6a:961bbccd Internal Bitmap : 2 sectors from superblock Update Time : Wed Feb 27 16:18:25 2013 Checksum : 6e18e93f - correct Events : 2609 Device Role : Active device 0 Array State : AA ('A' == active, '.' == missing) /dev/sda: MBR Magic : aa55 Partition[0] : 192780 sectors at 16065 (type fd) Partition[1] : 4192965 sectors at 208845 (type fd) Partition[2] : 209712510 sectors at 4401810 (type fd) Partition[3] : 1739405745 sectors at 214114320 (type fd) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-02-27 17:01 ` Hans-Peter Jansen @ 2013-02-28 3:40 ` NeilBrown 2013-02-28 10:49 ` Hans-Peter Jansen 0 siblings, 1 reply; 19+ messages in thread From: NeilBrown @ 2013-02-28 3:40 UTC (permalink / raw) To: Hans-Peter Jansen; +Cc: Linux RAID [-- Attachment #1: Type: text/plain, Size: 9617 bytes --] On Wed, 27 Feb 2013 18:01:29 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote: > [I intentionally switched off line wrapping, since it renders the given information > inedible] > > Hi Neil, > > sorry for the delay. Meanwhile, I updated to mdadm 3.2.6, and recreated the initrd. > Since I do the recreation on a regular base, no data salad appeared, but the initial > issue persists. I trigger checking the related md on each manual intervention. > > Hence approaching this again. > > That's a typical boot sequence: BTW, you aren't the only one to hit this issue .... I think I'm starting to get closer to the problem. > > Feb 27 08:35:48 zaphkiel kernel: [ 0.000000] Kernel command line: PROFILE=ibk root=/dev/disk/by-uuid/782b24f2-26f6-4ea9-ab8b-d377cd2ac97b console=ttyS0,19200 console=tty0 resume=/dev/md127 splash=silent console=ttyS0,115200n8 vga=791 > Feb 27 08:35:48 zaphkiel kernel: [ 6.816164] sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB) > Feb 27 08:35:48 zaphkiel kernel: [ 6.862240] sd 1:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB) > Feb 27 08:35:48 zaphkiel kernel: [ 6.983157] sd 0:0:0:0: [sda] Write Protect is off > Feb 27 08:35:48 zaphkiel kernel: [ 7.011914] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 > Feb 27 08:35:48 zaphkiel kernel: [ 7.011932] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA > Feb 27 08:35:48 zaphkiel kernel: [ 7.159045] sd 1:0:0:0: [sdb] Write Protect is off > Feb 27 08:35:48 zaphkiel kernel: [ 7.187766] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 > Feb 27 08:35:48 zaphkiel kernel: [ 7.187783] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA > Feb 27 08:35:48 zaphkiel kernel: [ 7.319637] sda: sda1 sda2 sda3 sda4 > Feb 27 08:35:48 zaphkiel kernel: [ 7.386469] sd 0:0:0:0: [sda] Attached SCSI disk > Feb 27 08:35:48 zaphkiel kernel: [ 7.414335] sdb: sdb1 sdb2 sdb3 sdb4 > Feb 27 08:35:48 zaphkiel kernel: [ 7.470597] sd 1:0:0:0: [sdb] Attached SCSI disk > Feb 27 08:35:48 zaphkiel kernel: [ 9.676078] PM: Checking hibernation image partition /dev/md127 > Feb 27 08:35:48 zaphkiel kernel: [ 9.956010] sd 6:0:0:0: [sdc] Attached SCSI removable disk > Feb 27 08:35:48 zaphkiel kernel: [ 10.175875] sd 6:0:0:1: [sdd] Attached SCSI removable disk > Feb 27 08:35:48 zaphkiel kernel: [ 10.262999] sd 6:0:0:2: [sde] Attached SCSI removable disk > Feb 27 08:35:48 zaphkiel kernel: [ 10.299000] sd 6:0:0:3: [sdf] Attached SCSI removable disk > Feb 27 08:35:48 zaphkiel kernel: [ 10.618022] md: raid1 personality registered for level 1 > Feb 27 08:35:48 zaphkiel kernel: [ 11.139833] md: bind<sda3> > Feb 27 08:35:48 zaphkiel kernel: [ 11.158438] md: bind<sdb3> > Feb 27 08:35:48 zaphkiel kernel: [ 11.200897] md/raid1:md124: active with 2 out of 2 mirrors Presumably this is md124 being assembled in the initrd as it contains the root filesystem. > Feb 27 08:35:48 zaphkiel kernel: [ 11.233921] created bitmap (200 pages) for device md124 > Feb 27 08:35:48 zaphkiel kernel: [ 11.276742] md124: bitmap initialized from disk: read 13/13 pages, set 0 of 409595 bits > Feb 27 08:35:48 zaphkiel kernel: [ 11.356796] md124: detected capacity change from 0 to 107372728320 > Feb 27 08:35:48 zaphkiel kernel: [ 11.395610] md124: unknown partition table > Feb 27 08:35:48 zaphkiel kernel: [ 11.543579] md: raid0 personality registered for level 0 > Feb 27 08:35:48 zaphkiel kernel: [ 11.583994] md: raid10 personality registered for level 10 > Feb 27 08:35:48 zaphkiel kernel: [ 12.188576] md: raid6 personality registered for level 6 > Feb 27 08:35:48 zaphkiel kernel: [ 12.220473] md: raid5 personality registered for level 5 > Feb 27 08:35:48 zaphkiel kernel: [ 12.252320] md: raid4 personality registered for level 4 > Feb 27 08:35:48 zaphkiel kernel: [ 12.318865] XFS (md124): Mounting Filesystem > Feb 27 08:35:48 zaphkiel kernel: [ 12.535666] XFS (md124): Ending clean mount And now XFS has mounted that, so presumably about there the boot sequence switches from initrd to real-root. You have obviously removed some lines, presumably because you thought brevity would help. For future reference: it doesn't. I much prefer to see complete logs so I can correlate various parts of the sequence and be sure I understand what I'm seeing. But I probably have enough here. > Feb 27 08:35:48 zaphkiel kernel: [ 22.720866] sd 0:0:0:0: Attached scsi generic sg0 type 0 > Feb 27 08:35:48 zaphkiel kernel: [ 23.126126] sd 1:0:0:0: Attached scsi generic sg1 type 0 > Feb 27 08:35:48 zaphkiel kernel: [ 23.203876] sd 6:0:0:0: Attached scsi generic sg2 type 0 > Feb 27 08:35:48 zaphkiel kernel: [ 23.402184] sd 6:0:0:1: Attached scsi generic sg3 type 0 > Feb 27 08:35:48 zaphkiel kernel: [ 23.725330] sd 6:0:0:2: Attached scsi generic sg4 type 0 > Feb 27 08:35:48 zaphkiel kernel: [ 23.917899] sd 6:0:0:3: Attached scsi generic sg5 type 0 > Feb 27 08:35:48 zaphkiel kernel: [ 24.492587] md: bind<sda2> > Feb 27 08:35:48 zaphkiel kernel: [ 24.711804] md: bind<sda1> > Feb 27 08:35:48 zaphkiel kernel: [ 24.821071] md: bind<sda4> These are parts of 3 different arrays that are starting to be assembled in parallel. That must be udev running "mdadm -I /dev/sdaX" which will partly assemble each array. when the 'sdb' bits come along "mdadm -I /dev/sdbX" should complete the assembly. > Feb 27 08:35:48 zaphkiel kernel: [ 25.607192] md: bind<sdb2> > Feb 27 08:35:48 zaphkiel kernel: [ 25.672335] md/raid1:md1: active with 2 out of 2 mirrors > Feb 27 08:35:48 zaphkiel kernel: [ 25.723678] created bitmap (8 pages) for device md1 > Feb 27 08:35:48 zaphkiel kernel: [ 25.771764] md1: bitmap initialized from disk: read 1/1 pages, set 0 of 16379 bits And there is the first one: mdadm -I /dev/sdb2 must have been run by udev so md1 got fully assembled. > Feb 27 08:35:48 zaphkiel kernel: [ 25.854963] md/raid1:md0: active with 1 out of 2 mirrors But now md0 got assembled with waiting for sdb1. This must have been the mdadm -IRs in /etc/init.d/boot.md. The purpose of this is to find any partially assembled arrays and complete the assembly. This is needed if, for example, sdb has failed complete and would never be found. However it shouldn't happen yet. It should wait until all the devices that have been discovered have been processed. A little early in the boot.md file you will see: if [ "$MDADM_DEVICE_TIMEOUT" -gt 0 ] ; then /sbin/udevadm settle --timeout="$MDADM_DEVICE_TIMEOUT" fi so unless you have MDADM_DEVICE_TIMEOUT=0 in /etc/sysconfig/mdadm (which I very much doubt) it should have already waited. But there is no sign that it did. So maybe "udevadm settle" is broken? > Feb 27 08:35:48 zaphkiel kernel: [ 25.886903] md1: detected capacity change from 0 to 2146783232 > Feb 27 08:35:48 zaphkiel kernel: [ 25.929142] created bitmap (6 pages) for device md0 > Feb 27 08:35:48 zaphkiel kernel: [ 25.958684] md1: unknown partition table > Feb 27 08:35:48 zaphkiel kernel: [ 25.995737] md: bind<sdb4> > Feb 27 08:35:48 zaphkiel kernel: [ 26.012500] md0: bitmap initialized from disk: read 1/1 pages, set 0 of 12047 bits > Feb 27 08:35:48 zaphkiel kernel: [ 26.077270] md0: detected capacity change from 0 to 98689024 > Feb 27 08:35:48 zaphkiel kernel: [ 26.168097] md/raid1:md3: active with 2 out of 2 mirrors And here comes sdb4, so "mdadm -I /dev/sdb4" must have been run by udev. And it got in before the "mdadm -IRs" managed to find that md0 was incomplete, and started it. > Feb 27 08:35:48 zaphkiel kernel: [ 26.205258] md0: unknown partition table > Feb 27 08:35:48 zaphkiel kernel: [ 26.242295] created bitmap (415 pages) for device md3 > Feb 27 08:35:48 zaphkiel kernel: [ 26.304271] md3: bitmap initialized from disk: read 26/26 pages, set 1465 of 849320 bits > Feb 27 08:35:48 zaphkiel kernel: [ 26.392578] md3: detected capacity change from 0 to 890575601664 > Feb 27 08:35:48 zaphkiel kernel: [ 26.575527] md3: unknown partition table > Feb 27 08:35:48 zaphkiel kernel: [ 26.763091] Adding 2096464k swap on /dev/md1. Priority:0 extents:1 across:2096464k > Feb 27 08:35:48 zaphkiel kernel: [ 29.268318] XFS (md3): Mounting Filesystem > Feb 27 08:35:48 zaphkiel kernel: [ 29.521844] EXT4-fs (md0): mounting ext2 file system using the ext4 subsystem > Feb 27 08:35:48 zaphkiel kernel: [ 29.684176] EXT4-fs (md0): mounted filesystem without journal. Opts: acl,user_xattr > Feb 27 08:35:48 zaphkiel kernel: [ 30.356255] XFS (md3): Ending clean mount > > Today, /dev/sdb1 of /dev/md0 vanished. Re-adding and checking doesn't revealed > any error, but it would be cool to get to the bottom of this, as re-adding some > lost partitions to the md every second day *is* a real PITA. > > What could I do to help resolving this issue? It really feels like a udev bug, but it is hard to be sure. Could you edit /etc/init.d/boot.md and add echo BEFORE mdadm -IRs > /dev/kmsg just before the call the "$mdadm_BIN -IRs", echo AFTER mdadm -IRs > /dev/kmsg just after that call, and echo AFTER mdadm -Asc > /dev/mksg just after the call to "$mdadm_BIN -A -s -c $mdadm_CONFIG" And maybe put similar messages just before and after the /sbin/udevadm settle .... call. If you can then reproduce the problem, the extra logging might tell us something useful. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-02-28 3:40 ` NeilBrown @ 2013-02-28 10:49 ` Hans-Peter Jansen 2013-02-28 21:25 ` NeilBrown 0 siblings, 1 reply; 19+ messages in thread From: Hans-Peter Jansen @ 2013-02-28 10:49 UTC (permalink / raw) To: NeilBrown; +Cc: Linux RAID Am Donnerstag, 28. Februar 2013, 14:40:13 schrieb NeilBrown: > On Wed, 27 Feb 2013 18:01:29 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote: > > > [I intentionally switched off line wrapping, since it renders the given information > > inedible] > > > > Hi Neil, > > > > sorry for the delay. Meanwhile, I updated to mdadm 3.2.6, and recreated the initrd. > > Since I do the recreation on a regular base, no data salad appeared, but the initial > > issue persists. I trigger checking the related md on each manual intervention. > > > > Hence approaching this again. > > > > That's a typical boot sequence: > > BTW, you aren't the only one to hit this issue .... I think I'm starting to > get closer to the problem. Nice to hear, will do my best to assist. > > > > > Feb 27 08:35:48 zaphkiel kernel: [ 0.000000] Kernel command line: PROFILE=ibk root=/dev/disk/by-uuid/782b24f2-26f6-4ea9-ab8b-d377cd2ac97b console=ttyS0,19200 console=tty0 resume=/dev/md127 splash=silent console=ttyS0,115200n8 vga=791 > > Feb 27 08:35:48 zaphkiel kernel: [ 6.816164] sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB) > > Feb 27 08:35:48 zaphkiel kernel: [ 6.862240] sd 1:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB) > > Feb 27 08:35:48 zaphkiel kernel: [ 6.983157] sd 0:0:0:0: [sda] Write Protect is off > > Feb 27 08:35:48 zaphkiel kernel: [ 7.011914] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 > > Feb 27 08:35:48 zaphkiel kernel: [ 7.011932] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA > > Feb 27 08:35:48 zaphkiel kernel: [ 7.159045] sd 1:0:0:0: [sdb] Write Protect is off > > Feb 27 08:35:48 zaphkiel kernel: [ 7.187766] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 > > Feb 27 08:35:48 zaphkiel kernel: [ 7.187783] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA > > Feb 27 08:35:48 zaphkiel kernel: [ 7.319637] sda: sda1 sda2 sda3 sda4 > > Feb 27 08:35:48 zaphkiel kernel: [ 7.386469] sd 0:0:0:0: [sda] Attached SCSI disk > > Feb 27 08:35:48 zaphkiel kernel: [ 7.414335] sdb: sdb1 sdb2 sdb3 sdb4 > > Feb 27 08:35:48 zaphkiel kernel: [ 7.470597] sd 1:0:0:0: [sdb] Attached SCSI disk > > Feb 27 08:35:48 zaphkiel kernel: [ 9.676078] PM: Checking hibernation image partition /dev/md127 > > Feb 27 08:35:48 zaphkiel kernel: [ 9.956010] sd 6:0:0:0: [sdc] Attached SCSI removable disk > > Feb 27 08:35:48 zaphkiel kernel: [ 10.175875] sd 6:0:0:1: [sdd] Attached SCSI removable disk > > Feb 27 08:35:48 zaphkiel kernel: [ 10.262999] sd 6:0:0:2: [sde] Attached SCSI removable disk > > Feb 27 08:35:48 zaphkiel kernel: [ 10.299000] sd 6:0:0:3: [sdf] Attached SCSI removable disk > > Feb 27 08:35:48 zaphkiel kernel: [ 10.618022] md: raid1 personality registered for level 1 > > Feb 27 08:35:48 zaphkiel kernel: [ 11.139833] md: bind<sda3> > > Feb 27 08:35:48 zaphkiel kernel: [ 11.158438] md: bind<sdb3> > > Feb 27 08:35:48 zaphkiel kernel: [ 11.200897] md/raid1:md124: active with 2 out of 2 mirrors > > Presumably this is md124 being assembled in the initrd as it contains the > root filesystem. Correct. > > > Feb 27 08:35:48 zaphkiel kernel: [ 11.233921] created bitmap (200 pages) for device md124 > > Feb 27 08:35:48 zaphkiel kernel: [ 11.276742] md124: bitmap initialized from disk: read 13/13 pages, set 0 of 409595 bits > > Feb 27 08:35:48 zaphkiel kernel: [ 11.356796] md124: detected capacity change from 0 to 107372728320 > > Feb 27 08:35:48 zaphkiel kernel: [ 11.395610] md124: unknown partition table > > Feb 27 08:35:48 zaphkiel kernel: [ 11.543579] md: raid0 personality registered for level 0 > > Feb 27 08:35:48 zaphkiel kernel: [ 11.583994] md: raid10 personality registered for level 10 > > Feb 27 08:35:48 zaphkiel kernel: [ 12.188576] md: raid6 personality registered for level 6 > > Feb 27 08:35:48 zaphkiel kernel: [ 12.220473] md: raid5 personality registered for level 5 > > Feb 27 08:35:48 zaphkiel kernel: [ 12.252320] md: raid4 personality registered for level 4 > > Feb 27 08:35:48 zaphkiel kernel: [ 12.318865] XFS (md124): Mounting Filesystem > > Feb 27 08:35:48 zaphkiel kernel: [ 12.535666] XFS (md124): Ending clean mount > > And now XFS has mounted that, so presumably about there the boot sequence > switches from initrd to real-root. You have obviously removed some lines, > presumably because you thought brevity would help. For future reference: it > doesn't. I much prefer to see complete logs so I can correlate various parts > of the sequence and be sure I understand what I'm seeing. But I probably > have enough here. Sorry, I grepped for 'md|sd' and removed some lines triggered from pata_amd, which doesn't have any business in this scenario anyway (all sata). Will remove this dangling driver from initrd later. We're leaving initrd here. > > Feb 27 08:35:48 zaphkiel kernel: [ 22.720866] sd 0:0:0:0: Attached scsi generic sg0 type 0 > > Feb 27 08:35:48 zaphkiel kernel: [ 23.126126] sd 1:0:0:0: Attached scsi generic sg1 type 0 > > Feb 27 08:35:48 zaphkiel kernel: [ 23.203876] sd 6:0:0:0: Attached scsi generic sg2 type 0 > > Feb 27 08:35:48 zaphkiel kernel: [ 23.402184] sd 6:0:0:1: Attached scsi generic sg3 type 0 > > Feb 27 08:35:48 zaphkiel kernel: [ 23.725330] sd 6:0:0:2: Attached scsi generic sg4 type 0 > > Feb 27 08:35:48 zaphkiel kernel: [ 23.917899] sd 6:0:0:3: Attached scsi generic sg5 type 0 > > Feb 27 08:35:48 zaphkiel kernel: [ 24.492587] md: bind<sda2> > > Feb 27 08:35:48 zaphkiel kernel: [ 24.711804] md: bind<sda1> > > Feb 27 08:35:48 zaphkiel kernel: [ 24.821071] md: bind<sda4> > > These are parts of 3 different arrays that are starting to be assembled in > parallel. That must be udev running "mdadm -I /dev/sdaX" which will partly > assemble each array. when the 'sdb' bits come along "mdadm -I /dev/sdbX" > should complete the assembly. Yes, exactly. > > > Feb 27 08:35:48 zaphkiel kernel: [ 25.607192] md: bind<sdb2> > > Feb 27 08:35:48 zaphkiel kernel: [ 25.672335] md/raid1:md1: active with 2 out of 2 mirrors > > Feb 27 08:35:48 zaphkiel kernel: [ 25.723678] created bitmap (8 pages) for device md1 > > Feb 27 08:35:48 zaphkiel kernel: [ 25.771764] md1: bitmap initialized from disk: read 1/1 pages, set 0 of 16379 bits > > And there is the first one: mdadm -I /dev/sdb2 must have been run by udev so > md1 got fully assembled. > > > > Feb 27 08:35:48 zaphkiel kernel: [ 25.854963] md/raid1:md0: active with 1 out of 2 mirrors > > But now md0 got assembled with waiting for sdb1. This must have been the > mdadm -IRs > in /etc/init.d/boot.md. The purpose of this is to find any partially > assembled arrays and complete the assembly. This is needed if, for example, > sdb has failed complete and would never be found. However it shouldn't > happen yet. It should wait until all the devices that have been discovered > have been processed. > A little early in the boot.md file you will see: > if [ "$MDADM_DEVICE_TIMEOUT" -gt 0 ] ; then > /sbin/udevadm settle --timeout="$MDADM_DEVICE_TIMEOUT" > fi > > so unless you have MDADM_DEVICE_TIMEOUT=0 in /etc/sysconfig/mdadm (which I > very much doubt) it should have already waited. But there is no sign that it > did. So maybe "udevadm settle" is broken? MDADM_DEVICE_TIMEOUT="60" As far as I can see, udevadm settle doesn't pick up all devices under some circumstances. The man page says, it waits up to --timeout seconds for the event queue to become empty. Obviously, the event queue must be incomplete then, given, that one md is assembled already in initrd (12 seconds ago), and all other parts physically exist since then. Probably, the events got eaten already in initrd already, given that timing. > > > Feb 27 08:35:48 zaphkiel kernel: [ 25.886903] md1: detected capacity change from 0 to 2146783232 > > Feb 27 08:35:48 zaphkiel kernel: [ 25.929142] created bitmap (6 pages) for device md0 > > Feb 27 08:35:48 zaphkiel kernel: [ 25.958684] md1: unknown partition table > > Feb 27 08:35:48 zaphkiel kernel: [ 25.995737] md: bind<sdb4> > > Feb 27 08:35:48 zaphkiel kernel: [ 26.012500] md0: bitmap initialized from disk: read 1/1 pages, set 0 of 12047 bits > > Feb 27 08:35:48 zaphkiel kernel: [ 26.077270] md0: detected capacity change from 0 to 98689024 > > Feb 27 08:35:48 zaphkiel kernel: [ 26.168097] md/raid1:md3: active with 2 out of 2 mirrors > > And here comes sdb4, so "mdadm -I /dev/sdb4" must have been run by udev. And > it got in before the "mdadm -IRs" managed to find that md0 was incomplete, > and started it. > > > > Feb 27 08:35:48 zaphkiel kernel: [ 26.205258] md0: unknown partition table > > Feb 27 08:35:48 zaphkiel kernel: [ 26.242295] created bitmap (415 pages) for device md3 > > Feb 27 08:35:48 zaphkiel kernel: [ 26.304271] md3: bitmap initialized from disk: read 26/26 pages, set 1465 of 849320 bits > > Feb 27 08:35:48 zaphkiel kernel: [ 26.392578] md3: detected capacity change from 0 to 890575601664 > > Feb 27 08:35:48 zaphkiel kernel: [ 26.575527] md3: unknown partition table > > Feb 27 08:35:48 zaphkiel kernel: [ 26.763091] Adding 2096464k swap on /dev/md1. Priority:0 extents:1 across:2096464k > > Feb 27 08:35:48 zaphkiel kernel: [ 29.268318] XFS (md3): Mounting Filesystem > > Feb 27 08:35:48 zaphkiel kernel: [ 29.521844] EXT4-fs (md0): mounting ext2 file system using the ext4 subsystem > > Feb 27 08:35:48 zaphkiel kernel: [ 29.684176] EXT4-fs (md0): mounted filesystem without journal. Opts: acl,user_xattr > > Feb 27 08:35:48 zaphkiel kernel: [ 30.356255] XFS (md3): Ending clean mount > > > > Today, /dev/sdb1 of /dev/md0 vanished. Re-adding and checking doesn't revealed > > any error, but it would be cool to get to the bottom of this, as re-adding some > > lost partitions to the md every second day *is* a real PITA. > > > > What could I do to help resolving this issue? > > It really feels like a udev bug, but it is hard to be sure. > Could you edit /etc/init.d/boot.md and add > echo BEFORE mdadm -IRs > /dev/kmsg > just before the call the "$mdadm_BIN -IRs", > echo AFTER mdadm -IRs > /dev/kmsg > just after that call, and > echo AFTER mdadm -Asc > /dev/mksg > just after the call to "$mdadm_BIN -A -s -c $mdadm_CONFIG" > > And maybe put similar messages just before and after the > /sbin/udevadm settle .... > call. > > If you can then reproduce the problem, the extra logging might tell us > something useful. I've enabled initrd (linuxrc=trace) and boot.md (#!/bin/sh -x) debugging, hence we should see the full story next time. Thanks, Pete ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-02-28 10:49 ` Hans-Peter Jansen @ 2013-02-28 21:25 ` NeilBrown 2013-02-28 22:16 ` Hans-Peter Jansen 0 siblings, 1 reply; 19+ messages in thread From: NeilBrown @ 2013-02-28 21:25 UTC (permalink / raw) To: Hans-Peter Jansen; +Cc: Linux RAID [-- Attachment #1: Type: text/plain, Size: 977 bytes --] On Thu, 28 Feb 2013 11:49:53 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote: This is what I asked for: > > It really feels like a udev bug, but it is hard to be sure. > > Could you edit /etc/init.d/boot.md and add > > echo BEFORE mdadm -IRs > /dev/kmsg > > just before the call the "$mdadm_BIN -IRs", > > echo AFTER mdadm -IRs > /dev/kmsg > > just after that call, and > > echo AFTER mdadm -Asc > /dev/mksg > > just after the call to "$mdadm_BIN -A -s -c $mdadm_CONFIG" > > > > And maybe put similar messages just before and after the > > /sbin/udevadm settle .... > > call. > > > > If you can then reproduce the problem, the extra logging might tell us > > something useful. > This is not at all the same thing: > I've enabled initrd (linuxrc=trace) and boot.md (#!/bin/sh -x) debugging, hence > we should see the full story next time. > It might help, but I'm far from certain that it will be nearly as useful. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-02-28 21:25 ` NeilBrown @ 2013-02-28 22:16 ` Hans-Peter Jansen [not found] ` <4291349.FrQcKOnicQ@xrated> 0 siblings, 1 reply; 19+ messages in thread From: Hans-Peter Jansen @ 2013-02-28 22:16 UTC (permalink / raw) To: NeilBrown; +Cc: Linux RAID Am Freitag, 1. März 2013, 08:25:20 schrieb NeilBrown: > On Thu, 28 Feb 2013 11:49:53 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote: > > This is what I asked for: > > > It really feels like a udev bug, but it is hard to be sure. > > > Could you edit /etc/init.d/boot.md and add > > > > > > echo BEFORE mdadm -IRs > /dev/kmsg > > > > > > just before the call the "$mdadm_BIN -IRs", > > > > > > echo AFTER mdadm -IRs > /dev/kmsg > > > > > > just after that call, and > > > > > > echo AFTER mdadm -Asc > /dev/mksg > > > > > > just after the call to "$mdadm_BIN -A -s -c $mdadm_CONFIG" > > > > > > And maybe put similar messages just before and after the > > > > > > /sbin/udevadm settle .... > > > > > > call. > > > > > > If you can then reproduce the problem, the extra logging might tell us > > > something useful. > > This is not at all the same thing: > > I've enabled initrd (linuxrc=trace) and boot.md (#!/bin/sh -x) debugging, > > hence we should see the full story next time. > > It might help, but I'm far from certain that it will be nearly as useful. Okay, okay, I do it both ways now.. Let's see, when it triggers. > NeilBrown Cheers, Pete -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <4291349.FrQcKOnicQ@xrated>]
* Re: Persistent failures with simple md setup [not found] ` <4291349.FrQcKOnicQ@xrated> @ 2013-03-03 23:33 ` NeilBrown 2013-03-13 0:52 ` NeilBrown 1 sibling, 0 replies; 19+ messages in thread From: NeilBrown @ 2013-03-03 23:33 UTC (permalink / raw) To: Hans-Peter Jansen; +Cc: Linux RAID [-- Attachment #1: Type: text/plain, Size: 4939 bytes --] On Sun, 03 Mar 2013 20:31:57 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote: > Am Donnerstag, 28. Februar 2013, 23:16:01 schrieb Hans-Peter Jansen: > > Am Freitag, 1. März 2013, 08:25:20 schrieb NeilBrown: > > > On Thu, 28 Feb 2013 11:49:53 +0100 Hans-Peter Jansen <hpj@urpla.net> > > > wrote: > > > > > > This is what I asked for: > > > > > It really feels like a udev bug, but it is hard to be sure. > > > > > Could you edit /etc/init.d/boot.md and add > > > > > > > > > > echo BEFORE mdadm -IRs > /dev/kmsg > > > > > > > > > > just before the call the "$mdadm_BIN -IRs", > > > > > > > > > > echo AFTER mdadm -IRs > /dev/kmsg > > > > > > > > > > just after that call, and > > > > > > > > > > echo AFTER mdadm -Asc > /dev/mksg > > > > > > > > > > just after the call to "$mdadm_BIN -A -s -c $mdadm_CONFIG" > > > > > > > > > > And maybe put similar messages just before and after the > > > > > > > > > > /sbin/udevadm settle .... > > > > > > > > > > call. > > > > > > > > > > If you can then reproduce the problem, the extra logging might tell us > > > > > something useful. > > > > > > This is not at all the same thing: > > > > I've enabled initrd (linuxrc=trace) and boot.md (#!/bin/sh -x) > > > > debugging, > > > > hence we should see the full story next time. > > > > > > It might help, but I'm far from certain that it will be nearly as useful. > > I see, sh debug messages do not condense in /var/log/something, they're routed > somewhere else. I not sure, how I like that systemd implied sillyness. > > > Okay, okay, I do it both ways now.. > > > > Let's see, when it triggers. > > I'm not sure, how the basic kernel boot contribute to boiling this issue down, > but here we go: (not censored this time, hope it will go though..) Thanks. The interesting bit is here: > Mar 3 09:11:59 zaphkiel kernel: [ 28.781703] md: bind<sdb4> It seems that udev is running "mdadm -I" on sdb4 here > Mar 3 09:11:59 zaphkiel kernel: [ 28.799939] tveeprom 1-0050: audio processor is MSP3415 (idx 6) > Mar 3 09:11:59 zaphkiel kernel: [ 28.862699] sr 7:0:0:0: Attached scsi generic sg6 type 5 > Mar 3 09:11:59 zaphkiel kernel: [ 28.894667] tveeprom 1-0050: has radio > Mar 3 09:11:59 zaphkiel kernel: [ 28.917194] bttv0: Hauppauge eeprom indicates model#44354 > Mar 3 09:11:59 zaphkiel kernel: [ 28.984037] bttv0: tuner type=20 > Mar 3 09:11:59 zaphkiel kernel: [ 29.003670] md: bind<sdb1> and "mdadm -I sdb1" here > Mar 3 09:11:59 zaphkiel kernel: [ 29.026123] input: HDA Digital PCBeep as /devices/pci0000:00/0000:00:07.0/input/input5 > Mar 3 09:11:59 zaphkiel kernel: [ 29.098905] i2c-core: driver [msp3400] using legacy suspend method > Mar 3 09:11:59 zaphkiel kernel: [ 29.139227] AFTER udevadm settle --timeout=60 and now udev is "settled". This is much less than 60 seconds after the "BEFORE udevadm settle", so it didn't time out - it must have exhausted the queue. > Mar 3 09:11:59 zaphkiel kernel: [ 29.211092] i2c-core: driver [msp3400] using legacy resume method > Mar 3 09:11:59 zaphkiel kernel: [ 29.276147] msp3400 1-0040: MSP3415D-B3 found @ 0x80 (bt878 #0 [sw]) > Mar 3 09:11:59 zaphkiel kernel: [ 29.325203] md: bind<sdb2> But look - it seems that "mdadm -I /dev/sdb2" was just run - presumably by udev. > Mar 3 09:11:59 zaphkiel kernel: [ 29.391360] msp3400 1-0040: msp3400 supports nicam, BEFORE mdadm -IRs We managed to get two lines blended together here, but it is clear that this is just before "mdadm -IRs" runs. > Mar 3 09:11:59 zaphkiel kernel: [ 29.455183] mode is autodetect > Mar 3 09:11:59 zaphkiel kernel: [ 29.525553] md/raid1:md0: active with 1 out of 2 mirrors > And unsurprisingly, md0 is assembled with only on mirror because we have seen sdb1 but not sda1 > Mar 3 09:11:59 zaphkiel kernel: [ 29.572395] i2c-core: driver [tuner] using legacy suspend method > Mar 3 09:11:59 zaphkiel kernel: [ 29.642012] i2c-core: driver [tuner] using legacy resume method > Mar 3 09:11:59 zaphkiel kernel: [ 29.688786] md: bind<sda4> and here comes sda4. Presumably sda1 is attempted a little later, but as md0 is already assembled, mdadm refuses to do anything with it. So "mdadm -IRs" is being run too early, apparently because "udevadm settle" is exiting too early. However I have looked closely at the udev code, and run various tests, and cannot see how it could possibly exit early. I can imagine it exiting a bit late in some anomalous situations, but not early. The only way I can explain this is to suggest that something other than udev is running "mdadm -I". But I cannot imagine what would be doing that. If you put a "sleep 10" before the "mdadm -IRs", I suspect your problems would go away. It isn't really a nice solution, but it is the only one I can think of at the moment. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup [not found] ` <4291349.FrQcKOnicQ@xrated> 2013-03-03 23:33 ` NeilBrown @ 2013-03-13 0:52 ` NeilBrown 2013-03-15 22:43 ` Hans-Peter Jansen 1 sibling, 1 reply; 19+ messages in thread From: NeilBrown @ 2013-03-13 0:52 UTC (permalink / raw) To: Hans-Peter Jansen; +Cc: Linux RAID [-- Attachment #1: Type: text/plain, Size: 733 bytes --] hi, wonder if you could try one more test for me. With the other (echo BEFORE / echo AFTER etc) tracing still there, change the /sbin/udevadm settle --timeout=$MDADM_DEVICE_TIMEOUT to /sbin/udevadm --debug settle --timeout=$MDADM_DEVICE_TIMEOUT > /dev/kmsg 2>&1 I found there is a case where "udevadm settle" can exit before the queue is empty. It seems like a very unlikely scenario, but it seems clear that something "unlikely" is happening. I'm hoping to see timeout waiting for udev queue appear in the logs when this runs. If you could add that and post the 'dmesg' output if you ever get that message - or maybe even if you don't - that would be very helpful. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-03-13 0:52 ` NeilBrown @ 2013-03-15 22:43 ` Hans-Peter Jansen 2013-03-18 11:20 ` Hans-Peter Jansen 0 siblings, 1 reply; 19+ messages in thread From: Hans-Peter Jansen @ 2013-03-15 22:43 UTC (permalink / raw) To: NeilBrown; +Cc: Linux RAID Am Mittwoch, 13. März 2013, 11:52:35 schrieben Sie: > hi, > wonder if you could try one more test for me. > With the other (echo BEFORE / echo AFTER etc) tracing still there, change > the /sbin/udevadm settle --timeout=$MDADM_DEVICE_TIMEOUT > to > > /sbin/udevadm --debug settle --timeout=$MDADM_DEVICE_TIMEOUT > /dev/kmsg > 2>&1 > > I found there is a case where "udevadm settle" can exit before the queue is > empty. It seems like a very unlikely scenario, but it seems clear that > something "unlikely" is happening. > I'm hoping to see > timeout waiting for udev queue > > appear in the logs when this runs. > > If you could add that and post the 'dmesg' output if you ever get that > message - or maybe even if you don't - that would be very helpful. Did that right now and disabled the sleep 10 in front, that provisionally "solved" this issue BTW. Now waiting for another md issue to occur. Cheers, Pete -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-03-15 22:43 ` Hans-Peter Jansen @ 2013-03-18 11:20 ` Hans-Peter Jansen 2013-03-21 3:24 ` NeilBrown 0 siblings, 1 reply; 19+ messages in thread From: Hans-Peter Jansen @ 2013-03-18 11:20 UTC (permalink / raw) To: NeilBrown; +Cc: Linux RAID Am Freitag, 15. März 2013, 23:43:31 schrieb Hans-Peter Jansen: > Am Mittwoch, 13. März 2013, 11:52:35 schrieben Sie: > > hi, > > > > wonder if you could try one more test for me. > > With the other (echo BEFORE / echo AFTER etc) tracing still there, change > > > > the /sbin/udevadm settle --timeout=$MDADM_DEVICE_TIMEOUT > > > > to > > > > /sbin/udevadm --debug settle --timeout=$MDADM_DEVICE_TIMEOUT > > > /dev/kmsg > > > > 2>&1 > > > > I found there is a case where "udevadm settle" can exit before the queue > > is > > > > empty. It seems like a very unlikely scenario, but it seems clear that > > something "unlikely" is happening. > > > > I'm hoping to see > > > > timeout waiting for udev queue > > > > appear in the logs when this runs. > > > > If you could add that and post the 'dmesg' output if you ever get that > > message - or maybe even if you don't - that would be very helpful. > > Did that right now and disabled the sleep 10 in front, that provisionally > "solved" this issue BTW. Now waiting for another md issue to occur. Here we go: (not censored in any way, and I hope to get the "window" right: md0 was affected this time. Mar 18 09:02:20 zaphkiel kernel: [ 10.153098] sd 6:0:0:0: [sdc] 7741440 512-byte logical blocks: (3.96 GB/3.69 GiB) Mar 18 09:02:20 zaphkiel kernel: [ 10.202222] sd 6:0:0:0: [sdc] Write Protect is off Mar 18 09:02:20 zaphkiel kernel: [ 10.230965] sd 6:0:0:0: [sdc] Mode Sense: 03 00 00 00 Mar 18 09:02:20 zaphkiel kernel: [ 10.234226] sd 6:0:0:0: [sdc] No Caching mode page present Mar 18 09:02:20 zaphkiel kernel: [ 10.267155] sd 6:0:0:0: [sdc] Assuming drive cache: write through Mar 18 09:02:20 zaphkiel kernel: [ 10.309225] sd 6:0:0:0: [sdc] No Caching mode page present Mar 18 09:02:20 zaphkiel kernel: [ 10.342135] sd 6:0:0:0: [sdc] Assuming drive cache: write through Mar 18 09:02:20 zaphkiel kernel: [ 10.379763] sdc: sdc1 Mar 18 09:02:20 zaphkiel kernel: [ 10.400224] sd 6:0:0:0: [sdc] No Caching mode page present Mar 18 09:02:20 zaphkiel kernel: [ 10.433138] sd 6:0:0:0: [sdc] Assuming drive cache: write through Mar 18 09:02:20 zaphkiel kernel: [ 10.469953] sd 6:0:0:0: [sdc] Attached SCSI removable disk Mar 18 09:02:20 zaphkiel kernel: [ 10.502901] Freeing unused kernel memory: 488k freed Mar 18 09:02:20 zaphkiel kernel: [ 10.532672] usb 2-4: device descriptor read/64, error -71 Mar 18 09:02:20 zaphkiel kernel: [ 10.565302] Write protecting the kernel text: 5204k Mar 18 09:02:20 zaphkiel kernel: [ 10.594587] Write protecting the kernel read-only data: 3540k Mar 18 09:02:20 zaphkiel kernel: [ 10.629070] NX-protecting the kernel data: 5036k Mar 18 09:02:20 zaphkiel kernel: [ 10.699772] ACPI: acpi_idle registered with cpuidle Mar 18 09:02:20 zaphkiel kernel: [ 10.735081] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled Mar 18 09:02:20 zaphkiel kernel: [ 10.792858] SGI XFS Quota Management subsystem Mar 18 09:02:20 zaphkiel kernel: [ 10.821500] md: raid1 personality registered for level 1 Mar 18 09:02:20 zaphkiel kernel: [ 10.853390] usb 2-4: new high speed USB device number 5 using ehci_hcd Mar 18 09:02:20 zaphkiel kernel: [ 10.894261] BIOS EDD facility v0.16 2004-Jun-25, 3 devices found Mar 18 09:02:20 zaphkiel kernel: [ 11.157034] usb 2-4: device descriptor read/64, error -71 Mar 18 09:02:20 zaphkiel kernel: [ 11.385910] md: bind<sda3> Mar 18 09:02:20 zaphkiel kernel: [ 11.407964] md: bind<sdb3> Mar 18 09:02:20 zaphkiel kernel: [ 11.427609] bio: create slab <bio-1> at 1 Mar 18 09:02:20 zaphkiel kernel: [ 11.451733] md/raid1:md124: active with 2 out of 2 mirrors Mar 18 09:02:20 zaphkiel kernel: [ 11.484761] created bitmap (200 pages) for device md124 Mar 18 09:02:20 zaphkiel kernel: [ 11.527745] md124: bitmap initialized from disk: read 13/13 pages, set 0 of 409595 bits Mar 18 09:02:20 zaphkiel kernel: [ 11.589866] md124: detected capacity change from 0 to 107372728320 Mar 18 09:02:20 zaphkiel kernel: [ 11.628729] md124: unknown partition table Mar 18 09:02:20 zaphkiel kernel: [ 11.705023] usb 2-4: device descriptor read/64, error -71 Mar 18 09:02:20 zaphkiel kernel: [ 11.940026] usb 2-4: new high speed USB device number 6 using ehci_hcd Mar 18 09:02:20 zaphkiel kernel: [ 12.443021] usb 2-4: device not accepting address 6, error -71 Mar 18 09:02:20 zaphkiel kernel: [ 12.579024] usb 2-4: new high speed USB device number 7 using ehci_hcd Mar 18 09:02:20 zaphkiel kernel: [ 13.082029] usb 2-4: device not accepting address 7, error -71 Mar 18 09:02:20 zaphkiel kernel: [ 13.117018] hub 2-0:1.0: unable to enumerate USB device on port 4 Mar 18 09:02:20 zaphkiel kernel: [ 13.385793] md: raid0 personality registered for level 0 Mar 18 09:02:20 zaphkiel kernel: [ 13.424034] usb 4-2: new low speed USB device number 2 using ohci_hcd Mar 18 09:02:20 zaphkiel kernel: [ 13.471246] md: raid10 personality registered for level 10 Mar 18 09:02:20 zaphkiel kernel: [ 13.517387] async_tx: api initialized (async) Mar 18 09:02:20 zaphkiel kernel: [ 13.544243] xor: automatically using best checksumming function: pIII_sse Mar 18 09:02:20 zaphkiel kernel: [ 13.590008] pIII_sse : 7224.000 MB/sec Mar 18 09:02:20 zaphkiel kernel: [ 13.615137] xor: using function: pIII_sse (7224.000 MB/sec) Mar 18 09:02:20 zaphkiel kernel: [ 13.666017] raid6: int32x1 996 MB/s Mar 18 09:02:20 zaphkiel kernel: [ 13.705044] raid6: int32x2 1101 MB/s Mar 18 09:02:20 zaphkiel kernel: [ 13.744041] raid6: int32x4 789 MB/s Mar 18 09:02:20 zaphkiel kernel: [ 13.783015] raid6: int32x8 878 MB/s Mar 18 09:02:20 zaphkiel kernel: [ 13.822041] raid6: mmxx1 1972 MB/s Mar 18 09:02:20 zaphkiel kernel: [ 13.861017] raid6: mmxx2 3632 MB/s Mar 18 09:02:20 zaphkiel kernel: [ 13.883514] usb 4-2: New USB device found, idVendor=093a, idProduct=2510 Mar 18 09:02:20 zaphkiel kernel: [ 13.940023] raid6: sse1x1 2007 MB/s Mar 18 09:02:20 zaphkiel kernel: [ 13.962555] usb 4-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0 Mar 18 09:02:20 zaphkiel kernel: [ 14.022017] raid6: sse1x2 3406 MB/s Mar 18 09:02:20 zaphkiel kernel: [ 14.061017] usb 4-2: Product: USB OPTICAL MOUSE Mar 18 09:02:20 zaphkiel kernel: [ 14.088224] raid6: sse2x1 3433 MB/s Mar 18 09:02:20 zaphkiel kernel: [ 14.110741] usb 4-2: Manufacturer: PIXART Mar 18 09:02:20 zaphkiel kernel: [ 14.152588] raid6: sse2x2 4511 MB/s Mar 18 09:02:20 zaphkiel kernel: [ 14.175186] raid6: using algorithm sse2x2 (4511 MB/s) Mar 18 09:02:20 zaphkiel kernel: [ 14.207849] md: raid6 personality registered for level 6 Mar 18 09:02:20 zaphkiel kernel: [ 14.239745] md: raid5 personality registered for level 5 Mar 18 09:02:20 zaphkiel kernel: [ 14.272105] input: PIXART USB OPTICAL MOUSE as /devices/pci0000:00/0000:00:04.0/usb4/4-2/4-2:1.0/input/input1 Mar 18 09:02:20 zaphkiel kernel: [ 14.334075] md: raid4 personality registered for level 4 Mar 18 09:02:20 zaphkiel kernel: [ 14.371117] generic-usb 0003:093A:2510.0001: input,hidraw0: USB HID v1.11 Mouse [PIXART USB OPTICAL MOUSE] on usb-0000:00:04.0-2/input0 Mar 18 09:02:20 zaphkiel kernel: [ 14.485966] XFS (md124): Mounting Filesystem Mar 18 09:02:20 zaphkiel kernel: [ 14.701101] XFS (md124): Ending clean mount Mar 18 09:02:20 zaphkiel kernel: [ 14.731718] usb 4-4: new full speed USB device number 3 using ohci_hcd Mar 18 09:02:20 zaphkiel kernel: [ 14.959749] usb 4-4: not running at top speed; connect to a high speed hub Mar 18 09:02:20 zaphkiel kernel: [ 15.020945] usb 4-4: New USB device found, idVendor=05e3, idProduct=0760 Mar 18 09:02:20 zaphkiel kernel: [ 15.066896] usb 4-4: New USB device strings: Mfr=0, Product=3, SerialNumber=4 Mar 18 09:02:20 zaphkiel kernel: [ 15.109742] usb 4-4: Product: Flash Reader Mar 18 09:02:20 zaphkiel kernel: [ 15.134328] usb 4-4: SerialNumber: 03570 Mar 18 09:02:20 zaphkiel kernel: [ 15.159828] scsi7 : usb-storage 4-4:1.0 Mar 18 09:02:20 zaphkiel kernel: [ 16.191763] scsi 7:0:0:0: Direct-Access Generic STORAGE DEVICE 0125 PQ: 0 ANSI: 0 Mar 18 09:02:20 zaphkiel kernel: [ 16.256768] scsi 7:0:0:1: Direct-Access Generic STORAGE DEVICE 0125 PQ: 0 ANSI: 0 Mar 18 09:02:20 zaphkiel kernel: [ 16.332762] scsi 7:0:0:2: Direct-Access Generic STORAGE DEVICE 0125 PQ: 0 ANSI: 0 Mar 18 09:02:20 zaphkiel kernel: [ 16.401753] sd 7:0:0:0: [sdd] Attached SCSI removable disk Mar 18 09:02:20 zaphkiel kernel: [ 16.450765] scsi 7:0:0:3: Direct-Access Generic STORAGE DEVICE 0125 PQ: 0 ANSI: 0 Mar 18 09:02:20 zaphkiel kernel: [ 16.508754] sd 7:0:0:1: [sde] Attached SCSI removable disk Mar 18 09:02:20 zaphkiel kernel: [ 16.572754] sd 7:0:0:2: [sdf] Attached SCSI removable disk Mar 18 09:02:20 zaphkiel kernel: [ 16.625754] sd 7:0:0:3: [sdg] Attached SCSI removable disk Mar 18 09:02:20 zaphkiel kernel: [ 20.412626] forcedeth: Reverse Engineered nForce ethernet driver. Version 0.64. Mar 18 09:02:20 zaphkiel kernel: [ 20.460450] ACPI: PCI Interrupt Link [LMAC] enabled at IRQ 22 Mar 18 09:02:20 zaphkiel kernel: [ 20.502526] forcedeth 0000:00:0a.0: PCI INT A -> Link[LMAC] -> GSI 22 (level, low) -> IRQ 22 Mar 18 09:02:20 zaphkiel kernel: [ 20.555583] forcedeth 0000:00:0a.0: setting latency timer to 64 Mar 18 09:02:20 zaphkiel kernel: [ 20.608180] forcedeth 0000:00:0a.0: ifname eth0, PHY OUI 0x732 @ 3, addr 00:23:54:2b:d2:58 Mar 18 09:02:20 zaphkiel kernel: [ 20.661808] forcedeth 0000:00:0a.0: highdma csum pwrctl mgmt gbit lnktim msi desc-v3 Mar 18 09:02:20 zaphkiel kernel: [ 21.514743] 8139too: 8139too Fast Ethernet driver 0.9.28 Mar 18 09:02:20 zaphkiel kernel: [ 21.661650] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 16 Mar 18 09:02:20 zaphkiel kernel: [ 21.745044] 8139too 0000:01:06.0: PCI INT A -> Link[LNKA] -> GSI 16 (level, low) -> IRQ 16 Mar 18 09:02:20 zaphkiel kernel: [ 21.853853] 8139too 0000:01:06.0: eth1: RealTek RTL8139 at 0xf7a32c00, 00:08:54:0a:9b:53, IRQ 16 Mar 18 09:02:20 zaphkiel kernel: [ 22.792962] BEFORE udevadm settle --timeout=60 Mar 18 09:02:20 zaphkiel kernel: [ 22.879311] main: Mar 18 09:02:20 zaphkiel kernel: [ 22.935601] runtime dir '/run/udev' Mar 18 09:02:20 zaphkiel kernel: [ 22.962183] run_command: Mar 18 09:02:20 zaphkiel kernel: [ 23.021385] calling: settle Mar 18 09:02:20 zaphkiel kernel: [ 26.373950] i2c i2c-0: nForce2 SMBus adapter at 0x600 Mar 18 09:02:20 zaphkiel kernel: [ 26.439755] WARNING! power/level is deprecated; use power/control instead Mar 18 09:02:20 zaphkiel kernel: [ 26.482815] ACPI: resource nForce2_smbus [io 0x0700-0x073f] conflicts with ACPI region SM00 [io 0x700-0x73f] Mar 18 09:02:20 zaphkiel kernel: [ 26.579063] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver Mar 18 09:02:20 zaphkiel kernel: [ 26.649433] firewire_ohci 0000:04:00.0: PCI INT A -> Link[LN3A] -> GSI 17 (level, low) -> IRQ 17 Mar 18 09:02:20 zaphkiel kernel: [ 26.739746] input: Power Button as /devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input2 Mar 18 09:02:20 zaphkiel kernel: [ 26.796477] firewire_ohci 0000:04:00.0: setting latency timer to 64 Mar 18 09:02:20 zaphkiel kernel: [ 26.801886] Linux video capture interface: v2.00 Mar 18 09:02:20 zaphkiel kernel: [ 26.849080] firewire_ohci: Added fw-ohci device 0000:04:00.0, OHCI v1.10, 4 IR + 4 IT contexts, quirks 0x10 Mar 18 09:02:20 zaphkiel kernel: [ 26.966186] pata_amd 0000:00:06.0: version 0.4.1 Mar 18 09:02:20 zaphkiel kernel: [ 26.966229] pata_amd 0000:00:06.0: setting latency timer to 64 Mar 18 09:02:20 zaphkiel kernel: [ 26.968773] scsi8 : pata_amd Mar 18 09:02:20 zaphkiel kernel: [ 26.992242] sd 0:0:0:0: Attached scsi generic sg0 type 0 Mar 18 09:02:20 zaphkiel kernel: [ 27.031690] scsi9 : pata_amd Mar 18 09:02:20 zaphkiel kernel: [ 27.064347] ata7: PATA max UDMA/133 cmd 0x1f0 ctl 0x3f6 bmdma 0xffa0 irq 14 Mar 18 09:02:20 zaphkiel kernel: [ 27.152035] ata8: PATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0xffa8 irq 15 Mar 18 09:02:20 zaphkiel kernel: [ 27.207504] IT8712 SuperIO detected. Mar 18 09:02:20 zaphkiel kernel: [ 27.248861] k8temp 0000:00:18.3: Temperature readouts might be wrong - check erratum #141 Mar 18 09:02:20 zaphkiel kernel: [ 27.358729] parport_pc 00:04: reported by Plug and Play ACPI Mar 18 09:02:20 zaphkiel kernel: [ 27.396357] sd 1:0:0:0: Attached scsi generic sg1 type 0 Mar 18 09:02:20 zaphkiel kernel: [ 27.435502] ata7.00: ATAPI: HL-DT-ST DVDRAM GSA-4167B, DL12, max UDMA/33 Mar 18 09:02:20 zaphkiel kernel: [ 27.479017] firewire_core: created device fw0: GUID 001e8c00018b93c2, S400 Mar 18 09:02:20 zaphkiel kernel: [ 27.522961] ACPI: Power Button [PWRB] Mar 18 09:02:20 zaphkiel kernel: [ 27.547160] parport0: PC-style at 0x378 Mar 18 09:02:20 zaphkiel kernel: [ 27.569182] ata7: nv_mode_filter: 0x739f&0x739f->0x739f, BIOS=0x7000 (0xc0000000) ACPI=0x701f (60:900:0x11) Mar 18 09:02:20 zaphkiel kernel: [ 27.572162] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input3 Mar 18 09:02:20 zaphkiel kernel: [ 27.619485] nvidia: module license 'NVIDIA' taints kernel. Mar 18 09:02:20 zaphkiel kernel: [ 27.652780] ata7.00: configured for UDMA/33 Mar 18 09:02:20 zaphkiel kernel: [ 27.681871] sd 6:0:0:0: Attached scsi generic sg2 type 0 Mar 18 09:02:20 zaphkiel kernel: [ 27.720249] scsi 8:0:0:0: CD-ROM HL-DT-ST DVDRAM GSA-4167B DL12 PQ: 0 ANSI: 5 Mar 18 09:02:20 zaphkiel kernel: [ 27.773178] , irq 7 Mar 18 09:02:20 zaphkiel kernel: [ 27.785284] Disabling lock debugging due to kernel taint Mar 18 09:02:20 zaphkiel kernel: [ 27.818744] [ Mar 18 09:02:20 zaphkiel kernel: [ 27.827966] sd 7:0:0:0: Attached scsi generic sg3 type 0 Mar 18 09:02:20 zaphkiel kernel: [ 27.862615] PCSPP Mar 18 09:02:20 zaphkiel kernel: [ 27.873974] ACPI: Power Button [PWRF] Mar 18 09:02:20 zaphkiel kernel: [ 27.903194] input: PC Speaker as /devices/platform/pcspkr/input/input4 Mar 18 09:02:20 zaphkiel kernel: [ 27.947382] ,TRISTATE] Mar 18 09:02:20 zaphkiel kernel: [ 27.967591] wmi: Mapper loaded Mar 18 09:02:20 zaphkiel kernel: [ 28.019498] sd 7:0:0:1: Attached scsi generic sg4 type 0 Mar 18 09:02:20 zaphkiel kernel: [ 28.086333] ppdev: user-space parallel port driver Mar 18 09:02:20 zaphkiel kernel: [ 28.145165] sd 7:0:0:2: Attached scsi generic sg5 type 0 Mar 18 09:02:20 zaphkiel kernel: [ 28.219780] sd 7:0:0:3: Attached scsi generic sg6 type 0 Mar 18 09:02:20 zaphkiel kernel: [ 28.294832] scsi 8:0:0:0: Attached scsi generic sg7 type 5 Mar 18 09:02:20 zaphkiel kernel: [ 28.359838] ata8: port disabled--ignoring Mar 18 09:02:20 zaphkiel kernel: [ 28.377526] snd_hda_intel 0000:00:07.0: power state changed by ACPI to D0 Mar 18 09:02:20 zaphkiel kernel: [ 28.460047] snd_hda_intel 0000:00:07.0: power state changed by ACPI to D0 Mar 18 09:02:20 zaphkiel kernel: [ 28.544146] ACPI: PCI Interrupt Link [LAZA] enabled at IRQ 21 Mar 18 09:02:20 zaphkiel kernel: [ 28.616023] snd_hda_intel 0000:00:07.0: PCI INT A -> Link[LAZA] -> GSI 21 (level, low) -> IRQ 21 Mar 18 09:02:20 zaphkiel kernel: [ 28.725762] hda_intel: Disabling MSI Mar 18 09:02:20 zaphkiel kernel: [ 28.749946] bttv: driver version 0.9.19 loaded Mar 18 09:02:20 zaphkiel kernel: [ 28.776705] snd_hda_intel 0000:00:07.0: setting latency timer to 64 Mar 18 09:02:20 zaphkiel kernel: [ 28.787103] ALSA hda_intel.c:1454 Enable delay in RIRB handling Mar 18 09:02:20 zaphkiel kernel: [ 28.804531] bttv: using 8 buffers with 2080k (520 pages) each for capture Mar 18 09:02:20 zaphkiel kernel: [ 28.864408] IR NEC protocol handler initialized Mar 18 09:02:20 zaphkiel kernel: [ 28.960457] IR RC5(x) protocol handler initialized Mar 18 09:02:20 zaphkiel kernel: [ 29.004015] bttv: Bt8xx card found (0). Mar 18 09:02:20 zaphkiel kernel: [ 29.088643] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 19 Mar 18 09:02:20 zaphkiel kernel: [ 29.123158] bttv 0000:01:07.0: PCI INT A -> Link[LNKB] -> GSI 19 (level, low) -> IRQ 19 Mar 18 09:02:20 zaphkiel kernel: [ 29.180972] IR RC6 protocol handler initialized Mar 18 09:02:20 zaphkiel kernel: [ 29.250091] bttv0: Bt878 (rev 17) at 0000:01:07.0, Mar 18 09:02:20 zaphkiel kernel: [ 29.278950] IR JVC protocol handler initialized Mar 18 09:02:20 zaphkiel kernel: [ 29.339358] irq: 19, latency: 64, mmio: 0xcfffe000 Mar 18 09:02:20 zaphkiel kernel: [ 29.400761] bttv0: detected: Hauppauge WinTV [card=10], PCI subsystem ID is 0070:13eb Mar 18 09:02:20 zaphkiel kernel: [ 29.447978] IR Sony protocol handler initialized Mar 18 09:02:20 zaphkiel kernel: [ 29.521989] IR MCE Keyboard/mouse protocol handler initialized Mar 18 09:02:20 zaphkiel kernel: [ 29.562022] bttv0: using: Hauppauge (bt878) [card=10,autodetected] Mar 18 09:02:20 zaphkiel kernel: [ 29.636072] lirc_dev: IR Remote Control driver registered, major 250 Mar 18 09:02:20 zaphkiel kernel: [ 29.679569] bttv0: gpio: en=00000000, out=00000000 in=00ffffdb [init] Mar 18 09:02:20 zaphkiel kernel: [ 29.683413] bttv0: Hauppauge/Voodoo msp34xx: reset line init [5] Mar 18 09:02:20 zaphkiel kernel: [ 29.725031] ALSA patch_realtek.c:1228 SKU: Nid=0x1d sku_cfg=0x4015e601 Mar 18 09:02:20 zaphkiel kernel: [ 29.725035] ALSA patch_realtek.c:1230 SKU: port_connectivity=0x1 Mar 18 09:02:20 zaphkiel kernel: [ 29.725038] ALSA patch_realtek.c:1231 SKU: enable_pcbeep=0x1 Mar 18 09:02:20 zaphkiel kernel: [ 29.725040] ALSA patch_realtek.c:1232 SKU: check_sum=0x00000005 Mar 18 09:02:20 zaphkiel kernel: [ 29.725042] ALSA patch_realtek.c:1233 SKU: customization=0x000000e6 Mar 18 09:02:20 zaphkiel kernel: [ 29.725045] ALSA patch_realtek.c:1234 SKU: external_amp=0x0 Mar 18 09:02:20 zaphkiel kernel: [ 29.725047] ALSA patch_realtek.c:1235 SKU: platform_type=0x0 Mar 18 09:02:20 zaphkiel kernel: [ 29.725049] ALSA patch_realtek.c:1236 SKU: swap=0x0 Mar 18 09:02:20 zaphkiel kernel: [ 29.725051] ALSA patch_realtek.c:1237 SKU: override=0x1 Mar 18 09:02:20 zaphkiel kernel: [ 29.742390] md: bind<sdb4> Mar 18 09:02:20 zaphkiel kernel: [ 29.779377] IR LIRC bridge handler initialized Mar 18 09:02:20 zaphkiel kernel: [ 29.857738] tveeprom 1-0050: Hauppauge model 44354, rev C142, serial# 2295831 Mar 18 09:02:20 zaphkiel kernel: [ 29.921654] md: bind<sda1> Mar 18 09:02:20 zaphkiel kernel: [ 29.946033] tveeprom 1-0050: tuner model is Temic 4009FR5 (idx 42, type 20) Mar 18 09:02:20 zaphkiel kernel: [ 30.020132] sr0: scsi3-mmc drive: 48x/48x writer dvd-ram cd/rw xa/form2 cdda tray Mar 18 09:02:20 zaphkiel kernel: [ 30.065051] tveeprom 1-0050: TV standards PAL(B/G) (eeprom 0x04) Mar 18 09:02:20 zaphkiel kernel: [ 30.103013] tveeprom 1-0050: audio processor is MSP3415 (idx 6) Mar 18 09:02:20 zaphkiel kernel: [ 30.145378] md: bind<sda2> Mar 18 09:02:20 zaphkiel kernel: [ 30.174031] tveeprom 1-0050: has radio Mar 18 09:02:20 zaphkiel kernel: [ 30.221049] bttv0: Hauppauge eeprom indicates model#44354 Mar 18 09:02:20 zaphkiel kernel: [ 30.269011] cdrom: Uniform CD-ROM driver Revision: 3.20 Mar 18 09:02:20 zaphkiel kernel: [ 30.300405] bttv0: tuner type=20 Mar 18 09:02:20 zaphkiel kernel: [ 30.337486] sr 8:0:0:0: Attached scsi CD-ROM sr0 Mar 18 09:02:20 zaphkiel kernel: [ 30.340058] nvidia 0000:02:00.0: PCI INT A -> Link[LN0A] -> GSI 19 (level, low) -> IRQ 19 Mar 18 09:02:20 zaphkiel kernel: [ 30.389781] i2c-core: driver [msp3400] using legacy suspend method Mar 18 09:02:20 zaphkiel kernel: [ 30.427379] md: bind<sda4> Mar 18 09:02:20 zaphkiel kernel: [ 30.447871] i2c-core: driver [msp3400] using legacy resume method Mar 18 09:02:20 zaphkiel kernel: [ 30.486398] nvidia 0000:02:00.0: setting latency timer to 64 Mar 18 09:02:20 zaphkiel kernel: [ 30.486403] vgaarb: device changed decodes: PCI:0000:02:00.0,olddecodes=io+mem,decodes=none:owns=io+mem Mar 18 09:02:20 zaphkiel kernel: [ 30.552124] msp3400 1-0040: MSP3415D-B3 found @ 0x80 (bt878 #0 [sw]) Mar 18 09:02:20 zaphkiel kernel: [ 30.591822] NVRM: loading NVIDIA UNIX x86 Kernel Module 304.64 Tue Oct 30 11:09:29 PDT 2012 Mar 18 09:02:20 zaphkiel kernel: [ 30.641940] msp3400 1-0040: msp3400 supports nicam, mode is autodetect Mar 18 09:02:20 zaphkiel kernel: [ 30.683731] Mar 18 09:02:20 zaphkiel kernel: [ 30.728574] i2c-core: driver [tuner] using legacy suspend method Mar 18 09:02:20 zaphkiel kernel: [ 30.767893] i2c-core: driver [tuner] using legacy resume method Mar 18 09:02:20 zaphkiel kernel: [ 30.811881] tuner 1-0061: Tuner -1 found with type(s) Radio TV. Mar 18 09:02:20 zaphkiel kernel: [ 30.847964] md/raid1:md3: active with 2 out of 2 mirrors Mar 18 09:02:20 zaphkiel kernel: [ 30.881959] created bitmap (415 pages) for device md3 Mar 18 09:02:20 zaphkiel kernel: [ 30.990096] input: HDA Digital PCBeep as /devices/pci0000:00/0000:00:07.0/input/input5 Mar 18 09:02:20 zaphkiel kernel: [ 31.038390] tuner-simple 1-0061: creating new instance Mar 18 09:02:20 zaphkiel kernel: [ 31.084560] tuner-simple 1-0061: type set to 20 (Temic PAL_BG (4009 FR5) or PAL_I (4069 FR5)) Mar 18 09:02:20 zaphkiel kernel: [ 31.140892] bttv0: registered device video0 Mar 18 09:02:20 zaphkiel kernel: [ 31.167942] bttv0: registered device vbi0 Mar 18 09:02:20 zaphkiel kernel: [ 31.192924] bttv0: registered device radio0 Mar 18 09:02:20 zaphkiel kernel: [ 31.223065] bttv0: PLL: 28636363 => 35468950 .. ok Mar 18 09:02:20 zaphkiel kernel: [ 31.312236] AFTER udevadm settle --timeout=60 Mar 18 09:02:20 zaphkiel kernel: [ 31.437261] md3: bitmap initialized from disk: read 26/26 pages, set 1464 of 849320 bits Mar 18 09:02:20 zaphkiel kernel: [ 31.536642] BEFORE mdadm -IRs Mar 18 09:02:20 zaphkiel kernel: [ 31.562744] md3: detected capacity change from 0 to 890575601664 Mar 18 09:02:20 zaphkiel kernel: [ 31.645517] md/raid1:md0: active with 1 out of 2 mirrors Mar 18 09:02:20 zaphkiel kernel: [ 31.709880] md3: unknown partition table Mar 18 09:02:20 zaphkiel kernel: [ 31.734035] md: bind<sdb2> Mar 18 09:02:20 zaphkiel kernel: [ 31.753744] created bitmap (6 pages) for device md0 Mar 18 09:02:20 zaphkiel kernel: [ 31.793486] md0: bitmap initialized from disk: read 1/1 pages, set 0 of 12047 bits Mar 18 09:02:20 zaphkiel kernel: [ 31.874934] md/raid1:md1: active with 2 out of 2 mirrors Mar 18 09:02:20 zaphkiel kernel: [ 31.907025] md0: detected capacity change from 0 to 98689024 Mar 18 09:02:20 zaphkiel kernel: [ 32.004102] created bitmap (8 pages) for device md1 Mar 18 09:02:20 zaphkiel kernel: [ 32.078506] md1: bitmap initialized from disk: read 1/1 pages, set 0 of 16379 bits Mar 18 09:02:20 zaphkiel kernel: [ 32.146550] md0: unknown partition table Mar 18 09:02:20 zaphkiel kernel: [ 32.170680] md1: detected capacity change from 0 to 2146783232 Mar 18 09:02:20 zaphkiel kernel: [ 32.209513] AFTER mdadm -IRs Mar 18 09:02:20 zaphkiel kernel: [ 32.452210] md1: unknown partition table Mar 18 09:02:20 zaphkiel kernel: [ 32.783780] Adding 2096464k swap on /dev/md1. Priority:-1 extents:1 across:2096464k Mar 18 09:02:20 zaphkiel kernel: [ 34.746669] AFTER mdadm -Asc Mar 18 09:02:20 zaphkiel kernel: [ 36.399439] XFS (md3): Mounting Filesystem Mar 18 09:02:20 zaphkiel kernel: [ 36.943112] EXT4-fs (md0): mounting ext2 file system using the ext4 subsystem Mar 18 09:02:20 zaphkiel kernel: [ 37.014070] XFS (md3): Ending clean mount Mar 18 09:02:20 zaphkiel kernel: [ 37.257681] EXT4-fs (md0): mounted filesystem without journal. Opts: acl,user_xattr Thanks, Pete -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-03-18 11:20 ` Hans-Peter Jansen @ 2013-03-21 3:24 ` NeilBrown 2013-04-10 13:28 ` Hans-Peter Jansen 2013-04-10 13:44 ` Hans-Peter Jansen 0 siblings, 2 replies; 19+ messages in thread From: NeilBrown @ 2013-03-21 3:24 UTC (permalink / raw) To: Hans-Peter Jansen; +Cc: Linux RAID [-- Attachment #1: Type: text/plain, Size: 1869 bytes --] On Mon, 18 Mar 2013 12:20:57 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote: > Am Freitag, 15. März 2013, 23:43:31 schrieb Hans-Peter Jansen: > > Am Mittwoch, 13. März 2013, 11:52:35 schrieben Sie: > > > hi, > > > > > > wonder if you could try one more test for me. > > > With the other (echo BEFORE / echo AFTER etc) tracing still there, change > > > > > > the /sbin/udevadm settle --timeout=$MDADM_DEVICE_TIMEOUT > > > > > > to > > > > > > /sbin/udevadm --debug settle --timeout=$MDADM_DEVICE_TIMEOUT > > > > /dev/kmsg > > > > > > 2>&1 > > > > > > I found there is a case where "udevadm settle" can exit before the queue > > > is > > > > > > empty. It seems like a very unlikely scenario, but it seems clear that > > > something "unlikely" is happening. > > > > > > I'm hoping to see > > > > > > timeout waiting for udev queue > > > > > > appear in the logs when this runs. > > > > > > If you could add that and post the 'dmesg' output if you ever get that > > > message - or maybe even if you don't - that would be very helpful. > > > > Did that right now and disabled the sleep 10 in front, that provisionally > > "solved" this issue BTW. Now waiting for another md issue to occur. > > Here we go: (not censored in any way, and I hope to get the "window" right: > md0 was affected this time. > Thanks! Unfortunately it is not as helpful as I hoped, but it does suggest that "udevadm settle" does sometimes appear to misbehave even if there aren't any problems with the md array. Could I ask for one more? Prefix the "udevadm settle " command with strace -o /tmp/udevadm.trace -s 500 (making sure that you have 'strace' installed) and then post the "/tmp/udevadm.trace" file. Hopefully that will at least allow me to rule out some possibilities. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-03-21 3:24 ` NeilBrown @ 2013-04-10 13:28 ` Hans-Peter Jansen 2013-04-10 13:44 ` Hans-Peter Jansen 1 sibling, 0 replies; 19+ messages in thread From: Hans-Peter Jansen @ 2013-04-10 13:28 UTC (permalink / raw) To: NeilBrown; +Cc: Linux RAID [-- Attachment #1: Type: text/plain, Size: 2179 bytes --] On Donnerstag, 21. März 2013 14:24:59 NeilBrown wrote: > On Mon, 18 Mar 2013 12:20:57 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote: > > Am Freitag, 15. März 2013, 23:43:31 schrieb Hans-Peter Jansen: > > > Am Mittwoch, 13. März 2013, 11:52:35 schrieben Sie: > > > > hi, > > > > > > > > wonder if you could try one more test for me. > > > > With the other (echo BEFORE / echo AFTER etc) tracing still there, > > > > change > > > > > > > > the /sbin/udevadm settle --timeout=$MDADM_DEVICE_TIMEOUT > > > > > > > > to > > > > > > > > /sbin/udevadm --debug settle --timeout=$MDADM_DEVICE_TIMEOUT > > > > > /dev/kmsg > > > > > > > > 2>&1 > > > > > > > > I found there is a case where "udevadm settle" can exit before the > > > > queue > > > > is > > > > > > > > empty. It seems like a very unlikely scenario, but it seems clear > > > > that > > > > something "unlikely" is happening. > > > > > > > > I'm hoping to see > > > > > > > > timeout waiting for udev queue > > > > > > > > appear in the logs when this runs. > > > > > > > > If you could add that and post the 'dmesg' output if you ever get > > > > that > > > > message - or maybe even if you don't - that would be very helpful. > > > > > > Did that right now and disabled the sleep 10 in front, that > > > provisionally > > > "solved" this issue BTW. Now waiting for another md issue to occur. > > > > Here we go: (not censored in any way, and I hope to get the "window" > > right: > > md0 was affected this time. > > Thanks! Unfortunately it is not as helpful as I hoped, but it does suggest > that "udevadm settle" does sometimes appear to misbehave even if there > aren't any problems with the md array. > > Could I ask for one more? > Prefix the "udevadm settle " command with > strace -o /tmp/udevadm.trace -s 500 > > (making sure that you have 'strace' installed) and then post the > "/tmp/udevadm.trace" file. Hopefully that will at least allow me to rule > out some possibilities. After reducing the timeout to 40 secs, we're able to catch a failure today with two degraded arrays: md0 and md1. Cheers, Pete [-- Attachment #2: udevadm.trace.gz --] [-- Type: application/x-gzip, Size: 5524 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-03-21 3:24 ` NeilBrown 2013-04-10 13:28 ` Hans-Peter Jansen @ 2013-04-10 13:44 ` Hans-Peter Jansen 2013-04-11 7:33 ` NeilBrown 1 sibling, 1 reply; 19+ messages in thread From: Hans-Peter Jansen @ 2013-04-10 13:44 UTC (permalink / raw) To: NeilBrown; +Cc: Linux RAID [-- Attachment #1: Type: text/plain, Size: 683 bytes --] On Donnerstag, 21. März 2013 14:24:59 NeilBrown wrote: > > Thanks! Unfortunately it is not as helpful as I hoped, but it does suggest > that "udevadm settle" does sometimes appear to misbehave even if there > aren't any problems with the md array. > > Could I ask for one more? > Prefix the "udevadm settle " command with > strace -o /tmp/udevadm.trace -s 500 > > (making sure that you have 'strace' installed) and then post the > "/tmp/udevadm.trace" file. Hopefully that will at least allow me to rule > out some possibilities. After reducing the timeout to 40 secs, we're able to catch a failure today with two degraded arrays: md0 and md1. Cheers, Pete [-- Attachment #2: udevadm.trace.gz --] [-- Type: application/x-gzip, Size: 5524 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-04-10 13:44 ` Hans-Peter Jansen @ 2013-04-11 7:33 ` NeilBrown 0 siblings, 0 replies; 19+ messages in thread From: NeilBrown @ 2013-04-11 7:33 UTC (permalink / raw) To: Hans-Peter Jansen; +Cc: Linux RAID [-- Attachment #1: Type: text/plain, Size: 2177 bytes --] On Wed, 10 Apr 2013 15:44:25 +0200 Hans-Peter Jansen <hpj@urpla.net> wrote: > On Donnerstag, 21. März 2013 14:24:59 NeilBrown wrote: > > > > Thanks! Unfortunately it is not as helpful as I hoped, but it does suggest > > that "udevadm settle" does sometimes appear to misbehave even if there > > aren't any problems with the md array. > > > > Could I ask for one more? > > Prefix the "udevadm settle " command with > > strace -o /tmp/udevadm.trace -s 500 > > > > (making sure that you have 'strace' installed) and then post the > > "/tmp/udevadm.trace" file. Hopefully that will at least allow me to rule > > out some possibilities. > > After reducing the timeout to 40 secs, we're able to catch a failure today > with two degraded arrays: md0 and md1. > > Cheers, > Pete Thanks. There is a similar bug report open at https://bugzilla.novell.com/show_bug.cgi?id=793954 I received a generally similar trace there and have been looking at yours and those and the code and not getting very far. The traces show the /run/udev/queue.bin getting bigger and bigger, then shrinking down to '8': %zgrep 'fstat64(4,' /tmp/udevadm.trace.gz | sed 's/.*st_size=\([0-9]*\),.*/\1/' This is expected. Whenever the queue becomes empty, the file is reset (normally 'add' and 'remove' records are simply appended). It is supposed to also be shrunk when it is bigger than 4K and over half of the file is wasted (due to add/remove pairs) but the calculation of wastage is broken and I think this never happens. Nevertheless there seems to be a correlation between the times when it fails and the max size of the queue.bin file. In the files from the buzilla, the times that it works, the queue.bin file never gets big. But that could just be an accident. In any case it is clear that udevadm is working correctly. Which seems to suggest that maybe 'udev' is messing with the queue.bin file. Either that or I'm misunderstand some other details and am entirely on the wrong path. But I've gone over the udev code again and again and I cannot fault it. So I'm still stumped. Thanks for your help though. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Persistent failures with simple md setup 2013-01-29 22:14 Persistent failures with simple md setup Hans-Peter Jansen 2013-01-30 9:07 ` Sebastian Riemer @ 2013-01-30 9:20 ` Roy Sigurd Karlsbakk 1 sibling, 0 replies; 19+ messages in thread From: Roy Sigurd Karlsbakk @ 2013-01-30 9:20 UTC (permalink / raw) To: Hans-Peter Jansen; +Cc: linux-raid > ~# cat /proc/mdstat > Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] > md3 : active raid1 sda4[0] > 869702736 blocks super 1.0 [2/1] [U_] > bitmap: 57/415 pages [228KB], 1024KB chunk > > md0 : active raid1 sda1[0] > 96376 blocks super 1.0 [2/1] [U_] > bitmap: 1/6 pages [4KB], 8KB chunk > > md1 : active (auto-read-only) raid1 sdb2[1] sda2[0] > 2096468 blocks super 1.0 [2/2] [UU] > bitmap: 0/8 pages [0KB], 128KB chunk > > md124 : active raid1 sdb3[1] sda3[0] > 104856180 blocks super 1.0 [2/2] [UU] > bitmap: 8/200 pages [32KB], 256KB chunk ... > for this case, but the behavior which partition is affected is random, > only md124 seems stable (the root fs). The strange md naming was the > result of an upgrade installation. The device details are attached as > well. > > It happens, that the active device even *switches* between boots, > which > is a perfect recipe for actually loosing data, hence this md doesn't > raise data security, it is the reason for loosing them. > > Could some kind soul tell me, what's going on here? This looks somewhat familiar. I had a similar issue after install Ubuntu 12.04 on my home server. IIRC udev/mdadm was out of sync, so that the raid was started before all drives were ready. This was fixed around ubuntu 12.04.1. But then again, since it finds some of the partitions, and not losing whole drives, this seems somewhat unlikely. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 98013356 roy@karlsbakk.net http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2013-04-11 7:33 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-01-29 22:14 Persistent failures with simple md setup Hans-Peter Jansen 2013-01-30 9:07 ` Sebastian Riemer 2013-01-30 17:12 ` Hans-Peter Jansen 2013-02-04 20:43 ` Hans-Peter Jansen 2013-02-05 3:44 ` NeilBrown 2013-02-27 17:01 ` Hans-Peter Jansen 2013-02-28 3:40 ` NeilBrown 2013-02-28 10:49 ` Hans-Peter Jansen 2013-02-28 21:25 ` NeilBrown 2013-02-28 22:16 ` Hans-Peter Jansen [not found] ` <4291349.FrQcKOnicQ@xrated> 2013-03-03 23:33 ` NeilBrown 2013-03-13 0:52 ` NeilBrown 2013-03-15 22:43 ` Hans-Peter Jansen 2013-03-18 11:20 ` Hans-Peter Jansen 2013-03-21 3:24 ` NeilBrown 2013-04-10 13:28 ` Hans-Peter Jansen 2013-04-10 13:44 ` Hans-Peter Jansen 2013-04-11 7:33 ` NeilBrown 2013-01-30 9:20 ` Roy Sigurd Karlsbakk
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.