* SMART causes disks to go offline on an LSI SAS 1068 controller @ 2009-09-14 14:29 Gabor Gombas 2009-10-27 17:30 ` [smartmontools-support] SMART causes disks to go offline on an LSI SAS1068 controller - Dell SAS 5/iR Tim Small 0 siblings, 1 reply; 9+ messages in thread From: Gabor Gombas @ 2009-09-14 14:29 UTC (permalink / raw) To: smartmontools-support, linux-scsi Hi, I'm having problems when using smartmontools with SATA disks behind an LSI SAS controller. The machine is a Dell PowerEdge 1950-II, the controller in question: 02:08.0 SCSI storage controller [0100]: LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS [1000:0054] (rev 01) Subsystem: Dell SAS 5/i Integrated Controller [1028:1f06] Flags: bus master, 66MHz, medium devsel, latency 72, IRQ 1270 I/O ports at ec00 [disabled] [size=256] Memory at fc8fc000 (64-bit, non-prefetchable) [size=16K] Memory at fc8e0000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at fc900000 [disabled] [size=1M] Capabilities: [50] Power Management version 2 Capabilities: [98] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+ Capabilities: [68] PCI-X non-bridge device Capabilities: [b0] MSI-X: Enable- Mask- TabSize=1 Kernel driver in use: mptsas Kernel modules: mptsas History: - The machine was running with kernel 2.6.22 and smartmontools 5.37 & 5.38 (from Debian) for a long time. smartd occassionally complained about "Device: /dev/sdX, not capable of SMART self-check", but other than that the machine was stable. smartd configuration: /dev/sda -d sat -a -s (L/../../4/03|S/../.././02|O/../../6/03) -m root -I 190 -I 194 /dev/sdb -d sat -a -s (L/../../4/03|S/../.././02|O/../../6/03) -m root -I 190 -I 194 sda is a Samsung HD160JJ, sdb is a Seagate ST3160812AS (oh well). - After switching to 2.6.26 (from Debian Lenny), running smartd started to cause the disks to go offline in a couple of hours after boot. Log sample: Sep 7 08:50:36 gw kernel: [4917120.304690] mptscsih: ioc0: attempting task abort! (sc=ffff81007ff26940) Sep 7 08:50:36 gw kernel: [4917120.304690] sd 0:0:1:0: [sdb] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00 Sep 7 08:50:40 gw kernel: [4917126.213130] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) Sep 7 08:50:40 gw kernel: [4917126.215970] mptsas: ioc0: removing sata device, channel 0, id 1, phy 1 Sep 7 08:50:40 gw kernel: [4917126.215974] port-0:1: mptsas: ioc0: delete port (1) Sep 7 08:50:40 gw kernel: [4917126.216570] sd 0:0:1:0: [sdb] Synchronizing SCSI cache Sep 7 08:50:40 gw kernel: [4917126.563597] mptscsih: ioc0: task abort: SUCCESS (sc=ffff81007ff26940) Sep 7 08:50:40 gw kernel: [4917126.563606] mptscsih: ioc0: attempting task abort! (sc=ffff81007ff26bc0) Sep 7 08:50:40 gw kernel: [4917126.563609] sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 01 49 f2 98 00 00 08 00 Sep 7 08:50:40 gw kernel: [4917126.563617] mptscsih: ioc0: task abort: SUCCESS (sc=ffff81007ff26bc0) Sep 7 08:50:40 gw kernel: [4917126.563623] mptscsih: ioc0: attempting target reset! (sc=ffff81007ff26940) Sep 7 08:50:40 gw kernel: [4917126.563625] sd 0:0:1:0: [sdb] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00 Sep 7 08:50:40 gw kernel: [4917126.897143] mptscsih: ioc0: target reset: SUCCESS (sc=ffff81007ff26940) Sep 7 08:50:40 gw kernel: [4917126.897143] mptscsih: ioc0: attempting bus reset! (sc=ffff81007ff26940) Sep 7 08:50:40 gw kernel: [4917126.897143] sd 0:0:1:0: [sdb] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00 Sep 7 08:50:44 gw kernel: [4917131.074580] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff81007ff26940) Sep 7 08:50:54 gw kernel: [4917145.159523] mptscsih: ioc0: attempting host reset! (sc=ffff81007ff26940) Sep 7 08:50:54 gw kernel: [4917145.163513] mptbase: ioc0: Initiating recovery Sep 7 08:51:10 gw kernel: [4917167.457273] mptscsih: ioc0: host reset: SUCCESS (sc=ffff81007ff26940) Sep 7 08:51:10 gw kernel: [4917167.457279] sd 0:0:1:0: Device offlined - not ready after error recovery Sep 7 08:51:10 gw kernel: [4917167.457282] sd 0:0:1:0: Device offlined - not ready after error recovery Sep 7 08:51:10 gw kernel: [4917167.457350] sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Sep 7 08:51:10 gw kernel: [4917167.457357] end_request: I/O error, dev sdb, sector 21623448 Sep 7 08:51:10 gw kernel: [4917167.457364] raid1: Disk failure on sdb6, disabling device. Sep 7 08:51:10 gw kernel: [4917167.457365] raid1: Operation continuing on 1 devices. Sep 7 08:51:10 gw kernel: [4917167.457388] end_request: I/O error, dev sdb, sector 1959743 Sep 7 08:51:10 gw kernel: [4917167.457393] md: super_written gets error=-5, uptodate=0 Sep 7 08:51:10 gw kernel: [4917167.457398] raid1: Disk failure on sdb1, disabling device. Sep 7 08:51:22 gw kernel: [4917167.457399] raid1: Operation continuing on 1 devices. Sep 7 08:51:22 gw kernel: [4917167.457411] end_request: I/O error, dev sdb, sector 21478687 Sep 7 08:51:22 gw kernel: [4917167.457415] md: super_written gets error=-5, uptodate=0 Sep 7 08:51:22 gw kernel: [4917167.457420] raid1: Disk failure on sdb5, disabling device. Sep 7 08:51:22 gw kernel: [4917167.457421] raid1: Operation continuing on 1 devices. Sep 7 08:51:22 gw kernel: [4917167.461613] sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Sep 7 08:51:22 gw kernel: [4917167.526799] raid1: Disk failure on sdb2, disabling device. Sep 7 08:51:22 gw kernel: [4917167.526801] raid1: Operation continuing on 1 devices. After such an error I have to manually remove and re-insert the drive to make the controller detect it again. - Upgrading to 2.6.30 (from Debian Sid) did not help. - Upgrading the controller firmware to the latest version available from Dell (the driver reports: FwRev=000a3300h) did not help. - I've found this thread: http://marc.info/?l=smartmontools-support&m=122518510306493&w=2 It claimed that a similar bug has been fixed in smartd in CVS HEAD as of 2008-10-30, so I've upgraded to smartmontools 5.38+svn2879-4 from Debian Sid (smartctl -V gives: smartctl 5.39 2009-08-29 r2879), but that also did not help. Is this a kernel bug (2.6.22 at least did not drop the disks), or a bug in smartmontools? Gabor -- --------------------------------------------------------- MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences --------------------------------------------------------- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [smartmontools-support] SMART causes disks to go offline on an LSI SAS1068 controller - Dell SAS 5/iR 2009-09-14 14:29 SMART causes disks to go offline on an LSI SAS 1068 controller Gabor Gombas @ 2009-10-27 17:30 ` Tim Small 2009-10-28 13:18 ` Apparent MPT ata pass-through bug SAS1068 and SAS1068E - WAS " Tim Small 0 siblings, 1 reply; 9+ messages in thread From: Tim Small @ 2009-10-27 17:30 UTC (permalink / raw) To: Gabor Gombas; +Cc: smartmontools-support, linux-scsi, Linux-PowerEdge Hello, Just to say that I'm seeing this bug as well, with smartmontools 5.38 and smartctl 5.39 2009-10-10 r2955 on Debian lenny. The machine is a Dell PowerEdge 860. I'm guessing that this is either a firmware or driver issue. 02:08.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS (rev 01) Subsystem: Dell SAS 5/iR Adapter RAID Controller Flags: bus master, 66MHz, medium devsel, latency 72, IRQ 1275 I/O ports at ec00 [disabled] [size=256] Memory at fe9fc000 (64-bit, non-prefetchable) [size=16K] Memory at fe9e0000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at fea00000 [disabled] [size=1M] Capabilities: [50] Power Management version 2 Capabilities: [98] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+ Capabilities: [68] PCI-X non-bridge device Capabilities: [b0] MSI-X: Enable- Mask- TabSize=1 Kernel driver in use: mptsas Kernel modules: mptsas # modinfo mptsas filename: /lib/modules/2.6.26-2-openvz-amd64/kernel/drivers/message/fusion/mptsas.ko version: 3.04.06 license: GPL description: Fusion MPT SAS Host driver author: LSI Corporation The errors look like this: 428.524463] mptscsih: ioc0: attempting task abort! (sc=ffff81021b950940) 428.524471] sd 0:0:0:0: [sda] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00 433.199851] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) 433.199851] mptsas: ioc0: removing sata device, channel 0, id 0, phy 0 433.199851] port-0:0: mptsas: ioc0: delete port (0) 433.199851] sd 0:0:0:0: [sda] Synchronizing SCSI cache 433.348856] mptscsih: ioc0: task abort: SUCCESS (sc=ffff81021b950940) 433.348868] mptscsih: ioc0: attempting task abort! (sc=ffff81021b950440) 433.348873] sd 0:0:0:0: [sda] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00 433.348885] mptscsih: ioc0: task abort: SUCCESS (sc=ffff81021b950440) 433.348893] mptscsih: ioc0: attempting target reset! (sc=ffff81021b950940) 433.348896] sd 0:0:0:0: [sda] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00 433.605026] mptscsih: ioc0: target reset: SUCCESS (sc=ffff81021b950940) 433.605034] mptscsih: ioc0: attempting bus reset! (sc=ffff81021b950940) 433.605037] sd 0:0:0:0: [sda] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00 434.157594] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff81021b950940) 444.546154] mptscsih: ioc0: attempting host reset! (sc=ffff81021b950940) 444.546162] mptbase: ioc0: Initiating recovery 461.540429] mptscsih: ioc0: host reset: SUCCESS (sc=ffff81021b950940) 461.540437] sd 0:0:0:0: Device offlined - not ready after error recovery 461.540440] sd 0:0:0:0: Device offlined - not ready after error recovery 461.540475] end_request: I/O error, dev sda, sector 15631039 461.540480] md: super_written gets error=-5, uptodate=0 461.540485] raid1: Disk failure on sda1, disabling device. and the drives are: Model Family: Seagate Barracuda ES Device Model: ST3250620NS Serial Number: 9QE3L9E0 Firmware Version: 3BKS and are in JBOD mode (+ sw RAID with md). lsiutil says: Current active firmware version is 0.10.51 Firmware image's version is MPTFW-00.10.51.00-IE LSI Logic x86 BIOS image's version is MPTBIOS-6.12.05.00 (2007.09.29) ... which is the latest on Dell's download pages for this server. The kernel is 2.6.26-2-openvz-amd64 from Debian Lenny (same behaviour with non-openvz kernel). Running smartd makes the drives disappear after a few hours, but doing this: while true ; do smartctl -T permissive -d sat -a /dev/sda > /dev/null && echo -n . ; done seems to knock them out in about a minute. Subjectively, 5.38 seemed to upset the controller a lot quicker than 5.39 r2955 does. For good measure I'm currently stress-testing a PE1950 with a SAS 6/iR (SAS1068E) in the same way (however this is using RAID setup through the BIOS). smartctl 5.39-pre needs '-T permissive' on the PE860, but 5.38 doesn't seem to require it. It is worth trying a newer mptsas driver? Regards, Tim. -- South East Open Source Solutions Limited Registered in England and Wales with company number 06134732. Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309 _______________________________________________ Linux-PowerEdge mailing list Linux-PowerEdge@dell.com https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq ^ permalink raw reply [flat|nested] 9+ messages in thread
* Apparent MPT ata pass-through bug SAS1068 and SAS1068E - WAS SMART causes disks to go offline on an LSI SAS1068 controller - Dell SAS 5/iR 2009-10-27 17:30 ` [smartmontools-support] SMART causes disks to go offline on an LSI SAS1068 controller - Dell SAS 5/iR Tim Small @ 2009-10-28 13:18 ` Tim Small 2009-10-28 13:28 ` Desai, Kashyap 0 siblings, 1 reply; 9+ messages in thread From: Tim Small @ 2009-10-28 13:18 UTC (permalink / raw) To: smartmontools-support, linux-scsi, Linux-PowerEdge; +Cc: Gabor Gombas Hello, On a Dell PowerEdge 1950 Debian 5.0 amd64 system (2.6.26-2-amd64), which includes one of these: 01:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) Subsystem: Dell SAS 6/iR Integrated RAID Controller Flags: bus master, fast devsel, latency 0, IRQ 1270 I/O ports at ec00 [size=256] Memory at fc5fc000 (64-bit, non-prefetchable) [size=16K] Memory at fc5e0000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at fc600000 [disabled] [size=1M] Capabilities: [50] Power Management version 2 Capabilities: [68] Express Endpoint, MSI 00 Capabilities: [98] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+ Capabilities: [b0] MSI-X: Enable- Mask- TabSize=1 Capabilities: [100] Advanced Error Reporting <?> Kernel driver in use: mptsas Kernel modules: mptsas filename: /lib/modules/2.6.26-2-amd64/kernel/drivers/message/fusion/mptsas.ko version: 3.04.06 license: GPL description: Fusion MPT SAS Host driver author: LSI Corporation .. and a couple of WesternDigitial SATA drives, I ran the following command: while true ; do smartctl -a /dev/sg0 > /dev/null ; done After approx 45 minutes this happened: kernel: [5060492.926757] mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @602 - Controller disabled. ... and all the attached block devices were no-longer available. The machine also runs mpt-status. Regards, Tim. -- South East Open Source Solutions Limited Registered in England and Wales with company number 06134732. Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309 ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Apparent MPT ata pass-through bug SAS1068 and SAS1068E - WAS SMART causes disks to go offline on an LSI SAS1068 controller - Dell SAS 5/iR 2009-10-28 13:18 ` Apparent MPT ata pass-through bug SAS1068 and SAS1068E - WAS " Tim Small @ 2009-10-28 13:28 ` Desai, Kashyap 2009-10-28 16:56 ` Tim Small 0 siblings, 1 reply; 9+ messages in thread From: Desai, Kashyap @ 2009-10-28 13:28 UTC (permalink / raw) To: Tim Small, smartmontools-support, linux-scsi; +Cc: Gabor Gombas Tim, Can you try doing the same test upgrading our driver to 3.04.13? You can find relevant patches at http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fjejb%2Fscsi-misc-2.6.git&a=search&h=993340e8ab4e856bf0fc7818cdca1a92f6e8ed38&st=commit&s=kashyap Thanks, Kashyap -----Original Message----- From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Tim Small Sent: Wednesday, October 28, 2009 6:48 PM To: smartmontools-support@lists.sourceforge.net; linux-scsi@vger.kernel.org; Linux-PowerEdge@dell.com Cc: Gabor Gombas Subject: Apparent MPT ata pass-through bug SAS1068 and SAS1068E - WAS SMART causes disks to go offline on an LSI SAS1068 controller - Dell SAS 5/iR Hello, On a Dell PowerEdge 1950 Debian 5.0 amd64 system (2.6.26-2-amd64), which includes one of these: 01:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) Subsystem: Dell SAS 6/iR Integrated RAID Controller Flags: bus master, fast devsel, latency 0, IRQ 1270 I/O ports at ec00 [size=256] Memory at fc5fc000 (64-bit, non-prefetchable) [size=16K] Memory at fc5e0000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at fc600000 [disabled] [size=1M] Capabilities: [50] Power Management version 2 Capabilities: [68] Express Endpoint, MSI 00 Capabilities: [98] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+ Capabilities: [b0] MSI-X: Enable- Mask- TabSize=1 Capabilities: [100] Advanced Error Reporting <?> Kernel driver in use: mptsas Kernel modules: mptsas filename: /lib/modules/2.6.26-2-amd64/kernel/drivers/message/fusion/mptsas.ko version: 3.04.06 license: GPL description: Fusion MPT SAS Host driver author: LSI Corporation .. and a couple of WesternDigitial SATA drives, I ran the following command: while true ; do smartctl -a /dev/sg0 > /dev/null ; done After approx 45 minutes this happened: kernel: [5060492.926757] mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @602 - Controller disabled. ... and all the attached block devices were no-longer available. The machine also runs mpt-status. Regards, Tim. -- South East Open Source Solutions Limited Registered in England and Wales with company number 06134732. Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Apparent MPT ata pass-through bug SAS1068 and SAS1068E - WAS SMART causes disks to go offline on an LSI SAS1068 controller - Dell SAS 5/iR 2009-10-28 13:28 ` Desai, Kashyap @ 2009-10-28 16:56 ` Tim Small 2009-10-28 21:10 ` Douglas Gilbert 2009-10-29 9:01 ` Tim Small 0 siblings, 2 replies; 9+ messages in thread From: Tim Small @ 2009-10-28 16:56 UTC (permalink / raw) To: Desai, Kashyap Cc: smartmontools-support, linux-scsi, Linux-PowerEdge, Gabor Gombas Desai, Kashyap wrote: > Can you try doing the same test upgrading our driver to 3.04.13? > You can find relevant patches at > > http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fjejb%2Fscsi-misc-2.6.git&a=search&h=993340e8ab4e856bf0fc7818cdca1a92f6e8ed38&st=commit&s=kashyap > Have just compiled a kernel package from James' tree directly, rather than attempting to back-port and manually patch, so I'm now running 2.6.32-rc4 with mptsas 3.04.13. I'm running the smartctl -a in a loop at the moment and will leave it running over-night, but with this kernel I get a pair of messages like this: [ 1045.130560] scsi 2:0:0:0: [sg1] Sense Key : Recovered Error [current] [descriptor] [ 1045.145751] Descriptor sense data with sense descriptors (in hex): [ 1045.158107] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00 [ 1045.171010] 00 4f 00 c2 00 50 [ 1045.178585] scsi 2:0:0:0: [sg1] Add. Sense: ATA pass through information available [ 1045.280318] scsi 2:0:0:0: [sg1] Sense Key : Recovered Error [current] [descriptor] [ 1045.284311] Descriptor sense data with sense descriptors (in hex): [ 1045.284311] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00 [ 1045.284311] 00 4f 00 c2 00 50 [ 1045.284311] scsi 2:0:0:0: [sg1] Add. Sense: ATA pass through information available for every 'smartctl -a /dev/sg1' command which is run. Cheers, Tim. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Apparent MPT ata pass-through bug SAS1068 and SAS1068E - WAS SMART causes disks to go offline on an LSI SAS1068 controller - Dell SAS 5/iR 2009-10-28 16:56 ` Tim Small @ 2009-10-28 21:10 ` Douglas Gilbert 2009-10-29 8:59 ` [smartmontools-support] " Tim Small 2009-10-29 9:01 ` Tim Small 1 sibling, 1 reply; 9+ messages in thread From: Douglas Gilbert @ 2009-10-28 21:10 UTC (permalink / raw) To: Tim Small Cc: Desai, Kashyap, smartmontools-support, linux-scsi, Linux-PowerEdge, Gabor Gombas Tim Small wrote: > Desai, Kashyap wrote: >> Can you try doing the same test upgrading our driver to 3.04.13? >> You can find relevant patches at >> http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fjejb%2Fscsi-misc-2.6.git&a=search&h=993340e8ab4e856bf0fc7818cdca1a92f6e8ed38&st=commit&s=kashyap >> >> > > Have just compiled a kernel package from James' tree directly, rather > than attempting to back-port and manually patch, so I'm now running > 2.6.32-rc4 with mptsas 3.04.13. I'm running the smartctl -a in a loop > at the moment and will leave it running over-night, but with this kernel > I get a pair of messages like this: > > [ 1045.130560] scsi 2:0:0:0: [sg1] Sense Key : Recovered Error [current] > [descriptor] > [ 1045.145751] Descriptor sense data with sense descriptors (in hex): > [ 1045.158107] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00 > [ 1045.171010] 00 4f 00 c2 00 50 > [ 1045.178585] scsi 2:0:0:0: [sg1] Add. Sense: ATA pass through > information available > [ 1045.280318] scsi 2:0:0:0: [sg1] Sense Key : Recovered Error [current] > [descriptor] > [ 1045.284311] Descriptor sense data with sense descriptors (in hex): > [ 1045.284311] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00 > [ 1045.284311] 00 4f 00 c2 00 50 > [ 1045.284311] scsi 2:0:0:0: [sg1] Add. Sense: ATA pass through > information available > > for every 'smartctl -a /dev/sg1' command which is run. Tim, This is _not_ an error. If the CK_COND bit is set in the SCSI ATA PASS-THROUGH (12 or 16 byte) cdb and the ATA command succeeds then what is shown above is correct. The whole point is to get the ATA registers after the command is complete. The register values are placed in a ATA (status) return descriptor encapsulated in sense data with that sense key and those additional sense codes. The ATA return descriptor starts with the "09" value in the sense buffer shown above. smartmontools needs to set CK_COND on some ATA commands (e.g. to get the SMART status) because the result can only be found in the ATA registers after completion. Now it is annoying, distracting and wasteful to log the sense data in this particular situation. Perhaps the SCSI mid level error reporting should filter out that particular combination: Sense key: RECOVERED ERROR Additional sense: ATA PASS THROUGH INFORMATION AVAILABLE (0x0,0x1d) References: sat-r09.pdf sat2r09.pdf Doug Gilbert ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [smartmontools-support] Apparent MPT ata pass-through bug SAS1068 and SAS1068E - WAS SMART causes disks to go offline on an LSI SAS1068 controller - Dell SAS 5/iR 2009-10-28 21:10 ` Douglas Gilbert @ 2009-10-29 8:59 ` Tim Small 0 siblings, 0 replies; 9+ messages in thread From: Tim Small @ 2009-10-29 8:59 UTC (permalink / raw) To: dgilbert Cc: Gabor Gombas, smartmontools-support, Desai, Kashyap, linux-scsi, Linux-PowerEdge Douglas Gilbert wrote: > Now it is annoying, distracting and wasteful to log the sense > data in this particular situation. Perhaps the SCSI mid level > error reporting should filter out that particular combination: > Sense key: RECOVERED ERROR > Additional sense: ATA PASS THROUGH INFORMATION AVAILABLE (0x0,0x1d) > > References: sat-r09.pdf sat2r09.pdf > Would this best be implemented as a patch to scsi_print_sense_hdr(...) / scsi_cmd_print_sense_hdr(...) in drivers/scsi/constants.c? Sorry - I've never touched any of the scsi code, but I can probably look into drafting a patch tomorrow? Thanks, Tim. -- South East Open Source Solutions Limited Registered in England and Wales with company number 06134732. Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Apparent MPT ata pass-through bug SAS1068 and SAS1068E - WAS SMART causes disks to go offline on an LSI SAS1068 controller - Dell SAS 5/iR 2009-10-28 16:56 ` Tim Small 2009-10-28 21:10 ` Douglas Gilbert @ 2009-10-29 9:01 ` Tim Small 2009-10-29 9:55 ` [smartmontools-support] " Tim Small 1 sibling, 1 reply; 9+ messages in thread From: Tim Small @ 2009-10-29 9:01 UTC (permalink / raw) To: Desai, Kashyap Cc: Gabor Gombas, smartmontools-support, linux-scsi, Linux-PowerEdge Tim Small wrote: > 2.6.32-rc4 with mptsas 3.04.13. I'm running the smartctl -a in a loop > at the moment and will leave it running over-night Hasn't crashed yet (15 hrs). The following has been logged however, which looks like ATA pass-through isn't work right to me... [ 22.414415] mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 9, phy 0, sas_addr 0x1221000000000000 [ 22.466953] mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 1, phy 1, sas_addr 0x1221000001000000 [ 22.519305] mptsas: ioc0: attaching raid volume, channel 1, id 0 [ 33.727405] Fusion MPT misc device (ioctl) driver 3.04.13 [ 33.738270] mptctl: Registered with Fusion MPT base driver [ 33.749277] mptctl: /dev/mptctl @ (major,minor=10,220) [ 5300.611795] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [ 5300.629028] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [ 5300.646254] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [ 5300.663478] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [ 5300.680700] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [ 5300.697924] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [ 5312.111795] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) [ 5312.131469] mptscsih: ioc0: attempting task abort! (sc=ffff88012c5fc8c0) [ 5312.156831] mptscsih: ioc0: task abort: FAILED (sc=ffff88012c5fc8c0) [ 5312.169534] mptscsih: ioc0: attempting target reset! (sc=ffff88012c5fc8c0) [ 5312.195222] mptscsih: ioc0: target reset: FAILED (sc=ffff88012c5fc8c0) [ 5312.208276] mptscsih: ioc0: attempting bus reset! (sc=ffff88012c5fc8c0) [ 5316.612245] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88012c5fc8c0) [ 5328.112389] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) [ 5328.128508] mptscsih: ioc0: attempting host reset! (sc=ffff88012c5fc8c0) [12537.867482] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) [12537.885769] mptscsih: ioc0: attempting host reset! (sc=ffff88012d55c8c0) [12537.899173] mptbase: ioc0: Initiating recovery [12559.704264] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88012d55c8c0) [44184.424640] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [44184.441866] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [44195.924782] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) [44195.944449] mptscsih: ioc0: attempting task abort! (sc=ffff88012c403ac0) [44195.969799] mptscsih: ioc0: task abort: FAILED (sc=ffff88012c403ac0) [44195.982500] mptscsih: ioc0: attempting target reset! (sc=ffff88012c403ac0) [44196.008182] mptscsih: ioc0: target reset: FAILED (sc=ffff88012c403ac0) [44196.021230] mptscsih: ioc0: attempting bus reset! (sc=ffff88012c403ac0) [44200.425026] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88012c403ac0) [44211.925127] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) [44211.943416] mptscsih: ioc0: attempting host reset! (sc=ffff88012c403ac0) [44211.956814] mptbase: ioc0: Initiating recovery [44233.760010] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88012c403ac0) [49878.447977] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [49889.948381] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) [49889.968080] mptscsih: ioc0: attempting task abort! (sc=ffff88003799acc0) [49889.993425] mptscsih: ioc0: task abort: FAILED (sc=ffff88003799acc0) [49890.006129] mptscsih: ioc0: attempting target reset! (sc=ffff88003799acc0) [49890.031817] mptscsih: ioc0: target reset: FAILED (sc=ffff88003799acc0) [49890.044869] mptscsih: ioc0: attempting bus reset! (sc=ffff88003799acc0) [49894.448617] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88003799acc0) [49905.948189] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) [49905.966491] mptscsih: ioc0: attempting host reset! (sc=ffff88003799acc0) [49905.979888] mptbase: ioc0: Initiating recovery ... I will impose a bit of extra IO load on the machine to see if that provokes more errors. Thanks, Tim. ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [smartmontools-support] Apparent MPT ata pass-through bug SAS1068 and SAS1068E - WAS SMART causes disks to go offline on an LSI SAS1068 controller - Dell SAS 5/iR 2009-10-29 9:01 ` Tim Small @ 2009-10-29 9:55 ` Tim Small 0 siblings, 0 replies; 9+ messages in thread From: Tim Small @ 2009-10-29 9:55 UTC (permalink / raw) To: Desai, Kashyap Cc: Gabor Gombas, smartmontools-support, linux-scsi, Linux-PowerEdge Tim Small wrote: > ... I will impose a bit of extra IO load on the machine to see if that > provokes more errors. > The answer would seem to be yes - whilst simultaneously running these two commands: while true ; do dd if=/dev/zero of=empty count=1M ; sync ; rm empty ; sync ; done and: while true ; do smartctl -a /dev/sg1 > /dev/null || echo failed && echo -n . ; done ... about 10% of the smartctl commands fail, and this sort of thing gets logged: [61729.829710] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61729.833705] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61729.833705] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61729.833705] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61729.833705] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61729.833705] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61729.833705] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61729.833705] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61729.833705] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61729.833705] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61729.833705] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61730.019141] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61741.334274] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) [61741.353972] mptscsih: ioc0: attempting task abort! (sc=ffff880037b6c880) [61741.367368] scsi 2:0:0:0: [sg1] CDB: Inquiry: 12 00 00 00 24 00 [61741.379314] mptscsih: ioc0: task abort: FAILED (sc=ffff880037b6c880) [61741.392017] mptscsih: ioc0: attempting target reset! (sc=ffff880037b6c880) [61741.405757] scsi 2:0:0:0: [sg1] CDB: Inquiry: 12 00 00 00 24 00 [61741.417702] mptscsih: ioc0: target reset: FAILED (sc=ffff880037b6c880) [61741.430752] mptscsih: ioc0: attempting bus reset! (sc=ffff880037b6c880) [61741.443970] scsi 2:0:0:0: [sg1] CDB: Inquiry: 12 00 00 00 24 00 [61745.830347] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff880037b6c880) [61757.329906] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) [61757.348194] mptscsih: ioc0: attempting host reset! (sc=ffff880037b6c880) [61757.361592] mptbase: ioc0: Initiating recovery [61779.120762] mptscsih: ioc0: host reset: SUCCESS (sc=ffff880037b6c880) [61795.240058] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61795.244054] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61806.744084] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) [61806.763772] mptscsih: ioc0: attempting task abort! (sc=ffff880037b6c380) [61806.777179] scsi 2:0:0:0: [sg1] CDB: Inquiry: 12 00 00 00 24 00 [61806.789127] mptscsih: ioc0: task abort: FAILED (sc=ffff880037b6c380) [61806.801833] mptscsih: ioc0: attempting target reset! (sc=ffff880037b6c380) [61806.815575] scsi 2:0:0:0: [sg1] CDB: Inquiry: 12 00 00 00 24 00 [61806.827520] mptscsih: ioc0: target reset: FAILED (sc=ffff880037b6c380) [61806.840575] mptscsih: ioc0: attempting bus reset! (sc=ffff880037b6c380) [61806.853797] scsi 2:0:0:0: [sg1] CDB: Inquiry: 12 00 00 00 24 00 [61811.240162] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff880037b6c380) [61822.739995] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) [61822.758297] mptscsih: ioc0: attempting host reset! (sc=ffff880037b6c380) [61822.771694] mptbase: ioc0: Initiating recovery [61844.528012] mptscsih: ioc0: host reset: SUCCESS (sc=ffff880037b6c380) [61865.400161] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61865.404157] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61865.404157] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61865.404157] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) [61876.904450] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) [61876.924174] mptscsih: ioc0: attempting task abort! (sc=ffff8800c0218d80) [61876.937577] scsi 2:0:0:0: [sg1] CDB: Inquiry: 12 00 00 00 24 00 [61876.949527] mptscsih: ioc0: task abort: FAILED (sc=ffff8800c0218d80) [61876.962233] mptscsih: ioc0: attempting target reset! (sc=ffff8800c0218d80) [61876.975974] scsi 2:0:0:0: [sg1] CDB: Inquiry: 12 00 00 00 24 00 [61876.987918] mptscsih: ioc0: target reset: FAILED (sc=ffff8800c0218d80) [61877.000971] mptscsih: ioc0: attempting bus reset! (sc=ffff8800c0218d80) [61877.014193] scsi 2:0:0:0: [sg1] CDB: Inquiry: 12 00 00 00 24 00 [61881.400528] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff8800c0218d80) [61892.900633] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) [61892.918924] mptscsih: ioc0: attempting host reset! (sc=ffff8800c0218d80) [61892.932322] mptbase: ioc0: Initiating recovery [61914.688765] mptscsih: ioc0: host reset: SUCCESS (sc=ffff8800c0218d80) [61924.300535] INFO: task sync:15809 blocked for more than 120 seconds. [61924.313245] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [61924.328907] sync D 0000000000000000 0 15809 9780 0x00000000 [61924.342681] ffffffff814ee8b0 0000000000000082 0000000000000000 000000005fb8f9b9 [61924.357538] 000000005fb8f9b9 0000000000000000 00000000000108a0 ffff8800379bdfd8 [61924.372387] 0000000000015980 0000000000015980 ffff88012e4ab040 ffff88012e4ab338 [61924.387241] Call Trace: [61924.392145] [<ffffffffa01afcf5>] ? log_wait_commit+0xcf/0x137 [jbd] [61924.404848] [<ffffffff8107cc8a>] ? autoremove_wake_function+0x0/0x59 [61924.417725] [<ffffffffa01c9c8c>] ? ext3_sync_fs+0x52/0x70 [ext3] [61924.429906] [<ffffffff8116ae4d>] ? sync_quota_sb+0x59/0x133 [61924.441222] [<ffffffff81141bbc>] ? __sync_filesystem+0x5f/0xab [61924.453057] [<ffffffff81141cb6>] ? sync_filesystems+0xae/0x110 [61924.464893] [<ffffffff81141d9a>] ? sys_sync+0x2c/0x56 [61924.475169] [<ffffffff81010e02>] ? system_call_fastpath+0x16/0x1b ... so I'm assuming that the same race occurs with ATA pass-through commands, but error recovery is better with 2.6.32-rc4 + mptsas 3.04.13 Cheers, Tim. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-10-29 9:56 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-09-14 14:29 SMART causes disks to go offline on an LSI SAS 1068 controller Gabor Gombas 2009-10-27 17:30 ` [smartmontools-support] SMART causes disks to go offline on an LSI SAS1068 controller - Dell SAS 5/iR Tim Small 2009-10-28 13:18 ` Apparent MPT ata pass-through bug SAS1068 and SAS1068E - WAS " Tim Small 2009-10-28 13:28 ` Desai, Kashyap 2009-10-28 16:56 ` Tim Small 2009-10-28 21:10 ` Douglas Gilbert 2009-10-29 8:59 ` [smartmontools-support] " Tim Small 2009-10-29 9:01 ` Tim Small 2009-10-29 9:55 ` [smartmontools-support] " Tim Small
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.