All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
@ 2010-03-21 11:34 ` bugzilla-daemon
  2010-03-30 14:53 ` bugzilla-daemon
                   ` (31 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-03-21 11:34 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831


Brian Sullivan <bexamous@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bexamous@gmail.com




--- Comment #11 from Brian Sullivan <bexamous@gmail.com>  2010-03-21 11:34:37 ---
I too am running into this bug.

Here is firmware rev of the onboard LSI controller I am using:
Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC
/proc/mpt/ioc0    LSI Logic SAS1068E B2     105      01160000     0

I tried 2.6.32 kernel from Ubuntu 10.04 and then tried updating to 2.6.33 from
mainline.  I also then tried updating the mptsas driver to the latest off LSI's
site, v4.18.00.00.  Nothing seemed to improve issue.

Problem is, for me has been, reading smart info fast enough, or long enough,
eventually the command will fail.  It tries aborting task, bus reset, and then
host reset.  This takes some amount of time.

The pause is what I believe causes drives to sometimes drop off the controller.
 I am not sure what is to blame, but at least a work around is to go in the LSI
controller's BIOS and set all the timeout values to 0.  The default timeout
value seems to vary depending on which 1068E card you have and which firmware
is installed.  After setting all timeout values to 0, I still have problem with
ATA pass-through, but the drives no longer drop off the controller when I hit
the pass-through bug.

Also I have both WDC and Hitatchi drives.  Both behave the same.

BTW here is errors I get when running hddtemp, basically same as OP:
[156291.890023] mptscsih: ioc0: attempting task abort! (sc=ffff880369e51000)
[156291.890028] sd 7:0:12:0: [sdo] CDB: ATA command pass through(16): 85 08 2e
00 00 00 00 00 00 00 00 00 00 00 ec 00
[156293.532938] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO
Executed}, SubCode(0x0000)
[156293.533080] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880369e51000)
[156303.531268] mptscsih: ioc0: attempting task abort! (sc=ffff880369e51000)
[156303.531274] sd 7:0:12:0: [sdo] CDB: Test Unit Ready: 00 00 00 00 00 00
[156303.531283] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880369e51000)
[156303.531299] mptscsih: ioc0: attempting target reset! (sc=ffff880369e51000)
[156303.531302] sd 7:0:12:0: [sdo] CDB: ATA command pass through(16): 85 08 2e
00 00 00 00 00 00 00 00 00 00 00 ec 00
[156305.050176] mptscsih: ioc0: target reset: FAILED (sc=ffff880369e51000)
[156305.050185] mptscsih: ioc0: attempting bus reset! (sc=ffff880369e51000)
[156305.050189] sd 7:0:12:0: [sdo] CDB: ATA command pass through(16): 85 08 2e
00 00 00 00 00 00 00 00 00 00 00 ec 00
[156309.553552] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff880369e51000)
[156329.560014] mptscsih: ioc0: attempting task abort! (sc=ffff880369e51000)
[156329.560020] sd 7:0:12:0: [sdo] CDB: Test Unit Ready: 00 00 00 00 00 00
[156331.297762] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO
Not Yet Executed}, SubCode(0x0000)
[156331.297903] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880369e51000)
[156331.297907] mptscsih: ioc0: attempting host reset! (sc=ffff880369e51000)
[156342.470033] mptscsih: ioc0: host reset: SUCCESS (sc=ffff880369e51000)

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
  2010-03-21 11:34 ` [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets bugzilla-daemon
@ 2010-03-30 14:53 ` bugzilla-daemon
  2010-04-29  7:25 ` bugzilla-daemon
                   ` (30 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-03-30 14:53 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #12 from bpkroth@gmail.com  2010-03-30 14:52:52 ---
I'm also seeing the problem, also on a Dell server.  I'm running Debian
Squeeze.  I didn't see this issue with 2.6.29.

# cat /proc/mpt/summary 
ioc0: LSISAS1068E B3, FwRev=00192f00h, Ports=1, MaxQ=266, IRQ=32

# uname -a
Linux oberon 2.6.32-3-amd64 #1 SMP Wed Feb 24 18:07:42 UTC 2010 x86_64
GNU/Linux

# dmesg | egrep -i 'scsi 0.*(wdc|samsung)'
[   18.427778] scsi 0:0:0:0: Direct-Access     ATA      WDC WD1602ABKS-1 3B04
PQ: 0 ANSI: 5
[   18.432650] scsi 0:0:1:0: Direct-Access     ATA      WDC WD1602ABKS-1 3B04
PQ: 0 ANSI: 5
[   18.448696] scsi 0:0:2:0: Direct-Access     ATA      SAMSUNG HD103UJ  1118
PQ: 0 ANSI: 5
[   18.464756] scsi 0:0:3:0: Direct-Access     ATA      SAMSUNG HD103UJ  1118
PQ: 0 ANSI: 5
[   18.480824] scsi 0:0:4:0: Direct-Access     ATA      SAMSUNG HD103UJ  1118
PQ: 0 ANSI: 5
[   18.496910] scsi 0:0:5:0: Direct-Access     ATA      SAMSUNG HD103UJ  1118
PQ: 0 ANSI: 5

Please let me know if you need any additional information.  I would like to be
able to know when my drives are starting to have issues so I can replace them
before it causes major problems.  Right now I can't issue SMART queries without
it throwing the disks from the md.

Thanks,
Brian

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
  2010-03-21 11:34 ` [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets bugzilla-daemon
  2010-03-30 14:53 ` bugzilla-daemon
@ 2010-04-29  7:25 ` bugzilla-daemon
  2010-04-29  9:27 ` bugzilla-daemon
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-04-29  7:25 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #13 from Brian Sullivan <bexamous@gmail.com>  2010-04-29 07:25:25 ---
This issue seems to get no attention.  I would be happy to go buy some other
HBA controller but 1068E is used everywhere, I can't figure out what to buy as
an alternative.

Is this a firmware bug?  There is a billion firmwares for the 1068E, I'm stuck
with an onboard controller and not sure if its possible to update the firmware.
 Would it be worth buying some addin card to be able to try different
firmwares?

Does LSI care about this?  If not, fine.  If so, is the problem you cannot
reproduce it?

If you google a bit you can find many people running into this bug.

Argh, I'm so frustrated.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (2 preceding siblings ...)
  2010-04-29  7:25 ` bugzilla-daemon
@ 2010-04-29  9:27 ` bugzilla-daemon
  2010-05-03 22:22 ` bugzilla-daemon
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-04-29  9:27 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #14 from amf <andy.fletcher@ukdedicated.com>  2010-04-29 09:27:19 ---
This isn't going to solve your MPTSAS problem, but may help you choose to move
on.

I've found LSI to be fairly unresponsive on this matter, and the same goes for
my vendor (Dell, who the moment you mention Linux just aren't interested).

LSI have now brought out their new range of 6Gb/sec based cards which use the
MPT2 system and I therefore doubt they will be progressing the development of
MPT any further.

The good news is that if you get hold of an MPT2 card (LSISAS2008 chipset) you
should (at least by my testing) be able to migrate your RAID to this card using
the BIOS. The MPT2 drivers seem a whole lot better (I understand they're a
complete re-write, which speaks volumes about how good the MPT driver was,
IMHO). I've not been able to break them so far.

Performance also seems to be quite improved.

HTH.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (3 preceding siblings ...)
  2010-04-29  9:27 ` bugzilla-daemon
@ 2010-05-03 22:22 ` bugzilla-daemon
  2010-05-04  5:16 ` bugzilla-daemon
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-05-03 22:22 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831


Ryan Kuester <rkuester@kspace.net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rkuester@kspace.net




--- Comment #15 from Ryan Kuester <rkuester@kspace.net>  2010-05-03 22:22:06 ---
Take a look at my diagnosis here:
http://lkml.org/lkml/2010/4/26/335

It includes a rough-draft patch.  I'd be very interested in hearing reports of
whether this fixes this smartctl issue in others' environments as it has in
mine.

The reason I haven't proposed it as a real patch is that there's probably a
better location for that code.  Where I have it, it'll apply to every SCSI host
using the MPT Fusion framework, and if it's a hardware bug, perhaps we want it
to apply only to this specific LSI 1068 controller.

That said, I expect most requests hitting the device are already well-aligned,
so this wouldn't affect many requests even if it did apply to a
broader-than-necessary collection of hardware.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (4 preceding siblings ...)
  2010-05-03 22:22 ` bugzilla-daemon
@ 2010-05-04  5:16 ` bugzilla-daemon
  2010-05-04  9:16 ` bugzilla-daemon
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-05-04  5:16 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #16 from kdesai <kashyap.desai@lsi.com>  2010-05-04 05:16:28 ---
(In reply to comment #15)
> Take a look at my diagnosis here:
> http://lkml.org/lkml/2010/4/26/335
> 
> It includes a rough-draft patch.  I'd be very interested in hearing reports of
> whether this fixes this smartctl issue in others' environments as it has in
> mine.
I am doing my analysis and meanwhile also in touch with our Firmware folks to
understand this issue.   Using Ryan's diagnosis tool I am able to see LSI
controller is not able to DMA for particular alignment. I will update this
ASAP.
Thanks, Kashyap
> 
> The reason I haven't proposed it as a real patch is that there's probably a
> better location for that code.  Where I have it, it'll apply to every SCSI host
> using the MPT Fusion framework, and if it's a hardware bug, perhaps we want it
> to apply only to this specific LSI 1068 controller.
> 
> That said, I expect most requests hitting the device are already well-aligned,
> so this wouldn't affect many requests even if it did apply to a
> broader-than-necessary collection of hardware.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (5 preceding siblings ...)
  2010-05-04  5:16 ` bugzilla-daemon
@ 2010-05-04  9:16 ` bugzilla-daemon
  2010-05-05 10:35 ` bugzilla-daemon
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-05-04  9:16 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #17 from Brian Sullivan <bexamous@gmail.com>  2010-05-04 09:15:45 ---
No way, issue is fixed!  After waiting for months I lose hope and order a
mpt2sas controller.  The next day issue is fixed.  Argh! lol :)

Without patch, running hddtemp in loop on 15 drives would last maybe 5-10
seconds before controller would crap out.

With patch its been going for at least 20 minutes now without issue.  I put
a load on the controller too (~600MB/sec reads total) & still stable.  I'll
let this go overnight but really, this issue is fixed for me.



On Mon, May 3, 2010 at 10:16 PM, <bugzilla-daemon@bugzilla.kernel.org>wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=14831
>
>
>
>
>
> --- Comment #16 from kdesai <kashyap.desai@lsi.com>  2010-05-04 05:16:28
> ---
> (In reply to comment #15)
> > Take a look at my diagnosis here:
> > http://lkml.org/lkml/2010/4/26/335
> >
> > It includes a rough-draft patch.  I'd be very interested in hearing
> reports of
> > whether this fixes this smartctl issue in others' environments as it has
> in
> > mine.
> I am doing my analysis and meanwhile also in touch with our Firmware folks
> to
> understand this issue.   Using Ryan's diagnosis tool I am able to see LSI
> controller is not able to DMA for particular alignment. I will update this
> ASAP.
> Thanks, Kashyap
> >
> > The reason I haven't proposed it as a real patch is that there's probably
> a
> > better location for that code.  Where I have it, it'll apply to every
> SCSI host
> > using the MPT Fusion framework, and if it's a hardware bug, perhaps we
> want it
> > to apply only to this specific LSI 1068 controller.
> >
> > That said, I expect most requests hitting the device are already
> well-aligned,
> > so this wouldn't affect many requests even if it did apply to a
> > broader-than-necessary collection of hardware.
>
> --
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
>

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (6 preceding siblings ...)
  2010-05-04  9:16 ` bugzilla-daemon
@ 2010-05-05 10:35 ` bugzilla-daemon
  2010-05-07  4:56 ` bugzilla-daemon
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-05-05 10:35 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831


Andrew Dunn <andrew.g.dunn.dod@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrew.g.dunn.dod@gmail.com




--- Comment #18 from Andrew Dunn <andrew.g.dunn.dod@gmail.com>  2010-05-05 10:35:17 ---
I anxiously await confirmation of this patch. This issue has been plaguing me
for quite a while. Just for verification the mpt2sas controllers don't have
problems with this? I was thinking of trying to get an AOC-USAS2-L8i
(http://www.supermicro.com/products/accessories/addon/AOC-USAS2-L8i.cfm?TYP=I)

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (7 preceding siblings ...)
  2010-05-05 10:35 ` bugzilla-daemon
@ 2010-05-07  4:56 ` bugzilla-daemon
  2010-05-07  8:01 ` bugzilla-daemon
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-05-07  4:56 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #19 from Brian Sullivan <bexamous@gmail.com>  2010-05-07 04:55:40 ---
So... patch seems to fix ATA command pass-through problems.  I let it go a
day spamming hddtemp in a loop on all the drives, while at same time reading
600MB/sec or so.  No problem.  Again, without patch, it would never manage
more than 10 seconds spamming all the drives at once.

IMO it seems like the ATA-Passthrough bug is fixed by this patch.  I cannot
cause a failure using ATA-Passthrough.

All is not good news however....

With this bug fixed I was going to start expanding a md array one disk at a
time.  Unfortunately sooner or later the controller seems to crap out.  I
don't know what is at fault, but the mptsas drive's method of just blowing
up and blocking processes forever sucks.

I've tried this 4 times now and each time I see some read errors, then task
resets fail and eventually it gets to point it just keeps spamming 'sometask
has been blocked for 120s'.  I WISH this was a bad drive, but even if it was
a bad drive it shouldn't take down the system like this, but just to be sure
I've been swapping a few drives and it doesn't really make a difference.
Each time a different drive starts the fail sequence.  I'm guessing its
unlikely I have a pile of bad drives.

I do have 16 drives all attached via a HP SAS Expander, perhaps the expander
is at fault.  I also have a backup Chenbro Expander I could try...  but I'm
too lazy to at the moment.  I could also try ditching the Expanders to see
if that is the cause of these problems, but again too lazy at the moment.
Monday a mpt2sas expander is being delivered, I think my best bet is to
ditch this mptsas driver all together.  If that doesn't fix problems I'll
then go back and try swapping Expanders and whatnot.

Anyways, TL;DR:  ATA-PassThrough bug is fixed, mptsas still blows.

Here log from current failures, fairly sure this is unrelated to the entire
ATA-Passthrough problem:
May  6 17:52:09 nine kernel: [18838.207805] md: recovery of RAID array md127
May  6 17:52:09 nine kernel: [18838.207815] md: minimum _guaranteed_  speed:
1000 KB/sec/disk.
May  6 17:52:09 nine kernel: [18838.207818] md: using maximum available idle
IO bandwidth (but not more than 200000 KB/sec) for recovery.
May  6 17:52:09 nine kernel: [18838.207831] md: using 128k window, over a
total of 1953510784 blocks.
May  6 17:52:09 nine kernel: [18838.207833] md: resuming recovery of md127
from checkpoint.
May  6 20:51:21 nine kernel: [29589.980035] mptscsih: ioc0: attempting task
abort! (sc=ffff8803318f4900)
May  6 20:51:21 nine kernel: [29589.980041] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8e f6 00 00 01 00 00
May  6 20:51:28 nine kernel: [29596.503483] mptbase: ioc0:
LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
May  6 20:51:28 nine kernel: [29596.503747] mptscsih: ioc0: task abort:
SUCCESS (sc=ffff8803318f4900)
May  6 20:51:28 nine kernel: [29597.253319] mptbase: ioc0:
LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry},
SubCode(0x0000)
May  6 20:51:28 nine kernel: [29597.253329] mptscsih: ioc0: attempting task
abort! (sc=ffff8803318f4e00)
May  6 20:51:28 nine kernel: [29597.253332] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8e fc 00 00 01 00 00
May  6 20:51:28 nine kernel: [29597.253341] mptscsih: ioc0: task abort:
SUCCESS (sc=ffff8803318f4e00)
May  6 20:51:29 nine kernel: [29597.753599] mptbase: ioc0:
LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed},
SubCode(0x0000)
May  6 20:51:29 nine kernel: [29597.753608] mptscsih: ioc0: attempting task
abort! (sc=ffff8803318f4c00)
May  6 20:51:29 nine kernel: [29597.753610] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 02 00 00 01 00 00
May  6 20:51:29 nine kernel: [29597.753619] mptscsih: ioc0: task abort:
SUCCESS (sc=ffff8803318f4c00)
May  6 20:51:29 nine kernel: [29597.753622] mptscsih: ioc0: attempting task
abort! (sc=ffff8803318f5b00)
May  6 20:51:29 nine kernel: [29597.753624] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 0e 00 00 01 00 00
May  6 20:51:29 nine kernel: [29597.753633] mptscsih: ioc0: task abort:
SUCCESS (sc=ffff8803318f5b00)
May  6 20:51:29 nine kernel: [29597.753636] mptscsih: ioc0: attempting task
abort! (sc=ffff880331e3d900)
May  6 20:51:29 nine kernel: [29597.753638] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 14 00 00 00 08 00
May  6 20:51:29 nine kernel: [29597.753646] mptscsih: ioc0: task abort:
SUCCESS (sc=ffff880331e3d900)
May  6 20:51:29 nine kernel: [29597.753649] mptscsih: ioc0: attempting task
abort! (sc=ffff880331e3d400)
May  6 20:51:29 nine kernel: [29597.753651] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 14 08 00 00 68 00
May  6 20:51:29 nine kernel: [29597.753659] mptscsih: ioc0: task abort:
SUCCESS (sc=ffff880331e3d400)
May  6 20:51:29 nine kernel: [29597.753671] mptscsih: ioc0: attempting
target reset! (sc=ffff8803318f4900)
May  6 20:51:29 nine kernel: [29597.753673] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8e f6 00 00 01 00 00
May  6 20:51:29 nine kernel: [29597.753685] mptscsih: ioc0: target reset:
FAILED (sc=ffff8803318f4900)
May  6 20:51:29 nine kernel: [29597.753693] mptscsih: ioc0: attempting bus
reset! (sc=ffff8803318f4900)
May  6 20:51:29 nine kernel: [29597.753695] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8e f6 00 00 01 00 00
May  6 20:51:29 nine kernel: [29597.753712] mptscsih: ioc0: bus reset:
FAILED (sc=ffff8803318f4900)
May  6 20:51:29 nine kernel: [29597.753715] mptscsih: ioc0: attempting host
reset! (sc=ffff8803318f4900)
May  6 20:52:04 nine kernel: [29632.830020] mptscsih: ioc0: host reset:
SUCCESS (sc=ffff8803318f4900)
May  6 20:52:14 nine kernel: [29642.840021] sd 6:0:5:0: Device offlined -
not ready after error recovery
May  6 20:52:14 nine kernel: [29642.840024] sd 6:0:5:0: Device offlined -
not ready after error recovery
May  6 20:52:14 nine kernel: [29642.840026] sd 6:0:5:0: Device offlined -
not ready after error recovery
May  6 20:52:14 nine kernel: [29642.840028] sd 6:0:5:0: Device offlined -
not ready after error recovery
May  6 20:52:14 nine kernel: [29642.840030] sd 6:0:5:0: Device offlined -
not ready after error recovery
May  6 20:52:14 nine kernel: [29642.840032] sd 6:0:5:0: Device offlined -
not ready after error recovery
May  6 20:52:14 nine kernel: [29642.840076] sd 6:0:5:0: [sdh] Unhandled
error code
May  6 20:52:14 nine kernel: [29642.840082] sd 6:0:5:0: [sdh] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May  6 20:52:14 nine kernel: [29642.840087] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8e f6 00 00 01 00 00
May  6 20:52:14 nine kernel: [29642.840112] raid5:md127: read error not
correctable (sector 1284435456 on sdh2).
May  6 20:52:14 nine kernel: [29642.840129] raid5:md127: read error not
correctable (sector 1284435464 on sdh2).
May  6 20:52:14 nine kernel: [29642.840133] raid5:md127: read error not
correctable (sector 1284435472 on sdh2).
May  6 20:52:14 nine kernel: [29642.840136] raid5:md127: read error not
correctable (sector 1284435480 on sdh2).
May  6 20:52:14 nine kernel: [29642.840139] raid5:md127: read error not
correctable (sector 1284435488 on sdh2).
May  6 20:52:14 nine kernel: [29642.840143] raid5:md127: read error not
correctable (sector 1284435496 on sdh2).
May  6 20:52:14 nine kernel: [29642.840149] raid5:md127: read error not
correctable (sector 1284435504 on sdh2).
May  6 20:52:14 nine kernel: [29642.840196] raid5:md127: read error not
correctable (sector 1284435512 on sdh2).
May  6 20:52:14 nine kernel: [29642.840199] raid5:md127: read error not
correctable (sector 1284435520 on sdh2).
May  6 20:52:14 nine kernel: [29642.840202] raid5:md127: read error not
correctable (sector 1284435528 on sdh2).
May  6 20:52:14 nine kernel: [29642.847676] sd 6:0:5:0: [sdh] Unhandled
error code
May  6 20:52:14 nine kernel: [29642.847678] sd 6:0:5:0: [sdh] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May  6 20:52:14 nine kernel: [29642.847681] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8e fc 00 00 01 00 00
May  6 20:52:14 nine kernel: [29642.847745] sd 6:0:5:0: [sdh] Unhandled
error code
May  6 20:52:14 nine kernel: [29642.847746] sd 6:0:5:0: [sdh] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May  6 20:52:14 nine kernel: [29642.847749] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 02 00 00 01 00 00
May  6 20:52:14 nine kernel: [29642.847812] sd 6:0:5:0: [sdh] Unhandled
error code
May  6 20:52:14 nine kernel: [29642.847813] sd 6:0:5:0: [sdh] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May  6 20:52:14 nine kernel: [29642.847816] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 0e 00 00 01 00 00
May  6 20:52:14 nine kernel: [29642.847871] sd 6:0:5:0: [sdh] Unhandled
error code
May  6 20:52:14 nine kernel: [29642.847873] sd 6:0:5:0: [sdh] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May  6 20:52:14 nine kernel: [29642.847875] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 14 00 00 00 08 00
May  6 20:52:14 nine kernel: [29642.847907] sd 6:0:5:0: [sdh] Unhandled
error code
May  6 20:52:14 nine kernel: [29642.847908] sd 6:0:5:0: [sdh] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May  6 20:52:14 nine kernel: [29642.847911] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 14 08 00 00 68 00
May  6 20:52:19 nine kernel: [29647.840019] mptbase: ioc0: WARNING - Issuing
Reset from mpt_config!!
May  6 20:52:50 nine kernel: [29678.961260] ------------[ cut here
]------------
May  6 20:52:50 nine kernel: [29678.961268] WARNING: at
/home/kernel-ppa/mainline/build/kernel/workqueue.c:485
flush_cpu_workqueue+0x8c/0x90()
May  6 20:52:50 nine kernel: [29678.961271] Hardware name: empty
May  6 20:52:50 nine kernel: [29678.961273] Modules linked in: btrfs
zlib_deflate crc32c libcrc32c xfs exportfs mptctl binfmt_misc ppdev
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state
nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge
stp kvm_intel kvm snd_hda_codec_realtek snd_hda_intel snd_hda_codec
snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss
snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device
psmouse serio_raw ioatdma snd i5100_edac nvidia(P) dca soundcore
snd_page_alloc edac_core lp parport raid10 raid456 async_raid6_recov
async_pq raid6_pq async_xor ses enclosure xor async_memcpy async_tx raid1
raid0 multipath linear ahci e1000e mptsas mptscsih mptbase
scsi_transport_sas
May  6 20:52:50 nine kernel: [29678.961333] Pid: 321, comm: mpt/0 Tainted:
P           2.6.34-020634rc6-generic #020634rc6
May  6 20:52:50 nine kernel: [29678.961336] Call Trace:
May  6 20:52:50 nine kernel: [29678.961341]  [<ffffffff8107a9ac>] ?
flush_cpu_workqueue+0x8c/0x90
May  6 20:52:50 nine kernel: [29678.961346]  [<ffffffff8105f1ec>]
warn_slowpath_common+0x8c/0xc0
May  6 20:52:50 nine kernel: [29678.961350]  [<ffffffff8105f234>]
warn_slowpath_null+0x14/0x20
May  6 20:52:50 nine kernel: [29678.961353]  [<ffffffff8107a9ac>]
flush_cpu_workqueue+0x8c/0x90
May  6 20:52:50 nine kernel: [29678.961357]  [<ffffffff8106f981>] ?
try_to_del_timer_sync+0x51/0xe0
May  6 20:52:50 nine kernel: [29678.961360]  [<ffffffff8107aa74>]
flush_workqueue+0x44/0x70
May  6 20:52:50 nine kernel: [29678.961373]  [<ffffffffa004531c>]
mptsas_cleanup_fw_event_q+0x12c/0x160 [mptsas]
May  6 20:52:50 nine kernel: [29678.961378]  [<ffffffffa0048434>]
mptsas_ioc_reset+0x94/0x130 [mptsas]
May  6 20:52:50 nine kernel: [29678.961383]  [<ffffffff81033d39>] ?
default_spin_lock_flags+0x9/0x10
May  6 20:52:50 nine kernel: [29678.961389]  [<ffffffffa001222d>]
mpt_signal_reset+0x4d/0x60 [mptbase]
May  6 20:52:50 nine kernel: [29678.961394]  [<ffffffffa0018eb6>]
mpt_SoftResetHandler+0x1b6/0x3c0 [mptbase]
May  6 20:52:50 nine kernel: [29678.961399]  [<ffffffffa001bee7>]
mpt_config+0x307/0x640 [mptbase]
May  6 20:52:50 nine kernel: [29678.961404]  [<ffffffffa004c6f0>] ?
mptsas_firmware_event_work+0x0/0xe80 [mptsas]
May  6 20:52:50 nine kernel: [29678.961409]  [<ffffffffa001d0b1>]
mpt_findImVolumes+0xb1/0x600 [mptbase]
May  6 20:52:50 nine kernel: [29678.961415]  [<ffffffffa004c6f0>] ?
mptsas_firmware_event_work+0x0/0xe80 [mptsas]
May  6 20:52:50 nine kernel: [29678.961419]  [<ffffffffa004cd88>]
mptsas_firmware_event_work+0x698/0xe80 [mptsas]
May  6 20:52:50 nine kernel: [29678.961424]  [<ffffffff8100985b>] ?
__switch_to+0xbb/0x2e0
May  6 20:52:50 nine kernel: [29678.961428]  [<ffffffff8105118e>] ?
put_prev_entity+0x2e/0x80
May  6 20:52:50 nine kernel: [29678.961430]  [<ffffffff81051af6>] ?
finish_task_switch+0x66/0xd0
May  6 20:52:50 nine kernel: [29678.961435]  [<ffffffffa004c6f0>] ?
mptsas_firmware_event_work+0x0/0xe80 [mptsas]
May  6 20:52:50 nine kernel: [29678.961438]  [<ffffffff8107a10c>]
run_workqueue+0xbc/0x190
May  6 20:52:50 nine kernel: [29678.961441]  [<ffffffff8107a65b>]
worker_thread+0x9b/0x100
May  6 20:52:50 nine kernel: [29678.961444]  [<ffffffff8107edc0>] ?
autoremove_wake_function+0x0/0x40
May  6 20:52:50 nine kernel: [29678.961447]  [<ffffffff8107a5c0>] ?
worker_thread+0x0/0x100
May  6 20:52:50 nine kernel: [29678.961450]  [<ffffffff8107e9e6>]
kthread+0x96/0xa0
May  6 20:52:50 nine kernel: [29678.961453]  [<ffffffff8100be64>]
kernel_thread_helper+0x4/0x10
May  6 20:52:50 nine kernel: [29678.961456]  [<ffffffff8107e950>] ?
kthread+0x0/0xa0
May  6 20:52:50 nine kernel: [29678.961458]  [<ffffffff8100be60>] ?
kernel_thread_helper+0x0/0x10
May  6 20:52:50 nine kernel: [29678.961460] ---[ end trace 5b0b1793526edc2a
]---
May  6 20:53:20 nine kernel: [29709.040090] mptscsih: ioc0: attempting task
abort! (sc=ffff880331812400)
May  6 20:53:20 nine kernel: [29709.040093] sd 6:0:15:0: [sdr] CDB:
Write(10): 2a 00 00 00 00 47 00 00 02 00
May  6 20:53:50 nine kernel: [29739.040011] mptscsih: ioc0: WARNING -
Issuing Reset from mptscsih_IssueTaskMgmt!!
May  6 20:54:13 nine kernel: [29761.700122] md127_resync  D
ffff880001f55740     0  6733      2 0x00000000
May  6 20:54:13 nine kernel: [29761.700130]  ffff8803318f3b90
0000000000000046 ffff8803318f3b50 ffff8803318f3fd8
May  6 20:54:13 nine kernel: [29761.700134]  ffff8803318eae20
0000000000015740 0000000000015740 ffff8803318f3fd8
May  6 20:54:13 nine kernel: [29761.700137]  0000000000015740
ffff8803318f3fd8 0000000000015740 ffff8803318eae20
May  6 20:54:13 nine kernel: [29761.700141] Call Trace:
May  6 20:54:13 nine kernel: [29761.700160]  [<ffffffffa00f20e2>]
get_active_stripe+0x232/0x340 [raid456]
May  6 20:54:13 nine kernel: [29761.700167]  [<ffffffff810507e0>] ?
default_wake_function+0x0/0x20
May  6 20:54:13 nine kernel: [29761.700172]  [<ffffffffa00f49ad>]
sync_request+0x26d/0x2d0 [raid456]
May  6 20:54:13 nine kernel: [29761.700176]  [<ffffffffa00f1e8e>] ?
raid5_unplug_device+0x7e/0xa0 [raid456]


On Wed, May 5, 2010 at 3:35 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=14831
>
>
> Andrew Dunn <andrew.g.dunn.dod@gmail.com> changed:
>
>           What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                 CC|                            |
> andrew.g.dunn.dod@gmail.com
>
>
>
>
> --- Comment #18 from Andrew Dunn <andrew.g.dunn.dod@gmail.com>  2010-05-05
> 10:35:17 ---
> I anxiously await confirmation of this patch. This issue has been plaguing
> me
> for quite a while. Just for verification the mpt2sas controllers don't have
> problems with this? I was thinking of trying to get an AOC-USAS2-L8i
> (
> http://www.supermicro.com/products/accessories/addon/AOC-USAS2-L8i.cfm?TYP=I
> )
>
> --
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
>

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (8 preceding siblings ...)
  2010-05-07  4:56 ` bugzilla-daemon
@ 2010-05-07  8:01 ` bugzilla-daemon
  2010-05-12  8:50 ` bugzilla-daemon
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-05-07  8:01 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #20 from kdesai <kashyap.desai@lsi.com>  2010-05-07 08:01:25 ---
(In reply to comment #19)
> So... patch seems to fix ATA command pass-through problems.  I let it go a
> day spamming hddtemp in a loop on all the drives, while at same time reading
> 600MB/sec or so.  No problem.  Again, without patch, it would never manage
> more than 10 seconds spamming all the drives at once.
> 
> IMO it seems like the ATA-Passthrough bug is fixed by this patch.  I cannot
> cause a failure using ATA-Passthrough.
> 
> All is not good news however....
> 
> With this bug fixed I was going to start expanding a md array one disk at a
> time.  Unfortunately sooner or later the controller seems to crap out.  I
> don't know what is at fault, but the mptsas drive's method of just blowing
> up and blocking processes forever sucks.
> 
> I've tried this 4 times now and each time I see some read errors, then task
> resets fail and eventually it gets to point it just keeps spamming 'sometask
> has been blocked for 120s'.  I WISH this was a bad drive, but even if it was
> a bad drive it shouldn't take down the system like this, but just to be sure
> I've been swapping a few drives and it doesn't really make a difference.
> Each time a different drive starts the fail sequence.  I'm guessing its
> unlikely I have a pile of bad drives.
> 
> I do have 16 drives all attached via a HP SAS Expander, perhaps the expander
> is at fault.  I also have a backup Chenbro Expander I could try...  but I'm
> too lazy to at the moment.  I could also try ditching the Expanders to see
> if that is the cause of these problems, but again too lazy at the moment.
> Monday a mpt2sas expander is being delivered, I think my best bet is to
> ditch this mptsas driver all together.  If that doesn't fix problems I'll
> then go back and try swapping Expanders and whatnot.
> 
> Anyways, TL;DR:  ATA-PassThrough bug is fixed, mptsas still blows.

Patch for setting dma boundary is mere avoiding condition which is causing this
issue. LSI Gen-1 controller does not have 512byte dma boundary limitation. I
have started internal chat with our Firmware engineer. I will update you
findings as and when some imp stuffs are found. 
> 
> Here log from current failures, fairly sure this is unrelated to the entire
> ATA-Passthrough problem:
> May  6 17:52:09 nine kernel: [18838.207805] md: recovery of RAID array md127
> May  6 17:52:09 nine kernel: [18838.207815] md: minimum _guaranteed_  speed:
> 1000 KB/sec/disk.
> May  6 17:52:09 nine kernel: [18838.207818] md: using maximum available idle
> IO bandwidth (but not more than 200000 KB/sec) for recovery.
> May  6 17:52:09 nine kernel: [18838.207831] md: using 128k window, over a
> total of 1953510784 blocks.
> May  6 17:52:09 nine kernel: [18838.207833] md: resuming recovery of md127
> from checkpoint.
> May  6 20:51:21 nine kernel: [29589.980035] mptscsih: ioc0: attempting task
> abort! (sc=ffff8803318f4900)
> May  6 20:51:21 nine kernel: [29589.980041] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8e f6 00 00 01 00 00
> May  6 20:51:28 nine kernel: [29596.503483] mptbase: ioc0:
> LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
> May  6 20:51:28 nine kernel: [29596.503747] mptscsih: ioc0: task abort:
> SUCCESS (sc=ffff8803318f4900)
> May  6 20:51:28 nine kernel: [29597.253319] mptbase: ioc0:
> LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry},
> SubCode(0x0000)
> May  6 20:51:28 nine kernel: [29597.253329] mptscsih: ioc0: attempting task
> abort! (sc=ffff8803318f4e00)
> May  6 20:51:28 nine kernel: [29597.253332] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8e fc 00 00 01 00 00
> May  6 20:51:28 nine kernel: [29597.253341] mptscsih: ioc0: task abort:
> SUCCESS (sc=ffff8803318f4e00)
> May  6 20:51:29 nine kernel: [29597.753599] mptbase: ioc0:
> LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed},
> SubCode(0x0000)
> May  6 20:51:29 nine kernel: [29597.753608] mptscsih: ioc0: attempting task
> abort! (sc=ffff8803318f4c00)
> May  6 20:51:29 nine kernel: [29597.753610] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 02 00 00 01 00 00
> May  6 20:51:29 nine kernel: [29597.753619] mptscsih: ioc0: task abort:
> SUCCESS (sc=ffff8803318f4c00)
> May  6 20:51:29 nine kernel: [29597.753622] mptscsih: ioc0: attempting task
> abort! (sc=ffff8803318f5b00)
> May  6 20:51:29 nine kernel: [29597.753624] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 0e 00 00 01 00 00
> May  6 20:51:29 nine kernel: [29597.753633] mptscsih: ioc0: task abort:
> SUCCESS (sc=ffff8803318f5b00)
> May  6 20:51:29 nine kernel: [29597.753636] mptscsih: ioc0: attempting task
> abort! (sc=ffff880331e3d900)
> May  6 20:51:29 nine kernel: [29597.753638] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 14 00 00 00 08 00
> May  6 20:51:29 nine kernel: [29597.753646] mptscsih: ioc0: task abort:
> SUCCESS (sc=ffff880331e3d900)
> May  6 20:51:29 nine kernel: [29597.753649] mptscsih: ioc0: attempting task
> abort! (sc=ffff880331e3d400)
> May  6 20:51:29 nine kernel: [29597.753651] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 14 08 00 00 68 00
> May  6 20:51:29 nine kernel: [29597.753659] mptscsih: ioc0: task abort:
> SUCCESS (sc=ffff880331e3d400)
> May  6 20:51:29 nine kernel: [29597.753671] mptscsih: ioc0: attempting
> target reset! (sc=ffff8803318f4900)
> May  6 20:51:29 nine kernel: [29597.753673] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8e f6 00 00 01 00 00
> May  6 20:51:29 nine kernel: [29597.753685] mptscsih: ioc0: target reset:
> FAILED (sc=ffff8803318f4900)
> May  6 20:51:29 nine kernel: [29597.753693] mptscsih: ioc0: attempting bus
> reset! (sc=ffff8803318f4900)
> May  6 20:51:29 nine kernel: [29597.753695] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8e f6 00 00 01 00 00
> May  6 20:51:29 nine kernel: [29597.753712] mptscsih: ioc0: bus reset:
> FAILED (sc=ffff8803318f4900)
> May  6 20:51:29 nine kernel: [29597.753715] mptscsih: ioc0: attempting host
> reset! (sc=ffff8803318f4900)
> May  6 20:52:04 nine kernel: [29632.830020] mptscsih: ioc0: host reset:
> SUCCESS (sc=ffff8803318f4900)
> May  6 20:52:14 nine kernel: [29642.840021] sd 6:0:5:0: Device offlined -
> not ready after error recovery
> May  6 20:52:14 nine kernel: [29642.840024] sd 6:0:5:0: Device offlined -
> not ready after error recovery
> May  6 20:52:14 nine kernel: [29642.840026] sd 6:0:5:0: Device offlined -
> not ready after error recovery
> May  6 20:52:14 nine kernel: [29642.840028] sd 6:0:5:0: Device offlined -
> not ready after error recovery
> May  6 20:52:14 nine kernel: [29642.840030] sd 6:0:5:0: Device offlined -
> not ready after error recovery
> May  6 20:52:14 nine kernel: [29642.840032] sd 6:0:5:0: Device offlined -
> not ready after error recovery
> May  6 20:52:14 nine kernel: [29642.840076] sd 6:0:5:0: [sdh] Unhandled
> error code
> May  6 20:52:14 nine kernel: [29642.840082] sd 6:0:5:0: [sdh] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> May  6 20:52:14 nine kernel: [29642.840087] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8e f6 00 00 01 00 00
> May  6 20:52:14 nine kernel: [29642.840112] raid5:md127: read error not
> correctable (sector 1284435456 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840129] raid5:md127: read error not
> correctable (sector 1284435464 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840133] raid5:md127: read error not
> correctable (sector 1284435472 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840136] raid5:md127: read error not
> correctable (sector 1284435480 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840139] raid5:md127: read error not
> correctable (sector 1284435488 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840143] raid5:md127: read error not
> correctable (sector 1284435496 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840149] raid5:md127: read error not
> correctable (sector 1284435504 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840196] raid5:md127: read error not
> correctable (sector 1284435512 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840199] raid5:md127: read error not
> correctable (sector 1284435520 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840202] raid5:md127: read error not
> correctable (sector 1284435528 on sdh2).
> May  6 20:52:14 nine kernel: [29642.847676] sd 6:0:5:0: [sdh] Unhandled
> error code
> May  6 20:52:14 nine kernel: [29642.847678] sd 6:0:5:0: [sdh] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> May  6 20:52:14 nine kernel: [29642.847681] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8e fc 00 00 01 00 00
> May  6 20:52:14 nine kernel: [29642.847745] sd 6:0:5:0: [sdh] Unhandled
> error code
> May  6 20:52:14 nine kernel: [29642.847746] sd 6:0:5:0: [sdh] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> May  6 20:52:14 nine kernel: [29642.847749] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 02 00 00 01 00 00
> May  6 20:52:14 nine kernel: [29642.847812] sd 6:0:5:0: [sdh] Unhandled
> error code
> May  6 20:52:14 nine kernel: [29642.847813] sd 6:0:5:0: [sdh] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> May  6 20:52:14 nine kernel: [29642.847816] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 0e 00 00 01 00 00
> May  6 20:52:14 nine kernel: [29642.847871] sd 6:0:5:0: [sdh] Unhandled
> error code
> May  6 20:52:14 nine kernel: [29642.847873] sd 6:0:5:0: [sdh] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> May  6 20:52:14 nine kernel: [29642.847875] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 14 00 00 00 08 00
> May  6 20:52:14 nine kernel: [29642.847907] sd 6:0:5:0: [sdh] Unhandled
> error code
> May  6 20:52:14 nine kernel: [29642.847908] sd 6:0:5:0: [sdh] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> May  6 20:52:14 nine kernel: [29642.847911] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 14 08 00 00 68 00
> May  6 20:52:19 nine kernel: [29647.840019] mptbase: ioc0: WARNING - Issuing
> Reset from mpt_config!!
> May  6 20:52:50 nine kernel: [29678.961260] ------------[ cut here
> ]------------
> May  6 20:52:50 nine kernel: [29678.961268] WARNING: at
> /home/kernel-ppa/mainline/build/kernel/workqueue.c:485
> flush_cpu_workqueue+0x8c/0x90()
> May  6 20:52:50 nine kernel: [29678.961271] Hardware name: empty
> May  6 20:52:50 nine kernel: [29678.961273] Modules linked in: btrfs
> zlib_deflate crc32c libcrc32c xfs exportfs mptctl binfmt_misc ppdev
> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state
> nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge
> stp kvm_intel kvm snd_hda_codec_realtek snd_hda_intel snd_hda_codec
> snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss
> snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device
> psmouse serio_raw ioatdma snd i5100_edac nvidia(P) dca soundcore
> snd_page_alloc edac_core lp parport raid10 raid456 async_raid6_recov
> async_pq raid6_pq async_xor ses enclosure xor async_memcpy async_tx raid1
> raid0 multipath linear ahci e1000e mptsas mptscsih mptbase
> scsi_transport_sas
> May  6 20:52:50 nine kernel: [29678.961333] Pid: 321, comm: mpt/0 Tainted:
> P           2.6.34-020634rc6-generic #020634rc6
> May  6 20:52:50 nine kernel: [29678.961336] Call Trace:
> May  6 20:52:50 nine kernel: [29678.961341]  [<ffffffff8107a9ac>] ?
> flush_cpu_workqueue+0x8c/0x90
> May  6 20:52:50 nine kernel: [29678.961346]  [<ffffffff8105f1ec>]
> warn_slowpath_common+0x8c/0xc0
> May  6 20:52:50 nine kernel: [29678.961350]  [<ffffffff8105f234>]
> warn_slowpath_null+0x14/0x20
> May  6 20:52:50 nine kernel: [29678.961353]  [<ffffffff8107a9ac>]
> flush_cpu_workqueue+0x8c/0x90
> May  6 20:52:50 nine kernel: [29678.961357]  [<ffffffff8106f981>] ?
> try_to_del_timer_sync+0x51/0xe0
> May  6 20:52:50 nine kernel: [29678.961360]  [<ffffffff8107aa74>]
> flush_workqueue+0x44/0x70
> May  6 20:52:50 nine kernel: [29678.961373]  [<ffffffffa004531c>]
> mptsas_cleanup_fw_event_q+0x12c/0x160 [mptsas]
> May  6 20:52:50 nine kernel: [29678.961378]  [<ffffffffa0048434>]
> mptsas_ioc_reset+0x94/0x130 [mptsas]
> May  6 20:52:50 nine kernel: [29678.961383]  [<ffffffff81033d39>] ?
> default_spin_lock_flags+0x9/0x10
> May  6 20:52:50 nine kernel: [29678.961389]  [<ffffffffa001222d>]
> mpt_signal_reset+0x4d/0x60 [mptbase]
> May  6 20:52:50 nine kernel: [29678.961394]  [<ffffffffa0018eb6>]
> mpt_SoftResetHandler+0x1b6/0x3c0 [mptbase]
> May  6 20:52:50 nine kernel: [29678.961399]  [<ffffffffa001bee7>]
> mpt_config+0x307/0x640 [mptbase]
> May  6 20:52:50 nine kernel: [29678.961404]  [<ffffffffa004c6f0>] ?
> mptsas_firmware_event_work+0x0/0xe80 [mptsas]
> May  6 20:52:50 nine kernel: [29678.961409]  [<ffffffffa001d0b1>]
> mpt_findImVolumes+0xb1/0x600 [mptbase]
> May  6 20:52:50 nine kernel: [29678.961415]  [<ffffffffa004c6f0>] ?
> mptsas_firmware_event_work+0x0/0xe80 [mptsas]
> May  6 20:52:50 nine kernel: [29678.961419]  [<ffffffffa004cd88>]
> mptsas_firmware_event_work+0x698/0xe80 [mptsas]
> May  6 20:52:50 nine kernel: [29678.961424]  [<ffffffff8100985b>] ?
> __switch_to+0xbb/0x2e0
> May  6 20:52:50 nine kernel: [29678.961428]  [<ffffffff8105118e>] ?
> put_prev_entity+0x2e/0x80
> May  6 20:52:50 nine kernel: [29678.961430]  [<ffffffff81051af6>] ?
> finish_task_switch+0x66/0xd0
> May  6 20:52:50 nine kernel: [29678.961435]  [<ffffffffa004c6f0>] ?
> mptsas_firmware_event_work+0x0/0xe80 [mptsas]
> May  6 20:52:50 nine kernel: [29678.961438]  [<ffffffff8107a10c>]
> run_workqueue+0xbc/0x190
> May  6 20:52:50 nine kernel: [29678.961441]  [<ffffffff8107a65b>]
> worker_thread+0x9b/0x100
> May  6 20:52:50 nine kernel: [29678.961444]  [<ffffffff8107edc0>] ?
> autoremove_wake_function+0x0/0x40
> May  6 20:52:50 nine kernel: [29678.961447]  [<ffffffff8107a5c0>] ?
> worker_thread+0x0/0x100
> May  6 20:52:50 nine kernel: [29678.961450]  [<ffffffff8107e9e6>]
> kthread+0x96/0xa0
> May  6 20:52:50 nine kernel: [29678.961453]  [<ffffffff8100be64>]
> kernel_thread_helper+0x4/0x10
> May  6 20:52:50 nine kernel: [29678.961456]  [<ffffffff8107e950>] ?
> kthread+0x0/0xa0
> May  6 20:52:50 nine kernel: [29678.961458]  [<ffffffff8100be60>] ?
> kernel_thread_helper+0x0/0x10
> May  6 20:52:50 nine kernel: [29678.961460] ---[ end trace 5b0b1793526edc2a
> ]---
> May  6 20:53:20 nine kernel: [29709.040090] mptscsih: ioc0: attempting task
> abort! (sc=ffff880331812400)
> May  6 20:53:20 nine kernel: [29709.040093] sd 6:0:15:0: [sdr] CDB:
> Write(10): 2a 00 00 00 00 47 00 00 02 00
> May  6 20:53:50 nine kernel: [29739.040011] mptscsih: ioc0: WARNING -
> Issuing Reset from mptscsih_IssueTaskMgmt!!
> May  6 20:54:13 nine kernel: [29761.700122] md127_resync  D
> ffff880001f55740     0  6733      2 0x00000000
> May  6 20:54:13 nine kernel: [29761.700130]  ffff8803318f3b90
> 0000000000000046 ffff8803318f3b50 ffff8803318f3fd8
> May  6 20:54:13 nine kernel: [29761.700134]  ffff8803318eae20
> 0000000000015740 0000000000015740 ffff8803318f3fd8
> May  6 20:54:13 nine kernel: [29761.700137]  0000000000015740
> ffff8803318f3fd8 0000000000015740 ffff8803318eae20
> May  6 20:54:13 nine kernel: [29761.700141] Call Trace:
> May  6 20:54:13 nine kernel: [29761.700160]  [<ffffffffa00f20e2>]
> get_active_stripe+0x232/0x340 [raid456]
> May  6 20:54:13 nine kernel: [29761.700167]  [<ffffffff810507e0>] ?
> default_wake_function+0x0/0x20
> May  6 20:54:13 nine kernel: [29761.700172]  [<ffffffffa00f49ad>]
> sync_request+0x26d/0x2d0 [raid456]
> May  6 20:54:13 nine kernel: [29761.700176]  [<ffffffffa00f1e8e>] ?
> raid5_unplug_device+0x7e/0xa0 [raid456]
> 
> 

As of now you can continue with patched for dma boundary alignment issue.
For this new issue you can provide me complete var log messages with debug
turned on.

use 0x8188 > /sys/modules/mptbase/parameters/mpt_debug_level

Thanks,
Kashyap
> On Wed, May 5, 2010 at 3:35 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=14831
> >
> >
> > Andrew Dunn <andrew.g.dunn.dod@gmail.com> changed:
> >
> >           What    |Removed                     |Added
> >
> > ----------------------------------------------------------------------------
> >                 CC|                            |
> > andrew.g.dunn.dod@gmail.com
> >
> >
> >
> >
> > --- Comment #18 from Andrew Dunn <andrew.g.dunn.dod@gmail.com>  2010-05-05
> > 10:35:17 ---
> > I anxiously await confirmation of this patch. This issue has been plaguing
> > me
> > for quite a while. Just for verification the mpt2sas controllers don't have
> > problems with this? I was thinking of trying to get an AOC-USAS2-L8i
> > (
> > http://www.supermicro.com/products/accessories/addon/AOC-USAS2-L8i.cfm?TYP=I
> > )
> >
> > --
> > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> > ------- You are receiving this mail because: -------
> > You are on the CC list for the bug.
> >

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (9 preceding siblings ...)
  2010-05-07  8:01 ` bugzilla-daemon
@ 2010-05-12  8:50 ` bugzilla-daemon
  2010-05-12  9:04 ` bugzilla-daemon
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-05-12  8:50 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #21 from Brian Sullivan <bexamous@gmail.com>  2010-05-12 08:50:32 ---
So apparently this bug affects mpt2sas too???

Parts of dmesg:
[    4.460541] mpt2sas0: LSISAS2008: FWVersion(02.00.50.00),
ChipRevision(0x02), BiosVersion(07.01.00.00)
[    4.460543] mpt2sas0: Protocol=(Initiator,Target),
Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)
[    4.460615] mpt2sas0: sending port enable !!
[   33.760036] mpt2sas0: sending diag reset !!
[   34.661882] eth0: no IPv6 routers present
[   34.710015] mpt2sas0: diag reset: SUCCESS
[   34.714397] mpt2sas0: attempting task abort! scmd(ffff88036f74bf00)
[   34.714404] sd 0:0:3:0: [sdg] CDB: Inquiry: 12 01 80 00 fe 00
[   34.714441] mpt2sas0: task abort: SUCCESS scmd(ffff88036f74bf00)
[   35.290527] mpt2sas0: LSISAS2008: FWVersion(02.00.50.00),
ChipRevision(0x02), BiosVersion(07.01.00.00)
[   35.290532] mpt2sas0: Protocol=(Initiator,Target),
Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)
[   35.290618] mpt2sas0: sending port enable !!
[   44.711262] mpt2sas0: attempting task abort! scmd(ffff88036f74bf00)
[   44.711264] sd 0:0:3:0: [sdg] CDB: Test Unit Ready: 00 00 00 00 00 00
[   44.711272] mpt2sas0: task abort: SUCCESS scmd(ffff88036f74bf00)
[   44.711274] mpt2sas0: attempting task abort! scmd(ffff88036e456a00)
[   44.711276] sd 0:0:9:0: [sdm] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   44.711285] mpt2sas0: task abort: SUCCESS scmd(ffff88036e456a00)
[   46.090185] mpt2sas0: port enable: SUCCESS
[   46.090299] mpt2sas0: _scsih_search_responding_sas_devices
[   46.091172] scsi target0:0:0: handle(0x000a),
sas_addr(0x50014380048874cc), enclosure logical id(0x50014380048874e5),
slot(47)
[   46.091259] scsi target0:0:1: handle(0x000b),
sas_addr(0x50014380048874cd), enclosure logical id(0x50014380048874e5),
slot(46)
[   46.091350] scsi target0:0:2: handle(0x000c),
sas_addr(0x50014380048874ce), enclosure logical id(0x50014380048874e5),
slot(45)
[   46.091437] scsi target0:0:3: handle(0x000d),
sas_addr(0x50014380048874cf), enclosure logical id(0x50014380048874e5),
slot(44)
[   46.091521] scsi target0:0:4: handle(0x000e),
sas_addr(0x50014380048874d0), enclosure logical id(0x50014380048874e5),
slot(51)
[   46.091612] scsi target0:0:5: handle(0x000f),
sas_addr(0x50014380048874d1), enclosure logical id(0x50014380048874e5),
slot(50)
[   46.091702] scsi target0:0:6: handle(0x0010),
sas_addr(0x50014380048874d2), enclosure logical id(0x50014380048874e5),
slot(49)
[   46.091789] scsi target0:0:7: handle(0x0011),
sas_addr(0x50014380048874d3), enclosure logical id(0x50014380048874e5),
slot(48)
[   46.091872] scsi target0:0:8: handle(0x0012),
sas_addr(0x50014380048874d4), enclosure logical id(0x50014380048874e5),
slot(55)
[   46.091964] scsi target0:0:9: handle(0x0013),
sas_addr(0x50014380048874d5), enclosure logical id(0x50014380048874e5),
slot(54)
[   46.092048] scsi target0:0:10: handle(0x0014),
sas_addr(0x50014380048874d6), enclosure logical id(0x50014380048874e5),
slot(53)
[   46.092134] scsi target0:0:11: handle(0x0015),
sas_addr(0x50014380048874d7), enclosure logical id(0x50014380048874e5),
slot(52)
[   46.092218] scsi target0:0:12: handle(0x0016),
sas_addr(0x50014380048874e0), enclosure logical id(0x0000000000000000),
slot(0)
[   46.092306] scsi target0:0:13: handle(0x0017),
sas_addr(0x50014380048874e1), enclosure logical id(0x0000000000000000),
slot(0)
[   46.092401] scsi target0:0:14: handle(0x0018),
sas_addr(0x50014380048874e2), enclosure logical id(0x0000000000000000),
slot(0)
[   46.092488] scsi target0:0:15: handle(0x0019),
sas_addr(0x50014380048874e3), enclosure logical id(0x0000000000000000),
slot(0)
[   46.092572] scsi target0:0:16: handle(0x001a),
sas_addr(0x50014380048874e5), enclosure logical id(0x50014380048874e5),
slot(0)
[   46.092658] mpt2sas0: _scsih_search_responding_raid_devices
[   46.092660] mpt2sas0: _scsih_search_responding_expanders
[   46.092753]  expander present: handle(0x0009),
sas_addr(0x50014380048874e6)
[   54.711261] mpt2sas0: attempting task abort! scmd(ffff88036e456a00)
[   54.711265] sd 0:0:9:0: [sdm] CDB: Test Unit Ready: 00 00 00 00 00 00
[   54.711275] mpt2sas0: task abort: SUCCESS scmd(ffff88036e456a00)
[   54.711277] mpt2sas0: attempting task abort! scmd(ffff88036f02fc00)
[   54.711279] sd 0:0:14:0: [sdr] CDB: ATA command pass through(12)/Blank:
a1 08 2e 00 01 00 00 00 00 ec 00 00
[   54.711290] mpt2sas0: task abort: SUCCESS scmd(ffff88036f02fc00)
[   54.711383] mpt2sas0: attempting task abort! scmd(ffff88036f72ed00)
[   54.711387] sd 0:0:15:0: [sds] CDB: ATA command pass through(12)/Blank:
a1 08 2e 00 01 00 00 00 00 ec 00 00
[   54.711401] mpt2sas0: task abort: SUCCESS scmd(ffff88036f72ed00)
[   54.711479] mpt2sas0: attempting task abort! scmd(ffff88036f72fe00)
[   54.711487] sd 0:0:2:0: [sdf] CDB: Inquiry: 12 00 00 00 fe 00
[   54.711495] mpt2sas0: task abort: SUCCESS scmd(ffff88036f72fe00)
[   54.711566] mpt2sas0: attempting task abort! scmd(ffff88036cd99300)
[   54.711570] sd 0:0:5:0: [sdi] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   54.711585] mpt2sas0: task abort: SUCCESS scmd(ffff88036cd99300)
[   54.711651] mpt2sas0: attempting task abort! scmd(ffff88036cd99900)
[   54.711654] sd 0:0:7:0: [sdk] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   54.711664] mpt2sas0: task abort: SUCCESS scmd(ffff88036cd99900)
[   54.711781] mpt2sas0: attempting task abort! scmd(ffff8803721f9000)
[   54.711784] sd 0:0:12:0: [sdp] CDB: ATA command pass through(12)/Blank:
a1 08 2e 00 01 00 00 00 00 ec 00 00
[   54.711794] mpt2sas0: task abort: SUCCESS scmd(ffff8803721f9000)
[   54.711867] mpt2sas0: attempting task abort! scmd(ffff88036f72fc00)
[   54.711871] sd 0:0:0:0: [sdd] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   54.711891] mpt2sas0: task abort: SUCCESS scmd(ffff88036f72fc00)
[   54.711981] mpt2sas0: attempting task abort! scmd(ffff88036e456b00)
[   54.711986] sd 0:0:4:0: [sdh] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   54.712030] mpt2sas0: task abort: SUCCESS scmd(ffff88036e456b00)
[   54.712097] mpt2sas0: attempting task abort! scmd(ffff88036cdd0d00)
[   54.712100] sd 0:0:6:0: [sdj] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   54.712110] mpt2sas0: task abort: SUCCESS scmd(ffff88036cdd0d00)
[   54.712176] mpt2sas0: attempting task abort! scmd(ffff88036f02ef00)
[   54.712181] sd 0:0:8:0: [sdl] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   54.712230] mpt2sas0: task abort: SUCCESS scmd(ffff88036f02ef00)
[   54.712310] mpt2sas0: attempting task abort! scmd(ffff88036cd99a00)
[   54.712313] sd 0:0:10:0: [sdn] CDB: ATA command pass through(12)/Blank:
a1 08 2e 00 01 00 00 00 00 ec 00 00

Spam hddtemp on drives and bam:
[ 1161.151577] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.151580] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.151582] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.151948] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.151951] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.151953] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.152313] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.152316] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.152318] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.152684] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.152688] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.152690] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.153054] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.153058] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.153060] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.153418] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.153420] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.153422] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.153787] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.153790] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.153792] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.154151] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1171.151888] mpt2sas0: attempting task abort! scmd(ffff880342884200)
[ 1171.151892] sd 0:0:17:0: [sdm] CDB: Test Unit Ready: 00 00 00 00 00 00
[ 1171.151902] mpt2sas0: task abort: SUCCESS scmd(ffff880342884200)
[ 1171.151906] mpt2sas0: attempting host reset! scmd(ffff880342884200)
[ 1171.151908] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1171.151923] mpt2sas0: sending diag reset !!
[ 1172.110009] mpt2sas0: diag reset: SUCCESS
[ 1172.690466] mpt2sas0: LSISAS2008: FWVersion(02.00.50.00),
ChipRevision(0x02), BiosVersion(07.01.00.00)
[ 1172.690469] mpt2sas0: Protocol=(Initiator,Target),
Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)
[ 1172.690536] mpt2sas0: sending port enable !!
[ 1181.730641] mpt2sas0: port enable: SUCCESS
[ 1181.730759] mpt2sas0: _scsih_search_responding_sas_devices
[ 1181.731611] scsi target0:0:0: handle(0x000a),
sas_addr(0x50014380048874cc), enclosure logical id(0x50014380048874e5),
slot(47)
[ 1181.731698] scsi target0:0:1: handle(0x000b),
sas_addr(0x50014380048874cd), enclosure logical id(0x50014380048874e5),
slot(46)
[ 1181.731782] scsi target0:0:2: handle(0x000c),
sas_addr(0x50014380048874ce), enclosure logical id(0x50014380048874e5),
slot(45)
[ 1181.731865] scsi target0:0:3: handle(0x000d),
sas_addr(0x50014380048874cf), enclosure logical id(0x50014380048874e5),
slot(44)
[ 1181.731945] scsi target0:0:4: handle(0x000e),
sas_addr(0x50014380048874d0), enclosure logical id(0x50014380048874e5),
slot(51)
[ 1181.732034] scsi target0:0:5: handle(0x000f),
sas_addr(0x50014380048874d1), enclosure logical id(0x50014380048874e5),
slot(50)
[ 1181.732118] scsi target0:0:6: handle(0x0010),
sas_addr(0x50014380048874d2), enclosure logical id(0x50014380048874e5),
slot(49)
[ 1181.732203] scsi target0:0:7: handle(0x0011),
sas_addr(0x50014380048874d3), enclosure logical id(0x50014380048874e5),
slot(48)
[ 1181.732286] scsi target0:0:8: handle(0x0012),
sas_addr(0x50014380048874d4), enclosure logical id(0x50014380048874e5),
slot(55)
[ 1181.732371] scsi target0:0:17: handle(0x0013),
sas_addr(0x50014380048874d5), enclosure logical id(0x50014380048874e5),
slot(54)
[ 1181.732454] scsi target0:0:10: handle(0x0014),
sas_addr(0x50014380048874d6), enclosure logical id(0x50014380048874e5),
slot(53)
[ 1181.732538] scsi target0:0:11: handle(0x0015),
sas_addr(0x50014380048874d7), enclosure logical id(0x50014380048874e5),
slot(52)
[ 1181.732621] scsi target0:0:12: handle(0x0016),
sas_addr(0x50014380048874e0), enclosure logical id(0x0000000000000000),
slot(0)
[ 1181.732704] scsi target0:0:13: handle(0x0017),
sas_addr(0x50014380048874e1), enclosure logical id(0x0000000000000000),
slot(0)
[ 1181.732788] scsi target0:0:14: handle(0x0018),
sas_addr(0x50014380048874e2), enclosure logical id(0x0000000000000000),
slot(0)
[ 1181.732870] scsi target0:0:15: handle(0x0019),
sas_addr(0x50014380048874e3), enclosure logical id(0x0000000000000000),
slot(0)
[ 1181.732954] scsi target0:0:16: handle(0x001a),
sas_addr(0x50014380048874e5), enclosure logical id(0x50014380048874e5),
slot(0)
[ 1181.733043] mpt2sas0: _scsih_search_responding_raid_devices
[ 1181.733046] mpt2sas0: _scsih_search_responding_expanders
[ 1181.733138]  expander present: handle(0x0009),
sas_addr(0x50014380048874e6)
[ 1181.733220] mpt2sas0: host reset: SUCCESS scmd(ffff880342884200)

Drives did not fall off though but I didn't really keep it up.




On Fri, May 7, 2010 at 1:01 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=14831
>
>
>
>
>
> --- Comment #20 from kdesai <kashyap.desai@lsi.com>  2010-05-07 08:01:25
> ---
> (In reply to comment #19)
> > So... patch seems to fix ATA command pass-through problems.  I let it go
> a
> > day spamming hddtemp in a loop on all the drives, while at same time
> reading
> > 600MB/sec or so.  No problem.  Again, without patch, it would never
> manage
> > more than 10 seconds spamming all the drives at once.
> >
> > IMO it seems like the ATA-Passthrough bug is fixed by this patch.  I
> cannot
> > cause a failure using ATA-Passthrough.
> >
> > All is not good news however....
> >
> > With this bug fixed I was going to start expanding a md array one disk at
> a
> > time.  Unfortunately sooner or later the controller seems to crap out.  I
> > don't know what is at fault, but the mptsas drive's method of just
> blowing
> > up and blocking processes forever sucks.
> >
> > I've tried this 4 times now and each time I see some read errors, then
> task
> > resets fail and eventually it gets to point it just keeps spamming
> 'sometask
> > has been blocked for 120s'.  I WISH this was a bad drive, but even if it
> was
> > a bad drive it shouldn't take down the system like this, but just to be
> sure
> > I've been swapping a few drives and it doesn't really make a difference.
> > Each time a different drive starts the fail sequence.  I'm guessing its
> > unlikely I have a pile of bad drives.
> >
> > I do have 16 drives all attached via a HP SAS Expander, perhaps the
> expander
> > is at fault.  I also have a backup Chenbro Expander I could try...  but
> I'm
> > too lazy to at the moment.  I could also try ditching the Expanders to
> see
> > if that is the cause of these problems, but again too lazy at the moment.
> > Monday a mpt2sas expander is being delivered, I think my best bet is to
> > ditch this mptsas driver all together.  If that doesn't fix problems I'll
> > then go back and try swapping Expanders and whatnot.
> >
> > Anyways, TL;DR:  ATA-PassThrough bug is fixed, mptsas still blows.
>
> Patch for setting dma boundary is mere avoiding condition which is causing
> this
> issue. LSI Gen-1 controller does not have 512byte dma boundary limitation.
> I
> have started internal chat with our Firmware engineer. I will update you
> findings as and when some imp stuffs are found.
> >
> > Here log from current failures, fairly sure this is unrelated to the
> entire
> > ATA-Passthrough problem:
> > May  6 17:52:09 nine kernel: [18838.207805] md: recovery of RAID array
> md127
> > May  6 17:52:09 nine kernel: [18838.207815] md: minimum _guaranteed_
>  speed:
> > 1000 KB/sec/disk.
> > May  6 17:52:09 nine kernel: [18838.207818] md: using maximum available
> idle
> > IO bandwidth (but not more than 200000 KB/sec) for recovery.
> > May  6 17:52:09 nine kernel: [18838.207831] md: using 128k window, over a
> > total of 1953510784 blocks.
> > May  6 17:52:09 nine kernel: [18838.207833] md: resuming recovery of
> md127
> > from checkpoint.
> > May  6 20:51:21 nine kernel: [29589.980035] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff8803318f4900)
> > May  6 20:51:21 nine kernel: [29589.980041] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8e f6 00 00 01 00 00
> > May  6 20:51:28 nine kernel: [29596.503483] mptbase: ioc0:
> > LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
> > May  6 20:51:28 nine kernel: [29596.503747] mptscsih: ioc0: task abort:
> > SUCCESS (sc=ffff8803318f4900)
> > May  6 20:51:28 nine kernel: [29597.253319] mptbase: ioc0:
> > LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay
> Retry},
> > SubCode(0x0000)
> > May  6 20:51:28 nine kernel: [29597.253329] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff8803318f4e00)
> > May  6 20:51:28 nine kernel: [29597.253332] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8e fc 00 00 01 00 00
> > May  6 20:51:28 nine kernel: [29597.253341] mptscsih: ioc0: task abort:
> > SUCCESS (sc=ffff8803318f4e00)
> > May  6 20:51:29 nine kernel: [29597.753599] mptbase: ioc0:
> > LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed},
> > SubCode(0x0000)
> > May  6 20:51:29 nine kernel: [29597.753608] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff8803318f4c00)
> > May  6 20:51:29 nine kernel: [29597.753610] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 02 00 00 01 00 00
> > May  6 20:51:29 nine kernel: [29597.753619] mptscsih: ioc0: task abort:
> > SUCCESS (sc=ffff8803318f4c00)
> > May  6 20:51:29 nine kernel: [29597.753622] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff8803318f5b00)
> > May  6 20:51:29 nine kernel: [29597.753624] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 0e 00 00 01 00 00
> > May  6 20:51:29 nine kernel: [29597.753633] mptscsih: ioc0: task abort:
> > SUCCESS (sc=ffff8803318f5b00)
> > May  6 20:51:29 nine kernel: [29597.753636] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff880331e3d900)
> > May  6 20:51:29 nine kernel: [29597.753638] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 14 00 00 00 08 00
> > May  6 20:51:29 nine kernel: [29597.753646] mptscsih: ioc0: task abort:
> > SUCCESS (sc=ffff880331e3d900)
> > May  6 20:51:29 nine kernel: [29597.753649] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff880331e3d400)
> > May  6 20:51:29 nine kernel: [29597.753651] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 14 08 00 00 68 00
> > May  6 20:51:29 nine kernel: [29597.753659] mptscsih: ioc0: task abort:
> > SUCCESS (sc=ffff880331e3d400)
> > May  6 20:51:29 nine kernel: [29597.753671] mptscsih: ioc0: attempting
> > target reset! (sc=ffff8803318f4900)
> > May  6 20:51:29 nine kernel: [29597.753673] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8e f6 00 00 01 00 00
> > May  6 20:51:29 nine kernel: [29597.753685] mptscsih: ioc0: target reset:
> > FAILED (sc=ffff8803318f4900)
> > May  6 20:51:29 nine kernel: [29597.753693] mptscsih: ioc0: attempting
> bus
> > reset! (sc=ffff8803318f4900)
> > May  6 20:51:29 nine kernel: [29597.753695] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8e f6 00 00 01 00 00
> > May  6 20:51:29 nine kernel: [29597.753712] mptscsih: ioc0: bus reset:
> > FAILED (sc=ffff8803318f4900)
> > May  6 20:51:29 nine kernel: [29597.753715] mptscsih: ioc0: attempting
> host
> > reset! (sc=ffff8803318f4900)
> > May  6 20:52:04 nine kernel: [29632.830020] mptscsih: ioc0: host reset:
> > SUCCESS (sc=ffff8803318f4900)
> > May  6 20:52:14 nine kernel: [29642.840021] sd 6:0:5:0: Device offlined -
> > not ready after error recovery
> > May  6 20:52:14 nine kernel: [29642.840024] sd 6:0:5:0: Device offlined -
> > not ready after error recovery
> > May  6 20:52:14 nine kernel: [29642.840026] sd 6:0:5:0: Device offlined -
> > not ready after error recovery
> > May  6 20:52:14 nine kernel: [29642.840028] sd 6:0:5:0: Device offlined -
> > not ready after error recovery
> > May  6 20:52:14 nine kernel: [29642.840030] sd 6:0:5:0: Device offlined -
> > not ready after error recovery
> > May  6 20:52:14 nine kernel: [29642.840032] sd 6:0:5:0: Device offlined -
> > not ready after error recovery
> > May  6 20:52:14 nine kernel: [29642.840076] sd 6:0:5:0: [sdh] Unhandled
> > error code
> > May  6 20:52:14 nine kernel: [29642.840082] sd 6:0:5:0: [sdh] Result:
> > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > May  6 20:52:14 nine kernel: [29642.840087] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8e f6 00 00 01 00 00
> > May  6 20:52:14 nine kernel: [29642.840112] raid5:md127: read error not
> > correctable (sector 1284435456 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840129] raid5:md127: read error not
> > correctable (sector 1284435464 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840133] raid5:md127: read error not
> > correctable (sector 1284435472 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840136] raid5:md127: read error not
> > correctable (sector 1284435480 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840139] raid5:md127: read error not
> > correctable (sector 1284435488 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840143] raid5:md127: read error not
> > correctable (sector 1284435496 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840149] raid5:md127: read error not
> > correctable (sector 1284435504 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840196] raid5:md127: read error not
> > correctable (sector 1284435512 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840199] raid5:md127: read error not
> > correctable (sector 1284435520 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840202] raid5:md127: read error not
> > correctable (sector 1284435528 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.847676] sd 6:0:5:0: [sdh] Unhandled
> > error code
> > May  6 20:52:14 nine kernel: [29642.847678] sd 6:0:5:0: [sdh] Result:
> > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > May  6 20:52:14 nine kernel: [29642.847681] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8e fc 00 00 01 00 00
> > May  6 20:52:14 nine kernel: [29642.847745] sd 6:0:5:0: [sdh] Unhandled
> > error code
> > May  6 20:52:14 nine kernel: [29642.847746] sd 6:0:5:0: [sdh] Result:
> > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > May  6 20:52:14 nine kernel: [29642.847749] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 02 00 00 01 00 00
> > May  6 20:52:14 nine kernel: [29642.847812] sd 6:0:5:0: [sdh] Unhandled
> > error code
> > May  6 20:52:14 nine kernel: [29642.847813] sd 6:0:5:0: [sdh] Result:
> > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > May  6 20:52:14 nine kernel: [29642.847816] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 0e 00 00 01 00 00
> > May  6 20:52:14 nine kernel: [29642.847871] sd 6:0:5:0: [sdh] Unhandled
> > error code
> > May  6 20:52:14 nine kernel: [29642.847873] sd 6:0:5:0: [sdh] Result:
> > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > May  6 20:52:14 nine kernel: [29642.847875] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 14 00 00 00 08 00
> > May  6 20:52:14 nine kernel: [29642.847907] sd 6:0:5:0: [sdh] Unhandled
> > error code
> > May  6 20:52:14 nine kernel: [29642.847908] sd 6:0:5:0: [sdh] Result:
> > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > May  6 20:52:14 nine kernel: [29642.847911] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 14 08 00 00 68 00
> > May  6 20:52:19 nine kernel: [29647.840019] mptbase: ioc0: WARNING -
> Issuing
> > Reset from mpt_config!!
> > May  6 20:52:50 nine kernel: [29678.961260] ------------[ cut here
> > ]------------
> > May  6 20:52:50 nine kernel: [29678.961268] WARNING: at
> > /home/kernel-ppa/mainline/build/kernel/workqueue.c:485
> > flush_cpu_workqueue+0x8c/0x90()
> > May  6 20:52:50 nine kernel: [29678.961271] Hardware name: empty
> > May  6 20:52:50 nine kernel: [29678.961273] Modules linked in: btrfs
> > zlib_deflate crc32c libcrc32c xfs exportfs mptctl binfmt_misc ppdev
> > ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> xt_state
> > nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables
> bridge
> > stp kvm_intel kvm snd_hda_codec_realtek snd_hda_intel snd_hda_codec
> > snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss
> > snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer
> snd_seq_device
> > psmouse serio_raw ioatdma snd i5100_edac nvidia(P) dca soundcore
> > snd_page_alloc edac_core lp parport raid10 raid456 async_raid6_recov
> > async_pq raid6_pq async_xor ses enclosure xor async_memcpy async_tx raid1
> > raid0 multipath linear ahci e1000e mptsas mptscsih mptbase
> > scsi_transport_sas
> > May  6 20:52:50 nine kernel: [29678.961333] Pid: 321, comm: mpt/0
> Tainted:
> > P           2.6.34-020634rc6-generic #020634rc6
> > May  6 20:52:50 nine kernel: [29678.961336] Call Trace:
> > May  6 20:52:50 nine kernel: [29678.961341]  [<ffffffff8107a9ac>] ?
> > flush_cpu_workqueue+0x8c/0x90
> > May  6 20:52:50 nine kernel: [29678.961346]  [<ffffffff8105f1ec>]
> > warn_slowpath_common+0x8c/0xc0
> > May  6 20:52:50 nine kernel: [29678.961350]  [<ffffffff8105f234>]
> > warn_slowpath_null+0x14/0x20
> > May  6 20:52:50 nine kernel: [29678.961353]  [<ffffffff8107a9ac>]
> > flush_cpu_workqueue+0x8c/0x90
> > May  6 20:52:50 nine kernel: [29678.961357]  [<ffffffff8106f981>] ?
> > try_to_del_timer_sync+0x51/0xe0
> > May  6 20:52:50 nine kernel: [29678.961360]  [<ffffffff8107aa74>]
> > flush_workqueue+0x44/0x70
> > May  6 20:52:50 nine kernel: [29678.961373]  [<ffffffffa004531c>]
> > mptsas_cleanup_fw_event_q+0x12c/0x160 [mptsas]
> > May  6 20:52:50 nine kernel: [29678.961378]  [<ffffffffa0048434>]
> > mptsas_ioc_reset+0x94/0x130 [mptsas]
> > May  6 20:52:50 nine kernel: [29678.961383]  [<ffffffff81033d39>] ?
> > default_spin_lock_flags+0x9/0x10
> > May  6 20:52:50 nine kernel: [29678.961389]  [<ffffffffa001222d>]
> > mpt_signal_reset+0x4d/0x60 [mptbase]
> > May  6 20:52:50 nine kernel: [29678.961394]  [<ffffffffa0018eb6>]
> > mpt_SoftResetHandler+0x1b6/0x3c0 [mptbase]
> > May  6 20:52:50 nine kernel: [29678.961399]  [<ffffffffa001bee7>]
> > mpt_config+0x307/0x640 [mptbase]
> > May  6 20:52:50 nine kernel: [29678.961404]  [<ffffffffa004c6f0>] ?
> > mptsas_firmware_event_work+0x0/0xe80 [mptsas]
> > May  6 20:52:50 nine kernel: [29678.961409]  [<ffffffffa001d0b1>]
> > mpt_findImVolumes+0xb1/0x600 [mptbase]
> > May  6 20:52:50 nine kernel: [29678.961415]  [<ffffffffa004c6f0>] ?
> > mptsas_firmware_event_work+0x0/0xe80 [mptsas]
> > May  6 20:52:50 nine kernel: [29678.961419]  [<ffffffffa004cd88>]
> > mptsas_firmware_event_work+0x698/0xe80 [mptsas]
> > May  6 20:52:50 nine kernel: [29678.961424]  [<ffffffff8100985b>] ?
> > __switch_to+0xbb/0x2e0
> > May  6 20:52:50 nine kernel: [29678.961428]  [<ffffffff8105118e>] ?
> > put_prev_entity+0x2e/0x80
> > May  6 20:52:50 nine kernel: [29678.961430]  [<ffffffff81051af6>] ?
> > finish_task_switch+0x66/0xd0
> > May  6 20:52:50 nine kernel: [29678.961435]  [<ffffffffa004c6f0>] ?
> > mptsas_firmware_event_work+0x0/0xe80 [mptsas]
> > May  6 20:52:50 nine kernel: [29678.961438]  [<ffffffff8107a10c>]
> > run_workqueue+0xbc/0x190
> > May  6 20:52:50 nine kernel: [29678.961441]  [<ffffffff8107a65b>]
> > worker_thread+0x9b/0x100
> > May  6 20:52:50 nine kernel: [29678.961444]  [<ffffffff8107edc0>] ?
> > autoremove_wake_function+0x0/0x40
> > May  6 20:52:50 nine kernel: [29678.961447]  [<ffffffff8107a5c0>] ?
> > worker_thread+0x0/0x100
> > May  6 20:52:50 nine kernel: [29678.961450]  [<ffffffff8107e9e6>]
> > kthread+0x96/0xa0
> > May  6 20:52:50 nine kernel: [29678.961453]  [<ffffffff8100be64>]
> > kernel_thread_helper+0x4/0x10
> > May  6 20:52:50 nine kernel: [29678.961456]  [<ffffffff8107e950>] ?
> > kthread+0x0/0xa0
> > May  6 20:52:50 nine kernel: [29678.961458]  [<ffffffff8100be60>] ?
> > kernel_thread_helper+0x0/0x10
> > May  6 20:52:50 nine kernel: [29678.961460] ---[ end trace
> 5b0b1793526edc2a
> > ]---
> > May  6 20:53:20 nine kernel: [29709.040090] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff880331812400)
> > May  6 20:53:20 nine kernel: [29709.040093] sd 6:0:15:0: [sdr] CDB:
> > Write(10): 2a 00 00 00 00 47 00 00 02 00
> > May  6 20:53:50 nine kernel: [29739.040011] mptscsih: ioc0: WARNING -
> > Issuing Reset from mptscsih_IssueTaskMgmt!!
> > May  6 20:54:13 nine kernel: [29761.700122] md127_resync  D
> > ffff880001f55740     0  6733      2 0x00000000
> > May  6 20:54:13 nine kernel: [29761.700130]  ffff8803318f3b90
> > 0000000000000046 ffff8803318f3b50 ffff8803318f3fd8
> > May  6 20:54:13 nine kernel: [29761.700134]  ffff8803318eae20
> > 0000000000015740 0000000000015740 ffff8803318f3fd8
> > May  6 20:54:13 nine kernel: [29761.700137]  0000000000015740
> > ffff8803318f3fd8 0000000000015740 ffff8803318eae20
> > May  6 20:54:13 nine kernel: [29761.700141] Call Trace:
> > May  6 20:54:13 nine kernel: [29761.700160]  [<ffffffffa00f20e2>]
> > get_active_stripe+0x232/0x340 [raid456]
> > May  6 20:54:13 nine kernel: [29761.700167]  [<ffffffff810507e0>] ?
> > default_wake_function+0x0/0x20
> > May  6 20:54:13 nine kernel: [29761.700172]  [<ffffffffa00f49ad>]
> > sync_request+0x26d/0x2d0 [raid456]
> > May  6 20:54:13 nine kernel: [29761.700176]  [<ffffffffa00f1e8e>] ?
> > raid5_unplug_device+0x7e/0xa0 [raid456]
> >
> >
>
> As of now you can continue with patched for dma boundary alignment issue.
> For this new issue you can provide me complete var log messages with debug
> turned on.
>
> use 0x8188 > /sys/modules/mptbase/parameters/mpt_debug_level
>
> Thanks,
> Kashyap
> > On Wed, May 5, 2010 at 3:35 AM, <bugzilla-daemon@bugzilla.kernel.org>
> wrote:
> >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=14831
> > >
> > >
> > > Andrew Dunn <andrew.g.dunn.dod@gmail.com> changed:
> > >
> > >           What    |Removed                     |Added
> > >
> > >
> ----------------------------------------------------------------------------
> > >                 CC|                            |
> > > andrew.g.dunn.dod@gmail.com
> > >
> > >
> > >
> > >
> > > --- Comment #18 from Andrew Dunn <andrew.g.dunn.dod@gmail.com>
>  2010-05-05
> > > 10:35:17 ---
> > > I anxiously await confirmation of this patch. This issue has been
> plaguing
> > > me
> > > for quite a while. Just for verification the mpt2sas controllers don't
> have
> > > problems with this? I was thinking of trying to get an AOC-USAS2-L8i
> > > (
> > >
> http://www.supermicro.com/products/accessories/addon/AOC-USAS2-L8i.cfm?TYP=I
> > > )
> > >
> > > --
> > > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> > > ------- You are receiving this mail because: -------
> > > You are on the CC list for the bug.
> > >
>
> --
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
>

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (10 preceding siblings ...)
  2010-05-12  8:50 ` bugzilla-daemon
@ 2010-05-12  9:04 ` bugzilla-daemon
  2010-05-12  9:33 ` bugzilla-daemon
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-05-12  9:04 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #22 from Tim Small <tim@seoss.co.uk>  2010-05-12 09:03:57 ---
(In reply to comment #21)

> So apparently this bug affects mpt2sas too???

Hi Brian,

Can you give the drive makes and models too?  Some recent WD drives can lock-up
on SMART with all controllers.

Thanks,

Tim.

p.s. please try and keep the quotes under control in comments - only quote the
relevant parts!

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (11 preceding siblings ...)
  2010-05-12  9:04 ` bugzilla-daemon
@ 2010-05-12  9:33 ` bugzilla-daemon
  2010-05-12 10:22 ` bugzilla-daemon
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-05-12  9:33 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #23 from Brian Sullivan <bexamous@gmail.com>  2010-05-12 09:33:13 ---
>
> Can you give the drive makes and models too?  Some recent WD drives can
> lock-up
> on SMART with all controllers.


1,1  /dev/sdq unmounted                           WDC WD10EADS-00L:1A01
WD-WCAU46815978
2,1  /dev/sdp unmounted                           WDC WD10EADS-00L:1A01
WD-WCAU46783271
3,1  /dev/sdo unmounted                           WDC WD10EADS-00L:1A01
WD-WCAU46803203
4,1  /dev/sdn unmounted                           WDC WD1000FYPS-0:1B01
WD-WCASJ0318455
1,2  /dev/sdi unmounted                           WDC WD10EACS-00D:1A01
WD-WCAU42580399
2,2  /dev/sdh unmounted                           WDC WD10EACS-00D:1A01
WD-WCAU42557087
3,2  /dev/sdg unmounted                           WDC WD10EADS-00L:1A01
WD-WCAU46812587
4,2  /dev/sdf unmounted                           WDC WD15EADS-00H:0K05
WD-WCAUP0019266
1,3  /dev/sde unmounted                           WDC WD20EADS-00S:0A01
WD-WCAVY2526737
2,3  /dev/sdd unmounted                           WDC WD20EADS-00S:0A01
WD-WCAVY2252304
3,3  (empty)
4,3  (empty)
1,4  /dev/sdm unmounted                           Hitachi HDS72202:A20N
JK1130YAGVZSRT
2,4  /dev/sdl unmounted                           WDC WD20EADS-00S:0A01
WD-WCAVY1361632
3,4  /dev/sdk unmounted                           WDC WD20EADS-00S:0A01
WD-WCAVY1338263
4,4  /dev/sdj unmounted                           WDC WD20EADS-00S:0A01
WD-WCAVY1279474
1,5  /dev/sdr unmounted                           WDC WD20EADS-00R:0A01
WD-WCAVY1985273
2,5  /dev/sds unmounted                           WDC WD20EADS-00R:0A01
WD-WCAVY1861891
3,5  /dev/sdt unmounted                           WDC WD20EADS-00R:0A01
WD-WCAVY1985283
4,5  /dev/sdu unmounted                           WDC WD20EADS-00R:0A01
WD-WCAVY1831784

I have some more Hitatchi drives, but I am pretty sure at some point I
compared WD to Hitatchi and had problems with both.

Maybe next weekend if I have some time I'll try pulling all the WD drives
and adding some more Hitatchi and seeing what happens.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (12 preceding siblings ...)
  2010-05-12  9:33 ` bugzilla-daemon
@ 2010-05-12 10:22 ` bugzilla-daemon
  2010-05-24  9:05 ` bugzilla-daemon
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-05-12 10:22 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #24 from Tim Small <tim@seoss.co.uk>  2010-05-12 10:22:08 ---
(In reply to comment #23)

> 3,1  /dev/sdo unmounted                           WDC WD10EADS-00L:1A01
> WD-WCAU46803203
> 4,1  /dev/sdn unmounted                           WDC WD1000FYPS-0:1B01
> WD-WCASJ0318455
> 1,2  /dev/sdi unmounted                           WDC WD10EACS-00D:1A01
> WD-WCAU42580399
> 2,2  /dev/sdh unmounted                           WDC WD10EACS-00D:1A01
> WD-WCAU42557087
> 3,2  /dev/sdg unmounted                           WDC WD10EADS-00L:1A01
> WD-WCAU46812587
> 4,2  /dev/sdf unmounted                           WDC WD15EADS-00H:0K05
> WD-WCAUP0019266

> Maybe next weekend if I have some time I'll try pulling all the WD drives
> and adding some more Hitatchi and seeing what happens.

Perhaps you could stress-test the drives on a different controller e.g. AHCI,
PIIX, Silicon Image, or whatever?

Tim.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (13 preceding siblings ...)
  2010-05-12 10:22 ` bugzilla-daemon
@ 2010-05-24  9:05 ` bugzilla-daemon
  2010-05-26  8:08 ` bugzilla-daemon
                   ` (17 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-05-24  9:05 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831


dujun@perabytes.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dujun@perabytes.com




--- Comment #25 from dujun@perabytes.com  2010-05-24 09:05:10 ---
https://bugzilla.kernel.org/show_bug.cgi?id=16021 
we have a similar problem which I described in above bug report. Ryan told me
about this patch work and I have it tested in our system. It seems that it
works, and the md raid rebuild speed is around 60MB/s for 16 hdd. 

However, the dd speed for when raid 5 is rebuiding is much less than original
result. Around 150MB/s for write compared with almost 400MB/s without the
patch. 600MB/s for read compared with 800MB/s. 

Is this caused by the bounce buffer for the alignment? Is there any way to
solve this problem in lsi formware so that we don't need a forced alignment? 

We will test the software raid grow problem reported by Brian later.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (14 preceding siblings ...)
  2010-05-24  9:05 ` bugzilla-daemon
@ 2010-05-26  8:08 ` bugzilla-daemon
  2010-06-07 20:00 ` bugzilla-daemon
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-05-26  8:08 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #26 from dujun@perabytes.com  2010-05-26 08:08:01 ---
It seems to us that the software raid growing has no problem at all with the
forced alignment patch.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (15 preceding siblings ...)
  2010-05-26  8:08 ` bugzilla-daemon
@ 2010-06-07 20:00 ` bugzilla-daemon
  2010-06-08  0:29 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-06-07 20:00 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831


richard@sauce.co.nz changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |richard@sauce.co.nz




--- Comment #27 from richard@sauce.co.nz  2010-06-07 20:00:22 ---
Hi Dujun,

Can you add any more information to this performance drop, here and on
linux-scsi?

See discussion thread here: 

http://marc.info/?l=linux-scsi&m=127567915722288&w=2

Thanks,

Richard

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (16 preceding siblings ...)
  2010-06-07 20:00 ` bugzilla-daemon
@ 2010-06-08  0:29 ` bugzilla-daemon
  2010-06-08  6:44 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-06-08  0:29 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #28 from dujun@perabytes.com  2010-06-08 00:29:49 ---
Hi, Richard, 
Sorry I didn't follow the linux scsi mailing list. Pls note following.

There are 3 testing environments, all works under linux 2.6.32 and the LSI
official driver 4.22.00.00:

1. LSI 1068e HBA chip on mainboard connected to two LSI x12A expander chips,
then 16 1T WD SATA-II disks.
2. LSI 1068e HBA chip on mainboard connected to one LSI x36 expander chip, then
16 1T WD SATA-II disks. 
3. LSI 1068e HBA chip on mainboard and LSI 1068e HBA on pci-e slot, then both
connected to 8 1T WD SATA-II disks, totally 16 disks. 

Without the 512 byte alignment patch, only testing 1 has reset problem. 2&3
works perfectly without any problem. 

With the patch, all the testing passed stability testing. However, the dd
testing showed that the testing 1 performance degraded.
mdadm -C /dev/md10 -l 5 -n 16 /dev/sd[b-q] to setup the md raid5.
then 
dd if=/dev/zero of=/dev/md10 bs=1M count=10960 to test the write speed of the
md. 

2&3 has no performance penalty. 

After several days' investigating further, changed hardware part by part, we
found that some of the x12A expanders may caused the problem. Most of them
works ok just like in testing 2&3, only two or three caused the performance
issue. 

So we may draw a conclusion that the patch should be included in the next
release. The performance may be caused by the x12A expander. We are going to
ask our chip solution provider to investigate further why some of the chips
work just with lower performance. 

(In reply to comment #27)
> Hi Dujun,
> 
> Can you add any more information to this performance drop, here and on
> linux-scsi?
> 
> See discussion thread here: 
> 
> http://marc.info/?l=linux-scsi&m=127567915722288&w=2
> 
> Thanks,
> 
> Richard

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (17 preceding siblings ...)
  2010-06-08  0:29 ` bugzilla-daemon
@ 2010-06-08  6:44 ` bugzilla-daemon
  2010-06-08  8:43 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-06-08  6:44 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #29 from Tim Small <tim@seoss.co.uk>  2010-06-08 06:44:45 ---
(In reply to comment #28)

> linux 2.6.32 and the LSI official driver 4.22.00.00:

The official Linux driver seems to be 3.04.15 -

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/message/fusion/mptbase.h;hb=HEAD#l69

so my understanding is that you should probably be testing with this driver,
rather than the 4.x LSI driver which hasn't made it into the Linux tree (yet). 
The 4.x driver might be useful for providing additional data points.

Some official word from LSI on this would be useful.

Thanks,

Tim.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (18 preceding siblings ...)
  2010-06-08  6:44 ` bugzilla-daemon
@ 2010-06-08  8:43 ` bugzilla-daemon
  2010-06-29 20:26 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-06-08  8:43 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #30 from kdesai <kashyap.desai@lsi.com>  2010-06-08 08:42:48 ---
(In reply to comment #29)
> (In reply to comment #28)
> 
> > linux 2.6.32 and the LSI official driver 4.22.00.00:
> 
> The official Linux driver seems to be 3.04.15 -
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/message/fusion/mptbase.h;hb=HEAD#l69
> 
> so my understanding is that you should probably be testing with this driver,
Yes. I would recommend use of 3.4.15 driver since it is a bug raised from
kernel.org. 

> rather than the 4.x LSI driver which hasn't made it into the Linux tree (yet). 
> The 4.x driver might be useful for providing additional data points.
> 
> Some official word from LSI on this would be useful.
> 
> Thanks,
> 
> Tim.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (19 preceding siblings ...)
  2010-06-08  8:43 ` bugzilla-daemon
@ 2010-06-29 20:26 ` bugzilla-daemon
  2010-08-28 15:44 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-06-29 20:26 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831


starlight@binnacle.cx changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |starlight@binnacle.cx




--- Comment #31 from starlight@binnacle.cx  2010-06-29 20:26:33 ---
FYI

Experiencing same problem on CentOS kernel with latest LSI
driver and and firmware:

LSI 2008
eight Seagate Momentus ST9500420AS SATA drives
LVM2 8x striped LV

CentOS 5.5 kernel 2.6.18-194.3.1.el5
MPT2BIOS 7.05.01.00 (2010.09.09)
SAS2008-IT 5.00.00.00
LSI mpt2sas 05.00.00.00

also

CentOS 5.4 kernel 2.6.18-164.10.1.el5
MPT2BIOS 7.03.00.00 (2009-10-12)
SAS2008-IR 4.00.00.00
distro mpt2sas version 01.101.00.00

-----

With 'smartd' running controller resets and drops last drive 
after about one or two days.  Fails during very light write 
activity rather than heavy write activity.  LV is used for a 
writing a very large log file to an 'ext4' file system.

With 'smartd' disabled ran for longer under mpt2sas v01.101,
but a somewhat different error corrupted the 'ext4' filesystem
after about three weeks.

EXT4-fs error (device dm-19): ext4_mb_generate_buddy: EXT4-fs: group 1168:
32768 blocks in bitmap, 32720 in gd
EXT4-fs error (device dm-19): ext4_mb_mark_diskspace_used: Allocating block
38273024 in system zone of 1168 group
.
.
.
mpt2sas0: attempting task abort! scmd(ffff810130449540)
sd 0:0:2:0: command: Read(10): 28 00 11 35 5b 07 00 00 08 00
mpt2sas0: task abort: SUCCESS scmd(ffff810130449540)
.
.
.

Too soon to tell if mpt2sas-05.00.00.00 is better.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (20 preceding siblings ...)
  2010-06-29 20:26 ` bugzilla-daemon
@ 2010-08-28 15:44 ` bugzilla-daemon
  2010-08-28 15:45 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-08-28 15:44 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #32 from starlight@binnacle.cx  2010-08-16 16:00:33 ---
Happened again with 'smartd' disabled and with *latest*
kernel, LSI device driver and LSI IT (initiator target)
firmware.  Took 37 days of uptime for it to happen.
Failure was during moderate write activity rather than
light activity as with the 'smartd' pass-through
transactions.  Kernel messages attached.

kernel 5.5 2.6.18-194.8.1.el5
MPT2BIOS-7.05.01.00 (2010.02.09)
SAS2008-IT 5.00.00.00
LSI driver mpt2sas-05.00.00.00

--- Comment #33 from starlight@binnacle.cx  2010-08-16 16:04:36 ---
Created an attachment (id=27469)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=27469)
kernel messages from failure

--- Comment #34 from starlight@binnacle.cx  2010-08-16 16:07:39 ---
Created an attachment (id=27470)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=27470)
kernel messages from corresponding boot

--- Comment #35 from starlight@binnacle.cx  2010-08-28 15:43:16 ---
Another controller failure, this time with logging_level=0x1F8.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (21 preceding siblings ...)
  2010-08-28 15:44 ` bugzilla-daemon
@ 2010-08-28 15:45 ` bugzilla-daemon
  2010-08-28 15:46 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-08-28 15:45 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #36 from starlight@binnacle.cx  2010-08-28 15:45:04 ---
Created an attachment (id=28171)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=28171)
boot-time information from 'lsiutil'

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (22 preceding siblings ...)
  2010-08-28 15:45 ` bugzilla-daemon
@ 2010-08-28 15:46 ` bugzilla-daemon
  2010-08-28 15:47 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-08-28 15:46 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #37 from starlight@binnacle.cx  2010-08-28 15:45:49 ---
Created an attachment (id=28181)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=28181)
firmware events from boot and failure

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (23 preceding siblings ...)
  2010-08-28 15:46 ` bugzilla-daemon
@ 2010-08-28 15:47 ` bugzilla-daemon
  2010-08-28 15:51 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-08-28 15:47 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #38 from starlight@binnacle.cx  2010-08-28 15:47:50 ---
Created an attachment (id=28191)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=28191)
boot-time messages with logging_level=0x1F8

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (24 preceding siblings ...)
  2010-08-28 15:47 ` bugzilla-daemon
@ 2010-08-28 15:51 ` bugzilla-daemon
  2010-08-28 15:53 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-08-28 15:51 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #39 from starlight@binnacle.cx  2010-08-28 15:51:01 ---
Created an attachment (id=28201)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=28201)
kernel messages from failure with logging_level=0x1F8

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (25 preceding siblings ...)
  2010-08-28 15:51 ` bugzilla-daemon
@ 2010-08-28 15:53 ` bugzilla-daemon
  2010-08-30 15:18 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-08-28 15:53 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #40 from starlight@binnacle.cx  2010-08-28 15:52:05 ---
descriptions for attachments in #38 and #39 are reversed

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (26 preceding siblings ...)
  2010-08-28 15:53 ` bugzilla-daemon
@ 2010-08-30 15:18 ` bugzilla-daemon
  2010-08-30 16:42 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-08-30 15:18 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #41 from kdesai <kashyap.desai@lsi.com>  2010-08-30 15:18:21 ---
(In reply to comment #40)
> descriptions for attachments in #38 and #39 are reversed

I have taken a deep look of all the available logs for below configuration.

kernel 5.5 2.6.18-194.8.1.el5
MPT2BIOS-7.05.01.00 (2010.02.09)
SAS2008-IT 5.00.00.00
LSI driver mpt2sas-05.00.00.00


Things are different in this case. It is not the same issue which is related to
"smartd" mentioned in this bugzilla.

I have seen some kind of hotplug action in this case. (or may be some
connection issue which has created Hotplug kind of situation)

1. See below snippet of (https://bugzilla.kernel.org/attachment.cgi?id=28191)
--
Aug 27 14:23:34 X kernel: mpt2sas0: Device Status Change
Aug 27 14:23:34 X kernel:     handle(0x000f), sas
address(0x4433221107000000)<6>mpt2sas0: SAS Topology Change List
Aug 27 14:23:34 X kernel: sd 0:0:7:0: device_blocked, handle(0x000f)
Aug 27 14:24:02 X kernel: mpt2sas0: attempting task abort!
scmd(ffff81005a235cc0)
Aug 27 14:24:02 X kernel: sd 0:0:7:0: 
Aug 27 14:24:02 X kernel:         comma

---

Driver has received Hotplug action "device delay removal" (this is relavent to
LSI controllers Device missing delay parameters) 
Check "/sys/class/scsi_host/host6/device_delay"

2. Very soon I have seen Some of the Task abort followed by Device delete event
See below snippet.

--ug 27 14:24:02 X kernel: mpt2sas0: attempting task abort!
scmd(ffff81005a235cc0)
Aug 27 14:24:02 X kernel: sd 0:0:7:0: 
Aug 27 14:24:02 X kernel:         command: Write(10): 2a 00 11 51 68 0f 00 04
00 00
Aug 27 14:24:02 X kernel: mpt2sas0: Device Status Change
Aug 27 14:24:02 X kernel: mpt2sas0: task abort: SUCCESS scmd(ffff81005a235cc0)
Aug 27 14:24:02 X kernel: 
Aug 27 14:24:02 X kernel: mpt2sas0: updating handles for
sas_host(0x5003048573212988)
Aug 27 14:24:02 X kernel:     handle(0x000f), sas
address(0x4433221107000000)<6>
Aug 27 14:24:02 X kernel: mpt2sas0: Discovery: (stop)
Aug 27 14:24:02 X kernel: mpt2sas0: Discovery: (start)
Aug 27 14:24:02 X kernel: mpt2sas0: SAS Topology Change List
Aug 27 14:24:02 X kernel: mpt2sas0: tr_send:handle(0x000f), (open), smid(439),
cb(7)
Aug 27 14:24:02 X kernel: mpt2sas0: Discovery: (stop)
Aug 27 14:24:02 X kernel: mpt2sas0: updating handles for
sas_host(0x5003048573212988)
Aug 27 14:24:02 X kernel: mpt2sas0: tr_complete:handle(0x000f), (open)
smid(439), ioc_status(0x0000), loginfo(0x00000000), completed(0)
Aug 27 14:24:02 X kernel: mpt2sas0: sc_send:handle(0x000f), (open), smid(540),
cb(5)
Aug 27 14:24:02 X kernel: mpt2sas0: sc_complete:handle(0x000f), (open)
smid(540), ioc_status(0x0000), loginfo(0x00000000)
Aug 27 14:24:02 X kernel: mpt2sas0: _scsih_remove_device: enter:
handle(0x000f), sas_addr(0x4433221107000000)
Aug 27 14:24:02 X kernel: sd 0:0:7:0: device_unblocked, handle(0x000f)
Aug 27 14:24:02 X kernel: mpt2sas0: removing handle(0x000f),
sas_addr(0x4433221107000000)
Aug 27 14:24:02 X kernel: mpt2sas0: _scsih_remove_device: exit: handle(0x000f),
sas_addr(0x4433221107000000)


---

3. Now Driver immediately receive Device ADD. (see below snippet)
--
Aug 27 14:24:02 X kernel: mpt2sas0: Discovery: (stop)
Aug 27 14:24:02 X kernel: mpt2sas0: REPORT_LUNS: handle(0x000f), retries(0)
Aug 27 14:24:02 X kernel: mpt2sas0:     ioc_status(0x0045),
loginfo(0x00000000), rc(ready)
Aug 27 14:24:02 X kernel: mpt2sas0: TEST_UNIT_READY: handle(0x000f), lun(0)
Aug 27 14:24:02 X kernel: mpt2sas0:     ioc_status(0x0000),
loginfo(0x00000000), rc(retry_ua)
Aug 27 14:24:02 X kernel: mpt2sas0:     [sense_key,asc,ascq]: [0x06,0x29,0x00]
Aug 27 14:24:02 X kernel: mpt2sas0: TEST_UNIT_READY: handle(0x000f), lun(0)
Aug 27 14:24:02 X kernel: mpt2sas0: attempting task abort!
scmd(ffff81005a235cc0)
Aug 27 14:24:02 X kernel: scsi 0:0:7:0: 
Aug 27 14:24:02 X kernel:         command: Test Unit Ready: 00 00 00 00 00 00
Aug 27 14:24:02 X kernel: mpt2sas0: device been deleted! scmd(ffff81005a235cc0)
--

4. At the end HBA reset is executed which is removing device "scsi 0:0:7:0".
It means device is not actually available in firmware table. (this can be
confirm if we have lsiutil option 8 and 16 )

In summary, this can be a completely different issue. Can we move this issue to
new bugzilla, so that I can have a fresh look on it ?

Thanks, Kashyap

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (27 preceding siblings ...)
  2010-08-30 15:18 ` bugzilla-daemon
@ 2010-08-30 16:42 ` bugzilla-daemon
  2010-08-31  8:17 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-08-30 16:42 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #42 from starlight@binnacle.cx  2010-08-30 16:42:08 ---
Kashyap,

Thank you for looking at this problem in depth.

Since it is different, I certainly can create a new bugzilla for 
it.  I'll do that in the next day or so.

Do you have any ideas about what might be the cause?

One thing that crosses my mind is that the drives here are not 
enterprise Seagate Constellation drives, but are Seagate 
Momentus drives that have more aggressive power saving features 
intended for laptops.  We chose them because at the time they 
were much less expensive and the sequential read/write 
performance was the same.  Now the price differential is much 
smaller and we would have gone with Constellations.

Is there any chance that the Momentus drives require the 10 
second command time-out in the LSI BIOS config to be extended? 
This is just a random idea.  The drives were all active at the 
time of the event and so would not have been in power saving 
mode or been responding slowly to commands.

Another theory I have is that there might be a memory leak in 
the firmware and that when all free memory is exhausted, the 
controller "goes insane".  Is such a memory leak something that 
would be apparent in the tracing?

Finally should mention that I believe the configuration is 
unusual.  A large, identical partition on the eight drives is 
configured as a software RAID0 volume.  I doubt that many people 
configure systems this way.  It might be stressing the
firmware/software in a unique fashion.

Thanks,

David





At 03:18 PM 8/30/2010 +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
>https://bugzilla.kernel.org/show_bug.cgi?id=14831
>
>I have taken a deep look of all the available logs for below 
>configuration.
>
>In summary, this can be a completely different issue. Can we move this issue to
>new bugzilla, so that I can have a fresh look on it ?
>
>Thanks, Kashyap

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (28 preceding siblings ...)
  2010-08-30 16:42 ` bugzilla-daemon
@ 2010-08-31  8:17 ` bugzilla-daemon
  2010-09-28 19:07 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-08-31  8:17 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #43 from starlight@binnacle.cx  2010-08-31 08:17:07 ---
Above recent activity reported by me assigned as new bug 17551 as it's
apparently not related to the 'smartd' failure.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (29 preceding siblings ...)
  2010-08-31  8:17 ` bugzilla-daemon
@ 2010-09-28 19:07 ` bugzilla-daemon
  2012-06-18 13:20 ` bugzilla-daemon
  2012-06-18 13:20 ` bugzilla-daemon
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-09-28 19:07 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831


Benjamin ESTRABAUD <be@mpstor.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |be@mpstor.com




--- Comment #43 from starlight@binnacle.cx  2010-08-31 08:17:07 ---
Above recent activity reported by me assigned as new bug 17551 as it's
apparently not related to the 'smartd' failure.

--- Comment #44 from Benjamin ESTRABAUD <be@mpstor.com>  2010-09-28 19:07:00 ---
Is the fix suggested in previous comment #15
(http://lkml.org/lkml/2010/4/26/335) not the same as a proposed commit from
Yuri Tikonov in september 09?

reference: (https://kerneltrap.org/mailarchive/linux-scsi/2009/9/1/6371653)

Is the one suggested in mptscsih.c more 'global' than the one suggested in
mptsas.c by Yuri? Or are they complementary/redundant/different?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (30 preceding siblings ...)
  2010-09-28 19:07 ` bugzilla-daemon
@ 2012-06-18 13:20 ` bugzilla-daemon
  2012-06-18 13:20 ` bugzilla-daemon
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2012-06-18 13:20 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831


Alan <alan@lxorguk.ukuu.org.uk> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |alan@lxorguk.ukuu.org.uk
         Resolution|                            |OBSOLETE




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
       [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
                   ` (31 preceding siblings ...)
  2012-06-18 13:20 ` bugzilla-daemon
@ 2012-06-18 13:20 ` bugzilla-daemon
  32 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2012-06-18 13:20 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=14831


Alan <alan@lxorguk.ukuu.org.uk> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
  2009-12-18 11:25 [Bug 14831] New: " bugzilla-daemon
                   ` (9 preceding siblings ...)
  2010-01-12 23:15 ` bugzilla-daemon
@ 2010-01-13 12:30 ` bugzilla-daemon
  10 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-01-13 12:30 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #10 from Tim Small <tim@seoss.co.uk>  2010-01-13 12:30:47 ---
Hi Aaron,

It's possible that this is an unrelated issue.  On one of the systems with the
MPT SAS controllers, I have moved a drive from an MPT SAS channel onto an Intel
631xESB/632xESB SATA channel, and the unreliable behaviour appeared to stop.

Do you have any other drives which you can test in place of the WD drives? 
Personally I have found Hitachi SATA drives to be well engineered in recent
years from a SMART PoV.

If you'd like to open another bug, the script included in this bug might help
you reproduce the problems.  You could also try disabling NCQ and/or using a
different SATA controller (Silicon Image SiI 3132 based PCIe cards are
available very inexpensively) to see if this helps.

Thanks,

Tim.

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
  2009-12-18 11:25 [Bug 14831] New: " bugzilla-daemon
                   ` (8 preceding siblings ...)
  2010-01-11 11:59 ` bugzilla-daemon
@ 2010-01-12 23:15 ` bugzilla-daemon
  2010-01-13 12:30 ` bugzilla-daemon
  10 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-01-12 23:15 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=14831


Aaron Williams <aaron.w2@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |aaron.w2@gmail.com




--- Comment #9 from Aaron Williams <aaron.w2@gmail.com>  2010-01-12 23:15:37 ---
I am seeing similar events with my current computer and with my last computer.
My setup consists of two WD Black Edition 1TB drives:

Model=WDC WD1001FALS-00J7B0, FwRev=05.00K05, SerialNo=WD-WMATV0705568
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50
 BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1,2,3,4,5,6,7

00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI
Controller

I also have smartmond running to periodically query the drive. In my case I
have two drives running in a mirrored configuration and this will arbitrarily
kick one of the drives out of my RAID array (using md). I just spent the last
few days recovering from a RAID event that killed the entire array (due to this
problem I believe).

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
  2009-12-18 11:25 [Bug 14831] New: " bugzilla-daemon
                   ` (7 preceding siblings ...)
  2009-12-21 12:11 ` bugzilla-daemon
@ 2010-01-11 11:59 ` bugzilla-daemon
  2010-01-12 23:15 ` bugzilla-daemon
  2010-01-13 12:30 ` bugzilla-daemon
  10 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2010-01-11 11:59 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #8 from amf <andy.fletcher@ukdedicated.com>  2010-01-11 11:59:39 ---
We see this on numerous Dell hosts running the SAS6iR based on the LSISAS1068E
chip. Running the stock RHEL5 driver.

It's simple to reproduce with SMART commands, but we actually see huge issues
with drives dropping off cards even during heavy I/O and no SMART commands
involved at all. It seems to be all the worse when the disk reallocates a bad
sector. The disks are Dell-supplied and therefore 'enterprise' models capable
of TLER type behaviour.

I believe the SMART command method makes it easier to reproduce what may be a
problem not specifically related to SMART, but that's just my own feeling.

HTH.

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
  2009-12-18 11:25 [Bug 14831] New: " bugzilla-daemon
                   ` (6 preceding siblings ...)
  2009-12-21 12:08 ` bugzilla-daemon
@ 2009-12-21 12:11 ` bugzilla-daemon
  2010-01-11 11:59 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2009-12-21 12:11 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #7 from Tim Small <tim@seoss.co.uk>  2009-12-21 12:11:31 ---
Created an attachment (id=24243)
 --> (http://bugzilla.kernel.org/attachment.cgi?id=24243)
Script to stress-test ATA command passthrough whilst write-loading a SATA
device.

This script uses dd to repeatedly write and remove a 1G zero-filled file
to/from a file-system whilst executing smartctl against the associated device
file.

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
  2009-12-18 11:25 [Bug 14831] New: " bugzilla-daemon
                   ` (5 preceding siblings ...)
  2009-12-21  4:52 ` bugzilla-daemon
@ 2009-12-21 12:08 ` bugzilla-daemon
  2009-12-21 12:11 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2009-12-21 12:08 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #6 from Tim Small <tim@seoss.co.uk>  2009-12-21 12:08:50 ---
Hi Kashyap,

Thanks for your input.  Unfortunately, I can't test on the 1068, as the machine
is now in production (with SMART disabled!).

I do still have access to the 1068E and the 1064, and I will see if I can
borrow another 1068.

Could you try the attached script on your test system?  It carries out I/O to
the device which is under test, and seems to trigger failures much more quickly
as a result.

Thanks,

Tim.

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
  2009-12-18 11:25 [Bug 14831] New: " bugzilla-daemon
                   ` (4 preceding siblings ...)
  2009-12-21  4:51 ` bugzilla-daemon
@ 2009-12-21  4:52 ` bugzilla-daemon
  2009-12-21 12:08 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2009-12-21  4:52 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #5 from kdesai <kashyap.desai@lsi.com>  2009-12-21 04:52:56 ---
Created an attachment (id=24240)
 --> (http://bugzilla.kernel.org/attachment.cgi?id=24240)
latest upstream fusion driver 3.4.14

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
  2009-12-18 11:25 [Bug 14831] New: " bugzilla-daemon
                   ` (3 preceding siblings ...)
  2009-12-18 15:32 ` bugzilla-daemon
@ 2009-12-21  4:51 ` bugzilla-daemon
  2009-12-21  4:52 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2009-12-21  4:51 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #4 from kdesai <kashyap.desai@lsi.com>  2009-12-21 04:51:47 ---
I tried the same test with below setup detail and It works fine for me. Can you
give try upgrading your FW version for 1068 B0 card as mentioned in comment #2.

I used 1068 B0 card and HDD is WesternDigitial SATA drives REVV: 1E01
FW version is 1.29.00.00-IE 
Card Name is SAS3442X.

MPT driver version is 3.4.14 (latest upstream driver). see attachment
fusion_03_04_14.tgz for quick access to fusion driver. You may need to some
change code to make it compilable with your kernel.


Please give a try and let me know your result.

1.29.00 Fw is available at
http://lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/combo/sas3442x-r/index.html

Thanks,
Kashyap

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
  2009-12-18 11:25 [Bug 14831] New: " bugzilla-daemon
                   ` (2 preceding siblings ...)
  2009-12-18 15:31 ` bugzilla-daemon
@ 2009-12-18 15:32 ` bugzilla-daemon
  2009-12-21  4:51 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2009-12-18 15:32 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=14831


Tim Small <tim@seoss.co.uk> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Kernel Version|2.6.26 - 2.6.31             |2.6.26 -
                   |                            |2.6.32rc4-scsi-misc




-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
  2009-12-18 11:25 [Bug 14831] New: " bugzilla-daemon
  2009-12-18 12:44 ` [Bug 14831] " bugzilla-daemon
  2009-12-18 15:18 ` bugzilla-daemon
@ 2009-12-18 15:31 ` bugzilla-daemon
  2009-12-18 15:32 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2009-12-18 15:31 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #3 from Tim Small <tim@seoss.co.uk>  2009-12-18 15:31:55 ---
Hmm, failed as soon as I submitted the last comment (Intel S5000PSL on-board
controller)...

filename:      
/lib/modules/2.6.26-2-amd64/kernel/drivers/message/fusion/mptsas.ko
version:        3.04.06

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev
 1.  /proc/mpt/ioc0    LSI Logic SAS1064E 04     105      01190100


Current active firmware version is 1.25.01
Firmware image's version is MPTFW-01.25.01.00-IT
  LSI Logic
x86 BIOS image's version is MPTBIOS-6.22.00.00 (2008.04.10)


[    2.369101] ioc0: LSISAS1064E B2: Capabilities={Initiator}
[    2.371062] mptbase: ioc0: PCI-MSI enabled
[    2.371612] PCI: Setting latency timer of device 0000:04:00.0 to 64
[   18.377426] scsi0 : ioc0: LSISAS1064E B2, FwRev=01190100h, Ports=1,
MaxQ=478, IRQ=1269
[   19.249357] scsi 0:0:0:0: Direct-Access     ATA      WDC WD3200BJKT-0 1A11
PQ: 0 ANSI: 5
[  881.982165] mptbase: ioc0: LogInfo(0x30030108): Originator={IOP},
Code={Invalid Page}, SubCode(0x0108)
[  882.086359] mptbase: ioc0: LogInfo(0x30030108): Originator={IOP},
Code={Invalid Page}, SubCode(0x0108)
[ 1514.521445] mptbase: ioc0: LogInfo(0x30030108): Originator={IOP},
Code={Invalid Page}, SubCode(0x0108)
[ 1514.525947] mptbase: ioc0: LogInfo(0x30030108): Originator={IOP},
Code={Invalid Page}, SubCode(0x0108)
[ 2051.568333] mptscsih: ioc0: attempting task abort! (sc=ffff8101190f6940)
[ 2051.568446] sd 0:0:0:0: [sda] CDB: ATA command pass through(16): 85 08 0e 00
00 00 01 00 00 00 00 00 00 00 ec 00
[ 2056.593202] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO
Executed}, SubCode(0x0000)
[ 2056.594064] mptsas: ioc0: removing sata device, channel 0, id 0, phy 0
[ 2056.594182]  port-0:0: mptsas: ioc0: delete port (0)
[ 2056.617086] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 2056.797030] mptscsih: ioc0: task abort: SUCCESS (sc=ffff8101190f6940)
[ 2056.797166] mptscsih: ioc0: attempting task abort! (sc=ffff810124da15c0)
[ 2056.798195] sd 0:0:0:0: [sda] CDB: Synchronize Cache(10): 35 00 00 00 00 00
00 00 00 00
[ 2056.799567] mptscsih: ioc0: task abort: SUCCESS (sc=ffff810124da15c0)
[ 2056.799697] mptscsih: ioc0: attempting target reset! (sc=ffff8101190f6940)
[ 2056.799821] sd 0:0:0:0: [sda] CDB: ATA command pass through(16): 85 08 0e 00
00 00 01 00 00 00 00 00 00 00 ec 00
[ 2057.137289] mptscsih: ioc0: target reset: SUCCESS (sc=ffff8101190f6940)
[ 2057.140469] mptscsih: ioc0: attempting bus reset! (sc=ffff8101190f6940)
[ 2057.140585] sd 0:0:0:0: [sda] CDB: ATA command pass through(16): 85 08 0e 00
00 00 01 00 00 00 00 00 00 00 ec 00
[ 2057.398218] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff8101190f6940)
[ 2068.601852] mptscsih: ioc0: attempting host reset! (sc=ffff8101190f6940)
[ 2068.606217] mptbase: ioc0: Initiating recovery
[ 2082.235819] mptscsih: ioc0: host reset: SUCCESS (sc=ffff8101190f6940)
[ 2082.235932] sd 0:0:0:0: Device offlined - not ready after error recovery
[ 2082.236054] sd 0:0:0:0: Device offlined - not ready after error recovery
[ 2082.236197] end_request: I/O error, dev sda, sector 19534911

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
  2009-12-18 11:25 [Bug 14831] New: " bugzilla-daemon
  2009-12-18 12:44 ` [Bug 14831] " bugzilla-daemon
@ 2009-12-18 15:18 ` bugzilla-daemon
  2009-12-18 15:31 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2009-12-18 15:18 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #2 from Tim Small <tim@seoss.co.uk>  2009-12-18 15:18:37 ---
For the 1068 (Dell PE860):

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev
 1.  /proc/mpt/ioc0    LSI Logic SAS1068 B0      105      000a3300

Current active firmware version is 0.10.51
Firmware image's version is MPTFW-00.10.51.00-IE
  LSI Logic
x86 BIOS image's version is MPTBIOS-6.12.05.00 (2007.09.29)


For the 1068E: (Dell PE1950):
     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev
 1.  /proc/mpt/ioc0    LSI Logic SAS1068E 08     105      00192f00

Current active firmware version is 0.25.47
Firmware image's version is MPTFW-00.25.47.00-IE
  LSI Logic
x86 BIOS image's version is MPTBIOS-6.22.03.00 (2008.08.06)


I've just started a test on a SAS1064 (Intel S5000PSL).  Will leave it on test
and report back here...

Thanks,

Tim.

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
  2009-12-18 11:25 [Bug 14831] New: " bugzilla-daemon
@ 2009-12-18 12:44 ` bugzilla-daemon
  2009-12-18 15:18 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: bugzilla-daemon @ 2009-12-18 12:44 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=14831





--- Comment #1 from kdesai <kashyap.desai@lsi.com>  2009-12-18 12:44:28 ---

Can you please provide firmware version information ?

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2012-06-18 13:20 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-14831-11613@https.bugzilla.kernel.org/>
2010-03-21 11:34 ` [Bug 14831] mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets bugzilla-daemon
2010-03-30 14:53 ` bugzilla-daemon
2010-04-29  7:25 ` bugzilla-daemon
2010-04-29  9:27 ` bugzilla-daemon
2010-05-03 22:22 ` bugzilla-daemon
2010-05-04  5:16 ` bugzilla-daemon
2010-05-04  9:16 ` bugzilla-daemon
2010-05-05 10:35 ` bugzilla-daemon
2010-05-07  4:56 ` bugzilla-daemon
2010-05-07  8:01 ` bugzilla-daemon
2010-05-12  8:50 ` bugzilla-daemon
2010-05-12  9:04 ` bugzilla-daemon
2010-05-12  9:33 ` bugzilla-daemon
2010-05-12 10:22 ` bugzilla-daemon
2010-05-24  9:05 ` bugzilla-daemon
2010-05-26  8:08 ` bugzilla-daemon
2010-06-07 20:00 ` bugzilla-daemon
2010-06-08  0:29 ` bugzilla-daemon
2010-06-08  6:44 ` bugzilla-daemon
2010-06-08  8:43 ` bugzilla-daemon
2010-06-29 20:26 ` bugzilla-daemon
2010-08-28 15:44 ` bugzilla-daemon
2010-08-28 15:45 ` bugzilla-daemon
2010-08-28 15:46 ` bugzilla-daemon
2010-08-28 15:47 ` bugzilla-daemon
2010-08-28 15:51 ` bugzilla-daemon
2010-08-28 15:53 ` bugzilla-daemon
2010-08-30 15:18 ` bugzilla-daemon
2010-08-30 16:42 ` bugzilla-daemon
2010-08-31  8:17 ` bugzilla-daemon
2010-09-28 19:07 ` bugzilla-daemon
2012-06-18 13:20 ` bugzilla-daemon
2012-06-18 13:20 ` bugzilla-daemon
2009-12-18 11:25 [Bug 14831] New: " bugzilla-daemon
2009-12-18 12:44 ` [Bug 14831] " bugzilla-daemon
2009-12-18 15:18 ` bugzilla-daemon
2009-12-18 15:31 ` bugzilla-daemon
2009-12-18 15:32 ` bugzilla-daemon
2009-12-21  4:51 ` bugzilla-daemon
2009-12-21  4:52 ` bugzilla-daemon
2009-12-21 12:08 ` bugzilla-daemon
2009-12-21 12:11 ` bugzilla-daemon
2010-01-11 11:59 ` bugzilla-daemon
2010-01-12 23:15 ` bugzilla-daemon
2010-01-13 12:30 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.