* QLA2200 causes kernel bug
@ 2009-08-06 15:28 Thomas Georgiou
2009-08-06 16:49 ` Andrew Vasquez
0 siblings, 1 reply; 7+ messages in thread
From: Thomas Georgiou @ 2009-08-06 15:28 UTC (permalink / raw)
To: linux-scsi
Whenever I have the qla2xxx module loaded, some kernel problem
eventually occurs. Sometimes its an oops, a bug, or once, a crash.
The same thing has been seen to happen on 3 different machines all
with qla2200 cards in them.
Here is the latest backtrace:
[42151.610011] kernel BUG at drivers/scsi/scsi_transport_fc.c:3022!
[42151.610011] invalid opcode: 0000 [#1] SMP
[42151.610011] last sysfs file:
/sys/devices/pci0000:00/0000:00:06.0/0000:05:00.2/0000:0a:01.0/host1/rport-1:0-25/target1:0:25/fc_transport/target1:0:25/port_name
[42151.610011] CPU 3
[42151.610011] Modules linked in: iscsi_scst scst_disk scst_vdisk scst
qla2xxx
[42151.610011] Pid: 4846, comm: fc_dl_1 Not tainted 2.6.30.4dl380 #3
ProLiant DL380 G4
[42151.610011] RIP: 0010:[<ffffffff812e3c1a>] [<ffffffff812e3c1a>]
fc_timeout_deleted_rport+0x250/0x2df
[42151.610011] RSP: 0018:ffff8801965c7e70 EFLAGS: 00010202
[42151.610011] RAX: ffff8801967cb648 RBX: ffff88019a837ba0 RCX:
ffff8801965c7de0
[42151.610011] RDX: 0000000000000003 RSI: ffff8801968b89c0 RDI:
ffff8801965c7e60
[42152.490635] RBP: ffff8801965c7eb0 R08: ffffffff8185dd50 R09:
ffff8801965c7d70
[42152.490635] R10: ffff8801965c7da0 R11: ffff88019530ea80 R12:
ffff880196dd3400
[42152.490635] R13: ffff880196dd3598 R14: ffff880196cb4000 R15:
ffff88019a890800
[42152.490635] FS: 0000000000000000(0000) GS:ffff88002807f000(0000)
knlGS:0000000000000000
[42152.490635] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[42152.490635] CR2: 0000000001c6b628 CR3: 00000001974af000 CR4:
00000000000006e0
[42152.490635] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[42152.490635] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[42152.490635] Process fc_dl_1 (pid: 4846, threadinfo
ffff8801965c6000, task ffff8801967cb2c0)
[42152.490635] Stack:
[42152.490635] 0000000000000000 0000000000000202 ffff8801975d2000
ffffc20000082b00
[42152.490635] ffff880196dd3598 ffffc20000082b00 ffff8801967cb2c0
ffffffff812e39ca
[42152.490635] ffff8801965c7f20 ffffffff8104708c 0000000000000000
ffff8801967cb2c0
[42152.490635] Call Trace:
[42152.490635] [<ffffffff812e39ca>] ?
fc_timeout_deleted_rport+0x0/0x2df
[42152.490635] [<ffffffff8104708c>] worker_thread+0x113/0x1ac
[42152.490635] [<ffffffff81049f58>] ?
autoremove_wake_function+0x0/0x38
[42152.490635] [<ffffffff81046f79>] ? worker_thread+0x0/0x1ac
[42152.490635] [<ffffffff81046f79>] ? worker_thread+0x0/0x1ac
[42152.490635] [<ffffffff81049e05>] kthread+0x56/0x85
[42152.490635] [<ffffffff8100caba>] child_rip+0xa/0x20
[42152.490635] [<ffffffff81049daf>] ? kthread+0x0/0x85
[42152.490635] [<ffffffff8100cab0>] ? child_rip+0x0/0x20
[42152.490635] Code: e0 fb 83 c8 08 41 88 85 b0 fe ff ff 49 8b 7e 58
48 8b 75 c8 e8 e4 86 22 00 4c 89 e7 e8 7c d1 ff ff 41 83 bd 90 fe ff
ff 01 74 04 <0f> 0b eb fe 41 8b 87 d0 02 00 00 83 f8 02 74 16 83 f8 03
74 29
[42152.490635] RIP [<ffffffff812e3c1a>]
fc_timeout_deleted_rport+0x250/0x2df
[42152.490635] RSP <ffff8801965c7e70>
and
[43410.037930] qla2xxx 0000:0a:01.0: LIP reset occurred (f7ca).
[43414.941958] qla2xxx 0000:0a:01.0: LOOP DOWN detected (b88f 0 9581).
[43418.341988] qla2xxx 0000:0a:01.0: LIP occurred (f7b1).
[43418.403690] qla2xxx 0000:0a:01.0: LOOP UP detected (1 Gbps).
[43423.541948] qla2xxx 0000:0a:01.0: LIP reset occurred (f5b5).
[43424.584083] qla2xxx 0000:06:02.0: LIP reset occurred (f7f7).
[43425.650042] rport-0:0-0: blocked FC remote port time out: removing
target and saving binding
[43425.672504] qla2xxx 0000:06:02.0: LIP occurred (f7f7).
[43425.815119] ------------[ cut here ]------------
[43425.825014] kernel BUG at drivers/scsi/scsi_transport_fc.c:3022!
[43425.825014] invalid opcode: 0000 [#2] SMP
[43425.825014] last sysfs file:
/sys/devices/pci0000:00/0000:00:06.0/0000:05:00.2/0000:0a:01.0/host1/rport-1:0-22/target1:0:22/1:0:22:0/block/sdco/uevent
[43425.825014] CPU 1
[43425.825014] Modules linked in: iscsi_scst scst_disk scst_vdisk scst
qla2xxx
[43425.825014] Pid: 4211, comm: fc_dl_0 Tainted: G D
2.6.30.4dl380 #3 ProLiant DL380 G4
[43425.825014] RIP: 0010:[<ffffffff812e3c1a>] [<ffffffff812e3c1a>]
fc_timeout_deleted_rport+0x250/0x2df
[43425.825014] RSP: 0018:ffff88019840be70 EFLAGS: 00010202
[43425.825014] RAX: ffff88019840be60 RBX: ffff880197db9920 RCX:
ffff88019840bde0
[43425.825014] RDX: 0000000000000001 RSI: ffff880196cb9000 RDI:
ffffffffffffff10
[43425.825014] RBP: ffff88019840beb0 R08: ffffffff8185dd50 R09:
ffff88019840bd70
[43425.825014] R10: ffff88019840bda0 R11: ffff88008a7c61a0 R12:
ffff88019a885c00
[43425.825014] R13: ffff88019a885d98 R14: ffff880196cb0000 R15:
ffff88019a894400
[43425.825014] FS: 0000000000000000(0000) GS:ffff88002804d000(0000)
knlGS:0000000000000000
[43425.825014] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[43425.825014] CR2: 00007fe26b7a8098 CR3: 0000000198589000 CR4:
00000000000006e0
[43425.825014] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[43425.825014] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[43425.825014] Process fc_dl_0 (pid: 4211, threadinfo
ffff88019840a000, task ffff880198458740)
[43425.825014] Stack:
[43425.825014] 0000000100000000 0000000000000202 ffff8801975d2000
ffffc20000081180
[43425.825014] ffff88019a885d98 ffffc20000081180 ffff880198458740
ffffffff812e39ca
[43425.825014] ffff88019840bf20 ffffffff8104708c 0000000000000000
ffff880198458740
[43425.825014] Call Trace:
[43425.825014] [<ffffffff812e39ca>] ?
fc_timeout_deleted_rport+0x0/0x2df
[43425.825014] [<ffffffff8104708c>] worker_thread+0x113/0x1ac
[43425.825014] [<ffffffff81049f58>] ?
autoremove_wake_function+0x0/0x38
[43425.825014] [<ffffffff81046f79>] ? worker_thread+0x0/0x1ac
[43425.825014] [<ffffffff81046f79>] ? worker_thread+0x0/0x1ac
[43425.825014] [<ffffffff81049e05>] kthread+0x56/0x85
[43425.825014] [<ffffffff8100caba>] child_rip+0xa/0x20
[43425.825014] [<ffffffff81049daf>] ? kthread+0x0/0x85
[43425.825014] [<ffffffff8100cab0>] ? child_rip+0x0/0x20
[43425.825014] Code: e0 fb 83 c8 08 41 88 85 b0 fe ff ff 49 8b 7e 58
48 8b 75 c8 e8 e4 86 22 00 4c 89 e7 e8 7c d1 ff ff 41 83 bd 90 fe ff
ff 01 74 04 <0f> 0b eb fe 41 8b 87 d0 02 00 00 83 f8 02 74 16 83 f8 03
74 29
[43425.825014] RIP [<ffffffff812e3c1a>]
fc_timeout_deleted_rport+0x250/0x2df
[43425.825014] RSP <ffff88019840be70>
[43425.871192] ---[ end trace dc9543ab95173f0f ]---
The bug has been occuring without scst compiled in as well.
I have filed a bug collecting other backtraces here:
http://bugzilla.kernel.org/show_bug.cgi?id=13873
Any ideas to what might be causing this?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: QLA2200 causes kernel bug
2009-08-06 15:28 QLA2200 causes kernel bug Thomas Georgiou
@ 2009-08-06 16:49 ` Andrew Vasquez
2009-08-06 17:12 ` Thomas Georgiou
0 siblings, 1 reply; 7+ messages in thread
From: Andrew Vasquez @ 2009-08-06 16:49 UTC (permalink / raw)
To: Thomas Georgiou; +Cc: linux-scsi
On Thu, 06 Aug 2009, Thomas Georgiou wrote:
> Whenever I have the qla2xxx module loaded, some kernel problem
> eventually occurs. Sometimes its an oops, a bug, or once, a crash.
> The same thing has been seen to happen on 3 different machines all
> with qla2200 cards in them.
>
> Here is the latest backtrace:
> [42151.610011] kernel BUG at drivers/scsi/scsi_transport_fc.c:3022!
> [42151.610011] invalid opcode: 0000 [#1] SMP
This looks similar to:
http://thread.gmane.org/gmane.linux.scsi/49853/focus=50297
with a proposed (though not-liked) solution here:
http://article.gmane.org/gmane.linux.scsi/50297
Could someone refresh my memory, why was there an issue with re-adding
rports after dev-loss-tmo triggered?
Thomas, could you forward the full messages file? I'm interested in
seeing what series of events led up to the BUG_ON(). The snippets
here and in the bugzilla only document the failing point BUG_ON().
Also, could you get a test run with the driver error-logging enabled
as well:
$ modprobe -v qla2xxx ql2xextended_error_logging=1
-- av
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: QLA2200 causes kernel bug
2009-08-06 16:49 ` Andrew Vasquez
@ 2009-08-06 17:12 ` Thomas Georgiou
2009-08-07 3:40 ` Thomas Georgiou
0 siblings, 1 reply; 7+ messages in thread
From: Thomas Georgiou @ 2009-08-06 17:12 UTC (permalink / raw)
To: Andrew Vasquez; +Cc: linux-scsi
Here is the dmesg log: http://www.tjhsst.edu/~2010tgeorgio/fclogs/dmesg3.log
Here is the complete syslog file, with more junk in it:
http://www.tjhsst.edu/~2010tgeorgio/fclogs/messages3.log
I will reboot the server and enable error logging.
Could this issues be caused by enclosures have conflicting scsi ids?
On Thu, Aug 6, 2009 at 12:49 PM, Andrew
Vasquez<andrew.vasquez@qlogic.com> wrote:
> Thomas, could you forward the full messages file? I'm interested in
> seeing what series of events led up to the BUG_ON(). The snippets
> here and in the bugzilla only document the failing point BUG_ON().
> Also, could you get a test run with the driver error-logging enabled
> as well:
>
> $ modprobe -v qla2xxx ql2xextended_error_logging=1
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: QLA2200 causes kernel bug
2009-08-06 17:12 ` Thomas Georgiou
@ 2009-08-07 3:40 ` Thomas Georgiou
2009-08-07 7:01 ` Andrew Vasquez
0 siblings, 1 reply; 7+ messages in thread
From: Thomas Georgiou @ 2009-08-07 3:40 UTC (permalink / raw)
To: Andrew Vasquez; +Cc: linux-scsi
Here is a new log with error logging enabled:
http://www.tjhsst.edu/~2010tgeorgio/fclogs/dmesg4.log
The enclosures do not have conflicting scsi ids.
On Thu, Aug 6, 2009 at 1:12 PM, Thomas Georgiou<tageorgiou@gmail.com> wrote:
> Here is the dmesg log: http://www.tjhsst.edu/~2010tgeorgio/fclogs/dmesg3.log
> Here is the complete syslog file, with more junk in it:
> http://www.tjhsst.edu/~2010tgeorgio/fclogs/messages3.log
> I will reboot the server and enable error logging.
>
> Could this issues be caused by enclosures have conflicting scsi ids?
>
> On Thu, Aug 6, 2009 at 12:49 PM, Andrew
> Vasquez<andrew.vasquez@qlogic.com> wrote:
>> Thomas, could you forward the full messages file? I'm interested in
>> seeing what series of events led up to the BUG_ON(). The snippets
>> here and in the bugzilla only document the failing point BUG_ON().
>> Also, could you get a test run with the driver error-logging enabled
>> as well:
>>
>> $ modprobe -v qla2xxx ql2xextended_error_logging=1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: QLA2200 causes kernel bug
2009-08-07 3:40 ` Thomas Georgiou
@ 2009-08-07 7:01 ` Andrew Vasquez
2009-08-07 19:19 ` Thomas Georgiou
0 siblings, 1 reply; 7+ messages in thread
From: Andrew Vasquez @ 2009-08-07 7:01 UTC (permalink / raw)
To: Thomas Georgiou; +Cc: linux-scsi
On Thu, 06 Aug 2009, Thomas Georgiou wrote:
> Here is a new log with error logging enabled:
> http://www.tjhsst.edu/~2010tgeorgio/fclogs/dmesg4.log
>
> The enclosures do not have conflicting scsi ids.
>
Could you let me know what's going on at timestamp - 1840:
[ 883.659657] bonnie++ used greatest stack depth: 3472 bytes left
[ 1840.109135] scsi(1): Asynchronous LIP RESET (f8f7).
[ 1840.109141] qla2xxx 0000:0a:01.0: LIP reset occurred (f8f7).
[ 1840.109477] scsi(1): qla2x00_reset_marker()
[ 1840.150025] scsi(1): fcport-0 - port retry count: 0 remaining
[ 1840.150030] scsi(1): fcport-1 - port retry count: 0 remaining
[ 1840.150034] scsi(1): fcport-2 - port retry count: 0 remaining
[ 1840.150037] scsi(1): fcport-3 - port retry count: 0 remaining
...
[ 1840.150113] scsi(1): fcport-26 - port retry count: 0 remaining
[ 1840.150116] scsi(1): fcport-27 - port retry count: 0 remaining
[ 1840.203564] scsi(1): LIP occurred (f7f7).
[ 1840.203569] qla2xxx 0000:0a:01.0: LIP occurred (f7f7).
[ 1840.234308] scsi(1): Asynchronous PORT UPDATE.
[ 1840.234313] scsi(1): Port database changed b88f 0000 003f.
[ 1841.100044] rport-1:0-0: blocked FC remote port time out: removing target and saving binding
[ 1841.100101] qla2x00_mailbox_command(1): **** FAILED. mbx0=4006, mbx1=12, mbx2=ffff, cmd=71 ****
[ 1841.100107] qla2x00_fabric_logout(1): failed=102 mbx1=12.
there's seems to be a great deal of fabric disruptions occuring on the
fibre. This seems to be occurring on both 22xx ports. It also
appears that you've set port-down-timeout (dev-loss-tmo) to 0 seconds
(for faster failovers)?
Could you describe the topology? Would it be possible to isolate the
faults, I take it the constant stream of RESETs are not expected?
-- av
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: QLA2200 causes kernel bug
2009-08-07 7:01 ` Andrew Vasquez
@ 2009-08-07 19:19 ` Thomas Georgiou
2009-08-07 21:11 ` Andrew Vasquez
0 siblings, 1 reply; 7+ messages in thread
From: Thomas Georgiou @ 2009-08-07 19:19 UTC (permalink / raw)
To: Andrew Vasquez; +Cc: linux-scsi
I am not sure what is happening at 1840.
The current topology is royal (the machine in this backtrace)
connected via 2 fibre channel connections directly to a Powervault
224F jbod. This is then connected via 2 connections again to another
224F, which is then connected to another machine, fiord (which also
has had problems).
I had royal connected to one 224f with 2 connections and did not
connect that jbod to anything else, and it worked with no problems for
the time it was connected like that (2 days).
I have also tried connecting fiord and royal to two powervault 51f
switches in a redundant configuration and then the switches to the
224Fs. This also generated problems and was where most of the
backtraces in the bug reports came from.
I have set qlport_down_retry=1 for faster failover. Should I unset
it? A constant stream of RESETs is not expected.
On Fri, Aug 7, 2009 at 3:01 AM, Andrew Vasquez<andrew.vasquez@qlogic.com> wrote:
> Could you let me know what's going on at timestamp - 1840:
> there's seems to be a great deal of fabric disruptions occuring on the
> fibre. This seems to be occurring on both 22xx ports. It also
> appears that you've set port-down-timeout (dev-loss-tmo) to 0 seconds
> (for faster failovers)?
>
> Could you describe the topology? Would it be possible to isolate the
> faults, I take it the constant stream of RESETs are not expected?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: QLA2200 causes kernel bug
2009-08-07 19:19 ` Thomas Georgiou
@ 2009-08-07 21:11 ` Andrew Vasquez
0 siblings, 0 replies; 7+ messages in thread
From: Andrew Vasquez @ 2009-08-07 21:11 UTC (permalink / raw)
To: Thomas Georgiou; +Cc: linux-scsi
On Fri, 07 Aug 2009, Thomas Georgiou wrote:
> I am not sure what is happening at 1840.
>
> The current topology is royal (the machine in this backtrace)
> connected via 2 fibre channel connections directly to a Powervault
> 224F jbod. This is then connected via 2 connections again to another
> 224F, which is then connected to another machine, fiord (which also
> has had problems).
>
> I had royal connected to one 224f with 2 connections and did not
> connect that jbod to anything else, and it worked with no problems for
> the time it was connected like that (2 days).
>
Ok, so it looks like there's two problems, first, I'd suggest you talk
with your JBOD vendor to see if this daisychained configuration is
supported? Is the JBOD acting as a mini-hub in this configuration?
Either way, as can be seen from the logs, your storage device is
continually LIP/LIP-resetting causing intermitent and visiblity/loss
to your storage, often times for long enough to have the midlayer
begin its reaping of scsi-devices. Given the low-seed value
for dev-loss-tmo (set via your qlport_down_retry usage), after
numerous LIPs you run into the second issue: the BUG_ON() triggering
within the FC-transport -- deferred execution of rport reaping in
fc_timeout_deleted_rport().
> I have also tried connecting fiord and royal to two powervault 51f
> switches in a redundant configuration and then the switches to the
> 224Fs. This also generated problems and was where most of the
> backtraces in the bug reports came from.
Just for completeness, could you gather a similar set of driver logs
with error-logging enabled within this configuration?
> I have set qlport_down_retry=1 for faster failover.
Increasing it may help to avoid problem (2).
> Should I unset
> it? A constant stream of RESETs is not expected.
Regards,
Andrew Vasquez
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-08-07 21:11 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-06 15:28 QLA2200 causes kernel bug Thomas Georgiou
2009-08-06 16:49 ` Andrew Vasquez
2009-08-06 17:12 ` Thomas Georgiou
2009-08-07 3:40 ` Thomas Georgiou
2009-08-07 7:01 ` Andrew Vasquez
2009-08-07 19:19 ` Thomas Georgiou
2009-08-07 21:11 ` Andrew Vasquez
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.